Heat in Data Centers: The Conundrum of Thermal Management 

By Robert Hult | October 22, 2024

Heat has a direct effect on device reliability. As the internal temperature of a chip exceeds its maximum rating, reliability begins to drop off. This results in a shortened life or complete chip failure.

Rainbow-like electronic chip motherboardAdvancing the limits of computing performance has been a constant defining objective of system designers since the first commercial computers were introduced in the early 1950s. Computers have undergone exponential increases in speed while shrinking the envelope and cost at rates unmatched by any device or technology we use today. The performance of refrigerator-sized machines that required hundreds of square feet in a controlled environment are now far exceeded by a cellular phone that fits in your back pocket. For nearly 50 years the industry marched to the pace defined by Moore’s Law. As they say, all good things must come to an end.

The last few years have seen growth of computational power as well as energy consumption. Every component in a data center, from high-performance GPUs to a length of wire, consumes a degree of power, some of which is converted to unwanted heat. Multiply that by millions of devices and you have a real problem. High-performance systems typically package components in close proximity, exacerbating the problem.

Elevated temperatures have a direct effect on device reliability. As the internal temperature of a chip exceeds its maximum rating, reliability begins to drop off and this results in a shortened life or complete chip failure. A report by the International Energy Agency estimated that the infrastructure of global data centers consumed 460 terawatt-hours of power in 2022: 2% of all global electricity usage. Individual hyperscale data centers can consume up to 100 megawatts of power. A single rack of equipment can draw 140,000 watts. That is a lot of heat concentrated in a small space.

Elevated temperatures can also impact additional components in a system. Design engineers must verify that PCB thermal coefficient of expansion (TCE) matches that of mounted devices such as connectors to minimize stress on solder joints during thermal cycling.

That level of energy consumption and the typical burning of fossil fuel required to generate it has also become an environmental issue. Availability of sufficient electrical power as well as cooling water have become major factors in the site selection of new data centers. The carbon footprint of data centers has been estimated to be approximately 2–4% of global carbon emissions. The energy efficiency of data centers is measured by the Power Usage Effectiveness (PUE) ratio which is determined by dividing the total amount of power entering the data center by the power used to run the IT equipment within it. PUEs improved dramatically over the 2007 to 2018 period but have since stalled in the 1.58 range. Given the current frenzied expansion of computing resources dedicated to data intensive artificial intelligence, the International Energy Agency forecasts predict data center energy usage to double by 2026.

Image Credit: SemiWiki.com

The conventional solution to moving heat away from individual devices is a passive heat sink. A metal structure with low thermal resistance and large surface area is attached to the top of a heat generating device.

Heat is transferred from the top of the device to the heat sink and dissipated in the ambient air stream using natural convection or fan forced air. Heat sinks can also be attached to the external frame of equipment to conduct heat away from internal components. The use of thermally conductive paste or gasket between the surface of the device and the heat sink increases the efficiency of the heat sink.

A typical server tray may include multiple fans at the rear to ensure adequate air movement inside the box. The constant whining of thousands of fans in a data center resulted in noise level standards defined by OSHA. The hot air exhausted by racks of equipment must be transferred out of the building using large chillers.

The electrical consumption of the fans and building cooling equipment adds to the total energy budget of the data center. Multiple industry studies have estimated cooling consumes between 40% and 50% of total power consumption of a data center.

Certain components are driving the trend of higher energy consumption. High-speed switches for example have been on a roadmap to higher performance but are also responsible for a 22X increase in power consumption over a 10-year period.

System designers have concluded that this increase is now at an unsustainable rate and represents a major obstacle to future system advances.

The solution likely will require addressing the problem from at least two perspectives.

  • Reducing power consumption of each component to minimize the resulting heat generation.
  • Increasing the efficiency of the equipment cooling process.

Pico joule per bit (pj/bit) has become a key design criteria of system efficiency. Chips that now consume from 10 to 15 pj/bit will be required to evolve to 1 pj/bit or less in the future.

At the chip level, reduced process dimensions shorten connections between features resulting in higher bandwidth as well as reduced power consumption and latency. Current chips have a long way to go to achieve the 1pj/bit goal.

Reducing the length of copper conductors and eliminating non-essential components in the system can result in lowered power consumption. Adoption of linear drive optics that eliminates the digital signal processor from a pluggable transceiver and co-package optics are being promoted to offer up to 30% reduced power over conventional interconnect technology.

The other major avenue of change in achieving industry power management goals while increasing system performance is to improve the efficiency of moving heat out of the system.

With the advent of 1000 watt processors and 700 watt GPUs, it appears that traditional fan cooling is approaching capacity limitations, especially in AI clusters. Although more costly, direct liquid cooling offers advantages of greater thermal transfer capacity, increased component reliability, and quiet operation. Liquid cooling, traditionally the domain of supercomputers, is now being considered in new data center construction.

Heat pipes and cold plates are enhancements of the static heat sink. Where heat sinks transfer heat to ambient air, liquid cooled systems use water or electrically inert fluids to transfer heat outside of the box. Heat pipes consist of a sealed tube and a fluid that carries heat away from a semiconductor when in vapor form and transports condensed vapor (liquid) back to the heat source.

Image Credit: Celsia

Cold plates are closed-loop devices that circulate chilled water to a transfer plate attached to the device.  Warm return water can be sent back to a chiller located inside the rack, at the end of a row of racks or an outdoor cooler.

Some cold plate designs create a sealed area on the top of a device allowing cooling liquid to be sprayed directly on the lid of a processor chip to increase heat transfer.

Non-drip fluid couplings allow quick removal and installation of liquid cooled daughtercards.

Direct immersion cooling was first utilized in the CRAY-2 supercomputer in 1985. The superior cooling efficiency of immersion cooling is generating increased interest particularly in AI clusters. This technology completely immerses the entire system in a tub of non-conductive liquid. Every component on each PCB is in intimate contact with circulating fluid allowing a consistent temperature gradient across each PCB.

Immersion cooling utilizes two technologies.

Image Credit: Fujitsu

Single phase immersion cooling allows a rack of cards to be totally submerged in externally chilled liquid. The fluid can be either pumped or use natural convection to circulate the liquid.

Two-phase immersion cooling is a closed loop system that use low boiling point fluids. The fluid vaporizes when in contact with hot surfaces thus extracting heat. Fluid vapors rise to the top of the enclosure and are condensed to re-enter the tub.

Immersion cooling represents the most drastic alternative to current fan cooling. It offers up to 50% energy reduction, without the need for pumps or fans and enabling increased compute density and long-term future proofing as device technology continues to advance. Adoption of immersion cooling will require significant changes in system design as well as packaging.

Recent forecasts of the global market for liquid cooling equipment indicate values for this segment will rise to $15 billion by 2029.

The choice of cooling technology impacts connectors. Years ago, connector profiles were reduced to minimize restriction to air flow. Connector housings were cored to allow airflow through the connector body and backplane connector housings were adapted to accept dripless liquid couplers. Pluggable transceiver PCB connectors added special features to maximize heat transfer. New high-current PCB and bus-bar connectors designed specifically for power distribution were introduced. As immersion cooling grows, the impedance of high-speed connectors will require modification to compensate for the difference in dielectric constant of air versus cooling fluid.

Designers of high-performance computers are not likely to allow energy conservation, thermal management, or sustainability challenges to limit future generations of high-performance computing equipment. Connectors will play an important role in supporting solutions to the thermal management demands of these systems.

Like this article? Check out our other High-Temperature, Data Centers articles, our Datacom Market Page, and our 2024 Article Archive

Subscribe to our weekly e-newsletters, follow us on LinkedIn, Twitter, and Facebook, and check out our eBook archives for more applicable, expert-informed connectivity content.

Robert Hult
Get the Latest News
x