Saturday, February 28, 2026

THE AI INFRASTRUCTURE BUILD Cooling: The Unsexy Necessity Post 6: Terrestrial Foundation From Air to Liquid — Why Blackwell GPUs Changed Everything

The AI Infrastructure Build: Post 6 - Cooling: The Unsexy Necessity ```

Cooling: The Unsexy Necessity

Post 6: Terrestrial Foundation

From Air to Liquid — Why Blackwell GPUs Changed Everything

By Randy Gipe | March 2026

NVIDIA GPUs don't just consume power. They generate massive heat.

An H100 chip: 700W. That’s seven incandescent light bulbs’ worth of heat—in a chip the size of your palm.

Blackwell: 1,000W. Ten light bulbs. And you’re putting 80,000 of them in one building.

80 megawatts of heat. Continuously. 24/7.

Air conditioning can’t handle it anymore. The entire industry is shifting to liquid cooling—pumping coolant directly onto chips, or even submerging entire servers in fluid.

This is the unglamorous infrastructure nobody photographs. But without it, AI stops.

Part 1: The Heat Problem

How Much Heat Are We Talking About?

πŸ”₯ GPU HEAT GENERATION (2020-2026)

Evolution of AI chip heat:

Chip TDP (Watts) Heat per Rack (40 GPUs) Cooling Challenge
NVIDIA V100 (2018) 300W 12 kW Air cooling sufficient
NVIDIA A100 (2020) 400W 16 kW Air cooling strained
NVIDIA H100 (2022) 700W 28 kW Liquid recommended
Blackwell B200 (2025) 1,000W 40 kW Liquid required

For a 10,000 GPU cluster (Blackwell):

  • 10,000 GPUs × 1,000W = 10 MW of heat
  • Equivalent to running 10,000 space heaters simultaneously
  • Or: Heating 500 average homes in winter

Data center cooling rule of thumb:

  • For every 1 MW of IT power, need 0.3-0.5 MW of cooling power
  • 10 MW IT load → 3-5 MW cooling → 13-15 MW total facility power

This is why Post 3's power crisis matters—cooling multiplies the electricity need.

Why Air Cooling Fails at Scale

Traditional data center cooling (pre-AI):

  • Cold air blown into server racks
  • Hot air exhausted out the back
  • Works fine for 5-10 kW per rack (traditional servers)

AI data center cooling (2024+):

  • 40+ kW per rack (Blackwell)
  • Air can't absorb heat fast enough
  • GPUs overheat → throttle performance → wasted money
  • Air cooling maxes out at ~20-25 kW/rack

The physics problem:

  • Air's heat capacity: ~1 kJ/(kg·K)
  • Water's heat capacity: ~4.2 kJ/(kg·K)
  • Water absorbs 4x more heat per unit mass than air
  • Result: Liquid cooling = only viable option for Blackwell-density racks

Part 2: The Liquid Cooling Revolution

Direct-to-Chip Liquid Cooling

πŸ’§ HOW DIRECT LIQUID COOLING WORKS

The system:

  1. Cold plates: Metal plates mounted directly onto GPUs/CPUs
  2. Coolant: Water or water-glycol mixture flows through cold plates
  3. Heat transfer: Coolant absorbs heat from chips (direct contact)
  4. Heat rejection: Hot coolant pumped to cooling towers/chillers outside building
  5. εΎͺ环: Cooled fluid returns to servers, cycle repeats

Advantages:

  • Efficiency: 30-40% more efficient than air cooling (less energy for same cooling)
  • Density: Can cool 40-80 kW racks (Blackwell + future chips)
  • Noise: Quieter (no loud fans)
  • Space: Smaller cooling infrastructure footprint

Disadvantages:

  • Complexity: Plumbing, leak risks, maintenance
  • Cost: 30-50% higher capex than air cooling
  • Expertise: Requires skilled technicians (can't just swap parts like air systems)

Adoption (2026):

  • 50%+ of new AI data centers use direct liquid cooling
  • Up from <10% in 2022 (pre-H100 era)
  • Projected: 80%+ by 2028 (as Blackwell deploys at scale)

Immersion Cooling — The Extreme Solution

For ultra-high-density deployments: Submerge entire servers in liquid.

🌊 IMMERSION COOLING

How it works:

  • Servers placed in tanks filled with dielectric fluid (non-conductive, doesn't short-circuit electronics)
  • GPUs, memory, everything submerged
  • Heat transfers directly from components to fluid
  • Hot fluid pumped to heat exchangers, cooled, returned

Types of immersion:

1. Single-phase immersion:

  • Fluid stays liquid (doesn't boil)
  • Simpler, more common
  • Can cool 100-200 kW per tank

2. Two-phase immersion:

  • Fluid boils at low temperature (~50°C)
  • Vapor rises, condenses, returns as liquid
  • More efficient but complex
  • Can cool 250+ kW per tank

Advantages:

  • Extreme density: Can cool 100+ kW racks (beyond Blackwell, future-proof)
  • Efficiency: 40-50% more efficient than air (PUE ~1.05 vs. air's 1.3-1.5)
  • No dust: Sealed systems, no particulate contamination

Disadvantages:

  • Cost: 2-3x more expensive than air cooling
  • Maintenance: Accessing components requires draining tanks
  • Fluid cost: Dielectric fluids expensive ($50-200/gallon, thousands of gallons needed)
  • Psychological barrier: Operators nervous about submerging expensive GPUs

Adoption (2026):

  • ~5-10% of new AI data centers use immersion
  • Mostly hyperscalers experimenting (Microsoft, Meta testing)
  • Bitcoin miners pivoting to AI (Post 4) often use immersion (already had infrastructure)

The Cooling Adoption Curve

Year Air Cooling Direct Liquid Immersion Driver
2020 95% 4% 1% A100 era (400W, air sufficient)
2023 70% 25% 5% H100 (700W, liquid recommended)
2026 40% 50% 10% Blackwell (1000W, liquid required)
2028 (proj.) 20% 65% 15% Next-gen GPUs (1200-1500W)

Air cooling won't disappear (still used for inference, legacy systems), but liquid dominates new AI builds.

Part 3: The Cooling Infrastructure Winners

Vertiv — The Data Center Cooling Leader

❄️ VERTIV

What they do:

  • Data center infrastructure: Cooling, power distribution, monitoring
  • Leading provider of direct liquid cooling systems for AI

Revenue (2025):

  • ~$7.5B total revenue (up 15-20% YoY, AI-driven)
  • Thermal management (cooling): ~40% of revenue (~$3B)
  • Gross margins: ~30-35%

AI cooling products:

  • Liebert DSE: Direct liquid cooling system (rack-level)
  • Liebert EconoPhase: Two-phase immersion cooling
  • Cold plates, coolant distribution units (CDUs), heat rejection

Customer base:

  • Hyperscalers (AWS, Azure, Google Cloud)
  • Data center REITs (Digital Realty, Equinix)
  • Enterprises deploying on-prem AI

Stock performance:

  • Nov 2022 (ChatGPT launch): ~$10
  • March 2026: ~$90-110
  • +800-1,000% gain (massive AI infrastructure winner)

Why Vertiv wins:

  • Incumbent advantage (already in 80%+ of large data centers)
  • End-to-end solutions (cooling + power + monitoring integrated)
  • Scale: Can deliver thousands of cooling units/year

Schneider Electric — The Diversified Giant

⚡ SCHNEIDER ELECTRIC

What they do:

  • Energy management, industrial automation, data center infrastructure
  • Cooling, UPS (uninterruptible power), power distribution

Revenue (2025):

  • ~€40B total (~$43B USD)
  • Data center segment: ~€8-10B (~$9-11B, 20-25% of total)
  • AI driving data center growth 20-30% YoY

AI cooling products:

  • EcoStruxure: Integrated data center management platform
  • APC by Schneider: Liquid cooling systems, in-row coolers
  • Partnerships with hyperscalers for custom solutions

Why Schneider competes:

  • Diversified (not dependent on data centers alone)
  • Global scale (operates in 100+ countries)
  • Software integration (cooling + power + monitoring via EcoStruxure)

Startups & Niche Players

LiquidStack:

  • Immersion cooling specialist
  • Two-phase immersion systems
  • Backed by Bitcoin mining pivot companies

CoolIT Systems:

  • Direct liquid cooling (cold plates, CDUs)
  • Focus: High-performance computing (HPC), AI

Asetek:

  • Liquid cooling for servers/GPUs
  • Originally gaming PC cooling (scaled to data centers)

These startups have 10-15% combined market share. Vertiv + Schneider dominate 60-70%.

Part 4: The Economics — 15-20% of Data Center Capex

Cooling Cost Breakdown

πŸ’° EXAMPLE: 500 MW AI DATA CENTER (BLACKWELL)

Total IT load: 500 MW

Cooling requirements:

  • 500 MW IT × 1.3 PUE (Power Usage Effectiveness) = 650 MW total facility power
  • Cooling power: ~150 MW

Cooling capex (direct liquid cooling):

1. In-rack cooling (cold plates, manifolds):

  • ~50,000 servers × $5,000-8,000/server = $250-400M

2. Coolant distribution units (CDUs):

  • ~500 units × $100k-200k = $50-100M

3. Heat rejection (cooling towers, chillers):

  • 150 MW cooling capacity × $500k-1M/MW = $75-150M

4. Piping, pumps, controls:

  • $100-200M

Total cooling capex: $475-850M

Total data center capex: $3-4B (GPUs, servers, networking, cooling, building, power)

Cooling as % of total: 12-28% (average ~15-20%)

For comparison, air cooling would be:

  • ~$300-500M (30-40% cheaper)
  • But can't handle Blackwell density (wouldn't work)

Operating Costs (Opex)

Cooling also consumes power continuously:

  • Air cooling PUE: 1.3-1.5 (30-50% overhead on IT power)
  • Liquid cooling PUE: 1.15-1.25 (15-25% overhead)
  • Immersion PUE: 1.05-1.15 (5-15% overhead)

For 500 MW IT load:

  • Air cooling: 150-250 MW cooling power → $0.08/kWh × 8,760 hours = $105-175M/year
  • Liquid cooling: 75-125 MW → $52-87M/year
  • Immersion: 25-75 MW → $17-52M/year

Opex savings from liquid cooling: $50-100M/year

Payback on higher capex: 5-8 years (liquid cooling pays for itself via energy savings)

Part 5: The Verdict — Cooling = Unglamorous but Essential

Nobody writes headlines about cooling. But without it, $400M GPU clusters become space heaters.

The picks-and-shovels thesis:

  • Vertiv: +800-1,000% since ChatGPT launch (infrastructure winner)
  • Schneider Electric: Data center segment growing 20-30% YoY
  • Cooling = 15-20% of data center capex (non-trivial)

The transition is inevitable:

  • Blackwell requires liquid (1,000W/chip)
  • Next-gen GPUs will be even hotter (1,200-1,500W projected)
  • Air cooling relegated to legacy/inference workloads
  • Liquid becomes standard by 2028

Infrastructure players capture steady returns while AI apps burn cash searching for business models.

What's Next in the Series

Post 7 (FINAL POST OF SECTION 1): Who Pays? — The $220B Capex Explosion

Microsoft, Google, Amazon, Meta spending $220 billion collectively in 2025. Where does it all go?

What we'll cover:

  • Hyperscaler capex breakdown (GPUs 40-50%, networking 20-30%, power/cooling 15-20%, buildings 10-15%)
  • OpenAI's $6B annual burn (mostly compute costs)
  • When does ROI kick in? (Azure AI revenue growing, but not yet profitable)
  • The coming capex taper? (2027-2028 risk if AI revenue doesn't materialize)

This completes Section 1: Terrestrial Foundation!

Then Section 2: The Power Solution (SMR nuclear, grid expansion)

SOURCES

GPU Heat Specifications:

  • NVIDIA product datasheets: H100, H200, Blackwell TDP (thermal design power)

Cooling Technology:

  • Vertiv, Schneider Electric product documentation (direct liquid, immersion systems)
  • Industry reports (Uptime Institute, Data Center Dynamics): PUE benchmarks, adoption rates

Company Financials:

  • Vertiv quarterly earnings (2025): Revenue growth, stock performance
  • Schneider Electric annual reports: Data center segment revenue

Cost Estimates:

  • Industry sources (JLL, CBRE): Data center construction costs, cooling capex breakdowns

No comments:

Post a Comment