Saturday, February 28, 2026

THE AI INFRASTRUCTURE BUILD The Networking Layer Post 5: Terrestrial Foundation Moving Petabytes Between GPUs — The 20-30% Nobody Talks About

The AI Infrastructure Build: Post 5 - The Networking Layer ```

The Networking Layer

Post 5: Terrestrial Foundation

Moving Petabytes Between GPUs — The 20-30% Nobody Talks About

By Randy Gipe | March 2026

Everyone focuses on GPUs. NVIDIA gets $40,000 per H100. TSMC manufactures them. Data centers house them.

But AI training isn’t just about individual chips. It’s about connecting thousands of GPUs so they can work together.

Training GPT-4 required moving petabytes of data between 25,000+ GPUs. Every nanosecond of latency matters. Every dropped packet kills performance.

And networking—switches, cables, optics—costs 20-30% as much as the GPUs themselves.

This is the invisible layer that makes or breaks AI infrastructure.

Part 1: Why AI Needs Massive Networking

The Data Movement Problem

Traditional computing: CPU does work locally, occasionally fetches data from memory or storage.

AI training: Thousands of GPUs constantly exchanging model weights, gradients, activations.

🔄 HOW AI TRAINING USES NETWORKING

The process (simplified):

Model parallelism: Different GPUs hold different parts of a large model (GPT-4, Claude, Gemini too big to fit on one GPU)
Data parallelism: Different GPUs process different training batches simultaneously
After each training step: All GPUs must synchronize (exchange gradients to update model weights)
Result: Constant all-to-all communication between thousands of GPUs

Bandwidth requirements:

Training GPT-4 class model: Moving 10-100+ TB/hour between GPUs
Per GPU pair: Needs 200-400 Gbps (gigabits per second) links
Latency critical: Every microsecond of delay = slower training = higher cost

Why this matters for costs:

10,000 H100 GPUs = $400M in chips
Networking (switches, cables, optics) = $80-120M (20-30% of GPU cost)
If networking is slow, GPUs sit idle waiting for data → wasted money

InfiniBand vs. Ethernet — The Architecture War

Two competing standards for GPU interconnects:

Technology	Leader	Bandwidth	Latency	Cost	AI Use
InfiniBand	NVIDIA (Mellanox)	400-800 Gbps	~1 μs	High	Training (dominant)
Ethernet	Arista, Broadcom, Cisco	100-400 Gbps	~5-10 μs	Medium	Inference, general

Why InfiniBand dominates AI training:

Lower latency: 1 microsecond vs. 5-10 microseconds (critical for tight GPU synchronization)
RDMA (Remote Direct Memory Access): GPUs can read/write each other's memory directly (no CPU overhead)
NVIDIA integration: H100/H200/Blackwell designed to work optimally with NVIDIA InfiniBand switches

Why Ethernet fights back:

Lower cost: Commodity standard, multiple vendors compete
Flexibility: Works with any server/GPU (not locked to NVIDIA ecosystem)
Improving: Ultra Ethernet Consortium (UEC) working on AI-optimized Ethernet specs

Current split (2026):

Training clusters: 70-80% InfiniBand (NVIDIA dominance)
Inference deployments: 60-70% Ethernet (cost/flexibility matter more)

Part 2: The Networking Winners

NVIDIA (Mellanox) — Vertical Integration

2020: NVIDIA acquired Mellanox for $6.9 billion.

Why it mattered:

Mellanox = #1 InfiniBand supplier (80%+ market share)
NVIDIA now controls both the GPUs AND the networking connecting them
Can optimize end-to-end (GPU ↔ switch ↔ GPU performance tuned together)

🔌 NVIDIA NETWORKING REVENUE

FY2024 (Jan 2024):

Networking revenue: ~$11B (18% of total $60.9B NVIDIA revenue)
InfiniBand switches, ConnectX NICs (network interface cards), cables, optics

FY2025 (projected):

Networking revenue: ~$20-25B (15-19% of $130B+ total)
Growing alongside GPU sales (every H100/Blackwell cluster needs networking)

Margins:

Similar to GPUs (~70-75% gross margins)
Monopoly pricing power (InfiniBand lock-in for training)

Why this creates a moat:

Customers buying H100s automatically buy NVIDIA networking (integrated ecosystem)
Switching to AMD GPUs harder because networking also needs replacement
NVIDIA captures 20-30% more revenue per cluster than just selling GPUs

Arista Networks — The Ethernet Champion

📡 ARISTA NETWORKS

What they do:

High-performance Ethernet switches for data centers
Focus: Cloud-scale networking (AWS, Microsoft, Meta top customers)

Revenue (2025):

~$7B annual revenue (up 30-40% YoY, AI-driven)
Gross margins: ~60-65% (excellent for networking hardware)

AI strategy:

400G/800G Ethernet switches optimized for AI workloads
Partnering with hyperscalers to build AI-specific Ethernet fabrics
Lower cost than InfiniBand → targets inference, hybrid training

Stock performance:

Nov 2022 (ChatGPT launch): ~$120
March 2026: ~$300-350
+150-190% gain (AI boom direct beneficiary)

Why Arista wins in Ethernet:

Cloud providers prefer multi-vendor (avoid NVIDIA lock-in)
Software-defined networking (EOS operating system = flexibility)
Proven at hyperscale (AWS backbone runs on Arista)

Broadcom — The Chip Inside the Switch

💻 BROADCOM

What they do:

Network switch silicon (chips that power Arista, Cisco, others' switches)
Optical transceivers, custom AI accelerators

AI networking revenue (2025):

~$12B from networking/custom AI chips (part of $50B+ total revenue)
Tomahawk/Jericho switch chips inside most Ethernet data center switches

Custom AI silicon:

Google TPU chips manufactured by Broadcom (design partnership)
Meta, ByteDance custom AI chips also Broadcom partnerships
Revenue: $5-7B annually from custom AI accelerators

Why Broadcom matters:

Arista/Cisco switches use Broadcom chips (Broadcom wins regardless of who sells switches)
Diversified: Networking + custom AI silicon + software (VMware acquisition)
Margins: ~60-70% on networking chips

Cisco (Coherent) — Long-Haul Optics

2023: Cisco acquired Coherent (optical transceiver company) for $6.2B in stock.

Why optics matter:

Within data center: Copper cables + active optical cables (short distance)
Between data centers: Coherent pluggable optics (400G/800G modules)
Hyperscalers training large models across multiple data centers (geo-distributed)

Use case:

Microsoft trains models across Virginia + Iowa data centers (latency-tolerant stages)
Needs 400-800 Gbps optical links between sites
Coherent modules: $5,000-15,000 each, thousands needed per cluster

Revenue impact:

Cisco networking revenue: ~$15B annually (stable but slow growth historically)
Coherent adds $1-2B high-margin optics revenue (AI-driven growth)

Part 3: The Cost Breakdown

What Does Networking Cost in an AI Cluster?

💰 EXAMPLE: 10,000 GPU CLUSTER (H100)

GPU cost:

10,000 H100 GPUs × $30,000 = $300M

Networking cost (InfiniBand):

1. Network interface cards (NICs):

10,000 servers × 8 GPUs/server = 1,250 servers
Each server: 4-8 ConnectX-7 NICs (400 Gbps each) = $3,000-6,000/server
Total NICs: $4-8M

2. Switches (leaf + spine architecture):

Leaf switches: 40-80 units × $100k-200k = $4-16M
Spine switches: 10-20 units × $300k-500k = $3-10M
Total switches: $7-26M

3. Cables + optics:

Direct-attach copper (short runs): $200-500 each × thousands = $1-3M
Active optical cables (longer runs): $1,000-3,000 each × thousands = $5-15M
Pluggable optics (inter-rack): $2,000-10,000 each × hundreds = $2-5M
Total cables/optics: $8-23M

Total networking cost: $19-57M

As percentage of GPU cost: 6-19%

But for larger clusters (50,000+ GPUs), networking complexity grows → 20-30% of GPU cost.

Part 4: The Ultra Ethernet Consortium — Fighting NVIDIA

The Challenge to InfiniBand Dominance

July 2023: Ultra Ethernet Consortium (UEC) founded.

Members:

AMD, Intel, Microsoft, Meta, Broadcom, Cisco, Arista, HPE
Notably absent: NVIDIA

Goal:

Develop Ethernet specifications optimized for AI workloads
Match InfiniBand performance (low latency, RDMA-like features)
Break NVIDIA's networking lock-in

Technical targets:

Latency: Reduce from 5-10 μs → 1-2 μs (close to InfiniBand)
Congestion control: AI-specific flow management
RDMA over Ethernet: GPU-to-GPU direct memory access via Ethernet

Timeline:

2024-2025: Spec development
2026-2027: First Ultra Ethernet products shipping
2028+: Potential InfiniBand displacement (if performance matches)

Why this matters:

Hyperscalers want alternatives to NVIDIA monopoly
If Ethernet matches InfiniBand, customers save 30-50% on networking costs
NVIDIA's networking revenue ($20-25B) at risk if Ultra Ethernet succeeds

NVIDIA's response:

Pushing 800G InfiniBand (staying ahead on bandwidth)
Tighter GPU-network integration (harder to replicate with generic Ethernet)
Betting Ultra Ethernet won't achieve <2 μs latency at scale

Part 5: The Verdict — Networking = Hidden 20-30%

Everyone obsesses over GPUs. Networking is the invisible 20-30%.

Why it matters:

Bottleneck: Slow networking = idle GPUs = wasted money
Lock-in: NVIDIA networking reinforces GPU dominance
Cost: $300M GPU cluster needs $60-90M networking (non-trivial)
Winners: NVIDIA (InfiniBand), Arista (Ethernet), Broadcom (switch chips)

Picks-and-shovels thesis holds: Arista +150-190% since ChatGPT, NVIDIA networking $20-25B revenue.

What's Next in the Series

Post 6 (next): Cooling — The Unsexy Necessity

Blackwell GPUs generate 1,000W of heat each. Multiply by 10,000 GPUs = 10 MW of heat. How do you cool it?

What we'll cover:

Air cooling → liquid cooling revolution (50% adoption in new builds)
Immersion cooling (GPUs submerged in dielectric fluid)
Vertiv, Schneider Electric: The cooling infrastructure winners
Why cooling = 15-20% of data center capex

Then Post 7: Who Pays? — The $220B Capex Explosion (completes Section 1!)

SOURCES

Networking Technology:

InfiniBand vs. Ethernet: Technical specs, vendor documentation (NVIDIA Mellanox, Arista)
Ultra Ethernet Consortium: Official announcements, member list, technical roadmap

Company Financials:

NVIDIA: FY2024/FY2025 earnings (networking revenue disclosed in 10-Qs)
Arista Networks: Quarterly earnings (revenue growth, AI-driven bookings)
Broadcom: Annual reports (networking + custom silicon revenue)

Cost Breakdowns:

Industry reports (Omdia, Dell'Oro Group): Data center networking spend
Vendor pricing: PublicAnthropicly available list prices, confirmed via industry sources

the gipster

Saturday, February 28, 2026

THE AI INFRASTRUCTURE BUILD The Networking Layer Post 5: Terrestrial Foundation Moving Petabytes Between GPUs — The 20-30% Nobody Talks About

The Networking Layer

Post 5: Terrestrial Foundation

Part 1: Why AI Needs Massive Networking

The Data Movement Problem

🔄 HOW AI TRAINING USES NETWORKING

InfiniBand vs. Ethernet — The Architecture War

Part 2: The Networking Winners

NVIDIA (Mellanox) — Vertical Integration

🔌 NVIDIA NETWORKING REVENUE

Arista Networks — The Ethernet Champion

📡 ARISTA NETWORKS

Broadcom — The Chip Inside the Switch

💻 BROADCOM

Cisco (Coherent) — Long-Haul Optics

Part 3: The Cost Breakdown

What Does Networking Cost in an AI Cluster?

💰 EXAMPLE: 10,000 GPU CLUSTER (H100)

Part 4: The Ultra Ethernet Consortium — Fighting NVIDIA

The Challenge to InfiniBand Dominance

Part 5: The Verdict — Networking = Hidden 20-30%

What's Next in the Series

SOURCES

THE AI INFRASTRUCTURE BUILD — POST 5 COMPLETE

No comments:

Post a Comment

About Me

Subscribe To

Saturday, February 28, 2026

THE AI INFRASTRUCTURE BUILD The Networking Layer Post 5: Terrestrial Foundation Moving Petabytes Between GPUs — The 20-30% Nobody Talks About

The Networking Layer

Post 5: Terrestrial Foundation

Part 1: Why AI Needs Massive Networking

The Data Movement Problem

🔄 HOW AI TRAINING USES NETWORKING

InfiniBand vs. Ethernet — The Architecture War

Part 2: The Networking Winners

NVIDIA (Mellanox) — Vertical Integration

🔌 NVIDIA NETWORKING REVENUE

Arista Networks — The Ethernet Champion

📡 ARISTA NETWORKS

Broadcom — The Chip Inside the Switch

💻 BROADCOM

Cisco (Coherent) — Long-Haul Optics

Part 3: The Cost Breakdown

What Does Networking Cost in an AI Cluster?

💰 EXAMPLE: 10,000 GPU CLUSTER (H100)

Part 4: The Ultra Ethernet Consortium — Fighting NVIDIA

The Challenge to InfiniBand Dominance

Part 5: The Verdict — Networking = Hidden 20-30%

What's Next in the Series

SOURCES

THE AI INFRASTRUCTURE BUILD — POST 5 COMPLETE

No comments:

Post a Comment