Ch 6: Network Topologies for AI Clusters

Ch 6 — Network Topologies for AI Clusters

Fat-tree, leaf-spine, rail-optimized, 3D torus — how to wire thousands of GPUs

arrow_backIndex

Core

traffic

Traffic Patterns

arrow_forward

account_tree

Fat-Tree

arrow_forward

view_column

Leaf-Spine

arrow_forward

train

Rail-Optimized

arrow_forward

view_in_ar

3D Torus

arrow_forward

dns

DGX SuperPOD

arrow_forward

warning

Failure Modes

arrow_forward

checklist

Choosing

Click play or press Space to begin...

Step- / 8

traffic

AI Traffic Is Different

All-to-all communication patterns break traditional network designs

The Highway System Analogy

Traditional data center traffic is like a hub-and-spoke airline system — most traffic flows between servers and a central point (storage, internet gateway). You can oversubscribe the network because not everyone talks at once.

AI training traffic is like every car on the highway needing to visit every other car simultaneously. During an AllReduce operation, every GPU sends data to every other GPU at the same time. This is called all-to-all traffic.

Key differences from traditional data center traffic:
• Volume: AI network traffic is ~3x general data center traffic
• Pattern: All-to-all (every node talks to every node)
• Tolerance: Near-zero packet loss required (even 0.1% loss causes massive slowdowns)
• Burstiness: Traffic comes in synchronized bursts (all GPUs finish compute and communicate simultaneously)

Traffic Pattern Comparison

Traditional data center: Pattern: Client → Server (N:1) Oversubscription: 3:1 to 8:1 Loss tolerance: 0.1-1% (TCP retransmits) Burst handling: Buffer and queue AI training cluster: Pattern: All-to-all (N:N) Oversubscription: 1:1 (non-blocking) Loss tolerance: ~0% (RDMA fails) Burst handling: Must sustain peak DGX H100 cluster numbers: 8 GPUs per server Each GPU: 1× 400G NIC Per server aggregate: 3.2 Tb/s GPU-to-NIC ratio: 8:1 1,000 GPU cluster: 125 servers × 3.2 Tb/s = 400 Tb/s total fabric capacity The fabric must sustain 400 Tb/s of all-to-all traffic with near- zero packet loss. This is why AI networking is so expensive.

Key insight: You cannot oversubscribe an AI training network. In a traditional data center, you might build a 3:1 oversubscribed network because not all servers talk at once. In AI training, all GPUs communicate simultaneously during AllReduce. Any oversubscription creates a bottleneck that slows every GPU in the cluster.

account_tree

Fat-Tree Topology

The classic HPC network design — full bisection bandwidth through hierarchical switching

How Fat-Tree Works

A fat-tree topology arranges switches in a tree hierarchy where the “trunk” (higher levels) has more bandwidth than the “branches” (lower levels) — hence “fat” tree.

Three tiers:
• Leaf switches (bottom): Connect directly to servers. Each leaf switch connects to 32–64 server ports.
• Spine switches (middle): Connect leaf switches together. Provide cross-rack connectivity.
• Core switches (top): Connect spine switches for full-cluster connectivity.

The key property: full bisection bandwidth. If you split the cluster in half, the bandwidth between the two halves equals the total bandwidth of either half. This means no bottleneck, regardless of which GPUs need to communicate.

NVIDIA’s DGX SuperPOD uses a three-tier fat-tree with Quantum-2 InfiniBand switches at 400 Gb/s per port.

Fat-Tree Design

3-tier fat-tree (1,024 GPUs): Tier 1 — Leaf (ToR) switches: 32 switches Each: 32 server + 32 uplinks Port speed: 400G Tier 2 — Spine switches: 32 switches Each: 32 downlinks + 32 uplinks Port speed: 400G Tier 3 — Core switches: 32 switches Each: 32 downlinks Port speed: 400G Properties: Bisection BW: full (1:1) Max hops: 5 (leaf→spine→core→spine→leaf) Latency: ~2-4 μs Switch count: 96 total Full bisection bandwidth means any GPU can talk to any other at full line rate. No congestion regardless of traffic pattern.

Key insight: Fat-tree is the gold standard for AI clusters because it guarantees full bisection bandwidth. The downside is cost — you need a lot of switches and cables. For a 1,000-GPU cluster, the networking infrastructure can cost $5–15 million, sometimes rivaling the cost of the GPUs themselves.

view_column

Leaf-Spine: The Modern Standard

Two-tier simplification of fat-tree — easier to build, good enough for most clusters

Leaf-Spine Architecture

Leaf-spine is a two-tier simplification of the fat-tree. It removes the core tier, connecting every leaf switch directly to every spine switch:

Leaf switches: Top-of-rack (ToR) switches that connect to servers. Each leaf has uplinks to every spine switch.

Spine switches: Provide cross-rack connectivity. Every leaf connects to every spine, creating a non-blocking fabric.

Advantages over 3-tier fat-tree:
• Fewer switch tiers = lower latency (max 3 hops vs 5)
• Simpler cabling and management
• Easier to scale by adding more spine switches
• Lower cost (fewer switches total)

Limitation: Scales to ~500–2,000 servers before you need a third tier. For larger clusters, you add a “super-spine” tier, effectively becoming a fat-tree again.

Leaf-Spine Numbers

Leaf-spine for 512 GPUs: Leaf (ToR) switches: 16 switches (64-port, 400G) 32 server ports + 32 uplinks each 4 servers per leaf (8 GPUs each) Spine switches: 32 switches (32-port, 400G) Each connects to all 16 leaves Properties: Max hops: 3 (leaf→spine→leaf) Latency: ~1.5-3 μs Bisection BW: 1:1 (non-blocking) Switch count: 48 total Switch capacity (2025): Broadcom Tomahawk 5: 51.2 Tb/s NVIDIA Spectrum-4: 51.2 Tb/s Ports: 64× 800G or 128× 400G Modern 51.2 Tb/s switches can handle 64 servers at 800G each on a single leaf — dramatically reducing switch count.

Key insight: Leaf-spine is the most common topology for AI clusters up to ~2,000 GPUs. It’s simpler and cheaper than a full fat-tree while still providing non-blocking bandwidth. Most cloud providers (AWS, Azure, GCP) use leaf-spine variants for their GPU clusters. Only the very largest clusters (10,000+ GPUs) need three tiers.

train

Rail-Optimized Topology

NVIDIA’s innovation — align network rails with GPU positions for optimal AI traffic

The Train Track Analogy

In a standard leaf-spine, all 8 GPUs in a server connect to the same leaf switch. But AI training traffic is mostly GPU-to-same-position-GPU across servers (GPU 0 talks to GPU 0 in other servers, GPU 1 to GPU 1, etc.).

Rail-optimized topology exploits this pattern. Instead of connecting all GPUs to one switch, each GPU connects to a dedicated “rail” — a separate leaf switch shared only with the same-position GPUs from other servers.

Think of it like 8 parallel train tracks, one for each GPU position. GPU 0 from every server rides rail 0. GPU 1 rides rail 1. Traffic stays on its own rail, avoiding cross-rail congestion.

This means no GPU is more than one hop away from any same-position GPU in the cluster. For AllReduce operations (which dominate training), this is optimal.

Rail-Optimized Design

Standard leaf-spine: All 8 GPUs → 1 leaf switch Cross-server traffic: 2-3 hops Congestion at leaf uplinks Rail-optimized: GPU 0 from all servers → Rail 0 GPU 1 from all servers → Rail 1 ... GPU 7 from all servers → Rail 7 Benefits: Same-position GPU traffic: 1 hop No cross-rail congestion Predictable latency Optimal for AllReduce DGX H100 SuperPOD uses this: 8 rails per Scalable Unit Each rail: dedicated leaf switch 100 Tb/s ToR switch capacity Limitation: Cross-position GPU traffic (GPU 0 → GPU 3 on another server) still needs spine hops. Works best when traffic pattern matches rail alignment. NVIDIA designs their training frameworks to match rail topology, so traffic naturally aligns.

Key insight: Rail-optimized topology is a co-design of hardware and software. NVIDIA designs both the network topology AND the communication patterns in their training frameworks (NCCL) to match. This co-design is why NVIDIA clusters often outperform competitors with similar hardware — the network and software are optimized together.

view_in_ar

3D Torus: Google’s TPU Approach

No switches at all — each chip connects directly to its 6 neighbors

The Rubik’s Cube Analogy

Google’s TPU pods use a radically different approach: no network switches at all. Instead, each TPU chip has direct connections to its 6 neighbors (up, down, left, right, front, back) in a 3D grid. The edges wrap around, forming a torus.

Think of it like a Rubik’s Cube where each small cube is a TPU chip, and each face connects to the adjacent cube. To send data from one corner to the opposite corner, it hops through intermediate chips.

TPU v5p pod topology: Up to 8,960 chips arranged in a 16×16×24 3D torus. Each chip has dedicated inter-chip interconnect (ICI) links to its neighbors.

Advantages: No expensive switches, very low latency for nearest-neighbor communication, natural fit for data-parallel and model-parallel training where communication is mostly local.

Disadvantage: Long-distance communication requires many hops. Worst-case latency grows with cluster size.

3D Torus vs Switch-Based

Google TPU v5p pod: Topology: 3D torus Max chips: 8,960 Dimensions: 16 × 16 × 24 Links/chip: 6 (±x, ±y, ±z) ICI BW: 2,765 GB/s per chip Switches: 0 Trillium (v6): ICI BW: 2x v5e Pod size: 256 chips Jupiter fabric: 100K+ chips Bisection BW: 13 Pb/s Pros vs switch-based: ✓ No switch cost ✓ Low nearest-neighbor latency ✓ Natural for grid-based comms ✓ Scales with chip count Cons vs switch-based: ✗ Multi-hop for distant chips ✗ Latency grows with distance ✗ Harder to route arbitrary patterns ✗ Requires topology-aware software

Key insight: Google’s 3D torus works because they control the entire stack — chip, interconnect, software (JAX/XLA), and training frameworks. The compiler automatically maps computation to the physical topology, minimizing long-distance communication. This vertical integration is why TPU pods achieve excellent scaling efficiency despite the simpler topology.

dns

Real-World: DGX SuperPOD Architecture

NVIDIA’s reference architecture for large-scale AI training clusters

SuperPOD Design

The NVIDIA DGX SuperPOD is the reference architecture for building large AI training clusters. It’s what companies like Meta, Microsoft, and Oracle use:

Building block — DGX H100: 8 H100 GPUs connected by NVSwitch (900 GB/s per GPU). Each GPU has a 400G InfiniBand NIC.

Scalable Unit (SU): 32 DGX H100 systems (256 GPUs) connected by a rail-optimized InfiniBand fabric. 8 leaf switches (one per rail) + spine switches.

SuperPOD: Multiple Scalable Units connected by a spine/super-spine fabric. Scales to 16 SUs = 4,096 GPUs per SuperPOD.

GB200 NVL72 variant: Each rack has 72 GPUs connected by NVLink (130 TB/s). Racks connect via 400G/800G InfiniBand or Ethernet. The NVLink domain is much larger (72 vs 8 GPUs), reducing the need for cross-rack communication.

SuperPOD Numbers

DGX H100 SuperPOD: Per DGX node (8 GPUs): NVLink BW: 900 GB/s per GPU InfiniBand: 8× 400G NICs Power: ~10.2 kW Per Scalable Unit (256 GPUs): DGX nodes: 32 Leaf switches: 8 (rail-optimized) Spine switches: 8 Power: ~330 kW Full SuperPOD (4,096 GPUs): Scalable Units: 16 Total switches: ~256+ Total cables: thousands Power: ~5.3 MW Cost: ~$150-200M The networking infrastructure (switches, cables, optics) can cost 15-25% of the total cluster cost. Not a rounding error.

Key insight: A DGX SuperPOD is not just GPUs — it’s a carefully engineered system where compute, networking, storage, power, and cooling are all designed together. The networking alone (switches, cables, optics, NICs) can cost $20–50 million for a 4,096-GPU cluster. This is why “just buy more GPUs” is never the full answer.

warning

Network Failures: The Silent Killer

10.7% of GPU job failures come from network configuration errors

Failure Modes

At scale, network failures are one of the biggest operational challenges. Research on large GPU clusters shows that 10.7% of significant job failures are caused by network issues:

Link failures: Individual cables or optics fail. In a cluster with thousands of cables, this happens daily. The fabric must route around failures without dropping traffic.

Switch failures: A leaf or spine switch goes down. All servers connected to that switch lose connectivity. Redundant paths must absorb the traffic.

Congestion hotspots: Topology-dependent congestion occurs when multiple flows compete for the same switch port. Even with non-blocking design, hash collisions in ECMP (Equal-Cost Multi-Path) routing can create hotspots.

Configuration errors: Misconfigured PFC thresholds, incorrect ECN settings, or wrong routing tables. These cause subtle performance degradation that’s hard to diagnose — training runs slower but doesn’t fail outright.

Impact at Scale

Meta Llama 3 training (16K GPUs): Total failures: 466 over 54 days Failures per day: 7.76 Network-related: ~10.7% = ~50 events Impact of network issues: Link flap (1 cable): Affected GPUs: 8 (one server) Recovery time: ~30 seconds Training impact: checkpoint + restart Switch failure (1 leaf): Affected GPUs: 32-64 Recovery time: ~5-15 minutes Training impact: full job restart Congestion (subtle): Affected GPUs: all Detection time: hours to days Impact: 5-20% slower training Subtle congestion is the worst — training doesn't fail, it just runs slower. You might not notice for days, wasting millions.

Key insight: Network monitoring and observability are critical at scale. You need real-time visibility into per-port utilization, packet drops, latency histograms, and ECMP hash distribution. Many organizations underinvest in network monitoring tools, leading to silent performance degradation that wastes far more money than the monitoring would cost.

checklist

Choosing a Topology

Match your network design to your cluster size, budget, and workload

Topology Decision Guide

Small cluster (8-64 GPUs): Topology: Simple leaf-spine Switches: 2-4 leaf + 2-4 spine Fabric: 400G Ethernet or IB Cost: $50K-200K networking Medium cluster (64-512 GPUs): Topology: Leaf-spine or rail-opt Switches: 8-32 leaf + 8-32 spine Fabric: 400G IB or RoCEv2 Cost: $500K-5M networking Large cluster (512-4K GPUs): Topology: Rail-optimized fat-tree Switches: 100+ total Fabric: 400G/800G IB Cost: $5M-50M networking Massive cluster (4K+ GPUs): Topology: 3-tier fat-tree or multi-rail SuperPOD Fabric: 800G IB or custom Cost: $50M+ networking

Key Design Principles

1. Never oversubscribe for training. AI training requires full bisection bandwidth. Any oversubscription creates a bottleneck that affects every GPU.

2. Match topology to traffic pattern. Rail-optimized for AllReduce-heavy training. Standard leaf-spine for mixed workloads. 3D torus if you control the full stack.

3. Plan for 800G. Current 400G fabrics will be upgraded to 800G within 2 years. Design your cabling and switch placement to support the upgrade.

4. Invest in monitoring. Network issues cause 10%+ of job failures. Real-time per-port monitoring pays for itself quickly.

5. Budget 15–25% for networking. If your GPU budget is $100M, plan $15–25M for networking. Skimping on the network wastes GPU investment.

Key insight: The network is the nervous system of an AI cluster. You can have the fastest GPUs in the world, but if they can’t communicate efficiently, they’ll spend more time waiting than computing. Network topology is not an afterthought — it’s a first-class design decision that determines cluster performance.

arrow_back Ch 5: Interconnects Ch 7: Distributed Training arrow_forward