Ch 5: Interconnects — How GPUs Talk

Ch 5 — Interconnects: How GPUs Talk

NVLink, NVSwitch, PCIe, InfiniBand, RoCEv2 — the bandwidth hierarchy

arrow_backIndex

Core

link

Why Connect

arrow_forward

cable

PCIe

arrow_forward

bolt

NVLink

arrow_forward

hub

NVSwitch

arrow_forward

lan

InfiniBand

arrow_forward

router

RoCEv2

arrow_forward

stacked_bar_chart

Hierarchy

arrow_forward

speed

Impact

Click play or press Space to begin...

Step- / 8

link

Why GPUs Need to Talk

Models don’t fit on one GPU — so GPUs must share data constantly

The Team Project Analogy

Imagine a team of 8 people working on a massive jigsaw puzzle. Each person has a section, but the pieces at the edges need to match. Every few minutes, they need to compare edge pieces with their neighbors.

If they’re sitting at the same table (fast interconnect), sharing is instant. If they’re in different rooms connected by a hallway (slow interconnect), every comparison takes minutes. The puzzle gets done at the speed of the slowest communication, not the fastest worker.

This is exactly what happens in distributed AI training. A 70B model split across 8 GPUs needs constant synchronization — gradients must be averaged (AllReduce), activations must be passed between layers (pipeline parallelism), and tensor slices must be gathered (tensor parallelism).

The speed of GPU-to-GPU communication directly determines training speed. Slow interconnects mean GPUs spend more time waiting than computing.

Communication Patterns in AI

Data Parallelism (AllReduce): Every GPU has full model copy After each batch: average all gradients across all GPUs Data moved: 2× model_size per training step Tensor Parallelism (AllGather): Model layers split across GPUs Each forward pass: gather partial results from all GPUs Data moved: ~activations_size per layer, per step Pipeline Parallelism: Model layers on different GPUs Activations passed GPU→GPU Data moved: ~batch × hidden_dim per pipeline stage For a 70B model on 8 GPUs: AllReduce moves ~280 GB per step. At 3,350 GB/s NVLink: ~0.08 sec. At 50 GB/s PCIe: ~5.6 sec. That's a 70x difference.

Key insight: In distributed training, the interconnect is often the bottleneck, not the GPU compute. A cluster of 8 GPUs connected by slow PCIe can be slower than 4 GPUs connected by fast NVLink. Communication speed determines how efficiently you can scale to more GPUs.

cable

PCIe: The Baseline

Universal but slow — the minimum connection every GPU has

PCIe in AI Systems

PCI Express (PCIe) is the standard interface that connects GPUs to the CPU and to each other in consumer and entry-level server systems. Every GPU has a PCIe connection.

PCIe 4.0 x16: 32 GB/s per direction, 64 GB/s bidirectional. Used in older servers and consumer PCs.

PCIe 5.0 x16: 64 GB/s per direction, 128 GB/s bidirectional. Current standard in modern servers.

PCIe 6.0 x16: 128 GB/s per direction, 256 GB/s bidirectional. Emerging in 2025–2026.

PCIe is fine for single-GPU workloads (loading model weights from CPU memory to GPU) but becomes a severe bottleneck for multi-GPU communication. At 128 GB/s, transferring a 70B model’s gradients (280 GB in FP32) takes over 2 seconds — an eternity when training steps should take milliseconds.

PCIe Generation Comparison

PCIe Bandwidth (x16 slot): Gen 3.0: 16 GB/s per direction Gen 4.0: 32 GB/s per direction Gen 5.0: 64 GB/s per direction Gen 6.0: 128 GB/s per direction Compare to NVLink 5.0: NVLink: 900 GB/s per direction PCIe 5: 64 GB/s per direction Ratio: 14x faster PCIe use cases in AI: ✓ CPU ↔ GPU data transfer ✓ Single-GPU inference ✓ Loading model weights ✓ Storage I/O (GPUDirect) ✗ Multi-GPU training sync ✗ Tensor parallelism ✗ High-performance inference PCIe is the "last resort" for GPU-to-GPU communication. NVLink is always preferred when available.

Key insight: PCIe is like a country road connecting two cities. It works, but it’s slow for heavy traffic. For multi-GPU AI workloads, you need a highway (NVLink) or a railroad (InfiniBand). PCIe remains important for CPU-GPU communication and storage I/O, but it’s never the right choice for GPU-to-GPU data exchange at scale.

bolt

NVLink: The GPU Highway

Direct GPU-to-GPU connections at 14x PCIe speed

NVLink Architecture

NVLink is NVIDIA’s proprietary high-speed interconnect that connects GPUs directly to each other, bypassing the CPU and PCIe entirely. Think of it as a private highway between GPUs — dedicated lanes, no traffic lights, no speed limits.

Each NVLink generation has dramatically increased bandwidth:

NVLink 2.0 (Volta, 2017): 300 GB/s bidirectional per GPU. Connected up to 6 GPUs.

NVLink 3.0 (Ampere, 2020): 600 GB/s. 12 links per GPU.

NVLink 4.0 (Hopper, 2022): 900 GB/s. 18 links per GPU. Used in H100 and H200.

NVLink 5.0 (Blackwell, 2024): 1,800 GB/s (1.8 TB/s). 18 links at 100 GB/s each. Used in B200 and GB200.

NVLink 5.0 delivers 14x the bandwidth of PCIe 5.0. For tensor parallelism (which requires constant GPU-to-GPU data exchange), this difference is the difference between feasible and impossible.

NVLink Generation Comparison

NVLink Evolution: Gen Year BW/GPU Links Arch 2.0 2017 300 GB/s 6 Volta 3.0 2020 600 GB/s 12 Ampere 4.0 2022 900 GB/s 18 Hopper 5.0 2024 1,800 GB/s 18 Blackwell Practical impact (AllReduce): 70B model gradient sync (280 GB): PCIe 5.0: ~2.2 sec NVLink 4.0: ~0.31 sec NVLink 5.0: ~0.16 sec Tensor parallelism overhead: PCIe 5.0: ~40-60% of step time NVLink 4.0: ~5-10% of step time NVLink 5.0: ~3-5% of step time NVLink makes tensor parallelism practical. Without it, splitting a model across GPUs within a node would be too slow.

Key insight: NVLink is what makes multi-GPU training within a server practical. Without NVLink, you’d be limited to data parallelism (each GPU has a full model copy) which requires much more memory. NVLink enables tensor parallelism (splitting layers across GPUs) which is essential for models too large to fit on a single GPU.

hub

NVSwitch: Full-Mesh GPU Connectivity

Every GPU talks to every other GPU at full bandwidth — no bottlenecks

The Telephone Exchange Analogy

NVLink connects pairs of GPUs. But in a server with 8 GPUs, you can’t directly connect every GPU to every other GPU — that would require 28 direct links (8 choose 2).

NVSwitch solves this. It’s a dedicated chip that acts like a telephone exchange — any GPU can talk to any other GPU at full NVLink bandwidth, simultaneously.

In a DGX H100 server (8 GPUs), 4 NVSwitch chips create a full-mesh fabric. Every GPU has 900 GB/s to every other GPU. Total bisection bandwidth: 3.6 TB/s.

The Blackwell GB200 NVL72 takes this further: 9 NVSwitch trays connect 72 GPUs in a single NVLink domain with 130 TB/s total bandwidth. Any GPU can access any other GPU’s memory as if it were local. This creates a single “virtual GPU” with 72 × 192 GB = 13.8 TB of unified memory.

NVSwitch Scaling

DGX H100 (8 GPUs): NVSwitch chips: 4 GPU-GPU BW: 900 GB/s each Total bisection: 3.6 TB/s NVLink domain: 8 GPUs GB200 NVL72 (72 GPUs): NVSwitch trays: 9 GPU-GPU BW: 1,800 GB/s each Total bisection: 130 TB/s NVLink domain: 72 GPUs Unified memory: 13.8 TB 5th Gen NVSwitch (Blackwell): Can connect up to 576 GPUs in a single NVLink domain Total fabric BW: 1 PB/s 576 GPUs × 192 GB = 110 TB of unified GPU memory. Enough to hold a 55 trillion parameter model in FP16.

Key insight: NVSwitch transforms multiple discrete GPUs into what behaves like a single massive GPU. The 72-GPU GB200 NVL72 with 13.8 TB of unified memory can hold and serve the largest models without any model partitioning complexity. This is NVIDIA’s answer to “models keep getting bigger” — just make the GPU bigger.

lan

InfiniBand: The Data Center Fabric

When GPUs in different servers need to communicate — the gold standard

Beyond the Server

NVLink connects GPUs within a server. But training at scale requires hundreds or thousands of GPUs across many servers. This is where network fabric comes in.

InfiniBand is the traditional choice for HPC and AI clusters. Originally developed for supercomputers, it provides:

• Ultra-low latency: 1–1.6 microseconds end-to-end. Cut-through switching means data starts forwarding before the full packet arrives.

• Native RDMA: Remote Direct Memory Access lets one GPU read/write another GPU’s memory directly, bypassing the CPU entirely. Zero-copy, kernel-bypass.

• Credit-based flow control: Guaranteed lossless delivery without the complexity of TCP congestion control.

• Current speed: 400 Gb/s (NDR) per port, with 800 Gb/s (XDR) arriving in 2025–2026 via NVIDIA Quantum-X800 switches.

InfiniBand Specs

InfiniBand Generations: HDR (2019): 200 Gb/s per port NDR (2022): 400 Gb/s per port XDR (2025): 800 Gb/s per port InfiniBand characteristics: Latency: 1-1.6 μs RDMA: Native (hardware) Flow control: Credit-based Packet loss: Near zero Vendor: NVIDIA (Mellanox) DGX H100 cluster networking: 8 GPUs per server Each GPU: 1× 400G InfiniBand NIC Per server: 3.2 Tb/s aggregate 8:1 GPU-to-NIC ratio Market position (2023): ~80% of AI clusters used IB Market position (2025): Ethernet now taking the lead as RoCEv2 matures

Key insight: InfiniBand’s 1–1.6 microsecond latency vs RoCEv2’s 5–6 microseconds matters enormously for distributed training. In AllReduce operations that happen thousands of times per training step, those extra microseconds compound. But InfiniBand costs 1.5–2.5x more per port than Ethernet, and it’s controlled by NVIDIA (via Mellanox acquisition).

router

RoCEv2: Ethernet Fights Back

RDMA over standard Ethernet — cheaper, more familiar, and catching up fast

RoCEv2 Explained

RoCEv2 (RDMA over Converged Ethernet v2) brings RDMA capabilities to standard Ethernet networks. Instead of InfiniBand’s proprietary fabric, RoCEv2 carries RDMA traffic over UDP/IP on commodity Ethernet switches.

Advantages over InfiniBand:
• Cost: 1.5–2.5x cheaper per port
• Familiarity: Network engineers already know Ethernet
• Vendor choice: Multiple switch vendors (Arista, Cisco, Broadcom) vs NVIDIA-only for InfiniBand
• Convergence: Same fabric for AI traffic and general data center traffic

Challenges:
• Higher latency (5–6 μs vs 1–1.6 μs)
• Requires careful tuning: Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and Dynamic Congestion Control
• Store-and-forward switching adds latency vs InfiniBand’s cut-through
• Packet loss under congestion requires more complex handling

InfiniBand vs RoCEv2

InfiniBand

Latency: 1–1.6 μs
RDMA: Native hardware
Loss: Near zero (credit-based)
Cost: 1.5–2.5x higher
Vendor: NVIDIA only
Best for: Top-tier training

RoCEv2 Ethernet

Latency: 5–6 μs
RDMA: Over UDP/IP
Loss: Requires PFC/ECN tuning
Cost: Baseline
Vendor: Multi-vendor
Best for: Cost-sensitive, hybrid

Key insight: The InfiniBand vs Ethernet debate is shifting. In 2023, InfiniBand dominated 80% of AI clusters. By 2025, Ethernet has taken the lead as hyperscalers (Google, Meta, Microsoft) validated RoCEv2 at massive scale. The Ultra Ethernet Consortium is standardizing AI-optimized Ethernet features. For most organizations, RoCEv2 at 400G/800G is “good enough” and significantly cheaper.

stacked_bar_chart

The Bandwidth Hierarchy

From 128 GB/s to 1.8 TB/s — each level serves a different purpose

The Complete Picture

AI infrastructure has a clear bandwidth hierarchy. Each level connects different scopes of hardware:

Level 1 — Within the GPU: HBM to compute cores. 3,350–8,000 GB/s. This is the memory bandwidth we covered in Chapter 4.

Level 2 — GPU to GPU (same server): NVLink. 900–1,800 GB/s. Enables tensor parallelism within a node.

Level 3 — GPU to CPU: PCIe. 64–128 GB/s. For data loading, model weight transfer, and orchestration.

Level 4 — Server to server (same rack): InfiniBand or RoCEv2. 50–100 GB/s per link. Enables data parallelism across nodes.

Level 5 — Rack to rack: Spine switches. 400G–800G links. The backbone of the cluster network.

Each level is roughly 10–20x slower than the one above it. This hierarchy determines how you partition your model and your training strategy.

Bandwidth at Each Level

Bandwidth Hierarchy (B200): L1: HBM → Compute 8,000 GB/s (8 TB/s) L2: GPU ↔ GPU (NVLink 5.0) 1,800 GB/s (1.8 TB/s) Ratio to L1: 4.4x slower L3: GPU ↔ CPU (PCIe 5.0) 128 GB/s Ratio to L2: 14x slower L4: Server ↔ Server (IB NDR) 50 GB/s (400 Gb/s) Ratio to L3: 2.6x slower L5: Rack ↔ Rack (spine) 50-100 GB/s (aggregated) Varies by topology The 160x gap between NVLink and InfiniBand is why tensor parallelism stays within a node and data parallelism goes across.

Key insight: The bandwidth hierarchy dictates your parallelism strategy. Tensor parallelism (which needs the most communication) uses NVLink within a node. Data parallelism (less communication) uses InfiniBand/Ethernet across nodes. Pipeline parallelism (moderate communication) can span nodes if the network is fast enough. Matching your parallelism strategy to the bandwidth hierarchy is the key to efficient distributed training.

speed

Real-World Impact: Communication Overhead

How interconnect speed affects training time and GPU utilization

The Scaling Efficiency Problem

In a perfect world, doubling the number of GPUs would halve training time. In reality, communication overhead means you get less than 2x speedup:

Linear scaling (ideal): 8 GPUs = 8x faster. Never happens in practice.

Good scaling (NVLink + IB): 8 GPUs = 6–7x faster. ~80–90% efficiency. Achievable with fast interconnects and optimized communication.

Poor scaling (PCIe only): 8 GPUs = 3–4x faster. ~40–50% efficiency. Communication dominates, GPUs idle waiting.

At 1,000+ GPU scale, even small inefficiencies compound. If each AllReduce takes 10% of step time with InfiniBand, it takes 25% with Ethernet. Over months of training, that’s millions of dollars in wasted GPU-hours.

This is why organizations like Meta and Google invest billions in networking infrastructure — the ROI on faster interconnects is enormous at scale.

Scaling Efficiency Numbers

Training 70B model, 8 GPUs: NVLink 4.0 + InfiniBand NDR: Compute time: 85% Communication: 12% Overhead: 3% Scaling eff: ~87% PCIe 5.0 + 100G Ethernet: Compute time: 45% Communication: 50% Overhead: 5% Scaling eff: ~45% At 16,384 GPUs (Meta Llama 3): Network config errors caused 10.7% of significant job failures Even 1% communication overhead = 163 idle GPUs worth of waste = ~$47K/day at $3/GPU-hr At scale, interconnect investment pays for itself many times over. A $10M networking upgrade that saves 5% communication overhead saves $850K/month on a 16K cluster.

Key insight: Interconnect speed has diminishing returns at small scale but enormous returns at large scale. For a team with 8 GPUs, PCIe might be acceptable. For a team with 1,000+ GPUs, every microsecond of communication latency translates to millions of dollars in wasted compute. This is why the biggest AI labs invest disproportionately in networking.

arrow_back Ch 4: Memory Bottleneck Ch 6: Network Topologies arrow_forward