Ch 5 — Interconnects: How GPUs Talk

NVLink, NVSwitch, PCIe, InfiniBand, RoCEv2 — the bandwidth hierarchy
Core
link
Why Connect
arrow_forward
cable
PCIe
arrow_forward
bolt
NVLink
arrow_forward
hub
NVSwitch
arrow_forward
lan
InfiniBand
arrow_forward
router
RoCEv2
arrow_forward
stacked_bar_chart
Hierarchy
arrow_forward
speed
Impact
-
Click play or press Space to begin...
Step- / 8
link
Why GPUs Need to Talk
Models don’t fit on one GPU — so GPUs must share data constantly
The Team Project Analogy
Imagine a team of 8 people working on a massive jigsaw puzzle. Each person has a section, but the pieces at the edges need to match. Every few minutes, they need to compare edge pieces with their neighbors.

If they’re sitting at the same table (fast interconnect), sharing is instant. If they’re in different rooms connected by a hallway (slow interconnect), every comparison takes minutes. The puzzle gets done at the speed of the slowest communication, not the fastest worker.

This is exactly what happens in distributed AI training. A 70B model split across 8 GPUs needs constant synchronization — gradients must be averaged (AllReduce), activations must be passed between layers (pipeline parallelism), and tensor slices must be gathered (tensor parallelism).

The speed of GPU-to-GPU communication directly determines training speed. Slow interconnects mean GPUs spend more time waiting than computing.
Communication Patterns in AI
Data Parallelism (AllReduce): Every GPU has full model copy After each batch: average all gradients across all GPUs Data moved: 2× model_size per training step Tensor Parallelism (AllGather): Model layers split across GPUs Each forward pass: gather partial results from all GPUs Data moved: ~activations_size per layer, per step Pipeline Parallelism: Model layers on different GPUs Activations passed GPU→GPU Data moved: ~batch × hidden_dim per pipeline stage For a 70B model on 8 GPUs: AllReduce moves ~280 GB per step. At 3,350 GB/s NVLink: ~0.08 sec. At 50 GB/s PCIe: ~5.6 sec. That's a 70x difference.
Key insight: In distributed training, the interconnect is often the bottleneck, not the GPU compute. A cluster of 8 GPUs connected by slow PCIe can be slower than 4 GPUs connected by fast NVLink. Communication speed determines how efficiently you can scale to more GPUs.
cable
PCIe: The Baseline
Universal but slow — the minimum connection every GPU has
PCIe in AI Systems
PCI Express (PCIe) is the standard interface that connects GPUs to the CPU and to each other in consumer and entry-level server systems. Every GPU has a PCIe connection.

PCIe 4.0 x16: 32 GB/s per direction, 64 GB/s bidirectional. Used in older servers and consumer PCs.

PCIe 5.0 x16: 64 GB/s per direction, 128 GB/s bidirectional. Current standard in modern servers.

PCIe 6.0 x16: 128 GB/s per direction, 256 GB/s bidirectional. Emerging in 2025–2026.

PCIe is fine for single-GPU workloads (loading model weights from CPU memory to GPU) but becomes a severe bottleneck for multi-GPU communication. At 128 GB/s, transferring a 70B model’s gradients (280 GB in FP32) takes over 2 seconds — an eternity when training steps should take milliseconds.
PCIe Generation Comparison
PCIe Bandwidth (x16 slot): Gen 3.0: 16 GB/s per direction Gen 4.0: 32 GB/s per direction Gen 5.0: 64 GB/s per direction Gen 6.0: 128 GB/s per direction Compare to NVLink 5.0: NVLink: 900 GB/s per direction PCIe 5: 64 GB/s per direction Ratio: 14x faster PCIe use cases in AI: ✓ CPU ↔ GPU data transfer ✓ Single-GPU inference ✓ Loading model weights ✓ Storage I/O (GPUDirect) ✗ Multi-GPU training sync ✗ Tensor parallelism ✗ High-performance inference PCIe is the "last resort" for GPU-to-GPU communication. NVLink is always preferred when available.
Key insight: PCIe is like a country road connecting two cities. It works, but it’s slow for heavy traffic. For multi-GPU AI workloads, you need a highway (NVLink) or a railroad (InfiniBand). PCIe remains important for CPU-GPU communication and storage I/O, but it’s never the right choice for GPU-to-GPU data exchange at scale.
bolt
NVLink: The GPU Highway
Direct GPU-to-GPU connections at 14x PCIe speed
NVLink Architecture
NVLink is NVIDIA’s proprietary high-speed interconnect that connects GPUs directly to each other, bypassing the CPU and PCIe entirely. Think of it as a private highway between GPUs — dedicated lanes, no traffic lights, no speed limits.

Each NVLink generation has dramatically increased bandwidth:

NVLink 2.0 (Volta, 2017): 300 GB/s bidirectional per GPU. Connected up to 6 GPUs.

NVLink 3.0 (Ampere, 2020): 600 GB/s. 12 links per GPU.

NVLink 4.0 (Hopper, 2022): 900 GB/s. 18 links per GPU. Used in H100 and H200.

NVLink 5.0 (Blackwell, 2024): 1,800 GB/s (1.8 TB/s). 18 links at 100 GB/s each. Used in B200 and GB200.

NVLink 5.0 delivers 14x the bandwidth of PCIe 5.0. For tensor parallelism (which requires constant GPU-to-GPU data exchange), this difference is the difference between feasible and impossible.
NVLink Generation Comparison
NVLink Evolution: Gen Year BW/GPU Links Arch 2.0 2017 300 GB/s 6 Volta 3.0 2020 600 GB/s 12 Ampere 4.0 2022 900 GB/s 18 Hopper 5.0 2024 1,800 GB/s 18 Blackwell Practical impact (AllReduce): 70B model gradient sync (280 GB): PCIe 5.0: ~2.2 sec NVLink 4.0: ~0.31 sec NVLink 5.0: ~0.16 sec Tensor parallelism overhead: PCIe 5.0: ~40-60% of step time NVLink 4.0: ~5-10% of step time NVLink 5.0: ~3-5% of step time NVLink makes tensor parallelism practical. Without it, splitting a model across GPUs within a node would be too slow.
Key insight: NVLink is what makes multi-GPU training within a server practical. Without NVLink, you’d be limited to data parallelism (each GPU has a full model copy) which requires much more memory. NVLink enables tensor parallelism (splitting layers across GPUs) which is essential for models too large to fit on a single GPU.
hub
NVSwitch: Full-Mesh GPU Connectivity
Every GPU talks to every other GPU at full bandwidth — no bottlenecks
The Telephone Exchange Analogy
NVLink connects pairs of GPUs. But in a server with 8 GPUs, you can’t directly connect every GPU to every other GPU — that would require 28 direct links (8 choose 2).

NVSwitch solves this. It’s a dedicated chip that acts like a telephone exchange — any GPU can talk to any other GPU at full NVLink bandwidth, simultaneously.

In a DGX H100 server (8 GPUs), 4 NVSwitch chips create a full-mesh fabric. Every GPU has 900 GB/s to every other GPU. Total bisection bandwidth: 3.6 TB/s.

The Blackwell GB200 NVL72 takes this further: 9 NVSwitch trays connect 72 GPUs in a single NVLink domain with 130 TB/s total bandwidth. Any GPU can access any other GPU’s memory as if it were local. This creates a single “virtual GPU” with 72 × 192 GB = 13.8 TB of unified memory.
NVSwitch Scaling
DGX H100 (8 GPUs): NVSwitch chips: 4 GPU-GPU BW: 900 GB/s each Total bisection: 3.6 TB/s NVLink domain: 8 GPUs GB200 NVL72 (72 GPUs): NVSwitch trays: 9 GPU-GPU BW: 1,800 GB/s each Total bisection: 130 TB/s NVLink domain: 72 GPUs Unified memory: 13.8 TB 5th Gen NVSwitch (Blackwell): Can connect up to 576 GPUs in a single NVLink domain Total fabric BW: 1 PB/s 576 GPUs × 192 GB = 110 TB of unified GPU memory. Enough to hold a 55 trillion parameter model in FP16.
Key insight: NVSwitch transforms multiple discrete GPUs into what behaves like a single massive GPU. The 72-GPU GB200 NVL72 with 13.8 TB of unified memory can hold and serve the largest models without any model partitioning complexity. This is NVIDIA’s answer to “models keep getting bigger” — just make the GPU bigger.
lan
InfiniBand: The Data Center Fabric
When GPUs in different servers need to communicate — the gold standard
Beyond the Server
NVLink connects GPUs within a server. But training at scale requires hundreds or thousands of GPUs across many servers. This is where network fabric comes in.

InfiniBand is the traditional choice for HPC and AI clusters. Originally developed for supercomputers, it provides:

Ultra-low latency: 1–1.6 microseconds end-to-end. Cut-through switching means data starts forwarding before the full packet arrives.

Native RDMA: Remote Direct Memory Access lets one GPU read/write another GPU’s memory directly, bypassing the CPU entirely. Zero-copy, kernel-bypass.

Credit-based flow control: Guaranteed lossless delivery without the complexity of TCP congestion control.

Current speed: 400 Gb/s (NDR) per port, with 800 Gb/s (XDR) arriving in 2025–2026 via NVIDIA Quantum-X800 switches.
InfiniBand Specs
InfiniBand Generations: HDR (2019): 200 Gb/s per port NDR (2022): 400 Gb/s per port XDR (2025): 800 Gb/s per port InfiniBand characteristics: Latency: 1-1.6 μs RDMA: Native (hardware) Flow control: Credit-based Packet loss: Near zero Vendor: NVIDIA (Mellanox) DGX H100 cluster networking: 8 GPUs per server Each GPU: 1× 400G InfiniBand NIC Per server: 3.2 Tb/s aggregate 8:1 GPU-to-NIC ratio Market position (2023): ~80% of AI clusters used IB Market position (2025): Ethernet now taking the lead as RoCEv2 matures
Key insight: InfiniBand’s 1–1.6 microsecond latency vs RoCEv2’s 5–6 microseconds matters enormously for distributed training. In AllReduce operations that happen thousands of times per training step, those extra microseconds compound. But InfiniBand costs 1.5–2.5x more per port than Ethernet, and it’s controlled by NVIDIA (via Mellanox acquisition).
router
RoCEv2: Ethernet Fights Back
RDMA over standard Ethernet — cheaper, more familiar, and catching up fast
RoCEv2 Explained
RoCEv2 (RDMA over Converged Ethernet v2) brings RDMA capabilities to standard Ethernet networks. Instead of InfiniBand’s proprietary fabric, RoCEv2 carries RDMA traffic over UDP/IP on commodity Ethernet switches.

Advantages over InfiniBand:
Cost: 1.5–2.5x cheaper per port
Familiarity: Network engineers already know Ethernet
Vendor choice: Multiple switch vendors (Arista, Cisco, Broadcom) vs NVIDIA-only for InfiniBand
Convergence: Same fabric for AI traffic and general data center traffic

Challenges:
• Higher latency (5–6 μs vs 1–1.6 μs)
• Requires careful tuning: Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and Dynamic Congestion Control
• Store-and-forward switching adds latency vs InfiniBand’s cut-through
• Packet loss under congestion requires more complex handling
InfiniBand vs RoCEv2
InfiniBand
Latency: 1–1.6 μs
RDMA: Native hardware
Loss: Near zero (credit-based)
Cost: 1.5–2.5x higher
Vendor: NVIDIA only
Best for: Top-tier training
RoCEv2 Ethernet
Latency: 5–6 μs
RDMA: Over UDP/IP
Loss: Requires PFC/ECN tuning
Cost: Baseline
Vendor: Multi-vendor
Best for: Cost-sensitive, hybrid
Key insight: The InfiniBand vs Ethernet debate is shifting. In 2023, InfiniBand dominated 80% of AI clusters. By 2025, Ethernet has taken the lead as hyperscalers (Google, Meta, Microsoft) validated RoCEv2 at massive scale. The Ultra Ethernet Consortium is standardizing AI-optimized Ethernet features. For most organizations, RoCEv2 at 400G/800G is “good enough” and significantly cheaper.
stacked_bar_chart
The Bandwidth Hierarchy
From 128 GB/s to 1.8 TB/s — each level serves a different purpose
The Complete Picture
AI infrastructure has a clear bandwidth hierarchy. Each level connects different scopes of hardware:

Level 1 — Within the GPU: HBM to compute cores. 3,350–8,000 GB/s. This is the memory bandwidth we covered in Chapter 4.

Level 2 — GPU to GPU (same server): NVLink. 900–1,800 GB/s. Enables tensor parallelism within a node.

Level 3 — GPU to CPU: PCIe. 64–128 GB/s. For data loading, model weight transfer, and orchestration.

Level 4 — Server to server (same rack): InfiniBand or RoCEv2. 50–100 GB/s per link. Enables data parallelism across nodes.

Level 5 — Rack to rack: Spine switches. 400G–800G links. The backbone of the cluster network.

Each level is roughly 10–20x slower than the one above it. This hierarchy determines how you partition your model and your training strategy.
Bandwidth at Each Level
Bandwidth Hierarchy (B200): L1: HBM → Compute 8,000 GB/s (8 TB/s) L2: GPU ↔ GPU (NVLink 5.0) 1,800 GB/s (1.8 TB/s) Ratio to L1: 4.4x slower L3: GPU ↔ CPU (PCIe 5.0) 128 GB/s Ratio to L2: 14x slower L4: Server ↔ Server (IB NDR) 50 GB/s (400 Gb/s) Ratio to L3: 2.6x slower L5: Rack ↔ Rack (spine) 50-100 GB/s (aggregated) Varies by topology The 160x gap between NVLink and InfiniBand is why tensor parallelism stays within a node and data parallelism goes across.
Key insight: The bandwidth hierarchy dictates your parallelism strategy. Tensor parallelism (which needs the most communication) uses NVLink within a node. Data parallelism (less communication) uses InfiniBand/Ethernet across nodes. Pipeline parallelism (moderate communication) can span nodes if the network is fast enough. Matching your parallelism strategy to the bandwidth hierarchy is the key to efficient distributed training.
speed
Real-World Impact: Communication Overhead
How interconnect speed affects training time and GPU utilization
The Scaling Efficiency Problem
In a perfect world, doubling the number of GPUs would halve training time. In reality, communication overhead means you get less than 2x speedup:

Linear scaling (ideal): 8 GPUs = 8x faster. Never happens in practice.

Good scaling (NVLink + IB): 8 GPUs = 6–7x faster. ~80–90% efficiency. Achievable with fast interconnects and optimized communication.

Poor scaling (PCIe only): 8 GPUs = 3–4x faster. ~40–50% efficiency. Communication dominates, GPUs idle waiting.

At 1,000+ GPU scale, even small inefficiencies compound. If each AllReduce takes 10% of step time with InfiniBand, it takes 25% with Ethernet. Over months of training, that’s millions of dollars in wasted GPU-hours.

This is why organizations like Meta and Google invest billions in networking infrastructure — the ROI on faster interconnects is enormous at scale.
Scaling Efficiency Numbers
Training 70B model, 8 GPUs: NVLink 4.0 + InfiniBand NDR: Compute time: 85% Communication: 12% Overhead: 3% Scaling eff: ~87% PCIe 5.0 + 100G Ethernet: Compute time: 45% Communication: 50% Overhead: 5% Scaling eff: ~45% At 16,384 GPUs (Meta Llama 3): Network config errors caused 10.7% of significant job failures Even 1% communication overhead = 163 idle GPUs worth of waste = ~$47K/day at $3/GPU-hr At scale, interconnect investment pays for itself many times over. A $10M networking upgrade that saves 5% communication overhead saves $850K/month on a 16K cluster.
Key insight: Interconnect speed has diminishing returns at small scale but enormous returns at large scale. For a team with 8 GPUs, PCIe might be acceptable. For a team with 1,000+ GPUs, every microsecond of communication latency translates to millions of dollars in wasted compute. This is why the biggest AI labs invest disproportionately in networking.
© 2026 Kiran Shirol — The AI Atlas. All rights reserved.