Ch 7: Distributed Training — AI Infrastructure

Ch 7 — Distributed Training: Splitting the Work

Data, tensor, and pipeline parallelism — how to train on thousands of GPUs

arrow_backIndex

Hands-On

question_mark

Why Distribute

arrow_forward

content_copy

Data Parallel

arrow_forward

grid_view

Tensor Parallel

arrow_forward

view_timeline

Pipeline Parallel

arrow_forward

storage

FSDP

arrow_forward

view_in_ar

3D Parallelism

arrow_forward

sync

AllReduce

arrow_forward

code

Code Patterns

Click play or press Space to begin...

Step- / 8

question_mark

Why Distribute? The Memory and Time Problem

Models are too big for one GPU, and training takes too long on one GPU

Two Reasons to Distribute

Reason 1: It doesn’t fit. A 70B model needs ~1.5 TB for training (weights + optimizer + gradients + activations). An H100 has 80 GB. You physically cannot train it on one GPU.

Reason 2: It takes too long. Even if a model fits on one GPU, training on trillions of tokens takes months on a single GPU. Using 1,000 GPUs can reduce this to days.

Distributed training solves both problems by splitting the work across multiple GPUs. But how you split it matters enormously. There are three fundamental strategies, each splitting along a different dimension:

• Data parallelism: Split the training data
• Tensor parallelism: Split the model layers
• Pipeline parallelism: Split the model stages

Modern large-scale training combines all three — called 3D parallelism.

The Scale of the Problem

Training memory requirements: 7B model: ~150 GB → 2 H100s min 13B model: ~300 GB → 4 H100s min 70B model: ~1,500 GB → 20 H100s min 405B model: ~8,000 GB → 100 H100s min Training time (single H100): 7B on 1T tokens: ~45 days 70B on 15T tokens: ~29 years 405B on 15T tokens: ~170 years With distributed training: Llama 3 70B: 16,384 H100s × 54 days = 6.4M GPU-hours Llama 3 405B: 16,384 H100s × ~90 days = ~35M GPU-hours Without distributed training, large models would be impossible. It's not an optimization — it's a requirement.

Key insight: Distributed training isn’t about making things faster — it’s about making things possible. A 70B model literally cannot be trained on a single GPU. The question isn’t “should we distribute?” but “how should we distribute?”

content_copy

Data Parallelism: The Photocopy Strategy

Every GPU has a full copy of the model, but processes different data

How Data Parallelism Works

Think of a classroom with 8 teachers, each with an identical copy of the textbook. Each teacher grades a different stack of exams. At the end, they compare notes and agree on the grading rubric for next time.

Step 1: Copy the full model to every GPU.
Step 2: Split each training batch into mini-batches, one per GPU.
Step 3: Each GPU computes gradients on its mini-batch independently.
Step 4: AllReduce — average all gradients across all GPUs.
Step 5: Every GPU updates its model with the averaged gradients.

The result: each GPU processes 1/N of the data, so training is ~N times faster. The communication cost is one AllReduce per training step, which moves 2× the model size in data.

DDP (Distributed Data Parallel) in PyTorch implements this. It’s the simplest form of distributed training and works well when the model fits on a single GPU.

DDP Code Pattern

import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel # Initialize process group dist.init_process_group("nccl") rank = dist.get_rank() # Each GPU gets the same model model = MyModel().to(rank) model = DistributedDataParallel(model, device_ids=[rank]) # Each GPU gets different data sampler = DistributedSampler(dataset) loader = DataLoader(dataset, sampler=sampler, batch_size=32) # Training loop — DDP handles # gradient sync automatically for batch in loader: loss = model(batch) loss.backward() # AllReduce here optimizer.step()

Key insight: Data parallelism is the simplest strategy but requires each GPU to hold the full model. For a 70B model needing 1.5 TB for training, you’d need 19+ GPUs just for one copy — and DDP needs one copy per GPU. This is why FSDP was invented: it shards the model across GPUs instead of copying it.

grid_view

Tensor Parallelism: Splitting the Puzzle

Split individual layers across GPUs — each GPU computes part of every operation

How Tensor Parallelism Works

Think of a jigsaw puzzle split among 4 people. Each person has 1/4 of the pieces for every section. They work simultaneously, but need to share edge pieces constantly.

In tensor parallelism, each large matrix multiplication is split across GPUs. For a weight matrix W of size [8192 × 8192]:

• GPU 0 holds columns 0–2047
• GPU 1 holds columns 2048–4095
• GPU 2 holds columns 4096–6143
• GPU 3 holds columns 6144–8191

Each GPU computes its partial result, then they AllGather to combine results. This happens for every layer, every forward and backward pass.

Tensor parallelism requires extremely fast GPU-to-GPU communication because it communicates at every layer. This is why it’s only used within a node (NVLink), never across nodes (InfiniBand).

Tensor Parallelism Trade-offs

Tensor Parallel degree = 4: Weight matrix: [8192 × 8192] Per GPU: [8192 × 2048] Memory saved: 4x per layer Communication per layer: AllGather: hidden_dim × batch ReduceScatter: hidden_dim × batch Total: ~2 × hidden_dim × batch For 70B model, TP=4: Communication per step: 80 layers × 2 × 8192 × batch With NVLink (900 GB/s): fast With PCIe (128 GB/s): unusable When to use: ✓ Model doesn't fit on 1 GPU ✓ Within a single NVLink node ✓ TP degree = 2, 4, or 8 ✗ Never across nodes (too slow) ✗ Not for small models that fit

Key insight: Tensor parallelism is the most communication-intensive strategy because it communicates at every layer. It only works within a node where NVLink provides 900+ GB/s. Across nodes with InfiniBand (50 GB/s), the communication overhead would make each GPU spend more time waiting than computing.

view_timeline

Pipeline Parallelism: The Assembly Line

Split model into stages — each GPU handles a group of consecutive layers

How Pipeline Parallelism Works

Think of an assembly line in a car factory. Station 1 builds the frame, Station 2 adds the engine, Station 3 paints it, Station 4 adds interiors. Each station works on a different car simultaneously.

In pipeline parallelism, the model’s layers are split into stages, each assigned to a different GPU:

• GPU 0: Layers 0–19 (Stage 1)
• GPU 1: Layers 20–39 (Stage 2)
• GPU 2: Layers 40–59 (Stage 3)
• GPU 3: Layers 60–79 (Stage 4)

Data flows through the pipeline: GPU 0 processes a micro-batch, sends activations to GPU 1, then immediately starts on the next micro-batch. This keeps all GPUs busy.

The challenge: pipeline bubbles. At the start and end of each batch, some GPUs are idle waiting for data to flow through. With 4 stages, ~25% of time is wasted in bubbles (without optimizations).

Pipeline Efficiency

Pipeline with 4 stages, 1 micro-batch: Bubble fraction: 75% (terrible) With 16 micro-batches: Bubble fraction: ~19% (better) With 32 micro-batches: Bubble fraction: ~9% (good) Formula: Bubble = (stages - 1) / (stages - 1 + micro_batches) Communication per stage: Only activations between stages Size: batch × hidden_dim × 2 bytes Much less than tensor parallel When to use: ✓ Scaling across nodes ✓ Model too big for one node ✓ Combined with TP within node ✗ Small models ✗ When bubble overhead > benefit Pipeline parallelism communicates less than tensor parallelism, so it works across nodes with InfiniBand/Ethernet.

Key insight: Pipeline parallelism trades some GPU utilization (bubbles) for much less communication than tensor parallelism. This makes it ideal for scaling across nodes where network bandwidth is limited. The key optimization is using many micro-batches to minimize bubble time.

storage

FSDP: Shared Storage Locker

Shard everything — weights, gradients, optimizer states — across all GPUs

The Storage Locker Analogy

DDP gives every GPU a full copy of everything — like every worker having their own complete toolbox. Wasteful when you have 1,000 workers.

FSDP (Fully Sharded Data Parallel) is like a shared storage locker. Each worker keeps only 1/N of the tools. When they need a tool they don’t have, they borrow it from a neighbor, use it, and return it.

FSDP shards three things across GPUs:
• Model weights: Each GPU holds 1/N of the parameters
• Gradients: Each GPU computes and stores 1/N of gradients
• Optimizer states: Each GPU maintains 1/N of Adam states

Before each layer’s forward pass, FSDP does an AllGather to reconstruct the full layer weights from all GPUs. After the backward pass, it does a ReduceScatter to distribute gradients. Then it discards the gathered weights to free memory.

FSDP Memory Savings

70B model, 8 GPUs: DDP (full copy per GPU): Weights: 140 GB per GPU Gradients: 140 GB per GPU Optimizer: 560 GB per GPU Total: 840 GB per GPU ✗ FSDP (sharded, 8 GPUs): Weights: 17.5 GB per GPU Gradients: 17.5 GB per GPU Optimizer: 70 GB per GPU Total: 105 GB per GPU ✓ Memory reduction: 8x Trade-off: More communication (AllGather + ReduceScatter per layer) but model actually fits in memory. PyTorch FSDP code: model = FSDP(model, sharding_strategy= ShardingStrategy.FULL_SHARD) FSDP is now the default for training models that don't fit on a single GPU in PyTorch.

Key insight: FSDP is the most memory-efficient form of data parallelism. It reduces per-GPU memory by N× (where N is the number of GPUs) at the cost of more communication. For most training scenarios where the model doesn’t fit on one GPU, FSDP is the go-to strategy before adding tensor or pipeline parallelism.

view_in_ar

3D Parallelism: Combining Everything

Data + Tensor + Pipeline — how trillion-parameter models are actually trained

The 3D Strategy

For the largest models (70B+), no single parallelism strategy is enough. Modern training combines all three, each operating at a different level of the hardware hierarchy:

Tensor Parallelism (TP=8): Within a single node. Split layers across 8 GPUs connected by NVLink. Handles the most communication-intensive splitting.

Pipeline Parallelism (PP=8): Across nodes within a rack. Split model stages across 8 nodes. Moderate communication via InfiniBand.

Data Parallelism (DP=256): Across racks. Each pipeline replica processes different data. Gradient sync via AllReduce over the full cluster.

Total GPUs: TP × PP × DP = 8 × 8 × 256 = 16,384 GPUs

This is exactly how Meta trained Llama 3: TP=8 within each DGX node, PP across nodes, DP across the full 16K-GPU cluster.

3D Parallelism Layout

Llama 3 405B training config: Total GPUs: 16,384 Tensor Parallel: 8 (within node) Pipeline Parallel: 16 (across nodes) Data Parallel: 128 (across cluster) Communication hierarchy: TP (NVLink, 900 GB/s): Every layer, every step Highest bandwidth needed PP (InfiniBand, 400 Gb/s): Between pipeline stages Moderate bandwidth needed DP (InfiniBand, 400 Gb/s): AllReduce once per step Can overlap with compute Why this mapping: TP on NVLink (fastest) PP on InfiniBand (medium) DP on InfiniBand (least urgent, can overlap with backward pass) The parallelism strategy maps directly to the bandwidth hierarchy from Chapter 5.

Key insight: 3D parallelism maps each parallelism strategy to the appropriate level of the bandwidth hierarchy. Tensor parallelism (most communication) uses NVLink (fastest). Pipeline parallelism (moderate) uses InfiniBand. Data parallelism (least urgent, can overlap) uses the remaining network capacity. This co-design of parallelism and hardware is what makes large-scale training efficient.

sync

AllReduce: The Gradient Synchronization

The most critical collective operation in distributed training

How AllReduce Works

AllReduce is the operation that keeps all GPUs in sync. After each training step, every GPU has computed gradients on its local data. AllReduce sums (or averages) these gradients across all GPUs so every GPU has the same result.

The naive approach (send everything to one GPU, sum, broadcast back) doesn’t scale. Instead, NCCL (NVIDIA’s communication library) uses ring AllReduce:

1. Ring ReduceScatter: Each GPU sends 1/N of its data to the next GPU in a ring. After N–1 steps, each GPU has the sum of 1/N of the data.

2. Ring AllGather: Each GPU shares its summed chunk around the ring. After N–1 steps, every GPU has the complete summed result.

Total data moved per GPU: 2 × (N–1)/N × model_size. For large N, this approaches 2 × model_size — independent of the number of GPUs. This is why AllReduce scales well.

AllReduce Performance

Ring AllReduce for 70B model: Model size (FP16): 140 GB Data per GPU: 2 × 140 = 280 GB (independent of GPU count!) Time estimates: 8 GPUs (NVLink, 900 GB/s): 280 GB ÷ 900 GB/s = 0.31 sec 1,024 GPUs (IB, 50 GB/s): 280 GB ÷ 50 GB/s = 5.6 sec Optimization: overlap with compute Start AllReduce for layer N while computing layer N+1 Hides ~50-80% of comm time NCCL (NVIDIA Collective Comms): Handles ring/tree AllReduce Topology-aware routing Automatic algorithm selection Used by PyTorch, TensorFlow, JAX NCCL is the unsung hero of distributed training. It makes multi-GPU communication "just work" for most users.

Key insight: The beauty of ring AllReduce is that each GPU only moves 2× the model size of data, regardless of how many GPUs participate. Adding more GPUs doesn’t increase per-GPU communication. The bottleneck is the slowest link in the ring — which is why network bandwidth matters so much.

code

Practical: DDP vs FSDP vs DeepSpeed

Choosing the right framework for your training scale

Framework Comparison

PyTorch DDP: Best for: Model fits on 1 GPU Memory: Full copy per GPU Comm: AllReduce per step Ease: Easiest to set up Scale: Up to ~100 GPUs PyTorch FSDP: Best for: Model doesn't fit on 1 GPU Memory: Sharded across GPUs Comm: AllGather + ReduceScatter Ease: Moderate complexity Scale: Up to ~1,000 GPUs DeepSpeed (Microsoft): Best for: Very large models Memory: ZeRO stages 1/2/3 Comm: Optimized collectives Ease: Config-driven Scale: Up to ~10,000+ GPUs Megatron-LM (NVIDIA): Best for: Maximum performance Memory: Full 3D parallelism Comm: Optimized for NVLink/IB Ease: Complex, expert-level Scale: Up to ~100,000 GPUs

Decision Guide

Start Simple

Model <7B, fits on 1 GPU:
Use DDP. 3 lines of code change.

Model 7B–70B:
Use FSDP. Shards automatically.

Fine-tuning any size:
Use FSDP + LoRA. Minimal memory.

Scale Up

Model 70B+, 100+ GPUs:
Use DeepSpeed ZeRO-3 or FSDP with TP.

Model 200B+, 1000+ GPUs:
Use Megatron-LM with 3D parallelism.

Maximum efficiency:
Custom 3D parallel config tuned to your hardware topology.

Key insight: Start with the simplest strategy that works. DDP for small models, FSDP for medium, DeepSpeed/Megatron for large. Premature optimization of parallelism strategy wastes engineering time. Only move to 3D parallelism when you’re training at 100+ GPU scale and need every percent of efficiency.

arrow_back Ch 6: Network Topologies Ch 8: Training Clusters arrow_forward