How Data Parallelism Works
Think of a classroom with 8 teachers, each with an identical copy of the textbook. Each teacher grades a different stack of exams. At the end, they compare notes and agree on the grading rubric for next time.
Step 1: Copy the full model to every GPU.
Step 2: Split each training batch into mini-batches, one per GPU.
Step 3: Each GPU computes gradients on its mini-batch independently.
Step 4: AllReduce — average all gradients across all GPUs.
Step 5: Every GPU updates its model with the averaged gradients.
The result: each GPU processes 1/N of the data, so training is ~N times faster. The communication cost is one AllReduce per training step, which moves 2× the model size in data.
DDP (Distributed Data Parallel) in PyTorch implements this. It’s the simplest form of distributed training and works well when the model fits on a single GPU.
DDP Code Pattern
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
# Initialize process group
dist.init_process_group("nccl")
rank = dist.get_rank()
# Each GPU gets the same model
model = MyModel().to(rank)
model = DistributedDataParallel(model,
device_ids=[rank])
# Each GPU gets different data
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset,
sampler=sampler, batch_size=32)
# Training loop — DDP handles
# gradient sync automatically
for batch in loader:
loss = model(batch)
loss.backward() # AllReduce here
optimizer.step()
Key insight: Data parallelism is the simplest strategy but requires each GPU to hold the full model. For a 70B model needing 1.5 TB for training, you’d need 19+ GPUs just for one copy — and DDP needs one copy per GPU. This is why FSDP was invented: it shards the model across GPUs instead of copying it.