Ch 5 — Full Fine-Tuning & Distributed — Under the Hood
DeepSpeed configs, FSDP code, Accelerate setup, gradient checkpointing, and multi-node training
Under the Hood
-
Click play or press Space to begin...
AData Parallelism & CommunicationHow GPUs coordinate during distributed training
1
developer_board
GPU 0
Batch shard 0
all-reduce
1
developer_board
GPU 1
Batch shard 1
all-reduce
1
developer_board
GPU N
Batch shard N
2
syncNCCL: NVIDIA Collective Communications Library for GPU-to-GPU communication
BDeepSpeed ZeRO ConfigurationJSON configs for Stage 1, 2, and 3
3
settings
ZeRO-2 Config
ds_config.json
launch
4
rocket_launch
Accelerate
accelerate launch
or
5
description
ZeRO-3 Config
For large models
6
storageCPU offloading: move optimizer states to CPU RAM when GPU memory is insufficient
CPyTorch FSDP ConfigurationAccelerate FSDP config and wrapping policies
7
settings
FSDP Config
accelerate config
wrap
7
layers
Auto Wrap
Per transformer layer
DMemory Optimization CodeGradient checkpointing, accumulation, and mixed precision
8
save
Grad Checkpoint
Recompute activations
+
8
stacked_bar_chart
Grad Accumulate
Simulate large batch
+
9
bolt
Flash Attention
O(n) memory
EComplete Full Fine-Tuning ScriptEnd-to-end training with Accelerate + DeepSpeed
10
code
Full Script
Ready to run
launch
10
monitoring
Monitor
WandB + loss curves
save
10
save
Checkpoint
Resume on preempt