Ch 5: Full Fine-Tuning & Distributed

Ch 5 — Full Fine-Tuning & Distributed — Under the Hood

DeepSpeed configs, FSDP code, Accelerate setup, gradient checkpointing, and multi-node training

Index ← High Level

Under the Hood

Click play or press Space to begin...

Step- / 10

AData Parallelism & CommunicationHow GPUs coordinate during distributed training

developer_board

GPU 0

Batch shard 0

all-reduce

developer_board

GPU 1

Batch shard 1

all-reduce

developer_board

GPU N

Batch shard N

syncNCCL: NVIDIA Collective Communications Library for GPU-to-GPU communication

BDeepSpeed ZeRO ConfigurationJSON configs for Stage 1, 2, and 3

settings

ZeRO-2 Config

ds_config.json

launch

rocket_launch

Accelerate

accelerate launch

description

ZeRO-3 Config

For large models

storageCPU offloading: move optimizer states to CPU RAM when GPU memory is insufficient

CPyTorch FSDP ConfigurationAccelerate FSDP config and wrapping policies

settings

FSDP Config

accelerate config

wrap

layers

Auto Wrap

Per transformer layer

DMemory Optimization CodeGradient checkpointing, accumulation, and mixed precision

save

Grad Checkpoint

Recompute activations

stacked_bar_chart

Grad Accumulate

Simulate large batch

bolt

Flash Attention

O(n) memory

EComplete Full Fine-Tuning ScriptEnd-to-end training with Accelerate + DeepSpeed

code

Full Script

Ready to run

launch

monitoring

Monitor

WandB + loss curves

save

Checkpoint

Resume on preempt