What FSDP Does
FSDP (Zhao et al., 2023, Meta) is PyTorch's native equivalent of DeepSpeed ZeRO Stage 3. It shards model parameters, gradients, and optimizer states across GPUs. During computation, parameters are gathered on-demand (all-gather), used, and then released (reduce-scatter).
FSDP is built into PyTorch (no external library needed) and is the preferred approach for Meta's own LLM training.
FSDP vs DeepSpeed
FSDP advantages:
- Native PyTorch (no external dependency)
- Better integration with PyTorch ecosystem
- Preferred by Meta, used for Llama training
- Simpler configuration
DeepSpeed advantages:
- More mature, battle-tested at scale
- CPU/NVMe offloading (ZeRO-Infinity)
- ZeRO Stage 1 and 2 (less communication overhead)
- Better documentation and community support
Both achieve similar results. Choose based on your ecosystem.
Sharding Strategies
FULL_SHARD: Shard everything (like ZeRO Stage 3). Maximum memory savings, highest communication.
SHARD_GRAD_OP: Shard gradients and optimizer states only (like ZeRO Stage 2). Good balance.
NO_SHARD: Standard data parallelism. No memory savings but fastest communication.
# FSDP with HuggingFace Accelerate
# accelerate config → select FSDP
# Then launch:
accelerate launch \
--num_processes 8 \
train.py
# Or with torchrun:
torchrun --nproc_per_node 8 train.py
For most users: Use HuggingFace Accelerate with either DeepSpeed or FSDP. Accelerate abstracts the complexity and lets you switch between backends with a config file. Run accelerate config to set up, then accelerate launch train.py to train.