Ch 5 — Full Fine-Tuning & Distributed Training

When to use full fine-tuning, DeepSpeed ZeRO, FSDP, multi-GPU strategies, and memory optimization
High Level
compare_arrows
When
arrow_forward
memory
Memory
arrow_forward
dynamic_form
DeepSpeed
arrow_forward
lan
FSDP
arrow_forward
speed
Optimize
arrow_forward
savings
Cost
arrow_forward
checklist
Decision
-
Click play or press Space to begin the journey...
Step- / 7
compare_arrows
When Full Fine-Tuning Beats LoRA
The cases where updating all parameters is worth the cost
Full Fine-Tuning Advantages
1. Maximum quality: Updating all parameters gives the model the most capacity to adapt. For complex domain shifts (e.g., English model to Japanese, general model to specialized medical reasoning), full fine-tuning can outperform LoRA by 2-5%.

2. Large datasets: With 100K+ high-quality examples, full fine-tuning can leverage the extra capacity. LoRA with rank 16 may bottleneck on very large datasets.

3. Continued pre-training: When adapting a model to a new domain by training on raw text (not instruction pairs), full fine-tuning is standard. This is sometimes called "domain-adaptive pre-training."

4. Creating a new base model: If you're building a foundation model for others to fine-tune on top of, full fine-tuning produces a cleaner result than a merged LoRA adapter.
When LoRA Is Still Better
1. Limited data (<50K examples): LoRA acts as a regularizer. Full fine-tuning on small datasets overfits more easily.

2. Limited compute: Full fine-tuning of 7B needs 4-8 GPUs. LoRA needs 1 GPU.

3. Multiple tasks: LoRA adapters can be swapped. Full fine-tuning creates a separate model per task.

4. Rapid iteration: LoRA trains 2-5x faster than full fine-tuning. Better for experimentation.
Full Fine-Tuning
All params updated
Maximum quality
4-8+ GPUs needed
Hours to days
LoRA
0.1-1% params
Near-full quality
1 GPU sufficient
Minutes to hours
The 80/20 rule: LoRA handles 80% of fine-tuning use cases. Full fine-tuning is for the remaining 20% where you need maximum quality, have abundant data and compute, or are doing continued pre-training. Always try LoRA first.
memory
Memory Requirements
Why full fine-tuning needs so much GPU memory
The Four Memory Components
For a 7B model in bf16 with AdamW optimizer:

1. Model weights (bf16): 7B × 2 bytes = 14 GB

2. Gradients (bf16): 7B × 2 bytes = 14 GB

3. Optimizer states (fp32): AdamW stores first moment (m) and second moment (v), both in fp32: 7B × 4 × 2 = 56 GB

4. Activations: Intermediate values saved for backpropagation. Depends on batch size and sequence length: 2-16 GB

Total: ~86-100 GB for a 7B model. That's 2-3 A100 40GB GPUs or 1-2 A100 80GB GPUs.
Scaling to Larger Models
Llama 3 8B: ~90 GB total → 2× A100 80GB
Llama 3 70B: ~800 GB total → 16× A100 80GB (2 nodes)
Llama 3 405B: ~4.6 TB total → 64+ A100 80GB (8+ nodes)

This is why distributed training is essential for full fine-tuning. No single GPU can hold a 7B model's full training state.
Memory Optimization Techniques
Gradient checkpointing: Trade compute for memory. Recompute activations during backward pass instead of storing them. Saves 60-70% of activation memory at ~30% compute overhead.

Mixed precision (bf16): Keep weights and gradients in bf16, only optimizer states in fp32. Standard practice on modern GPUs.

Gradient accumulation: Process small batches, accumulate gradients, then update. Simulates a larger batch size without the memory cost.
The optimizer states dominate memory. For a 7B model, weights are 14 GB but optimizer states are 56 GB (4x the weights). This is why DeepSpeed ZeRO and FSDP focus on distributing optimizer states across GPUs.
dynamic_form
DeepSpeed ZeRO
Zero Redundancy Optimizer: distribute training state across GPUs
The Problem ZeRO Solves
In standard data parallelism, every GPU holds a complete copy of the model, gradients, and optimizer states. With 8 GPUs, you have 8 redundant copies. ZeRO (Rajbhandari et al., 2020, Microsoft) eliminates this redundancy by partitioning the training state across GPUs.
ZeRO Stage 1: Partition Optimizer States
Each GPU holds only 1/N of the optimizer states (where N = number of GPUs). For 8 GPUs: each holds 56/8 = 7 GB of optimizer states instead of 56 GB. Weights and gradients are still replicated.

Memory per GPU (7B, 8 GPUs): 14 (weights) + 14 (gradients) + 7 (optimizer) = 35 GB
ZeRO Stage 2: + Partition Gradients
Also partition gradients across GPUs. Each GPU holds 1/N of gradients.

Memory per GPU: 14 (weights) + 1.75 (gradients) + 7 (optimizer) = ~23 GB
ZeRO Stage 3: + Partition Weights
Partition everything: weights, gradients, and optimizer states. Each GPU holds only 1/N of each. During forward/backward pass, weights are gathered on-demand (all-gather) and released after use.

Memory per GPU: 1.75 (weights) + 1.75 (gradients) + 7 (optimizer) = ~11 GB
Stage 1
Partition: optimizer
Memory: ~35 GB/GPU
Communication: low
Easiest to use
Stage 2
Partition: opt + grad
Memory: ~23 GB/GPU
Communication: medium
Good default
Stage 3
Partition: everything
Memory: ~11 GB/GPU
Communication: high
For large models
Stage 3 + Offload
+ CPU/NVMe offload
Memory: minimal
Communication: highest
Last resort
Recommendation: Start with ZeRO Stage 2 (best balance of memory savings and speed). Use Stage 3 only when the model doesn't fit with Stage 2. Avoid CPU offloading unless absolutely necessary (it's 3-10x slower). HuggingFace Accelerate makes DeepSpeed integration easy.
lan
PyTorch FSDP
Fully Sharded Data Parallel: PyTorch-native distributed training
What FSDP Does
FSDP (Zhao et al., 2023, Meta) is PyTorch's native equivalent of DeepSpeed ZeRO Stage 3. It shards model parameters, gradients, and optimizer states across GPUs. During computation, parameters are gathered on-demand (all-gather), used, and then released (reduce-scatter).

FSDP is built into PyTorch (no external library needed) and is the preferred approach for Meta's own LLM training.
FSDP vs DeepSpeed
FSDP advantages:
- Native PyTorch (no external dependency)
- Better integration with PyTorch ecosystem
- Preferred by Meta, used for Llama training
- Simpler configuration

DeepSpeed advantages:
- More mature, battle-tested at scale
- CPU/NVMe offloading (ZeRO-Infinity)
- ZeRO Stage 1 and 2 (less communication overhead)
- Better documentation and community support

Both achieve similar results. Choose based on your ecosystem.
Sharding Strategies
FULL_SHARD: Shard everything (like ZeRO Stage 3). Maximum memory savings, highest communication.

SHARD_GRAD_OP: Shard gradients and optimizer states only (like ZeRO Stage 2). Good balance.

NO_SHARD: Standard data parallelism. No memory savings but fastest communication.
# FSDP with HuggingFace Accelerate # accelerate config → select FSDP # Then launch: accelerate launch \ --num_processes 8 \ train.py # Or with torchrun: torchrun --nproc_per_node 8 train.py
For most users: Use HuggingFace Accelerate with either DeepSpeed or FSDP. Accelerate abstracts the complexity and lets you switch between backends with a config file. Run accelerate config to set up, then accelerate launch train.py to train.
speed
Training Optimization Techniques
Gradient checkpointing, accumulation, and mixed precision
Gradient Checkpointing
Problem: During backpropagation, all intermediate activations must be stored. For a 7B model with batch_size=4 and seq_len=2048, activations can use 8-16 GB.

Solution: Don't store all activations. Instead, save only at "checkpoint" boundaries (typically every transformer layer). During backward pass, recompute the activations between checkpoints.

Trade-off: Saves 60-70% of activation memory at the cost of ~30% more compute (each activation is computed twice). Almost always worth it for full fine-tuning.
Gradient Accumulation
Problem: Large batch sizes improve training stability but don't fit in memory.

Solution: Process small micro-batches, accumulate gradients over N steps, then perform one optimizer update. Effective batch size = micro_batch × num_GPUs × accumulation_steps.

Example: micro_batch=2, 4 GPUs, accumulation=4 → effective batch = 32.
Mixed Precision Training
bf16 (bfloat16): The standard for modern LLM training. Same exponent range as fp32 (no overflow issues) but half the memory. Supported on A100, H100, RTX 30xx+, and Apple M-series.

fp16: Older alternative. Requires loss scaling to avoid underflow. Use only if bf16 is not available.

The mixed precision recipe: Forward and backward pass in bf16. Optimizer states in fp32. Weight updates in fp32, then cast back to bf16. This is handled automatically by HuggingFace Trainer with bf16=True.
Flash Attention
FlashAttention-2 (Dao, 2023) fuses the attention computation into a single GPU kernel, avoiding the O(n²) memory cost of materializing the attention matrix. Reduces attention memory from O(n²) to O(n) and speeds up training by 20-40%. Enabled by default in HuggingFace Transformers for supported models.
Always enable: gradient_checkpointing=True, bf16=True, and FlashAttention-2. These three optimizations together reduce memory by 50-70% with minimal speed impact. They are the baseline for any full fine-tuning run.
savings
Cost Analysis
GPU hours, cloud costs, and cost optimization
GPU Cost Reference (2024-2025)
Cloud GPU hourly rates (approximate):
NVIDIA A100 80GB: $1.50-$3.00/hr
NVIDIA H100 80GB: $2.50-$4.00/hr
NVIDIA A100 40GB: $1.00-$2.00/hr
NVIDIA L40S 48GB: $1.00-$1.50/hr
NVIDIA RTX 4090 24GB: $0.40-$0.80/hr

Providers: Lambda Labs, RunPod, Together AI, Vast.ai, AWS, GCP, Azure. Spot/preemptible instances are 50-70% cheaper.
Training Time Estimates
Llama 3 8B, full fine-tuning:
10K examples, 3 epochs: ~4-8 hours on 4× A100 80GB
Cost: ~$24-$96

Llama 3 8B, LoRA:
10K examples, 3 epochs: ~1-2 hours on 1× A100 80GB
Cost: ~$3-$6

Llama 3 70B, full fine-tuning:
10K examples, 3 epochs: ~24-48 hours on 8× A100 80GB
Cost: ~$288-$1,152
Cost Optimization Tips
1. Use spot instances: 50-70% cheaper. Save checkpoints frequently to resume after preemption.

2. Start with LoRA: Validate your data and approach with LoRA ($5-$20) before committing to full fine-tuning ($50-$500+).

3. Use smaller models first: Prototype on 8B, then scale to 70B only if needed.

4. Optimize data first: Better data quality at 5K examples beats mediocre data at 50K. Less data = less training time = less cost.

5. Early stopping: Monitor validation loss. Stop training when it plateaus or increases. Don't waste compute on overfitting.
The cost-effective path: (1) Prototype with QLoRA on 8B ($2-$5). (2) If quality is good, train with LoRA on 8B ($5-$20). (3) If you need more quality, try full fine-tuning on 8B ($50-$100). (4) Only scale to 70B if 8B is genuinely insufficient. Most tasks don't need 70B.
checklist
Decision Framework
Choosing the right training strategy
Decision Tree
Q: Do you have <50K examples?
→ Yes: Use LoRA (less overfitting risk)

Q: Do you need multi-task/multi-tenant?
→ Yes: Use LoRA (swappable adapters)

Q: Is this continued pre-training on raw text?
→ Yes: Use full fine-tuning

Q: Do you have 4+ A100 GPUs available?
→ No: Use LoRA or QLoRA
→ Yes: Full fine-tuning is an option

Q: Is this a major domain shift (e.g., language change)?
→ Yes: Full fine-tuning likely needed
→ No: LoRA is probably sufficient
Distributed Training Cheat Sheet
1 GPU (24 GB): QLoRA for 8B-70B, LoRA for 8B
1 GPU (80 GB): LoRA for 8B-70B, full FT for 8B (tight)
2-4 GPUs (80 GB each): Full FT for 8B with DeepSpeed ZeRO-2
8 GPUs (80 GB each): Full FT for 8B-70B with ZeRO-2/3 or FSDP
16+ GPUs: Full FT for 70B+ with ZeRO-3 or FSDP
Distributed Framework Choice
Single node (1-8 GPUs): DeepSpeed ZeRO-2 or FSDP SHARD_GRAD_OP. Both work well.

Multi-node (16+ GPUs): DeepSpeed ZeRO-3 or FSDP FULL_SHARD. DeepSpeed has better multi-node support historically, but FSDP is catching up.

Wrapper: Always use HuggingFace Accelerate. It abstracts the complexity and lets you switch backends easily.
The golden rule: Start simple. Try LoRA first. If quality is insufficient, try LoRA with higher rank or more target modules. Only move to full fine-tuning when you have evidence that LoRA is the bottleneck, not data quality or hyperparameters.