Ch 5: Full Fine-Tuning & Distributed Training

Ch 5 — Full Fine-Tuning & Distributed Training

When to use full fine-tuning, DeepSpeed ZeRO, FSDP, multi-GPU strategies, and memory optimization

Index Under the Hood →

High Level

compare_arrows

When

arrow_forward

memory

Memory

arrow_forward

dynamic_form

DeepSpeed

arrow_forward

lan

FSDP

arrow_forward

speed

Optimize

arrow_forward

savings

Cost

arrow_forward

checklist

Decision

Click play or press Space to begin the journey...

Step- / 7

compare_arrows

When Full Fine-Tuning Beats LoRA

The cases where updating all parameters is worth the cost

Full Fine-Tuning Advantages

1. Maximum quality: Updating all parameters gives the model the most capacity to adapt. For complex domain shifts (e.g., English model to Japanese, general model to specialized medical reasoning), full fine-tuning can outperform LoRA by 2-5%.

2. Large datasets: With 100K+ high-quality examples, full fine-tuning can leverage the extra capacity. LoRA with rank 16 may bottleneck on very large datasets.

3. Continued pre-training: When adapting a model to a new domain by training on raw text (not instruction pairs), full fine-tuning is standard. This is sometimes called "domain-adaptive pre-training."

4. Creating a new base model: If you're building a foundation model for others to fine-tune on top of, full fine-tuning produces a cleaner result than a merged LoRA adapter.

When LoRA Is Still Better

1. Limited data (<50K examples): LoRA acts as a regularizer. Full fine-tuning on small datasets overfits more easily.

2. Limited compute: Full fine-tuning of 7B needs 4-8 GPUs. LoRA needs 1 GPU.

3. Multiple tasks: LoRA adapters can be swapped. Full fine-tuning creates a separate model per task.

4. Rapid iteration: LoRA trains 2-5x faster than full fine-tuning. Better for experimentation.

Full Fine-Tuning
All params updated
Maximum quality
4-8+ GPUs needed
Hours to days

LoRA
0.1-1% params
Near-full quality
1 GPU sufficient
Minutes to hours

The 80/20 rule: LoRA handles 80% of fine-tuning use cases. Full fine-tuning is for the remaining 20% where you need maximum quality, have abundant data and compute, or are doing continued pre-training. Always try LoRA first.

memory

Memory Requirements

Why full fine-tuning needs so much GPU memory

The Four Memory Components

For a 7B model in bf16 with AdamW optimizer:

1. Model weights (bf16): 7B × 2 bytes = 14 GB

2. Gradients (bf16): 7B × 2 bytes = 14 GB

3. Optimizer states (fp32): AdamW stores first moment (m) and second moment (v), both in fp32: 7B × 4 × 2 = 56 GB

4. Activations: Intermediate values saved for backpropagation. Depends on batch size and sequence length: 2-16 GB

Total: ~86-100 GB for a 7B model. That's 2-3 A100 40GB GPUs or 1-2 A100 80GB GPUs.

Scaling to Larger Models

Llama 3 8B: ~90 GB total → 2× A100 80GB
Llama 3 70B: ~800 GB total → 16× A100 80GB (2 nodes)
Llama 3 405B: ~4.6 TB total → 64+ A100 80GB (8+ nodes)

This is why distributed training is essential for full fine-tuning. No single GPU can hold a 7B model's full training state.

Memory Optimization Techniques

Gradient checkpointing: Trade compute for memory. Recompute activations during backward pass instead of storing them. Saves 60-70% of activation memory at ~30% compute overhead.

Mixed precision (bf16): Keep weights and gradients in bf16, only optimizer states in fp32. Standard practice on modern GPUs.

Gradient accumulation: Process small batches, accumulate gradients, then update. Simulates a larger batch size without the memory cost.

The optimizer states dominate memory. For a 7B model, weights are 14 GB but optimizer states are 56 GB (4x the weights). This is why DeepSpeed ZeRO and FSDP focus on distributing optimizer states across GPUs.

dynamic_form

DeepSpeed ZeRO

Zero Redundancy Optimizer: distribute training state across GPUs

The Problem ZeRO Solves

In standard data parallelism, every GPU holds a complete copy of the model, gradients, and optimizer states. With 8 GPUs, you have 8 redundant copies. ZeRO (Rajbhandari et al., 2020, Microsoft) eliminates this redundancy by partitioning the training state across GPUs.

ZeRO Stage 1: Partition Optimizer States

Each GPU holds only 1/N of the optimizer states (where N = number of GPUs). For 8 GPUs: each holds 56/8 = 7 GB of optimizer states instead of 56 GB. Weights and gradients are still replicated.

Memory per GPU (7B, 8 GPUs): 14 (weights) + 14 (gradients) + 7 (optimizer) = 35 GB

ZeRO Stage 2: + Partition Gradients

Also partition gradients across GPUs. Each GPU holds 1/N of gradients.

Memory per GPU: 14 (weights) + 1.75 (gradients) + 7 (optimizer) = ~23 GB

ZeRO Stage 3: + Partition Weights

Partition everything: weights, gradients, and optimizer states. Each GPU holds only 1/N of each. During forward/backward pass, weights are gathered on-demand (all-gather) and released after use.

Memory per GPU: 1.75 (weights) + 1.75 (gradients) + 7 (optimizer) = ~11 GB

Stage 1
Partition: optimizer
Memory: ~35 GB/GPU
Communication: low
Easiest to use

Stage 2
Partition: opt + grad
Memory: ~23 GB/GPU
Communication: medium
Good default

Stage 3
Partition: everything
Memory: ~11 GB/GPU
Communication: high
For large models

Stage 3 + Offload
+ CPU/NVMe offload
Memory: minimal
Communication: highest
Last resort

Recommendation: Start with ZeRO Stage 2 (best balance of memory savings and speed). Use Stage 3 only when the model doesn't fit with Stage 2. Avoid CPU offloading unless absolutely necessary (it's 3-10x slower). HuggingFace Accelerate makes DeepSpeed integration easy.

lan

PyTorch FSDP

Fully Sharded Data Parallel: PyTorch-native distributed training

What FSDP Does

FSDP (Zhao et al., 2023, Meta) is PyTorch's native equivalent of DeepSpeed ZeRO Stage 3. It shards model parameters, gradients, and optimizer states across GPUs. During computation, parameters are gathered on-demand (all-gather), used, and then released (reduce-scatter).

FSDP is built into PyTorch (no external library needed) and is the preferred approach for Meta's own LLM training.

FSDP vs DeepSpeed

FSDP advantages:
- Native PyTorch (no external dependency)
- Better integration with PyTorch ecosystem
- Preferred by Meta, used for Llama training
- Simpler configuration

DeepSpeed advantages:
- More mature, battle-tested at scale
- CPU/NVMe offloading (ZeRO-Infinity)
- ZeRO Stage 1 and 2 (less communication overhead)
- Better documentation and community support

Both achieve similar results. Choose based on your ecosystem.

Sharding Strategies

FULL_SHARD: Shard everything (like ZeRO Stage 3). Maximum memory savings, highest communication.

SHARD_GRAD_OP: Shard gradients and optimizer states only (like ZeRO Stage 2). Good balance.

NO_SHARD: Standard data parallelism. No memory savings but fastest communication.

# FSDP with HuggingFace Accelerate # accelerate config → select FSDP # Then launch: accelerate launch \ --num_processes 8 \ train.py # Or with torchrun: torchrun --nproc_per_node 8 train.py

For most users: Use HuggingFace Accelerate with either DeepSpeed or FSDP. Accelerate abstracts the complexity and lets you switch between backends with a config file. Run accelerate config to set up, then accelerate launch train.py to train.

speed

Training Optimization Techniques

Gradient checkpointing, accumulation, and mixed precision

Gradient Checkpointing

Problem: During backpropagation, all intermediate activations must be stored. For a 7B model with batch_size=4 and seq_len=2048, activations can use 8-16 GB.

Solution: Don't store all activations. Instead, save only at "checkpoint" boundaries (typically every transformer layer). During backward pass, recompute the activations between checkpoints.

Trade-off: Saves 60-70% of activation memory at the cost of ~30% more compute (each activation is computed twice). Almost always worth it for full fine-tuning.

Gradient Accumulation

Problem: Large batch sizes improve training stability but don't fit in memory.

Solution: Process small micro-batches, accumulate gradients over N steps, then perform one optimizer update. Effective batch size = micro_batch × num_GPUs × accumulation_steps.

Example: micro_batch=2, 4 GPUs, accumulation=4 → effective batch = 32.

Mixed Precision Training

bf16 (bfloat16): The standard for modern LLM training. Same exponent range as fp32 (no overflow issues) but half the memory. Supported on A100, H100, RTX 30xx+, and Apple M-series.

fp16: Older alternative. Requires loss scaling to avoid underflow. Use only if bf16 is not available.

The mixed precision recipe: Forward and backward pass in bf16. Optimizer states in fp32. Weight updates in fp32, then cast back to bf16. This is handled automatically by HuggingFace Trainer with bf16=True.

Flash Attention

FlashAttention-2 (Dao, 2023) fuses the attention computation into a single GPU kernel, avoiding the O(n²) memory cost of materializing the attention matrix. Reduces attention memory from O(n²) to O(n) and speeds up training by 20-40%. Enabled by default in HuggingFace Transformers for supported models.

Always enable: gradient_checkpointing=True, bf16=True, and FlashAttention-2. These three optimizations together reduce memory by 50-70% with minimal speed impact. They are the baseline for any full fine-tuning run.

savings

Cost Analysis

GPU hours, cloud costs, and cost optimization

GPU Cost Reference (2024-2025)

Cloud GPU hourly rates (approximate):
NVIDIA A100 80GB: $1.50-$3.00/hr
NVIDIA H100 80GB: $2.50-$4.00/hr
NVIDIA A100 40GB: $1.00-$2.00/hr
NVIDIA L40S 48GB: $1.00-$1.50/hr
NVIDIA RTX 4090 24GB: $0.40-$0.80/hr

Providers: Lambda Labs, RunPod, Together AI, Vast.ai, AWS, GCP, Azure. Spot/preemptible instances are 50-70% cheaper.

Training Time Estimates

Llama 3 8B, full fine-tuning:
10K examples, 3 epochs: ~4-8 hours on 4× A100 80GB
Cost: ~$24-$96

Llama 3 8B, LoRA:
10K examples, 3 epochs: ~1-2 hours on 1× A100 80GB
Cost: ~$3-$6

Llama 3 70B, full fine-tuning:
10K examples, 3 epochs: ~24-48 hours on 8× A100 80GB
Cost: ~$288-$1,152

Cost Optimization Tips

1. Use spot instances: 50-70% cheaper. Save checkpoints frequently to resume after preemption.

2. Start with LoRA: Validate your data and approach with LoRA ($5-$20) before committing to full fine-tuning ($50-$500+).

3. Use smaller models first: Prototype on 8B, then scale to 70B only if needed.

4. Optimize data first: Better data quality at 5K examples beats mediocre data at 50K. Less data = less training time = less cost.

5. Early stopping: Monitor validation loss. Stop training when it plateaus or increases. Don't waste compute on overfitting.

The cost-effective path: (1) Prototype with QLoRA on 8B ($2-$5). (2) If quality is good, train with LoRA on 8B ($5-$20). (3) If you need more quality, try full fine-tuning on 8B ($50-$100). (4) Only scale to 70B if 8B is genuinely insufficient. Most tasks don't need 70B.

checklist

Decision Framework

Choosing the right training strategy

Decision Tree

Q: Do you have <50K examples?
→ Yes: Use LoRA (less overfitting risk)

Q: Do you need multi-task/multi-tenant?
→ Yes: Use LoRA (swappable adapters)

Q: Is this continued pre-training on raw text?
→ Yes: Use full fine-tuning

Q: Do you have 4+ A100 GPUs available?
→ No: Use LoRA or QLoRA
→ Yes: Full fine-tuning is an option

Q: Is this a major domain shift (e.g., language change)?
→ Yes: Full fine-tuning likely needed
→ No: LoRA is probably sufficient

Distributed Training Cheat Sheet

1 GPU (24 GB): QLoRA for 8B-70B, LoRA for 8B
1 GPU (80 GB): LoRA for 8B-70B, full FT for 8B (tight)
2-4 GPUs (80 GB each): Full FT for 8B with DeepSpeed ZeRO-2
8 GPUs (80 GB each): Full FT for 8B-70B with ZeRO-2/3 or FSDP
16+ GPUs: Full FT for 70B+ with ZeRO-3 or FSDP

Distributed Framework Choice

Single node (1-8 GPUs): DeepSpeed ZeRO-2 or FSDP SHARD_GRAD_OP. Both work well.

Multi-node (16+ GPUs): DeepSpeed ZeRO-3 or FSDP FULL_SHARD. DeepSpeed has better multi-node support historically, but FSDP is catching up.

Wrapper: Always use HuggingFace Accelerate. It abstracts the complexity and lets you switch backends easily.

The golden rule: Start simple. Try LoRA first. If quality is insufficient, try LoRA with higher rank or more target modules. Only move to full fine-tuning when you have evidence that LoRA is the bottleneck, not data quality or hyperparameters.