Multi-Tier Checkpointing Strategy
Production training runs use a tiered approach, balancing speed against durability:
Tier 1 — Local NVMe (every 5–10 min): Each GPU writes its shard to the node’s local NVMe SSD. At 7 GB/s, writing 1.1 GB takes <1 second. Protects against GPU failures but not node failures.
Tier 2 — Parallel FS (every 30–60 min): Full checkpoint written to Lustre/GPFS. At 50 GB/s aggregate, 560 GB takes ~11 seconds. Protects against node failures but not data center failures.
Tier 3 — Cloud/Remote (every few hours): Checkpoint copied to S3/GCS for disaster recovery. At 10 GB/s, 560 GB takes ~56 seconds. Protects against everything but is slowest.
Asynchronous writes: All tiers use async I/O — training continues while checkpoints drain to storage in the background. The GPU only stalls if the next checkpoint starts before the previous one finishes writing.
Cost of Checkpoint Frequency
Scenario: 16K GPU cluster, $2.50/hr/GPU
Cluster cost: $40,000/hr ($667/min)
Failure every 3 hours (Meta Llama 3 data):
30-min checkpoints:
Max lost work: 30 min × $667 = $20,000
Average lost: $10,000 per failure
Daily (8 failures): $80,000/day
10-min checkpoints:
Max lost work: 10 min × $667 = $6,670
Average lost: $3,335 per failure
Daily: $26,680/day
5-min checkpoints (local NVMe):
Max lost work: 5 min × $667 = $3,335
Average lost: $1,668 per failure
Daily: $13,340/day
Savings: 5-min vs 30-min = $66,660/day
Key insight: Checkpointing is like the autosave in a video game. Save too rarely and you lose hours of progress when you die. Save too often and you spend all your time at save points instead of playing. The multi-tier approach is like having a quick-save (local NVMe) for frequent saves and a full save-to-cloud for the important milestones.