Ch 8: Training Clusters — AI Infrastructure

Ch 8 — Training Clusters: Anatomy of an AI Factory

DGX SuperPOD, GB200 NVL72, failure recovery, checkpointing, and real costs

arrow_backIndex

Hands-On

factory

AI Factory

arrow_forward

dns

DGX H100

arrow_forward

developer_board

GB200 NVL72

arrow_forward

warning

Failures

arrow_forward

save

Checkpoints

arrow_forward

schedule

Scheduling

arrow_forward

payments

Cost Math

arrow_forward

trending_up

Scaling

Click play or press Space to begin...

Step- / 8

factory

The AI Factory Concept

A training cluster is a factory that converts electricity and data into intelligence

What Is an AI Factory?

NVIDIA CEO Jensen Huang coined the term “AI Factory” to describe modern training clusters. The analogy is apt:

Raw materials: Training data (trillions of tokens), electricity (megawatts), cooling (water/air)

Machinery: Thousands of GPUs, networking fabric, storage systems

Product: Trained model weights — the “intelligence” that powers AI applications

Quality control: Evaluation benchmarks, loss curves, human evaluation

Just like a physical factory, an AI factory requires careful engineering of every component. The GPUs are the machines, the network is the conveyor belt, the storage is the warehouse, and the power/cooling systems are the utilities. A bottleneck in any one component slows the entire factory.

The largest AI factories today cost $1–10 billion to build and consume 50–500 MW of power — equivalent to a small city.

Major AI Factories (2024–2026)

Meta (Llama 3 cluster): GPUs: 16,384 H100 Network: InfiniBand NDR Power: ~100 MW Cost: ~$2-3B (estimated) xAI (Colossus, Memphis): GPUs: 100,000 H100 Network: InfiniBand Power: ~150 MW Built in: 122 days Microsoft (Azure AI): GPUs: ~100K+ H100/A100 Network: InfiniBand + Ethernet Multiple data centers AWS (Project Rainier): Chips: ~500K Trainium2 Facility: 1,200 acres For Anthropic training Google (TPU pods): Chips: 100K+ TPU v5/v6 Network: Jupiter (13 Pb/s) For Gemini training

Key insight: The AI industry is in an unprecedented infrastructure buildout. Over $600 billion in capex is planned for AI infrastructure in 2025–2026. The companies building the largest AI factories today will have a structural advantage in training the next generation of models — because training compute is the primary bottleneck.

dns

DGX H100: The Building Block

8 GPUs, NVSwitch, InfiniBand — the standard unit of AI compute

Inside a DGX H100

The NVIDIA DGX H100 is the standard building block for AI clusters. Each system contains:

8× H100 SXM GPUs: 640 GB total HBM3. Connected by NVSwitch for 900 GB/s per GPU all-to-all bandwidth.

2× Intel Xeon CPUs: For orchestration, data preprocessing, and system management.

8× ConnectX-7 NICs: 400G InfiniBand per NIC. One NIC per GPU for rail-optimized networking. 3.2 Tb/s total network bandwidth.

2 TB system RAM: For data staging and preprocessing.

8× NVMe SSDs: 30 TB total local storage for checkpoints and data caching.

Power: ~10.2 kW per system. Air-cooled.

Price: ~$300,000–$400,000 per system ($37,500–$50,000 per GPU).

DGX H100 Specs

DGX H100 System: GPUs: 8× H100 SXM 80GB GPU Memory: 640 GB HBM3 total NVLink: 900 GB/s per GPU NVSwitch: 4 chips (full mesh) FP8 Total: 15.8 PFLOPS CPUs: 2× Intel Xeon 8480+ System RAM: 2 TB DDR5 Storage: 30 TB NVMe SSD Network: 8× 400G InfiniBand Power: 10.2 kW Cooling: Air Weight: ~120 kg Price: ~$300-400K Scalable Unit (32 DGX): GPUs: 256 Memory: 20 TB HBM3 FP8: 506 PFLOPS Power: ~330 kW Price: ~$10-13M A single DGX H100 has more compute than the world's fastest supercomputer from 2010.

Key insight: The DGX H100 is designed as a “unit of compute” — everything inside is balanced. The NVLink bandwidth matches GPU compute needs, the InfiniBand bandwidth matches cross-node communication needs, and the local storage handles checkpoint I/O. This balance is what makes it effective as a building block for larger clusters.

developer_board

GB200 NVL72: The Next Generation

72 GPUs in one rack with 130 TB/s NVLink — a fundamentally different architecture

Rack-Scale Computing

The GB200 NVL72 represents a paradigm shift from server-scale to rack-scale computing:

72 Blackwell GPUs + 36 Grace CPUs in a single liquid-cooled rack. All 72 GPUs are connected by 5th-gen NVLink through 9 NVSwitch trays, creating one massive NVLink domain.

Key difference from DGX H100: In DGX H100, the NVLink domain is 8 GPUs (one server). Cross-server communication uses InfiniBand. In GB200 NVL72, the NVLink domain is 72 GPUs (one rack). This means tensor parallelism can span the entire rack at NVLink speeds.

Unified memory: 72 × 192 GB = 13.8 TB of GPU memory accessible at NVLink bandwidth. A 405B model in FP16 (810 GB) fits entirely within one rack’s NVLink domain — no InfiniBand needed for model weights.

Performance: 30x faster real-time inference than H100 for LLMs. 4x faster training per GPU.

GB200 NVL72 Specs

GB200 NVL72 Rack: GPUs: 72× B200 CPUs: 36× Grace (ARM) GPU Memory: 13.8 TB HBM3e NVLink BW: 130 TB/s total FP4 Total: ~720 PFLOPS Cooling: Liquid (required) Power: ~120 kW per rack Weight: ~1,400 kg DGX H100 vs GB200 NVL72: DGX H100 NVL72 GPUs/unit: 8 72 NVLink domain:8 72 GPU memory: 640 GB 13.8 TB NVLink BW: 7.2 TB/s 130 TB/s Cooling: Air Liquid Power: 10.2 kW ~120 kW The NVLink domain going from 8 to 72 GPUs is the biggest architectural change. It eliminates the InfiniBand bottleneck for most models.

Key insight: The GB200 NVL72 changes the economics of large model training. By putting 72 GPUs in one NVLink domain, it eliminates the InfiniBand bottleneck for models up to ~7 trillion parameters. This means simpler parallelism strategies, fewer networking costs, and higher GPU utilization. The trade-off: liquid cooling is mandatory, and the rack costs $2–3 million.

warning

Failure Is the Norm, Not the Exception

Meta’s Llama 3: 419 failures in 54 days — one every 3 hours

The Reality of Large-Scale Training

When you run 16,384 GPUs for 54 days straight, things break. Constantly. Meta’s published data on Llama 3 405B training reveals the brutal reality:

419 unexpected interruptions over 54 days. That’s one failure every 3 hours.

Failure breakdown:
• GPU failures: 148 events (30.1%) — the #1 cause
• HBM3 memory: 72 events (17.2%) — memory dies fail
• GPU SRAM/processor: 36 events (8.6%)
• Network switches/cables: 35 events (8.4%)
• Software/other: remaining events
• CPU failures: only 2 events (CPUs are reliable)

Despite this, Meta achieved >90% effective training time through automated recovery. Only 3 events required manual intervention in 54 days.

Failure Statistics

Llama 3 405B training failures: Duration: 54 days GPUs: 16,384 H100 Total failures: 419 Failures/day: 7.76 MTBF (cluster): ~3 hours Failure causes: GPU hardware: 30.1% (148) HBM3 memory: 17.2% (72) GPU SRAM: 5.5% (23) GPU SXM: 3.1% (13) Network: 8.4% (35) Host/software: 17.8% (74) CPU: 0.5% (2) Other: 17.4% (72) Recovery: Automated: 416 events Manual: 3 events Effective time: >90% Environmental note: daily temp changes caused 1-2% throughput fluctuation across the cluster.

Key insight: At 16K GPU scale, hardware failure is a statistical certainty, not a rare event. The key to successful training isn’t preventing failures (impossible) but recovering from them quickly. Meta’s automated recovery system handled 99.3% of failures without human intervention. Building this resilience infrastructure is as important as buying the GPUs.

save

Checkpoint Strategies: Your Insurance Policy

Saving model state frequently enough to minimize lost work on failure

The Save Game Analogy

Checkpointing is like saving your game. If the game crashes, you lose everything since your last save. Save too rarely and you lose hours of progress. Save too often and you spend all your time saving instead of playing.

What gets saved: Model weights, optimizer states, learning rate schedule, data loader position, random number generator states. For a 70B model, a full checkpoint is ~1.5 TB.

Multi-tier checkpointing:
• Tier 1 — Local NVMe (fast): Save to local SSDs every 5–10 minutes. Fast write (~10 GB/s), but lost if the server fails.
• Tier 2 — Parallel filesystem (medium): Save to Lustre/GPFS every 30–60 minutes. Survives server failure but slower write.
• Tier 3 — Cloud/tape (slow): Save to S3/GCS every few hours. Survives data center failure. Slowest but most durable.

On failure: try Tier 1 first (fastest recovery). If corrupted, fall back to Tier 2, then Tier 3.

Checkpoint Math

70B model checkpoint size: Weights (FP16): 140 GB Optimizer (FP32): 560 GB Other state: ~50 GB Total: ~750 GB per replica 405B model checkpoint: Total: ~4.5 TB per replica Checkpoint frequency trade-off: Every 5 min (aggressive): Lost work on failure: 5 min max Write overhead: ~8-15% of time Storage: ~100 TB/day Every 30 min (balanced): Lost work on failure: 30 min max Write overhead: ~2-4% of time Storage: ~20 TB/day Every 2 hours (conservative): Lost work on failure: 2 hrs max Write overhead: <1% of time Storage: ~5 TB/day With failures every 3 hours, 30-minute checkpoints lose ~15 min of work on average. At $50K/hr for 16K GPUs, that's $12,500 per failure.

Key insight: Checkpoint frequency is an economic optimization. More frequent checkpoints cost storage and write overhead but save compute on recovery. With failures every 3 hours and GPU costs of $50K/hr, the optimal checkpoint interval is typically 10–30 minutes. Async checkpointing (writing while training continues) can reduce overhead to near zero.

schedule

Job Scheduling and Resource Management

Keeping thousands of GPUs busy 24/7 is an operations challenge

Scheduling Challenges

A 16K-GPU cluster is shared by multiple teams running different jobs. The scheduler must:

1. Maximize utilization: Idle GPUs cost $3–4/hr each. 1,000 idle GPUs = $3,000–4,000/hr wasted. Target: >90% utilization.

2. Handle priorities: A critical training run for the next product launch takes priority over experimental fine-tuning jobs.

3. Gang scheduling: A 1,024-GPU training job needs all 1,024 GPUs to start simultaneously. You can’t start with 900 and add 124 later.

4. Preemption: When a high-priority job arrives, lower-priority jobs must checkpoint and yield their GPUs. This requires reliable checkpointing.

5. Topology awareness: A 64-GPU job should be scheduled on GPUs that are close together in the network topology (same rack, same rail) to minimize communication latency.

Scheduling Tools

Common schedulers: Slurm (HPC heritage): Most common for GPU clusters Gang scheduling built-in Topology-aware placement Preemption support Used by: Meta, most HPC labs Kubernetes + GPU plugin: Cloud-native approach Better for mixed workloads GPU device plugin for allocation Less mature for large training Used by: Cloud providers Ray (distributed framework): Python-native scheduling Good for ML pipelines Autoscaling support Used by: Anyscale, OpenAI Utilization targets: Training clusters: >85% Inference clusters: >60% Mixed clusters: >70% Average GPU utilization across the industry is only 30-50%. Better scheduling can save millions per year.

Key insight: GPU utilization is the single biggest lever for cost optimization. Industry average is 30–50%, but well-managed clusters achieve 85–95%. The difference between 50% and 90% utilization on a 1,000-GPU cluster is $13M/year in wasted compute. Scheduling and operations matter as much as hardware selection.

payments

The Real Cost of a Training Run

GPUs are just the start — networking, storage, power, and people add up fast

Cost Breakdown

The headline GPU cost is just part of the total. A complete training run includes:

GPU compute: The largest cost. 16,384 H100s at $3/GPU-hr for 54 days = ~$63M. This is the cloud cost; on-prem amortized cost is lower but requires upfront capital.

Networking: InfiniBand switches, cables, optics. 15–25% of cluster cost. For a 16K-GPU cluster: $30–60M in networking hardware.

Storage: Parallel filesystem for checkpoints and training data. 5–10% of cluster cost. Petabytes of high-performance storage.

Power and cooling: ~10 kW per DGX × 2,048 DGX systems = 20 MW. At $0.10/kWh = $1.2M/month in electricity alone.

People: ML engineers, infrastructure engineers, SREs. A team of 20–50 people supporting a large training run. $5–15M/year in salaries.

Wasted compute: Failures, restarts, debugging, hyperparameter search. Typically 20–40% of total compute is “wasted.”

Llama 3 405B Cost Estimate

Llama 3 405B training cost: GPU compute (cloud equivalent): 16,384 GPUs × $3/hr × 54 days = ~$63M Including wasted compute (~30%): = ~$82M Networking (amortized): = ~$5-10M Storage (amortized): = ~$2-5M Power/cooling: = ~$2-3M People (training period): = ~$2-4M Total estimated cost: ~$95-105M For comparison: GPT-4: ~$100-200M (estimated) Gemini: ~$100-300M (estimated) Claude 3: ~$50-100M (estimated) These costs are why only a handful of organizations can train frontier models.

Key insight: The $100M+ cost of training a frontier model creates a massive barrier to entry. But the cost per FLOP is dropping ~2x every 2 years as hardware improves. What costs $100M today will cost $25M in 4 years. The question isn’t whether training costs will decrease, but whether model sizes will grow faster than costs decrease.

trending_up

Scaling: From 8 GPUs to 100,000

The challenges multiply at every order of magnitude

Scaling Challenges by Size

8 GPUs (1 server): Challenges: None significant Parallelism: DDP or FSDP Network: NVLink only Failures: Rare Cost: ~$25/hr 64 GPUs (8 servers): Challenges: Network config Parallelism: FSDP + TP Network: NVLink + InfiniBand Failures: Weekly Cost: ~$200/hr 1,024 GPUs (128 servers): Challenges: Scheduling, monitoring Parallelism: 3D parallel Network: Rail-optimized Failures: Daily Cost: ~$3,000/hr 16,384 GPUs (2,048 servers): Challenges: Everything Parallelism: Full 3D + custom Network: SuperPOD Failures: Every 3 hours Cost: ~$50,000/hr 100,000 GPUs: Challenges: Physics limits Power: ~100+ MW Cooling: Liquid required Cost: ~$300,000/hr

What Changes at Each Scale

8 → 64 GPUs: You need InfiniBand or fast Ethernet. Network configuration becomes important. NCCL tuning starts to matter.

64 → 1,024 GPUs: You need a real scheduler (Slurm). Monitoring becomes essential. Failures happen often enough to need automated recovery. Checkpoint strategy matters.

1,024 → 16K GPUs: You need a dedicated infrastructure team. Custom diagnostics tools. Topology-aware scheduling. Multi-tier checkpointing. Environmental monitoring (temperature affects throughput).

16K → 100K GPUs: You’re building a data center, not renting one. Power procurement (100+ MW) takes years. Liquid cooling is mandatory. You need custom networking fabric. The operational complexity rivals running a power plant.

Key insight: Scaling AI training is not linear. Every 10x increase in GPU count brings qualitatively new challenges. The jump from 1,000 to 10,000 GPUs isn’t just “10x more of the same” — it requires fundamentally different infrastructure, operations, and engineering. This is why experience at scale is one of the most valuable assets in AI.

arrow_back Ch 7: Distributed Training Ch 9: Inference Infrastructure arrow_forward