Ch 10: Storage & Data Pipelines

Ch 10 — Storage & Data Pipelines: Feeding the GPUs

Parallel file systems, GPUDirect Storage, checkpointing, data loading, and storage architecture

arrow_backIndex

Hands-On

warning

Bottleneck

arrow_forward

storage

Parallel FS

arrow_forward

bolt

GPUDirect

arrow_forward

save

Checkpoints

arrow_forward

input

Data Loading

arrow_forward

cloud

Object Store

arrow_forward

hub

Architecture

arrow_forward

calculate

Sizing

Click play or press Space to begin...

Step- / 8

warning

The Storage Bottleneck: Starving GPUs

Why $400K GPUs sit idle waiting for data

The Speed Gap

Modern GPUs process data far faster than storage can deliver it. An H100 GPU consumes data at 3.35 TB/s from its HBM3 memory. But the fastest NVMe SSD delivers only ~7 GB/s — a 478× gap. Even a network-attached parallel file system tops out at 10–50 GB/s per client.

The saving grace: GPUs don’t need continuous storage I/O during training. Data is loaded into GPU memory in batches, processed through forward and backward passes, then the next batch is loaded. The question is whether storage can deliver the next batch before the GPU finishes the current one.

When it can’t, GPUs idle. On a 1,000-GPU cluster at $2.50/hr per GPU, every 1% of idle time from storage starvation costs $21,900/month.

Storage I/O Patterns in AI

Training data loading: Sequential reads of large files (tokenized datasets). Predictable, high-throughput. Typically 1–10 GB/s per node.

Checkpointing: Periodic writes of model state. Bursty, write-heavy. A 70B model checkpoint is ~140 GB; a 1T model is ~2 TB. Must complete fast enough to not stall training.

Shuffling & preprocessing: Random reads across large datasets. Latency-sensitive. Can bottleneck data pipelines if storage IOPS is insufficient.

Model loading (inference): Read entire model weights at startup. A 70B FP16 model = 140 GB. At 7 GB/s (NVMe), that’s 20 seconds. At 1 GB/s (network), that’s 2.3 minutes.

Storage Bandwidth Hierarchy

Storage Tier Bandwidth Latency ────────────────────────────────────────────── HBM3 (GPU memory) 3,350 GB/s ~ns L2 Cache (GPU) ~12 TB/s ~ns PCIe Gen5 (GPU↔CPU) 64 GB/s ~μs Local NVMe SSD 7-14 GB/s ~10 μs NVMe-oF (networked) 25-100 GB/s ~50 μs Parallel FS (Lustre) 10-50 GB/s ~100 μs Object Store (S3) 1-10 GB/s ~10 ms HDD (archival) 0.2 GB/s ~5 ms # The key metric: can storage fill GPU memory # before the GPU finishes processing? Example: Training batch loading Batch size: 4 MB (tokenized text) GPU processing time: ~200 ms Required bandwidth: 4 MB / 200 ms = 20 MB/s → Even S3 can keep up for text training! Example: Image training batch Batch size: 2 GB (256 images × 8 MB) GPU processing time: ~500 ms Required bandwidth: 2 GB / 500 ms = 4 GB/s → Needs local NVMe or parallel FS

Key insight: Storage for AI training is like a kitchen’s prep station. A fast chef (GPU) with a slow prep cook (storage) means the chef stands around waiting for ingredients. The solution isn’t a faster chef — it’s a better prep line. Most AI teams over-invest in GPUs and under-invest in storage.

storage

Parallel File Systems: Lustre & GPFS

Distributing data across hundreds of storage nodes for aggregate bandwidth

Why Parallel File Systems?

A single storage server maxes out at ~10–20 GB/s. A 1,000-GPU cluster needs 100+ GB/s of aggregate storage bandwidth. The solution: spread data across many servers and read from all of them simultaneously.

Parallel file systems stripe files across multiple Object Storage Targets (OSTs). A 1 TB file might be split into 1 MB chunks distributed across 100 servers. When a client reads the file, it reads from all 100 servers in parallel, achieving 100× the bandwidth of a single server.

Lustre Architecture

Lustre (Linux + Cluster) is the dominant open-source parallel file system for AI and HPC. Key components:

MDS (Metadata Server): Handles file namespace — directory listings, permissions, file-to-OST mappings. Typically 1–4 active MDS nodes for clusters up to ~50K clients.

OSS/OST (Object Storage Server/Target): Stores actual file data. Each OSS manages multiple OSTs (typically NVMe or SSD arrays). Stripe files across OSTs for parallel access.

Client: POSIX-compatible kernel module on compute nodes. Reads/writes appear as normal file operations but are transparently parallelized across OSTs.

Scaling: Add more OSS nodes for more bandwidth. DDN’s EXAScaler (Lustre-based) delivers 4 TB/s to NVIDIA’s Eos supercomputer across hundreds of OSS nodes.

Lustre vs GPFS (IBM Storage Scale)

Feature Lustre GPFS ────────────────────────────────────────────── License Open source Commercial Max throughput 4+ TB/s 2+ TB/s Max clients ~100K ~20K Metadata Separate MDS Distributed Small files Weaker Better POSIX compliance Full Full Cloud offering AWS FSx for Lustre IBM Cloud Typical users HPC, AI labs Enterprise, banks Cost Lower (OSS) Higher (license) AWS FSx for Lustre (2025): Max throughput: 1,200 Gbps (150 GB/s) per client GPUDirect Storage: Supported S3 integration: Automatic lazy-load from S3 Cost: ~$0.14/GB-month (persistent) # Lustre is the "Linux of storage" for AI — # not the prettiest, but battle-tested at scale.

Key insight: A parallel file system is like a highway with many lanes. A single-server NAS is a one-lane road — no matter how fast the cars go, throughput is limited. Lustre opens 100+ lanes. Each GPU reads from its own lane, and aggregate bandwidth scales linearly with the number of storage servers.

bolt

GPUDirect Storage: Bypassing the CPU

Direct DMA transfers from storage to GPU memory

The CPU Bounce Problem

In a traditional data path, reading data from storage to GPU requires two copies:

1. Storage → CPU system memory (via DMA)
2. CPU system memory → GPU memory (via PCIe)

Each copy consumes CPU cycles, system memory bandwidth, and PCIe bandwidth. The CPU becomes a bottleneck, especially when multiple GPUs compete for data. On an 8-GPU node, the CPU must orchestrate 8 parallel data streams — often saturating the memory bus before the GPUs are full.

GPUDirect Storage (GDS) eliminates the CPU from the data path entirely. Storage devices transfer data directly to GPU memory via RDMA (Remote Direct Memory Access), bypassing system memory completely.

How GDS Works

1. GPU application requests a file read via cuFile API
2. GDS driver maps GPU memory pages for DMA access
3. Storage controller (NVMe or network) performs DMA directly to GPU memory
4. GPU is notified when transfer completes

No CPU involvement. No system memory copy. The data path is: Storage → PCIe/Network → GPU HBM.

GDS requires: NVIDIA GPU (Ampere+), compatible NVMe driver or network adapter, and CUDA 11.4+ with the cuFile library.

Performance Impact

Traditional path (CPU bounce): Storage → System RAM → GPU HBM Bandwidth: ~6-8 GB/s per GPU CPU load: ~30-50% (data copying) Latency: ~200 μs GPUDirect Storage: Storage → GPU HBM (direct DMA) Bandwidth: ~12-40 GB/s per GPU CPU load: ~0% (for data transfer) Latency: ~50 μs Improvement: Bandwidth: 2-5× CPU freed: 30-50% (for preprocessing) Latency: 4× lower At scale (DDN EXAScaler + GDS): Sequential reads: 351 GiB/s Aggregate to cluster: 4 TB/s # DeepNVMe (PyTorch) + local NVMe + GDS: # 20× faster checkpoint I/O with PCIe Gen5

NVIDIA SCADA (2025)

NVIDIA’s Storage Compute Accelerated Data Access (SCADA), announced November 2025, takes GDS further by completely eliminating CPU involvement in storage orchestration. The GPU itself manages data placement, prefetching, and caching — treating storage as an extension of its memory hierarchy.

This is particularly important for checkpointing, where a 1T model must write 2 TB of state to storage without stalling training. With SCADA, the GPU can overlap checkpoint writes with the next training step.

Key insight: GPUDirect Storage is like a delivery truck driving straight into the factory floor instead of unloading at the loading dock, carrying boxes to the warehouse, then carting them to the assembly line. Cutting out the middleman (CPU) doubles the delivery speed and frees the warehouse workers (CPU cores) for other tasks.

save

Checkpointing: Saving Progress Without Stalling

Multi-tier strategies for protecting billion-dollar training runs

Why Checkpointing Is Critical

A frontier model training run costs $50–100M+ and runs for weeks to months. Hardware failures happen every few hours on large clusters (Meta reported one failure every 3 hours on 16K GPUs). Without checkpointing, a failure could lose days of training progress — millions of dollars of compute wasted.

The challenge: checkpoints are huge. A 70B model’s full training state (weights + optimizer states + gradients) is ~560 GB. A 1T model: ~8 TB. Writing this to storage fast enough to not stall training is a serious engineering problem.

Checkpoint Size Math

# Full training checkpoint includes: Model weights: params × bytes_per_param Optimizer (Adam): 2 × model_weights (momentum + variance) Gradients: 1 × model_weights Total: ~4 × model_weights Example: Llama 3 70B (mixed precision) Weights (FP16): 140 GB Optimizer (FP32): 280 GB Gradients (FP16): 140 GB Total: ~560 GB Example: 1T parameter model Weights (FP16): 2 TB Optimizer (FP32): 4 TB Gradients (FP16): 2 TB Total: ~8 TB # With FSDP, each GPU only saves its shard: 70B on 512 GPUs: 560 GB / 512 = ~1.1 GB/GPU 1T on 4096 GPUs: 8 TB / 4096 = ~2 GB/GPU

Multi-Tier Checkpointing Strategy

Production training runs use a tiered approach, balancing speed against durability:

Tier 1 — Local NVMe (every 5–10 min): Each GPU writes its shard to the node’s local NVMe SSD. At 7 GB/s, writing 1.1 GB takes <1 second. Protects against GPU failures but not node failures.

Tier 2 — Parallel FS (every 30–60 min): Full checkpoint written to Lustre/GPFS. At 50 GB/s aggregate, 560 GB takes ~11 seconds. Protects against node failures but not data center failures.

Tier 3 — Cloud/Remote (every few hours): Checkpoint copied to S3/GCS for disaster recovery. At 10 GB/s, 560 GB takes ~56 seconds. Protects against everything but is slowest.

Asynchronous writes: All tiers use async I/O — training continues while checkpoints drain to storage in the background. The GPU only stalls if the next checkpoint starts before the previous one finishes writing.

Cost of Checkpoint Frequency

Scenario: 16K GPU cluster, $2.50/hr/GPU Cluster cost: $40,000/hr ($667/min) Failure every 3 hours (Meta Llama 3 data): 30-min checkpoints: Max lost work: 30 min × $667 = $20,000 Average lost: $10,000 per failure Daily (8 failures): $80,000/day 10-min checkpoints: Max lost work: 10 min × $667 = $6,670 Average lost: $3,335 per failure Daily: $26,680/day 5-min checkpoints (local NVMe): Max lost work: 5 min × $667 = $3,335 Average lost: $1,668 per failure Daily: $13,340/day Savings: 5-min vs 30-min = $66,660/day

Key insight: Checkpointing is like the autosave in a video game. Save too rarely and you lose hours of progress when you die. Save too often and you spend all your time at save points instead of playing. The multi-tier approach is like having a quick-save (local NVMe) for frequent saves and a full save-to-cloud for the important milestones.

input

Data Loading: The Hidden Bottleneck

How PyTorch DataLoader, streaming, and preprocessing affect GPU utilization

The DataLoader Pipeline

Before data reaches the GPU, it passes through a multi-stage pipeline:

1. Read from storage: Fetch raw data (text, images, audio) from disk or network.
2. Deserialize: Parse file formats (Parquet, TFRecord, WebDataset).
3. Transform: Tokenize text, resize images, augment data.
4. Batch: Collate samples into tensors of uniform shape.
5. Transfer: Copy batch from CPU memory to GPU memory (pinned memory + async copy).

Each stage can become a bottleneck. PyTorch’s default DataLoader uses CPU workers for stages 1–4 and overlaps them with GPU training via prefetching. But with large models on fast GPUs, even 16 CPU workers may not keep up.

Common Bottlenecks

# Diagnosing data loading bottlenecks: Symptom: GPU utilization drops periodically → DataLoader can't prefetch fast enough Fix: Increase num_workers, use persistent_workers Symptom: CPU at 100%, GPU at 50% → Preprocessing is CPU-bound Fix: Pre-tokenize data, use DALI for GPU preprocessing Symptom: High I/O wait → Storage bandwidth insufficient Fix: Local NVMe cache, parallel FS, larger prefetch # PyTorch DataLoader tuning: DataLoader( dataset, batch_size=32, num_workers=8, # CPU workers pin_memory=True, # faster CPU→GPU prefetch_factor=4, # batches ahead persistent_workers=True # keep alive )

Streaming vs Pre-Downloaded

Pre-downloaded (traditional): Copy entire dataset to local/parallel storage before training. Simple, fast reads, but requires storage capacity equal to dataset size on every node. A 15 TB dataset (Common Crawl) needs 15 TB per node or a shared parallel FS.

Streaming (MosaicML, HuggingFace): Read data directly from object storage (S3/GCS) during training. No local copy needed. Data is shuffled and batched on-the-fly.

Streaming works well for text (low bandwidth per sample) but struggles for images/video (high bandwidth). The sweet spot: stream from S3 to a local NVMe cache, with the cache acting as a sliding window over the dataset.

Data Formats for AI Training

Format Read Speed Compression Random Access ────────────────────────────────────────────────────── JSON/JSONL Slow None Line-level Parquet Fast Columnar Column-level TFRecord Fast Optional Sequential WebDataset Very fast Tar-based Sequential Arrow/IPC Very fast Optional Row-level Mosaic MDSv2 Very fast Shard-based Shard-level # WebDataset: tar files of samples # Sequential reads → maximizes disk throughput # Shuffling done across shards, not within # Rule of thumb: pre-tokenize + shard into # 256 MB–1 GB files for optimal parallel reads

Key insight: Data loading is like a supply chain. The GPU is a factory that can process 1,000 widgets per second, but if the delivery trucks (DataLoader workers) only bring 500 per second, the factory runs at half capacity. Optimizing the supply chain (format, caching, prefetching) is often cheaper than buying a faster factory (bigger GPU).

cloud

Object Storage Integration: S3, GCS & Beyond

Using cloud object stores as the data lake for AI training

Object Storage for AI

Most organizations store their training data in object storage (S3, GCS, Azure Blob) because it’s cheap ($0.023/GB-month for S3 Standard), infinitely scalable, and durable (11 nines). The challenge is getting data from object storage to GPUs fast enough.

S3 throughput: A single S3 GET request delivers ~100 MB/s. With 1,000 parallel requests, you can reach ~100 GB/s aggregate. But latency is 10–50ms per request — 1,000× slower than local NVMe.

Egress costs: AWS charges $0.09/GB for data leaving a region. Training on a 15 TB dataset with 3 epochs = 45 TB of reads = $4,050 in egress if data and compute are in different regions. Same-region is free.

Tiered Caching Architecture

The production pattern combines object storage with local caching:

Cold tier (S3/GCS): Full dataset, cheap, durable. Source of truth.

Warm tier (Parallel FS): Active training data cached on Lustre/GPFS. High bandwidth, shared across nodes. AWS FSx for Lustre can auto-hydrate from S3.

Hot tier (Local NVMe): Current epoch’s data cached on node-local SSDs. Fastest access, limited capacity (1–8 TB per node).

Data flows: S3 → Lustre (background prefetch) → Local NVMe (per-epoch cache) → GPU memory (per-batch). Each tier acts as a progressively faster, smaller cache.

AWS FSx for Lustre + S3 Integration

# FSx for Lustre: managed Lustre linked to S3 Setup: S3 bucket: s3://my-training-data (15 TB) FSx volume: /fsx/training (linked to S3) Throughput: 1,200 Gbps (150 GB/s) per client How it works: 1. File appears in /fsx/ as a stub (metadata only) 2. First read triggers lazy-load from S3 3. Subsequent reads served from Lustre (fast) 4. Write-back to S3 on demand or schedule Cost comparison (15 TB dataset, 1 month): S3 Standard: $345/month FSx Persistent: $2,100/month FSx Scratch: $840/month (no durability) Local NVMe (8 nodes): Included with instances # Best practice: FSx Scratch for training data # (can re-hydrate from S3), FSx Persistent for # checkpoints (need durability).

Key insight: The storage architecture for AI is like a library system. S3 is the central warehouse with every book ever printed — cheap to store but slow to retrieve. The parallel file system is the local branch library — faster, but limited shelf space. Local NVMe is the stack of books on your desk — instant access, but only room for what you’re reading right now. Good data engineering moves the right books to the right tier at the right time.

hub

Storage Architecture Patterns

How to design storage for training clusters, inference, and mixed workloads

Training Cluster Storage

A training cluster’s storage must handle three simultaneous workloads: data loading (read-heavy, sequential), checkpointing (write-heavy, bursty), and logging/metrics (write-heavy, small files). These have conflicting I/O patterns.

Best practice: Separate storage pools for each workload:

Data pool: Lustre with read-optimized OSTs. Stripe across many servers. Tune for sequential read throughput. Cache hot data on local NVMe.

Checkpoint pool: Lustre or GPFS with write-optimized OSTs. Needs burst bandwidth (entire cluster writes simultaneously). Size for peak write rate, not average.

Scratch pool: Local NVMe on compute nodes. Used for Tier-1 checkpoints, data caching, and temporary files. No network overhead.

Inference Storage

Inference storage is simpler but has different priorities:

Model loading: Read entire model weights at startup. Speed determines cold-start latency. A 70B model from NVMe: 20s. From network: 2+ minutes. From S3: 5+ minutes.

Model registry: Store multiple model versions, quantization variants, LoRA adapters. Needs fast random access for adapter loading.

Solution: Keep active models on local NVMe or a fast NAS. Use S3 as the model registry with a pull-through cache. Pre-pull models during scaling events to avoid cold-start latency.

Reference Architecture: 1,000-GPU Cluster

Compute: 125 nodes × 8 GPUs (H100) Local Storage (per node): 4× NVMe Gen4 3.84 TB = 15.36 TB Bandwidth: 28 GB/s (4× 7 GB/s) Use: Tier-1 checkpoints, data cache Parallel FS (shared): DDN EXAScaler (Lustre-based) Capacity: 2 PB usable Throughput: 500 GB/s aggregate OSS nodes: 40 (NVMe-backed) Use: Training data, Tier-2 checkpoints Object Storage: S3-compatible (MinIO or AWS S3) Capacity: 10+ PB Use: Raw datasets, Tier-3 checkpoints, archive Estimated storage cost: Local NVMe: Included with nodes Parallel FS: ~$500K-1M (DDN appliance) S3 (10 PB): ~$$2,760/month (AWS) Total: ~3-5% of cluster cost

Key insight: Storage architecture for AI is like plumbing in a building. Nobody notices it when it works, but when a pipe bursts (storage bottleneck), the whole building floods (GPUs idle). Spend 3–5% of your cluster budget on storage to protect the other 95% from sitting idle.

calculate

Storage Sizing: Capacity, Bandwidth & Cost

How to calculate storage requirements for real AI workloads

Capacity Sizing

Training Data: Text (Common Crawl): ~15 TB (compressed) Image (LAION-5B): ~240 TB Video (HD, 1M hours): ~2 PB Code (The Stack v2): ~4 TB Checkpoint Storage: Per checkpoint (70B): 560 GB Keep last 5: 2.8 TB Per checkpoint (1T): 8 TB Keep last 5: 40 TB Model Artifacts (inference): Per model version: 140-280 GB 10 models × 3 versions: 4-8 TB LoRA adapters (100): 70 GB Rule of thumb: Parallel FS: 2× dataset + 10× largest checkpoint Object store: All raw data + all checkpoint history

Bandwidth Sizing

Data loading bandwidth: Calculate based on batch size, processing time, and number of nodes. For text training, even modest bandwidth (100 MB/s per node) suffices. For image/video, plan for 1–10 GB/s per node.

Checkpoint bandwidth: Total checkpoint size ÷ acceptable write time. If a 560 GB checkpoint must complete in 30 seconds: need 18.7 GB/s aggregate write bandwidth. With FSDP (each GPU writes its shard in parallel), 125 nodes writing 4.5 GB each at 7 GB/s (local NVMe) finishes in <1 second.

Total Cost of Storage

1,000-GPU training cluster (1 year): GPU compute: 1,000 × $2.50/hr × 8,760 hrs = $21.9M Storage: Parallel FS (2 PB, DDN): $800K (amortized) Local NVMe (included): $0 S3 (10 PB): $33K Total storage: $833K (~3.8% of compute) Cost of NOT having good storage: 5% GPU idle from storage starvation: 1,000 × $2.50 × 8,760 × 0.05 = $1.095M # Bad storage costs MORE than good storage! # $833K investment prevents $1.095M+ waste. NVMe-oF market growth: 27.8% CAGR AI storage market: $36B (2025) → $322B (2035)

Key insight: Storage is the cheapest insurance policy for your GPU investment. Spending 3–5% of your cluster budget on proper storage prevents 5–15% of your GPUs from sitting idle. The math is simple: $833K in storage saves $1M+ in wasted compute. Every dollar spent on storage returns $1.30+ in GPU utilization.

arrow_backPrevious Chapter Next Chapterarrow_forward