Ch 4: Memory — The Real Bottleneck

Ch 4 — Memory: The Real Bottleneck

HBM, bandwidth math, KV cache explosion, and why memory determines everything

arrow_backIndex

Core

apartment

HBM vs GDDR

arrow_forward

layers

HBM Stacking

arrow_forward

speed

Bandwidth Math

arrow_forward

scale

Weight Memory

arrow_forward

expand

KV Cache

arrow_forward

swap_vert

Bound Type

arrow_forward

timeline

HBM Roadmap

arrow_forward

checklist

Sizing Guide

Click play or press Space to begin...

Step- / 8

apartment

HBM vs GDDR: Two Approaches to GPU Memory

Apartment building vs row of houses — stacking changes everything

The Apartment Building Analogy

GDDR (used in gaming GPUs like RTX 4090) is like a row of houses along a street. Each house (memory chip) has its own driveway (data bus). To get more capacity, you build more houses along the street. But the street has limited width — bandwidth is constrained.

HBM (used in AI GPUs like H100, B200) is like an apartment building. Memory chips are stacked vertically, connected by thousands of tiny elevators (through-silicon vias, or TSVs). This stacking gives you a massively wider data path — 1,024 bits wide per stack vs GDDR’s 32 bits per chip.

The result: HBM delivers 5–10x more bandwidth than GDDR in the same physical footprint. This is why every serious AI GPU uses HBM — the bandwidth is essential for feeding thousands of Tensor Cores.

HBM vs GDDR Comparison

GDDR6X (RTX 4090): Capacity: 24 GB Bandwidth: 1,008 GB/s Bus width: 384 bits Chips: 12 (side by side) Power: ~50W for memory HBM3 (H100): Capacity: 80 GB Bandwidth: 3,350 GB/s Bus width: 5,120 bits Stacks: 5 (8-high each) Power: ~30W for memory HBM3e (B200): Capacity: 192 GB Bandwidth: 8,000 GB/s Stacks: 8 (12-high each) HBM3e: 8x the bandwidth of GDDR6X while using less power per bit. This is why AI GPUs cost $25K+.

Key insight: HBM is the single most expensive component in an AI GPU — it can account for 40–60% of the chip’s cost. The global HBM market is projected to grow from $38 billion (2025) to $58 billion (2026). SK Hynix, Samsung, and Micron are in an arms race to produce more HBM stacks, and supply constraints directly affect GPU availability.

layers

How HBM Stacking Works

Through-silicon vias connect vertically stacked DRAM dies for massive bandwidth

The Stacking Technology

HBM achieves its extraordinary bandwidth through vertical stacking:

1. DRAM dies are stacked: 8 or 12 DRAM dies are placed on top of each other, like floors in a building. Each die is thinned to ~50 micrometers (thinner than a human hair).

2. TSVs connect them: Thousands of Through-Silicon Vias (TSVs) — tiny copper pillars drilled through each die — create vertical electrical connections between all layers. Each stack has over 5,000 TSVs.

3. Wide interface: Each HBM stack has a 1,024-bit wide interface. With 5 stacks (H100), that’s 5,120 bits of data moving simultaneously. Compare to GDDR’s 32 bits per chip.

4. Close to the GPU: HBM stacks sit on the same silicon interposer as the GPU die, just millimeters away. This proximity reduces latency and power consumption vs off-package memory.

HBM Generation Evolution

Gen Layers Cap/Stack BW/Stack Data Rate HBM2 8-high 8 GB 256 GB/s 2.4 Gbps HBM2e 8-high 16 GB 460 GB/s 3.6 Gbps HBM3 8-high 16 GB 665 GB/s 6.4 Gbps HBM3e 12-high 36 GB 1,200 GB/s 9.6 Gbps HBM4 16-high 48 GB 2,000+ TBD Used in: HBM2e: A100 (80GB = 5 stacks) HBM3: H100 (80GB = 5 stacks) HBM3e: H200 (141GB), B200 (192GB) HBM4: Next-gen GPUs (2026+) Market leaders (Q2 2025): SK Hynix: 62% share Micron: 21% share Samsung: 17% share

Key insight: HBM4 (expected 2026) doubles the interface width to 2,048 bits and targets 2+ TB/s per stack. Samsung has already announced HBM4E at 4 TB/s per stack. This means future GPUs could have 10–20x the memory bandwidth of today’s H100 — fundamentally changing what’s possible for real-time inference of very large models.

speed

Bandwidth Math: Why It Determines Token Speed

Every token generated requires reading the entire model from memory

The Fundamental Equation

During LLM inference (generating tokens), the GPU must read every model weight from memory for every token. This is because each token passes through every layer of the model.

This means your maximum token generation speed is directly limited by memory bandwidth:

Max tokens/sec = Memory Bandwidth ÷ Model Size

For a 70B model in FP16 (140 GB of weights):
• H100: 3,350 GB/s ÷ 140 GB = ~24 tokens/sec (theoretical max)
• B200: 8,000 GB/s ÷ 140 GB = ~57 tokens/sec (theoretical max)

Real-world numbers are lower due to KV cache reads, activation memory, and overhead. But this formula gives you the ceiling — no amount of TFLOPS can exceed it.

Bandwidth-Limited Token Rates

Theoretical max tokens/sec (single GPU): Formula: BW ÷ Model_Size_Bytes 7B model (FP16 = 14 GB): H100: 3,350 / 14 = ~239 tok/s B200: 8,000 / 14 = ~571 tok/s 13B model (FP16 = 26 GB): H100: 3,350 / 26 = ~129 tok/s B200: 8,000 / 26 = ~308 tok/s 70B model (FP16 = 140 GB): H100: 3,350 / 140 = ~24 tok/s B200: 8,000 / 140 = ~57 tok/s 70B model (INT4 = 35 GB): H100: 3,350 / 35 = ~96 tok/s B200: 8,000 / 35 = ~229 tok/s Quantization (FP16→INT4) gives 4x more tokens/sec because you read 4x less data per token. This is why quantization matters so much.

Key insight: This is why memory bandwidth — not TFLOPS — determines inference speed for most LLM workloads. The H100 has 990 TFLOPS FP16 but only 3,350 GB/s bandwidth. For single-user inference, those TFLOPS sit mostly idle, waiting for data. Bandwidth is the bottleneck, and quantization is the most effective way to work around it.

scale

Model Weight Memory: The Sizing Problem

A 70B model needs 140 GB just for weights — before anything else

Memory Budget for Model Weights

Every parameter in a neural network takes memory. The amount depends on precision:

FP32 (4 bytes per param):
• 7B model = 28 GB
• 13B model = 52 GB
• 70B model = 280 GB
• 405B model = 1,620 GB (!)

FP16/BF16 (2 bytes per param):
• 7B = 14 GB, 13B = 26 GB, 70B = 140 GB

INT8 (1 byte per param):
• 7B = 7 GB, 13B = 13 GB, 70B = 70 GB

INT4 (0.5 bytes per param):
• 7B = 3.5 GB, 13B = 6.5 GB, 70B = 35 GB

But weights are just the start. You also need memory for activations, optimizer states (during training), and the KV cache (during inference).

Training Memory: The 16x Multiplier

Training a 7B model (FP32/FP16 mixed): Model weights (FP16): 14 GB Master weights (FP32): 28 GB Gradients (FP16): 14 GB Optimizer states (FP32): Adam momentum: 28 GB Adam variance: 28 GB ───────────────────────────── Total optimizer memory: 112 GB Plus activations: ~20-60 GB (depends on batch size and sequence length) Total for 7B training: ~130-170 GB Total for 70B training: ~1,300-1,700 GB A 70B model needs ~1.5 TB just to train. That's 19 H100s worth of memory — minimum. This is why distributed training exists.

Key insight: The “16x rule” for training: you need roughly 16 bytes per parameter for mixed-precision training with Adam optimizer. A 70B model needs ~1,120 GB just for weights + optimizer, before activations. This is why training always requires multiple GPUs, while inference can sometimes fit on one.

expand

The KV Cache Explosion

Every token you generate makes the next one more expensive

What Is the KV Cache?

During LLM inference, the attention mechanism needs to look back at all previous tokens. To avoid recomputing attention for every past token, GPUs store the Key and Value vectors for each token in a cache.

Think of it like a conversation transcript. Every time you say something new, the model needs to re-read the entire transcript to understand context. The KV cache stores this transcript in GPU memory so it doesn’t have to recompute it.

The problem: this cache grows linearly with sequence length and batch size. For long conversations or large batches, the KV cache can consume more memory than the model weights themselves.

For a 70B model with 80 layers, 64 attention heads, and 128-dim per head:
KV cache per token = 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB per token

KV Cache Memory Math

Llama 3 70B KV cache: Per token: ~2.6 MB (FP16) Single user, 4K context: 4,096 × 2.6 MB = ~10.7 GB Single user, 128K context: 131,072 × 2.6 MB = ~341 GB (!!) Batch of 32 users, 4K each: 32 × 10.7 GB = ~342 GB Total GPU memory needed: Weights (FP16): 140 GB + KV cache (32×4K): 342 GB + Activations: ~10 GB ───────────────────────── Total: ~492 GB That's 6 H100s just for memory. The KV cache is often the largest memory consumer in production inference systems.

Key insight: The KV cache is why long-context models (128K+ tokens) are so expensive to serve. It’s also why techniques like KV cache quantization (storing keys/values in INT8 instead of FP16), PagedAttention (used in vLLM), and multi-query attention (sharing KV heads) are critical for production inference. They can reduce KV cache memory by 4–8x.

swap_vert

Compute-Bound vs Memory-Bound

Understanding which bottleneck you’re hitting changes your optimization strategy

Two Different Bottlenecks

Every GPU workload is limited by either compute (not enough TFLOPS) or memory bandwidth (can’t feed data fast enough). Knowing which one you’re hitting determines your optimization strategy:

Compute-bound (training, batched inference):
• GPU cores are fully utilized
• Adding more TFLOPS helps
• Larger batch sizes help (more work per memory read)
• Lower precision helps (more ops per cycle)

Memory-bound (single-user inference, small batches):
• GPU cores are idle, waiting for data
• More TFLOPS don’t help
• More bandwidth helps
• Quantization helps (less data to read)
• Smaller models help

The Arithmetic Intensity Test

Arithmetic Intensity (AI): = FLOPs per byte of memory accessed H100 balance point: 990 TFLOPS ÷ 3,350 GB/s = 295 FLOPs per byte If your workload does <295 FLOPs per byte → memory-bound If your workload does >295 FLOPs per byte → compute-bound Typical workloads: Training (large batch): ~500-2000 FLOPs/byte → Compute Inference (batch=1): ~1-2 FLOPs/byte → Memory Inference (batch=64): ~64-128 FLOPs/byte → Still memory Inference (batch=512): ~512-1024 FLOPs/byte → Compute

Key insight: Most LLM inference is memory-bound. This is why buying a GPU with more TFLOPS doesn’t always make inference faster — you need more bandwidth. It’s also why the H200 (same compute as H100, but 43% more bandwidth) gives a meaningful inference speedup despite having zero additional TFLOPS.

timeline

The HBM Roadmap: What’s Coming

HBM4, HBM4E, and the path to 4 TB/s per stack

HBM Technology Roadmap

Memory technology is evolving rapidly to keep pace with AI compute demands:

HBM3e (current, 2024–2025): 9.6 Gbps data rate, up to 36 GB per stack, ~1.2 TB/s per stack. Used in H200, B200, MI350X. 30% more power-efficient than competitors (Micron). Production is scaling rapidly.

HBM4 (2026): JEDEC spec released April 2025. Doubles interface width to 2,048 bits. Targets 2+ TB/s per stack. Will enable GPUs with 256–384 GB of memory at 16+ TB/s total bandwidth.

HBM4E (2027+): Samsung announced 16 Gbps speed, up to 48 GB per stack, 4 TB/s per stack. This would give a single GPU potentially 384 GB at 32 TB/s — enough to run a 200B model on a single chip.

The memory industry is investing tens of billions in HBM manufacturing capacity. It’s the bottleneck that determines how fast AI accelerators can be built.

Impact on AI Workloads

What HBM4/4E enables: Single-GPU model capacity: HBM3e (B200): 192 GB → 96B FP16 HBM4 (2026): ~384 GB → 192B FP16 HBM4E (2027): ~384 GB → 192B FP16 Inference speed (70B FP16): HBM3 (H100): ~24 tok/s HBM3e (B200): ~57 tok/s HBM4 (est): ~114 tok/s HBM4E (est): ~228 tok/s HBM market growth: 2024: $26 billion 2025: $38 billion 2026: $58 billion (projected) Memory bandwidth is doubling every ~2 years. This is what makes each GPU generation dramatically faster at inference.

Key insight: The HBM roadmap is as important as the GPU compute roadmap. Each HBM generation roughly doubles bandwidth, which directly translates to faster inference. The combination of more bandwidth (HBM4) and lower precision (FP4/FP8) means future GPUs could serve today’s largest models at real-time speeds on a single chip.

checklist

Practical GPU Sizing Guide

How many GPUs do you actually need for your model?

Inference Sizing

Rule of thumb for inference: GPUs needed = Model_Size ÷ GPU_Memory (leave 20-30% headroom for KV cache) 7B model (FP16 = 14 GB): H100 (80GB): 1 GPU ✓ RTX 4090 (24GB): 1 GPU ✓ 13B model (FP16 = 26 GB): H100: 1 GPU ✓ RTX 4090: 2 GPUs (or INT4 = 1) 70B model (FP16 = 140 GB): H100: 2 GPUs B200 (192GB): 1 GPU ✓ MI300X (192GB): 1 GPU ✓ 405B model (FP16 = 810 GB): H100: ~12 GPUs B200: ~5 GPUs

Training Sizing

Rule of thumb for training: Memory = ~16 bytes × parameters + activations (varies with batch) 7B model training: Memory: ~130-170 GB H100s needed: 2-4 (with FSDP sharding) 13B model training: Memory: ~250-350 GB H100s needed: 4-8 70B model training: Memory: ~1,300-1,700 GB H100s needed: 16-32 405B model training: Memory: ~7,500-10,000 GB H100s needed: 128+ These are minimums. Production training often uses 2-4x more GPUs for faster throughput.

Key insight: Memory is the first constraint to check when planning any AI workload. Before worrying about TFLOPS, interconnects, or networking, ask: “Does my model fit?” If it doesn’t fit in GPU memory, nothing else matters until you solve that problem — either with more GPUs, quantization, or a smaller model.

arrow_back Ch 3: The Accelerator Zoo Ch 5: Interconnects arrow_forward