The Analogy
Without PagedAttention, each request pre-allocates GPU memory for its maximum possible context length — like reserving an entire restaurant for one person who might bring friends. PagedAttention (Kwon et al., 2023, vLLM) borrows from OS virtual memory: allocate KV cache in small pages on demand. No wasted memory. This simple idea improved serving throughput by 2-4×.
Key insight: Before PagedAttention, KV cache memory waste was 60-80% in production. A 128K-capable model would reserve 16 GB per request even if the actual conversation was only 1K tokens. PagedAttention allocates in 16-token blocks, achieving near-zero waste. vLLM is now the standard serving framework for open-source LLMs, used by thousands of companies.
How It Works
# Traditional KV cache allocation:
# Request 1: reserve 128K tokens → 16 GB
# Request 2: reserve 128K tokens → 16 GB
# Actual usage: 2K tokens each → 0.25 GB
# Waste: 31.5 GB (98.4%!)
# PagedAttention (vLLM):
# Allocate in pages (blocks of 16 tokens)
# Request 1: 2K tokens → 125 pages → 0.25 GB
# Request 2: 2K tokens → 125 pages → 0.25 GB
# Waste: ~0 GB
# → Can serve 60× more concurrent requests!
# Additional benefits:
# - Shared prefixes (system prompts)
# - Copy-on-write for beam search
# - Dynamic memory growth
# vLLM throughput vs baselines:
# 2-4× higher than HuggingFace TGI
# Near-zero memory waste