Ch 10: Context Windows & Memory

Ch 10 — Context Windows & Memory

How LLMs remember conversations — KV cache, context length, RAG, and the quest for infinite memory

arrow_backIndex

Inference

width

Window

arrow_forward

save

KV Cache

arrow_forward

memory

Memory

arrow_forward

rotate_right

RoPE Ext

arrow_forward

auto_stories

PagedAttn

arrow_forward

RAG

arrow_forward

all_inclusive

Limits

Click play or press Space to begin...

Step- / 7

width

The Context Window: An LLM’s Working Memory

Everything the model can “see” at once

The Analogy

The context window is like a desk. A small desk (2K tokens) can hold one page. A large desk (128K tokens) can hold a short novel. Everything on the desk is visible simultaneously — the model can attend to any token from any other token. But once something falls off the desk, it’s gone. LLMs have no persistent memory between conversations. Each API call starts with a blank desk.

Key insight: Context windows have grown dramatically: GPT-2 (1K tokens) → GPT-3 (2K) → GPT-3.5 (4K/16K) → GPT-4 (8K/32K/128K) → Claude 3 (200K) → Gemini 1.5 (1M+). But attention is O(n²) in sequence length, so doubling the context quadruples the compute. A 128K context window processes 128K × 128K = 16.4 billion attention scores per layer.

Context Window Sizes

# Context windows (tokens): # GPT-2 (2019): 1,024 # GPT-3 (2020): 2,048 # GPT-4 (2023): 128,000 # Claude 3.5 (2024): 200,000 # Gemini 1.5 (2024):1,000,000 # Llama 3.1 (2024): 128,000 # What fits in 128K tokens? # ~96,000 words ≈ a 300-page book # ~200 pages of code # ~3 hours of meeting transcript # Attention cost: O(n²) # 2K context: 4M attention scores/layer # 128K context: 16.4B attention scores/layer # 1M context: 1T attention scores/layer # → Quadratic scaling is the core challenge

save

The KV Cache: Avoiding Redundant Work

Why we store key-value pairs from previous tokens

The Analogy

Without a KV cache, generating each new token would require recomputing attention for all previous tokens from scratch — like re-reading an entire book every time you write one word. The KV cache stores the Key and Value vectors from all previous tokens. When generating token N+1, you only compute Q/K/V for the new token and look up the cached K/V for tokens 1…N. This turns O(n²) per token into O(n).

Key insight: The KV cache is the biggest memory consumer during inference. For Llama 3 8B with 128K context: 32 layers × 2 (K+V) × 128K tokens × 4096 dims × 2 bytes = 64 GB just for the KV cache of one request! This is why serving long-context models is so expensive, and why KV cache compression (quantization to FP8, eviction strategies) is a hot research area.

KV Cache Math

# Without KV cache (naive): # Token 1: process [t1] → 1 step # Token 2: process [t1, t2] → 2 steps # Token N: process [t1...tN] → N steps # Total: 1+2+...+N = N²/2 steps # With KV cache: # Token 1: compute K1,V1, cache them # Token 2: compute K2,V2, attend to K1,V1 # Token N: compute KN,VN, attend to K1..KN-1 # Total: N steps (linear!) # KV cache memory (Llama 3 8B, BF16): # Per token per layer: # K: 1024 dims × 2 bytes = 2 KB (GQA: 8 heads) # V: 1024 dims × 2 bytes = 2 KB # Per token (all 32 layers): 128 KB # 128K tokens: 128K × 128KB = 16 GB # (GQA reduces this 4× vs full MHA)

memory

Prefill vs Decode: Two Phases of Inference

Why the first token is slow and the rest are fast

The Analogy

When you start reading a long document, the first read-through takes time (you’re processing everything). After that, answering questions is fast (you remember the content). LLM inference has the same two phases: Prefill processes the entire prompt in parallel (compute-bound, fast per token). Decode generates tokens one at a time (memory-bound, slow). Time-to-first-token (TTFT) depends on prefill; tokens-per-second depends on decode.

Key insight: Prefill is compute-bound (GPU cores are the bottleneck) while decode is memory-bound (memory bandwidth is the bottleneck). This is why different optimizations target different phases: FlashAttention speeds up prefill, while KV cache quantization and batching speed up decode. Understanding this split is essential for optimizing LLM serving.

The Two Phases

# Phase 1: PREFILL (prompt processing) # Input: 4000-token prompt # Process all 4000 tokens in parallel # Fill the KV cache for all prompt tokens # Compute-bound: GPU cores fully utilized # Time: ~200ms for 4K tokens on H100 # Phase 2: DECODE (token generation) # Generate 1 token at a time # Each step: read entire KV cache from memory # Memory-bound: waiting for data transfer # Time: ~20ms per token on H100 # For a 4K prompt + 500 token response: # TTFT: ~200ms (prefill) # Generation: 500 × 20ms = 10s (decode) # Total: ~10.2s # Decode dominates total time!

rotate_right

Extending Context: RoPE Scaling & Beyond

How models trained on 4K tokens work at 128K

The Analogy

Imagine a ruler marked up to 10 cm. To measure 100 cm, you could stretch the ruler (each mark now represents 10 cm) or use a different encoding. RoPE (Rotary Position Embeddings, Ch 2) encodes position as rotations. To extend context, you can interpolate the rotations (squeeze more positions into the same rotation range) or use NTK-aware scaling (adjust different frequency components differently).

Key insight: Llama 3.1 extended from 8K to 128K context using RoPE scaling + continued pretraining on long documents. LongRoPE (Microsoft) achieves 2M tokens by non-uniformly rescaling RoPE frequencies with only 1K fine-tuning steps. The key discovery: high-frequency RoPE dimensions should be scaled less than low-frequency ones, because they encode local position information that shouldn’t change.

Extension Methods

# RoPE position interpolation: # Original: position i → rotate by i·θ # Extended: position i → rotate by i·θ/scale # scale = new_length / original_length # Methods (from simple to advanced): # 1. Linear interpolation (Chen et al., 2023) # Scale all frequencies equally # Simple but loses short-context quality # 2. NTK-aware (Reddit/bloc97, 2023) # Scale high-freq less, low-freq more # Better short-context preservation # 3. YaRN (Peng et al., 2023) # NTK + attention scaling + temperature # Used by many open-source models # 4. LongRoPE (Microsoft, 2024) # Evolutionary search for optimal scaling # 2M tokens, 1K fine-tuning steps # Used in Phi-3

auto_stories

PagedAttention: Virtual Memory for KV Cache

How vLLM serves thousands of requests efficiently

The Analogy

Without PagedAttention, each request pre-allocates GPU memory for its maximum possible context length — like reserving an entire restaurant for one person who might bring friends. PagedAttention (Kwon et al., 2023, vLLM) borrows from OS virtual memory: allocate KV cache in small pages on demand. No wasted memory. This simple idea improved serving throughput by 2-4×.

Key insight: Before PagedAttention, KV cache memory waste was 60-80% in production. A 128K-capable model would reserve 16 GB per request even if the actual conversation was only 1K tokens. PagedAttention allocates in 16-token blocks, achieving near-zero waste. vLLM is now the standard serving framework for open-source LLMs, used by thousands of companies.

How It Works

# Traditional KV cache allocation: # Request 1: reserve 128K tokens → 16 GB # Request 2: reserve 128K tokens → 16 GB # Actual usage: 2K tokens each → 0.25 GB # Waste: 31.5 GB (98.4%!) # PagedAttention (vLLM): # Allocate in pages (blocks of 16 tokens) # Request 1: 2K tokens → 125 pages → 0.25 GB # Request 2: 2K tokens → 125 pages → 0.25 GB # Waste: ~0 GB # → Can serve 60× more concurrent requests! # Additional benefits: # - Shared prefixes (system prompts) # - Copy-on-write for beam search # - Dynamic memory growth # vLLM throughput vs baselines: # 2-4× higher than HuggingFace TGI # Near-zero memory waste

RAG: Retrieval-Augmented Generation

Giving LLMs access to external knowledge

The Analogy

Even with 128K tokens, an LLM can’t hold all of Wikipedia in its context. RAG is like giving a student access to a library during an exam: when asked a question, first search a knowledge base for relevant documents, then stuff them into the context window, then generate an answer. The model’s parametric knowledge (from training) is augmented with retrieved knowledge (from the database).

Key insight: RAG solves two critical problems: hallucination (the model can cite real documents instead of making things up) and freshness (the knowledge base can be updated without retraining). The embedding models from Ch 2 power the retrieval step: embed the query, find the nearest document embeddings via cosine similarity, and inject the top-K documents into the prompt. This is how most enterprise AI applications work.

RAG Pipeline

# RAG pipeline: # 1. Index (offline, once): # Split documents into chunks (~512 tokens) # Embed each chunk → vector (1536-dim) # Store in vector database (Pinecone, etc.) # 2. Retrieve (per query): # Embed user query → vector # Find top-K nearest chunks (cosine sim) # K = 3-10 typically # 3. Generate (per query): prompt = f"""Context: {retrieved_chunks} Question: {user_query} Answer based on the context above:""" response = llm.generate(prompt) # Benefits: # ✓ Reduces hallucination (citable sources) # ✓ Always up-to-date (update DB, not model) # ✓ Domain-specific (your company's docs) # ✓ No fine-tuning needed

all_inclusive

The Limits: Lost in the Middle & Beyond

Long context doesn’t mean perfect memory

Known Limitations

Research shows LLMs suffer from “lost in the middle” (Liu et al., 2023): information at the start and end of the context is recalled well, but information in the middle is often missed. A 128K context window doesn’t mean perfect recall of all 128K tokens. Additionally, performance degrades on tasks requiring precise retrieval from very long contexts. The effective context is often shorter than the advertised maximum.

Key insight: The future of LLM memory likely combines multiple approaches: long context windows (128K-1M), RAG for external knowledge, tool use for real-time data, and potentially persistent memory stores that accumulate across conversations. No single approach solves the memory problem completely. Understanding these tradeoffs is essential for building robust AI applications.

Memory Strategies

# LLM memory landscape: # In-context (this chapter): # ✓ Immediate, no setup # ✗ Limited to context window # ✗ Lost in the middle problem # RAG (retrieval): # ✓ Unlimited knowledge base # ✓ Citable, updatable # ✗ Retrieval can miss relevant docs # Fine-tuning (Ch 7): # ✓ Baked into model weights # ✗ Expensive, can't easily update # Tool use (function calling): # ✓ Real-time data (APIs, databases) # ✗ Requires integration work # Best practice: combine all four # Long context + RAG + fine-tuning + tools # = robust, production-ready AI system

arrow_back Ch 9: Text Generation Ch 11: Making LLMs Fast arrow_forward