Ch 11 — Making LLMs Fast

Quantization, FlashAttention, batching, distillation — the engineering that makes LLMs practical
Optimization
compress
Quantize
arrow_forward
flash_on
FlashAttn
arrow_forward
batch_prediction
Batching
arrow_forward
fast_forward
Speculative
arrow_forward
school
Distill
arrow_forward
devices
On-Device
arrow_forward
dns
Serving
-
Click play or press Space to begin...
Step- / 7
compress
Quantization: Shrinking Models 4×
Using fewer bits per parameter with minimal quality loss
The Analogy
Imagine a painting with 16 million colors (FP16). If you reduce it to 16 colors (4-bit), most of the image looks the same — the human eye can’t tell the difference for most pixels. Quantization reduces the precision of model weights from 16-bit (2 bytes) to 8-bit (1 byte) or 4-bit (0.5 bytes). A 70B model goes from 140 GB to 35 GB — fitting on a single GPU instead of four.
Key insight: Modern quantization methods (GPTQ, AWQ, GGUF) are remarkably good. 4-bit quantization typically loses less than 1% on benchmarks compared to full precision. The key: weights are approximately normally distributed, so you can design quantization schemes (like NF4 from QLoRA) that are information-theoretically optimal for this distribution. Llama.cpp popularized running 4-bit models on CPUs, enabling LLMs on laptops.
Quantization Methods
# Precision levels: # FP32: 4 bytes/param → 280 GB for 70B # BF16: 2 bytes/param → 140 GB for 70B # INT8: 1 byte/param → 70 GB for 70B # INT4: 0.5 byte/param → 35 GB for 70B # Popular methods: # GPTQ: post-training, layer-by-layer # AWQ: activation-aware, preserves important # weights at higher precision # GGUF: llama.cpp format, CPU-friendly # bitsandbytes: HuggingFace integration # Quality impact (Llama 3 70B): # BF16: MMLU 82.0% (baseline) # INT8: MMLU 81.8% (-0.2%) # INT4: MMLU 81.1% (-0.9%) # INT3: MMLU 78.5% (-3.5%, noticeable) # Rule of thumb: 4-bit is the sweet spot # Below 4-bit, quality degrades noticeably
flash_on
FlashAttention: IO-Aware Attention
2-4× faster attention by being smart about memory access
The Analogy
Standard attention computes the full N×N attention matrix, writes it to GPU memory, then reads it back for the next step. This is like a chef who carries each ingredient from the pantry to the counter one at a time. FlashAttention (Dao et al., 2022) is like bringing all ingredients at once: it computes attention in tiles that fit in fast SRAM, never writing the full attention matrix to slow HBM. Same result, 2-4× faster.
Key insight: FlashAttention doesn’t change the math — it computes exactly the same attention. The speedup comes entirely from reducing memory reads/writes (IO). GPU SRAM is ~10× faster than HBM but much smaller (~20 MB vs 80 GB). By tiling the computation to fit in SRAM, FlashAttention avoids the memory bottleneck. FlashAttention-2 and FlashAttention-3 further optimize for newer hardware. It’s now the default in every major framework.
The Speedup
# Standard attention memory access: # 1. Compute QK^T → write N×N to HBM # 2. Read N×N from HBM → softmax # 3. Write N×N to HBM # 4. Read N×N from HBM → multiply by V # IO: 4 × N² reads/writes to slow memory # FlashAttention: # Process in tiles that fit in SRAM # Never materialize full N×N matrix # Use online softmax (running max trick) # IO: O(N²/SRAM_size) — much less! # Practical speedup: # Standard: 100ms for 4K context # FlashAttn-2: 35ms (2.9× faster) # FlashAttn-3 (H100): 25ms (4× faster) # Memory savings: # Standard: O(N²) memory for attn matrix # FlashAttn: O(N) memory — linear! # Enables much longer context windows
batch_prediction
Continuous Batching: Serving Many Users
Processing multiple requests simultaneously
The Analogy
A restaurant that serves one customer at a time wastes most of its kitchen capacity. Batching serves multiple customers simultaneously. Continuous batching (Orca, Yu et al., 2022) goes further: as soon as one request finishes, a new one takes its slot — no waiting for the entire batch to complete. This keeps the GPU busy at all times, improving throughput by 10-20× compared to naive sequential serving.
Key insight: The decode phase is memory-bound (Ch 10): the GPU spends most time reading KV cache, not computing. Batching amortizes this: reading the model weights once serves B requests simultaneously. With batch size 32, you get ~32× throughput at only ~1.5× latency. This is why LLM APIs can serve millions of users affordably — the per-request cost drops dramatically with batching.
Batching Strategies
# Static batching (naive): # Wait for B requests, process together # All must finish before new batch starts # Short requests wait for long ones → waste # Continuous batching (Orca/vLLM): # Start processing immediately # When request finishes → slot freed # New request fills the slot immediately # No wasted GPU cycles # Throughput comparison (Llama 3 8B, H100): # Sequential: ~50 tokens/sec # Static batch (B=32): ~800 tokens/sec # Continuous batch: ~1200 tokens/sec # + PagedAttention: ~2000 tokens/sec # Cost per 1M tokens (approximate): # Sequential: $2.00 # Optimized serving: $0.05-0.10 # → 20-40× cost reduction!
fast_forward
Speculative Decoding & Parallel Strategies
Breaking the sequential bottleneck
Recap
We covered speculative decoding in Ch 9: a small draft model generates K tokens, the large model verifies in parallel, giving 2-3× speedup with identical output. Other parallel strategies include Medusa (multiple prediction heads on one model), EAGLE (feature-level drafting), and lookahead decoding (Jacobi iteration). All exploit the same insight: verification is parallel, generation is sequential.
Key insight: Speculative decoding is especially powerful for structured outputs (JSON, code) where the draft model’s predictions are highly accurate. For code generation, acceptance rates of 85-95% are common, giving near-linear speedup. Combined with batching and quantization, these techniques make LLM serving 50-100× more efficient than naive approaches.
Parallel Generation Methods
# Speculative decoding variants: # Classic (Leviathan et al., 2023): # Draft: separate small model # Verify: target model, 1 forward pass # Speedup: 2-3× # Medusa (Cai et al., 2024): # Draft: extra prediction heads on same model # No separate model needed # Speedup: 2-3× # EAGLE (Li et al., 2024): # Draft from hidden states, not tokens # Higher acceptance rate # Speedup: 2.5-3.5× # Combined optimization stack: # Quantization: 4× memory reduction # FlashAttention: 2-4× attention speedup # Continuous batch: 20× throughput # PagedAttention: 2-4× memory efficiency # Speculative: 2-3× latency reduction # Total: 50-100× over naive baseline
school
Knowledge Distillation: Teaching Small Models
Transfer knowledge from a large teacher to a small student
The Analogy
A professor (large model) has deep understanding but is expensive to consult. A teaching assistant (small model) can learn from the professor and handle most questions at a fraction of the cost. Distillation trains a small model to mimic a large model’s outputs. The student learns from the teacher’s “soft” probability distributions, which contain more information than hard labels alone.
Key insight: Google’s Gemma 2 models were distilled from larger Gemini models. DeepSeek-R1-Distill models distill reasoning ability from DeepSeek-R1 (671B) into 1.5B-70B models. The distilled 32B model matches or exceeds GPT-4o on math and coding benchmarks. Distillation is how the “small model revolution” (Ch 5) actually works in practice.
Distillation Process
# Knowledge distillation: # Teacher: Llama 3 405B (large, slow) # Student: Llama 3 8B (small, fast) # Standard training: # Student learns from hard labels # "Paris" = [0, 0, 1, 0, ...] # Distillation: # Student learns from teacher's soft probs # [0.01, 0.02, 0.85, 0.05, 0.03, ...] # "Paris is most likely, but Lyon and # Marseille are also plausible" # → Richer learning signal! # Loss = α·CE(student, labels) # + (1-α)·KL(student, teacher) # Real examples: # Gemma 2 9B: distilled from Gemini # DeepSeek-R1-Distill-Qwen-32B: # distilled from R1 671B # matches GPT-4o on AIME math
devices
On-Device LLMs: AI on Your Phone
Running models locally with no internet
The Trend
LLMs are moving from cloud to device. Apple Intelligence runs a ~3B model on iPhone. Google runs Gemini Nano on Pixel. Qualcomm’s Snapdragon runs 7B models. The recipe: small model (1-3B) + aggressive quantization (4-bit) + hardware-specific optimization. A 3B model at 4-bit needs only 1.5 GB — fits easily in phone RAM. Latency is better (no network round-trip) and privacy is preserved (data never leaves the device).
Key insight: The combination of overtraining (Ch 5), distillation, and quantization means a 3B model in 2025 can match a 13B model from 2023. Apple’s on-device model handles autocomplete, summarization, and rewriting. The trade-off: on-device models are less capable than cloud models for complex reasoning, but for common tasks, they’re fast, private, and free.
On-Device Stack
# On-device LLM requirements: # Memory: < 4 GB (phone RAM budget) # Speed: > 10 tokens/sec (usable) # Power: < 5W (battery-friendly) # How to fit: # 3B model × 4-bit = 1.5 GB ✓ # 7B model × 4-bit = 3.5 GB (tight) # 13B model × 4-bit = 6.5 GB ✗ (too big) # Frameworks: # llama.cpp: CPU, cross-platform # Apple MLX: Apple Silicon optimized # MLC-LLM: mobile (Android/iOS) # Ollama: desktop, easy to use # Performance (Llama 3.2 3B, 4-bit): # iPhone 15 Pro: ~15 tokens/sec # M3 MacBook: ~40 tokens/sec # RTX 4090: ~100 tokens/sec
dns
The Serving Stack: Putting It All Together
How production LLM APIs actually work
Production Stack
A production LLM serving system combines all these optimizations: Quantized model (4-8 bit) + FlashAttention + PagedAttention + continuous batching + speculative decoding + tensor parallelism across GPUs. Frameworks like vLLM, TensorRT-LLM (NVIDIA), and TGI (HuggingFace) package all of this together. The result: serving costs have dropped 100× since GPT-3’s launch.
Key insight: The cost of LLM inference has been falling faster than Moore’s Law. GPT-3.5 cost $60/M tokens at launch (2022); equivalent quality models now cost $0.10-0.50/M tokens. This 100-600× cost reduction comes from better hardware (H100 vs A100), better software (vLLM, FlashAttention), smaller models (Phi, Gemma), and quantization. Making LLMs fast isn’t just engineering — it’s what makes AI accessible to everyone.
The Full Stack
# Production LLM serving stack: # Layer 1: Model optimization # - Quantization (AWQ/GPTQ, 4-8 bit) # - Distillation (smaller model) # - GQA (fewer KV heads) # Layer 2: Compute optimization # - FlashAttention (IO-aware) # - Speculative decoding # - Tensor parallelism (multi-GPU) # Layer 3: Serving optimization # - PagedAttention (memory) # - Continuous batching (throughput) # - Prefix caching (shared prompts) # Layer 4: Infrastructure # - Load balancing, auto-scaling # - Request routing, rate limiting # - Monitoring, logging # Popular frameworks: # vLLM, TensorRT-LLM, TGI, SGLang