Ch 5 — Scaling Up: From Transformer to LLM

Why bigger models are smarter — scaling laws, compute budgets, and the race to trillion parameters
Scale
expand
Why Scale
arrow_forward
show_chart
Kaplan
arrow_forward
balance
Chinchilla
arrow_forward
fast_forward
Beyond
arrow_forward
memory
Anatomy
arrow_forward
precision_manufacturing
Hardware
arrow_forward
hub
MoE
arrow_forward
compress
Small & Fast
arrow_forward
landscape
Landscape
-
Click play or press Space to begin...
Step- / 9
expand
Why Bigger Models Are Smarter
The surprising regularity of scaling
The Analogy
Think of a library. A small library with 1,000 books can answer basic questions. A city library with 100,000 books can handle more complex topics. A national library with millions of volumes can connect ideas across disciplines. LLMs work the same way: more parameters = more “shelf space” for storing patterns, facts, and reasoning strategies. The remarkable discovery is that this scaling follows predictable mathematical laws.
Key insight: Loss (how wrong the model is) decreases as a smooth power law with more parameters, more data, and more compute. This means you can predict how good a model will be before training it. OpenAI used this to plan GPT-4: they trained small models, measured the scaling curve, and accurately predicted GPT-4’s performance months before it finished training.
The Growth of LLMs
# Model sizes over time: # GPT-1 (2018): 117M params # GPT-2 (2019): 1.5B params (13× bigger) # GPT-3 (2020): 175B params (117× bigger) # PaLM (2022): 540B params # Llama 2 (2023): 70B params # Llama 3 (2024): 405B params # GPT-4 (2023): ~1.8T params (MoE, est.) # Training data growth: # GPT-2: 40B tokens (WebText) # GPT-3: 300B tokens # Llama 2: 2T tokens # Llama 3: 15T tokens # Compute growth (FLOPs): # GPT-3: 3.14 × 10²³ FLOPs # Llama 3 405B: ~3.8 × 10²⁵ FLOPs
show_chart
Kaplan Scaling Laws (2020)
OpenAI’s first discovery: loss follows power laws
The Discovery
Kaplan et al. at OpenAI (Jan 2020) found that model loss follows a power law with respect to three variables: number of parameters (N), dataset size (D), and compute budget (C). The key equation: L(N) ≈ (N_c / N)^α where α ≈ 0.076. This means every 10× increase in parameters reduces loss by a fixed amount. The relationship is remarkably smooth — no sudden jumps or plateaus.
Key insight: Kaplan’s main recommendation was to prioritize model size over data: make the model as big as possible, even if you can’t train it on that much data. This led to GPT-3 being trained on “only” 300B tokens despite having 175B parameters (~1.7 tokens per parameter). This advice turned out to be wrong — Chinchilla would later show why.
The Power Law
# Kaplan scaling law (simplified): # L(N) = (N_c / N)^α_N + L_∞ # L(D) = (D_c / D)^α_D + L_∞ # L(C) = (C_c / C)^α_C + L_∞ # # Where: # L = cross-entropy loss (lower = better) # N = number of parameters # D = number of training tokens # C = compute (FLOPs) # L_∞ = irreducible loss (entropy of text) # Practical implication: # 10× more params → ~0.5 nats lower loss # This is why GPT-3 was 1000× bigger than GPT-2 # and why the race to scale began
Real World
Doubling a library’s books gives diminishing returns — each doubling adds less new knowledge than the last
In LLMs
Power law: 10× more parameters gives a fixed improvement. Diminishing returns, but predictable and consistent
balance
Chinchilla: The Compute-Optimal Recipe
DeepMind’s 2022 correction — data matters as much as size
The Analogy
Imagine studying for an exam. Kaplan said: “Get the biggest brain possible.” Chinchilla said: “A medium-sized brain that studies more books will outperform a giant brain that barely studied.” DeepMind trained 400+ models and found that parameters and training tokens should scale equally. The optimal ratio: roughly 20 tokens per parameter.
Key insight: Chinchilla (70B params, 1.4T tokens) matched the much larger Gopher (280B params, 300B tokens) while being 4× smaller. This proved GPT-3 was massively undertrained: with 175B params, it should have seen ~3.5T tokens, not 300B. Chinchilla reshaped the entire field — suddenly, data collection became as important as GPU procurement.
The Numbers
# Chinchilla optimal: D ≈ 20 × N # (tokens ≈ 20 × parameters) # Was GPT-3 optimal? # N = 175B, D = 300B # Ratio: 300B / 175B = 1.7 tokens/param # Optimal would be: 175B × 20 = 3.5T tokens # GPT-3 was ~12× undertrained! # Chinchilla-optimal examples: # 1B model → 20B tokens # 7B model → 140B tokens # 70B model → 1.4T tokens # 175B model → 3.5T tokens # Compute formula: C ≈ 6 × N × D # (6 FLOPs per param per token) # Chinchilla 70B: 6 × 70B × 1.4T # ≈ 5.9 × 10²³ FLOPs
fast_forward
Beyond Chinchilla: Overtrain for Cheap Inference
Why Llama 3 trains 75× past the optimal point
The Analogy
Chinchilla tells you the cheapest way to train. But what about using the model? A 70B model costs 10× more per query than a 7B model. If you’re serving billions of queries, it’s cheaper to overtrain a smaller model: spend more on training (one-time cost) to get a smaller model that’s cheaper to run forever. It’s like investing more in a fuel-efficient car — higher upfront cost, lower lifetime cost.
Key insight: Llama 3 8B was trained on 15T tokens — that’s ~1,875 tokens per parameter, or 75× the Chinchilla ratio. Microsoft’s Phi-3 Mini (3.8B) used 3.3T tokens (870 tokens/param). These “overtrained” small models match much larger Chinchilla-optimal models while being dramatically cheaper to serve. This is the dominant strategy in 2024-2025.
The Shift
# Chinchilla-optimal vs actual training: # Llama 3 8B: # Chinchilla optimal: 160B tokens # Actually trained on: 15T tokens # Overtrain factor: 94× # Result: matches Llama 2 70B on many tasks # Llama 3 70B: # Chinchilla optimal: 1.4T tokens # Actually trained on: 15T tokens # Overtrain factor: 11× # Result: approaches GPT-4 on many tasks # Phi-3 Mini (3.8B): # Chinchilla optimal: 76B tokens # Actually trained on: 3.3T tokens # Overtrain factor: 43× # Result: matches Llama 2 13B # The new rule: train small, train long # Inference cost dominates at scale
memory
Anatomy of a Large Model
Where do all those billions of parameters live?
Parameter Breakdown
An LLM’s parameters are stored in weight matrices. The two biggest consumers: attention (Q, K, V, and output projections) and FFN (the expand/compress layers). In most models, FFN takes ~65% of parameters and attention takes ~33%. Embeddings and norms are a small fraction. Understanding where parameters live helps you understand what the model “knows.”
Key insight: Memory required = parameters × bytes per parameter. At full precision (FP32), each parameter is 4 bytes. A 70B model needs 280 GB just for weights — that’s 4 high-end GPUs. At half precision (FP16/BF16), it’s 140 GB. At 4-bit quantization, it’s 35 GB — fits on a single GPU. This is why quantization (Ch 11) is so important.
Llama 3 8B Breakdown
# Llama 3 8B parameter breakdown: # Total: ~8.03B parameters # Embedding table: # 128,256 × 4,096 = 525M (6.5%) # Per transformer block (×32): # Attention (Q,K,V,O projections): # Q: 4096 × 4096 = 16.8M # K: 4096 × 1024 = 4.2M (GQA: 8 heads) # V: 4096 × 1024 = 4.2M (GQA: 8 heads) # O: 4096 × 4096 = 16.8M # Subtotal: 42M per block # FFN (SwiGLU, 3 matrices): # W1: 4096 × 14336 = 58.7M # W3: 4096 × 14336 = 58.7M # W2: 14336 × 4096 = 58.7M # Subtotal: 176M per block # Norms: 2 × 4096 = 8K (negligible) # 32 blocks: 32 × 218M = 6.98B (87%) # Output head: 4096 × 128256 = 525M (6.5%)
precision_manufacturing
The Hardware Behind Scale
GPUs, clusters, and the infrastructure that makes LLMs possible
The Analogy
Training an LLM is like building a skyscraper. You need cranes (GPUs), coordination (distributed training), and time. A single NVIDIA H100 GPU can do ~2,000 TFLOPS (FP8). Training Llama 3 405B required ~30.8 million GPU-hours on H100s. That’s 16,384 GPUs running for ~78 days. The electricity alone costs millions of dollars.
Key insight: Training cost scales as C ≈ 6ND (6 FLOPs per parameter per token). At ~$2/GPU-hour for H100s, training Llama 3 405B cost roughly $60M+ in compute alone. GPT-4 is estimated at $100M+. This is why only a handful of organizations can train frontier models — and why efficient architectures and training recipes matter enormously.
Training Infrastructure
# Key GPU specs (NVIDIA): # A100: 80GB HBM2e, 312 TFLOPS (BF16) # H100: 80GB HBM3, 990 TFLOPS (BF16) # H200: 141GB HBM3e, 990 TFLOPS (BF16) # B200: 192GB HBM3e, 2250 TFLOPS (BF16) # Training clusters: # GPT-3: ~1,000 V100s, ~34 days # Llama 3 405B: 16,384 H100s, ~78 days # Estimated cost: $60-100M+ # Distributed training techniques: # - Data parallelism: split batches # - Tensor parallelism: split matrices # - Pipeline parallelism: split layers # - FSDP: shard optimizer states # All combined for 16K+ GPU training
hub
Mixture of Experts: Scaling Without the Cost
How to have a trillion parameters but only use 10% at a time
The Analogy
Imagine a hospital with 8 specialist doctors. For each patient, a triage nurse (router) decides which 2 specialists to consult. The hospital has the knowledge of 8 doctors but only pays 2 per patient. Mixture of Experts (MoE) works the same way: replace the single FFN with 8 (or more) expert FFNs, and a learned router picks the top-2 for each token. Total parameters are huge, but active parameters per token are small.
Key insight: GPT-4 is widely reported to use MoE with ~1.8 trillion total parameters but only ~220B active per token. Mixtral 8x7B has 46.7B total params but only 12.9B active (2 of 8 experts per token). DeepSeek-V3 uses 671B total / 37B active. MoE gives you the quality of a huge model at the inference cost of a small one — the dominant architecture for frontier models.
How MoE Works
class MoELayer(nn.Module): def __init__(self, d_model, d_ff, n_experts, top_k): super().__init__() self.experts = nn.ModuleList([ SwiGLU_FFN(d_model, d_ff) for _ in range(n_experts) ]) self.router = nn.Linear(d_model, n_experts) self.top_k = top_k def forward(self, x): # Router scores for each expert scores = self.router(x) weights, indices = scores.topk(self.top_k) weights = torch.softmax(weights, dim=-1) # Only run top-k experts per token out = torch.zeros_like(x) for i, expert in enumerate(self.experts): mask = (indices == i).any(dim=-1) if mask.any(): out[mask] += weights[mask, i:i+1] * expert(x[mask]) return out # Mixtral: 8 experts, top-2 # Total: 46.7B, Active: 12.9B per token
compress
The Small Model Revolution
Phi, Gemma, Qwen — when small models punch above their weight
The Trend
The frontier isn’t just about getting bigger — it’s about getting smaller and smarter. Microsoft’s Phi-3 Mini (3.8B) matches Llama 2 13B. Google’s Gemma 2 9B rivals Llama 3 8B. The secret: better data (curated, synthetic, deduplicated) and longer training (overtraining far past Chinchilla). A 3B model in 2025 outperforms a 175B model from 2020.
Key insight: Data quality matters more than data quantity. Phi models use heavily curated “textbook quality” data and synthetic examples. Llama 3 invested heavily in data filtering pipelines. The lesson: a small model trained on excellent data beats a large model trained on noisy web scrapes. This is why data engineering is now the most critical part of LLM development.
Small Model Landscape
# Small models that punch above weight: # Phi-3 Mini (3.8B, 3.3T tokens) # → Matches Llama 2 13B (3.5× bigger) # → Runs on a phone # Gemma 2 9B (8T tokens) # → Rivals Llama 3 8B # → Knowledge distillation from Gemini # Qwen 2.5 7B (18T tokens) # → Competitive with Llama 3 70B on some tasks # Llama 3.2 3B (9T tokens) # → Designed for on-device # The pattern: smaller model + more data # + better data + longer training # = surprisingly competitive results
landscape
The Scaling Landscape: Where We Are
Current frontiers and what comes next
The Big Picture
We’ve covered the three levers of scaling: parameters (model size), data (training tokens), and compute (FLOPs). Kaplan said “scale parameters.” Chinchilla said “balance parameters and data.” Modern practice says “overtrain small models for inference efficiency.” MoE says “have many parameters but only use some.” The field is converging on a nuanced understanding: there’s no single best approach — it depends on your deployment constraints.
Key insight: The next frontier is “test-time compute” scaling — spending more compute during inference (chain-of-thought, search, verification) rather than just during training. OpenAI’s o1/o3 and DeepSeek-R1 show that letting models “think longer” at inference time can substitute for larger model sizes. This is a fundamentally new dimension of scaling.
Scaling Strategies Summary
# Three eras of scaling: # Era 1 (2018-2021): "Bigger is better" # GPT-1 → GPT-2 → GPT-3 # Just make the model bigger # Era 2 (2022-2023): "Balance params & data" # Chinchilla, Llama 1, Llama 2 # 20 tokens per parameter # Era 3 (2024+): "Overtrain + MoE + Distill" # Llama 3, Phi-3, Mixtral, DeepSeek-V3 # Small active params, huge total knowledge # + test-time compute (o1, R1) # The compute equation: # C_train ≈ 6 × N × D (one-time) # C_infer ≈ 2 × N_active (per token) # Optimize for total cost = train + infer