The Analogy
Imagine studying for an exam. Kaplan said: “Get the biggest brain possible.” Chinchilla said: “A medium-sized brain that studies more books will outperform a giant brain that barely studied.” DeepMind trained 400+ models and found that parameters and training tokens should scale equally. The optimal ratio: roughly 20 tokens per parameter.
Key insight: Chinchilla (70B params, 1.4T tokens) matched the much larger Gopher (280B params, 300B tokens) while being 4× smaller. This proved GPT-3 was massively undertrained: with 175B params, it should have seen ~3.5T tokens, not 300B. Chinchilla reshaped the entire field — suddenly, data collection became as important as GPU procurement.
The Numbers
# Chinchilla optimal: D ≈ 20 × N
# (tokens ≈ 20 × parameters)
# Was GPT-3 optimal?
# N = 175B, D = 300B
# Ratio: 300B / 175B = 1.7 tokens/param
# Optimal would be: 175B × 20 = 3.5T tokens
# GPT-3 was ~12× undertrained!
# Chinchilla-optimal examples:
# 1B model → 20B tokens
# 7B model → 140B tokens
# 70B model → 1.4T tokens
# 175B model → 3.5T tokens
# Compute formula: C ≈ 6 × N × D
# (6 FLOPs per param per token)
# Chinchilla 70B: 6 × 70B × 1.4T
# ≈ 5.9 × 10²³ FLOPs