Ch 2 — Benchmarks: The Scoreboard

MMLU, HumanEval, SWE-bench, GPQA — what they measure and why they saturate
High Level
school
MMLU
arrow_forward
code
HumanEval
arrow_forward
build
SWE-bench
arrow_forward
science
GPQA
arrow_forward
forum
Arena
arrow_forward
tips_and_updates
Takeaway
-
Click play or press Space to begin...
Step- / 8
school
MMLU — The Knowledge Test
Massive Multitask Language Understanding
What It Measures
MMLU tests broad academic knowledge across 57 subjects — from abstract algebra to world religions. Each question is multiple-choice (A/B/C/D) with a single correct answer. It was designed to measure how much a model “knows” across the full breadth of human knowledge.
The Numbers
14,042 questions across 57 tasks. Random baseline: 25% (4 choices). Human expert baseline: ~89.8%. Current frontier models cluster above 88%, making MMLU essentially saturated for differentiating top models.
Why It Matters (and Doesn’t)
MMLU was the first widely-adopted general benchmark and remains a standard reporting metric. But its saturation means it can no longer differentiate frontier models. A model scoring 87% vs 89% on MMLU tells you almost nothing about which is better for your use case.
Saturation status: MMLU is effectively solved. Frontier models exceed human expert performance. The field has moved to MMLU-Pro (harder, 10-choice questions) and GPQA (graduate-level science) as replacements.
code
HumanEval & Coding Benchmarks
Can the model write working code?
HumanEval
164 Python programming problems with function signatures and docstrings. The model generates the function body, which is tested against unit tests. Metric: pass@k (probability of at least one correct solution in k attempts). Most frontier models now score above 85%.
Beyond HumanEval
HumanEval+: Same problems, 80x more test cases to catch edge cases
MBPP: 974 crowd-sourced Python problems (easier)
LiveCodeBench: Fresh problems from competitive programming sites, updated monthly to resist contamination
The Contamination Problem
HumanEval problems are now in most training datasets. LiveCodeBench showed models dropping 20–30% on problems released after their training cutoff — proving that high HumanEval scores partly reflect memorization, not genuine coding ability.
Key insight: A model scoring 92% on HumanEval might score 60% on novel problems it hasn’t seen. Always test with fresh, domain-specific coding tasks rather than relying on public benchmark scores.
build
SWE-bench — Real Software Engineering
Can the model fix actual GitHub issues?
What It Tests
SWE-bench presents real GitHub issues from popular Python repositories (Django, Flask, scikit-learn, etc.) and asks models to generate patches that pass the repository’s test suite. This tests end-to-end software engineering: reading code, understanding context, writing fixes, and passing tests.
Variants
SWE-bench Lite: 300 curated, well-specified issues (easier subset)
SWE-bench Verified: Human-validated subset with clear specifications
SWE-bench Pro: Private, harder subset — GPT-5 and Claude Opus 4.1 both score only ~23%
Current Scores (2025–2026)
// SWE-bench Verified (top scores) Gemini 3 Pro 76.2% Claude Sonnet 4 72.7% GPT-5 ~69% // SWE-bench Pro (private, harder) GPT-5 23.1% Claude Opus 4.1 23.1%
Key insight: The gap between SWE-bench Verified (~72%) and SWE-bench Pro (~23%) shows how much benchmark difficulty matters. Easy benchmarks create an illusion of capability that harder tests shatter.
science
GPQA & Reasoning Benchmarks
Graduate-level science and mathematical reasoning
GPQA (Google-Proof Q&A)
448 expert-crafted questions in biology, physics, and chemistry. Questions are designed to be “Google-proof” — you can’t find the answer by searching. Even domain experts with internet access only score ~65%. Current leader: Gemini 3 Pro at 92.6%.
GSM8K & Math
GSM8K: 8,500 grade-school math word problems. Status: saturated (frontier models score 95%+). Replaced by MATH (competition-level problems) and AIME (American Invitational Mathematics Exam) for harder reasoning tests.
Why Reasoning Benchmarks Matter
Reasoning benchmarks test whether models can chain logical steps, not just recall facts. A model that scores well on GPQA demonstrates genuine multi-step reasoning ability — the kind needed for complex real-world tasks like research, analysis, and problem-solving.
The saturation pattern: GSM8K (2021) → saturated in 2 years. MMLU (2020) → saturated in 3 years. GPQA (2023) → approaching saturation. Each generation of benchmarks has a shorter useful lifespan.
forum
Chatbot Arena & Human Preference
When humans are the benchmark
How It Works
LMSYS Chatbot Arena presents users with two anonymous model responses side by side. Users pick which they prefer. Preferences are aggregated into ELO ratings (like chess rankings). As of January 2026: nearly 5 million votes across 296 models.
Why It Works
Arena avoids the contamination problem because every query is unique. It measures what users actually care about — helpfulness, clarity, accuracy — rather than performance on artificial tasks. It’s become the most trusted ranking for comparing frontier models.
Limitations
Style bias: Users prefer longer, more verbose answers even when shorter ones are more accurate
Demographic skew: Voters are mostly tech-savvy English speakers
Task distribution: Heavily weighted toward general chat, not domain-specific tasks
No granularity: “Which is better?” doesn’t tell you why or at what
Key insight: Chatbot Arena is the best general ranking we have, but it measures “average user preference on average tasks.” Your specific use case may have very different requirements.
trending_down
The Saturation Problem
When benchmarks stop being useful
The Saturation Timeline
// Benchmark lifecycle GSM8K (2021) → Saturated 2023 ~2 years MMLU (2020) → Saturated 2024 ~3 years HumanEval (2021) → Saturated 2024 ~3 years GPQA (2023) → Approaching... ~2 years SWE-bench (2023) → Still useful ongoing
Why Benchmarks Saturate
Three forces drive saturation:

1. Data contamination: Benchmark questions leak into training data
2. Optimization pressure: Labs specifically optimize for benchmark scores
3. Capability ceiling: Models genuinely improve and exceed the benchmark’s difficulty level
The treadmill: The field is stuck in a cycle: create benchmark → models saturate it → contamination renders it useless → create harder benchmark. Each new benchmark has a 6–12 month useful lifespan before the cycle repeats.
tips_and_updates
How to Read Benchmark Scores
A practical guide to not being fooled
Red Flags
Cherry-picked benchmarks: Only showing scores where the model excels
No error bars: Single-run scores without confidence intervals
Old benchmarks only: Reporting MMLU/GSM8K but not harder alternatives
No comparison to baselines: Scores without context are meaningless
Benchmark-specific fine-tuning: Models optimized for the test, not the task
What to Do Instead
1. Build your own eval set from real production data (50–200 examples)
2. Test on multiple benchmarks across different capabilities
3. Use contamination-resistant benchmarks (LiveCodeBench, SWE-bench Pro)
4. Check Chatbot Arena for general quality ranking
5. Run your own A/B tests with real users
Rule of thumb: Public benchmarks tell you a model’s floor, not its ceiling. A model that scores well on benchmarks might still fail on your specific task. A model that scores poorly might excel in your domain. Always test on your own data.
map
The Benchmark Landscape Map
Choosing the right benchmark for your evaluation
By Capability
// Match benchmark to what you need General knowledge → MMLU-Pro, ARC Reasoning → GPQA, MATH, ARC-C Coding → SWE-bench, LiveCodeBench Instruction follow → IFEval, MT-Bench Safety → TruthfulQA, BBQ Multimodal → MMMU, MathVista Overall quality → Chatbot Arena ELO
The Bottom Line
Benchmarks are necessary but not sufficient. They give you a starting point for model selection and a common language for comparison. But the only benchmark that truly matters is performance on your specific task with your specific data. Everything else is a proxy.
Next up: In Chapter 3, we’ll explore LLM-as-Judge — using AI to evaluate AI at scale, achieving 80–90% human agreement at a fraction of the cost. This is the technique that makes systematic evaluation practical.