Ch 2: Benchmarks — The Scoreboard

school

MMLU — The Knowledge Test

Massive Multitask Language Understanding

What It Measures

MMLU tests broad academic knowledge across 57 subjects — from abstract algebra to world religions. Each question is multiple-choice (A/B/C/D) with a single correct answer. It was designed to measure how much a model “knows” across the full breadth of human knowledge.

The Numbers

14,042 questions across 57 tasks. Random baseline: 25% (4 choices). Human expert baseline: ~89.8%. Current frontier models cluster above 88%, making MMLU essentially saturated for differentiating top models.

Why It Matters (and Doesn’t)

MMLU was the first widely-adopted general benchmark and remains a standard reporting metric. But its saturation means it can no longer differentiate frontier models. A model scoring 87% vs 89% on MMLU tells you almost nothing about which is better for your use case.

Saturation status: MMLU is effectively solved. Frontier models exceed human expert performance. The field has moved to MMLU-Pro (harder, 10-choice questions) and GPQA (graduate-level science) as replacements.

code

HumanEval & Coding Benchmarks

Can the model write working code?

HumanEval

164 Python programming problems with function signatures and docstrings. The model generates the function body, which is tested against unit tests. Metric: pass@k (probability of at least one correct solution in k attempts). Most frontier models now score above 85%.

Beyond HumanEval

• HumanEval+: Same problems, 80x more test cases to catch edge cases
• MBPP: 974 crowd-sourced Python problems (easier)
• LiveCodeBench: Fresh problems from competitive programming sites, updated monthly to resist contamination

The Contamination Problem

HumanEval problems are now in most training datasets. LiveCodeBench showed models dropping 20–30% on problems released after their training cutoff — proving that high HumanEval scores partly reflect memorization, not genuine coding ability.

Key insight: A model scoring 92% on HumanEval might score 60% on novel problems it hasn’t seen. Always test with fresh, domain-specific coding tasks rather than relying on public benchmark scores.

build

SWE-bench — Real Software Engineering

Can the model fix actual GitHub issues?

What It Tests

SWE-bench presents real GitHub issues from popular Python repositories (Django, Flask, scikit-learn, etc.) and asks models to generate patches that pass the repository’s test suite. This tests end-to-end software engineering: reading code, understanding context, writing fixes, and passing tests.

Variants

• SWE-bench Lite: 300 curated, well-specified issues (easier subset)
• SWE-bench Verified: Human-validated subset with clear specifications
• SWE-bench Pro: Private, harder subset — GPT-5 and Claude Opus 4.1 both score only ~23%

Current Scores (2025–2026)

// SWE-bench Verified (top scores) Gemini 3 Pro 76.2% Claude Sonnet 4 72.7% GPT-5 ~69% // SWE-bench Pro (private, harder) GPT-5 23.1% Claude Opus 4.1 23.1%

Key insight: The gap between SWE-bench Verified (~72%) and SWE-bench Pro (~23%) shows how much benchmark difficulty matters. Easy benchmarks create an illusion of capability that harder tests shatter.

science

GPQA & Reasoning Benchmarks

Graduate-level science and mathematical reasoning

GPQA (Google-Proof Q&A)

448 expert-crafted questions in biology, physics, and chemistry. Questions are designed to be “Google-proof” — you can’t find the answer by searching. Even domain experts with internet access only score ~65%. Current leader: Gemini 3 Pro at 92.6%.

GSM8K & Math

GSM8K: 8,500 grade-school math word problems. Status: saturated (frontier models score 95%+). Replaced by MATH (competition-level problems) and AIME (American Invitational Mathematics Exam) for harder reasoning tests.

Why Reasoning Benchmarks Matter

Reasoning benchmarks test whether models can chain logical steps, not just recall facts. A model that scores well on GPQA demonstrates genuine multi-step reasoning ability — the kind needed for complex real-world tasks like research, analysis, and problem-solving.

The saturation pattern: GSM8K (2021) → saturated in 2 years. MMLU (2020) → saturated in 3 years. GPQA (2023) → approaching saturation. Each generation of benchmarks has a shorter useful lifespan.

forum

Chatbot Arena & Human Preference

When humans are the benchmark

How It Works

LMSYS Chatbot Arena presents users with two anonymous model responses side by side. Users pick which they prefer. Preferences are aggregated into ELO ratings (like chess rankings). As of January 2026: nearly 5 million votes across 296 models.

Why It Works

Arena avoids the contamination problem because every query is unique. It measures what users actually care about — helpfulness, clarity, accuracy — rather than performance on artificial tasks. It’s become the most trusted ranking for comparing frontier models.

Limitations

• Style bias: Users prefer longer, more verbose answers even when shorter ones are more accurate
• Demographic skew: Voters are mostly tech-savvy English speakers
• Task distribution: Heavily weighted toward general chat, not domain-specific tasks
• No granularity: “Which is better?” doesn’t tell you why or at what

Key insight: Chatbot Arena is the best general ranking we have, but it measures “average user preference on average tasks.” Your specific use case may have very different requirements.

trending_down

The Saturation Problem

When benchmarks stop being useful

The Saturation Timeline

// Benchmark lifecycle GSM8K (2021) → Saturated 2023 ~2 years MMLU (2020) → Saturated 2024 ~3 years HumanEval (2021) → Saturated 2024 ~3 years GPQA (2023) → Approaching... ~2 years SWE-bench (2023) → Still useful ongoing

Why Benchmarks Saturate

Three forces drive saturation:

1. Data contamination: Benchmark questions leak into training data
2. Optimization pressure: Labs specifically optimize for benchmark scores
3. Capability ceiling: Models genuinely improve and exceed the benchmark’s difficulty level

The treadmill: The field is stuck in a cycle: create benchmark → models saturate it → contamination renders it useless → create harder benchmark. Each new benchmark has a 6–12 month useful lifespan before the cycle repeats.

tips_and_updates

How to Read Benchmark Scores

A practical guide to not being fooled

Red Flags

• Cherry-picked benchmarks: Only showing scores where the model excels
• No error bars: Single-run scores without confidence intervals
• Old benchmarks only: Reporting MMLU/GSM8K but not harder alternatives
• No comparison to baselines: Scores without context are meaningless
• Benchmark-specific fine-tuning: Models optimized for the test, not the task

What to Do Instead

1. Build your own eval set from real production data (50–200 examples)
2. Test on multiple benchmarks across different capabilities
3. Use contamination-resistant benchmarks (LiveCodeBench, SWE-bench Pro)
4. Check Chatbot Arena for general quality ranking
5. Run your own A/B tests with real users

Rule of thumb: Public benchmarks tell you a model’s floor, not its ceiling. A model that scores well on benchmarks might still fail on your specific task. A model that scores poorly might excel in your domain. Always test on your own data.

map

The Benchmark Landscape Map

Choosing the right benchmark for your evaluation

By Capability

// Match benchmark to what you need General knowledge → MMLU-Pro, ARC Reasoning → GPQA, MATH, ARC-C Coding → SWE-bench, LiveCodeBench Instruction follow → IFEval, MT-Bench Safety → TruthfulQA, BBQ Multimodal → MMMU, MathVista Overall quality → Chatbot Arena ELO

The Bottom Line

Benchmarks are necessary but not sufficient. They give you a starting point for model selection and a common language for comparison. But the only benchmark that truly matters is performance on your specific task with your specific data. Everything else is a proxy.

Next up: In Chapter 3, we’ll explore LLM-as-Judge — using AI to evaluate AI at scale, achieving 80–90% human agreement at a fraction of the cost. This is the technique that makes systematic evaluation practical.

Ch 2 — Benchmarks: The Scoreboard