Ch 4: Benchmarks & Evaluation — Reading Model Cards

Ch 4 — Benchmarks & Evaluation Results

Reading the scoreboard — what benchmarks measure, what ranges are good, and what to be skeptical of

Index

High Level

school

Knowledge

arrow_forward

code

Coding

arrow_forward

calculate

Reasoning

arrow_forward

people

Human Pref

arrow_forward

science

Testing

arrow_forward

warning

Red Flags

Click play or press Space to begin...

Step- / 8

school

Knowledge Benchmarks: MMLU & Friends

The “SAT test” for AI models — and why it’s becoming less useful

MMLU (Massive Multitask Language Understanding)

57 subjects, 16,000 multiple-choice questions covering STEM, humanities, social sciences, and professional areas like law and medicine. Think of it as a college entrance exam for AI. Scoring guide: 25–40% = barely above random guessing (25% is chance). 60–80% = strong. 80–90% = approaching human expert level. 90%+ = exceeds average human performance. Leading models now score 91–94%, making MMLU “saturated” — it can no longer distinguish between frontier models.

MMLU-Pro & GPQA

MMLU-Pro: A harder successor with 10 answer choices instead of 4, requiring chain-of-thought reasoning. Scores are typically 20–30 points lower than MMLU, making it better for distinguishing current models.

GPQA (Graduate-Level Problem QA): PhD-level questions in physics, chemistry, and biology. Even domain experts only score ~65%. Model scores in the 40–60% range indicate strong reasoning. This is the “hard” knowledge benchmark.

Key insight: When you see MMLU on a card, check the score range. Above 90% is table stakes for frontier models — it no longer differentiates. MMLU-Pro and GPQA scores are more informative for comparing current models.

code

Coding Benchmarks: HumanEval & SWE-bench

Can this model actually write code that works?

HumanEval

164 Python programming problems where the model writes a function and it’s tested against unit tests. Measures pass@1 (does the first attempt pass?). Scores: 30–50% = basic coding. 50–70% = competent. 70–90% = strong. 90%+ = frontier (and the benchmark is becoming saturated). Note: HumanEval tests isolated function writing, not real-world software engineering.

SWE-bench

SWE-bench Verified tests whether a model can solve real GitHub issues from popular open-source projects — understanding codebases, writing patches, and passing test suites. This is much harder than HumanEval. Scores of 30–50% are currently state-of-the-art. SWE-bench is the closest benchmark to “can this model actually do engineering work?”

Key insight: If you need a model for coding, HumanEval tells you about syntax and algorithm ability. SWE-bench tells you about real-world engineering. A model scoring 90% on HumanEval but 20% on SWE-bench can write functions but can’t navigate codebases.

calculate

Reasoning & Math: GSM8K, MATH, BBH

Can this model think, or just pattern-match?

Math Benchmarks

GSM8K: 8,500 grade-school math word problems. Tests basic multi-step arithmetic reasoning. Frontier models now score 95%+, so it’s becoming saturated. Still useful for testing smaller models.

MATH: Competition-level math problems (AMC, AIME, Olympiad). Much harder. Scores of 50–70% are strong; 70%+ is exceptional. This benchmark genuinely tests mathematical reasoning.

Reasoning Benchmarks

BBH (BIG-Bench Hard): 23 challenging tasks from the BIG-Bench suite where models previously lagged behind humans. Tests logical reasoning, causal understanding, and multi-step inference.

IFEval (Instruction Following Eval): Tests whether the model follows specific formatting instructions (“respond in JSON,” “use exactly 3 paragraphs”). Critical for production use where output format matters.

ARC (AI2 Reasoning Challenge): Science questions from grades 3–9. The “Challenge Set” requires complex reasoning beyond fact retrieval.

Key insight: Reasoning benchmarks are currently the best discriminators between models. Knowledge benchmarks are saturated above 90%, but reasoning scores still range widely (40–85%), making them more informative for model comparison.

people

Human Preference: Arena Elo & Chatbot Arena

The closest thing to “how good does it actually feel to use this model?”

What Chatbot Arena Is

Chatbot Arena (by LMSYS) is a platform where users chat with two anonymous models side-by-side and vote for the better response. Over millions of human votes, models receive an Elo rating (like chess). This is widely considered the most meaningful benchmark for real-world model quality because it reflects actual human preferences, not synthetic test questions.

How to Interpret It

Arena Elo is a relative ranking, not an absolute score. A model with Elo 1250 will beat a model with Elo 1150 roughly 64% of the time. The top 5 models are typically within 30–50 Elo points of each other. Arena scores are separated by category: overall, coding, math, hard prompts, creative writing — a model might rank #1 in coding but #10 in creative writing.

Key insight: If you had to look at only one evaluation metric, Arena Elo is the most informative. It captures everything benchmarks miss: tone, helpfulness, nuance, and the “feel” of the model. Look for Arena rankings if the model card mentions them.

science

Few-Shot vs. Zero-Shot Testing

Why the same model gets different scores depending on how it’s tested

The Difference

Zero-shot: The model gets only the question. “What is the capital of France?” No examples, no hints.

Few-shot (e.g., 5-shot): The model gets 5 example question-answer pairs before the actual question. This “primes” the model and typically produces higher scores because the model learns the expected format from the examples.

The difference can be 5–15 percentage points on the same benchmark.

Why This Matters for Comparison

If Model A reports “MMLU: 85% (5-shot)” and Model B reports “MMLU: 80% (0-shot),” you cannot directly compare them. Model B might actually be better — it just wasn’t given examples. Always check whether the card specifies the shot count. When comparing models on a leaderboard, make sure all models were tested under the same conditions (same shot count, same prompt template).

Key insight: Testing conditions matter as much as the score itself. A model card that doesn’t specify whether its scores are 0-shot or 5-shot is hiding important context. Always look for the shot count next to the benchmark name.

update

Open LLM Leaderboard v1 vs. v2

Why the benchmarks changed and what the new ones tell you

The v1 Benchmarks (Retired)

The original Open LLM Leaderboard used: ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K. These benchmarks became saturated — top models scored 85–95% on most of them, making it impossible to distinguish between frontier models. Worse, some models showed signs of data contamination (training on benchmark test data), inflating scores artificially.

The v2 Benchmarks (Current)

The updated leaderboard uses harder alternatives: MMLU-Pro, IFEval, BBH, MATH, GPQA, MuSR. These span a wider score range (40–85%), are less contaminated, and test genuine reasoning rather than pattern matching. If a model card shows v1 benchmarks (ARC, HellaSwag), the scores may be outdated or inflated. Look for v2 benchmarks for more reliable comparisons.

Key insight: If you see a model card that prominently features HellaSwag and TruthfulQA scores but omits MMLU-Pro and GPQA, the card may be using outdated (and flattering) benchmarks. Always check which benchmark version is being reported.

warning

Red Flags in Benchmark Reporting

How to spot cherry-picked, inflated, or misleading evaluation results

What to Watch For

Cherry-picking: The card only shows benchmarks where the model wins and omits standard ones where it underperforms. A card that shows 5 custom benchmarks but no MMLU, no HumanEval, and no Arena score is likely hiding something.

No comparison baseline: Showing a score without context. “85% on our internal eval” means nothing if you don’t know what other models score on the same eval.

Contamination signals: A small model matching or beating much larger models on specific benchmarks. A 7B model scoring 92% on MMLU should make you skeptical.

The Healthy Card

A trustworthy model card: shows standard benchmarks (at least MMLU/MMLU-Pro, one coding benchmark, one reasoning benchmark), includes comparison with the base model and competing models of similar size, specifies testing conditions (shot count, prompt template), and acknowledges weaknesses (“this model underperforms on multilingual tasks”).

Key insight: A model card that only shows benchmarks where it wins is like a resume that only lists strengths. Always check what is missing. The benchmarks a model card chooses not to report are as informative as the ones it does.

checklist

Your Benchmark Decoder

A one-line summary for every benchmark you’ll encounter

Quick Reference

MMLU Knowledge across 57 subjects MMLU-Pro Harder MMLU, 10 choices, CoT GPQA PhD-level science questions HumanEval Python function generation SWE-bench Real GitHub issue resolution GSM8K Grade-school math word problems MATH Competition-level math BBH Hard reasoning tasks IFEval Instruction following accuracy ARC Science reasoning (grades 3-9) HellaSwag Common-sense continuation TruthfulQA Resistance to false claims Arena Elo Human preference rating

Your Reading Strategy

Step 1: Identify which benchmarks are reported. Are they standard (MMLU, HumanEval) or custom?
Step 2: Check the testing conditions (shot count, v1 vs v2 benchmarks).
Step 3: Compare against models of the same size class, not just the leaderboard top.
Step 4: Check for Arena Elo if available — it’s the most holistic measure.
Step 5: Note what’s missing. No coding benchmark? The model may not be good at code.

Key insight: Benchmarks are proxies, not proof. They tell you a model’s performance on specific tests under specific conditions. The only real benchmark is how well it works on your task. Use card benchmarks to narrow your shortlist, then test the finalists yourself.

arrow_back Ch 3: Architecture & Parameters Ch 5: Training Data & Licensing arrow_forward