Ch 7 — Benchmarks & Evaluation

GSM8K, MATH, ARC-AGI, GPQA, HumanEval, contamination, and measuring reasoning vs memorization
High Level
target
Goal
arrow_forward
dataset
Data
arrow_forward
psychology
Run
arrow_forward
grading
Score
arrow_forward
shield
Sanity
arrow_forward
insights
Learn
-
Click play or press Space to begin...
Step- / 8
analytics
Why Reasoning Benchmarks Exist
From perplexity to problem-solving metrics
What We Want to Measure
Classic LM metrics (perplexity, BLEU) poorly capture multi-step correctness. Reasoning benchmarks provide tasks with reference answers (or verifiers) so you can compute accuracy, pass@k, or partial credit. Good benchmarks stress compositionality (combining skills), generalization (held-out templates), and robustness (paraphrases, distractors). Bad benchmarks are saturated (everyone scores ~100%), leaky (answers in common crawl), or gameable (format hacks). Your job as a practitioner is not just to read leaderboard numbers but to know what capability each benchmark isolates and what setup was used (CoT? tools? self-consistency? test-time compute?).
Evaluation Dimensions
Correctness: exact match / verifier Process: step-level labels (PRMs) Cost: tokens, latency, $/task Reliability: variance across seeds Safety: refusal, injection, leaks // Reasoning ≠ single scalar score
Key insight: A higher score is meaningless if you don’t know the conditions: prompting, tools, sampling temperature, and whether the benchmark is already in training data.
calculate
GSM8K & MATH
Grade-school and competition-style math word problems
The Standard Math Duo
GSM8K (Cobbe et al., 2021) is a widely used set of ~8K grade-school math word problems with natural language explanations and numeric final answers. It became the flagship benchmark for chain-of-thought improvements because it rewards multi-step arithmetic and planning. MATH (Hendrycks et al.) is a harder corpus of competition-style problems spanning algebra, geometry, calculus, and more, often requiring longer derivations. Together they anchor math reasoning evaluation. Typical reporting includes: accuracy with CoT, pass@k with sampling, and comparisons with/without tools (Python). When models approach ceiling on GSM8K, researchers emphasize MATH and harder subsets to separate models.
Practical Notes
GSM8K: ~8K problems, school-level Standard metric: final-answer acc CoT / PAL / tool use common MATH: Competition difficulty Longer chains; harder verification // Always log prompt + tool setup
Key insight: GSM8K is a sanity check; MATH is where test-time compute and advanced methods really separate systems.
code
HumanEval & Code Reasoning Suites
From single-function synthesis to repositories
Coding as Executable Reasoning
HumanEval (Chen et al., OpenAI) popularized pass@k on Python docstring-to-code problems with hidden unit tests. It measures whether the model can produce code that executes correctly — a crisp verifier. Limitations: small size, risk of contamination, and narrow task format. The field has expanded to multi-file benchmarks (e.g., repository-level tasks) where models must navigate context, edit files, and pass broader test suites. For reasoning evaluation, coding benchmarks matter because they combine planning, API knowledge, and debugging loops (especially when paired with sandboxes).
Metrics
pass@1: single sample pass@k: success if any of k samples pass Tests: hidden unit tests (primary) // Report k, temperature, and filters
Key insight: Code benchmarks reward executable correctness — closer to PAL/tool reasoning than free-form explanation.
grid_view
ARC-AGI & Abstract Visual Reasoning
Chollet’s emphasis on generalization, not memorization tricks
What ARC Tests
The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet, presents small grid transformation puzzles: a few input/output examples define a latent rule; the model must predict the output for a new input. ARC is designed to stress few-shot generalization and core knowledge priors rather than memorized trivia. ARC-AGI is a related benchmark framing aimed at measuring progress toward more general reasoning; public reporting varies by model generation and evaluation setup. Treat ARC scores as indicators of progress on a particular style of puzzle generalization — not as a complete picture of “AGI.” It remains a valuable counterweight to benchmarks that large models can brute-force via pretraining scale.
Interpretation
Strength: novel rule induction Weakness: narrow domain (grids) Caution: compare only like-for-like evaluation protocols // Read benchmark cards carefully
Key insight: ARC-style tasks punish pattern-matching without abstraction. They’re useful precisely because many text benchmarks are easier to game with scale.
school
GPQA & Expert-Level QA
Hard multiple-choice science questions
Graduate-Level Knowledge + Reasoning
GPQA (Rein et al.) is a multiple-choice benchmark of difficult questions in biology, physics, and chemistry, written to be challenging for non-experts and validated by domain experts. GPQA Diamond is a filtered subset often used for cleaner evaluation. Strong scores indicate a combination of factual depth and discriminative reasoning (eliminating distractors). GPQA is frequently cited alongside reasoning models (e.g., OpenAI o-series reporting) because it remains difficult for models that only excel at surface pattern tasks. Like all MC benchmarks, watch for guessing baselines and calibration issues.
Usage Tips
Report: accuracy / normalized score Controls: random baseline, human expert Variants: Diamond vs full // Pair with open-ended probes
Key insight: GPQA measures expert difficulty in a constrained format. It complements math/code but does not replace safety or real-world task evals.
dataset_linked
BIG-Bench Hard & Broad Suites
Multi-task stress tests and hard subsets
Breadth vs Depth
BIG-Bench is a large collaborative benchmark spanning diverse tasks. BIG-Bench Hard (BBH) focuses on tasks that remained difficult for early instruction-tuned models, including multi-step reasoning, symbolic manipulation, and weird compositional puzzles. Suites like BBH are useful for regression testing across capabilities after you change a model or a prompt. They also appear in tool-learning papers (e.g., PAL reported gains on selected BBH tasks). The trade-off is complexity: aggregation can hide task-specific regressions, so teams often drill into per-task dashboards.
How to Use Suites
Aggregate: macro average with care Drill-down: per-task failures Compare: same tokenizer + parser // Watch for formatting sensitivity
Key insight: Broad suites are great for coverage; narrow benchmarks are great for diagnosis. Use both in a tiered eval strategy.
find_in_page
Contamination, Memorization, and Leakage
When high scores mislead
The Credibility Problem
If benchmark examples or solutions appear in pretraining data, models can succeed via memorization instead of reasoning. Mitigations include: n-gram overlap checks, training-time exclusion experiments, held-out rewrites of questions, and dynamic benchmarks (freshly generated instances). Another subtle issue is answer leakage through style: models learn dataset quirks. For chain-of-thought evaluation, also watch training on CoT traces from the same distribution. Contamination science is imperfect — treat it as risk management: assume some leakage unless you control data; prefer evaluations with private test sets for high-stakes decisions.
Signals of Trouble
Suspicious: perfect BBH + weak probes Check: paraphrase robustness Check: counterfactual numbers Check: tool-off vs tool-on gaps // Memorization often breaks under edits
Key insight: Strong in-distribution scores plus fragility to small edits is a classic sign the model is matching surface patterns, not executing a stable procedure.
account_tree
Building an Evaluation Stack
From playground experiments to production gates
A Practical Recipe
Layer evaluations: (1) Core reasoning — GSM8K/MATH + a small private set of your own word problems. (2) Code — HumanEval-style plus internal integration tests. (3) Knowledge-heavy reasoning — GPQA or domain exams. (4) Generalization probes — ARC-style or custom puzzles. (5) Operational metrics — latency, cost, tool failure rates, escalation to human. Automate runs in CI with pinned prompts; store traces (CoT, tool calls) for debugging. For product teams, add human rubrics on a sample of production tasks — benchmarks alone rarely capture UX.
Minimal Stack
L1: fast regression (GSM8K subset) L2: hard math (MATH subset) L3: code (unit tests) L4: domain private tests L5: online monitoring + rubrics // Next chapter: future of reasoning
Key insight: Treat evaluation as infrastructure: versioned prompts, recorded configs, and per-task dashboards. Reasoning quality is a moving target — your harness must move with it.