Ch 7: Benchmarks & Evaluation — Reasoning & CoT Models

Ch 7 — Benchmarks & Evaluation

GSM8K, MATH, ARC-AGI, GPQA, HumanEval, contamination, and measuring reasoning vs memorization

Index

High Level

target

Goal

arrow_forward

dataset

Data

arrow_forward

psychology

Run

arrow_forward

grading

Score

arrow_forward

shield

Sanity

arrow_forward

insights

Learn

Click play or press Space to begin...

Step- / 8

analytics

Why Reasoning Benchmarks Exist

From perplexity to problem-solving metrics

What We Want to Measure

Classic LM metrics (perplexity, BLEU) poorly capture multi-step correctness. Reasoning benchmarks provide tasks with reference answers (or verifiers) so you can compute accuracy, pass@k, or partial credit. Good benchmarks stress compositionality (combining skills), generalization (held-out templates), and robustness (paraphrases, distractors). Bad benchmarks are saturated (everyone scores ~100%), leaky (answers in common crawl), or gameable (format hacks). Your job as a practitioner is not just to read leaderboard numbers but to know what capability each benchmark isolates and what setup was used (CoT? tools? self-consistency? test-time compute?).

Evaluation Dimensions

Correctness: exact match / verifier Process: step-level labels (PRMs) Cost: tokens, latency, $/task Reliability: variance across seeds Safety: refusal, injection, leaks // Reasoning ≠ single scalar score

Key insight: A higher score is meaningless if you don’t know the conditions: prompting, tools, sampling temperature, and whether the benchmark is already in training data.

calculate

GSM8K & MATH

Grade-school and competition-style math word problems

The Standard Math Duo

GSM8K (Cobbe et al., 2021) is a widely used set of ~8K grade-school math word problems with natural language explanations and numeric final answers. It became the flagship benchmark for chain-of-thought improvements because it rewards multi-step arithmetic and planning. MATH (Hendrycks et al.) is a harder corpus of competition-style problems spanning algebra, geometry, calculus, and more, often requiring longer derivations. Together they anchor math reasoning evaluation. Typical reporting includes: accuracy with CoT, pass@k with sampling, and comparisons with/without tools (Python). When models approach ceiling on GSM8K, researchers emphasize MATH and harder subsets to separate models.

Practical Notes

GSM8K: ~8K problems, school-level Standard metric: final-answer acc CoT / PAL / tool use common MATH: Competition difficulty Longer chains; harder verification // Always log prompt + tool setup

Key insight: GSM8K is a sanity check; MATH is where test-time compute and advanced methods really separate systems.

code

HumanEval & Code Reasoning Suites

From single-function synthesis to repositories

Coding as Executable Reasoning

HumanEval (Chen et al., OpenAI) popularized pass@k on Python docstring-to-code problems with hidden unit tests. It measures whether the model can produce code that executes correctly — a crisp verifier. Limitations: small size, risk of contamination, and narrow task format. The field has expanded to multi-file benchmarks (e.g., repository-level tasks) where models must navigate context, edit files, and pass broader test suites. For reasoning evaluation, coding benchmarks matter because they combine planning, API knowledge, and debugging loops (especially when paired with sandboxes).

Metrics

pass@1: single sample pass@k: success if any of k samples pass Tests: hidden unit tests (primary) // Report k, temperature, and filters

Key insight: Code benchmarks reward executable correctness — closer to PAL/tool reasoning than free-form explanation.

grid_view

ARC-AGI & Abstract Visual Reasoning

Chollet’s emphasis on generalization, not memorization tricks

What ARC Tests

The Abstraction and Reasoning Corpus (ARC), introduced by François Chollet, presents small grid transformation puzzles: a few input/output examples define a latent rule; the model must predict the output for a new input. ARC is designed to stress few-shot generalization and core knowledge priors rather than memorized trivia. ARC-AGI is a related benchmark framing aimed at measuring progress toward more general reasoning; public reporting varies by model generation and evaluation setup. Treat ARC scores as indicators of progress on a particular style of puzzle generalization — not as a complete picture of “AGI.” It remains a valuable counterweight to benchmarks that large models can brute-force via pretraining scale.

Interpretation

Strength: novel rule induction Weakness: narrow domain (grids) Caution: compare only like-for-like evaluation protocols // Read benchmark cards carefully

Key insight: ARC-style tasks punish pattern-matching without abstraction. They’re useful precisely because many text benchmarks are easier to game with scale.

school

GPQA & Expert-Level QA

Hard multiple-choice science questions

Graduate-Level Knowledge + Reasoning

GPQA (Rein et al.) is a multiple-choice benchmark of difficult questions in biology, physics, and chemistry, written to be challenging for non-experts and validated by domain experts. GPQA Diamond is a filtered subset often used for cleaner evaluation. Strong scores indicate a combination of factual depth and discriminative reasoning (eliminating distractors). GPQA is frequently cited alongside reasoning models (e.g., OpenAI o-series reporting) because it remains difficult for models that only excel at surface pattern tasks. Like all MC benchmarks, watch for guessing baselines and calibration issues.

Usage Tips

Report: accuracy / normalized score Controls: random baseline, human expert Variants: Diamond vs full // Pair with open-ended probes

Key insight: GPQA measures expert difficulty in a constrained format. It complements math/code but does not replace safety or real-world task evals.

dataset_linked

BIG-Bench Hard & Broad Suites

Multi-task stress tests and hard subsets

Breadth vs Depth

BIG-Bench is a large collaborative benchmark spanning diverse tasks. BIG-Bench Hard (BBH) focuses on tasks that remained difficult for early instruction-tuned models, including multi-step reasoning, symbolic manipulation, and weird compositional puzzles. Suites like BBH are useful for regression testing across capabilities after you change a model or a prompt. They also appear in tool-learning papers (e.g., PAL reported gains on selected BBH tasks). The trade-off is complexity: aggregation can hide task-specific regressions, so teams often drill into per-task dashboards.

How to Use Suites

Aggregate: macro average with care Drill-down: per-task failures Compare: same tokenizer + parser // Watch for formatting sensitivity

Key insight: Broad suites are great for coverage; narrow benchmarks are great for diagnosis. Use both in a tiered eval strategy.

find_in_page

Contamination, Memorization, and Leakage

When high scores mislead

The Credibility Problem

If benchmark examples or solutions appear in pretraining data, models can succeed via memorization instead of reasoning. Mitigations include: n-gram overlap checks, training-time exclusion experiments, held-out rewrites of questions, and dynamic benchmarks (freshly generated instances). Another subtle issue is answer leakage through style: models learn dataset quirks. For chain-of-thought evaluation, also watch training on CoT traces from the same distribution. Contamination science is imperfect — treat it as risk management: assume some leakage unless you control data; prefer evaluations with private test sets for high-stakes decisions.

Signals of Trouble

Suspicious: perfect BBH + weak probes Check: paraphrase robustness Check: counterfactual numbers Check: tool-off vs tool-on gaps // Memorization often breaks under edits

Key insight: Strong in-distribution scores plus fragility to small edits is a classic sign the model is matching surface patterns, not executing a stable procedure.

account_tree

Building an Evaluation Stack

From playground experiments to production gates

A Practical Recipe

Layer evaluations: (1) Core reasoning — GSM8K/MATH + a small private set of your own word problems. (2) Code — HumanEval-style plus internal integration tests. (3) Knowledge-heavy reasoning — GPQA or domain exams. (4) Generalization probes — ARC-style or custom puzzles. (5) Operational metrics — latency, cost, tool failure rates, escalation to human. Automate runs in CI with pinned prompts; store traces (CoT, tool calls) for debugging. For product teams, add human rubrics on a sample of production tasks — benchmarks alone rarely capture UX.

Minimal Stack

L1: fast regression (GSM8K subset) L2: hard math (MATH subset) L3: code (unit tests) L4: domain private tests L5: online monitoring + rubrics // Next chapter: future of reasoning

Key insight: Treat evaluation as infrastructure: versioned prompts, recorded configs, and per-task dashboards. Reasoning quality is a moving target — your harness must move with it.

arrow_back Ch 6: Tool-Augmented Reasoning Ch 8: The Future of Reasoning arrow_forward