What We Want to Measure
Classic LM metrics (perplexity, BLEU) poorly capture multi-step correctness. Reasoning benchmarks provide tasks with reference answers (or verifiers) so you can compute accuracy, pass@k, or partial credit. Good benchmarks stress compositionality (combining skills), generalization (held-out templates), and robustness (paraphrases, distractors). Bad benchmarks are saturated (everyone scores ~100%), leaky (answers in common crawl), or gameable (format hacks). Your job as a practitioner is not just to read leaderboard numbers but to know what capability each benchmark isolates and what setup was used (CoT? tools? self-consistency? test-time compute?).
Evaluation Dimensions
Correctness: exact match / verifier
Process: step-level labels (PRMs)
Cost: tokens, latency, $/task
Reliability: variance across seeds
Safety: refusal, injection, leaks
// Reasoning ≠ single scalar score
Key insight: A higher score is meaningless if you don’t know the conditions: prompting, tools, sampling temperature, and whether the benchmark is already in training data.