Knowledge & Reasoning
MMLU (Massive Multitask Language Understanding, Hendrycks et al. 2021): 57 subjects from elementary to professional level. 14K multiple-choice questions. 5-shot evaluation. The most widely reported benchmark. Limitation: 4-choice format is easy to game; data contamination is widespread.
MMLU-PRO (Wang et al. 2024): Harder version with 10 choices instead of 4, requiring more reasoning. Expert-reviewed to reduce noise. Used in Open LLM Leaderboard v2.
ARC (AI2 Reasoning Challenge, Clark et al. 2018): Grade-school science questions. ARC-Easy and ARC-Challenge splits. Tests basic scientific reasoning.
TruthfulQA (Lin et al. 2022): 817 questions designed to elicit common misconceptions. Measures whether the model gives truthful answers rather than popular but wrong ones.
Reasoning & Math
GSM8K (Cobbe et al. 2021): 8.5K grade-school math word problems. Tests multi-step arithmetic reasoning. Chain-of-thought prompting dramatically improves scores.
MATH (Hendrycks et al. 2021): 12.5K competition-level math problems across 7 subjects. Levels 1-5 difficulty. MATH Lvl 5 (hardest) is used in Open LLM Leaderboard v2.
BBH (Big Bench Hard, Suzgun et al. 2022): 23 challenging tasks from BIG-Bench where models previously failed. Covers multistep arithmetic, algorithmic reasoning, language understanding, and world knowledge.
HellaSwag (Zellers et al. 2019): Sentence completion benchmark. Tests commonsense reasoning. Was hard when released but most modern models score >95%.
Benchmark saturation is real. HellaSwag, ARC-Easy, and original MMLU are now too easy for frontier models. This is why the Open LLM Leaderboard moved to v2 with harder benchmarks (MMLU-PRO, GPQA, BBH, MATH Lvl 5, MuSR, IFEval). Always check if a benchmark still discriminates between models.