Ch 9: Evaluation & Benchmarks

Ch 9 — Evaluation & Benchmarks

MMLU, HumanEval, MT-Bench, Chatbot Arena, Open LLM Leaderboard, custom evaluation, and human evaluation

Index Under the Hood →

High Level

help

Why Eval

arrow_forward

quiz

Benchmarks

arrow_forward

code

Code Eval

arrow_forward

gavel

LLM Judge

arrow_forward

stadium

Arena

arrow_forward

leaderboard

Leaderboard

arrow_forward

tune

Custom Eval

Click play or press Space to begin the journey...

Step- / 7

help

Why Evaluation Matters

Training loss is not enough — you need to measure real-world capability

The Problem with Loss Curves

Training loss goes down — is the model better? Not necessarily. Low training loss can mean overfitting, memorization, or reward hacking. A model can have perfect loss on your training data while performing worse on real tasks.

The gap between loss and usefulness: Loss measures how well the model predicts the next token on your specific dataset. It does not measure reasoning ability, factual accuracy, instruction following, safety, or whether users actually prefer the outputs.

What Evaluation Actually Measures

Knowledge & reasoning: Can the model answer questions across domains? (MMLU, ARC, GPQA)

Code generation: Can it write working code? (HumanEval, MBPP)

Math: Can it solve problems step-by-step? (GSM8K, MATH)

Instruction following: Does it follow formatting and constraint instructions? (IFEval)

Human preference: Do humans actually prefer its outputs? (Chatbot Arena, MT-Bench)

The Evaluation Stack

Automated benchmarks: Standardized tests with known correct answers. Fast, reproducible, but can be gamed. Examples: MMLU, HumanEval, GSM8K.

LLM-as-judge: Use a strong model (GPT-4) to evaluate outputs. Scalable, correlates well with humans (~80% agreement). Examples: MT-Bench, AlpacaEval.

Human evaluation: Real humans rate outputs. Gold standard but expensive ($1-5 per comparison) and slow. Examples: Chatbot Arena, custom annotation.

Best practice: Use all three. Automated benchmarks for quick iteration, LLM-as-judge for development, human evaluation for final validation.

The #1 mistake in fine-tuning: Only looking at training loss. You must evaluate on held-out data using task-relevant benchmarks. A model that scores well on MMLU but poorly on your specific task is useless for your use case. Always include custom evaluation on your actual task.

quiz

Standard Benchmarks

The canonical tests for measuring LLM capabilities

Knowledge & Reasoning

MMLU (Massive Multitask Language Understanding, Hendrycks et al. 2021): 57 subjects from elementary to professional level. 14K multiple-choice questions. 5-shot evaluation. The most widely reported benchmark. Limitation: 4-choice format is easy to game; data contamination is widespread.

MMLU-PRO (Wang et al. 2024): Harder version with 10 choices instead of 4, requiring more reasoning. Expert-reviewed to reduce noise. Used in Open LLM Leaderboard v2.

ARC (AI2 Reasoning Challenge, Clark et al. 2018): Grade-school science questions. ARC-Easy and ARC-Challenge splits. Tests basic scientific reasoning.

TruthfulQA (Lin et al. 2022): 817 questions designed to elicit common misconceptions. Measures whether the model gives truthful answers rather than popular but wrong ones.

Reasoning & Math

GSM8K (Cobbe et al. 2021): 8.5K grade-school math word problems. Tests multi-step arithmetic reasoning. Chain-of-thought prompting dramatically improves scores.

MATH (Hendrycks et al. 2021): 12.5K competition-level math problems across 7 subjects. Levels 1-5 difficulty. MATH Lvl 5 (hardest) is used in Open LLM Leaderboard v2.

BBH (Big Bench Hard, Suzgun et al. 2022): 23 challenging tasks from BIG-Bench where models previously failed. Covers multistep arithmetic, algorithmic reasoning, language understanding, and world knowledge.

HellaSwag (Zellers et al. 2019): Sentence completion benchmark. Tests commonsense reasoning. Was hard when released but most modern models score >95%.

Benchmark saturation is real. HellaSwag, ARC-Easy, and original MMLU are now too easy for frontier models. This is why the Open LLM Leaderboard moved to v2 with harder benchmarks (MMLU-PRO, GPQA, BBH, MATH Lvl 5, MuSR, IFEval). Always check if a benchmark still discriminates between models.

code

Code & Instruction-Following Benchmarks

Measuring functional correctness and constraint adherence

Code Generation

HumanEval (Chen et al. 2021, OpenAI): 164 hand-crafted Python problems. Each has a function signature, docstring, and unit tests. The model generates code, which is executed against the tests. Measures functional correctness, not lexical similarity.

Pass@k metric: Generate k samples per problem, report the fraction where at least one passes all tests. Pass@1 = single-shot accuracy. Pass@10 = probability with 10 attempts. Original Codex: 28.8% pass@1, 70.2% pass@100.

MBPP (Austin et al. 2021, Google): 974 crowd-sourced Python problems. Simpler than HumanEval but larger. Also uses execution-based evaluation.

HumanEval+ (Liu et al. 2023): Extended version with 80x more test cases per problem. Catches models that pass original tests by coincidence.

Instruction Following

IFEval (Zhou et al. 2023): Tests whether models follow specific formatting instructions. Examples: “Write exactly 3 paragraphs”, “Include the word ‘therefore’ at least twice”, “Respond in JSON format”. Uses strict and loose metrics. Used in Open LLM Leaderboard v2.

GPQA (Rein et al. 2023): Graduate-level questions written by PhD experts in biology, physics, and chemistry. “Google-proof” — even domain experts with internet access struggle. Restricted access to prevent data contamination. Used in Open LLM Leaderboard v2.

HumanEval
164 Python problems
Execution-based
Pass@k metric
Chen et al. 2021

MBPP
974 Python problems
Execution-based
Simpler, larger
Austin et al. 2021

IFEval
Instruction following
Format constraints
Strict/loose metrics
Zhou et al. 2023

GPQA
PhD-level science
Google-proof
Restricted access
Rein et al. 2023

gavel

LLM-as-Judge Evaluation

Using strong models to evaluate weaker ones

MT-Bench

MT-Bench (Zheng et al. 2023, LMSYS): 80 multi-turn questions across 8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities. GPT-4 scores each response on a 1–10 scale with detailed explanations.

Key innovation: Multi-turn evaluation. Turn 1 asks a question, Turn 2 follows up. This tests the model’s ability to maintain context and handle follow-up instructions — closer to real conversations than single-turn benchmarks.

Judge agreement: GPT-4 as judge achieves >80% agreement with human preferences, matching inter-human agreement rates.

AlpacaEval 2

AlpacaEval 2 (Dubois et al. 2024, Stanford): 805 instructions from diverse sources. Compares model outputs against a reference model (GPT-4 Turbo). Reports length-controlled (LC) win rate to penalize models that win by being verbose rather than better.

Why LC matters: Naive win rate rewards longer outputs (judges prefer verbose answers). LC win rate regresses out the length effect, giving a fairer comparison.

Arena-Hard

Arena-Hard (Li et al. 2024, LMSYS): 500 challenging user queries sourced from real Chatbot Arena conversations. Uses GPT-4 Turbo as judge. Designed to have high agreement with Chatbot Arena rankings (>89% separability). Faster and cheaper than running a full arena.

Strengths & Limitations

Strengths: Scalable, reproducible, evaluates open-ended generation (not just multiple choice), correlates well with human preferences.

Limitations: Judge bias (GPT-4 favors GPT-4-style outputs), position bias (first response preferred), verbosity bias (longer = better), self-enhancement bias (models rate their own outputs higher). Mitigations: swap positions, use LC metrics, use multiple judges.

LLM-as-judge is the practical sweet spot. It is 10-100x cheaper than human evaluation, 100x faster, and correlates at >80% with human preferences. Use MT-Bench for multi-turn chat quality, AlpacaEval 2 LC for instruction following, and Arena-Hard for challenging real-world queries.

stadium

Chatbot Arena & Elo Ratings

The gold standard for human preference evaluation

How Chatbot Arena Works

LMSYS Chatbot Arena (Zheng et al. 2023): An open platform where users chat with two anonymous models side-by-side and vote for the better response. The models are revealed only after voting. Over 800,000 votes collected across 90+ models since May 2023.

Blind A/B testing: Users don’t know which model they’re talking to, eliminating brand bias. This is the closest thing to a controlled experiment for LLM quality.

Elo rating system: Borrowed from chess. Each model starts with a baseline rating. Wins against strong opponents increase rating more than wins against weak ones. The rating converges to a stable ranking over thousands of votes.

Why Arena Matters

Real user queries: Unlike curated benchmarks, Arena tests models on whatever users actually ask — from creative writing to debugging code to explaining physics.

Domain-specific rankings: Arena provides separate Elo scores for coding, math, hard prompts, and general conversation. A model can rank #1 in coding but #5 overall.

The 1500 Elo barrier: As of early 2026, only a handful of frontier models exceed 1500 Elo, driven by test-time compute (“thinking”) approaches that spend more time reasoning before responding.

For fine-tuning evaluation: You probably won’t submit your fine-tuned model to Chatbot Arena. But Arena rankings tell you which models are the strongest judges for LLM-as-judge evaluation. Use the top-ranked model as your judge. Arena also sets the ceiling — if your fine-tuned 8B model approaches the Arena score of a 70B model on your specific task, that’s a major win.

leaderboard

Open LLM Leaderboard

HuggingFace’s standardized evaluation for open-source models

Leaderboard v1 (Retired)

Original benchmarks (2023–2024):
- ARC (25-shot, science reasoning)
- HellaSwag (10-shot, commonsense)
- MMLU (5-shot, knowledge)
- TruthfulQA (0-shot, truthfulness)
- Winogrande (5-shot, coreference)
- GSM8K (5-shot, math)

Why it was retired: Benchmarks became saturated. Top models scored >95% on HellaSwag and ARC. Data contamination was rampant — models were trained on benchmark data. The leaderboard no longer discriminated between good and great models.

Leaderboard v2 (Current)

New benchmarks (launched mid-2024):
- MMLU-PRO: 10 choices, reasoning-heavy
- GPQA: PhD-level science, Google-proof
- BBH: 23 hard tasks from BIG-Bench
- MATH Lvl 5: Competition math, hardest level
- MuSR: ~1000-word multi-step reasoning
- IFEval: Instruction-following compliance

Key changes: All benchmarks are harder. Scores dropped significantly (average went from ~70% to ~30%). This restored the leaderboard’s ability to differentiate models. All evaluations run through EleutherAI’s lm-evaluation-harness for reproducibility.

For your fine-tuned model: Run the same benchmarks locally using lm-evaluation-harness. Compare your fine-tuned model against the base model on the v2 suite. If your fine-tuned model drops significantly on general benchmarks, you may have catastrophic forgetting. If it improves on your target task without dropping elsewhere, you have a successful fine-tune.

tune

Custom & Human Evaluation

Building evaluation that matches your actual use case

Custom Evaluation Design

Step 1: Define success criteria. What does “good” mean for your task? Factual accuracy? Correct format? Appropriate tone? Specific domain knowledge?

Step 2: Build a test set. 100–500 examples that represent your real use case. Include edge cases. Never use training data for evaluation. Have domain experts create or validate the test set.

Step 3: Choose metrics. For classification: accuracy, F1, precision/recall. For generation: LLM-as-judge scores, ROUGE/BLEU (if reference answers exist), exact match (for structured output). For preference: win rate vs baseline model.

Step 4: Automate. Run evaluation after every training run. Compare against the base model and your best previous fine-tune. Track metrics over time in WandB.

Human Evaluation

When to use: Final validation before deployment. When automated metrics don’t capture what matters (tone, creativity, safety). When you need stakeholder buy-in.

Design principles:
- Blind comparison (A/B testing, evaluator doesn’t know which model is which)
- Clear rubric (define exactly what “better” means)
- Multiple evaluators (at least 3 per example)
- Inter-annotator agreement (Cohen’s kappa > 0.6 is acceptable)
- Sufficient sample size (100+ comparisons for statistical significance)

Cost: $1–5 per comparison. A 200-example evaluation with 3 annotators costs $600–3,000. Use LLM-as-judge for iteration, human evaluation for final decisions.

The complete evaluation recipe: (1) Run lm-evaluation-harness for general benchmarks — check for catastrophic forgetting. (2) Run your custom test set with LLM-as-judge — measure task-specific quality. (3) Do a small human evaluation (50–100 examples) — validate that LLM-judge scores match human preferences. (4) Deploy with monitoring — track real-world performance and user feedback.