By Capability
// Match benchmark to what you need
General knowledge → MMLU-Pro, ARC
Reasoning → GPQA, MATH, ARC-C
Coding → SWE-bench, LiveCodeBench
Instruction follow → IFEval, MT-Bench
Safety → TruthfulQA, BBQ
Multimodal → MMMU, MathVista
Overall quality → Chatbot Arena ELO
The Bottom Line
Benchmarks are necessary but not sufficient. They give you a starting point for model selection and a common language for comparison. But the only benchmark that truly matters is performance on your specific task with your specific data. Everything else is a proxy.
Next up: In Chapter 3, we’ll explore LLM-as-Judge — using AI to evaluate AI at scale, achieving 80–90% human agreement at a fraction of the cost. This is the technique that makes systematic evaluation practical.