Where Reasoning Models Shine
Reasoning models show dramatic improvements on tasks requiring multi-step reasoning: Mathematics — AIME 2024: o1 scored 83.3% vs GPT-4o’s 13.4%. MATH benchmark: o1 achieved 94.8%. These are competition-level problems requiring multi-step proofs. Science — GPQA Diamond (PhD-level): o1 scored 77.3%, surpassing human PhD experts (69.7%). The model can reason through complex physics and chemistry problems. Coding — Codeforces: o1 reached 89th percentile. SWE-bench Verified: o3-mini solved 49.3% of real-world GitHub issues. Novel reasoning — ARC-AGI: o3 scored 87.5%, a benchmark designed to test genuine reasoning on novel tasks. Where they DON’T help much: simple factual questions, creative writing, summarization, translation. These tasks don’t benefit from extended reasoning. Using o1 for “What year was the Eiffel Tower built?” wastes compute.
Benchmark Comparison
// Reasoning model benchmarks
Mathematics:
AIME 2024:
GPT-4o: 13.4%
o1: 83.3% (+70%)
MATH:
GPT-4o: 76.6%
o1: 94.8% (+18%)
Science:
GPQA Diamond (PhD-level):
Human PhD: 69.7%
GPT-4o: 53.6%
o1: 77.3%
// Surpasses human experts
Coding:
Codeforces:
GPT-4o: 11th percentile
o1: 89th percentile
Novel Reasoning:
ARC-AGI:
GPT-4o: 5%
o3: 87.5%
Where NOT to use:
Simple facts, creative writing,
summarization, translation
// No benefit from extra thinking
Key insight: Reasoning models are not universally better. They excel specifically on tasks requiring multi-step reasoning: math, science, coding, logic. For everything else, standard models are faster, cheaper, and equally good. Choose the right model for the task.