Ch 4: Test-Time Compute Scaling — Reasoning & CoT Models

Ch 4 — Test-Time Compute Scaling

OpenAI o1/o3, thinking tokens, reinforcement learning for reasoning, and the new scaling paradigm

Index

High Level

school

Train

arrow_forward

psychology

Think

arrow_forward

token

Tokens

arrow_forward

verified

Verify

arrow_forward

tune

Scale

arrow_forward

emoji_events

Result

Click play or press Space to begin...

Step- / 8

swap_vert

The Two Scaling Paradigms

Train-time vs test-time compute

The Paradigm Shift

Since GPT-3, AI progress has been driven by train-time scaling: bigger models, more data, more training compute. This is captured by Chinchilla scaling laws (Hoffmann et al., 2022): performance improves predictably with model size and training tokens. But train-time scaling is hitting diminishing returns. Training GPT-4 cost an estimated $100M+. Each generation requires roughly 10x more compute for modest gains. Enter test-time compute scaling: instead of making the model bigger, let it think longer on each problem. A smaller model that spends 10 minutes reasoning can outperform a larger model that answers instantly. This is the insight behind OpenAI’s o1 (September 2024): the model generates extended internal reasoning chains — sometimes thousands of tokens — before producing a final answer. More thinking tokens = better answers. Performance scales with inference compute, not just model size.

Two Scaling Laws

// Train-time scaling (2020-2023) Performance ∝ f(model_size, data, compute) GPT-3: 175B params, $4.6M GPT-4: ~1.8T params, $100M+ Each gen: ~10x compute, ~1.5x gain // Diminishing returns // Test-time scaling (2024+) Performance ∝ f(thinking_tokens) Same model, more inference compute o1 on AIME: 83.3% (vs GPT-4o: 13.4%) o1 on GPQA: 77.3% (PhD-level) // Spend compute at inference The Key Insight: Train-time: fixed cost, fast inference Test-time: lower train cost, flexible inference spend per problem Easy question → few thinking tokens Hard question → many thinking tokens // Adaptive compute allocation Analogy: Train-time = studying for an exam Test-time = thinking during the exam Both matter. o1 does both.

Key insight: Test-time compute scaling doesn’t replace train-time scaling — it complements it. o1 is both a well-trained model AND one that thinks at inference. The breakthrough is that performance scales with inference compute, giving you a new knob to turn beyond model size.

smart_toy

OpenAI o1: The First Reasoning Model

September 2024 — “Learning to Reason with LLMs”

How o1 Works

OpenAI released o1 in September 2024 with the tagline “Learning to Reason with LLMs.” The core mechanism: o1 is trained via large-scale reinforcement learning to generate an extended chain of reasoning (“thinking tokens”) before producing a final answer. Unlike CoT prompting (where we ask the model to reason), o1 has been trained to reason. The RL training teaches it: when to explore alternative approaches, how to recognize and recover from errors, when to try a different strategy, and how to break complex problems into sub-problems. The thinking tokens are hidden from the user — you only see a summary and the final answer. Internally, o1 may generate hundreds to thousands of thinking tokens. On the AIME 2024 math competition, o1 scored 83.3% (top 500 US students), compared to GPT-4o’s 13.4%. On GPQA Diamond (PhD-level science), o1 achieved 77.3%, surpassing human PhD experts. On Codeforces, o1 reached the 89th percentile.

o1 Architecture

// How o1 processes a question User: "Prove that √2 is irrational." o1 Internal Thinking (hidden): "I need to prove √2 is irrational. The classic approach is proof by contradiction. Let me assume √2 is rational, so √2 = p/q where p,q are integers with no common factors... Then 2 = p²/q², so p² = 2q². This means p² is even, so p is even. Let p = 2k... Then (2k)² = 2q², so 4k² = 2q², so q² = 2k², meaning q is also even. But we assumed p,q have no common factors — contradiction!" o1 Output (visible to user): [Clean, structured proof] Benchmarks: AIME 2024: 83.3% (GPT-4o: 13.4%) GPQA: 77.3% (PhD expert: 69.7%) Codeforces: 89th percentile // Trained to think, not just prompted

Key insight: The critical difference from CoT prompting: o1 was trained via RL to reason, not just prompted. The model has learned when to think harder, how to backtrack, and what strategies to try. This is internalized search, not external orchestration.

token

Thinking Tokens: How They Work

The mechanics of extended reasoning chains

The Mechanism

Thinking tokens are the internal reasoning tokens that o1 generates before producing a visible answer. They serve as the model’s “scratch paper.” Key characteristics: Variable length — simple questions may generate 50 thinking tokens; hard problems can generate 10,000+. The model learns to allocate compute proportional to difficulty. Hidden from users — OpenAI shows a summary (“Thinking for 12 seconds...”) but not the full chain. This is partly for IP protection and partly because raw thinking can be messy. Self-correction — unlike standard CoT, thinking tokens include self-correction: “Wait, that’s wrong. Let me try a different approach...” The model has learned to catch its own mistakes. Strategy switching — the model may try one approach, realize it’s not working, and switch to a completely different strategy. This is the backtracking behavior that ToT achieves externally, but o1 does internally. Cost implication — you pay for thinking tokens. A problem that generates 5,000 thinking tokens costs significantly more than a standard GPT-4o call.

Thinking Token Patterns

// Patterns in o1 thinking tokens 1. Exploration: "Let me try approach A first..." "Hmm, what if I use calculus here?" // Considers multiple strategies 2. Self-Correction: "Wait, I made an error in step 3." "Actually, that formula doesn't apply here. Let me reconsider..." // Catches and fixes mistakes 3. Strategy Switching: "This algebraic approach is getting too complex. Let me try a geometric argument instead..." // Backtracking (internalized ToT) 4. Verification: "Let me check: if x = 3, then 3² + 2(3) = 15. Yes, that works." // Self-verification before answering Token Counts (approximate): Simple question: ~50 tokens Medium problem: ~500 tokens Hard math: ~5,000 tokens Competition-level: ~10,000+ tokens // Adaptive compute allocation

Key insight: Thinking tokens are the mechanism that makes test-time scaling work. The model adaptively allocates compute: easy questions get few tokens, hard questions get many. This is fundamentally more efficient than using the same model size for every question.

model_training

RL Training for Reasoning

How reinforcement learning teaches models to think

The Training Process

How do you train a model to reason? OpenAI uses large-scale reinforcement learning. The process (simplified): Step 1: Base model — start with a strong pre-trained LLM (likely GPT-4 class). Step 2: Generate reasoning traces — for math/code/science problems with known answers, have the model generate many reasoning attempts. Step 3: Reward signal — correct final answers get positive reward; incorrect get negative. Crucially, you can also reward good reasoning steps (process rewards) not just final answers (outcome rewards). Step 4: Policy optimization — use RL algorithms (likely PPO or similar) to update the model to generate reasoning traces that lead to correct answers. Over many iterations, the model learns: which reasoning strategies work, when to explore alternatives, how to self-correct, and how much to think for different difficulty levels. OpenAI reported that o1’s performance improves consistently with both more training compute AND more test-time compute — two independent scaling axes.

RL Training Pipeline

// Simplified RL training for reasoning Step 1: Collect Problems Math: GSM8K, MATH, competition Code: HumanEval, Codeforces Science: GPQA, physics, chemistry // Problems with verifiable answers Step 2: Generate Traces For each problem, generate N reasoning traces (thinking tokens) Some reach correct answer, some don't Step 3: Assign Rewards Outcome reward: Correct answer → +1 Wrong answer → -1 Process reward (optional): Good step → +0.1 Bad step → -0.1 // Reward the journey, not just end Step 4: RL Update Policy gradient (PPO / GRPO): Increase probability of traces that led to correct answers Decrease probability of bad traces Step 5: Iterate Repeat thousands of times Model learns reasoning strategies // Performance scales with RL compute

Key insight: The RL training is what separates o1 from CoT prompting. CoT relies on patterns the model learned during pre-training. RL explicitly optimizes the model to reason well. The model develops reasoning strategies that may not exist in the training data.

speed

Reasoning Effort Levels

o3-mini and compute-optimal inference

Adaptive Reasoning

In January 2025, OpenAI released o3-mini with a key innovation: configurable reasoning effort. Users can choose low, medium, or high reasoning effort: Low — minimal thinking tokens, fast responses, cheap. Good for simple questions that don’t need deep reasoning. Medium — moderate thinking, balanced cost/quality. Matches o1’s performance on most tasks at lower cost. High — extensive thinking, maximum accuracy. Best for competition-level math and hard coding problems. This is compute-optimal inference: allocate exactly as much compute as the problem needs. Not every question deserves 10,000 thinking tokens. The practical implication: you can use the same model for both quick Q&A and deep reasoning, just by adjusting the effort level. o3-mini with medium effort matches o1 on math, coding, and science benchmarks while being significantly faster and cheaper. With high effort, it exceeds o1 on many benchmarks.

Reasoning Effort

// o3-mini reasoning effort levels Low Effort: Thinking tokens: ~50-200 Latency: ~1-3s Use: simple questions, chat "What is the capital of France?" // Fast and cheap Medium Effort: Thinking tokens: ~500-2000 Latency: ~5-15s Use: moderate reasoning, coding "Write a binary search function" // Matches o1 at lower cost High Effort: Thinking tokens: ~5000-20000 Latency: ~30-120s Use: competition math, hard proofs "Solve this AIME problem" // Maximum accuracy API Usage: response = client.chat.completions .create( model="o3-mini", reasoning_effort="medium", messages=[...] ) // One model, flexible compute

Key insight: Configurable reasoning effort is the practical realization of test-time compute scaling. Instead of one-size-fits-all, you match compute to problem difficulty. This makes reasoning models economically viable for production use — you only pay for the thinking you need.

query_stats

Benchmark Performance

Where reasoning models excel — and where they don’t

Where Reasoning Models Shine

Reasoning models show dramatic improvements on tasks requiring multi-step reasoning: Mathematics — AIME 2024: o1 scored 83.3% vs GPT-4o’s 13.4%. MATH benchmark: o1 achieved 94.8%. These are competition-level problems requiring multi-step proofs. Science — GPQA Diamond (PhD-level): o1 scored 77.3%, surpassing human PhD experts (69.7%). The model can reason through complex physics and chemistry problems. Coding — Codeforces: o1 reached 89th percentile. SWE-bench Verified: o3-mini solved 49.3% of real-world GitHub issues. Novel reasoning — ARC-AGI: o3 scored 87.5%, a benchmark designed to test genuine reasoning on novel tasks. Where they DON’T help much: simple factual questions, creative writing, summarization, translation. These tasks don’t benefit from extended reasoning. Using o1 for “What year was the Eiffel Tower built?” wastes compute.

Benchmark Comparison

// Reasoning model benchmarks Mathematics: AIME 2024: GPT-4o: 13.4% o1: 83.3% (+70%) MATH: GPT-4o: 76.6% o1: 94.8% (+18%) Science: GPQA Diamond (PhD-level): Human PhD: 69.7% GPT-4o: 53.6% o1: 77.3% // Surpasses human experts Coding: Codeforces: GPT-4o: 11th percentile o1: 89th percentile Novel Reasoning: ARC-AGI: GPT-4o: 5% o3: 87.5% Where NOT to use: Simple facts, creative writing, summarization, translation // No benefit from extra thinking

Key insight: Reasoning models are not universally better. They excel specifically on tasks requiring multi-step reasoning: math, science, coding, logic. For everything else, standard models are faster, cheaper, and equally good. Choose the right model for the task.

open_in_new

Open-Source Reasoning: DeepSeek-R1

Democratizing test-time compute scaling

The Open-Source Revolution

In January 2025, Chinese AI lab DeepSeek released DeepSeek-R1, an open-source reasoning model that matches o1’s performance on many benchmarks. This was a watershed moment: Performance — R1 matches or approaches o1 on AIME (79.8%), MATH (97.3%), and coding benchmarks. On some benchmarks it exceeds o1. Open weights — unlike o1, R1’s weights are publicly available. Researchers can study, fine-tune, and deploy it. Training recipe revealed — DeepSeek published their training approach: start with a base model, apply RL (using GRPO — Group Relative Policy Optimization), and the model learns to reason. Remarkably, they showed that RL alone (without supervised fine-tuning on reasoning traces) can produce reasoning behavior. Distillation — DeepSeek also released smaller distilled versions (1.5B, 7B, 14B, 32B, 70B) that retain much of R1’s reasoning ability. The 32B distilled model outperforms o1-mini on several benchmarks. Cost — R1 API pricing is roughly 90% cheaper than o1.

DeepSeek-R1

// DeepSeek-R1 vs OpenAI o1 Performance: AIME 2024: o1: 83.3% R1: 79.8% MATH: o1: 94.8% R1: 97.3% (!) Codeforces: o1: 89th percentile R1: 96.3rd percentile (!) Training Recipe: 1. Base model (DeepSeek-V3) 2. RL with GRPO algorithm 3. No supervised CoT data needed! 4. Model discovers reasoning itself // RL alone produces reasoning Distilled Versions: R1-1.5B, R1-7B, R1-14B, R1-32B, R1-70B R1-32B > o1-mini on many tasks // Small models can reason too Impact: Open weights → research access 90% cheaper than o1 Proves reasoning is not proprietary // Democratization of reasoning

Key insight: DeepSeek-R1 proved that reasoning ability is not a proprietary secret. The recipe is straightforward: strong base model + RL training = reasoning. Open-source models now match frontier closed models, and distilled versions bring reasoning to smaller deployments.

rocket_launch

The Future of Test-Time Scaling

Where this paradigm is heading

What’s Next

Test-time compute scaling is still in its early days. Key directions: Compute-optimal routing — automatically decide how much thinking each question needs. Route simple questions to fast models, hard questions to reasoning models. This is already happening with o3-mini’s effort levels, but will become more granular. Specialized reasoning — models trained for specific domains: legal reasoning, medical diagnosis, financial analysis. Domain-specific RL training on domain-specific problems. Longer horizons — current reasoning models think for seconds to minutes. Future models may reason for hours on very hard problems (research-level math, complex engineering). Multi-modal reasoning — reasoning over images, diagrams, and code simultaneously. Thinking tokens that reference visual elements. Verification integration — tighter integration of reasoning and verification. The model generates a proof, then verifies each step, then revises. This is where process reward models (next chapter) become critical. Efficiency — making thinking tokens cheaper through distillation, quantization, and speculative decoding for reasoning chains.

Future Directions

// The future of test-time scaling 2024: Foundation o1: first reasoning model Fixed thinking, hidden tokens // Proof of concept 2025: Democratization o3-mini: configurable effort DeepSeek-R1: open-source Qwen QwQ: Chinese open-source // Everyone gets reasoning 2025-2026: Optimization Compute-optimal routing Speculative decoding for thinking Distilled reasoning models // Making it cheaper and faster 2026+: Frontier Hours-long reasoning sessions Multi-modal reasoning Domain-specific reasoning models Reasoning + tool use + agents // Reasoning as infrastructure The Big Picture: Train-time scaling: hitting limits Test-time scaling: just beginning Both together: the path to AGI? // Two scaling laws, one goal

Key insight: We’re at the beginning of the test-time compute era. Train-time scaling gave us capable base models. Test-time scaling gives them the ability to think. The combination of both — well-trained models that think adaptively — is the current frontier of AI capability.

arrow_back Ch 3: Tree-of-Thought & Search Ch 5: Verification & Reward Models arrow_forward