Ch 5: Verification & Reward Models — Reasoning & CoT Models

Ch 5 — Verification & Reward Models

Process reward models (PRMs), outcome reward models (ORMs), step-level verification, and “Let’s Verify Step by Step”

Index

High Level

psychology

Reason

arrow_forward

checklist

Steps

arrow_forward

fact_check

Verify

arrow_forward

star

Score

arrow_forward

filter_alt

Select

arrow_forward

verified

Answer

Click play or press Space to begin...

Step- / 8

bug_report

The Verification Problem

Why generating reasoning isn’t enough

Generation vs Verification

Chain-of-thought and test-time scaling let models generate reasoning. But how do you know the reasoning is correct? This is the verification problem. Consider: a model generates a 10-step math proof. Steps 1–7 are correct, step 8 has a subtle error, and steps 9–10 build on that error. The final answer is wrong, but the reasoning looks plausible. Self-consistency (majority vote) helps but is crude: it only checks the final answer, not the reasoning process. If most paths make the same error, majority vote fails. We need verification: a way to check each step of the reasoning, identify errors, and select the best reasoning path. This is where reward models come in. A reward model is a separate neural network trained to evaluate the quality of reasoning — either the final outcome (ORM) or each step (PRM). OpenAI’s 2023 paper “Let’s Verify Step by Step” showed that step-level verification dramatically outperforms outcome-only verification.

The Problem

// Why verification matters Generated Reasoning: Step 1: x² + 5x + 6 = 0 ✓ Step 2: (x + 2)(x + 3) = 0 ✓ Step 3: x = -2 or x = -3 ✓ → Correct! Easy to verify. Harder Problem: Step 1: ∫ sin²(x) dx ✓ Step 2: = ∫ (1-cos(2x))/2 dx ✓ Step 3: = x/2 - sin(x)/2 ✗ // Should be x/2 - sin(2x)/4 Step 4: + C ✓ → Wrong answer, plausible steps Self-Consistency Failure: Path 1: ... → wrong (same error) Path 2: ... → wrong (same error) Path 3: ... → correct Majority vote: wrong answer wins! // Systematic errors fool voting What We Need: Check EACH step independently Catch errors at the step level Select paths with correct reasoning // This is what reward models do

Key insight: There’s a fundamental asymmetry: generating a plausible reasoning chain is easier than verifying one is correct. This is why verification is a separate, critical capability. It’s the difference between a student writing an answer and a teacher grading it.

emoji_events

Outcome Reward Models (ORMs)

Judging the final answer only

How ORMs Work

An Outcome Reward Model (ORM) is the simpler approach: it evaluates only the final answer of a reasoning chain. Training: collect many (problem, solution) pairs. For each, label whether the final answer is correct or incorrect. Train a neural network to predict correctness from the full solution text. Usage: generate N reasoning paths, score each with the ORM, select the highest-scored path. Advantages: easy to train (just need final answer labels), cheap to annotate (automated verification for math/code). Disadvantages: provides only a sparse reward signal — one score for the entire solution. Can’t pinpoint where errors occur. A solution with 9 correct steps and 1 wrong step gets the same “wrong” label as a completely wrong solution. This makes RL training inefficient: the model doesn’t know which steps to improve. ORMs also suffer from reward hacking: models learn to produce outputs that score well without actually reasoning correctly.

ORM Pipeline

// Outcome Reward Model Training: Input: full solution text Output: P(correct) ∈ [0, 1] Solution A (correct): "Step 1... Step 2... Answer: 42" Label: 1.0 Solution B (wrong): "Step 1... Step 2... Answer: 37" Label: 0.0 Usage (Best-of-N): Generate N = 100 solutions Score each with ORM Select highest-scored solution Solution 1: score 0.23 Solution 2: score 0.91 ← select Solution 3: score 0.45 ... Limitation: ORM says "wrong" but WHERE? Step 1? Step 5? Step 9? → No localization of errors // Sparse signal, no step feedback

Key insight: ORMs are simple and effective for best-of-N selection: generate many solutions, pick the one the ORM scores highest. But they can’t tell you why a solution is wrong or where the error is. For training better reasoners, we need step-level feedback.

fact_check

Process Reward Models (PRMs)

OpenAI’s “Let’s Verify Step by Step” (2023)

Step-Level Verification

A Process Reward Model (PRM) evaluates each step of a reasoning chain independently. OpenAI’s 2023 paper “Let’s Verify Step by Step” (Lightman et al.) demonstrated the power of this approach. Training: human annotators label each step of a math solution as “correct,” “incorrect,” or “neutral.” OpenAI collected 800,000 step-level labels across 75,000 solutions (the PRM800K dataset). The PRM is trained to predict the correctness of each step given all previous steps. Results on MATH benchmark: using a PRM to select among solutions solved 78.2% of problems, significantly outperforming ORM-based selection. The PRM catches errors at the exact step where they occur, enabling: better solution selection, more targeted RL training (reward good steps, penalize bad ones), and interpretable feedback (you can see which step the model flagged).

PRM in Action

// Process Reward Model Input: solution with step boundaries PRM Scores Each Step: Step 1: "Let x = price per apple" → Score: 0.98 ✓ Step 2: "Total cost = 5x + 3(2x)" → Score: 0.95 ✓ Step 3: "5x + 6x = 11x" → Score: 0.97 ✓ Step 4: "11x = 33, so x = 4" → Score: 0.12 ✗ ← ERROR HERE // Should be x = 3 (33/11 = 3) Step 5: "Total = 11 × 4 = 44" → Score: 0.08 ✗ PRM Output: Step scores: [.98, .95, .97, .12, .08] First error: Step 4 Solution score: min = 0.08 MATH Benchmark: ORM selection: 72.4% PRM selection: 78.2% (+5.8%) // Step-level beats outcome-level

Key insight: PRMs provide a dense reward signal: feedback at every step, not just the end. This is analogous to a teacher who marks each line of a proof vs. one who only marks the final answer. Dense feedback enables faster learning and better error localization.

compare

PRM vs ORM: Deep Comparison

When to use which, and why PRMs win

Head-to-Head

Reward signal density: ORM gives one score per solution (sparse). PRM gives one score per step (dense). Dense signals enable faster RL convergence and better credit assignment. Error localization: ORM says “wrong.” PRM says “wrong at step 4.” This enables targeted correction and interpretable feedback. Training cost: ORMs are cheap to train — just need (solution, correct/incorrect) pairs, which can be auto-generated for math and code. PRMs are expensive — need step-level annotations. OpenAI used human annotators for PRM800K. Annotation scalability: ORM labels can be automated (check final answer). PRM labels traditionally required humans, but recent work uses Monte Carlo estimation: for each step, complete the solution many times and check what fraction reach the correct answer. This automates PRM training. Alignment benefits: PRMs reward correct reasoning processes, not just correct answers. This mitigates “reward hacking” where models reach right answers via wrong reasoning. Compute efficiency: recent research (2024) shows PRMs achieve 1.5–5x better compute efficiency than ORMs when used for test-time search.

Comparison Table

// PRM vs ORM comparison Feature ORM PRM ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Signal Sparse Dense (1/soln) (1/step) Error loc. No Yes "wrong" "step 4" Train cost Low High auto human/MC Accuracy Good Better MATH: 72.4% 78.2% Compute 1x 1.5-5x efficiency better Alignment Weak Strong right ans right wrong way process When to Use: ORM: quick selection, auto-labels PRM: RL training, high-stakes, interpretability needed Both: combine for best results // PRM for training, ORM for cheap eval

Key insight: PRMs are strictly better for quality but more expensive to train. The practical approach: use Monte Carlo estimation to auto-generate PRM labels (complete each partial solution many times, measure success rate). This gives you PRM quality without human annotation cost.

casino

Monte Carlo Estimation for PRMs

Automating step-level labels without humans

Automated Step Labels

The biggest bottleneck for PRMs is collecting step-level labels. Human annotation is expensive and slow (OpenAI’s PRM800K required significant annotation effort). Monte Carlo estimation (MCE) automates this: for each step in a solution, complete the remaining solution many times (e.g., 64 completions) using the base model. The step’s score = fraction of completions that reach the correct final answer. Intuition: if a step is correct, many completions from that point will succeed. If a step is wrong, few completions will succeed (they’re building on a bad foundation). Advantages: fully automated, no human annotators needed. Scales to millions of labels. Limitations: labels are policy-dependent — they reflect the base model’s ability, not ground truth. A step might be correct but the model can’t solve the rest, giving a low score (false negative). Or a step might be wrong but the model “recovers” in later steps (false positive). Despite these limitations, MCE-trained PRMs approach human-annotated PRM performance in practice.

Monte Carlo Estimation

// Monte Carlo step-level labels Problem: "Solve 2x + 5 = 13" Solution being evaluated: Step 1: "Subtract 5 from both sides" Step 2: "2x = 8" Step 3: "Divide by 2" Step 4: "x = 4" ✓ Score Step 2 ("2x = 8"): Complete from Step 2, N=64 times: Completion 1: "x = 4" ✓ Completion 2: "x = 4" ✓ ... Completion 64: "x = 4" ✓ Score = 64/64 = 1.0 Score a BAD Step 2 ("2x = 18"): Complete from bad step, N=64: Completion 1: "x = 9" ✗ Completion 2: "x = 9" ✗ ... 3 completions get lucky: "x = 4" ✓ Score = 3/64 = 0.047 // Low score → bad step detected Cost: N completions × S steps × P problems 64 × 5 × 10,000 = 3.2M generations // Expensive but fully automated

Key insight: Monte Carlo estimation is the key that unlocked scalable PRM training. It trades compute for annotation: instead of paying humans to label steps, you pay for LLM completions. This makes PRMs practical at scale, which is why they’re now central to reasoning model training.

Verification-Guided Search

Using reward models to guide reasoning at inference

Search + Verification

Reward models aren’t just for training — they’re powerful at inference time too. The combination of search + verification is the core of modern reasoning systems: Best-of-N with PRM — generate N solutions, score each step with the PRM, select the solution with the highest minimum step score. More reliable than majority vote because it checks reasoning quality, not just answer agreement. Beam search with PRM — at each reasoning step, generate multiple candidates, score them with the PRM, keep the top-k. This is ToT (Chapter 3) but with a learned verifier instead of LLM-as-judge. Process Advantage Verifiers (PAVs) — a 2024 advance that scores each step relative to alternatives. Instead of “is this step correct?” it asks “is this step better than alternatives?” PAVs achieve 8%+ greater accuracy and 1.5–5x better compute efficiency than ORMs. MCTS + PRM — use the PRM as the value function in Monte Carlo Tree Search. The PRM evaluates partial solutions, guiding the search toward promising branches.

Verification-Guided Search

// Using PRMs at inference time 1. Best-of-N + PRM: Generate N=100 solutions PRM scores each step of each Solution score = min(step scores) Select highest-scored solution // Better than majority vote 2. Beam Search + PRM: Step 1: generate 5 candidates PRM scores: [.95, .88, .72, .91, .45] Keep top 3: [.95, .91, .88] Step 2: from each, generate 5 more PRM scores all 15 candidates Keep top 3 again ... continue to solution // ToT with learned verifier 3. MCTS + PRM: PRM = value function for MCTS Guides exploration toward high-quality reasoning paths // AlphaGo-style for reasoning Results (PAV, 2024): vs ORM: +8% accuracy Compute efficiency: 1.5-5x better // Step-level verification wins

Key insight: The combination of generation + verification + search is the holy trinity of modern reasoning. The generator proposes, the verifier evaluates, and the search algorithm explores. This is exactly what o1 does internally — but you can also build it externally with open-source components.

model_training

PRMs for RL Training

Dense rewards make RL training more efficient

Process Supervision in RL

PRMs play a dual role: inference-time verification AND training-time reward shaping. In RL training for reasoning (as in o1/R1), the reward signal determines what the model learns: Outcome supervision (ORM) — reward only the final answer. The model must figure out which steps were good or bad from a single signal. This is the credit assignment problem: with a 10-step solution, which steps deserve credit for the correct answer? RL with sparse rewards is notoriously slow. Process supervision (PRM) — reward each step individually. The model gets immediate feedback on every reasoning step. This solves credit assignment: good steps get positive reward, bad steps get negative reward. RL converges much faster. OpenAI’s research showed process supervision is 5–6x more sample-efficient than outcome supervision. The model learns correct reasoning patterns faster because it knows exactly which steps to reinforce. This is believed to be a key component of o1’s training: RL with process-level rewards, not just outcome rewards.

RL with Process Rewards

// Outcome vs Process supervision Outcome Supervision: Solution: [S1, S2, S3, S4, S5] Final answer: correct Reward: [?, ?, ?, ?, +1] // Which steps were good? Unknown! Solution: [S1, S2, S3, S4, S5] Final answer: wrong Reward: [?, ?, ?, ?, -1] // S1-S3 might be correct but // all get penalized equally Process Supervision: Solution: [S1, S2, S3, S4, S5] PRM scores: [+1, +1, +1, -1, -1] // Error at S4! Reinforce S1-S3, // penalize S4-S5 RL Efficiency: Outcome: ~100K episodes to converge Process: ~20K episodes to converge // 5x more sample-efficient Alignment Benefit: Outcome: may reward wrong reasoning that accidentally gets right answer Process: rewards correct reasoning regardless of final answer // Aligned with human values

Key insight: Process supervision solves the credit assignment problem in RL for reasoning. Instead of asking “was the answer right?” it asks “was each step right?” This is 5–6x more efficient and produces models that reason correctly, not just models that get lucky.

science

Open Challenges & Future Directions

What’s still hard about verification

Unsolved Problems

Verification is powerful but far from solved: Beyond math and code — PRMs work well for math (verifiable answers) and code (test cases). But how do you verify reasoning about ethics, strategy, or creative problems? There’s no ground truth to check against. Verifier reliability — the verifier itself can be wrong. If the PRM has a blind spot, it will consistently miss certain errors. Quis custodiet ipsos custodes? (Who watches the watchmen?) Reward hacking — models can learn to produce outputs that score well on the PRM without actually reasoning correctly. The model “games” the verifier. This is an ongoing arms race. Generalization — PRMs trained on math problems may not generalize to science or coding. Domain-specific PRMs are needed, multiplying training costs. Step boundary detection — PRMs assume clear step boundaries. But natural reasoning doesn’t always decompose into discrete steps. Scaling verification — as reasoning chains get longer (thousands of steps), verification becomes harder. Each step depends on all previous steps, creating compounding uncertainty.

Open Challenges

// Unsolved verification challenges 1. Non-Verifiable Domains: Math: check answer ✓ Code: run tests ✓ Ethics: ??? Strategy: ??? Creative: ??? // No ground truth to verify against 2. Verifier Errors: PRM says step is correct But PRM is wrong! → Cascading errors in selection // Who verifies the verifier? 3. Reward Hacking: Model learns PRM's blind spots Produces "PRM-pleasing" outputs that aren't actually correct // Gaming the reward function 4. Long-Chain Verification: 10-step proof: manageable 100-step proof: hard 1000-step proof: ??? // Compounding uncertainty Future Directions: Formal verification integration Self-improving verifiers Multi-verifier ensembles Verifier-generator co-training // Verification is the bottleneck

Key insight: Verification is the bottleneck of AI reasoning. We can generate increasingly long reasoning chains, but verifying them gets harder as they get longer. The future of reasoning AI depends on solving verification — especially for domains beyond math and code.

arrow_back Ch 4: Test-Time Compute Scaling Ch 6: Tool-Augmented Reasoning arrow_forward