Ch 2: Chain-of-Thought Prompting — Reasoning & CoT Models

Ch 2 — Chain-of-Thought Prompting

Wei et al. (2022), zero-shot CoT, self-consistency, and why “let’s think step by step” works

Index

High Level

help

Question

arrow_forward

edit_note

Few-Shot

arrow_forward

auto_fix_high

Zero-Shot

arrow_forward

route

Chain

arrow_forward

diversity_1

Sample

arrow_forward

check_circle

Answer

Click play or press Space to begin...

Step- / 8

article

The Original Paper: Wei et al. (2022)

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

The Breakthrough

In January 2022, Jason Wei and colleagues at Google Brain published the paper that launched the reasoning revolution. The core idea is deceptively simple: instead of asking a model to jump directly from question to answer, provide exemplars that include intermediate reasoning steps. The model then learns to generate its own reasoning steps for new problems. Testing on Google’s PaLM (540B parameter) model, chain-of-thought prompting with just 8 exemplars achieved state-of-the-art on the GSM8K math benchmark — surpassing a fine-tuned GPT-3 with a verifier. The improvement was dramatic across three categories: arithmetic reasoning (GSM8K, SVAMP, MultiArith), commonsense reasoning (StrategyQA, CSQA), and symbolic reasoning (last letter concatenation, coin flip). Crucially, CoT only helps with large models (100B+ parameters). Smaller models generate incoherent chains that hurt performance.

Few-Shot CoT Example

// Few-shot chain-of-thought prompt Exemplar: Q: Roger has 5 tennis balls. He buys 2 more cans of 3 tennis balls each. How many tennis balls does he have? A: Roger started with 5 balls. 2 cans of 3 balls each = 2 × 3 = 6. 5 + 6 = 11. The answer is 11. New Question: Q: The cafeteria had 23 apples. If they used 20 for lunch and bought 6 more, how many apples do they have? A: The cafeteria started with 23. They used 20, so 23 - 20 = 3. They bought 6 more, so 3 + 6 = 9. The answer is 9. ✓ Results (PaLM 540B): GSM8K: 17.9% → 56.9% (+39%) SVAMP: 79.0% → 86.6% // State-of-the-art with 8 exemplars

Key insight: Chain-of-thought prompting is a few-shot technique: you provide 4–8 exemplars that include reasoning steps, and the model learns to generate its own. The magic is that it works without any fine-tuning — just prompting. But it only works with large models (100B+).

auto_fix_high

Zero-Shot CoT: “Let’s Think Step by Step”

Kojima et al. (2022) — no exemplars needed

The Discovery

Just two months after Wei et al., Kojima et al. made a surprising discovery: you don’t need hand-crafted exemplars at all. Simply appending “Let’s think step by step” to the prompt triggers chain-of-thought reasoning in large language models. This zero-shot CoT approach is remarkably effective: on MultiArith, accuracy jumped from 17.7% to 78.7% with InstructGPT. On GSM8K, it went from 10.4% to 40.7%. The technique works in two stages: (1) append “Let’s think step by step” and let the model generate reasoning, then (2) append “Therefore, the answer is” to extract the final answer. While zero-shot CoT doesn’t match few-shot CoT performance (no exemplars to guide the reasoning format), it’s dramatically simpler to use. No need to craft domain-specific exemplars — one magic phrase works across tasks.

Zero-Shot CoT

// Zero-shot CoT (Kojima et al. 2022) Standard Prompt: Q: A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there? A: 8 // Wrong! Zero-Shot CoT: Q: A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there? Let's think step by step. A: There are 16 balls total. Half are golf balls: 16 / 2 = 8. Half of golf balls are blue: 8 / 2 = 4. Therefore, the answer is 4. ✓ Results (InstructGPT): MultiArith: 17.7% → 78.7% GSM8K: 10.4% → 40.7% SVAMP: 63.7% → 79.3% Other Effective Phrases: "Let's work this out step by step" "Let's break this down" "Think carefully and show your work"

Key insight: “Let’s think step by step” is arguably the most impactful six words in AI history. It proved that LLMs already have latent reasoning capabilities — they just need to be prompted to use them. The reasoning is there; it just needs to be elicited.

diversity_1

Self-Consistency: Majority Vote over Paths

Wang et al. (2022) — sample diverse reasoning, pick the consensus

The Idea

Standard CoT uses greedy decoding: the model generates one reasoning path and returns one answer. But complex problems often have multiple valid approaches. What if the model takes a wrong turn in its reasoning? Self-consistency (Wang et al., 2022, published at ICLR 2023) addresses this by: (1) Sampling multiple reasoning paths using temperature > 0 (e.g., temperature = 0.7, sample 40 paths). (2) Extracting the final answer from each path. (3) Taking the majority vote — the most common answer across all paths wins. The intuition: if a problem has a correct answer, multiple different reasoning approaches should converge on it. Wrong answers are more random and spread across different values. Results were striking: GSM8K improved by +17.9%, SVAMP by +11.0%, AQuA by +12.2%. Self-consistency is now the standard approach for any reasoning task where you can afford the extra inference cost.

Self-Consistency

// Self-consistency (Wang et al. 2022) Step 1: Sample N reasoning paths temperature = 0.7, N = 40 Path 1: "23 - 20 = 3, 3 + 6 = 9" → Answer: 9 Path 2: "23 + 6 = 29, 29 - 20 = 9" → Answer: 9 Path 3: "20 - 6 = 14, 23 - 14 = 9" → Answer: 9 Path 4: "23 - 20 = 13, 13 + 6 = 19" → Answer: 19 // Wrong path! Step 2: Majority vote Answer 9: 36 votes (90%) Answer 19: 3 votes (7.5%) Answer 3: 1 vote (2.5%) Step 3: Return majority → 9 ✓ Results (CoT + Self-Consistency): GSM8K: 56.9% → 74.4% (+17.9%) AQuA: 35.8% → 48.0% (+12.2%) // Significant gains over greedy CoT

Key insight: Self-consistency trades compute for accuracy. Sampling 40 paths costs 40x more inference than greedy decoding, but the accuracy gains are substantial. The trade-off is worth it for high-stakes reasoning tasks. This is an early form of test-time compute scaling.

psychology

Why Does Chain-of-Thought Work?

Theories and empirical evidence

Proposed Explanations

Why does generating intermediate text improve reasoning? Several theories: External working memory — the chain of text serves as scratch space. Each token generated becomes part of the context for the next token, effectively giving the model more “memory” to work with. Problem decomposition — CoT breaks one hard problem into many easy sub-problems. Each step is simple enough for the model to handle in a single forward pass. Distributional shift — training data contains many examples of step-by-step reasoning (textbooks, tutorials, Stack Overflow). CoT prompts the model to access this “reasoning mode” of its training distribution. Attention allocation — intermediate steps help the model attend to the right parts of the problem at each stage, rather than trying to attend to everything at once. Error localization — when reasoning fails, you can identify which step went wrong, enabling targeted correction.

Why It Works

// Theories for why CoT works 1. External Working Memory: Standard: all reasoning in 1 pass CoT: each step adds to context → More "memory" for computation // Like writing on scratch paper 2. Problem Decomposition: Hard: "What is 23 - 20 + 6?" Easy: "23 - 20 = ?" then "3 + 6 = ?" → Each sub-step is trivial // Divide and conquer 3. Training Distribution: Training data has step-by-step: Textbooks, tutorials, forums CoT activates this "mode" // Accessing learned patterns 4. Attention Allocation: Step 1: focus on "23 apples" Step 2: focus on "used 20" Step 3: focus on "bought 6" // Guided attention per step Empirical Finding: CoT only helps models ≥ 100B params Smaller models: incoherent chains // Need enough capacity to reason

Key insight: The most compelling explanation is that CoT provides external working memory. A transformer’s hidden state has limited capacity for intermediate computation. By writing steps as text, each step becomes part of the context window — effectively unlimited scratch space.

tune

CoT Variants & Extensions

Least-to-most, complexity-based, and more

Beyond Basic CoT

Researchers have developed many CoT variants: Least-to-Most Prompting (Zhou et al., 2022) — first decompose the problem into sub-problems, then solve each sub-problem sequentially, using previous answers as context. Particularly effective for problems requiring compositional generalization. Complexity-Based Prompting (Fu et al., 2023) — among multiple sampled reasoning paths, select the ones with the most reasoning steps (highest complexity). The intuition: longer reasoning chains tend to be more thorough. Active Prompting (Diao et al., 2023) — automatically select the most informative exemplars for CoT by identifying questions where the model is most uncertain. Auto-CoT (Zhang et al., 2022) — automatically generate CoT exemplars using zero-shot CoT, eliminating the need for manual exemplar creation. Clusters questions by similarity and generates demonstrations for each cluster.

CoT Variants

// CoT variant comparison Least-to-Most (Zhou et al. 2022): Step 1: Decompose into sub-problems "What sub-problems do I need to solve first?" Step 2: Solve each sequentially // Best for compositional problems Complexity-Based (Fu et al. 2023): Sample N reasoning paths Select paths with MOST steps (not majority vote on answer) // Longer chains = more thorough Auto-CoT (Zhang et al. 2022): 1. Cluster questions by similarity 2. Zero-shot CoT on each cluster 3. Use generated chains as exemplars // No manual exemplar creation Active Prompting (Diao 2023): Find questions model is uncertain on Create exemplars for those // Targeted exemplar selection When to Use What: Simple: Zero-shot CoT Important: Few-shot CoT + SC Compositional: Least-to-Most Automated: Auto-CoT

Key insight: The practical recommendation: start with zero-shot CoT (“let’s think step by step”). If accuracy matters, add self-consistency (sample 5–40 paths). If the problem is compositional, use least-to-most. Only craft manual exemplars if these simpler approaches fail.

code

Practical Implementation

How to use CoT in production systems

Implementation Guide

Implementing CoT in production requires attention to several details: Prompt structure — for few-shot CoT, include 4–8 exemplars that match your task domain. Each exemplar should show the question, step-by-step reasoning, and a clearly marked final answer. Answer extraction — use a consistent format like “The answer is X” or “Therefore: X” to make parsing reliable. Regex extraction is common. Temperature — for greedy CoT, use temperature = 0. For self-consistency, use temperature = 0.5–0.8 to get diverse paths. Cost management — CoT generates more tokens (reasoning + answer), increasing cost. Self-consistency multiplies this by N samples. Budget accordingly. Streaming — for user-facing applications, you can stream the reasoning steps to show “thinking” progress, or hide them and only show the final answer.

Code Example

# Self-consistency with CoT (Python) import openai import re from collections import Counter def solve_with_sc(question, n=10): prompt = f"""Solve step by step. Q: {question} A: Let's think step by step.""" answers = [] for _ in range(n): resp = openai.chat.completions .create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=0.7, ) text = resp.choices[0] .message.content # Extract final answer match = re.search( r"answer is (\d+)", text) if match: answers.append(match.group(1)) # Majority vote return Counter(answers) .most_common(1)[0][0]

Key insight: In production, the biggest practical concern is cost. CoT generates 3–10x more tokens than direct answering. Self-consistency multiplies that by N. For cost-sensitive applications, use zero-shot CoT with greedy decoding. Reserve self-consistency for high-stakes decisions.

warning

Limitations of Chain-of-Thought

When CoT fails and what to do about it

Known Limitations

CoT is powerful but not a silver bullet: Faithfulness problem — the generated reasoning may not reflect the model’s actual computation. The model might arrive at the right answer for wrong reasons, or generate plausible-looking reasoning that leads to a wrong answer. The chain is a rationalization, not necessarily the true reasoning process. Error propagation — if an early step is wrong, all subsequent steps build on that error. Unlike humans who can catch and correct mistakes, standard CoT has no self-correction mechanism. Model size dependency — CoT only helps models with 100B+ parameters. Smaller models generate incoherent chains that actually hurt performance (an “inverse scaling” effect). Computational overhead — generating reasoning tokens is expensive. For simple questions, CoT wastes compute. Not true reasoning — CoT improves pattern matching over reasoning-like text, but the model still doesn’t truly “understand” logic.

Failure Modes

// CoT failure modes 1. Unfaithful Reasoning: Chain says: "5 × 3 = 15, 15 + 2 = 17" But model may have just guessed 17 and generated plausible steps after // Post-hoc rationalization 2. Error Propagation: Step 1: "23 - 20 = 13" ← WRONG Step 2: "13 + 6 = 19" ← Correct math Final: 19 ← Wrong answer // One bad step ruins everything 3. Small Model Failure: 7B model with CoT: "The cafeteria had apples. Apples are red. Red is a color. Colors are nice. The answer is 42." // Incoherent chain, worse result 4. Overthinking Simple Problems: "What is 2 + 2?" CoT: "Let me think... 2 is a number representing a pair. Adding another pair gives us..." → wastes tokens Solutions: → Self-consistency (fixes some errors) → Verification (Ch 5: PRMs) → Tree search (Ch 3: ToT) → Tool use (Ch 6: calculators)

Key insight: The faithfulness problem is the deepest limitation of CoT. We can’t be sure the generated reasoning reflects the model’s actual computation. This is why verification (Chapter 5) and mechanistic interpretability are so important — we need to verify reasoning, not just trust it.

map

CoT in the Bigger Picture

From prompting trick to foundational technique

The Legacy

Chain-of-thought prompting was the spark that ignited the reasoning revolution. Its impact extends far beyond the original paper: It proved reasoning is elicitable — LLMs have latent reasoning capabilities that can be activated through prompting. This changed how we think about model capabilities. It established the paradigm — “think before you answer” is now the default for any complex task. Every modern reasoning technique builds on this idea. It inspired training approaches — OpenAI’s o1/o3 models are trained to generate “thinking tokens” — essentially CoT that’s been baked into the model through reinforcement learning. It changed benchmarking — reasoning benchmarks (GSM8K, MATH) became the primary measure of model capability, shifting focus from perplexity to problem-solving. Looking ahead: CoT is the foundation. Tree-of-Thought (next chapter) extends it with search. Test-time compute (Chapter 4) trains it into the model. Verification (Chapter 5) validates it.

Evolution of CoT

// How CoT evolved into modern reasoning 2022: Prompting Era CoT prompting (Wei et al.) Zero-shot CoT (Kojima et al.) Self-consistency (Wang et al.) // External technique, no training 2023: Search Era Tree-of-Thought (Yao et al.) Graph-of-Thought Least-to-Most decomposition // CoT + structured search 2024: Training Era OpenAI o1: CoT trained via RL "Thinking tokens" = learned CoT Process reward models verify steps // CoT baked into the model 2025: Democratization DeepSeek-R1: open-source o1 o3-mini: cheap reasoning Reasoning as a commodity // Everyone gets CoT The Thread: All of these are the same idea: "Think before you answer" Just implemented differently // CoT is the foundation of it all

Key insight: Every modern reasoning technique — from Tree-of-Thought to o1 to DeepSeek-R1 — is a descendant of chain-of-thought prompting. The core idea (“generate intermediate reasoning steps”) is the same; only the implementation has evolved from prompting to search to training.

arrow_back Ch 1: The Reasoning Gap Ch 3: Tree-of-Thought & Search arrow_forward