Ch 4 — Chain-of-Thought Prompting

“Think step by step” and why it works — the difference between wrong and right on hard problems
Reasoning
psychology
Why CoT
arrow_forward
bolt
Direct Fail
arrow_forward
route
Step-by-Step
arrow_forward
code
Zero-Shot CoT
arrow_forward
school
Few-Shot CoT
arrow_forward
bug_report
Debug Domain
arrow_forward
how_to_vote
Self-Consistency
arrow_forward
checklist
When to Use
-
Click play or press Space to begin...
Step- / 8
psychology
Why “Think Step by Step” Actually Works
Each generated token is a computation step — more reasoning tokens = more compute = better answers
The Core Idea
An LLM gets a fixed amount of computation per token it generates. When you ask for a direct answer, the model has to compress all its reasoning into a single token (the answer). When you ask it to “think step by step,” each reasoning token becomes an intermediate computation that feeds into the next. More tokens = more compute = better answers on hard problems.

This was formalized in the 2022 paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” by Wei et al. at Google Brain. They showed that simply adding “Let’s think step by step” improved accuracy on math word problems from 17.7% to 78.7% on GSM8K (a grade-school math benchmark).
The Analogy
Imagine someone asks you: “What’s 17 × 24?” If you have to answer instantly (one shot), you’ll probably guess wrong. But if you’re allowed to write on a scratch pad — 17 × 20 = 340, 17 × 4 = 68, 340 + 68 = 408 — you get it right every time. Chain-of-thought prompting gives the model a scratch pad. The intermediate text it generates is its working memory.
Key insight: CoT doesn’t make the model “smarter” — it gives the model more room to think. The knowledge was already there; the bottleneck was forcing a complex answer into a single step. This connects directly to the “fixed compute per token” concept from How LLMs Work (Ch 13).
bolt
The Direct-Answer Trap
When you ask for just the answer, the model skips steps and gets it wrong
Math Problem
Prompt: A store has 15 apples. They sell 40% of them in the morning and receive a shipment of 20 more in the afternoon. How many apples do they have at the end of the day? Just give me the number.
Model Output (Direct)
27
Why It’s Wrong
The model jumped to an answer without working through the steps. The correct calculation: 40% of 15 = 6 apples sold, 15 − 6 = 9 remaining, 9 + 20 = 29 apples. The model likely did something like 15 − 8 + 20 = 27 (rounding 40% of 15 to 8 instead of 6). With no scratch pad, small errors compound.
Logic Problem
Prompt: A farmer has 3 fields. Field A produces twice as much as Field B. Field C produces 30 bushels more than Field A. Together they produce 390 bushels. How much does each field produce? Answer directly.
Model Output (Direct)
Field A: 120 bushels Field B: 60 bushels Field C: 150 bushels
Check
120 + 60 + 150 = 330 ≠ 390. Wrong. The correct answer: B = 72, A = 144, C = 174 (total: 390). The model’s numbers look plausible but don’t add up because it never verified the sum.
The pattern: Direct-answer prompts fail on multi-step problems because the model has to do all computation “in its head” (in a single forward pass). The more steps, the more likely an error. This is the exact problem CoT solves.
route
The Fix: “Think Through This Step by Step”
Same problems, one line added — the model shows its work and gets it right
Math Problem with CoT
Prompt: A store has 15 apples. They sell 40% of them in the morning and receive a shipment of 20 more in the afternoon. How many apples do they have at the end of the day? Think through this step by step before giving the final answer.
Model Output (with CoT)
Step 1: Calculate apples sold in the morning 40% of 15 = 0.40 × 15 = 6 apples sold Step 2: Calculate remaining after morning 15 - 6 = 9 apples remaining Step 3: Add afternoon shipment 9 + 20 = 29 apples Final answer: 29 apples
Logic Problem with CoT
Prompt: A farmer has 3 fields. Field A produces twice as much as Field B. Field C produces 30 bushels more than Field A. Together they produce 390 bushels. How much does each field produce? Work through the algebra step by step.
Model Output (with CoT)
Let B = x (Field B's production) Field A = 2x (twice Field B) Field C = 2x + 30 (30 more than A) Total equation: x + 2x + (2x + 30) = 390 5x + 30 = 390 5x = 360 x = 72 Field B: 72 bushels Field A: 144 bushels Field C: 174 bushels Verification: 72 + 144 + 174 = 390 ✓
Key insight: Notice the model even verified its answer. When you give it room to reason, it naturally adopts good problem-solving habits like checking its work. The “think step by step” instruction doesn’t just improve accuracy — it makes the model’s reasoning transparent and auditable.
code
Zero-Shot CoT: The Magic Phrase
Just append “Let’s think step by step” — no examples needed
What Is Zero-Shot CoT?
Zero-shot CoT is the simplest form: you add a trigger phrase to the end of your prompt and the model starts reasoning. No examples, no special formatting. The original paper by Kojima et al. (2022) tested various phrases and found “Let’s think step by step” was the most effective trigger.
Effective Trigger Phrases
# Most effective (original research) "Let's think step by step." # Good alternatives "Think through this step by step before answering." "Break this down into steps." "Work through the reasoning first, then give your final answer." # For specific domains "Walk through the debugging process step by step." "Analyze this systematically, considering each factor."
When Zero-Shot CoT Is Enough
Use zero-shot CoT when:
• The problem has a clear logical structure (math, logic puzzles)
• You want the model to show its work for transparency
• The reasoning steps are straightforward (not domain-specific)
• You’re prototyping and want a quick accuracy boost
Cost Consideration
Zero-shot CoT typically generates 3–5x more output tokens than a direct answer. For the apple problem: direct answer = ~2 tokens, CoT answer = ~60 tokens. At GPT-4 output pricing (~$30/M tokens), that’s the difference between $0.00006 and $0.0018 per call. Negligible for most use cases, but worth noting for high-volume pipelines.
Key insight: Zero-shot CoT is the lowest-effort, highest-impact prompting technique. One sentence added to your prompt can turn a wrong answer into a right one. If you learn only one technique from this entire course, make it this one.
school
Few-Shot CoT: Show the Reasoning Pattern
When you need the model to reason in a specific way, show it an example with reasoning
Why Few-Shot CoT?
Zero-shot CoT lets the model reason however it wants. Sometimes you need it to reason in a specific way — following your debugging methodology, your analysis framework, or your decision criteria. Few-shot CoT combines the power of examples (Ch 3) with chain-of-thought reasoning.
Example: Pricing Decision Analysis
Analyze pricing decisions using this framework: Market → Costs → Competitors → Recommendation. Example: Q: Should we raise the price of our Basic plan from $9/mo to $12/mo? A: Let me analyze this systematically. Market: Our Basic plan has 12,000 users with 2.1% monthly churn. The segment is price-sensitive freelancers. Costs: Current margin is 34%. At $12, margin would be 48%. Infrastructure costs are fixed up to 20K users. Competitors: Notion charges $10, Coda $10, Monday $9. A $12 price puts us above all three in the basic tier. Recommendation: Don't raise to $12. The 3% margin gain doesn't offset the churn risk in a price-sensitive segment where we'd be the most expensive option. Consider $10 instead — matches competitors, improves margin to 40%. Now analyze: Q: Should we add a $29/mo Pro tier between Basic ($9) and Enterprise ($99)?
What the Example Teaches
The example doesn’t just say “think step by step” — it shows the exact framework to follow:

1. Structure: Market → Costs → Competitors → Recommendation
2. Depth: Include specific numbers, not vague statements
3. Reasoning style: Weigh trade-offs, consider risks
4. Conclusion format: Clear yes/no with justification

The model will now analyze the new question using the same four-step framework with the same level of specificity.
Zero-Shot CoT
“Should we add a $29 Pro tier? Let’s think step by step.”

Model reasons freely — might focus on features, might focus on pricing psychology, might miss competitor analysis entirely. Unpredictable structure.
Few-Shot CoT
Model follows the Market → Costs → Competitors → Recommendation framework exactly, includes specific numbers, weighs trade-offs, and gives a clear recommendation with justification.
Key insight: Zero-shot CoT says “think.” Few-shot CoT says “think like this.” Use zero-shot when any reasoning helps. Use few-shot when you need reasoning to follow a specific methodology.
bug_report
Domain Example: Debugging an API Error
Direct prompt gives generic advice — CoT prompt gives systematic diagnosis
The Scenario
Your Flask API returns 500 Internal Server Error on POST /api/users but GET /api/users works fine. You have the error log and the route handler code.
Direct Prompt (Superficial)
Prompt: “My Flask API returns 500 on POST /api/users but GET works fine. What’s wrong?”

Output:
Here are some common causes of 500 errors on POST routes:
1. Missing request body parsing
2. Database connection issues
3. Validation errors
4. CORS configuration
5. Missing required fields

Try checking your error logs for more details.

Generic checklist. Doesn’t look at your actual code.
CoT Prompt (Deliberate)
Debug this systematically. For each step, state what you're checking and why. My Flask API returns 500 on POST /api/users but GET /api/users works fine. Error log: sqlalchemy.exc.IntegrityError: (psycopg2.errors.NotNullViolation) null value in column "created_at" Route handler: @app.route('/api/users', methods=['POST']) def create_user(): data = request.get_json() user = User( name=data['name'], email=data['email'] ) db.session.add(user) db.session.commit() return jsonify(user.to_dict()), 201 Walk through the debugging process step by step.
Model Output (CoT)
Step 1: Read the error message The error is NotNullViolation on "created_at". This means the database requires created_at but it's not being set. Step 2: Check the route handler The User() constructor sets name and email but NOT created_at. The column likely has nullable=False in the model definition. Step 3: Why GET works but POST doesn't GET reads existing rows (which have created_at). POST creates new rows without setting it → DB constraint violation. Step 4: The fix Option A: Add a default in the model: created_at = db.Column(db.DateTime, default=datetime.utcnow) Option B: Set it in the route: user = User(name=data['name'], email=data['email'], created_at=datetime.utcnow())
Key insight: The CoT prompt didn’t just say “think step by step” — it said “debug systematically, state what you’re checking and why.” This guided the model to follow a real debugging methodology: read the error, trace the code, explain the asymmetry, propose a fix. Domain-specific CoT triggers produce domain-specific reasoning.
how_to_vote
Self-Consistency: Ask 5 Times, Take the Majority Vote
When one CoT path might be wrong, sample multiple paths and let them vote
The Problem CoT Doesn’t Fully Solve
CoT dramatically improves accuracy, but it’s not perfect. The model might take a wrong turn in its reasoning and confidently arrive at a wrong answer. Since each CoT path is generated with some randomness (temperature > 0), different runs can produce different reasoning chains with different answers.
The Self-Consistency Technique
Proposed by Wang et al. (2022), self-consistency works like this:

1. Send the same CoT prompt 5–20 times with temperature > 0 (e.g., 0.7)
2. Each run produces a different reasoning chain
3. Extract the final answer from each chain
4. Take the majority vote

If 4 out of 5 runs say “29 apples” and 1 says “27 apples,” the answer is 29. The wrong reasoning path gets outvoted.
Implementation Pattern
import openai from collections import Counter def self_consistent_answer(prompt, n=5): answers = [] for _ in range(n): resp = openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.7 ) text = resp.choices[0].message.content answer = extract_final_answer(text) answers.append(answer) # Majority vote vote = Counter(answers).most_common(1) return vote[0][0], answers
Cost vs Accuracy Trade-off
5 samples: 5x cost, significant accuracy boost
10 samples: 10x cost, diminishing returns
20 samples: 20x cost, marginal improvement

For most tasks, 5 samples is the sweet spot. Use self-consistency for high-stakes decisions where being wrong is expensive (medical triage, financial analysis, legal review).
Key insight: Self-consistency treats the model’s randomness as a feature, not a bug. Different reasoning paths explore different solution strategies. The majority vote filters out the occasional wrong path. It’s like asking 5 experts instead of 1 — the group is more reliable than any individual.
checklist
When to Use (and Not Use) Chain-of-Thought
CoT is powerful but not always necessary — here’s the decision framework
Use CoT When
Multi-step reasoning: Math, logic, planning, scheduling

Debugging: Tracing errors through code, logs, or configurations

Analysis: Comparing options, weighing trade-offs, making recommendations

Complex classification: When the category depends on multiple factors

Transparency matters: When you need to audit why the model chose an answer, not just what it chose
Skip CoT When
Simple retrieval: “What’s the capital of France?” — CoT adds tokens without improving accuracy

Format conversion: “Convert this JSON to YAML” — mechanical task, no reasoning needed

Creative writing: “Write a poem about autumn” — CoT can make creative output feel formulaic

High-volume, low-stakes: When 3–5x token cost matters and errors are cheap to fix
The CoT Decision Tree
# Does the task require multiple steps? NO → Skip CoT (direct answer is fine) YES → Use CoT # Do you need specific reasoning style? NO → Zero-shot CoT ("Let's think step by step") YES → Few-shot CoT (show example with reasoning) # Is the answer high-stakes? NO → Single CoT call is enough YES → Self-consistency (5 calls, majority vote)
Combining with Previous Techniques
CoT stacks with everything you’ve learned:

Ch 2 + CoT: Role + Context + Task + “Think step by step”
Ch 3 + CoT: Few-shot examples that include reasoning chains

These techniques are composable. The best prompts combine clear structure (Ch 2), good examples (Ch 3), and explicit reasoning (Ch 4).
Key insight: Chain-of-thought is the single most impactful technique for hard problems. It costs more tokens but dramatically reduces errors. Think of it as buying compute with tokens — you’re trading output length for output quality. For anything that requires reasoning, always default to CoT.