Ch 4: Chain-of-Thought Prompting — Prompt Engineering Mastery

Ch 4 — Chain-of-Thought Prompting

“Think step by step” and why it works — the difference between wrong and right on hard problems

arrow_backIndex

Reasoning

psychology

Why CoT

arrow_forward

bolt

Direct Fail

arrow_forward

route

Step-by-Step

arrow_forward

code

Zero-Shot CoT

arrow_forward

school

Few-Shot CoT

arrow_forward

bug_report

Debug Domain

arrow_forward

how_to_vote

Self-Consistency

arrow_forward

checklist

When to Use

Click play or press Space to begin...

Step- / 8

psychology

Why “Think Step by Step” Actually Works

Each generated token is a computation step — more reasoning tokens = more compute = better answers

The Core Idea

An LLM gets a fixed amount of computation per token it generates. When you ask for a direct answer, the model has to compress all its reasoning into a single token (the answer). When you ask it to “think step by step,” each reasoning token becomes an intermediate computation that feeds into the next. More tokens = more compute = better answers on hard problems.

This was formalized in the 2022 paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” by Wei et al. at Google Brain. They showed that simply adding “Let’s think step by step” improved accuracy on math word problems from 17.7% to 78.7% on GSM8K (a grade-school math benchmark).

The Analogy

Imagine someone asks you: “What’s 17 × 24?” If you have to answer instantly (one shot), you’ll probably guess wrong. But if you’re allowed to write on a scratch pad — 17 × 20 = 340, 17 × 4 = 68, 340 + 68 = 408 — you get it right every time. Chain-of-thought prompting gives the model a scratch pad. The intermediate text it generates is its working memory.

Key insight: CoT doesn’t make the model “smarter” — it gives the model more room to think. The knowledge was already there; the bottleneck was forcing a complex answer into a single step. This connects directly to the “fixed compute per token” concept from How LLMs Work (Ch 13).

bolt

The Direct-Answer Trap

When you ask for just the answer, the model skips steps and gets it wrong

Math Problem

Prompt: A store has 15 apples. They sell 40% of them in the morning and receive a shipment of 20 more in the afternoon. How many apples do they have at the end of the day? Just give me the number.

Model Output (Direct)

Why It’s Wrong

The model jumped to an answer without working through the steps. The correct calculation: 40% of 15 = 6 apples sold, 15 − 6 = 9 remaining, 9 + 20 = 29 apples. The model likely did something like 15 − 8 + 20 = 27 (rounding 40% of 15 to 8 instead of 6). With no scratch pad, small errors compound.

Logic Problem

Prompt: A farmer has 3 fields. Field A produces twice as much as Field B. Field C produces 30 bushels more than Field A. Together they produce 390 bushels. How much does each field produce? Answer directly.

Model Output (Direct)

Field A: 120 bushels Field B: 60 bushels Field C: 150 bushels

Check

120 + 60 + 150 = 330 ≠ 390. Wrong. The correct answer: B = 72, A = 144, C = 174 (total: 390). The model’s numbers look plausible but don’t add up because it never verified the sum.

The pattern: Direct-answer prompts fail on multi-step problems because the model has to do all computation “in its head” (in a single forward pass). The more steps, the more likely an error. This is the exact problem CoT solves.

route

The Fix: “Think Through This Step by Step”

Same problems, one line added — the model shows its work and gets it right

Math Problem with CoT

Prompt: A store has 15 apples. They sell 40% of them in the morning and receive a shipment of 20 more in the afternoon. How many apples do they have at the end of the day? Think through this step by step before giving the final answer.

Model Output (with CoT)

Step 1: Calculate apples sold in the morning 40% of 15 = 0.40 × 15 = 6 apples sold Step 2: Calculate remaining after morning 15 - 6 = 9 apples remaining Step 3: Add afternoon shipment 9 + 20 = 29 apples Final answer: 29 apples

Logic Problem with CoT

Model Output (with CoT)

Let B = x (Field B's production) Field A = 2x (twice Field B) Field C = 2x + 30 (30 more than A) Total equation: x + 2x + (2x + 30) = 390 5x + 30 = 390 5x = 360 x = 72 Field B: 72 bushels Field A: 144 bushels Field C: 174 bushels Verification: 72 + 144 + 174 = 390 ✓

Key insight: Notice the model even verified its answer. When you give it room to reason, it naturally adopts good problem-solving habits like checking its work. The “think step by step” instruction doesn’t just improve accuracy — it makes the model’s reasoning transparent and auditable.

code

Zero-Shot CoT: The Magic Phrase

Just append “Let’s think step by step” — no examples needed

What Is Zero-Shot CoT?

Zero-shot CoT is the simplest form: you add a trigger phrase to the end of your prompt and the model starts reasoning. No examples, no special formatting. The original paper by Kojima et al. (2022) tested various phrases and found “Let’s think step by step” was the most effective trigger.

Effective Trigger Phrases

# Most effective (original research) "Let's think step by step." # Good alternatives "Think through this step by step before answering." "Break this down into steps." "Work through the reasoning first, then give your final answer." # For specific domains "Walk through the debugging process step by step." "Analyze this systematically, considering each factor."

When Zero-Shot CoT Is Enough

Use zero-shot CoT when:
• The problem has a clear logical structure (math, logic puzzles)
• You want the model to show its work for transparency
• The reasoning steps are straightforward (not domain-specific)
• You’re prototyping and want a quick accuracy boost

Cost Consideration

Zero-shot CoT typically generates 3–5x more output tokens than a direct answer. For the apple problem: direct answer = ~2 tokens, CoT answer = ~60 tokens. At GPT-4 output pricing (~$30/M tokens), that’s the difference between $0.00006 and $0.0018 per call. Negligible for most use cases, but worth noting for high-volume pipelines.

Key insight: Zero-shot CoT is the lowest-effort, highest-impact prompting technique. One sentence added to your prompt can turn a wrong answer into a right one. If you learn only one technique from this entire course, make it this one.

school

Few-Shot CoT: Show the Reasoning Pattern

When you need the model to reason in a specific way, show it an example with reasoning

Why Few-Shot CoT?

Zero-shot CoT lets the model reason however it wants. Sometimes you need it to reason in a specific way — following your debugging methodology, your analysis framework, or your decision criteria. Few-shot CoT combines the power of examples (Ch 3) with chain-of-thought reasoning.

Example: Pricing Decision Analysis

Analyze pricing decisions using this framework: Market → Costs → Competitors → Recommendation. Example: Q: Should we raise the price of our Basic plan from $9/mo to $12/mo? A: Let me analyze this systematically. Market: Our Basic plan has 12,000 users with 2.1% monthly churn. The segment is price-sensitive freelancers. Costs: Current margin is 34%. At $12, margin would be 48%. Infrastructure costs are fixed up to 20K users. Competitors: Notion charges $10, Coda $10, Monday $9. A $12 price puts us above all three in the basic tier. Recommendation: Don't raise to $12. The 3% margin gain doesn't offset the churn risk in a price-sensitive segment where we'd be the most expensive option. Consider $10 instead — matches competitors, improves margin to 40%. Now analyze: Q: Should we add a $29/mo Pro tier between Basic ($9) and Enterprise ($99)?

What the Example Teaches

The example doesn’t just say “think step by step” — it shows the exact framework to follow:

1. Structure: Market → Costs → Competitors → Recommendation
2. Depth: Include specific numbers, not vague statements
3. Reasoning style: Weigh trade-offs, consider risks
4. Conclusion format: Clear yes/no with justification

The model will now analyze the new question using the same four-step framework with the same level of specificity.

Zero-Shot CoT

“Should we add a $29 Pro tier? Let’s think step by step.”

Model reasons freely — might focus on features, might focus on pricing psychology, might miss competitor analysis entirely. Unpredictable structure.

Few-Shot CoT

Model follows the Market → Costs → Competitors → Recommendation framework exactly, includes specific numbers, weighs trade-offs, and gives a clear recommendation with justification.

Key insight: Zero-shot CoT says “think.” Few-shot CoT says “think like this.” Use zero-shot when any reasoning helps. Use few-shot when you need reasoning to follow a specific methodology.

bug_report

Domain Example: Debugging an API Error

Direct prompt gives generic advice — CoT prompt gives systematic diagnosis

The Scenario

Your Flask API returns 500 Internal Server Error on POST /api/users but GET /api/users works fine. You have the error log and the route handler code.

Direct Prompt (Superficial)

Prompt: “My Flask API returns 500 on POST /api/users but GET works fine. What’s wrong?”

Output:
Here are some common causes of 500 errors on POST routes:
1. Missing request body parsing
2. Database connection issues
3. Validation errors
4. CORS configuration
5. Missing required fields

Try checking your error logs for more details.

Generic checklist. Doesn’t look at your actual code.

CoT Prompt (Deliberate)

Debug this systematically. For each step, state what you're checking and why. My Flask API returns 500 on POST /api/users but GET /api/users works fine. Error log: sqlalchemy.exc.IntegrityError: (psycopg2.errors.NotNullViolation) null value in column "created_at" Route handler: @app.route('/api/users', methods=['POST']) def create_user(): data = request.get_json() user = User( name=data['name'], email=data['email'] ) db.session.add(user) db.session.commit() return jsonify(user.to_dict()), 201 Walk through the debugging process step by step.

Model Output (CoT)

Step 1: Read the error message The error is NotNullViolation on "created_at". This means the database requires created_at but it's not being set. Step 2: Check the route handler The User() constructor sets name and email but NOT created_at. The column likely has nullable=False in the model definition. Step 3: Why GET works but POST doesn't GET reads existing rows (which have created_at). POST creates new rows without setting it → DB constraint violation. Step 4: The fix Option A: Add a default in the model: created_at = db.Column(db.DateTime, default=datetime.utcnow) Option B: Set it in the route: user = User(name=data['name'], email=data['email'], created_at=datetime.utcnow())

Key insight: The CoT prompt didn’t just say “think step by step” — it said “debug systematically, state what you’re checking and why.” This guided the model to follow a real debugging methodology: read the error, trace the code, explain the asymmetry, propose a fix. Domain-specific CoT triggers produce domain-specific reasoning.

how_to_vote

Self-Consistency: Ask 5 Times, Take the Majority Vote

When one CoT path might be wrong, sample multiple paths and let them vote

The Problem CoT Doesn’t Fully Solve

CoT dramatically improves accuracy, but it’s not perfect. The model might take a wrong turn in its reasoning and confidently arrive at a wrong answer. Since each CoT path is generated with some randomness (temperature > 0), different runs can produce different reasoning chains with different answers.

The Self-Consistency Technique

Proposed by Wang et al. (2022), self-consistency works like this:

1. Send the same CoT prompt 5–20 times with temperature > 0 (e.g., 0.7)
2. Each run produces a different reasoning chain
3. Extract the final answer from each chain
4. Take the majority vote

If 4 out of 5 runs say “29 apples” and 1 says “27 apples,” the answer is 29. The wrong reasoning path gets outvoted.

Implementation Pattern

import openai from collections import Counter def self_consistent_answer(prompt, n=5): answers = [] for _ in range(n): resp = openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.7 ) text = resp.choices[0].message.content answer = extract_final_answer(text) answers.append(answer) # Majority vote vote = Counter(answers).most_common(1) return vote[0][0], answers

Cost vs Accuracy Trade-off

5 samples: 5x cost, significant accuracy boost
10 samples: 10x cost, diminishing returns
20 samples: 20x cost, marginal improvement

For most tasks, 5 samples is the sweet spot. Use self-consistency for high-stakes decisions where being wrong is expensive (medical triage, financial analysis, legal review).

Key insight: Self-consistency treats the model’s randomness as a feature, not a bug. Different reasoning paths explore different solution strategies. The majority vote filters out the occasional wrong path. It’s like asking 5 experts instead of 1 — the group is more reliable than any individual.

checklist

When to Use (and Not Use) Chain-of-Thought

CoT is powerful but not always necessary — here’s the decision framework

Use CoT When

Multi-step reasoning: Math, logic, planning, scheduling

Debugging: Tracing errors through code, logs, or configurations

Analysis: Comparing options, weighing trade-offs, making recommendations

Complex classification: When the category depends on multiple factors

Transparency matters: When you need to audit why the model chose an answer, not just what it chose

Skip CoT When

Simple retrieval: “What’s the capital of France?” — CoT adds tokens without improving accuracy

Format conversion: “Convert this JSON to YAML” — mechanical task, no reasoning needed

Creative writing: “Write a poem about autumn” — CoT can make creative output feel formulaic

High-volume, low-stakes: When 3–5x token cost matters and errors are cheap to fix

The CoT Decision Tree

# Does the task require multiple steps? NO → Skip CoT (direct answer is fine) YES → Use CoT # Do you need specific reasoning style? NO → Zero-shot CoT ("Let's think step by step") YES → Few-shot CoT (show example with reasoning) # Is the answer high-stakes? NO → Single CoT call is enough YES → Self-consistency (5 calls, majority vote)

Combining with Previous Techniques

CoT stacks with everything you’ve learned:

Ch 2 + CoT: Role + Context + Task + “Think step by step”
Ch 3 + CoT: Few-shot examples that include reasoning chains

These techniques are composable. The best prompts combine clear structure (Ch 2), good examples (Ch 3), and explicit reasoning (Ch 4).

Key insight: Chain-of-thought is the single most impactful technique for hard problems. It costs more tokens but dramatically reduces errors. Think of it as buying compute with tokens — you’re trading output length for output quality. For anything that requires reasoning, always default to CoT.

arrow_back Ch 3: Zero-Shot vs Few-Shot Ch 5: Advanced Reasoning arrow_forward