Ch 1: The Reasoning Gap — Reasoning & CoT Models

Ch 1 — The Reasoning Gap

Why LLMs struggle with logic, math, and multi-step problems

Index

High Level

smart_toy

LLM

arrow_forward

bolt

Fast

arrow_forward

psychology

Slow

arrow_forward

error

Fail

arrow_forward

lightbulb

CoT

arrow_forward

rocket_launch

Reason

Click play or press Space to begin...

Step- / 8

smart_toy

LLMs Are Pattern Matchers, Not Reasoners

Next-token prediction vs. logical reasoning

The Core Problem

Large language models are trained to predict the next token. This makes them excellent at pattern matching, fluent text generation, and retrieving memorized knowledge. But reasoning — the ability to follow logical steps, perform multi-step computation, and arrive at correct conclusions through deduction — is fundamentally different from pattern matching. When you ask an LLM “What is 23 × 47?”, it doesn’t compute the answer. It predicts what tokens are likely to follow that question based on training data. For simple problems it has seen many times, this works. For novel problems requiring genuine computation, it fails. This is the reasoning gap: the difference between what LLMs appear to understand and what they can actually reason about.

Where LLMs Fail

// Tasks that expose the reasoning gap Multi-step math: "If a train travels 60 mph for 2.5 hours, then 80 mph for 1.5 hours, what's the total distance?" // Requires: multiply, then add Logical deduction: "All roses are flowers. Some flowers fade quickly. Can we conclude that some roses fade quickly?" // Answer: No (invalid syllogism) // LLMs often say Yes Planning: "Move blocks A, B, C to form a specific tower configuration" // Requires: state tracking + search Counting: "How many r's in 'strawberry'?" // Famously difficult for LLMs Pattern: Easy tasks: memorized patterns Hard tasks: require actual reasoning // LLMs fake reasoning via patterns

Key insight: LLMs don’t reason — they pattern-match. When the pattern is familiar (common math problems), they appear to reason. When the problem is novel, the illusion breaks. The entire field of reasoning AI is about closing this gap.

bolt

System 1 vs System 2 Thinking

Kahneman’s framework applied to AI

Two Systems

Daniel Kahneman’s “Thinking, Fast and Slow” describes two modes of human cognition: System 1 — fast, automatic, intuitive. Recognizing faces, reading text, answering “2 + 2 = ?”. Requires no deliberate effort. System 2 — slow, deliberate, analytical. Solving “17 × 24”, planning a route, writing a proof. Requires focused attention and step-by-step reasoning. Standard LLMs are System 1 machines: they generate responses in a single forward pass, with no ability to “think harder” about difficult problems. They spend the same compute on “What color is the sky?” as on “Prove Fermat’s Last Theorem.” The revolution in reasoning AI is about giving LLMs System 2 capabilities: the ability to slow down, think step by step, explore multiple paths, and verify their work.

System 1 vs System 2

// Kahneman's framework for AI System 1 (Standard LLMs): Fast, single forward pass Pattern matching Same compute for easy and hard No self-correction // "What's 2+2?" = same effort as // "Prove the Riemann hypothesis" System 2 (Reasoning Models): Slow, deliberate Step-by-step reasoning More compute for harder problems Self-verification and backtracking // Thinks harder when needed How to Add System 2: Chain-of-Thought prompting Tree-of-Thought search Test-time compute scaling (o1/o3) Process reward models Tool use (calculators, code) The Key Insight: System 2 = more compute at inference Not more parameters, more thinking // Trade inference cost for accuracy

Key insight: The fundamental shift in reasoning AI is from “bigger models” to “more thinking.” Instead of scaling parameters (training compute), we scale inference compute — letting models think longer on harder problems. This is the System 1 → System 2 transition.

psychology

Types of Reasoning

Mathematical, logical, commonsense, and causal

Reasoning Categories

Not all reasoning is the same. LLMs struggle differently with each type: Mathematical reasoning — arithmetic, algebra, word problems. LLMs fail on novel computations but can solve problems similar to training data. Chain-of-thought helps significantly. Logical reasoning — deduction, induction, abduction. LLMs struggle with formal logic, especially negation and quantifiers (“all”, “some”, “none”). Commonsense reasoning — understanding physical world, social norms, temporal relationships. LLMs are surprisingly good at this because it’s heavily represented in training data. Causal reasoning — understanding cause and effect, counterfactuals. LLMs learn correlations, not causation. Planning — multi-step goal-directed behavior with state tracking. One of the weakest areas for LLMs.

Reasoning Difficulty

// LLM reasoning capabilities Commonsense: ████████░░ Good "Is ice cream usually cold?" → Yes Well-represented in training data Mathematical: █████░░░░░ Mixed Simple: "5 × 3 = ?" → 15 ✓ Complex: "∫x²dx from 0 to 3" → ? CoT helps dramatically Logical: ████░░░░░░ Weak "All A are B. All B are C. Are all A also C?" → Usually ✓ "Not all A are B..." → Often ✗ Causal: ███░░░░░░░ Weak Correlation ≠ causation Counterfactuals are hard Planning: ██░░░░░░░░ Very Weak State tracking fails Can't search solution spaces // Worst reasoning category

Key insight: LLMs are best at commonsense reasoning (pattern-heavy) and worst at planning (requires search and state tracking). The techniques in this course — CoT, ToT, tool use — target the weak areas by adding structure and computation to the reasoning process.

error

Why Standard Prompting Fails

The limitations of direct answer generation

Direct Prompting

Standard prompting asks the model to produce an answer directly: “Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have now? A:” The model must jump from question to answer in a single step. For simple problems, this works. But for multi-step problems, the model must perform all intermediate computations implicitly within its forward pass. This is like asking a human to solve a complex math problem entirely in their head, without writing anything down. The problem: the model’s “working memory” is its hidden state, which has limited capacity. Complex reasoning requires more intermediate state than the model can maintain in a single forward pass. The solution: let the model “write down its work” by generating intermediate reasoning steps as text.

Direct vs Chain-of-Thought

// Direct prompting (fails) Q: Roger has 5 tennis balls. He buys 2 more cans of 3 tennis balls each. How many tennis balls does he have? A: 11 // Correct! But only because simple Q: A cafeteria had 23 apples. They used 20 for lunch and bought 6 more. How many apples do they have? A: 27 // Wrong! Should be 9 // Chain-of-thought (succeeds) Q: Same question... A: The cafeteria started with 23. They used 20, so 23 - 20 = 3. They bought 6 more, so 3 + 6 = 9. The answer is 9. // Correct! Why it works: Intermediate steps = external memory Each step is a simple computation Model "writes down its work"

Key insight: Chain-of-thought works because it converts one hard problem into many easy problems. Each intermediate step is simple enough for the model to handle in a single forward pass. The chain of text serves as “external working memory.”

timeline

The Reasoning Revolution Timeline

From CoT prompting to o3: a brief history

Key Milestones

The field of LLM reasoning has evolved rapidly: Jan 2022 — Wei et al. publish “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” The foundational paper. Mar 2022 — Kojima et al. discover zero-shot CoT: “Let’s think step by step” works without examples. Mar 2022 — Wang et al. introduce self-consistency: sample multiple reasoning paths, take majority vote. May 2023 — Yao et al. publish Tree-of-Thought: BFS/DFS over reasoning paths. Sep 2024 — OpenAI releases o1: the first model trained specifically for reasoning with test-time compute scaling. Jan 2025 — DeepSeek releases R1: open-source reasoning model matching o1 performance. Jan 2025 — OpenAI releases o3-mini: 14x cheaper than o1 with better math performance.

Timeline

// Reasoning AI timeline 2022: Jan: CoT prompting (Wei et al.) Mar: Zero-shot CoT (Kojima et al.) Mar: Self-consistency (Wang et al.) // The prompting era 2023: Feb: Toolformer (Schick et al.) May: Tree-of-Thought (Yao et al.) Nov: Let's Verify Step by Step (PRM) // Search + verification era 2024: Sep: OpenAI o1 released Dec: o1 system card published // Test-time compute era begins 2025: Jan: DeepSeek-R1 (open source!) Jan: OpenAI o3-mini (14x cheaper) Apr: o3 + o4-mini system card // Reasoning becomes accessible Trend: Prompting → Search → Training External tricks → Built-in reasoning

Key insight: The field evolved from external tricks (prompting) to built-in reasoning (o1/o3). The trend is clear: reasoning is moving from something we add at inference time to something trained into the model itself. DeepSeek-R1 made this accessible to everyone.

analytics

Scaling Laws: Parameters vs Compute

Two dimensions of scaling for AI capability

Two Scaling Dimensions

AI capability can be improved along two dimensions: Training compute (traditional scaling) — more parameters, more data, more training time. This is the approach that gave us GPT-3 → GPT-4. Diminishing returns: doubling parameters doesn’t double capability. Test-time compute (reasoning scaling) — more computation at inference. Let the model think longer on harder problems. This is the approach behind o1/o3. The key insight from OpenAI’s research: for reasoning tasks, scaling test-time compute can be more efficient than scaling model size. A smaller model that thinks longer can outperform a larger model that answers immediately. This changes the economics of AI: instead of always needing bigger models, you can use smaller models with more inference compute for reasoning-heavy tasks.

Scaling Comparison

// Two dimensions of scaling Training Compute Scaling: GPT-3 (175B) → GPT-4 (~1.8T?) More params = better at everything But: diminishing returns Cost: $100M+ per training run // Traditional approach Test-Time Compute Scaling: Same model, more thinking time Easy question: think briefly Hard question: think extensively Cost: pay per query, scales with difficulty // The reasoning approach Key Finding: For reasoning tasks: Small model + long thinking > Large model + quick answer // o3-mini beats o1 on math // at 14x lower cost Implication: Not always "bigger is better" Sometimes "think harder" is better Adaptive compute per problem

Key insight: Test-time compute scaling is the most important paradigm shift in AI since the transformer. It means we can improve AI reasoning without training bigger models — just by letting existing models think longer. This democratizes access to reasoning capabilities.

lightbulb

The Reasoning Toolkit

Techniques covered in this course

Course Overview

This course covers the full spectrum of reasoning techniques: Chain-of-Thought (Ch 2) — the foundational technique. Generate intermediate reasoning steps. Zero-shot and few-shot variants. Self-consistency for robustness. Tree-of-Thought (Ch 3) — explore multiple reasoning paths using search algorithms (BFS, DFS, MCTS). Test-Time Compute (Ch 4) — the o1/o3 approach. Train models to reason with “thinking tokens.” Scale compute at inference. Verification (Ch 5) — process reward models (PRMs) that verify each reasoning step. Outcome reward models (ORMs). Tool Use (Ch 6) — augment reasoning with code interpreters, calculators, and retrieval. Benchmarks (Ch 7) — how we measure reasoning. GSM8K, MATH, ARC-AGI, GPQA. Future (Ch 8) — open-source reasoning models, hybrid approaches, and what’s next.

Technique Spectrum

// Reasoning techniques spectrum Prompting (no training needed): Chain-of-Thought (few-shot) Zero-shot CoT ("think step by step") Self-consistency (majority vote) // Cheapest, easiest to apply Search (inference-time): Tree-of-Thought (BFS/DFS) Monte Carlo Tree Search Beam search over reasoning paths // More compute, better results Verification (trained): Process Reward Models (PRM) Outcome Reward Models (ORM) Step-level verification // Learned quality checking Built-in Reasoning (trained): OpenAI o1/o3 (thinking tokens) DeepSeek-R1 (open source) Test-time compute scaling // Most powerful, most expensive Tool Augmentation: Code interpreters, calculators Retrieval, databases // Offload computation to tools

Key insight: These techniques are not mutually exclusive — they’re complementary. The most powerful reasoning systems combine multiple approaches: built-in reasoning (o1) + tool use (code interpreter) + verification (PRM) + search (best-of-N sampling).

rocket_launch

Why This Matters Now

Reasoning is the frontier of AI capability

The Stakes

Reasoning is arguably the most important frontier in AI: AGI connection — many researchers believe that robust reasoning is the key missing piece for artificial general intelligence. If we can make AI truly reason, we unlock capabilities far beyond pattern matching. Practical impact — reasoning AI enables: reliable code generation (plan before coding), scientific discovery (hypothesis generation and testing), mathematical proof (formal verification), medical diagnosis (multi-step differential diagnosis), legal analysis (complex argument construction). Economic shift — reasoning models change the cost structure of AI. Instead of paying for the biggest model, you pay for the right amount of thinking. o3-mini costs 14x less than o1 but outperforms it on math. Democratization — DeepSeek-R1 (open source, Jan 2025) made reasoning capabilities available to everyone, not just OpenAI customers.

Impact Areas

// Why reasoning AI matters Code Generation: Plan → implement → test → debug Not just autocomplete // Reasoning = reliable code Science: Hypothesis → experiment → analyze AlphaFold, AlphaProof // AI as research partner Mathematics: o3 scored 25.2% on ARC-AGI (previous SOTA: 5%) Formal proof verification // Approaching human-level Medicine: Multi-step differential diagnosis Drug interaction reasoning // Where errors cost lives Economics: o3-mini: $1.10/$4.40 per M tokens o1: $15/$60 per M tokens // 14x cheaper, better at math Open Source: DeepSeek-R1: matches o1 Fully open weights // Reasoning for everyone

Key insight: We are living through the most important shift in AI since the transformer: the transition from “predict the next token” to “reason about the problem.” This course will give you a deep understanding of how this works and where it’s going.

Ch 2: Chain-of-Thought Prompting arrow_forward