Ch 3: Tree-of-Thought & Search — Reasoning & CoT Models

Ch 3 — Tree-of-Thought & Search

Yao et al. (2023), BFS/DFS over reasoning paths, Monte Carlo Tree Search for LLMs

Index

High Level

help

Problem

arrow_forward

account_tree

Branch

arrow_forward

star

Evaluate

arrow_forward

undo

Backtrack

arrow_forward

check_circle

Solve

Click play or press Space to begin...

Step- / 8

route

From Chain to Tree

Why a single reasoning path isn’t enough

The Limitation of Chains

Chain-of-thought generates a single linear path from question to answer. This works for problems with straightforward solutions, but many problems require exploration: trying different approaches, evaluating partial solutions, and backtracking when a path leads nowhere. Consider the “Game of 24”: given four numbers, use arithmetic operations to make 24. For example, with {4, 5, 6, 10}: one approach might try 4 × 6 = 24, but then you have unused numbers. Another might try 10 - 4 = 6, then 6 × 5 = 30, which doesn’t work. You need to explore multiple paths. Standard CoT commits to one path and can’t backtrack. Tree-of-Thought (Yao et al., NeurIPS 2023) solves this by treating reasoning as a search problem: generate multiple candidate “thoughts” at each step, evaluate them, and use search algorithms (BFS, DFS) to find the best path.

Chain vs Tree

// Chain-of-Thought (linear) Input → Step 1 → Step 2 → Answer One path, no backtracking If Step 1 is wrong, everything fails // Like walking a single road // Tree-of-Thought (branching) Input ├─ Thought A │ ├─ Thought A1 → ✗ (dead end) │ └─ Thought A2 → ✓ Answer! ├─ Thought B │ ├─ Thought B1 → ✗ │ └─ Thought B2 → ✗ └─ Thought C └─ Thought C1 → ✓ Answer! Multiple paths, can backtrack Evaluate before committing // Like exploring a maze Game of 24 Results: CoT (GPT-4): 4% success ToT (GPT-4): 74% success // 18.5x improvement!

Key insight: The jump from 4% to 74% on Game of 24 demonstrates that some problems are fundamentally unsolvable with linear reasoning. They require search — the ability to explore, evaluate, and backtrack. ToT brings classical AI search to LLM reasoning.

account_tree

The ToT Framework

Four components: decompose, generate, evaluate, search

Framework Components

Tree-of-Thought has four key components: 1. Thought decomposition — define what constitutes a “thought” (a coherent unit of reasoning). For Game of 24, a thought is one arithmetic operation. For creative writing, it might be a paragraph plan. The granularity matters: too fine-grained creates too many branches; too coarse misses opportunities for correction. 2. Thought generation — at each node, generate k candidate next thoughts. Use the LLM to propose multiple continuations (e.g., k = 5 candidates). 3. State evaluation — use the LLM to evaluate each candidate thought: “Is this partial solution promising? Rate it as sure/maybe/impossible.” This is the heuristic that guides the search. 4. Search algorithm — use BFS or DFS to traverse the tree, guided by the evaluations. BFS explores breadth-first (all options at each level); DFS goes deep first and backtracks.

ToT Components

// Tree-of-Thought framework 1. Thought Decomposition: Define granularity of each "thought" Game of 24: one arithmetic operation Writing: one paragraph plan Crossword: one word placement // Task-specific design choice 2. Thought Generation: At each node, generate k candidates k = 5 (typical) Use LLM: "Given [state], what are possible next steps?" // LLM proposes, doesn't decide 3. State Evaluation: LLM evaluates each candidate: "Is this partial solution promising?" → "sure" / "maybe" / "impossible" // LLM as heuristic function 4. Search Algorithm: BFS: explore all options per level DFS: go deep, backtrack on failure Prune "impossible" branches early // Classical AI search + LLM

Key insight: ToT uses the LLM for two roles: (1) generating candidate thoughts (the “imagination”) and (2) evaluating them (the “judgment”). This separation of generation and evaluation is a powerful pattern that recurs throughout reasoning AI.

BFS vs DFS for Reasoning

Choosing the right search strategy

Search Strategies

Breadth-First Search (BFS) explores all candidate thoughts at each level before going deeper. At each step, keep the top-b most promising thoughts (beam search). Best for problems where you need to compare multiple approaches at each stage. Used by Yao et al. for Game of 24 (b = 5). Depth-First Search (DFS) explores one path deeply before backtracking. Goes as far as possible along one branch, then backtracks when it hits a dead end or an “impossible” evaluation. Best for problems with deep solution paths where early pruning is effective. Used by Yao et al. for Creative Writing. When to use which: BFS when the solution space is wide but shallow (many options at each step, few steps). DFS when the solution space is narrow but deep (few options, many steps). In practice, BFS with beam search (keep top-b at each level) is the most common approach.

BFS vs DFS

// BFS (Breadth-First Search) Level 0: [Problem] Level 1: [A] [B] [C] [D] [E] Evaluate all → keep top b=3 Level 1': [A] [C] [E] Level 2: [A1][A2][C1][C2][E1][E2] Evaluate all → keep top b=3 ... continue until solution // Wide exploration, beam search // Best for: Game of 24, puzzles // DFS (Depth-First Search) Start: [Problem] → Try [A] → Try [A1] → Try [A1a] → ✗ dead end → Backtrack to [A1] → Try [A1b] → ✓ solution! // Deep exploration, backtracking // Best for: writing, planning Comparison: BFS: more LLM calls per level DFS: fewer calls, but may go deep BFS: better for wide problems DFS: better for deep problems

Key insight: The choice between BFS and DFS depends on the problem structure. For most reasoning tasks, BFS with beam search (keep top-b candidates at each level) is the default. DFS is better when you need deep exploration with early pruning.

casino

Monte Carlo Tree Search (MCTS)

The AlphaGo algorithm applied to LLM reasoning

MCTS for LLMs

Monte Carlo Tree Search is the algorithm that powered AlphaGo’s victory over Lee Sedol in 2016. It combines tree search with random sampling (rollouts) to efficiently explore large search spaces. Applied to LLM reasoning, MCTS works in four phases: Selection — traverse the tree from root to a leaf node, using UCB1 (Upper Confidence Bound) to balance exploration vs exploitation. Expansion — at the leaf, use the LLM to generate candidate next thoughts. Simulation (rollout) — from each candidate, use the LLM to quickly complete the reasoning to a final answer (possibly with greedy decoding). Backpropagation — update the value estimates of all nodes along the path based on whether the rollout reached a correct answer. MCTS is more compute-intensive than BFS/DFS but handles larger search spaces better. It’s particularly effective when combined with a learned value function (like in AlphaGo).

MCTS Algorithm

// MCTS for LLM reasoning Repeat N iterations: 1. SELECT: Traverse tree using UCB1 UCB1 = value + c × √(ln(N)/n) // Balance exploit vs explore 2. EXPAND: At leaf node, LLM generates k candidate next thoughts // LLM as thought generator 3. SIMULATE (Rollout): From each candidate, LLM completes reasoning to final answer (greedy) Check if answer is correct // Quick evaluation via completion 4. BACKPROPAGATE: Update value estimates along path Correct answer → increase value Wrong answer → decrease value // Learn which paths are promising vs BFS/DFS: More compute-intensive Better for large search spaces Naturally balances explore/exploit // The AlphaGo approach for reasoning

Key insight: MCTS is the bridge between classical AI search and modern LLM reasoning. It’s the same algorithm that beat the world Go champion, now applied to mathematical reasoning and code generation. OpenAI’s o1 is believed to use MCTS-like search internally.

extension

Game of 24: A Case Study

How ToT achieves 74% where CoT gets 4%

The Task

The Game of 24 is a mathematical puzzle: given four numbers, use +, −, ×, ÷ to make exactly 24. Example: {1, 2, 3, 4} → 1 × 2 × 3 × 4 = 24. This is a perfect test case for ToT because: (1) it requires search — there are many possible operation sequences, (2) it has verifiable answers — you can check if the result equals 24, (3) it requires backtracking — most paths don’t lead to 24. Yao et al. used BFS with b = 5 (keep top 5 candidates at each level). Each “thought” is one arithmetic operation that combines two numbers. The LLM evaluates each partial state as “sure” (definitely solvable), “maybe” (possibly solvable), or “impossible” (can’t reach 24). Results: GPT-4 with standard prompting solved 7.3%, CoT solved 4.0%, CoT + self-consistency (100 samples) solved 9.0%, and ToT solved 74%.

Game of 24 Example

// Game of 24: {4, 5, 6, 10} Level 0: {4, 5, 6, 10} Level 1 (generate 5 thoughts): A: 10 - 4 = 6 → {5, 6, 6} B: 5 × 4 = 20 → {6, 10, 20} C: 10 - 6 = 4 → {4, 4, 5} D: 6 - 4 = 2 → {2, 5, 10} E: 10 + 4 = 14 → {5, 6, 14} Evaluate: (LLM judges each) A: "maybe" B: "maybe" C: "maybe" D: "sure" E: "maybe" Keep top b=5 Level 2 (from D: {2, 5, 10}): D1: 2 × 10 = 20 → {5, 20} D2: 5 × 2 = 10 → {10, 10} D3: 10 / 2 = 5 → {5, 5} ... Level 3 (from D1: {5, 20}): D1a: 20 + 5 = 25 → ✗ D1b: 20 - 5 = 15 → ✗ ... From A: {5, 6, 6}: A1: 5 - 6 = -1 → ... A2: 6 - 5 = 1 → {1, 6} A2a: 6 / 1 ... no // Eventually finds: (6-4)×(10+5)... // wait, that's 30. Keep searching!

Key insight: The Game of 24 result is the most dramatic demonstration of search vs. linear reasoning. Self-consistency with 100 samples (100x compute) only reached 9%. ToT with BFS reached 74%. More samples of the same approach (linear) can’t match structured search.

diversity_2

Graph-of-Thought & Beyond

Extensions that go beyond tree structures

Beyond Trees

Trees are just the beginning. Researchers have explored richer reasoning structures: Graph-of-Thought (Besta et al., 2023) — allows thoughts to merge and form cycles, not just branch. Two reasoning paths can converge when they reach the same intermediate conclusion. This models how humans reason: we combine insights from different angles. Algorithm-of-Thought (Sel et al., 2023) — teaches LLMs to mimic algorithmic search patterns (like DFS) within a single prompt, without actually running multiple LLM calls. Everything-of-Thought (Ding et al., 2023) — combines MCTS with an external “thought library” of reusable reasoning patterns. Buffer-of-Thought (Yang et al., 2024) — maintains a library of high-level “thought templates” that can be retrieved and adapted for new problems. The trend: increasingly sophisticated reasoning structures that combine LLM generation with classical AI search.

Reasoning Structures

// Evolution of reasoning structures Chain-of-Thought: A → B → C → D → Answer Linear, no branching // Simplest Tree-of-Thought: ┌─ B1 ─ C1 A ──┤ └─ B2 ─ C2 → Answer Branching, no merging // Search over branches Graph-of-Thought: ┌─ B1 ─┐ A ──┤ ├─ D → Answer └─ B2 ─┘ Branching AND merging // Combine insights from paths Algorithm-of-Thought: Single prompt mimics DFS/BFS No multiple LLM calls needed // Efficient but less powerful Practical Recommendation: Simple tasks: CoT Search tasks: ToT with BFS Complex tasks: MCTS Research: Graph-of-Thought

Key insight: The field is converging on a common pattern: use LLMs for generation (proposing thoughts) and evaluation (judging quality), combined with classical AI search algorithms for exploration. The specific structure (chain, tree, graph) is a design choice based on the problem.

speed

Cost & Efficiency Trade-offs

Search is powerful but expensive

The Cost of Search

Tree search dramatically improves reasoning but at significant computational cost: LLM calls multiply — ToT with BFS (b = 5, depth = 3) requires generating and evaluating 5 × 3 = 15 thoughts, plus 15 evaluation calls = ~30 LLM calls per problem. Compare to 1 call for standard prompting. Token generation — each thought generation and evaluation produces tokens. Total token count can be 50–100x standard prompting. Latency — sequential search (DFS) adds latency proportional to search depth. BFS can parallelize within a level but still requires multiple rounds. Diminishing returns — increasing beam width or search depth has diminishing returns. Going from b = 5 to b = 10 rarely doubles accuracy. Practical guidance: use search only when the problem genuinely requires exploration. For most reasoning tasks, CoT + self-consistency is sufficient. Reserve ToT/MCTS for search-heavy problems (puzzles, planning, code generation).

Cost Comparison

// Compute cost comparison Standard Prompting: LLM calls: 1 Tokens: ~100 Latency: ~1s Cost: $0.001 CoT Prompting: LLM calls: 1 Tokens: ~300 (reasoning + answer) Latency: ~2s Cost: $0.003 Self-Consistency (N=40): LLM calls: 40 Tokens: ~12,000 Latency: ~5s (parallel) Cost: $0.12 ToT (BFS, b=5, depth=3): LLM calls: ~30 Tokens: ~10,000 Latency: ~15s (sequential levels) Cost: $0.10 MCTS (100 iterations): LLM calls: ~200 Tokens: ~50,000 Latency: ~60s Cost: $0.50 // Use the cheapest method that works

Key insight: The practical rule: start with the cheapest approach (CoT) and only escalate to search when accuracy demands it. For most applications, CoT + self-consistency (5–10 samples) provides the best cost-accuracy trade-off. Reserve ToT/MCTS for problems that genuinely require exploration.

trending_up

From External Search to Learned Search

How ToT inspired o1 and built-in reasoning

The Evolution

ToT demonstrated that search dramatically improves LLM reasoning. But running search externally (orchestrating multiple LLM calls) is expensive and complex. The natural next step: train the search into the model itself. This is exactly what OpenAI did with o1: External search (ToT) — an external program orchestrates multiple LLM calls, manages the tree, and runs the search algorithm. The LLM is a component. Learned search (o1/o3) — the model is trained via reinforcement learning to perform search-like reasoning internally. It generates “thinking tokens” that explore different approaches, evaluate them, and backtrack — all within a single generation. The model has internalized the search process. This is the key insight of Chapter 4 (Test-Time Compute Scaling): you can train a model to do what ToT does externally, but more efficiently and with learned heuristics rather than hand-designed search algorithms.

External vs Learned Search

// Evolution of search in reasoning 2022: External Prompting CoT: linear reasoning via prompts Self-consistency: sample + vote // No search, just sampling 2023: External Search ToT: BFS/DFS over LLM-generated thoughts, orchestrated externally MCTS: Monte Carlo rollouts // Search algorithm outside the model 2024: Learned Search o1: RL-trained to search internally "Thinking tokens" = internalized ToT Model learns when to explore, evaluate, and backtrack // Search algorithm inside the model 2025: Democratized DeepSeek-R1: open-source learned search, matching o1 performance // Everyone gets internal search The Pattern: External technique → works well → Train it into the model → Model does it natively // Same pattern as CoT → o1

Key insight: The trajectory is clear: external search (ToT) proved the concept, then learned search (o1) made it practical. This is the same pattern as CoT: first we showed prompting works, then we trained it in. Next chapter: how test-time compute scaling makes this work.

arrow_back Ch 2: Chain-of-Thought Prompting Ch 4: Test-Time Compute Scaling arrow_forward