Ch 9 — How LLMs Generate Text

Autoregressive decoding, temperature, top-p, and the art of controlled randomness
Inference
replay
Loop
arrow_forward
arrow_upward
Greedy
arrow_forward
thermostat
Temperature
arrow_forward
filter_list
Top-K
arrow_forward
donut_small
Top-P
arrow_forward
account_tree
Beam
arrow_forward
fast_forward
Speculative
arrow_forward
psychology
CoT
-
Click play or press Space to begin...
Step- / 8
replay
The Autoregressive Loop
LLMs generate text one token at a time, feeding each output back as input
The Analogy
Imagine writing a story by typing one word, then reading everything you’ve written so far before typing the next word. That’s autoregressive generation. The model takes the entire context (prompt + generated tokens so far), runs it through all transformer layers, and produces a probability distribution over the vocabulary for the next token. It picks one token, appends it, and repeats. A 500-token response requires 500 forward passes through the model.
Key insight: This is why LLMs are slow: they’re fundamentally sequential. Each token depends on all previous tokens. You can’t generate token 100 without first generating tokens 1-99. For Llama 3 8B generating at ~50 tokens/second on an A100, a 500-token response takes ~10 seconds. The KV cache (Ch 10) avoids recomputing attention for previous tokens, but the sequential bottleneck remains.
The Generation Loop
def generate(model, prompt_ids, max_tokens): tokens = prompt_ids.clone() for _ in range(max_tokens): # Forward pass: all tokens → logits logits = model(tokens) # Only care about last position next_logits = logits[:, -1, :] # Convert to probabilities probs = torch.softmax(next_logits, dim=-1) # Pick next token (many strategies!) next_token = sample(probs) # Append and repeat tokens = torch.cat([tokens, next_token], dim=-1) # Stop if EOS token if next_token == eos_token: break return tokens # The "sample" function is where all the # decoding strategy magic happens...
arrow_upward
Greedy Decoding: Always Pick the Most Likely
Simple but repetitive
The Analogy
Greedy decoding is like always ordering the most popular dish at a restaurant. It’s safe and predictable, but you’ll eat the same thing every time. The model always picks the highest-probability token: argmax(probs). This produces deterministic output (same input = same output) but tends to be repetitive and boring for long text. It works well for factual tasks where there’s one right answer.
Key insight: Greedy decoding is optimal for short, factual outputs (like answering “What is 2+2?”). But for creative writing, it produces degenerate text: “The cat sat on the mat. The cat sat on the mat. The cat sat on the mat...” This happens because the most likely next token is often the same pattern repeating. Sampling introduces the randomness needed for diverse, natural text.
Greedy vs Sampling
# Greedy: always pick highest probability next_token = torch.argmax(probs, dim=-1) # Example probabilities for next token: # "the" → 0.35 ← greedy picks this # "a" → 0.20 # "this" → 0.15 # "my" → 0.10 # ... → 0.20 (rest of vocabulary) # Sampling: randomly pick based on probs next_token = torch.multinomial(probs, 1) # 35% chance "the", 20% chance "a", etc. # Different output each time! # Greedy output: deterministic, repetitive # "The city was beautiful. The city was # known for its beautiful architecture..." # Sampled output: diverse, natural # "The city was beautiful. Ancient spires # caught the morning light as we..."
thermostat
Temperature: Controlling Randomness
The single most important generation parameter
The Analogy
Temperature is like a confidence dial. At T=0 (frozen), the model is maximally confident — always picks the top choice (greedy). At T=1 (normal), probabilities are used as-is. At T=2 (hot), the distribution flattens — unlikely tokens become more probable, producing wild, creative (sometimes nonsensical) text. Lower temperature = more focused and predictable. Higher = more creative and risky.
Key insight: Temperature works by dividing logits before softmax: P(token) = softmax(logits / T). At T→0, the distribution becomes a spike on the top token. At T→∞, it becomes uniform (all tokens equally likely). Most APIs default to T=0.7-1.0. For code generation, T=0.2 works well (deterministic). For creative writing, T=0.8-1.0. For brainstorming, T=1.2+.
Temperature in Action
# Temperature scaling: # scaled_logits = logits / temperature # probs = softmax(scaled_logits) # Original logits: [3.0, 2.0, 1.0, 0.5] # → softmax: [0.51, 0.19, 0.07, 0.04...] # T=0.5 (cold): logits/0.5 = [6, 4, 2, 1] # → softmax: [0.84, 0.11, 0.01, 0.005...] # Very peaked — almost greedy # T=1.0 (normal): unchanged # → softmax: [0.51, 0.19, 0.07, 0.04...] # T=2.0 (hot): logits/2 = [1.5, 1, 0.5, 0.25] # → softmax: [0.32, 0.20, 0.12, 0.09...] # Flatter — more randomness def sample_with_temperature(logits, T=0.7): scaled = logits / T probs = torch.softmax(scaled, dim=-1) return torch.multinomial(probs, 1)
filter_list
Top-K Sampling: Limiting the Candidate Pool
Only consider the K most likely tokens
The Analogy
Imagine choosing a restaurant. With 10,000 options (full vocabulary), you might accidentally pick a terrible one. Top-K narrows it down: only consider the K best restaurants, then randomly pick from those. K=1 is greedy. K=50 gives variety while avoiding the worst options. K=100,000 is pure sampling. The problem: K is fixed, but sometimes there are 3 good options and sometimes 300.
Top-K Implementation
def top_k_sample(logits, k=50, T=0.7): scaled = logits / T # Keep only top-k logits top_k_vals, top_k_idx = scaled.topk(k) # Set everything else to -infinity filtered = torch.full_like(scaled, float('-inf')) filtered.scatter_(-1, top_k_idx, top_k_vals) probs = torch.softmax(filtered, dim=-1) return torch.multinomial(probs, 1) # Example: vocab of 128,256 tokens # K=50: only consider top 50 tokens # The other 128,206 get probability 0 # Problem with fixed K: # "The capital of France is ___" # → Only 1-2 good answers (Paris, Lyon) # → K=50 includes 48 bad options # "I like to eat ___" # → Hundreds of valid foods # → K=50 might exclude good options
donut_small
Top-P (Nucleus Sampling): Adaptive Filtering
The smart alternative to top-K
The Analogy
Instead of always considering exactly K options, top-p says: “Consider the smallest set of options that covers P% of the probability.” If one token has 95% probability, top-p=0.95 might only include 1 token. If the distribution is flat, it might include 200 tokens. It adapts to the situation. This is why top-p (also called “nucleus sampling”) is the most popular method in production.
Key insight: Most LLM APIs (OpenAI, Anthropic, etc.) use temperature + top-p as the default combination. Typical defaults: T=1.0, top_p=0.95. This means: use the model’s natural probabilities, but cut off the long tail of unlikely tokens. For deterministic output, set T=0 (which makes top-p irrelevant). For creative output, T=0.8 + top_p=0.95 is a good starting point.
Top-P Implementation
def top_p_sample(logits, p=0.95, T=1.0): scaled = logits / T probs = torch.softmax(scaled, dim=-1) # Sort by probability (descending) sorted_probs, sorted_idx = probs.sort( descending=True ) # Cumulative sum cumsum = sorted_probs.cumsum(dim=-1) # Remove tokens beyond threshold p mask = cumsum - sorted_probs > p sorted_probs[mask] = 0 # Renormalize and sample sorted_probs /= sorted_probs.sum() return sorted_idx[ torch.multinomial(sorted_probs, 1) ] # Adaptive behavior: # "Capital of France is ___" # P(Paris)=0.97 → nucleus = {Paris} # Only 1 token! (very confident) # "I enjoy eating ___" # P(pizza)=0.08, P(pasta)=0.07, ... # Nucleus = {pizza, pasta, sushi, ...} # ~50 tokens (uncertain, many valid)
account_tree
Beam Search: Looking Ahead
Maintaining multiple candidates simultaneously
The Analogy
Greedy search is like a chess player who only thinks one move ahead. Beam search thinks multiple moves ahead by keeping B candidate sequences (“beams”) alive simultaneously. At each step, it expands all beams, scores the results, and keeps the top B. It finds better global sequences but is slower (B× more computation). Used mainly for translation and summarization, not for chat.
Key insight: Beam search is deterministic and tends to produce “safe” but bland text. It was the standard for machine translation (2017-2020) but has been largely replaced by sampling methods for open-ended generation. Modern chat models almost never use beam search — temperature + top-p gives better subjective quality for conversational text.
When to Use What
# Decoding strategy cheat sheet: # Greedy (T=0): # Best for: math, code, factual QA # Deterministic, no randomness # Low temperature (T=0.2-0.5): # Best for: code generation, structured output # Mostly deterministic, slight variation # Medium temperature (T=0.7-1.0) + top-p: # Best for: chat, general tasks # Natural, diverse, coherent # High temperature (T=1.0-1.5): # Best for: creative writing, brainstorming # Very diverse, occasionally wild # Beam search (B=4-8): # Best for: translation, summarization # Optimal sequences, but bland # Common API defaults: # OpenAI: T=1.0, top_p=1.0 # Anthropic: T=1.0, top_p=0.999 # Llama: T=0.6, top_p=0.9
fast_forward
Speculative Decoding: Generating Faster
Use a small model to draft, a large model to verify
The Analogy
Imagine a junior writer drafts 5 sentences quickly, then a senior editor reviews them. If the editor agrees with 4 out of 5, you’ve saved time — the editor only needs to rewrite 1 sentence instead of writing all 5. Speculative decoding uses a small, fast “draft” model to generate K tokens, then the large model verifies them all in a single forward pass (parallel, not sequential). Accepted tokens are free; rejected ones get regenerated.
Key insight: The key insight is that verification is parallel but generation is sequential. The large model can check K draft tokens in one forward pass (same cost as generating 1 token). If the draft model agrees with the large model 80% of the time and K=5, you get ~4 tokens for the cost of 1 large-model pass. This gives 2-3× speedup with mathematically identical output to the large model alone.
How It Works
# Speculative decoding algorithm: # Draft model: Llama 3 1B (fast) # Target model: Llama 3 70B (slow, accurate) for each generation step: # 1. Draft model generates K tokens draft_tokens = draft_model.generate(K=5) # Fast: 5 sequential passes of 1B model # 2. Target model verifies ALL K at once target_probs = target_model( context + draft_tokens ) # One parallel pass of 70B model # 3. Accept/reject each draft token for i in range(K): if accept(draft_probs[i], target_probs[i]): keep token i # free! else: resample from target # fix it break # discard rest of draft # Result: 2-3× faster, EXACT same output # Used by: vLLM, TensorRT-LLM, Apple MLX
psychology
Chain-of-Thought: Thinking Before Answering
Test-time compute scaling — the newest frontier
The Analogy
When you face a hard math problem, you don’t blurt out the answer — you think step by step. Chain-of-thought (CoT) prompting makes LLMs do the same: generate intermediate reasoning steps before the final answer. OpenAI’s o1/o3 models take this further with test-time compute scaling: the model generates many reasoning chains, evaluates them, and picks the best. More thinking time = better answers.
Key insight: Test-time compute is a new scaling dimension (Ch 5). Instead of making the model bigger (more parameters) or training longer (more data), you let it think longer at inference time. DeepSeek-R1 generates thousands of reasoning tokens internally before answering. This trades inference cost for quality — and for hard problems (math, code, reasoning), it’s dramatically more effective than simply using a larger model.
CoT and Test-Time Compute
# Standard generation: # "What is 127 × 43?" → "5461" # (often wrong — no intermediate steps) # Chain-of-thought: # "What is 127 × 43? Think step by step." # → "127 × 40 = 5080 # 127 × 3 = 381 # 5080 + 381 = 5461" # (correct — intermediate steps help!) # Test-time compute (o1/o3, R1): # 1. Generate multiple reasoning chains # 2. Verify each chain internally # 3. Select best answer # 4. May use 10,000+ tokens of "thinking" # for a 100-token answer # The tradeoff: # Standard: fast, cheap, less accurate # CoT: slower, more expensive, more accurate # o1-style: very slow, very expensive, # dramatically more accurate on # math, code, and reasoning tasks