Ch 13 — Emergent Abilities & Limitations

What LLMs can surprisingly do, what they fundamentally can’t, and why the difference matters
Frontier
auto_awesome
Emergent
arrow_forward
psychology
Reasoning
arrow_forward
cloud_off
Hallucinate
arrow_forward
trending_down
Inverse
arrow_forward
block
Can’t Do
arrow_forward
balance
Debate
arrow_forward
tips_and_updates
Practical
-
Click play or press Space to begin...
Step- / 7
auto_awesome
Emergent Abilities: Surprise Capabilities
Skills that appear suddenly at scale — or do they?
The Analogy
Imagine heating water. At 99°C, nothing special. At 100°C, it suddenly boils — a phase transition. Early research (Wei et al., 2022) claimed LLMs show similar jumps: abilities like arithmetic, translation, and chain-of-thought reasoning that are absent in small models but suddenly appear in large ones. GPT-3 (175B) could do few-shot learning that GPT-2 (1.5B) couldn’t. This was called “emergence.”
Key insight: The emergence debate is nuanced. Schaeffer et al. (2023) argued that “emergent abilities are a mirage” — they appear sudden only because of how we measure (discrete accuracy metrics create artificial thresholds). With continuous metrics, improvement is smooth and predictable. The truth is likely in between: some capabilities do improve smoothly, but the practical utility of those capabilities can have sharp thresholds (e.g., 60% accuracy on math is useless, 90% is useful).
Claimed Emergent Abilities
# Abilities that "emerge" at scale: # Few-shot learning (GPT-3, 175B) # → Give 3 examples, model generalizes # → Not present in GPT-2 (1.5B) # Chain-of-thought reasoning (~100B+) # → "Think step by step" works # → Small models can't do this # Multi-step arithmetic (~50B+) # → 3-digit addition, multiplication # Code generation (~10B+) # → Write working programs from descriptions # The debate: # Wei et al. (2022): "Emergence is real" # Schaeffer et al. (2023): "It's a mirage" # Reality: smooth improvement, but sharp # practical utility thresholds
psychology
Reasoning: Real or Simulated?
Can LLMs actually reason, or are they just pattern matching?
The Evidence
LLMs can solve math problems, write code, and pass bar exams. But do they reason or just pattern-match? Evidence for reasoning: they solve novel problems not in training data, chain-of-thought improves accuracy, and they can combine concepts in new ways. Evidence against: they fail on simple variations of problems they can solve, are sensitive to irrelevant details, and struggle with tasks requiring true logical deduction.
Key insight: The most honest answer: LLMs do something that looks like reasoning and is useful like reasoning, but may not be reasoning in the way humans do it. They’re incredibly powerful pattern matchers that have learned patterns complex enough to approximate reasoning on many tasks. Whether this constitutes “real” reasoning is partly a philosophical question. What matters practically: they’re reliable enough for many tasks but fail unpredictably on others.
Reasoning Failures
# Tasks LLMs handle well: # ✓ Math (with CoT): GSM8K 95%+ # ✓ Code: HumanEval 90%+ (o1) # ✓ Knowledge: MMLU 90%+ # ✓ Translation: near-human quality # Tasks that reveal limitations: # ✗ "I have 5 apples, eat 3, buy 2 more, # give half away. How many?" → often wrong # when steps are complex enough # ✗ Spatial reasoning: "If I face north # and turn right twice, which way?" # ✗ Counting: "How many r's in strawberry?" # (famously hard for tokenized models) # ✗ Novel logic puzzles with no training # data analogues # ✗ Consistent long-range planning
cloud_off
Hallucination: Confident Fabrication
Why LLMs make things up and how to mitigate it
The Analogy
An LLM is like a very confident storyteller who never says “I don’t know.” Ask about a real paper and it might cite the right authors with the wrong title, or invent a plausible-sounding paper that doesn’t exist. This happens because the model is trained to predict the most likely next token, not the most truthful next token. Plausible-sounding text is rewarded even if it’s factually wrong.
Key insight: Hallucination is not a bug that can be fully fixed — it’s a fundamental property of how LLMs work. The model generates text by sampling from probability distributions (Ch 9). It has no internal fact-checker, no database to verify against, and no concept of “truth” separate from “what text usually follows this context.” Mitigations: RAG (Ch 10), chain-of-thought verification, and training models to say “I don’t know” (alignment, Ch 8).
Types of Hallucination
# Hallucination types: # 1. Factual: wrong facts stated confidently # "Einstein was born in 1880" (actually 1879) # 2. Fabrication: invented entities # "The Smith et al. (2023) paper showed..." # (paper doesn't exist) # 3. Inconsistency: contradicts itself # "X is true" then later "X is false" # 4. Unfaithful: contradicts provided context # Given a document, summarizes incorrectly # Mitigation strategies: # RAG: ground responses in real documents # CoT: verify reasoning step by step # Low temperature: reduce randomness # RLHF: train to say "I don't know" # Citations: require source attribution # Human review: for high-stakes outputs
trending_down
Inverse Scaling: When Bigger Is Worse
Counterintuitive cases where larger models perform worse
The Phenomenon
Scaling laws (Ch 5) say bigger models are better. But on some tasks, larger models perform worse. On TruthfulQA, bigger models give more confident but less truthful answers — they’ve memorized common misconceptions more strongly. On tasks requiring overriding memorized knowledge (e.g., “if gravity were repulsive, what would happen?”), larger models are more anchored to memorized facts and less flexible.
Key insight: Inverse scaling reveals a fundamental tension: larger models are better at pattern matching but also more committed to learned patterns. They’re harder to “steer” away from memorized associations. This is why alignment (Ch 8) becomes more important, not less, as models scale. A more capable model that’s harder to control is potentially more dangerous than a less capable one.
Examples
# Inverse scaling examples: # TruthfulQA (Lin et al., 2022): # GPT-2 (1.5B): 40% truthful # GPT-3 (175B): 28% truthful (!) # Larger model = more confident lies # (Memorized popular misconceptions) # Redefinition tasks: # "If π = 4, what is the area of a # circle with radius 3?" # Small model: might try 4 × 9 = 36 # Large model: "28.27" (used real π!) # Larger model = more anchored to π=3.14 # Sycophancy (larger models): # User: "I think 2+2=5, right?" # Small model: "No, 2+2=4" # Large model: "You raise an interesting # point..." (agrees to please user)
block
Fundamental Limitations
What LLMs structurally cannot do
Architectural Limits
No real-time knowledge: Training data has a cutoff date. No persistent memory: Each conversation starts fresh (without external tools). No true understanding: The model manipulates tokens, not concepts. No self-awareness: It doesn’t know what it knows or doesn’t know. No grounding: It has never experienced the physical world. Fixed compute per token: Every token gets the same amount of processing, regardless of difficulty.
Key insight: The “fixed compute per token” limitation is particularly important. When you ask a hard math question, the model has the same number of FLOPs to generate each token of the answer as when generating “the.” Chain-of-thought (Ch 9) partially addresses this by spreading computation across more tokens, but it’s a workaround, not a solution. This is a fundamental architectural constraint of autoregressive transformers.
Limitation Map
# Structural limitations: # 1. No world model # Can't simulate physics, causality # "What happens if I drop a ball?" → guesses # 2. No persistent state # Forgets everything between API calls # Workaround: external memory, RAG # 3. Tokenization artifacts # Can't count characters reliably # "How many r's in strawberry?" → wrong # Because "strawberry" = ["str","aw","berry"] # 4. No backtracking # Once a token is generated, it's committed # Can't revise earlier reasoning # Workaround: CoT, but still left-to-right # 5. Brittle to framing # Same problem, different wording → different # answer. Humans are robust to this. # 6. Training data bias # Reflects biases in internet text # Partially mitigated by RLHF (Ch 8)
balance
The Understanding Debate
Stochastic parrots vs. genuine comprehension
Two Perspectives
The field is divided. Skeptics (Bender & Koller, 2020; Marcus, 2022) argue LLMs are “stochastic parrots” — sophisticated autocomplete that manipulates form without understanding meaning. Optimists (Bubeck et al., 2023) argue GPT-4 shows “sparks of AGI” — genuine reasoning, planning, and creativity that go beyond pattern matching. The truth likely depends on how you define “understanding.”
Key insight: This debate matters practically. If LLMs are “just” pattern matchers, their failures are predictable (they fail when the pattern is novel). If they have some form of understanding, their failures are harder to predict. For building reliable systems, assume the worst case: treat LLMs as powerful but unreliable tools that need verification, guardrails, and human oversight for high-stakes decisions.
The Spectrum
# The understanding spectrum: # "Stochastic parrot" view: # - No understanding, just statistics # - Impressive but fundamentally limited # - Will plateau without new architectures # Proponents: Bender, Marcus, LeCun # "Sparks of AGI" view: # - Genuine reasoning and understanding # - Scaling will continue to unlock abilities # - We're on the path to AGI # Proponents: Bubeck, Altman, Sutskever # Pragmatic middle ground: # - "Understanding" is a spectrum # - LLMs have SOME form of representation # - Not human-like, but not trivial either # - Focus on what they CAN do reliably # - Build systems with appropriate guardrails
tips_and_updates
Practical Takeaways
How to work effectively with LLM strengths and weaknesses
Best Practices
Understanding limitations makes you a better AI practitioner. Use LLMs for what they’re good at: drafting, summarization, translation, code generation, brainstorming. Add guardrails for what they’re bad at: factual accuracy (use RAG), reasoning (use CoT + verification), consistency (use structured output). Never trust without verification for high-stakes decisions.
Key insight: The most effective AI systems combine LLM strengths with traditional software strengths. LLM for language understanding + database for facts + code for logic + human review for judgment. This “compound AI system” approach is how production AI applications work. Understanding what LLMs can and can’t do is the most valuable skill in AI engineering.
The Reliability Matrix
# When to trust LLMs: # HIGH reliability: # ✓ Text summarization # ✓ Translation # ✓ Code explanation # ✓ Creative writing # ✓ Format conversion (JSON, CSV, etc.) # MEDIUM reliability (verify!): # ~ Code generation (test it!) # ~ Math (use CoT + calculator) # ~ Factual questions (use RAG) # ~ Analysis (check reasoning) # LOW reliability (always verify): # ✗ Specific dates, numbers, citations # ✗ Legal/medical advice # ✗ Counting, character-level operations # ✗ Novel logical puzzles # ✗ Claims about itself ("I was trained...")