Ch 14 — Large Language Models: The New General-Purpose Technology

How LLMs are built, what they can and can’t do, and why they hallucinate
High Level
database
Data
arrow_forward
model_training
Pre-train
arrow_forward
thumb_up
Align
arrow_forward
auto_awesome
Emerge
arrow_forward
warning
Limits
arrow_forward
business
Deploy
-
Click play or press Space to begin...
Step- / 8
model_training
The Three-Stage Training Pipeline
How a raw neural network becomes ChatGPT
Stage 1: Pre-Training
The model reads trillions of words from the internet — books, articles, websites, code, forums — and learns to predict the next word. This is the most expensive stage (Chapter 12: $10M–$100M+). The result is a model with broad knowledge but no particular skill at following instructions or being helpful. It’s like a brilliant but unfocused intern who has read everything but doesn’t know what you want.
Stage 2: Supervised Fine-Tuning (SFT)
Human contractors write thousands of ideal question-answer pairs: “Summarize this article” with a perfect summary, “Write a professional email” with a polished draft. The model learns the format and style of helpful responses. This transforms it from a next-word predictor into an instruction follower.
Stage 3: RLHF
Reinforcement Learning from Human Feedback is the secret sauce. Human raters compare multiple model responses to the same prompt and rank them from best to worst. A “reward model” learns these preferences, and the LLM is then optimized to produce responses the reward model scores highly. This is what makes the model helpful, harmless, and honest — not just capable.
Key insight: RLHF is now the default alignment strategy, adopted by 70% of enterprises (up from 25% in 2023). It’s what separates a raw language model from a useful assistant. Without RLHF, GPT-4 would be a brilliant but unpredictable text generator. With it, the model learns to be helpful, refuse harmful requests, and acknowledge uncertainty.
auto_awesome
Emergent Abilities
Capabilities that appear at scale — without being explicitly trained
What Emergence Means
As language models scale up, they develop capabilities that weren’t explicitly trained and don’t exist in smaller models. GPT-2 (1.5B parameters) could generate coherent paragraphs. GPT-3 (175B) could suddenly do arithmetic, translate between languages it wasn’t trained on, and write code. GPT-4 (~1.8T) could pass the bar exam, reason about complex scenarios, and explain its reasoning step by step.
Examples of Emergence
Chain-of-thought reasoning — The ability to break complex problems into steps and reason through them, appearing only in models above ~100B parameters.
In-context learning — Learning new tasks from just a few examples in the prompt, without any retraining.
Code generation — Writing functional code from natural language descriptions.
Multi-step planning — Decomposing complex goals into actionable sub-tasks.
Why It Matters
Emergent abilities mean that the next generation of models may be capable of things we can’t predict based on current models. This has profound implications for planning: the AI capabilities available to your organization in 12 months may be qualitatively different from today, not just incrementally better. It also means benchmarks become obsolete quickly — a test that challenges today’s models may be trivial for next year’s.
Key insight: Emergence is both exciting and unsettling. It means AI capabilities are advancing faster than our ability to predict them. For executives, this argues for flexible AI strategies that can absorb new capabilities as they appear, rather than rigid plans built around today’s limitations. What’s impossible today may be routine in 18 months.
straighten
Context Windows: The Model’s Working Memory
How much information the model can consider at once
What a Context Window Is
The context window is the maximum amount of text a model can process in a single interaction — both your input and its output combined. It’s measured in tokens (roughly ¾ of a word). A 128K token window can hold approximately a 300-page book. A 1M token window can hold several books. Everything outside the context window is invisible to the model — it literally doesn’t exist for that interaction.
Current Landscape
GPT-4 Turbo — 128,000 tokens. Noticeable quality degradation near capacity.
Claude — 200,000 tokens standard; up to 1 million tokens in extended mode, with less than 5% accuracy degradation across the full window.
Gemini Pro — Up to 1 million tokens available.
Cohere Command-R+ — 128,000 tokens, optimized for retrieval tasks.
Advertised vs. Effective
A critical distinction: most models underperform their advertised context window. Research shows models typically become unreliable around 65% of their claimed capacity. A model advertising 128K tokens may deliver consistent quality only up to ~80K tokens. Information in the middle of very long contexts is often “lost” — the model pays more attention to the beginning and end.
Key insight: Context window size is one of the most important practical considerations when choosing an LLM. If your use case involves analyzing long documents (contracts, reports, codebases), you need a model with a large effective context window, not just a large advertised one. Ask vendors about performance at capacity, not just capacity itself.
error
Hallucinations: The Fundamental Limitation
Why LLMs confidently state things that aren’t true
What Hallucination Is
LLMs generate text by predicting the most likely next token based on patterns learned during training. They have no mechanism to verify whether what they’re saying is true. When the model encounters a question where its training data is sparse or contradictory, it doesn’t say “I don’t know” — it generates the most statistically plausible response, which may be entirely fabricated. It does this with the same confidence as when stating verified facts.
The Hallucination Spectrum
Hallucination rates vary dramatically by model and task. On the BullshitBench benchmark:

Best performers — ~3% rate of confident false statements (Claude Sonnet with high reasoning).
Mid-range — 5–10% false statement rates.
Concerning finding — Some reasoning-enhanced models actually hallucinate more, not less. Increased compute can enable better rationalization of false premises — the “Reasoning Paradox.”
Mitigation Strategies
Retrieval-Augmented Generation (RAG) — Ground the model’s responses in retrieved documents (Chapter 18).
Structured output — Constrain the model to output in specific formats with citations.
Human-in-the-loop — Use LLMs for drafting, not final decisions. A human reviews and approves.
Temperature control — Lower “temperature” settings make the model more conservative and less creative, reducing hallucination risk.
Multi-model verification — Cross-check critical outputs across multiple models.
Critical for leaders: Hallucination is not a bug that will be “fixed” in the next version. It’s a fundamental property of how these models work — they generate plausible text, not verified truth. Any deployment in a domain where accuracy matters (legal, medical, financial, regulatory) must include verification mechanisms. Treat LLM output as a first draft from a knowledgeable but unreliable source.
landscape
The Model Landscape
Choosing between GPT, Claude, Gemini, LLaMA, and others
Closed-Source Leaders
OpenAI (GPT series) — The market pioneer. Strong at data extraction and general-purpose tasks. Largest ecosystem of integrations and developer tools.

Anthropic (Claude) — Leads in document analysis (94.2% accuracy), code review, and hallucination resistance. Largest effective context window. Strong safety focus.

Google (Gemini) — Best multimodal capabilities (text + image + video). Competitive pricing, especially the Flash tier ($0.075/1M tokens). Deep Google Cloud integration.
Open-Source Alternatives
Meta LLaMA — The leading open-source model family. Free to use, can be hosted on your own infrastructure, and fine-tuned on proprietary data. Rapidly closing the gap with closed-source models.
Mistral — European open-source models known for efficiency. Strong performance relative to size.
DeepSeek — Chinese open-source models achieving competitive performance at dramatically lower training costs ($5.6M vs. $100M for comparable quality).
Key insight: There is no single “best” LLM. The right choice depends on your specific use case, data sensitivity requirements, budget, and integration needs. Many enterprises use multiple models: a powerful closed-source model for complex reasoning, a cost-efficient model for high-volume simple tasks, and an open-source model for sensitive data that can’t leave your infrastructure.
tune
Fine-Tuning vs. Foundation
When to customize and when to use off-the-shelf
Three Levels of Customization
1. Prompt engineering — Use the model as-is with carefully crafted instructions. Zero cost, immediate results. Sufficient for 60–70% of enterprise use cases.

2. Fine-tuning — Train the model on your specific data and tasks. Costs $1K–$50K. Improves domain expertise, output format, and consistency. Best when you need the model to adopt your organization’s terminology, style, or specialized knowledge.

3. Pre-training from scratch — Build a model from the ground up. Costs $10M+. Only justified for highly specialized domains (biotech, defense) or organizations with unique data at massive scale.
The Decision Framework
Start with prompting. If the model performs adequately with good prompts, stop there. Most organizations over-invest in fine-tuning when better prompts would suffice.

Fine-tune when: The model consistently fails on domain-specific tasks, you need a specific output format at scale, or you need to reduce per-query costs by using a smaller fine-tuned model instead of a larger general one.

Don’t fine-tune when: Your data is small (<1,000 examples), the task changes frequently, or the foundation model already performs well with prompting.
Key insight: The most common mistake in enterprise AI is jumping to fine-tuning before exhausting prompt engineering. Fine-tuning is a commitment — it requires data preparation, training infrastructure, and ongoing maintenance. Prompting is flexible, immediate, and free. Always start there.
trending_up
Scaling Laws & the Frontier
What gets better with scale — and what doesn’t
What Scaling Laws Tell Us
Research has established predictable relationships between model size, data volume, compute budget, and performance. More parameters + more data + more compute = better performance, following a power law. This predictability is what justified the billions invested in training frontier models — researchers could estimate in advance how much improvement a given investment would yield.
The Scaling Debate
The industry is divided on whether pure scaling will continue to deliver breakthroughs:

The optimists argue that scaling laws still hold and the next 10× in compute will produce another qualitative leap in capability.
The pragmatists note diminishing returns: RLHF scaling is less efficient than pre-training, and larger models benefit less from alignment with fixed reward models. The focus is shifting to data quality, inference-time compute, and architectural innovation.
Inference-Time Compute
A major trend: instead of making models bigger, make them think harder at inference time. Techniques like chain-of-thought prompting, tree-of-thought reasoning, and test-time compute scaling allow models to spend more computation on difficult problems. This is more cost-effective than training a larger model and can be applied selectively — simple queries get fast, cheap responses; complex queries get deeper reasoning.
Key insight: The era of “just make it bigger” is evolving into “make it smarter.” For enterprise planning, this means AI capabilities will continue to improve rapidly, but through efficiency gains and architectural innovation, not just brute-force scaling. The models you use next year will be better and cheaper than today’s — a rare combination in technology.
psychology
The LLM Decision Framework
What every executive needs to evaluate
Five Questions for Every LLM Decision
1. What’s the accuracy requirement? — If errors have serious consequences (legal, medical, financial), you need verification layers. LLMs alone are insufficient for high-stakes accuracy.

2. Where does the data go? — Closed-source APIs send data to third-party servers. If your data is sensitive, consider open-source models on your own infrastructure or enterprise agreements with data isolation guarantees.

3. What’s the volume and latency? — High-volume, low-latency use cases need cost-efficient models. Complex, low-volume tasks justify premium models.
Five Questions (Continued)
4. How much customization is needed? — Start with prompting. Move to fine-tuning only if prompting consistently falls short. Pre-training from scratch is almost never justified.

5. What’s the total cost of ownership? — Include API costs at projected volume, integration development, monitoring, human review for critical outputs, and ongoing prompt/model maintenance. The API cost is often the smallest component.
The bottom line: LLMs are the most versatile AI technology ever created. They can draft, analyze, translate, code, reason, and converse. But they are not oracles. They hallucinate, they have knowledge cutoffs, and they can be confidently wrong. The organizations that succeed with LLMs are those that deploy them as powerful assistants with human oversight, not as autonomous decision-makers. Treat them as a brilliant but unreliable colleague who always needs their work checked.