Ch 14: Large Language Models — The New General-Purpose Technology

Ch 14 — Large Language Models: The New General-Purpose Technology

How LLMs are built, what they can and can’t do, and why they hallucinate

Index

High Level

database

Data

arrow_forward

model_training

Pre-train

arrow_forward

thumb_up

Align

arrow_forward

auto_awesome

Emerge

arrow_forward

warning

Limits

arrow_forward

business

Deploy

Click play or press Space to begin...

Step- / 8

model_training

The Three-Stage Training Pipeline

How a raw neural network becomes ChatGPT

Stage 1: Pre-Training

The model reads trillions of words from the internet — books, articles, websites, code, forums — and learns to predict the next word. This is the most expensive stage (Chapter 12: $10M–$100M+). The result is a model with broad knowledge but no particular skill at following instructions or being helpful. It’s like a brilliant but unfocused intern who has read everything but doesn’t know what you want.

Stage 2: Supervised Fine-Tuning (SFT)

Human contractors write thousands of ideal question-answer pairs: “Summarize this article” with a perfect summary, “Write a professional email” with a polished draft. The model learns the format and style of helpful responses. This transforms it from a next-word predictor into an instruction follower.

Stage 3: RLHF

Reinforcement Learning from Human Feedback is the secret sauce. Human raters compare multiple model responses to the same prompt and rank them from best to worst. A “reward model” learns these preferences, and the LLM is then optimized to produce responses the reward model scores highly. This is what makes the model helpful, harmless, and honest — not just capable.

Key insight: RLHF is now the default alignment strategy, adopted by 70% of enterprises (up from 25% in 2023). It’s what separates a raw language model from a useful assistant. Without RLHF, GPT-4 would be a brilliant but unpredictable text generator. With it, the model learns to be helpful, refuse harmful requests, and acknowledge uncertainty.

auto_awesome

Emergent Abilities

Capabilities that appear at scale — without being explicitly trained

What Emergence Means

As language models scale up, they develop capabilities that weren’t explicitly trained and don’t exist in smaller models. GPT-2 (1.5B parameters) could generate coherent paragraphs. GPT-3 (175B) could suddenly do arithmetic, translate between languages it wasn’t trained on, and write code. GPT-4 (~1.8T) could pass the bar exam, reason about complex scenarios, and explain its reasoning step by step.

Examples of Emergence

Chain-of-thought reasoning — The ability to break complex problems into steps and reason through them, appearing only in models above ~100B parameters.
In-context learning — Learning new tasks from just a few examples in the prompt, without any retraining.
Code generation — Writing functional code from natural language descriptions.
Multi-step planning — Decomposing complex goals into actionable sub-tasks.

Why It Matters

Emergent abilities mean that the next generation of models may be capable of things we can’t predict based on current models. This has profound implications for planning: the AI capabilities available to your organization in 12 months may be qualitatively different from today, not just incrementally better. It also means benchmarks become obsolete quickly — a test that challenges today’s models may be trivial for next year’s.

Key insight: Emergence is both exciting and unsettling. It means AI capabilities are advancing faster than our ability to predict them. For executives, this argues for flexible AI strategies that can absorb new capabilities as they appear, rather than rigid plans built around today’s limitations. What’s impossible today may be routine in 18 months.

straighten

Context Windows: The Model’s Working Memory

How much information the model can consider at once

What a Context Window Is

The context window is the maximum amount of text a model can process in a single interaction — both your input and its output combined. It’s measured in tokens (roughly ¾ of a word). A 128K token window can hold approximately a 300-page book. A 1M token window can hold several books. Everything outside the context window is invisible to the model — it literally doesn’t exist for that interaction.

Current Landscape

GPT-4 Turbo — 128,000 tokens. Noticeable quality degradation near capacity.
Claude — 200,000 tokens standard; up to 1 million tokens in extended mode, with less than 5% accuracy degradation across the full window.
Gemini Pro — Up to 1 million tokens available.
Cohere Command-R+ — 128,000 tokens, optimized for retrieval tasks.

Advertised vs. Effective

A critical distinction: most models underperform their advertised context window. Research shows models typically become unreliable around 65% of their claimed capacity. A model advertising 128K tokens may deliver consistent quality only up to ~80K tokens. Information in the middle of very long contexts is often “lost” — the model pays more attention to the beginning and end.

Key insight: Context window size is one of the most important practical considerations when choosing an LLM. If your use case involves analyzing long documents (contracts, reports, codebases), you need a model with a large effective context window, not just a large advertised one. Ask vendors about performance at capacity, not just capacity itself.

error

Hallucinations: The Fundamental Limitation

Why LLMs confidently state things that aren’t true

What Hallucination Is

LLMs generate text by predicting the most likely next token based on patterns learned during training. They have no mechanism to verify whether what they’re saying is true. When the model encounters a question where its training data is sparse or contradictory, it doesn’t say “I don’t know” — it generates the most statistically plausible response, which may be entirely fabricated. It does this with the same confidence as when stating verified facts.

The Hallucination Spectrum

Hallucination rates vary dramatically by model and task. On the BullshitBench benchmark:

Best performers — ~3% rate of confident false statements (Claude Sonnet with high reasoning).
Mid-range — 5–10% false statement rates.
Concerning finding — Some reasoning-enhanced models actually hallucinate more, not less. Increased compute can enable better rationalization of false premises — the “Reasoning Paradox.”

Mitigation Strategies

Retrieval-Augmented Generation (RAG) — Ground the model’s responses in retrieved documents (Chapter 18).
Structured output — Constrain the model to output in specific formats with citations.
Human-in-the-loop — Use LLMs for drafting, not final decisions. A human reviews and approves.
Temperature control — Lower “temperature” settings make the model more conservative and less creative, reducing hallucination risk.
Multi-model verification — Cross-check critical outputs across multiple models.

Critical for leaders: Hallucination is not a bug that will be “fixed” in the next version. It’s a fundamental property of how these models work — they generate plausible text, not verified truth. Any deployment in a domain where accuracy matters (legal, medical, financial, regulatory) must include verification mechanisms. Treat LLM output as a first draft from a knowledgeable but unreliable source.

landscape

The Model Landscape

Choosing between GPT, Claude, Gemini, LLaMA, and others

Closed-Source Leaders

OpenAI (GPT series) — The market pioneer. Strong at data extraction and general-purpose tasks. Largest ecosystem of integrations and developer tools.

Anthropic (Claude) — Leads in document analysis (94.2% accuracy), code review, and hallucination resistance. Largest effective context window. Strong safety focus.

Google (Gemini) — Best multimodal capabilities (text + image + video). Competitive pricing, especially the Flash tier ($0.075/1M tokens). Deep Google Cloud integration.

Open-Source Alternatives

Meta LLaMA — The leading open-source model family. Free to use, can be hosted on your own infrastructure, and fine-tuned on proprietary data. Rapidly closing the gap with closed-source models.
Mistral — European open-source models known for efficiency. Strong performance relative to size.
DeepSeek — Chinese open-source models achieving competitive performance at dramatically lower training costs ($5.6M vs. $100M for comparable quality).

Key insight: There is no single “best” LLM. The right choice depends on your specific use case, data sensitivity requirements, budget, and integration needs. Many enterprises use multiple models: a powerful closed-source model for complex reasoning, a cost-efficient model for high-volume simple tasks, and an open-source model for sensitive data that can’t leave your infrastructure.

tune

Fine-Tuning vs. Foundation

When to customize and when to use off-the-shelf

Three Levels of Customization

1. Prompt engineering — Use the model as-is with carefully crafted instructions. Zero cost, immediate results. Sufficient for 60–70% of enterprise use cases.

2. Fine-tuning — Train the model on your specific data and tasks. Costs $1K–$50K. Improves domain expertise, output format, and consistency. Best when you need the model to adopt your organization’s terminology, style, or specialized knowledge.

3. Pre-training from scratch — Build a model from the ground up. Costs $10M+. Only justified for highly specialized domains (biotech, defense) or organizations with unique data at massive scale.

The Decision Framework

Start with prompting. If the model performs adequately with good prompts, stop there. Most organizations over-invest in fine-tuning when better prompts would suffice.

Fine-tune when: The model consistently fails on domain-specific tasks, you need a specific output format at scale, or you need to reduce per-query costs by using a smaller fine-tuned model instead of a larger general one.

Don’t fine-tune when: Your data is small (<1,000 examples), the task changes frequently, or the foundation model already performs well with prompting.

Key insight: The most common mistake in enterprise AI is jumping to fine-tuning before exhausting prompt engineering. Fine-tuning is a commitment — it requires data preparation, training infrastructure, and ongoing maintenance. Prompting is flexible, immediate, and free. Always start there.

trending_up

Scaling Laws & the Frontier

What gets better with scale — and what doesn’t

What Scaling Laws Tell Us

Research has established predictable relationships between model size, data volume, compute budget, and performance. More parameters + more data + more compute = better performance, following a power law. This predictability is what justified the billions invested in training frontier models — researchers could estimate in advance how much improvement a given investment would yield.

The Scaling Debate

The industry is divided on whether pure scaling will continue to deliver breakthroughs:

The optimists argue that scaling laws still hold and the next 10× in compute will produce another qualitative leap in capability.
The pragmatists note diminishing returns: RLHF scaling is less efficient than pre-training, and larger models benefit less from alignment with fixed reward models. The focus is shifting to data quality, inference-time compute, and architectural innovation.

Inference-Time Compute

A major trend: instead of making models bigger, make them think harder at inference time. Techniques like chain-of-thought prompting, tree-of-thought reasoning, and test-time compute scaling allow models to spend more computation on difficult problems. This is more cost-effective than training a larger model and can be applied selectively — simple queries get fast, cheap responses; complex queries get deeper reasoning.

Key insight: The era of “just make it bigger” is evolving into “make it smarter.” For enterprise planning, this means AI capabilities will continue to improve rapidly, but through efficiency gains and architectural innovation, not just brute-force scaling. The models you use next year will be better and cheaper than today’s — a rare combination in technology.

psychology

The LLM Decision Framework

What every executive needs to evaluate

Five Questions for Every LLM Decision

1. What’s the accuracy requirement? — If errors have serious consequences (legal, medical, financial), you need verification layers. LLMs alone are insufficient for high-stakes accuracy.

2. Where does the data go? — Closed-source APIs send data to third-party servers. If your data is sensitive, consider open-source models on your own infrastructure or enterprise agreements with data isolation guarantees.

3. What’s the volume and latency? — High-volume, low-latency use cases need cost-efficient models. Complex, low-volume tasks justify premium models.

Five Questions (Continued)

4. How much customization is needed? — Start with prompting. Move to fine-tuning only if prompting consistently falls short. Pre-training from scratch is almost never justified.

5. What’s the total cost of ownership? — Include API costs at projected volume, integration development, monitoring, human review for critical outputs, and ongoing prompt/model maintenance. The API cost is often the smallest component.

The bottom line: LLMs are the most versatile AI technology ever created. They can draft, analyze, translate, code, reason, and converse. But they are not oracles. They hallucinate, they have knowledge cutoffs, and they can be confidently wrong. The organizations that succeed with LLMs are those that deploy them as powerful assistants with human oversight, not as autonomous decision-makers. Treat them as a brilliant but unreliable colleague who always needs their work checked.

arrow_back Ch 13: The Transformer Ch 15: Fine-Tuning vs. Foundation arrow_forward