Ch 10 — Large Language Models

From GPT-1 to frontier models — pretraining, alignment, and emergent abilities
High Level
token
Tokenize
arrow_forward
model_training
Pretrain
arrow_forward
trending_up
Scale
arrow_forward
tune
Align
arrow_forward
auto_awesome
Emerge
arrow_forward
rocket_launch
Deploy
-
Click play or press Space to begin the journey...
Step- / 8
token
Tokenization & Vocabulary
Turning text into numbers the model can process
What Is Tokenization?
LLMs don’t see words — they see tokens. Tokenization splits text into subword units using algorithms like Byte Pair Encoding (BPE). Common words become single tokens; rare words are split into pieces. Each token maps to an integer ID, which maps to a learned embedding vector.
# BPE tokenization example (GPT-4) "Hello world" → [9906, 1917] (2 tokens) "unhappiness" → [359, 71, 7907] (3: un+happ+iness) "ChatGPT" → [16047, 38, 2898] (3: Chat+G+PT) "¡Hola mundo!" → [...] (more tokens) # ~1 token ≈ 4 characters in English # Non-English text uses more tokens
Vocabulary Sizes
Model Vocab Size Method GPT-2 50,257 BPE GPT-3 50,257 BPE GPT-4 100,277 BPE (cl100k) Llama 2 32,000 SentencePiece Llama 3 128,256 BPE (tiktoken) Gemini 256,000 SentencePiece # Larger vocab = fewer tokens per text # = faster inference, but bigger embedding
Why BPE? Character-level is too granular (sequences too long). Word-level can’t handle unseen words. BPE finds the sweet spot: common words are single tokens, rare words decompose into known subwords. The vocabulary is built by iteratively merging the most frequent byte pairs in the training corpus.
model_training
Pretraining: Next-Token Prediction
The deceptively simple objective that powers all LLMs
The Objective
Pretraining is self-supervised: given a sequence of tokens, predict the next token. No human labels needed. The model reads trillions of tokens from the internet — books, code, Wikipedia, forums — and learns to predict what comes next. This simple objective forces the model to learn grammar, facts, reasoning, and world knowledge.
# Next-token prediction Input: "The capital of France is" Target: "Paris" Loss: -log P("Paris" | "The capital of France is") # Minimize cross-entropy loss over all tokens # in trillions of training examples Training data scale: GPT-3: 300B tokens Llama 2: 2T tokens Llama 3: 15T tokens GPT-4: ~13T tokens (estimated)
What the Model Learns
To predict the next token well, the model must learn:

Syntax: grammar rules, sentence structure
Semantics: word meanings, relationships
Facts: “Paris is the capital of France”
Reasoning: “If A > B and B > C, then A > C”
Code: programming patterns, APIs
Style: formal vs. casual, tone
The bitter lesson revisited: Next-token prediction seems too simple to produce intelligence. But at sufficient scale, this objective creates models that pass bar exams, write code, translate languages, and reason about novel problems. The simplicity of the objective is its strength — it scales without human bottlenecks.
trending_up
Scaling Laws
Bigger models + more data + more compute = better performance
The Discovery
Kaplan et al. (2020) at OpenAI discovered that LLM loss follows power laws in three variables: model parameters (N), dataset size (D), and compute (C). Double any one → predictable improvement. This means you can forecast model quality before training — enabling billion-dollar investment decisions.
# Scaling laws (simplified) L(N) ∝ N−0.076 # loss vs parameters L(D) ∝ D−0.095 # loss vs data L(C) ∝ C−0.050 # loss vs compute # Power laws = straight lines on log-log plots # No sign of plateauing yet
Chinchilla & Beyond
DeepMind’s Chinchilla (2022) showed GPT-3 was undertrained: for the same compute, a 70B model on 1.3T tokens beats a 280B model on 300B tokens. The rule: scale parameters and tokens equally. But Llama 3 (8B on 15T tokens) showed that overtrained small models are cheaper to serve — inference cost matters too.
The compute frontier: GPT-3 cost ~$4.6M to train. GPT-4 is estimated at $100M+. Frontier models in 2025 cost $500M–$1B. Each generation uses ~10x more compute. The question: how long can this scaling continue before hitting data, energy, or economic limits?
tune
Alignment: From Base Model to Assistant
Instruction tuning + RLHF + DPO
The Three-Stage Pipeline
A pretrained LLM is a next-token predictor — it completes text but doesn’t follow instructions. Alignment transforms it into a helpful, harmless assistant through three stages:
Stage 1: Supervised Fine-Tuning (SFT) Train on (instruction, response) pairs ~10K–100K high-quality examples Human-written or distilled from stronger models Model learns to follow instructions Stage 2: Reward Model Training Humans rank multiple model responses Train a reward model to predict human preferences RM(response) → scalar score Stage 3: RLHF / DPO Optimize the LLM to maximize reward RLHF: PPO reinforcement learning DPO: direct preference optimization (simpler) Model becomes helpful AND safe
Base Model
“Write a poem about cats” → continues with random text, Wikipedia-style facts, or more instructions. No concept of “following” the request.
Aligned Model
“Write a poem about cats” → produces a well-structured poem about cats. Refuses harmful requests. Admits uncertainty.
InstructGPT (2022) showed that a 1.3B aligned model was preferred over a 175B base model. Alignment is not just safety — it’s what makes LLMs useful. The aligned model follows instructions, stays on topic, and produces the format users expect.
auto_awesome
Emergent Abilities
Capabilities that appear at scale without being explicitly trained
What Are Emergent Abilities?
Some capabilities appear only above a certain scale — they’re absent in small models and suddenly present in large ones. These include multi-step reasoning, arithmetic, code generation, and following complex instructions. Whether these are truly “emergent” or just hard to measure at small scale is debated.
# Abilities that improve with scale In-Context Learning (ICL): Show examples in the prompt → model learns "cat:gato, dog:perro, house:" → "casa" No weight updates needed! Chain-of-Thought (CoT): "Let's think step by step..." Breaks complex problems into sub-steps Dramatically improves math/reasoning Instruction Following: "Explain quantum physics to a 5-year-old" Adapts tone, complexity, format
The GPT Timeline
GPT-1 (2018) 117M params, 4.6GB data Showed pretraining + fine-tuning works GPT-2 (2019) 1.5B params, 40GB data Coherent paragraphs, "too dangerous" GPT-3 (2020) 175B params, 300B tokens In-context learning, few-shot prompting ChatGPT (2022) GPT-3.5 + RLHF 100M users in 2 months GPT-4 (2023) ~1.8T params (MoE, est.) Passes bar exam, multimodal
The debate: Schaeffer et al. (2023) argued emergent abilities are a “mirage” caused by metric choice — switch to continuous metrics and the jump disappears. Others counter that some capabilities (like multi-digit multiplication) genuinely require a minimum model size. The truth likely lies between.
psychology
How LLMs Generate Text
Sampling, temperature, and decoding strategies
Autoregressive Generation
LLMs generate text one token at a time. At each step, the model outputs a probability distribution over the entire vocabulary. A decoding strategy selects the next token. The chosen token is appended to the input, and the process repeats until a stop token or max length.
# Generation process Input: "The best programming language is" Step 1: P(next) = {Python: 0.35, Java: 0.15, C: 0.08, Rust: 0.07, ...} Pick: "Python" (sampled or greedy) Step 2: Input = "...language is Python" Step 3: P(next) = {because: 0.25, .: 0.15, ...} # Repeat until <EOS> or max_tokens
Decoding Strategies
Greedy: Always pick highest probability Deterministic but repetitive Temperature sampling: P′ = softmax(logits / T) T=0: greedy T=1: standard T>1: creative Lower T = more focused, higher T = more random Top-k: Sample from top k tokens only k=50 is common Top-p (nucleus): Sample from smallest set whose cumulative probability ≥ p p=0.9 is common, adapts to context Best practice: top-p=0.9 + temperature=0.7
The same model, different outputs: Temperature=0 gives deterministic, factual responses. Temperature=1.0 gives creative, varied text. This is why the same LLM can be used for both code generation (low T) and creative writing (high T) — the decoding strategy controls the behavior.
rocket_launch
The LLM Ecosystem
Open vs. closed, fine-tuning, RAG, and agents
Closed-source (API access): GPT-4, Claude, Gemini Best performance, no weight access Pay per token, vendor lock-in Open-weight: Llama 3, Mistral, Qwen, DeepSeek Download and run locally Fine-tune for your domain Full control, privacy Fine-tuning approaches: Full fine-tune: update all weights (expensive) LoRA: update low-rank adapters (~1% params) QLoRA: LoRA + 4-bit quantization (fits on 1 GPU)
Extending LLMs
RAG (Retrieval-Augmented Generation): Query → search knowledge base → inject relevant docs into prompt → generate Reduces hallucination, adds fresh knowledge Tool Use / Function Calling: LLM decides when to call external tools Calculator, search, database, APIs Grounds the model in real data Agents: LLM + tools + planning + memory Multi-step task execution ReAct, AutoGPT, coding agents
The trend: LLMs are becoming platforms, not just models. RAG adds knowledge, tools add capabilities, agents add autonomy. The model is the reasoning engine; everything else is infrastructure around it.
neurology
Limitations & Key Takeaways
What LLMs can’t do — and what they’ve changed forever
Known Limitations
Hallucination: Confidently generates false information. No reliable internal “I don’t know” signal.

Reasoning: Struggles with novel multi-step logic, especially math and planning.

Knowledge cutoff: Training data has a date boundary. No real-time knowledge without RAG/tools.

Context window: Limited input size (though growing: 128K–1M+ tokens).

Cost: Frontier models cost millions to train and significant compute to serve.
Key Takeaways
1. LLMs are decoder-only transformers trained on next-token prediction

2. Tokenization (BPE) converts text to subword tokens

3. Scaling laws predict performance from compute budget

4. Alignment (SFT + RLHF/DPO) transforms base models into assistants

5. Emergent abilities appear at scale: ICL, CoT, instruction following

6. Temperature and sampling control generation diversity

7. RAG, tools, and agents extend LLM capabilities beyond the weights
Coming up: Ch 11 explores Generative AI beyond text — image generation (diffusion models, DALL-E), video, audio, and the creative AI revolution.