Ch 10: Large Language Models

Ch 10 — Large Language Models

From GPT-1 to frontier models — pretraining, alignment, and emergent abilities

Index Under the Hood →

High Level

token

Tokenize

arrow_forward

model_training

Pretrain

arrow_forward

trending_up

Scale

arrow_forward

tune

Align

arrow_forward

auto_awesome

Emerge

arrow_forward

rocket_launch

Deploy

Click play or press Space to begin the journey...

Step- / 8

token

Tokenization & Vocabulary

Turning text into numbers the model can process

What Is Tokenization?

LLMs don’t see words — they see tokens. Tokenization splits text into subword units using algorithms like Byte Pair Encoding (BPE). Common words become single tokens; rare words are split into pieces. Each token maps to an integer ID, which maps to a learned embedding vector.

# BPE tokenization example (GPT-4) "Hello world" → [9906, 1917] (2 tokens) "unhappiness" → [359, 71, 7907] (3: un+happ+iness) "ChatGPT" → [16047, 38, 2898] (3: Chat+G+PT) "¡Hola mundo!" → [...] (more tokens) # ~1 token ≈ 4 characters in English # Non-English text uses more tokens

Vocabulary Sizes

Model Vocab Size Method GPT-2 50,257 BPE GPT-3 50,257 BPE GPT-4 100,277 BPE (cl100k) Llama 2 32,000 SentencePiece Llama 3 128,256 BPE (tiktoken) Gemini 256,000 SentencePiece # Larger vocab = fewer tokens per text # = faster inference, but bigger embedding

Why BPE? Character-level is too granular (sequences too long). Word-level can’t handle unseen words. BPE finds the sweet spot: common words are single tokens, rare words decompose into known subwords. The vocabulary is built by iteratively merging the most frequent byte pairs in the training corpus.

model_training

Pretraining: Next-Token Prediction

The deceptively simple objective that powers all LLMs

The Objective

Pretraining is self-supervised: given a sequence of tokens, predict the next token. No human labels needed. The model reads trillions of tokens from the internet — books, code, Wikipedia, forums — and learns to predict what comes next. This simple objective forces the model to learn grammar, facts, reasoning, and world knowledge.

# Next-token prediction Input: "The capital of France is" Target: "Paris" Loss: -log P("Paris" | "The capital of France is") # Minimize cross-entropy loss over all tokens # in trillions of training examples Training data scale: GPT-3: 300B tokens Llama 2: 2T tokens Llama 3: 15T tokens GPT-4: ~13T tokens (estimated)

What the Model Learns

To predict the next token well, the model must learn:

Syntax: grammar rules, sentence structure
Semantics: word meanings, relationships
Facts: “Paris is the capital of France”
Reasoning: “If A > B and B > C, then A > C”
Code: programming patterns, APIs
Style: formal vs. casual, tone

The bitter lesson revisited: Next-token prediction seems too simple to produce intelligence. But at sufficient scale, this objective creates models that pass bar exams, write code, translate languages, and reason about novel problems. The simplicity of the objective is its strength — it scales without human bottlenecks.

trending_up

Scaling Laws

Bigger models + more data + more compute = better performance

The Discovery

Kaplan et al. (2020) at OpenAI discovered that LLM loss follows power laws in three variables: model parameters (N), dataset size (D), and compute (C). Double any one → predictable improvement. This means you can forecast model quality before training — enabling billion-dollar investment decisions.

# Scaling laws (simplified) L(N) ∝ N^−0.076 # loss vs parameters L(D) ∝ D^−0.095 # loss vs data L(C) ∝ C^−0.050 # loss vs compute # Power laws = straight lines on log-log plots # No sign of plateauing yet

Chinchilla & Beyond

DeepMind’s Chinchilla (2022) showed GPT-3 was undertrained: for the same compute, a 70B model on 1.3T tokens beats a 280B model on 300B tokens. The rule: scale parameters and tokens equally. But Llama 3 (8B on 15T tokens) showed that overtrained small models are cheaper to serve — inference cost matters too.

The compute frontier: GPT-3 cost ~$4.6M to train. GPT-4 is estimated at $100M+. Frontier models in 2025 cost $500M–$1B. Each generation uses ~10x more compute. The question: how long can this scaling continue before hitting data, energy, or economic limits?

tune

Alignment: From Base Model to Assistant

Instruction tuning + RLHF + DPO

The Three-Stage Pipeline

A pretrained LLM is a next-token predictor — it completes text but doesn’t follow instructions. Alignment transforms it into a helpful, harmless assistant through three stages:

Stage 1: Supervised Fine-Tuning (SFT) Train on (instruction, response) pairs ~10K–100K high-quality examples Human-written or distilled from stronger models Model learns to follow instructions Stage 2: Reward Model Training Humans rank multiple model responses Train a reward model to predict human preferences RM(response) → scalar score Stage 3: RLHF / DPO Optimize the LLM to maximize reward RLHF: PPO reinforcement learning DPO: direct preference optimization (simpler) Model becomes helpful AND safe

Base Model

“Write a poem about cats” → continues with random text, Wikipedia-style facts, or more instructions. No concept of “following” the request.

Aligned Model

“Write a poem about cats” → produces a well-structured poem about cats. Refuses harmful requests. Admits uncertainty.

InstructGPT (2022) showed that a 1.3B aligned model was preferred over a 175B base model. Alignment is not just safety — it’s what makes LLMs useful. The aligned model follows instructions, stays on topic, and produces the format users expect.

auto_awesome

Emergent Abilities

Capabilities that appear at scale without being explicitly trained

What Are Emergent Abilities?

Some capabilities appear only above a certain scale — they’re absent in small models and suddenly present in large ones. These include multi-step reasoning, arithmetic, code generation, and following complex instructions. Whether these are truly “emergent” or just hard to measure at small scale is debated.

# Abilities that improve with scale In-Context Learning (ICL): Show examples in the prompt → model learns "cat:gato, dog:perro, house:" → "casa" No weight updates needed! Chain-of-Thought (CoT): "Let's think step by step..." Breaks complex problems into sub-steps Dramatically improves math/reasoning Instruction Following: "Explain quantum physics to a 5-year-old" Adapts tone, complexity, format

The GPT Timeline

GPT-1 (2018) 117M params, 4.6GB data Showed pretraining + fine-tuning works GPT-2 (2019) 1.5B params, 40GB data Coherent paragraphs, "too dangerous" GPT-3 (2020) 175B params, 300B tokens In-context learning, few-shot prompting ChatGPT (2022) GPT-3.5 + RLHF 100M users in 2 months GPT-4 (2023) ~1.8T params (MoE, est.) Passes bar exam, multimodal

The debate: Schaeffer et al. (2023) argued emergent abilities are a “mirage” caused by metric choice — switch to continuous metrics and the jump disappears. Others counter that some capabilities (like multi-digit multiplication) genuinely require a minimum model size. The truth likely lies between.

psychology

How LLMs Generate Text

Sampling, temperature, and decoding strategies

Autoregressive Generation

LLMs generate text one token at a time. At each step, the model outputs a probability distribution over the entire vocabulary. A decoding strategy selects the next token. The chosen token is appended to the input, and the process repeats until a stop token or max length.

# Generation process Input: "The best programming language is" Step 1: P(next) = {Python: 0.35, Java: 0.15, C: 0.08, Rust: 0.07, ...} Pick: "Python" (sampled or greedy) Step 2: Input = "...language is Python" Step 3: P(next) = {because: 0.25, .: 0.15, ...} # Repeat until <EOS> or max_tokens

Decoding Strategies

Greedy: Always pick highest probability Deterministic but repetitive Temperature sampling: P′ = softmax(logits / T) T=0: greedy T=1: standard T>1: creative Lower T = more focused, higher T = more random Top-k: Sample from top k tokens only k=50 is common Top-p (nucleus): Sample from smallest set whose cumulative probability ≥ p p=0.9 is common, adapts to context Best practice: top-p=0.9 + temperature=0.7

The same model, different outputs: Temperature=0 gives deterministic, factual responses. Temperature=1.0 gives creative, varied text. This is why the same LLM can be used for both code generation (low T) and creative writing (high T) — the decoding strategy controls the behavior.

rocket_launch

The LLM Ecosystem

Open vs. closed, fine-tuning, RAG, and agents

Closed-source (API access): GPT-4, Claude, Gemini Best performance, no weight access Pay per token, vendor lock-in Open-weight: Llama 3, Mistral, Qwen, DeepSeek Download and run locally Fine-tune for your domain Full control, privacy Fine-tuning approaches: Full fine-tune: update all weights (expensive) LoRA: update low-rank adapters (~1% params) QLoRA: LoRA + 4-bit quantization (fits on 1 GPU)

Extending LLMs

RAG (Retrieval-Augmented Generation): Query → search knowledge base → inject relevant docs into prompt → generate Reduces hallucination, adds fresh knowledge Tool Use / Function Calling: LLM decides when to call external tools Calculator, search, database, APIs Grounds the model in real data Agents: LLM + tools + planning + memory Multi-step task execution ReAct, AutoGPT, coding agents

The trend: LLMs are becoming platforms, not just models. RAG adds knowledge, tools add capabilities, agents add autonomy. The model is the reasoning engine; everything else is infrastructure around it.

neurology

Limitations & Key Takeaways

What LLMs can’t do — and what they’ve changed forever

Known Limitations

Hallucination: Confidently generates false information. No reliable internal “I don’t know” signal.

Reasoning: Struggles with novel multi-step logic, especially math and planning.

Knowledge cutoff: Training data has a date boundary. No real-time knowledge without RAG/tools.

Context window: Limited input size (though growing: 128K–1M+ tokens).

Cost: Frontier models cost millions to train and significant compute to serve.

Key Takeaways

1. LLMs are decoder-only transformers trained on next-token prediction

2. Tokenization (BPE) converts text to subword tokens

3. Scaling laws predict performance from compute budget

4. Alignment (SFT + RLHF/DPO) transforms base models into assistants

5. Emergent abilities appear at scale: ICL, CoT, instruction following

6. Temperature and sampling control generation diversity

7. RAG, tools, and agents extend LLM capabilities beyond the weights

Coming up: Ch 11 explores Generative AI beyond text — image generation (diffusion models, DALL-E), video, audio, and the creative AI revolution.