Ch 6 — Language Models & Generation

N-grams, perplexity, RNN/LSTM language models, and the path to neural text generation
High Level
functions
N-gram
arrow_forward
neurology
RNN/LSTM
arrow_forward
speed
Perplexity
arrow_forward
auto_awesome
Generate
arrow_forward
tune
Decode
arrow_forward
smart_toy
LLMs
-
Click play or press Space to begin...
Step- / 8
auto_awesome
What Is a Language Model?
Predicting the next word — the foundation of all modern NLP
The Core Idea
A language model assigns probabilities to sequences of words. Given a context, it predicts what comes next. "The cat sat on the ___" — a good language model assigns high probability to "mat" and low probability to "democracy." Formally, a language model computes P(w1, w2, ..., wn) — the probability of a sequence. By the chain rule, this decomposes into: P(w1) × P(w2|w1) × P(w3|w1,w2) × ... Language models are the foundation of modern NLP. Spell checkers, autocomplete, machine translation, speech recognition, and every LLM from GPT to Claude are language models at their core. The entire history of NLP can be told through the evolution of language models: from counting word sequences to neural networks that generate human-quality text.
Language Model Basics
Core task: predict the next word "The cat sat on the ___" P(mat) = 0.15 (high) P(floor) = 0.08 P(democracy) = 0.0001 (low) Chain rule decomposition: P("the cat sat") = P("the") × P("cat" | "the") × P("sat" | "the cat") Applications: Autocomplete, spell check Machine translation Speech recognition Text generation (GPT, Claude) Every LLM is a language model
Key insight: The seemingly simple task of "predict the next word" turns out to require deep understanding of grammar, facts, reasoning, and world knowledge. This is why scaling language models produces increasingly capable AI systems.
functions
N-gram Language Models
Count sequences of words — the statistical foundation
N-gram Models
N-gram models estimate the probability of a word based on the previous N−1 words. A bigram model uses one word of context: P("mat" | "the"). A trigram uses two: P("mat" | "on the"). Training is just counting: P("mat" | "on the") = count("on the mat") / count("on the"). N-gram models are fast, simple, and surprisingly effective for tasks like spell checking and speech recognition. But they have fundamental limitations. Sparsity: most n-grams never appear in training data, so their probability is zero. Smoothing techniques (Laplace, Kneser-Ney) redistribute probability mass to unseen n-grams. Limited context: even a 5-gram model only sees 4 previous words, missing long-range dependencies like "The doctor who treated the patient in the emergency room last Tuesday ___."
N-gram Examples
Unigram (N=1): P(word) P("the") = 0.07, P("cat") = 0.001 Bigram (N=2): P(word | prev_word) P("cat" | "the") = 0.01 P("sat" | "cat") = 0.005 Trigram (N=3): P(word | prev_2_words) P("on" | "cat sat") = 0.15 P("mat" | "on the") = 0.02 Sparsity problem: "on the mat" appears 50 times "on the xylophone" appears 0 times P("xylophone" | "on the") = 0 ??? Smoothing: redistribute probability Kneser-Ney: best smoothing method Gives small probability to unseen n-grams
Key insight: N-gram models reveal a fundamental tension in language modeling: more context improves predictions but makes sparsity worse. A 5-gram model is more accurate in theory but most 5-grams never appear in training data. Neural models solved this by learning continuous representations.
speed
Perplexity
How to measure whether a language model is any good
Measuring Language Models
Perplexity is the standard metric for language models. Intuitively, it measures how "surprised" the model is by the test data. A perplexity of 100 means the model is as uncertain as if it were choosing uniformly among 100 words at each step. Lower perplexity = better model. Formally, perplexity is 2 raised to the power of the average negative log-probability: PPL = 2−(1/N) ∑ log2 P(wi|context). A perfect model that always predicts the correct next word has perplexity 1. A model that assigns equal probability to a 50,000-word vocabulary has perplexity 50,000. Modern LLMs achieve perplexities of 10–30 on standard benchmarks, meaning they effectively narrow down the next word to 10–30 candidates. Perplexity is useful for comparing models on the same test set but doesn't directly measure generation quality.
Perplexity Examples
Perplexity = 2^(avg negative log prob) Intuition: "effective vocabulary size" PPL = 1: perfect prediction PPL = 10: choosing among ~10 words PPL = 100: choosing among ~100 words PPL = 50000: random guessing Typical perplexities: Unigram model: ~1000 Trigram + KN: ~80-100 LSTM: ~50-70 GPT-2 (1.5B): ~18-22 GPT-3 (175B): ~10-15 Limitations: Only comparable on same test set Doesn't measure generation quality Low PPL doesn't guarantee good text
Key insight: Perplexity measures how well a model predicts text, not how well it generates text. A model with low perplexity might still produce repetitive or incoherent output. Generation quality requires additional evaluation (human judgment, task-specific metrics).
neurology
Neural Language Models
RNNs and LSTMs — learning to predict words with neural networks
From Counting to Learning
Neural language models replaced counting with learned representations. Bengio's 2003 neural language model used a feed-forward network over word embeddings — the first to show that neural networks could outperform n-grams. Recurrent Neural Networks (RNNs) process text one word at a time, maintaining a hidden state that summarizes everything seen so far. Unlike n-grams, RNNs have no fixed context window — in theory, they can use the entire preceding text. In practice, vanilla RNNs suffer from the vanishing gradient problem: information from early words fades as the sequence grows. LSTMs (Long Short-Term Memory) solved this with gating mechanisms that control what to remember and what to forget. LSTM language models achieved perplexities of 50–70, cutting n-gram perplexity nearly in half.
Neural LM Evolution
2003: Bengio's neural LM Feed-forward over embeddings Fixed context window (like n-grams) But: continuous representations 2010+: RNN language models Process one word at a time Hidden state = memory of past h_t = f(h_{t-1}, x_t) Problem: vanishing gradients 2014+: LSTM language models Gates: forget, input, output Cell state carries long-range info PPL: 50-70 (vs n-gram ~100) Key advantage over n-grams: No sparsity problem Unlimited context (in theory) Learned word similarities
Key insight: Neural language models solved the sparsity problem that plagued n-grams. Because they operate on continuous embeddings, similar words get similar predictions — even for word sequences never seen in training.
auto_fix_high
Text Generation
Turning a language model into a text generator — one token at a time
Autoregressive Generation
A language model becomes a text generator through autoregressive generation: predict the next word, add it to the context, predict the next word again, repeat. Given "The cat", the model predicts "sat" (highest probability), yielding "The cat sat". Then it predicts "on" from "The cat sat", and so on. This simple loop is how every modern text generator works, from GPT to Claude. The quality of generated text depends on two things: the quality of the language model (how well it predicts) and the decoding strategy (how it chooses from the predicted distribution). Greedy decoding always picks the most probable word, but this produces repetitive, boring text. Better strategies introduce controlled randomness to produce diverse, natural-sounding output.
Autoregressive Loop
Generation loop: prompt = "The cat" Step 1: P(next | "The cat") "sat": 0.15, "is": 0.12, "ran": 0.08 Pick "sat" → "The cat sat" Step 2: P(next | "The cat sat") "on": 0.20, "down": 0.10, "in": 0.08 Pick "on" → "The cat sat on" Step 3: P(next | "The cat sat on") "the": 0.25, "a": 0.12, "my": 0.08 Pick "the" → "The cat sat on the" ... continue until <EOS> token This is how GPT generates text: One token at a time, left to right
Key insight: Autoregressive generation is inherently sequential — each token depends on all previous tokens. This is why LLM inference is slow: you can't parallelize the generation loop. Each token requires a full forward pass through the model.
tune
Decoding Strategies
Greedy, beam search, top-k, and nucleus sampling — how to choose the next word
Choosing from the Distribution
The language model outputs a probability distribution over the vocabulary at each step. The decoding strategy determines how to select from this distribution. Greedy decoding always picks the highest-probability token — fast but produces repetitive, generic text. Beam search maintains the top-k candidate sequences at each step, finding globally better sequences than greedy — good for translation and summarization but still tends toward safe, boring output. Temperature sampling scales the logits before softmax: temperature < 1 makes the distribution sharper (more deterministic), temperature > 1 makes it flatter (more random). Top-k sampling restricts sampling to the k most probable tokens. Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds p, adapting the candidate pool to model confidence.
Decoding Strategies
Greedy: always pick argmax Repetitive, generic, boring Beam search (beam_size=5): Keep top 5 candidates at each step Good for: translation, summarization Bad for: creative/open-ended text Temperature: T=0.1: very deterministic (sharp) T=1.0: original distribution T=2.0: very random (flat) Top-k (k=50): Sample from top 50 tokens only Fixed candidate pool size Top-p / Nucleus (p=0.9): Sample from smallest set with P ≥ 0.9 Adaptive: small set when confident, large set when uncertain
Key insight: Top-p (nucleus) sampling is the modern default because it adapts to model confidence. When the model is sure ("The capital of France is ___"), the nucleus is tiny. When it's uncertain ("The best movie is ___"), the nucleus expands to allow diversity.
warning
Generation Challenges
Repetition, hallucination, and the problems that plague text generation
What Goes Wrong
Text generation faces several persistent challenges. Repetition: models tend to repeat phrases or sentences, especially with greedy/beam search. Repetition penalties and n-gram blocking help but don't eliminate the problem. Hallucination: models generate plausible-sounding but factually incorrect text because they optimize for probability, not truth. Coherence degradation: as generated text grows longer, it tends to drift off-topic or contradict earlier statements because the model has no explicit memory of its own output beyond the context window. Exposure bias: during training, the model sees gold-standard context; during generation, it sees its own (potentially wrong) predictions, causing errors to compound. Evaluation difficulty: there's no single metric that captures generation quality — fluency, coherence, factuality, and relevance are all separate dimensions.
Generation Problems
Repetition: "The cat sat on the mat. The cat sat on the mat. The cat sat on..." Fix: repetition penalty, n-gram blocking Hallucination: "Einstein invented the telephone in 1876" Sounds plausible, completely wrong Fix: retrieval augmentation, grounding Coherence drift: Paragraph 1: about cats Paragraph 5: somehow about economics Fix: planning, outline-first generation Exposure bias: Training: sees correct previous words Generation: sees its own (wrong) words Errors compound over long sequences Evaluation: No single metric captures quality Human evaluation is gold standard
Key insight: Hallucination is not a bug — it's a fundamental property of language models that optimize for probability. A fluent, probable sentence can be factually wrong. This is why retrieval-augmented generation (RAG) and grounding are essential for factual applications.
smart_toy
From Language Models to LLMs
The scaling insight that changed everything
The Scaling Revolution
The path from n-gram language models to GPT-4 and Claude is a story of scale. Each generation of language model was bigger, trained on more data, and surprisingly more capable. GPT-1 (2018, 117M parameters) showed that pre-trained language models could be fine-tuned for downstream tasks. GPT-2 (2019, 1.5B) demonstrated that larger models could generate remarkably coherent text. GPT-3 (2020, 175B) revealed emergent abilities: few-shot learning, basic reasoning, and code generation appeared without explicit training. The scaling laws (Kaplan et al., 2020) showed that language model performance improves predictably with model size, data size, and compute. This insight — that next-word prediction at sufficient scale produces general intelligence — is the foundation of the modern AI revolution. The next chapter covers the transformer architecture that made this scaling possible.
The Scaling Timeline
2018: GPT-1 (117M params) Pre-train + fine-tune paradigm 12 transformer layers 2019: GPT-2 (1.5B params) Coherent multi-paragraph text "Too dangerous to release" (initially) 2020: GPT-3 (175B params) Few-shot learning emerges No fine-tuning needed for many tasks 2020: Scaling laws (Kaplan et al.) Performance scales predictably with: - Model size (parameters) - Dataset size (tokens) - Compute (FLOPs) The insight: Next-word prediction at scale → general language understanding
Key insight: The entire LLM revolution is built on language modeling — the same "predict the next word" task from n-grams. The difference is scale: billions of parameters, trillions of tokens, and the transformer architecture that made it computationally feasible.