Ch 6: Language Models & Generation — Natural Language Processing

Ch 6 — Language Models & Generation

N-grams, perplexity, RNN/LSTM language models, and the path to neural text generation

Index

High Level

functions

N-gram

arrow_forward

neurology

RNN/LSTM

arrow_forward

speed

Perplexity

arrow_forward

auto_awesome

Generate

arrow_forward

tune

Decode

arrow_forward

smart_toy

LLMs

Click play or press Space to begin...

Step- / 8

auto_awesome

What Is a Language Model?

Predicting the next word — the foundation of all modern NLP

The Core Idea

A language model assigns probabilities to sequences of words. Given a context, it predicts what comes next. "The cat sat on the ___" — a good language model assigns high probability to "mat" and low probability to "democracy." Formally, a language model computes P(w₁, w₂, ..., w_n) — the probability of a sequence. By the chain rule, this decomposes into: P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ... Language models are the foundation of modern NLP. Spell checkers, autocomplete, machine translation, speech recognition, and every LLM from GPT to Claude are language models at their core. The entire history of NLP can be told through the evolution of language models: from counting word sequences to neural networks that generate human-quality text.

Language Model Basics

Core task: predict the next word "The cat sat on the ___" P(mat) = 0.15 (high) P(floor) = 0.08 P(democracy) = 0.0001 (low) Chain rule decomposition: P("the cat sat") = P("the") × P("cat" | "the") × P("sat" | "the cat") Applications: Autocomplete, spell check Machine translation Speech recognition Text generation (GPT, Claude) Every LLM is a language model

Key insight: The seemingly simple task of "predict the next word" turns out to require deep understanding of grammar, facts, reasoning, and world knowledge. This is why scaling language models produces increasingly capable AI systems.

functions

N-gram Language Models

Count sequences of words — the statistical foundation

N-gram Models

N-gram models estimate the probability of a word based on the previous N−1 words. A bigram model uses one word of context: P("mat" | "the"). A trigram uses two: P("mat" | "on the"). Training is just counting: P("mat" | "on the") = count("on the mat") / count("on the"). N-gram models are fast, simple, and surprisingly effective for tasks like spell checking and speech recognition. But they have fundamental limitations. Sparsity: most n-grams never appear in training data, so their probability is zero. Smoothing techniques (Laplace, Kneser-Ney) redistribute probability mass to unseen n-grams. Limited context: even a 5-gram model only sees 4 previous words, missing long-range dependencies like "The doctor who treated the patient in the emergency room last Tuesday ___."

N-gram Examples

Unigram (N=1): P(word) P("the") = 0.07, P("cat") = 0.001 Bigram (N=2): P(word | prev_word) P("cat" | "the") = 0.01 P("sat" | "cat") = 0.005 Trigram (N=3): P(word | prev_2_words) P("on" | "cat sat") = 0.15 P("mat" | "on the") = 0.02 Sparsity problem: "on the mat" appears 50 times "on the xylophone" appears 0 times P("xylophone" | "on the") = 0 ??? Smoothing: redistribute probability Kneser-Ney: best smoothing method Gives small probability to unseen n-grams

Key insight: N-gram models reveal a fundamental tension in language modeling: more context improves predictions but makes sparsity worse. A 5-gram model is more accurate in theory but most 5-grams never appear in training data. Neural models solved this by learning continuous representations.

speed

Perplexity

How to measure whether a language model is any good

Measuring Language Models

Perplexity is the standard metric for language models. Intuitively, it measures how "surprised" the model is by the test data. A perplexity of 100 means the model is as uncertain as if it were choosing uniformly among 100 words at each step. Lower perplexity = better model. Formally, perplexity is 2 raised to the power of the average negative log-probability: PPL = 2^{−(1/N) ∑ log₂ P(w_i|context)}. A perfect model that always predicts the correct next word has perplexity 1. A model that assigns equal probability to a 50,000-word vocabulary has perplexity 50,000. Modern LLMs achieve perplexities of 10–30 on standard benchmarks, meaning they effectively narrow down the next word to 10–30 candidates. Perplexity is useful for comparing models on the same test set but doesn't directly measure generation quality.

Perplexity Examples

Perplexity = 2^(avg negative log prob) Intuition: "effective vocabulary size" PPL = 1: perfect prediction PPL = 10: choosing among ~10 words PPL = 100: choosing among ~100 words PPL = 50000: random guessing Typical perplexities: Unigram model: ~1000 Trigram + KN: ~80-100 LSTM: ~50-70 GPT-2 (1.5B): ~18-22 GPT-3 (175B): ~10-15 Limitations: Only comparable on same test set Doesn't measure generation quality Low PPL doesn't guarantee good text

Key insight: Perplexity measures how well a model predicts text, not how well it generates text. A model with low perplexity might still produce repetitive or incoherent output. Generation quality requires additional evaluation (human judgment, task-specific metrics).

neurology

Neural Language Models

RNNs and LSTMs — learning to predict words with neural networks

From Counting to Learning

Neural language models replaced counting with learned representations. Bengio's 2003 neural language model used a feed-forward network over word embeddings — the first to show that neural networks could outperform n-grams. Recurrent Neural Networks (RNNs) process text one word at a time, maintaining a hidden state that summarizes everything seen so far. Unlike n-grams, RNNs have no fixed context window — in theory, they can use the entire preceding text. In practice, vanilla RNNs suffer from the vanishing gradient problem: information from early words fades as the sequence grows. LSTMs (Long Short-Term Memory) solved this with gating mechanisms that control what to remember and what to forget. LSTM language models achieved perplexities of 50–70, cutting n-gram perplexity nearly in half.

Neural LM Evolution

2003: Bengio's neural LM Feed-forward over embeddings Fixed context window (like n-grams) But: continuous representations 2010+: RNN language models Process one word at a time Hidden state = memory of past h_t = f(h_{t-1}, x_t) Problem: vanishing gradients 2014+: LSTM language models Gates: forget, input, output Cell state carries long-range info PPL: 50-70 (vs n-gram ~100) Key advantage over n-grams: No sparsity problem Unlimited context (in theory) Learned word similarities

Key insight: Neural language models solved the sparsity problem that plagued n-grams. Because they operate on continuous embeddings, similar words get similar predictions — even for word sequences never seen in training.

auto_fix_high

Text Generation

Turning a language model into a text generator — one token at a time

Autoregressive Generation

A language model becomes a text generator through autoregressive generation: predict the next word, add it to the context, predict the next word again, repeat. Given "The cat", the model predicts "sat" (highest probability), yielding "The cat sat". Then it predicts "on" from "The cat sat", and so on. This simple loop is how every modern text generator works, from GPT to Claude. The quality of generated text depends on two things: the quality of the language model (how well it predicts) and the decoding strategy (how it chooses from the predicted distribution). Greedy decoding always picks the most probable word, but this produces repetitive, boring text. Better strategies introduce controlled randomness to produce diverse, natural-sounding output.

Autoregressive Loop

Generation loop: prompt = "The cat" Step 1: P(next | "The cat") "sat": 0.15, "is": 0.12, "ran": 0.08 Pick "sat" → "The cat sat" Step 2: P(next | "The cat sat") "on": 0.20, "down": 0.10, "in": 0.08 Pick "on" → "The cat sat on" Step 3: P(next | "The cat sat on") "the": 0.25, "a": 0.12, "my": 0.08 Pick "the" → "The cat sat on the" ... continue until <EOS> token This is how GPT generates text: One token at a time, left to right

Key insight: Autoregressive generation is inherently sequential — each token depends on all previous tokens. This is why LLM inference is slow: you can't parallelize the generation loop. Each token requires a full forward pass through the model.

tune

Decoding Strategies

Greedy, beam search, top-k, and nucleus sampling — how to choose the next word

Choosing from the Distribution

The language model outputs a probability distribution over the vocabulary at each step. The decoding strategy determines how to select from this distribution. Greedy decoding always picks the highest-probability token — fast but produces repetitive, generic text. Beam search maintains the top-k candidate sequences at each step, finding globally better sequences than greedy — good for translation and summarization but still tends toward safe, boring output. Temperature sampling scales the logits before softmax: temperature < 1 makes the distribution sharper (more deterministic), temperature > 1 makes it flatter (more random). Top-k sampling restricts sampling to the k most probable tokens. Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds p, adapting the candidate pool to model confidence.

Decoding Strategies

Greedy: always pick argmax Repetitive, generic, boring Beam search (beam_size=5): Keep top 5 candidates at each step Good for: translation, summarization Bad for: creative/open-ended text Temperature: T=0.1: very deterministic (sharp) T=1.0: original distribution T=2.0: very random (flat) Top-k (k=50): Sample from top 50 tokens only Fixed candidate pool size Top-p / Nucleus (p=0.9): Sample from smallest set with P ≥ 0.9 Adaptive: small set when confident, large set when uncertain

Key insight: Top-p (nucleus) sampling is the modern default because it adapts to model confidence. When the model is sure ("The capital of France is ___"), the nucleus is tiny. When it's uncertain ("The best movie is ___"), the nucleus expands to allow diversity.

warning

Generation Challenges

Repetition, hallucination, and the problems that plague text generation

What Goes Wrong

Text generation faces several persistent challenges. Repetition: models tend to repeat phrases or sentences, especially with greedy/beam search. Repetition penalties and n-gram blocking help but don't eliminate the problem. Hallucination: models generate plausible-sounding but factually incorrect text because they optimize for probability, not truth. Coherence degradation: as generated text grows longer, it tends to drift off-topic or contradict earlier statements because the model has no explicit memory of its own output beyond the context window. Exposure bias: during training, the model sees gold-standard context; during generation, it sees its own (potentially wrong) predictions, causing errors to compound. Evaluation difficulty: there's no single metric that captures generation quality — fluency, coherence, factuality, and relevance are all separate dimensions.

Generation Problems

Repetition: "The cat sat on the mat. The cat sat on the mat. The cat sat on..." Fix: repetition penalty, n-gram blocking Hallucination: "Einstein invented the telephone in 1876" Sounds plausible, completely wrong Fix: retrieval augmentation, grounding Coherence drift: Paragraph 1: about cats Paragraph 5: somehow about economics Fix: planning, outline-first generation Exposure bias: Training: sees correct previous words Generation: sees its own (wrong) words Errors compound over long sequences Evaluation: No single metric captures quality Human evaluation is gold standard

Key insight: Hallucination is not a bug — it's a fundamental property of language models that optimize for probability. A fluent, probable sentence can be factually wrong. This is why retrieval-augmented generation (RAG) and grounding are essential for factual applications.

smart_toy

From Language Models to LLMs

The scaling insight that changed everything

The Scaling Revolution

The path from n-gram language models to GPT-4 and Claude is a story of scale. Each generation of language model was bigger, trained on more data, and surprisingly more capable. GPT-1 (2018, 117M parameters) showed that pre-trained language models could be fine-tuned for downstream tasks. GPT-2 (2019, 1.5B) demonstrated that larger models could generate remarkably coherent text. GPT-3 (2020, 175B) revealed emergent abilities: few-shot learning, basic reasoning, and code generation appeared without explicit training. The scaling laws (Kaplan et al., 2020) showed that language model performance improves predictably with model size, data size, and compute. This insight — that next-word prediction at sufficient scale produces general intelligence — is the foundation of the modern AI revolution. The next chapter covers the transformer architecture that made this scaling possible.

The Scaling Timeline

2018: GPT-1 (117M params) Pre-train + fine-tune paradigm 12 transformer layers 2019: GPT-2 (1.5B params) Coherent multi-paragraph text "Too dangerous to release" (initially) 2020: GPT-3 (175B params) Few-shot learning emerges No fine-tuning needed for many tasks 2020: Scaling laws (Kaplan et al.) Performance scales predictably with: - Model size (parameters) - Dataset size (tokens) - Compute (FLOPs) The insight: Next-word prediction at scale → general language understanding

Key insight: The entire LLM revolution is built on language modeling — the same "predict the next word" task from n-grams. The difference is scale: billions of parameters, trillions of tokens, and the transformer architecture that made it computationally feasible.

arrow_back Ch 5: Sequence Labeling Ch 7: The Transformer Revolution arrow_forward