Ch 7 — The Transformer Revolution

Self-attention, BERT, GPT, T5 — how one architecture changed everything
High Level
visibility
Attention
arrow_forward
view_module
Encoder
arrow_forward
output
Decoder
arrow_forward
school
BERT
arrow_forward
smart_toy
GPT
arrow_forward
swap_horiz
T5
-
Click play or press Space to begin...
Step- / 8
visibility
Self-Attention: The Core Innovation
Every token attends to every other token — in parallel
How Self-Attention Works
The transformer's core innovation is self-attention: a mechanism that lets every token in a sequence look at every other token to determine how much each one matters for understanding the current token. Each token is projected into three vectors: Query (what am I looking for?), Key (what do I contain?), and Value (what information do I provide?). Attention scores are computed as the dot product of queries and keys, scaled and softmaxed into weights, then used to create a weighted sum of values. The result: each token's representation is enriched by information from the most relevant other tokens. Unlike RNNs that process sequentially, self-attention processes all tokens in parallel, enabling massive GPU parallelism. This is why transformers can be trained on billions of tokens — they're orders of magnitude faster than RNNs.
Self-Attention Mechanics
For each token: Q = W_q × token_embedding (query) K = W_k × token_embedding (key) V = W_v × token_embedding (value) Attention scores: score(i,j) = Q_i · K_j / √d_k weights = softmax(scores) output_i = ∑ weights[i,j] × V_j Example: "The cat sat on the mat" For "sat": high attention to "cat" (subject) For "mat": high attention to "on" (preposition) Key advantage over RNNs: All tokens processed in parallel No sequential bottleneck Direct connections between any two tokens
Key insight: Self-attention gives every token a direct connection to every other token, regardless of distance. In an RNN, information from token 1 must pass through every intermediate token to reach token 100. In a transformer, it's one step.
stacks
The Transformer Architecture
Multi-head attention, feed-forward layers, and positional encoding
Building Blocks
The full transformer architecture (Vaswani et al., 2017) stacks several components. Multi-head attention runs self-attention multiple times in parallel (typically 8–16 heads), each learning different relationship types — one head might learn syntactic dependencies, another semantic similarity. Feed-forward layers after each attention block add non-linear transformations, increasing the model's capacity. Layer normalization and residual connections stabilize training of deep networks. Positional encoding injects word order information, since self-attention is inherently order-agnostic — without it, "dog bites man" and "man bites dog" would have identical representations. The original transformer used sinusoidal positional encodings; modern models use learned positional embeddings or relative position encodings like RoPE.
Transformer Block
One transformer block: Input tokens + positional encoding ↓ Multi-Head Self-Attention (8-16 heads) + Residual connection + LayerNorm ↓ Feed-Forward Network (2 linear layers) + Residual connection + LayerNorm ↓ Output representations Stack N blocks: BERT-base: 12 blocks, 768-dim, 12 heads BERT-large: 24 blocks, 1024-dim, 16 heads GPT-3: 96 blocks, 12288-dim, 96 heads Positional encoding: Original: sinusoidal functions Modern: learned embeddings, RoPE Without it: no word order!
Key insight: The transformer is a remarkably simple architecture: just attention + feed-forward + normalization, stacked repeatedly. Its power comes from depth (many layers) and width (many attention heads), not from architectural complexity.
view_module
Three Architectural Variants
Encoder-only, decoder-only, and encoder-decoder — each optimized for different tasks
The Three Families
The original transformer had both an encoder and decoder. Subsequent models discovered that using just one half works better for specific tasks. Encoder-only models (BERT, RoBERTa) use bidirectional attention — every token sees every other token. This is ideal for understanding tasks: classification, NER, question answering. Decoder-only models (GPT, LLaMA, Claude) use causal attention — each token can only see previous tokens. This is ideal for generation tasks and is the architecture behind all modern LLMs. Encoder-decoder models (T5, BART) use the encoder to understand the input and the decoder to generate the output. This is ideal for transformation tasks: translation, summarization, where you need to comprehend input before producing output.
Three Variants
Encoder-only (BERT, RoBERTa): Attention: bidirectional (sees all) Best for: classification, NER, QA Pre-training: masked language model Decoder-only (GPT, LLaMA, Claude): Attention: causal (sees only past) Best for: text generation, chat Pre-training: next token prediction Dominant architecture for LLMs Encoder-decoder (T5, BART): Encoder: bidirectional understanding Decoder: autoregressive generation Best for: translation, summarization Pre-training: span corruption
Key insight: The choice of architecture is a trade-off between understanding and generation. Encoders understand best (bidirectional), decoders generate best (autoregressive), and encoder-decoders do both. Modern LLMs chose generation (decoder-only) and compensate with scale.
school
BERT: Bidirectional Understanding
Masked language modeling and the pre-train/fine-tune revolution
How BERT Works
BERT (Bidirectional Encoder Representations from Transformers, Google 2018) was the first model to demonstrate that a single pre-trained model could achieve state-of-the-art on 11 NLP benchmarks simultaneously. BERT is pre-trained with two objectives: Masked Language Modeling (MLM) — randomly mask 15% of tokens and predict them from context ("The [MASK] sat on the mat" → "cat"), and Next Sentence Prediction (NSP) — predict whether two sentences are consecutive. MLM forces bidirectional understanding: to predict a masked word, the model must use both left and right context. For downstream tasks, BERT adds a task-specific head (a linear layer) on top and fine-tunes the entire model. Classification uses the [CLS] token representation; NER uses per-token representations. This pre-train then fine-tune paradigm became the standard workflow for NLP.
BERT Details
Pre-training objectives: MLM: "The [MASK] sat on the [MASK]" Predict: "cat", "mat" Uses both left and right context NSP: Are these consecutive sentences? "The cat sat." + "It was tired." → Yes "The cat sat." + "Paris is nice." → No Fine-tuning: Classification: [CLS] → linear → label NER: each token → linear → IOB tag QA: start/end token prediction Results: SOTA on 11 benchmarks at once BERT-base: 110M params, 12 layers BERT-large: 340M params, 24 layers
Key insight: BERT's real innovation was proving that unsupervised pre-training on raw text produces representations that transfer to any NLP task. You no longer need task-specific architectures — one pre-trained model, many fine-tuned applications.
smart_toy
GPT: Autoregressive Generation
Predict the next token — and scale until emergent abilities appear
The GPT Approach
GPT (Generative Pre-trained Transformer, OpenAI 2018) took the opposite approach from BERT: instead of bidirectional understanding, GPT uses causal (left-to-right) attention and pre-trains on next-token prediction. This makes GPT a natural text generator. GPT-1 (117M parameters) demonstrated that pre-training + fine-tuning works for generation. GPT-2 (1.5B) showed that scaling produces coherent multi-paragraph text. GPT-3 (175B) revealed emergent abilities: without any fine-tuning, it could perform tasks from just a few examples in the prompt (few-shot learning). This discovery — that scale alone produces new capabilities — launched the LLM revolution. The decoder-only architecture became the dominant paradigm: GPT-4, Claude, LLaMA, Gemini, and Mistral all use decoder-only transformers.
GPT Evolution
Pre-training: next token prediction "The cat sat on the" → predict "mat" Causal attention: only sees past tokens GPT-1 (2018): 117M params Pre-train + fine-tune for tasks 12 layers, 768-dim GPT-2 (2019): 1.5B params Coherent paragraphs, no fine-tuning Zero-shot task performance GPT-3 (2020): 175B params Few-shot learning emerges In-context learning from examples 96 layers, 12288-dim The insight: Scale + next-token prediction = general-purpose AI system
Key insight: GPT proved that generation and understanding are not separate. A model trained only to predict the next token develops deep language understanding as a side effect. This is why decoder-only models dominate modern AI.
swap_horiz
T5: Text-to-Text Unification
Every NLP task as a text transformation
The Text-to-Text Framework
T5 (Text-to-Text Transfer Transformer, Google 2019) proposed a radical simplification: every NLP task is a text-to-text problem. Classification: "classify: This movie is great" → "positive." Translation: "translate English to German: Hello" → "Hallo." Summarization: "summarize: [long text]" → "[summary]." By framing all tasks identically, T5 uses the same model, loss function, and training procedure for everything. T5 uses an encoder-decoder architecture with span corruption pre-training: random spans of text are replaced with sentinel tokens, and the model learns to reconstruct them. The systematic study behind T5 tested dozens of design choices (model size, pre-training objective, data, fine-tuning strategy) and published the results, making it one of the most influential NLP papers for practical guidance.
T5 Task Framing
Everything is text-to-text: Classification: Input: "classify: This movie is great" Output: "positive" Translation: Input: "translate English to German: Hello" Output: "Hallo" Summarization: Input: "summarize: [long article]" Output: "[short summary]" Question Answering: Input: "question: What color is the sky? context: The sky is blue." Output: "blue" Pre-training: span corruption "The <X> sat on <Y> mat" → "<X> cat <Y> the"
Key insight: T5's text-to-text framing is conceptually elegant: it eliminates the need for task-specific architectures, loss functions, or output formats. This unification influenced the design of instruction-tuned models and modern LLM prompting.
compare_arrows
BERT vs GPT: When to Use Which
Understanding vs generation — choosing the right architecture
Choosing an Architecture
The choice between BERT-style and GPT-style models depends on your task. BERT (encoder) excels at tasks that require understanding the full input: classification, NER, semantic similarity, extractive QA. Its bidirectional attention means every token can attend to every other token, producing richer representations for understanding. GPT (decoder) excels at generating text: dialogue, creative writing, code generation, and any task where you need to produce new text. Its causal attention naturally supports autoregressive generation. In practice, the distinction has blurred: modern LLMs (GPT-4, Claude) are so capable that they can perform understanding tasks through generation (classify by generating the label). But for production systems where efficiency matters, BERT-style models remain the better choice for classification and extraction tasks — they're 10–100x smaller and faster.
Architecture Decision Guide
Use BERT (encoder) when: Classification, NER, extraction Semantic similarity, search Need efficiency (small, fast) Have labeled fine-tuning data 110M-340M params, fast inference Use GPT (decoder) when: Text generation, dialogue Creative/open-ended tasks Few-shot learning (no fine-tuning data) General-purpose assistant 7B-175B+ params, slower inference Use T5 (encoder-decoder) when: Translation, summarization Tasks requiring input comprehension + structured output generation 220M-11B params
Key insight: In 2024+, decoder-only models have largely won. They can do understanding tasks through generation, and their scaling properties are better understood. But encoder models remain the practical choice when you need fast, efficient, task-specific inference.
rocket_launch
Why Transformers Won
Parallelism, scalability, and the pre-training paradigm
The Transformer Advantage
Transformers didn't just improve NLP — they unified it. Before transformers, each NLP task had its own specialized architecture. After transformers, one architecture handles everything. Three properties explain their dominance. Parallelism: unlike RNNs that process tokens sequentially, transformers process all tokens simultaneously, making them 10–100x faster to train on GPUs. Scalability: transformer performance improves predictably with model size, data size, and compute (scaling laws). This predictability enables billion-dollar training investments. Transfer learning: pre-training on massive unlabeled text produces representations that transfer to any downstream task with minimal fine-tuning. The transformer's impact extends far beyond NLP: Vision Transformers (ViT) for images, AlphaFold for protein structure, and multimodal models all use the same architecture.
Why Transformers Dominate
1. Parallelism: RNN: process tokens 1, 2, 3, ... (serial) Transformer: process all tokens at once 10-100x faster training on GPUs 2. Scalability: Performance scales predictably More params + data + compute = better Enables billion-dollar investments 3. Transfer learning: Pre-train once on massive text Fine-tune for any task Amortizes training cost Beyond NLP: Vision: ViT, DINO Protein: AlphaFold Audio: Whisper Multimodal: GPT-4V, Gemini One architecture for everything
Key insight: The transformer is arguably the most important architecture in the history of AI. Not because it's the most elegant design, but because it scales. And in deep learning, the architecture that scales best wins.