Ch 9 — Attention & Transformers

“Attention Is All You Need” — the architecture that changed everything
High Level
center_focus_strong
Attention
arrow_forward
key
Q K V
arrow_forward
view_comfy
Multi-Head
arrow_forward
stacks
Layers
arrow_forward
pin_drop
Position
arrow_forward
hub
Variants
-
Click play or press Space to begin the journey...
Step- / 8
center_focus_strong
The Attention Mechanism
From add-on to the entire architecture
The Core Idea
Attention lets a model look at all positions in a sequence and decide which ones are relevant for the current task. Instead of compressing everything into a single vector (Ch 8’s seq2seq bottleneck), attention creates a weighted combination of all positions — different weights for each query.
# Attention in one sentence: "Given what I'm looking for (query), which parts of the input (keys) match, and what information (values) should I gather?" # Bahdanau (2015): attention on RNN encoder # Vaswani (2017): attention IS the architecture
Why Attention Beats RNNs
RNN: Word 1 → Word 2 → ... → Word 100 100 steps to connect Word 1 to Word 100 Sequential: can't parallelize Attention: Word 1 ↔ Word 100 directly 1 step to connect any two positions All positions computed in parallel
The breakthrough: Attention reduces the path length between any two positions from O(n) to O(1). This means the model can learn long-range dependencies without information having to pass through many intermediate steps.
key
Query, Key, Value
The three projections that make attention work
The QKV Framework
Each input token is projected into three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). The attention score between two tokens is the dot product of the query of one and the key of the other.
# Scaled dot-product attention Q = X · W_Q # queries K = X · W_K # keys V = X · W_V # values Attention(Q,K,V) = softmax(Q·K¹/√d_k) · V # √d_k scaling prevents softmax saturation # softmax turns scores into probabilities
Analogy: Library Search
Think of a library. Your query is your search term. Each book has a key (its title/tags). The value is the book’s content. You match your query against all keys, and the most relevant books contribute their values to your answer — weighted by relevance.
Self-attention: When Q, K, and V all come from the same sequence, it’s called self-attention. Each word attends to every other word in the same sentence. “The cat sat on the mat because it was tired” — self-attention lets “it” attend strongly to “cat.”
view_comfy
Multi-Head Attention
Multiple attention patterns in parallel
Why Multiple Heads?
A single attention head can only focus on one type of relationship at a time. Multi-head attention runs multiple attention operations in parallel, each with different learned W_Q, W_K, W_V projections. One head might learn syntax, another coreference, another semantic similarity.
# Multi-head attention head\u2081 = Attention(X·W\u2081_Q, X·W\u2081_K, X·W\u2081_V) head\u2082 = Attention(X·W\u2082_Q, X·W\u2082_K, X·W\u2082_V) ... head\u2095 = Attention(X·W\u2095_Q, X·W\u2095_K, X·W\u2095_V) MultiHead = Concat(head\u2081,...,head\u2095) · W_O
Typical Configurations
Model d_model heads d_k BERT-base 768 12 64 GPT-2 768 12 64 GPT-3 12288 96 128 GPT-4 (est.) ~12K ~128 ~96 Llama 3 70B 8192 64 128 # d_k = d_model / num_heads # Each head operates on a slice # Total compute = same as single large head
What heads learn: Research shows different heads specialize. Some track positional patterns (attend to adjacent tokens), some track syntactic relations (subject-verb), some handle coreference (“it” → “cat”). This emergent specialization is not programmed — it’s learned.
stacks
The Transformer Block
Attention + feed-forward + residuals + layer norm
One Transformer Layer
Each transformer layer has two sub-layers: multi-head self-attention and a feed-forward network (FFN). Both are wrapped with a residual connection and layer normalization. The FFN is applied independently to each position — it’s where the model stores factual knowledge.
# One transformer layer (Pre-LN variant) 1. Self-Attention sub-layer: x = x + MultiHeadAttn(LayerNorm(x)) 2. Feed-Forward sub-layer: x = x + FFN(LayerNorm(x)) FFN(x) = GELU(x·W\u2081 + b\u2081) · W\u2082 + b\u2082 # FFN hidden dim = 4 × d_model (typical) # e.g., d_model=768 → FFN hidden=3072
Stacking Layers
The original transformer used 6 layers. Modern models stack many more: BERT-large has 24, GPT-3 has 96, GPT-4 is estimated at 120+. Each layer refines the representation — early layers capture syntax, middle layers capture semantics, late layers prepare for the task.
Residual connections (x + sublayer(x)) are essential. Without them, gradients vanish in deep transformers just like in deep CNNs. Layer normalization stabilizes training by normalizing across the feature dimension. Together, they enable stacking 100+ layers.
pin_drop
Positional Encoding
Teaching transformers about word order
The Problem
Self-attention is permutation-invariant — it treats “cat sat mat” the same as “mat cat sat.” Unlike RNNs, which process tokens sequentially, transformers see all tokens at once. We must explicitly inject position information.
# Original: sinusoidal positional encoding PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) # Each position gets a unique pattern # Different frequencies for different dims # Can generalize to unseen sequence lengths input = token_embedding + positional_encoding
Modern Approaches
Absolute (original): Fixed sinusoidal or learned embeddings Added to input once Relative (Shaw 2018, T5): Encode distance between tokens Added to attention scores RoPE (Su 2021, used in Llama/GPT-4): Rotary Position Embedding Encodes position via rotation matrices Naturally handles relative distances Scales well to long contexts
RoPE is now the dominant approach. It rotates the query and key vectors by an angle proportional to their position. The dot product between rotated Q and K naturally depends on the relative distance between tokens — elegant and effective.
hub
Encoder vs. Decoder vs. Encoder-Decoder
Three transformer families for different tasks
Encoder-Only (BERT, 2018): Bidirectional self-attention Each token sees ALL other tokens Task: understanding (classification, NER) Training: masked language modeling (MLM) Decoder-Only (GPT, 2018): Causal (masked) self-attention Each token sees only PREVIOUS tokens Task: generation (text, code, chat) Training: next-token prediction Encoder-Decoder (T5, 2019; original): Encoder: bidirectional Decoder: causal + cross-attention to encoder Task: translation, summarization Training: span corruption / denoising
The Causal Mask
Decoder models use a causal mask that prevents each position from attending to future tokens. When generating “The cat sat,” the model predicting “sat” can only see “The” and “cat” — not future words. This enables autoregressive generation.
The GPT family won. Decoder-only models dominate modern AI: GPT-4, Claude, Llama, Gemini, Mistral. The simplicity of next-token prediction + massive scale proved more powerful than the architectural complexity of encoder-decoder models. BERT-style models remain useful for classification and retrieval.
speed
Efficiency & Scaling
The O(n²) problem and how to tame it
The Quadratic Cost
Self-attention computes scores between every pair of tokens: O(n²) time and memory. For a 2048-token sequence, that’s ~4 million attention scores per head. For 128K tokens (GPT-4 Turbo): ~16 billion. This is the transformer’s biggest limitation.
# Attention cost scaling Sequence Pairs Memory 512 262K ~2 MB 2,048 4.2M ~32 MB 8,192 67M ~512 MB 32,768 1.07B ~8 GB 131,072 17.2B ~128 GB # Per head, per layer, float16
Efficiency Solutions
FlashAttention (Dao, 2022): Same math, better memory access Fuses operations, avoids materializing the full n\u00d7n attention matrix 2-4x faster, much less memory KV Cache (inference): Cache key/value from previous tokens Only compute attention for new token Avoids recomputing entire sequence Grouped-Query Attention (GQA): Share K,V heads across query heads Llama 2 70B: 64 Q heads, 8 KV heads 8x less KV cache memory
FlashAttention is now standard in all major frameworks. It doesn’t change the math — it changes the memory access pattern to exploit GPU hardware. The result: longer contexts, faster training, and lower memory usage with mathematically identical results.
auto_awesome
Impact & Key Takeaways
Why the transformer is the most important architecture in AI history
The Transformer Revolution
Published in June 2017 by Vaswani et al. at Google Brain, “Attention Is All You Need” has become the most cited ML paper in history. The transformer replaced RNNs, CNNs (for NLP), and became the backbone of GPT, BERT, T5, Llama, Gemini, Claude, DALL-E, Stable Diffusion, AlphaFold 2, and virtually every frontier AI system.
Beyond NLP: Transformers now dominate computer vision (ViT), protein folding (AlphaFold), drug discovery, weather forecasting (GraphCast), music generation, robotics, and code generation. The architecture is domain-agnostic — any data that can be tokenized can be processed by a transformer.
Key Takeaways
1. Attention connects any two positions in O(1) steps

2. Query-Key-Value: Q asks, K matches, V provides content

3. Multi-head attention captures diverse relationships

4. Transformer block = attention + FFN + residual + LayerNorm

5. Positional encoding injects order (RoPE is now standard)

6. Decoder-only (GPT-style) dominates modern AI

7. O(n²) cost is the main limitation; FlashAttention + KV cache mitigate it
Coming up: Ch 10 explores how transformers scale into Large Language Models — pretraining, fine-tuning, RLHF, emergent abilities, and the path from GPT-1 to frontier models.