Ch 11 — The Attention Mechanism

Bahdanau attention, self-attention, multi-head attention, and why attention replaced recurrence
High Level
compress
Bottleneck
arrow_forward
center_focus_strong
Bahdanau
arrow_forward
key
Q, K, V
arrow_forward
self_improvement
Self-Attn
arrow_forward
view_column
Multi-Head
arrow_forward
token
Transformer
-
Click play or press Space to begin...
Step- / 8
compress
The Bottleneck Problem
Why fixed-size context vectors fail for long sequences
The Seq2Seq Limitation
Recall from Chapter 7: the seq2seq encoder compresses an entire input sentence into a single fixed-size vector. For short sentences, this works. But for a 50-word sentence, cramming all meaning into a 512-dimensional vector is like summarizing a novel in a tweet. Information about early words gets overwritten by later ones. Translation quality degrades sharply for sentences longer than ~20 words. The fundamental problem: the decoder has equal access to all parts of the input (via the context vector) but no way to focus on specific parts when generating each output word.
The Problem Illustrated
// Translating a long sentence Input: "The agreement on the European Economic Area was signed in August 1992." // Seq2seq encoder: // All 12 words → single 512-dim vector // Early words ("The agreement") are // overwritten by later words ("1992") // When generating "L'accord" (French), // decoder needs "The agreement" but // can't selectively access it
Critical in AI: This bottleneck was the single biggest limitation of seq2seq models. Bahdanau’s attention mechanism solved it by allowing the decoder to “look back” at all encoder states, not just the final one.
center_focus_strong
Bahdanau Attention (2015)
Learning to align and translate
The Breakthrough
Bahdanau, Cho, and Bengio (ICLR 2015) proposed that instead of compressing the entire input into one vector, the decoder should attend to different parts of the input at each step. At each decoding step, the model computes attention weights — a probability distribution over all encoder hidden states — indicating which input words are most relevant for the current output word. The context vector is then a weighted sum of all encoder states, dynamically changing at each step. When generating “L’accord,” the model attends heavily to “The agreement.”
Bahdanau Attention
// At decoder step t: // h₁...hₙ = encoder hidden states // sₜ = decoder hidden state // 1. Compute alignment scores eₜᵢ = score(sₜ₋₁, hᵢ) // for each input i // 2. Normalize to attention weights αₜᵢ = softmax(eₜᵢ) // 3. Compute context vector cₜ = Σᵢ αₜᵢ · hᵢ // weighted sum // 4. Use cₜ + sₜ₋₁ to generate output // Different cₜ at each decoder step!
Key insight: Attention creates a direct connection between each decoder step and every encoder position. Information no longer needs to survive the bottleneck — the decoder can reach back and “look at” any part of the input at any time.
key
Queries, Keys & Values
The database analogy that unified attention
The QKV Framework
Vaswani et al. (2017) generalized attention using a database analogy. A query (Q) is what you’re looking for. Keys (K) are labels on stored items. Values (V) are the actual stored content. Attention computes the similarity between the query and each key, then returns a weighted sum of values. The similarity is computed as a scaled dot product: score = Q·K¹/√d_k, where d_k is the key dimension. The scaling prevents dot products from growing too large in high dimensions, which would push softmax into regions with tiny gradients.
Scaled Dot-Product Attention
// Scaled dot-product attention Attention(Q, K, V) = softmax(Q · Kᵀ / √d_k) · V // Q: (seq_len, d_k) — what am I looking for? // K: (seq_len, d_k) — what do I contain? // V: (seq_len, d_v) — what do I return? // Q · Kᵀ: (seq_len × seq_len) attention matrix // Each row = attention weights for one position // softmax makes weights sum to 1 // √d_k scaling prevents gradient saturation
Key insight: Q, K, V are all learned linear projections of the input. The network learns what to query for, what to advertise as keys, and what to return as values. This flexibility is what makes attention so powerful.
self_improvement
Self-Attention
A sequence attending to itself
From Cross-Attention to Self-Attention
Bahdanau attention is cross-attention: the decoder attends to the encoder. Self-attention is when a sequence attends to itself. Each position computes Q, K, V from its own representation, then attends to all other positions in the same sequence. This lets every word in a sentence directly interact with every other word, regardless of distance. In “The cat sat on the mat because it was tired,” self-attention lets “it” directly attend to “cat” to resolve the pronoun — no need to pass information through intermediate words.
Self-Attention vs. RNN
// RNN: information must pass through // every intermediate position cat → sat → on → the → mat → because → it // Path length: O(n) — 6 steps // Signal degrades over distance // Self-attention: direct connection it ←──────────────────────── cat // Path length: O(1) — 1 step // No degradation over distance // Self-attention: every position attends // to every other position simultaneously // → fully parallel computation
Key insight: Self-attention has O(1) maximum path length between any two positions (vs. O(n) for RNNs). This means long-range dependencies are just as easy to learn as short-range ones. This is the fundamental advantage over recurrence.
view_column
Multi-Head Attention
Attending to different things simultaneously
Why Multiple Heads?
A single attention head can only focus on one type of relationship at a time. Multi-head attention runs h parallel attention operations (heads), each with its own learned Q, K, V projections. Different heads learn to attend to different things: one head might capture syntactic relationships (subject-verb), another semantic similarity, another positional patterns. The outputs of all heads are concatenated and projected back to the model dimension. GPT-3 uses 96 heads; GPT-4 is estimated to use 128+.
Multi-Head Attention
// Multi-head attention for i in range(h): // h heads Qᵢ = X · W_Qᵢ // (seq, d_k) Kᵢ = X · W_Kᵢ // (seq, d_k) Vᵢ = X · W_Vᵢ // (seq, d_v) headᵢ = Attention(Qᵢ, Kᵢ, Vᵢ) output = Concat(head₁...headₕ) · W_O // d_model = 768, h = 12 heads // d_k = d_v = 768/12 = 64 per head // Same total compute as single-head
Key insight: Multi-head attention doesn’t increase computation — each head operates on a smaller dimension (d_model/h). It’s like having multiple “perspectives” on the same data, each specialized for different types of relationships.
visibility
Attention Patterns & Visualization
What attention heads actually learn
What Heads Learn
Research on BERT and GPT attention patterns reveals that different heads specialize in different linguistic relationships: positional heads attend to adjacent tokens, syntactic heads connect subjects to verbs, coreference heads link pronouns to their antecedents, and delimiter heads attend to special tokens like [SEP] or [CLS]. Some heads appear to be “no-op” heads that can be pruned without affecting performance. This emergent specialization happens without any explicit supervision.
Attention Matrix
// Attention matrix for "The cat sat" // Each row = where that token attends The cat sat The [0.1 0.7 0.2] // "The" → "cat" cat [0.3 0.2 0.5] // "cat" → "sat" sat [0.1 0.6 0.3] // "sat" → "cat" // Rows sum to 1 (softmax) // High values = strong attention // This head learned subject-verb links
Key insight: The attention matrix is interpretable — you can visualize which tokens attend to which. This makes attention-based models more transparent than RNNs, where information flow is hidden inside the recurrent state.
warning
The Quadratic Cost
Attention’s O(n²) complexity and solutions
The Scaling Problem
Self-attention computes a score between every pair of positions, creating an n×n attention matrix. For a sequence of length n, this is O(n²) in both time and memory. A 1,000-token sequence needs 1 million attention scores; 100,000 tokens needs 10 billion. This quadratic cost is why early transformers were limited to 512–2048 tokens. Solutions include sparse attention (attend to only nearby + selected distant positions), linear attention (approximate softmax), FlashAttention (GPU memory-efficient exact attention), and sliding window attention (used in Mistral).
Complexity Comparison
// Attention complexity Full attention: O(n²) time, O(n²) memory Sparse attention: O(n√n) or O(n·log(n)) Linear attention: O(n) time, O(n) memory FlashAttention: O(n²) time, O(n) memory // Context lengths over time: GPT-2 (2019): 1,024 tokens GPT-3 (2020): 2,048 tokens GPT-4 (2023): 8K-128K tokens Gemini (2024): 1M-2M tokens // FlashAttention + sparse methods // made long contexts practical
Key insight: FlashAttention (Dao et al., 2022) doesn’t change the algorithm — it computes exact attention but reorganizes memory access patterns to avoid materializing the full n×n matrix. This 2-4× speedup made long-context models practical.
token
From Attention to Transformers
The mechanism that changed everything
The Paradigm Shift
Attention started as an add-on to RNNs (Bahdanau, 2015). Then Vaswani et al. (2017) asked the radical question: what if we use only attention, with no recurrence at all? The answer was the Transformer — an architecture built entirely from self-attention and feedforward layers. It was faster to train (fully parallelizable), handled long-range dependencies better (O(1) path length), and achieved state-of-the-art results on translation. Every major AI model since — BERT, GPT, LLaMA, Gemini, Claude — is a Transformer.
The connection: The next and final chapter covers the Transformer architecture in full: encoder-decoder structure, positional encoding, layer design, and how “Attention Is All You Need” became the foundation of modern AI. This is the bridge from deep learning fundamentals to the LLM era.
Attention Timeline
// The attention revolution 2015: Bahdanau attention (RNN add-on) 2015: Luong attention (simplified) 2017: Transformer (attention only!) 2018: BERT (bidirectional self-attention) 2018: GPT (causal self-attention) 2020: Vision Transformer (ViT) 2022: FlashAttention (efficient exact) 2023: Grouped-query attention (GQA) 2024: Ring attention (distributed long ctx)