Ch 9: Attention & Transformers

Ch 9 — Attention & Transformers

“Attention Is All You Need” — the architecture that changed everything

Index Under the Hood →

High Level

center_focus_strong

Attention

arrow_forward

key

Q K V

arrow_forward

view_comfy

Multi-Head

arrow_forward

stacks

Layers

arrow_forward

pin_drop

Position

arrow_forward

hub

Variants

Click play or press Space to begin the journey...

Step- / 8

center_focus_strong

The Attention Mechanism

From add-on to the entire architecture

The Core Idea

Attention lets a model look at all positions in a sequence and decide which ones are relevant for the current task. Instead of compressing everything into a single vector (Ch 8’s seq2seq bottleneck), attention creates a weighted combination of all positions — different weights for each query.

# Attention in one sentence: "Given what I'm looking for (query), which parts of the input (keys) match, and what information (values) should I gather?" # Bahdanau (2015): attention on RNN encoder # Vaswani (2017): attention IS the architecture

Why Attention Beats RNNs

RNN: Word 1 → Word 2 → ... → Word 100 100 steps to connect Word 1 to Word 100 Sequential: can't parallelize Attention: Word 1 ↔ Word 100 directly 1 step to connect any two positions All positions computed in parallel

The breakthrough: Attention reduces the path length between any two positions from O(n) to O(1). This means the model can learn long-range dependencies without information having to pass through many intermediate steps.

key

Query, Key, Value

The three projections that make attention work

The QKV Framework

Each input token is projected into three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). The attention score between two tokens is the dot product of the query of one and the key of the other.

# Scaled dot-product attention Q = X · W_Q # queries K = X · W_K # keys V = X · W_V # values Attention(Q,K,V) = softmax(Q·K¹/√d_k) · V # √d_k scaling prevents softmax saturation # softmax turns scores into probabilities

Analogy: Library Search

Think of a library. Your query is your search term. Each book has a key (its title/tags). The value is the book’s content. You match your query against all keys, and the most relevant books contribute their values to your answer — weighted by relevance.

Self-attention: When Q, K, and V all come from the same sequence, it’s called self-attention. Each word attends to every other word in the same sentence. “The cat sat on the mat because it was tired” — self-attention lets “it” attend strongly to “cat.”

view_comfy

Multi-Head Attention

Multiple attention patterns in parallel

Why Multiple Heads?

A single attention head can only focus on one type of relationship at a time. Multi-head attention runs multiple attention operations in parallel, each with different learned W_Q, W_K, W_V projections. One head might learn syntax, another coreference, another semantic similarity.

# Multi-head attention head\u2081 = Attention(X·W\u2081_Q, X·W\u2081_K, X·W\u2081_V) head\u2082 = Attention(X·W\u2082_Q, X·W\u2082_K, X·W\u2082_V) ... head\u2095 = Attention(X·W\u2095_Q, X·W\u2095_K, X·W\u2095_V) MultiHead = Concat(head\u2081,...,head\u2095) · W_O

Typical Configurations

Model d_model heads d_k BERT-base 768 12 64 GPT-2 768 12 64 GPT-3 12288 96 128 GPT-4 (est.) ~12K ~128 ~96 Llama 3 70B 8192 64 128 # d_k = d_model / num_heads # Each head operates on a slice # Total compute = same as single large head

What heads learn: Research shows different heads specialize. Some track positional patterns (attend to adjacent tokens), some track syntactic relations (subject-verb), some handle coreference (“it” → “cat”). This emergent specialization is not programmed — it’s learned.

stacks

The Transformer Block

Attention + feed-forward + residuals + layer norm

One Transformer Layer

Each transformer layer has two sub-layers: multi-head self-attention and a feed-forward network (FFN). Both are wrapped with a residual connection and layer normalization. The FFN is applied independently to each position — it’s where the model stores factual knowledge.

# One transformer layer (Pre-LN variant) 1. Self-Attention sub-layer: x = x + MultiHeadAttn(LayerNorm(x)) 2. Feed-Forward sub-layer: x = x + FFN(LayerNorm(x)) FFN(x) = GELU(x·W\u2081 + b\u2081) · W\u2082 + b\u2082 # FFN hidden dim = 4 × d_model (typical) # e.g., d_model=768 → FFN hidden=3072

Stacking Layers

The original transformer used 6 layers. Modern models stack many more: BERT-large has 24, GPT-3 has 96, GPT-4 is estimated at 120+. Each layer refines the representation — early layers capture syntax, middle layers capture semantics, late layers prepare for the task.

Residual connections (x + sublayer(x)) are essential. Without them, gradients vanish in deep transformers just like in deep CNNs. Layer normalization stabilizes training by normalizing across the feature dimension. Together, they enable stacking 100+ layers.

pin_drop

Positional Encoding

Teaching transformers about word order

The Problem

Self-attention is permutation-invariant — it treats “cat sat mat” the same as “mat cat sat.” Unlike RNNs, which process tokens sequentially, transformers see all tokens at once. We must explicitly inject position information.

# Original: sinusoidal positional encoding PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) # Each position gets a unique pattern # Different frequencies for different dims # Can generalize to unseen sequence lengths input = token_embedding + positional_encoding

Modern Approaches

Absolute (original): Fixed sinusoidal or learned embeddings Added to input once Relative (Shaw 2018, T5): Encode distance between tokens Added to attention scores RoPE (Su 2021, used in Llama/GPT-4): Rotary Position Embedding Encodes position via rotation matrices Naturally handles relative distances Scales well to long contexts

RoPE is now the dominant approach. It rotates the query and key vectors by an angle proportional to their position. The dot product between rotated Q and K naturally depends on the relative distance between tokens — elegant and effective.

hub

Encoder vs. Decoder vs. Encoder-Decoder

Three transformer families for different tasks

Encoder-Only (BERT, 2018): Bidirectional self-attention Each token sees ALL other tokens Task: understanding (classification, NER) Training: masked language modeling (MLM) Decoder-Only (GPT, 2018): Causal (masked) self-attention Each token sees only PREVIOUS tokens Task: generation (text, code, chat) Training: next-token prediction Encoder-Decoder (T5, 2019; original): Encoder: bidirectional Decoder: causal + cross-attention to encoder Task: translation, summarization Training: span corruption / denoising

The Causal Mask

Decoder models use a causal mask that prevents each position from attending to future tokens. When generating “The cat sat,” the model predicting “sat” can only see “The” and “cat” — not future words. This enables autoregressive generation.

The GPT family won. Decoder-only models dominate modern AI: GPT-4, Claude, Llama, Gemini, Mistral. The simplicity of next-token prediction + massive scale proved more powerful than the architectural complexity of encoder-decoder models. BERT-style models remain useful for classification and retrieval.

speed

Efficiency & Scaling

The O(n²) problem and how to tame it

The Quadratic Cost

Self-attention computes scores between every pair of tokens: O(n²) time and memory. For a 2048-token sequence, that’s ~4 million attention scores per head. For 128K tokens (GPT-4 Turbo): ~16 billion. This is the transformer’s biggest limitation.

# Attention cost scaling Sequence Pairs Memory 512 262K ~2 MB 2,048 4.2M ~32 MB 8,192 67M ~512 MB 32,768 1.07B ~8 GB 131,072 17.2B ~128 GB # Per head, per layer, float16

Efficiency Solutions

FlashAttention (Dao, 2022): Same math, better memory access Fuses operations, avoids materializing the full n\u00d7n attention matrix 2-4x faster, much less memory KV Cache (inference): Cache key/value from previous tokens Only compute attention for new token Avoids recomputing entire sequence Grouped-Query Attention (GQA): Share K,V heads across query heads Llama 2 70B: 64 Q heads, 8 KV heads 8x less KV cache memory

FlashAttention is now standard in all major frameworks. It doesn’t change the math — it changes the memory access pattern to exploit GPU hardware. The result: longer contexts, faster training, and lower memory usage with mathematically identical results.

auto_awesome

Impact & Key Takeaways

Why the transformer is the most important architecture in AI history

The Transformer Revolution

Published in June 2017 by Vaswani et al. at Google Brain, “Attention Is All You Need” has become the most cited ML paper in history. The transformer replaced RNNs, CNNs (for NLP), and became the backbone of GPT, BERT, T5, Llama, Gemini, Claude, DALL-E, Stable Diffusion, AlphaFold 2, and virtually every frontier AI system.

Beyond NLP: Transformers now dominate computer vision (ViT), protein folding (AlphaFold), drug discovery, weather forecasting (GraphCast), music generation, robotics, and code generation. The architecture is domain-agnostic — any data that can be tokenized can be processed by a transformer.

Key Takeaways

1. Attention connects any two positions in O(1) steps

2. Query-Key-Value: Q asks, K matches, V provides content

3. Multi-head attention captures diverse relationships

4. Transformer block = attention + FFN + residual + LayerNorm

5. Positional encoding injects order (RoPE is now standard)

6. Decoder-only (GPT-style) dominates modern AI

7. O(n²) cost is the main limitation; FlashAttention + KV cache mitigate it

Coming up: Ch 10 explores how transformers scale into Large Language Models — pretraining, fine-tuning, RLHF, emergent abilities, and the path from GPT-1 to frontier models.