Ch 4 — Attention Weights: The Model’s Focus Mechanism

Q, K, V, O projection matrices — how the model decides what to attend to
High Level
help
Query (Q)
arrow_forward
key
Key (K)
arrow_forward
database
Value (V)
arrow_forward
center_focus_strong
Attention
arrow_forward
output
Output (O)
arrow_forward
groups
GQA
-
Click play or press Space to begin...
Step- / 8
visibility
Four Projection Matrices Per Layer
Q, K, V, and O — the four tensors that implement attention
The Attention Blueprint
Every transformer layer has a self-attention block containing four weight matrices. These are the tensors that let the model "look at" other tokens when processing each position. Q (Query) asks "what am I looking for?", K (Key) asks "what do I contain?", V (Value) says "here's my information", and O (Output) projects the combined result back to the hidden dimension.
Tensor Names (Layer 0)
// Four attention tensors per layer: model.layers.0.self_attn.q_proj.weight model.layers.0.self_attn.k_proj.weight model.layers.0.self_attn.v_proj.weight model.layers.0.self_attn.o_proj.weight // × 32 layers = 128 attention tensors total // Layer index runs from 0 to 31
Key insight: The tensor name encodes exactly where it lives in the architecture. model.layers.15.self_attn.k_proj.weight tells you: model → layer 15 → self attention → key projection → weight matrix.
help
Q and O: Full-Width Matrices
Shape [hidden_size, hidden_size] = [4096, 4096] for Llama 3.1 8B
Query Projection
The Q (query) projection transforms each token's hidden state into a "question" vector. It has shape [hidden_size, hidden_size] — [4096, 4096] for Llama 3.1 8B. This single matrix actually contains 32 heads packed together: conceptually, it's 32 separate [128, 4096] matrices stacked. Each head independently learns to ask different types of questions: one might focus on syntax, another on semantics, another on positional relationships.
Shape Math
// Q and O projection shapes: q_proj.weight: [4096, 4096] // = num_heads × head_dim × hidden_size // = 32 × 128 × 4096 // 32 heads, each 128-dimensional o_proj.weight: [4096, 4096] // Projects concatenated head outputs // back to hidden_size // Each is 4096 × 4096 × 2 bytes = 32 MB
Key insight: head_dim = hidden_size / num_attention_heads = 4096 / 32 = 128. This 128-dimensional space is where each attention head operates. The full Q matrix is all 32 heads packed into one tensor for computational efficiency.
key
K and V: Smaller with GQA
Grouped-Query Attention shrinks K/V from [4096, 4096] to [1024, 4096]
Why K/V Are Smaller
In standard multi-head attention, K and V have the same number of heads as Q (32). But Llama 3.1 uses Grouped-Query Attention (GQA), where multiple Q heads share the same K and V heads. Llama 3.1 8B has 32 Q heads but only 8 KV heads — each KV head is shared by 4 Q heads. This means K and V projections have shape [num_kv_heads × head_dim, hidden_size] = [8 × 128, 4096] = [1024, 4096].
GQA Tensor Shapes
// Llama 3.1 8B (GQA: 32Q, 8KV): q_proj.weight: [4096, 4096] // 32 heads k_proj.weight: [1024, 4096] // 8 heads v_proj.weight: [1024, 4096] // 8 heads o_proj.weight: [4096, 4096] // full width // Without GQA (standard MHA): // K, V would be [4096, 4096] each // GQA saves 75% of KV parameters
Key insight: GQA saves both file size (smaller K/V weight tensors) AND inference memory (smaller KV cache at runtime). This is why modern LLMs almost universally use GQA — the quality loss is minimal but the memory savings are huge.
view_week
Multi-Head Attention: Packed into One Matrix
32 independent heads concatenated into a single tensor
Head Packing
The [4096, 4096] Q matrix isn't one monolithic transformation. It's 32 independent [128, 4096] matrices stacked vertically. During the forward pass, the framework reshapes the output from [seq_len, 4096] to [seq_len, 32, 128] — splitting it into 32 heads of 128 dimensions each. Each head computes attention independently, then the results are concatenated and projected through O.
Reshape Visualization
// Q projection + reshape per layer: input: [seq_len, 4096] // ↓ Multiply by q_proj.weight.T q_flat: [seq_len, 4096] // ↓ Reshape: split 4096 into 32 × 128 q_heads: [seq_len, 32, 128] // ↓ Each head attends independently // K uses 8 heads: [seq_len, 8, 128] // Each K head is shared by 4 Q heads
Key insight: The reshape is free — no computation, just reinterpreting the same memory. This is why heads are packed into one matrix: it's more efficient to do one large matrix multiply than 32 small ones on GPUs.
center_focus_strong
What Attention Actually Computes
QK^T / √d → softmax → multiply by V
The Attention Formula
After projection, each head computes: Attention(Q, K, V) = softmax(QKT / √d) · V. The Q×KT dot product produces an "attention score" between every pair of tokens. Division by √128 prevents the scores from getting too large. Softmax normalizes them into probabilities. Multiplying by V mixes the value vectors according to these attention weights. This is the mechanism that lets token 5 "look at" token 2.
Step-by-Step
// For one head, sequence length 10: Q: [10, 128] // 10 query vectors K: [10, 128] // 10 key vectors V: [10, 128] // 10 value vectors scores = Q @ K.T // [10, 10] attention map scores = scores / √128 // scale down weights = softmax(scores) // [10, 10] probabilities output = weights @ V // [10, 128] weighted mix
Why it matters: The weight tensors (q_proj, k_proj, v_proj) store the learned projections — they determine HOW the model creates queries, keys, and values. The attention scores themselves are computed at runtime and never stored in the file.
groups
GQA: The Memory Multiplier
Why sharing KV heads saves 75% of KV cache memory at inference
GQA vs MHA vs MQA
MHA (Multi-Head Attention): 32 Q heads, 32 KV heads. Full quality, full cost.

MQA (Multi-Query Attention): 32 Q heads, 1 KV head. Maximum savings, some quality loss.

GQA (Grouped-Query Attention): 32 Q heads, 8 KV heads. Middle ground used by Llama 3, Mistral, and most modern LLMs. The GQA paper (Ainslie et al., 2023) showed that 8 KV heads retain nearly all the quality of 32 while using only 25% of the KV memory.
Parameter Savings
// Attention params per layer (Llama 3.1 8B): With MHA (32 KV heads): Q: 4096×4096 + K: 4096×4096 V: 4096×4096 + O: 4096×4096 = 67.1M params/layer With GQA (8 KV heads): Q: 4096×4096 + K: 1024×4096 V: 1024×4096 + O: 4096×4096 = 41.9M params/layer // 37% fewer attention params per layer // × 32 layers = ~805M params saved total
balance
The Attention Norm Tensor
input_layernorm.weight — a tiny but critical tensor
RMSNorm Before Attention
Before the attention computation, the input is passed through RMS Layer Normalization. The tensor model.layers.{N}.input_layernorm.weight has shape [4096] — just 4,096 numbers. It rescales each dimension of the hidden state to stabilize training. Despite being tiny (8 KB in BF16), removing it causes training to diverge completely. It's applied element-wise: each of the 4,096 dimensions gets its own learned scale factor.
Norm Tensor Size
// Normalization tensors per layer: input_layernorm.weight: [4096] // Applied BEFORE attention // 4096 × 2 bytes = 8 KB post_attention_layernorm.weight: [4096] // Applied BEFORE the FFN // 4096 × 2 bytes = 8 KB // 2 norms × 32 layers = 64 tensors // Total: 64 × 8 KB = 512 KB // Less than 0.003% of the model!
Key insight: Normalization tensors are the smallest weights in the file but among the most critical. They're the "guardrails" that keep values in a healthy range as data flows through 32 layers of transformations.
calculate
Attention Parameter Budget
How attention's ~28% of the model breaks down
Per-Layer Budget
Each of the 32 layers contributes about 41.9M attention parameters (with GQA). Across 32 layers, that's approximately 1.34 billion attention parameters — roughly 28% of the 8.03B total. The Q and O projections dominate (16.8M params each), while K and V are 4.2M each thanks to GQA. Add in two norm tensors (8K each) and the attention block totals about 42M params/layer.
Attention Cheat Sheet
// Per layer attention tensors (Llama 3.1 8B): q_proj: [4096, 4096] = 16.8M params (32 MB) k_proj: [1024, 4096] = 4.2M params (8 MB) v_proj: [1024, 4096] = 4.2M params (8 MB) o_proj: [4096, 4096] = 16.8M params (32 MB) norms: [4096] × 2 = 8K params ──────────────────────────────── Total: ~42M params/layer (~80 MB) × 32 layers = ~1.34B (~2.6 GB)
Key insight: Attention is about 28% of the model. The real heavyweight is the FFN (next chapter) at ~65%. Understanding this ratio helps you predict where quantization quality loss will hit hardest and where LoRA adapters should target.