Ch 4: Attention Weights — The Model's Focus Mechanism

visibility

Four Projection Matrices Per Layer

Q, K, V, and O — the four tensors that implement attention

The Attention Blueprint

Every transformer layer has a self-attention block containing four weight matrices. These are the tensors that let the model "look at" other tokens when processing each position. Q (Query) asks "what am I looking for?", K (Key) asks "what do I contain?", V (Value) says "here's my information", and O (Output) projects the combined result back to the hidden dimension.

Tensor Names (Layer 0)

// Four attention tensors per layer: model.layers.0.self_attn.q_proj.weight model.layers.0.self_attn.k_proj.weight model.layers.0.self_attn.v_proj.weight model.layers.0.self_attn.o_proj.weight // × 32 layers = 128 attention tensors total // Layer index runs from 0 to 31

Key insight: The tensor name encodes exactly where it lives in the architecture. model.layers.15.self_attn.k_proj.weight tells you: model → layer 15 → self attention → key projection → weight matrix.

help

Q and O: Full-Width Matrices

Shape [hidden_size, hidden_size] = [4096, 4096] for Llama 3.1 8B

Query Projection

The Q (query) projection transforms each token's hidden state into a "question" vector. It has shape [hidden_size, hidden_size] — [4096, 4096] for Llama 3.1 8B. This single matrix actually contains 32 heads packed together: conceptually, it's 32 separate [128, 4096] matrices stacked. Each head independently learns to ask different types of questions: one might focus on syntax, another on semantics, another on positional relationships.

Shape Math

// Q and O projection shapes: q_proj.weight: [4096, 4096] // = num_heads × head_dim × hidden_size // = 32 × 128 × 4096 // 32 heads, each 128-dimensional o_proj.weight: [4096, 4096] // Projects concatenated head outputs // back to hidden_size // Each is 4096 × 4096 × 2 bytes = 32 MB

Key insight: head_dim = hidden_size / num_attention_heads = 4096 / 32 = 128. This 128-dimensional space is where each attention head operates. The full Q matrix is all 32 heads packed into one tensor for computational efficiency.

key

K and V: Smaller with GQA

Grouped-Query Attention shrinks K/V from [4096, 4096] to [1024, 4096]

Why K/V Are Smaller

In standard multi-head attention, K and V have the same number of heads as Q (32). But Llama 3.1 uses Grouped-Query Attention (GQA), where multiple Q heads share the same K and V heads. Llama 3.1 8B has 32 Q heads but only 8 KV heads — each KV head is shared by 4 Q heads. This means K and V projections have shape [num_kv_heads × head_dim, hidden_size] = [8 × 128, 4096] = [1024, 4096].

GQA Tensor Shapes

// Llama 3.1 8B (GQA: 32Q, 8KV): q_proj.weight: [4096, 4096] // 32 heads k_proj.weight: [1024, 4096] // 8 heads v_proj.weight: [1024, 4096] // 8 heads o_proj.weight: [4096, 4096] // full width // Without GQA (standard MHA): // K, V would be [4096, 4096] each // GQA saves 75% of KV parameters

Key insight: GQA saves both file size (smaller K/V weight tensors) AND inference memory (smaller KV cache at runtime). This is why modern LLMs almost universally use GQA — the quality loss is minimal but the memory savings are huge.

view_week

Multi-Head Attention: Packed into One Matrix

32 independent heads concatenated into a single tensor

Head Packing

The [4096, 4096] Q matrix isn't one monolithic transformation. It's 32 independent [128, 4096] matrices stacked vertically. During the forward pass, the framework reshapes the output from [seq_len, 4096] to [seq_len, 32, 128] — splitting it into 32 heads of 128 dimensions each. Each head computes attention independently, then the results are concatenated and projected through O.

Reshape Visualization

// Q projection + reshape per layer: input: [seq_len, 4096] // ↓ Multiply by q_proj.weight.T q_flat: [seq_len, 4096] // ↓ Reshape: split 4096 into 32 × 128 q_heads: [seq_len, 32, 128] // ↓ Each head attends independently // K uses 8 heads: [seq_len, 8, 128] // Each K head is shared by 4 Q heads

Key insight: The reshape is free — no computation, just reinterpreting the same memory. This is why heads are packed into one matrix: it's more efficient to do one large matrix multiply than 32 small ones on GPUs.

center_focus_strong

What Attention Actually Computes

QK^T / √d → softmax → multiply by V

The Attention Formula

After projection, each head computes: Attention(Q, K, V) = softmax(QK^T / √d) · V. The Q×K^T dot product produces an "attention score" between every pair of tokens. Division by √128 prevents the scores from getting too large. Softmax normalizes them into probabilities. Multiplying by V mixes the value vectors according to these attention weights. This is the mechanism that lets token 5 "look at" token 2.

Step-by-Step

// For one head, sequence length 10: Q: [10, 128] // 10 query vectors K: [10, 128] // 10 key vectors V: [10, 128] // 10 value vectors scores = Q @ K.T // [10, 10] attention map scores = scores / √128 // scale down weights = softmax(scores) // [10, 10] probabilities output = weights @ V // [10, 128] weighted mix

Why it matters: The weight tensors (q_proj, k_proj, v_proj) store the learned projections — they determine HOW the model creates queries, keys, and values. The attention scores themselves are computed at runtime and never stored in the file.

groups

GQA: The Memory Multiplier

Why sharing KV heads saves 75% of KV cache memory at inference

GQA vs MHA vs MQA

MHA (Multi-Head Attention): 32 Q heads, 32 KV heads. Full quality, full cost.

MQA (Multi-Query Attention): 32 Q heads, 1 KV head. Maximum savings, some quality loss.

GQA (Grouped-Query Attention): 32 Q heads, 8 KV heads. Middle ground used by Llama 3, Mistral, and most modern LLMs. The GQA paper (Ainslie et al., 2023) showed that 8 KV heads retain nearly all the quality of 32 while using only 25% of the KV memory.

Parameter Savings

// Attention params per layer (Llama 3.1 8B): With MHA (32 KV heads): Q: 4096×4096 + K: 4096×4096 V: 4096×4096 + O: 4096×4096 = 67.1M params/layer With GQA (8 KV heads): Q: 4096×4096 + K: 1024×4096 V: 1024×4096 + O: 4096×4096 = 41.9M params/layer // 37% fewer attention params per layer // × 32 layers = ~805M params saved total

balance

The Attention Norm Tensor

input_layernorm.weight — a tiny but critical tensor

RMSNorm Before Attention

Before the attention computation, the input is passed through RMS Layer Normalization. The tensor model.layers.{N}.input_layernorm.weight has shape [4096] — just 4,096 numbers. It rescales each dimension of the hidden state to stabilize training. Despite being tiny (8 KB in BF16), removing it causes training to diverge completely. It's applied element-wise: each of the 4,096 dimensions gets its own learned scale factor.

Norm Tensor Size

// Normalization tensors per layer: input_layernorm.weight: [4096] // Applied BEFORE attention // 4096 × 2 bytes = 8 KB post_attention_layernorm.weight: [4096] // Applied BEFORE the FFN // 4096 × 2 bytes = 8 KB // 2 norms × 32 layers = 64 tensors // Total: 64 × 8 KB = 512 KB // Less than 0.003% of the model!

Key insight: Normalization tensors are the smallest weights in the file but among the most critical. They're the "guardrails" that keep values in a healthy range as data flows through 32 layers of transformations.

calculate

Attention Parameter Budget

How attention's ~28% of the model breaks down

Per-Layer Budget

Each of the 32 layers contributes about 41.9M attention parameters (with GQA). Across 32 layers, that's approximately 1.34 billion attention parameters — roughly 28% of the 8.03B total. The Q and O projections dominate (16.8M params each), while K and V are 4.2M each thanks to GQA. Add in two norm tensors (8K each) and the attention block totals about 42M params/layer.

Attention Cheat Sheet

// Per layer attention tensors (Llama 3.1 8B): q_proj: [4096, 4096] = 16.8M params (32 MB) k_proj: [1024, 4096] = 4.2M params (8 MB) v_proj: [1024, 4096] = 4.2M params (8 MB) o_proj: [4096, 4096] = 16.8M params (32 MB) norms: [4096] × 2 = 8K params ──────────────────────────────── Total: ~42M params/layer (~80 MB) × 32 layers = ~1.34B (~2.6 GB)

Key insight: Attention is about 28% of the model. The real heavyweight is the FFN (next chapter) at ~65%. Understanding this ratio helps you predict where quantization quality loss will hit hardest and where LoRA adapters should target.

Ch 4 — Attention Weights: The Model’s Focus Mechanism