The Attention Formula
After projection, each head computes: Attention(Q, K, V) = softmax(QKT / √d) · V. The Q×KT dot product produces an "attention score" between every pair of tokens. Division by √128 prevents the scores from getting too large. Softmax normalizes them into probabilities. Multiplying by V mixes the value vectors according to these attention weights. This is the mechanism that lets token 5 "look at" token 2.
Step-by-Step
// For one head, sequence length 10:
Q: [10, 128] // 10 query vectors
K: [10, 128] // 10 key vectors
V: [10, 128] // 10 value vectors
scores = Q @ K.T // [10, 10] attention map
scores = scores / √128 // scale down
weights = softmax(scores) // [10, 10] probabilities
output = weights @ V // [10, 128] weighted mix
Why it matters: The weight tensors (q_proj, k_proj, v_proj) store the learned projections — they determine HOW the model creates queries, keys, and values. The attention scores themselves are computed at runtime and never stored in the file.