The QKV Framework
Vaswani et al. (2017) generalized attention using a database analogy. A query (Q) is what you’re looking for. Keys (K) are labels on stored items. Values (V) are the actual stored content. Attention computes the similarity between the query and each key, then returns a weighted sum of values. The similarity is computed as a scaled dot product: score = Q·K¹/√d_k, where d_k is the key dimension. The scaling prevents dot products from growing too large in high dimensions, which would push softmax into regions with tiny gradients.
Scaled Dot-Product Attention
// Scaled dot-product attention
Attention(Q, K, V) = softmax(Q · Kᵀ / √d_k) · V
// Q: (seq_len, d_k) — what am I looking for?
// K: (seq_len, d_k) — what do I contain?
// V: (seq_len, d_v) — what do I return?
// Q · Kᵀ: (seq_len × seq_len) attention matrix
// Each row = attention weights for one position
// softmax makes weights sum to 1
// √d_k scaling prevents gradient saturation
Key insight: Q, K, V are all learned linear projections of the input. The network learns what to query for, what to advertise as keys, and what to return as values. This flexibility is what makes attention so powerful.