The Position Problem
Self-attention is permutation-invariant — it treats “cat sat mat” and “mat cat sat” identically because it only computes pairwise similarities, not positions. But word order matters! Positional encodings add position information to the input embeddings. The original Transformer used sinusoidal encodings: each position gets a unique pattern of sine and cosine waves at different frequencies. Modern models (GPT, LLaMA) use learned positional embeddings or Rotary Position Embeddings (RoPE) which encode relative positions directly in the attention computation.
Positional Encoding
// Sinusoidal positional encoding
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
// pos = position in sequence
// i = dimension index
// d = model dimension
// Input = token_embedding + pos_encoding
// Modern alternatives:
Learned: trainable embedding per position
RoPE: rotates Q,K vectors by position
(relative position in attention)
ALiBi: adds linear bias to attention scores
Key insight: RoPE (Su et al., 2021) encodes position by rotating query and key vectors, making attention scores naturally depend on relative distance. It generalizes better to longer sequences than absolute position embeddings and is used in LLaMA, Mistral, and most modern LLMs.