How RoPE Works
RoPE encodes token position by rotating the Q and K vectors in 2D planes within the head_dim space. Each pair of dimensions is rotated by an angle proportional to the position × a frequency. This creates a pattern where the dot product Q·K naturally decreases for distant tokens, giving the model a sense of distance. Unlike learned position embeddings, RoPE is computed at runtime from two config values and requires no stored tensors.
Config-Driven Position Encoding
// RoPE is defined by config.json values:
"rope_theta": 500000.0
"max_position_embeddings": 131072
// At runtime, frequencies are computed:
freqs = 1.0 / (theta ^ (2i/d))
// i = dimension index, d = head_dim
// Higher theta → slower frequency decay
// → better long-context performance
// Some older models store inv_freq tensors
// but modern models compute them on the fly
Key insight: rope_theta controls context length capability. Llama 2 used 10,000 (4K context). Llama 3.1 uses 500,000 (128K context). Higher theta = the rotation frequencies decay more slowly = the model can distinguish positions across longer sequences.