Ch 6 — Special Tensors: Normalization, Positional Encoding, and the Output Head

The small but critical components that don't fit neatly into attention or FFN
High Level
balance
RMSNorm
arrow_forward
pin_drop
RoPE
arrow_forward
output
lm_head
arrow_forward
tune
Final Norm
arrow_forward
hub
MoE
arrow_forward
inventory
Inventory
-
Click play or press Space to begin...
Step- / 8
balance
RMSNorm: The Stability Guardrails
Tiny 1D tensors that keep values in a healthy range
What RMSNorm Does
Root Mean Square Layer Normalization normalizes hidden states by dividing by the RMS of all values, then scaling by a learned weight vector. Unlike standard LayerNorm, RMSNorm skips the mean subtraction — it only rescales, making it simpler and ~10-15% faster. Each of the 4,096 dimensions gets its own learned scale factor. Without normalization, values would grow or shrink exponentially across 32 layers, causing numerical overflow or vanishing gradients.
All Norm Tensors
// Per-layer norms (×32 layers): model.layers.{N}.input_layernorm.weight // Shape: [4096] — before attention model.layers.{N}.post_attention_layernorm.weight // Shape: [4096] — before FFN // Final norm (after last layer): model.norm.weight // Shape: [4096] — before lm_head // Total: 64 per-layer + 1 final = 65 norms // 65 × 4096 × 2 bytes = ~520 KB // Less than 0.003% of the model!
Key insight: Norm tensors are the smallest weights in the file but deleting any one of them would make the model produce gibberish. They're the "guardrails" that keep the residual stream numerically stable across all 32 layers.
functions
RMSNorm vs. LayerNorm
Why modern LLMs dropped the mean subtraction
The Simplification
LayerNorm (BERT/GPT-2): subtract mean, divide by standard deviation, scale and shift. Requires both a weight and a bias vector (8,192 params per norm).

RMSNorm (Llama/Mistral): divide by root-mean-square only, then scale. No mean subtraction, no bias (4,096 params per norm). The RMSNorm paper (Zhang & Sennrich, 2019) showed this achieves equivalent quality with fewer operations. Most modern LLMs use RMSNorm exclusively.
Formula Comparison
// LayerNorm (legacy): y = (x - mean(x)) / std(x) * γ + β // γ = weight [4096], β = bias [4096] // 8192 learnable params per norm // RMSNorm (modern): y = x / RMS(x) * γ // γ = weight [4096], no bias // 4096 learnable params per norm // ~10-15% faster computation
Why it matters: If you see a layernorm.bias tensor in a model file, you know it's using the older LayerNorm. Llama-family models only have .weight tensors for norms — no bias. This is a quick way to identify the normalization scheme.
pin_drop
Rotary Position Embeddings (RoPE)
Position encoding computed on the fly — usually NOT stored as weights
How RoPE Works
RoPE encodes token position by rotating the Q and K vectors in 2D planes within the head_dim space. Each pair of dimensions is rotated by an angle proportional to the position × a frequency. This creates a pattern where the dot product Q·K naturally decreases for distant tokens, giving the model a sense of distance. Unlike learned position embeddings, RoPE is computed at runtime from two config values and requires no stored tensors.
Config-Driven Position Encoding
// RoPE is defined by config.json values: "rope_theta": 500000.0 "max_position_embeddings": 131072 // At runtime, frequencies are computed: freqs = 1.0 / (theta ^ (2i/d)) // i = dimension index, d = head_dim // Higher theta → slower frequency decay // → better long-context performance // Some older models store inv_freq tensors // but modern models compute them on the fly
Key insight: rope_theta controls context length capability. Llama 2 used 10,000 (4K context). Llama 3.1 uses 500,000 (128K context). Higher theta = the rotation frequencies decay more slowly = the model can distinguish positions across longer sequences.
tune
The Final Norm: model.norm.weight
The last normalization before the output head
Purpose
After passing through all 32 transformer layers, the hidden states go through one final RMSNorm: model.norm.weight. This is the same [4096] shape as the per-layer norms, but it sits between the last transformer layer and the lm_head output projection. It ensures the hidden states are properly scaled before being converted into token probabilities. Without it, the lm_head would receive unnormalized inputs, producing skewed probability distributions.
Output Pipeline
// After the last transformer layer: hidden = transformer_layers(input) // Shape: [seq_len, 4096] normed = rms_norm(hidden, model.norm.weight) // Shape: [seq_len, 4096] logits = normed @ lm_head.weight.T // Shape: [seq_len, 128256] probs = softmax(logits) // Shape: [seq_len, 128256] // model.norm is the ONLY standalone norm // (not inside a layer)
output
The Output Head: lm_head.weight
Converting hidden states back to vocabulary probabilities
The Inverse Embedding
lm_head.weight has shape [vocab_size, hidden_size] = [128256, 4096]. It's the inverse of the embedding: while embed_tokens converts token IDs to vectors, lm_head converts vectors back to a score for every token. Each row of lm_head represents "how much does this hidden state look like token X?" When the dot product of the hidden state with row 42 is high, token 42 is a likely next token.
lm_head Details
"lm_head.weight": { "dtype": "BF16", "shape": [128256, 4096], "data_offsets": [...] } // 128,256 × 4,096 × 2 = ~1 GB // Same shape as embed_tokens.weight // But NOT shared (tie_word_embeddings=false) // If tied: lm_head references embed_tokens // If untied: separate tensor, separate storage
Why it matters: The lm_head is the last matrix multiply before text generation. Its quality directly affects token prediction. Some quantization schemes keep lm_head at higher precision than the rest of the model to preserve generation quality.
hub
MoE-Specific Tensors
Mixture of Experts: router weights and per-expert FFN matrices
How MoE Changes the Tensor Layout
Mixture of Experts models (like Mixtral 8x7B) replace the single MLP with N expert MLPs plus a router. Mixtral has 8 copies of gate/up/down projections per layer, plus a small router weight that decides which 2 experts to activate for each token. Total params: 46.7B, but only 13B active per token. The router tensor is tiny — [num_experts, hidden_size] = [8, 4096] — but critically determines expert selection.
MoE Tensor Names
// Mixtral layer tensors (MoE): model.layers.{N}.block_sparse_moe.gate.weight // Shape: [8, 4096] — the router // 8 expert MLPs per layer: ...block_sparse_moe.experts.0.w1.weight ...block_sparse_moe.experts.0.w2.weight ...block_sparse_moe.experts.0.w3.weight // ... through experts.7 // 8 experts × 3 matrices = 24 MLP tensors // + 1 router = 25 MoE tensors per layer
Key insight: MoE is why you see models with 47B total parameters but 13B "active" — only 2 of 8 experts fire per token. The file is 47B params large, but inference only reads 13B params worth of weights per forward pass.
inventory
Complete Special Tensor Inventory
Every tensor that isn't attention, FFN, or embedding
Full List (Llama 3.1 8B)
// Standalone special tensors: model.embed_tokens.weight [128256, 4096] // ~1.0 GB model.norm.weight [4096] // 8 KB lm_head.weight [128256, 4096] // ~1.0 GB // Per-layer norms (×32 layers): input_layernorm.weight [4096] post_attention_layernorm.weight [4096] // Not stored as weights: // - RoPE (computed from rope_theta) // - KV cache (runtime only) // - Attention masks (runtime only)
Size Summary
Embedding: ~1.0 GB
lm_head: ~1.0 GB
All 65 norms: ~520 KB
RoPE: 0 bytes (computed)

Total "special" tensors: ~2.0 GB, or about 12.5% of the model. The remaining 87.5% is attention (28%) + FFN (65%) across the 32 transformer layers.
lightbulb
Practical Takeaways
Debugging, quantization, and context length implications
Key Decisions These Tensors Encode
rope_theta: Determines max context length. If a model claims 128K context but rope_theta is 10,000, the quality will degrade past ~4K tokens.

tie_word_embeddings: If true, embed_tokens and lm_head share weights — saves ~1 GB but limits the model's ability to specialize input vs. output representations.

MoE presence: If you see block_sparse_moe or experts in tensor names, the model is MoE. Total params ≠ active params — memory estimates must account for loading all experts even if only 2 fire per token.
Quick Checks
// Debugging checklist for special tensors: ✓ model.norm.weight exists? // If missing → loading will crash ✓ lm_head.weight shape[0] == vocab_size? // Mismatch → wrong tokenizer ✓ rope_theta matches expected context? // 10K → 4K ctx, 500K → 128K ctx ✓ Norm count == 2 × num_layers + 1? // Missing norm → corrupt download ✓ MoE: experts in names? // If yes → check total vs active params
Key insight: Special tensors are the "connective tissue" of the model. They're small, but each serves a unique structural role. You now know every tensor type in an LLM file. Next: the tokenizer files that convert text to numbers.