Ch 6: Special Tensors — Normalization, Positional Encoding, Output Head

balance

RMSNorm: The Stability Guardrails

Tiny 1D tensors that keep values in a healthy range

What RMSNorm Does

Root Mean Square Layer Normalization normalizes hidden states by dividing by the RMS of all values, then scaling by a learned weight vector. Unlike standard LayerNorm, RMSNorm skips the mean subtraction — it only rescales, making it simpler and ~10-15% faster. Each of the 4,096 dimensions gets its own learned scale factor. Without normalization, values would grow or shrink exponentially across 32 layers, causing numerical overflow or vanishing gradients.

All Norm Tensors

// Per-layer norms (×32 layers): model.layers.{N}.input_layernorm.weight // Shape: [4096] — before attention model.layers.{N}.post_attention_layernorm.weight // Shape: [4096] — before FFN // Final norm (after last layer): model.norm.weight // Shape: [4096] — before lm_head // Total: 64 per-layer + 1 final = 65 norms // 65 × 4096 × 2 bytes = ~520 KB // Less than 0.003% of the model!

Key insight: Norm tensors are the smallest weights in the file but deleting any one of them would make the model produce gibberish. They're the "guardrails" that keep the residual stream numerically stable across all 32 layers.

functions

RMSNorm vs. LayerNorm

Why modern LLMs dropped the mean subtraction

The Simplification

LayerNorm (BERT/GPT-2): subtract mean, divide by standard deviation, scale and shift. Requires both a weight and a bias vector (8,192 params per norm).

RMSNorm (Llama/Mistral): divide by root-mean-square only, then scale. No mean subtraction, no bias (4,096 params per norm). The RMSNorm paper (Zhang & Sennrich, 2019) showed this achieves equivalent quality with fewer operations. Most modern LLMs use RMSNorm exclusively.

Formula Comparison

// LayerNorm (legacy): y = (x - mean(x)) / std(x) * γ + β // γ = weight [4096], β = bias [4096] // 8192 learnable params per norm // RMSNorm (modern): y = x / RMS(x) * γ // γ = weight [4096], no bias // 4096 learnable params per norm // ~10-15% faster computation

Why it matters: If you see a layernorm.bias tensor in a model file, you know it's using the older LayerNorm. Llama-family models only have .weight tensors for norms — no bias. This is a quick way to identify the normalization scheme.

pin_drop

Rotary Position Embeddings (RoPE)

Position encoding computed on the fly — usually NOT stored as weights

How RoPE Works

RoPE encodes token position by rotating the Q and K vectors in 2D planes within the head_dim space. Each pair of dimensions is rotated by an angle proportional to the position × a frequency. This creates a pattern where the dot product Q·K naturally decreases for distant tokens, giving the model a sense of distance. Unlike learned position embeddings, RoPE is computed at runtime from two config values and requires no stored tensors.

Config-Driven Position Encoding

// RoPE is defined by config.json values: "rope_theta": 500000.0 "max_position_embeddings": 131072 // At runtime, frequencies are computed: freqs = 1.0 / (theta ^ (2i/d)) // i = dimension index, d = head_dim // Higher theta → slower frequency decay // → better long-context performance // Some older models store inv_freq tensors // but modern models compute them on the fly

Key insight: rope_theta controls context length capability. Llama 2 used 10,000 (4K context). Llama 3.1 uses 500,000 (128K context). Higher theta = the rotation frequencies decay more slowly = the model can distinguish positions across longer sequences.

tune

The Final Norm: model.norm.weight

The last normalization before the output head

Purpose

After passing through all 32 transformer layers, the hidden states go through one final RMSNorm: model.norm.weight. This is the same [4096] shape as the per-layer norms, but it sits between the last transformer layer and the lm_head output projection. It ensures the hidden states are properly scaled before being converted into token probabilities. Without it, the lm_head would receive unnormalized inputs, producing skewed probability distributions.

Output Pipeline

// After the last transformer layer: hidden = transformer_layers(input) // Shape: [seq_len, 4096] normed = rms_norm(hidden, model.norm.weight) // Shape: [seq_len, 4096] logits = normed @ lm_head.weight.T // Shape: [seq_len, 128256] probs = softmax(logits) // Shape: [seq_len, 128256] // model.norm is the ONLY standalone norm // (not inside a layer)

output

The Output Head: lm_head.weight

Converting hidden states back to vocabulary probabilities

The Inverse Embedding

lm_head.weight has shape [vocab_size, hidden_size] = [128256, 4096]. It's the inverse of the embedding: while embed_tokens converts token IDs to vectors, lm_head converts vectors back to a score for every token. Each row of lm_head represents "how much does this hidden state look like token X?" When the dot product of the hidden state with row 42 is high, token 42 is a likely next token.

lm_head Details

"lm_head.weight": { "dtype": "BF16", "shape": [128256, 4096], "data_offsets": [...] } // 128,256 × 4,096 × 2 = ~1 GB // Same shape as embed_tokens.weight // But NOT shared (tie_word_embeddings=false) // If tied: lm_head references embed_tokens // If untied: separate tensor, separate storage

Why it matters: The lm_head is the last matrix multiply before text generation. Its quality directly affects token prediction. Some quantization schemes keep lm_head at higher precision than the rest of the model to preserve generation quality.

hub

MoE-Specific Tensors

Mixture of Experts: router weights and per-expert FFN matrices

How MoE Changes the Tensor Layout

Mixture of Experts models (like Mixtral 8x7B) replace the single MLP with N expert MLPs plus a router. Mixtral has 8 copies of gate/up/down projections per layer, plus a small router weight that decides which 2 experts to activate for each token. Total params: 46.7B, but only 13B active per token. The router tensor is tiny — [num_experts, hidden_size] = [8, 4096] — but critically determines expert selection.

MoE Tensor Names

// Mixtral layer tensors (MoE): model.layers.{N}.block_sparse_moe.gate.weight // Shape: [8, 4096] — the router // 8 expert MLPs per layer: ...block_sparse_moe.experts.0.w1.weight ...block_sparse_moe.experts.0.w2.weight ...block_sparse_moe.experts.0.w3.weight // ... through experts.7 // 8 experts × 3 matrices = 24 MLP tensors // + 1 router = 25 MoE tensors per layer

Key insight: MoE is why you see models with 47B total parameters but 13B "active" — only 2 of 8 experts fire per token. The file is 47B params large, but inference only reads 13B params worth of weights per forward pass.

inventory

Complete Special Tensor Inventory

Every tensor that isn't attention, FFN, or embedding

Full List (Llama 3.1 8B)

// Standalone special tensors: model.embed_tokens.weight [128256, 4096] // ~1.0 GB model.norm.weight [4096] // 8 KB lm_head.weight [128256, 4096] // ~1.0 GB // Per-layer norms (×32 layers): input_layernorm.weight [4096] post_attention_layernorm.weight [4096] // Not stored as weights: // - RoPE (computed from rope_theta) // - KV cache (runtime only) // - Attention masks (runtime only)

Size Summary

Embedding: ~1.0 GB
lm_head: ~1.0 GB
All 65 norms: ~520 KB
RoPE: 0 bytes (computed)

Total "special" tensors: ~2.0 GB, or about 12.5% of the model. The remaining 87.5% is attention (28%) + FFN (65%) across the 32 transformer layers.

lightbulb

Practical Takeaways

Debugging, quantization, and context length implications

Key Decisions These Tensors Encode

rope_theta: Determines max context length. If a model claims 128K context but rope_theta is 10,000, the quality will degrade past ~4K tokens.

tie_word_embeddings: If true, embed_tokens and lm_head share weights — saves ~1 GB but limits the model's ability to specialize input vs. output representations.

MoE presence: If you see block_sparse_moe or experts in tensor names, the model is MoE. Total params ≠ active params — memory estimates must account for loading all experts even if only 2 fire per token.

Quick Checks

// Debugging checklist for special tensors: ✓ model.norm.weight exists? // If missing → loading will crash ✓ lm_head.weight shape[0] == vocab_size? // Mismatch → wrong tokenizer ✓ rope_theta matches expected context? // 10K → 4K ctx, 500K → 128K ctx ✓ Norm count == 2 × num_layers + 1? // Missing norm → corrupt download ✓ MoE: experts in names? // If yes → check total vs active params

Key insight: Special tensors are the "connective tissue" of the model. They're small, but each serves a unique structural role. You now know every tensor type in an LLM file. Next: the tokenizer files that convert text to numbers.

Ch 6 — Special Tensors: Normalization, Positional Encoding, and the Output Head