menu_book

Glossary

Quick-reference definitions for Anatomy of an LLM File
Tensor Names
model.embed_tokens.weight
The embedding matrix. Shape: [vocab_size, hidden_size]. Converts token IDs to dense vectors via table lookup.
model.layers.{N}.self_attn.q_proj.weight
Query projection for layer N. Shape: [hidden_size, hidden_size]. Contains all attention heads packed into one matrix.
model.layers.{N}.self_attn.k_proj.weight
Key projection for layer N. Shape: [num_kv_heads × head_dim, hidden_size]. Smaller than Q due to GQA.
model.layers.{N}.self_attn.v_proj.weight
Value projection for layer N. Same shape as k_proj. Contains the data vectors that get mixed by attention.
model.layers.{N}.self_attn.o_proj.weight
Output projection for layer N. Shape: [hidden_size, hidden_size]. Projects concatenated head outputs back to hidden dimension.
model.layers.{N}.mlp.gate_proj.weight
SwiGLU gate projection. Shape: [intermediate_size, hidden_size]. Produces the gating signal (w1).
model.layers.{N}.mlp.up_proj.weight
SwiGLU up projection. Shape: [intermediate_size, hidden_size]. Carries the data path (w3).
model.layers.{N}.mlp.down_proj.weight
SwiGLU down projection. Shape: [hidden_size, intermediate_size]. Compresses back to hidden dimension (w2).
model.layers.{N}.input_layernorm.weight
RMSNorm before attention. Shape: [hidden_size]. Tiny (8 KB) but essential for training stability.
model.layers.{N}.post_attention_layernorm.weight
RMSNorm before FFN. Shape: [hidden_size]. Applied between attention output and MLP input.
model.norm.weight
Final RMSNorm after all layers. Shape: [hidden_size]. Applied before the output head (lm_head).
lm_head.weight
Output projection. Shape: [vocab_size, hidden_size]. Converts hidden states to logits over the vocabulary.
File Formats
Safetensors
Binary format by Hugging Face. 8-byte header length + JSON metadata + raw tensor data. No code execution risk. Supports zero-copy mmap loading.
GGUF
GPT-Generated Unified Format for llama.cpp. Self-contained: tokenizer, config, and weights in one file. Magic bytes: GGUF. Native quantization support.
PyTorch .bin
Legacy format using Python pickle serialization. Can execute arbitrary code during loading. Security risk for untrusted models.
Sharding
Splitting model weights across multiple files. An index JSON maps tensor names to shard files. Used for models too large for a single file.
Architecture Concepts
BF16 (Brain Float 16)
16-bit floating point format with 8 exponent bits. Same range as FP32 but half the precision. 2 bytes per parameter. The standard training/inference dtype.
BPE (Byte Pair Encoding)
Tokenization algorithm that learns subword units by iteratively merging the most frequent character pairs. The merge list defines the tokenization.
GQA (Grouped-Query Attention)
Multiple Q heads share fewer KV heads. Llama 3: 32 Q heads, 8 KV heads (4:1 ratio). Saves KV cache memory and weight storage with minimal quality loss.
head_dim
Dimension per attention head. Computed as hidden_size / num_attention_heads. Llama 3.1 8B: 4096 / 32 = 128.
hidden_size
Width of the residual stream / embedding dimension. Llama 3.1 8B: 4096. Determines the width of most tensors in the model.
intermediate_size
Width of the FFN hidden layer. Llama 3.1 8B: 14336 (~3.5× hidden_size). Determines gate_proj and up_proj first dimension.
KV Cache
Runtime memory storing K and V vectors from previous tokens. Shape: [layers, 2, seq_len, kv_heads, head_dim]. Grows linearly with sequence length. Not stored in model files.
MoE (Mixture of Experts)
Architecture where each layer has multiple expert MLPs and a router that selects a subset (e.g., 2 of 8) per token. Total params > active params.
mmap (Memory Mapping)
OS-level technique that maps a file directly into virtual memory. Allows accessing tensor data by pointer arithmetic without copying. Used by Safetensors for fast loading.
RMSNorm
Root Mean Square Layer Normalization. Simpler than LayerNorm (no mean subtraction, no bias). Formula: y = x / RMS(x) × γ. ~10-15% faster.
RoPE (Rotary Position Embedding)
Position encoding that rotates Q and K vectors in 2D planes. Controlled by rope_theta in config.json. Computed at runtime, not stored as weights.
rope_theta
Base frequency for RoPE. Higher values enable longer context. Llama 2: 10,000 (4K ctx). Llama 3.1: 500,000 (128K ctx).
SwiGLU
Activation function: swish(gate_proj(x)) × up_proj(x). Uses 3 matrices instead of 2. Better quality per parameter than ReLU/GELU. Standard in Llama/Mistral.
Weight Tying
Sharing the same matrix between embed_tokens and lm_head. Saves vocab_size × hidden_size parameters. Used in GPT-2/BERT but not in Llama/Mistral.