menu_book

Glossary

Quick-reference definitions for Anatomy of an LLM File

Back to Index

Tensor Names

model.embed_tokens.weight

The embedding matrix. Shape: [vocab_size, hidden_size]. Converts token IDs to dense vectors via table lookup.

model.layers.{N}.self_attn.q_proj.weight

Query projection for layer N. Shape: [hidden_size, hidden_size]. Contains all attention heads packed into one matrix.

model.layers.{N}.self_attn.k_proj.weight

Key projection for layer N. Shape: [num_kv_heads × head_dim, hidden_size]. Smaller than Q due to GQA.

model.layers.{N}.self_attn.v_proj.weight

Value projection for layer N. Same shape as k_proj. Contains the data vectors that get mixed by attention.

model.layers.{N}.self_attn.o_proj.weight

Output projection for layer N. Shape: [hidden_size, hidden_size]. Projects concatenated head outputs back to hidden dimension.

model.layers.{N}.mlp.gate_proj.weight

SwiGLU gate projection. Shape: [intermediate_size, hidden_size]. Produces the gating signal (w1).

model.layers.{N}.mlp.up_proj.weight

SwiGLU up projection. Shape: [intermediate_size, hidden_size]. Carries the data path (w3).

model.layers.{N}.mlp.down_proj.weight

SwiGLU down projection. Shape: [hidden_size, intermediate_size]. Compresses back to hidden dimension (w2).

model.layers.{N}.input_layernorm.weight

RMSNorm before attention. Shape: [hidden_size]. Tiny (8 KB) but essential for training stability.

model.layers.{N}.post_attention_layernorm.weight

RMSNorm before FFN. Shape: [hidden_size]. Applied between attention output and MLP input.

model.norm.weight

Final RMSNorm after all layers. Shape: [hidden_size]. Applied before the output head (lm_head).

lm_head.weight

Output projection. Shape: [vocab_size, hidden_size]. Converts hidden states to logits over the vocabulary.

File Formats

Safetensors

Binary format by Hugging Face. 8-byte header length + JSON metadata + raw tensor data. No code execution risk. Supports zero-copy mmap loading.

GGUF

GPT-Generated Unified Format for llama.cpp. Self-contained: tokenizer, config, and weights in one file. Magic bytes: GGUF. Native quantization support.

PyTorch .bin

Legacy format using Python pickle serialization. Can execute arbitrary code during loading. Security risk for untrusted models.

Sharding

Splitting model weights across multiple files. An index JSON maps tensor names to shard files. Used for models too large for a single file.

Architecture Concepts

BF16 (Brain Float 16)

16-bit floating point format with 8 exponent bits. Same range as FP32 but half the precision. 2 bytes per parameter. The standard training/inference dtype.

BPE (Byte Pair Encoding)

Tokenization algorithm that learns subword units by iteratively merging the most frequent character pairs. The merge list defines the tokenization.

GQA (Grouped-Query Attention)

Multiple Q heads share fewer KV heads. Llama 3: 32 Q heads, 8 KV heads (4:1 ratio). Saves KV cache memory and weight storage with minimal quality loss.

head_dim

Dimension per attention head. Computed as hidden_size / num_attention_heads. Llama 3.1 8B: 4096 / 32 = 128.

hidden_size

Width of the residual stream / embedding dimension. Llama 3.1 8B: 4096. Determines the width of most tensors in the model.

intermediate_size

Width of the FFN hidden layer. Llama 3.1 8B: 14336 (~3.5× hidden_size). Determines gate_proj and up_proj first dimension.

KV Cache

Runtime memory storing K and V vectors from previous tokens. Shape: [layers, 2, seq_len, kv_heads, head_dim]. Grows linearly with sequence length. Not stored in model files.

MoE (Mixture of Experts)

Architecture where each layer has multiple expert MLPs and a router that selects a subset (e.g., 2 of 8) per token. Total params > active params.

mmap (Memory Mapping)

OS-level technique that maps a file directly into virtual memory. Allows accessing tensor data by pointer arithmetic without copying. Used by Safetensors for fast loading.

RMSNorm

Root Mean Square Layer Normalization. Simpler than LayerNorm (no mean subtraction, no bias). Formula: y = x / RMS(x) × γ. ~10-15% faster.

RoPE (Rotary Position Embedding)

Position encoding that rotates Q and K vectors in 2D planes. Controlled by rope_theta in config.json. Computed at runtime, not stored as weights.

rope_theta

Base frequency for RoPE. Higher values enable longer context. Llama 2: 10,000 (4K ctx). Llama 3.1: 500,000 (128K ctx).

SwiGLU

Activation function: swish(gate_proj(x)) × up_proj(x). Uses 3 matrices instead of 2. Better quality per parameter than ReLU/GELU. Standard in Llama/Mistral.

Weight Tying

Sharing the same matrix between embed_tokens and lm_head. Saves vocab_size × hidden_size parameters. Used in GPT-2/BERT but not in Llama/Mistral.