Ch 1: What's Actually in an LLM File?

inventory_2

The Mental Model: A ZIP of a Brain

What you're really downloading from Hugging Face

The Big Picture

When you download an LLM from Hugging Face, you're getting a container that holds exactly three categories of data: metadata (the architecture blueprint), a tokenizer (the text-to-numbers dictionary), and weight tensors (the learned knowledge). Think of it like a ZIP of a brain — metadata is the skull shape, weights are the neurons, and the tokenizer is the language center.

A Typical Model Download

// Files you get for Llama 3.1 8B: model-00001-of-00004.safetensors // ─┐ model-00002-of-00004.safetensors // ─┤ Weight tensors model-00003-of-00004.safetensors // ─┤ (~16 GB total) model-00004-of-00004.safetensors // ─┘ model.safetensors.index.json // Index map config.json // Architecture tokenizer.json // Vocabulary tokenizer_config.json // Chat template

Key insight: The model isn't one file — it's a collection. But the weight files dwarf everything else combined.

description

Component 1: Metadata (The Blueprint)

config.json — ~1.4 KB that defines the entire architecture

What Metadata Contains

The metadata is the architectural blueprint of the model. It tells the inference engine how to arrange the weight tensors into a working neural network. Without it, the weights are just a massive blob of numbers with no structure. Every field in config.json maps directly to a tensor shape or model behavior.

How Much Space?

For Llama 3.1 8B: config.json is 1.4 KB. The index file (model.safetensors.index.json) that maps tensor names to shard files is about 133 KB. Together, all metadata is under 200 KB — roughly 0.001% of the total download.

Key Config Fields

"hidden_size": 4096, // Embedding dimension "num_hidden_layers": 32, // Transformer blocks "num_attention_heads": 32, // Q heads "num_key_value_heads": 8, // KV heads (GQA) "intermediate_size": 14336,// FFN width "vocab_size": 128256, // Token dictionary "max_position_embeddings": 131072

Key insight: Every number in config.json determines a tensor dimension. Change hidden_size from 4096 to 8192 and every weight tensor doubles in width. The config is tiny but it's the DNA of the model.

spellcheck

Component 2: The Tokenizer (The Dictionary)

How text becomes numbers the model can process

What the Tokenizer Does

The tokenizer is the model's translation layer — it converts human text into sequences of integer IDs, and converts model output back into text. It contains the vocabulary (a mapping of ~128K tokens to integer IDs), merge rules (how to break unknown words into known subpieces), and special tokens like <|begin_of_text|> and <|eot_id|>.

How Much Space?

For Llama 3.1 8B: tokenizer.model is 2.1 MB, tokenizer_config.json is 38.3 KB, special_tokens_map.json is 88 bytes. Total tokenizer data: roughly 2.2 MB — about 0.01% of the total download.

Tokenizer in Action

// Input text: "Hello, world!" // Tokenized output (Llama 3.1): [9906, 11, 1917, 0] // "Hello" "," "world" "!" // Each ID = a row in the embedding matrix // Token 9906 → row 9906 of embed_tokens.weight

Why it matters: The tokenizer's vocab_size (128,256 for Llama 3.1) directly determines the first dimension of the embedding matrix. More tokens = larger embedding table = bigger file.

fitness_center

Component 3: Weight Tensors (The Brain)

8 billion learned numbers that ARE the model

What Weights Are

Weight tensors are multidimensional arrays of floating-point numbers learned during training. Each tensor is a matrix (or higher-dimensional array) that transforms input data in a specific way. For Llama 3.1 8B, there are approximately 8.03 billion parameters — 8,030,261,248 individual numbers that were adjusted across trillions of training tokens.

How Much Space?

In BF16 (brain floating point, 2 bytes per number): 8.03B × 2 bytes = ~16.06 GB. In FP32 (4 bytes): ~32 GB. In 4-bit quantization: ~4.7 GB. The weights are 99.98% of the total file size.

Size Comparison

// Llama 3.1 8B file breakdown (BF16): Weight tensors: ~16,060 MB // 99.98% Tokenizer: ~2.2 MB // 0.014% Metadata: ~0.2 MB // 0.001% ──────────────────────────── Total: ~16,062 MB

Key insight: When you download a 16 GB model file, you're downloading 16 GB of learned knowledge and ~2 MB of instructions for how to use it. Optimizing file size means optimizing weights — which is exactly what quantization does.

view_in_ar

What Is a Tensor, Exactly?

The building block: multidimensional arrays with a name, shape, and dtype

Tensor = Named Matrix

Each tensor in a model file has three properties: a name (like model.layers.0.self_attn.q_proj.weight), a shape (like [4096, 4096] meaning 4096 rows × 4096 columns), and a dtype (the numeric precision, like BF16 or FP32). The name follows the model's architecture hierarchy — you can read it like a file path: model → layers → layer 0 → self attention → query projection → weight.

Tensor Name Anatomy

// A single tensor entry from the file header: "model.layers.0.self_attn.q_proj.weight": { "dtype": "BF16", "shape": [4096, 4096], "data_offsets": [0, 33554432] } // 4096 × 4096 × 2 bytes (BF16) = 33,554,432 bytes // That's 32 MB for ONE tensor

Key insight: The tensor name IS the architecture. model.layers.31.mlp.gate_proj.weight tells you: this is layer 31's MLP gate projection. There are 32 layers (0-31), each with multiple tensors. You can reconstruct the entire model architecture just from reading the tensor names.

schema

The Tensor Inventory of Llama 3.1 8B

~290 named tensors, organized by layer and function

Tensor Categories

Every tensor in the file falls into one of five categories:

1. Embedding: model.embed_tokens.weight — the vocabulary lookup table
2. Attention: Q, K, V, O projection matrices per layer
3. Feed-Forward (MLP): gate_proj, up_proj, down_proj per layer
4. Normalization: RMSNorm weights per layer + final norm
5. Output head: lm_head.weight — turns hidden states back into token probabilities

Tensor Count Breakdown

// Per transformer layer (×32 layers): self_attn.q_proj.weight // [4096, 4096] self_attn.k_proj.weight // [1024, 4096] self_attn.v_proj.weight // [1024, 4096] self_attn.o_proj.weight // [4096, 4096] mlp.gate_proj.weight // [14336, 4096] mlp.up_proj.weight // [14336, 4096] mlp.down_proj.weight // [4096, 14336] input_layernorm.weight // [4096] post_attention_layernorm.weight // [4096] // 9 tensors × 32 layers = 288 layer tensors // + embed_tokens + lm_head + model.norm = ~291

equalizer

Where the Bytes Go: Parameter Budget

MLP dominates, attention is second, everything else is rounding error

Parameter Distribution

Not all tensors are created equal. The MLP (feed-forward) tensors account for roughly 65% of all parameters because each layer has three large matrices (gate, up, down) with the intermediate dimension (14,336) being 3.5× the hidden dimension (4,096). Attention accounts for about 28%. The embedding and output head share about 6.5%, and normalization is a negligible 0.002%.

Parameter Budget

// Llama 3.1 8B — approximate parameter counts MLP (FFN): 5.2B params // ~65% Attention: 2.3B params // ~28% Embeddings+Head: 0.5B params // ~6.5% Norms: 0.3M params // ~0.002% ──────────────────────────────── Total: 8.03B params

Key insight: If you want to make a model smaller, the MLP layers are where the most bytes are hiding. This is why techniques like MoE (Mixture of Experts) replace the single MLP with a router that selects from multiple smaller expert MLPs — you get more knowledge capacity per active parameter.

lightbulb

Why This Matters: What Knowing the Anatomy Lets You Do

From debugging to optimization — practical applications

Practical Payoffs

Estimate memory: Parameters × bytes-per-param = GPU RAM needed. 8B params in BF16 = ~16 GB, plus overhead for KV cache and activations.

Debug loading errors: "Missing key model.layers.0.self_attn.q_proj.weight" means a tensor is absent — you can check the index file to find which shard should contain it.

Choose the right format: Safetensors for GPU inference, GGUF for CPU/llama.cpp, avoid PyTorch .bin format for untrusted models (arbitrary code execution risk).

Understand quantization tradeoffs: Converting from BF16 to 4-bit reduces each parameter from 2 bytes to 0.5 bytes, cutting file size by 75%, but quality loss varies by tensor type.

Quick Memory Formula

// Back-of-envelope memory estimate: Model RAM = params × bytes_per_param 8B × 2 (BF16) = 16 GB 8B × 4 (FP32) = 32 GB 8B × 0.5 (Q4) = 4 GB + KV cache (grows with sequence length) + Activations (temporary, during inference) + Framework overhead (~1-2 GB)

Key insight: You now have the map. The rest of this course zooms into each region: file format internals (Ch 2), embeddings (Ch 3), attention weights (Ch 4), FFN layers (Ch 5), special tensors (Ch 6), tokenizer (Ch 7), and config + runtime (Ch 8).

Ch 1 — What’s Actually in an LLM File?