lightbulb

Key Insights

The most important takeaways from Anatomy of an LLM File
Mental Models
1
An LLM file is a ZIP of a brain
Metadata is the skull shape (~0.2 MB), the tokenizer is the language center (~2 MB), and weights are the neurons (~16 GB). 99.98% of what you download is learned numbers.
Chapter 1
2
The tensor name IS the architecture
model.layers.15.self_attn.k_proj.weight tells you: model → layer 15 → self attention → key projection. You can reconstruct the entire architecture just from reading tensor names.
Chapter 1
3
config.json is the DNA
Every number in config.json determines a tensor dimension. Change hidden_size from 4096 to 8192 and every weight tensor doubles in width. You can calculate total parameters without opening any weight file.
Chapters 2 & 8
File Formats
4
Safetensors for GPU, GGUF for laptop, avoid PyTorch .bin
Safetensors = safe + fast mmap loading. GGUF = self-contained with native quantization. PyTorch .bin = pickle = arbitrary code execution risk. Loading an untrusted .bin is like running an untrusted script.
Chapter 2
5
You can identify any format from its first 4 bytes
JSON brace after 8-byte header = Safetensors. GGUF magic = GGUF. PK or pickle header = PyTorch. Works even on misnamed files.
Chapter 2
Weight Architecture
6
FFN owns ~65% of all parameters, attention ~28%
The MLP layers (gate/up/down_proj) are the largest tensors. Optimizing model size means optimizing the FFN. This is why MoE replaces the single MLP with a router + multiple expert MLPs.
Chapters 4 & 5
7
GQA saves 75% of KV memory with minimal quality loss
Llama 3 uses 32 Q heads but only 8 KV heads. K/V projections shrink from [4096, 4096] to [1024, 4096]. This saves both file size AND inference memory (smaller KV cache).
Chapter 4
8
Norm tensors are the smallest but most critical
65 RMSNorm tensors total ~520 KB — less than 0.003% of the model. But removing any one causes training to diverge. They're the guardrails that keep values stable across 32 layers.
Chapter 6
Tokenizer & Runtime
9
Wrong tokenizer = garbage, even with perfect weights
Tokenizer and weights are a matched pair — like a lock and key. Wrong vocab size → crash. Wrong merge rules → wrong embeddings. Wrong chat template → confused output. Always download both from the same repository.
Chapter 7
10
The KV cache can consume more memory than the model
Llama 3.1 8B: 128 KB per token in the KV cache. At 128K context, that's 16 GB — equaling the model weight size. Long-context serving needs significantly more GPU memory than the model file suggests.
Chapter 8
11
Memory formula: params × bytes + KV cache + overhead
Model RAM = params × bytes_per_param (BF16=2, Q4=0.5). KV cache = 2 × layers × kv_heads × head_dim × seq_len × bytes. Add ~1-2 GB for framework overhead. Total must fit in GPU VRAM.
Chapter 8
12
rope_theta controls context length capability
rope_theta=10,000 → ~4K context. rope_theta=500,000 → 128K context. Higher theta = slower frequency decay = model can distinguish positions across longer sequences. Computed at runtime, not stored as weights.
Chapter 6