lightbulb

Key Insights

The most important takeaways from Anatomy of an LLM File

Mental Models

An LLM file is a ZIP of a brain

Metadata is the skull shape (~0.2 MB), the tokenizer is the language center (~2 MB), and weights are the neurons (~16 GB). 99.98% of what you download is learned numbers.

Chapter 1

The tensor name IS the architecture

model.layers.15.self_attn.k_proj.weight tells you: model → layer 15 → self attention → key projection. You can reconstruct the entire architecture just from reading tensor names.

Chapter 1

config.json is the DNA

Every number in config.json determines a tensor dimension. Change hidden_size from 4096 to 8192 and every weight tensor doubles in width. You can calculate total parameters without opening any weight file.

Chapters 2 & 8

File Formats

Safetensors for GPU, GGUF for laptop, avoid PyTorch .bin

Safetensors = safe + fast mmap loading. GGUF = self-contained with native quantization. PyTorch .bin = pickle = arbitrary code execution risk. Loading an untrusted .bin is like running an untrusted script.

Chapter 2

You can identify any format from its first 4 bytes

JSON brace after 8-byte header = Safetensors. GGUF magic = GGUF. PK or pickle header = PyTorch. Works even on misnamed files.

Chapter 2

Weight Architecture

FFN owns ~65% of all parameters, attention ~28%

The MLP layers (gate/up/down_proj) are the largest tensors. Optimizing model size means optimizing the FFN. This is why MoE replaces the single MLP with a router + multiple expert MLPs.

Chapters 4 & 5

GQA saves 75% of KV memory with minimal quality loss

Llama 3 uses 32 Q heads but only 8 KV heads. K/V projections shrink from [4096, 4096] to [1024, 4096]. This saves both file size AND inference memory (smaller KV cache).

Chapter 4

Norm tensors are the smallest but most critical

65 RMSNorm tensors total ~520 KB — less than 0.003% of the model. But removing any one causes training to diverge. They're the guardrails that keep values stable across 32 layers.

Chapter 6

Tokenizer & Runtime

Wrong tokenizer = garbage, even with perfect weights

Tokenizer and weights are a matched pair — like a lock and key. Wrong vocab size → crash. Wrong merge rules → wrong embeddings. Wrong chat template → confused output. Always download both from the same repository.

Chapter 7

The KV cache can consume more memory than the model

Llama 3.1 8B: 128 KB per token in the KV cache. At 128K context, that's 16 GB — equaling the model weight size. Long-context serving needs significantly more GPU memory than the model file suggests.

Chapter 8

Memory formula: params × bytes + KV cache + overhead

Model RAM = params × bytes_per_param (BF16=2, Q4=0.5). KV cache = 2 × layers × kv_heads × head_dim × seq_len × bytes. Add ~1-2 GB for framework overhead. Total must fit in GPU VRAM.

Chapter 8

rope_theta controls context length capability

rope_theta=10,000 → ~4K context. rope_theta=500,000 → 128K context. Higher theta = slower frequency decay = model can distinguish positions across longer sequences. Computed at runtime, not stored as weights.

Chapter 6