Practical Payoffs
Estimate memory: Parameters × bytes-per-param = GPU RAM needed. 8B params in BF16 = ~16 GB, plus overhead for KV cache and activations.
Debug loading errors: "Missing key model.layers.0.self_attn.q_proj.weight" means a tensor is absent — you can check the index file to find which shard should contain it.
Choose the right format: Safetensors for GPU inference, GGUF for CPU/llama.cpp, avoid PyTorch .bin format for untrusted models (arbitrary code execution risk).
Understand quantization tradeoffs: Converting from BF16 to 4-bit reduces each parameter from 2 bytes to 0.5 bytes, cutting file size by 75%, but quality loss varies by tensor type.
Quick Memory Formula
// Back-of-envelope memory estimate:
Model RAM = params × bytes_per_param
8B × 2 (BF16) = 16 GB
8B × 4 (FP32) = 32 GB
8B × 0.5 (Q4) = 4 GB
+ KV cache (grows with sequence length)
+ Activations (temporary, during inference)
+ Framework overhead (~1-2 GB)
Key insight: You now have the map. The rest of this course zooms into each region: file format internals (Ch 2), embeddings (Ch 3), attention weights (Ch 4), FFN layers (Ch 5), special tensors (Ch 6), tokenizer (Ch 7), and config + runtime (Ch 8).