Ch 8 — config.json, generation_config.json, and Runtime Structures

The blueprint, the defaults, and the invisible memory consumers
High Level
description
config.json
arrow_forward
tune
gen_config
arrow_forward
cached
KV Cache
arrow_forward
bolt
Activations
arrow_forward
calculate
Memory
arrow_forward
check_circle
Complete
-
Click play or press Space to begin...
Step- / 8
description
config.json: The Architectural Blueprint
Every field maps to a tensor shape or model behavior
Reading the Blueprint
config.json is the DNA of the model. Every field directly determines a tensor dimension, a computation path, or a behavior. You can reverse-engineer the entire model architecture from this one file. It tells the inference engine how many layers to create, how wide each tensor should be, how many attention heads to use, and which activation function to apply. At ~1.4 KB, it's the smallest file but arguably the most information-dense.
Field-to-Tensor Map
// config.json → tensor shapes: "hidden_size": 4096 → embed_tokens: [V, 4096] → q_proj, o_proj: [4096, 4096] → all norm weights: [4096] "num_hidden_layers": 32 → layers.0 through layers.31 "num_attention_heads": 32 → head_dim = 4096/32 = 128 "num_key_value_heads": 8 → k_proj, v_proj: [8×128, 4096] = [1024, 4096] "intermediate_size": 14336 → gate/up_proj: [14336, 4096] "vocab_size": 128256 → embed_tokens: [128256, 4096]
build
Reverse-Engineering from config.json
Calculate total parameters without opening the weight files
Parameter Formula
You can compute the total parameter count from config.json alone:

Embedding + lm_head: 2 × vocab × hidden (if untied)
Per-layer attention: hidden² + 2 × (kv_heads × head_dim × hidden) + hidden²
Per-layer FFN: 3 × intermediate × hidden
Per-layer norms: 2 × hidden
Final norm: hidden

Multiply per-layer by num_layers, add embeddings. This gives you the exact parameter count and therefore the exact file size in any precision.
Calculation
// Llama 3.1 8B from config.json alone: Embed + head: 2 × 128256 × 4096 = 1.05B Attn/layer: 4096² + 2×(1024×4096) + 4096² = 41.9M FFN/layer: 3 × 14336 × 4096 = 176.2M Norms/layer: 2 × 4096 = 8.2K Layer total: 218.1M × 32 = 6.98B Final norm: 4096 = 4K ───────────────────────── Grand total: ~8.03B parameters
tune
generation_config.json: Generation Defaults
Temperature, top_p, top_k, and other sampling parameters
What It Controls
generation_config.json defines default text generation parameters. These control how the model samples from its probability distribution: temperature (randomness), top_p (nucleus sampling threshold), top_k (limit to top K tokens), max_new_tokens (output length limit), and repetition_penalty. These are defaults — inference frameworks can override them per request.
Example File
// generation_config.json (~148 bytes): { "bos_token_id": 128000, "do_sample": true, "eos_token_id": [128001, 128008, 128009], "max_length": 4096, "temperature": 0.6, "top_p": 0.9 }
Key insight: Notice eos_token_id is an array — multiple tokens can signal "stop generating." This is how the model knows to stop at <|end_of_text|>, <|eom_id|>, or <|eot_id|>.
cached
The KV Cache: The Invisible Memory Consumer
Runtime structure that can consume more memory than the model itself
What the KV Cache Is
During inference, the model needs to remember the K and V vectors from all previous tokens to attend to them. Rather than recomputing these every time, they're cached in GPU memory. The KV cache shape is [num_layers, 2, seq_len, num_kv_heads, head_dim]. It doesn't exist in the model file — it's created at runtime and grows linearly with sequence length. For long conversations, the KV cache can exceed the model weight memory.
KV Cache Size Formula
// KV cache memory per token: per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_param // Llama 3.1 8B (BF16): per_token = 2 × 32 × 8 × 128 × 2 = 131,072 bytes = 128 KB/token // At various sequence lengths: 4K tokens: 512 MB 32K tokens: 4 GB 128K tokens: 16 GB // = model size!
Why it matters: At 128K context length, the KV cache for Llama 3.1 8B consumes ~16 GB — as much as the model weights themselves. This is why long-context serving requires significantly more GPU memory than the model size alone suggests.
bolt
Activation Tensors and Attention Masks
Temporary memory that exists only during the forward pass
Runtime-Only Structures
Activations are the intermediate computation results at each layer — the hidden states, attention scores, FFN intermediate values. They're temporary: allocated during the forward pass, used once, then freed. Attention masks prevent the model from attending to future tokens (causal masking) or padding tokens. Both are created at runtime and never stored in the model file. Activation memory peaks during the FFN layer (14,336-dimensional intermediate).
Memory Categories
// Total GPU memory during inference: 1. Model weights // Loaded from file ~16 GB (8B, BF16) 2. KV cache // Grows with seq_len 128 KB/token 3. Activations // Temporary per layer ~200-500 MB peak 4. Framework // CUDA, PyTorch overhead ~1-2 GB // Only #1 comes from the file // #2-4 are created at runtime
Key insight: When people say "this model needs 24 GB of VRAM," they mean weights + KV cache + activations + overhead. The model file alone tells you only the weight size — you need to add runtime memory based on your expected sequence length and batch size.
calculate
The Memory Budget Calculator
How to estimate total GPU memory for any model
Step-by-Step Estimation
Step 1: Model weights = params × bytes_per_param
Step 2: KV cache = 2 × layers × kv_heads × head_dim × seq_len × bytes × batch_size
Step 3: Activations ≈ 2× hidden_size × seq_len × bytes (peak during FFN)
Step 4: Framework overhead ≈ 1-2 GB
Total: Sum of all four. If total > GPU VRAM, you need quantization, model parallelism, or a bigger GPU.
Example: Llama 3.1 8B, 8K Context
// Llama 3.1 8B, BF16, 8K ctx, batch=1: Weights: 8.03B × 2 = 16.1 GB KV cache: 128KB × 8192 = 1.0 GB Activations: ~0.4 GB Overhead: ~1.5 GB ──────────────────────────────── Total: ~19 GB // → Fits on a 24 GB GPU (A10, 4090) // Same model, Q4 quantized: Weights: 8.03B × 0.5 = 4.0 GB KV cache: (same) 1.0 GB Total: ~6.9 GB // → Fits on an 8 GB GPU
architecture
Config Fields Decoded: The Complete Reference
Every important field and what it means
Architecture Fields
"architectures": ["LlamaForCausalLM"] // Which model class to instantiate "model_type": "llama" // Architecture family identifier "torch_dtype": "bfloat16" // Native precision of weights "tie_word_embeddings": false // embed_tokens and lm_head separate "rope_theta": 500000.0 // Controls context length capability "rms_norm_eps": 1e-05 // Epsilon for numerical stability
Dimension Fields
"hidden_size": 4096 // Width of the residual stream "num_hidden_layers": 32 // Depth: how many transformer blocks "num_attention_heads": 32 // Q heads → head_dim = 4096/32 = 128 "num_key_value_heads": 8 // KV heads (GQA) → K/V are [1024, 4096] "intermediate_size": 14336 // FFN width → gate/up: [14336, 4096] "vocab_size": 128256 // Embedding rows → must match tokenizer "max_position_embeddings": 131072 // Maximum supported context length
emoji_events
Course Complete: You Can Now Read Any LLM File
From bytes on disk to a working transformer — the full map
What You Now Know
Ch 1: An LLM file = metadata + tokenizer + weights (99.98%)
Ch 2: Three formats: Safetensors (safe, fast), GGUF (self-contained), PyTorch (legacy, risky)
Ch 3: Embedding = [vocab, hidden] lookup table, token IDs → vectors
Ch 4: Attention = Q, K, V, O projections with GQA shrinking K/V
Ch 5: FFN = SwiGLU gate/up/down, 65% of all parameters
Ch 6: Special tensors: RMSNorm, RoPE (computed), lm_head, MoE
Ch 7: Tokenizer = BPE vocab + merges + chat template contract
Ch 8: config.json is the DNA; KV cache grows with sequence length
The Complete File Map
// Every file in an LLM download: config.json // DNA generation_config.json // Defaults tokenizer.json // Dictionary tokenizer_config.json // Chat template model.safetensors.index // Shard map model-0000N.safetensors // Weights // Not in the file but in your GPU: KV cache // 128 KB/token Activations // Temporary
Key insight: You can now open any LLM file, read its config, inspect its tensors, estimate its memory footprint, verify its tokenizer compatibility, and understand exactly what every byte is doing. You've completed the anatomy course.