Ch 8: config.json, generation_config.json, and Runtime Structures

Ch 8 — config.json, generation_config.json, and Runtime Structures

The blueprint, the defaults, and the invisible memory consumers

Index

High Level

description

config.json

arrow_forward

tune

gen_config

arrow_forward

cached

KV Cache

arrow_forward

bolt

Activations

arrow_forward

calculate

Memory

arrow_forward

check_circle

Complete

Click play or press Space to begin...

Step- / 8

description

config.json: The Architectural Blueprint

Every field maps to a tensor shape or model behavior

Reading the Blueprint

config.json is the DNA of the model. Every field directly determines a tensor dimension, a computation path, or a behavior. You can reverse-engineer the entire model architecture from this one file. It tells the inference engine how many layers to create, how wide each tensor should be, how many attention heads to use, and which activation function to apply. At ~1.4 KB, it's the smallest file but arguably the most information-dense.

Field-to-Tensor Map

// config.json → tensor shapes: "hidden_size": 4096 → embed_tokens: [V, 4096] → q_proj, o_proj: [4096, 4096] → all norm weights: [4096] "num_hidden_layers": 32 → layers.0 through layers.31 "num_attention_heads": 32 → head_dim = 4096/32 = 128 "num_key_value_heads": 8 → k_proj, v_proj: [8×128, 4096] = [1024, 4096] "intermediate_size": 14336 → gate/up_proj: [14336, 4096] "vocab_size": 128256 → embed_tokens: [128256, 4096]

build

Reverse-Engineering from config.json

Calculate total parameters without opening the weight files

Parameter Formula

You can compute the total parameter count from config.json alone:

Embedding + lm_head: 2 × vocab × hidden (if untied)
Per-layer attention: hidden² + 2 × (kv_heads × head_dim × hidden) + hidden²
Per-layer FFN: 3 × intermediate × hidden
Per-layer norms: 2 × hidden
Final norm: hidden

Multiply per-layer by num_layers, add embeddings. This gives you the exact parameter count and therefore the exact file size in any precision.

Calculation

// Llama 3.1 8B from config.json alone: Embed + head: 2 × 128256 × 4096 = 1.05B Attn/layer: 4096² + 2×(1024×4096) + 4096² = 41.9M FFN/layer: 3 × 14336 × 4096 = 176.2M Norms/layer: 2 × 4096 = 8.2K Layer total: 218.1M × 32 = 6.98B Final norm: 4096 = 4K ───────────────────────── Grand total: ~8.03B parameters

tune

generation_config.json: Generation Defaults

Temperature, top_p, top_k, and other sampling parameters

What It Controls

generation_config.json defines default text generation parameters. These control how the model samples from its probability distribution: temperature (randomness), top_p (nucleus sampling threshold), top_k (limit to top K tokens), max_new_tokens (output length limit), and repetition_penalty. These are defaults — inference frameworks can override them per request.

Example File

// generation_config.json (~148 bytes): { "bos_token_id": 128000, "do_sample": true, "eos_token_id": [128001, 128008, 128009], "max_length": 4096, "temperature": 0.6, "top_p": 0.9 }

Key insight: Notice eos_token_id is an array — multiple tokens can signal "stop generating." This is how the model knows to stop at <|end_of_text|>, <|eom_id|>, or <|eot_id|>.

cached

The KV Cache: The Invisible Memory Consumer

Runtime structure that can consume more memory than the model itself

What the KV Cache Is

During inference, the model needs to remember the K and V vectors from all previous tokens to attend to them. Rather than recomputing these every time, they're cached in GPU memory. The KV cache shape is [num_layers, 2, seq_len, num_kv_heads, head_dim]. It doesn't exist in the model file — it's created at runtime and grows linearly with sequence length. For long conversations, the KV cache can exceed the model weight memory.

KV Cache Size Formula

// KV cache memory per token: per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_param // Llama 3.1 8B (BF16): per_token = 2 × 32 × 8 × 128 × 2 = 131,072 bytes = 128 KB/token // At various sequence lengths: 4K tokens: 512 MB 32K tokens: 4 GB 128K tokens: 16 GB // = model size!

Why it matters: At 128K context length, the KV cache for Llama 3.1 8B consumes ~16 GB — as much as the model weights themselves. This is why long-context serving requires significantly more GPU memory than the model size alone suggests.

bolt

Activation Tensors and Attention Masks

Temporary memory that exists only during the forward pass

Runtime-Only Structures

Activations are the intermediate computation results at each layer — the hidden states, attention scores, FFN intermediate values. They're temporary: allocated during the forward pass, used once, then freed. Attention masks prevent the model from attending to future tokens (causal masking) or padding tokens. Both are created at runtime and never stored in the model file. Activation memory peaks during the FFN layer (14,336-dimensional intermediate).

Memory Categories

// Total GPU memory during inference: 1. Model weights // Loaded from file ~16 GB (8B, BF16) 2. KV cache // Grows with seq_len 128 KB/token 3. Activations // Temporary per layer ~200-500 MB peak 4. Framework // CUDA, PyTorch overhead ~1-2 GB // Only #1 comes from the file // #2-4 are created at runtime

Key insight: When people say "this model needs 24 GB of VRAM," they mean weights + KV cache + activations + overhead. The model file alone tells you only the weight size — you need to add runtime memory based on your expected sequence length and batch size.

calculate

The Memory Budget Calculator

How to estimate total GPU memory for any model

Step-by-Step Estimation

Step 1: Model weights = params × bytes_per_param
Step 2: KV cache = 2 × layers × kv_heads × head_dim × seq_len × bytes × batch_size
Step 3: Activations ≈ 2× hidden_size × seq_len × bytes (peak during FFN)
Step 4: Framework overhead ≈ 1-2 GB
Total: Sum of all four. If total > GPU VRAM, you need quantization, model parallelism, or a bigger GPU.

Example: Llama 3.1 8B, 8K Context

// Llama 3.1 8B, BF16, 8K ctx, batch=1: Weights: 8.03B × 2 = 16.1 GB KV cache: 128KB × 8192 = 1.0 GB Activations: ~0.4 GB Overhead: ~1.5 GB ──────────────────────────────── Total: ~19 GB // → Fits on a 24 GB GPU (A10, 4090) // Same model, Q4 quantized: Weights: 8.03B × 0.5 = 4.0 GB KV cache: (same) 1.0 GB Total: ~6.9 GB // → Fits on an 8 GB GPU

architecture

Config Fields Decoded: The Complete Reference

Every important field and what it means

Architecture Fields

"architectures": ["LlamaForCausalLM"] // Which model class to instantiate "model_type": "llama" // Architecture family identifier "torch_dtype": "bfloat16" // Native precision of weights "tie_word_embeddings": false // embed_tokens and lm_head separate "rope_theta": 500000.0 // Controls context length capability "rms_norm_eps": 1e-05 // Epsilon for numerical stability

Dimension Fields

"hidden_size": 4096 // Width of the residual stream "num_hidden_layers": 32 // Depth: how many transformer blocks "num_attention_heads": 32 // Q heads → head_dim = 4096/32 = 128 "num_key_value_heads": 8 // KV heads (GQA) → K/V are [1024, 4096] "intermediate_size": 14336 // FFN width → gate/up: [14336, 4096] "vocab_size": 128256 // Embedding rows → must match tokenizer "max_position_embeddings": 131072 // Maximum supported context length

emoji_events

Course Complete: You Can Now Read Any LLM File

From bytes on disk to a working transformer — the full map

What You Now Know

Ch 1: An LLM file = metadata + tokenizer + weights (99.98%)
Ch 2: Three formats: Safetensors (safe, fast), GGUF (self-contained), PyTorch (legacy, risky)
Ch 3: Embedding = [vocab, hidden] lookup table, token IDs → vectors
Ch 4: Attention = Q, K, V, O projections with GQA shrinking K/V
Ch 5: FFN = SwiGLU gate/up/down, 65% of all parameters
Ch 6: Special tensors: RMSNorm, RoPE (computed), lm_head, MoE
Ch 7: Tokenizer = BPE vocab + merges + chat template contract
Ch 8: config.json is the DNA; KV cache grows with sequence length

The Complete File Map

// Every file in an LLM download: config.json // DNA generation_config.json // Defaults tokenizer.json // Dictionary tokenizer_config.json // Chat template model.safetensors.index // Shard map model-0000N.safetensors // Weights // Not in the file but in your GPU: KV cache // 128 KB/token Activations // Temporary

Key insight: You can now open any LLM file, read its config, inspect its tensors, estimate its memory footprint, verify its tokenizer compatibility, and understand exactly what every byte is doing. You've completed the anatomy course.

arrow_back Ch 7: Tokenizer Files Back to Index arrow_forward