Ch 3: Architecture & Parameters — Reading Model Cards

Ch 3 — Architecture & Parameters

Sizing up a model — what the numbers actually mean for your hardware and use case

Index

High Level

tag

Params

arrow_forward

hub

Dense/MoE

arrow_forward

visibility

Attention

arrow_forward

straighten

Context

arrow_forward

settings

Config

arrow_forward

memory

Memory

Click play or press Space to begin...

Step- / 8

tag

What “7B” Actually Means

Parameter count — the number that dominates every model name

The Basics

When you see “Llama-3.1-8B,” the 8B means 8 billion parameters — 8 billion individual numbers (weights) that the model learned during training. More parameters generally means more capacity to store knowledge and handle complex reasoning. The common sizes you’ll encounter: 1B–3B (edge/mobile), 7B–8B (single GPU sweet spot), 13B–14B (strong single-GPU), 32B–70B (multi-GPU or quantized), 405B+ (cluster-scale).

The Practical Implication

Parameter count directly determines how much memory (VRAM/RAM) you need. Each parameter at full precision (FP16) takes 2 bytes. So an 8B model needs ~16GB of VRAM just for the weights, plus overhead for KV-cache and activations. A 70B model needs ~140GB at FP16 — that’s multiple high-end GPUs. This is why quantization exists: an 8B model at 4-bit precision needs only ~4GB.

Key insight: Parameter count is like engine displacement in cars. A bigger engine has more power, but it also burns more fuel and won’t fit in every chassis. A 7B model on a laptop can be more useful than a 70B model you can’t run.

hub

Dense vs. Mixture-of-Experts (MoE)

Why “47B parameters” doesn’t always mean what you think

Dense Models

In a dense model, every parameter is used for every input token. Llama 3.1 8B is dense — all 8 billion parameters fire on every forward pass. This is simple but expensive at scale: double the parameters, double the compute.

MoE Models

In a Mixture-of-Experts model, the parameters are split into “expert” sub-networks, and a router selects only a few experts per token. Mixtral 8x7B has 46.7B total parameters but only ~12.9B are active per token (2 of 8 experts). DeepSeek-R1 has 671B total but only ~37B active. The model’s name says 47B, but it runs like a 13B.

How to Spot It on a Card

Look for keywords: “Mixture of Experts,” “MoE,” “8x7B” (the “x” is a giveaway), or in config.json you’ll see fields like num_experts and num_experts_per_tok. If the card reports both “total parameters” and “active parameters,” it’s MoE. Memory requirements are based on total parameters (all experts must be loaded), but inference speed relates to active parameters.

Key insight: Parameter count alone is misleading for MoE models. A 47B MoE model (Mixtral) can be faster to run than a 13B dense model, because only 12.9B parameters are active per token. Always ask: “Is this dense or MoE?”

visibility

Attention Types: MHA, GQA, MQA

The architectural choice that affects speed and memory during inference

What Attention Is

Every transformer model uses an attention mechanism with queries (Q), keys (K), and values (V). The attention type determines how these are organized, affecting speed and memory during long-context inference.

The Three Types

Multi-Head Attention (MHA): Each attention head has its own Q, K, V projections. The original Transformer design. Full quality but highest memory for KV-cache.

Grouped-Query Attention (GQA): Multiple query heads share a smaller set of K/V heads. Llama 3, Gemma 2, and Mistral use this. Reduces KV-cache memory significantly with minimal quality loss.

Multi-Query Attention (MQA): All query heads share a single K/V head. Maximum speed, but slightly lower quality. Used in some older models like Falcon.

Where to Find It

In config.json, compare num_attention_heads (query heads) with num_key_value_heads (KV heads). If they’re equal, it’s MHA. If KV is smaller (e.g., 32 query heads, 8 KV heads), it’s GQA. If KV is 1, it’s MQA. Most modern LLMs use GQA as the sweet spot.

Key insight: GQA is now the de facto standard. If you see a model with GQA, it will handle long contexts more efficiently than an MHA model of the same size. This matters most when you’re processing documents, not short prompts.

straighten

Context Length

How much text the model can see at once — and why bigger isn’t always better

What It Means

Context length (or context window) is the maximum number of tokens the model can process in a single input+output. Common values: 2K–4K (older models), 8K (GPT-3.5 era), 32K–128K (modern LLMs like Llama 3.1, Gemma 2), 200K–1M+ (Gemini, Claude). A token is roughly 3/4 of a word in English, so 128K tokens ≈ 96,000 words ≈ a full novel.

The Tradeoffs

Longer context means you can feed in more documents, longer conversations, or entire codebases. But: memory usage grows with context length (the KV-cache), quality can degrade at the edges of very long contexts (the “lost in the middle” problem), and cost scales linearly with input tokens. A model that advertises 128K context may perform great at 32K but poorly at 120K. Look for the card’s stated “effective context” or needle-in-a-haystack test results.

Key insight: Match the context length to your use case. Chatbot? 8K is fine. RAG with documents? 32K covers most needs. Full-codebase analysis? You need 128K+. Don’t pay the memory cost of 128K context if you only need 8K.

settings

Reading config.json

The blueprint that defines every architectural decision

The Key Fields

// From a typical LLM config.json "hidden_size": 4096, // Width of each layer "num_hidden_layers": 32, // Depth (number of layers) "num_attention_heads": 32, // Query heads "num_key_value_heads": 8, // KV heads (GQA) "vocab_size": 128256, // Tokenizer vocabulary "max_position_embeddings": 131072, // Max context "rope_theta": 500000.0, // Rotary embedding base "rms_norm_eps": 1e-05 // Normalization

What Each Field Tells You

hidden_size × num_hidden_layers = the model’s capacity. More layers and wider layers = more parameters.

num_attention_heads vs num_key_value_heads = attention type (equal = MHA, different = GQA).

max_position_embeddings = the maximum context length. 131072 = 128K tokens.

rope_theta = the rotary position encoding base. Higher values (500K+) indicate models trained for long context. Lower values (10K) suggest shorter effective context.

Key insight: You don’t need to understand the math behind every field. The config.json is the model’s blueprint — scan it for the key numbers (layers, heads, context length, vocab size) and move on.

memory

The Memory Estimation Rule of Thumb

Can you even run this model? A quick mental math formula

The Formula

VRAM (GB) ≈ Parameters (B) × Bytes per Parameter

FP32: 8B × 4 = 32GB
FP16/BF16: 8B × 2 = 16GB
INT8: 8B × 1 = 8GB
INT4 (Q4): 8B × 0.5 = 4GB

This is weights only. Add 20–30% for KV-cache, activations, and framework overhead. So an 8B model at FP16 realistically needs ~20GB, and at INT4 needs ~5–6GB. A 70B model at INT4 needs ~40GB — two 24GB GPUs or one 48GB GPU.

Quick Reference

// Can I run it? Quick check: 7B Q4 → ~5GB // Laptop GPU 7B FP16 → ~16GB // RTX 4090 13B Q4 → ~8GB // RTX 3080/4070 70B Q4 → ~40GB // 2x RTX 4090 405B Q4 → ~220GB // Multi-node

Key insight: Before reading benchmarks or training data, check if you can even run the model. Parameters × bytes-per-param + 25% overhead = minimum VRAM. If it doesn’t fit, look for a quantized variant or a smaller model.

compare

Size vs. Quality: The Frontier

When smaller models punch above their weight

The Scaling Reality

Bigger is generally better, but the relationship is logarithmic, not linear. Going from 7B to 70B (10x more parameters) doesn’t give you 10x better results — it might give you 15–25% improvement on benchmarks. Meanwhile, a well-fine-tuned 7B model can outperform a generic 70B model on a specific task. The current generation of “small” models (Llama 3.2 3B, Phi-4 14B, Gemma 3 4B) often match or beat previous-generation large models.

What to Compare

When choosing between model sizes, compare: benchmarks per GB of VRAM (efficiency), not raw benchmark scores. A 7B model scoring 75% on MMLU while needing 5GB is often more practical than a 70B model scoring 85% while needing 40GB. The question isn’t “which is smarter?” but “which gives me the best result I can actually run?”

Key insight: Always ask: “Is the next size up worth 4x more VRAM and 3x slower inference?” For most production use cases, the answer is no. The sweet spot for most teams is the largest model that comfortably fits their hardware budget.

checklist

Your Architecture Checklist

The five questions to answer from the architecture section of any card

The Five Questions

1. How many parameters? Check if total vs. active differs (MoE).

2. Dense or MoE? Look for “8x7B” naming or num_experts in config.json.

3. What’s the context length? Check max_position_embeddings and match to your use case.

4. Will it fit my hardware? Params × bytes-per-param + 25% overhead.

5. What attention type? GQA = modern and efficient. MHA = older, more memory-hungry.

Connecting to Practical Needs

Architecture specs only matter in relation to your constraints. A chatbot for customer support might need a 7B model with 8K context. A code assistant might need a 32B model with 128K context. A document analysis pipeline might need a 70B model with 32K context. Let your use case drive the spec requirements, not the other way around.

Key insight: Architecture is not about finding the “best” model. It’s about finding the best model you can run for your specific task. The best model in the world is useless if it won’t fit on your hardware.

arrow_back Ch 2: The YAML Header Ch 4: Benchmarks & Evaluation arrow_forward