Ch 3 — Quantization: Shrinking Without Breaking

How FP32 becomes INT4, what GGUF files contain, and choosing the right quantization level
Core
psychology
What & Why
arrow_forward
straighten
Number Line
arrow_forward
tune
PTQ vs QAT
arrow_forward
description
GGUF Format
arrow_forward
compare_arrows
Q4 vs Q5 vs Q8
arrow_forward
leaderboard
Benchmarks
arrow_forward
memory
RAM Guide
arrow_forward
checklist
Pick Your Level
-
Click play or press Space to begin...
Step- / 8
psychology
What Is Quantization?
Reducing the precision of model weights to make them smaller and faster
The Core Idea
A neural network is millions (or billions) of numbers called weights. During training, these weights are stored as 32-bit floating-point numbers (FP32) — very precise but very large.

Quantization converts these weights to lower precision: 16-bit, 8-bit, or even 4-bit. Less precision = smaller file = less RAM = faster inference.

Think of it like image compression: a raw photo is 25MB, a JPEG is 2MB. You lose some detail, but for most purposes, the JPEG is “good enough.”
The Math
7 billion weights × precision = file size FP32 (32 bits): 7B × 4 bytes = 28 GB FP16 (16 bits): 7B × 2 bytes = 14 GB INT8 (8 bits): 7B × 1 byte = 7 GB INT4 (4 bits): 7B × 0.5 byte= 3.5 GB From 28GB to 3.5GB — an 8x reduction. That's the difference between needing a $10,000 server and a $1,000 laptop.
Key insight: Quantization is the single most important technique for running models locally. Without it, a 7B model needs 28GB of RAM. With 4-bit quantization, it needs 3.5–4GB. This is what makes local AI possible on consumer hardware.
straighten
The Number Line Analogy
Imagine a ruler with fewer tick marks — you lose precision but keep the general shape
FP32: The Precise Ruler
Imagine a ruler from -1.0 to 1.0 with 4 billion tick marks. You can represent any number with extreme precision:

0.123456789012345...

This is FP32. Every weight in the model is stored with this precision. It’s like measuring with a micrometer when you only need a tape measure.
INT4: The Rough Ruler
Now imagine the same ruler with only 16 tick marks (-8 to +7). Every weight must snap to the nearest tick:

0.123... → rounds to 0.125

You lose the fine detail, but the overall pattern of weights is preserved. The model still “knows” the same things — it just expresses them with less precision.
Why It Works
Neural networks are robust to noise. During training, weights are constantly being nudged by tiny amounts. The model learns to be resilient to small perturbations.

Quantization introduces small rounding errors — but the model was already trained to handle noise. As long as the rounding errors are small relative to the weight values, the model’s behavior barely changes.

This is why a 4-bit model can retain 90–98% of the original quality: the important information is in the pattern of weights, not in the 12th decimal place.
Key insight: Quantization works because neural networks store information in patterns across millions of weights, not in individual weight precision. Rounding each weight slightly changes nothing meaningful — like how rounding every pixel in a photo by 1 shade doesn’t change what the photo looks like.
tune
PTQ vs QAT: Two Approaches
Post-Training Quantization (quick) vs Quantization-Aware Training (better but expensive)
Post-Training Quantization (PTQ)
How it works: 1. Take a fully trained FP32 model 2. Convert weights to lower precision 3. Done — no retraining needed Pros: ✓ Fast (minutes, not days) ✓ No training data needed ✓ Anyone can do it (just run a tool) Cons: ✗ Slightly more quality loss ✗ Can't adapt to quantization errors This is what llama.cpp does. This is what you'll use 99% of the time.
Quantization-Aware Training (QAT)
How it works: 1. During training, simulate quantization 2. Model learns to compensate for rounding errors 3. Produces weights that quantize better Pros: ✓ Less quality loss at same bit depth ✓ Model adapts to quantization Cons: ✗ Requires full training pipeline ✗ Expensive (GPU hours) ✗ Only model creators do this
GPTQ: A Middle Ground
GPTQ is a popular PTQ method that uses a small calibration dataset to minimize quantization error. It’s more accurate than naive rounding but doesn’t require full retraining. Many models on Hugging Face are available in GPTQ format.
Key insight: For local deployment, you’ll almost always use PTQ via llama.cpp or download pre-quantized GGUF files. QAT is what model creators (Meta, Google) do during training. You benefit from their QAT work when you download their models — then apply PTQ on top for your target bit depth.
description
The GGUF Format: Why It Won
A single file that contains everything needed to run a quantized model
What’s Inside a GGUF File
model-name-Q4_K_M.gguf ├── Header │ ├── Magic number (GGUF) │ ├── Version │ └── Tensor count ├── Metadata │ ├── Model architecture │ ├── Tokenizer (vocabulary) │ ├── Context length │ ├── Quantization type │ └── Chat template └── Tensor Data └── All quantized weights One file. Everything included. No separate tokenizer files, no config.json, no confusion.
Why GGUF Replaced GGML
GGML (the predecessor) had problems: no metadata, no tokenizer, required separate config files. GGUF (introduced August 2023) fixed everything:

Self-contained: One file = complete model
Metadata-rich: Architecture, tokenizer, chat template all embedded
Extensible: New fields can be added without breaking old readers
Cross-platform: Same file runs on Mac, Windows, Linux
Ecosystem: Ollama, llama.cpp, LM Studio, GPT4All all support it
Key insight: GGUF is to local AI what MP3 was to music: a universal format that just works everywhere. When you see a model on Hugging Face with “GGUF” in the name, you can download that single file and run it with Ollama or llama.cpp immediately. No setup, no configuration.
compare_arrows
Q4_K_M vs Q5_K_M vs Q8_0
The three quantization levels you’ll actually use — and when to pick each
The K-Quant Family
Q4_K_M (4-bit, medium) Bits per weight: ~4.5 Size (7B model): ~3.80 GB Quality loss: 3-8% Perplexity Δ: +0.0535 → Recommended default for most users Q5_K_M (5-bit, medium) Bits per weight: ~5.1 Size (7B model): ~4.45 GB Quality loss: 2-5% Perplexity Δ: +0.0142 → Quality-focused, if you have RAM Q8_0 (8-bit) Bits per weight: ~8.5 Size (7B model): ~6.70 GB Quality loss: 1-3% Perplexity Δ: +0.0004 → Near-lossless, maximum quality
What the Names Mean
Q = Quantized
4/5/8 = Target bit depth
K = K-quant method (blockwise quantization with super-blocks)
M = Medium quality (vs S=Small/faster, L=Large/better)

K-quant formats use blockwise quantization: weights are grouped into blocks, and each block gets its own scale factor. This captures both local and global weight patterns, significantly improving accuracy over naive per-tensor quantization.
Token Agreement with FP16
# How often the quantized model # picks the same top token as FP16: Q4_K_M: 88-92% agreement Q5_K_M: 92-93% agreement Q8_0: 99%+ agreement
Key insight: Q4_K_M is the sweet spot for 90% of use cases. The jump from Q4 to Q5 costs ~17% more RAM for ~4% less quality loss. The jump from Q5 to Q8 costs ~50% more RAM for ~2% less quality loss. Diminishing returns favor Q4_K_M unless you have RAM to spare.
leaderboard
Real Benchmarks: Quality vs Size vs Speed
Measured performance across quantization levels on consumer hardware
Llama 3.2 7B — Quantization Comparison
Format Size PPL Δ tok/s RAM FP16 13.5GB base 25 16GB Q8_0 6.7GB +0.0004 40 8GB Q5_K_M 4.5GB +0.0142 52 6GB Q4_K_M 3.8GB +0.0535 60 5GB Q4_K_S 3.6GB +0.0800 62 5GB Q3_K_M 2.9GB +0.1500 68 4GB Q2_K 2.1GB +0.8000 75 3GB PPL Δ = perplexity increase (lower = better) tok/s = tokens/second on M2 Pro (CPU) RAM = approximate runtime memory
The Quality Cliff
Notice the pattern: quality degrades gradually from Q8 to Q4, then falls off a cliff at Q3 and below.

Q4_K_M → Q3_K_M: Perplexity jumps 3x (0.05 → 0.15). Noticeable quality drop in generation tasks.

Q3_K_M → Q2_K: Perplexity jumps 5x (0.15 → 0.80). Model becomes unreliable. Frequent nonsense output.

Rule of thumb: Don’t go below Q4 for generation tasks. Q3 is acceptable only for classification/extraction where you’re parsing structured output.
Key insight: Quantization has a “sweet zone” between Q4 and Q8 where you get massive size reduction with minimal quality loss. Below Q4, quality degrades rapidly. Above Q8, you’re paying for precision the model doesn’t need. Stay in the sweet zone.
memory
RAM Requirements Guide
How much memory you need for each model size and quantization level
RAM = Model Size + Context + Overhead
Model size (Q4_K_M): 1B → ~0.8 GB 3B → ~2.0 GB 7B → ~3.8 GB 9B → ~5.5 GB 14B → ~8.5 GB 24B → ~14 GB 70B → ~40 GB Context window overhead: 4K context: ~0.5 GB 8K context: ~1.0 GB 32K context: ~3.0 GB 128K context: ~10 GB System overhead: ~0.5-1.0 GB
What Fits Where
8 GB RAM (MacBook Air M2) ✓ 3B Q4 + 8K context ✓ 7B Q4 + 4K context (tight) ✗ 9B anything 16 GB RAM (MacBook Pro M2/M3) ✓ 7B Q5 + 8K context ✓ 9B Q4 + 8K context ✓ 14B Q4 + 4K context (tight) 24 GB VRAM (RTX 4090) ✓ 14B Q5 + 8K context ✓ 24B Q4 + 4K context 32 GB RAM (Mac Studio M2 Pro) ✓ 24B Q5 + 8K context ✓ 14B Q8 + 32K context
Key insight: Context window size is a hidden RAM cost that catches people off guard. A 7B Q4 model is 3.8GB, but with 32K context it needs 7GB total. If you’re doing RAG with long documents, factor in context window RAM. For short tasks (classification, extraction), use a small context to save memory.
checklist
Picking Your Quantization Level
A decision tree for choosing the right format
Decision Tree
Q: How much RAM do you have? Tight (model barely fits): → Q4_K_M — best quality at minimum size Comfortable (25%+ headroom): → Q5_K_M — noticeable quality bump Plenty (50%+ headroom): → Q8_0 — near-lossless, why not? Q: What's the task? Classification / extraction: → Q4_K_M is plenty (even Q3 works) Summarization / chat: → Q4_K_M minimum, Q5_K_M preferred Creative writing / nuanced text: → Q5_K_M minimum, Q8_0 preferred Code generation: → Q5_K_M minimum (precision matters for syntax correctness)
Quick Reference
Default choice: Q4_K_M. It’s the most popular for a reason — best balance of size, speed, and quality.

If quality matters more: Q5_K_M. ~17% more RAM for measurably better output.

If RAM is no issue: Q8_0. Near-lossless. Only 2x the size of Q4.

Never go below Q4 for generation tasks. Q3 and Q2 are only for extreme constraints (mobile, IoT) on simple tasks.
Key insight: When in doubt, start with Q4_K_M. If the output quality isn’t good enough for your task, try Q5_K_M. If it’s still not enough, the problem is probably the model size (try a larger model), not the quantization level. Going from Q4 to Q8 on a 3B model won’t make it as smart as a Q4 9B model.