Ch 3: Quantization — Small Models & Local AI

Ch 3 — Quantization: Shrinking Without Breaking

How FP32 becomes INT4, what GGUF files contain, and choosing the right quantization level

arrow_backIndex

Core

psychology

What & Why

arrow_forward

straighten

Number Line

arrow_forward

tune

PTQ vs QAT

arrow_forward

description

GGUF Format

arrow_forward

compare_arrows

Q4 vs Q5 vs Q8

arrow_forward

leaderboard

Benchmarks

arrow_forward

memory

RAM Guide

arrow_forward

checklist

Pick Your Level

Click play or press Space to begin...

Step- / 8

psychology

What Is Quantization?

Reducing the precision of model weights to make them smaller and faster

The Core Idea

A neural network is millions (or billions) of numbers called weights. During training, these weights are stored as 32-bit floating-point numbers (FP32) — very precise but very large.

Quantization converts these weights to lower precision: 16-bit, 8-bit, or even 4-bit. Less precision = smaller file = less RAM = faster inference.

Think of it like image compression: a raw photo is 25MB, a JPEG is 2MB. You lose some detail, but for most purposes, the JPEG is “good enough.”

The Math

7 billion weights × precision = file size FP32 (32 bits): 7B × 4 bytes = 28 GB FP16 (16 bits): 7B × 2 bytes = 14 GB INT8 (8 bits): 7B × 1 byte = 7 GB INT4 (4 bits): 7B × 0.5 byte= 3.5 GB From 28GB to 3.5GB — an 8x reduction. That's the difference between needing a $10,000 server and a $1,000 laptop.

Key insight: Quantization is the single most important technique for running models locally. Without it, a 7B model needs 28GB of RAM. With 4-bit quantization, it needs 3.5–4GB. This is what makes local AI possible on consumer hardware.

straighten

The Number Line Analogy

Imagine a ruler with fewer tick marks — you lose precision but keep the general shape

FP32: The Precise Ruler

Imagine a ruler from -1.0 to 1.0 with 4 billion tick marks. You can represent any number with extreme precision:

0.123456789012345...

This is FP32. Every weight in the model is stored with this precision. It’s like measuring with a micrometer when you only need a tape measure.

INT4: The Rough Ruler

Now imagine the same ruler with only 16 tick marks (-8 to +7). Every weight must snap to the nearest tick:

0.123... → rounds to 0.125

You lose the fine detail, but the overall pattern of weights is preserved. The model still “knows” the same things — it just expresses them with less precision.

Why It Works

Neural networks are robust to noise. During training, weights are constantly being nudged by tiny amounts. The model learns to be resilient to small perturbations.

Quantization introduces small rounding errors — but the model was already trained to handle noise. As long as the rounding errors are small relative to the weight values, the model’s behavior barely changes.

This is why a 4-bit model can retain 90–98% of the original quality: the important information is in the pattern of weights, not in the 12th decimal place.

Key insight: Quantization works because neural networks store information in patterns across millions of weights, not in individual weight precision. Rounding each weight slightly changes nothing meaningful — like how rounding every pixel in a photo by 1 shade doesn’t change what the photo looks like.

tune

PTQ vs QAT: Two Approaches

Post-Training Quantization (quick) vs Quantization-Aware Training (better but expensive)

Post-Training Quantization (PTQ)

How it works: 1. Take a fully trained FP32 model 2. Convert weights to lower precision 3. Done — no retraining needed Pros: ✓ Fast (minutes, not days) ✓ No training data needed ✓ Anyone can do it (just run a tool) Cons: ✗ Slightly more quality loss ✗ Can't adapt to quantization errors This is what llama.cpp does. This is what you'll use 99% of the time.

Quantization-Aware Training (QAT)

How it works: 1. During training, simulate quantization 2. Model learns to compensate for rounding errors 3. Produces weights that quantize better Pros: ✓ Less quality loss at same bit depth ✓ Model adapts to quantization Cons: ✗ Requires full training pipeline ✗ Expensive (GPU hours) ✗ Only model creators do this

GPTQ: A Middle Ground

GPTQ is a popular PTQ method that uses a small calibration dataset to minimize quantization error. It’s more accurate than naive rounding but doesn’t require full retraining. Many models on Hugging Face are available in GPTQ format.

Key insight: For local deployment, you’ll almost always use PTQ via llama.cpp or download pre-quantized GGUF files. QAT is what model creators (Meta, Google) do during training. You benefit from their QAT work when you download their models — then apply PTQ on top for your target bit depth.

description

The GGUF Format: Why It Won

A single file that contains everything needed to run a quantized model

What’s Inside a GGUF File

model-name-Q4_K_M.gguf ├── Header │ ├── Magic number (GGUF) │ ├── Version │ └── Tensor count ├── Metadata │ ├── Model architecture │ ├── Tokenizer (vocabulary) │ ├── Context length │ ├── Quantization type │ └── Chat template └── Tensor Data └── All quantized weights One file. Everything included. No separate tokenizer files, no config.json, no confusion.

Why GGUF Replaced GGML

GGML (the predecessor) had problems: no metadata, no tokenizer, required separate config files. GGUF (introduced August 2023) fixed everything:

• Self-contained: One file = complete model
• Metadata-rich: Architecture, tokenizer, chat template all embedded
• Extensible: New fields can be added without breaking old readers
• Cross-platform: Same file runs on Mac, Windows, Linux
• Ecosystem: Ollama, llama.cpp, LM Studio, GPT4All all support it

Key insight: GGUF is to local AI what MP3 was to music: a universal format that just works everywhere. When you see a model on Hugging Face with “GGUF” in the name, you can download that single file and run it with Ollama or llama.cpp immediately. No setup, no configuration.

compare_arrows

Q4_K_M vs Q5_K_M vs Q8_0

The three quantization levels you’ll actually use — and when to pick each

The K-Quant Family

Q4_K_M (4-bit, medium) Bits per weight: ~4.5 Size (7B model): ~3.80 GB Quality loss: 3-8% Perplexity Δ: +0.0535 → Recommended default for most users Q5_K_M (5-bit, medium) Bits per weight: ~5.1 Size (7B model): ~4.45 GB Quality loss: 2-5% Perplexity Δ: +0.0142 → Quality-focused, if you have RAM Q8_0 (8-bit) Bits per weight: ~8.5 Size (7B model): ~6.70 GB Quality loss: 1-3% Perplexity Δ: +0.0004 → Near-lossless, maximum quality

What the Names Mean

Q = Quantized
4/5/8 = Target bit depth
K = K-quant method (blockwise quantization with super-blocks)
M = Medium quality (vs S=Small/faster, L=Large/better)

K-quant formats use blockwise quantization: weights are grouped into blocks, and each block gets its own scale factor. This captures both local and global weight patterns, significantly improving accuracy over naive per-tensor quantization.

Token Agreement with FP16

# How often the quantized model # picks the same top token as FP16: Q4_K_M: 88-92% agreement Q5_K_M: 92-93% agreement Q8_0: 99%+ agreement

Key insight: Q4_K_M is the sweet spot for 90% of use cases. The jump from Q4 to Q5 costs ~17% more RAM for ~4% less quality loss. The jump from Q5 to Q8 costs ~50% more RAM for ~2% less quality loss. Diminishing returns favor Q4_K_M unless you have RAM to spare.

leaderboard

Real Benchmarks: Quality vs Size vs Speed

Measured performance across quantization levels on consumer hardware

Llama 3.2 7B — Quantization Comparison

Format Size PPL Δ tok/s RAM FP16 13.5GB base 25 16GB Q8_0 6.7GB +0.0004 40 8GB Q5_K_M 4.5GB +0.0142 52 6GB Q4_K_M 3.8GB +0.0535 60 5GB Q4_K_S 3.6GB +0.0800 62 5GB Q3_K_M 2.9GB +0.1500 68 4GB Q2_K 2.1GB +0.8000 75 3GB PPL Δ = perplexity increase (lower = better) tok/s = tokens/second on M2 Pro (CPU) RAM = approximate runtime memory

The Quality Cliff

Notice the pattern: quality degrades gradually from Q8 to Q4, then falls off a cliff at Q3 and below.

Q4_K_M → Q3_K_M: Perplexity jumps 3x (0.05 → 0.15). Noticeable quality drop in generation tasks.

Q3_K_M → Q2_K: Perplexity jumps 5x (0.15 → 0.80). Model becomes unreliable. Frequent nonsense output.

Rule of thumb: Don’t go below Q4 for generation tasks. Q3 is acceptable only for classification/extraction where you’re parsing structured output.

Key insight: Quantization has a “sweet zone” between Q4 and Q8 where you get massive size reduction with minimal quality loss. Below Q4, quality degrades rapidly. Above Q8, you’re paying for precision the model doesn’t need. Stay in the sweet zone.

memory

RAM Requirements Guide

How much memory you need for each model size and quantization level

RAM = Model Size + Context + Overhead

Model size (Q4_K_M): 1B → ~0.8 GB 3B → ~2.0 GB 7B → ~3.8 GB 9B → ~5.5 GB 14B → ~8.5 GB 24B → ~14 GB 70B → ~40 GB Context window overhead: 4K context: ~0.5 GB 8K context: ~1.0 GB 32K context: ~3.0 GB 128K context: ~10 GB System overhead: ~0.5-1.0 GB

What Fits Where

8 GB RAM (MacBook Air M2) ✓ 3B Q4 + 8K context ✓ 7B Q4 + 4K context (tight) ✗ 9B anything 16 GB RAM (MacBook Pro M2/M3) ✓ 7B Q5 + 8K context ✓ 9B Q4 + 8K context ✓ 14B Q4 + 4K context (tight) 24 GB VRAM (RTX 4090) ✓ 14B Q5 + 8K context ✓ 24B Q4 + 4K context 32 GB RAM (Mac Studio M2 Pro) ✓ 24B Q5 + 8K context ✓ 14B Q8 + 32K context

Key insight: Context window size is a hidden RAM cost that catches people off guard. A 7B Q4 model is 3.8GB, but with 32K context it needs 7GB total. If you’re doing RAG with long documents, factor in context window RAM. For short tasks (classification, extraction), use a small context to save memory.

checklist

Picking Your Quantization Level

A decision tree for choosing the right format

Decision Tree

Q: How much RAM do you have? Tight (model barely fits): → Q4_K_M — best quality at minimum size Comfortable (25%+ headroom): → Q5_K_M — noticeable quality bump Plenty (50%+ headroom): → Q8_0 — near-lossless, why not? Q: What's the task? Classification / extraction: → Q4_K_M is plenty (even Q3 works) Summarization / chat: → Q4_K_M minimum, Q5_K_M preferred Creative writing / nuanced text: → Q5_K_M minimum, Q8_0 preferred Code generation: → Q5_K_M minimum (precision matters for syntax correctness)

Quick Reference

Default choice: Q4_K_M. It’s the most popular for a reason — best balance of size, speed, and quality.

If quality matters more: Q5_K_M. ~17% more RAM for measurably better output.

If RAM is no issue: Q8_0. Near-lossless. Only 2x the size of Q4.

Never go below Q4 for generation tasks. Q3 and Q2 are only for extreme constraints (mobile, IoT) on simple tasks.

Key insight: When in doubt, start with Q4_K_M. If the output quality isn’t good enough for your task, try Q5_K_M. If it’s still not enough, the problem is probably the model size (try a larger model), not the quantization level. Going from Q4 to Q8 on a 3B model won’t make it as smart as a Q4 9B model.

arrow_back Ch 2: Small Model Landscape Ch 4: Distillation & Pruning arrow_forward