Ch 4: Distillation & Pruning — Small Models & Local AI

Ch 4 — Distillation & Pruning

How large models teach small models, and how to remove what doesn’t matter

arrow_backIndex

Core

school

Teacher-Student

arrow_forward

thermostat

Soft Labels

arrow_forward

layers

Layer Alignment

arrow_forward

content_cut

Pruning

arrow_forward

grid_view

Structured

arrow_forward

science

Real Examples

arrow_forward

compare

Distill vs Quant

arrow_forward

checklist

Decision Guide

Click play or press Space to begin...

Step- / 8

school

Knowledge Distillation: Teacher-Student Training

A large model teaches a small model to mimic its behavior

The Analogy

Imagine a master chef (the teacher) training an apprentice (the student). The master doesn’t just hand over recipes — they show the apprentice how to taste, how to adjust seasoning, how to recognize when something is “almost right.”

The apprentice will never be as experienced as the master, but they learn the master’s judgment — not just the answers, but the reasoning behind them. That’s distillation.

How It Works

Teacher: GPT-4 (or any large model) Student: A smaller model (e.g., 3B) Step 1: Feed input to the teacher Step 2: Teacher produces output probabilities (not just the top answer) Step 3: Train the student to match the teacher's probability distribution Step 4: Student learns the teacher's "reasoning patterns" — not just correct answers, but also which wrong answers are "almost right" The student becomes a compressed version of the teacher's knowledge.

Key insight: Distillation transfers knowledge, not just data. A student model trained by distillation from GPT-4 learns more than one trained on the same data from scratch — because the teacher’s output probabilities encode nuanced understanding that raw labels don’t capture.

thermostat

Soft Labels: The Secret Sauce

Why probability distributions teach more than hard answers

Hard Labels vs Soft Labels

Hard label (traditional training): Input: "The capital of France is ___" Label: "Paris" (100% confidence) The model learns: Paris = right. Everything else = wrong. Soft label (distillation): Input: "The capital of France is ___" Teacher output: Paris: 0.92 Lyon: 0.03 Marseille: 0.02 Berlin: 0.01 London: 0.01 ... The student learns: Paris is right, but Lyon and Marseille are "close" (French cities), while Berlin and London are "further" (other capitals).

Why Soft Labels Are Richer

The probability distribution from the teacher contains dark knowledge — information about relationships between concepts that hard labels don’t capture:

• Lyon is more similar to Paris than Berlin is
• Marseille is a French city but not the capital
• Berlin is a capital but in a different country

This relational information helps the student model generalize better, especially on examples it hasn’t seen during training.

Temperature in Distillation

A temperature parameter (T) softens the teacher’s probabilities. Higher T makes the distribution more uniform, revealing more of the “dark knowledge.” Typical T values: 2–10. This is different from inference temperature (Ch 1 of Prompt Engineering course).

Key insight: Soft labels are why distilled models punch above their weight. A 3B model trained with soft labels from GPT-4 learns the teacher’s “intuition” about word relationships, not just the right answers. This is how Phi-4-mini (3.8B) achieves scores that rival much larger models.

layers

Intermediate Layer Alignment

Teaching the student to think like the teacher, not just answer like the teacher

Beyond Output Matching

Basic distillation only matches the final output. But modern techniques also align intermediate layers — the hidden representations inside the model.

Think of it this way: two people can arrive at the same answer through different reasoning. Intermediate alignment ensures the student not only gets the right answer but reasons about it the same way the teacher does.

DistillLens (2025)

Uses the Logit Lens technique to peek into intermediate layers and align them symmetrically between teacher and student. This preserves the teacher’s uncertainty profile — the student learns not just what the teacher knows, but also what the teacher is unsure about.

The Training Pipeline

Modern distillation pipeline: 1. Output distillation Match teacher's final probabilities Loss: KL divergence 2. Intermediate alignment Match hidden states at key layers Loss: MSE between representations 3. Attention transfer Match attention patterns Student learns what to "focus on" 4. Contrastive learning Increase teacher response likelihood, decrease student's own (incorrect) response likelihood simultaneously Combined loss = weighted sum of all four. Each component teaches something different.

Key insight: Modern distillation is multi-signal: output matching + intermediate alignment + attention transfer + contrastive learning. Each signal teaches the student something different. This is why 2025-era distilled models are dramatically better than 2023-era ones — the distillation techniques have improved as much as the models themselves.

content_cut

Pruning: Cutting What Doesn’t Matter

Many neurons and connections contribute almost nothing — remove them

The Analogy

Imagine a company with 1,000 employees. Analysis shows that 200 of them do 80% of the productive work. The other 800 contribute marginally — they attend meetings, forward emails, but don’t produce much.

Pruning is like restructuring: identify the low-contributors and remove them. The company gets smaller but barely less productive.

Neural networks are the same: many weights are near-zero and contribute almost nothing to the output. Remove them, and the model gets smaller with minimal quality loss.

How Pruning Works

Step 1: Measure importance For each weight (or neuron, or layer), calculate how much it contributes to the model's output. Methods: - Magnitude: |weight| < threshold → prune - Gradient: low gradient = low impact - Taylor expansion: approximate the effect of removing each weight Step 2: Remove low-importance elements Set weights to zero (unstructured) or remove entire neurons/layers (structured) Step 3: Fine-tune (optional) Retrain briefly to recover any quality lost from pruning

Key insight: Research shows that 50–90% of weights in a trained neural network can be removed with less than 5% quality loss. The “Lottery Ticket Hypothesis” (Frankle & Carlin, 2019) suggests that within every large network, there’s a small subnetwork that performs almost as well. Pruning finds that subnetwork.

grid_view

Structured vs Unstructured Pruning

Removing individual weights vs removing entire neurons or layers

Unstructured Pruning

What: Set individual weights to zero Result: Sparse matrix (lots of zeros) Before: [0.5, 0.1, 0.8, 0.02, 0.7] After: [0.5, 0, 0.8, 0, 0.7] Pros: ✓ Very fine-grained control ✓ Can remove 90%+ of weights ✓ Minimal quality loss Cons: ✗ Sparse matrices are hard to accelerate on standard hardware ✗ Need special sparse kernels ✗ Actual speedup is often small

Structured Pruning

What: Remove entire neurons, attention heads, or transformer layers Result: Smaller but dense model Before: 36 transformer layers After: 28 transformer layers Pros: ✓ Real speedup on standard hardware ✓ Smaller model file ✓ No special sparse kernels needed Cons: ✗ Coarser — can't be as selective ✗ More quality loss per % removed ✗ Needs careful layer importance analysis

Key insight: For local deployment, structured pruning is more practical because it produces a smaller, dense model that runs fast on standard hardware. Unstructured pruning creates sparse models that need specialized hardware/software to actually run faster. Most consumer devices don’t have good sparse acceleration.

science

Real-World Examples

How today’s small models were actually built

Phi-4-mini (Microsoft)

Strategy: Distillation + Data Quality Teacher: GPT-4 (and other large models) Student: 3.8B parameter model Key innovation: Instead of training on massive web crawls, Microsoft used GPT-4 to generate high-quality synthetic training data. Result: 3.8B params, MIT license GSM8K: 88.6% (rivals 13B models) ARC-C: 83.7% Lesson: Quality of training data matters more than quantity. 1M high-quality examples > 1B noisy ones.

Iterative Layer-wise Distillation (2025)

Strategy: Structured Pruning + Distillation Model: Qwen 2.5 3B (36 layers) Target: 28 layers (22% reduction) Process: 1. Evaluate importance of each layer 2. Remove the 8 least important layers 3. Fine-tune with KL divergence loss + MSE loss on intermediate states 4. Quality loss: only 9.7% Removed 22% of layers, lost only 9.7% quality. The removed layers were contributing almost nothing.

Key insight: The best small models use a combination of techniques: distillation for knowledge transfer, synthetic data for training quality, and pruning for efficiency. No single technique is enough — it’s the combination that produces models like Phi-4-mini that punch far above their weight class.

compare

Distillation vs Quantization vs Pruning

Three techniques, different purposes, often combined

Comparison

Quantization (Ch 3) What: Reduce precision of weights When: After training (PTQ) or during Who: Anyone (download GGUF, done) Effect: Same architecture, smaller file Quality: 90-98% retained Distillation What: Train small model from large one When: During training Who: Model creators (needs GPU cluster) Effect: Entirely new, smaller model Quality: 85-95% of teacher Pruning What: Remove unimportant weights/layers When: After training + fine-tune Who: Model creators or advanced users Effect: Same architecture, fewer parts Quality: 90-95% retained

How They Combine

Typical production pipeline:

1. Distillation: Train a 3B model from a 70B teacher (done by model creator)

2. Pruning: Remove 20% of layers that contribute least (done by model creator)

3. Quantization: Convert to Q4_K_M GGUF for deployment (done by you or the community)

The result: a model that started as 70B × FP32 = 280GB, compressed to 3B × Q4 = 2GB. A 140x reduction with 85–90% of the original quality.

Key insight: As a local AI practitioner, quantization is the technique you’ll use directly (Ch 6 covers how). Distillation and pruning are done by model creators — but understanding them helps you appreciate why some small models are dramatically better than others, and how to evaluate which ones to use.

checklist

The Compression Decision Guide

What to use when, and what to look for in pre-compressed models

For Practitioners (You)

Your primary tool: Quantization 1. Pick a model (Ch 2 landscape) 2. Download a pre-quantized GGUF (Q4_K_M for most, Q5_K_M for quality) 3. Run with Ollama (Ch 5) 4. Done When evaluating pre-built models: Look for models that were: ✓ Distilled from a strong teacher ✓ Trained on high-quality data ✓ Available in multiple GGUF quants ✓ Benchmarked on relevant tasks You benefit from distillation and pruning without doing it yourself. The model creator did the hard work.

For Model Builders

If you’re fine-tuning or building custom models:

1. Start with a distilled base: Fine-tune Phi-4-mini or Qwen 3.5 4B, not a random 4B model. They already carry knowledge from larger teachers.

2. Consider pruning after fine-tuning: Your fine-tuned model may have layers that are redundant for your specific task. Structured pruning can make it 20–30% smaller.

3. Quantize last: Always quantize as the final step. Quantize → fine-tune is worse than fine-tune → quantize.

Key insight: The compression pipeline is: distill → prune → quantize. Each step reduces size with some quality loss. The order matters: distillation creates the best small architecture, pruning removes redundancy, quantization reduces precision. Now that you understand the theory, Chapter 5 gets hands-on with Ollama.

arrow_back Ch 3: Quantization Ch 5: Ollama arrow_forward