Ch 4 — Distillation & Pruning

How large models teach small models, and how to remove what doesn’t matter
Core
school
Teacher-Student
arrow_forward
thermostat
Soft Labels
arrow_forward
layers
Layer Alignment
arrow_forward
content_cut
Pruning
arrow_forward
grid_view
Structured
arrow_forward
science
Real Examples
arrow_forward
compare
Distill vs Quant
arrow_forward
checklist
Decision Guide
-
Click play or press Space to begin...
Step- / 8
school
Knowledge Distillation: Teacher-Student Training
A large model teaches a small model to mimic its behavior
The Analogy
Imagine a master chef (the teacher) training an apprentice (the student). The master doesn’t just hand over recipes — they show the apprentice how to taste, how to adjust seasoning, how to recognize when something is “almost right.”

The apprentice will never be as experienced as the master, but they learn the master’s judgment — not just the answers, but the reasoning behind them. That’s distillation.
How It Works
Teacher: GPT-4 (or any large model) Student: A smaller model (e.g., 3B) Step 1: Feed input to the teacher Step 2: Teacher produces output probabilities (not just the top answer) Step 3: Train the student to match the teacher's probability distribution Step 4: Student learns the teacher's "reasoning patterns" — not just correct answers, but also which wrong answers are "almost right" The student becomes a compressed version of the teacher's knowledge.
Key insight: Distillation transfers knowledge, not just data. A student model trained by distillation from GPT-4 learns more than one trained on the same data from scratch — because the teacher’s output probabilities encode nuanced understanding that raw labels don’t capture.
thermostat
Soft Labels: The Secret Sauce
Why probability distributions teach more than hard answers
Hard Labels vs Soft Labels
Hard label (traditional training): Input: "The capital of France is ___" Label: "Paris" (100% confidence) The model learns: Paris = right. Everything else = wrong. Soft label (distillation): Input: "The capital of France is ___" Teacher output: Paris: 0.92 Lyon: 0.03 Marseille: 0.02 Berlin: 0.01 London: 0.01 ... The student learns: Paris is right, but Lyon and Marseille are "close" (French cities), while Berlin and London are "further" (other capitals).
Why Soft Labels Are Richer
The probability distribution from the teacher contains dark knowledge — information about relationships between concepts that hard labels don’t capture:

• Lyon is more similar to Paris than Berlin is
• Marseille is a French city but not the capital
• Berlin is a capital but in a different country

This relational information helps the student model generalize better, especially on examples it hasn’t seen during training.
Temperature in Distillation
A temperature parameter (T) softens the teacher’s probabilities. Higher T makes the distribution more uniform, revealing more of the “dark knowledge.” Typical T values: 2–10. This is different from inference temperature (Ch 1 of Prompt Engineering course).
Key insight: Soft labels are why distilled models punch above their weight. A 3B model trained with soft labels from GPT-4 learns the teacher’s “intuition” about word relationships, not just the right answers. This is how Phi-4-mini (3.8B) achieves scores that rival much larger models.
layers
Intermediate Layer Alignment
Teaching the student to think like the teacher, not just answer like the teacher
Beyond Output Matching
Basic distillation only matches the final output. But modern techniques also align intermediate layers — the hidden representations inside the model.

Think of it this way: two people can arrive at the same answer through different reasoning. Intermediate alignment ensures the student not only gets the right answer but reasons about it the same way the teacher does.
DistillLens (2025)
Uses the Logit Lens technique to peek into intermediate layers and align them symmetrically between teacher and student. This preserves the teacher’s uncertainty profile — the student learns not just what the teacher knows, but also what the teacher is unsure about.
The Training Pipeline
Modern distillation pipeline: 1. Output distillation Match teacher's final probabilities Loss: KL divergence 2. Intermediate alignment Match hidden states at key layers Loss: MSE between representations 3. Attention transfer Match attention patterns Student learns what to "focus on" 4. Contrastive learning Increase teacher response likelihood, decrease student's own (incorrect) response likelihood simultaneously Combined loss = weighted sum of all four. Each component teaches something different.
Key insight: Modern distillation is multi-signal: output matching + intermediate alignment + attention transfer + contrastive learning. Each signal teaches the student something different. This is why 2025-era distilled models are dramatically better than 2023-era ones — the distillation techniques have improved as much as the models themselves.
content_cut
Pruning: Cutting What Doesn’t Matter
Many neurons and connections contribute almost nothing — remove them
The Analogy
Imagine a company with 1,000 employees. Analysis shows that 200 of them do 80% of the productive work. The other 800 contribute marginally — they attend meetings, forward emails, but don’t produce much.

Pruning is like restructuring: identify the low-contributors and remove them. The company gets smaller but barely less productive.

Neural networks are the same: many weights are near-zero and contribute almost nothing to the output. Remove them, and the model gets smaller with minimal quality loss.
How Pruning Works
Step 1: Measure importance For each weight (or neuron, or layer), calculate how much it contributes to the model's output. Methods: - Magnitude: |weight| < threshold → prune - Gradient: low gradient = low impact - Taylor expansion: approximate the effect of removing each weight Step 2: Remove low-importance elements Set weights to zero (unstructured) or remove entire neurons/layers (structured) Step 3: Fine-tune (optional) Retrain briefly to recover any quality lost from pruning
Key insight: Research shows that 50–90% of weights in a trained neural network can be removed with less than 5% quality loss. The “Lottery Ticket Hypothesis” (Frankle & Carlin, 2019) suggests that within every large network, there’s a small subnetwork that performs almost as well. Pruning finds that subnetwork.
grid_view
Structured vs Unstructured Pruning
Removing individual weights vs removing entire neurons or layers
Unstructured Pruning
What: Set individual weights to zero Result: Sparse matrix (lots of zeros) Before: [0.5, 0.1, 0.8, 0.02, 0.7] After: [0.5, 0, 0.8, 0, 0.7] Pros: ✓ Very fine-grained control ✓ Can remove 90%+ of weights ✓ Minimal quality loss Cons: ✗ Sparse matrices are hard to accelerate on standard hardware ✗ Need special sparse kernels ✗ Actual speedup is often small
Structured Pruning
What: Remove entire neurons, attention heads, or transformer layers Result: Smaller but dense model Before: 36 transformer layers After: 28 transformer layers Pros: ✓ Real speedup on standard hardware ✓ Smaller model file ✓ No special sparse kernels needed Cons: ✗ Coarser — can't be as selective ✗ More quality loss per % removed ✗ Needs careful layer importance analysis
Key insight: For local deployment, structured pruning is more practical because it produces a smaller, dense model that runs fast on standard hardware. Unstructured pruning creates sparse models that need specialized hardware/software to actually run faster. Most consumer devices don’t have good sparse acceleration.
science
Real-World Examples
How today’s small models were actually built
Phi-4-mini (Microsoft)
Strategy: Distillation + Data Quality Teacher: GPT-4 (and other large models) Student: 3.8B parameter model Key innovation: Instead of training on massive web crawls, Microsoft used GPT-4 to generate high-quality synthetic training data. Result: 3.8B params, MIT license GSM8K: 88.6% (rivals 13B models) ARC-C: 83.7% Lesson: Quality of training data matters more than quantity. 1M high-quality examples > 1B noisy ones.
Iterative Layer-wise Distillation (2025)
Strategy: Structured Pruning + Distillation Model: Qwen 2.5 3B (36 layers) Target: 28 layers (22% reduction) Process: 1. Evaluate importance of each layer 2. Remove the 8 least important layers 3. Fine-tune with KL divergence loss + MSE loss on intermediate states 4. Quality loss: only 9.7% Removed 22% of layers, lost only 9.7% quality. The removed layers were contributing almost nothing.
Key insight: The best small models use a combination of techniques: distillation for knowledge transfer, synthetic data for training quality, and pruning for efficiency. No single technique is enough — it’s the combination that produces models like Phi-4-mini that punch far above their weight class.
compare
Distillation vs Quantization vs Pruning
Three techniques, different purposes, often combined
Comparison
Quantization (Ch 3) What: Reduce precision of weights When: After training (PTQ) or during Who: Anyone (download GGUF, done) Effect: Same architecture, smaller file Quality: 90-98% retained Distillation What: Train small model from large one When: During training Who: Model creators (needs GPU cluster) Effect: Entirely new, smaller model Quality: 85-95% of teacher Pruning What: Remove unimportant weights/layers When: After training + fine-tune Who: Model creators or advanced users Effect: Same architecture, fewer parts Quality: 90-95% retained
How They Combine
Typical production pipeline:

1. Distillation: Train a 3B model from a 70B teacher (done by model creator)

2. Pruning: Remove 20% of layers that contribute least (done by model creator)

3. Quantization: Convert to Q4_K_M GGUF for deployment (done by you or the community)

The result: a model that started as 70B × FP32 = 280GB, compressed to 3B × Q4 = 2GB. A 140x reduction with 85–90% of the original quality.
Key insight: As a local AI practitioner, quantization is the technique you’ll use directly (Ch 6 covers how). Distillation and pruning are done by model creators — but understanding them helps you appreciate why some small models are dramatically better than others, and how to evaluate which ones to use.
checklist
The Compression Decision Guide
What to use when, and what to look for in pre-compressed models
For Practitioners (You)
Your primary tool: Quantization 1. Pick a model (Ch 2 landscape) 2. Download a pre-quantized GGUF (Q4_K_M for most, Q5_K_M for quality) 3. Run with Ollama (Ch 5) 4. Done When evaluating pre-built models: Look for models that were: ✓ Distilled from a strong teacher ✓ Trained on high-quality data ✓ Available in multiple GGUF quants ✓ Benchmarked on relevant tasks You benefit from distillation and pruning without doing it yourself. The model creator did the hard work.
For Model Builders
If you’re fine-tuning or building custom models:

1. Start with a distilled base: Fine-tune Phi-4-mini or Qwen 3.5 4B, not a random 4B model. They already carry knowledge from larger teachers.

2. Consider pruning after fine-tuning: Your fine-tuned model may have layers that are redundant for your specific task. Structured pruning can make it 20–30% smaller.

3. Quantize last: Always quantize as the final step. Quantize → fine-tune is worse than fine-tune → quantize.
Key insight: The compression pipeline is: distill → prune → quantize. Each step reduces size with some quality loss. The order matters: distillation creates the best small architecture, pruning removes redundancy, quantization reduces precision. Now that you understand the theory, Chapter 5 gets hands-on with Ollama.