For Practitioners (You)
Your primary tool: Quantization
1. Pick a model (Ch 2 landscape)
2. Download a pre-quantized GGUF
(Q4_K_M for most, Q5_K_M for quality)
3. Run with Ollama (Ch 5)
4. Done
When evaluating pre-built models:
Look for models that were:
✓ Distilled from a strong teacher
✓ Trained on high-quality data
✓ Available in multiple GGUF quants
✓ Benchmarked on relevant tasks
You benefit from distillation and
pruning without doing it yourself.
The model creator did the hard work.
For Model Builders
If you’re fine-tuning or building custom models:
1. Start with a distilled base: Fine-tune Phi-4-mini or Qwen 3.5 4B, not a random 4B model. They already carry knowledge from larger teachers.
2. Consider pruning after fine-tuning: Your fine-tuned model may have layers that are redundant for your specific task. Structured pruning can make it 20–30% smaller.
3. Quantize last: Always quantize as the final step. Quantize → fine-tune is worse than fine-tune → quantize.
Key insight: The compression pipeline is: distill → prune → quantize. Each step reduces size with some quality loss. The order matters: distillation creates the best small architecture, pruning removes redundancy, quantization reduces precision. Now that you understand the theory, Chapter 5 gets hands-on with Ollama.