Ch 10 — The Future of Small Models

Where small models are heading — and your complete local AI toolkit
Strategy
trending_up
Trends
arrow_forward
fast_forward
Spec Decode
arrow_forward
model_training
On-Device FT
arrow_forward
developer_board
Hardware
arrow_forward
devices
Convergence
arrow_forward
build
Toolkit
arrow_forward
route
Decision Tree
arrow_forward
flag
Next Steps
-
Click play or press Space to begin...
Step- / 8
trending_up
The Trend: Smaller AND Smarter
Each generation of small models matches the previous generation’s large models
The Compression Timeline
2023: GPT-4 level = 1.8T params (cloud only) 2024: GPT-4 level ≈ 70B params (server GPU) 2025: GPT-4 level ≈ 9-14B params (laptop) 2026: GPT-4 level ≈ 3-4B params? (phone?) Every 12-18 months, the same capability fits in a model 5-10x smaller. Recent milestones: Gemma 3n E4B: LMArena >1300 (sub-10B first!) Qwen 3.5 9B: MMLU-Pro 82.5 (was GPT-4 in 2023) Phi-4-mini 3.8B: GSM8K 88.6% (MIT license)
Why This Is Happening
1. Better training data: Synthetic data from large models, curated datasets, quality over quantity.

2. Architecture innovations: Selective parameter activation (Gemma 3n), mixture of experts, efficient attention.

3. Better distillation: Multi-signal distillation, intermediate alignment, contrastive learning (Ch 4).

4. Competition: Meta, Google, Microsoft, Alibaba, Mistral all racing to make the best small model. Each release pushes the others.
Key insight: The gap between small and large models is closing at an accelerating rate. Today’s 9B model matches 2023’s frontier. By 2027, a phone-sized model (3–4B) may match today’s frontier for most tasks. The future of AI is small, local, and everywhere.
fast_forward
Speculative Decoding
Small model drafts, large model verifies — 2-3x faster generation
How It Works
Traditional generation: Large model generates 1 token at a time Each token: full forward pass (~50ms) 100 tokens = 100 forward passes = 5s Speculative decoding: Small model (1B) drafts 5 tokens fast Large model (70B) verifies all 5 at once Accept correct tokens, reject wrong ones Repeat Result: 2-3x faster generation because the large model processes multiple tokens in parallel during verification, and the small model is right ~70-80% of the time.
Why This Matters for Local AI
For cloud: Speculative decoding makes large models faster and cheaper to serve. Providers can handle more requests per GPU.

For local: You can pair a tiny model (1B, runs on CPU) with a medium model (7B, runs on GPU). The 1B drafts tokens while the 7B verifies. The result: 7B quality at nearly 1B speed.

Already supported: llama.cpp has speculative decoding built in. Ollama is adding support. This is production-ready technology.
Key insight: Speculative decoding is a “free lunch” — you get the same output quality as the large model but 2–3x faster. It works because small models agree with large models on most tokens (the “easy” ones). The large model only needs to think hard about the “surprising” tokens.
model_training
On-Device Fine-Tuning
Personalizing models locally — the model adapts to YOU
The Vision
Today, fine-tuning requires a GPU server. But on-device fine-tuning is coming: your phone or laptop continuously adapts the model to your writing style, vocabulary, and preferences.

Imagine a keyboard that doesn’t just predict common words — it predicts YOUR words. An email assistant that writes in YOUR tone. A code assistant that knows YOUR codebase conventions.
Current State
What works today: ✓ LoRA fine-tuning on laptop GPU ✓ QLoRA (quantized LoRA) on 8GB VRAM ✓ Apple MLX fine-tuning on M-series What's coming: → On-phone LoRA adaptation → Federated learning (learn from many users without sharing data) → Continuous learning from user interactions
LoRA on a Laptop
# Fine-tune Qwen 2.5 7B with QLoRA # Requires: 8GB VRAM or 16GB unified from transformers import AutoModelForCausalLM from peft import LoraConfig, get_peft_model from trl import SFTTrainer model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-7B", load_in_4bit=True ) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05 ) model = get_peft_model(model, lora_config) # Trainable params: ~4M (0.06% of total) # Training time: ~1 hour on M2 Pro
Key insight: LoRA fine-tuning is already practical on consumer hardware. QLoRA (4-bit base + LoRA adapters) lets you fine-tune a 7B model on a 16GB laptop. The adapter is tiny (~10MB) and can be swapped in/out. This means personalized models without personalized hardware.
developer_board
Hardware Trends: NPUs Everywhere
Every new chip has dedicated AI hardware — the infrastructure is being built
The NPU Revolution
Apple Neural Engine M4: 38 TOPS | A18 Pro: 35 TOPS In: Every Mac, iPhone, iPad since 2020 Qualcomm Hexagon NPU Snapdragon 8 Elite: 45 TOPS In: Samsung Galaxy S25, OnePlus 13 Intel NPU Lunar Lake: 48 TOPS In: Latest Intel laptops AMD XDNA Ryzen AI: 50 TOPS In: Latest AMD laptops NVIDIA (discrete) RTX 4090: 1,321 TOPS (FP4) RTX 5090: ~2,000 TOPS (FP4)
What This Means
Every device sold today has AI hardware. Phones, laptops, tablets, even some IoT devices have dedicated neural processing units. The hardware is ahead of the software.

The bottleneck is software: NPU support in inference engines (llama.cpp, ExecuTorch) is still maturing. Most local AI today runs on GPU or CPU, not NPU. As NPU support improves, local AI will get 2–5x faster on the same hardware.

Windows Copilot+ PCs: Microsoft requires 40+ TOPS NPU for the “Copilot+” label. This is pushing NPU adoption across the PC industry.
Key insight: The hardware for local AI is already in your pocket and on your desk. NPUs with 35–50 TOPS are in every flagship phone and modern laptop. As software catches up (better NPU backends in llama.cpp, ExecuTorch), local AI will get dramatically faster without any new hardware purchases.
devices
The Convergence: Every Device Becomes an AI Device
The end state: AI is ambient, local, and always available
The Vision
2023: AI = cloud API. You send data to a server, get a response.

2025: AI = local option. You can run models on your laptop. Edge deployment is emerging.

2027+: AI = ambient. Every device has a capable model. Your phone, laptop, car, smart home, watch — all running local AI. Cloud is the fallback for hard tasks, not the default.

This isn’t science fiction. The hardware exists. The models are getting small enough. The software ecosystem is maturing. We’re in the early innings of this transition.
What Changes
For Users: AI works offline, instantly, privately No subscriptions for basic AI features Personalized models that know you For Developers: AI is a local library, not an API call No rate limits, no API keys, no costs Ship AI features without cloud infra For Businesses: AI processing at zero marginal cost Complete data sovereignty No vendor lock-in to AI providers For Society: AI access doesn't require internet Developing regions get AI too Privacy becomes the default, not opt-in
Key insight: The shift from cloud-first to local-first AI is as significant as the shift from mainframes to personal computers. It democratizes AI access, eliminates the cost barrier, and makes privacy the default. You’re learning these skills at exactly the right time.
build
Your Local AI Toolkit
Everything you learned in this course, organized for reference
Models
Tiny (1-3B) — Phone/Edge: Llama 3.2 1B, 3B Gemma 2B Small (4-9B) — Laptop: Qwen 3.5 4B, 9B ← Best quality/size Gemma 3 4B Phi-4-mini 3.8B ← MIT license Medium (14-24B) — Desktop/Server: Qwen 3.5 14B Mistral Small 3.1 (24B)
Quantization
Default: Q4_K_M (best balance) Quality: Q5_K_M (if RAM allows) Max: Q8_0 (near-lossless) Format: GGUF (universal)
Tools
Daily driver: Ollama Power user: llama.cpp Mobile: ExecuTorch Browser: WebLLM Cross-platform: ONNX Runtime Vector store: ChromaDB Framework: LangChain / LlamaIndex Fine-tuning: QLoRA + PEFT
Architecture
Simple: App → Ollama API → Model RAG: App → ChromaDB → Ollama Hybrid: Router → Local / Cloud Edge: ExecuTorch / WebLLM
Key insight: You now have a complete toolkit. The specific model names and version numbers will change — new models drop monthly. But the concepts (quantization, distillation, GGUF, Ollama, hybrid routing) are stable. You have the framework to evaluate and adopt whatever comes next.
route
The Complete Decision Tree
From “I want to use AI” to “here’s exactly what to deploy”
Step 1: Where Does It Run?
Phone/Browser? → 1-3B model → ExecuTorch (mobile) / WebLLM (web) → Q4_K_M quantization Laptop/Desktop? → 4-9B model (8-16GB RAM) → 14-24B model (24GB+ RAM) → Ollama + GGUF Server? → 24-70B model → Ollama or vLLM → Consider cloud if low volume
Step 2: Which Model?
Classification/Extraction: → Llama 3.2 1B or Phi-4-mini Summarization/Chat: → Qwen 3.5 9B Code: → Qwen 3.5 9B or Phi-4-mini Best quality possible (local): → Mistral Small 3.1 (24B) Need frontier reasoning: → Cloud (GPT-4o / Claude) → Or hybrid: local + cloud fallback
Key insight: This decision tree covers 95% of local AI deployment scenarios. Start with Ollama + Qwen 3.5 9B on your laptop. That’s your baseline. Adjust from there based on your specific constraints (hardware, task, privacy, volume).
flag
Your Next Steps
What to do right now to start your local AI journey
Today
1. Install Ollama brew install ollama # or ollama.com 2. Pull your first model ollama run qwen2.5:7b 3. Ask it something "Summarize this paragraph: ..." 4. Try the API curl localhost:11434/api/generate ...
This Week
5. Build a simple app Python + ollama library Document summarizer or chatbot 6. Try local RAG ChromaDB + Ollama + your documents 7. Compare models Test 3-4 models on YOUR actual task Measure quality, speed, RAM
This Month
8. Identify a production use case Classification? Extraction? Chat? Calculate cloud vs local cost 9. Build a proof of concept Hybrid architecture if needed Measure quality vs cloud baseline 10. Deploy Ollama on a Mac Mini / Linux server Monitor quality and performance
Course Complete
You now understand why small models matter (Ch 1), which models to choose (Ch 2), how they’re compressed (Ch 3–4), how to run them (Ch 5–6), how to build with them (Ch 7–8), and when to use them vs cloud (Ch 9). The future is local. Go build.
Key insight: The best way to learn local AI is to use it. Install Ollama today, build something this week, deploy something this month. The tools are ready, the models are good enough, and the cost is zero. The only thing missing is your first ollama run.