Ch 10: The Future of Small Models — Small Models & Local AI

Ch 10 — The Future of Small Models

Where small models are heading — and your complete local AI toolkit

arrow_backIndex

Strategy

trending_up

Trends

arrow_forward

fast_forward

Spec Decode

arrow_forward

model_training

On-Device FT

arrow_forward

developer_board

Hardware

arrow_forward

devices

Convergence

arrow_forward

build

Toolkit

arrow_forward

route

Decision Tree

arrow_forward

flag

Next Steps

Click play or press Space to begin...

Step- / 8

trending_up

The Trend: Smaller AND Smarter

Each generation of small models matches the previous generation’s large models

The Compression Timeline

2023: GPT-4 level = 1.8T params (cloud only) 2024: GPT-4 level ≈ 70B params (server GPU) 2025: GPT-4 level ≈ 9-14B params (laptop) 2026: GPT-4 level ≈ 3-4B params? (phone?) Every 12-18 months, the same capability fits in a model 5-10x smaller. Recent milestones: Gemma 3n E4B: LMArena >1300 (sub-10B first!) Qwen 3.5 9B: MMLU-Pro 82.5 (was GPT-4 in 2023) Phi-4-mini 3.8B: GSM8K 88.6% (MIT license)

Why This Is Happening

1. Better training data: Synthetic data from large models, curated datasets, quality over quantity.

2. Architecture innovations: Selective parameter activation (Gemma 3n), mixture of experts, efficient attention.

3. Better distillation: Multi-signal distillation, intermediate alignment, contrastive learning (Ch 4).

4. Competition: Meta, Google, Microsoft, Alibaba, Mistral all racing to make the best small model. Each release pushes the others.

Key insight: The gap between small and large models is closing at an accelerating rate. Today’s 9B model matches 2023’s frontier. By 2027, a phone-sized model (3–4B) may match today’s frontier for most tasks. The future of AI is small, local, and everywhere.

fast_forward

Speculative Decoding

Small model drafts, large model verifies — 2-3x faster generation

How It Works

Traditional generation: Large model generates 1 token at a time Each token: full forward pass (~50ms) 100 tokens = 100 forward passes = 5s Speculative decoding: Small model (1B) drafts 5 tokens fast Large model (70B) verifies all 5 at once Accept correct tokens, reject wrong ones Repeat Result: 2-3x faster generation because the large model processes multiple tokens in parallel during verification, and the small model is right ~70-80% of the time.

Why This Matters for Local AI

For cloud: Speculative decoding makes large models faster and cheaper to serve. Providers can handle more requests per GPU.

For local: You can pair a tiny model (1B, runs on CPU) with a medium model (7B, runs on GPU). The 1B drafts tokens while the 7B verifies. The result: 7B quality at nearly 1B speed.

Already supported: llama.cpp has speculative decoding built in. Ollama is adding support. This is production-ready technology.

Key insight: Speculative decoding is a “free lunch” — you get the same output quality as the large model but 2–3x faster. It works because small models agree with large models on most tokens (the “easy” ones). The large model only needs to think hard about the “surprising” tokens.

model_training

On-Device Fine-Tuning

Personalizing models locally — the model adapts to YOU

The Vision

Today, fine-tuning requires a GPU server. But on-device fine-tuning is coming: your phone or laptop continuously adapts the model to your writing style, vocabulary, and preferences.

Imagine a keyboard that doesn’t just predict common words — it predicts YOUR words. An email assistant that writes in YOUR tone. A code assistant that knows YOUR codebase conventions.

Current State

What works today: ✓ LoRA fine-tuning on laptop GPU ✓ QLoRA (quantized LoRA) on 8GB VRAM ✓ Apple MLX fine-tuning on M-series What's coming: → On-phone LoRA adaptation → Federated learning (learn from many users without sharing data) → Continuous learning from user interactions

LoRA on a Laptop

# Fine-tune Qwen 2.5 7B with QLoRA # Requires: 8GB VRAM or 16GB unified from transformers import AutoModelForCausalLM from peft import LoraConfig, get_peft_model from trl import SFTTrainer model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-7B", load_in_4bit=True ) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05 ) model = get_peft_model(model, lora_config) # Trainable params: ~4M (0.06% of total) # Training time: ~1 hour on M2 Pro

Key insight: LoRA fine-tuning is already practical on consumer hardware. QLoRA (4-bit base + LoRA adapters) lets you fine-tune a 7B model on a 16GB laptop. The adapter is tiny (~10MB) and can be swapped in/out. This means personalized models without personalized hardware.

developer_board

Hardware Trends: NPUs Everywhere

Every new chip has dedicated AI hardware — the infrastructure is being built

The NPU Revolution

Apple Neural Engine M4: 38 TOPS | A18 Pro: 35 TOPS In: Every Mac, iPhone, iPad since 2020 Qualcomm Hexagon NPU Snapdragon 8 Elite: 45 TOPS In: Samsung Galaxy S25, OnePlus 13 Intel NPU Lunar Lake: 48 TOPS In: Latest Intel laptops AMD XDNA Ryzen AI: 50 TOPS In: Latest AMD laptops NVIDIA (discrete) RTX 4090: 1,321 TOPS (FP4) RTX 5090: ~2,000 TOPS (FP4)

What This Means

Every device sold today has AI hardware. Phones, laptops, tablets, even some IoT devices have dedicated neural processing units. The hardware is ahead of the software.

The bottleneck is software: NPU support in inference engines (llama.cpp, ExecuTorch) is still maturing. Most local AI today runs on GPU or CPU, not NPU. As NPU support improves, local AI will get 2–5x faster on the same hardware.

Windows Copilot+ PCs: Microsoft requires 40+ TOPS NPU for the “Copilot+” label. This is pushing NPU adoption across the PC industry.

Key insight: The hardware for local AI is already in your pocket and on your desk. NPUs with 35–50 TOPS are in every flagship phone and modern laptop. As software catches up (better NPU backends in llama.cpp, ExecuTorch), local AI will get dramatically faster without any new hardware purchases.

devices

The Convergence: Every Device Becomes an AI Device

The end state: AI is ambient, local, and always available

The Vision

2023: AI = cloud API. You send data to a server, get a response.

2025: AI = local option. You can run models on your laptop. Edge deployment is emerging.

2027+: AI = ambient. Every device has a capable model. Your phone, laptop, car, smart home, watch — all running local AI. Cloud is the fallback for hard tasks, not the default.

This isn’t science fiction. The hardware exists. The models are getting small enough. The software ecosystem is maturing. We’re in the early innings of this transition.

What Changes

For Users: AI works offline, instantly, privately No subscriptions for basic AI features Personalized models that know you For Developers: AI is a local library, not an API call No rate limits, no API keys, no costs Ship AI features without cloud infra For Businesses: AI processing at zero marginal cost Complete data sovereignty No vendor lock-in to AI providers For Society: AI access doesn't require internet Developing regions get AI too Privacy becomes the default, not opt-in

Key insight: The shift from cloud-first to local-first AI is as significant as the shift from mainframes to personal computers. It democratizes AI access, eliminates the cost barrier, and makes privacy the default. You’re learning these skills at exactly the right time.

build

Your Local AI Toolkit

Everything you learned in this course, organized for reference

Models

Tiny (1-3B) — Phone/Edge: Llama 3.2 1B, 3B Gemma 2B Small (4-9B) — Laptop: Qwen 3.5 4B, 9B ← Best quality/size Gemma 3 4B Phi-4-mini 3.8B ← MIT license Medium (14-24B) — Desktop/Server: Qwen 3.5 14B Mistral Small 3.1 (24B)

Quantization

Default: Q4_K_M (best balance) Quality: Q5_K_M (if RAM allows) Max: Q8_0 (near-lossless) Format: GGUF (universal)

Tools

Daily driver: Ollama Power user: llama.cpp Mobile: ExecuTorch Browser: WebLLM Cross-platform: ONNX Runtime Vector store: ChromaDB Framework: LangChain / LlamaIndex Fine-tuning: QLoRA + PEFT

Architecture

Simple: App → Ollama API → Model RAG: App → ChromaDB → Ollama Hybrid: Router → Local / Cloud Edge: ExecuTorch / WebLLM

Key insight: You now have a complete toolkit. The specific model names and version numbers will change — new models drop monthly. But the concepts (quantization, distillation, GGUF, Ollama, hybrid routing) are stable. You have the framework to evaluate and adopt whatever comes next.

route

The Complete Decision Tree

From “I want to use AI” to “here’s exactly what to deploy”

Step 1: Where Does It Run?

Phone/Browser? → 1-3B model → ExecuTorch (mobile) / WebLLM (web) → Q4_K_M quantization Laptop/Desktop? → 4-9B model (8-16GB RAM) → 14-24B model (24GB+ RAM) → Ollama + GGUF Server? → 24-70B model → Ollama or vLLM → Consider cloud if low volume

Step 2: Which Model?

Classification/Extraction: → Llama 3.2 1B or Phi-4-mini Summarization/Chat: → Qwen 3.5 9B Code: → Qwen 3.5 9B or Phi-4-mini Best quality possible (local): → Mistral Small 3.1 (24B) Need frontier reasoning: → Cloud (GPT-4o / Claude) → Or hybrid: local + cloud fallback

Key insight: This decision tree covers 95% of local AI deployment scenarios. Start with Ollama + Qwen 3.5 9B on your laptop. That’s your baseline. Adjust from there based on your specific constraints (hardware, task, privacy, volume).

flag

Your Next Steps

What to do right now to start your local AI journey

Today

1. Install Ollama brew install ollama # or ollama.com 2. Pull your first model ollama run qwen2.5:7b 3. Ask it something "Summarize this paragraph: ..." 4. Try the API curl localhost:11434/api/generate ...

This Week

5. Build a simple app Python + ollama library Document summarizer or chatbot 6. Try local RAG ChromaDB + Ollama + your documents 7. Compare models Test 3-4 models on YOUR actual task Measure quality, speed, RAM

This Month

8. Identify a production use case Classification? Extraction? Chat? Calculate cloud vs local cost 9. Build a proof of concept Hybrid architecture if needed Measure quality vs cloud baseline 10. Deploy Ollama on a Mac Mini / Linux server Monitor quality and performance

Course Complete

You now understand why small models matter (Ch 1), which models to choose (Ch 2), how they’re compressed (Ch 3–4), how to run them (Ch 5–6), how to build with them (Ch 7–8), and when to use them vs cloud (Ch 9). The future is local. Go build.

Key insight: The best way to learn local AI is to use it. Install Ollama today, build something this week, deploy something this month. The tools are ready, the models are good enough, and the cost is zero. The only thing missing is your first ollama run.

arrow_back Ch 9: Local vs Cloud Course Index arrow_forward