Ch 1 — Why Small Models Matter

The cost, latency, and privacy case for running AI on your own hardware
Foundation
cloud
Cloud Problem
arrow_forward
payments
Cost Math
arrow_forward
speed
Latency
arrow_forward
lock
Privacy
arrow_forward
target
Good Enough
arrow_forward
compare
3B vs 70B
arrow_forward
landscape
Landscape
arrow_forward
checklist
When to Go Local
-
Click play or press Space to begin...
Step- / 8
cloud
The Cloud API Problem
Every token you send costs money, adds latency, and leaves your control
The Three Costs of Cloud AI
Every time you call GPT-4o, Claude, or Gemini through an API, three things happen:

1. You pay per token. GPT-4o costs $2.50 per million input tokens, $10 per million output tokens. A customer support bot handling 10,000 conversations/day can cost $3,000–$10,000/month.

2. You add latency. Network round-trip (50–200ms) + queue wait (0–2s) + generation time. Your user stares at a spinner.

3. Your data leaves your network. Every prompt, every customer message, every document — sent to a third-party server. For healthcare, finance, or legal, this can be a compliance violation.
The Scale Problem
Monthly cost at scale (GPT-4o): 1,000 requests/day ~500K tokens/day → ~$50/month 10,000 requests/day ~5M tokens/day → ~$500/month 100,000 requests/day ~50M tokens/day → ~$5,000/month 1M requests/day ~500M tokens/day → ~$50,000/month And this is just one model. Add embeddings, reranking, and evaluation — costs multiply.
Key insight: Cloud APIs are the easiest way to start with AI, but the hardest way to scale. The per-token cost model means your AI bill grows linearly with usage. Local models flip this: high upfront cost (hardware), near-zero marginal cost per request.
payments
The Cost Math: Cloud vs Local
When does running your own model break even?
Cloud API Costs (2025 Pricing)
GPT-4o Input: $2.50 / 1M tokens Output: $10.00 / 1M tokens GPT-4o-mini Input: $0.15 / 1M tokens Output: $0.60 / 1M tokens Claude 3.5 Sonnet Input: $3.00 / 1M tokens Output: $15.00 / 1M tokens Claude 3.5 Haiku Input: $0.80 / 1M tokens Output: $4.00 / 1M tokens
Local Model Costs
Hardware (one-time): MacBook M2 Pro (16GB): ~$2,000 Gaming PC + RTX 4090: ~$2,500 Mac Studio M2 Ultra: ~$4,000 Running cost: Electricity: ~$5-15/month Maintenance: $0 Per-token cost: $0.00 Break-even: At $500/month cloud spend → local pays for itself in 4-5 months
Key insight: If you’re spending more than $200/month on API calls for tasks that a 7B–9B model can handle (classification, extraction, summarization, simple chat), local deployment pays for itself within months. The marginal cost of each additional request is essentially zero.
speed
Latency: Local Is Instant
No network, no queue, no cold start — tokens start flowing immediately
Cloud API Latency Breakdown
Time to first token (cloud): DNS + TLS handshake: 50-100ms Network round-trip: 30-200ms Queue wait (peak): 0-5,000ms Model loading (cold): 0-2,000ms Prefill (prompt proc): 100-500ms ───────────────────────────── Total TTFT: 200ms - 8s Time to first token (local): Model already loaded: 0ms No network: 0ms Prefill: 50-200ms ───────────────────────────── Total TTFT: 50-200ms
Throughput Comparison
Cloud (GPT-4o): ~80–120 tokens/sec output, but shared infrastructure means variable performance. During peak hours, you might wait seconds just to start.

Local (Llama 3.2 3B on M2 Pro): ~40–60 tokens/sec. Slower per-token, but consistent. No queue, no variability, no rate limits.

Local (Qwen 3.5 9B on RTX 4090): ~80–100 tokens/sec. Comparable to cloud, zero latency overhead.
Key insight: For interactive applications (autocomplete, real-time suggestions, IDE assistants), latency matters more than raw throughput. A local 3B model that responds in 50ms feels instant. A cloud model that responds in 2 seconds feels broken — even if the answer is better.
lock
Privacy: Your Data Never Leaves
GDPR, HIPAA, air-gapped environments — local models solve the compliance problem
The Data Problem
When you send data to a cloud API:

1. Data in transit: Your prompt travels over the internet, encrypted but still leaving your network.

2. Data at rest: The provider may log requests for abuse monitoring, debugging, or training (opt-out varies by provider).

3. Jurisdiction: Your data may be processed in a different country. EU customer data processed on US servers? GDPR violation.

4. Third-party risk: Provider gets breached? Your data is exposed.
Industries That Need Local
Healthcare (HIPAA) Patient records, clinical notes, diagnostic summaries — cannot leave the hospital network Finance (SOC 2, PCI-DSS) Transaction data, fraud detection, customer financial records Legal (Attorney-Client Privilege) Case documents, contracts, legal analysis — privileged information Government / Defense Classified or sensitive data on air-gapped networks — no internet Enterprise (Internal IP) Source code, trade secrets, internal communications, M&A documents
Key insight: For many organizations, the question isn’t “is the cloud model better?” but “are we allowed to use it?” Local models remove the compliance question entirely. Your data stays on your hardware, in your jurisdiction, under your control.
target
The “Good Enough” Threshold
95% of GPT-4 quality at 5% of the cost — for the right tasks
Not All Tasks Need GPT-4
Most production AI tasks fall into a few categories. Many of them don’t need a 200B+ parameter frontier model:

Classification: “Is this email spam?” “What category is this ticket?” A 3B model handles this at 95%+ accuracy.

Extraction: “Pull the name, date, and amount from this invoice.” Structured extraction works well with small models + JSON mode.

Summarization: “Summarize this 2-page document in 3 bullets.” A 7B model produces summaries nearly indistinguishable from GPT-4.

Simple chat: FAQ bots, internal knowledge assistants, customer greeting flows.
Where You Still Need Frontier Models
Complex reasoning Multi-step math, logic puzzles, scientific analysis — frontier models are significantly better Creative writing Nuanced tone, style matching, long-form coherent narratives Code generation Complex multi-file refactoring, architectural decisions (though small models handle simple code well) Multilingual nuance Low-resource languages, cultural context, idiomatic translation
Key insight: The question isn’t “is the small model as good as GPT-4?” It’s “is the small model good enough for THIS task?” For classification, extraction, and simple generation, the answer is almost always yes. Match the model to the task, not the other way around.
compare
When 3B Beats 70B
Task-specific small models can outperform general-purpose large models
The Specialization Advantage
A general-purpose 70B model knows a little about everything. A fine-tuned 3B model can know a LOT about one thing.

Example: A 3B model fine-tuned on 10,000 medical discharge summaries will outperform GPT-4 at writing discharge summaries — because it has seen the exact format, terminology, and patterns thousands of times.

This is the same principle as hiring a specialist vs a generalist. The generalist is more versatile, but the specialist is better at their specific job.
Real Benchmarks
Qwen 3.5 9B vs GPT-4o on benchmarks: MMLU-Pro: 82.5 vs 88.0 (93.7%) HumanEval: ~75 vs ~90 (83.3%) GSM8K: ~88 vs ~95 (92.6%) Gemma 3 4B (just 4 billion params!): GSM8K: 89.2% HumanEval: 71.3% ARC-C: ~80% A 4B model achieving 89% on math reasoning — running on 3GB of RAM.
Key insight: Small models in 2025–2026 are dramatically better than small models from even a year ago. Qwen 3.5 9B scores 82.5 on MMLU-Pro — that would have been frontier-level performance in 2023. The gap between small and large is closing fast.
landscape
The Small Model Landscape (Preview)
The key players you’ll meet throughout this course
Model Families
Meta — Llama 3.2 1B, 3B params | Open-weight Great for edge/mobile deployment Google — Gemma 3 4B params | Open-weight Punches above its weight on reasoning Microsoft — Phi-4-mini 3.8B params | Open-weight Strong on math and code Alibaba — Qwen 3.5 4B, 9B params | Open-weight Current king of the <10B leaderboard Mistral — Mistral Small 3.1 24B params | Open-weight Best "medium" model, fits on 16GB GPU
The Tools
Ollama: One-command model management. Pull, run, serve. The Docker of local AI. (Ch 5)

llama.cpp: The C++ inference engine under Ollama. Convert, quantize, serve models. (Ch 6)

GGUF: The file format for quantized models. Single file, runs anywhere. (Ch 3, 6)

ExecuTorch: PyTorch’s framework for deploying to phones and edge devices. (Ch 8)

WebLLM: Run models directly in the browser using WebGPU. (Ch 8)
Key insight: The small model ecosystem is mature and growing fast. You don’t need to train models — you need to choose the right one, quantize it, and deploy it. This course will teach you exactly that.
checklist
When to Go Local: The Decision Checklist
A quick framework for deciding cloud vs local
Go Local When
Task is well-defined (classification, extraction, summarization, simple chat) Data is sensitive (PII, medical, financial, legal, internal IP) Volume is high (>1,000 requests/day and growing) Latency matters (real-time, interactive, autocomplete, IDE integration) Offline needed (air-gapped, mobile, unreliable internet) Cost predictability matters (fixed hardware budget vs variable API bill)
Stay on Cloud When
Task requires frontier reasoning (complex math, multi-step logic) Volume is low (<100 requests/day — API is cheaper than hardware) You need the latest model immediately (cloud gets new models first) No ML engineering capacity (cloud is zero-ops, local needs some setup) Multi-modal (vision + audio + text — local support is still catching up)
Key insight: The best strategy is often hybrid: local models for high-volume, well-defined tasks; cloud models for complex, low-volume tasks. We’ll build this routing pattern in Chapter 9. For now, remember: local isn’t about replacing cloud — it’s about using the right tool for the right job.