Ch 1: Why Small Models Matter — Small Models & Local AI

Ch 1 — Why Small Models Matter

The cost, latency, and privacy case for running AI on your own hardware

arrow_backIndex

Foundation

cloud

Cloud Problem

arrow_forward

payments

Cost Math

arrow_forward

speed

Latency

arrow_forward

lock

Privacy

arrow_forward

target

Good Enough

arrow_forward

compare

3B vs 70B

arrow_forward

landscape

Landscape

arrow_forward

checklist

When to Go Local

Click play or press Space to begin...

Step- / 8

cloud

The Cloud API Problem

Every token you send costs money, adds latency, and leaves your control

The Three Costs of Cloud AI

Every time you call GPT-4o, Claude, or Gemini through an API, three things happen:

1. You pay per token. GPT-4o costs $2.50 per million input tokens, $10 per million output tokens. A customer support bot handling 10,000 conversations/day can cost $3,000–$10,000/month.

2. You add latency. Network round-trip (50–200ms) + queue wait (0–2s) + generation time. Your user stares at a spinner.

3. Your data leaves your network. Every prompt, every customer message, every document — sent to a third-party server. For healthcare, finance, or legal, this can be a compliance violation.

The Scale Problem

Monthly cost at scale (GPT-4o): 1,000 requests/day ~500K tokens/day → ~$50/month 10,000 requests/day ~5M tokens/day → ~$500/month 100,000 requests/day ~50M tokens/day → ~$5,000/month 1M requests/day ~500M tokens/day → ~$50,000/month And this is just one model. Add embeddings, reranking, and evaluation — costs multiply.

Key insight: Cloud APIs are the easiest way to start with AI, but the hardest way to scale. The per-token cost model means your AI bill grows linearly with usage. Local models flip this: high upfront cost (hardware), near-zero marginal cost per request.

payments

The Cost Math: Cloud vs Local

When does running your own model break even?

Cloud API Costs (2025 Pricing)

GPT-4o Input: $2.50 / 1M tokens Output: $10.00 / 1M tokens GPT-4o-mini Input: $0.15 / 1M tokens Output: $0.60 / 1M tokens Claude 3.5 Sonnet Input: $3.00 / 1M tokens Output: $15.00 / 1M tokens Claude 3.5 Haiku Input: $0.80 / 1M tokens Output: $4.00 / 1M tokens

Local Model Costs

Hardware (one-time): MacBook M2 Pro (16GB): ~$2,000 Gaming PC + RTX 4090: ~$2,500 Mac Studio M2 Ultra: ~$4,000 Running cost: Electricity: ~$5-15/month Maintenance: $0 Per-token cost: $0.00 Break-even: At $500/month cloud spend → local pays for itself in 4-5 months

Key insight: If you’re spending more than $200/month on API calls for tasks that a 7B–9B model can handle (classification, extraction, summarization, simple chat), local deployment pays for itself within months. The marginal cost of each additional request is essentially zero.

speed

Latency: Local Is Instant

No network, no queue, no cold start — tokens start flowing immediately

Cloud API Latency Breakdown

Time to first token (cloud): DNS + TLS handshake: 50-100ms Network round-trip: 30-200ms Queue wait (peak): 0-5,000ms Model loading (cold): 0-2,000ms Prefill (prompt proc): 100-500ms ───────────────────────────── Total TTFT: 200ms - 8s Time to first token (local): Model already loaded: 0ms No network: 0ms Prefill: 50-200ms ───────────────────────────── Total TTFT: 50-200ms

Throughput Comparison

Cloud (GPT-4o): ~80–120 tokens/sec output, but shared infrastructure means variable performance. During peak hours, you might wait seconds just to start.

Local (Llama 3.2 3B on M2 Pro): ~40–60 tokens/sec. Slower per-token, but consistent. No queue, no variability, no rate limits.

Local (Qwen 3.5 9B on RTX 4090): ~80–100 tokens/sec. Comparable to cloud, zero latency overhead.

Key insight: For interactive applications (autocomplete, real-time suggestions, IDE assistants), latency matters more than raw throughput. A local 3B model that responds in 50ms feels instant. A cloud model that responds in 2 seconds feels broken — even if the answer is better.

lock

Privacy: Your Data Never Leaves

GDPR, HIPAA, air-gapped environments — local models solve the compliance problem

The Data Problem

When you send data to a cloud API:

1. Data in transit: Your prompt travels over the internet, encrypted but still leaving your network.

2. Data at rest: The provider may log requests for abuse monitoring, debugging, or training (opt-out varies by provider).

3. Jurisdiction: Your data may be processed in a different country. EU customer data processed on US servers? GDPR violation.

4. Third-party risk: Provider gets breached? Your data is exposed.

Industries That Need Local

Healthcare (HIPAA) Patient records, clinical notes, diagnostic summaries — cannot leave the hospital network Finance (SOC 2, PCI-DSS) Transaction data, fraud detection, customer financial records Legal (Attorney-Client Privilege) Case documents, contracts, legal analysis — privileged information Government / Defense Classified or sensitive data on air-gapped networks — no internet Enterprise (Internal IP) Source code, trade secrets, internal communications, M&A documents

Key insight: For many organizations, the question isn’t “is the cloud model better?” but “are we allowed to use it?” Local models remove the compliance question entirely. Your data stays on your hardware, in your jurisdiction, under your control.

target

The “Good Enough” Threshold

95% of GPT-4 quality at 5% of the cost — for the right tasks

Not All Tasks Need GPT-4

Most production AI tasks fall into a few categories. Many of them don’t need a 200B+ parameter frontier model:

Classification: “Is this email spam?” “What category is this ticket?” A 3B model handles this at 95%+ accuracy.

Extraction: “Pull the name, date, and amount from this invoice.” Structured extraction works well with small models + JSON mode.

Summarization: “Summarize this 2-page document in 3 bullets.” A 7B model produces summaries nearly indistinguishable from GPT-4.

Simple chat: FAQ bots, internal knowledge assistants, customer greeting flows.

Where You Still Need Frontier Models

Complex reasoning Multi-step math, logic puzzles, scientific analysis — frontier models are significantly better Creative writing Nuanced tone, style matching, long-form coherent narratives Code generation Complex multi-file refactoring, architectural decisions (though small models handle simple code well) Multilingual nuance Low-resource languages, cultural context, idiomatic translation

Key insight: The question isn’t “is the small model as good as GPT-4?” It’s “is the small model good enough for THIS task?” For classification, extraction, and simple generation, the answer is almost always yes. Match the model to the task, not the other way around.

compare

When 3B Beats 70B

Task-specific small models can outperform general-purpose large models

The Specialization Advantage

A general-purpose 70B model knows a little about everything. A fine-tuned 3B model can know a LOT about one thing.

Example: A 3B model fine-tuned on 10,000 medical discharge summaries will outperform GPT-4 at writing discharge summaries — because it has seen the exact format, terminology, and patterns thousands of times.

This is the same principle as hiring a specialist vs a generalist. The generalist is more versatile, but the specialist is better at their specific job.

Real Benchmarks

Qwen 3.5 9B vs GPT-4o on benchmarks: MMLU-Pro: 82.5 vs 88.0 (93.7%) HumanEval: ~75 vs ~90 (83.3%) GSM8K: ~88 vs ~95 (92.6%) Gemma 3 4B (just 4 billion params!): GSM8K: 89.2% HumanEval: 71.3% ARC-C: ~80% A 4B model achieving 89% on math reasoning — running on 3GB of RAM.

Key insight: Small models in 2025–2026 are dramatically better than small models from even a year ago. Qwen 3.5 9B scores 82.5 on MMLU-Pro — that would have been frontier-level performance in 2023. The gap between small and large is closing fast.

landscape

The Small Model Landscape (Preview)

The key players you’ll meet throughout this course

Model Families

Meta — Llama 3.2 1B, 3B params | Open-weight Great for edge/mobile deployment Google — Gemma 3 4B params | Open-weight Punches above its weight on reasoning Microsoft — Phi-4-mini 3.8B params | Open-weight Strong on math and code Alibaba — Qwen 3.5 4B, 9B params | Open-weight Current king of the <10B leaderboard Mistral — Mistral Small 3.1 24B params | Open-weight Best "medium" model, fits on 16GB GPU

The Tools

Ollama: One-command model management. Pull, run, serve. The Docker of local AI. (Ch 5)

llama.cpp: The C++ inference engine under Ollama. Convert, quantize, serve models. (Ch 6)

GGUF: The file format for quantized models. Single file, runs anywhere. (Ch 3, 6)

ExecuTorch: PyTorch’s framework for deploying to phones and edge devices. (Ch 8)

WebLLM: Run models directly in the browser using WebGPU. (Ch 8)

Key insight: The small model ecosystem is mature and growing fast. You don’t need to train models — you need to choose the right one, quantize it, and deploy it. This course will teach you exactly that.

checklist

When to Go Local: The Decision Checklist

A quick framework for deciding cloud vs local

Go Local When

✓ Task is well-defined (classification, extraction, summarization, simple chat) ✓ Data is sensitive (PII, medical, financial, legal, internal IP) ✓ Volume is high (>1,000 requests/day and growing) ✓ Latency matters (real-time, interactive, autocomplete, IDE integration) ✓ Offline needed (air-gapped, mobile, unreliable internet) ✓ Cost predictability matters (fixed hardware budget vs variable API bill)

Stay on Cloud When

✗ Task requires frontier reasoning (complex math, multi-step logic) ✗ Volume is low (<100 requests/day — API is cheaper than hardware) ✗ You need the latest model immediately (cloud gets new models first) ✗ No ML engineering capacity (cloud is zero-ops, local needs some setup) ✗ Multi-modal (vision + audio + text — local support is still catching up)

Key insight: The best strategy is often hybrid: local models for high-volume, well-defined tasks; cloud models for complex, low-volume tasks. We’ll build this routing pattern in Chapter 9. For now, remember: local isn’t about replacing cloud — it’s about using the right tool for the right job.

arrow_back Course Index Ch 2: The Small Model Landscape arrow_forward