Ch 9: Local vs Cloud — Small Models & Local AI

Ch 9 — Local vs Cloud: The Decision Framework

Cost, latency, privacy, quality — a systematic approach to choosing

arrow_backIndex

Strategy

payments

Cost

arrow_forward

speed

Latency

arrow_forward

lock

Privacy

arrow_forward

stars

Quality

arrow_forward

merge

Hybrid

arrow_forward

alt_route

Routing

arrow_forward

science

Case Study

arrow_forward

checklist

Decision Tree

Click play or press Space to begin...

Step- / 8

payments

Cost Analysis: When Local Breaks Even

The math behind the cloud-vs-local decision

Break-Even Calculator

Monthly cloud cost at different volumes: (GPT-4o-mini: $0.15/$0.60 per 1M tokens) 1K req/day × 500 tok = $12/month 5K req/day × 500 tok = $60/month 10K req/day × 500 tok = $120/month 50K req/day × 500 tok = $600/month Local hardware (one-time): Mac Mini M2 Pro: $1,600 RTX 4090 build: $2,500 Electricity: $10/month Break-even: At $120/mo cloud → local in 14 months At $600/mo cloud → local in 3 months

The Hidden Costs

Cloud hidden costs: Rate limits (need higher tier), token overages, embedding costs, evaluation costs, multi-model pipelines multiply everything.

Local hidden costs: Setup time (1–2 days), maintenance, hardware failures, electricity, cooling, ML engineering knowledge.

Rule of thumb: If your monthly API bill exceeds $200 and the task can be handled by a 7–9B model, local deployment will save money within 6–12 months.

Key insight: Cloud is cheaper at low volume (<1K requests/day). Local is cheaper at high volume (>5K requests/day). The crossover point depends on your hardware choice and the cloud model you’re replacing. Always do the math for your specific case.

speed

Latency: When Milliseconds Matter

Interactive apps need local; batch processing can use cloud

Latency Comparison

Time to first token (TTFT): Cloud (GPT-4o): 200ms - 2s Cloud (GPT-4o-mini):100ms - 500ms Local (Ollama 7B): 50 - 200ms Edge (phone 1B): 20 - 50ms End-to-end for 100-token response: Cloud (GPT-4o): 1 - 3s Local (7B, M2 Pro): 2 - 3s Local (7B, 4090): 1 - 1.5s Edge (1B, phone): 2 - 3s

When Latency Drives the Decision

Autocomplete / IDE: Users expect <100ms TTFT. Only local/edge can deliver this consistently.

Real-time suggestions: Chat reply suggestions, smart compose. Local wins on consistency (no queue variability).

Batch processing: Overnight document processing, bulk classification. Latency doesn’t matter — choose by cost and quality.

User-facing chat: Cloud is fine if you can tolerate 1–2s startup. Local is better if you need consistent sub-second response.

Key insight: Local models have lower and more consistent latency. Cloud models have higher but variable latency (depends on load). For interactive applications where consistency matters more than peak speed, local wins. For batch processing, choose by cost.

lock

Privacy & Compliance

When regulations make the decision for you

Compliance Requirements

MUST be local/edge: ✓ HIPAA (patient data) ✓ Air-gapped networks (defense) ✓ Attorney-client privilege ✓ GDPR (if provider is non-EU) ✓ PCI-DSS (payment card data) ✓ ITAR (export-controlled data) CAN be cloud (with safeguards): ✓ SOC 2 compliant providers ✓ Data processing agreements ✓ EU-based providers for GDPR ✓ Enterprise API tiers (no training) No restriction: ✓ Public data processing ✓ Non-sensitive internal tools ✓ Development/testing

The Privacy Spectrum

Level 1 — Edge: Data never leaves the device. Maximum privacy. For the most sensitive use cases.

Level 2 — Local server: Data stays on your network. Good for enterprise, healthcare, legal.

Level 3 — Cloud (enterprise tier): Data sent to provider but not used for training. Contractual guarantees.

Level 4 — Cloud (standard): Data may be logged, used for improvement. Fine for non-sensitive tasks.

Key insight: For many organizations, privacy isn’t a preference — it’s a legal requirement. If your data falls under HIPAA, GDPR, or similar regulations, local deployment may be the only compliant option. This alone can justify the investment in local infrastructure.

stars

Quality Trade-offs

When you genuinely need a frontier model — and when you don’t

Task Complexity vs Model Size

Simple (local 3-7B is fine): Classification, routing, tagging Entity extraction (name, date, amount) Simple summarization (1-3 bullets) FAQ/template-based responses Text formatting, cleaning Medium (local 9-24B or cloud mini): Detailed summarization Code generation (single function) Conversational chat Document Q&A with RAG Translation (major languages) Hard (cloud frontier needed): Multi-step reasoning chains Complex code (multi-file refactoring) Creative writing (novel quality) Scientific analysis Low-resource language translation

The 80/20 Rule

In most production systems, 80% of requests are simple tasks that a local model handles perfectly. Only 20% need frontier-level reasoning.

This means you can route 80% of traffic to a free local model and only pay for the 20% that needs cloud. This is the hybrid architecture.

Key insight: The question isn’t “is GPT-4 better than Qwen 3.5 9B?” (it is, on average). The question is “is GPT-4 better enough for THIS specific task to justify the cost?” For classification and extraction, the answer is almost always no. For complex reasoning, the answer is often yes.

merge

Hybrid Architectures

The best of both worlds — local for simple, cloud for complex

Architecture Patterns

Pattern 1: Task-Based Split Classification → Local (3B) Extraction → Local (7B) Summarization → Local (9B) Complex reasoning → Cloud (GPT-4o) Pattern 2: Quality Cascade Try local first → check confidence If confidence > 0.9 → return local If confidence < 0.9 → escalate to cloud Pattern 3: Cost Tiering Free tier users → Local model Paid tier users → Cloud model Premium users → Best available

Implementation

def smart_route(task, complexity): if task in ["classify", "extract", "tag"]: return local_model("qwen2.5:7b") if task == "summarize" and complexity == "low": return local_model("qwen2.5:7b") if task in ["reason", "create", "analyze"]: return cloud_model("gpt-4o") # Default: try local, escalate if needed result = local_model("qwen2.5:7b") if result.confidence < 0.85: return cloud_model("gpt-4o-mini") return result

Key insight: Hybrid is the production-grade answer. Pure local or pure cloud is rarely optimal. Route simple tasks locally (free, fast, private) and complex tasks to cloud (expensive, slower, but smarter). This can reduce cloud costs by 60–80% while maintaining quality.

alt_route

The Routing Pattern

A small classifier decides which model handles each request

How It Works

Request comes in ↓ Router (tiny local model, 1B) Classifies: simple / medium / complex Latency: 10-20ms ↓ Simple → Local 7B (free, fast) Medium → Local 24B or Cloud mini Complex → Cloud frontier (GPT-4o) The router itself is a local model running classification. It adds minimal latency but saves significant cost by avoiding cloud for simple tasks.

Real-World Impact

Before routing (all cloud GPT-4o): 10,000 requests/day Cost: ~$500/month Latency: variable (200ms - 5s) After routing (hybrid): 7,000 simple → Local (free) 2,000 medium → Cloud mini ($40) 1,000 complex → Cloud GPT-4o ($100) Total: ~$140/month (72% savings) Latency: faster for 70% of requests

Key insight: The routing pattern is the most practical hybrid architecture. A tiny classifier (even rule-based) routes requests to the appropriate model tier. You get cloud quality when you need it and local speed/cost when you don’t. Start simple (keyword rules) and graduate to ML-based routing.

science

Case Study: Customer Support Platform

How a real system uses hybrid local + cloud

The System

Customer support bot handling: 15,000 tickets/day 5 categories: billing, technical, account, feature, general Before (all cloud): Model: GPT-4o-mini Cost: $900/month Avg response: 1.2s After (hybrid): Router: Llama 3.2 1B (local) Tier 1: Qwen 2.5 7B (local) → FAQ, status checks, simple billing → 65% of tickets Tier 2: GPT-4o-mini (cloud) → Complex billing, technical issues → 30% of tickets Tier 3: GPT-4o (cloud) → Escalations, complaints, edge cases → 5% of tickets

Results

Cost: $900 → $280/month (69% savings) Speed: 1.2s → 0.4s avg (65% faster) Quality: 92% → 91% satisfaction (1% drop, acceptable trade-off) Privacy: 65% of data never leaves network Hardware investment: Mac Mini M2 Pro: $1,600 (one-time) Pays for itself in 2.6 months

Key insight: The case study shows the typical hybrid outcome: 60–70% cost reduction, faster response for most requests, minimal quality impact. The 1% satisfaction drop is because the local model occasionally gives slightly less nuanced responses on simple queries — a trade-off most businesses happily accept.

checklist

The Decision Tree

A systematic framework for every AI deployment decision

The Framework

Q1: Is the data sensitive? YES → Local or Edge (mandatory) NO → Continue to Q2 Q2: Volume > 5K requests/day? YES → Local saves money. Continue Q3. NO → Cloud is probably cheaper. Q3: Does the task need frontier reasoning? YES → Cloud (or hybrid: local for simple subtasks, cloud for hard) NO → Local handles it. Done. Q4: Is latency critical (<100ms TTFT)? YES → Local or Edge (mandatory) NO → Cloud is acceptable. Q5: Do you have ML engineering capacity? YES → Local (Ollama makes it easy) NO → Cloud (zero ops) or hire

Quick Summary

Go Local: Sensitive data, high volume, latency-critical, well-defined tasks (classification, extraction, summarization).

Go Cloud: Low volume, complex reasoning, no ML capacity, need latest models immediately.

Go Hybrid: Mixed workload, cost optimization, want best of both worlds. This is the answer for most production systems.

Go Edge: Offline required, maximum privacy, mobile/browser deployment, simple tasks only.

Key insight: There is no universal answer. The right choice depends on your data sensitivity, volume, task complexity, latency requirements, and team capacity. Use this decision tree as a starting point, then validate with a proof-of-concept. Chapter 10 looks at where all of this is heading.

arrow_back Ch 8: Edge Deployment Ch 10: Future of Small Models arrow_forward