The Framework
Q1: Is the data sensitive?
YES → Local or Edge (mandatory)
NO → Continue to Q2
Q2: Volume > 5K requests/day?
YES → Local saves money. Continue Q3.
NO → Cloud is probably cheaper.
Q3: Does the task need frontier reasoning?
YES → Cloud (or hybrid: local for
simple subtasks, cloud for hard)
NO → Local handles it. Done.
Q4: Is latency critical (<100ms TTFT)?
YES → Local or Edge (mandatory)
NO → Cloud is acceptable.
Q5: Do you have ML engineering capacity?
YES → Local (Ollama makes it easy)
NO → Cloud (zero ops) or hire
Quick Summary
Go Local: Sensitive data, high volume, latency-critical, well-defined tasks (classification, extraction, summarization).
Go Cloud: Low volume, complex reasoning, no ML capacity, need latest models immediately.
Go Hybrid: Mixed workload, cost optimization, want best of both worlds. This is the answer for most production systems.
Go Edge: Offline required, maximum privacy, mobile/browser deployment, simple tasks only.
Key insight: There is no universal answer. The right choice depends on your data sensitivity, volume, task complexity, latency requirements, and team capacity. Use this decision tree as a starting point, then validate with a proof-of-concept. Chapter 10 looks at where all of this is heading.