5
“Renting an apartment (API) vs buying a house (self-hosting) — most teams should rent.”
- APIs win for 87% of use cases. Self-hosting only makes sense at >10B tokens/month or with strict regulatory requirements.
- NVIDIA B200 costs $6,400 to manufacture, sells for $40K–$50K (84% margin). Cloud pricing varies 3.2x across providers.
- Self-hosting hidden costs are 3–5x the raw GPU price. GPU utilization at 20% means 5x higher cost per token.
6
“Use less (compression), reuse (caching), switch to a cheaper source (routing).”
- Model routing is the highest-leverage optimization: 40–60% savings by sending 62% of tasks to budget models with zero quality loss.
- Prompt caching: 10 minutes to enable, 45–90% savings on cached tokens, plus 13–31% faster time-to-first-token.
- Batch APIs: 50% off for async workloads. Distillation: 5–30x cost reduction, 95–97% quality retention.
- Combined savings: routing + caching + batching achieves 47–80% total cost reduction in production.
Bottom line: APIs beat self-hosting for most teams. The optimization playbook (routing, caching, batching, compression, distillation) can cut bills by 60–70%. Start with prompt caching (10 minutes, highest ROI), then add routing, then batch eligible workloads.