The Utilization Problem
A GPU costs the same whether it’s processing tokens or sitting idle. Most self-hosted deployments run at 15–30% average utilization because demand is bursty — peak hours see high load, but nights and weekends are quiet. At 20% utilization, your effective cost per token is 5x higher than the theoretical maximum.
// Utilization impact on cost per token
100% utilization: $0.10/1K tokens
50% utilization: $0.20/1K tokens (2x)
20% utilization: $0.50/1K tokens (5x)
10% utilization: $1.00/1K tokens (10x)
// At 10% utilization, APIs are almost
// always cheaper
Improving Utilization
Techniques to push utilization higher: Request batching (group multiple requests into single GPU passes). Multi-model serving (run different models on the same GPU based on demand). Spot/preemptible instances (use cheap GPU time for non-urgent batch work). Auto-scaling (spin GPUs up/down with demand, though this has cold-start penalties).
Key insight: Before self-hosting, honestly assess your expected utilization. If you can’t sustain >60% utilization, APIs will be cheaper. The break-even math only works when GPUs are busy most of the time.