Cost Per Token Breakdown
The fundamental unit of inference cost is dollars per million tokens. This depends on three factors: GPU cost per hour, tokens generated per hour, and overhead (networking, storage, engineering).
For a well-optimized vLLM deployment of Llama 3 70B on 4× H100 (FP8):
GPU cost:
4× H100 on-demand: 4 × $2.50/hr = $10/hr
Reserved (1yr): 4 × $1.60/hr = $6.40/hr
Throughput (vLLM, FP8, continuous batching):
Per GPU: ~850 tok/s
4 GPUs: ~3,400 tok/s
Per hour: 3,400 × 3,600 = 12.24M tokens/hr
Cost per million tokens:
On-demand: $10 / 12.24 = $0.82/M tokens
Reserved: $6.40 / 12.24 = $0.52/M tokens
+ Overhead (~30%): $0.68-1.07/M tokens
# Compare to API pricing:
OpenAI GPT-4o output: $10.00/M tokens
Anthropic Claude 3.5: $15.00/M tokens
Self-hosted Llama 70B: $0.68-1.07/M tokens
Savings vs API: 10-15×
When Self-Hosting Makes Sense
Self-hosting breaks even when your monthly token volume exceeds the point where GPU costs (amortized) are less than API costs. The crossover depends on utilization:
Break-even analysis (vs GPT-4o at $10/M):
Monthly GPU cost (4× H100 reserved): $4,608
Monthly capacity: 12.24M × 730 hrs = 8.9B tokens
At 100% utilization:
Self-host: $4,608 / 8,935M = $0.52/M
Break-even: 0.46M tokens/month
At 30% utilization (realistic):
Effective: $4,608 / 2,680M = $1.72/M
Break-even: 1.5M tokens/month
At 10% utilization (low traffic):
Effective: $4,608 / 894M = $5.15/M
Break-even: 4.6M tokens/month
# Rule of thumb: self-host when you consistently
# generate >5M tokens/month AND can maintain
# >25% GPU utilization.
Hidden Costs of Self-Hosting
Engineering: 0.5–2 FTEs to manage inference infrastructure ($100–400K/yr).
Monitoring: Prometheus, Grafana, custom dashboards for latency/throughput SLAs.
Model updates: New model versions require re-quantization, re-benchmarking, A/B testing.
Redundancy: Need 2× capacity for zero-downtime deployments during updates.
Edge cases: OOM errors, CUDA crashes, driver updates, security patches.
Key insight: Self-hosting inference is like buying vs. leasing a car. Buying is cheaper per mile if you drive enough, but you’re responsible for maintenance, insurance, and depreciation. APIs are like taxis — expensive per trip but zero commitment. The crossover point is lower than most people think (~5M tokens/month), but the hidden costs are higher than most people budget.