Ch 4 — Real-World Bills

What AI actually costs — the electricity bill analogy
High Level
electric_bolt
Analogy
arrow_forward
support_agent
Support
arrow_forward
code
Code
arrow_forward
search
RAG
arrow_forward
smart_toy
Agents
arrow_forward
account_balance
Budget
-
Click play or press Space to begin...
Step- / 8
electric_bolt
The Electricity Bill Analogy
Stop thinking per-kilowatt; start thinking monthly
From Per-Token to Per-Month
Nobody thinks about their electricity bill in kilowatt-hours. You think: “My bill was $180 this month.” AI costs work the same way. Per-token pricing is useful for understanding the unit economics, but what matters is the monthly bill at production volume. This chapter translates everything from Chapters 1–3 into real monthly costs for five common workloads.
The 90% Underestimation Problem
96% of enterprises report AI costs exceeding initial projections. Teams routinely underestimate by 3x or more when planning on sticker price alone. The gap comes from hidden multipliers (Chapter 3), conversation history accumulation, retry loops, and the difference between prototype volume and production volume.
Key insight: The worked examples below use March 2026 pricing. Prices change rapidly, but the methodology — calculating tokens per task, multiplying by daily volume, accounting for hidden costs — remains constant.
support_agent
Example 1: Customer Support Bot
10,000 conversations/month
The Workload
A customer support chatbot handling 10,000 conversations/month. Average conversation: 5 turns, 500 input tokens + 300 output tokens per turn. System prompt: 1,500 tokens (sent every turn). RAG context: 2,000 tokens per turn. Total per conversation: ~22,500 input + 1,500 output tokens.
// Monthly cost by model choice GPT-4o mini ($0.15/$0.60) Input: 225M tokens × $0.15/M = $33.75 Output: 15M tokens × $0.60/M = $9.00 Total: $42.75/month Claude Sonnet ($3.00/$15.00) Input: 225M × $3.00/M = $675 Output: 15M × $15.00/M = $225 Total: $900/month
The Model Choice Impact
Same workload, 21x cost difference. GPT-4o mini at $43/month vs Claude Sonnet at $900/month. For most customer support tasks — FAQ answers, order status, policy lookups — the budget model performs comparably. The expensive model only justifies its cost for complex troubleshooting or nuanced conversations.
Key insight: A real fintech startup budgeted $800/month for their GPT-4o support bot. The first invoice was $4,200/month because they didn’t account for conversation history accumulation, retry loops, and RAG context overhead. Always multiply your estimate by 2–3x for safety.
code
Example 2: Code Assistant (20 Developers)
1,000 code requests/day across the team
The Workload
20 developers, each making ~50 code requests/day (completions, explanations, refactoring, debugging). Average request: 2,000 input tokens (code context + prompt) + 1,000 output tokens (generated code). System prompt: 800 tokens. Total: 1,000 requests/day × 2,800 input + 1,000 output.
// Monthly cost (30 days) GPT-4.1 ($2.00/$8.00) Input: 84M tokens × $2.00/M = $168 Output: 30M tokens × $8.00/M = $240 Total: $408/month ($20/dev/month) GPT-4o mini ($0.15/$0.60) Input: 84M × $0.15/M = $12.60 Output: 30M × $0.60/M = $18.00 Total: $30.60/month ($1.53/dev/month)
The Value Calculation
At $408/month for 20 developers, that’s $20/developer/month on GPT-4.1. If each developer saves just 30 minutes/day (conservative for a good code assistant), that’s 10 hours/month of saved time. At $75/hour loaded cost, that’s $750 in saved productivity per developer — a 37x return on the $20 API cost.
Key insight: Code assistance is one of the highest-ROI AI workloads because the output (code) has direct, measurable productivity value. Even at mid-tier pricing, the cost is trivial compared to developer salaries.
search
Example 3: RAG Document Q&A
1,000 document queries/day
The Workload
A RAG system answering questions over a corporate knowledge base. 1,000 queries/day. Each query retrieves 5 document chunks averaging 1,000 tokens each (5,000 tokens of context). System prompt: 1,000 tokens. User query: 100 tokens. Output: 500 tokens. Total per request: 6,100 input + 500 output.
// Monthly cost (30 days) Claude Sonnet ($3.00/$15.00) Input: 183M tokens × $3.00/M = $549 Output: 15M tokens × $15.00/M = $225 Total: $774/month With prompt caching (90% discount on system prompt) Cached: 30M tokens saved × 90% = -$81 Optimized: ~$693/month GPT-4o mini ($0.15/$0.60) Total: $36.45/month
RAG Cost Drivers
RAG is input-heavy (92% input tokens), which makes it relatively cheaper per request. But the cost scales with retrieval depth — retrieving 10 chunks instead of 5 doubles input cost. The key optimization is retrieval quality over quantity: better chunking and reranking means fewer, more relevant chunks, which reduces both cost and improves answer quality.
Key insight: RAG systems benefit enormously from prompt caching because the system prompt and few-shot examples are identical across all requests. Enabling caching alone can save 10–15% of total RAG costs with zero effort.
smart_toy
Example 4: Autonomous Agent Fleet
7 agents running 24/7
The Workload
7 AI agents handling software engineering, data analysis, and operations tasks around the clock. Each agent averages 50 tasks/day. Each task requires 5–10 LLM calls (planning, tool use, execution, verification). Average per task: 15,000 input + 8,000 output tokens. Context grows with each call in a task.
// Monthly cost (7 agents, 50 tasks/day each) GPT-5 ($1.25/$10.00) Tasks/month: 7 × 50 × 30 = 10,500 Input: 157.5M × $1.25/M = $197 Output: 84M × $10.00/M = $840 LLM cost: $1,037/month Infrastructure/monitoring: +$2,800 Total: ~$3,837/month Claude Opus ($5.00/$25.00) LLM cost: $2,888/month Infrastructure: +$2,800 Total: ~$5,688/month
The Real Cost
These estimates assume well-constrained agents. In practice, agents encounter errors, retry, and sometimes enter doom loops. A real Series B fintech startup reported their agent fleet cost $4,200/month in LLM fees + $2,800 in infrastructure. Without monitoring, teams lose $8,000–23,000/month to undetected waste.
Key insight: Each agent must generate enough value to cover its own costs. At $3,837/month for 7 agents, each agent must deliver at least $548/month in value. For software engineering agents at $5–8 per task, that’s 70–110 successful tasks/month just to break even.
layers
The Three-Layer Cost Stack
Infrastructure + model access + overhead
Layer 1: Infrastructure
$5–2,000/month depending on scale. Includes hosting, databases, vector stores, monitoring tools, and compute for non-LLM processing. For a simple chatbot, this might be $5–50/month on a managed platform. For an agent fleet, $500–2,000/month for orchestration infrastructure.
Layer 2: Model Access (LLM API)
$5–50,000+/month depending on volume and model choice. This is the variable cost that scales with usage — everything we’ve calculated in the examples above. For light usage, $5–50/month. For heavy production, $1,000–50,000+/month.
Layer 3: Overhead
10–30% annually on top of direct costs. Includes engineering time for prompt optimization, monitoring setup, incident response, model migration when providers change APIs, and ongoing evaluation. This is the cost most teams forget — the human labor required to keep AI systems running well.
Key insight: The LLM API cost (Layer 2) gets all the attention, but infrastructure and overhead often account for 40–60% of total spend. A $1,000/month LLM bill typically comes with $600–1,500/month in supporting costs.
analytics
Cost-Per-Task: The Real Metric
Total tokens × price / success rate
Why Per-Task Matters
Monthly totals tell you what you spent. Cost-per-task tells you whether it was worth it. The formula: (total tokens × price) / successful completions. If your agent completes 80% of tasks successfully, the cost per successful task is 25% higher than the raw per-task cost. Failed tasks still consume tokens.
// Cost-per-task with success rate Raw cost per task: $5.00 Success rate: 80% Cost per SUCCESS: $6.25 Failed task waste: $1.25/success // Model choice impact on same task Claude Opus: $5.00/task, 92% success = $5.43 GPT-4o mini: $0.15/task, 71% success = $0.21 // Cheaper model wins even with lower success
Break-Even Analysis
Every AI task must generate more value than it costs. A customer support resolution that saves a $15/hour support agent 10 minutes is worth $2.50. If the AI resolution costs $0.15 (budget model), that’s a 16x return. If it costs $5.00 (premium model), you’re losing $2.50 per resolution. The math changes everything.
Key insight: The cheapest model that meets your quality threshold always wins on cost-per-successful-task. A 92% success rate on a $5 model costs more per success than a 71% success rate on a $0.15 model. Run the numbers before choosing.
lightbulb
The Monthly Bill Framework
How to estimate before you deploy
The Estimation Checklist
1. Calculate tokens per task (input + output + thinking). 2. Multiply by daily volume × 30. 3. Apply the full cost formula from Chapter 3 (including thinking tokens and surcharges). 4. Add infrastructure costs (40–60% of LLM cost). 5. Multiply total by 2–3x for safety margin. 6. Calculate cost-per-successful-task and compare to value generated.
Key insight: The teams that succeed with AI economics are the ones that treat cost estimation like engineering, not guessing. Prototype with real data, measure actual token consumption, and build in safety margins before committing to production volumes.
What’s Next
We’ve covered the complete token economics story: what tokens are (Ch 1), how they’re priced (Ch 2), the hidden multipliers (Ch 3), and what real bills look like (Ch 4). Chapter 5 shifts to the GPU and infrastructure layer — the real estate analogy of renting vs buying vs building.
Chapter Summary
Customer support: $43–900/month depending on model. Code assistant: $20/dev/month with 37x ROI. RAG: $36–774/month. Agent fleet: $3,800–5,700/month. The three-layer cost stack (infra + API + overhead) means LLM cost is only 40–60% of total. Cost-per-successful-task is the metric that matters. Always multiply estimates by 2–3x.