Ch 2 — The Token Price Tag

Input, output & the asymmetry — the restaurant analogy
High Level
restaurant_menu
Menu
arrow_forward
soup_kitchen
Kitchen
arrow_forward
balance
Asymmetry
arrow_forward
compare
Tasks
arrow_forward
view_list
Tiers
arrow_forward
receipt_long
Reading
-
Click play or press Space to begin...
Step- / 8
restaurant_menu
The Restaurant Analogy
Reading the menu is cheap; having the chef cook is expensive
Input = Reading the Menu
When you send a prompt to an LLM, the model reads everything at once — your system prompt, conversation history, retrieved documents, and your question. This is like walking into a restaurant and reading the entire menu. It takes a moment, but the restaurant can serve many readers simultaneously. Input tokens are processed in parallel in a single forward pass through the model.
Output = The Chef Cooking
The model’s response is generated one token at a time, sequentially. Each token depends on every token before it. This is like a chef preparing your meal — course by course, each dish building on the last. The chef can only work on one dish at a time, and you’re paying for every minute of their attention. Output tokens require a full forward pass per token.
Key insight: This is why output tokens cost 3–8x more than input tokens across every major provider. Reading is cheap; creating is expensive. The same principle applies to AI.
memory
Why Output Is Fundamentally More Expensive
Three layers of cost behind every generated token
Layer 1: Compute Intensity
Input processing runs the entire prompt through the model in one parallel operation. Whether your prompt is 100 tokens or 10,000 tokens, it’s processed in a single batch. Output generation requires a complete forward pass for every single token. A 500-token response means 500 separate inference passes through the model’s billions of parameters.
Layer 2: Memory Overhead
During generation, the model must maintain a KV-cache (key-value cache) that stores the internal state of every token generated so far. This cache grows with each new token and consumes GPU memory throughout the entire generation process. Longer responses require more memory for longer periods.
Layer 3: Sequential Dependency
This is the fundamental constraint: token 47 cannot be generated until token 46 exists. The autoregressive architecture means output generation is inherently sequential and cannot be parallelized. No amount of hardware can make output generation as efficient as input processing — it’s a mathematical limitation of how transformer models work.
Key insight: The input/output price gap isn’t arbitrary markup — it reflects genuine differences in computational cost. Providers who charge the same for input and output are either subsidizing output or using less capable models.
balance
The Asymmetry in Numbers
Output-to-input price ratios across major providers
March 2026 Price Ratios
// Output-to-input price ratio by model GPT-5 Nano $0.05 / $0.40 8.0x GPT-4o mini $0.15 / $0.60 4.0x DeepSeek V3.2 $0.27 / $0.42 1.6x GPT-5 $1.25 / $10.00 8.0x GPT-4.1 $2.00 / $8.00 4.0x GPT-5.4 $2.50 / $15.00 6.0x Claude Sonnet $3.00 / $15.00 5.0x Claude Opus $5.00 / $25.00 5.0x o1-pro $150 / $600 4.0x
The Pattern
Most models charge 4–8x more for output than input. The outlier is DeepSeek V3.2 at only 1.6x — likely a competitive pricing strategy to gain market share. The typical ratio across the industry is 4–5x. This means a request that generates 1,000 output tokens costs the same as reading 4,000–5,000 input tokens.
Key insight: When estimating costs, don’t just count total tokens. A 10,000-token request that’s 9,000 input + 1,000 output costs very differently from one that’s 1,000 input + 9,000 output. The output-heavy request can cost 3–5x more.
compare
Task Type Determines Your Bill
Same total tokens, wildly different costs
Summarization vs Drafting
Summarization (Cheap)
8,000 input + 500 output
Long document in, short summary out.
At GPT-5.4: $0.02 input + $0.0075 output = $0.028
Drafting (Expensive)
500 input + 8,000 output
Short prompt in, long document out.
At GPT-5.4: $0.00125 input + $0.12 output = $0.121
Key insight: Same total tokens (8,500), but the drafting task costs 4.3x more. This is why understanding the input/output ratio of your workload is critical for cost estimation.
Common Task Profiles
// Input:Output ratio by task type Classification 95:5 (cheapest) Extraction 90:10 (cheap) Summarization 85:15 (cheap) Q&A / Chat 60:40 (moderate) Code generation 30:70 (expensive) Long-form writing 10:90 (most expensive) Agent planning 20:80 (expensive)
view_list
The Three Pricing Tiers
Budget, mid-tier, and flagship — a 3,000x spread
Budget Tier: $0.05–$0.30/M Input
GPT-5 Nano, GPT-4.1 Nano, GPT-4o mini, DeepSeek V3.2. These models are remarkably capable for their price. They handle classification, extraction, summarization, simple Q&A, and routing decisions. For a customer support bot handling 10,000 conversations/month, budget models cost $2–5/month. They cover 60–70% of production workloads.
Mid Tier: $1.00–$5.00/M Input
Claude Haiku, GPT-5, GPT-4.1, GPT-5.4, Claude Sonnet, Claude Opus. The workhorses for coding assistance, complex analysis, multi-step reasoning, and creative writing. A code assistant for 20 developers costs $30–50/month at this tier. These models handle the remaining 25–35% of tasks that need more sophistication.
Flagship / Reasoning: $15–$150/M Input
o1-pro, o3-pro. Reserved for hard math, science, and coding problems that require extended reasoning chains. These models generate thousands of invisible “thinking tokens” (covered in Chapter 3). A single complex reasoning task can cost $0.50–$5.00. Use sparingly and only when cheaper models genuinely can’t handle the task.
Key insight: The 3,000x price range means that choosing the right tier for each task is the single most impactful cost decision. Routing 62% of tasks to budget models (with zero quality loss) can cut your bill by 40–60%. This is covered in Chapter 6.
receipt_long
Reading an API Pricing Page
The per-million-tokens convention and how to calculate costs
The Convention
All major providers quote prices as dollars per million tokens ($/M). When you see “$2.50 / $10.00” it means $2.50 per million input tokens and $10.00 per million output tokens. This convention exists because individual tokens cost fractions of a cent — per-million makes the numbers readable.
// Cost calculation formula cost = (input_tokens × input_price_per_M + output_tokens × output_price_per_M) / 1,000,000 // Example: 2,000 input + 500 output on GPT-5.4 cost = (2000 × $2.50 + 500 × $15.00) / 1,000,000 = ($5,000 + $7,500) / 1,000,000 = $0.0125 per request
Scaling the Math
That $0.0125 per request seems trivial. But at 10,000 requests/day, it’s $125/day or $3,750/month. At 100,000 requests/day, it’s $37,500/month. The per-request cost is always small; the monthly bill is where reality hits. Always multiply by your expected daily volume to get the real picture.
Key insight: A common mistake is evaluating models by per-request cost alone. The right metric is monthly cost at production volume. A model that’s $0.001 cheaper per request saves $30,000/month at 1M requests/day.
warning
The Pricing Traps
Common mistakes that inflate your bill
Trap 1: Ignoring System Prompts
Your system prompt is sent with every single request. A 2,000-token system prompt across 100,000 daily requests means 200M input tokens/day. At $2.50/M, that’s $500/day just for system prompts. Prompt caching (Chapter 6) can reduce this by 75–90%, but only if you know to look for it.
Trap 2: Conversation History Bloat
In multi-turn conversations, the entire history is re-sent with each message. A 20-turn conversation accumulates all previous messages as input tokens. By turn 20, you might be sending 15,000+ input tokens per request — even if the user’s latest message is only 50 tokens.
Trap 3: Verbose Output Instructions
Asking the model to “explain in detail” or “be thorough” can double or triple output length without improving quality. Since output tokens are 4–8x more expensive, verbose instructions have an outsized cost impact. Be specific about desired output length.
Trap 4: Wrong Model for the Job
Using Claude Opus ($5/$25) for tasks that GPT-4o mini ($0.15/$0.60) handles equally well means paying 33x more for input and 42x more for output. The quality difference for simple classification or extraction tasks is often negligible.
Key insight: Most teams overspend not because individual requests are expensive, but because they use the same expensive model for every task and never optimize system prompts or conversation history management.
lightbulb
The Complete Price Picture
Putting it all together before we go deeper
The Restaurant, Revisited
Here’s the full restaurant analogy: Reading the menu = input tokens (cheap, parallel). The chef cooking = output tokens (expensive, sequential). A prix fixe menu = subscription pricing (fixed cost regardless of consumption). Ordering the tasting menu = reasoning models with thinking tokens (Chapter 3). Sending food back = retry loops in agents (Chapter 7).
Key insight: The input/output asymmetry is the most important pricing concept in AI economics. Every optimization strategy in this course — caching, compression, routing, distillation — works by either reducing the number of expensive output tokens or shifting work to cheaper input processing.
What’s Next
Chapter 3 reveals the hidden multipliers that make your actual bill much higher than the sticker price suggests. Reasoning tokens, quadratic attention scaling, context surcharges, and batch discounts — the iceberg beneath the surface of API pricing.
Chapter Summary
Output tokens cost 3–8x more than input because generation is sequential while reading is parallel. Task type determines your bill — summarization (input-heavy) is cheap, drafting (output-heavy) is expensive. The 3,000x price range across models means tier selection is the highest-leverage decision. Always calculate monthly cost at production volume, not per-request cost.