Ch 2: The Token Price Tag

restaurant_menu

The Restaurant Analogy

Reading the menu is cheap; having the chef cook is expensive

Input = Reading the Menu

When you send a prompt to an LLM, the model reads everything at once — your system prompt, conversation history, retrieved documents, and your question. This is like walking into a restaurant and reading the entire menu. It takes a moment, but the restaurant can serve many readers simultaneously. Input tokens are processed in parallel in a single forward pass through the model.

Output = The Chef Cooking

The model’s response is generated one token at a time, sequentially. Each token depends on every token before it. This is like a chef preparing your meal — course by course, each dish building on the last. The chef can only work on one dish at a time, and you’re paying for every minute of their attention. Output tokens require a full forward pass per token.

Key insight: This is why output tokens cost 3–8x more than input tokens across every major provider. Reading is cheap; creating is expensive. The same principle applies to AI.

memory

Why Output Is Fundamentally More Expensive

Three layers of cost behind every generated token

Layer 1: Compute Intensity

Input processing runs the entire prompt through the model in one parallel operation. Whether your prompt is 100 tokens or 10,000 tokens, it’s processed in a single batch. Output generation requires a complete forward pass for every single token. A 500-token response means 500 separate inference passes through the model’s billions of parameters.

Layer 2: Memory Overhead

During generation, the model must maintain a KV-cache (key-value cache) that stores the internal state of every token generated so far. This cache grows with each new token and consumes GPU memory throughout the entire generation process. Longer responses require more memory for longer periods.

Layer 3: Sequential Dependency

This is the fundamental constraint: token 47 cannot be generated until token 46 exists. The autoregressive architecture means output generation is inherently sequential and cannot be parallelized. No amount of hardware can make output generation as efficient as input processing — it’s a mathematical limitation of how transformer models work.

Key insight: The input/output price gap isn’t arbitrary markup — it reflects genuine differences in computational cost. Providers who charge the same for input and output are either subsidizing output or using less capable models.

balance

The Asymmetry in Numbers

Output-to-input price ratios across major providers

March 2026 Price Ratios

// Output-to-input price ratio by model GPT-5 Nano $0.05 / $0.40 8.0x GPT-4o mini $0.15 / $0.60 4.0x DeepSeek V3.2 $0.27 / $0.42 1.6x GPT-5 $1.25 / $10.00 8.0x GPT-4.1 $2.00 / $8.00 4.0x GPT-5.4 $2.50 / $15.00 6.0x Claude Sonnet $3.00 / $15.00 5.0x Claude Opus $5.00 / $25.00 5.0x o1-pro $150 / $600 4.0x

The Pattern

Most models charge 4–8x more for output than input. The outlier is DeepSeek V3.2 at only 1.6x — likely a competitive pricing strategy to gain market share. The typical ratio across the industry is 4–5x. This means a request that generates 1,000 output tokens costs the same as reading 4,000–5,000 input tokens.

Key insight: When estimating costs, don’t just count total tokens. A 10,000-token request that’s 9,000 input + 1,000 output costs very differently from one that’s 1,000 input + 9,000 output. The output-heavy request can cost 3–5x more.

compare

Task Type Determines Your Bill

Same total tokens, wildly different costs

Summarization vs Drafting

Summarization (Cheap)

8,000 input + 500 output
Long document in, short summary out.
At GPT-5.4: $0.02 input + $0.0075 output = $0.028

Drafting (Expensive)

500 input + 8,000 output
Short prompt in, long document out.
At GPT-5.4: $0.00125 input + $0.12 output = $0.121

Key insight: Same total tokens (8,500), but the drafting task costs 4.3x more. This is why understanding the input/output ratio of your workload is critical for cost estimation.

Common Task Profiles

// Input:Output ratio by task type Classification 95:5 (cheapest) Extraction 90:10 (cheap) Summarization 85:15 (cheap) Q&A / Chat 60:40 (moderate) Code generation 30:70 (expensive) Long-form writing 10:90 (most expensive) Agent planning 20:80 (expensive)

view_list

The Three Pricing Tiers

Budget, mid-tier, and flagship — a 3,000x spread

Budget Tier: $0.05–$0.30/M Input

GPT-5 Nano, GPT-4.1 Nano, GPT-4o mini, DeepSeek V3.2. These models are remarkably capable for their price. They handle classification, extraction, summarization, simple Q&A, and routing decisions. For a customer support bot handling 10,000 conversations/month, budget models cost $2–5/month. They cover 60–70% of production workloads.

Mid Tier: $1.00–$5.00/M Input

Claude Haiku, GPT-5, GPT-4.1, GPT-5.4, Claude Sonnet, Claude Opus. The workhorses for coding assistance, complex analysis, multi-step reasoning, and creative writing. A code assistant for 20 developers costs $30–50/month at this tier. These models handle the remaining 25–35% of tasks that need more sophistication.

Flagship / Reasoning: $15–$150/M Input

o1-pro, o3-pro. Reserved for hard math, science, and coding problems that require extended reasoning chains. These models generate thousands of invisible “thinking tokens” (covered in Chapter 3). A single complex reasoning task can cost $0.50–$5.00. Use sparingly and only when cheaper models genuinely can’t handle the task.

Key insight: The 3,000x price range means that choosing the right tier for each task is the single most impactful cost decision. Routing 62% of tasks to budget models (with zero quality loss) can cut your bill by 40–60%. This is covered in Chapter 6.

receipt_long

Reading an API Pricing Page

The per-million-tokens convention and how to calculate costs

The Convention

All major providers quote prices as dollars per million tokens ($/M). When you see “$2.50 / $10.00” it means $2.50 per million input tokens and $10.00 per million output tokens. This convention exists because individual tokens cost fractions of a cent — per-million makes the numbers readable.

// Cost calculation formula cost = (input_tokens × input_price_per_M + output_tokens × output_price_per_M) / 1,000,000 // Example: 2,000 input + 500 output on GPT-5.4 cost = (2000 × $2.50 + 500 × $15.00) / 1,000,000 = ($5,000 + $7,500) / 1,000,000 = $0.0125 per request

Scaling the Math

That $0.0125 per request seems trivial. But at 10,000 requests/day, it’s $125/day or $3,750/month. At 100,000 requests/day, it’s $37,500/month. The per-request cost is always small; the monthly bill is where reality hits. Always multiply by your expected daily volume to get the real picture.

Key insight: A common mistake is evaluating models by per-request cost alone. The right metric is monthly cost at production volume. A model that’s $0.001 cheaper per request saves $30,000/month at 1M requests/day.

warning

The Pricing Traps

Common mistakes that inflate your bill

Trap 1: Ignoring System Prompts

Your system prompt is sent with every single request. A 2,000-token system prompt across 100,000 daily requests means 200M input tokens/day. At $2.50/M, that’s $500/day just for system prompts. Prompt caching (Chapter 6) can reduce this by 75–90%, but only if you know to look for it.

Trap 2: Conversation History Bloat

In multi-turn conversations, the entire history is re-sent with each message. A 20-turn conversation accumulates all previous messages as input tokens. By turn 20, you might be sending 15,000+ input tokens per request — even if the user’s latest message is only 50 tokens.

Trap 3: Verbose Output Instructions

Asking the model to “explain in detail” or “be thorough” can double or triple output length without improving quality. Since output tokens are 4–8x more expensive, verbose instructions have an outsized cost impact. Be specific about desired output length.

Trap 4: Wrong Model for the Job

Using Claude Opus ($5/$25) for tasks that GPT-4o mini ($0.15/$0.60) handles equally well means paying 33x more for input and 42x more for output. The quality difference for simple classification or extraction tasks is often negligible.

Key insight: Most teams overspend not because individual requests are expensive, but because they use the same expensive model for every task and never optimize system prompts or conversation history management.

lightbulb

The Complete Price Picture

Putting it all together before we go deeper

The Restaurant, Revisited

Here’s the full restaurant analogy: Reading the menu = input tokens (cheap, parallel). The chef cooking = output tokens (expensive, sequential). A prix fixe menu = subscription pricing (fixed cost regardless of consumption). Ordering the tasting menu = reasoning models with thinking tokens (Chapter 3). Sending food back = retry loops in agents (Chapter 7).

Key insight: The input/output asymmetry is the most important pricing concept in AI economics. Every optimization strategy in this course — caching, compression, routing, distillation — works by either reducing the number of expensive output tokens or shifting work to cheaper input processing.

What’s Next

Chapter 3 reveals the hidden multipliers that make your actual bill much higher than the sticker price suggests. Reasoning tokens, quadratic attention scaling, context surcharges, and batch discounts — the iceberg beneath the surface of API pricing.

Chapter Summary

Output tokens cost 3–8x more than input because generation is sequential while reading is parallel. Task type determines your bill — summarization (input-heavy) is cheap, drafting (output-heavy) is expensive. The 3,000x price range across models means tier selection is the highest-leverage decision. Always calculate monthly cost at production volume, not per-request cost.

Ch 2 — The Token Price Tag