Ch 1: What Is a Token? — AI Economics

toll

The Taxi Meter Analogy

Every AI interaction starts the meter running

The Core Idea

Think of an LLM like a taxi. The moment you get in and tell the driver where to go, the meter starts running. Every block you travel (every chunk of text processed) adds to the fare. A token is that “block” — the fundamental billing unit of AI. You pay for every token that goes into the model (your question, context, documents) and every token that comes out (the model’s response).

What a Token Actually Is

A token is not a word, a character, or a sentence. It’s a sub-word unit — roughly 3–4 characters, or about 0.75 words in English. LLMs don’t read text the way humans do. They break everything into these small chunks before processing. The word “economics” is 1 token. The word “tokenization” is 2 tokens. The phrase “AI cost engineering” is 4 tokens.

Key insight: Just like you wouldn’t take a taxi without knowing the per-mile rate, you shouldn’t use an LLM API without understanding token pricing. The meter is always running.

cut

How BPE Breaks Text Into Tokens

Byte Pair Encoding — the algorithm behind tokenization

Byte Pair Encoding (BPE)

Most modern LLMs use Byte Pair Encoding to decide how to split text. BPE starts with individual characters and iteratively merges the most frequently occurring pairs from a massive training corpus. After training, the algorithm produces a vocabulary of 50,000–100,000 tokens. Common words like “the” or “is” become single tokens. Rare or long words get split into smaller recognized pieces.

// How BPE tokenizes "Unbelievable" "Unbelievable" → ["Un", "believ", "able"] // 1 word → 3 tokens // How BPE tokenizes "the cat sat" "the cat sat" → ["the", " cat", " sat"] // 3 words → 3 tokens (common words = 1 token each)

Different Models, Different Tokenizers

OpenAI uses Tiktoken (cl100k_base for GPT-4, o200k_base for GPT-4o and later). Anthropic uses their own BPE variant. Google uses SentencePiece. The same sentence can produce different token counts on different models — “The quick brown fox jumps over the lazy dog” might be 9 tokens on GPT-4o but 11 on another model.

Key insight: Tokenizers are trained on data, so they reflect the patterns of their training corpus. English text tokenizes efficiently because these models were trained primarily on English. The same text in Japanese or Arabic can use 2–3x more tokens.

data_object

Why Format Matters

Code, JSON, and structured data cost more than prose

Token Density by Content Type

Not all text is created equal in token economics. Common English prose averages about 1.3 tokens per word — relatively efficient. But code averages 1.5–3 tokens per word because of special characters, indentation, and syntax. JSON and YAML are the worst offenders at 2–4 tokens per word, because every brace, bracket, colon, and quotation mark consumes a token.

// Token cost by format (same information) English prose: ~130 tokens Python code: ~200 tokens (1.5x) JSON payload: ~350 tokens (2.7x) XML document: ~400 tokens (3.1x)

Why This Matters for Cost

If your application sends tool schemas, API responses, or structured data to the model, you’re paying a format tax. A single MCP tool definition in JSON can consume 500+ tokens. An agent with 20 tools loaded starts every conversation having already spent 10,000+ tokens just on tool definitions — before the user says a word.

Key insight: The same information expressed as a concise natural-language summary can cost 50–70% less in tokens than its JSON equivalent. Format choice is a cost decision.

calculate

Mental Math for Tokens

Quick rules of thumb for estimating token counts

The Quick Reference

You don’t need a tokenizer to estimate costs. These rough conversions work for English text with GPT-4o’s tokenizer:

// Token estimation rules of thumb 1 token ≈ 4 characters or 0.75 words 100 tokens ≈ 75 words (a short paragraph) 1,000 tokens ≈ 750 words (about 1.5 pages) A tweet ≈ 40–70 tokens An email ≈ 130–150 tokens A 10-page doc≈ 3,000–4,000 tokens 1M tokens ≈ 750,000 words ≈ ~1,500 pages

Context Window Sizes

Context windows define the maximum number of tokens a model can process in a single request (input + output combined). As of March 2026: GPT-4o supports 128K tokens (~96,000 words), Claude 3.5 Sonnet supports 200K tokens, and Gemini 2.0 supports up to 2M tokens. But bigger isn’t always better — performance degrades well before hitting the limit.

Key insight: A 128K context window at GPT-4o pricing ($2.50/M input tokens) costs $0.32 if you fill it completely. Run 10,000 full-context requests per day and you’re spending $96,000/month on input tokens alone.

trending_down

The 1,000x Cost Collapse

From $30/M tokens to $0.05/M in three years

The Price Trajectory

When GPT-4 launched in May 2023, input tokens cost $30 per million and output tokens cost $60 per million. By March 2026, OpenAI’s GPT-5 Nano offers input at $0.05 per million and output at $0.40 per million. That’s a 600x reduction in input cost for a model that’s arguably more capable than the original GPT-4 for most tasks.

// The cost collapse timeline May 2023 GPT-4: $30.00 / 1M input tokens Nov 2023 GPT-4 Turbo: $10.00 / 1M input tokens May 2024 GPT-4o: $5.00 / 1M input tokens Jul 2024 GPT-4o mini: $0.15 / 1M input tokens Apr 2025 GPT-4.1 Nano: $0.10 / 1M input tokens Mar 2026 GPT-5 Nano: $0.05 / 1M input tokens

Why It’s Collapsing

Three forces drive the collapse: (1) Hardware improvements — each new GPU generation (H100 → H200 → B200) delivers 2–3x more inference throughput. (2) Algorithmic efficiency — techniques like speculative decoding, quantization, and distillation make models faster and cheaper to run. (3) Competition — DeepSeek, Mistral, and open-source models force prices down across the board.

Key insight: The cost collapse doesn’t mean AI is getting cheap. It means the same capability is getting cheap. Frontier models (o1-pro at $150/M input) remain expensive. The gap between budget and premium is now 3,000x.

landscape

The March 2026 Pricing Landscape

Budget, mid-tier, and flagship — the full spectrum

Budget Tier ($0.05–$0.30/M input)

GPT-5 Nano ($0.05/$0.40), GPT-4.1 Nano ($0.10/$0.40), GPT-4o mini ($0.15/$0.60), DeepSeek V3.2 ($0.27/$0.42). These models handle classification, extraction, summarization, and simple Q&A. They cover 60–70% of production workloads at negligible cost.

Mid Tier ($1.00–$5.00/M input)

Claude 3.5 Haiku ($1.00/$5.00), GPT-5 ($1.25/$10.00), GPT-4.1 ($2.00/$8.00), GPT-5.4 ($2.50/$15.00), Claude 4.6 Sonnet ($3.00/$15.00), Claude 4.6 Opus ($5.00/$25.00). The workhorses for coding, analysis, and complex reasoning.

Flagship / Reasoning ($15–$150/M input)

o1-pro ($150/$600) sits at the extreme end. These models use extended “thinking” chains for hard math, science, and coding problems. The price reflects the enormous compute required for reasoning tokens (covered in Chapter 3).

The Pricing Convention

All API pricing is quoted as cost per million tokens ($/M). When you see “$2.50/$10.00” it means $2.50 per million input tokens and $10.00 per million output tokens. To calculate a single request cost: (input_tokens × input_price + output_tokens × output_price) / 1,000,000.

Key insight: The 3,000x price range from GPT-5 Nano ($0.05/M) to o1-pro ($150/M) means model selection is the single highest-leverage cost decision you can make. Choosing the right model for each task is covered in Chapter 6.

speed

Inference Dominates AI Spend

Training is a one-time cost; inference runs forever

Training vs Inference

Training a frontier model is a massive one-time investment (GPT-4 reportedly cost $100M+ to train). But once trained, the model serves millions of users continuously. Inference — the process of generating responses to user queries — now accounts for roughly two-thirds of all AI compute spend industry-wide. Every API call, every chatbot response, every agent action is an inference cost.

Why This Matters

Training costs are borne by the model provider (OpenAI, Anthropic, Google). Inference costs are borne by you, the developer or company using the API. As AI agents become more autonomous — making dozens of LLM calls per task — inference costs scale with usage in ways that training costs never did. Understanding token economics is understanding your operational cost structure.

Key insight: A single autonomous agent making 50 LLM calls per task, running 100 tasks per day, on a mid-tier model at $3/M input tokens, can easily consume $500–1,500/month. Multiply by a fleet of agents and you’re in five-figure territory.

lightbulb

Tokens as the Currency of AI

The mental model that frames this entire course

The Taxi Meter, Revisited

Here’s the complete taxi meter analogy: Getting in the cab = loading your system prompt and context (input tokens). Telling the driver your destination = your user query (more input tokens). The drive itself = the model generating a response (output tokens, billed at a higher rate). Taking a detour = reasoning/thinking tokens you never see but still pay for (Chapter 3). Sitting in traffic = quadratic attention scaling on long contexts (Chapter 3).

Key insight: Every optimization technique in this course — caching, compression, routing, distillation — maps back to reducing the meter. Use a cheaper cab (budget model). Take a shorter route (compress context). Share the ride (batch processing). Avoid detours (constrain reasoning). The economics are that simple.

What’s Next

Now that you understand what tokens are and how they’re counted, Chapter 2 dives into why input and output tokens are priced differently — the restaurant analogy. Reading the menu is cheap; having the chef cook your meal is expensive. The asymmetry between input and output pricing is the single most important concept in token economics.

Chapter Summary

Tokens are sub-word units (~0.75 words each) created by BPE. Format matters: JSON costs 2–3x more than prose. Token prices have collapsed 600x since 2023, but the range from budget to flagship is 3,000x. Inference now dominates AI spend. Every AI cost optimization starts with understanding this fundamental unit.

Ch 1 — What Is a Token?