Ch 1 — What Is a Token?

The fundamental unit of AI economics — the taxi meter analogy
High Level
toll
Tokens
arrow_forward
cut
BPE
arrow_forward
data_object
Formats
arrow_forward
calculate
Counting
arrow_forward
trending_down
Collapse
arrow_forward
landscape
Landscape
-
Click play or press Space to begin...
Step- / 8
toll
The Taxi Meter Analogy
Every AI interaction starts the meter running
The Core Idea
Think of an LLM like a taxi. The moment you get in and tell the driver where to go, the meter starts running. Every block you travel (every chunk of text processed) adds to the fare. A token is that “block” — the fundamental billing unit of AI. You pay for every token that goes into the model (your question, context, documents) and every token that comes out (the model’s response).
What a Token Actually Is
A token is not a word, a character, or a sentence. It’s a sub-word unit — roughly 3–4 characters, or about 0.75 words in English. LLMs don’t read text the way humans do. They break everything into these small chunks before processing. The word “economics” is 1 token. The word “tokenization” is 2 tokens. The phrase “AI cost engineering” is 4 tokens.
Key insight: Just like you wouldn’t take a taxi without knowing the per-mile rate, you shouldn’t use an LLM API without understanding token pricing. The meter is always running.
cut
How BPE Breaks Text Into Tokens
Byte Pair Encoding — the algorithm behind tokenization
Byte Pair Encoding (BPE)
Most modern LLMs use Byte Pair Encoding to decide how to split text. BPE starts with individual characters and iteratively merges the most frequently occurring pairs from a massive training corpus. After training, the algorithm produces a vocabulary of 50,000–100,000 tokens. Common words like “the” or “is” become single tokens. Rare or long words get split into smaller recognized pieces.
// How BPE tokenizes "Unbelievable" "Unbelievable" → ["Un", "believ", "able"] // 1 word → 3 tokens // How BPE tokenizes "the cat sat" "the cat sat" → ["the", " cat", " sat"] // 3 words → 3 tokens (common words = 1 token each)
Different Models, Different Tokenizers
OpenAI uses Tiktoken (cl100k_base for GPT-4, o200k_base for GPT-4o and later). Anthropic uses their own BPE variant. Google uses SentencePiece. The same sentence can produce different token counts on different models — “The quick brown fox jumps over the lazy dog” might be 9 tokens on GPT-4o but 11 on another model.
Key insight: Tokenizers are trained on data, so they reflect the patterns of their training corpus. English text tokenizes efficiently because these models were trained primarily on English. The same text in Japanese or Arabic can use 2–3x more tokens.
data_object
Why Format Matters
Code, JSON, and structured data cost more than prose
Token Density by Content Type
Not all text is created equal in token economics. Common English prose averages about 1.3 tokens per word — relatively efficient. But code averages 1.5–3 tokens per word because of special characters, indentation, and syntax. JSON and YAML are the worst offenders at 2–4 tokens per word, because every brace, bracket, colon, and quotation mark consumes a token.
// Token cost by format (same information) English prose: ~130 tokens Python code: ~200 tokens (1.5x) JSON payload: ~350 tokens (2.7x) XML document: ~400 tokens (3.1x)
Why This Matters for Cost
If your application sends tool schemas, API responses, or structured data to the model, you’re paying a format tax. A single MCP tool definition in JSON can consume 500+ tokens. An agent with 20 tools loaded starts every conversation having already spent 10,000+ tokens just on tool definitions — before the user says a word.
Key insight: The same information expressed as a concise natural-language summary can cost 50–70% less in tokens than its JSON equivalent. Format choice is a cost decision.
calculate
Mental Math for Tokens
Quick rules of thumb for estimating token counts
The Quick Reference
You don’t need a tokenizer to estimate costs. These rough conversions work for English text with GPT-4o’s tokenizer:
// Token estimation rules of thumb 1 token4 characters or 0.75 words 100 tokens75 words (a short paragraph) 1,000 tokens750 words (about 1.5 pages) A tweet40–70 tokens An email130–150 tokens A 10-page doc3,000–4,000 tokens 1M tokens750,000 words~1,500 pages
Context Window Sizes
Context windows define the maximum number of tokens a model can process in a single request (input + output combined). As of March 2026: GPT-4o supports 128K tokens (~96,000 words), Claude 3.5 Sonnet supports 200K tokens, and Gemini 2.0 supports up to 2M tokens. But bigger isn’t always better — performance degrades well before hitting the limit.
Key insight: A 128K context window at GPT-4o pricing ($2.50/M input tokens) costs $0.32 if you fill it completely. Run 10,000 full-context requests per day and you’re spending $96,000/month on input tokens alone.
trending_down
The 1,000x Cost Collapse
From $30/M tokens to $0.05/M in three years
The Price Trajectory
When GPT-4 launched in May 2023, input tokens cost $30 per million and output tokens cost $60 per million. By March 2026, OpenAI’s GPT-5 Nano offers input at $0.05 per million and output at $0.40 per million. That’s a 600x reduction in input cost for a model that’s arguably more capable than the original GPT-4 for most tasks.
// The cost collapse timeline May 2023 GPT-4: $30.00 / 1M input tokens Nov 2023 GPT-4 Turbo: $10.00 / 1M input tokens May 2024 GPT-4o: $5.00 / 1M input tokens Jul 2024 GPT-4o mini: $0.15 / 1M input tokens Apr 2025 GPT-4.1 Nano: $0.10 / 1M input tokens Mar 2026 GPT-5 Nano: $0.05 / 1M input tokens
Why It’s Collapsing
Three forces drive the collapse: (1) Hardware improvements — each new GPU generation (H100 → H200 → B200) delivers 2–3x more inference throughput. (2) Algorithmic efficiency — techniques like speculative decoding, quantization, and distillation make models faster and cheaper to run. (3) Competition — DeepSeek, Mistral, and open-source models force prices down across the board.
Key insight: The cost collapse doesn’t mean AI is getting cheap. It means the same capability is getting cheap. Frontier models (o1-pro at $150/M input) remain expensive. The gap between budget and premium is now 3,000x.
landscape
The March 2026 Pricing Landscape
Budget, mid-tier, and flagship — the full spectrum
Budget Tier ($0.05–$0.30/M input)
GPT-5 Nano ($0.05/$0.40), GPT-4.1 Nano ($0.10/$0.40), GPT-4o mini ($0.15/$0.60), DeepSeek V3.2 ($0.27/$0.42). These models handle classification, extraction, summarization, and simple Q&A. They cover 60–70% of production workloads at negligible cost.
Mid Tier ($1.00–$5.00/M input)
Claude 3.5 Haiku ($1.00/$5.00), GPT-5 ($1.25/$10.00), GPT-4.1 ($2.00/$8.00), GPT-5.4 ($2.50/$15.00), Claude 4.6 Sonnet ($3.00/$15.00), Claude 4.6 Opus ($5.00/$25.00). The workhorses for coding, analysis, and complex reasoning.
Flagship / Reasoning ($15–$150/M input)
o1-pro ($150/$600) sits at the extreme end. These models use extended “thinking” chains for hard math, science, and coding problems. The price reflects the enormous compute required for reasoning tokens (covered in Chapter 3).
The Pricing Convention
All API pricing is quoted as cost per million tokens ($/M). When you see “$2.50/$10.00” it means $2.50 per million input tokens and $10.00 per million output tokens. To calculate a single request cost: (input_tokens × input_price + output_tokens × output_price) / 1,000,000.
Key insight: The 3,000x price range from GPT-5 Nano ($0.05/M) to o1-pro ($150/M) means model selection is the single highest-leverage cost decision you can make. Choosing the right model for each task is covered in Chapter 6.
speed
Inference Dominates AI Spend
Training is a one-time cost; inference runs forever
Training vs Inference
Training a frontier model is a massive one-time investment (GPT-4 reportedly cost $100M+ to train). But once trained, the model serves millions of users continuously. Inference — the process of generating responses to user queries — now accounts for roughly two-thirds of all AI compute spend industry-wide. Every API call, every chatbot response, every agent action is an inference cost.
Why This Matters
Training costs are borne by the model provider (OpenAI, Anthropic, Google). Inference costs are borne by you, the developer or company using the API. As AI agents become more autonomous — making dozens of LLM calls per task — inference costs scale with usage in ways that training costs never did. Understanding token economics is understanding your operational cost structure.
Key insight: A single autonomous agent making 50 LLM calls per task, running 100 tasks per day, on a mid-tier model at $3/M input tokens, can easily consume $500–1,500/month. Multiply by a fleet of agents and you’re in five-figure territory.
lightbulb
Tokens as the Currency of AI
The mental model that frames this entire course
The Taxi Meter, Revisited
Here’s the complete taxi meter analogy: Getting in the cab = loading your system prompt and context (input tokens). Telling the driver your destination = your user query (more input tokens). The drive itself = the model generating a response (output tokens, billed at a higher rate). Taking a detour = reasoning/thinking tokens you never see but still pay for (Chapter 3). Sitting in traffic = quadratic attention scaling on long contexts (Chapter 3).
Key insight: Every optimization technique in this course — caching, compression, routing, distillation — maps back to reducing the meter. Use a cheaper cab (budget model). Take a shorter route (compress context). Share the ride (batch processing). Avoid detours (constrain reasoning). The economics are that simple.
What’s Next
Now that you understand what tokens are and how they’re counted, Chapter 2 dives into why input and output tokens are priced differently — the restaurant analogy. Reading the menu is cheap; having the chef cook your meal is expensive. The asymmetry between input and output pricing is the single most important concept in token economics.
Chapter Summary
Tokens are sub-word units (~0.75 words each) created by BPE. Format matters: JSON costs 2–3x more than prose. Token prices have collapsed 600x since 2023, but the range from budget to flagship is 3,000x. Inference now dominates AI spend. Every AI cost optimization starts with understanding this fundamental unit.