Ch 1 — Text to Tokens

How raw text becomes numbers — the very first step in every LLM
Foundation
text_fields
Raw Text
arrow_forward
content_cut
Split
arrow_forward
merge
BPE
arrow_forward
dictionary
Vocab
arrow_forward
tag
Special
arrow_forward
pin
IDs
arrow_forward
compare
Compare
arrow_forward
psychology
Why It Matters
-
Click play or press Space to begin...
Step- / 8
text_fields
The Problem: Computers Don’t Read
Why we need tokenization at all
The Analogy
Imagine you speak English and need to communicate with someone who only understands numbers. You’d need a codebook — a dictionary that maps every word (or piece of a word) to a unique number. That’s exactly what a tokenizer does. It’s the translator between human language and the numbers a neural network can process.
Key insight: An LLM never sees text. It only sees sequences of integers. “Hello world” might become [15339, 1917]. Everything the model does — understanding, reasoning, generating — happens in the world of numbers. Tokenization is the bridge.
What Happens
# What you type: "Hello, how are you?" # What the LLM actually sees: [9906, 11, 1268, 527, 499, 30] # Each number = one "token" # 9906 → "Hello" # 11 → "," # 1268 → " how" # 527 → " are" # 499 → " you" # 30 → "?"
Real World
A codebook that translates English words into numbered codes for a telegraph
In LLMs
A tokenizer that maps text chunks to integer IDs from a fixed vocabulary
content_cut
How to Split Text: Three Approaches
Characters, words, or something in between?
The Analogy
Imagine chopping a sentence for a Scrabble game. You could split into individual letters (tiny tiles, lots of them), whole words (big tiles, but you need millions), or common chunks like “ing”, “tion”, “un” (medium tiles, reusable). Modern LLMs use the third approach: subword tokenization.
Key insight: Word-level tokenization fails on new words (“ChatGPT” would be unknown). Character-level needs extremely long sequences (a 1000-word essay = ~5000 characters). Subword tokenization is the sweet spot — common words stay whole, rare words get split into known pieces.
The Three Approaches
# Character-level: tiny vocab, long sequences "unhappiness" → ["u","n","h","a","p","p","i","n","e","s","s"] # Vocab: ~256 (bytes). 11 tokens for 1 word! # Word-level: huge vocab, OOV problem "unhappiness" → ["unhappiness"] # Vocab: 500K+ words. "ChatGPT" = unknown! # Subword (BPE): balanced vocab, no OOV "unhappiness" → ["un", "happiness"] # Vocab: 32K-128K. Handles any text. # Common words stay whole: "the" → ["the"] # Rare words split: "ChatGPT" → ["Chat","GPT"]
merge
Byte-Pair Encoding (BPE)
The algorithm behind GPT, Llama, and most modern tokenizers
The Analogy
Imagine you’re creating shorthand for note-taking. You notice you write “th” constantly, so you create a symbol for it. Then “the” appears a lot, so you merge “th”+“e” into one symbol. Then “the ” (with space), and so on. BPE does exactly this — it starts with individual bytes and repeatedly merges the most frequent pair into a new token.
Key insight: BPE was originally a data compression algorithm (Gage, 1994). Sennrich et al. (2016) adapted it for NLP. OpenAI’s GPT-2 was the first major model to use byte-level BPE, and it’s been the standard ever since. The “byte” part means it starts from raw bytes (0-255), so it can handle any language, emoji, or even binary data.
BPE Step by Step
# Training corpus: "low lower lowest" # Start with characters: # l o w _ l o w e r _ l o w e s t # Count pairs: (l,o)=3 (o,w)=3 (w,e)=2 ... # Merge most frequent: (l,o) → "lo" # lo w _ lo w e r _ lo w e s t # Count again: (lo,w)=3 is most frequent # Merge: (lo,w) → "low" # low _ low e r _ low e s t # Next: (low,e)=2 → "lowe" # low _ lowe r _ lowe s t # Continue until vocab size reached... # Final merge rules (ordered): # 1. l + o → lo # 2. lo + w → low # 3. low + e → lowe # ...
dictionary
The Vocabulary: Your Model’s Dictionary
Real vocabulary sizes from actual models
The Analogy
A vocabulary is like a phrase book for a traveler. A tiny phrase book (256 entries = raw bytes) can express anything but requires many lookups per sentence. A massive dictionary (500K words) is fast but can’t handle new slang. The sweet spot is 32K–128K subword tokens — enough to keep common words whole while splitting rare ones.
Real Model Vocabularies
# Actual vocabulary sizes: # GPT-2: 50,257 tokens (BPE) # GPT-3.5/4: 100,258 tokens (cl100k_base) # GPT-4o: 200,019 tokens (o200k_base) # Llama 2: 32,000 tokens (SentencePiece) # Llama 3: 128,256 tokens (tiktoken-based) # Claude 3: ~100K tokens (BPE variant) # Gemini: 256,000 tokens (SentencePiece)
Key insight: Larger vocabularies mean fewer tokens per sentence (faster inference, fits more text in context window) but more parameters in the embedding layer. GPT-4o doubled the vocab from 100K to 200K specifically to improve efficiency on non-English languages, where smaller vocabs waste tokens on single characters.
Compression Ratio
# Average characters per token: # GPT-2 (50K vocab): ~4.0 chars/token (English) # GPT-4 (100K vocab): ~4.0 chars/token (English) # Llama 3 (128K): ~4.0 chars/token (English) # ~2.5 chars/token (Chinese) # ~1.5 chars/token (Japanese) # "Hello world" = 2 tokens (11 chars) # → 5.5 chars/token (very efficient!)
tag
Special Tokens: The Control Signals
Hidden markers that tell the model where things start, stop, and change
The Analogy
Special tokens are like punctuation marks for the model — but invisible to you. Just as a period tells you a sentence ended, <|endoftext|> tells the model a document ended. <|im_start|> marks where a chat message begins. These tokens never appear in normal text — they’re injected by the system to give the model structural cues.
Key insight: When you chat with ChatGPT, your message gets wrapped in special tokens before the model sees it. The model doesn’t see “What is Python?” — it sees something like <|im_start|>user\nWhat is Python?<|im_end|><|im_start|>assistant\n. These invisible markers are how the model knows who’s speaking and when to stop generating.
Common Special Tokens
# GPT-4 special tokens (cl100k_base): <|endoftext|> # document boundary <|im_start|> # chat message start <|im_end|> # chat message end <|fim_prefix|> # fill-in-the-middle <|fim_middle|> # (for code completion) <|fim_suffix|> # (for code completion) # Llama 3 special tokens: <|begin_of_text|> # sequence start <|end_of_text|> # sequence end <|start_header_id|> # role marker <|end_header_id|> # role marker <|eot_id|> # end of turn
Chat Template Example
# What the model actually receives: <|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant. <|eot_id|> <|start_header_id|>user<|end_header_id|> What is Python? <|eot_id|> <|start_header_id|>assistant<|end_header_id|>
pin
Token IDs: Try It Yourself
Real code you can run to see tokenization in action
OpenAI tiktoken
import tiktoken # GPT-4 tokenizer enc = tiktoken.get_encoding("cl100k_base") text = "Tokenization is surprisingly important!" tokens = enc.encode(text) # [5765, 2065, 374, 29729, 3062, 0] # Decode back to see each token's text: for t in tokens: print(f"{t:6d} → '{enc.decode([t])}'") # 5765 → 'Token' # 2065 → 'ization' # 374 → ' is' # 29729 → ' surprisingly' # 3062 → ' important' # 0 → '!' print(f"Tokens: {len(tokens)}") # 6 print(f"Chars: {len(text)}") # 40 print(f"Ratio: {len(text)/len(tokens):.1f} chars/token") # 6.7 chars/token
Hugging Face Transformers
from transformers import AutoTokenizer # Llama 3 tokenizer tok = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B" ) text = "Tokenization is surprisingly important!" ids = tok.encode(text) tokens = tok.convert_ids_to_tokens(ids) # ['Token', 'ization', 'Ġis', 'Ġsurprisingly', # 'Ġimportant', '!'] # Ġ = space character (byte 0x20) # Decode back to string: tok.decode(ids) # "Tokenization is surprisingly important!"
Why it matters: The Ġ prefix (displayed as a special character) represents a leading space. Tokens like Ġis mean “ is” (with space). This is how the tokenizer preserves whitespace — spaces aren’t separate tokens, they’re attached to the next word.
compare
BPE vs. WordPiece vs. SentencePiece
Different models use different tokenizers — here’s how they compare
The Three Main Algorithms
BPE (GPT, Llama 3): Merges the most frequent byte pair, bottom-up. Starts from bytes, builds up. WordPiece (BERT, DistilBERT): Similar to BPE but picks merges that maximize likelihood of the training data, not just frequency. Uses ## prefix for continuations. Unigram/SentencePiece (Llama 2, T5, Gemini): Starts with a large vocab and prunes tokens that contribute least. Works directly on raw text (no pre-tokenization).
Key insight: The choice of tokenizer affects model performance more than most people realize. Llama 3 switched from SentencePiece (32K vocab) to a tiktoken-based BPE (128K vocab), improving compression from ~3.2 to ~4.0 chars/token on English. That means the same context window fits 25% more text.
Side-by-Side Comparison
# Input: "unbelievable" # BPE (GPT-4, cl100k_base): ["un", "believ", "able"] # 3 tokens # WordPiece (BERT): ["un", "##bel", "##ie", "##va", "##ble"] # 5 tokens (## = continuation) # SentencePiece (Llama 2): ["▁un", "bel", "iev", "able"] # 4 tokens (▁ = word boundary)
Which Models Use What
# BPE (byte-level): # GPT-2, GPT-3, GPT-4, GPT-4o # Llama 3, Mistral, Falcon # WordPiece: # BERT, DistilBERT, ELECTRA # SentencePiece (Unigram or BPE): # Llama 2, T5, Gemini, PaLM # XLNet, ALBERT, mBART
psychology
Why Tokenization Matters More Than You Think
Tokenization quirks explain many LLM behaviors
Tokenization Explains LLM Quirks
Why LLMs struggle with counting letters: “strawberry” might be tokenized as [“str”, “aw”, “berry”] — the model never sees individual letters, so counting “r”s is hard. Why code is expensive: Python code uses ~2× more tokens than English prose (whitespace, symbols, variable names). Why non-English costs more: Chinese text uses ~2-3× more tokens per character than English with English-optimized tokenizers.
The complete picture: Tokenization is not just a preprocessing step — it fundamentally shapes what the model can and cannot do. The vocabulary defines the model’s “alphabet.” The compression ratio determines how much text fits in the context window. The token boundaries affect what patterns the model can learn. Every LLM limitation you encounter has a tokenization component.
Real-World Implications
# Token cost = real money cost # GPT-4o: $2.50 per 1M input tokens # 1M tokens ≈ 750K words ≈ 3,000 pages # Same text, different token counts: # "Hello" in English: 1 token # "こんにちは" in Japanese: 3-4 tokens # → Japanese users pay 3-4× more per word! # Context window = token limit # GPT-4o: 128K tokens ≈ 96K words (English) # GPT-4o: 128K tokens ≈ 40K words (Chinese) # Code is token-hungry: # "for i in range(10):" = 7 tokens # "loop ten times" = 3 tokens # Same meaning, 2.3× more tokens for code
Real World
A translator who chunks text differently will understand and convey meaning differently
In LLMs
The tokenizer defines what “atoms” the model thinks in — it literally shapes the model’s perception of language