Ch 1: Text to Tokens — How LLMs Work

Ch 1 — Text to Tokens

How raw text becomes numbers — the very first step in every LLM

arrow_backIndex

Foundation

text_fields

Raw Text

arrow_forward

content_cut

Split

arrow_forward

merge

BPE

arrow_forward

dictionary

Vocab

arrow_forward

tag

Special

arrow_forward

IDs

arrow_forward

compare

Compare

arrow_forward

psychology

Why It Matters

Click play or press Space to begin...

Step- / 8

text_fields

The Problem: Computers Don’t Read

Why we need tokenization at all

The Analogy

Imagine you speak English and need to communicate with someone who only understands numbers. You’d need a codebook — a dictionary that maps every word (or piece of a word) to a unique number. That’s exactly what a tokenizer does. It’s the translator between human language and the numbers a neural network can process.

Key insight: An LLM never sees text. It only sees sequences of integers. “Hello world” might become [15339, 1917]. Everything the model does — understanding, reasoning, generating — happens in the world of numbers. Tokenization is the bridge.

What Happens

# What you type: "Hello, how are you?" # What the LLM actually sees: [9906, 11, 1268, 527, 499, 30] # Each number = one "token" # 9906 → "Hello" # 11 → "," # 1268 → " how" # 527 → " are" # 499 → " you" # 30 → "?"

Real World

A codebook that translates English words into numbered codes for a telegraph

In LLMs

A tokenizer that maps text chunks to integer IDs from a fixed vocabulary

content_cut

How to Split Text: Three Approaches

Characters, words, or something in between?

The Analogy

Imagine chopping a sentence for a Scrabble game. You could split into individual letters (tiny tiles, lots of them), whole words (big tiles, but you need millions), or common chunks like “ing”, “tion”, “un” (medium tiles, reusable). Modern LLMs use the third approach: subword tokenization.

Key insight: Word-level tokenization fails on new words (“ChatGPT” would be unknown). Character-level needs extremely long sequences (a 1000-word essay = ~5000 characters). Subword tokenization is the sweet spot — common words stay whole, rare words get split into known pieces.

The Three Approaches

# Character-level: tiny vocab, long sequences "unhappiness" → ["u","n","h","a","p","p","i","n","e","s","s"] # Vocab: ~256 (bytes). 11 tokens for 1 word! # Word-level: huge vocab, OOV problem "unhappiness" → ["unhappiness"] # Vocab: 500K+ words. "ChatGPT" = unknown! # Subword (BPE): balanced vocab, no OOV "unhappiness" → ["un", "happiness"] # Vocab: 32K-128K. Handles any text. # Common words stay whole: "the" → ["the"] # Rare words split: "ChatGPT" → ["Chat","GPT"]

merge

Byte-Pair Encoding (BPE)

The algorithm behind GPT, Llama, and most modern tokenizers

The Analogy

Imagine you’re creating shorthand for note-taking. You notice you write “th” constantly, so you create a symbol for it. Then “the” appears a lot, so you merge “th”+“e” into one symbol. Then “the ” (with space), and so on. BPE does exactly this — it starts with individual bytes and repeatedly merges the most frequent pair into a new token.

Key insight: BPE was originally a data compression algorithm (Gage, 1994). Sennrich et al. (2016) adapted it for NLP. OpenAI’s GPT-2 was the first major model to use byte-level BPE, and it’s been the standard ever since. The “byte” part means it starts from raw bytes (0-255), so it can handle any language, emoji, or even binary data.

BPE Step by Step

# Training corpus: "low lower lowest" # Start with characters: # l o w _ l o w e r _ l o w e s t # Count pairs: (l,o)=3 (o,w)=3 (w,e)=2 ... # Merge most frequent: (l,o) → "lo" # lo w _ lo w e r _ lo w e s t # Count again: (lo,w)=3 is most frequent # Merge: (lo,w) → "low" # low _ low e r _ low e s t # Next: (low,e)=2 → "lowe" # low _ lowe r _ lowe s t # Continue until vocab size reached... # Final merge rules (ordered): # 1. l + o → lo # 2. lo + w → low # 3. low + e → lowe # ...

dictionary

The Vocabulary: Your Model’s Dictionary

Real vocabulary sizes from actual models

The Analogy

A vocabulary is like a phrase book for a traveler. A tiny phrase book (256 entries = raw bytes) can express anything but requires many lookups per sentence. A massive dictionary (500K words) is fast but can’t handle new slang. The sweet spot is 32K–128K subword tokens — enough to keep common words whole while splitting rare ones.

Real Model Vocabularies

# Actual vocabulary sizes: # GPT-2: 50,257 tokens (BPE) # GPT-3.5/4: 100,258 tokens (cl100k_base) # GPT-4o: 200,019 tokens (o200k_base) # Llama 2: 32,000 tokens (SentencePiece) # Llama 3: 128,256 tokens (tiktoken-based) # Claude 3: ~100K tokens (BPE variant) # Gemini: 256,000 tokens (SentencePiece)

Key insight: Larger vocabularies mean fewer tokens per sentence (faster inference, fits more text in context window) but more parameters in the embedding layer. GPT-4o doubled the vocab from 100K to 200K specifically to improve efficiency on non-English languages, where smaller vocabs waste tokens on single characters.

Compression Ratio

# Average characters per token: # GPT-2 (50K vocab): ~4.0 chars/token (English) # GPT-4 (100K vocab): ~4.0 chars/token (English) # Llama 3 (128K): ~4.0 chars/token (English) # ~2.5 chars/token (Chinese) # ~1.5 chars/token (Japanese) # "Hello world" = 2 tokens (11 chars) # → 5.5 chars/token (very efficient!)

tag

Special Tokens: The Control Signals

Hidden markers that tell the model where things start, stop, and change

The Analogy

Special tokens are like punctuation marks for the model — but invisible to you. Just as a period tells you a sentence ended, <|endoftext|> tells the model a document ended. <|im_start|> marks where a chat message begins. These tokens never appear in normal text — they’re injected by the system to give the model structural cues.

Key insight: When you chat with ChatGPT, your message gets wrapped in special tokens before the model sees it. The model doesn’t see “What is Python?” — it sees something like <|im_start|>user\nWhat is Python?<|im_end|><|im_start|>assistant\n. These invisible markers are how the model knows who’s speaking and when to stop generating.

Common Special Tokens

Chat Template Example

Token IDs: Try It Yourself

Real code you can run to see tokenization in action

OpenAI tiktoken

import tiktoken # GPT-4 tokenizer enc = tiktoken.get_encoding("cl100k_base") text = "Tokenization is surprisingly important!" tokens = enc.encode(text) # [5765, 2065, 374, 29729, 3062, 0] # Decode back to see each token's text: for t in tokens: print(f"{t:6d} → '{enc.decode([t])}'") # 5765 → 'Token' # 2065 → 'ization' # 374 → ' is' # 29729 → ' surprisingly' # 3062 → ' important' # 0 → '!' print(f"Tokens: {len(tokens)}") # 6 print(f"Chars: {len(text)}") # 40 print(f"Ratio: {len(text)/len(tokens):.1f} chars/token") # 6.7 chars/token

Hugging Face Transformers

from transformers import AutoTokenizer # Llama 3 tokenizer tok = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B" ) text = "Tokenization is surprisingly important!" ids = tok.encode(text) tokens = tok.convert_ids_to_tokens(ids) # ['Token', 'ization', 'Ġis', 'Ġsurprisingly', # 'Ġimportant', '!'] # Ġ = space character (byte 0x20) # Decode back to string: tok.decode(ids) # "Tokenization is surprisingly important!"

Why it matters: The Ġ prefix (displayed as a special character) represents a leading space. Tokens like Ġis mean “ is” (with space). This is how the tokenizer preserves whitespace — spaces aren’t separate tokens, they’re attached to the next word.

compare

BPE vs. WordPiece vs. SentencePiece

Different models use different tokenizers — here’s how they compare

The Three Main Algorithms

BPE (GPT, Llama 3): Merges the most frequent byte pair, bottom-up. Starts from bytes, builds up. WordPiece (BERT, DistilBERT): Similar to BPE but picks merges that maximize likelihood of the training data, not just frequency. Uses ## prefix for continuations. Unigram/SentencePiece (Llama 2, T5, Gemini): Starts with a large vocab and prunes tokens that contribute least. Works directly on raw text (no pre-tokenization).

Key insight: The choice of tokenizer affects model performance more than most people realize. Llama 3 switched from SentencePiece (32K vocab) to a tiktoken-based BPE (128K vocab), improving compression from ~3.2 to ~4.0 chars/token on English. That means the same context window fits 25% more text.

Side-by-Side Comparison

# Input: "unbelievable" # BPE (GPT-4, cl100k_base): ["un", "believ", "able"] # 3 tokens # WordPiece (BERT): ["un", "##bel", "##ie", "##va", "##ble"] # 5 tokens (## = continuation) # SentencePiece (Llama 2): ["▁un", "bel", "iev", "able"] # 4 tokens (▁ = word boundary)

Which Models Use What

# BPE (byte-level): # GPT-2, GPT-3, GPT-4, GPT-4o # Llama 3, Mistral, Falcon # WordPiece: # BERT, DistilBERT, ELECTRA # SentencePiece (Unigram or BPE): # Llama 2, T5, Gemini, PaLM # XLNet, ALBERT, mBART

psychology

Why Tokenization Matters More Than You Think

Tokenization quirks explain many LLM behaviors

Tokenization Explains LLM Quirks

Why LLMs struggle with counting letters: “strawberry” might be tokenized as [“str”, “aw”, “berry”] — the model never sees individual letters, so counting “r”s is hard. Why code is expensive: Python code uses ~2× more tokens than English prose (whitespace, symbols, variable names). Why non-English costs more: Chinese text uses ~2-3× more tokens per character than English with English-optimized tokenizers.

The complete picture: Tokenization is not just a preprocessing step — it fundamentally shapes what the model can and cannot do. The vocabulary defines the model’s “alphabet.” The compression ratio determines how much text fits in the context window. The token boundaries affect what patterns the model can learn. Every LLM limitation you encounter has a tokenization component.

Real-World Implications

# Token cost = real money cost # GPT-4o: $2.50 per 1M input tokens # 1M tokens ≈ 750K words ≈ 3,000 pages # Same text, different token counts: # "Hello" in English: 1 token # "こんにちは" in Japanese: 3-4 tokens # → Japanese users pay 3-4× more per word! # Context window = token limit # GPT-4o: 128K tokens ≈ 96K words (English) # GPT-4o: 128K tokens ≈ 40K words (Chinese) # Code is token-hungry: # "for i in range(10):" = 7 tokens # "loop ten times" = 3 tokens # Same meaning, 2.3× more tokens for code

Real World

A translator who chunks text differently will understand and convey meaning differently

In LLMs

The tokenizer defines what “atoms” the model thinks in — it literally shapes the model’s perception of language

Ch 2: Embeddings arrow_forward