Ch 7 — The Tokenizer Files: Text to Token IDs and Back

Vocabulary, BPE merge rules, special tokens, and the chat template contract
High Level
dictionary
Vocab
arrow_forward
merge_type
BPE Merges
arrow_forward
star
Special
arrow_forward
chat
Template
arrow_forward
settings
Config
arrow_forward
check_circle
Validate
-
Click play or press Space to begin...
Step- / 8
dictionary
The Vocabulary: Token-to-ID Mapping
128,256 entries mapping text fragments to integer IDs
What It Contains
The vocabulary is a bidirectional mapping between text fragments (tokens) and integer IDs. For Llama 3.1, this means 128,256 entries where each token — from single characters like "a" to whole words like " the" (note the leading space) to subwords like "tion" — has a unique integer. The vocabulary lives inside tokenizer.json as a JSON object under the model.vocab key. The vocab_size (128,256) directly determines the embedding matrix's first dimension.
Vocabulary Entries
// Sample from tokenizer.json vocab: { "!": 0, "\"": 1, "#": 2, ... " the": 279, // note leading space " Hello": 9906, "world": 1917, ... "<|begin_of_text|>": 128000 } // Total: 128,256 entries
Key insight: Tokens often include leading spaces — " Hello" (with space) is a different token than "Hello" (without). This is how BPE tokenizers handle word boundaries without explicit whitespace tokens.
merge_type
BPE Merge Rules: How Unknown Words Are Split
An ordered list of character pair merges learned from the training corpus
How BPE Works
Byte Pair Encoding (BPE) starts with individual bytes and iteratively merges the most frequent adjacent pairs. The merges list in tokenizer.json defines these merges in order of frequency. When tokenizing new text, the algorithm applies these merges greedily: first split into bytes, then apply merge 1 if the pair exists, then merge 2, etc. This is why any text can be tokenized — worst case, it falls back to individual bytes.
Merge Rules Example
// BPE merges (ordered by frequency): "merges": [ "Ġ t", // merge #1: space + t "Ġ a", // merge #2: space + a "i n", // merge #3: i + n → "in" "h e", // merge #4: h + e → "he" "Ġ th", // merge #5: " t" + h "Ġ the", // merge #6: " th" + e ... // ~280K merge rules total ] // Ġ represents a space character
Key insight: The merge order IS the tokenization algorithm. The same text can produce different tokens if you change the merge order. This is why you must use the exact tokenizer that was trained with the model — the merge rules and the embedding matrix are linked.
star
Special Tokens: Control Signals
Tokens that tell the model about conversation structure
What Special Tokens Do
Special tokens are control signals that never appear in normal text. They mark conversation boundaries, speaker roles, and generation endpoints. Llama 3 uses tokens like <|begin_of_text|> (start of conversation), <|eot_id|> (end of turn), and <|start_header_id|> / <|end_header_id|> (role markers). These tokens occupy IDs at the end of the vocabulary (128000+), reserved specifically for control purposes.
Llama 3 Special Tokens
// Llama 3 special tokens: <|begin_of_text|> ID: 128000 <|end_of_text|> ID: 128001 <|start_header_id|> ID: 128006 <|end_header_id|> ID: 128007 <|eot_id|> ID: 128009 // These are "added_tokens" — they don't // come from BPE merges but are explicitly // added to the vocabulary
Key insight: The model learned the meaning of these tokens during training. <|eot_id|> means "stop generating." If your inference code doesn't include it as a stop token, the model will keep generating past its turn boundary.
chat
The Chat Template: Conversation Formatting
A Jinja2 template that structures multi-turn dialogue
What the Template Does
The chat template in tokenizer_config.json is a Jinja2 template that defines how multi-turn conversations are formatted before being fed to the model. It wraps each message with the correct special tokens and role markers. This template is the contract between the user and the model — using the wrong template produces gibberish even with perfect weights, because the model was trained on a specific conversation format.
Formatted Output
// What the chat template produces: <|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant.<|eot_id|> <|start_header_id|>user<|end_header_id|> What is Python?<|eot_id|> <|start_header_id|>assistant<|end_header_id|> // ↑ Model generates from here
Why it matters: Using ChatML template with a Llama model (or vice versa) causes the model to see unfamiliar control tokens, producing confused or degraded output. Always match the template to the model.
folder_open
tokenizer.json: The Complete File
Vocabulary, merges, normalizer, pre-tokenizer, post-processor, decoder
File Sections
The tokenizer.json file contains six key sections: normalizer (text cleanup before tokenizing), pre_tokenizer (how to split text into pre-token chunks, e.g., by whitespace), model (the BPE vocabulary and merge rules), post_processor (adds special tokens like BOS/EOS), decoder (converts token IDs back to text), and added_tokens (special tokens injected into the vocabulary).
File Structure
// tokenizer.json top-level structure: { "version": "1.0", "added_tokens": [...], // special tokens "normalizer": null, // text cleanup "pre_tokenizer": {...}, // split rules "model": { "type": "BPE", "vocab": {...}, // 128K entries "merges": [...] // ~280K rules }, "post_processor": {...},// add BOS/EOS "decoder": {...} // IDs → text }
settings
tokenizer_config.json: Settings and Template
Configuration, special token mappings, and the chat template
Key Config Fields
tokenizer_config.json contains operational settings: which special tokens to use for BOS/EOS/PAD, whether to add BOS token automatically, the maximum sequence length, and crucially, the chat_template string (the Jinja2 template). It also specifies the tokenizer_class (e.g., "PreTrainedTokenizerFast") which tells the framework which tokenizer implementation to use.
Config Example
// tokenizer_config.json key fields: { "add_bos_token": true, "add_eos_token": false, "bos_token": "<|begin_of_text|>", "eos_token": "<|end_of_text|>", "model_max_length": 131072, "tokenizer_class": "PreTrainedTokenizerFast", "chat_template": "{% set loop_messages... %}" }
link_off
What Happens When the Tokenizer Is Wrong
Mismatched tokenizers produce garbage — even with perfect weights
Failure Modes
Wrong vocab size: If your tokenizer has 32K tokens but the model expects 128K, token IDs above 32K index into nonexistent embedding rows → crash.

Wrong merge rules: "Hello world" might become IDs [9906, 1917] with the correct tokenizer but [412, 1024, 888, 3055] with a different one. The model was trained on the first mapping — it has no idea what the second means.

Wrong chat template: The model sees random special tokens instead of proper role markers, producing confused or incoherent output.
Quick Validation
// Tokenizer compatibility checks: 1. tokenizer.vocab_size == config.vocab_size? // Must match embed_tokens shape[0] 2. tokenizer("Hello") produces expected IDs? // Compare with reference implementation 3. chat_template matches model's training? // Llama 3 ≠ ChatML ≠ Alpaca format 4. Special tokens registered as stop tokens? // <|eot_id|> must trigger generation stop
Rule of thumb: Always download tokenizer files from the same model repository as the weights. Never mix tokenizers from different model families. The tokenizer and weights are a matched pair — like a lock and key.
lightbulb
Practical Takeaways
The tokenizer cheat sheet
Tokenizer File Summary
// Tokenizer files and their roles: tokenizer.json // ~9 MB // Vocab + BPE merges + normalizer // The core tokenization algorithm tokenizer_config.json // ~38 KB // Settings + chat template // Special token mappings tokenizer.model // ~2.1 MB // SentencePiece binary (legacy) special_tokens_map.json // ~88 bytes // Quick lookup for BOS/EOS/PAD
Key Relationships
vocab_size ↔ embedding shape: Must match exactly or loading crashes.

Merge order ↔ token IDs: The same text produces different IDs with different merge rules. IDs are the input to the embedding lookup.

Chat template ↔ training format: The model was trained to recognize specific control token patterns. Wrong template = wrong patterns = bad output.
Key insight: The tokenizer is the "API contract" between human text and the model. Get it right and everything works. Get it wrong and no amount of weight quality can save you. Last chapter: config.json and the runtime structures that don't live in any file.