Ch 7: The Tokenizer Files — Text to Token IDs and Back

Ch 7 — The Tokenizer Files: Text to Token IDs and Back

Vocabulary, BPE merge rules, special tokens, and the chat template contract

Index

High Level

dictionary

Vocab

arrow_forward

merge_type

BPE Merges

arrow_forward

star

Special

arrow_forward

chat

Template

arrow_forward

settings

Config

arrow_forward

check_circle

Validate

Click play or press Space to begin...

Step- / 8

dictionary

The Vocabulary: Token-to-ID Mapping

128,256 entries mapping text fragments to integer IDs

What It Contains

The vocabulary is a bidirectional mapping between text fragments (tokens) and integer IDs. For Llama 3.1, this means 128,256 entries where each token — from single characters like "a" to whole words like " the" (note the leading space) to subwords like "tion" — has a unique integer. The vocabulary lives inside tokenizer.json as a JSON object under the model.vocab key. The vocab_size (128,256) directly determines the embedding matrix's first dimension.

Vocabulary Entries

// Sample from tokenizer.json vocab: { "!": 0, "\"": 1, "#": 2, ... " the": 279, // note leading space " Hello": 9906, "world": 1917, ... "<|begin_of_text|>": 128000 } // Total: 128,256 entries

Key insight: Tokens often include leading spaces — " Hello" (with space) is a different token than "Hello" (without). This is how BPE tokenizers handle word boundaries without explicit whitespace tokens.

merge_type

BPE Merge Rules: How Unknown Words Are Split

An ordered list of character pair merges learned from the training corpus

How BPE Works

Byte Pair Encoding (BPE) starts with individual bytes and iteratively merges the most frequent adjacent pairs. The merges list in tokenizer.json defines these merges in order of frequency. When tokenizing new text, the algorithm applies these merges greedily: first split into bytes, then apply merge 1 if the pair exists, then merge 2, etc. This is why any text can be tokenized — worst case, it falls back to individual bytes.

Merge Rules Example

// BPE merges (ordered by frequency): "merges": [ "Ġ t", // merge #1: space + t "Ġ a", // merge #2: space + a "i n", // merge #3: i + n → "in" "h e", // merge #4: h + e → "he" "Ġ th", // merge #5: " t" + h "Ġ the", // merge #6: " th" + e ... // ~280K merge rules total ] // Ġ represents a space character

Key insight: The merge order IS the tokenization algorithm. The same text can produce different tokens if you change the merge order. This is why you must use the exact tokenizer that was trained with the model — the merge rules and the embedding matrix are linked.

star

Special Tokens: Control Signals

Tokens that tell the model about conversation structure

What Special Tokens Do

Special tokens are control signals that never appear in normal text. They mark conversation boundaries, speaker roles, and generation endpoints. Llama 3 uses tokens like <|begin_of_text|> (start of conversation), <|eot_id|> (end of turn), and <|start_header_id|> / <|end_header_id|> (role markers). These tokens occupy IDs at the end of the vocabulary (128000+), reserved specifically for control purposes.

Llama 3 Special Tokens

Key insight: The model learned the meaning of these tokens during training. <|eot_id|> means "stop generating." If your inference code doesn't include it as a stop token, the model will keep generating past its turn boundary.

chat

The Chat Template: Conversation Formatting

A Jinja2 template that structures multi-turn dialogue

What the Template Does

The chat template in tokenizer_config.json is a Jinja2 template that defines how multi-turn conversations are formatted before being fed to the model. It wraps each message with the correct special tokens and role markers. This template is the contract between the user and the model — using the wrong template produces gibberish even with perfect weights, because the model was trained on a specific conversation format.

Formatted Output

Why it matters: Using ChatML template with a Llama model (or vice versa) causes the model to see unfamiliar control tokens, producing confused or degraded output. Always match the template to the model.

folder_open

tokenizer.json: The Complete File

Vocabulary, merges, normalizer, pre-tokenizer, post-processor, decoder

File Sections

The tokenizer.json file contains six key sections: normalizer (text cleanup before tokenizing), pre_tokenizer (how to split text into pre-token chunks, e.g., by whitespace), model (the BPE vocabulary and merge rules), post_processor (adds special tokens like BOS/EOS), decoder (converts token IDs back to text), and added_tokens (special tokens injected into the vocabulary).

File Structure

// tokenizer.json top-level structure: { "version": "1.0", "added_tokens": [...], // special tokens "normalizer": null, // text cleanup "pre_tokenizer": {...}, // split rules "model": { "type": "BPE", "vocab": {...}, // 128K entries "merges": [...] // ~280K rules }, "post_processor": {...},// add BOS/EOS "decoder": {...} // IDs → text }

settings

tokenizer_config.json: Settings and Template

Configuration, special token mappings, and the chat template

Key Config Fields

tokenizer_config.json contains operational settings: which special tokens to use for BOS/EOS/PAD, whether to add BOS token automatically, the maximum sequence length, and crucially, the chat_template string (the Jinja2 template). It also specifies the tokenizer_class (e.g., "PreTrainedTokenizerFast") which tells the framework which tokenizer implementation to use.

Config Example

// tokenizer_config.json key fields: { "add_bos_token": true, "add_eos_token": false, "bos_token": "<|begin_of_text|>", "eos_token": "<|end_of_text|>", "model_max_length": 131072, "tokenizer_class": "PreTrainedTokenizerFast", "chat_template": "{% set loop_messages... %}" }

link_off

What Happens When the Tokenizer Is Wrong

Mismatched tokenizers produce garbage — even with perfect weights

Failure Modes

Wrong vocab size: If your tokenizer has 32K tokens but the model expects 128K, token IDs above 32K index into nonexistent embedding rows → crash.

Wrong merge rules: "Hello world" might become IDs [9906, 1917] with the correct tokenizer but [412, 1024, 888, 3055] with a different one. The model was trained on the first mapping — it has no idea what the second means.

Wrong chat template: The model sees random special tokens instead of proper role markers, producing confused or incoherent output.

Quick Validation

// Tokenizer compatibility checks: 1. tokenizer.vocab_size == config.vocab_size? // Must match embed_tokens shape[0] 2. tokenizer("Hello") produces expected IDs? // Compare with reference implementation 3. chat_template matches model's training? // Llama 3 ≠ ChatML ≠ Alpaca format 4. Special tokens registered as stop tokens? // <|eot_id|> must trigger generation stop

Rule of thumb: Always download tokenizer files from the same model repository as the weights. Never mix tokenizers from different model families. The tokenizer and weights are a matched pair — like a lock and key.

lightbulb

Practical Takeaways

The tokenizer cheat sheet

Tokenizer File Summary

// Tokenizer files and their roles: tokenizer.json // ~9 MB // Vocab + BPE merges + normalizer // The core tokenization algorithm tokenizer_config.json // ~38 KB // Settings + chat template // Special token mappings tokenizer.model // ~2.1 MB // SentencePiece binary (legacy) special_tokens_map.json // ~88 bytes // Quick lookup for BOS/EOS/PAD

Key Relationships

vocab_size ↔ embedding shape: Must match exactly or loading crashes.

Merge order ↔ token IDs: The same text produces different IDs with different merge rules. IDs are the input to the embedding lookup.

Chat template ↔ training format: The model was trained to recognize specific control token patterns. Wrong template = wrong patterns = bad output.

Key insight: The tokenizer is the "API contract" between human text and the model. Get it right and everything works. Get it wrong and no amount of weight quality can save you. Last chapter: config.json and the runtime structures that don't live in any file.

arrow_back Ch 6: Special Tensors Ch 8: Config & Runtime arrow_forward