The Analogy
Imagine you’re creating shorthand for note-taking. You notice you write “th” constantly, so you create a symbol for it. Then “the” appears a lot, so you merge “th”+“e” into one symbol. Then “the ” (with space), and so on. BPE does exactly this — it starts with individual bytes and repeatedly merges the most frequent pair into a new token.
Key insight: BPE was originally a data compression algorithm (Gage, 1994). Sennrich et al. (2016) adapted it for NLP. OpenAI’s GPT-2 was the first major model to use byte-level BPE, and it’s been the standard ever since. The “byte” part means it starts from raw bytes (0-255), so it can handle any language, emoji, or even binary data.
BPE Step by Step
# Training corpus: "low lower lowest"
# Start with characters:
# l o w _ l o w e r _ l o w e s t
# Count pairs: (l,o)=3 (o,w)=3 (w,e)=2 ...
# Merge most frequent: (l,o) → "lo"
# lo w _ lo w e r _ lo w e s t
# Count again: (lo,w)=3 is most frequent
# Merge: (lo,w) → "low"
# low _ low e r _ low e s t
# Next: (low,e)=2 → "lowe"
# low _ lowe r _ lowe s t
# Continue until vocab size reached...
# Final merge rules (ordered):
# 1. l + o → lo
# 2. lo + w → low
# 3. low + e → lowe
# ...