1
LLMs don't read words; they read numbers. Tokenization is the bridge between human language and machine math.
- Subword Tokenization: Systems like BPE break rare words into smaller chunks rather than storing millions of unique words, balancing vocabulary size with sequence length.
- The "Strawberry" Problem: LLMs struggle with spelling tasks (like counting the 'r's in strawberry) because the tokenizer merges the word into a single opaque number before the model even sees it.
2
Words are converted into high-dimensional coordinates where geometric distance equals semantic similarity.
- Vector Space: If you plot words in 4,096 dimensions, "dog" and "cat" are close together. "Dog" and "car" are far apart.
- Context is Everything: Modern embeddings are dynamic. The word "bank" gets a different vector if it's next to "river" versus next to "money".
3
Attention allows the model to look at the entire sentence at once and figure out which words are relevant to each other.
- Queries, Keys, and Values: Every token asks a question (Query), checks other tokens' labels (Keys), and extracts their meaning (Values) if there's a match.
- Self-Attention: The mechanism that allows the word "it" in "The animal didn't cross the street because it was too tired" to mathematically link to "animal" rather than "street".
The Bottom Line: Before any "thinking" happens, text must be chunked into tokens, mapped to mathematical vectors, and mixed together using Attention so every word understands its context.