Key Insights — How LLMs Work

Part 1

From Text to Numbers

Chapters 1-3

expand_more

1

Text to Tokens

LLMs don't read words; they read numbers. Tokenization is the bridge between human language and machine math.

Subword Tokenization: Systems like BPE break rare words into smaller chunks rather than storing millions of unique words, balancing vocabulary size with sequence length.
The "Strawberry" Problem: LLMs struggle with spelling tasks (like counting the 'r's in strawberry) because the tokenizer merges the word into a single opaque number before the model even sees it.

2

Embeddings: Meaning as Math

Words are converted into high-dimensional coordinates where geometric distance equals semantic similarity.

Vector Space: If you plot words in 4,096 dimensions, "dog" and "cat" are close together. "Dog" and "car" are far apart.
Context is Everything: Modern embeddings are dynamic. The word "bank" gets a different vector if it's next to "river" versus next to "money".

3

Attention: The Core Innovation

Attention allows the model to look at the entire sentence at once and figure out which words are relevant to each other.

Queries, Keys, and Values: Every token asks a question (Query), checks other tokens' labels (Keys), and extracts their meaning (Values) if there's a match.
Self-Attention: The mechanism that allows the word "it" in "The animal didn't cross the street because it was too tired" to mathematically link to "animal" rather than "street".

The Bottom Line: Before any "thinking" happens, text must be chunked into tokens, mapped to mathematical vectors, and mixed together using Attention so every word understands its context.

Part 2

The Transformer Architecture

Chapters 4-6

expand_more

4

The Transformer Block

An LLM is just the same block of operations repeated dozens of times.

Two-Step Process: Each block does Attention (tokens talk to each other) followed by a Feed-Forward Network (tokens process what they just learned individually).
Residual Connections: "Bypass lanes" that allow information to skip layers, preventing the signal from dying out in very deep networks.

5

Scaling Up: From Transformer to LLM

Intelligence in LLMs is an emergent property of massive scale.

Scaling Laws: Model performance predictably improves as you increase three variables: parameter count, dataset size, and compute.
Mixture of Experts (MoE): A trick to make massive models faster by only activating a small subset of the neural network (the "experts") for any given token.

6

The Training Recipe

Pre-training is the most expensive and time-consuming part of creating an LLM.

Next-Token Prediction: The only objective during pre-training is guessing the next word in a sequence. By doing this across trillions of words, the model learns grammar, facts, and reasoning.
Data Quality: The internet is full of garbage. Curating, deduplicating, and filtering the pre-training dataset is the most closely guarded secret of AI labs.

The Bottom Line: A base LLM is just a massive statistical engine trained on supercomputers for months to play the world's most advanced game of autocomplete.

Part 3

Training & Alignment

Chapters 7-8

expand_more

7

Fine-Tuning & Instruction Following

A base model just continues text. Fine-tuning turns it into an assistant that answers questions.

Supervised Fine-Tuning (SFT): Showing the model thousands of examples of high-quality Q&A pairs so it learns the format of a helpful conversation.

8

RLHF & Alignment

Alignment teaches the model human values: be helpful, honest, and harmless.

RLHF: Humans rate model outputs (A is better than B). A "Reward Model" learns these preferences, and then trains the main LLM to maximize that reward.
The "ChatGPT" Moment: RLHF is the specific breakthrough that made LLMs accessible and safe enough for the general public.

The Bottom Line: Pre-training gives the model knowledge. Fine-tuning gives it a conversational interface. Alignment gives it a personality and safety boundaries.

Part 4

Inference & Optimization

Chapters 9-11

expand_more

9

How LLMs Generate Text

Generation is an autoregressive loop: predict one token, add it to the input, repeat.

Temperature: Controls the randomness of the output. Temp 0 = always pick the most likely word (robotic/factual). Temp 1 = pick from a wider distribution (creative).
Top-P / Top-K: Filtering techniques to prevent the model from ever picking completely nonsensical words during generation.

10

Context Windows & Memory

LLMs have amnesia. They only "remember" what is currently inside their context window.

Quadratic Scaling: Doubling the context window requires 4x the compute during the attention phase, which is why infinite context is mathematically difficult.
KV Cache: A crucial optimization that saves the mathematical state of previous tokens so the model doesn't have to re-read the entire prompt for every new word it generates.

11

Making LLMs Fast

Serving LLMs to millions of users requires extreme engineering to overcome the memory wall.

Quantization: Shrinking the precision of the model's weights (e.g., from 16-bit to 4-bit) to make it fit in smaller GPUs and run faster.
Continuous Batching: Dynamically swapping user requests in and out of the GPU the millisecond they finish to maximize hardware utilization.

The Bottom Line: Generating text one token at a time is incredibly inefficient. The entire field of LLM engineering is dedicated to hacking the math and memory to make it faster.

Part 5

Frontiers & Landscape

Chapters 12-14

expand_more

12

Multimodal LLMs

Text is just another sequence. Modern models treat images and audio as just more tokens.

Vision Encoders: Models like CLIP convert images into the exact same mathematical embedding space as text, allowing the LLM to "read" the image.

13

Emergent Abilities & Limitations

LLMs are brilliant pattern matchers, but they do not reason like humans.

In-Context Learning: The ability of a model to learn a new task just by seeing examples in the prompt, without changing its underlying weights.
Hallucinations: Because LLMs are probabilistic text generators, they will confidently invent facts when they don't know the answer. It is a feature of the architecture, not a bug.

14

The LLM Landscape (Capstone)

The market is split between massive closed models and highly capable open-weight models.

Closed vs Open: OpenAI and Anthropic lead the frontier via APIs, while Meta's Llama provides open-weight models that anyone can download and run locally.

The Bottom Line: LLMs are the most capable general-purpose technology since the internet, but understanding their fundamental limitations (hallucinations, lack of true reasoning) is critical to using them effectively.