The Journey
You now understand the complete LLM pipeline: Text → Tokens (Ch 1) → Embeddings (Ch 2) → Attention (Ch 3) → Transformer Blocks (Ch 4) → Scale (Ch 5) → Pretraining (Ch 6) → Fine-tuning (Ch 7) → Alignment (Ch 8) → Generation (Ch 9) → Context (Ch 10) → Optimization (Ch 11) → Multimodal (Ch 12) → Capabilities & Limits (Ch 13) → The Landscape (Ch 14). Every concept builds on the previous ones.
Key insight: The entire field rests on one elegant idea: predict the next token. Everything else — attention, scaling laws, RLHF, quantization, multimodality — is engineering to make that simple idea work better, faster, and more safely. The transformer architecture from 2017 is still the foundation. The innovation is in training recipes, data, alignment, and serving infrastructure.
The Full Pipeline
# The complete LLM pipeline:
# ARCHITECTURE (Ch 1-4):
# text → tokenize → embed → N × (
# norm → attention → residual →
# norm → FFN → residual
# ) → norm → output head → logits
# TRAINING (Ch 5-8):
# Pretrain (15T tokens, next-token pred)
# → SFT (100K examples, instruction format)
# → RLHF/DPO (50K prefs, quality+safety)
# INFERENCE (Ch 9-11):
# Prompt → prefill (parallel) →
# decode (sequential, KV cache) →
# sample (temperature + top-p) → output
# Optimized: quantize + FlashAttn + batch
# FRONTIER (Ch 12-14):
# + Vision (ViT + adapter)
# + Audio (codec tokens)
# + Reasoning (CoT, test-time compute)
# + Agents (tool use, planning)