Ch 14 — The Math of Modern AI (Capstone)

Transformers, diffusion, RLHF — every concept from this course in action
Capstone
token
Attention
arrow_forward
auto_awesome
Transformer
arrow_forward
blur_on
Diffusion
arrow_forward
thumb_up
RLHF
arrow_forward
compress
LoRA
arrow_forward
rocket_launch
Scale
arrow_forward
auto_graph
Map
-
Click play or press Space to begin...
Step- / 8
token
Self-Attention — The Core Innovation
Every word looks at every other word to understand context
The Analogy
Imagine reading a sentence and highlighting which other words each word “pays attention to.” In “The cat sat on the mat because it was tired,” the word “it” attends strongly to “cat.” Self-attention computes this for every word pair simultaneously. Each word asks: “Who should I listen to?” using three vectors: Query (what I’m looking for), Key (what I offer), Value (my actual content).
Math from this course: Attention uses dot products (Ch 1) to measure similarity between Q and K. It divides by √d_k to prevent softmax saturation (Ch 13). Softmax converts scores to probabilities using the log-sum-exp trick (Ch 13). The result is a weighted sum — a linear combination (Ch 2) of value vectors.
The Math
# Self-Attention formula: # Attention(Q, K, V) = softmax(QK^T / √d_k) V # Step by step: # 1. Q, K, V = X @ W_q, X @ W_k, X @ W_v # (matrix multiplication — Ch 2) # 2. scores = Q @ K.T / √d_k # (dot product similarity — Ch 1) # (scaling prevents saturation — Ch 13) # 3. weights = softmax(scores) # (probability distribution — Ch 7) # (log-sum-exp trick — Ch 13) # 4. output = weights @ V # (weighted sum — Ch 1, 2) # Tensor shapes (Ch 12): # Q, K, V: (batch, heads, seq, d_k) # scores: (batch, heads, seq, seq) # output: (batch, heads, seq, d_k)
auto_awesome
The Transformer Architecture
The building block behind GPT, BERT, and every modern LLM
The Analogy
A transformer is like a team of editors reviewing a document. Each editor (layer) reads the entire text, highlights important connections (attention), then rewrites each word with richer context (feed-forward). After 96 editors (GPT-4 has ~120 layers), every word carries deep understanding of the full context. Residual connections are like keeping the original draft alongside each edit.
Math from this course: Layer normalization uses mean and variance (Ch 8) to keep activations stable (conditioning — Ch 13). Residual connections prevent vanishing gradients (Ch 5, 6). Position encodings use sinusoidal functions to encode sequence order. The feed-forward network is y = W₂ · ReLU(W₁x + b₁) + b₂ — pure matrix multiplication (Ch 2) and nonlinear activation.
Architecture
# Transformer block (repeated N times): # 1. Multi-head self-attention attn_out = MultiHeadAttention(x) x = LayerNorm(x + attn_out) # residual # 2. Feed-forward network ff_out = FFN(x) # W₂·GELU(W₁x+b₁)+b₂ x = LayerNorm(x + ff_out) # residual # GPT-4 (estimated): # ~120 transformer blocks # d_model = 12288, heads = 96 # ~1.8 trillion parameters # Trained on ~13T tokens # Training objective: # Minimize cross-entropy (Ch 11) on # next-token prediction # Loss = -Σ log P(token_t | tokens_
blur_on
Diffusion Models — Learning to Denoise
How DALL-E and Stable Diffusion generate images from noise
The Analogy
Imagine a sculptor who learns by watching statues dissolve into sand (forward process). Once they understand how things fall apart, they can reverse the process — starting from a pile of sand and sculpting it back into a statue (reverse process). Diffusion models learn to reverse the gradual addition of Gaussian noise. Start with pure noise, denoise step by step, and a beautiful image emerges.
Math from this course: The forward process adds Gaussian noise (Ch 8) at each step. The model learns the score function ∇log p(x) — the gradient (Ch 4) of the log-probability. Training minimizes MSE loss between predicted and actual noise. The reverse process uses Bayes’ theorem (Ch 7) to compute the posterior. KL divergence (Ch 11) appears in the variational bound.
The Math
# Forward process: add noise gradually # x_t = √(ᾱ_t) × x_0 + √(1-ᾱ_t) × ε # ε ~ N(0, I) (Gaussian noise — Ch 8) # Model learns: ε_θ(x_t, t) ≈ ε # "Given noisy image x_t at step t, # predict the noise that was added" # Training loss (simple version): # L = E[||ε - ε_θ(x_t, t)||²] # (MSE between true and predicted noise) # Reverse process: denoise step by step # x_{t-1} = (1/√α_t)(x_t - β_t/√(1-ᾱ_t) × ε_θ) # + σ_t × z # Text conditioning (Stable Diffusion): # ε_θ(x_t, t, text_embedding) # Cross-attention between image and text # (same attention mechanism as transformers!)
thumb_up
RLHF — Aligning AI with Human Values
How ChatGPT learned to be helpful, harmless, and honest
The Analogy
Training a base LLM is like teaching someone to speak English. RLHF is like teaching them to be a good conversationalist — polite, helpful, and honest. Step 1: humans rank model outputs (preference data). Step 2: train a reward model to predict human preferences. Step 3: use reinforcement learning (PPO) to maximize the reward while staying close to the base model.
Math from this course: The reward model is trained with cross-entropy loss (Ch 11) on pairwise comparisons. PPO optimizes a clipped objective — a constrained optimization problem (Ch 6). The KL penalty (Ch 11) prevents the model from drifting too far from the base: Loss = reward − β × KL(π_new || π_ref). This is MAP estimation (Ch 9) with the base model as the prior!
The Pipeline
# RLHF Pipeline: # Step 1: Supervised Fine-Tuning (SFT) # Train on human-written examples # Loss = cross-entropy (Ch 11) # Step 2: Reward Model # Human ranks: response_A > response_B # Train R(x, y) to predict preferences # Loss = -log σ(R(y_w) - R(y_l)) # (Bradley-Terry model — logistic — Ch 9) # Step 3: PPO Optimization # Maximize: E[R(x, y)] # Subject to: KL(π_θ || π_ref) < δ # Combined: R(x,y) - β × KL(π_θ || π_ref) # DPO (Direct Preference Optimization): # Skip reward model entirely! # Directly optimize policy from preferences # Loss = -log σ(β(log π/π_ref(y_w) # - log π/π_ref(y_l)))
compress
LoRA & Efficient Fine-Tuning
Update billions of parameters by training only millions
The Analogy
Imagine remodeling a house. Full fine-tuning = tearing down and rebuilding every wall. LoRA = adding small, targeted modifications (new shelves, fresh paint) that transform the space without touching the structure. LoRA freezes the original weights and adds small low-rank matrices that capture the task-specific changes. 175B parameters frozen, only ~10M trained.
Math from this course: LoRA is pure SVD (Ch 3) in action. Instead of updating the full weight matrix W (d×d), LoRA learns W + BA where B is (d×r) and A is (r×d), with rank r << d. This is a low-rank approximation (Ch 3). The insight: weight changes during fine-tuning have low intrinsic dimensionality (Ch 12) — they lie on a low-dimensional manifold.
The Math
# Standard fine-tuning: # W_new = W + ΔW (ΔW is d×d = huge) # LoRA insight: ΔW has low rank! # ΔW ≈ B × A where B: d×r, A: r×d # r = 4, 8, or 16 (rank) # Example: d = 12288 (GPT-3 size) # Full ΔW: 12288² = 151M parameters # LoRA (r=8): 12288×8 + 8×12288 = 197K # → 768× fewer parameters! # Forward pass: # h = (W + BA) × x = Wx + BAx # W is frozen, only B and A are trained # At inference: merge W_new = W + BA # No extra latency! # QLoRA: quantize W to 4-bit + LoRA # Fine-tune 65B model on single GPU!
Full Fine-Tune
Update all 175B params. Needs 100s of GPUs.
LoRA
Update ~10M params (0.006%). Single GPU possible.
rocket_launch
Scaling Laws — Why Bigger Is Better
The mathematical laws that predicted GPT-4 before it was built
The Analogy
Moore’s Law predicted transistor doubling for decades. Scaling laws do the same for AI: loss decreases as a power law with more parameters, data, and compute. Double the compute → predictable improvement. This is why companies invest billions in training — the returns are mathematically predictable. OpenAI used scaling laws to predict GPT-4’s performance before training it.
Math from this course: Scaling laws are power laws: L(N) = aN^−α + L∞. Taking the log (Ch 11) of both sides gives a straight line: log(L) = −α log(N) + const. This is linear regression in log-space. The Chinchilla paper showed the optimal ratio: tokens ≈ 20 × parameters. This is an optimization problem (Ch 6) — minimize loss subject to a compute budget.
The Numbers
# Kaplan et al. (2020) scaling laws: # L(N) = (N_c / N)^α_N (parameters) # L(D) = (D_c / D)^α_D (data) # L(C) = (C_c / C)^α_C (compute) # α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.050 # Loss decreases as power law with scale # Chinchilla optimal (Hoffmann 2022): # Tokens ≈ 20 × Parameters # GPT-3: 175B params, 300B tokens (under) # Chinchilla: 70B params, 1.4T tokens # → Same compute, better performance! # Compute budget (FLOPs ≈ 6 × N × D): # GPT-3: ~3.6 × 10²³ FLOPs # GPT-4: ~2 × 10²⁵ FLOPs (estimated) # → 50× more compute → predictable gain
auto_graph
The Complete Math Map
Every concept from this course, connected
Linear Algebra (Ch 1-3)
Vectors represent data (embeddings, features). Matrices transform data (weights, attention). Eigenvalues/SVD reveal structure (PCA, LoRA). Every forward pass is matrix multiplication. Every embedding lookup is a vector operation. Every dimensionality reduction is SVD.
Calculus (Ch 4-6)
Derivatives measure sensitivity. Chain rule enables backpropagation. Gradient descent finds optimal weights. Without calculus, there is no training. Every weight update is a gradient step. Every optimizer (Adam, SGD) is a calculus algorithm.
Probability (Ch 7-9)
Probability quantifies uncertainty. Distributions model data. MLE/Bayesian inference drives learning. Every prediction is a probability. Every loss function is a likelihood. Every regularizer is a prior belief.
Advanced Topics (Ch 10-13)
Hypothesis testing validates models. Information theory defines loss functions. Tensors structure computation. Numerical stability makes it all work. Cross-entropy loss IS information theory. Tensor shapes ARE the architecture. Numerical tricks ARE the engineering.
school
Your Mathematical Foundation
You now have the math to understand any AI paper
What You’ve Learned
You started with vectors and ended with transformers. Along the way, you learned that dot products power attention, gradients power training, probability powers prediction, and information theory powers loss functions. These aren’t separate topics — they’re one interconnected system. Every AI breakthrough is a creative combination of these fundamentals.
The final insight: AI isn’t magic — it’s math. Transformers are matrix multiplications with softmax attention. Diffusion models are Gaussian noise with learned denoising. RLHF is optimization with a KL constraint. LoRA is low-rank matrix factorization. When you read an AI paper now, you’ll recognize the math. That’s the real superpower.
Cheat Sheet
# The Math Behind Modern AI: # # Attention = softmax(QK^T/√d) × V # → dot product, softmax, matrix multiply # # Training = minimize cross-entropy # → gradient descent via backpropagation # # Diffusion = learn ε_θ(x_t, t) ≈ ε # → Gaussian noise, score matching # # RLHF = max R(x,y) - β×KL(π||π_ref) # → reward optimization with KL constraint # # LoRA = W + BA (rank-r update) # → low-rank approximation via SVD insight # # Scaling = L(C) ∝ C^(-0.05) # → power law, log-linear relationship # # You now speak the language of AI. 🎓
Before This Course
“AI is a black box that somehow works.”
After This Course
“AI is math I understand: linear algebra, calculus, probability, and information theory.”