Ch 14: The Math of Modern AI (Capstone) — Mathematics Behind AI & ML

Ch 14 — The Math of Modern AI (Capstone)

Transformers, diffusion, RLHF — every concept from this course in action

arrow_backIndex

Capstone

token

Attention

arrow_forward

auto_awesome

Transformer

arrow_forward

blur_on

Diffusion

arrow_forward

thumb_up

RLHF

arrow_forward

compress

LoRA

arrow_forward

rocket_launch

Scale

arrow_forward

auto_graph

Map

Click play or press Space to begin...

Step- / 8

token

Self-Attention — The Core Innovation

Every word looks at every other word to understand context

The Analogy

Imagine reading a sentence and highlighting which other words each word “pays attention to.” In “The cat sat on the mat because it was tired,” the word “it” attends strongly to “cat.” Self-attention computes this for every word pair simultaneously. Each word asks: “Who should I listen to?” using three vectors: Query (what I’m looking for), Key (what I offer), Value (my actual content).

Math from this course: Attention uses dot products (Ch 1) to measure similarity between Q and K. It divides by √d_k to prevent softmax saturation (Ch 13). Softmax converts scores to probabilities using the log-sum-exp trick (Ch 13). The result is a weighted sum — a linear combination (Ch 2) of value vectors.

The Math

# Self-Attention formula: # Attention(Q, K, V) = softmax(QK^T / √d_k) V # Step by step: # 1. Q, K, V = X @ W_q, X @ W_k, X @ W_v # (matrix multiplication — Ch 2) # 2. scores = Q @ K.T / √d_k # (dot product similarity — Ch 1) # (scaling prevents saturation — Ch 13) # 3. weights = softmax(scores) # (probability distribution — Ch 7) # (log-sum-exp trick — Ch 13) # 4. output = weights @ V # (weighted sum — Ch 1, 2) # Tensor shapes (Ch 12): # Q, K, V: (batch, heads, seq, d_k) # scores: (batch, heads, seq, seq) # output: (batch, heads, seq, d_k)

auto_awesome

The Transformer Architecture

The building block behind GPT, BERT, and every modern LLM

The Analogy

A transformer is like a team of editors reviewing a document. Each editor (layer) reads the entire text, highlights important connections (attention), then rewrites each word with richer context (feed-forward). After 96 editors (GPT-4 has ~120 layers), every word carries deep understanding of the full context. Residual connections are like keeping the original draft alongside each edit.

Math from this course: Layer normalization uses mean and variance (Ch 8) to keep activations stable (conditioning — Ch 13). Residual connections prevent vanishing gradients (Ch 5, 6). Position encodings use sinusoidal functions to encode sequence order. The feed-forward network is y = W₂ · ReLU(W₁x + b₁) + b₂ — pure matrix multiplication (Ch 2) and nonlinear activation.

Architecture

# Transformer block (repeated N times): # 1. Multi-head self-attention attn_out = MultiHeadAttention(x) x = LayerNorm(x + attn_out) # residual # 2. Feed-forward network ff_out = FFN(x) # W₂·GELU(W₁x+b₁)+b₂ x = LayerNorm(x + ff_out) # residual # GPT-4 (estimated): # ~120 transformer blocks # d_model = 12288, heads = 96 # ~1.8 trillion parameters # Trained on ~13T tokens # Training objective: # Minimize cross-entropy (Ch 11) on # next-token prediction # Loss = -Σ log P(token_t | tokens_

blur_on

Diffusion Models — Learning to Denoise

How DALL-E and Stable Diffusion generate images from noise

The Analogy

Imagine a sculptor who learns by watching statues dissolve into sand (forward process). Once they understand how things fall apart, they can reverse the process — starting from a pile of sand and sculpting it back into a statue (reverse process). Diffusion models learn to reverse the gradual addition of Gaussian noise. Start with pure noise, denoise step by step, and a beautiful image emerges.

Math from this course: The forward process adds Gaussian noise (Ch 8) at each step. The model learns the score function ∇log p(x) — the gradient (Ch 4) of the log-probability. Training minimizes MSE loss between predicted and actual noise. The reverse process uses Bayes’ theorem (Ch 7) to compute the posterior. KL divergence (Ch 11) appears in the variational bound.

The Math

# Forward process: add noise gradually # x_t = √(ᾱ_t) × x_0 + √(1-ᾱ_t) × ε # ε ~ N(0, I) (Gaussian noise — Ch 8) # Model learns: ε_θ(x_t, t) ≈ ε # "Given noisy image x_t at step t, # predict the noise that was added" # Training loss (simple version): # L = E[||ε - ε_θ(x_t, t)||²] # (MSE between true and predicted noise) # Reverse process: denoise step by step # x_{t-1} = (1/√α_t)(x_t - β_t/√(1-ᾱ_t) × ε_θ) # + σ_t × z # Text conditioning (Stable Diffusion): # ε_θ(x_t, t, text_embedding) # Cross-attention between image and text # (same attention mechanism as transformers!)

thumb_up

RLHF — Aligning AI with Human Values

How ChatGPT learned to be helpful, harmless, and honest

The Analogy

Training a base LLM is like teaching someone to speak English. RLHF is like teaching them to be a good conversationalist — polite, helpful, and honest. Step 1: humans rank model outputs (preference data). Step 2: train a reward model to predict human preferences. Step 3: use reinforcement learning (PPO) to maximize the reward while staying close to the base model.

Math from this course: The reward model is trained with cross-entropy loss (Ch 11) on pairwise comparisons. PPO optimizes a clipped objective — a constrained optimization problem (Ch 6). The KL penalty (Ch 11) prevents the model from drifting too far from the base: Loss = reward − β × KL(π_new || π_ref). This is MAP estimation (Ch 9) with the base model as the prior!

The Pipeline

# RLHF Pipeline: # Step 1: Supervised Fine-Tuning (SFT) # Train on human-written examples # Loss = cross-entropy (Ch 11) # Step 2: Reward Model # Human ranks: response_A > response_B # Train R(x, y) to predict preferences # Loss = -log σ(R(y_w) - R(y_l)) # (Bradley-Terry model — logistic — Ch 9) # Step 3: PPO Optimization # Maximize: E[R(x, y)] # Subject to: KL(π_θ || π_ref) < δ # Combined: R(x,y) - β × KL(π_θ || π_ref) # DPO (Direct Preference Optimization): # Skip reward model entirely! # Directly optimize policy from preferences # Loss = -log σ(β(log π/π_ref(y_w) # - log π/π_ref(y_l)))

compress

LoRA & Efficient Fine-Tuning

Update billions of parameters by training only millions

The Analogy

Imagine remodeling a house. Full fine-tuning = tearing down and rebuilding every wall. LoRA = adding small, targeted modifications (new shelves, fresh paint) that transform the space without touching the structure. LoRA freezes the original weights and adds small low-rank matrices that capture the task-specific changes. 175B parameters frozen, only ~10M trained.

Math from this course: LoRA is pure SVD (Ch 3) in action. Instead of updating the full weight matrix W (d×d), LoRA learns W + BA where B is (d×r) and A is (r×d), with rank r << d. This is a low-rank approximation (Ch 3). The insight: weight changes during fine-tuning have low intrinsic dimensionality (Ch 12) — they lie on a low-dimensional manifold.

The Math

# Standard fine-tuning: # W_new = W + ΔW (ΔW is d×d = huge) # LoRA insight: ΔW has low rank! # ΔW ≈ B × A where B: d×r, A: r×d # r = 4, 8, or 16 (rank) # Example: d = 12288 (GPT-3 size) # Full ΔW: 12288² = 151M parameters # LoRA (r=8): 12288×8 + 8×12288 = 197K # → 768× fewer parameters! # Forward pass: # h = (W + BA) × x = Wx + BAx # W is frozen, only B and A are trained # At inference: merge W_new = W + BA # No extra latency! # QLoRA: quantize W to 4-bit + LoRA # Fine-tune 65B model on single GPU!

Full Fine-Tune

Update all 175B params. Needs 100s of GPUs.

LoRA

Update ~10M params (0.006%). Single GPU possible.

rocket_launch

Scaling Laws — Why Bigger Is Better

The mathematical laws that predicted GPT-4 before it was built

The Analogy

Moore’s Law predicted transistor doubling for decades. Scaling laws do the same for AI: loss decreases as a power law with more parameters, data, and compute. Double the compute → predictable improvement. This is why companies invest billions in training — the returns are mathematically predictable. OpenAI used scaling laws to predict GPT-4’s performance before training it.

Math from this course: Scaling laws are power laws: L(N) = aN^−α + L∞. Taking the log (Ch 11) of both sides gives a straight line: log(L) = −α log(N) + const. This is linear regression in log-space. The Chinchilla paper showed the optimal ratio: tokens ≈ 20 × parameters. This is an optimization problem (Ch 6) — minimize loss subject to a compute budget.

The Numbers

# Kaplan et al. (2020) scaling laws: # L(N) = (N_c / N)^α_N (parameters) # L(D) = (D_c / D)^α_D (data) # L(C) = (C_c / C)^α_C (compute) # α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.050 # Loss decreases as power law with scale # Chinchilla optimal (Hoffmann 2022): # Tokens ≈ 20 × Parameters # GPT-3: 175B params, 300B tokens (under) # Chinchilla: 70B params, 1.4T tokens # → Same compute, better performance! # Compute budget (FLOPs ≈ 6 × N × D): # GPT-3: ~3.6 × 10²³ FLOPs # GPT-4: ~2 × 10²⁵ FLOPs (estimated) # → 50× more compute → predictable gain

auto_graph

The Complete Math Map

Every concept from this course, connected

Linear Algebra (Ch 1-3)

Vectors represent data (embeddings, features). Matrices transform data (weights, attention). Eigenvalues/SVD reveal structure (PCA, LoRA). Every forward pass is matrix multiplication. Every embedding lookup is a vector operation. Every dimensionality reduction is SVD.

Calculus (Ch 4-6)

Derivatives measure sensitivity. Chain rule enables backpropagation. Gradient descent finds optimal weights. Without calculus, there is no training. Every weight update is a gradient step. Every optimizer (Adam, SGD) is a calculus algorithm.

Probability (Ch 7-9)

Probability quantifies uncertainty. Distributions model data. MLE/Bayesian inference drives learning. Every prediction is a probability. Every loss function is a likelihood. Every regularizer is a prior belief.

Advanced Topics (Ch 10-13)

Hypothesis testing validates models. Information theory defines loss functions. Tensors structure computation. Numerical stability makes it all work. Cross-entropy loss IS information theory. Tensor shapes ARE the architecture. Numerical tricks ARE the engineering.

school

Your Mathematical Foundation

You now have the math to understand any AI paper

What You’ve Learned

You started with vectors and ended with transformers. Along the way, you learned that dot products power attention, gradients power training, probability powers prediction, and information theory powers loss functions. These aren’t separate topics — they’re one interconnected system. Every AI breakthrough is a creative combination of these fundamentals.

The final insight: AI isn’t magic — it’s math. Transformers are matrix multiplications with softmax attention. Diffusion models are Gaussian noise with learned denoising. RLHF is optimization with a KL constraint. LoRA is low-rank matrix factorization. When you read an AI paper now, you’ll recognize the math. That’s the real superpower.

Cheat Sheet

# The Math Behind Modern AI: # # Attention = softmax(QK^T/√d) × V # → dot product, softmax, matrix multiply # # Training = minimize cross-entropy # → gradient descent via backpropagation # # Diffusion = learn ε_θ(x_t, t) ≈ ε # → Gaussian noise, score matching # # RLHF = max R(x,y) - β×KL(π||π_ref) # → reward optimization with KL constraint # # LoRA = W + BA (rank-r update) # → low-rank approximation via SVD insight # # Scaling = L(C) ∝ C^(-0.05) # → power law, log-linear relationship # # You now speak the language of AI. 🎓

Before This Course

“AI is a black box that somehow works.”

After This Course

“AI is math I understand: linear algebra, calculus, probability, and information theory.”

arrow_back Ch 13: Numerical Methods Back to Index arrow_forward