Ch 9: Maximum Likelihood & Bayesian Inference

Ch 9 — Maximum Likelihood & Bayesian Inference

The detective finding the most likely suspect — how models learn from data

arrow_backIndex

Probability

Detective

arrow_forward

trending_up

Likelihood

arrow_forward

functions

Log-Lik

arrow_forward

neurology

MLE = Train

arrow_forward

psychology

Prior

arrow_forward

update

Posterior

arrow_forward

shield

Regularize

Click play or press Space to begin...

Step- / 8

The Detective Analogy

Finding the most likely explanation for the evidence

The Analogy

A detective arrives at a crime scene. There are fingerprints, footprints, and a broken window. The detective asks: “Which suspect most likely produced this evidence?” That’s Maximum Likelihood Estimation (MLE). You have data (evidence) and you want to find the parameters (suspect) that make the data most probable.

Key insight: Training a neural network IS maximum likelihood estimation. When you minimize cross-entropy loss, you’re literally maximizing the likelihood of the training data under the model. Every time you call loss.backward(), you’re doing MLE.

The Core Question

# MLE asks: which parameters θ make # the observed data most probable? # θ* = argmax P(data | θ) # θ # Example: coin flip # Data: 7 heads, 3 tails # θ = probability of heads # Which θ makes "7H, 3T" most likely? # Answer: θ = 7/10 = 0.7 # Neural network version: # θ = all weights and biases # data = training examples # θ* = weights that maximize P(data|θ)

Real World

Detective: which suspect best explains the evidence?

In AI

Training: which weights best explain the training data?

trending_up

Likelihood Function

How probable is the data under different parameter values?

The Analogy

The likelihood function L(θ) answers: “If the coin had bias θ, how probable is seeing 7 heads in 10 flips?” You evaluate this for every possible θ and pick the one that gives the highest probability. It’s the same formula as probability, but viewed from the opposite direction — the data is fixed, the parameter varies.

Key insight: Likelihood is NOT the probability of the parameters. It’s the probability of the data given the parameters. P(data | θ) treats θ as the variable. This subtle distinction is the foundation of all statistical learning.

Worked Example

# Coin: 7 heads, 3 tails # L(θ) = θ⁷ × (1-θ)³ import numpy as np thetas = np.linspace(0, 1, 100) L = thetas**7 * (1-thetas)**3 # Maximum at θ = 0.7 thetas[np.argmax(L)] # ≈ 0.70 # For i.i.d. data, likelihood = product: # L(θ) = Π P(xᵢ | θ) # Products of tiny numbers → underflow! # Solution: use log-likelihood instead

functions

Log-Likelihood — Products Become Sums

The trick that makes MLE computationally practical

The Analogy

Multiplying 10,000 small probabilities gives a number so tiny your computer can’t represent it (underflow). The log trick converts products into sums: log(a × b) = log(a) + log(b). Sums of reasonable numbers are easy to compute. Since log is monotonic, maximizing log-likelihood gives the same answer as maximizing likelihood.

Key insight: The cross-entropy loss you use in PyTorch IS the negative log-likelihood. Minimizing cross-entropy = maximizing log-likelihood = doing MLE. The “loss function” and “MLE” are the same thing wearing different hats.

Worked Example

# Log-likelihood: ℓ(θ) = Σ log P(xᵢ|θ) # Coin: 7H, 3T # ℓ(θ) = 7×log(θ) + 3×log(1-θ) # Maximize: dℓ/dθ = 7/θ - 3/(1-θ) = 0 # → 7(1-θ) = 3θ → 7 = 10θ → θ = 0.7 ✓ # Cross-entropy loss = negative log-likelihood loss = torch.nn.CrossEntropyLoss() # loss(logits, targets) = -Σ yᵢ log(ŷᵢ) # Minimizing this = maximizing log P(data|θ) # MSE loss = MLE under Gaussian assumption # If errors ~ N(0,σ²), then MLE → minimize Σ(y-ŷ)²

Key connection: Cross-entropy loss = MLE for categorical data. MSE loss = MLE for Gaussian data. Binary cross-entropy = MLE for Bernoulli data. Every standard loss function IS a log-likelihood.

neurology

Training = Maximum Likelihood

Every gradient step maximizes the likelihood of your data

The Connection

When you train a neural network with cross-entropy loss, you’re finding the weights θ that maximize P(training_data | θ). The gradient ∇L points toward higher likelihood. Each optimizer step moves weights to make the training data more probable under the model. This is MLE, implemented via gradient descent.

Key insight: An LLM trained on the internet is doing MLE: finding weights that maximize the probability of all the text it saw. P(“The capital of France is Paris”) is high because that sentence appeared many times in training data. The model literally learned to assign high probability to true statements.

In Practice

# LLM training = MLE on text for batch in text_data: # P(next_token | context; θ) logits = model(batch.context) # Negative log-likelihood loss = F.cross_entropy(logits, batch.target) # = -log P(target | context; θ) loss.backward() optimizer.step() # After training: # P("Paris"|"capital of France is"; θ*) ≈ 0.85 # θ* maximizes likelihood of training text

Real World

Detective picks the suspect that best explains all evidence

In AI

Training picks the weights that best explain all training data

psychology

Bayesian Inference — Adding Prior Beliefs

What if you have prior knowledge before seeing the data?

The Analogy

MLE is a detective with no prior knowledge — they only look at evidence. Bayesian inference is a detective who also considers prior beliefs: “Most coins are fair, so even if I see 7/10 heads, I shouldn’t conclude the coin is heavily biased.” The prior pulls the estimate toward “reasonable” values. With more data, the prior matters less and the evidence dominates.

Key insight: Bayesian inference naturally prevents overfitting. The prior says “don’t trust extreme parameter values unless you have overwhelming evidence.” With little data, the prior dominates (safe defaults). With lots of data, the likelihood dominates (data speaks for itself).

Worked Example

# Bayesian: P(θ|data) ∝ P(data|θ) × P(θ) # posterior ∝ likelihood × prior # Coin: 7H, 3T # MLE: θ = 0.7 (just the data) # Bayesian with Beta(10,10) prior # (prior belief: coin is roughly fair) # Posterior: Beta(10+7, 10+3) = Beta(17,13) # MAP estimate: (17-1)/(17+13-2) = 0.571 # Pulled toward 0.5 by the prior! # With 700H, 300T (more data): # Posterior: Beta(710, 310) # MAP: 709/1018 = 0.697 (≈ MLE) # Prior barely matters with lots of data

update

MAP Estimation — The Best of Both Worlds

Maximum A Posteriori: likelihood + prior

The Analogy

MAP (Maximum A Posteriori) finds the single most probable parameter value given both the data AND the prior: θ* = argmax P(θ | data). It’s like MLE but with a “reasonableness check.” MLE says “whatever the data says.” MAP says “whatever the data says, tempered by what’s reasonable.”

Key insight: MAP with a Gaussian prior on weights IS L2 regularization (weight decay). The prior P(θ) = N(0, σ²) penalizes large weights, which is exactly what weight_decay does in AdamW. Regularization is secretly Bayesian inference!

The Connection

# MAP = argmax [log P(data|θ) + log P(θ)] # = argmax [log-likelihood + log-prior] # Gaussian prior: P(θ) = N(0, σ²) # log P(θ) = -θ²/(2σ²) + const # MAP = argmax [ℓ(θ) - λ‖θ‖²] # where λ = 1/(2σ²) # This IS L2 regularization! # Laplace prior: P(θ) ∝ exp(-|θ|/b) # → L1 regularization (sparsity) # PyTorch weight decay = L2 = Gaussian prior optimizer = torch.optim.AdamW( params, lr=1e-3, weight_decay=0.01 )

MLE

Pure data: θ* = argmax P(data|θ)

MAP

Data + prior: θ* = argmax P(data|θ)P(θ) = regularized MLE

shield

Regularization as Bayesian Prior

Weight decay, dropout, and data augmentation through a Bayesian lens

The Analogy

Regularization is like a skeptical advisor who says “don’t trust extreme conclusions from limited data.” L2 regularization says “weights should be small” (Gaussian prior). L1 says “most weights should be zero” (Laplace prior). Dropout says “don’t rely on any single neuron” (approximate Bayesian model averaging).

Key insight: Every regularization technique has a Bayesian interpretation. L2 = Gaussian prior. L1 = Laplace prior. Dropout ≈ approximate Bayesian inference over an ensemble of sub-networks. Early stopping = implicit regularization (limiting the effective complexity of the model).

Regularization Zoo

# L2 (weight decay) = Gaussian prior # Loss = NLL + λ‖w‖² # Shrinks all weights toward zero # L1 (Lasso) = Laplace prior # Loss = NLL + λ‖w‖₁ # Pushes many weights to exactly zero # Dropout = approximate Bayesian ensemble layer = nn.Dropout(p=0.1) # Randomly zeros 10% of activations # ≈ averaging over 2^n sub-networks # Data augmentation = expanding the prior # "I believe rotated cats are still cats"

Practical rule: More data = less regularization needed. With infinite data, MLE and MAP converge. With small data, strong priors (heavy regularization) prevent overfitting.

hub

The Big Picture — Learning = Inference

MLE, MAP, and full Bayesian — a spectrum of approaches

The Spectrum

MLE: Find the single best θ. Simple, scalable, can overfit. MAP: Find the single best θ with a prior. Adds regularization. Full Bayesian: Don’t pick one θ — maintain a distribution over all possible θ values. Most principled but computationally expensive. Modern AI mostly uses MLE/MAP because full Bayesian doesn’t scale to billions of parameters.

Why it matters for AI: Understanding MLE explains why cross-entropy loss works, why weight decay helps, and why more data beats more parameters. Bayesian thinking explains uncertainty quantification, which is critical for AI safety — knowing when the model doesn’t know.

Summary

# MLE: θ* = argmax P(data|θ) # → Cross-entropy loss, MSE loss # → No regularization # MAP: θ* = argmax P(data|θ)P(θ) # → Loss + weight decay # → Gaussian prior = L2 regularization # Full Bayesian: P(θ|data) = full distribution # → Uncertainty estimates # → Computationally expensive # → Approximations: MC Dropout, ensembles # Key insight: training = statistical inference # Loss function = negative log-likelihood # Regularization = prior belief

Real World

Detective with no bias (MLE) vs. detective with experience (MAP/Bayesian)

In AI

Cross-entropy (MLE) + weight decay (MAP) = standard training recipe

arrow_back Ch 8: Distributions Ch 10: Hypothesis Testing arrow_forward