Ch 9 — Maximum Likelihood & Bayesian Inference

The detective finding the most likely suspect — how models learn from data
Probability
search
Detective
arrow_forward
trending_up
Likelihood
arrow_forward
functions
Log-Lik
arrow_forward
neurology
MLE = Train
arrow_forward
psychology
Prior
arrow_forward
update
Posterior
arrow_forward
shield
Regularize
-
Click play or press Space to begin...
Step- / 8
search
The Detective Analogy
Finding the most likely explanation for the evidence
The Analogy
A detective arrives at a crime scene. There are fingerprints, footprints, and a broken window. The detective asks: “Which suspect most likely produced this evidence?” That’s Maximum Likelihood Estimation (MLE). You have data (evidence) and you want to find the parameters (suspect) that make the data most probable.
Key insight: Training a neural network IS maximum likelihood estimation. When you minimize cross-entropy loss, you’re literally maximizing the likelihood of the training data under the model. Every time you call loss.backward(), you’re doing MLE.
The Core Question
# MLE asks: which parameters θ make # the observed data most probable? # θ* = argmax P(data | θ) # θ # Example: coin flip # Data: 7 heads, 3 tails # θ = probability of heads # Which θ makes "7H, 3T" most likely? # Answer: θ = 7/10 = 0.7 # Neural network version: # θ = all weights and biases # data = training examples # θ* = weights that maximize P(data|θ)
Real World
Detective: which suspect best explains the evidence?
In AI
Training: which weights best explain the training data?
trending_up
Likelihood Function
How probable is the data under different parameter values?
The Analogy
The likelihood function L(θ) answers: “If the coin had bias θ, how probable is seeing 7 heads in 10 flips?” You evaluate this for every possible θ and pick the one that gives the highest probability. It’s the same formula as probability, but viewed from the opposite direction — the data is fixed, the parameter varies.
Key insight: Likelihood is NOT the probability of the parameters. It’s the probability of the data given the parameters. P(data | θ) treats θ as the variable. This subtle distinction is the foundation of all statistical learning.
Worked Example
# Coin: 7 heads, 3 tails # L(θ) = θ⁷ × (1-θ)³ import numpy as np thetas = np.linspace(0, 1, 100) L = thetas**7 * (1-thetas)**3 # Maximum at θ = 0.7 thetas[np.argmax(L)] # ≈ 0.70 # For i.i.d. data, likelihood = product: # L(θ) = Π P(xᵢ | θ) # Products of tiny numbers → underflow! # Solution: use log-likelihood instead
functions
Log-Likelihood — Products Become Sums
The trick that makes MLE computationally practical
The Analogy
Multiplying 10,000 small probabilities gives a number so tiny your computer can’t represent it (underflow). The log trick converts products into sums: log(a × b) = log(a) + log(b). Sums of reasonable numbers are easy to compute. Since log is monotonic, maximizing log-likelihood gives the same answer as maximizing likelihood.
Key insight: The cross-entropy loss you use in PyTorch IS the negative log-likelihood. Minimizing cross-entropy = maximizing log-likelihood = doing MLE. The “loss function” and “MLE” are the same thing wearing different hats.
Worked Example
# Log-likelihood: ℓ(θ) = Σ log P(xᵢ|θ) # Coin: 7H, 3T # ℓ(θ) = 7×log(θ) + 3×log(1-θ) # Maximize: dℓ/dθ = 7/θ - 3/(1-θ) = 0 # → 7(1-θ) = 3θ → 7 = 10θ → θ = 0.7 ✓ # Cross-entropy loss = negative log-likelihood loss = torch.nn.CrossEntropyLoss() # loss(logits, targets) = -Σ yᵢ log(ŷᵢ) # Minimizing this = maximizing log P(data|θ) # MSE loss = MLE under Gaussian assumption # If errors ~ N(0,σ²), then MLE → minimize Σ(y-ŷ)²
Key connection: Cross-entropy loss = MLE for categorical data. MSE loss = MLE for Gaussian data. Binary cross-entropy = MLE for Bernoulli data. Every standard loss function IS a log-likelihood.
neurology
Training = Maximum Likelihood
Every gradient step maximizes the likelihood of your data
The Connection
When you train a neural network with cross-entropy loss, you’re finding the weights θ that maximize P(training_data | θ). The gradient ∇L points toward higher likelihood. Each optimizer step moves weights to make the training data more probable under the model. This is MLE, implemented via gradient descent.
Key insight: An LLM trained on the internet is doing MLE: finding weights that maximize the probability of all the text it saw. P(“The capital of France is Paris”) is high because that sentence appeared many times in training data. The model literally learned to assign high probability to true statements.
In Practice
# LLM training = MLE on text for batch in text_data: # P(next_token | context; θ) logits = model(batch.context) # Negative log-likelihood loss = F.cross_entropy(logits, batch.target) # = -log P(target | context; θ) loss.backward() optimizer.step() # After training: # P("Paris"|"capital of France is"; θ*) ≈ 0.85 # θ* maximizes likelihood of training text
Real World
Detective picks the suspect that best explains all evidence
In AI
Training picks the weights that best explain all training data
psychology
Bayesian Inference — Adding Prior Beliefs
What if you have prior knowledge before seeing the data?
The Analogy
MLE is a detective with no prior knowledge — they only look at evidence. Bayesian inference is a detective who also considers prior beliefs: “Most coins are fair, so even if I see 7/10 heads, I shouldn’t conclude the coin is heavily biased.” The prior pulls the estimate toward “reasonable” values. With more data, the prior matters less and the evidence dominates.
Key insight: Bayesian inference naturally prevents overfitting. The prior says “don’t trust extreme parameter values unless you have overwhelming evidence.” With little data, the prior dominates (safe defaults). With lots of data, the likelihood dominates (data speaks for itself).
Worked Example
# Bayesian: P(θ|data) ∝ P(data|θ) × P(θ) # posterior ∝ likelihood × prior # Coin: 7H, 3T # MLE: θ = 0.7 (just the data) # Bayesian with Beta(10,10) prior # (prior belief: coin is roughly fair) # Posterior: Beta(10+7, 10+3) = Beta(17,13) # MAP estimate: (17-1)/(17+13-2) = 0.571 # Pulled toward 0.5 by the prior! # With 700H, 300T (more data): # Posterior: Beta(710, 310) # MAP: 709/1018 = 0.697 (≈ MLE) # Prior barely matters with lots of data
update
MAP Estimation — The Best of Both Worlds
Maximum A Posteriori: likelihood + prior
The Analogy
MAP (Maximum A Posteriori) finds the single most probable parameter value given both the data AND the prior: θ* = argmax P(θ | data). It’s like MLE but with a “reasonableness check.” MLE says “whatever the data says.” MAP says “whatever the data says, tempered by what’s reasonable.”
Key insight: MAP with a Gaussian prior on weights IS L2 regularization (weight decay). The prior P(θ) = N(0, σ²) penalizes large weights, which is exactly what weight_decay does in AdamW. Regularization is secretly Bayesian inference!
The Connection
# MAP = argmax [log P(data|θ) + log P(θ)] # = argmax [log-likelihood + log-prior] # Gaussian prior: P(θ) = N(0, σ²) # log P(θ) = -θ²/(2σ²) + const # MAP = argmax [ℓ(θ) - λ‖θ‖²] # where λ = 1/(2σ²) # This IS L2 regularization! # Laplace prior: P(θ) ∝ exp(-|θ|/b) # → L1 regularization (sparsity) # PyTorch weight decay = L2 = Gaussian prior optimizer = torch.optim.AdamW( params, lr=1e-3, weight_decay=0.01 )
MLE
Pure data: θ* = argmax P(data|θ)
MAP
Data + prior: θ* = argmax P(data|θ)P(θ) = regularized MLE
shield
Regularization as Bayesian Prior
Weight decay, dropout, and data augmentation through a Bayesian lens
The Analogy
Regularization is like a skeptical advisor who says “don’t trust extreme conclusions from limited data.” L2 regularization says “weights should be small” (Gaussian prior). L1 says “most weights should be zero” (Laplace prior). Dropout says “don’t rely on any single neuron” (approximate Bayesian model averaging).
Key insight: Every regularization technique has a Bayesian interpretation. L2 = Gaussian prior. L1 = Laplace prior. Dropout ≈ approximate Bayesian inference over an ensemble of sub-networks. Early stopping = implicit regularization (limiting the effective complexity of the model).
Regularization Zoo
# L2 (weight decay) = Gaussian prior # Loss = NLL + λ‖w‖² # Shrinks all weights toward zero # L1 (Lasso) = Laplace prior # Loss = NLL + λ‖w‖₁ # Pushes many weights to exactly zero # Dropout = approximate Bayesian ensemble layer = nn.Dropout(p=0.1) # Randomly zeros 10% of activations # ≈ averaging over 2^n sub-networks # Data augmentation = expanding the prior # "I believe rotated cats are still cats"
Practical rule: More data = less regularization needed. With infinite data, MLE and MAP converge. With small data, strong priors (heavy regularization) prevent overfitting.
hub
The Big Picture — Learning = Inference
MLE, MAP, and full Bayesian — a spectrum of approaches
The Spectrum
MLE: Find the single best θ. Simple, scalable, can overfit. MAP: Find the single best θ with a prior. Adds regularization. Full Bayesian: Don’t pick one θ — maintain a distribution over all possible θ values. Most principled but computationally expensive. Modern AI mostly uses MLE/MAP because full Bayesian doesn’t scale to billions of parameters.
Why it matters for AI: Understanding MLE explains why cross-entropy loss works, why weight decay helps, and why more data beats more parameters. Bayesian thinking explains uncertainty quantification, which is critical for AI safety — knowing when the model doesn’t know.
Summary
# MLE: θ* = argmax P(data|θ) # → Cross-entropy loss, MSE loss # → No regularization # MAP: θ* = argmax P(data|θ)P(θ) # → Loss + weight decay # → Gaussian prior = L2 regularization # Full Bayesian: P(θ|data) = full distribution # → Uncertainty estimates # → Computationally expensive # → Approximations: MC Dropout, ensembles # Key insight: training = statistical inference # Loss function = negative log-likelihood # Regularization = prior belief
Real World
Detective with no bias (MLE) vs. detective with experience (MAP/Bayesian)
In AI
Cross-entropy (MLE) + weight decay (MAP) = standard training recipe