Ch 10 — Hypothesis Testing & Statistical Learning

The courtroom trial — innocent until proven guilty, and the bias-variance tradeoff
Probability
gavel
Trial
arrow_forward
error
Errors
arrow_forward
analytics
p-value
arrow_forward
science
A/B Test
arrow_forward
balance
Bias-Var
arrow_forward
model_training
Overfit
arrow_forward
verified
Validate
-
Click play or press Space to begin...
Step- / 8
gavel
The Courtroom Trial
Innocent until proven guilty — the logic of hypothesis testing
The Analogy
A courtroom starts with the assumption of innocence (null hypothesis H₀). The prosecution presents evidence. The jury asks: “Is this evidence so overwhelming that we can reject innocence?” If yes → guilty (reject H₀). If not → not guilty (fail to reject H₀). Note: “not guilty” doesn’t mean innocent — it means insufficient evidence.
Key insight: In ML, hypothesis testing is how you answer: “Is Model A actually better than Model B, or did it just get lucky on this test set?” Without statistical testing, you might deploy a model that’s not actually better — just luckier on your particular evaluation data.
The Framework
# H₀ (null): no effect / no difference # H₁ (alternative): there IS a difference # Example: "Is my new model better?" # H₀: new model = old model (no improvement) # H₁: new model > old model (improvement) # Procedure: # 1. Assume H₀ is true # 2. Compute test statistic from data # 3. Ask: how likely is this statistic if H₀? # 4. If very unlikely (p < 0.05) → reject H₀
Real World
Jury: “Is the evidence strong enough to convict?”
In AI
“Is Model A’s improvement over Model B statistically significant?”
error
Type I & Type II Errors
Convicting the innocent vs. freeing the guilty
The Analogy
Type I error (false positive): convicting an innocent person. You rejected H₀ when it was actually true. Type II error (false negative): letting a guilty person go free. You failed to reject H₀ when it was actually false. There’s a tradeoff: making it harder to convict (fewer Type I) means more guilty people go free (more Type II).
Key insight: In AI, precision vs. recall IS the Type I/II tradeoff. High precision (few false positives) = strict jury. High recall (few false negatives) = lenient jury. A spam filter with high precision rarely flags good email as spam, but might miss some spam. The F1 score balances both.
Worked Example
# Type I (α): reject H₀ when H₀ is true # = false positive rate = 1 - specificity # Typically α = 0.05 (5% chance of mistake) # Type II (β): fail to reject H₀ when H₁ true # = false negative rate = 1 - recall # Power = 1 - β (ability to detect real effects) # Confusion matrix: # Predicted + | Predicted - # Actually + | TP | FN (Type II) # Actually - | FP (I) | TN # Precision = TP / (TP + FP) # Recall = TP / (TP + FN) # F1 = 2 × (P × R) / (P + R)
Type I (False +)
Convict innocent / Flag good email as spam
Type II (False −)
Free guilty / Miss actual spam email
analytics
p-values & Significance
How surprising is the evidence?
The Analogy
The p-value answers: “If H₀ were true, how likely is it to see evidence this extreme or more extreme?” If p = 0.03, there’s only a 3% chance of seeing this result by random chance. If p < 0.05 (conventional threshold), we say the result is statistically significant — unlikely to be due to chance alone.
Key insight: A p-value of 0.03 does NOT mean “3% chance H₀ is true.” It means “3% chance of seeing this data if H₀ were true.” This subtle distinction trips up even experienced researchers. Also, statistical significance ≠ practical significance — a tiny improvement can be “significant” with enough data.
Worked Example
from scipy import stats # Compare two models' accuracy on test set model_a_scores = [0.85, 0.87, 0.84, 0.86, 0.88] model_b_scores = [0.82, 0.83, 0.81, 0.84, 0.82] # Paired t-test: is the difference real? t_stat, p_value = stats.ttest_rel( model_a_scores, model_b_scores ) # p_value ≈ 0.004 < 0.05 # → Statistically significant! # Model A is genuinely better (not luck)
Rule of thumb: p < 0.05 = significant (reject H₀). p < 0.01 = highly significant. p < 0.001 = very highly significant. But always consider effect size too — a 0.1% improvement might be “significant” but not worth deploying.
science
A/B Testing in AI
The gold standard for comparing models in production
The Analogy
A/B testing is like a clinical trial for AI. Group A gets the old model, Group B gets the new model. You measure outcomes (clicks, revenue, satisfaction) and use hypothesis testing to determine if the new model is genuinely better. Random assignment ensures the groups are comparable, so any difference is due to the model, not the users.
Key insight: Every major tech company (Google, Netflix, Meta) runs thousands of A/B tests simultaneously. When Google changes its search ranking algorithm, it A/B tests on millions of users before rolling out. Statistical rigor prevents shipping “improvements” that are actually noise.
In Practice
# A/B test: new recommendation model group_a = old_model_clicks # [2.1, 2.3, ...] group_b = new_model_clicks # [2.4, 2.5, ...] # H₀: mean(A) = mean(B) (no difference) # H₁: mean(B) > mean(A) (new is better) t, p = stats.ttest_ind(group_b, group_a, alternative='greater') if p < 0.05: print("Ship it! New model is better.") else: print("Not enough evidence. Keep old model.")
Real World
Drug trial: does the new drug work better than placebo?
In AI
A/B test: does the new model get more clicks than the old one?
balance
The Bias-Variance Tradeoff
Memorizing answers vs. learning principles
The Analogy
Imagine studying for an exam. High bias = you only learned the chapter summaries (too simple, misses nuances). High variance = you memorized every practice problem word-for-word (can’t handle new questions). The ideal student learns the principles — enough detail to be accurate, but general enough to handle new problems.
Key insight: Error = Bias² + Variance + Irreducible Noise. You can’t reduce all three. A simple model (linear regression) has high bias but low variance. A complex model (deep network) has low bias but high variance. The sweet spot minimizes total error.
Worked Example
# Bias: error from wrong assumptions # Fitting a line to curved data → high bias # Variance: sensitivity to training data # Fitting a degree-100 polynomial → high var # Total error = Bias² + Variance + Noise # Underfitting (high bias): # Train acc: 70%, Test acc: 68% # → Model too simple # Overfitting (high variance): # Train acc: 99%, Test acc: 75% # → Model memorized training data # Good fit: # Train acc: 92%, Test acc: 90% # → Model learned generalizable patterns
model_training
Overfitting & Underfitting
The most common failure modes in machine learning
The Analogy
Overfitting is like a student who memorized every answer in the textbook but can’t solve a new problem. The model fits the training data perfectly (including noise) but fails on new data. Underfitting is a student who barely studied — bad on both training and test data. The goal is to learn the signal, not the noise.
Key insight: Modern deep networks are so large they can memorize random labels (Zhang et al., 2017). A ResNet-50 can achieve 100% training accuracy on random noise. Yet with real data and proper regularization, these same models generalize beautifully. Understanding why is one of the deepest open questions in ML theory.
Detection & Fixes
# Detecting overfitting: # train_loss ↓↓↓ but val_loss ↑↑↑ # Gap between train and val accuracy grows # Fixes for overfitting: # 1. More data (always helps) # 2. Regularization (L2, dropout) # 3. Data augmentation # 4. Early stopping # 5. Simpler model # Fixes for underfitting: # 1. Bigger model (more parameters) # 2. Train longer # 3. Better features # 4. Less regularization # 5. Lower learning rate
Overfit
Train: 99%, Test: 75% — memorized noise
Good Fit
Train: 92%, Test: 90% — learned signal
verified
Cross-Validation — Honest Evaluation
Don’t grade yourself on the homework you practiced
The Analogy
Grading a student on the exact problems they practiced is meaningless. Cross-validation is like giving the student a new exam they haven’t seen. k-fold CV splits data into k parts: train on k−1, test on the held-out fold, rotate. This gives an honest estimate of how the model will perform on truly new data.
Key insight: The train/validation/test split is the most important practice in ML. Train = homework. Validation = practice exam (tune hyperparameters). Test = final exam (report results). NEVER tune on the test set — that’s cheating, and your reported accuracy will be overly optimistic.
In Practice
from sklearn.model_selection import cross_val_score # 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5) # [0.88, 0.91, 0.87, 0.90, 0.89] # Mean: 0.89 ± 0.015 # Proper split: # 70% train / 15% validation / 15% test # Train: fit model # Val: tune hyperparameters # Test: final evaluation (touch ONCE)
Real World
Practice exam ≠ final exam. Grade on new problems.
In AI
Train ≠ test. Cross-validate for honest performance estimates.
school
Statistical Learning Theory
Why generalization works — PAC learning and VC dimension
The Big Picture
Statistical learning theory asks: “Why does a model trained on a finite sample generalize to unseen data?” The answer involves the VC dimension (how complex the model is) and PAC learning (probably approximately correct). The generalization bound says: test error ≤ train error + complexity penalty. More data or simpler models = tighter bound.
Why it matters for AI: This theory explains the “unreasonable effectiveness” of deep learning. Classical bounds suggest huge networks should overfit catastrophically, but they don’t. Understanding why — implicit regularization, flat minima, the lottery ticket hypothesis — is one of the biggest open questions in ML theory.
Key Concepts
# Generalization bound (simplified): # Test error ≤ Train error + √(d/n) # d = model complexity (VC dimension) # n = number of training samples # More data (↑n) → tighter bound # Simpler model (↓d) → tighter bound # VC dimension examples: # Linear classifier in 2D: VC = 3 # (can shatter 3 points, not 4) # Neural network: VC ≈ O(params × layers) # PAC learning: with probability ≥ 1-δ, # error ≤ ε, if n ≥ O(d/ε² × log(1/δ))
Real World
More practice problems + simpler study strategy = better exam performance
In AI
More data + regularization = better generalization to unseen data