Ch 10: Hypothesis Testing & Statistical Learning

Ch 10 — Hypothesis Testing & Statistical Learning

The courtroom trial — innocent until proven guilty, and the bias-variance tradeoff

arrow_backIndex

Probability

gavel

Trial

arrow_forward

error

Errors

arrow_forward

analytics

p-value

arrow_forward

science

A/B Test

arrow_forward

balance

Bias-Var

arrow_forward

model_training

Overfit

arrow_forward

verified

Validate

Click play or press Space to begin...

Step- / 8

gavel

The Courtroom Trial

Innocent until proven guilty — the logic of hypothesis testing

The Analogy

A courtroom starts with the assumption of innocence (null hypothesis H₀). The prosecution presents evidence. The jury asks: “Is this evidence so overwhelming that we can reject innocence?” If yes → guilty (reject H₀). If not → not guilty (fail to reject H₀). Note: “not guilty” doesn’t mean innocent — it means insufficient evidence.

Key insight: In ML, hypothesis testing is how you answer: “Is Model A actually better than Model B, or did it just get lucky on this test set?” Without statistical testing, you might deploy a model that’s not actually better — just luckier on your particular evaluation data.

The Framework

# H₀ (null): no effect / no difference # H₁ (alternative): there IS a difference # Example: "Is my new model better?" # H₀: new model = old model (no improvement) # H₁: new model > old model (improvement) # Procedure: # 1. Assume H₀ is true # 2. Compute test statistic from data # 3. Ask: how likely is this statistic if H₀? # 4. If very unlikely (p < 0.05) → reject H₀

Real World

Jury: “Is the evidence strong enough to convict?”

In AI

“Is Model A’s improvement over Model B statistically significant?”

error

Type I & Type II Errors

Convicting the innocent vs. freeing the guilty

The Analogy

Type I error (false positive): convicting an innocent person. You rejected H₀ when it was actually true. Type II error (false negative): letting a guilty person go free. You failed to reject H₀ when it was actually false. There’s a tradeoff: making it harder to convict (fewer Type I) means more guilty people go free (more Type II).

Key insight: In AI, precision vs. recall IS the Type I/II tradeoff. High precision (few false positives) = strict jury. High recall (few false negatives) = lenient jury. A spam filter with high precision rarely flags good email as spam, but might miss some spam. The F1 score balances both.

Worked Example

# Type I (α): reject H₀ when H₀ is true # = false positive rate = 1 - specificity # Typically α = 0.05 (5% chance of mistake) # Type II (β): fail to reject H₀ when H₁ true # = false negative rate = 1 - recall # Power = 1 - β (ability to detect real effects) # Confusion matrix: # Predicted + | Predicted - # Actually + | TP | FN (Type II) # Actually - | FP (I) | TN # Precision = TP / (TP + FP) # Recall = TP / (TP + FN) # F1 = 2 × (P × R) / (P + R)

Type I (False +)

Convict innocent / Flag good email as spam

Type II (False −)

Free guilty / Miss actual spam email

analytics

p-values & Significance

How surprising is the evidence?

The Analogy

The p-value answers: “If H₀ were true, how likely is it to see evidence this extreme or more extreme?” If p = 0.03, there’s only a 3% chance of seeing this result by random chance. If p < 0.05 (conventional threshold), we say the result is statistically significant — unlikely to be due to chance alone.

Key insight: A p-value of 0.03 does NOT mean “3% chance H₀ is true.” It means “3% chance of seeing this data if H₀ were true.” This subtle distinction trips up even experienced researchers. Also, statistical significance ≠ practical significance — a tiny improvement can be “significant” with enough data.

Worked Example

from scipy import stats # Compare two models' accuracy on test set model_a_scores = [0.85, 0.87, 0.84, 0.86, 0.88] model_b_scores = [0.82, 0.83, 0.81, 0.84, 0.82] # Paired t-test: is the difference real? t_stat, p_value = stats.ttest_rel( model_a_scores, model_b_scores ) # p_value ≈ 0.004 < 0.05 # → Statistically significant! # Model A is genuinely better (not luck)

Rule of thumb: p < 0.05 = significant (reject H₀). p < 0.01 = highly significant. p < 0.001 = very highly significant. But always consider effect size too — a 0.1% improvement might be “significant” but not worth deploying.

science

A/B Testing in AI

The gold standard for comparing models in production

The Analogy

A/B testing is like a clinical trial for AI. Group A gets the old model, Group B gets the new model. You measure outcomes (clicks, revenue, satisfaction) and use hypothesis testing to determine if the new model is genuinely better. Random assignment ensures the groups are comparable, so any difference is due to the model, not the users.

Key insight: Every major tech company (Google, Netflix, Meta) runs thousands of A/B tests simultaneously. When Google changes its search ranking algorithm, it A/B tests on millions of users before rolling out. Statistical rigor prevents shipping “improvements” that are actually noise.

In Practice

# A/B test: new recommendation model group_a = old_model_clicks # [2.1, 2.3, ...] group_b = new_model_clicks # [2.4, 2.5, ...] # H₀: mean(A) = mean(B) (no difference) # H₁: mean(B) > mean(A) (new is better) t, p = stats.ttest_ind(group_b, group_a, alternative='greater') if p < 0.05: print("Ship it! New model is better.") else: print("Not enough evidence. Keep old model.")

Real World

Drug trial: does the new drug work better than placebo?

In AI

A/B test: does the new model get more clicks than the old one?

balance

The Bias-Variance Tradeoff

Memorizing answers vs. learning principles

The Analogy

Imagine studying for an exam. High bias = you only learned the chapter summaries (too simple, misses nuances). High variance = you memorized every practice problem word-for-word (can’t handle new questions). The ideal student learns the principles — enough detail to be accurate, but general enough to handle new problems.

Key insight: Error = Bias² + Variance + Irreducible Noise. You can’t reduce all three. A simple model (linear regression) has high bias but low variance. A complex model (deep network) has low bias but high variance. The sweet spot minimizes total error.

Worked Example

# Bias: error from wrong assumptions # Fitting a line to curved data → high bias # Variance: sensitivity to training data # Fitting a degree-100 polynomial → high var # Total error = Bias² + Variance + Noise # Underfitting (high bias): # Train acc: 70%, Test acc: 68% # → Model too simple # Overfitting (high variance): # Train acc: 99%, Test acc: 75% # → Model memorized training data # Good fit: # Train acc: 92%, Test acc: 90% # → Model learned generalizable patterns

model_training

Overfitting & Underfitting

The most common failure modes in machine learning

The Analogy

Overfitting is like a student who memorized every answer in the textbook but can’t solve a new problem. The model fits the training data perfectly (including noise) but fails on new data. Underfitting is a student who barely studied — bad on both training and test data. The goal is to learn the signal, not the noise.

Key insight: Modern deep networks are so large they can memorize random labels (Zhang et al., 2017). A ResNet-50 can achieve 100% training accuracy on random noise. Yet with real data and proper regularization, these same models generalize beautifully. Understanding why is one of the deepest open questions in ML theory.

Detection & Fixes

# Detecting overfitting: # train_loss ↓↓↓ but val_loss ↑↑↑ # Gap between train and val accuracy grows # Fixes for overfitting: # 1. More data (always helps) # 2. Regularization (L2, dropout) # 3. Data augmentation # 4. Early stopping # 5. Simpler model # Fixes for underfitting: # 1. Bigger model (more parameters) # 2. Train longer # 3. Better features # 4. Less regularization # 5. Lower learning rate

Overfit

Train: 99%, Test: 75% — memorized noise

Good Fit

Train: 92%, Test: 90% — learned signal

verified

Cross-Validation — Honest Evaluation

Don’t grade yourself on the homework you practiced

The Analogy

Grading a student on the exact problems they practiced is meaningless. Cross-validation is like giving the student a new exam they haven’t seen. k-fold CV splits data into k parts: train on k−1, test on the held-out fold, rotate. This gives an honest estimate of how the model will perform on truly new data.

Key insight: The train/validation/test split is the most important practice in ML. Train = homework. Validation = practice exam (tune hyperparameters). Test = final exam (report results). NEVER tune on the test set — that’s cheating, and your reported accuracy will be overly optimistic.

In Practice

from sklearn.model_selection import cross_val_score # 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5) # [0.88, 0.91, 0.87, 0.90, 0.89] # Mean: 0.89 ± 0.015 # Proper split: # 70% train / 15% validation / 15% test # Train: fit model # Val: tune hyperparameters # Test: final evaluation (touch ONCE)

Real World

Practice exam ≠ final exam. Grade on new problems.

In AI

Train ≠ test. Cross-validate for honest performance estimates.

school

Statistical Learning Theory

Why generalization works — PAC learning and VC dimension

The Big Picture

Statistical learning theory asks: “Why does a model trained on a finite sample generalize to unseen data?” The answer involves the VC dimension (how complex the model is) and PAC learning (probably approximately correct). The generalization bound says: test error ≤ train error + complexity penalty. More data or simpler models = tighter bound.

Why it matters for AI: This theory explains the “unreasonable effectiveness” of deep learning. Classical bounds suggest huge networks should overfit catastrophically, but they don’t. Understanding why — implicit regularization, flat minima, the lottery ticket hypothesis — is one of the biggest open questions in ML theory.

Key Concepts

# Generalization bound (simplified): # Test error ≤ Train error + √(d/n) # d = model complexity (VC dimension) # n = number of training samples # More data (↑n) → tighter bound # Simpler model (↓d) → tighter bound # VC dimension examples: # Linear classifier in 2D: VC = 3 # (can shatter 3 points, not 4) # Neural network: VC ≈ O(params × layers) # PAC learning: with probability ≥ 1-δ, # error ≤ ε, if n ≥ O(d/ε² × log(1/δ))

Real World

More practice problems + simpler study strategy = better exam performance

In AI

More data + regularization = better generalization to unseen data

arrow_back Ch 9: MLE & Bayesian Ch 11: Information Theory arrow_forward