Ch 9 — Model Evaluation & Selection

How to honestly measure model performance — and pick the best model
Toolkit
call_split
Split
arrow_forward
replay
K-Fold CV
arrow_forward
grid_view
Grid Search
arrow_forward
show_chart
Learn Curves
arrow_forward
warning
Acc Paradox
arrow_forward
balance
PR Trade-off
arrow_forward
science
Significance
arrow_forward
flowsheet
Flowchart
-
Click play or press Space to begin...
Step- / 8
call_split
Train / Validation / Test Splits
Three datasets, three purposes — never peek at the test set
Why Three Splits?
Training set (~60–80%): The model learns from this data. It sees these examples during fit().

Validation set (~10–20%): Used to tune hyperparameters (learning rate, regularization, tree depth). The model never trains on this data, but you use its performance to make decisions.

Test set (~10–20%): The final, untouched evaluation. You look at this once, at the very end, to report your model’s true performance. If you tune hyperparameters based on test performance, you’re cheating — the test set becomes a second validation set.

Data leakage is the #1 mistake in ML evaluation. It happens when information from the test set leaks into training — through feature scaling, feature selection, or hyperparameter tuning on test data. Always fit preprocessing on training data only.
Splitting in scikit-learn
from sklearn.model_selection import train_test_split # Two-way split (simple) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) # Three-way split (with validation) X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.15, stratify=y, random_state=42 ) X_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=0.18, stratify=y_temp, random_state=42 ) # Result: ~70% train, ~15% val, ~15% test stratify=y ensures class ratios are preserved. # If 30% spam in full data → 30% in each split.
Key insight: The test set is like a sealed exam. You study (train), take practice tests (validate), and adjust your strategy. But the real exam (test set) is only opened once. If you peek at the real exam while studying, your grade doesn’t reflect true knowledge — it reflects memorization of that specific exam.
replay
K-Fold Cross-Validation
Use all data for both training and validation — rotate through folds
How K-Fold Works
A single train/val split wastes data and gives a noisy estimate. K-Fold CV fixes both:

1. Split data into K equal folds (typically K=5 or 10).
2. For each fold: use it as validation, train on the other K−1 folds.
3. Average the K validation scores.

Every sample is used for validation exactly once and for training K−1 times. With K=5, you get 5 accuracy estimates. The mean is more reliable than any single split, and the standard deviation tells you how stable the model is.

Stratified K-Fold ensures each fold has the same class distribution as the full dataset. Critical for imbalanced data.

Leave-One-Out (LOO) is K-Fold with K=n. Maximum data usage but very slow and high variance. Rarely used in practice.
Cross-Validation in Code
from sklearn.model_selection import ( cross_val_score, StratifiedKFold ) from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100) # Simple 5-fold CV scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"Accuracy: {scores.mean():.3f} ± {scores.std():.3f}") # Stratified K-Fold (explicit control) skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=skf) # Typical output: # Fold scores: [0.93, 0.91, 0.94, 0.92, 0.90] # Mean: 0.920 ± 0.015 # The ± tells you: "expect ~92% ± 1.5% on new data"
Key insight: Cross-validation is like a round-robin tournament. Every team plays every other team. The final ranking (mean score) is much more reliable than a single match result. The standard deviation tells you if the model is consistently good or just got lucky on one split.
grid_view
GridSearchCV: Systematic Hyperparameter Tuning
Try every combination, pick the best — with honest evaluation
Grid Search + Cross-Validation
GridSearchCV combines exhaustive hyperparameter search with cross-validation:

1. Define a grid of hyperparameter values to try.
2. For each combination, run K-fold CV.
3. Pick the combination with the best mean CV score.
4. Refit the model on all training data with the best hyperparameters.

For a grid with 4 values of C and 4 values of gamma (16 combinations) with 5-fold CV, that’s 80 model fits. Can be slow for large grids.

RandomizedSearchCV samples random combinations instead of trying all. With 100 random samples from a large grid, you often find a solution within 5% of the best — at a fraction of the cost. Preferred for large search spaces.
GridSearchCV in Code
from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler pipe = make_pipeline(StandardScaler(), SVC()) param_grid = { 'svc__C': [0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1], 'svc__kernel': ['rbf'] } grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1, refit=True) grid.fit(X_train, y_train) print(f"Best params: {grid.best_params_}") print(f"Best CV score: {grid.best_score_:.3f}") print(f"Test score: {grid.score(X_test, y_test):.3f}")
Key insight: GridSearchCV is like a cooking competition where you try every combination of temperature and cooking time, taste-test each (cross-validate), and pick the winner. The key is that the taste-testing uses fresh food each time (different folds), so you’re not just picking the combination that got lucky on one batch.
show_chart
Learning Curves: Diagnose Bias vs Variance
Plot performance vs training size — know if you need more data or a better model
Reading Learning Curves
A learning curve plots training score and validation score as a function of training set size.

High bias (underfitting): Both curves plateau at a low score, close together. More data won’t help — the model is too simple. Fix: use a more complex model, add features, reduce regularization.

High variance (overfitting): Training score is high, validation score is much lower. The gap is large. More data will help (the curves converge as n increases). Fix: more data, simpler model, more regularization, feature selection.

Good fit: Both curves converge at a high score with a small gap. The model has found the right complexity for the data.
Learning Curves in Code
from sklearn.model_selection import learning_curve import numpy as np sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5, scoring='accuracy' ) # Plot means ± std for train and val train_mean = train_scores.mean(axis=1) val_mean = val_scores.mean(axis=1) Diagnosis: train=0.99, val=0.75high variance (overfit) → Get more data, simplify model, regularize train=0.80, val=0.78high bias (underfit) → More complex model, more features train=0.95, val=0.93good fit
Key insight: Learning curves answer the most important question in ML: “Should I get more data or a better model?” If the curves are converging (high variance), more data helps. If they’re already flat and close together (high bias), more data is useless — you need a more powerful model.
warning
The Accuracy Paradox
99% accuracy can be terrible — when class imbalance strikes
When Accuracy Lies
A fraud detection dataset: 99.5% legitimate transactions, 0.5% fraud. A model that always predicts “legitimate” gets 99.5% accuracy — while catching zero fraud.

This is the accuracy paradox: high accuracy on imbalanced data is meaningless. The model learned to predict the majority class and ignore the minority class entirely.

Better metrics for imbalanced data:
Precision/Recall/F1 (from Ch 3) — focus on the minority class
ROC-AUC — threshold-independent ranking quality
PR-AUC (Precision-Recall AUC) — better than ROC-AUC for severe imbalance
Matthews Correlation Coefficient (MCC) — balanced metric even with imbalanced classes

Techniques for imbalanced data: class weights (class_weight='balanced'), oversampling (SMOTE), undersampling, or threshold tuning.
Handling Imbalance
from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( classification_report, average_precision_score ) # Without class weights: ignores minority lr = LogisticRegression() lr.fit(X_train, y_train) # Accuracy: 99.5%, Recall(fraud): 0.10 ← terrible # With class weights: penalizes minority errors more lr_balanced = LogisticRegression( class_weight='balanced' ) lr_balanced.fit(X_train, y_train) # Accuracy: 97.0%, Recall(fraud): 0.85 ← much better # Use PR-AUC for imbalanced evaluation y_prob = lr_balanced.predict_proba(X_test)[:, 1] pr_auc = average_precision_score(y_test, y_prob) print(f"PR-AUC: {pr_auc:.3f}")
Key insight: Accuracy is like grading a multiple-choice test where 99 out of 100 answers are “A.” A student who writes “A” for everything gets 99% — without knowing anything. For imbalanced data, always use precision, recall, F1, or PR-AUC. These metrics force the model to actually learn the minority class.
balance
Precision-Recall Trade-off & Threshold Tuning
You can’t maximize both — choose based on business cost
The Trade-off
Most classifiers output a probability. The default threshold is 0.5: predict positive if P > 0.5. But 0.5 is arbitrary — you can adjust it.

Lower threshold (e.g., 0.3): More positives predicted → higher recall (catch more fraud) but lower precision (more false alarms).

Higher threshold (e.g., 0.7): Fewer positives predicted → higher precision (fewer false alarms) but lower recall (miss more fraud).

The right threshold depends on the business cost:
• Cancer screening: low threshold (missing cancer is deadly)
• Spam filter: medium threshold (balance convenience and safety)
• Criminal sentencing: high threshold (false conviction is catastrophic)
Threshold Tuning in Code
from sklearn.metrics import precision_recall_curve y_prob = model.predict_proba(X_test)[:, 1] precisions, recalls, thresholds = precision_recall_curve( y_test, y_prob ) # Find threshold for desired recall target_recall = 0.90 idx = (recalls >= target_recall).sum() - 1 best_threshold = thresholds[idx] print(f"Threshold for 90% recall: {best_threshold:.3f}") print(f"Precision at that recall: {precisions[idx]:.3f}") # Apply custom threshold y_pred_custom = (y_prob >= best_threshold).astype(int) # Typical trade-off: # threshold=0.3: recall=0.95, precision=0.40 # threshold=0.5: recall=0.80, precision=0.70 # threshold=0.7: recall=0.60, precision=0.90
Key insight: The precision-recall trade-off is like a security checkpoint. Strict screening (high threshold) catches fewer innocent people but also misses some threats. Loose screening (low threshold) catches every threat but hassles many innocent people. The right strictness depends on whether missing a threat or hassling innocents costs more.
science
Statistical Significance
Is 92% really better than 91%? Or just noise?
When Differences Matter
Model A: 92.3% ± 1.5%. Model B: 91.8% ± 1.2%. Is A really better?

With overlapping confidence intervals, the difference might be noise. A paired t-test on the K-fold scores can tell you if the difference is statistically significant.

Paired t-test: Compare the K scores from model A vs model B on the same folds. If p < 0.05, the difference is significant.

Practical significance matters too. Even if 92.3% vs 91.8% is statistically significant, is 0.5% worth the extra complexity? A simpler model that’s 0.5% worse but 10x faster and more interpretable might be the better choice.

Rule of thumb: if the difference is within 1 standard deviation of the CV scores, it’s probably not meaningful. Focus on the model that’s simplest, fastest, and most interpretable among the top performers.
Statistical Test in Code
from scipy.stats import ttest_rel from sklearn.model_selection import cross_val_score, StratifiedKFold skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) scores_A = cross_val_score(model_A, X, y, cv=skf) scores_B = cross_val_score(model_B, X, y, cv=skf) t_stat, p_value = ttest_rel(scores_A, scores_B) print(f"A: {scores_A.mean():.3f} ± {scores_A.std():.3f}") print(f"B: {scores_B.mean():.3f} ± {scores_B.std():.3f}") print(f"p-value: {p_value:.4f}") if p_value < 0.05: print("Significant difference") else: print("No significant difference → pick simpler model")
Key insight: Statistical significance is like asking “if we ran this experiment 100 times, would model A consistently beat model B?” A single comparison can be misleading. The paired t-test on K-fold scores gives you confidence. But always weigh statistical significance against practical significance — 0.5% accuracy isn’t worth 10x complexity.
flowsheet
Model Selection Flowchart
Which algorithm for which problem? A practical decision guide
The Decision Flowchart
START: What type of problem? REGRESSION (predict a number): Few features, linear? → LinearRegression / Ridge Nonlinear? → RandomForestRegressor Best accuracy? → HistGradientBoostingRegressor Small data, nonlinear? → SVR (RBF kernel) CLASSIFICATION (predict a category): Need interpretability? → LogisticRegression / DecisionTree Text data? → MultinomialNB + TF-IDF Small data (<1K)? → SVC (RBF) or NaiveBayes Medium data (1K-100K)? → RandomForestClassifier Best accuracy? → HistGradientBoostingClassifier Very large (>100K)? → HistGradientBoosting / SGD CLUSTERING (find groups): Know K, spherical? → KMeans Unknown K, any shape? → DBSCAN / HDBSCAN Need hierarchy? → AgglomerativeClustering
The Practitioner’s Checklist
For any ML project: 1. Start with the simplest baseline (LogReg, DecisionTree, or NaiveBayes) 2. Evaluate with cross-validation 3. Check learning curves (bias vs variance) 4. Try 2-3 more complex models 5. Tune hyperparameters with GridSearchCV 6. Compare with paired t-test 7. Pick simplest model within 1% of the best 8. Final evaluation on held-out test set 9. Report test score (never tune after this!) # The model selection process matters more # than the model itself. A well-evaluated # simple model beats a poorly-evaluated # complex model every time.
Key insight: Model selection is not about finding the “best” algorithm. It’s about finding the simplest model that meets your performance requirements, evaluated honestly. Start simple, add complexity only when the learning curve demands it, and always hold out a final test set you never peek at until the very end.