Ch 4 — Decision Trees & Random Forests

From 20 Questions to ensemble wisdom — the most practical algorithms in ML
Classification
account_tree
Splits
arrow_forward
functions
Gini/Entropy
arrow_forward
content_cut
Pruning
arrow_forward
inventory_2
Bagging
arrow_forward
forest
Random Forest
arrow_forward
star
Importance
arrow_forward
rocket_launch
Boosting
arrow_forward
code
Full Code
-
Click play or press Space to begin...
Step- / 8
account_tree
Decision Trees: Recursive Binary Splitting
The 20 Questions game — ask the best question at each step
The 20 Questions Analogy
A decision tree works exactly like the game 20 Questions. At each step, it asks the single yes/no question that best separates the remaining items:

“Is income > $50K?” → Yes: go left. No: go right.
“Is age > 35?” → Yes: go left. No: go right.

Each internal node is a question (split on a feature at a threshold). Each leaf is a prediction (the majority class of samples that reached that leaf, or the mean value for regression).

The algorithm is greedy: at each node, it tries every feature and every threshold, picks the split that best separates the classes, then recurses on each child. It never backtracks — once a split is made, it’s permanent.

Trees are non-parametric: they make no assumptions about the data distribution. They can capture nonlinear relationships, interactions between features, and complex decision boundaries that linear models can’t.
A Simple Tree Example
Loan approval decision tree: [Income > $50K?] ├── Yes: [Credit Score > 700?] │ ├── Yes: APPROVE (95% of 200 samples) │ └── No: [Debt Ratio > 40%?] │ ├── Yes: DENY (88% of 50) │ └── No: APPROVE (72% of 30) └── No: [Employment > 2 years?] ├── Yes: [Savings > $10K?] │ ├── Yes: APPROVE (68% of 40) │ └── No: DENY (81% of 60) └── No: DENY (92% of 120) # Each leaf shows the majority class and # the confidence (% of samples in that class). # The tree learned: high income + good credit # → approve. Low income + short employment # → deny. No rules were hand-written.
Key insight: A decision tree is like a flowchart that writes itself. Instead of a human expert deciding which questions to ask, the algorithm discovers the most informative questions from the data. The result is perfectly interpretable — you can trace any prediction from root to leaf and explain exactly why.
functions
Gini Impurity vs Entropy
Two ways to measure how “mixed” a node is
Gini Impurity
Gini impurity measures the probability that a randomly chosen sample would be misclassified if labeled according to the class distribution in the node:

Gini(D) = 1 − ∑ pᵢ²

where pᵢ is the proportion of class i. A pure node (all one class) has Gini = 0. A maximally impure node (50/50 split) has Gini = 0.5 for binary classification.

Entropy comes from information theory and measures the “surprise” or uncertainty:

H(D) = −∑ pᵢ · log₂(pᵢ)

Pure node: H = 0. Maximum impurity (50/50): H = 1.0 bit.

Information Gain = parent entropy − weighted average of children’s entropy. The split with the highest information gain wins.

In practice, Gini and entropy produce nearly identical trees. Gini is slightly faster (no log computation). scikit-learn defaults to Gini.
Numerical Example
Node with 100 samples: 70 cats, 30 dogs Gini: p_cat = 0.7, p_dog = 0.3 Gini = 1 - (0.7² + 0.3²) = 1 - 0.58 = 0.42 Entropy: H = -(0.7·log₂(0.7) + 0.3·log₂(0.3)) = -(0.7·(-0.515) + 0.3·(-1.737)) = 0.360 + 0.521 = 0.881 bits Split on "weight > 10kg": Left: 60 cats, 5 dogs → Gini = 0.142 Right: 10 cats, 25 dogs → Gini = 0.408 Weighted Gini = (65/100)·0.142 + (35/100)·0.408 = 0.092 + 0.143 = 0.235 Gini reduction = 0.42 - 0.235 = 0.185# This split reduced impurity by 0.185. # The algorithm tries all features/thresholds # and picks the one with the biggest reduction.
Key insight: Gini and entropy both measure “how mixed is this bag of marbles?” A bag of all red marbles (pure) scores 0. A bag of half red, half blue (maximally mixed) scores the maximum. The tree picks the split that creates the purest children — like sorting marbles by asking the right question.
content_cut
Tree Depth, Pruning, and Overfitting
An unpruned tree memorizes — a pruned tree generalizes
The Overfitting Problem
An unrestricted decision tree will keep splitting until every leaf is pure — one sample per leaf. This gives 100% training accuracy but terrible test accuracy because the tree memorized every quirk in the training data.

A tree with depth 20 on 1,000 samples can have up to 2²&sup0; = 1,048,576 leaves — more leaves than data points. Each leaf captures noise, not signal.

Pre-pruning (stopping early):
max_depth: Limit tree depth (e.g., 5–10)
min_samples_split: Minimum samples to split a node (e.g., 20)
min_samples_leaf: Minimum samples in each leaf (e.g., 5)
max_leaf_nodes: Cap total leaves

Post-pruning (cost-complexity pruning): Grow the full tree, then remove branches that don’t improve cross-validated performance. scikit-learn supports this via ccp_alpha.
Pruning in scikit-learn
from sklearn.tree import DecisionTreeClassifier # Unpruned: overfits tree_full = DecisionTreeClassifier() tree_full.fit(X_train, y_train) # train_acc=1.00, test_acc=0.82 # Pre-pruned: controlled depth tree_pruned = DecisionTreeClassifier( max_depth=5, min_samples_leaf=10, min_samples_split=20 ) tree_pruned.fit(X_train, y_train) # train_acc=0.92, test_acc=0.89 # Post-pruned: cost-complexity tree_ccp = DecisionTreeClassifier(ccp_alpha=0.01) tree_ccp.fit(X_train, y_train) # train_acc=0.91, test_acc=0.90 # Typical depth vs accuracy: # depth=1: train=0.75, test=0.74 (underfit) # depth=5: train=0.92, test=0.89 (good) # depth=20: train=1.00, test=0.82 (overfit)
Key insight: An unpruned tree is like a student who memorizes every practice problem verbatim, including the typos. They ace the practice test but bomb the real exam. Pruning forces the tree to learn general rules instead of memorizing specifics — like a student who studies concepts, not answers.
inventory_2
Bagging: Bootstrap Aggregating
Train many trees on random subsets, then vote — the wisdom of crowds
The Wisdom of Crowds
A single decision tree is high variance: small changes in the training data produce very different trees. The solution? Train many trees and let them vote.

Bootstrap sampling: From n training samples, randomly draw n samples with replacement. On average, each bootstrap sample contains ~63.2% of the unique original samples (the rest are duplicates). The ~36.8% left out form the out-of-bag (OOB) set — a free validation set.

Aggregating: Train one tree on each bootstrap sample. For classification, take a majority vote. For regression, take the average.

Mathematically, if each tree has variance σ² and the trees are independent, the ensemble variance is σ²/B where B is the number of trees. More trees = lower variance. The bias stays the same (each tree is still a full decision tree), but the variance drops dramatically.
Bagging Mechanics
Original data: 1000 samples Bootstrap sample 1: 1000 draws with replacement → ~632 unique samples, ~368 duplicates → Train tree₁ Bootstrap sample 2: another 1000 draws → different ~632 unique samples → Train tree₂ ... repeat B=100 times ... Prediction: tree₁ says: cat tree₂ says: dog tree₃ says: cat tree₄ says: cat ... 67 trees say cat, 33 say dog → predict cat Variance reduction: Single tree variance: σ² B independent trees: σ²/B 100 trees → 100x lower variance # But trees trained on overlapping data aren't # truly independent. Correlation ρ between trees # means: Var = ρσ² + (1-ρ)σ²/B # Random Forests reduce ρ — that's the key.
Key insight: Bagging is like asking 100 doctors for a diagnosis instead of 1. Each doctor sees a slightly different set of test results (bootstrap samples). Their individual opinions vary, but the majority vote is much more reliable than any single opinion. The key is that their errors are somewhat independent.
forest
Random Forests: Bagging + Feature Randomization
The trick that makes trees truly independent
Why Not Just Bagging?
Plain bagging has a problem: if one feature is very strong (e.g., income for loan prediction), every tree splits on it first. The trees become correlated — they all make similar mistakes. Correlated trees don’t reduce variance much.

Random Forests add a second layer of randomness: at each split, only consider a random subset of features. If you have d features, typically try √d features for classification or d/3 for regression.

This forces trees to explore different features. Some trees might discover that “zip code” is predictive even though “income” is stronger overall. The ensemble captures diverse patterns that no single tree would find.

The result: trees are less correlated (ρ drops), so the ensemble variance drops further. Random Forests are one of the most robust, accurate, and easy-to-use algorithms in all of ML. They work well out of the box with minimal tuning.
Random Forest in scikit-learn
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier( n_estimators=100, # 100 trees max_depth=None, # grow full trees max_features='sqrt', # √d features/split min_samples_leaf=1, # default oob_score=True, # free validation n_jobs=-1, # all CPU cores random_state=42 ) rf.fit(X_train, y_train) print(f"OOB Score: {rf.oob_score_:.3f}") print(f"Test Acc: {rf.score(X_test, y_test):.3f}") # Typical results: # Single tree: test_acc = 0.82 # Bagged trees: test_acc = 0.88 # Random Forest: test_acc = 0.91 # The feature randomization adds ~3% accuracy.
Key insight: Random Forests are like a jury where each juror is forced to focus on different evidence. One juror examines the financial records, another the witness testimony, a third the physical evidence. Because they each see different aspects, their collective verdict is far more reliable than any individual’s — even if some jurors are wrong.
star
Feature Importance from Trees
Which features matter most? Trees tell you directly
Two Methods
1. Mean Decrease in Impurity (MDI): For each feature, sum the Gini reduction across all splits in all trees where that feature was used. Features that create the biggest purity improvements rank highest. Fast but can be biased toward high-cardinality features (many unique values).

2. Permutation Importance: Randomly shuffle one feature’s values and measure how much the model’s accuracy drops. If shuffling “income” drops accuracy from 91% to 72%, income is very important. If shuffling “zip code” drops it to 90%, zip code barely matters. More reliable than MDI but slower (requires re-evaluation).

scikit-learn provides both: model.feature_importances_ for MDI and permutation_importance() for the permutation method. The scikit-learn docs recommend permutation importance for more reliable results.
Feature Importance in Code
from sklearn.inspection import permutation_importance import numpy as np # Method 1: MDI (built-in, fast) importances = rf.feature_importances_ indices = np.argsort(importances)[::-1] for i in indices[:5]: print(f" {names[i]:20s} {importances[i]:.3f}") # Method 2: Permutation (more reliable) perm = permutation_importance( rf, X_test, y_test, n_repeats=10, random_state=42 ) for i in perm.importances_mean.argsort()[::-1][:5]: print(f" {names[i]:20s} " f"{perm.importances_mean[i]:.3f} " f"± {perm.importances_std[i]:.3f}") # MDI can overweight features with many splits. # Permutation importance is model-agnostic and # works with any classifier, not just trees.
Key insight: Feature importance from trees is like tracking which ingredients a chef uses most. MDI counts how often each ingredient appears in the recipe (fast but biased toward versatile ingredients). Permutation importance removes one ingredient at a time and checks if the dish still tastes good (slower but more honest).
rocket_launch
Gradient Boosted Trees: XGBoost & LightGBM
Sequential error correction — each tree fixes the previous tree’s mistakes
Boosting vs Bagging
Bagging trains trees in parallel on random subsets. Boosting trains trees sequentially: each new tree focuses on the mistakes of the previous ensemble.

Gradient Boosting works by fitting each new tree to the residuals (errors) of the current model. If the ensemble predicts $300K for a house worth $350K, the next tree tries to predict the $50K residual.

The ensemble prediction is a weighted sum: F(x) = ∑ η · hᵗ(x), where η is the learning rate (shrinkage, typically 0.01–0.3) and hᵗ is each small tree.

XGBoost adds L1/L2 regularization on tree weights and uses second-order gradients for better splits. LightGBM uses histogram-based splitting (bins features into 256 buckets) for 10–50x speed on large datasets. scikit-learn’s HistGradientBoostingClassifier is inspired by LightGBM.

Gradient boosted trees dominate Kaggle competitions and are the go-to for tabular data in industry.
Boosting in scikit-learn
from sklearn.ensemble import ( GradientBoostingClassifier, HistGradientBoostingClassifier ) # Classic gradient boosting gb = GradientBoostingClassifier( n_estimators=200, learning_rate=0.1, max_depth=3, # shallow trees (stumps) random_state=42 ) # Histogram-based (LightGBM-style, much faster) hgb = HistGradientBoostingClassifier( max_iter=200, learning_rate=0.1, max_depth=6, random_state=42 ) Typical accuracy comparison: Logistic Regression: 0.85 Decision Tree: 0.82 Random Forest: 0.91 Gradient Boosting: 0.93 ← usually best HistGradientBoosting:0.93 (10x faster)
Key insight: Bagging is a democracy — many independent voters. Boosting is a relay race — each runner picks up where the last one stumbled. Bagging reduces variance (stabilizes wild predictions). Boosting reduces bias (fixes systematic errors). That’s why boosted trees often outperform random forests, especially on structured/tabular data.
code
Complete Comparison Pipeline
Tree, Forest, and Boosting head-to-head on real data
Head-to-Head: Wine Quality
from sklearn.datasets import load_wine from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import ( RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier ) X, y = load_wine(return_X_y=True) models = { "Decision Tree": DecisionTreeClassifier( max_depth=5, random_state=42), "Random Forest": RandomForestClassifier( n_estimators=100, random_state=42), "Gradient Boost": GradientBoostingClassifier( n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42), "HistGradBoost": HistGradientBoostingClassifier( max_iter=100, random_state=42), } for name, model in models.items(): scores = cross_val_score( model, X, y, cv=5, scoring="accuracy" ) print(f"{name:18s} {scores.mean():.3f} " f"± {scores.std():.3f}")
Expected Results
Wine dataset (178 samples, 13 features, 3 classes): Decision Tree 0.899 ± 0.050 Random Forest 0.978 ± 0.022 Gradient Boost 0.966 ± 0.033 HistGradBoost 0.972 ± 0.028 Key takeaways: • Single tree: 90% — decent but unstable (±5%) • Random Forest: 98% — best here, low variance • Boosting: 97% — close, would win on larger data • All ensembles crush the single tree When to use which: Small data (<1K): Random Forest (robust) Medium data (1K-1M): HistGradientBoosting Large data (>1M): HistGradientBoosting/XGBoost Need interpretability: Single tree + pruning Need feature importance: Random Forest + perm
Key insight: Tree-based ensembles are the Swiss Army knife of ML. They handle mixed feature types, missing values, nonlinear relationships, and feature interactions without preprocessing. Random Forests are the safe default; gradient boosting squeezes out the last few percent of accuracy. For tabular data, they remain the state of the art — even in the age of deep learning.