Ch 3: Logistic Regression & Classification

Ch 3 — Logistic Regression & Classification

Drawing decision boundaries — from probabilities to predictions

arrow_backIndex

Classification

swap_horiz

Sigmoid

arrow_forward

casino

Log-Odds

arrow_forward

functions

Loss

arrow_forward

border_all

Boundary

arrow_forward

gradient

Optimize

arrow_forward

apps

Multi-Class

arrow_forward

grid_on

Metrics

arrow_forward

code

Full Code

Click play or press Space to begin...

Step- / 8

swap_horiz

From Regression to Classification: The Sigmoid

Squashing any number into a probability between 0 and 1

The Problem

Linear regression predicts any real number: −∞ to +∞. But classification needs a probability: a number between 0 and 1. “What’s the probability this email is spam?”

If we just use ŷ = wx + b, we might predict −3.2 or 47.8 — neither is a valid probability.

The sigmoid function fixes this by squashing any real number into (0, 1):

σ(z) = 1 / (1 + e&sup{−z})

When z is very negative, σ ≈ 0. When z is very positive, σ ≈ 1. At z = 0, σ = 0.5 exactly. The function is smooth, differentiable, and S-shaped.

Logistic regression is simply: P(y=1|x) = σ(wᵀx + b). The linear part (wᵀx + b) computes a “score,” and the sigmoid converts it to a probability.

The Sigmoid Function

σ(z) = 1 / (1 + e⁻ᶻ) z = -10 → σ = 0.00005 (≈ 0) z = -2 → σ = 0.119 z = 0 → σ = 0.500 (exact midpoint) z = 2 → σ = 0.881 z = 10 → σ = 0.99995 (≈ 1) Properties: • Output always in (0, 1) → valid probability • Symmetric: σ(-z) = 1 - σ(z) • Derivative: σ'(z) = σ(z)(1 - σ(z)) → max gradient at z=0, vanishes at extremes The full model: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b P(spam | email) = σ(z) Predict spam if P > 0.5 (default threshold)

Key insight: The sigmoid is like a dimmer switch for certainty. The linear score z is how strongly the evidence points toward “yes.” The sigmoid converts that evidence into a calibrated confidence level. A score of +5 means “99.3% sure it’s spam” — not “5 spam units.”

casino

Log-Odds, Logit, and Probability

The mathematical bridge between linear models and probabilities

From Probability to Log-Odds

Odds express probability as a ratio: if P(spam) = 0.8, the odds are 0.8/0.2 = 4:1 (“four times more likely spam than not”).

odds = P / (1 − P)

The log-odds (logit) is the natural log of the odds:

logit(P) = ln(P / (1 − P))

The logit maps probabilities from (0, 1) back to (−∞, +∞) — the inverse of the sigmoid. This is why logistic regression is “linear in the log-odds”:

ln(P / (1 − P)) = wᵀx + b

Each weight wᵢ has a clean interpretation: increasing xᵢ by 1 unit changes the log-odds by wᵢ. Equivalently, it multiplies the odds by e^{wᵢ}. If w = 0.7, each unit increase multiplies the odds by e&sup{0.7} ≈ 2.01 — roughly doubling the odds.

The Three Representations

Probability → Odds → Log-Odds P = 0.50 → odds = 1.00 → logit = 0.00 P = 0.73 → odds = 2.70 → logit = 1.00 P = 0.88 → odds = 7.39 → logit = 2.00 P = 0.95 → odds = 19.0 → logit = 2.94 P = 0.27 → odds = 0.37 → logit = -1.00 Interpreting weights: Model: logit(P) = 0.7·(study_hours) - 3.5 Student studies 5 hours: logit = 0.7×5 - 3.5 = 0.0 P(pass) = σ(0.0) = 0.50 (coin flip) Student studies 8 hours: logit = 0.7×8 - 3.5 = 2.1 P(pass) = σ(2.1) = 0.89 (very likely)

Key insight: The logit is the “native language” of logistic regression. The model thinks in log-odds (a linear scale), and the sigmoid translates that into the probability language humans understand. Each weight tells you how much one feature shifts the log-odds — a clean, additive effect.

functions

Binary Cross-Entropy Loss

Why log loss, not MSE — the right loss for classification

Why Not MSE?

For linear regression, MSE creates a smooth bowl with one minimum. But with the sigmoid, MSE becomes non-convex — full of local minima where gradient descent gets stuck.

Binary cross-entropy (log loss) fixes this. For a single sample:

L = −[y·log(p) + (1−y)·log(1−p)]

where y ∈ {0, 1} is the true label and p = σ(wᵀx + b) is the predicted probability.

When y = 1: L = −log(p). If p = 0.99, loss = 0.01 (great). If p = 0.01, loss = 4.6 (terrible). The log creates an infinite penalty for confidently wrong predictions.

When y = 0: L = −log(1−p). Same logic, mirrored.

Over all n samples: L = −(1/n) ∑ [yᵢ·log(pᵢ) + (1−yᵢ)·log(1−pᵢ)]

This loss is convex with the sigmoid, guaranteeing a single global minimum.

Log Loss Behavior

True label y=1 (actual positive): Predict p=0.99 → loss = -log(0.99) = 0.01 ✓ Predict p=0.80 → loss = -log(0.80) = 0.22 Predict p=0.50 → loss = -log(0.50) = 0.69 Predict p=0.10 → loss = -log(0.10) = 2.30 ✗ Predict p=0.01 → loss = -log(0.01) = 4.61 ✗✗ True label y=0 (actual negative): Predict p=0.01 → loss = -log(0.99) = 0.01 ✓ Predict p=0.50 → loss = -log(0.50) = 0.69 Predict p=0.99 → loss = -log(0.01) = 4.61 ✗✗ # The asymmetry is the point: # Being 99% confident AND wrong costs 460x # more than being 99% confident and right. # This forces the model to be well-calibrated.

Key insight: Log loss is like a lie detector for confidence. If you say “I’m 99% sure this is spam” and it’s not, you get hammered. If you say “I’m 51% sure,” the penalty is mild. This forces the model to only be confident when it has strong evidence — producing well-calibrated probabilities, not just correct labels.

border_all

Decision Boundary Geometry

The line (or hyperplane) that separates classes in feature space

Where the Boundary Lives

The model predicts class 1 when P > 0.5, which happens when σ(z) > 0.5, which happens when z = wᵀx + b > 0.

The decision boundary is the set of points where z = 0:

w₁x₁ + w₂x₂ + b = 0

In 2D, this is a straight line. In 3D, a plane. In higher dimensions, a hyperplane. Everything on one side is class 0, the other side is class 1.

The weights determine the boundary’s orientation (the normal vector is w), and the bias b determines its position (how far from the origin).

Logistic regression can only draw linear boundaries. If the true boundary is curved (like a circle separating two classes), logistic regression will fail. That’s when you need SVMs with kernels (Ch 5) or decision trees (Ch 4).

Visualizing the Boundary

2D example (2 features): Model: z = 2.0·x₁ + 3.0·x₂ - 6.0 Decision boundary: 2x₁ + 3x₂ - 6 = 0 Rearranged: x₂ = -(2/3)x₁ + 2 Slope = -w₁/w₂ = -2/3 Intercept = -b/w₂ = 2 Predictions: Point (1, 2): z = 2+6-6 = 2 → σ=0.88 → class 1 Point (1, 1): z = 2+3-6 = -1 → σ=0.27 → class 0 Point (3, 0): z = 6+0-6 = 0 → σ=0.50 → boundary! # Distance from boundary = confidence. # Points far from the line → high confidence. # Points near the line → P ≈ 0.5 (uncertain).

Key insight: The decision boundary is like a fence between two properties. Logistic regression can only build straight fences. If the properties have a winding border (nonlinear data), a straight fence will misclassify points near the curves. The model’s confidence drops smoothly as you approach the fence — points right on the fence get P = 0.5.

gradient

Gradient Descent for Logistic Regression

No closed-form solution — we must iterate

Why No Normal Equation?

Unlike linear regression, the cross-entropy loss with the sigmoid has no closed-form solution. Setting the derivative to zero gives a transcendental equation that can’t be solved algebraically.

We must use iterative optimization. The gradient of the log loss is elegantly simple:

∇L = (1/n) Xᵀ(σ(Xw) − y)

This looks almost identical to the linear regression gradient! The only difference: instead of (Xw − y), we have (σ(Xw) − y) — the sigmoid wraps the predictions.

scikit-learn’s LogisticRegression uses more sophisticated solvers by default: LBFGS (a quasi-Newton method) for small datasets, SAG/SAGA for large ones. These converge much faster than vanilla gradient descent by approximating the curvature of the loss surface.

Gradient Derivation

Cross-entropy loss: L = -(1/n) Σ [yᵢ·log(pᵢ) + (1-yᵢ)·log(1-pᵢ)] where pᵢ = σ(wᵀxᵢ + b) Gradient (beautifully simple): ∂L/∂wⱼ = (1/n) Σ (pᵢ - yᵢ)·xᵢⱼ ∂L/∂b = (1/n) Σ (pᵢ - yᵢ) In matrix form: ∇L = (1/n) Xᵀ(σ(Xw) - y) Update: w ← w - η · ∇L scikit-learn solvers: 'lbfgs' → default, quasi-Newton, fast 'saga' → stochastic, scales to millions 'newton-cg' → exact Hessian, most precise 'liblinear' → L1 penalty, small datasets

Key insight: The gradient ∇L = (1/n)Xᵀ(p − y) has a beautiful interpretation: for each sample, the “error signal” is (pᵢ − yᵢ) — how far the predicted probability is from the truth. Positive error means the model is too confident; negative means not confident enough. The gradient points in the direction that fixes these errors.

apps

Multi-Class: Softmax and Cross-Entropy

From 2 classes to K classes — one-vs-rest and softmax

Extending to K Classes

Binary logistic regression handles 2 classes. For K > 2 classes (e.g., classifying digits 0–9), two strategies exist:

One-vs-Rest (OvR): Train K separate binary classifiers. Classifier k predicts “class k vs everything else.” At prediction time, pick the class with the highest probability. Simple but can produce inconsistent probabilities (they don’t sum to 1).

Softmax (Multinomial): One model with K output nodes. The softmax function converts K raw scores into K probabilities that sum to 1:

P(class k) = e^{z_k} / ∑ e^{z_j}

The loss generalizes to categorical cross-entropy:

L = −(1/n) ∑ ∑ yᵢₖ · log(pᵢₖ)

scikit-learn uses multi_class='multinomial' for softmax (default with LBFGS solver).

Softmax in Action

Raw scores (logits) for 3 classes: z = [2.0, 1.0, 0.1] Softmax computation: e^z = [7.39, 2.72, 1.11] sum = 11.22 P = [0.659, 0.242, 0.099] sum(P) = 1.000 ✓ scikit-learn: from sklearn.linear_model import LogisticRegression # Binary (2 classes): uses sigmoid clf = LogisticRegression() # Multi-class (K classes): uses softmax clf = LogisticRegression( multi_class='multinomial', # softmax solver='lbfgs', max_iter=200 ) clf.fit(X_train, y_train) probs = clf.predict_proba(X_test) # [n, K]

Key insight: Softmax is like a competitive election. Each class campaigns with a raw score (logit). Softmax converts these into vote shares that sum to 100%. The exponentiation amplifies differences — a small lead in raw score becomes a decisive probability advantage. The winner takes the prediction, but the margins tell you how confident the model is.

grid_on

Confusion Matrix, Precision, Recall, F1, ROC-AUC

Accuracy is not enough — the metrics that actually matter

The Confusion Matrix

A 2×2 table for binary classification:

True Positive (TP): Predicted spam, actually spam.
False Positive (FP): Predicted spam, actually not spam. (“Type I error”)
False Negative (FN): Predicted not spam, actually spam. (“Type II error”)
True Negative (TN): Predicted not spam, actually not spam.

Accuracy = (TP + TN) / (TP + FP + FN + TN). Sounds good, but fails with imbalanced data: if 99% of emails are not spam, predicting “not spam” always gives 99% accuracy while catching zero spam.

Precision = TP / (TP + FP). “Of all emails I flagged as spam, how many actually were?”
Recall = TP / (TP + FN). “Of all actual spam, how many did I catch?”
F1 = 2 · (Precision · Recall) / (Precision + Recall). Harmonic mean — balances both.

When to Use Which Metric

Spam filter (cost of FP ≈ cost of FN): Use F1 score — balance precision and recall Cancer screening (missing cancer is deadly): Maximize Recall — catch every positive case Accept more false positives (further testing) Legal document review (false accusations costly): Maximize Precision — every flag must be real ROC-AUC (threshold-independent): Plots True Positive Rate vs False Positive Rate at every threshold from 0 to 1. AUC = 0.5 → random guessing AUC = 1.0 → perfect separation AUC = 0.85 → good classifier # ROC-AUC answers: "How well does the model # rank positives above negatives?" regardless # of what threshold you choose.

Key insight: Precision and recall are like a security guard and a search party. High precision means the guard rarely raises false alarms. High recall means the search party finds everyone. You usually can’t maximize both — lowering the threshold catches more positives (higher recall) but also more false alarms (lower precision). The F1 score finds the balance.

code

Complete Classification Pipeline

Breast cancer detection with logistic regression — end to end

End-to-End: Breast Cancer Wisconsin

from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( classification_report, roc_auc_score, confusion_matrix ) # 569 samples, 30 features, 2 classes X, y = load_breast_cancer(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) scaler = StandardScaler() X_tr_s = scaler.fit_transform(X_tr) X_te_s = scaler.transform(X_te) model = LogisticRegression(max_iter=500, C=1.0) model.fit(X_tr_s, y_tr) y_pred = model.predict(X_te_s) y_prob = model.predict_proba(X_te_s)[:, 1] print(classification_report(y_te, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_te, y_prob):.3f}") print(f"Confusion Matrix:\n{confusion_matrix(y_te, y_pred)}")

Expected Output

precision recall f1-score support malignant 0.98 0.93 0.95 43 benign 0.96 0.99 0.97 71 accuracy 0.96 114 macro avg 0.97 0.96 0.96 114 ROC-AUC: 0.997 Confusion Matrix: [[40, 3], # 3 malignant missed (FN) [ 1, 70]] # 1 false alarm (FP) # In cancer screening, those 3 false negatives # (missed cancers) are the critical concern. # Lowering the threshold from 0.5 to 0.3 would # catch more cancers at the cost of more FPs.

Key insight: Logistic regression achieves 96% accuracy and 0.997 ROC-AUC on breast cancer detection with just 30 features and a linear boundary. It’s fast, interpretable (each coefficient tells you which features matter), and gives calibrated probabilities. For many real-world classification problems, logistic regression is the right starting point — and often the right ending point too.

arrow_back Linear Regression Decision Trees & Random Forests arrow_forward