Ch 3 — Logistic Regression & Classification

Drawing decision boundaries — from probabilities to predictions
Classification
swap_horiz
Sigmoid
arrow_forward
casino
Log-Odds
arrow_forward
functions
Loss
arrow_forward
border_all
Boundary
arrow_forward
gradient
Optimize
arrow_forward
apps
Multi-Class
arrow_forward
grid_on
Metrics
arrow_forward
code
Full Code
-
Click play or press Space to begin...
Step- / 8
swap_horiz
From Regression to Classification: The Sigmoid
Squashing any number into a probability between 0 and 1
The Problem
Linear regression predicts any real number: −∞ to +∞. But classification needs a probability: a number between 0 and 1. “What’s the probability this email is spam?”

If we just use ŷ = wx + b, we might predict −3.2 or 47.8 — neither is a valid probability.

The sigmoid function fixes this by squashing any real number into (0, 1):

σ(z) = 1 / (1 + e&sup{−z})

When z is very negative, σ ≈ 0. When z is very positive, σ ≈ 1. At z = 0, σ = 0.5 exactly. The function is smooth, differentiable, and S-shaped.

Logistic regression is simply: P(y=1|x) = σ(wᵀx + b). The linear part (wᵀx + b) computes a “score,” and the sigmoid converts it to a probability.
The Sigmoid Function
σ(z) = 1 / (1 + e⁻ᶻ) z = -10 → σ = 0.00005 (≈ 0) z = -2 → σ = 0.119 z = 0 → σ = 0.500 (exact midpoint) z = 2 → σ = 0.881 z = 10 → σ = 0.99995 (≈ 1) Properties: • Output always in (0, 1) → valid probability • Symmetric: σ(-z) = 1 - σ(z) • Derivative: σ'(z) = σ(z)(1 - σ(z)) → max gradient at z=0, vanishes at extremes The full model: z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b P(spam | email) = σ(z) Predict spam if P > 0.5 (default threshold)
Key insight: The sigmoid is like a dimmer switch for certainty. The linear score z is how strongly the evidence points toward “yes.” The sigmoid converts that evidence into a calibrated confidence level. A score of +5 means “99.3% sure it’s spam” — not “5 spam units.”
casino
Log-Odds, Logit, and Probability
The mathematical bridge between linear models and probabilities
From Probability to Log-Odds
Odds express probability as a ratio: if P(spam) = 0.8, the odds are 0.8/0.2 = 4:1 (“four times more likely spam than not”).

odds = P / (1 − P)

The log-odds (logit) is the natural log of the odds:

logit(P) = ln(P / (1 − P))

The logit maps probabilities from (0, 1) back to (−∞, +∞) — the inverse of the sigmoid. This is why logistic regression is “linear in the log-odds”:

ln(P / (1 − P)) = wᵀx + b

Each weight wᵢ has a clean interpretation: increasing xᵢ by 1 unit changes the log-odds by wᵢ. Equivalently, it multiplies the odds by e^{wᵢ}. If w = 0.7, each unit increase multiplies the odds by e&sup{0.7} ≈ 2.01 — roughly doubling the odds.
The Three Representations
Probability → Odds → Log-Odds P = 0.50 → odds = 1.00 → logit = 0.00 P = 0.73 → odds = 2.70 → logit = 1.00 P = 0.88 → odds = 7.39 → logit = 2.00 P = 0.95 → odds = 19.0 → logit = 2.94 P = 0.27 → odds = 0.37 → logit = -1.00 Interpreting weights: Model: logit(P) = 0.7·(study_hours) - 3.5 Student studies 5 hours: logit = 0.7×5 - 3.5 = 0.0 P(pass) = σ(0.0) = 0.50 (coin flip) Student studies 8 hours: logit = 0.7×8 - 3.5 = 2.1 P(pass) = σ(2.1) = 0.89 (very likely)
Key insight: The logit is the “native language” of logistic regression. The model thinks in log-odds (a linear scale), and the sigmoid translates that into the probability language humans understand. Each weight tells you how much one feature shifts the log-odds — a clean, additive effect.
functions
Binary Cross-Entropy Loss
Why log loss, not MSE — the right loss for classification
Why Not MSE?
For linear regression, MSE creates a smooth bowl with one minimum. But with the sigmoid, MSE becomes non-convex — full of local minima where gradient descent gets stuck.

Binary cross-entropy (log loss) fixes this. For a single sample:

L = −[y·log(p) + (1−y)·log(1−p)]

where y ∈ {0, 1} is the true label and p = σ(wᵀx + b) is the predicted probability.

When y = 1: L = −log(p). If p = 0.99, loss = 0.01 (great). If p = 0.01, loss = 4.6 (terrible). The log creates an infinite penalty for confidently wrong predictions.

When y = 0: L = −log(1−p). Same logic, mirrored.

Over all n samples: L = −(1/n) ∑ [yᵢ·log(pᵢ) + (1−yᵢ)·log(1−pᵢ)]

This loss is convex with the sigmoid, guaranteeing a single global minimum.
Log Loss Behavior
True label y=1 (actual positive): Predict p=0.99 → loss = -log(0.99) = 0.01 ✓ Predict p=0.80 → loss = -log(0.80) = 0.22 Predict p=0.50 → loss = -log(0.50) = 0.69 Predict p=0.10 → loss = -log(0.10) = 2.30 ✗ Predict p=0.01 → loss = -log(0.01) = 4.61 ✗✗ True label y=0 (actual negative): Predict p=0.01 → loss = -log(0.99) = 0.01 ✓ Predict p=0.50 → loss = -log(0.50) = 0.69 Predict p=0.99 → loss = -log(0.01) = 4.61 ✗✗ # The asymmetry is the point: # Being 99% confident AND wrong costs 460x # more than being 99% confident and right. # This forces the model to be well-calibrated.
Key insight: Log loss is like a lie detector for confidence. If you say “I’m 99% sure this is spam” and it’s not, you get hammered. If you say “I’m 51% sure,” the penalty is mild. This forces the model to only be confident when it has strong evidence — producing well-calibrated probabilities, not just correct labels.
border_all
Decision Boundary Geometry
The line (or hyperplane) that separates classes in feature space
Where the Boundary Lives
The model predicts class 1 when P > 0.5, which happens when σ(z) > 0.5, which happens when z = wᵀx + b > 0.

The decision boundary is the set of points where z = 0:

w₁x₁ + w₂x₂ + b = 0

In 2D, this is a straight line. In 3D, a plane. In higher dimensions, a hyperplane. Everything on one side is class 0, the other side is class 1.

The weights determine the boundary’s orientation (the normal vector is w), and the bias b determines its position (how far from the origin).

Logistic regression can only draw linear boundaries. If the true boundary is curved (like a circle separating two classes), logistic regression will fail. That’s when you need SVMs with kernels (Ch 5) or decision trees (Ch 4).
Visualizing the Boundary
2D example (2 features): Model: z = 2.0·x₁ + 3.0·x₂ - 6.0 Decision boundary: 2x₁ + 3x₂ - 6 = 0 Rearranged: x₂ = -(2/3)x₁ + 2 Slope = -w₁/w₂ = -2/3 Intercept = -b/w₂ = 2 Predictions: Point (1, 2): z = 2+6-6 = 2 → σ=0.88 → class 1 Point (1, 1): z = 2+3-6 = -1 → σ=0.27 → class 0 Point (3, 0): z = 6+0-6 = 0 → σ=0.50 → boundary! # Distance from boundary = confidence. # Points far from the line → high confidence. # Points near the line → P ≈ 0.5 (uncertain).
Key insight: The decision boundary is like a fence between two properties. Logistic regression can only build straight fences. If the properties have a winding border (nonlinear data), a straight fence will misclassify points near the curves. The model’s confidence drops smoothly as you approach the fence — points right on the fence get P = 0.5.
gradient
Gradient Descent for Logistic Regression
No closed-form solution — we must iterate
Why No Normal Equation?
Unlike linear regression, the cross-entropy loss with the sigmoid has no closed-form solution. Setting the derivative to zero gives a transcendental equation that can’t be solved algebraically.

We must use iterative optimization. The gradient of the log loss is elegantly simple:

∇L = (1/n) Xᵀ(σ(Xw) − y)

This looks almost identical to the linear regression gradient! The only difference: instead of (Xw − y), we have (σ(Xw) − y) — the sigmoid wraps the predictions.

scikit-learn’s LogisticRegression uses more sophisticated solvers by default: LBFGS (a quasi-Newton method) for small datasets, SAG/SAGA for large ones. These converge much faster than vanilla gradient descent by approximating the curvature of the loss surface.
Gradient Derivation
Cross-entropy loss: L = -(1/n) Σ [yᵢ·log(pᵢ) + (1-yᵢ)·log(1-pᵢ)] where pᵢ = σ(wᵀxᵢ + b) Gradient (beautifully simple): ∂L/∂wⱼ = (1/n) Σ (pᵢ - yᵢ)·xᵢⱼ ∂L/∂b = (1/n) Σ (pᵢ - yᵢ) In matrix form: ∇L = (1/n) Xᵀ(σ(Xw) - y) Update: w ← w - η · ∇L scikit-learn solvers: 'lbfgs' → default, quasi-Newton, fast 'saga' → stochastic, scales to millions 'newton-cg' → exact Hessian, most precise 'liblinear' → L1 penalty, small datasets
Key insight: The gradient ∇L = (1/n)Xᵀ(p − y) has a beautiful interpretation: for each sample, the “error signal” is (pᵢ − yᵢ) — how far the predicted probability is from the truth. Positive error means the model is too confident; negative means not confident enough. The gradient points in the direction that fixes these errors.
apps
Multi-Class: Softmax and Cross-Entropy
From 2 classes to K classes — one-vs-rest and softmax
Extending to K Classes
Binary logistic regression handles 2 classes. For K > 2 classes (e.g., classifying digits 0–9), two strategies exist:

One-vs-Rest (OvR): Train K separate binary classifiers. Classifier k predicts “class k vs everything else.” At prediction time, pick the class with the highest probability. Simple but can produce inconsistent probabilities (they don’t sum to 1).

Softmax (Multinomial): One model with K output nodes. The softmax function converts K raw scores into K probabilities that sum to 1:

P(class k) = e^{z_k} / ∑ e^{z_j}

The loss generalizes to categorical cross-entropy:

L = −(1/n) ∑ ∑ yᵢₖ · log(pᵢₖ)

scikit-learn uses multi_class='multinomial' for softmax (default with LBFGS solver).
Softmax in Action
Raw scores (logits) for 3 classes: z = [2.0, 1.0, 0.1] Softmax computation: e^z = [7.39, 2.72, 1.11] sum = 11.22 P = [0.659, 0.242, 0.099] sum(P) = 1.000scikit-learn: from sklearn.linear_model import LogisticRegression # Binary (2 classes): uses sigmoid clf = LogisticRegression() # Multi-class (K classes): uses softmax clf = LogisticRegression( multi_class='multinomial', # softmax solver='lbfgs', max_iter=200 ) clf.fit(X_train, y_train) probs = clf.predict_proba(X_test) # [n, K]
Key insight: Softmax is like a competitive election. Each class campaigns with a raw score (logit). Softmax converts these into vote shares that sum to 100%. The exponentiation amplifies differences — a small lead in raw score becomes a decisive probability advantage. The winner takes the prediction, but the margins tell you how confident the model is.
grid_on
Confusion Matrix, Precision, Recall, F1, ROC-AUC
Accuracy is not enough — the metrics that actually matter
The Confusion Matrix
A 2×2 table for binary classification:

True Positive (TP): Predicted spam, actually spam.
False Positive (FP): Predicted spam, actually not spam. (“Type I error”)
False Negative (FN): Predicted not spam, actually spam. (“Type II error”)
True Negative (TN): Predicted not spam, actually not spam.

Accuracy = (TP + TN) / (TP + FP + FN + TN). Sounds good, but fails with imbalanced data: if 99% of emails are not spam, predicting “not spam” always gives 99% accuracy while catching zero spam.

Precision = TP / (TP + FP). “Of all emails I flagged as spam, how many actually were?”
Recall = TP / (TP + FN). “Of all actual spam, how many did I catch?”
F1 = 2 · (Precision · Recall) / (Precision + Recall). Harmonic mean — balances both.
When to Use Which Metric
Spam filter (cost of FP ≈ cost of FN): Use F1 score — balance precision and recall Cancer screening (missing cancer is deadly): Maximize Recall — catch every positive case Accept more false positives (further testing) Legal document review (false accusations costly): Maximize Precision — every flag must be real ROC-AUC (threshold-independent): Plots True Positive Rate vs False Positive Rate at every threshold from 0 to 1. AUC = 0.5 → random guessing AUC = 1.0 → perfect separation AUC = 0.85 → good classifier # ROC-AUC answers: "How well does the model # rank positives above negatives?" regardless # of what threshold you choose.
Key insight: Precision and recall are like a security guard and a search party. High precision means the guard rarely raises false alarms. High recall means the search party finds everyone. You usually can’t maximize both — lowering the threshold catches more positives (higher recall) but also more false alarms (lower precision). The F1 score finds the balance.
code
Complete Classification Pipeline
Breast cancer detection with logistic regression — end to end
End-to-End: Breast Cancer Wisconsin
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( classification_report, roc_auc_score, confusion_matrix ) # 569 samples, 30 features, 2 classes X, y = load_breast_cancer(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 ) scaler = StandardScaler() X_tr_s = scaler.fit_transform(X_tr) X_te_s = scaler.transform(X_te) model = LogisticRegression(max_iter=500, C=1.0) model.fit(X_tr_s, y_tr) y_pred = model.predict(X_te_s) y_prob = model.predict_proba(X_te_s)[:, 1] print(classification_report(y_te, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_te, y_prob):.3f}") print(f"Confusion Matrix:\n{confusion_matrix(y_te, y_pred)}")
Expected Output
precision recall f1-score support malignant 0.98 0.93 0.95 43 benign 0.96 0.99 0.97 71 accuracy 0.96 114 macro avg 0.97 0.96 0.96 114 ROC-AUC: 0.997 Confusion Matrix: [[40, 3], # 3 malignant missed (FN) [ 1, 70]] # 1 false alarm (FP) # In cancer screening, those 3 false negatives # (missed cancers) are the critical concern. # Lowering the threshold from 0.5 to 0.3 would # catch more cancers at the cost of more FPs.
Key insight: Logistic regression achieves 96% accuracy and 0.997 ROC-AUC on breast cancer detection with just 30 features and a linear boundary. It’s fast, interpretable (each coefficient tells you which features matter), and gives calibrated probabilities. For many real-world classification problems, logistic regression is the right starting point — and often the right ending point too.