Ch 2: Linear Regression

Ch 2 — Linear Regression

Fitting lines to data — the algorithm that launched a thousand models

arrow_backIndex

Foundation

show_chart

Intuition

arrow_forward

functions

MSE Loss

arrow_forward

calculate

Normal Eq

arrow_forward

downhill_skiing

Grad Descent

arrow_forward

ssid_chart

Polynomial

arrow_forward

compress

Regularize

arrow_forward

assessment

R-Squared

arrow_forward

code

Full Code

Click play or press Space to begin...

Step- / 8

show_chart

Fitting a Line to Data

y = wx + b — the simplest model that actually works

Geometric Intuition

Linear regression finds the best straight line through a scatter plot of data points. “Best” means the line that minimizes the total distance between each point and the line.

The model is: ŷ = w·x + b, where w (weight/slope) controls the steepness and b (bias/intercept) controls where the line crosses the y-axis.

With multiple features, the line becomes a hyperplane: ŷ = w₁x₁ + w₂x₂ + … + wₙxₙ + b. A house price model with 3 features (sqft, bedrooms, age) fits a plane in 4D space.

Think of it like adjusting a ruler on a scatter plot. You tilt it (change w) and slide it up/down (change b) until it “fits” the data as closely as possible. The question is: what does “closely” mean mathematically?

A Concrete Example

House price prediction (1 feature): sqft (x): 800 1200 1600 2000 2400 price (y): 150K 220K 280K 350K 410K The model learns: ŷ = 108 × sqft + 63,600 Prediction for 1800 sqft: ŷ = 108 × 1800 + 63,600 = $258,000 With multiple features: ŷ = w₁(sqft) + w₂(beds) + w₃(age) + b ŷ = 95(sqft) + 15,000(beds) - 800(age) + 40,000 # Each weight tells you: # "Holding everything else constant, # one more sqft adds $95 to the price."

Key insight: Linear regression is like adjusting a ruler on a scatter plot. The slope tells you “for every 1-unit increase in x, y changes by w units.” The intercept is where the story starts when x = 0. Simple, interpretable, and surprisingly powerful for many real-world problems.

functions

Mean Squared Error — Why Squared?

The loss function that defines “best fit”

Deriving MSE

For each data point, the residual is the gap between the actual value and the prediction: eᵢ = yᵢ − ŷᵢ. We want to minimize the total residual, but raw residuals can be positive or negative and cancel out.

Why not absolute value? |e| works but has a sharp corner at zero, making it non-differentiable — calculus-based optimization struggles with corners.

Why squared? e² is smooth, differentiable everywhere, and penalizes large errors more than small ones. An error of 10 costs 100, but an error of 20 costs 400 — four times more. This makes the model work extra hard to avoid big mistakes.

The Mean Squared Error averages these squared residuals:

MSE = (1/n) ∑ (yᵢ − ŷᵢ)²

Minimizing MSE is equivalent to maximum likelihood estimation under the assumption that errors are normally distributed — a deep connection between statistics and ML.

MSE Derivation Step by Step

Given: n data points (x₁,y₁), ..., (xₙ,yₙ) Model: ŷᵢ = w·xᵢ + b Step 1: Define residuals eᵢ = yᵢ - ŷᵢ = yᵢ - (w·xᵢ + b) Step 2: Square to remove sign eᵢ² = (yᵢ - w·xᵢ - b)² Step 3: Average over all points MSE(w,b) = (1/n) Σᵢ (yᵢ - w·xᵢ - b)² Why not other options? |eᵢ| → not differentiable at 0 eᵢ → positives cancel negatives eᵢ² → smooth, differentiable, convex ✓ eᵢ⁴ → over-penalizes outliers # MSE creates a smooth, bowl-shaped surface # in (w, b) space. The bottom of the bowl # is the optimal solution.

Key insight: Squaring errors is like grading with a “big mistakes cost way more” policy. Getting one question wrong by 2 points costs 4, but getting it wrong by 10 costs 100. This forces the model to spread its errors evenly rather than nailing most points but catastrophically missing a few.

calculate

The Normal Equation — Closed-Form Solution

Solve for the optimal weights in one shot using linear algebra

The Idea

Since MSE is a smooth, bowl-shaped (convex) function, there’s exactly one minimum. We can find it by setting the derivative to zero and solving directly — no iteration needed.

In matrix form, the model is ŷ = Xw where X is the [n × d] feature matrix (with a column of 1s for the intercept) and w is the [d × 1] weight vector.

The MSE becomes: L(w) = (1/n)||Xw − y||²

Taking the derivative and setting it to zero:
∇L = (2/n) Xᵀ(Xw − y) = 0

Solving: w* = (XᵀX)¹ Xᵀy

This is the Normal Equation. It gives the exact optimal weights in one computation. scikit-learn’s LinearRegression uses an SVD-based variant of this for numerical stability.

Normal Equation Derivation

Matrix form: L(w) = (1/n)(Xw - y)ᵀ(Xw - y) = (1/n)(wᵀXᵀXw - 2wᵀXᵀy + yᵀy) Take derivative, set to zero: ∂L/∂w = (2/n)(XᵀXw - Xᵀy) = 0 XᵀXw = Xᵀy w* = (XᵀX)⁻¹ Xᵀy Complexity: O(n·d²) to compute XᵀX O(d³) to invert XᵀX Total: O(n·d² + d³) When it breaks: d > n → XᵀX is singular (not invertible) d > ~10,000 → d³ inversion too slow Multicollinearity → XᵀX nearly singular # scikit-learn uses SVD instead of direct # inversion for better numerical stability: # X = UΣVᵀ → w* = VΣ⁻¹Uᵀy

Key insight: The Normal Equation is like having a GPS that tells you the exact location of the bottom of a valley. No need to walk downhill step by step — you just teleport there. The catch: computing the GPS coordinates requires inverting a matrix, which gets expensive when you have thousands of features (O(d³)).

downhill_skiing

Gradient Descent — The Ball Rolling Downhill

When the closed-form is too expensive, iterate toward the answer

The Algorithm

Imagine you’re blindfolded on a hilly landscape and want to reach the lowest valley. You can feel the slope under your feet. Gradient descent says: take a step in the direction of steepest descent, then repeat.

1. Initialize weights randomly (or to zero).
2. Compute the gradient — the vector of partial derivatives that points “uphill.”
3. Update: w ← w − η · ∇L(w), where η is the learning rate.
4. Repeat until the loss stops decreasing (convergence).

The learning rate η is critical. Too large: you overshoot the minimum and diverge. Too small: convergence takes forever. Typical starting values: 0.01 or 0.001.

Variants: Batch GD uses all n samples per step. Stochastic GD (SGD) uses 1 sample. Mini-batch GD uses a batch of 32–256 — the practical sweet spot.

Gradient Descent for Linear Regression

Gradient of MSE w.r.t. weights: ∂MSE/∂w = -(2/n) Xᵀ(y - Xw) ∂MSE/∂b = -(2/n) Σ(yᵢ - ŷᵢ) Update rule: w ← w - η · ∂MSE/∂w b ← b - η · ∂MSE/∂b Python from scratch: import numpy as np w, b, lr, epochs = 0.0, 0.0, 0.01, 1000 n = len(X) for _ in range(epochs): y_pred = w * X + b dw = (-2/n) * np.sum(X * (y - y_pred)) db = (-2/n) * np.sum(y - y_pred) w -= lr * dw b -= lr * db # After 1000 steps, w and b converge to # the same values as the Normal Equation.

Key insight: Gradient descent is like finding the bottom of a swimming pool while blindfolded. You feel the slope with your feet and step downhill. The learning rate is your step size — too big and you splash past the deep end, too small and you’re still wading at sunset. The Normal Equation teleports you there, but gradient descent works even when the pool is too big to map.

ssid_chart

Polynomial Regression — The Overfitting Trap

When straight lines aren’t enough, but curves can go too far

Beyond Straight Lines

What if the relationship between x and y is curved? A salary-vs-experience plot often shows diminishing returns — your first 5 years matter more than years 25–30.

Polynomial regression adds powers of x as new features: ŷ = w₁x + w₂x² + w₃x³ + b. It’s still “linear regression” because the model is linear in the weights — we just transformed the features.

Degree 2 (quadratic) can capture U-shapes. Degree 3 can capture S-curves. Degree 10 can capture almost anything — including noise.

The trap: With 20 data points and a degree-19 polynomial, you get a curve that passes through every single point (training MSE = 0). But between the points, it oscillates wildly. This is textbook overfitting — the model memorized the data instead of learning the pattern.

Polynomial Regression in scikit-learn

from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.metrics import mean_squared_error # Degree 2: captures curvature poly2 = make_pipeline( PolynomialFeatures(degree=2), LinearRegression() ) poly2.fit(X_train, y_train) print(f"Deg 2 test MSE: {mean_squared_error(y_test, poly2.predict(X_test)):.1f}") # Degree 15: overfits badly poly15 = make_pipeline( PolynomialFeatures(degree=15), LinearRegression() ) poly15.fit(X_train, y_train) print(f"Deg 15 test MSE: {mean_squared_error(y_test, poly15.predict(X_test)):.1f}") # Typical result: # Deg 1: train=12.5, test=13.1 (underfit) # Deg 2: train=4.2, test=5.0 (good) # Deg 15: train=0.01, test=847 (overfit!)

Key insight: Polynomial regression is like fitting a flexible wire through data points. A stiff wire (low degree) misses curves. A perfectly flexible wire (high degree) passes through every point but wiggles absurdly between them. Regularization is the tool that controls the wire’s flexibility.

compress

Regularization: Ridge (L2) and Lasso (L1)

Shrinking coefficients to fight overfitting

The Core Idea

Overfitting happens when weights grow large to fit noise. Regularization adds a penalty for large weights to the loss function, forcing the model to keep weights small.

Ridge (L2): L = MSE + α · ∑wᵢ²
Adds the sum of squared weights. Shrinks all coefficients toward zero but never exactly to zero. Like a rubber band pulling all weights toward the center — strong weights resist, weak ones get pulled in.

Lasso (L1): L = MSE + α · ∑|wᵢ|
Adds the sum of absolute weights. Can shrink coefficients exactly to zero, effectively removing features. Lasso does automatic feature selection.

The hyperparameter α controls regularization strength. α = 0 is plain linear regression. α → ∞ forces all weights to zero (predicts the mean). Cross-validation finds the sweet spot.

Ridge vs Lasso in Practice

from sklearn.linear_model import Ridge, Lasso, ElasticNet # Ridge: shrinks all weights, keeps all features ridge = Ridge(alpha=1.0) ridge.fit(X_train, y_train) print(f"Ridge coefs: {ridge.coef_}") # [0.45, 0.32, -0.18, 0.09, 0.03] # All non-zero, but smaller than OLS # Lasso: zeros out unimportant features lasso = Lasso(alpha=0.1) lasso.fit(X_train, y_train) print(f"Lasso coefs: {lasso.coef_}") # [0.52, 0.38, -0.21, 0.0, 0.0] # Features 4 and 5 eliminated! # ElasticNet: combines L1 + L2 enet = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio=0 → pure Ridge # l1_ratio=1 → pure Lasso

Ridge (L2)

Penalty: α∑w²
Effect: shrinks all weights
Features: keeps all
Best for: many small effects, multicollinearity

Lasso (L1)

Penalty: α∑|w|
Effect: zeros out weak weights
Features: selects subset
Best for: sparse models, feature selection

Key insight: Ridge is like a volume knob that turns down all speakers equally. Lasso is like a sound engineer who mutes the instruments that aren’t contributing to the song. When you have 500 features but suspect only 20 matter, Lasso finds them for you.

assessment

R-Squared and Residual Analysis

How good is your model? Measuring explained variance

R-Squared (R²)

R² answers: “What fraction of the variance in y does my model explain?”

R² = 1 − (SSᵣᵉᵛ / SSᵗᵒᵗ)

SSᵣᵉᵛ = ∑(yᵢ − ŷᵢ)² — residual sum of squares (what the model gets wrong).
SSᵗᵒᵗ = ∑(yᵢ − ȳ)² — total sum of squares (total variance in y).

R² = 0: model is no better than predicting the mean.
R² = 1: model explains all variance perfectly.
R² = 0.85: model explains 85% of variance.

The trap: R² always increases when you add more features, even useless ones. Adjusted R² penalizes for extra features:

Adj R² = 1 − [(1 − R²)(n − 1) / (n − k − 1)]

where n = samples, k = features. If a new feature doesn’t improve the model enough, Adjusted R² goes down.

Residual Analysis in Code

from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_squared_error import numpy as np model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) # R² and RMSE r2 = r2_score(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) print(f"R²: {r2:.3f}, RMSE: ${rmse:,.0f}") # R²: 0.847, RMSE: $38,200 # Adjusted R² n, k = X_test.shape adj_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1) print(f"Adjusted R²: {adj_r2:.3f}") # Adjusted R²: 0.841 # Residual analysis: check for patterns residuals = y_test - y_pred # Good: residuals randomly scattered around 0 # Bad: residuals show a curve → nonlinearity # Bad: residuals fan out → heteroscedasticity

Key insight: R² is like a grade on an exam. 0.85 means you got 85% right. But just like a student who memorizes answers, a model can inflate its R² by adding junk features. Adjusted R² is the professor who deducts points for padding your essay with filler.

code

Complete Linear Regression Pipeline

From raw data to tuned model — everything together in scikit-learn

End-to-End: California Housing

from sklearn.datasets import fetch_california_housing from sklearn.model_selection import ( train_test_split, cross_val_score ) from sklearn.preprocessing import StandardScaler from sklearn.linear_model import ( LinearRegression, Ridge, Lasso ) from sklearn.metrics import r2_score, mean_squared_error import numpy as np # Load: 20,640 houses, 8 features, target = median price X, y = fetch_california_housing(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_tr_s = scaler.fit_transform(X_tr) X_te_s = scaler.transform(X_te) # Compare three models models = { "OLS": LinearRegression(), "Ridge": Ridge(alpha=1.0), "Lasso": Lasso(alpha=0.01), } for name, m in models.items(): m.fit(X_tr_s, y_tr) y_pred = m.predict(X_te_s) r2 = r2_score(y_te, y_pred) rmse = np.sqrt(mean_squared_error(y_te, y_pred)) cv = cross_val_score(m, X_tr_s, y_tr, cv=5, scoring="r2") print(f"{name:6s} R²={r2:.3f} RMSE={rmse:.3f} " f"CV R²={cv.mean():.3f}±{cv.std():.3f}")

Expected Output

Results (California Housing, 20,640 samples): OLS R²=0.606 RMSE=0.733 CV R²=0.606±0.012 Ridge R²=0.606 RMSE=0.733 CV R²=0.606±0.012 Lasso R²=0.604 RMSE=0.735 CV R²=0.604±0.012 # With 20K samples and 8 features, regularization # barely changes results — OLS isn't overfitting. # Regularization shines when d >> n or features # are correlated. Lasso feature selection (α=0.1): MedInc: 0.72 ← strongest predictor Latitude: -0.65 Longitude: -0.61 AveOccup: -0.04 HouseAge: 0.03 AveRooms: 0.00 ← zeroed out by Lasso AveBedrms: 0.00 ← zeroed out Population: 0.00 ← zeroed out

Key insight: Linear regression is the “hello world” of ML, but it’s not a toy. With R² = 0.606 on California housing, it explains 60% of price variance using just 8 features and a straight line. The remaining 40% needs nonlinear models (trees, neural nets) — but linear regression tells you which features matter most, runs in milliseconds, and is fully interpretable. Always start here.

arrow_back What Is Machine Learning? Logistic Regression arrow_forward