Ch 1: What Is Machine Learning?

Ch 1 — What Is Machine Learning?

Recipes vs chefs — why we let data write the rules

arrow_backIndex

Foundation

psychology

Paradigm

arrow_forward

category

The Three Learning Paradigms

Supervised, unsupervised, and reinforcement learning — the ML taxonomy

Supervised Learning

You provide labeled examples — input-output pairs — and the model learns the mapping. Like a student studying with an answer key.

Regression: Predict a continuous number. “Given square footage, bedrooms, and location, predict house price.” Output: $425,000.

Classification: Predict a category. “Given an email’s text, predict spam or not-spam.” Output: spam (probability 0.94).

~80% of real-world ML applications are supervised learning. It works whenever you have historical data with known outcomes.

Unsupervised Learning

No labels — the model finds hidden structure on its own. Like sorting a pile of laundry without anyone telling you what goes where.

Clustering: Group similar customers by purchase behavior. Dimensionality reduction: Compress 1,000 features into 50 that capture 95% of the variance. Anomaly detection: Find the one transaction out of millions that looks suspicious.

Reinforcement Learning

Learn by trial and error — an agent takes actions in an environment and receives rewards or penalties. Like training a dog: good behavior gets treats, bad behavior gets nothing.

The agent maximizes cumulative reward over time, balancing exploration (trying new things) vs exploitation (doing what already works). Used in game AI (AlphaGo), robotics, and recommendation systems.

Quick Comparison

Paradigm Data Goal Supervised Labeled Predict y from X Unsupervised Unlabeled Find structure Reinforcement Rewards Maximize reward This course focuses on supervised & unsupervised: Ch 2-6: Supervised (regression, classification) Ch 7-8: Unsupervised (clustering, PCA) Ch 9-10: Evaluation & engineering

Key insight: Supervised learning is like studying with flashcards (question on front, answer on back). Unsupervised learning is like being handed a box of unlabeled photos and asked to organize them into albums. Reinforcement learning is like learning to ride a bike — you fall, adjust, and eventually balance.

route

The ML Pipeline

Data → Features → Model → Loss → Optimize → Evaluate

Six Stages of Every ML Project

1. Data Collection: Gather raw data. A house-price model needs thousands of sales records with prices, sizes, locations, and dates.

2. Feature Engineering: Transform raw data into numbers the model can use. Convert “3 bedrooms, downtown, built 1990” into a numerical vector [3, 0.95, 34]. This is often the highest-ROI step.

3. Model Selection: Choose an algorithm. Linear regression for simple relationships, random forests for complex ones, SVMs for high-dimensional data. Each has different assumptions.

4. Loss Function: Define what “wrong” means mathematically. For regression, typically Mean Squared Error. For classification, cross-entropy loss. The loss function is the model’s report card.

5. Optimization: Adjust model parameters to minimize the loss. Gradient descent is the workhorse: compute the gradient, take a step downhill, repeat.

6. Evaluation: Test on data the model has never seen. If it performs well on training data but poorly on test data, you’ve overfit.

The Pipeline in scikit-learn

from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # 1. Data: X = features, y = target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # 2. Feature engineering: scale features scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # 3. Model selection model = LinearRegression() # 4-5. Loss + Optimization (fit does both) model.fit(X_train_s, y_train) # 6. Evaluation y_pred = model.predict(X_test_s) mse = mean_squared_error(y_test, y_pred) print(f"Test MSE: {mse:.2f}")

Key insight: The ML pipeline is like building a house. Data is the land, features are the blueprints, the model is the construction crew, the loss function is the building inspector, optimization is fixing what the inspector flags, and evaluation is the final walkthrough. Skip any step and the house falls down.

functions

The Mathematical Setup

Hypothesis spaces, loss functions, and empirical risk minimization

Hypothesis Space

A hypothesis space H is the set of all functions your model can possibly learn. When you choose “linear regression,” your hypothesis space is all straight lines: H = {f(x) = wx + b | w, b ∈ ℝ}. When you choose a degree-5 polynomial, H becomes all curves up to degree 5.

Choosing the hypothesis space is the most important decision in ML. Too small (a line for curved data) and you can’t capture the pattern. Too large (a degree-100 polynomial for 50 data points) and you’ll memorize noise.

Loss Functions

A loss function ℓ(f(x), y) measures how wrong a single prediction is. Common choices:

Squared error: ℓ = (f(x) − y)² — penalizes large errors quadratically. Predicting $500K when the true price is $400K costs (100K)² = 10 billion “loss units.”

Absolute error: ℓ = |f(x) − y| — linear penalty, more robust to outliers.

0-1 loss: ℓ = 1 if f(x) ≠ y, 0 otherwise — for classification. Hard to optimize (not differentiable), so we use surrogates like log loss.

Empirical Risk Minimization

The true risk (population risk) is the expected loss over all possible data:

R(f) = E[ℓ(f(X), Y)]

We can’t compute this because we don’t know the true data distribution. Instead, we approximate it with the empirical risk — the average loss over our n training samples:

R̂(f) = (1/n) ∑ ℓ(f(xᵢ), yᵢ)

Empirical Risk Minimization (ERM) picks the function f* from H that minimizes R̂. This is what model.fit() does: search through H to find the f that makes the average training loss as small as possible.

ERM in Symbols

True risk (unknown): R(f) = E[ ℓ(f(X), Y) ] Empirical risk (computable): R̂(f) = (1/n) Σᵢ ℓ(f(xᵢ), yᵢ) ERM objective: f* = argmin R̂(f) f ∈ H # In scikit-learn, model.fit(X, y) solves ERM. # LinearRegression minimizes MSE. # LogisticRegression minimizes log loss. # The hypothesis space H is implicit in your # model choice.

Key insight: ERM is like a student studying only from past exams (training data) and hoping the real exam (unseen data) is similar. If past exams are representative, the student does well. If they’re not — or if the student memorizes answers instead of understanding concepts — they fail. That gap between training performance and real performance is the central tension in all of ML.

target

The Bias-Variance Trade-off

The dartboard analogy — why you can’t have it all

The Dartboard Analogy

Imagine throwing darts at a bullseye. Each dart throw is like training a model on a different random sample of data:

Low bias, low variance: Darts clustered tightly around the bullseye. Your model is both accurate and consistent. This is the dream.

High bias, low variance: Darts clustered tightly, but consistently off-center. Your model is consistent but systematically wrong — it’s too simple to capture the true pattern.

Low bias, high variance: Darts scattered widely but centered around the bullseye on average. Your model captures the right pattern on average, but any single training run gives wildly different results.

High bias, high variance: Darts scattered and off-center. The worst of both worlds.

The Math

For squared error loss, the expected test error decomposes exactly: E[(y - f̂(x))²] = Bias²(f̂) + Var(f̂) + σ² Where: Bias²(f̂) = (E[f̂(x)] - f(x))² → How far the average prediction is from truth Var(f̂) = E[(f̂(x) - E[f̂(x)])²] → How much predictions vary across datasets σ² = irreducible error (noise in the data) → Can't be reduced by any model The trade-off: Simple model (e.g., linear): high bias, low variance Complex model (e.g., degree-20 poly): low bias, high variance Sweet spot: minimize Bias² + Var

Key insight: The bias-variance trade-off is like adjusting the zoom on a camera. Zoom out too far (high bias) and you miss details. Zoom in too far (high variance) and every tiny shake blurs the image. The art of ML is finding the zoom level where you capture the subject clearly without amplifying camera shake.

trending_up

Overfitting and Underfitting

Training error vs test error — the curves that tell you everything

Two Failure Modes

Underfitting (high bias): Your model is too simple. A straight line trying to fit a parabola. Both training error and test error are high. The model hasn’t learned the pattern — it’s like a student who didn’t study at all.

Overfitting (high variance): Your model is too complex. A degree-15 polynomial fitting 20 data points — it passes through every point perfectly (training error ≈ 0) but oscillates wildly between them. Test error is much higher than training error. The model memorized the training data, including its noise — like a student who memorized answers without understanding concepts.

The diagnostic: Plot training error and test error vs model complexity. Underfitting: both are high. Overfitting: training error is low but test error diverges upward. The sweet spot is where test error is minimized.

Visualizing with scikit-learn

from sklearn.model_selection import validation_curve from sklearn.tree import DecisionTreeRegressor import numpy as np # Vary tree depth from 1 (underfit) to 20 (overfit) depths = np.arange(1, 21) train_scores, test_scores = validation_curve( DecisionTreeRegressor(), X, y, param_name="max_depth", param_range=depths, scoring="neg_mean_squared_error", cv=5 ) # Typical result: # depth=1: train_MSE=25, test_MSE=28 (underfit) # depth=5: train_MSE=8, test_MSE=10 (sweet spot) # depth=20: train_MSE=0, test_MSE=35 (overfit)

Underfitting

Train error: high
Test error: high
Gap: small
Fix: more complex model, more features, less regularization

Good Fit

Train error: low
Test error: low
Gap: small
Goal: minimize test error while keeping the gap small

Overfitting

Train error: very low
Test error: high
Gap: large
Fix: simpler model, more data, regularization, dropout

code

A Complete ML Example in 30 Lines

From raw data to evaluated model — the full pipeline in scikit-learn

End-to-End Example: Iris Classification

from sklearn.datasets import load_iris from sklearn.model_selection import ( train_test_split, cross_val_score ) from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( accuracy_score, classification_report ) # 1. Load data (150 samples, 4 features, 3 classes) X, y = load_iris(return_X_y=True) # 2. Split: 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # preserve class ratios ) # 3. Scale features to zero mean, unit variance scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # 4. Train model (ERM: minimize log loss) model = LogisticRegression(max_iter=200) model.fit(X_train_s, y_train) # 5. Evaluate y_pred = model.predict(X_test_s) print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}") print(classification_report(y_test, y_pred)) # 6. Cross-validation (more robust estimate) cv_scores = cross_val_score(model, X_train_s, y_train, cv=5) print(f"CV Accuracy: {cv_scores.mean():.1%} ± {cv_scores.std():.1%}") # Typical output: # Accuracy: 100.0% # CV Accuracy: 96.7% ± 2.1%

What Each Step Does

train_test_split: Holds out 20% of data the model never sees during training. This simulates “the real world.” The stratify=y parameter ensures each class appears proportionally in both sets.

StandardScaler: Transforms features to have mean=0 and std=1. Critical because many algorithms (logistic regression, SVMs, KNN) are sensitive to feature scales. A feature ranging 0–1000 would dominate one ranging 0–1.

fit_transform vs transform: fit_transform on training data learns the mean and std. transform on test data applies the same transformation. Never fit on test data — that’s data leakage.

cross_val_score: Splits training data into 5 folds, trains on 4, tests on 1, rotates. Gives 5 accuracy estimates. The mean is more reliable than a single train/test split.

Key insight: The gap between the single test accuracy (100%) and the cross-validation accuracy (96.7%) tells you something important: with only 30 test samples, you can get lucky. Cross-validation averages over multiple splits, giving you a more honest estimate of how the model will perform in the wild.

decision

When ML Beats Rules — and When It Doesn’t

The decision framework for choosing ML vs traditional programming

ML Wins When…

1. The rules are too complex to write: Recognizing faces, understanding speech, detecting fraud. The number of rules would be in the millions and constantly changing.

2. The rules change over time: Spam filters, recommendation engines, stock trading. What worked last month doesn’t work this month. ML adapts by retraining on new data.

3. You have lots of data but no domain expertise: Genomics, climate modeling, particle physics. The patterns exist in the data but humans can’t articulate them.

4. You need personalization at scale: Netflix recommendations for 200M users. You can’t write rules for each person, but ML can learn individual preferences.

Rules Win When…

1. The logic is simple and well-defined: Tax calculations, unit conversions, sorting algorithms. A formula or if-else chain is clearer, faster, and guaranteed correct.

2. You need 100% explainability: Medical dosing, legal compliance, safety-critical systems. “The model said so” isn’t acceptable when lives are at stake.

3. You have very little data: ML needs hundreds to millions of examples. With 10 data points, a domain expert writing rules will outperform any model.

4. The cost of errors is asymmetric and extreme: Nuclear reactor control, aircraft autopilot. A 99.9% accurate model still fails 1 in 1,000 times.

The Decision Checklist

Use ML if ALL of these are true: ✓ Pattern exists in the data ✓ You can't write the rules explicitly ✓ You have enough labeled data (or can get it) ✓ The pattern is relatively stable (or you can retrain) ✓ Approximate answers are acceptable Use rules/heuristics if ANY of these are true: ✓ Logic is simple and well-defined ✓ You have fewer than ~100 examples ✓ 100% correctness is required ✓ Full explainability is legally required ✓ The problem doesn't change over time

Key insight: ML is a power tool, not a magic wand. A chainsaw is amazing for cutting trees but terrible for brain surgery. The best engineers know when to reach for ML and when a simple if-else statement is the right answer. In practice, most production systems use both: rules for the easy cases, ML for the hard ones.

arrow_back Back to Index Linear Regression arrow_forward