Ch 1 — What Is Machine Learning?

Recipes vs chefs — why we let data write the rules
Foundation
psychology
Paradigm
arrow_forward
category
Taxonomy
arrow_forward
route
Pipeline
arrow_forward
functions
Loss & Risk
arrow_forward
target
Bias-Var
arrow_forward
trending_up
Overfit
arrow_forward
code
Code
arrow_forward
decision
When ML?
-
Click play or press Space to begin...
Step- / 8
psychology
Learning from Data vs Explicit Programming
The recipe vs the chef — a fundamental shift in how we build software
The Core Idea
Traditional programming is like following a recipe: you write exact rules for every situation. “If the email contains ‘free money,’ mark it as spam.” You, the programmer, encode all the knowledge.

Machine learning flips this. Instead of writing rules, you give the computer thousands of examples (labeled emails) and let it figure out the rules itself. ML is the chef who tastes ingredients and invents new recipes.

Formally: traditional programming takes rules + data → answers. Machine learning takes data + answers → rules. Arthur Samuel defined it in 1959 as “the field of study that gives computers the ability to learn without being explicitly programmed.”
The Two Paradigms
Traditional Programming: Input: rules + data Output: answers Example: if temperature > 100: alert("fever") Machine Learning: Input: data + answers (labels) Output: rules (a model) Example: model.fit(patient_data, diagnoses) model.predict(new_patient) # The model discovers patterns like: # "temperature > 100.4 AND white_blood_cells > 11k # → 87% probability of infection" # You never wrote that rule — the data did.
Key insight: ML is like hiring a chef instead of buying a cookbook. The cookbook (traditional code) works for known recipes, but the chef (ML) can taste new ingredients and invent dishes you never imagined. The catch? The chef needs to taste a lot of food first (training data).
category
The Three Learning Paradigms
Supervised, unsupervised, and reinforcement learning — the ML taxonomy
Supervised Learning
You provide labeled examples — input-output pairs — and the model learns the mapping. Like a student studying with an answer key.

Regression: Predict a continuous number. “Given square footage, bedrooms, and location, predict house price.” Output: $425,000.

Classification: Predict a category. “Given an email’s text, predict spam or not-spam.” Output: spam (probability 0.94).

~80% of real-world ML applications are supervised learning. It works whenever you have historical data with known outcomes.
Unsupervised Learning
No labels — the model finds hidden structure on its own. Like sorting a pile of laundry without anyone telling you what goes where.

Clustering: Group similar customers by purchase behavior. Dimensionality reduction: Compress 1,000 features into 50 that capture 95% of the variance. Anomaly detection: Find the one transaction out of millions that looks suspicious.
Reinforcement Learning
Learn by trial and error — an agent takes actions in an environment and receives rewards or penalties. Like training a dog: good behavior gets treats, bad behavior gets nothing.

The agent maximizes cumulative reward over time, balancing exploration (trying new things) vs exploitation (doing what already works). Used in game AI (AlphaGo), robotics, and recommendation systems.
Quick Comparison
Paradigm Data Goal Supervised Labeled Predict y from X Unsupervised Unlabeled Find structure Reinforcement Rewards Maximize reward This course focuses on supervised & unsupervised: Ch 2-6: Supervised (regression, classification) Ch 7-8: Unsupervised (clustering, PCA) Ch 9-10: Evaluation & engineering
Key insight: Supervised learning is like studying with flashcards (question on front, answer on back). Unsupervised learning is like being handed a box of unlabeled photos and asked to organize them into albums. Reinforcement learning is like learning to ride a bike — you fall, adjust, and eventually balance.
route
The ML Pipeline
Data → Features → Model → Loss → Optimize → Evaluate
Six Stages of Every ML Project
1. Data Collection: Gather raw data. A house-price model needs thousands of sales records with prices, sizes, locations, and dates.

2. Feature Engineering: Transform raw data into numbers the model can use. Convert “3 bedrooms, downtown, built 1990” into a numerical vector [3, 0.95, 34]. This is often the highest-ROI step.

3. Model Selection: Choose an algorithm. Linear regression for simple relationships, random forests for complex ones, SVMs for high-dimensional data. Each has different assumptions.

4. Loss Function: Define what “wrong” means mathematically. For regression, typically Mean Squared Error. For classification, cross-entropy loss. The loss function is the model’s report card.

5. Optimization: Adjust model parameters to minimize the loss. Gradient descent is the workhorse: compute the gradient, take a step downhill, repeat.

6. Evaluation: Test on data the model has never seen. If it performs well on training data but poorly on test data, you’ve overfit.
The Pipeline in scikit-learn
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # 1. Data: X = features, y = target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # 2. Feature engineering: scale features scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # 3. Model selection model = LinearRegression() # 4-5. Loss + Optimization (fit does both) model.fit(X_train_s, y_train) # 6. Evaluation y_pred = model.predict(X_test_s) mse = mean_squared_error(y_test, y_pred) print(f"Test MSE: {mse:.2f}")
Key insight: The ML pipeline is like building a house. Data is the land, features are the blueprints, the model is the construction crew, the loss function is the building inspector, optimization is fixing what the inspector flags, and evaluation is the final walkthrough. Skip any step and the house falls down.
functions
The Mathematical Setup
Hypothesis spaces, loss functions, and empirical risk minimization
Hypothesis Space
A hypothesis space H is the set of all functions your model can possibly learn. When you choose “linear regression,” your hypothesis space is all straight lines: H = {f(x) = wx + b | w, b ∈ ℝ}. When you choose a degree-5 polynomial, H becomes all curves up to degree 5.

Choosing the hypothesis space is the most important decision in ML. Too small (a line for curved data) and you can’t capture the pattern. Too large (a degree-100 polynomial for 50 data points) and you’ll memorize noise.
Loss Functions
A loss function ℓ(f(x), y) measures how wrong a single prediction is. Common choices:

Squared error: ℓ = (f(x) − y)² — penalizes large errors quadratically. Predicting $500K when the true price is $400K costs (100K)² = 10 billion “loss units.”

Absolute error: ℓ = |f(x) − y| — linear penalty, more robust to outliers.

0-1 loss: ℓ = 1 if f(x) ≠ y, 0 otherwise — for classification. Hard to optimize (not differentiable), so we use surrogates like log loss.
Empirical Risk Minimization
The true risk (population risk) is the expected loss over all possible data:

R(f) = E[ℓ(f(X), Y)]

We can’t compute this because we don’t know the true data distribution. Instead, we approximate it with the empirical risk — the average loss over our n training samples:

R̂(f) = (1/n) ∑ ℓ(f(xᵢ), yᵢ)

Empirical Risk Minimization (ERM) picks the function f* from H that minimizes R̂. This is what model.fit() does: search through H to find the f that makes the average training loss as small as possible.
ERM in Symbols
True risk (unknown): R(f) = E[ ℓ(f(X), Y) ] Empirical risk (computable): R̂(f) = (1/n) Σᵢ ℓ(f(xᵢ), yᵢ) ERM objective: f* = argmin R̂(f) f ∈ H # In scikit-learn, model.fit(X, y) solves ERM. # LinearRegression minimizes MSE. # LogisticRegression minimizes log loss. # The hypothesis space H is implicit in your # model choice.
Key insight: ERM is like a student studying only from past exams (training data) and hoping the real exam (unseen data) is similar. If past exams are representative, the student does well. If they’re not — or if the student memorizes answers instead of understanding concepts — they fail. That gap between training performance and real performance is the central tension in all of ML.
target
The Bias-Variance Trade-off
The dartboard analogy — why you can’t have it all
The Dartboard Analogy
Imagine throwing darts at a bullseye. Each dart throw is like training a model on a different random sample of data:

Low bias, low variance: Darts clustered tightly around the bullseye. Your model is both accurate and consistent. This is the dream.

High bias, low variance: Darts clustered tightly, but consistently off-center. Your model is consistent but systematically wrong — it’s too simple to capture the true pattern.

Low bias, high variance: Darts scattered widely but centered around the bullseye on average. Your model captures the right pattern on average, but any single training run gives wildly different results.

High bias, high variance: Darts scattered and off-center. The worst of both worlds.
The Math
For squared error loss, the expected test error decomposes exactly: E[(y - f̂(x))²] = Bias²(f̂) + Var(f̂) + σ² Where: Bias²(f̂) = (E[f̂(x)] - f(x))² → How far the average prediction is from truth Var(f̂) = E[(f̂(x) - E[f̂(x)])²] → How much predictions vary across datasets σ² = irreducible error (noise in the data) → Can't be reduced by any model The trade-off: Simple model (e.g., linear): high bias, low variance Complex model (e.g., degree-20 poly): low bias, high variance Sweet spot: minimize Bias² + Var
Key insight: The bias-variance trade-off is like adjusting the zoom on a camera. Zoom out too far (high bias) and you miss details. Zoom in too far (high variance) and every tiny shake blurs the image. The art of ML is finding the zoom level where you capture the subject clearly without amplifying camera shake.
trending_up
Overfitting and Underfitting
Training error vs test error — the curves that tell you everything
Two Failure Modes
Underfitting (high bias): Your model is too simple. A straight line trying to fit a parabola. Both training error and test error are high. The model hasn’t learned the pattern — it’s like a student who didn’t study at all.

Overfitting (high variance): Your model is too complex. A degree-15 polynomial fitting 20 data points — it passes through every point perfectly (training error ≈ 0) but oscillates wildly between them. Test error is much higher than training error. The model memorized the training data, including its noise — like a student who memorized answers without understanding concepts.

The diagnostic: Plot training error and test error vs model complexity. Underfitting: both are high. Overfitting: training error is low but test error diverges upward. The sweet spot is where test error is minimized.
Visualizing with scikit-learn
from sklearn.model_selection import validation_curve from sklearn.tree import DecisionTreeRegressor import numpy as np # Vary tree depth from 1 (underfit) to 20 (overfit) depths = np.arange(1, 21) train_scores, test_scores = validation_curve( DecisionTreeRegressor(), X, y, param_name="max_depth", param_range=depths, scoring="neg_mean_squared_error", cv=5 ) # Typical result: # depth=1: train_MSE=25, test_MSE=28 (underfit) # depth=5: train_MSE=8, test_MSE=10 (sweet spot) # depth=20: train_MSE=0, test_MSE=35 (overfit)
Underfitting
Train error: high
Test error: high
Gap: small
Fix: more complex model, more features, less regularization
Good Fit
Train error: low
Test error: low
Gap: small
Goal: minimize test error while keeping the gap small
Overfitting
Train error: very low
Test error: high
Gap: large
Fix: simpler model, more data, regularization, dropout
code
A Complete ML Example in 30 Lines
From raw data to evaluated model — the full pipeline in scikit-learn
End-to-End Example: Iris Classification
from sklearn.datasets import load_iris from sklearn.model_selection import ( train_test_split, cross_val_score ) from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( accuracy_score, classification_report ) # 1. Load data (150 samples, 4 features, 3 classes) X, y = load_iris(return_X_y=True) # 2. Split: 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # preserve class ratios ) # 3. Scale features to zero mean, unit variance scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # 4. Train model (ERM: minimize log loss) model = LogisticRegression(max_iter=200) model.fit(X_train_s, y_train) # 5. Evaluate y_pred = model.predict(X_test_s) print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}") print(classification_report(y_test, y_pred)) # 6. Cross-validation (more robust estimate) cv_scores = cross_val_score(model, X_train_s, y_train, cv=5) print(f"CV Accuracy: {cv_scores.mean():.1%} ± {cv_scores.std():.1%}") # Typical output: # Accuracy: 100.0% # CV Accuracy: 96.7% ± 2.1%
What Each Step Does
train_test_split: Holds out 20% of data the model never sees during training. This simulates “the real world.” The stratify=y parameter ensures each class appears proportionally in both sets.

StandardScaler: Transforms features to have mean=0 and std=1. Critical because many algorithms (logistic regression, SVMs, KNN) are sensitive to feature scales. A feature ranging 0–1000 would dominate one ranging 0–1.

fit_transform vs transform: fit_transform on training data learns the mean and std. transform on test data applies the same transformation. Never fit on test data — that’s data leakage.

cross_val_score: Splits training data into 5 folds, trains on 4, tests on 1, rotates. Gives 5 accuracy estimates. The mean is more reliable than a single train/test split.
Key insight: The gap between the single test accuracy (100%) and the cross-validation accuracy (96.7%) tells you something important: with only 30 test samples, you can get lucky. Cross-validation averages over multiple splits, giving you a more honest estimate of how the model will perform in the wild.
decision
When ML Beats Rules — and When It Doesn’t
The decision framework for choosing ML vs traditional programming
ML Wins When…
1. The rules are too complex to write: Recognizing faces, understanding speech, detecting fraud. The number of rules would be in the millions and constantly changing.

2. The rules change over time: Spam filters, recommendation engines, stock trading. What worked last month doesn’t work this month. ML adapts by retraining on new data.

3. You have lots of data but no domain expertise: Genomics, climate modeling, particle physics. The patterns exist in the data but humans can’t articulate them.

4. You need personalization at scale: Netflix recommendations for 200M users. You can’t write rules for each person, but ML can learn individual preferences.
Rules Win When…
1. The logic is simple and well-defined: Tax calculations, unit conversions, sorting algorithms. A formula or if-else chain is clearer, faster, and guaranteed correct.

2. You need 100% explainability: Medical dosing, legal compliance, safety-critical systems. “The model said so” isn’t acceptable when lives are at stake.

3. You have very little data: ML needs hundreds to millions of examples. With 10 data points, a domain expert writing rules will outperform any model.

4. The cost of errors is asymmetric and extreme: Nuclear reactor control, aircraft autopilot. A 99.9% accurate model still fails 1 in 1,000 times.
The Decision Checklist
Use ML if ALL of these are true: ✓ Pattern exists in the data ✓ You can't write the rules explicitly ✓ You have enough labeled data (or can get it) ✓ The pattern is relatively stable (or you can retrain) ✓ Approximate answers are acceptable Use rules/heuristics if ANY of these are true: ✓ Logic is simple and well-defined ✓ You have fewer than ~100 examples ✓ 100% correctness is required ✓ Full explainability is legally required ✓ The problem doesn't change over time
Key insight: ML is a power tool, not a magic wand. A chainsaw is amazing for cutting trees but terrible for brain surgery. The best engineers know when to reach for ML and when a simple if-else statement is the right answer. In practice, most production systems use both: rules for the easy cases, ML for the hard ones.