Ch 3: Machine Learning Paradigms — High Level

Ch 3 — Machine Learning Paradigms

Supervised, unsupervised, and reinforcement learning — the three ways machines learn from data

Index Under the Hood →

High Level

school

Supervised

arrow_forward

category

Classify

arrow_forward

search

Unsupervised

arrow_forward

hub

Cluster

arrow_forward

sports_esports

RL

arrow_forward

auto_awesome

Beyond

-

Click play or press Space to begin the journey...

Step- / 8

school

Three Ways Machines Learn

The fundamental paradigms of machine learning

The Core Question

Machine learning is the science of getting computers to learn from data without being explicitly programmed. But how a system learns depends on what kind of feedback it receives. This gives us three fundamental paradigms, each suited to different problems.

Supervised “Learn from labeled examples” Input: data + correct answers Goal: predict answers for new data Unsupervised “Find hidden structure” Input: data only (no labels) Goal: discover patterns & groups Reinforcement “Learn by trial and reward” Input: environment + reward signal Goal: maximize cumulative reward

Analogy

Supervised: A teacher shows you flashcards with questions and answers. You learn the mapping and can answer new questions.

Unsupervised: You’re given a pile of photos with no labels. You naturally group them by similarity — landscapes, portraits, animals.

Reinforcement: You learn to ride a bike. Nobody tells you the “correct” action at each moment — you try things, fall, adjust, and gradually improve through feedback.

Most modern AI uses supervised learning (or its variants). Image classifiers, spam filters, language models, and recommendation systems all learn from labeled or structured data. But the lines are blurring — LLMs use self-supervised pretraining + RLHF.

category

Supervised Learning: Classification

Predicting which category something belongs to

How It Works

Given a dataset of input features (X) and correct labels (y), the algorithm learns a function f(X) → y that maps inputs to categories. Once trained, it can classify new, unseen inputs. The model learns by minimizing the difference between its predictions and the true labels.

Real-World Examples

Email spam detection: Features = word frequencies, sender, links. Labels = spam/not spam.
Medical diagnosis: Features = symptoms, test results. Labels = disease/healthy.
Image recognition: Features = pixel values. Labels = cat/dog/bird.
Fraud detection: Features = transaction data. Labels = fraudulent/legitimate.

# Classification algorithms Logistic Regression Linear boundary, probability output Fast, interpretable, good baseline Decision Trees / Random Forests Series of if/then splits Handles mixed data types well Support Vector Machines (SVM) Finds optimal separating hyperplane Effective in high dimensions Neural Networks Learns complex non-linear boundaries Dominates when data is abundant k-Nearest Neighbors (k-NN) Classifies by majority vote of neighbors Simple but slow at scale

The bias-variance tradeoff: Simple models (logistic regression) may underfit complex data. Complex models (deep neural nets) may overfit to noise. The art of ML is finding the right balance for your data and problem.

trending_up

Supervised Learning: Regression

Predicting continuous numerical values

Classification vs Regression

Classification predicts categories (spam/not spam). Regression predicts numbers (house price = $450,000). Same supervised framework — learn from labeled examples — but the output is continuous rather than discrete.

Real-World Examples

House price prediction: Features = sq ft, bedrooms, location. Output = price.
Stock forecasting: Features = historical prices, volume. Output = future price.
Weather prediction: Features = temperature, pressure, humidity. Output = tomorrow’s temperature.
Ad revenue estimation: Features = clicks, impressions. Output = revenue.

# Linear regression — the simplest model y = w⋅x + b # Example: predict house price price = 200 × sqft + 50000 × bedrooms + 30000 × garage - 10000 # The model learns weights (w) and bias (b) # by minimizing prediction error on # training data (labeled examples). # Loss function: Mean Squared Error MSE = (1/n) × ∑(y_pred - y_true)²

Beyond linear: Real relationships are rarely linear. Polynomial regression, decision tree regressors, and neural networks can model complex non-linear relationships. Deep learning excels when the mapping from input to output is highly complex (e.g., image → age estimation).

hub

Unsupervised Learning: Clustering

Discovering natural groups in unlabeled data

No Labels, No Problem

Unsupervised learning works with unlabeled data — no correct answers are provided. The algorithm must discover structure on its own. Clustering groups similar data points together, revealing natural categories the data contains.

Real-World Examples

Customer segmentation: Group customers by purchasing behavior to target marketing.
Document clustering: Organize news articles by topic without predefined categories.
Anomaly detection: Find unusual patterns that don’t fit any cluster (fraud, network intrusion).
Gene expression: Group genes with similar expression patterns to discover biological functions.

# K-Means clustering 1. Choose K (number of clusters) 2. Randomly place K centroids 3. Assign each point to nearest centroid 4. Recalculate centroids as cluster means 5. Repeat 3-4 until convergence # Other clustering algorithms Hierarchical Build tree of nested clusters DBSCAN Density-based, finds arbitrary shapes Gaussian Mix Probabilistic, soft assignments

The K problem: K-Means requires you to specify the number of clusters in advance. Choosing the wrong K gives meaningless results. Techniques like the elbow method and silhouette analysis help, but cluster count is often a judgment call.

compress

Unsupervised: Dimensionality Reduction

Compressing high-dimensional data while preserving structure

The Curse of Dimensionality

Real-world data often has hundreds or thousands of features (dimensions). High-dimensional data is hard to visualize, slow to process, and prone to overfitting. Dimensionality reduction compresses data into fewer dimensions while preserving the most important information.

Key Techniques

PCA (Principal Component Analysis): Finds the directions of maximum variance and projects data onto them. Reduces 1000 features to 50 while keeping 95% of the information.

t-SNE / UMAP: Non-linear methods that preserve local structure. Excellent for visualizing high-dimensional data in 2D or 3D. Widely used to visualize embeddings and clusters.

# PCA — the idea Original data: 1000 features per sample After PCA: 50 features per sample Info retained: ~95% of variance # How it works: 1. Find direction of maximum variance (PC1) 2. Find next orthogonal direction (PC2) 3. Repeat for K components 4. Project data onto top-K components # Use cases: Visualization (reduce to 2D/3D) Preprocessing (speed up downstream ML) Noise removal (drop low-variance dims)

Autoencoders are a neural network approach to dimensionality reduction. They learn to compress data into a small “bottleneck” layer and reconstruct it. The bottleneck representation captures the most important features. Variational autoencoders (VAEs) extend this for generative modeling (Ch 11).

sports_esports

Reinforcement Learning

Learning by trial, error, and reward

The Agent-Environment Loop

An agent observes the current state of an environment, takes an action, receives a reward (positive or negative), and transitions to a new state. The goal: learn a policy (strategy) that maximizes cumulative reward over time. No labeled examples — the agent discovers what works through exploration.

Real-World Examples

Game playing: AlphaGo, Atari DQN, OpenAI Five (Dota 2)
Robotics: Learning to walk, grasp objects, navigate
Recommendation: Optimizing long-term user engagement
LLM alignment: RLHF trains models to be helpful and safe (Ch 12)

# The RL loop while not done: state = observe(environment) action = policy(state) # choose action reward = environment.step(action) update policy based on reward # The exploration-exploitation dilemma Explore: Try new actions (might discover better) Exploit: Use best known action (safe but limited) # Balance via epsilon-greedy: # With probability ε, explore randomly # Otherwise, take the best known action

Delayed rewards: In chess, you don’t know if a move was good until many moves later. RL must assign “credit” to past actions that contributed to eventual success or failure. This credit assignment problem is one of the hardest challenges in RL.

auto_awesome

Beyond the Big Three

Self-supervised, semi-supervised, and transfer learning

Self-Supervised Learning

The model creates its own labels from the data. GPT-style pretraining masks the next word and predicts it. BERT masks random words and fills them in. Contrastive learning (SimCLR, CLIP) learns by comparing augmented views of the same data. This is how modern foundation models are trained — no human labeling required for pretraining.

Semi-Supervised Learning

Uses a small amount of labeled data plus a large amount of unlabeled data. The model learns general patterns from unlabeled data and refines with labels. Practical when labeling is expensive (medical imaging, satellite photos).

Transfer Learning

Pretrain on a large general dataset, then fine-tune on a small task-specific dataset. ImageNet-pretrained CNNs transfer to medical imaging. GPT pretrained on internet text transfers to customer support. This is the dominant paradigm in modern AI — almost no one trains from scratch anymore.

# The modern recipe 1. Self-supervised pretrain Train on massive unlabeled data (GPT: predict next token) 2. Supervised fine-tune Adapt to specific task with labels (instruction tuning) 3. RLHF alignment Reinforce helpful, safe behavior (reward model from human prefs)

This is how ChatGPT was built: Self-supervised pretraining (predict next token on internet text) + supervised fine-tuning (instruction following) + RLHF (align with human preferences). All three paradigms in one system.

decision

Choosing the Right Paradigm

A practical decision framework

Decision Guide

Do you have labeled data? Yes → Supervised learning Predict category? → Classification Predict number? → Regression No → Unsupervised learning Find groups? → Clustering Reduce features? → Dim. reduction Find outliers? → Anomaly detection Is there an environment with rewards? Yes → Reinforcement learning Sequential decisions with feedback Have massive unlabeled data? Yes → Self-supervised pretrain Then fine-tune with small labeled set

Key Takeaways

1. Supervised learning is the workhorse — most production ML is supervised

2. Unsupervised learning discovers hidden structure without labels

3. RL learns from interaction and delayed rewards

4. Self-supervised learning powers modern foundation models

5. Transfer learning means you rarely train from scratch

6. Modern systems combine multiple paradigms (pretrain + fine-tune + RLHF)

Coming up: Ch 4 covers how to prepare data for these paradigms. Ch 5–6 dive into the neural network mechanics that power supervised learning. Ch 12 goes deep on reinforcement learning and RLHF.