Ch 5: Supervised Learning — Teaching by Example

Ch 5 — Supervised Learning: Teaching by Example

The most widely deployed form of machine learning in enterprise — and the foundation of most AI you already use

Index

High Level

label

Label

arrow_forward

model_training

Train

arrow_forward

fork_right

Classify

arrow_forward

show_chart

Predict

arrow_forward

assessment

Evaluate

arrow_forward

business

Deploy

Click play or press Space to begin...

Step- / 8

school

The Core Idea

Learning from labeled examples

How It Works

Supervised learning is the most straightforward form of machine learning. You provide the system with inputs paired with correct outputs — labeled examples. The model studies these pairs and learns to map inputs to outputs. Then, when given a new input it hasn’t seen before, it predicts the output based on what it learned.

A Concrete Example

A bank wants to predict which loan applicants will default. It feeds the model 10 years of historical loan data: income, credit score, employment history, loan amount, and the outcome — defaulted or repaid. The model learns which combinations of factors correlate with default. When a new applicant applies, the model predicts the probability of default based on their profile.

Why “Supervised”

The term comes from the fact that the model learns under supervision — it has a teacher (the labeled data) that tells it the right answer for each example. During training, the model makes a prediction, compares it to the correct label, measures the error, and adjusts. This feedback loop is what drives learning.

Key insight: Supervised learning requires labeled data, and labeling is expensive. The quality and quantity of your labels directly determines the quality of the model. This is why data preparation (Chapter 4) is so critical — the labels are the supervision.

fork_right

Classification: Sorting Into Categories

Is this A or B? The most common supervised learning task

What Classification Does

Classification assigns an input to one of a predefined set of categories. The output is a label, not a number. Binary classification has two categories (spam/not spam, fraud/legitimate, will churn/won’t churn). Multi-class classification has more (which product category? which customer segment? which disease diagnosis?).

Enterprise Examples

Email spam filtering — Classify each email as spam or legitimate.
Credit scoring — Classify applicants as low, medium, or high risk. The ECB reports significant growth in European banks using AI for credit scoring between 2023 and 2024.
Customer churn — Classify customers as likely to leave or likely to stay.
Medical diagnosis — Classify a scan as showing a tumor or not.

How the Model Decides

The model doesn’t output a simple yes/no. It outputs a probability — “there is a 87% chance this transaction is fraudulent.” The business then sets a threshold: flag anything above 80%? 90%? 95%? This threshold is a business decision, not a technical one. A lower threshold catches more fraud but creates more false alarms. A higher threshold misses more fraud but reduces false positives.

Why it matters: The threshold decision is where business judgment meets machine learning. The model provides the probability; the organization decides what to do with it. This is why AI deployment is never purely a technical exercise.

show_chart

Regression: Predicting a Number

How much? How many? When?

What Regression Does

Regression predicts a continuous numerical value rather than a category. Instead of “will this customer churn?” (classification), regression answers “how much will this customer spend next quarter?” The output is a number on a scale, not a label in a bucket.

Enterprise Examples

Demand forecasting — How many units will we sell next month in each region?
Price optimization — What price maximizes revenue for this product?
Revenue prediction — What will Q3 revenue be based on current pipeline?
Estimated time of arrival — When will this shipment arrive?
Customer lifetime value — How much total revenue will this customer generate?

Classification vs. Regression

The distinction is simple: classification predicts a category, regression predicts a number. Sometimes the same business question can be framed either way. “Will this machine fail?” is classification. “How many hours until this machine fails?” is regression. The framing depends on what decision the prediction needs to support.

Key insight: Regression models are everywhere in business, often without being called “AI.” Demand forecasting, pricing optimization, and financial projections have used regression techniques for decades. Modern ML makes them more accurate by handling more variables and non-linear relationships.

account_tree

Decision Trees: The Most Intuitive Algorithm

A flowchart that the machine builds itself

How Decision Trees Work

A decision tree is a series of yes/no questions that progressively narrow down to a prediction. “Is income above $50K? Yes. Is credit score above 700? No. Is employment tenure above 2 years? Yes. Prediction: low default risk.” The model automatically determines which questions to ask, in what order, and where to set the thresholds — all learned from the training data.

Why Executives Like Them

Decision trees are interpretable. You can trace exactly why the model made a specific prediction. In regulated industries like banking and healthcare, this matters — regulators may require that you explain how an automated decision was made. A decision tree can be visualized as a flowchart that anyone can follow.

The Limitation

A single decision tree tends to overfit — it memorizes the training data rather than learning generalizable patterns. It’s also sensitive to small changes in data: remove a few examples and the tree can look completely different. This instability makes single trees unreliable for high-stakes production use.

Why it matters: Decision trees are the building block for more powerful techniques. Understanding them is essential because the algorithms that dominate enterprise ML today — Random Forests and Gradient Boosting — are built by combining hundreds or thousands of decision trees.

forest

Ensemble Methods: Strength in Numbers

Random Forests, Gradient Boosting, and XGBoost

Random Forest

Instead of one decision tree, build hundreds of trees, each trained on a slightly different random subset of the data. Each tree votes on the prediction, and the majority wins. Individual trees may be wrong, but the collective is remarkably accurate. This is the same principle behind polling: one person’s opinion may be off, but the average of a thousand opinions is usually close to the truth.

Gradient Boosting & XGBoost

Rather than building trees independently, gradient boosting builds them sequentially — each new tree focuses specifically on correcting the errors of the previous trees. XGBoost (eXtreme Gradient Boosting) is an optimized implementation that has become the dominant algorithm for structured/tabular data in enterprise ML. It consistently wins Kaggle competitions and powers production systems in banking, insurance, and e-commerce.

Why These Dominate Enterprise AI

For structured business data — spreadsheets, databases, transaction logs — tree-based ensemble methods outperform neural networks in most cases. They’re faster to train, require less data, handle missing values natively, and are easier to interpret. Neural networks shine on unstructured data (images, text). For the tabular data that runs most businesses, XGBoost and Random Forest are the workhorses.

Key insight: When a vendor says they use “AI” for fraud detection, credit scoring, or demand forecasting on structured data, they’re most likely using gradient boosting or random forests — not deep learning. These are mature, well-understood, and highly effective techniques. That’s a feature, not a limitation.

balance

The Imbalanced Data Problem

When 99.9% of your examples are “normal”

The Problem

In many real-world problems, the thing you’re trying to detect is extremely rare. Only 0.1% of credit card transactions are fraudulent. Only 1% of manufactured parts are defective. Only 2% of customers churn each month. If the model simply predicts “normal” every time, it achieves 99.9% accuracy while catching nothing. This is the class imbalance problem.

Why It Matters

A model that misses all fraud is useless, even if its accuracy score looks impressive. The business impact of a false negative (missed fraud) is often far greater than a false positive (legitimate transaction flagged). The cost asymmetry between these errors is what drives the choice of evaluation metrics and threshold settings.

How It’s Addressed

Oversampling — Create synthetic examples of the rare class to balance the training data.
Undersampling — Reduce examples of the common class.
Cost-sensitive learning — Penalize the model more heavily for missing the rare class. Recent research (2025) shows cost-sensitive deep ensemble methods significantly improve fraud detection on imbalanced banking data.
Better metrics — Use precision, recall, and F1 instead of raw accuracy.

Critical for leaders: When someone reports “99% accuracy” on a fraud or defect detection model, ask: “What’s the recall?” That tells you what percentage of actual fraud the model catches. A model with 99% accuracy and 5% recall is catching almost nothing.

visibility

Interpretability vs. Performance

The tradeoff that shapes every deployment decision

The Tradeoff

Simpler models (linear regression, single decision trees) are easy to explain but may miss complex patterns. Complex models (deep neural networks, large ensembles) capture more nuance but are harder to interpret. This isn’t an abstract concern — it has direct business and regulatory implications.

When Interpretability Is Required

Regulated industries — Banking (credit decisions), healthcare (diagnosis), insurance (pricing). Regulators may require that you explain why a specific decision was made.
High-stakes decisions — Hiring, criminal justice, medical treatment. Affected individuals have a right to understand the basis of automated decisions.
Trust building — Internal stakeholders are more likely to adopt AI they can understand.

The Middle Ground

Modern techniques like SHAP values and LIME can explain individual predictions from complex models — showing which features contributed most to a specific decision. This allows organizations to use powerful models while still providing explanations. The EU AI Act’s transparency requirements are accelerating adoption of these explainability tools.

Rule of thumb: Start with an interpretable model. If it meets the performance requirements, deploy it. Only move to more complex models if the accuracy gain justifies the loss of transparency. In many enterprise use cases, the simpler model is good enough — and far easier to maintain, debug, and trust.

inventory

Where Supervised Learning Fits

The workhorse of enterprise AI

Best Suited For

Clear historical outcomes — You have data showing what happened in the past (who defaulted, what sold, which parts failed).
Structured, tabular data — Databases, spreadsheets, transaction logs.
Well-defined prediction targets — You know exactly what you want to predict.
Sufficient labeled examples — Thousands to millions of labeled records available.

Not Suited For

No historical data — New products, new markets, unprecedented events.
No clear labels — Exploratory analysis where you don’t know what you’re looking for (use unsupervised learning, Chapter 6).
Rapidly changing environments — If the patterns shift faster than you can retrain.
Unstructured data at scale — For images, text, and audio, deep learning (Act III) is typically more effective.

The Bottom Line

Supervised learning is the most mature, most deployed, and most commercially proven form of machine learning. It powers credit scoring at every major bank, fraud detection at every payment processor, demand forecasting at every retailer, and spam filtering in every email client. It’s not glamorous. It doesn’t make headlines. But it delivers measurable ROI at scale, every day.

Key insight: If your organization is early in its AI journey, supervised learning on structured data is the highest-probability path to value. The algorithms are proven, the tooling is mature, and the failure modes are well-understood. Start here before chasing generative AI.

arrow_back Ch 4: Data Ch 6: Unsupervised Learning arrow_forward