Ch 6: Unsupervised Learning — Finding Hidden Structure

Ch 6 — Unsupervised Learning: Finding Hidden Structure

When you don’t know what you’re looking for — and that’s the point

Index

High Level

database

Raw Data

arrow_forward

hub

Cluster

arrow_forward

warning

Anomaly

arrow_forward

compress

Reduce

arrow_forward

pattern

Patterns

arrow_forward

lightbulb

Insight

Click play or press Space to begin...

Step- / 8

explore

No Labels, No Teacher

The fundamental difference from supervised learning

The Core Idea

In supervised learning (Chapter 5), you give the model labeled examples: “this transaction was fraud, this one wasn’t.” In unsupervised learning, there are no labels. You hand the model a dataset and say: “Find whatever structure is in here.” The model discovers patterns, groupings, and anomalies on its own — patterns that humans might never have thought to look for.

Why It Matters

Labeling data is expensive and sometimes impossible. You can’t label what you don’t know exists. A retailer doesn’t know in advance how many customer segments it has or what defines them. A security team can’t label every type of cyberattack that hasn’t been invented yet. Unsupervised learning finds the questions you didn’t know to ask.

Supervised vs. Unsupervised

Supervised: “Here are 10,000 emails labeled spam or not spam. Learn to classify new emails.” The model learns a specific task with a clear right answer.

Unsupervised: “Here are 10,000 customer records. Find meaningful groups.” The model discovers structure without being told what to look for. There’s no single “right answer” — the value lies in the insights the structure reveals.

Key insight: Supervised learning answers questions you already have. Unsupervised learning reveals questions you didn’t know to ask. Both are essential. The most sophisticated organizations use them together — unsupervised learning to explore, supervised learning to act.

hub

Clustering: Grouping the Similar

The most common unsupervised learning technique

What Clustering Does

Clustering algorithms examine data points and group them by similarity. Points within a cluster are more similar to each other than to points in other clusters. The algorithm doesn’t know what the groups represent — it just finds them. A human then interprets what each cluster means and whether the grouping is useful.

K-Means: The Workhorse

The most widely used clustering algorithm. You specify how many clusters (K) you want, and the algorithm partitions the data into that many groups by minimizing the distance between points and their cluster center. It’s fast, scalable, and works well when clusters are roughly spherical and similar in size. The challenge: choosing the right K. The “elbow method” helps by plotting performance against different K values and finding the point of diminishing returns.

Beyond K-Means

Hierarchical clustering builds a tree of nested groups, useful when you want to see relationships at multiple levels of granularity (e.g., broad market segments that break into sub-segments).

DBSCAN and HDBSCAN find clusters of arbitrary shapes and automatically identify outliers. Unlike K-Means, they don’t require you to specify the number of clusters in advance — they discover it from the data.

Why it matters: Clustering is the foundation of customer segmentation, market analysis, and portfolio grouping. Research shows data-driven segmentation delivers 30–40% improvement in conversion rates compared to demographic assumptions alone. Retail companies typically identify 4–6 meaningful segments; B2B organizations work with 3–5.

groups

Customer Segmentation in Practice

The highest-value enterprise application of clustering

How It Works

A retailer feeds clustering algorithms data on purchase frequency, average order value, product categories, browsing behavior, and recency of last purchase. The algorithm groups customers into segments that share similar patterns. One cluster might be “high-frequency, low-value buyers.” Another might be “seasonal big spenders.” A third might be “at-risk customers showing declining engagement.”

From Clusters to Action

The raw clusters are just numbers. The business value comes from interpreting and acting on them:
Targeted marketing — Different messages for different segments.
Retention campaigns — Identify at-risk segments before they churn.
Product development — Understand what each segment values.
Pricing strategy — Optimize pricing by segment willingness to pay.

Modern Approach: Clustering + LLMs

A 2025 trend: organizations are combining traditional clustering with large language models. The clustering algorithm finds the groups; the LLM automatically generates human-readable descriptions of each segment — turning “Cluster 3” into “Cost-sensitive shoppers who buy in bulk during promotions and respond to free-shipping offers.” This dramatically reduces the time from analysis to action.

Banking example: Financial institutions use behavioral clustering to power next-best-action models, targeted retention campaigns, and behavioral shift monitoring. When a customer’s behavior moves them from one cluster to another, it signals a life event or risk that the bank can proactively address.

crisis_alert

Anomaly Detection: Spotting the Unusual

Finding the needle without knowing what the needle looks like

The Concept

Instead of learning what fraud looks like (supervised), anomaly detection learns what “normal” looks like and flags anything that deviates significantly. This is powerful because it can detect novel threats and patterns that have never been seen before — something supervised models, trained only on historical examples, cannot do.

Key Techniques

Isolation Forest — Isolates anomalies by randomly partitioning data. Anomalies are easier to isolate (fewer partitions needed), so the algorithm finds them efficiently. Widely used in enterprise fraud and procurement anomaly detection.

Statistical methods — Flag data points that fall outside expected distributions.

Autoencoders — Neural networks trained to reconstruct normal data. When they fail to reconstruct an input accurately, it’s likely an anomaly.

Enterprise Applications

Cybersecurity — Detect unusual network traffic patterns that indicate a breach, even for attack types never seen before.
Financial fraud — Flag unusual transaction patterns in real time.
Manufacturing — Identify equipment behavior that deviates from normal operating parameters before failure occurs.
Procurement — Detect suspicious purchase patterns, duplicate invoices, or vendor collusion. A 2025 study showed hybrid approaches combining clustering with Isolation Forest significantly improve anomaly detection in enterprise purchase processes.

Critical advantage: Supervised fraud detection only catches fraud that resembles past fraud. Anomaly detection catches anything that doesn’t look normal — including entirely new attack vectors. The most effective systems use both: supervised models for known patterns, anomaly detection for unknown threats.

compress

Dimensionality Reduction: Simplifying Complexity

Turning 500 variables into 5 without losing the story

The Problem

Real-world datasets often have hundreds or thousands of variables (dimensions). A customer record might include 200 behavioral features. A genomic dataset might have 20,000 gene expressions. Humans can’t visualize or reason about data in 200 dimensions. Many algorithms also struggle — they slow down, overfit, or produce unreliable results. This is called the “curse of dimensionality.”

PCA: Principal Component Analysis

PCA is the most widely used dimensionality reduction technique. It finds the directions of maximum variation in the data and projects everything onto those directions. If 95% of the variation in 200 variables can be captured by 10 “principal components,” you’ve reduced complexity by 95% while preserving nearly all the information. It’s often used as a preprocessing step before clustering to improve both speed and accuracy.

Why Leaders Should Care

Dimensionality reduction isn’t just a technical optimization. It has direct business implications:

Faster models — Fewer variables means faster training and inference, reducing compute costs.
Better results — Removing noise improves model accuracy. Studies show PCA combined with clustering achieves significantly better segmentation scores than clustering on raw data.
Visualization — Reduce data to 2–3 dimensions for human-interpretable visualizations that reveal structure at a glance.

Key insight: When a data science team says they’re doing “feature engineering” or “dimensionality reduction,” they’re distilling complexity into its essential signals. It’s the difference between handing an executive a 200-page report and a 5-slide summary that captures the same story. Both have the information; one is actionable.

link

Association Rules: What Goes With What

The “customers who bought X also bought Y” engine

The Concept

Association rule mining discovers relationships between items in large datasets. The classic example: analyzing millions of shopping baskets to find that customers who buy diapers on Friday evenings also tend to buy beer. These aren’t predictions — they’re discovered correlations that reveal hidden purchasing patterns.

How It Works

The algorithm scans transaction data for frequent item combinations and measures three things:
Support — How often the combination appears (frequency).
Confidence — How often B appears when A is present (reliability).
Lift — How much more likely B is when A is present compared to random chance (strength of the relationship).

Enterprise Applications

Recommendation engines — “Customers who bought this also bought...” Amazon attributes up to 35% of its revenue to recommendation systems.
Cross-selling — Banks discovering that customers with a checking account and auto loan are 3x more likely to open an investment account.
Store layout — Physical retailers optimizing product placement based on co-purchase patterns.
Medical research — Discovering that certain symptom combinations predict specific conditions.

Key insight: Association rules are among the simplest unsupervised techniques, but they power some of the most commercially valuable AI applications. Recommendation systems alone represent a multi-billion-dollar industry. The technique is decades old; the scale at which it’s now applied is what changed.

compare_arrows

Supervised + Unsupervised: Better Together

How the best organizations combine both approaches

The Combined Approach

The most effective enterprise ML systems don’t choose between supervised and unsupervised — they use both in sequence. Unsupervised learning explores and structures the data. Supervised learning then acts on that structure. This pipeline is standard practice in mature AI organizations.

Common Patterns

Cluster, then classify — Use clustering to segment customers, then build separate supervised models for each segment. A churn model trained on high-value customers performs better than a one-size-fits-all model.

Detect, then investigate — Use anomaly detection to flag unusual transactions, then use a supervised classifier to categorize the type of anomaly.

Reduce, then predict — Use PCA to reduce dimensionality, then feed the simplified data into a supervised model for faster, more accurate predictions.

Real-World Example

A large bank’s fraud detection pipeline:
Step 1: Anomaly detection flags transactions that deviate from normal patterns (unsupervised).
Step 2: A supervised classifier categorizes flagged transactions by fraud type (card theft, account takeover, synthetic identity).
Step 3: Clustering groups related fraudulent transactions to identify organized fraud rings.
Step 4: Association rules discover new fraud patterns that feed back into the supervised model’s training data.

Why it matters: When evaluating AI vendors or internal proposals, ask whether the approach uses supervised learning, unsupervised learning, or both. A system that only uses supervised learning is blind to novel patterns. A system that only uses unsupervised learning can find patterns but can’t act on them with precision.

psychology

The Executive Mental Model

When to reach for unsupervised learning

Use Unsupervised Learning When

You don’t have labels — No historical outcomes to learn from, or labeling would be prohibitively expensive.
You’re exploring — You want to understand the structure of your data before building predictive models.
You need to find the unknown — Novel fraud patterns, emerging customer segments, hidden operational inefficiencies.
You have too many variables — Dimensionality reduction simplifies complexity before downstream analysis.

Limitations to Know

No ground truth — Without labels, it’s harder to measure whether the model is “right.” Evaluation is more subjective.
Interpretation required — The algorithm finds groups; a human must decide if those groups are meaningful and actionable.
Sensitivity to parameters — The number of clusters, the distance metric, and other settings significantly affect results. Different settings can produce very different groupings from the same data.

The Bottom Line

Supervised learning is your precision tool — it answers specific questions with measurable accuracy. Unsupervised learning is your exploration tool — it reveals structure, segments, and anomalies you didn’t know existed. The most mature AI organizations use both, in sequence, to first understand their data and then act on it.

Three questions for your next AI review:
1. Are we only looking for patterns we already know about, or are we also discovering new ones?
2. How are we segmenting our customers — by demographic assumptions or by actual behavioral data?
3. Can our fraud/security systems detect threats they’ve never seen before, or only variations of past incidents?

arrow_back Ch 5: Supervised Learning Ch 7: The ML Pipeline arrow_forward