Ch 6 — Unsupervised Learning: Finding Hidden Structure

When you don’t know what you’re looking for — and that’s the point
High Level
database
Raw Data
arrow_forward
hub
Cluster
arrow_forward
warning
Anomaly
arrow_forward
compress
Reduce
arrow_forward
pattern
Patterns
arrow_forward
lightbulb
Insight
-
Click play or press Space to begin...
Step- / 8
explore
No Labels, No Teacher
The fundamental difference from supervised learning
The Core Idea
In supervised learning (Chapter 5), you give the model labeled examples: “this transaction was fraud, this one wasn’t.” In unsupervised learning, there are no labels. You hand the model a dataset and say: “Find whatever structure is in here.” The model discovers patterns, groupings, and anomalies on its own — patterns that humans might never have thought to look for.
Why It Matters
Labeling data is expensive and sometimes impossible. You can’t label what you don’t know exists. A retailer doesn’t know in advance how many customer segments it has or what defines them. A security team can’t label every type of cyberattack that hasn’t been invented yet. Unsupervised learning finds the questions you didn’t know to ask.
Supervised vs. Unsupervised
Supervised: “Here are 10,000 emails labeled spam or not spam. Learn to classify new emails.” The model learns a specific task with a clear right answer.

Unsupervised: “Here are 10,000 customer records. Find meaningful groups.” The model discovers structure without being told what to look for. There’s no single “right answer” — the value lies in the insights the structure reveals.
Key insight: Supervised learning answers questions you already have. Unsupervised learning reveals questions you didn’t know to ask. Both are essential. The most sophisticated organizations use them together — unsupervised learning to explore, supervised learning to act.
hub
Clustering: Grouping the Similar
The most common unsupervised learning technique
What Clustering Does
Clustering algorithms examine data points and group them by similarity. Points within a cluster are more similar to each other than to points in other clusters. The algorithm doesn’t know what the groups represent — it just finds them. A human then interprets what each cluster means and whether the grouping is useful.
K-Means: The Workhorse
The most widely used clustering algorithm. You specify how many clusters (K) you want, and the algorithm partitions the data into that many groups by minimizing the distance between points and their cluster center. It’s fast, scalable, and works well when clusters are roughly spherical and similar in size. The challenge: choosing the right K. The “elbow method” helps by plotting performance against different K values and finding the point of diminishing returns.
Beyond K-Means
Hierarchical clustering builds a tree of nested groups, useful when you want to see relationships at multiple levels of granularity (e.g., broad market segments that break into sub-segments).

DBSCAN and HDBSCAN find clusters of arbitrary shapes and automatically identify outliers. Unlike K-Means, they don’t require you to specify the number of clusters in advance — they discover it from the data.
Why it matters: Clustering is the foundation of customer segmentation, market analysis, and portfolio grouping. Research shows data-driven segmentation delivers 30–40% improvement in conversion rates compared to demographic assumptions alone. Retail companies typically identify 4–6 meaningful segments; B2B organizations work with 3–5.
groups
Customer Segmentation in Practice
The highest-value enterprise application of clustering
How It Works
A retailer feeds clustering algorithms data on purchase frequency, average order value, product categories, browsing behavior, and recency of last purchase. The algorithm groups customers into segments that share similar patterns. One cluster might be “high-frequency, low-value buyers.” Another might be “seasonal big spenders.” A third might be “at-risk customers showing declining engagement.”
From Clusters to Action
The raw clusters are just numbers. The business value comes from interpreting and acting on them:
Targeted marketing — Different messages for different segments.
Retention campaigns — Identify at-risk segments before they churn.
Product development — Understand what each segment values.
Pricing strategy — Optimize pricing by segment willingness to pay.
Modern Approach: Clustering + LLMs
A 2025 trend: organizations are combining traditional clustering with large language models. The clustering algorithm finds the groups; the LLM automatically generates human-readable descriptions of each segment — turning “Cluster 3” into “Cost-sensitive shoppers who buy in bulk during promotions and respond to free-shipping offers.” This dramatically reduces the time from analysis to action.
Banking example: Financial institutions use behavioral clustering to power next-best-action models, targeted retention campaigns, and behavioral shift monitoring. When a customer’s behavior moves them from one cluster to another, it signals a life event or risk that the bank can proactively address.
crisis_alert
Anomaly Detection: Spotting the Unusual
Finding the needle without knowing what the needle looks like
The Concept
Instead of learning what fraud looks like (supervised), anomaly detection learns what “normal” looks like and flags anything that deviates significantly. This is powerful because it can detect novel threats and patterns that have never been seen before — something supervised models, trained only on historical examples, cannot do.
Key Techniques
Isolation Forest — Isolates anomalies by randomly partitioning data. Anomalies are easier to isolate (fewer partitions needed), so the algorithm finds them efficiently. Widely used in enterprise fraud and procurement anomaly detection.

Statistical methods — Flag data points that fall outside expected distributions.

Autoencoders — Neural networks trained to reconstruct normal data. When they fail to reconstruct an input accurately, it’s likely an anomaly.
Enterprise Applications
Cybersecurity — Detect unusual network traffic patterns that indicate a breach, even for attack types never seen before.
Financial fraud — Flag unusual transaction patterns in real time.
Manufacturing — Identify equipment behavior that deviates from normal operating parameters before failure occurs.
Procurement — Detect suspicious purchase patterns, duplicate invoices, or vendor collusion. A 2025 study showed hybrid approaches combining clustering with Isolation Forest significantly improve anomaly detection in enterprise purchase processes.
Critical advantage: Supervised fraud detection only catches fraud that resembles past fraud. Anomaly detection catches anything that doesn’t look normal — including entirely new attack vectors. The most effective systems use both: supervised models for known patterns, anomaly detection for unknown threats.
compress
Dimensionality Reduction: Simplifying Complexity
Turning 500 variables into 5 without losing the story
The Problem
Real-world datasets often have hundreds or thousands of variables (dimensions). A customer record might include 200 behavioral features. A genomic dataset might have 20,000 gene expressions. Humans can’t visualize or reason about data in 200 dimensions. Many algorithms also struggle — they slow down, overfit, or produce unreliable results. This is called the “curse of dimensionality.”
PCA: Principal Component Analysis
PCA is the most widely used dimensionality reduction technique. It finds the directions of maximum variation in the data and projects everything onto those directions. If 95% of the variation in 200 variables can be captured by 10 “principal components,” you’ve reduced complexity by 95% while preserving nearly all the information. It’s often used as a preprocessing step before clustering to improve both speed and accuracy.
Why Leaders Should Care
Dimensionality reduction isn’t just a technical optimization. It has direct business implications:

Faster models — Fewer variables means faster training and inference, reducing compute costs.
Better results — Removing noise improves model accuracy. Studies show PCA combined with clustering achieves significantly better segmentation scores than clustering on raw data.
Visualization — Reduce data to 2–3 dimensions for human-interpretable visualizations that reveal structure at a glance.
Key insight: When a data science team says they’re doing “feature engineering” or “dimensionality reduction,” they’re distilling complexity into its essential signals. It’s the difference between handing an executive a 200-page report and a 5-slide summary that captures the same story. Both have the information; one is actionable.
link
Association Rules: What Goes With What
The “customers who bought X also bought Y” engine
The Concept
Association rule mining discovers relationships between items in large datasets. The classic example: analyzing millions of shopping baskets to find that customers who buy diapers on Friday evenings also tend to buy beer. These aren’t predictions — they’re discovered correlations that reveal hidden purchasing patterns.
How It Works
The algorithm scans transaction data for frequent item combinations and measures three things:
Support — How often the combination appears (frequency).
Confidence — How often B appears when A is present (reliability).
Lift — How much more likely B is when A is present compared to random chance (strength of the relationship).
Enterprise Applications
Recommendation engines — “Customers who bought this also bought...” Amazon attributes up to 35% of its revenue to recommendation systems.
Cross-selling — Banks discovering that customers with a checking account and auto loan are 3x more likely to open an investment account.
Store layout — Physical retailers optimizing product placement based on co-purchase patterns.
Medical research — Discovering that certain symptom combinations predict specific conditions.
Key insight: Association rules are among the simplest unsupervised techniques, but they power some of the most commercially valuable AI applications. Recommendation systems alone represent a multi-billion-dollar industry. The technique is decades old; the scale at which it’s now applied is what changed.
compare_arrows
Supervised + Unsupervised: Better Together
How the best organizations combine both approaches
The Combined Approach
The most effective enterprise ML systems don’t choose between supervised and unsupervised — they use both in sequence. Unsupervised learning explores and structures the data. Supervised learning then acts on that structure. This pipeline is standard practice in mature AI organizations.
Common Patterns
Cluster, then classify — Use clustering to segment customers, then build separate supervised models for each segment. A churn model trained on high-value customers performs better than a one-size-fits-all model.

Detect, then investigate — Use anomaly detection to flag unusual transactions, then use a supervised classifier to categorize the type of anomaly.

Reduce, then predict — Use PCA to reduce dimensionality, then feed the simplified data into a supervised model for faster, more accurate predictions.
Real-World Example
A large bank’s fraud detection pipeline:
Step 1: Anomaly detection flags transactions that deviate from normal patterns (unsupervised).
Step 2: A supervised classifier categorizes flagged transactions by fraud type (card theft, account takeover, synthetic identity).
Step 3: Clustering groups related fraudulent transactions to identify organized fraud rings.
Step 4: Association rules discover new fraud patterns that feed back into the supervised model’s training data.
Why it matters: When evaluating AI vendors or internal proposals, ask whether the approach uses supervised learning, unsupervised learning, or both. A system that only uses supervised learning is blind to novel patterns. A system that only uses unsupervised learning can find patterns but can’t act on them with precision.
psychology
The Executive Mental Model
When to reach for unsupervised learning
Use Unsupervised Learning When
You don’t have labels — No historical outcomes to learn from, or labeling would be prohibitively expensive.
You’re exploring — You want to understand the structure of your data before building predictive models.
You need to find the unknown — Novel fraud patterns, emerging customer segments, hidden operational inefficiencies.
You have too many variables — Dimensionality reduction simplifies complexity before downstream analysis.
Limitations to Know
No ground truth — Without labels, it’s harder to measure whether the model is “right.” Evaluation is more subjective.
Interpretation required — The algorithm finds groups; a human must decide if those groups are meaningful and actionable.
Sensitivity to parameters — The number of clusters, the distance metric, and other settings significantly affect results. Different settings can produce very different groupings from the same data.
The Bottom Line
Supervised learning is your precision tool — it answers specific questions with measurable accuracy. Unsupervised learning is your exploration tool — it reveals structure, segments, and anomalies you didn’t know existed. The most mature AI organizations use both, in sequence, to first understand their data and then act on it.
Three questions for your next AI review:
1. Are we only looking for patterns we already know about, or are we also discovering new ones?
2. How are we segmenting our customers — by demographic assumptions or by actual behavioral data?
3. Can our fraud/security systems detect threats they’ve never seen before, or only variations of past incidents?