What Is Clustering?
In supervised learning, every training example has a label. In clustering, you have data but no labels. The goal: discover natural groups (clusters) in the data.
Think of sorting a pile of laundry without anyone telling you the categories. You’d naturally group by color, fabric type, or size. Clustering algorithms do the same with data — they find groups of similar items.
Use cases: Customer segmentation (group shoppers by behavior), anomaly detection (find the one transaction that doesn’t fit any group), image compression (group similar pixels), gene expression analysis (group genes with similar activity patterns), and document organization (group similar articles).
The fundamental challenge: how do you evaluate clustering without labels? There’s no “accuracy” to compute. Instead, we use internal metrics like silhouette score and within-cluster sum of squares (WCSS).
Clustering vs Classification
Classification (supervised):
Input: features + labels
Output: decision boundary
Goal: predict labels for new data
Eval: accuracy, F1, ROC-AUC
Clustering (unsupervised):
Input: features only (no labels!)
Output: group assignments
Goal: discover natural groups
Eval: silhouette, WCSS, visual inspection
Real-world example:
You have 100,000 customers with purchase data.
No one has labeled them "budget" or "premium."
Clustering discovers:
Cluster 0: budget shoppers (low spend, sales)
Cluster 1: premium buyers (high spend, brands)
Cluster 2: occasional (rare, seasonal)
Now marketing can target each group differently.
Key insight: Clustering is like being a new student at a school cafeteria. Nobody tells you the social groups, but after a few days you notice the athletes sit together, the band kids sit together, and the gamers sit together. You discovered the structure by observing patterns — that’s exactly what clustering algorithms do with data.