Ch 7: Clustering — Classic ML

Ch 7 — Clustering: K-Means, DBSCAN & Beyond

Finding structure without labels — sorting laundry in the dark

arrow_backIndex

Unsupervised

help

Why?

arrow_forward

hub

K-Means

arrow_forward

functions

WCSS Math

arrow_forward

tune

Choose K

arrow_forward

warning

Limits

arrow_forward

scatter_plot

DBSCAN

arrow_forward

account_tree

Hierarchical

arrow_forward

code

Full Code

Click play or press Space to begin...

Step- / 8

help

Unsupervised Learning: No Labels, Find Structure

The sorting laundry analogy — grouping without instructions

What Is Clustering?

In supervised learning, every training example has a label. In clustering, you have data but no labels. The goal: discover natural groups (clusters) in the data.

Think of sorting a pile of laundry without anyone telling you the categories. You’d naturally group by color, fabric type, or size. Clustering algorithms do the same with data — they find groups of similar items.

Use cases: Customer segmentation (group shoppers by behavior), anomaly detection (find the one transaction that doesn’t fit any group), image compression (group similar pixels), gene expression analysis (group genes with similar activity patterns), and document organization (group similar articles).

The fundamental challenge: how do you evaluate clustering without labels? There’s no “accuracy” to compute. Instead, we use internal metrics like silhouette score and within-cluster sum of squares (WCSS).

Clustering vs Classification

Classification (supervised): Input: features + labels Output: decision boundary Goal: predict labels for new data Eval: accuracy, F1, ROC-AUC Clustering (unsupervised): Input: features only (no labels!) Output: group assignments Goal: discover natural groups Eval: silhouette, WCSS, visual inspection Real-world example: You have 100,000 customers with purchase data. No one has labeled them "budget" or "premium." Clustering discovers: Cluster 0: budget shoppers (low spend, sales) Cluster 1: premium buyers (high spend, brands) Cluster 2: occasional (rare, seasonal) Now marketing can target each group differently.

Key insight: Clustering is like being a new student at a school cafeteria. Nobody tells you the social groups, but after a few days you notice the athletes sit together, the band kids sit together, and the gamers sit together. You discovered the structure by observing patterns — that’s exactly what clustering algorithms do with data.

hub

K-Means: The Algorithm

Assign, update, repeat — converge to cluster centers

How K-Means Works

Step 1: Initialize K cluster centers (centroids) randomly.
Step 2: Assign each data point to the nearest centroid (using Euclidean distance).
Step 3: Update each centroid to the mean of all points assigned to it.
Step 4: Repeat steps 2–3 until centroids stop moving (convergence).

Convergence is guaranteed because each step reduces the total within-cluster sum of squares (WCSS). Typically converges in 10–50 iterations.

K-Means++ improves initialization by spreading initial centroids apart. First centroid is random; each subsequent one is chosen with probability proportional to distance from existing centroids. This avoids the “all centroids start in one cluster” problem and is scikit-learn’s default (init='k-means++').

K-Means in scikit-learn

from sklearn.cluster import KMeans km = KMeans( n_clusters=3, # K: number of clusters init='k-means++', # smart initialization n_init=10, # run 10 times, pick best max_iter=300, # max iterations random_state=42 ) km.fit(X) labels = km.labels_ # cluster assignments centers = km.cluster_centers_ # centroid coords inertia = km.inertia_ # WCSS (lower=tighter) print(f"Clusters: {np.unique(labels)}") print(f"WCSS: {inertia:.1f}") print(f"Sizes: {np.bincount(labels)}") # Complexity: O(n × K × d × iterations) # Very fast — handles millions of points. # MiniBatchKMeans for even larger datasets.

Key insight: K-Means is like placing K magnets on a table covered in iron filings. Each filing snaps to the nearest magnet. Then you move each magnet to the center of its filings. Repeat until the magnets stop moving. The final magnet positions are your cluster centers.

functions

K-Means Math: Within-Cluster Sum of Squares

The objective function K-Means minimizes

The Objective

K-Means minimizes the Within-Cluster Sum of Squares (WCSS), also called inertia:

WCSS = ∑ₖ ∑ᵢ∈Cₖ ||xᵢ − μₖ||²

where μₖ is the centroid of cluster k. For each point, compute its squared distance to its cluster center, then sum over all points in all clusters.

Lower WCSS = tighter clusters. But WCSS always decreases as K increases (K = n gives WCSS = 0, with each point as its own cluster). The challenge is finding the K where adding more clusters stops helping significantly.

K-Means is an NP-hard problem in general, but the iterative algorithm finds a good local minimum. Running multiple times with different initializations (n_init=10) helps avoid bad local minima.

WCSS Computation

WCSS = Σₖ Σᵢ∈Cₖ ||xᵢ - μₖ||² Example (2D, K=2): Cluster 0: points (1,2), (2,1), (1,1) μ₀ = (1.33, 1.33) WCSS₀ = (1-1.33)²+(2-1.33)² + (2-1.33)²+(1-1.33)² + (1-1.33)²+(1-1.33)² = 0.11+0.44+0.44+0.11+0.11+0.11 = 1.33 Cluster 1: points (8,9), (9,8), (8,8) μ₁ = (8.33, 8.33) WCSS₁ = 1.33 Total WCSS = 1.33 + 1.33 = 2.67 # Good clustering: low WCSS (tight clusters) # Bad clustering: high WCSS (spread out clusters) # K-Means guarantees WCSS decreases each iteration

Key insight: WCSS measures how “compact” your clusters are. It’s like measuring how tightly packed each group of friends is sitting in the cafeteria. A good clustering has everyone sitting close to their group’s center. WCSS is the total “spread” across all groups.

tune

Choosing K: Elbow Method & Silhouette Score

The hardest question in clustering — how many groups?

Two Methods

Elbow Method: Plot WCSS vs K. As K increases, WCSS drops. Look for the “elbow” — the point where adding more clusters stops reducing WCSS significantly. The elbow is where the curve bends from steep to flat.

Silhouette Score: For each point, compute:
s = (b − a) / max(a, b)
where a = average distance to points in the same cluster, b = average distance to points in the nearest other cluster.

s ranges from −1 to +1. s ≈ 1: well-clustered. s ≈ 0: on the boundary. s < 0: probably in the wrong cluster. The average silhouette score across all points measures overall clustering quality.

Pick the K that maximizes the average silhouette score. This is more reliable than the elbow method, which can be ambiguous.

Choosing K in Code

from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Elbow method wcss = [] for k in range(2, 11): km = KMeans(n_clusters=k, random_state=42) km.fit(X) wcss.append(km.inertia_) # Plot wcss vs k → look for elbow # Silhouette score (more reliable) sil_scores = [] for k in range(2, 11): km = KMeans(n_clusters=k, random_state=42) labels = km.fit_predict(X) sil = silhouette_score(X, labels) sil_scores.append(sil) print(f"K={k}: silhouette={sil:.3f}") # Typical output: # K=2: silhouette=0.581 # K=3: silhouette=0.553 ← if elbow at 3 # K=4: silhouette=0.498 # K=5: silhouette=0.410 # Pick K with highest silhouette

Key insight: Choosing K is like deciding how many drawers to use for organizing socks. Too few drawers (K=2) and you mix dress socks with gym socks. Too many drawers (K=20) and each drawer has one pair. The elbow method finds where adding drawers stops helping. The silhouette score checks if each sock is actually in the right drawer.

warning

K-Means Limitations

Spherical clusters, sensitivity to initialization, and the curse of scale

When K-Means Fails

1. Non-spherical clusters: K-Means assumes clusters are roughly spherical (equal variance in all directions). It fails on elongated, crescent-shaped, or ring-shaped clusters because it uses Euclidean distance to the centroid.

2. Unequal cluster sizes: K-Means tends to create equal-sized clusters. If one cluster has 10,000 points and another has 100, K-Means may split the large cluster and merge the small one with a neighbor.

3. Sensitivity to outliers: A single outlier far from all clusters pulls its centroid toward it, distorting the entire cluster.

4. Must specify K in advance: You need to know (or guess) the number of clusters before running. DBSCAN doesn’t have this limitation.

5. Feature scaling: K-Means uses Euclidean distance, so features on different scales dominate. Always standardize features first.

K-Means Works

Spherical, well-separated clusters
Roughly equal cluster sizes
Low-dimensional data
Need for speed (O(nKd))
Known number of clusters

K-Means Fails

Elongated or ring-shaped clusters
Very unequal cluster sizes
Many outliers in the data
Unknown number of clusters
Non-Euclidean distance needed

Key insight: K-Means sees the world through circular glasses. It can only draw circular boundaries around clusters. If your data has crescent moons, concentric rings, or irregular blobs, K-Means will force them into circles — like trying to fit a square peg into a round hole. That’s when you need DBSCAN.

scatter_plot

DBSCAN: Density-Based Clustering

Find clusters of any shape — and automatically detect outliers

How DBSCAN Works

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely packed together and marks points in low-density regions as noise.

Two parameters: ε (eps) = neighborhood radius, min_samples = minimum points to form a dense region.

Core point: Has ≥ min_samples neighbors within ε.
Border point: Within ε of a core point but has < min_samples neighbors itself.
Noise point: Neither core nor border — an outlier.

Clusters form by connecting core points that are within ε of each other. Border points are assigned to the nearest core point’s cluster. Noise points get label −1.

Advantages: No need to specify K. Finds clusters of any shape. Automatically identifies outliers. Works well with spatial data.

DBSCAN in scikit-learn

from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(X) db = DBSCAN( eps=0.5, # neighborhood radius min_samples=5, # min points for core metric='euclidean' ) labels = db.fit_predict(X_scaled) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = (labels == -1).sum() print(f"Clusters: {n_clusters}") print(f"Noise points: {n_noise}") # Tuning eps: use k-distance plot # Sort distances to k-th nearest neighbor # Look for the "knee" — that's a good eps.

Key insight: DBSCAN is like finding friend groups at a party. If 5+ people are standing within arm’s reach of each other, they’re a group. People on the edge of a group are “border” members. The person standing alone in the corner is noise. Unlike K-Means, DBSCAN doesn’t force everyone into a group — loners stay loners.

account_tree

Hierarchical Clustering & Dendrograms

Build a tree of merges — cut at any level to get clusters

Agglomerative Clustering

Agglomerative (bottom-up) clustering starts with each point as its own cluster, then repeatedly merges the two closest clusters until only one remains.

The merge history forms a dendrogram — a tree diagram. Cut the dendrogram at any height to get a specific number of clusters. Higher cut = fewer clusters. Lower cut = more clusters.

Linkage criteria define “distance between clusters”:
• Ward: Minimizes within-cluster variance. Produces compact, equal-sized clusters. Most common.
• Complete: Maximum distance between any two points in different clusters.
• Average: Mean distance between all pairs of points.
• Single: Minimum distance. Can find elongated clusters but is sensitive to noise (chaining effect).

Complexity: O(n³) for naive, O(n² log n) optimized. Practical for up to ~10,000 points.

Hierarchical Clustering in Code

from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import dendrogram, linkage # scikit-learn: get cluster labels agg = AgglomerativeClustering( n_clusters=3, linkage='ward' ) labels = agg.fit_predict(X) # scipy: compute linkage for dendrogram Z = linkage(X, method='ward') dendrogram(Z, truncate_mode='lastp', p=20) # The dendrogram shows merge history. # Tall vertical lines = big distance between # merged clusters → natural cluster boundary. # Cut where the tallest gap is.

Key insight: Hierarchical clustering is like organizing a company. Start with individual employees (leaves). Merge them into teams, teams into departments, departments into divisions, divisions into the company (root). The dendrogram is the org chart. You can “cut” at any level to see the structure at that granularity.

code

Complete Clustering Comparison

K-Means vs DBSCAN vs Agglomerative on real data

Head-to-Head Comparison

from sklearn.datasets import make_moons, make_blobs from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score # Dataset 1: Blobs (K-Means friendly) X_blobs, _ = make_blobs(n_samples=500, centers=3, random_state=42) # Dataset 2: Moons (non-convex) X_moons, _ = make_moons(n_samples=500, noise=0.1, random_state=42) X_moons = StandardScaler().fit_transform(X_moons) models = { "KMeans": KMeans(n_clusters=2, random_state=42), "DBSCAN": DBSCAN(eps=0.3, min_samples=5), "Agglom": AgglomerativeClustering(n_clusters=2), } for name, model in models.items(): labels = model.fit_predict(X_moons) mask = labels >= 0 if mask.sum() > 1 and len(set(labels[mask])) > 1: sil = silhouette_score(X_moons[mask], labels[mask]) else: sil = 0 print(f"{name:8s} sil={sil:.3f}")

Results and When to Use What

On blobs (spherical clusters): KMeans: sil=0.75 ← best (designed for this) DBSCAN: sil=0.72 Agglom: sil=0.74 On moons (crescent shapes): KMeans: sil=0.38 ← fails (can't do crescents) DBSCAN: sil=0.48 ← wins (density-based) Agglom: sil=0.41 Decision guide: Spherical, known K → KMeans (fastest) Arbitrary shape → DBSCAN (auto K, outliers) Need dendrogram → Agglomerative Very large data → MiniBatchKMeans Mixed densities → HDBSCAN (adaptive eps)

Key insight: There’s no single best clustering algorithm. K-Means is the fastest and works great on spherical clusters. DBSCAN handles arbitrary shapes and finds outliers. Hierarchical gives you a full merge tree. The right choice depends on your data’s shape, size, and whether you know K in advance. When in doubt, try all three and compare silhouette scores.

arrow_back Naive Bayes Dimensionality Reduction arrow_forward