Ch 8 — Dimensionality Reduction: PCA & t-SNE

Compressing high-dimensional data — see the forest, not every leaf
Unsupervised
expand
Curse
arrow_forward
compress
PCA Idea
arrow_forward
functions
PCA Math
arrow_forward
tune
Components
arrow_forward
build
PCA Uses
arrow_forward
bubble_chart
t-SNE
arrow_forward
compare
t-SNE vs UMAP
arrow_forward
code
Full Code
-
Click play or press Space to begin...
Step- / 8
expand
The Curse of Dimensionality
Why high dimensions break everything — distances, volumes, and intuition
What Goes Wrong
In high dimensions, distances become meaningless. The ratio of the farthest point to the nearest point approaches 1 as dimensions increase. If all points are roughly equidistant, KNN, K-Means, and any distance-based algorithm fails.

Volume explodes: A unit hypercube in 100D has volume 1, but a unit hypersphere inside it has volume ≈ 10&sup{−70}. Almost all the volume is in the corners. Random samples concentrate in a thin shell near the surface.

Data becomes sparse: To maintain the same density of points, you need exponentially more data as dimensions grow. With 10 features and 1,000 samples, you have reasonable coverage. With 1,000 features and 1,000 samples, each point is alone in its neighborhood.

Overfitting risk: More features = more parameters = more ways to memorize noise. A model with 10,000 features and 500 samples will overfit unless you reduce dimensions or regularize heavily.
The Numbers
Distance concentration in high dimensions: d=2: max_dist/min_dist ≈ 3.2 (meaningful) d=10: max_dist/min_dist ≈ 1.8 d=100: max_dist/min_dist ≈ 1.2 (all ~same!) d=1000:max_dist/min_dist ≈ 1.06 (useless) Data needed for coverage: d=2: ~100 samples for reasonable coverage d=10: ~10,000 samples d=100: ~10¹⁰ samples (impossible) Real-world examples: Text (TF-IDF): 50,000+ features Gene expression: 20,000 genes Images (pixels): 784 (28×28) to millions Tabular data: 10-500 features # Solution: reduce dimensions to 2-50 # while preserving the important structure.
Key insight: The curse of dimensionality is like trying to find your friend in a room. In 1D (a hallway), easy. In 2D (a floor), harder. In 100D, everyone is equally far from everyone else — “close” and “far” lose meaning. Dimensionality reduction collapses the 100D room back to a 2D map where distances matter again.
compress
PCA: Finding Directions of Maximum Variance
Project data onto the axes where it varies most
The Intuition
Imagine a cloud of 3D data points shaped like a flat pancake. The pancake has lots of spread in two directions but almost none in the third (thickness). PCA finds these directions and says: “Keep the two spread-out directions, drop the thin one.”

Principal Component 1 (PC1): The direction of maximum variance in the data. If you project all points onto this line, the spread is maximized.

PC2: The direction of maximum remaining variance, perpendicular to PC1.

PC3, PC4, …: Each captures the next most variance, always perpendicular to all previous components.

If the first 2 PCs capture 95% of the variance, you can represent 100D data in 2D with only 5% information loss. This is lossy compression — like JPEG for data.
PCA Step by Step
1. Center the data: X_centered = X - mean(X) 2. Compute covariance matrix: C = (1/(n-1)) × X_centeredᵀ × X_centered # C is d×d, symmetric, positive semi-definite 3. Find eigenvalues & eigenvectors: C × v = λ × v # λ₁ ≥ λ₂ ≥ ... ≥ λd (sorted) # v₁, v₂, ..., vd (principal components) 4. Project data: X_reduced = X_centered × V_k # V_k = [v₁, v₂, ..., vₖ] (top k eigenvectors) Explained variance ratio: EVR_i = λᵢ / Σλⱼ # What fraction of total variance PC_i captures # In practice, scikit-learn uses SVD (more # numerically stable than eigendecomposition).
Key insight: PCA is like taking a photograph of a 3D object. The camera angle (principal component direction) that shows the most detail is the best view. PC1 is the front view, PC2 is the side view. Together they capture most of the object’s shape, even though you lost the depth dimension.
functions
PCA Math: Covariance, Eigenvalues, Eigenvectors
The linear algebra behind the magic
The Covariance Matrix
The covariance matrix C is a d×d matrix where Cᵢᵤ measures how features i and j co-vary:

Cᵢᵤ = (1/(n−1)) ∑(xᵢₖ − μᵢ)(xᵤₖ − μᵤ)

Diagonal entries Cᵢᵢ = variance of feature i. Off-diagonal Cᵢᵤ = covariance between features i and j. If Cᵢᵤ > 0, they increase together. If Cᵢᵤ < 0, one increases as the other decreases.

The eigenvectors of C are the principal component directions. The eigenvalues are the variance along each direction. PCA sorts eigenvalues from largest to smallest and keeps the top k eigenvectors.

scikit-learn uses SVD instead: X = UΣVᵀ. The columns of V are the principal components. The singular values σᵢ relate to eigenvalues by λᵢ = σᵢ²/(n−1). SVD is more numerically stable and doesn’t require computing C explicitly.
Numerical Example
3 features, covariance matrix: C = [[4.0, 2.0, 0.1], [2.0, 3.0, 0.2], [0.1, 0.2, 0.5]] Eigenvalues: λ₁=5.62, λ₂=1.38, λ₃=0.50 Total variance: 5.62 + 1.38 + 0.50 = 7.50 Explained variance ratios: PC1: 5.62/7.50 = 74.9% PC2: 1.38/7.50 = 18.4% PC3: 0.50/7.50 = 6.7% Cumulative: PC1: 74.9% PC1+PC2: 93.3% ← keep 2 PCs, lose 6.7% All 3: 100% # 2 components capture 93.3% of variance. # Reduced from 3D to 2D with minimal loss.
Key insight: Eigenvalues are like the lengths of a shadow. Shine a light on a 3D object from different angles. The angle that casts the longest shadow (most variance) is PC1. The perpendicular angle with the next longest shadow is PC2. If the third shadow is tiny, the object is essentially flat — you can safely ignore that dimension.
tune
How Many Components to Keep
Scree plots and the 95% variance threshold
Two Methods
Scree plot: Plot eigenvalues (or explained variance ratios) in decreasing order. Look for the “elbow” where the curve flattens. Components after the elbow add little information.

Cumulative variance threshold: Keep enough components to explain a target percentage of variance (commonly 95% or 99%). If 50 components out of 1,000 explain 95% of variance, you’ve compressed 20x with 5% loss.

scikit-learn makes this easy: set n_components=0.95 and PCA automatically selects the right number of components.
Choosing Components in Code
from sklearn.decomposition import PCA import numpy as np # Method 1: specify exact number pca = PCA(n_components=50) X_reduced = pca.fit_transform(X) # Method 2: specify variance threshold pca = PCA(n_components=0.95) # keep 95% X_reduced = pca.fit_transform(X) print(f"Components kept: {pca.n_components_}") # Scree plot data pca_full = PCA().fit(X) cumvar = np.cumsum(pca_full.explained_variance_ratio_) # Plot cumvar → find where it crosses 0.95 Typical results (MNIST, 784 features): 95% variance → 154 components (5x compression) 99% variance → 331 components (2.4x compression) For visualization → 2 components (~25% variance)
Key insight: Choosing components is like deciding how many pixels to keep in a photo. 95% quality looks nearly identical to the original. 50% quality is blurry but recognizable. 2 components (for visualization) is like a rough sketch — you lose detail but see the overall structure. The right choice depends on whether you’re preprocessing (95%) or visualizing (2).
build
PCA for Visualization, Denoising, and Preprocessing
Three practical uses beyond just compression
Three Use Cases
1. Visualization: Project 100D data to 2D for scatter plots. Color by class label to see if classes are separable. If clusters are visible in 2D PCA, a classifier will likely work well.

2. Denoising: Noise lives in the low-variance components. Keep the top k components, discard the rest, then reconstruct: X̂ = Xᵣᵉᵈ × Vₖᵀ + μ. The reconstruction removes noise while preserving signal.

3. Preprocessing: Reduce dimensions before training a classifier. Benefits: faster training (fewer features), less overfitting (fewer parameters), and sometimes better accuracy (noise removed). Common pipeline: PCA(n_components=0.95) → LogisticRegression.

Important: Always fit PCA on training data only, then transform both train and test. Fitting on test data is data leakage.
PCA Pipeline Example
from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline # PCA as preprocessing pipe = make_pipeline( StandardScaler(), # always scale first! PCA(n_components=0.95), # keep 95% variance LogisticRegression(max_iter=500) ) pipe.fit(X_train, y_train) print(f"Accuracy: {pipe.score(X_test, y_test):.3f}") # Visualization (2D) pca_2d = PCA(n_components=2) X_2d = pca_2d.fit_transform(StandardScaler().fit_transform(X)) # Scatter plot X_2d[:, 0] vs X_2d[:, 1], color by y # MNIST: 784D → 154D (95% var) → LogReg # Accuracy drops from 0.92 to 0.91 (1% loss) # Training time drops from 12s to 3s (4x faster)
Key insight: PCA is the Swiss Army knife of preprocessing. For visualization, it gives you a 2D map of high-dimensional data. For denoising, it strips away the static. For preprocessing, it speeds up training and reduces overfitting. Always scale your features first — PCA is sensitive to feature magnitudes.
bubble_chart
t-SNE: Nonlinear Embedding for Visualization
Preserving neighborhoods — seeing clusters that PCA misses
Why t-SNE?
PCA is linear — it can only rotate and stretch data. If clusters are separated by curved boundaries in high-D, PCA’s 2D projection may overlap them.

t-SNE (t-distributed Stochastic Neighbor Embedding) is nonlinear. It preserves local neighborhoods: points that are close in high-D stay close in 2D. Points that are far apart in high-D are pushed apart in 2D.

The algorithm: (1) Compute pairwise similarities in high-D using Gaussian kernels. (2) Initialize random 2D positions. (3) Iteratively adjust 2D positions to match high-D similarities, using a Student-t distribution (heavier tails than Gaussian) to avoid the “crowding problem.”

Perplexity (typically 5–50) controls the effective number of neighbors. Low perplexity = focus on very local structure. High perplexity = consider broader neighborhoods.
t-SNE in scikit-learn
from sklearn.manifold import TSNE tsne = TSNE( n_components=2, # always 2 or 3 perplexity=30, # neighbors to consider learning_rate=200, # step size n_iter=1000, # optimization steps random_state=42 ) X_2d = tsne.fit_transform(X) Caveats:Visualization only — no transform() for new data • Distances between clusters are meaningless • Cluster sizes are meaningless • Different runs give different layouts • Slow: O(n²) — use PCA first to reduce to 50D # Best practice: PCA to 50D first, then t-SNE # X_50 = PCA(n_components=50).fit_transform(X) # X_2d = TSNE().fit_transform(X_50)
Key insight: t-SNE is like a social network layout algorithm. It places friends close together and strangers far apart. The resulting map shows community structure beautifully — but the distances between communities and the sizes of communities are artifacts, not real. Use t-SNE to see clusters, not to measure them.
compare
t-SNE vs UMAP: Speed, Global Structure, and Trade-offs
The newer alternative that’s often better
UMAP Advantages
UMAP (Uniform Manifold Approximation and Projection) is a newer nonlinear method that addresses t-SNE’s weaknesses:

1. Speed: UMAP is 10–100x faster than t-SNE on large datasets. It scales to millions of points.

2. Global structure: t-SNE preserves local neighborhoods but distorts global relationships (distant clusters may appear at arbitrary distances). UMAP better preserves the relative positions of clusters.

3. Deterministic: With a fixed random seed, UMAP gives consistent results. t-SNE can produce very different layouts across runs.

4. Supports transform(): UMAP can embed new data points without rerunning the full algorithm. t-SNE cannot.

UMAP is not in scikit-learn but is available via pip install umap-learn. For most visualization tasks in 2025, UMAP is the preferred choice over t-SNE.
PCA
Linear, fast, interpretable
Preserves global variance
Has inverse_transform
Good for preprocessing
Deterministic
t-SNE
Nonlinear, slow (O(n²))
Best local structure
No transform for new data
Visualization only
Non-deterministic
UMAP
Nonlinear, fast
Good local + global
Has transform
Viz + preprocessing
Mostly deterministic
Key insight: PCA is a telescope (sees far, misses details). t-SNE is a microscope (sees local detail, loses the big picture). UMAP is a good pair of binoculars (reasonable detail at all scales, and much faster). For preprocessing, use PCA. For visualization, use UMAP (or t-SNE if you need the absolute best local cluster separation).
code
Complete Dimensionality Reduction Pipeline
PCA + t-SNE on MNIST digits — from 784D to 2D
MNIST Visualization Pipeline
from sklearn.datasets import load_digits from sklearn.decomposition import PCA from sklearn.manifold import TSNE from sklearn.preprocessing import StandardScaler import numpy as np X, y = load_digits(return_X_y=True) print(f"Original: {X.shape}") # (1797, 64) X_scaled = StandardScaler().fit_transform(X) # PCA: 64D → 2D (for comparison) pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"PCA 2D variance: {pca.explained_variance_ratio_.sum():.1%}") # PCA: 64D → 50D (preprocessing) pca50 = PCA(n_components=50) X_50 = pca50.fit_transform(X_scaled) print(f"PCA 50D variance: {pca50.explained_variance_ratio_.sum():.1%}") # t-SNE: 50D → 2D (best practice: PCA first) tsne = TSNE(n_components=2, perplexity=30, random_state=42) X_tsne = tsne.fit_transform(X_50)
Results
Digits dataset (1797 samples, 64 features): PCA 2D: 28.6% variance explained → Digits overlap heavily in 2D PCA PCA 50D: 98.5% variance explained → Great for preprocessing (64→50, minimal loss) t-SNE 2D (from 50D): → 10 clear digit clusters visible! → 0s, 1s, 2s, ... each form distinct blobs When to use what: Preprocessing: PCA(n_components=0.95) Quick 2D plot: PCA(n_components=2) Beautiful 2D viz: PCA(50) → TSNE(2) Large data viz: PCA(50) → UMAP(2) Denoising: PCA(k) → inverse_transform
Key insight: The PCA → t-SNE pipeline is the gold standard for visualizing high-dimensional data. PCA removes noise and reduces computation (64D → 50D). t-SNE then creates a beautiful 2D map where clusters are clearly visible. On MNIST digits, you can literally see 10 distinct digit clusters — something impossible in the raw 64D space.