Ch 3: Eigenvalues, SVD & Decompositions — Mathematics Behind AI & ML

sports_football

The Football Cloud — Finding Natural Axes

Why some directions matter more than others

The Analogy

Imagine throwing 1,000 darts at a wall, but they cluster in a football (ellipse) shape — longer along one axis, shorter along another. The eigenvectors are the long and short axes of that football. The eigenvalues tell you how stretched each axis is. The long axis (big eigenvalue) is the direction where the data varies most — it carries the most information.

Key insight: Your data doesn’t care about the x-y grid you drew. It has its own natural axes — the directions where it actually spreads out. Eigenvectors find those natural axes. Eigenvalues tell you which axes matter most.

The Math

An eigenvector v of matrix A satisfies: Av = λv. The matrix only scales v (by λ), never changes its direction.

A = np.array([[3, 1], [1, 3]]) eigenvalues, eigenvectors = np.linalg.eig(A) # eigenvalues: [4, 2] # eigenvectors: [[0.707, -0.707], # [0.707, 0.707]] # Verify: A @ v = λ × v v = eigenvectors[:, 0] # first eigenvector A @ v # ≈ 4 × v ✓ (scaled, not rotated)

Real World

Football’s long axis = direction of most spread

In AI

Largest eigenvalue = most important feature direction

open_with

Eigenvectors & Eigenvalues in Detail

The directions that survive a transformation unchanged

The Analogy

Imagine a stretchy rubber sheet. When you pull it, most points move in complicated ways. But some special arrows only get longer or shorter — they don’t rotate. Those are the eigenvectors. The stretch factor is the eigenvalue. λ = 3 means that arrow gets 3× longer. λ = 0.5 means it shrinks to half. λ = 0 means it collapses to nothing.

Key insight: A matrix might look complicated, but its eigenvectors reveal its “true personality.” In those special directions, the matrix is just scaling — no rotation, no shearing. This simplification is the key to everything that follows.

Diagonalization

If A has n independent eigenvectors, we can write A = VΛV⁻¹, where V holds eigenvectors as columns and Λ is diagonal with eigenvalues:

# Diagonalization: A = V Λ V⁻¹ vals, V = np.linalg.eig(A) Lambda = np.diag(vals) # diagonal matrix # Reconstruct A A_reconstructed = V @ Lambda @ np.linalg.inv(V) # A_reconstructed ≈ A ✓ # Power of A is easy in eigen-basis: # A^100 = V × Λ^100 × V⁻¹ # Just raise eigenvalues to the 100th power!

Key property: Symmetric matrices (A = A&supT;) always have real eigenvalues and orthogonal eigenvectors. Covariance matrices are symmetric — which is why PCA always works cleanly.

scatter_plot

The Covariance Matrix

Measuring how features move together

The Analogy

Imagine tracking ice cream sales and temperature over a year. When temperature goes up, sales go up too — they co-vary. The covariance matrix captures all these “move together” relationships between every pair of features. Diagonal entries = how much each feature varies on its own. Off-diagonal = how much pairs move together.

Key insight: The covariance matrix IS the football shape. Its eigenvectors are the football’s axes, and its eigenvalues are how stretched each axis is. PCA simply finds the eigenvectors of the covariance matrix.

Worked Example

# 100 data points, 3 features X = np.random.randn(100, 3) # Center the data (subtract mean) X_centered = X - X.mean(axis=0) # Covariance matrix: (3×3) C = (X_centered.T @ X_centered) / (len(X) - 1) # C[i,j] = how much feature i and j co-vary # C[i,i] = variance of feature i # Or simply: C = np.cov(X.T) # same result

Real World

Temperature ↑ and ice cream sales ↑ co-vary positively

In AI

Covariance matrix reveals which features are redundant (highly correlated)

compress

PCA — Choosing the Best Rulers

Reduce 1,000 features to 50 without losing much

The Analogy

Imagine photographing a 3D sculpture. From the front, you see most of the detail. From the side, you see a bit more. From the top, you see almost nothing new. PCA picks the camera angles that capture the most detail — then throws away the boring angles. If 95% of the information is in the first 50 directions (out of 1,000), you keep only those 50.

Key insight: PCA is literally “find the eigenvectors of the covariance matrix, sort by eigenvalue, keep the top k.” The eigenvalues tell you what percentage of information each direction carries. If the top 50 eigenvalues sum to 95% of the total, those 50 directions are enough.

Worked Example

from sklearn.decomposition import PCA # 1000 samples, 100 features X = np.random.randn(1000, 100) # Keep top 10 components pca = PCA(n_components=10) X_reduced = pca.fit_transform(X) # X_reduced.shape = (1000, 10) # How much variance is explained? pca.explained_variance_ratio_ # [0.023, 0.021, 0.019, ...] (random data) # Real data: first few capture most variance # e.g., [0.45, 0.25, 0.12, 0.08, ...]

Real numbers: In face recognition, PCA on 10,000-pixel images often finds that ~100 principal components capture 95%+ of the variance. That’s a 100× compression with minimal information loss.

call_split

SVD — The Universal Decomposition

Any matrix = rotate × scale × rotate

The Analogy

SVD says every matrix transformation can be broken into three simple steps: (1) rotate the input, (2) stretch or shrink along the axes, (3) rotate again. It’s like decomposing any dance move into “turn, stretch, turn.” The stretching amounts (singular values) tell you which dimensions carry the most energy.

Key insight: Unlike eigendecomposition (which only works on square matrices), SVD works on any matrix — even rectangular ones. This makes it the Swiss Army knife of linear algebra. PCA is actually just SVD applied to the centered data matrix.

Worked Example

# SVD: A = U × Σ × Vᵀ A = np.array([[1, 2], [3, 4], [5, 6]]) # A is 3×2 (not square!) U, S, Vt = np.linalg.svd(A, full_matrices=False) # U: (3×2) — left rotation # S: [9.52, 0.51] — singular values # Vt: (2×2) — right rotation # Reconstruct: A ≈ U × diag(S) × Vt A_recon = U @ np.diag(S) @ Vt # A_recon ≈ A ✓

Formula: A = UΣV&supT;. U and V are orthogonal (rotation), Σ is diagonal (scaling). Singular values σ₁ ≥ σ₂ ≥ ... ≥ 0.

photo_size_select_small

Low-Rank Approximation — Compressing Photos

Keep the big singular values, drop the small ones

The Analogy

Imagine a photo is a matrix of pixel values. SVD decomposes it into layers of detail: the first layer captures the broad shapes, the second adds finer detail, the third even finer. Low-rank approximation keeps only the first k layers. With k = 50 out of 500, you get a recognizable image at 10% of the data. It’s like JPEG compression — keep the important parts, discard the noise.

Key insight: This is exactly the principle behind LoRA fine-tuning. A billion-parameter weight matrix doesn’t need a billion-parameter update. A rank-16 update (tiny!) is often enough to adapt the model to a new task. The “important changes” live in a low-dimensional subspace.

Worked Example

# Image compression via SVD img = np.random.randn(500, 500) # 500×500 image U, S, Vt = np.linalg.svd(img, full_matrices=False) # Keep only top k=50 singular values k = 50 img_compressed = U[:,:k] @ np.diag(S[:k]) @ Vt[:k,:] # Storage: 500×500 = 250,000 values # Compressed: 500×50 + 50 + 50×500 = 50,050 # → 5× compression!

Real World

JPEG keeps important visual details, drops imperceptible noise

In AI

Low-rank approximation compresses weight matrices while preserving model quality

tune

LoRA — Fine-Tuning with Thin Slices

Adapt a billion-parameter model by changing only a sliver

The Analogy

Imagine a master chef (pre-trained model) who knows 10,000 recipes. To specialize in Italian food, you don’t retrain from scratch. You give them a thin notebook of Italian adjustments: “add more basil here, less cream there.” LoRA does exactly this — it freezes the original weights and adds a tiny low-rank “adjustment notebook” (matrices A and B where rank r « d).

Key insight: LoRA decomposes the weight update ΔW into two thin matrices: ΔW = B × A, where B is (d×r) and A is (r×d). With d = 4096 and r = 16, you train 131,072 parameters instead of 16,777,216 — a 128× reduction.

Worked Example

import torch # Original weight matrix (frozen) d = 4096 W = torch.randn(d, d) # 16.7M params # LoRA: low-rank update ΔW = B × A r = 16 # rank (tiny!) A = torch.randn(r, d) # 65,536 params B = torch.randn(d, r) # 65,536 params # Total trainable: 131,072 (0.78% of W!) # Forward pass with LoRA x = torch.randn(d) y = (W + B @ A) @ x # original + adjustment

Source: Hu et al. (2021) “LoRA: Low-Rank Adaptation of Large Language Models” showed that fine-tuning updates have low intrinsic rank, enabling efficient adaptation with r as small as 4–16.

hub

The Big Picture — Decompositions Everywhere

Why breaking things apart is the key to understanding them

The Analogy

Decomposition is like understanding a symphony by separating it into individual instruments. The full orchestra (matrix) is complex, but each instrument (component) is simple. Eigendecomposition, SVD, and PCA all do the same fundamental thing: break a complex object into simple, ranked pieces so you can keep the important ones and discard the rest.

Why it matters for AI: Decompositions are everywhere in modern AI: PCA for feature reduction, SVD for matrix compression, LoRA for efficient fine-tuning, spectral clustering for graph analysis, and latent factor models for recommendation systems. Mastering this chapter unlocks all of them.

Cheat Sheet

# Eigendecomposition: A = VΛV⁻¹ # → Square matrices only # → Eigenvectors = natural axes # → Eigenvalues = stretch factors # SVD: A = UΣVᵀ # → Any matrix (even rectangular) # → Singular values = importance ranking # → Low-rank approx = keep top k # PCA: eigenvectors of covariance matrix # → Equivalent to SVD of centered data # → Reduce 1000 features to 50 # LoRA: ΔW = B × A (rank r ≪ d) # → Fine-tune with 0.1-1% of params

Real World

Separate a symphony into instruments to understand each part

In AI

Decompose weight matrices to compress, fine-tune, and understand models

Ch 3 — Eigenvalues, SVD & Decompositions