Ch 3 — Eigenvalues, SVD & Decompositions

Finding the natural axes of your data — and compressing billions of parameters
Linear Algebra
sports_football
Football
arrow_forward
open_with
Eigen
arrow_forward
scatter_plot
Covariance
arrow_forward
compress
PCA
arrow_forward
call_split
SVD
arrow_forward
photo_size_select_small
Low-Rank
arrow_forward
tune
LoRA
-
Click play or press Space to begin...
Step- / 8
sports_football
The Football Cloud — Finding Natural Axes
Why some directions matter more than others
The Analogy
Imagine throwing 1,000 darts at a wall, but they cluster in a football (ellipse) shape — longer along one axis, shorter along another. The eigenvectors are the long and short axes of that football. The eigenvalues tell you how stretched each axis is. The long axis (big eigenvalue) is the direction where the data varies most — it carries the most information.
Key insight: Your data doesn’t care about the x-y grid you drew. It has its own natural axes — the directions where it actually spreads out. Eigenvectors find those natural axes. Eigenvalues tell you which axes matter most.
The Math
An eigenvector v of matrix A satisfies: Av = λv. The matrix only scales v (by λ), never changes its direction.
A = np.array([[3, 1], [1, 3]]) eigenvalues, eigenvectors = np.linalg.eig(A) # eigenvalues: [4, 2] # eigenvectors: [[0.707, -0.707], # [0.707, 0.707]] # Verify: A @ v = λ × v v = eigenvectors[:, 0] # first eigenvector A @ v # ≈ 4 × v ✓ (scaled, not rotated)
Real World
Football’s long axis = direction of most spread
In AI
Largest eigenvalue = most important feature direction
open_with
Eigenvectors & Eigenvalues in Detail
The directions that survive a transformation unchanged
The Analogy
Imagine a stretchy rubber sheet. When you pull it, most points move in complicated ways. But some special arrows only get longer or shorter — they don’t rotate. Those are the eigenvectors. The stretch factor is the eigenvalue. λ = 3 means that arrow gets 3× longer. λ = 0.5 means it shrinks to half. λ = 0 means it collapses to nothing.
Key insight: A matrix might look complicated, but its eigenvectors reveal its “true personality.” In those special directions, the matrix is just scaling — no rotation, no shearing. This simplification is the key to everything that follows.
Diagonalization
If A has n independent eigenvectors, we can write A = VΛV⁻¹, where V holds eigenvectors as columns and Λ is diagonal with eigenvalues:
# Diagonalization: A = V Λ V⁻¹ vals, V = np.linalg.eig(A) Lambda = np.diag(vals) # diagonal matrix # Reconstruct A A_reconstructed = V @ Lambda @ np.linalg.inv(V) # A_reconstructed ≈ A ✓ # Power of A is easy in eigen-basis: # A^100 = V × Λ^100 × V⁻¹ # Just raise eigenvalues to the 100th power!
Key property: Symmetric matrices (A = A&supT;) always have real eigenvalues and orthogonal eigenvectors. Covariance matrices are symmetric — which is why PCA always works cleanly.
scatter_plot
The Covariance Matrix
Measuring how features move together
The Analogy
Imagine tracking ice cream sales and temperature over a year. When temperature goes up, sales go up too — they co-vary. The covariance matrix captures all these “move together” relationships between every pair of features. Diagonal entries = how much each feature varies on its own. Off-diagonal = how much pairs move together.
Key insight: The covariance matrix IS the football shape. Its eigenvectors are the football’s axes, and its eigenvalues are how stretched each axis is. PCA simply finds the eigenvectors of the covariance matrix.
Worked Example
# 100 data points, 3 features X = np.random.randn(100, 3) # Center the data (subtract mean) X_centered = X - X.mean(axis=0) # Covariance matrix: (3×3) C = (X_centered.T @ X_centered) / (len(X) - 1) # C[i,j] = how much feature i and j co-vary # C[i,i] = variance of feature i # Or simply: C = np.cov(X.T) # same result
Real World
Temperature ↑ and ice cream sales ↑ co-vary positively
In AI
Covariance matrix reveals which features are redundant (highly correlated)
compress
PCA — Choosing the Best Rulers
Reduce 1,000 features to 50 without losing much
The Analogy
Imagine photographing a 3D sculpture. From the front, you see most of the detail. From the side, you see a bit more. From the top, you see almost nothing new. PCA picks the camera angles that capture the most detail — then throws away the boring angles. If 95% of the information is in the first 50 directions (out of 1,000), you keep only those 50.
Key insight: PCA is literally “find the eigenvectors of the covariance matrix, sort by eigenvalue, keep the top k.” The eigenvalues tell you what percentage of information each direction carries. If the top 50 eigenvalues sum to 95% of the total, those 50 directions are enough.
Worked Example
from sklearn.decomposition import PCA # 1000 samples, 100 features X = np.random.randn(1000, 100) # Keep top 10 components pca = PCA(n_components=10) X_reduced = pca.fit_transform(X) # X_reduced.shape = (1000, 10) # How much variance is explained? pca.explained_variance_ratio_ # [0.023, 0.021, 0.019, ...] (random data) # Real data: first few capture most variance # e.g., [0.45, 0.25, 0.12, 0.08, ...]
Real numbers: In face recognition, PCA on 10,000-pixel images often finds that ~100 principal components capture 95%+ of the variance. That’s a 100× compression with minimal information loss.
call_split
SVD — The Universal Decomposition
Any matrix = rotate × scale × rotate
The Analogy
SVD says every matrix transformation can be broken into three simple steps: (1) rotate the input, (2) stretch or shrink along the axes, (3) rotate again. It’s like decomposing any dance move into “turn, stretch, turn.” The stretching amounts (singular values) tell you which dimensions carry the most energy.
Key insight: Unlike eigendecomposition (which only works on square matrices), SVD works on any matrix — even rectangular ones. This makes it the Swiss Army knife of linear algebra. PCA is actually just SVD applied to the centered data matrix.
Worked Example
# SVD: A = U × Σ × Vᵀ A = np.array([[1, 2], [3, 4], [5, 6]]) # A is 3×2 (not square!) U, S, Vt = np.linalg.svd(A, full_matrices=False) # U: (3×2) — left rotation # S: [9.52, 0.51] — singular values # Vt: (2×2) — right rotation # Reconstruct: A ≈ U × diag(S) × Vt A_recon = U @ np.diag(S) @ Vt # A_recon ≈ A ✓
Formula: A = UΣV&supT;. U and V are orthogonal (rotation), Σ is diagonal (scaling). Singular values σ₁ ≥ σ₂ ≥ ... ≥ 0.
photo_size_select_small
Low-Rank Approximation — Compressing Photos
Keep the big singular values, drop the small ones
The Analogy
Imagine a photo is a matrix of pixel values. SVD decomposes it into layers of detail: the first layer captures the broad shapes, the second adds finer detail, the third even finer. Low-rank approximation keeps only the first k layers. With k = 50 out of 500, you get a recognizable image at 10% of the data. It’s like JPEG compression — keep the important parts, discard the noise.
Key insight: This is exactly the principle behind LoRA fine-tuning. A billion-parameter weight matrix doesn’t need a billion-parameter update. A rank-16 update (tiny!) is often enough to adapt the model to a new task. The “important changes” live in a low-dimensional subspace.
Worked Example
# Image compression via SVD img = np.random.randn(500, 500) # 500×500 image U, S, Vt = np.linalg.svd(img, full_matrices=False) # Keep only top k=50 singular values k = 50 img_compressed = U[:,:k] @ np.diag(S[:k]) @ Vt[:k,:] # Storage: 500×500 = 250,000 values # Compressed: 500×50 + 50 + 50×500 = 50,050 # → 5× compression!
Real World
JPEG keeps important visual details, drops imperceptible noise
In AI
Low-rank approximation compresses weight matrices while preserving model quality
tune
LoRA — Fine-Tuning with Thin Slices
Adapt a billion-parameter model by changing only a sliver
The Analogy
Imagine a master chef (pre-trained model) who knows 10,000 recipes. To specialize in Italian food, you don’t retrain from scratch. You give them a thin notebook of Italian adjustments: “add more basil here, less cream there.” LoRA does exactly this — it freezes the original weights and adds a tiny low-rank “adjustment notebook” (matrices A and B where rank r « d).
Key insight: LoRA decomposes the weight update ΔW into two thin matrices: ΔW = B × A, where B is (d×r) and A is (r×d). With d = 4096 and r = 16, you train 131,072 parameters instead of 16,777,216 — a 128× reduction.
Worked Example
import torch # Original weight matrix (frozen) d = 4096 W = torch.randn(d, d) # 16.7M params # LoRA: low-rank update ΔW = B × A r = 16 # rank (tiny!) A = torch.randn(r, d) # 65,536 params B = torch.randn(d, r) # 65,536 params # Total trainable: 131,072 (0.78% of W!) # Forward pass with LoRA x = torch.randn(d) y = (W + B @ A) @ x # original + adjustment
Source: Hu et al. (2021) “LoRA: Low-Rank Adaptation of Large Language Models” showed that fine-tuning updates have low intrinsic rank, enabling efficient adaptation with r as small as 4–16.
hub
The Big Picture — Decompositions Everywhere
Why breaking things apart is the key to understanding them
The Analogy
Decomposition is like understanding a symphony by separating it into individual instruments. The full orchestra (matrix) is complex, but each instrument (component) is simple. Eigendecomposition, SVD, and PCA all do the same fundamental thing: break a complex object into simple, ranked pieces so you can keep the important ones and discard the rest.
Why it matters for AI: Decompositions are everywhere in modern AI: PCA for feature reduction, SVD for matrix compression, LoRA for efficient fine-tuning, spectral clustering for graph analysis, and latent factor models for recommendation systems. Mastering this chapter unlocks all of them.
Cheat Sheet
# Eigendecomposition: A = VΛV⁻¹ # → Square matrices only # → Eigenvectors = natural axes # → Eigenvalues = stretch factors # SVD: A = UΣVᵀ # → Any matrix (even rectangular) # → Singular values = importance ranking # → Low-rank approx = keep top k # PCA: eigenvectors of covariance matrix # → Equivalent to SVD of centered data # → Reduce 1000 features to 50 # LoRA: ΔW = B × A (rank r ≪ d) # → Fine-tune with 0.1-1% of params
Real World
Separate a symphony into instruments to understand each part
In AI
Decompose weight matrices to compress, fine-tune, and understand models