The Analogy
Every matrix corresponds to a geometric transformation. A rotation matrix spins things. A scaling matrix stretches or shrinks. A shear matrix tilts (like italicizing text). The beauty: combining transformations = multiplying matrices. Rotate then scale? Multiply the rotation matrix by the scaling matrix.
Key insight: A deep neural network with 10 layers is 10 matrix multiplications in a row. Each layer rotates, scales, and reshapes the data in a new way. By the final layer, the data has been transformed so many times that originally tangled classes become linearly separable.
Worked Example
# 90° rotation matrix
theta = np.pi / 2
R = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
# R ≈ [[0, -1], [1, 0]]
# Apply to point (1, 0) → (0, 1)
R @ np.array([1, 0]) # [0, 1] rotated 90°!
# Scaling matrix: stretch x by 2, y by 3
S = np.array([[2, 0], [0, 3]])
# Combine: rotate THEN scale = S @ R
combined = S @ R # one matrix does both!
Key insight: Matrix multiplication is NOT commutative: A×B ≠ B×A. Rotate-then-scale gives a different result than scale-then-rotate. Order matters in neural networks too — layer order changes the learned representation.