Ch 12: Tensors & High-Dimensional Geometry — Mathematics Behind AI & ML

Ch 12 — Tensors & High-Dimensional Geometry

Multi-dimensional containers and the strange world beyond 3D

arrow_backIndex

Geometry

view_in_ar

Tensors

arrow_forward

layers

Shapes

arrow_forward

swap_horiz

Broadcast

arrow_forward

open_with

High-D

arrow_forward

bubble_chart

Curse

arrow_forward

near_me

Distance

arrow_forward

polyline

Manifolds

Click play or press Space to begin...

Step- / 8

view_in_ar

What Is a Tensor?

The universal container for data in deep learning

The Analogy

Think of data containers as a hierarchy: a scalar is a single number (temperature: 72°F). A vector is a list (GPS: [lat, lon]). A matrix is a spreadsheet (rows × columns). A tensor is a stack of spreadsheets — like a Rubik’s cube of numbers. An image is a 3D tensor: height × width × color channels. A batch of images is a 4D tensor.

Key insight: The name “TensorFlow” literally means “tensors flowing through a computation graph.” PyTorch’s core object is torch.Tensor. Every piece of data in deep learning — images, text, audio, video — gets converted into a tensor before the model can process it. Tensors are the universal language of AI.

The Hierarchy

import torch # Scalar (0D tensor) — single number scalar = torch.tensor(42.0) # shape: () # Vector (1D tensor) — list of numbers vector = torch.tensor([1,2,3]) # shape: (3,) # Matrix (2D tensor) — spreadsheet matrix = torch.randn(3, 4) # shape: (3, 4) # 3D tensor — e.g., single RGB image image = torch.randn(3, 224, 224) # (C, H, W) # 4D tensor — batch of images batch = torch.randn(32, 3, 224, 224) # (batch, channels, height, width) # 5D tensor — batch of videos video = torch.randn(8, 3, 16, 224, 224) # (batch, channels, frames, H, W)

layers

Tensor Shapes & Reshaping

Shape errors are the #1 bug in deep learning code

The Analogy

A tensor’s shape is like the dimensions of a shipping container. A (32, 3, 224, 224) tensor holds 32 images, each with 3 color channels, each 224×224 pixels. Reshaping is like rearranging items in the container without changing the contents — folding a 6×4 blanket into a 2×12 roll. The total number of elements must stay the same.

Key insight: Shape mismatches cause more bugs than any other error in deep learning. “RuntimeError: mat1 and mat2 shapes cannot be multiplied” is the error you’ll see most often. Understanding shapes = understanding data flow. Always print .shape when debugging!

Common Operations

x = torch.randn(2, 3, 4) # (2, 3, 4) # Reshape: same data, different shape y = x.reshape(6, 4) # (6, 4) z = x.reshape(2, 12) # (2, 12) w = x.reshape(-1) # (24,) flatten # Transpose: swap dimensions t = x.permute(0, 2, 1) # (2, 4, 3) # Squeeze/unsqueeze: add/remove dim a = torch.randn(3, 1, 4) b = a.squeeze(1) # (3, 4) c = b.unsqueeze(0) # (1, 3, 4) # Common in transformers: # (batch, seq, d_model) → (batch, heads, seq, d_k) # via reshape + permute

swap_horiz

Broadcasting — Automatic Shape Matching

How PyTorch handles mismatched shapes elegantly

The Analogy

Imagine adding a tip percentage to every item on a restaurant bill. You have a column of prices and a single tip rate. You don’t need to write the tip rate next to every price — you just “broadcast” it across all rows. Broadcasting automatically stretches smaller tensors to match larger ones, avoiding the need for explicit copying.

Key insight: Broadcasting is why you can write x - x.mean() to center data, even though x is a matrix and x.mean() is a scalar. PyTorch automatically broadcasts the scalar across every element. Without broadcasting, you’d need to manually expand the mean to match the shape — broadcasting saves thousands of lines of code in ML.

Rules & Examples

# Broadcasting rules (right-align shapes): # 1. Dimensions must match, OR # 2. One of them must be 1 (gets stretched) # Example 1: scalar + matrix A = torch.randn(3, 4) # (3, 4) b = torch.tensor(5.0) # () → broadcasts to (3, 4) C = A + b # (3, 4) ✓ # Example 2: row + matrix row = torch.randn(1, 4) # (1, 4) → (3, 4) D = A + row # (3, 4) ✓ # Example 3: column + row = matrix! col = torch.randn(3, 1) # (3, 1) row = torch.randn(1, 4) # (1, 4) E = col + row # (3, 4) ✓ # Fails: incompatible shapes # (3, 4) + (2, 4) → ERROR (3 ≠ 2)

open_with

High-Dimensional Intuition

Why your 3D intuition fails in 1000 dimensions

The Analogy

In 3D, most of an orange’s volume is in the juicy interior. In 1000D, almost all the volume is in a thin shell near the surface — the interior is essentially empty. This is the concentration of measure phenomenon. High-dimensional spaces are vast, empty, and deeply counterintuitive. A word embedding lives in 768D space. GPT-4’s parameters live in billions of dimensions.

Key insight: In high dimensions, random points are almost all equidistant from each other. In 2D, two random points can be close or far. In 1000D, the distance between any two random points concentrates around the same value. This is why nearest-neighbor search gets harder in high dimensions — everything is “equally far” from everything else.

Surprising Facts

# Volume of unit sphere in d dimensions: # d=2: π ≈ 3.14 (circle area) # d=3: 4π/3 ≈ 4.19 (sphere volume) # d=10: ≈ 2.55 # d=100: ≈ 10^(-40) (essentially ZERO!) # In high-D, the sphere is mostly empty # Almost all volume is near the surface # Random vectors in high-D are nearly # orthogonal to each other: import torch d = 1000 a = torch.randn(d) b = torch.randn(d) cos_sim = torch.dot(a, b) / (a.norm() * b.norm()) # cos_sim ≈ 0.03 (nearly perpendicular!) # In 2D this would be very unlikely

bubble_chart

The Curse of Dimensionality

Why more features can make things worse

The Analogy

Imagine searching for a friend in a 1D hallway (easy), a 2D field (harder), or a 3D building (much harder). In 1000D space, data points are so spread out that everything looks sparse. You need exponentially more data to fill the space. With 10 features and 10 bins each, you need 10^10 = 10 billion data points to have just one per bin. This is the curse of dimensionality.

Key insight: This is why feature selection and dimensionality reduction (PCA from Ch 3) are so important. Adding more features doesn’t always help — it can actually hurt if you don’t have enough data to fill the expanded space. The curse also explains why k-nearest neighbors works great in low dimensions but fails in high dimensions.

The Numbers

# Data needed to cover space: # 1D, 10 bins: 10 points # 2D, 10 bins: 100 points # 3D, 10 bins: 1,000 points # 10D, 10 bins: 10,000,000,000 points! # Solutions: # 1. PCA: reduce 1000D → 50D # 2. Feature selection: keep top-k # 3. Regularization: prevent overfitting # 4. Deep learning: learns useful subspaces # Why deep learning helps: # Real data lies on low-D manifolds # 1M-pixel images of faces ≈ 50D manifold # Networks learn to find these manifolds

Real World

Finding someone in a hallway (1D) vs. a skyscraper (3D) vs. a city (higher D)

In AI

10 features need ~100 samples. 1000 features need exponentially more.

near_me

Distance & Similarity in High-D

How to measure closeness when everything is far apart

The Analogy

In a library, you can measure book similarity by shelf distance (Euclidean — how far apart physically) or by topic overlap (cosine similarity — how aligned their subjects are). In high-D, Euclidean distance becomes less meaningful (everything is equidistant), so cosine similarity (angle between vectors) becomes the go-to metric. That’s why embeddings use cosine similarity, not Euclidean distance.

Key insight: When you search in ChatGPT or use RAG (Retrieval-Augmented Generation), the system converts your query into an embedding vector and finds the most similar documents using cosine similarity. Vector databases like Pinecone, Weaviate, and FAISS are built entirely around efficient similarity search in high-dimensional spaces.

Distance Metrics

# Euclidean (L2): straight-line distance # d(a,b) = √(Σ(aᵢ - bᵢ)²) d_l2 = torch.dist(a, b, p=2) # Manhattan (L1): grid distance # d(a,b) = Σ|aᵢ - bᵢ| d_l1 = torch.dist(a, b, p=1) # Cosine similarity: angle between vectors # cos(θ) = (a·b) / (|a| × |b|) cos = torch.nn.functional.cosine_similarity( a.unsqueeze(0), b.unsqueeze(0) ) # In high-D, cosine > Euclidean because: # - Euclidean distances concentrate # - Cosine measures direction, not magnitude # - Embeddings care about meaning (direction) # not magnitude

polyline

Manifolds — Data Lives on Surfaces

Why real data is simpler than it looks

The Analogy

Earth’s surface is a 2D manifold embedded in 3D space. You can walk in two directions (north/south, east/west), even though you live in a 3D world. Similarly, images of human faces live on a low-dimensional manifold in pixel space. A 1M-pixel image has 1M dimensions, but faces vary along maybe ~50 meaningful dimensions (pose, lighting, expression, identity). Deep learning finds these manifolds.

Key insight: This is why GANs and diffusion models can generate realistic faces from just 512 random numbers. The latent space IS the manifold. Moving along the manifold = smooth, meaningful changes (smile more, turn head). Moving off the manifold = unrealistic garbage. The model learned the manifold of real faces.

The Manifold Hypothesis

# The Manifold Hypothesis: # Real-world data lies on low-D manifolds # embedded in high-D space # Examples: # Face images: 1M pixels → ~50D manifold # Handwritten digits: 784 pixels → ~10D # Natural language: vocab 50K → ~768D embed # Why this matters for AI: # 1. Autoencoders learn manifold structure # 2. GANs generate on the manifold # 3. t-SNE/UMAP visualize manifolds in 2D # 4. Classification = finding boundaries # between manifolds of different classes # Interpolation on manifold (latent space): # z1 = encode(face_A) # smiling woman # z2 = encode(face_B) # neutral man # z_mid = 0.5*z1 + 0.5*z2 # blend! # decode(z_mid) → realistic in-between face

Real World

Earth’s surface: 2D manifold in 3D space. You walk in 2 directions.

In AI

Face images: ~50D manifold in 1M-D pixel space. GANs learn this surface.

school

Tensors & Geometry in Practice

Putting it all together for real AI systems

Tensor Operations in Transformers

A transformer processes text as tensors flowing through layers. Input: (batch, seq_len, d_model). Attention splits this into (batch, heads, seq_len, d_k). The entire self-attention mechanism is just tensor operations: reshape, transpose, matrix multiply, softmax, reshape back. Understanding tensor shapes is understanding how transformers work.

The complete picture: Tensors are the containers. Shapes define the structure. Broadcasting handles mismatches. High-D geometry explains why embeddings work. The manifold hypothesis explains why deep learning generalizes. Together, they form the geometric foundation of all modern AI.

Transformer Attention Shapes

# Self-attention tensor flow: # Input: (B, S, D) = (32, 128, 768) # B=batch, S=seq_len, D=d_model # Project Q, K, V: # (B, S, D) × (D, D) → (B, S, D) # Split into heads (h=12, d_k=64): # (B, S, D) → (B, S, h, d_k) # → permute → (B, h, S, d_k) # Attention scores: # Q × K^T: (B,h,S,d_k) × (B,h,d_k,S) # → (B, h, S, S) ← attention matrix! # Apply to values: # (B,h,S,S) × (B,h,S,d_k) → (B,h,S,d_k) # → concat heads → (B, S, D)

arrow_back Ch 11: Information Theory Ch 13: Numerical Methods arrow_forward