Ch 11: Information Theory for ML — Mathematics Behind AI & ML

Ch 11 — Information Theory for ML

Surprise, uncertainty, and the language of learning

arrow_backIndex

Information

priority_high

Surprise

arrow_forward

tag

Entropy

arrow_forward

compare_arrows

Cross-Ent

arrow_forward

straighten

KL Div

arrow_forward

hub

Mutual Info

arrow_forward

compress

Bottleneck

arrow_forward

smart_toy

AI Uses

Click play or press Space to begin...

Step- / 8

priority_high

Information as Surprise

The less likely an event, the more information it carries

The Analogy

Imagine a weather forecast. “It’s sunny in the Sahara” — no surprise, no information. “It’s snowing in the Sahara” — extremely surprising, lots of information! Information = surprise. The less likely an event, the more information it carries when it happens. Claude Shannon formalized this in 1948: Information(event) = −log₂(probability).

Key insight: When a language model predicts the next word, it assigns probabilities. “The cat sat on the ___” → “mat” (high probability, low surprise). If the model says “volcano” — that’s high information (surprising). Perplexity, the standard metric for language models, is literally the exponential of average surprise.

The Math

# Information content of an event # I(x) = -log₂(P(x)) bits import numpy as np # Fair coin flip (P = 0.5) I_coin = -np.log2(0.5) # = 1 bit # Fair die roll (P = 1/6) I_die = -np.log2(1/6) # ≈ 2.58 bits # Rare event (P = 0.001) I_rare = -np.log2(0.001) # ≈ 9.97 bits # Certain event (P = 1.0) I_certain = -np.log2(1.0) # = 0 bits (no surprise!)

Real World

“Sun rises tomorrow” = 0 surprise. “Aliens land” = maximum surprise.

In AI

Predicting “the” after “in” = low info. Predicting “platypus” = high info.

tag

Entropy — Average Surprise

How uncertain is the entire distribution?

The Analogy

Entropy is the average surprise across all possible outcomes. A fair coin has maximum entropy (1 bit) — you’re maximally uncertain. A loaded coin (99% heads) has low entropy — you already know what’s coming. Think of entropy as a measure of chaos or unpredictability in a system.

Key insight: In decision trees, the algorithm splits on the feature that reduces entropy the most (information gain). High entropy = mixed classes = uncertain. After a good split, each branch has lower entropy = more pure = more certain. The tree literally maximizes information gained at each step!

Worked Example

# Entropy: H(X) = -Σ P(x) × log₂(P(x)) # Fair coin: P(H)=0.5, P(T)=0.5 H_fair = -(0.5*np.log2(0.5) + 0.5*np.log2(0.5)) # = 1.0 bit (maximum uncertainty) # Loaded coin: P(H)=0.99, P(T)=0.01 H_loaded = -(0.99*np.log2(0.99) + 0.01*np.log2(0.01)) # ≈ 0.08 bits (almost certain) # Decision tree: split that reduces entropy most # Before split: H = 1.0 (50/50 cat/dog) # After split on "has whiskers": # Left: H ≈ 0.2 (mostly cats) # Right: H ≈ 0.3 (mostly dogs) # Information gain = 1.0 - weighted avg ≈ 0.75

compare_arrows

Cross-Entropy — The #1 Loss Function

Measuring how well your model matches reality

The Analogy

Imagine you’re packing for a trip. Entropy = the minimum luggage if you pack perfectly for the actual weather. Cross-entropy = the luggage you actually bring based on your predicted weather. If your predictions are perfect, cross-entropy = entropy. If your predictions are wrong, you overpack — cross-entropy > entropy. The gap is wasted effort.

Key insight: Cross-entropy loss is THE most common loss function in classification. When you train a neural network with nn.CrossEntropyLoss() in PyTorch, you’re literally minimizing the average surprise of the true labels under the model’s predicted distribution. Lower cross-entropy = model’s predictions match reality better.

The Math & Code

# Cross-entropy: H(P, Q) = -Σ P(x) × log Q(x) # P = true distribution, Q = model's prediction import torch import torch.nn as nn # True label: class 2 (one-hot: [0, 0, 1]) # Model predicts: [0.1, 0.2, 0.7] loss_fn = nn.CrossEntropyLoss() logits = torch.tensor([[-1.0, 0.0, 1.2]]) target = torch.tensor([2]) loss = loss_fn(logits, target) # loss ≈ 0.56 (low — good prediction!) # Bad prediction: model says class 0 logits_bad = torch.tensor([[2.0, 0.0, -1.0]]) loss_bad = loss_fn(logits_bad, target) # loss ≈ 3.17 (high — wrong prediction!)

Real World

Packing perfectly for actual weather = minimum luggage (entropy)

In AI

Model matching true labels = minimum cross-entropy loss

straighten

KL Divergence — Distance Between Distributions

How different is your model from reality?

The Analogy

KL divergence measures the “extra cost” of using the wrong distribution. If you designed a communication system based on predicted weather Q but the actual weather follows P, KL(P||Q) is the wasted bits per message. It’s the gap between cross-entropy and entropy: KL(P||Q) = H(P,Q) − H(P). Always ≥ 0, equals 0 only when P = Q.

Key insight: KL divergence is everywhere in modern AI. VAEs minimize KL divergence to keep the latent space well-structured. Knowledge distillation uses KL to transfer knowledge from a large teacher model to a small student. RLHF uses KL to prevent the fine-tuned model from drifting too far from the base model.

Worked Example

# KL(P || Q) = Σ P(x) × log(P(x) / Q(x)) # = H(P, Q) - H(P) (extra cost) P = np.array([0.7, 0.2, 0.1]) # true Q = np.array([0.3, 0.4, 0.3]) # model KL = np.sum(P * np.log2(P / Q)) # ≈ 0.50 bits (significant mismatch) # Note: KL is NOT symmetric! # KL(P||Q) ≠ KL(Q||P) # KL(P||Q): "cost of using Q when truth is P" # KL(Q||P): "cost of using P when truth is Q" # In VAE: # Loss = Reconstruction + β × KL(q(z|x) || p(z)) # KL term keeps latent space close to N(0,1)

Key insight: Minimizing cross-entropy H(P,Q) is equivalent to minimizing KL(P||Q) because H(P) is constant. That’s why cross-entropy loss works — it’s secretly minimizing the distance between your model and reality.

hub

Mutual Information

How much does knowing X tell you about Y?

The Analogy

Mutual information measures how much knowing one thing tells you about another. Knowing someone’s height tells you something about their weight (positive MI). Knowing their shoe size tells you almost nothing about their favorite color (near-zero MI). MI = 0 means the variables are completely independent.

Key insight: Mutual information is a more powerful version of correlation. Correlation only captures linear relationships. MI captures any statistical dependency. If X = sin(Y), correlation might be ~0, but MI is high. This makes MI invaluable for feature selection — find features that share the most information with the target, regardless of the relationship shape.

The Math

# MI(X; Y) = H(X) + H(Y) - H(X, Y) # = KL(P(X,Y) || P(X)P(Y)) # = reduction in uncertainty about X # when you learn Y (and vice versa) # Properties: # MI(X; Y) ≥ 0 (always) # MI(X; Y) = 0 iff X ⊥ Y (independent) # MI(X; X) = H(X) (max possible) # MI(X; Y) = MI(Y; X) (symmetric!) from sklearn.feature_selection import mutual_info_classif # Select features with highest MI to target mi_scores = mutual_info_classif(X, y) # [0.82, 0.01, 0.45, 0.67, ...] # Feature 0 and 3 are most informative

compress

The Information Bottleneck

Compress the input, keep only what matters for the output

The Analogy

Imagine summarizing a 500-page book into a 1-page summary. You must compress (lose details) while preserving the key message. The information bottleneck theory says deep learning does exactly this: each layer compresses the input (forgets irrelevant details) while preserving information relevant to the output label.

Key insight: This explains why deep networks generalize. Early layers extract features (compress 1M pixels to 512 features). Later layers keep only what matters for the task. A cat classifier doesn’t need to remember the exact pixel values — just “has ears, whiskers, fur pattern.” The network learns to throw away noise and keep signal.

The Framework

# Information Bottleneck principle: # min I(X; T) - β × I(T; Y) # # X = input (image pixels) # T = representation (hidden layer) # Y = output (label) # # Goal: minimize I(X;T) → compress input # maximize I(T;Y) → preserve label info # Deep network as compression pipeline: # Image (3×224×224 = 150K values) # → Conv layers (compress spatial info) # → 512-dim feature vector (bottleneck) # → 10-class prediction # 150K → 512 → 10 (massive compression!)

Real World

Summarize a book: lose details, keep the core message

In AI

Neural net: compress 150K pixels to 512 features that predict the label

smart_toy

Information Theory in Modern AI

From GPT to diffusion models — information theory is everywhere

Language Models

GPT and LLMs are trained to minimize cross-entropy on next-token prediction. Perplexity = 2^(cross-entropy) measures how “confused” the model is. GPT-4 has perplexity ~10-15 on English text, meaning at each position it’s effectively choosing between 10-15 equally likely words. Temperature in sampling controls the entropy of the output distribution — low temp = low entropy = predictable; high temp = high entropy = creative.

Key connection: When you set temperature=0.7 in ChatGPT, you’re literally adjusting the entropy of the probability distribution over next tokens. Temperature=0 picks the most likely token (zero entropy). Temperature=2 makes the distribution nearly uniform (high entropy, very random).

More Applications

# VAEs: ELBO = Reconstruction - KL # KL keeps latent space structured # Diffusion models: learn to reverse # information destruction (adding noise) # Score = ∇log p(x) (gradient of log-prob) # RLHF: KL penalty prevents drift # Loss = reward - β × KL(π_new || π_ref) # Knowledge distillation: # Student minimizes KL to teacher's outputs # Soft labels carry more info than hard labels # Perplexity of language model: # PPL = 2^H(P, Q) where Q = model # GPT-4: PPL ≈ 10-15 on English text # Human: PPL ≈ 20 (we're less predictable!)

school

The Complete Picture

How all the information theory concepts connect

The Unified View

All of information theory connects through one idea: uncertainty reduction. Entropy measures total uncertainty. Cross-entropy measures how well a model captures that uncertainty. KL divergence measures the gap. Mutual information measures shared uncertainty. Every ML loss function is secretly an information-theoretic quantity.

The big picture: Shannon’s 1948 paper “A Mathematical Theory of Communication” is arguably the most important paper for modern AI. Without it, we wouldn’t have cross-entropy loss, perplexity, KL divergence for VAEs, or the information bottleneck theory. Information theory IS the language of learning.

Cheat Sheet

# Information Theory Cheat Sheet for ML: # # Entropy H(P) → uncertainty of P # Cross-ent H(P,Q) → avg surprise of P # under model Q # KL(P||Q) → extra cost of Q vs P # MI(X;Y) → shared information # # Key relationships: # H(P,Q) = H(P) + KL(P||Q) # MI(X;Y) = H(X) - H(X|Y) # MI(X;Y) = KL(P(X,Y) || P(X)P(Y)) # # Where you see them: # Classification → cross-entropy loss # VAE → KL divergence regularizer # LLM eval → perplexity = 2^H(P,Q) # Feature selection → mutual information # Decision trees → information gain

arrow_back Ch 10: Hypothesis Testing Ch 12: Tensors & Geometry arrow_forward