Ch 11 — Information Theory for ML

Surprise, uncertainty, and the language of learning
Information
priority_high
Surprise
arrow_forward
tag
Entropy
arrow_forward
compare_arrows
Cross-Ent
arrow_forward
straighten
KL Div
arrow_forward
hub
Mutual Info
arrow_forward
compress
Bottleneck
arrow_forward
smart_toy
AI Uses
-
Click play or press Space to begin...
Step- / 8
priority_high
Information as Surprise
The less likely an event, the more information it carries
The Analogy
Imagine a weather forecast. “It’s sunny in the Sahara” — no surprise, no information. “It’s snowing in the Sahara” — extremely surprising, lots of information! Information = surprise. The less likely an event, the more information it carries when it happens. Claude Shannon formalized this in 1948: Information(event) = −log₂(probability).
Key insight: When a language model predicts the next word, it assigns probabilities. “The cat sat on the ___” → “mat” (high probability, low surprise). If the model says “volcano” — that’s high information (surprising). Perplexity, the standard metric for language models, is literally the exponential of average surprise.
The Math
# Information content of an event # I(x) = -log₂(P(x)) bits import numpy as np # Fair coin flip (P = 0.5) I_coin = -np.log2(0.5) # = 1 bit # Fair die roll (P = 1/6) I_die = -np.log2(1/6) # ≈ 2.58 bits # Rare event (P = 0.001) I_rare = -np.log2(0.001) # ≈ 9.97 bits # Certain event (P = 1.0) I_certain = -np.log2(1.0) # = 0 bits (no surprise!)
Real World
“Sun rises tomorrow” = 0 surprise. “Aliens land” = maximum surprise.
In AI
Predicting “the” after “in” = low info. Predicting “platypus” = high info.
tag
Entropy — Average Surprise
How uncertain is the entire distribution?
The Analogy
Entropy is the average surprise across all possible outcomes. A fair coin has maximum entropy (1 bit) — you’re maximally uncertain. A loaded coin (99% heads) has low entropy — you already know what’s coming. Think of entropy as a measure of chaos or unpredictability in a system.
Key insight: In decision trees, the algorithm splits on the feature that reduces entropy the most (information gain). High entropy = mixed classes = uncertain. After a good split, each branch has lower entropy = more pure = more certain. The tree literally maximizes information gained at each step!
Worked Example
# Entropy: H(X) = -Σ P(x) × log₂(P(x)) # Fair coin: P(H)=0.5, P(T)=0.5 H_fair = -(0.5*np.log2(0.5) + 0.5*np.log2(0.5)) # = 1.0 bit (maximum uncertainty) # Loaded coin: P(H)=0.99, P(T)=0.01 H_loaded = -(0.99*np.log2(0.99) + 0.01*np.log2(0.01)) # ≈ 0.08 bits (almost certain) # Decision tree: split that reduces entropy most # Before split: H = 1.0 (50/50 cat/dog) # After split on "has whiskers": # Left: H ≈ 0.2 (mostly cats) # Right: H ≈ 0.3 (mostly dogs) # Information gain = 1.0 - weighted avg ≈ 0.75
compare_arrows
Cross-Entropy — The #1 Loss Function
Measuring how well your model matches reality
The Analogy
Imagine you’re packing for a trip. Entropy = the minimum luggage if you pack perfectly for the actual weather. Cross-entropy = the luggage you actually bring based on your predicted weather. If your predictions are perfect, cross-entropy = entropy. If your predictions are wrong, you overpack — cross-entropy > entropy. The gap is wasted effort.
Key insight: Cross-entropy loss is THE most common loss function in classification. When you train a neural network with nn.CrossEntropyLoss() in PyTorch, you’re literally minimizing the average surprise of the true labels under the model’s predicted distribution. Lower cross-entropy = model’s predictions match reality better.
The Math & Code
# Cross-entropy: H(P, Q) = -Σ P(x) × log Q(x) # P = true distribution, Q = model's prediction import torch import torch.nn as nn # True label: class 2 (one-hot: [0, 0, 1]) # Model predicts: [0.1, 0.2, 0.7] loss_fn = nn.CrossEntropyLoss() logits = torch.tensor([[-1.0, 0.0, 1.2]]) target = torch.tensor([2]) loss = loss_fn(logits, target) # loss ≈ 0.56 (low — good prediction!) # Bad prediction: model says class 0 logits_bad = torch.tensor([[2.0, 0.0, -1.0]]) loss_bad = loss_fn(logits_bad, target) # loss ≈ 3.17 (high — wrong prediction!)
Real World
Packing perfectly for actual weather = minimum luggage (entropy)
In AI
Model matching true labels = minimum cross-entropy loss
straighten
KL Divergence — Distance Between Distributions
How different is your model from reality?
The Analogy
KL divergence measures the “extra cost” of using the wrong distribution. If you designed a communication system based on predicted weather Q but the actual weather follows P, KL(P||Q) is the wasted bits per message. It’s the gap between cross-entropy and entropy: KL(P||Q) = H(P,Q) − H(P). Always ≥ 0, equals 0 only when P = Q.
Key insight: KL divergence is everywhere in modern AI. VAEs minimize KL divergence to keep the latent space well-structured. Knowledge distillation uses KL to transfer knowledge from a large teacher model to a small student. RLHF uses KL to prevent the fine-tuned model from drifting too far from the base model.
Worked Example
# KL(P || Q) = Σ P(x) × log(P(x) / Q(x)) # = H(P, Q) - H(P) (extra cost) P = np.array([0.7, 0.2, 0.1]) # true Q = np.array([0.3, 0.4, 0.3]) # model KL = np.sum(P * np.log2(P / Q)) # ≈ 0.50 bits (significant mismatch) # Note: KL is NOT symmetric! # KL(P||Q) ≠ KL(Q||P) # KL(P||Q): "cost of using Q when truth is P" # KL(Q||P): "cost of using P when truth is Q" # In VAE: # Loss = Reconstruction + β × KL(q(z|x) || p(z)) # KL term keeps latent space close to N(0,1)
Key insight: Minimizing cross-entropy H(P,Q) is equivalent to minimizing KL(P||Q) because H(P) is constant. That’s why cross-entropy loss works — it’s secretly minimizing the distance between your model and reality.
hub
Mutual Information
How much does knowing X tell you about Y?
The Analogy
Mutual information measures how much knowing one thing tells you about another. Knowing someone’s height tells you something about their weight (positive MI). Knowing their shoe size tells you almost nothing about their favorite color (near-zero MI). MI = 0 means the variables are completely independent.
Key insight: Mutual information is a more powerful version of correlation. Correlation only captures linear relationships. MI captures any statistical dependency. If X = sin(Y), correlation might be ~0, but MI is high. This makes MI invaluable for feature selection — find features that share the most information with the target, regardless of the relationship shape.
The Math
# MI(X; Y) = H(X) + H(Y) - H(X, Y) # = KL(P(X,Y) || P(X)P(Y)) # = reduction in uncertainty about X # when you learn Y (and vice versa) # Properties: # MI(X; Y) ≥ 0 (always) # MI(X; Y) = 0 iff X ⊥ Y (independent) # MI(X; X) = H(X) (max possible) # MI(X; Y) = MI(Y; X) (symmetric!) from sklearn.feature_selection import mutual_info_classif # Select features with highest MI to target mi_scores = mutual_info_classif(X, y) # [0.82, 0.01, 0.45, 0.67, ...] # Feature 0 and 3 are most informative
compress
The Information Bottleneck
Compress the input, keep only what matters for the output
The Analogy
Imagine summarizing a 500-page book into a 1-page summary. You must compress (lose details) while preserving the key message. The information bottleneck theory says deep learning does exactly this: each layer compresses the input (forgets irrelevant details) while preserving information relevant to the output label.
Key insight: This explains why deep networks generalize. Early layers extract features (compress 1M pixels to 512 features). Later layers keep only what matters for the task. A cat classifier doesn’t need to remember the exact pixel values — just “has ears, whiskers, fur pattern.” The network learns to throw away noise and keep signal.
The Framework
# Information Bottleneck principle: # min I(X; T) - β × I(T; Y) # # X = input (image pixels) # T = representation (hidden layer) # Y = output (label) # # Goal: minimize I(X;T) → compress input # maximize I(T;Y) → preserve label info # Deep network as compression pipeline: # Image (3×224×224 = 150K values) # → Conv layers (compress spatial info) # → 512-dim feature vector (bottleneck) # → 10-class prediction # 150K → 512 → 10 (massive compression!)
Real World
Summarize a book: lose details, keep the core message
In AI
Neural net: compress 150K pixels to 512 features that predict the label
smart_toy
Information Theory in Modern AI
From GPT to diffusion models — information theory is everywhere
Language Models
GPT and LLMs are trained to minimize cross-entropy on next-token prediction. Perplexity = 2^(cross-entropy) measures how “confused” the model is. GPT-4 has perplexity ~10-15 on English text, meaning at each position it’s effectively choosing between 10-15 equally likely words. Temperature in sampling controls the entropy of the output distribution — low temp = low entropy = predictable; high temp = high entropy = creative.
Key connection: When you set temperature=0.7 in ChatGPT, you’re literally adjusting the entropy of the probability distribution over next tokens. Temperature=0 picks the most likely token (zero entropy). Temperature=2 makes the distribution nearly uniform (high entropy, very random).
More Applications
# VAEs: ELBO = Reconstruction - KL # KL keeps latent space structured # Diffusion models: learn to reverse # information destruction (adding noise) # Score = ∇log p(x) (gradient of log-prob) # RLHF: KL penalty prevents drift # Loss = reward - β × KL(π_new || π_ref) # Knowledge distillation: # Student minimizes KL to teacher's outputs # Soft labels carry more info than hard labels # Perplexity of language model: # PPL = 2^H(P, Q) where Q = model # GPT-4: PPL ≈ 10-15 on English text # Human: PPL ≈ 20 (we're less predictable!)
school
The Complete Picture
How all the information theory concepts connect
The Unified View
All of information theory connects through one idea: uncertainty reduction. Entropy measures total uncertainty. Cross-entropy measures how well a model captures that uncertainty. KL divergence measures the gap. Mutual information measures shared uncertainty. Every ML loss function is secretly an information-theoretic quantity.
The big picture: Shannon’s 1948 paper “A Mathematical Theory of Communication” is arguably the most important paper for modern AI. Without it, we wouldn’t have cross-entropy loss, perplexity, KL divergence for VAEs, or the information bottleneck theory. Information theory IS the language of learning.
Cheat Sheet
# Information Theory Cheat Sheet for ML: # # Entropy H(P) → uncertainty of P # Cross-ent H(P,Q) → avg surprise of P # under model Q # KL(P||Q) → extra cost of Q vs P # MI(X;Y) → shared information # # Key relationships: # H(P,Q) = H(P) + KL(P||Q) # MI(X;Y) = H(X) - H(X|Y) # MI(X;Y) = KL(P(X,Y) || P(X)P(Y)) # # Where you see them: # Classification → cross-entropy loss # VAE → KL divergence regularizer # LLM eval → perplexity = 2^H(P,Q) # Feature selection → mutual information # Decision trees → information gain