Ch 8 — Distributions & Expectations

The bell curve is everywhere — from human heights to neural network weights
Probability
height
Heights
arrow_forward
functions
PMF/PDF
arrow_forward
balance
E[X]
arrow_forward
expand
Variance
arrow_forward
show_chart
Gaussian
arrow_forward
auto_fix_high
CLT
arrow_forward
neurology
Init
-
Click play or press Space to begin...
Step- / 8
height
A Histogram of Heights
Distributions describe the shape of randomness
The Analogy
Measure the height of everyone in your city and plot a histogram. Most people cluster around the average (5’7”), with fewer very short or very tall people. This histogram IS a distribution — it tells you the probability of each height range. A distribution is the “shape” of randomness: where values cluster, how spread out they are, and how likely extreme values are.
Key insight: Neural network weights start as random numbers drawn from a distribution. The choice of distribution (Gaussian, uniform) and its parameters (mean, variance) determines whether the network can learn at all. Bad initialization = dead network.
Types of Distributions
# Discrete: countable outcomes # Coin flip: P(H)=0.5, P(T)=0.5 # Die roll: P(1)=P(2)=...=P(6)=1/6 # Continuous: any value in a range # Height: 5.0, 5.1, 5.123, ... # Temperature: any real number # Common distributions in AI: # Bernoulli: binary (spam/not spam) # Categorical: multi-class (next token) # Gaussian: continuous (weight init) # Uniform: equal probability everywhere
Real World
Heights cluster around average, few extremes — bell curve
In AI
Weights initialized from Gaussian, predictions from categorical distribution
functions
PMF & PDF — The Shape Functions
How to describe a distribution mathematically
The Analogy
A PMF (probability mass function) is like a bar chart for discrete outcomes — each bar’s height is the probability. A PDF (probability density function) is like a smooth curve for continuous outcomes — the area under the curve between two points gives the probability. Think of PMF as “how much probability sits on each point” and PDF as “how densely probability is spread.”
Key insight: For a continuous distribution, P(X = exactly 5.7000...) = 0. You can only ask about ranges: P(5.6 < X < 5.8). This is why we use density, not probability. The softmax output of an LLM is a PMF over the discrete vocabulary.
Worked Example
# PMF: discrete (die roll) # P(X=1) = 1/6, P(X=2) = 1/6, ... # Σ P(X=k) = 1 # PDF: continuous (Gaussian) # f(x) = (1/√(2πσ²)) × exp(-(x-μ)²/(2σ²)) # ∫ f(x) dx = 1 import numpy as np from scipy import stats # Gaussian PDF at x=0, μ=0, σ=1 stats.norm.pdf(0, loc=0, scale=1) # 0.3989 — density, not probability! # P(-1 < X < 1) for standard normal stats.norm.cdf(1) - stats.norm.cdf(-1) # 0.6827 — 68.27% within 1 std dev
balance
Expected Value — The Balance Point
If you repeated the experiment forever, what’s the average?
The Analogy
The expected value E[X] is the balance point of the distribution — if you cut out the histogram from cardboard, E[X] is where it would balance on a pencil. For a fair die, E[X] = 3.5 (the average of 1,2,3,4,5,6). It’s the “long-run average” if you repeated the experiment infinitely many times.
Key insight: The loss function in AI training IS an expected value: L = E[loss(x, y)] averaged over all data points. When we use mini-batches, we’re estimating this expected value with a sample average. The law of large numbers guarantees this estimate improves with more samples.
Worked Example
# E[X] = Σ xᵢ × P(xᵢ) (discrete) # E[X] = ∫ x × f(x) dx (continuous) # Fair die: E[X] = 1×(1/6) + 2×(1/6) + ... # = 21/6 = 3.5 # Loaded die: P(6)=0.5, others=0.1 each # E[X] = 1×0.1+2×0.1+3×0.1+4×0.1+5×0.1+6×0.5 # E[X] = 0.1+0.2+0.3+0.4+0.5+3.0 = 4.5 # AI loss as expected value: # L = E[(y - ŷ)²] (MSE loss) # ≈ (1/batch_size) Σ (yᵢ - ŷᵢ)²
Real World
Fair die averages 3.5 over many rolls
In AI
Loss = expected error over all data, estimated by mini-batch average
expand
Variance & Standard Deviation
How spread out is the distribution?
The Analogy
Two cities might have the same average temperature (70°F), but one is a desert (huge swings: 40° to 100°) and the other is coastal (steady: 65° to 75°). Variance measures this spread — how far values typically stray from the mean. Standard deviation (σ) is the square root of variance, in the same units as the data.
Key insight: Batch normalization, one of the most important techniques in deep learning, works by normalizing each layer’s activations to have mean 0 and variance 1. It literally computes E[X] and Var[X] of each mini-batch and rescales. This stabilizes training dramatically.
Worked Example
# Var[X] = E[(X - μ)²] = E[X²] - (E[X])² # σ = √Var[X] data = np.array([2, 4, 4, 4, 5, 5, 7, 9]) mean = data.mean() # 5.0 var = data.var() # 4.0 std = data.std() # 2.0 # Batch normalization in PyTorch: bn = torch.nn.BatchNorm1d(256) # For each feature: x̂ = (x - μ_batch) / σ_batch # Then: y = γ × x̂ + β (learnable scale/shift)
68-95-99.7 rule: For a Gaussian, 68% of data falls within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ. Values beyond 3σ are rare outliers.
show_chart
The Gaussian (Normal) Distribution
The bell curve that rules the universe
The Analogy
The Gaussian (bell curve) appears everywhere: human heights, measurement errors, stock returns, IQ scores. It’s defined by just two numbers: the mean μ (center) and standard deviation σ (width). The standard normal has μ = 0, σ = 1. It’s the “default” distribution of nature because of the Central Limit Theorem.
Key insight: Neural network weights are almost always initialized from a Gaussian distribution. Xavier init uses N(0, 1/n) and He init uses N(0, 2/n). The Gaussian is chosen because it’s mathematically clean, symmetric, and well-understood. Diffusion models add and remove Gaussian noise to generate images.
Worked Example
# Gaussian PDF: # f(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²)) import torch # Standard normal: μ=0, σ=1 z = torch.randn(1000) # 1000 samples z.mean() # ≈ 0.0 z.std() # ≈ 1.0 # Xavier initialization for layer (in, out) n_in = 512 W = torch.randn(n_in, 256) * (1/n_in)**0.5 # W ~ N(0, 1/512) — keeps variance stable # He initialization (for ReLU) W = torch.randn(n_in, 256) * (2/n_in)**0.5 # W ~ N(0, 2/512) — accounts for ReLU
auto_fix_high
The Central Limit Theorem
Why the bell curve appears everywhere
The Analogy
Roll one die: flat distribution (each number equally likely). Average 2 dice: triangular. Average 30 dice: nearly perfect bell curve. The Central Limit Theorem says: average enough random things together, and the result is always Gaussian, regardless of the original distribution. This is why the bell curve appears everywhere — most real-world measurements are averages of many small effects.
Key insight: Mini-batch gradient estimation works because of the CLT. Each mini-batch gradient is an average of individual gradients. By the CLT, this average is approximately Gaussian, and its variance shrinks as 1/batch_size. Larger batches = more accurate gradient estimates.
Worked Example
# CLT: average of n samples → Gaussian # as n → ∞, regardless of original dist # Uniform [0,1]: flat, NOT Gaussian samples = np.random.uniform(0, 1, (10000, 30)) averages = samples.mean(axis=1) # averages is approximately N(0.5, 1/(12×30)) # → bell curve centered at 0.5! # Mini-batch gradient variance: # Var[ĝ] = Var[g] / batch_size # batch=32: Var/32 (noisy) # batch=256: Var/256 (8× less noisy)
Real World
Average 30 dice rolls → bell curve (CLT in action)
In AI
Mini-batch gradient ≈ Gaussian by CLT; larger batch = less noise
neurology
Weight Initialization — Xavier & He
The right distribution prevents dead neurons
The Analogy
Imagine tuning 1,000 guitar strings. If you start them all at the same tension (zero init), they all sound the same — useless. If you start with random tensions that are too extreme, some strings snap (exploding gradients) and others go slack (vanishing gradients). Xavier/He initialization picks the “Goldilocks” variance so that signals neither explode nor vanish as they pass through layers.
Key insight: Xavier init sets Var(W) = 1/n_in so that the variance of activations stays constant across layers. He init doubles this to 2/n_in because ReLU kills half the signal (negative values become 0). This simple variance calculation is the difference between a network that trains and one that doesn’t.
In Practice
import torch.nn as nn # Xavier (Glorot) — for tanh/sigmoid nn.init.xavier_normal_(layer.weight) # W ~ N(0, 2/(n_in + n_out)) # He (Kaiming) — for ReLU nn.init.kaiming_normal_(layer.weight) # W ~ N(0, 2/n_in) # Bad init: too large W = torch.randn(512, 512) * 10 # Activations explode: 10^layers # Bad init: too small W = torch.randn(512, 512) * 0.001 # Activations vanish: 0.001^layers ≈ 0
Source: Glorot & Bengio (2010) “Understanding the difficulty of training deep feedforward neural networks” (Xavier). He et al. (2015) “Delving Deep into Rectifiers” (Kaiming/He).
hub
Other Key Distributions in AI
Bernoulli, categorical, Poisson, and more
Distribution Zoo
Bernoulli: Binary outcomes (spam/not spam). Categorical: Multi-class (next token from vocabulary). Poisson: Count of rare events (website visits per hour). Exponential: Time between events (time until next click). Beta: Probability of a probability (uncertainty about a coin’s bias). Each distribution models a different type of randomness.
Why it matters for AI: Choosing the right distribution for your problem is critical. Binary classification uses Bernoulli (sigmoid output). Multi-class uses categorical (softmax output). Regression often assumes Gaussian errors (MSE loss). Diffusion models use Gaussian noise. VAEs use Gaussian latent spaces. The distribution IS the model.
Cheat Sheet
# Bernoulli: P(X=1) = p, P(X=0) = 1-p # → Binary classification, dropout # Categorical: P(X=k) = pₖ, Σpₖ = 1 # → LLM next-token, multi-class # Gaussian: N(μ, σ²) # → Weight init, noise, latent spaces # Uniform: U(a, b), all values equally likely # → Random sampling, some inits # Softmax creates categorical from logits: logits = torch.tensor([2.0, 1.0, 0.1]) probs = torch.softmax(logits, dim=0) # [0.659, 0.242, 0.099] — categorical!
Real World
Heights = Gaussian, coin flips = Bernoulli, bus arrivals = Poisson
In AI
Weights = Gaussian, predictions = categorical, dropout = Bernoulli