Ch 8: Distributions & Expectations — Mathematics Behind AI & ML

Ch 8 — Distributions & Expectations

The bell curve is everywhere — from human heights to neural network weights

arrow_backIndex

Probability

height

Heights

arrow_forward

functions

PMF/PDF

arrow_forward

balance

E[X]

arrow_forward

expand

Variance

arrow_forward

show_chart

Gaussian

arrow_forward

auto_fix_high

CLT

arrow_forward

neurology

Init

Click play or press Space to begin...

Step- / 8

height

A Histogram of Heights

Distributions describe the shape of randomness

The Analogy

Measure the height of everyone in your city and plot a histogram. Most people cluster around the average (5’7”), with fewer very short or very tall people. This histogram IS a distribution — it tells you the probability of each height range. A distribution is the “shape” of randomness: where values cluster, how spread out they are, and how likely extreme values are.

Key insight: Neural network weights start as random numbers drawn from a distribution. The choice of distribution (Gaussian, uniform) and its parameters (mean, variance) determines whether the network can learn at all. Bad initialization = dead network.

Types of Distributions

# Discrete: countable outcomes # Coin flip: P(H)=0.5, P(T)=0.5 # Die roll: P(1)=P(2)=...=P(6)=1/6 # Continuous: any value in a range # Height: 5.0, 5.1, 5.123, ... # Temperature: any real number # Common distributions in AI: # Bernoulli: binary (spam/not spam) # Categorical: multi-class (next token) # Gaussian: continuous (weight init) # Uniform: equal probability everywhere

Real World

Heights cluster around average, few extremes — bell curve

In AI

Weights initialized from Gaussian, predictions from categorical distribution

functions

PMF & PDF — The Shape Functions

How to describe a distribution mathematically

The Analogy

A PMF (probability mass function) is like a bar chart for discrete outcomes — each bar’s height is the probability. A PDF (probability density function) is like a smooth curve for continuous outcomes — the area under the curve between two points gives the probability. Think of PMF as “how much probability sits on each point” and PDF as “how densely probability is spread.”

Key insight: For a continuous distribution, P(X = exactly 5.7000...) = 0. You can only ask about ranges: P(5.6 < X < 5.8). This is why we use density, not probability. The softmax output of an LLM is a PMF over the discrete vocabulary.

Worked Example

# PMF: discrete (die roll) # P(X=1) = 1/6, P(X=2) = 1/6, ... # Σ P(X=k) = 1 # PDF: continuous (Gaussian) # f(x) = (1/√(2πσ²)) × exp(-(x-μ)²/(2σ²)) # ∫ f(x) dx = 1 import numpy as np from scipy import stats # Gaussian PDF at x=0, μ=0, σ=1 stats.norm.pdf(0, loc=0, scale=1) # 0.3989 — density, not probability! # P(-1 < X < 1) for standard normal stats.norm.cdf(1) - stats.norm.cdf(-1) # 0.6827 — 68.27% within 1 std dev

balance

Expected Value — The Balance Point

If you repeated the experiment forever, what’s the average?

The Analogy

The expected value E[X] is the balance point of the distribution — if you cut out the histogram from cardboard, E[X] is where it would balance on a pencil. For a fair die, E[X] = 3.5 (the average of 1,2,3,4,5,6). It’s the “long-run average” if you repeated the experiment infinitely many times.

Key insight: The loss function in AI training IS an expected value: L = E[loss(x, y)] averaged over all data points. When we use mini-batches, we’re estimating this expected value with a sample average. The law of large numbers guarantees this estimate improves with more samples.

Worked Example

# E[X] = Σ xᵢ × P(xᵢ) (discrete) # E[X] = ∫ x × f(x) dx (continuous) # Fair die: E[X] = 1×(1/6) + 2×(1/6) + ... # = 21/6 = 3.5 # Loaded die: P(6)=0.5, others=0.1 each # E[X] = 1×0.1+2×0.1+3×0.1+4×0.1+5×0.1+6×0.5 # E[X] = 0.1+0.2+0.3+0.4+0.5+3.0 = 4.5 # AI loss as expected value: # L = E[(y - ŷ)²] (MSE loss) # ≈ (1/batch_size) Σ (yᵢ - ŷᵢ)²

Real World

Fair die averages 3.5 over many rolls

In AI

Loss = expected error over all data, estimated by mini-batch average

expand

Variance & Standard Deviation

How spread out is the distribution?

The Analogy

Two cities might have the same average temperature (70°F), but one is a desert (huge swings: 40° to 100°) and the other is coastal (steady: 65° to 75°). Variance measures this spread — how far values typically stray from the mean. Standard deviation (σ) is the square root of variance, in the same units as the data.

Key insight: Batch normalization, one of the most important techniques in deep learning, works by normalizing each layer’s activations to have mean 0 and variance 1. It literally computes E[X] and Var[X] of each mini-batch and rescales. This stabilizes training dramatically.

Worked Example

# Var[X] = E[(X - μ)²] = E[X²] - (E[X])² # σ = √Var[X] data = np.array([2, 4, 4, 4, 5, 5, 7, 9]) mean = data.mean() # 5.0 var = data.var() # 4.0 std = data.std() # 2.0 # Batch normalization in PyTorch: bn = torch.nn.BatchNorm1d(256) # For each feature: x̂ = (x - μ_batch) / σ_batch # Then: y = γ × x̂ + β (learnable scale/shift)

68-95-99.7 rule: For a Gaussian, 68% of data falls within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ. Values beyond 3σ are rare outliers.

show_chart

The Gaussian (Normal) Distribution

The bell curve that rules the universe

The Analogy

The Gaussian (bell curve) appears everywhere: human heights, measurement errors, stock returns, IQ scores. It’s defined by just two numbers: the mean μ (center) and standard deviation σ (width). The standard normal has μ = 0, σ = 1. It’s the “default” distribution of nature because of the Central Limit Theorem.

Key insight: Neural network weights are almost always initialized from a Gaussian distribution. Xavier init uses N(0, 1/n) and He init uses N(0, 2/n). The Gaussian is chosen because it’s mathematically clean, symmetric, and well-understood. Diffusion models add and remove Gaussian noise to generate images.

Worked Example

# Gaussian PDF: # f(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²)) import torch # Standard normal: μ=0, σ=1 z = torch.randn(1000) # 1000 samples z.mean() # ≈ 0.0 z.std() # ≈ 1.0 # Xavier initialization for layer (in, out) n_in = 512 W = torch.randn(n_in, 256) * (1/n_in)**0.5 # W ~ N(0, 1/512) — keeps variance stable # He initialization (for ReLU) W = torch.randn(n_in, 256) * (2/n_in)**0.5 # W ~ N(0, 2/512) — accounts for ReLU

auto_fix_high

The Central Limit Theorem

Why the bell curve appears everywhere

The Analogy

Roll one die: flat distribution (each number equally likely). Average 2 dice: triangular. Average 30 dice: nearly perfect bell curve. The Central Limit Theorem says: average enough random things together, and the result is always Gaussian, regardless of the original distribution. This is why the bell curve appears everywhere — most real-world measurements are averages of many small effects.

Key insight: Mini-batch gradient estimation works because of the CLT. Each mini-batch gradient is an average of individual gradients. By the CLT, this average is approximately Gaussian, and its variance shrinks as 1/batch_size. Larger batches = more accurate gradient estimates.

Worked Example

# CLT: average of n samples → Gaussian # as n → ∞, regardless of original dist # Uniform [0,1]: flat, NOT Gaussian samples = np.random.uniform(0, 1, (10000, 30)) averages = samples.mean(axis=1) # averages is approximately N(0.5, 1/(12×30)) # → bell curve centered at 0.5! # Mini-batch gradient variance: # Var[ĝ] = Var[g] / batch_size # batch=32: Var/32 (noisy) # batch=256: Var/256 (8× less noisy)

Real World

Average 30 dice rolls → bell curve (CLT in action)

In AI

Mini-batch gradient ≈ Gaussian by CLT; larger batch = less noise

neurology

Weight Initialization — Xavier & He

The right distribution prevents dead neurons

The Analogy

Imagine tuning 1,000 guitar strings. If you start them all at the same tension (zero init), they all sound the same — useless. If you start with random tensions that are too extreme, some strings snap (exploding gradients) and others go slack (vanishing gradients). Xavier/He initialization picks the “Goldilocks” variance so that signals neither explode nor vanish as they pass through layers.

Key insight: Xavier init sets Var(W) = 1/n_in so that the variance of activations stays constant across layers. He init doubles this to 2/n_in because ReLU kills half the signal (negative values become 0). This simple variance calculation is the difference between a network that trains and one that doesn’t.

In Practice

import torch.nn as nn # Xavier (Glorot) — for tanh/sigmoid nn.init.xavier_normal_(layer.weight) # W ~ N(0, 2/(n_in + n_out)) # He (Kaiming) — for ReLU nn.init.kaiming_normal_(layer.weight) # W ~ N(0, 2/n_in) # Bad init: too large W = torch.randn(512, 512) * 10 # Activations explode: 10^layers # Bad init: too small W = torch.randn(512, 512) * 0.001 # Activations vanish: 0.001^layers ≈ 0

Source: Glorot & Bengio (2010) “Understanding the difficulty of training deep feedforward neural networks” (Xavier). He et al. (2015) “Delving Deep into Rectifiers” (Kaiming/He).

hub

Other Key Distributions in AI

Bernoulli, categorical, Poisson, and more

Distribution Zoo

Bernoulli: Binary outcomes (spam/not spam). Categorical: Multi-class (next token from vocabulary). Poisson: Count of rare events (website visits per hour). Exponential: Time between events (time until next click). Beta: Probability of a probability (uncertainty about a coin’s bias). Each distribution models a different type of randomness.

Why it matters for AI: Choosing the right distribution for your problem is critical. Binary classification uses Bernoulli (sigmoid output). Multi-class uses categorical (softmax output). Regression often assumes Gaussian errors (MSE loss). Diffusion models use Gaussian noise. VAEs use Gaussian latent spaces. The distribution IS the model.

Cheat Sheet

# Bernoulli: P(X=1) = p, P(X=0) = 1-p # → Binary classification, dropout # Categorical: P(X=k) = pₖ, Σpₖ = 1 # → LLM next-token, multi-class # Gaussian: N(μ, σ²) # → Weight init, noise, latent spaces # Uniform: U(a, b), all values equally likely # → Random sampling, some inits # Softmax creates categorical from logits: logits = torch.tensor([2.0, 1.0, 0.1]) probs = torch.softmax(logits, dim=0) # [0.659, 0.242, 0.099] — categorical!

Real World

Heights = Gaussian, coin flips = Bernoulli, bus arrivals = Poisson

In AI

Weights = Gaussian, predictions = categorical, dropout = Bernoulli

arrow_back Ch 7: Probability Ch 9: MLE & Bayesian Inference arrow_forward