The Analogy
Two cities might have the same average temperature (70°F), but one is a desert (huge swings: 40° to 100°) and the other is coastal (steady: 65° to 75°). Variance measures this spread — how far values typically stray from the mean. Standard deviation (σ) is the square root of variance, in the same units as the data.
Key insight: Batch normalization, one of the most important techniques in deep learning, works by normalizing each layer’s activations to have mean 0 and variance 1. It literally computes E[X] and Var[X] of each mini-batch and rescales. This stabilizes training dramatically.
Worked Example
# Var[X] = E[(X - μ)²] = E[X²] - (E[X])²
# σ = √Var[X]
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
mean = data.mean() # 5.0
var = data.var() # 4.0
std = data.std() # 2.0
# Batch normalization in PyTorch:
bn = torch.nn.BatchNorm1d(256)
# For each feature: x̂ = (x - μ_batch) / σ_batch
# Then: y = γ × x̂ + β (learnable scale/shift)
68-95-99.7 rule: For a Gaussian, 68% of data falls within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ. Values beyond 3σ are rare outliers.