The Analogy
Vanilla SGD is like a hiker who stops and re-evaluates at every step. Momentum turns the hiker into a ball rolling downhill. The ball builds speed in directions it consistently rolls (the true downhill direction) and naturally dampens oscillations (the noisy side-to-side wobble). It uses a weighted average of past gradients, not just the current one.
Key insight: Without momentum, SGD zigzags in narrow valleys (imagine a bowling alley — the gradient points sideways toward the walls, not down the alley). Momentum smooths this out by accumulating velocity in the consistent direction (down the alley), making convergence much faster.
Worked Example
# Momentum: v = β×v + ∇L; w = w - lr×v
# β = 0.9 (typical) — 90% of old velocity
v = 0 # initial velocity
beta = 0.9 # momentum coefficient
lr = 0.01
for step in range(100):
g = compute_gradient(w)
v = beta * v + g # accumulate
w = w - lr * v # update
# PyTorch:
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.01, momentum=0.9
)
Intuition: β = 0.9 means the velocity is a weighted average of roughly the last 10 gradients (1/(1−0.9) = 10). This smooths out noise while preserving the consistent signal.