The Analogy
The gradient tells you the slope. The Hessian tells you the curvature — is the hill curving like a bowl (minimum), a dome (maximum), or a saddle (up in one direction, down in another)? It’s the matrix of second derivatives: how the slope itself is changing.
Key insight: In high-dimensional loss landscapes, most critical points are saddle points, not local minima. The Hessian’s eigenvalues tell you: all positive = bowl (minimum), all negative = dome (maximum), mixed = saddle. Research shows neural network loss surfaces have exponentially more saddle points than minima.
Worked Example
# f(x,y) = x² - y² (saddle function)
# ∇f = [2x, -2y]
# Hessian H = [[∂²f/∂x², ∂²f/∂x∂y],
# [∂²f/∂y∂x, ∂²f/∂y²]]
# = [[2, 0],
# [0, -2]]
# Eigenvalues of H: +2 and -2
# Mixed signs → SADDLE POINT!
# Bowl: f(x,y) = x² + y²
# H = [[2,0],[0,2]], eigenvalues: +2, +2
# All positive → MINIMUM ✓
# Second-order optimizers (L-BFGS, Newton)
# use Hessian info for better steps
# but computing H is O(n²) — too expensive
# for billions of parameters
Computational cost: For a model with n parameters, the Hessian is n×n. GPT-3 has 175B params — the Hessian would be 175B × 175B. That’s why first-order methods (SGD, Adam) dominate: they only need the gradient, not the Hessian.