Hochreiter's Analysis (1991)
Sepp Hochreiter formally analyzed the vanishing gradient problem in his 1991 diploma thesis. In a vanilla RNN, the gradient at time step 1 from a loss at time step T involves multiplying the same weight matrix W_hh (and tanh derivatives) T times. If the largest eigenvalue of W_hh is less than 1, gradients shrink exponentially. If greater than 1, they explode. For a 100-step sequence, even eigenvalues of 0.9 produce gradients of 0.9¹⁰⁰ ≈ 2.7 × 10⁻⁵. The network effectively “forgets” early inputs.
Critical in AI: This means a vanilla RNN processing “The cat, which sat on the mat near the window overlooking the garden, was sleeping” cannot learn that “was” depends on “cat” (singular) 15 words earlier. This limitation motivated LSTMs and GRUs.
The Math
// Gradient through T time steps
∂hₜ/∂h₁ = Πₖ₌₁ᵀ⁻¹ ∂hₖ₊₁/∂hₖ
= Πₖ₌₁ᵀ⁻¹ diag(tanh'(zₖ)) · W_hh
// tanh derivative max = 1 (at z=0)
// If ||W_hh|| < 1: exponential decay
// If ||W_hh|| > 1: exponential growth
// Gradient clipping helps with exploding:
if ||g|| > threshold:
g = threshold · g / ||g||
// But nothing helps vanishing in vanilla RNN