Simplified Gates
Cho et al. (2014) introduced the GRU as a simpler alternative to LSTM. It merges the cell state and hidden state into one, and uses only two gates instead of three: a reset gate and an update gate. Fewer parameters, faster training, and comparable performance on most tasks.
# GRU: 2 gates instead of 3
Reset gate (r):
“How much past to forget?”
r = σ(W\u1d63 · [h\u209c\u208b\u2081, x\u209c])
Update gate (z):
“How much to update vs keep?”
z = σ(W\u1dbb · [h\u209c\u208b\u2081, x\u209c])
New state:
h\u0303 = tanh(W · [r ⊙ h\u209c\u208b\u2081, x\u209c])
h\u209c = (1-z) ⊙ h\u209c\u208b\u2081 + z ⊙ h\u0303
LSTM
3 gates (forget, input, output) + cell state. More parameters. Slightly better on very long sequences. Standard for speech recognition.
GRU
2 gates (reset, update). Fewer parameters, faster training. Comparable accuracy. Better for smaller datasets. No separate cell state.
In practice: LSTM and GRU perform similarly on most tasks. LSTM is the safer default. GRU is preferred when compute is limited. Both are largely superseded by transformers (Ch 9) for most NLP tasks, but remain relevant for real-time streaming, edge devices, and time-series forecasting.