Ch 7 — LSTMs, GRUs & Sequence Models

Gating mechanisms, bidirectional RNNs, seq2seq, and encoder-decoder architectures
High Level
memory
Cell State
arrow_forward
lock_open
Gates
arrow_forward
compress
GRU
arrow_forward
swap_horiz
Bidir.
arrow_forward
translate
Seq2Seq
arrow_forward
auto_awesome
Legacy
-
Click play or press Space to begin...
Step- / 8
memory
LSTM: The Cell State Highway
Hochreiter & Schmidhuber’s 1997 breakthrough
The Core Innovation
The Long Short-Term Memory (LSTM) network, introduced by Hochreiter and Schmidhuber in 1997, solves the vanishing gradient problem with a brilliant idea: a cell state (cₜ) that runs through time like a conveyor belt. Information flows along this belt with only minor linear interactions (addition and element-wise multiplication). Three gates — forget, input, and output — control what information is removed, added, or read from the cell state. Because the cell state path is mostly linear, gradients can flow across hundreds or thousands of time steps without vanishing.
LSTM Equations
// LSTM at time step t fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f) // forget gate iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i) // input gate c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c) // candidate cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ // cell update oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o) // output gate hₜ = oₜ ⊙ tanh(cₜ) // hidden state // σ = sigmoid (0 to 1) → gate values // ⊙ = element-wise multiplication
lock_open
The Three Gates
Forget, input, and output — controlling information flow
Forget Gate
The forget gate fₜ decides what to erase from the cell state. It outputs a value between 0 (completely forget) and 1 (completely keep) for each dimension. When processing “The cat sat. The dog ran,” the forget gate might erase “cat” information when it encounters the new subject “dog.”
Input Gate
The input gate iₜ decides what new information to store. It works with a candidate value c̃ₜ (what could be added) and scales it by how much to actually add. This two-step process lets the LSTM be selective about what enters long-term memory.
Output Gate
The output gate oₜ decides what to expose from the cell state as the hidden state hₜ. The cell state might contain information about gender, number, and tense, but only the relevant parts are output at each step.
Intuitive Example
// Processing: "The cat, which was very // fluffy and loved to nap, was sleeping." Step "cat": input gate: store "subject=cat, singular" forget gate: keep everything Step "which...nap": input gate: store clause details forget gate: KEEP "subject=cat" ← key! Step "was": output gate: read "singular" → predict // "was" (not "were") because cat is singular // Cell state preserved subject across 10+ words
Key insight: The cell state acts like a “highway” for information. The forget gate’s gradient is just fₜ (a sigmoid output near 1 for important information), so gradients flow almost unchanged across time — solving the vanishing gradient problem.
compress
GRU: A Simpler Alternative
Cho et al. (2014) — fewer gates, similar performance
Simplifying the Gates
The Gated Recurrent Unit (GRU), proposed by Cho et al. in 2014, merges the forget and input gates into a single update gate and combines the cell state and hidden state into one. It has only two gates (update and reset) instead of three, making it ~25% faster to train with fewer parameters. Empirical studies (Chung et al., 2014) showed GRUs perform comparably to LSTMs on most tasks, with neither consistently dominating.
GRU Equations
// GRU at time step t zₜ = σ(W_z · [hₜ₋₁, xₜ]) // update gate rₜ = σ(W_r · [hₜ₋₁, xₜ]) // reset gate h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ]) // candidate hₜ = (1-zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ // interpolate // Update gate zₜ: how much to update // Reset gate rₜ: how much past to forget // No separate cell state — just hₜ // LSTM: 4 weight matrices per layer // GRU: 3 weight matrices per layer
Rule of thumb: Use LSTM as the default for sequence tasks. Try GRU if you need faster training or have limited compute. For most practical purposes, the difference is negligible — both are far superior to vanilla RNNs.
swap_horiz
Bidirectional RNNs
Looking forward and backward simultaneously
Why Bidirectional?
A standard RNN only sees past context. But in many tasks, future context matters too. In “He went to the bank to deposit money,” the word “deposit” (which comes after “bank”) disambiguates that “bank” means a financial institution. A Bidirectional RNN (BiRNN) runs two separate RNNs: one forward (left to right) and one backward (right to left). Their hidden states are concatenated at each position, giving each token access to the full sentence context.
Key insight: Bidirectional LSTMs (BiLSTMs) were the dominant architecture for NLP from ~2015 to 2018, used in named entity recognition, POS tagging, and as the backbone of ELMo (Peters et al., 2018). BERT later achieved the same bidirectional context using masked attention in transformers.
BiLSTM in PyTorch
import torch.nn as nn bilstm = nn.LSTM( input_size=128, hidden_size=256, num_layers=2, bidirectional=True, // ← key flag batch_first=True ) // Output hidden size = 256 × 2 = 512 // (forward 256 + backward 256) // Can only be used when full sequence // is available (not for autoregressive gen)
translate
Seq2Seq: Encoder-Decoder
Sutskever et al. (2014) — the architecture that enabled neural machine translation
The Encoder-Decoder Framework
In 2014, Sutskever, Vinyals, and Le at Google introduced the sequence-to-sequence (seq2seq) model. An encoder LSTM reads the entire input sequence and compresses it into a fixed-size vector (the final hidden state). A decoder LSTM then generates the output sequence one token at a time, conditioned on this vector. This framework enabled the first competitive neural machine translation systems, translating English to French with a single end-to-end model.
Seq2Seq Architecture
// Seq2Seq for translation Encoder: "I love cats" → LSTM → LSTM → LSTM ↓ [context vector] ↓ Decoder: [context] → LSTM → "J'aime" LSTM → "les" LSTM → "chats" LSTM → <EOS> // Teacher forcing during training: // Feed correct previous token (not predicted) // Speeds convergence but creates train/test gap
Key insight: The context vector bottleneck is seq2seq’s Achilles’ heel — the entire input must be compressed into a single vector. For long sentences, information is lost. This limitation directly motivated the attention mechanism (Bahdanau et al., 2015), covered in Chapter 11.
build
Practical LSTM Tips
Training tricks that make LSTMs work in practice
Training Best Practices
1. Gradient clipping: Clip gradient norm to 1.0–5.0 to prevent exploding gradients. 2. Forget gate bias: Initialize forget gate bias to 1.0 (Jozefowicz et al., 2015) so the LSTM starts by remembering everything. 3. Dropout: Apply dropout between layers (not within recurrent connections). Variational dropout (Gal & Ghahramani, 2016) uses the same mask across time steps. 4. Layer normalization: Normalize within each time step for more stable training. 5. Packing sequences: Use packed sequences for variable-length batches to avoid wasting compute on padding.
PyTorch LSTM Training
// Gradient clipping torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=1.0 ) // Packed sequences for variable lengths from torch.nn.utils.rnn import ( pack_padded_sequence, pad_packed_sequence ) packed = pack_padded_sequence( embeddings, lengths, batch_first=True, enforce_sorted=False ) output, (h_n, c_n) = lstm(packed)
Rule of thumb: Always clip gradients when training RNNs/LSTMs. A max_norm of 1.0 is a safe default. Without clipping, a single bad batch can produce enormous gradients that destroy learned weights.
compare
LSTM vs. GRU vs. Transformer
When to use which architecture
Comparison
LSTM: Best for tasks requiring precise memory control (e.g., code generation, music). 4 gates, separate cell state. GRU: Faster, fewer parameters, comparable accuracy. Good default when compute is limited. Transformer: Parallelizable (no sequential dependency), handles long-range dependencies via attention, dominates NLP since 2018. LSTMs are still used for streaming/real-time applications where you process one token at a time and can’t see the future.
LSTM/GRU Weakness
Sequential processing — can’t parallelize across time steps. Training is slow for long sequences. O(T) computation, no shortcut.
Transformer Advantage
Fully parallel — all positions computed simultaneously. Attention provides direct connections between any two positions. O(1) path length.
Quick Reference
// Architecture comparison LSTM GRU Transformer Gates: 3 2 0 (attention) Parallel: No No Yes Memory: O(1) O(1) O(T²) Speed: Slow Medium Fast (GPU) Long-dep: Good Good Excellent // Use LSTM/GRU: streaming, real-time, small data // Use Transformer: NLP, large data, GPU available
auto_awesome
The Legacy of Recurrence
How LSTMs paved the way for transformers
What LSTMs Achieved
From 1997 to 2017, LSTMs powered the state of the art in machine translation (Google Translate, 2016), speech recognition (Siri, Alexa), text generation, sentiment analysis, and time series forecasting. The seq2seq framework with attention became the dominant paradigm for any sequence-to-sequence task. LSTMs proved that neural networks could handle variable-length sequential data — a capability that seemed impossible with feedforward networks.
The connection: The attention mechanism (Chapter 11) was invented as an add-on to seq2seq LSTMs. Vaswani et al. (2017) then asked: what if we use only attention, without any recurrence? The answer was the Transformer — and LSTMs became largely obsolete for NLP. But the concepts of gating, cell states, and sequence modeling live on in every modern architecture.
LSTM Applications That Changed the World
// LSTM milestones 2014: Seq2seq machine translation 2015: Google voice search (LSTM-based) 2016: Google Neural Machine Translation (8-layer LSTM encoder-decoder) 2016: Apple Siri (LSTM for speech) 2017: Amazon Alexa (LSTM for NLU) 2018: ELMo (BiLSTM language model) // Then: Transformers took over (2018+) // But LSTMs remain in edge/streaming apps