Ch 7: LSTMs, GRUs & Sequence Models — Deep Learning Fundamentals

memory

LSTM: The Cell State Highway

Hochreiter & Schmidhuber’s 1997 breakthrough

The Core Innovation

The Long Short-Term Memory (LSTM) network, introduced by Hochreiter and Schmidhuber in 1997, solves the vanishing gradient problem with a brilliant idea: a cell state (cₜ) that runs through time like a conveyor belt. Information flows along this belt with only minor linear interactions (addition and element-wise multiplication). Three gates — forget, input, and output — control what information is removed, added, or read from the cell state. Because the cell state path is mostly linear, gradients can flow across hundreds or thousands of time steps without vanishing.

LSTM Equations

// LSTM at time step t fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f) // forget gate iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i) // input gate c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c) // candidate cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ // cell update oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o) // output gate hₜ = oₜ ⊙ tanh(cₜ) // hidden state // σ = sigmoid (0 to 1) → gate values // ⊙ = element-wise multiplication

lock_open

The Three Gates

Forget, input, and output — controlling information flow

Forget Gate

The forget gate fₜ decides what to erase from the cell state. It outputs a value between 0 (completely forget) and 1 (completely keep) for each dimension. When processing “The cat sat. The dog ran,” the forget gate might erase “cat” information when it encounters the new subject “dog.”

Input Gate

The input gate iₜ decides what new information to store. It works with a candidate value c̃ₜ (what could be added) and scales it by how much to actually add. This two-step process lets the LSTM be selective about what enters long-term memory.

Output Gate

The output gate oₜ decides what to expose from the cell state as the hidden state hₜ. The cell state might contain information about gender, number, and tense, but only the relevant parts are output at each step.

Intuitive Example

// Processing: "The cat, which was very // fluffy and loved to nap, was sleeping." Step "cat": input gate: store "subject=cat, singular" forget gate: keep everything Step "which...nap": input gate: store clause details forget gate: KEEP "subject=cat" ← key! Step "was": output gate: read "singular" → predict // "was" (not "were") because cat is singular // Cell state preserved subject across 10+ words

Key insight: The cell state acts like a “highway” for information. The forget gate’s gradient is just fₜ (a sigmoid output near 1 for important information), so gradients flow almost unchanged across time — solving the vanishing gradient problem.

compress

GRU: A Simpler Alternative

Cho et al. (2014) — fewer gates, similar performance

Simplifying the Gates

The Gated Recurrent Unit (GRU), proposed by Cho et al. in 2014, merges the forget and input gates into a single update gate and combines the cell state and hidden state into one. It has only two gates (update and reset) instead of three, making it ~25% faster to train with fewer parameters. Empirical studies (Chung et al., 2014) showed GRUs perform comparably to LSTMs on most tasks, with neither consistently dominating.

GRU Equations

// GRU at time step t zₜ = σ(W_z · [hₜ₋₁, xₜ]) // update gate rₜ = σ(W_r · [hₜ₋₁, xₜ]) // reset gate h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ]) // candidate hₜ = (1-zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ // interpolate // Update gate zₜ: how much to update // Reset gate rₜ: how much past to forget // No separate cell state — just hₜ // LSTM: 4 weight matrices per layer // GRU: 3 weight matrices per layer

Rule of thumb: Use LSTM as the default for sequence tasks. Try GRU if you need faster training or have limited compute. For most practical purposes, the difference is negligible — both are far superior to vanilla RNNs.

swap_horiz

Bidirectional RNNs

Looking forward and backward simultaneously

Why Bidirectional?

A standard RNN only sees past context. But in many tasks, future context matters too. In “He went to the bank to deposit money,” the word “deposit” (which comes after “bank”) disambiguates that “bank” means a financial institution. A Bidirectional RNN (BiRNN) runs two separate RNNs: one forward (left to right) and one backward (right to left). Their hidden states are concatenated at each position, giving each token access to the full sentence context.

Key insight: Bidirectional LSTMs (BiLSTMs) were the dominant architecture for NLP from ~2015 to 2018, used in named entity recognition, POS tagging, and as the backbone of ELMo (Peters et al., 2018). BERT later achieved the same bidirectional context using masked attention in transformers.

BiLSTM in PyTorch

import torch.nn as nn bilstm = nn.LSTM( input_size=128, hidden_size=256, num_layers=2, bidirectional=True, // ← key flag batch_first=True ) // Output hidden size = 256 × 2 = 512 // (forward 256 + backward 256) // Can only be used when full sequence // is available (not for autoregressive gen)

translate

Seq2Seq: Encoder-Decoder

Sutskever et al. (2014) — the architecture that enabled neural machine translation

The Encoder-Decoder Framework

In 2014, Sutskever, Vinyals, and Le at Google introduced the sequence-to-sequence (seq2seq) model. An encoder LSTM reads the entire input sequence and compresses it into a fixed-size vector (the final hidden state). A decoder LSTM then generates the output sequence one token at a time, conditioned on this vector. This framework enabled the first competitive neural machine translation systems, translating English to French with a single end-to-end model.

Seq2Seq Architecture

// Seq2Seq for translation Encoder: "I love cats" → LSTM → LSTM → LSTM ↓ [context vector] ↓ Decoder: [context] → LSTM → "J'aime" LSTM → "les" LSTM → "chats" LSTM → <EOS> // Teacher forcing during training: // Feed correct previous token (not predicted) // Speeds convergence but creates train/test gap

Key insight: The context vector bottleneck is seq2seq’s Achilles’ heel — the entire input must be compressed into a single vector. For long sentences, information is lost. This limitation directly motivated the attention mechanism (Bahdanau et al., 2015), covered in Chapter 11.

build

Practical LSTM Tips

Training tricks that make LSTMs work in practice

Training Best Practices

1. Gradient clipping: Clip gradient norm to 1.0–5.0 to prevent exploding gradients. 2. Forget gate bias: Initialize forget gate bias to 1.0 (Jozefowicz et al., 2015) so the LSTM starts by remembering everything. 3. Dropout: Apply dropout between layers (not within recurrent connections). Variational dropout (Gal & Ghahramani, 2016) uses the same mask across time steps. 4. Layer normalization: Normalize within each time step for more stable training. 5. Packing sequences: Use packed sequences for variable-length batches to avoid wasting compute on padding.

PyTorch LSTM Training

// Gradient clipping torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=1.0 ) // Packed sequences for variable lengths from torch.nn.utils.rnn import ( pack_padded_sequence, pad_packed_sequence ) packed = pack_padded_sequence( embeddings, lengths, batch_first=True, enforce_sorted=False ) output, (h_n, c_n) = lstm(packed)

Rule of thumb: Always clip gradients when training RNNs/LSTMs. A max_norm of 1.0 is a safe default. Without clipping, a single bad batch can produce enormous gradients that destroy learned weights.

compare

LSTM vs. GRU vs. Transformer

When to use which architecture

Comparison

LSTM: Best for tasks requiring precise memory control (e.g., code generation, music). 4 gates, separate cell state. GRU: Faster, fewer parameters, comparable accuracy. Good default when compute is limited. Transformer: Parallelizable (no sequential dependency), handles long-range dependencies via attention, dominates NLP since 2018. LSTMs are still used for streaming/real-time applications where you process one token at a time and can’t see the future.

LSTM/GRU Weakness

Sequential processing — can’t parallelize across time steps. Training is slow for long sequences. O(T) computation, no shortcut.

Transformer Advantage

Fully parallel — all positions computed simultaneously. Attention provides direct connections between any two positions. O(1) path length.

Quick Reference

// Architecture comparison LSTM GRU Transformer Gates: 3 2 0 (attention) Parallel: No No Yes Memory: O(1) O(1) O(T²) Speed: Slow Medium Fast (GPU) Long-dep: Good Good Excellent // Use LSTM/GRU: streaming, real-time, small data // Use Transformer: NLP, large data, GPU available

auto_awesome

The Legacy of Recurrence

How LSTMs paved the way for transformers

What LSTMs Achieved

From 1997 to 2017, LSTMs powered the state of the art in machine translation (Google Translate, 2016), speech recognition (Siri, Alexa), text generation, sentiment analysis, and time series forecasting. The seq2seq framework with attention became the dominant paradigm for any sequence-to-sequence task. LSTMs proved that neural networks could handle variable-length sequential data — a capability that seemed impossible with feedforward networks.

The connection: The attention mechanism (Chapter 11) was invented as an add-on to seq2seq LSTMs. Vaswani et al. (2017) then asked: what if we use only attention, without any recurrence? The answer was the Transformer — and LSTMs became largely obsolete for NLP. But the concepts of gating, cell states, and sequence modeling live on in every modern architecture.

LSTM Applications That Changed the World

// LSTM milestones 2014: Seq2seq machine translation 2015: Google voice search (LSTM-based) 2016: Google Neural Machine Translation (8-layer LSTM encoder-decoder) 2016: Apple Siri (LSTM for speech) 2017: Amazon Alexa (LSTM for NLU) 2018: ELMo (BiLSTM language model) // Then: Transformers took over (2018+) // But LSTMs remain in edge/streaming apps

Ch 7 — LSTMs, GRUs & Sequence Models