Ch 6 — Recurrent Neural Networks

Sequence modeling, hidden state, and the vanishing gradient problem
High Level
format_list_numbered
Sequences
arrow_forward
replay
Recurrence
arrow_forward
data_object
Hidden State
arrow_forward
history
BPTT
arrow_forward
warning
Vanishing
arrow_forward
apps
Tasks
-
Click play or press Space to begin...
Step- / 8
format_list_numbered
Why Sequences Need Special Networks
The limitation of feedforward networks for ordered data
The Problem with Fixed Inputs
Language, audio, stock prices, and DNA are all sequences — data where order matters. “The dog bit the man” and “The man bit the dog” have the same words but different meanings. Feedforward networks (MLPs, CNNs) take fixed-size inputs and have no concept of time or order. They can’t process a sentence of 5 words and a sentence of 50 words with the same architecture. We need networks with memory — networks that can process one element at a time while remembering what came before.
Sequence Tasks
// Types of sequence problems One-to-many: image → caption Many-to-one: sentence → sentiment Many-to-many: English → French Many-to-many: video → frame labels // Variable-length inputs AND outputs // Order matters: "not good" ≠ "good not" // Context can be arbitrarily far back
Key insight: The key requirement is a network that can process inputs one at a time, maintain a running summary (hidden state) of everything seen so far, and produce outputs at any time step. This is exactly what RNNs do.
replay
The Recurrent Neuron
A network with a loop
How Recurrence Works
An RNN processes a sequence one element at a time. At each time step t, it takes the current input xₜ and the previous hidden state hₜ₋₁, and produces a new hidden state hₜ. The hidden state is the network’s “memory” — a compressed summary of everything it has seen so far. The same weights W are shared across all time steps (weight sharing in time, just as CNNs share weights in space). This means an RNN can process sequences of any length with a fixed number of parameters.
The RNN Equations
// Vanilla RNN at time step t hₜ = tanh(W_hh · hₜ₋₁ + W_xh · xₜ + b_h) yₜ = W_hy · hₜ + b_y // W_hh: hidden-to-hidden weights // W_xh: input-to-hidden weights // W_hy: hidden-to-output weights // hₜ: hidden state (memory) // Same W used at EVERY time step // Unrolled view: x₁ → [RNN] → h₁ → [RNN] → h₂ → [RNN] → h₃ ↓ ↓ ↓ y₁ y₂ y₃
Key insight: When “unrolled” across time, an RNN looks like a very deep feedforward network with shared weights. A 100-word sentence creates a 100-layer-deep network. This is why vanishing gradients are such a severe problem for RNNs.
data_object
Hidden State as Memory
What the hidden state encodes
The Hidden State Vector
The hidden state hₜ is a fixed-size vector (e.g., 256 or 512 dimensions) that must encode everything relevant about the sequence seen so far. After processing “The cat sat on the,” the hidden state should encode enough information to predict “mat” as a likely next word. This is a lossy compression — the hidden state is a bottleneck. Long sequences force the network to overwrite old information with new, which is why vanilla RNNs struggle with long-range dependencies.
PyTorch RNN
import torch.nn as nn rnn = nn.RNN( input_size=100, // embedding dim hidden_size=256, // hidden state dim num_layers=2, // stacked RNN layers batch_first=True ) // Input: (batch, seq_len, input_size) // Output: (batch, seq_len, hidden_size) // h_n: final hidden state output, h_n = rnn(x, h_0)
Key insight: The hidden state is the RNN’s “working memory.” Its fixed size means the network must learn what to remember and what to forget — a task that vanilla RNNs handle poorly, motivating the gating mechanisms in LSTMs and GRUs.
history
Backpropagation Through Time (BPTT)
Training RNNs by unrolling
How BPTT Works
To train an RNN, we unroll it across time steps, creating a deep feedforward graph. The loss is computed at each time step (or just the final step), and gradients flow backward through all time steps via the chain rule. This is Backpropagation Through Time (BPTT). For a sequence of length T, BPTT creates a T-layer-deep computation graph. The gradient of the loss with respect to early weights involves multiplying T Jacobian matrices — one per time step.
BPTT Gradient
// Gradient flows backward through time ∂L/∂W = Σₜ ∂Lₜ/∂W // For each Lₜ, chain rule through time: ∂Lₜ/∂h₁ = ∂Lₜ/∂hₜ · ∂hₜ/∂hₜ₋₁ · ... · ∂h₂/∂h₁ // Each ∂hₖ/∂hₖ₋₁ involves W_hh // Product of T matrices → vanish or explode // Truncated BPTT: only backprop K steps // Trades long-range learning for stability
Rule of thumb: Truncated BPTT (backpropagating through only K time steps instead of the full sequence) is a common practical compromise. It limits memory usage and gradient instability but prevents learning dependencies longer than K steps.
warning
The Vanishing Gradient Problem in RNNs
Why vanilla RNNs can’t learn long-range dependencies
Hochreiter's Analysis (1991)
Sepp Hochreiter formally analyzed the vanishing gradient problem in his 1991 diploma thesis. In a vanilla RNN, the gradient at time step 1 from a loss at time step T involves multiplying the same weight matrix W_hh (and tanh derivatives) T times. If the largest eigenvalue of W_hh is less than 1, gradients shrink exponentially. If greater than 1, they explode. For a 100-step sequence, even eigenvalues of 0.9 produce gradients of 0.9¹⁰⁰ ≈ 2.7 × 10⁻⁵. The network effectively “forgets” early inputs.
Critical in AI: This means a vanilla RNN processing “The cat, which sat on the mat near the window overlooking the garden, was sleeping” cannot learn that “was” depends on “cat” (singular) 15 words earlier. This limitation motivated LSTMs and GRUs.
The Math
// Gradient through T time steps ∂hₜ/∂h₁ = Πₖ₌₁ᵀ⁻¹ ∂hₖ₊₁/∂hₖ = Πₖ₌₁ᵀ⁻¹ diag(tanh'(zₖ)) · W_hh // tanh derivative max = 1 (at z=0) // If ||W_hh|| < 1: exponential decay // If ||W_hh|| > 1: exponential growth // Gradient clipping helps with exploding: if ||g|| > threshold: g = threshold · g / ||g|| // But nothing helps vanishing in vanilla RNN
apps
RNN Architectures & Tasks
Different configurations for different problems
Architecture Variants
Many-to-one: Process entire sequence, use final hidden state for classification (sentiment analysis). One-to-many: Single input generates a sequence (image captioning). Many-to-many (aligned): Output at each step (POS tagging, named entity recognition). Many-to-many (unaligned): Encoder processes input, decoder generates output (machine translation). Stacked RNNs: Multiple RNN layers where each layer’s output becomes the next layer’s input, learning increasingly abstract representations.
Stacked RNN
// Stacked (deep) RNN Layer 1: h¹ₜ = tanh(W¹ · [h¹ₜ₋₁, xₜ]) Layer 2: h²ₜ = tanh(W² · [h²ₜ₋₁, h¹ₜ]) Layer 3: h³ₜ = tanh(W³ · [h³ₜ₋₁, h²ₜ]) Output: yₜ = W_y · h³ₜ // 2-3 layers is typical // More layers = harder to train // (vanishing gradients in depth AND time)
Key insight: Stacking RNN layers adds depth at each time step, while the recurrent connections add depth across time. This creates a 2D gradient flow problem — gradients must survive both vertical (layer) and horizontal (time) paths.
text_fields
Word Embeddings & Input Representation
How text becomes numbers for RNNs
From Words to Vectors
RNNs need numerical inputs. One-hot encoding represents each word as a sparse vector (vocabulary-sized, all zeros except one 1). This is wasteful and captures no semantic similarity. Word embeddings (Word2Vec, GloVe) map words to dense, low-dimensional vectors (50–300 dims) where similar words are nearby: “king” and “queen” are close, “king” and “pizza” are far. An embedding layer (a learnable lookup table) is typically the first layer of any NLP model.
Embedding Layer
// PyTorch embedding layer vocab_size = 10000 embed_dim = 128 embed = nn.Embedding(vocab_size, embed_dim) // Input: token IDs [4, 52, 891, 7] // Output: (4, 128) tensor of embeddings // Famous result (Mikolov et al., 2013): // vec("king") - vec("man") + vec("woman") // ≈ vec("queen")
Why it matters: Word embeddings were one of the first demonstrations that neural networks learn meaningful representations of language. The embedding layer is now universal — it appears in every NLP model from RNNs to GPT-4.
school
Summary & What’s Next
The promise and limitation of vanilla RNNs
What We Learned
RNNs introduced recurrence — the ability to process sequences of any length by maintaining a hidden state that acts as memory. They share weights across time steps, making them parameter-efficient. But vanilla RNNs have a fatal flaw: the vanishing gradient problem prevents them from learning long-range dependencies. A 100-step sequence creates a 100-layer-deep gradient path where information decays exponentially.
The connection: The vanishing gradient problem in RNNs is the same problem that plagued deep feedforward networks before ReLU and skip connections. The solution for RNNs was analogous: gating mechanisms. The next chapter covers LSTMs and GRUs — architectures with “gates” that control information flow, enabling learning across thousands of time steps.
RNN Timeline
// RNN milestones 1986: Rumelhart — backpropagation 1990: Elman — simple recurrent networks 1991: Hochreiter — vanishing gradient analysis 1997: Hochreiter & Schmidhuber — LSTM 2013: Mikolov — Word2Vec embeddings 2014: Cho — GRU (simplified LSTM) 2014: Sutskever — seq2seq for translation 2015: Bahdanau — attention mechanism 2017: Vaswani — Transformer (RNNs obsolete)