Ch 8 — RNNs & Sequences

Processing sequential data — from vanilla RNNs to LSTMs and the road to transformers
High Level
text_fields
Sequence
arrow_forward
loop
RNN
arrow_forward
warning
Vanishing
arrow_forward
memory
LSTM
arrow_forward
tune
GRU
arrow_forward
swap_horiz
Seq2Seq
-
Click play or press Space to begin the journey...
Step- / 8
text_fields
Why Sequences Need Special Networks
Text, speech, time series — data where order matters
The Problem
MLPs and CNNs process fixed-size inputs. But many real-world problems involve sequences of variable length where the order of elements matters. “The dog bit the man” means something very different from “The man bit the dog.” We need networks that understand temporal order and context.
# Sequential data examples Text: "I grew up in France ... I speak ___" Speech: audio waveform over time Music: sequence of notes and chords Time series: stock prices, sensor readings Video: sequence of image frames DNA: sequence of nucleotides (A,T,G,C) Code: sequence of tokens
Sequence Task Types
One-to-One: image → label (not a sequence) One-to-Many: image → caption (words) Many-to-One: review → sentiment (pos/neg) Many-to-Many: English → French (translation) Many-to-Many: video → per-frame labels
The key requirement: The network must maintain a memory of what it has seen so far. When processing the word “speak” in “I grew up in France ... I speak ___”, it needs to remember “France” from many words ago to predict “French.”
loop
The Vanilla RNN
A network with a loop — hidden state carries memory forward
How It Works
At each time step, the RNN takes two inputs: the current input (x\u209c) and the previous hidden state (h\u209c\u208b\u2081). It produces a new hidden state that encodes information about the entire sequence seen so far. The same weights are shared across all time steps — the network is “unrolled” through time.
# RNN computation at each time step h\u209c = tanh(W\u2095 · h\u209c\u208b\u2081 + W\u2093 · x\u209c + b) y\u209c = W\u2099 · h\u209c + b\u2099 # h\u209c = hidden state (memory) # x\u209c = input at time t # y\u209c = output at time t # W\u2095, W\u2093, W\u2099 = shared weights
Unrolled View
# Processing "the cat sat" t=1: x="the" + h\u2080 → h\u2081 t=2: x="cat" + h\u2081 → h\u2082 t=3: x="sat" + h\u2082 → h\u2083 # h\u2083 encodes the entire sequence # Same W\u2095, W\u2093 used at every step # This is weight sharing through time
The hidden state is the memory. It’s a fixed-size vector (e.g., 256 or 512 dimensions) that must compress everything the network has seen. This is both the RNN’s strength (compact memory) and its weakness (limited capacity, information bottleneck).
warning
The Vanishing Gradient Problem
Why vanilla RNNs can’t learn long-range dependencies
The Problem
During backpropagation through time (BPTT), gradients flow backward from the loss to early time steps. At each step, the gradient is multiplied by the weight matrix W\u2095 and the tanh derivative. After many steps, these repeated multiplications cause gradients to either vanish (shrink to ~0) or explode (grow to infinity).
# Gradient after T time steps ∂L/∂h\u2080 = ∂L/∂h\u209c × (W\u2095 × tanh′)\u1d40 # If largest eigenvalue of W\u2095 < 1: Gradient vanishes exponentially 0.9\u00b9\u2070\u2070 = 0.0000265 ≈ 0 # If largest eigenvalue of W\u2095 > 1: Gradient explodes exponentially 1.1\u00b9\u2070\u2070 = 13,780 → NaN
Practical Impact
Vanishing: The network can’t learn dependencies beyond ~10–20 time steps. “I grew up in France ... [50 words later] ... I speak ___” — the gradient from “France” has vanished by the time it reaches the prediction.

Exploding: Training becomes unstable (loss = NaN). Fix: gradient clipping — cap gradient magnitude at a threshold (e.g., 5.0).
This is the same problem that plagued deep feedforward networks (Ch 5–6). ResNets solved it with skip connections. LSTMs solve it with a cell state — an information highway where gradients flow with minimal transformation.
memory
LSTM: Long Short-Term Memory
Hochreiter & Schmidhuber (1997) — gates that control information flow
The Key Innovation
LSTMs add a cell state — a separate memory channel that runs parallel to the hidden state. Information flows through the cell state via simple addition and multiplication, avoiding the repeated matrix multiplications that cause vanishing gradients. Three gates control what information enters, stays, and leaves.
# LSTM gates (all sigmoid, output 0–1) Forget gate (f): “What to erase from memory?” f = σ(W\u1da0 · [h\u209c\u208b\u2081, x\u209c] + b\u1da0) Input gate (i): “What new info to store?” i = σ(W\u1d62 · [h\u209c\u208b\u2081, x\u209c] + b\u1d62) Output gate (o): “What to output from memory?” o = σ(W\u2092 · [h\u209c\u208b\u2081, x\u209c] + b\u2092)
Cell State Update
# Cell state = long-term memory highway C\u209c = f ⊙ C\u209c\u208b\u2081 + i ⊙ tanh(W\u1d9c · [h\u209c\u208b\u2081, x\u209c]) forget old add new h\u209c = o ⊙ tanh(C\u209c) output filtered memory # ⊙ = element-wise multiplication # Gates are 0–1: 0=block, 1=pass through
Why LSTMs work: The cell state update is C\u209c = f·C\u209c\u208b\u2081 + i·new. When f=1 and i=0, the cell state passes through unchanged — gradients flow perfectly. The network learns when to remember (f=1) and when to forget (f=0). This enables learning dependencies across 1000+ time steps.
tune
GRU: Gated Recurrent Unit
A simpler alternative to LSTM with similar performance
Simplified Gates
Cho et al. (2014) introduced the GRU as a simpler alternative to LSTM. It merges the cell state and hidden state into one, and uses only two gates instead of three: a reset gate and an update gate. Fewer parameters, faster training, and comparable performance on most tasks.
# GRU: 2 gates instead of 3 Reset gate (r): “How much past to forget?” r = σ(W\u1d63 · [h\u209c\u208b\u2081, x\u209c]) Update gate (z): “How much to update vs keep?” z = σ(W\u1dbb · [h\u209c\u208b\u2081, x\u209c]) New state: h\u0303 = tanh(W · [r ⊙ h\u209c\u208b\u2081, x\u209c]) h\u209c = (1-z) ⊙ h\u209c\u208b\u2081 + z ⊙ h\u0303
LSTM
3 gates (forget, input, output) + cell state. More parameters. Slightly better on very long sequences. Standard for speech recognition.
GRU
2 gates (reset, update). Fewer parameters, faster training. Comparable accuracy. Better for smaller datasets. No separate cell state.
In practice: LSTM and GRU perform similarly on most tasks. LSTM is the safer default. GRU is preferred when compute is limited. Both are largely superseded by transformers (Ch 9) for most NLP tasks, but remain relevant for real-time streaming, edge devices, and time-series forecasting.
swap_horiz
Sequence-to-Sequence & Encoder-Decoder
Translating between sequences of different lengths
The Architecture
Sutskever et al. (2014) introduced the encoder-decoder architecture for machine translation. The encoder RNN reads the input sequence and compresses it into a fixed-size context vector. The decoder RNN generates the output sequence from this context vector, one token at a time.
# Seq2Seq: English → French Encoder: "The cat sat" → h\u2081 → h\u2082 → h\u2083 Context vector = h\u2083 Decoder: h\u2083 → "Le" → "chat" → "s'est" → "assis" # The context vector must encode the # ENTIRE input sentence in one vector
The Bottleneck Problem
Compressing an entire sentence into a single fixed-size vector is a severe bottleneck. Long sentences lose information. The decoder has no way to “look back” at specific parts of the input. This limitation directly motivated the invention of attention (Ch 9) — allowing the decoder to focus on relevant parts of the input at each step.
Bahdanau attention (2015) was the breakthrough: instead of one context vector, let the decoder attend to all encoder hidden states at each decoding step. This dramatically improved translation quality and became the foundation for the transformer architecture (Ch 9).
apps
RNN Applications
Where recurrent networks still shine
# RNN/LSTM applications Language modeling Predict next word/character Foundation for text generation Machine translation Seq2Seq + attention (pre-transformer) Speech recognition Audio waveform → text (CTC loss) LSTMs still used in streaming ASR Time-series forecasting Stock prices, weather, sensor data LSTMs handle irregular intervals Music generation Sequence of notes → new melody Handwriting generation Sequence of pen strokes
Bidirectional RNNs
Process the sequence in both directions — forward and backward — and concatenate the hidden states. This gives each position context from both past and future. Essential for tasks like named entity recognition where the word after a name matters as much as the word before.
Stacked RNNs: Multiple RNN layers stacked on top of each other. The output of one layer becomes the input to the next. Deeper RNNs learn more abstract representations. Google’s Neural Machine Translation (2016) used 8 stacked LSTM layers with attention.
swap_horiz
RNNs Today & Key Takeaways
The transition from recurrence to attention
The Transformer Takeover
Transformers (Ch 9) replaced RNNs for most NLP tasks by 2018. Key advantages: parallelizable (RNNs must process sequentially), better long-range dependencies (attention connects any two positions directly), and scalable (RNNs don’t benefit as much from more compute). But RNNs aren’t dead — they remain relevant for streaming, edge, and real-time applications.
State Space Models (SSMs) like Mamba (2023) are a modern alternative that combines RNN-like sequential processing with transformer-like performance. They process sequences in O(n) time vs transformers’ O(n²), making them promising for very long sequences.
Key Takeaways
1. RNNs process sequences by maintaining a hidden state (memory)

2. Vanilla RNNs suffer from vanishing/exploding gradients

3. LSTMs solve this with a cell state + 3 gates (forget, input, output)

4. GRUs are simpler (2 gates) with similar performance

5. Seq2Seq encoder-decoder enabled machine translation

6. The context vector bottleneck motivated attention

7. Transformers largely replaced RNNs, but RNNs persist for streaming and edge use cases
Coming up: Ch 9 introduces attention and the transformer — the architecture that replaced RNNs and now powers GPT, BERT, and all modern LLMs.