Ch 8: RNNs & Sequences

Ch 8 — RNNs & Sequences

Processing sequential data — from vanilla RNNs to LSTMs and the road to transformers

Index Under the Hood →

High Level

text_fields

Sequence

arrow_forward

loop

RNN

arrow_forward

warning

Vanishing

arrow_forward

memory

LSTM

arrow_forward

tune

GRU

arrow_forward

swap_horiz

Seq2Seq

Click play or press Space to begin the journey...

Step- / 8

text_fields

Why Sequences Need Special Networks

Text, speech, time series — data where order matters

The Problem

MLPs and CNNs process fixed-size inputs. But many real-world problems involve sequences of variable length where the order of elements matters. “The dog bit the man” means something very different from “The man bit the dog.” We need networks that understand temporal order and context.

# Sequential data examples Text: "I grew up in France ... I speak ___" Speech: audio waveform over time Music: sequence of notes and chords Time series: stock prices, sensor readings Video: sequence of image frames DNA: sequence of nucleotides (A,T,G,C) Code: sequence of tokens

Sequence Task Types

One-to-One: image → label (not a sequence) One-to-Many: image → caption (words) Many-to-One: review → sentiment (pos/neg) Many-to-Many: English → French (translation) Many-to-Many: video → per-frame labels

The key requirement: The network must maintain a memory of what it has seen so far. When processing the word “speak” in “I grew up in France ... I speak ___”, it needs to remember “France” from many words ago to predict “French.”

loop

The Vanilla RNN

A network with a loop — hidden state carries memory forward

How It Works

At each time step, the RNN takes two inputs: the current input (x\u209c) and the previous hidden state (h\u209c\u208b\u2081). It produces a new hidden state that encodes information about the entire sequence seen so far. The same weights are shared across all time steps — the network is “unrolled” through time.

# RNN computation at each time step h\u209c = tanh(W\u2095 · h\u209c\u208b\u2081 + W\u2093 · x\u209c + b) y\u209c = W\u2099 · h\u209c + b\u2099 # h\u209c = hidden state (memory) # x\u209c = input at time t # y\u209c = output at time t # W\u2095, W\u2093, W\u2099 = shared weights

Unrolled View

# Processing "the cat sat" t=1: x="the" + h\u2080 → h\u2081 t=2: x="cat" + h\u2081 → h\u2082 t=3: x="sat" + h\u2082 → h\u2083 # h\u2083 encodes the entire sequence # Same W\u2095, W\u2093 used at every step # This is weight sharing through time

The hidden state is the memory. It’s a fixed-size vector (e.g., 256 or 512 dimensions) that must compress everything the network has seen. This is both the RNN’s strength (compact memory) and its weakness (limited capacity, information bottleneck).

warning

The Vanishing Gradient Problem

Why vanilla RNNs can’t learn long-range dependencies

The Problem

During backpropagation through time (BPTT), gradients flow backward from the loss to early time steps. At each step, the gradient is multiplied by the weight matrix W\u2095 and the tanh derivative. After many steps, these repeated multiplications cause gradients to either vanish (shrink to ~0) or explode (grow to infinity).

# Gradient after T time steps ∂L/∂h\u2080 = ∂L/∂h\u209c × (W\u2095 × tanh′)\u1d40 # If largest eigenvalue of W\u2095 < 1: Gradient vanishes exponentially 0.9\u00b9\u2070\u2070 = 0.0000265 ≈ 0 # If largest eigenvalue of W\u2095 > 1: Gradient explodes exponentially 1.1\u00b9\u2070\u2070 = 13,780 → NaN

Practical Impact

Vanishing: The network can’t learn dependencies beyond ~10–20 time steps. “I grew up in France ... [50 words later] ... I speak ___” — the gradient from “France” has vanished by the time it reaches the prediction.

Exploding: Training becomes unstable (loss = NaN). Fix: gradient clipping — cap gradient magnitude at a threshold (e.g., 5.0).

This is the same problem that plagued deep feedforward networks (Ch 5–6). ResNets solved it with skip connections. LSTMs solve it with a cell state — an information highway where gradients flow with minimal transformation.

memory

LSTM: Long Short-Term Memory

Hochreiter & Schmidhuber (1997) — gates that control information flow

The Key Innovation

LSTMs add a cell state — a separate memory channel that runs parallel to the hidden state. Information flows through the cell state via simple addition and multiplication, avoiding the repeated matrix multiplications that cause vanishing gradients. Three gates control what information enters, stays, and leaves.

# LSTM gates (all sigmoid, output 0–1) Forget gate (f): “What to erase from memory?” f = σ(W\u1da0 · [h\u209c\u208b\u2081, x\u209c] + b\u1da0) Input gate (i): “What new info to store?” i = σ(W\u1d62 · [h\u209c\u208b\u2081, x\u209c] + b\u1d62) Output gate (o): “What to output from memory?” o = σ(W\u2092 · [h\u209c\u208b\u2081, x\u209c] + b\u2092)

Cell State Update

# Cell state = long-term memory highway C\u209c = f ⊙ C\u209c\u208b\u2081 + i ⊙ tanh(W\u1d9c · [h\u209c\u208b\u2081, x\u209c]) forget old add new h\u209c = o ⊙ tanh(C\u209c) output filtered memory # ⊙ = element-wise multiplication # Gates are 0–1: 0=block, 1=pass through

Why LSTMs work: The cell state update is C\u209c = f·C\u209c\u208b\u2081 + i·new. When f=1 and i=0, the cell state passes through unchanged — gradients flow perfectly. The network learns when to remember (f=1) and when to forget (f=0). This enables learning dependencies across 1000+ time steps.

tune

GRU: Gated Recurrent Unit

A simpler alternative to LSTM with similar performance

Simplified Gates

Cho et al. (2014) introduced the GRU as a simpler alternative to LSTM. It merges the cell state and hidden state into one, and uses only two gates instead of three: a reset gate and an update gate. Fewer parameters, faster training, and comparable performance on most tasks.

# GRU: 2 gates instead of 3 Reset gate (r): “How much past to forget?” r = σ(W\u1d63 · [h\u209c\u208b\u2081, x\u209c]) Update gate (z): “How much to update vs keep?” z = σ(W\u1dbb · [h\u209c\u208b\u2081, x\u209c]) New state: h\u0303 = tanh(W · [r ⊙ h\u209c\u208b\u2081, x\u209c]) h\u209c = (1-z) ⊙ h\u209c\u208b\u2081 + z ⊙ h\u0303

LSTM

3 gates (forget, input, output) + cell state. More parameters. Slightly better on very long sequences. Standard for speech recognition.

GRU

2 gates (reset, update). Fewer parameters, faster training. Comparable accuracy. Better for smaller datasets. No separate cell state.

In practice: LSTM and GRU perform similarly on most tasks. LSTM is the safer default. GRU is preferred when compute is limited. Both are largely superseded by transformers (Ch 9) for most NLP tasks, but remain relevant for real-time streaming, edge devices, and time-series forecasting.

swap_horiz

Sequence-to-Sequence & Encoder-Decoder

Translating between sequences of different lengths

The Architecture

Sutskever et al. (2014) introduced the encoder-decoder architecture for machine translation. The encoder RNN reads the input sequence and compresses it into a fixed-size context vector. The decoder RNN generates the output sequence from this context vector, one token at a time.

# Seq2Seq: English → French Encoder: "The cat sat" → h\u2081 → h\u2082 → h\u2083 Context vector = h\u2083 Decoder: h\u2083 → "Le" → "chat" → "s'est" → "assis" # The context vector must encode the # ENTIRE input sentence in one vector

The Bottleneck Problem

Compressing an entire sentence into a single fixed-size vector is a severe bottleneck. Long sentences lose information. The decoder has no way to “look back” at specific parts of the input. This limitation directly motivated the invention of attention (Ch 9) — allowing the decoder to focus on relevant parts of the input at each step.

Bahdanau attention (2015) was the breakthrough: instead of one context vector, let the decoder attend to all encoder hidden states at each decoding step. This dramatically improved translation quality and became the foundation for the transformer architecture (Ch 9).

apps

RNN Applications

Where recurrent networks still shine

# RNN/LSTM applications Language modeling Predict next word/character Foundation for text generation Machine translation Seq2Seq + attention (pre-transformer) Speech recognition Audio waveform → text (CTC loss) LSTMs still used in streaming ASR Time-series forecasting Stock prices, weather, sensor data LSTMs handle irregular intervals Music generation Sequence of notes → new melody Handwriting generation Sequence of pen strokes

Bidirectional RNNs

Process the sequence in both directions — forward and backward — and concatenate the hidden states. This gives each position context from both past and future. Essential for tasks like named entity recognition where the word after a name matters as much as the word before.

Stacked RNNs: Multiple RNN layers stacked on top of each other. The output of one layer becomes the input to the next. Deeper RNNs learn more abstract representations. Google’s Neural Machine Translation (2016) used 8 stacked LSTM layers with attention.

swap_horiz

RNNs Today & Key Takeaways

The transition from recurrence to attention

The Transformer Takeover

Transformers (Ch 9) replaced RNNs for most NLP tasks by 2018. Key advantages: parallelizable (RNNs must process sequentially), better long-range dependencies (attention connects any two positions directly), and scalable (RNNs don’t benefit as much from more compute). But RNNs aren’t dead — they remain relevant for streaming, edge, and real-time applications.

State Space Models (SSMs) like Mamba (2023) are a modern alternative that combines RNN-like sequential processing with transformer-like performance. They process sequences in O(n) time vs transformers’ O(n²), making them promising for very long sequences.

Key Takeaways

1. RNNs process sequences by maintaining a hidden state (memory)

2. Vanilla RNNs suffer from vanishing/exploding gradients

3. LSTMs solve this with a cell state + 3 gates (forget, input, output)

4. GRUs are simpler (2 gates) with similar performance

5. Seq2Seq encoder-decoder enabled machine translation

6. The context vector bottleneck motivated attention

7. Transformers largely replaced RNNs, but RNNs persist for streaming and edge use cases

Coming up: Ch 9 introduces attention and the transformer — the architecture that replaced RNNs and now powers GPT, BERT, and all modern LLMs.