Ch 4: The Transformer Block

Ch 4 — The Transformer Block

The repeating unit that powers every LLM — attention, feed-forward, normalize, repeat

arrow_backIndex

Architecture

view_module

Overview

arrow_forward

add_circle

Residual

arrow_forward

tune

LayerNorm

arrow_forward

neurology

FFN

arrow_forward

bolt

Activations

arrow_forward

upgrade

Modern

arrow_forward

code

Code

arrow_forward

stacks

Stacking

Click play or press Space to begin...

Step- / 8

view_module

The Transformer Block: One Unit, Repeated

Every LLM is just this block stacked dozens or hundreds of times

The Analogy

Think of an assembly line in a factory. Each station does two things: (1) consult the team (attention — look at other tokens for context), then (2) think independently (feed-forward network — process the information). The product (token representation) gets refined at each station. GPT-3 has 96 stations. Llama 3 70B has 80. Each one makes the representation a little richer.

Key insight: The entire transformer is just this one block repeated N times. There’s no special logic between layers — the same architecture processes tokens at layer 1 and layer 96. The magic comes from stacking: early layers capture syntax and local patterns, middle layers capture semantics, and late layers capture task-specific reasoning.

The Block Structure

# One transformer block (pre-norm style): # # Input x # │ # ├──→ Norm → Attention ──→ + (residual) # │ │ # │ ┌─────────────────────┘ # │ │ # │ ├──→ Norm → FFN ──→ + (residual) # │ │ │ # │ │ ▼ # │ │ Output # # In equations: # h = x + Attention(Norm(x)) # y = h + FFN(Norm(h)) # Layer counts in real models: # GPT-2: 12 layers # GPT-3: 96 layers # Llama 3 8B: 32 layers # Llama 3 70B: 80 layers # GPT-4 (est): ~120 layers

add_circle

Residual Connections: The Gradient Highway

Why we add the input back to the output at every layer

The Analogy

Imagine passing a message through 96 people in a game of telephone. By the end, the message is unrecognizable. Residual connections fix this: at each step, you keep a copy of the original message and add the new information to it. So even after 96 steps, the original signal is preserved. Mathematically: output = input + transformation(input). The “+” is the residual connection.

Key insight: Without residual connections, training deep networks is nearly impossible. Gradients vanish as they flow backward through dozens of layers (the vanishing gradient problem from MathForAI Ch 5). Residual connections create a “gradient highway” — gradients can flow directly from the loss back to early layers without being multiplied by many small numbers. He et al. (2015) introduced this in ResNets for vision; transformers adopted it from day one.

Why It Works

# Without residual connections: # y = f(x) → gradient: df/dx # After 96 layers: df₉₆/dx = Π(df_i/dx_i) # If each df/dx ≈ 0.9, after 96 layers: # 0.9^96 ≈ 0.000045 (vanished!) # With residual connections: # y = x + f(x) → gradient: 1 + df/dx # The "1" ensures gradient ≥ 1 # Gradient always has a direct path back! # In code: # x = x + attention(norm(x)) # residual # x = x + ffn(norm(x)) # residual # The original x flows through unchanged # Each layer only ADDS refinements

Real World

Telephone game with a notepad: keep the original message, add corrections at each step

In LLMs

y = x + f(x): each layer adds refinements to the original representation

tune

Normalization: Keeping Numbers Sane

LayerNorm and RMSNorm prevent activations from exploding

The Analogy

Imagine a group project where one person writes in centimeters and another in miles. Before combining their work, you need to normalize — convert everything to the same scale. Layer normalization does this for neural network activations: it rescales each token’s vector to have zero mean and unit variance. This prevents values from drifting to extreme ranges as they pass through layers.

Key insight: The original transformer (2017) used post-norm: normalize after each sublayer. Modern LLMs use pre-norm: normalize before each sublayer. Pre-norm is more stable for training very deep networks because the residual path stays clean. Additionally, Llama and most 2024+ models use RMSNorm instead of LayerNorm — it’s simpler (no mean subtraction) and ~10-15% faster.

LayerNorm vs RMSNorm

# LayerNorm (GPT-2, GPT-3, BERT): # 1. Compute mean: μ = mean(x) # 2. Compute variance: σ² = var(x) # 3. Normalize: (x - μ) / √(σ² + ε) # 4. Scale and shift: γ * norm + β # RMSNorm (Llama, Mistral, Qwen): # 1. Compute RMS: rms = √(mean(x²)) # 2. Normalize: x / (rms + ε) # 3. Scale only: γ * norm (no shift!) # → Simpler, faster, works just as well class RMSNorm(nn.Module): def __init__(self, d, eps=1e-6): super().__init__() self.weight = nn.Parameter(torch.ones(d)) self.eps = eps def forward(self, x): rms = torch.sqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) return x / rms * self.weight

neurology

The Feed-Forward Network: Independent Thinking

After consulting the team, each token processes information on its own

The Analogy

After a team meeting (attention), each person goes back to their desk to think independently. The feed-forward network (FFN) is this independent thinking step. It’s applied to each token separately — no communication between tokens. It’s a simple two-layer neural network: expand to a larger dimension, apply a nonlinearity, then compress back. This is where the model stores “factual knowledge.”

Key insight: Research suggests that attention handles “routing” (which tokens to combine) while FFN layers store “knowledge” (facts, patterns). Geva et al. (2021) showed that FFN layers act as key-value memories: the first layer’s rows match patterns, and the second layer’s columns store associated information. This is why larger FFN dimensions = more knowledge capacity.

The Architecture

# Classic FFN (GPT-2, GPT-3): # FFN(x) = W₂ · ReLU(W₁ · x + b₁) + b₂ # W₁: (d_model, d_ff) — expand # W₂: (d_ff, d_model) — compress # d_ff = 4 × d_model typically # GPT-3: d_model=12288, d_ff=49152 # That's 12288 × 49152 × 2 ≈ 1.2B params # per FFN layer! (96 layers = 115B just FFN) # Llama 3 8B: d_model=4096, d_ff=14336 # (3.5× expansion, not 4×, due to SwiGLU) class FFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = nn.Linear(d_model, d_ff) self.w2 = nn.Linear(d_ff, d_model) def forward(self, x): return self.w2(torch.relu(self.w1(x)))

bolt

Activation Functions: ReLU, GELU, SwiGLU

The nonlinearity that gives neural networks their power

The Analogy

Without activation functions, a neural network is just matrix multiplication — which can only learn linear relationships. An activation function is like a decision gate: it decides which signals to pass through and which to suppress. ReLU is a simple on/off switch (negative = 0, positive = pass). GELU and SwiGLU are smoother, more nuanced gates.

Key insight: SwiGLU (Shazeer, 2020) is now the standard in modern LLMs. It uses a gating mechanism: one linear layer produces the signal, another produces a gate, and they’re multiplied together. This gives the model more control over information flow. Llama, Mistral, Qwen, and most 2024+ models use SwiGLU. It achieves the same quality with ~10% fewer training tokens than ReLU.

Evolution of Activations

# ReLU (2012): max(0, x) # Simple, fast, but "dead neurons" problem # Used in: original Transformer, GPT-2 # GELU (2016): x · Φ(x) # Smooth approximation of ReLU # Used in: GPT-3, BERT, RoBERTa # SwiGLU (2020): SiLU(xW₁) ⊙ (xW₃) # Gated: signal × gate # Used in: Llama 2/3, Mistral, Qwen, Gemma class SwiGLU_FFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = nn.Linear(d_model, d_ff) # gate self.w3 = nn.Linear(d_model, d_ff) # signal self.w2 = nn.Linear(d_ff, d_model) # down def forward(self, x): gate = torch.nn.functional.silu(self.w1(x)) signal = self.w3(x) return self.w2(gate * signal)

upgrade

Modern Upgrades: What Changed Since 2017

The original transformer vs. today’s LLMs

Original (2017) vs Modern (2024)

The core idea is the same, but every component has been refined. Post-norm → Pre-norm (more stable training). LayerNorm → RMSNorm (faster, simpler). ReLU → SwiGLU (better quality). Sinusoidal positions → RoPE (better length generalization). Standard attention → GQA (Grouped-Query Attention, more memory efficient).

Key insight: Llama’s success came not from a single breakthrough but from combining the best proven ideas: RMSNorm + SwiGLU + RoPE + GQA + pre-norm. This “best practices” approach achieved GPT-3 level performance with less than 10% of the parameters. The lesson: architecture matters, but careful engineering matters more.

Side-by-Side

# Original Transformer (Vaswani, 2017): # - Post-norm (norm after sublayer) # - LayerNorm # - ReLU activation # - Sinusoidal positional encoding # - Standard multi-head attention # - Encoder-decoder architecture # Modern LLM (Llama 3, 2024): # - Pre-norm (norm before sublayer) # - RMSNorm (faster, no mean) # - SwiGLU activation (gated) # - RoPE positional encoding # - Grouped-Query Attention (GQA) # - Decoder-only architecture # GQA: share K,V across head groups # Llama 3 8B: 32 Q heads, 8 KV heads # → 4× less KV cache memory

code

A Complete Transformer Block in PyTorch

The modern Llama-style block in ~25 lines

Implementation

class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads, d_ff): super().__init__() # Pre-norm with RMSNorm self.norm1 = RMSNorm(d_model) self.norm2 = RMSNorm(d_model) # Self-attention (from Ch 3) self.attn = CausalSelfAttention( d_model, n_heads ) # Feed-forward with SwiGLU self.ffn = SwiGLU_FFN(d_model, d_ff) def forward(self, x): # Sublayer 1: attention + residual x = x + self.attn(self.norm1(x)) # Sublayer 2: FFN + residual x = x + self.ffn(self.norm2(x)) return x

That’s It. Really.

The entire transformer block is: normalize → attend → add residual → normalize → feed-forward → add residual. Two sublayers, two residual connections, two normalizations. Stack this 32-120 times and you have a modern LLM. The simplicity is the beauty — and why it scales so well.

Parameter Count

# Params per block (Llama 3 8B): # d_model=4096, n_heads=32, d_ff=14336 # Attention: W_Q + W_K + W_V + W_O # = 4 × (4096 × 4096) = 67M params # FFN (SwiGLU): W₁ + W₃ + W₂ # = 2×(4096×14336) + (14336×4096) # = 176M params # Norms: 2 × 4096 = 8K (negligible) # Total per block: ~243M params # × 32 layers = ~7.8B params # + embeddings ≈ 8B total ✓

stacks

Stacking Blocks: From Shallow to Deep

What happens as tokens flow through 32, 80, or 120 layers

What Each Layer Does

Research (Tenney et al., 2019; Elhage et al., 2022) reveals a progression: Early layers (1-10): Learn syntax, part-of-speech, simple patterns. Middle layers (10-60): Learn semantics, entity relationships, factual knowledge. Late layers (60+): Learn task-specific reasoning, output formatting, next-token prediction. The representation gets progressively more abstract and task-relevant.

The complete picture: A transformer is: embeddings → N × (norm → attention → residual → norm → FFN → residual) → final norm → output projection. That’s the entire architecture. The original 2017 design was elegant; modern refinements (RMSNorm, SwiGLU, RoPE, GQA) made it practical at scale. Every LLM you use — GPT-4, Claude, Llama, Gemini — is this block, repeated.

The Full Model

class LLM(nn.Module): def __init__(self, vocab, d, n_heads, d_ff, n_layers): super().__init__() self.embed = nn.Embedding(vocab, d) self.blocks = nn.ModuleList([ TransformerBlock(d, n_heads, d_ff) for _ in range(n_layers) ]) self.norm = RMSNorm(d) self.head = nn.Linear(d, vocab) def forward(self, token_ids): x = self.embed(token_ids) for block in self.blocks: x = block(x) x = self.norm(x) logits = self.head(x) return logits # Llama 3 8B: model = LLM( vocab=128256, d=4096, n_heads=32, d_ff=14336, n_layers=32 )

arrow_back Ch 3: Attention Ch 5: Scaling Up arrow_forward