Ch 4 — The Transformer Block

The repeating unit that powers every LLM — attention, feed-forward, normalize, repeat
Architecture
view_module
Overview
arrow_forward
add_circle
Residual
arrow_forward
tune
LayerNorm
arrow_forward
neurology
FFN
arrow_forward
bolt
Activations
arrow_forward
upgrade
Modern
arrow_forward
code
Code
arrow_forward
stacks
Stacking
-
Click play or press Space to begin...
Step- / 8
view_module
The Transformer Block: One Unit, Repeated
Every LLM is just this block stacked dozens or hundreds of times
The Analogy
Think of an assembly line in a factory. Each station does two things: (1) consult the team (attention — look at other tokens for context), then (2) think independently (feed-forward network — process the information). The product (token representation) gets refined at each station. GPT-3 has 96 stations. Llama 3 70B has 80. Each one makes the representation a little richer.
Key insight: The entire transformer is just this one block repeated N times. There’s no special logic between layers — the same architecture processes tokens at layer 1 and layer 96. The magic comes from stacking: early layers capture syntax and local patterns, middle layers capture semantics, and late layers capture task-specific reasoning.
The Block Structure
# One transformer block (pre-norm style): # # Input x # │ # ├──→ Norm → Attention ──→ + (residual) # │ │ # │ ┌─────────────────────┘ # │ │ # │ ├──→ Norm → FFN ──→ + (residual) # │ │ │ # │ │ ▼ # │ │ Output # # In equations: # h = x + Attention(Norm(x)) # y = h + FFN(Norm(h)) # Layer counts in real models: # GPT-2: 12 layers # GPT-3: 96 layers # Llama 3 8B: 32 layers # Llama 3 70B: 80 layers # GPT-4 (est): ~120 layers
add_circle
Residual Connections: The Gradient Highway
Why we add the input back to the output at every layer
The Analogy
Imagine passing a message through 96 people in a game of telephone. By the end, the message is unrecognizable. Residual connections fix this: at each step, you keep a copy of the original message and add the new information to it. So even after 96 steps, the original signal is preserved. Mathematically: output = input + transformation(input). The “+” is the residual connection.
Key insight: Without residual connections, training deep networks is nearly impossible. Gradients vanish as they flow backward through dozens of layers (the vanishing gradient problem from MathForAI Ch 5). Residual connections create a “gradient highway” — gradients can flow directly from the loss back to early layers without being multiplied by many small numbers. He et al. (2015) introduced this in ResNets for vision; transformers adopted it from day one.
Why It Works
# Without residual connections: # y = f(x) → gradient: df/dx # After 96 layers: df₉₆/dx = Π(df_i/dx_i) # If each df/dx ≈ 0.9, after 96 layers: # 0.9^96 ≈ 0.000045 (vanished!) # With residual connections: # y = x + f(x) → gradient: 1 + df/dx # The "1" ensures gradient ≥ 1 # Gradient always has a direct path back! # In code: # x = x + attention(norm(x)) # residual # x = x + ffn(norm(x)) # residual # The original x flows through unchanged # Each layer only ADDS refinements
Real World
Telephone game with a notepad: keep the original message, add corrections at each step
In LLMs
y = x + f(x): each layer adds refinements to the original representation
tune
Normalization: Keeping Numbers Sane
LayerNorm and RMSNorm prevent activations from exploding
The Analogy
Imagine a group project where one person writes in centimeters and another in miles. Before combining their work, you need to normalize — convert everything to the same scale. Layer normalization does this for neural network activations: it rescales each token’s vector to have zero mean and unit variance. This prevents values from drifting to extreme ranges as they pass through layers.
Key insight: The original transformer (2017) used post-norm: normalize after each sublayer. Modern LLMs use pre-norm: normalize before each sublayer. Pre-norm is more stable for training very deep networks because the residual path stays clean. Additionally, Llama and most 2024+ models use RMSNorm instead of LayerNorm — it’s simpler (no mean subtraction) and ~10-15% faster.
LayerNorm vs RMSNorm
# LayerNorm (GPT-2, GPT-3, BERT): # 1. Compute mean: μ = mean(x) # 2. Compute variance: σ² = var(x) # 3. Normalize: (x - μ) / √(σ² + ε) # 4. Scale and shift: γ * norm + β # RMSNorm (Llama, Mistral, Qwen): # 1. Compute RMS: rms = √(mean(x²)) # 2. Normalize: x / (rms + ε) # 3. Scale only: γ * norm (no shift!) # → Simpler, faster, works just as well class RMSNorm(nn.Module): def __init__(self, d, eps=1e-6): super().__init__() self.weight = nn.Parameter(torch.ones(d)) self.eps = eps def forward(self, x): rms = torch.sqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) return x / rms * self.weight
neurology
The Feed-Forward Network: Independent Thinking
After consulting the team, each token processes information on its own
The Analogy
After a team meeting (attention), each person goes back to their desk to think independently. The feed-forward network (FFN) is this independent thinking step. It’s applied to each token separately — no communication between tokens. It’s a simple two-layer neural network: expand to a larger dimension, apply a nonlinearity, then compress back. This is where the model stores “factual knowledge.”
Key insight: Research suggests that attention handles “routing” (which tokens to combine) while FFN layers store “knowledge” (facts, patterns). Geva et al. (2021) showed that FFN layers act as key-value memories: the first layer’s rows match patterns, and the second layer’s columns store associated information. This is why larger FFN dimensions = more knowledge capacity.
The Architecture
# Classic FFN (GPT-2, GPT-3): # FFN(x) = W₂ · ReLU(W₁ · x + b₁) + b₂ # W₁: (d_model, d_ff) — expand # W₂: (d_ff, d_model) — compress # d_ff = 4 × d_model typically # GPT-3: d_model=12288, d_ff=49152 # That's 12288 × 49152 × 2 ≈ 1.2B params # per FFN layer! (96 layers = 115B just FFN) # Llama 3 8B: d_model=4096, d_ff=14336 # (3.5× expansion, not 4×, due to SwiGLU) class FFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = nn.Linear(d_model, d_ff) self.w2 = nn.Linear(d_ff, d_model) def forward(self, x): return self.w2(torch.relu(self.w1(x)))
bolt
Activation Functions: ReLU, GELU, SwiGLU
The nonlinearity that gives neural networks their power
The Analogy
Without activation functions, a neural network is just matrix multiplication — which can only learn linear relationships. An activation function is like a decision gate: it decides which signals to pass through and which to suppress. ReLU is a simple on/off switch (negative = 0, positive = pass). GELU and SwiGLU are smoother, more nuanced gates.
Key insight: SwiGLU (Shazeer, 2020) is now the standard in modern LLMs. It uses a gating mechanism: one linear layer produces the signal, another produces a gate, and they’re multiplied together. This gives the model more control over information flow. Llama, Mistral, Qwen, and most 2024+ models use SwiGLU. It achieves the same quality with ~10% fewer training tokens than ReLU.
Evolution of Activations
# ReLU (2012): max(0, x) # Simple, fast, but "dead neurons" problem # Used in: original Transformer, GPT-2 # GELU (2016): x · Φ(x) # Smooth approximation of ReLU # Used in: GPT-3, BERT, RoBERTa # SwiGLU (2020): SiLU(xW₁) ⊙ (xW₃) # Gated: signal × gate # Used in: Llama 2/3, Mistral, Qwen, Gemma class SwiGLU_FFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = nn.Linear(d_model, d_ff) # gate self.w3 = nn.Linear(d_model, d_ff) # signal self.w2 = nn.Linear(d_ff, d_model) # down def forward(self, x): gate = torch.nn.functional.silu(self.w1(x)) signal = self.w3(x) return self.w2(gate * signal)
upgrade
Modern Upgrades: What Changed Since 2017
The original transformer vs. today’s LLMs
Original (2017) vs Modern (2024)
The core idea is the same, but every component has been refined. Post-norm → Pre-norm (more stable training). LayerNorm → RMSNorm (faster, simpler). ReLU → SwiGLU (better quality). Sinusoidal positions → RoPE (better length generalization). Standard attention → GQA (Grouped-Query Attention, more memory efficient).
Key insight: Llama’s success came not from a single breakthrough but from combining the best proven ideas: RMSNorm + SwiGLU + RoPE + GQA + pre-norm. This “best practices” approach achieved GPT-3 level performance with less than 10% of the parameters. The lesson: architecture matters, but careful engineering matters more.
Side-by-Side
# Original Transformer (Vaswani, 2017): # - Post-norm (norm after sublayer) # - LayerNorm # - ReLU activation # - Sinusoidal positional encoding # - Standard multi-head attention # - Encoder-decoder architecture # Modern LLM (Llama 3, 2024): # - Pre-norm (norm before sublayer) # - RMSNorm (faster, no mean) # - SwiGLU activation (gated) # - RoPE positional encoding # - Grouped-Query Attention (GQA) # - Decoder-only architecture # GQA: share K,V across head groups # Llama 3 8B: 32 Q heads, 8 KV heads # → 4× less KV cache memory
code
A Complete Transformer Block in PyTorch
The modern Llama-style block in ~25 lines
Implementation
class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads, d_ff): super().__init__() # Pre-norm with RMSNorm self.norm1 = RMSNorm(d_model) self.norm2 = RMSNorm(d_model) # Self-attention (from Ch 3) self.attn = CausalSelfAttention( d_model, n_heads ) # Feed-forward with SwiGLU self.ffn = SwiGLU_FFN(d_model, d_ff) def forward(self, x): # Sublayer 1: attention + residual x = x + self.attn(self.norm1(x)) # Sublayer 2: FFN + residual x = x + self.ffn(self.norm2(x)) return x
That’s It. Really.
The entire transformer block is: normalize → attend → add residual → normalize → feed-forward → add residual. Two sublayers, two residual connections, two normalizations. Stack this 32-120 times and you have a modern LLM. The simplicity is the beauty — and why it scales so well.
Parameter Count
# Params per block (Llama 3 8B): # d_model=4096, n_heads=32, d_ff=14336 # Attention: W_Q + W_K + W_V + W_O # = 4 × (4096 × 4096) = 67M params # FFN (SwiGLU): W₁ + W₃ + W₂ # = 2×(4096×14336) + (14336×4096) # = 176M params # Norms: 2 × 4096 = 8K (negligible) # Total per block: ~243M params # × 32 layers = ~7.8B params # + embeddings ≈ 8B total ✓
stacks
Stacking Blocks: From Shallow to Deep
What happens as tokens flow through 32, 80, or 120 layers
What Each Layer Does
Research (Tenney et al., 2019; Elhage et al., 2022) reveals a progression: Early layers (1-10): Learn syntax, part-of-speech, simple patterns. Middle layers (10-60): Learn semantics, entity relationships, factual knowledge. Late layers (60+): Learn task-specific reasoning, output formatting, next-token prediction. The representation gets progressively more abstract and task-relevant.
The complete picture: A transformer is: embeddings → N × (norm → attention → residual → norm → FFN → residual) → final norm → output projection. That’s the entire architecture. The original 2017 design was elegant; modern refinements (RMSNorm, SwiGLU, RoPE, GQA) made it practical at scale. Every LLM you use — GPT-4, Claude, Llama, Gemini — is this block, repeated.
The Full Model
class LLM(nn.Module): def __init__(self, vocab, d, n_heads, d_ff, n_layers): super().__init__() self.embed = nn.Embedding(vocab, d) self.blocks = nn.ModuleList([ TransformerBlock(d, n_heads, d_ff) for _ in range(n_layers) ]) self.norm = RMSNorm(d) self.head = nn.Linear(d, vocab) def forward(self, token_ids): x = self.embed(token_ids) for block in self.blocks: x = block(x) x = self.norm(x) logits = self.head(x) return logits # Llama 3 8B: model = LLM( vocab=128256, d=4096, n_heads=32, d_ff=14336, n_layers=32 )