Ch 12 — The Transformer Architecture

Encoder-decoder, positional encoding, “Attention Is All You Need,” and the bridge to LLMs
High Level
text_fields
Tokens
arrow_forward
pin_drop
Position
arrow_forward
center_focus_strong
Self-Attn
arrow_forward
neurology
FFN
arrow_forward
layers
Stack N×
arrow_forward
smart_toy
LLMs
-
Click play or press Space to begin...
Step- / 8
article
“Attention Is All You Need”
Vaswani et al. (NeurIPS 2017) — the paper that changed AI
The Radical Proposal
In June 2017, eight Google researchers published a paper with a provocative title: “Attention Is All You Need.” They proposed the Transformer — a sequence-to-sequence model built entirely from attention mechanisms, with no recurrence and no convolutions. It achieved 28.4 BLEU on English-to-German translation (a new state of the art) while training in just 3.5 days on 8 GPUs — far faster than RNN-based models. The Transformer’s parallelism and scalability made it the foundation for every major AI breakthrough since: BERT, GPT, LLaMA, Gemini, Claude, and beyond.
The Original Transformer
// Transformer (Vaswani et al., 2017) Model dim: d_model = 512 Heads: h = 8 Layers: N = 6 (encoder) + 6 (decoder) FFN dim: d_ff = 2048 Parameters: ~65M // Training: // 8× NVIDIA P100 GPUs, 3.5 days // AdamW, warmup + inverse sqrt decay // Label smoothing 0.1 // Dropout 0.1
Key insight: The Transformer’s 65M parameters seem tiny by today’s standards (GPT-4 has ~1.8 trillion). But the architecture scaled perfectly — the same design works from 65M to 1.8T parameters, which is why it became universal.
pin_drop
Positional Encoding
Giving the Transformer a sense of order
The Position Problem
Self-attention is permutation-invariant — it treats “cat sat mat” and “mat cat sat” identically because it only computes pairwise similarities, not positions. But word order matters! Positional encodings add position information to the input embeddings. The original Transformer used sinusoidal encodings: each position gets a unique pattern of sine and cosine waves at different frequencies. Modern models (GPT, LLaMA) use learned positional embeddings or Rotary Position Embeddings (RoPE) which encode relative positions directly in the attention computation.
Positional Encoding
// Sinusoidal positional encoding PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) // pos = position in sequence // i = dimension index // d = model dimension // Input = token_embedding + pos_encoding // Modern alternatives: Learned: trainable embedding per position RoPE: rotates Q,K vectors by position (relative position in attention) ALiBi: adds linear bias to attention scores
Key insight: RoPE (Su et al., 2021) encodes position by rotating query and key vectors, making attention scores naturally depend on relative distance. It generalizes better to longer sequences than absolute position embeddings and is used in LLaMA, Mistral, and most modern LLMs.
view_in_ar
The Transformer Block
The repeating unit: attention + feedforward + residuals
Anatomy of a Layer
Each Transformer layer (block) has two sub-layers: multi-head self-attention and a position-wise feedforward network (FFN). Each sub-layer has a residual connection (like ResNet’s skip connections) and layer normalization. The FFN is a simple 2-layer MLP applied independently to each position: Linear(d_model → d_ff) → GELU → Linear(d_ff → d_model). The FFN is where the model stores “knowledge” — factual associations learned during training. The attention layers handle relationships between positions.
Transformer Block
// Pre-norm Transformer block (modern) def transformer_block(x): // Sub-layer 1: Multi-head attention x = x + MultiHeadAttn(LayerNorm(x)) // Sub-layer 2: Feedforward network x = x + FFN(LayerNorm(x)) return x // FFN (position-wise) def FFN(x): return Linear(GELU(Linear(x))) // d_model → 4×d_model → d_model // Stack N blocks: GPT-2 has 12-48 // GPT-3 has 96, LLaMA-70B has 80
Key insight: The residual connections are critical — they create “gradient highways” that let gradients flow through 100+ layers without vanishing. This is the same principle as ResNet’s skip connections, applied to sequence models.
visibility_off
Causal Masking
Preventing the model from seeing the future
The Autoregressive Constraint
In language generation, the model predicts one token at a time, left to right. When predicting token 5, it must not see tokens 6, 7, 8... This is enforced with a causal mask (also called an attention mask): a triangular matrix that sets attention scores to -∞ for future positions, making their softmax weights zero. This ensures each position can only attend to itself and earlier positions. BERT uses no mask (bidirectional); GPT uses a causal mask (autoregressive). This is the fundamental difference between encoder-only and decoder-only architectures.
Causal Mask
// Causal attention mask // 1 = can attend, 0 = masked (-∞) t₁ t₂ t₃ t₄ t₁ [ 1 0 0 0 ] // sees only t₁ t₂ [ 1 1 0 0 ] // sees t₁, t₂ t₃ [ 1 1 1 0 ] // sees t₁-t₃ t₄ [ 1 1 1 1 ] // sees all // Applied before softmax: // scores[mask == 0] = -∞ // softmax(-∞) = 0 → no attention
Key insight: Causal masking enables parallel training — all positions are computed simultaneously, but each only sees past context. During inference, tokens are generated one at a time (autoregressive), using KV caching to avoid recomputation.
architecture
Encoder-Only, Decoder-Only & Encoder-Decoder
Three Transformer variants for different tasks
Three Architectures
Encoder-only (BERT): bidirectional self-attention, no causal mask. Excels at understanding tasks (classification, NER, similarity). Decoder-only (GPT): causal self-attention, generates text left-to-right. Dominates language generation and has become the default for LLMs. Encoder-decoder (T5, original Transformer): encoder processes input bidirectionally, decoder generates output autoregressively with cross-attention to the encoder. Used for translation, summarization.
Architecture Comparison
// Transformer variants Encoder-only (BERT, RoBERTa): Bidirectional attention (no mask) → classification, NER, embeddings Decoder-only (GPT, LLaMA, Claude): Causal attention (triangular mask) → text generation, chat, reasoning Encoder-decoder (T5, BART, Whisper): Encoder: bidirectional Decoder: causal + cross-attention → translation, summarization, ASR // Decoder-only has won: // GPT-4, Claude, Gemini, LLaMA are all // decoder-only transformers
Key insight: The decoder-only architecture won because it’s simpler (one stack of layers), scales better, and can handle both understanding and generation through in-context learning. GPT showed that a single architecture trained on next-token prediction can do almost anything.
trending_up
Scaling Laws
Why bigger Transformers keep getting better
The Scaling Hypothesis
Kaplan et al. (OpenAI, 2020) discovered that Transformer performance follows power laws: loss decreases predictably as you increase model size, dataset size, and compute. Double the parameters → predictable improvement. This means you can predict how well a model will perform before training it, enabling efficient resource allocation. The Chinchilla scaling laws (Hoffmann et al., 2022) refined this: for a given compute budget, there’s an optimal ratio of model size to training tokens (~20 tokens per parameter).
Transformer Scale Over Time
// Transformer model sizes Transformer (2017): 65M params BERT-Large (2018): 340M params GPT-2 (2019): 1.5B params GPT-3 (2020): 175B params PaLM (2022): 540B params GPT-4 (2023): ~1.8T params (MoE) LLaMA-3 (2024): 405B params // ~27,000× increase in 7 years // Same core architecture throughout
Key insight: The Transformer architecture has scaled from 65M to 1.8 trillion parameters with no fundamental changes. This remarkable scalability — combined with predictable scaling laws — is why the Transformer became the universal architecture for AI.
build
Modern Transformer Improvements
GQA, SwiGLU, RMSNorm, and KV caching
Post-2017 Refinements
The core Transformer design has been refined significantly: Grouped-Query Attention (GQA) shares key/value heads across query heads, reducing memory during inference. SwiGLU activation (Shazeer, 2020) replaces ReLU/GELU in the FFN for better performance. RMSNorm replaces LayerNorm for efficiency. Pre-norm (normalize before attention) replaces post-norm for more stable training. KV caching stores computed key/value pairs to avoid recomputation during autoregressive generation. These are all used in LLaMA, Mistral, and most modern LLMs.
Modern LLM Architecture
// LLaMA-style Transformer (2023+) Normalization: RMSNorm (pre-norm) Attention: GQA (grouped-query) Position: RoPE (rotary embeddings) Activation: SwiGLU in FFN FFN ratio: d_ff = 8/3 × d_model Vocab: BPE tokenizer (32K-128K) Training: AdamW, cosine decay Inference: KV cache, speculative decoding // vs. Original Transformer (2017): // LayerNorm (post), MHA, sinusoidal, // ReLU, d_ff = 4×d_model
Key insight: These refinements are evolutionary, not revolutionary. The core idea — stacked self-attention + FFN blocks with residual connections — hasn’t changed since 2017. The improvements are about efficiency and scale, not architecture.
smart_toy
The Bridge to LLMs
From deep learning fundamentals to the AI revolution
Everything Connects
This course traced a path from the 1943 McCulloch-Pitts neuron to the 2017 Transformer. Every concept builds on the last: perceptronsbackpropagationCNNs (spatial hierarchy) → RNNs (sequence memory) → LSTMs (gating) → attention (direct connections) → Transformers (attention is all you need). The Transformer combined the best ideas from 74 years of neural network research into a single, scalable architecture. GPT, BERT, LLaMA, Claude, Gemini — they are all Transformers, trained on the same principles covered in this course.
Course complete: You now understand the foundations that power every modern AI system. To go deeper into how LLMs specifically work, explore the How LLMs Work course. For practical applications, see Prompt Engineering and AI-Assisted Coding.
The Full Journey
// Deep Learning Fundamentals → LLMs 1943: McCulloch-Pitts neuron 1958: Perceptron (learning from data) 1986: Backpropagation (training deep nets) 1989: CNNs (spatial hierarchy) 1997: LSTMs (gated memory) 2012: AlexNet (GPU + deep learning) 2015: ResNet (skip connections) 2015: Attention (direct connections) 2017: Transformer (attention only) 2018: BERT + GPT (pretrained LLMs) 2020: GPT-3 (scaling laws) 2022: ChatGPT (RLHF + scale) 2024+: Reasoning, agents, multimodal