Ch 12: The Transformer Architecture — Deep Learning Fundamentals

Ch 12 — The Transformer Architecture

Encoder-decoder, positional encoding, “Attention Is All You Need,” and the bridge to LLMs

Index

High Level

text_fields

Tokens

arrow_forward

pin_drop

Position

arrow_forward

center_focus_strong

Self-Attn

arrow_forward

neurology

FFN

arrow_forward

layers

Stack N×

arrow_forward

smart_toy

LLMs

Click play or press Space to begin...

Step- / 8

article

“Attention Is All You Need”

Vaswani et al. (NeurIPS 2017) — the paper that changed AI

The Radical Proposal

In June 2017, eight Google researchers published a paper with a provocative title: “Attention Is All You Need.” They proposed the Transformer — a sequence-to-sequence model built entirely from attention mechanisms, with no recurrence and no convolutions. It achieved 28.4 BLEU on English-to-German translation (a new state of the art) while training in just 3.5 days on 8 GPUs — far faster than RNN-based models. The Transformer’s parallelism and scalability made it the foundation for every major AI breakthrough since: BERT, GPT, LLaMA, Gemini, Claude, and beyond.

The Original Transformer

// Transformer (Vaswani et al., 2017) Model dim: d_model = 512 Heads: h = 8 Layers: N = 6 (encoder) + 6 (decoder) FFN dim: d_ff = 2048 Parameters: ~65M // Training: // 8× NVIDIA P100 GPUs, 3.5 days // AdamW, warmup + inverse sqrt decay // Label smoothing 0.1 // Dropout 0.1

Key insight: The Transformer’s 65M parameters seem tiny by today’s standards (GPT-4 has ~1.8 trillion). But the architecture scaled perfectly — the same design works from 65M to 1.8T parameters, which is why it became universal.

pin_drop

Positional Encoding

Giving the Transformer a sense of order

The Position Problem

Self-attention is permutation-invariant — it treats “cat sat mat” and “mat cat sat” identically because it only computes pairwise similarities, not positions. But word order matters! Positional encodings add position information to the input embeddings. The original Transformer used sinusoidal encodings: each position gets a unique pattern of sine and cosine waves at different frequencies. Modern models (GPT, LLaMA) use learned positional embeddings or Rotary Position Embeddings (RoPE) which encode relative positions directly in the attention computation.

Positional Encoding

// Sinusoidal positional encoding PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) // pos = position in sequence // i = dimension index // d = model dimension // Input = token_embedding + pos_encoding // Modern alternatives: Learned: trainable embedding per position RoPE: rotates Q,K vectors by position (relative position in attention) ALiBi: adds linear bias to attention scores

Key insight: RoPE (Su et al., 2021) encodes position by rotating query and key vectors, making attention scores naturally depend on relative distance. It generalizes better to longer sequences than absolute position embeddings and is used in LLaMA, Mistral, and most modern LLMs.

view_in_ar

The Transformer Block

The repeating unit: attention + feedforward + residuals

Anatomy of a Layer

Each Transformer layer (block) has two sub-layers: multi-head self-attention and a position-wise feedforward network (FFN). Each sub-layer has a residual connection (like ResNet’s skip connections) and layer normalization. The FFN is a simple 2-layer MLP applied independently to each position: Linear(d_model → d_ff) → GELU → Linear(d_ff → d_model). The FFN is where the model stores “knowledge” — factual associations learned during training. The attention layers handle relationships between positions.

Transformer Block

// Pre-norm Transformer block (modern) def transformer_block(x): // Sub-layer 1: Multi-head attention x = x + MultiHeadAttn(LayerNorm(x)) // Sub-layer 2: Feedforward network x = x + FFN(LayerNorm(x)) return x // FFN (position-wise) def FFN(x): return Linear(GELU(Linear(x))) // d_model → 4×d_model → d_model // Stack N blocks: GPT-2 has 12-48 // GPT-3 has 96, LLaMA-70B has 80

Key insight: The residual connections are critical — they create “gradient highways” that let gradients flow through 100+ layers without vanishing. This is the same principle as ResNet’s skip connections, applied to sequence models.

visibility_off

Causal Masking

Preventing the model from seeing the future

The Autoregressive Constraint

In language generation, the model predicts one token at a time, left to right. When predicting token 5, it must not see tokens 6, 7, 8... This is enforced with a causal mask (also called an attention mask): a triangular matrix that sets attention scores to -∞ for future positions, making their softmax weights zero. This ensures each position can only attend to itself and earlier positions. BERT uses no mask (bidirectional); GPT uses a causal mask (autoregressive). This is the fundamental difference between encoder-only and decoder-only architectures.

Causal Mask

// Causal attention mask // 1 = can attend, 0 = masked (-∞) t₁ t₂ t₃ t₄ t₁ [ 1 0 0 0 ] // sees only t₁ t₂ [ 1 1 0 0 ] // sees t₁, t₂ t₃ [ 1 1 1 0 ] // sees t₁-t₃ t₄ [ 1 1 1 1 ] // sees all // Applied before softmax: // scores[mask == 0] = -∞ // softmax(-∞) = 0 → no attention

Key insight: Causal masking enables parallel training — all positions are computed simultaneously, but each only sees past context. During inference, tokens are generated one at a time (autoregressive), using KV caching to avoid recomputation.

architecture

Encoder-Only, Decoder-Only & Encoder-Decoder

Three Transformer variants for different tasks

Three Architectures

Encoder-only (BERT): bidirectional self-attention, no causal mask. Excels at understanding tasks (classification, NER, similarity). Decoder-only (GPT): causal self-attention, generates text left-to-right. Dominates language generation and has become the default for LLMs. Encoder-decoder (T5, original Transformer): encoder processes input bidirectionally, decoder generates output autoregressively with cross-attention to the encoder. Used for translation, summarization.

Architecture Comparison

// Transformer variants Encoder-only (BERT, RoBERTa): Bidirectional attention (no mask) → classification, NER, embeddings Decoder-only (GPT, LLaMA, Claude): Causal attention (triangular mask) → text generation, chat, reasoning Encoder-decoder (T5, BART, Whisper): Encoder: bidirectional Decoder: causal + cross-attention → translation, summarization, ASR // Decoder-only has won: // GPT-4, Claude, Gemini, LLaMA are all // decoder-only transformers

Key insight: The decoder-only architecture won because it’s simpler (one stack of layers), scales better, and can handle both understanding and generation through in-context learning. GPT showed that a single architecture trained on next-token prediction can do almost anything.

trending_up

Scaling Laws

Why bigger Transformers keep getting better

The Scaling Hypothesis

Kaplan et al. (OpenAI, 2020) discovered that Transformer performance follows power laws: loss decreases predictably as you increase model size, dataset size, and compute. Double the parameters → predictable improvement. This means you can predict how well a model will perform before training it, enabling efficient resource allocation. The Chinchilla scaling laws (Hoffmann et al., 2022) refined this: for a given compute budget, there’s an optimal ratio of model size to training tokens (~20 tokens per parameter).

Transformer Scale Over Time

// Transformer model sizes Transformer (2017): 65M params BERT-Large (2018): 340M params GPT-2 (2019): 1.5B params GPT-3 (2020): 175B params PaLM (2022): 540B params GPT-4 (2023): ~1.8T params (MoE) LLaMA-3 (2024): 405B params // ~27,000× increase in 7 years // Same core architecture throughout

Key insight: The Transformer architecture has scaled from 65M to 1.8 trillion parameters with no fundamental changes. This remarkable scalability — combined with predictable scaling laws — is why the Transformer became the universal architecture for AI.

build

Modern Transformer Improvements

GQA, SwiGLU, RMSNorm, and KV caching

Post-2017 Refinements

The core Transformer design has been refined significantly: Grouped-Query Attention (GQA) shares key/value heads across query heads, reducing memory during inference. SwiGLU activation (Shazeer, 2020) replaces ReLU/GELU in the FFN for better performance. RMSNorm replaces LayerNorm for efficiency. Pre-norm (normalize before attention) replaces post-norm for more stable training. KV caching stores computed key/value pairs to avoid recomputation during autoregressive generation. These are all used in LLaMA, Mistral, and most modern LLMs.

Modern LLM Architecture

// LLaMA-style Transformer (2023+) Normalization: RMSNorm (pre-norm) Attention: GQA (grouped-query) Position: RoPE (rotary embeddings) Activation: SwiGLU in FFN FFN ratio: d_ff = 8/3 × d_model Vocab: BPE tokenizer (32K-128K) Training: AdamW, cosine decay Inference: KV cache, speculative decoding // vs. Original Transformer (2017): // LayerNorm (post), MHA, sinusoidal, // ReLU, d_ff = 4×d_model

Key insight: These refinements are evolutionary, not revolutionary. The core idea — stacked self-attention + FFN blocks with residual connections — hasn’t changed since 2017. The improvements are about efficiency and scale, not architecture.

smart_toy

The Bridge to LLMs

From deep learning fundamentals to the AI revolution

Everything Connects

This course traced a path from the 1943 McCulloch-Pitts neuron to the 2017 Transformer. Every concept builds on the last: perceptrons → backpropagation → CNNs (spatial hierarchy) → RNNs (sequence memory) → LSTMs (gating) → attention (direct connections) → Transformers (attention is all you need). The Transformer combined the best ideas from 74 years of neural network research into a single, scalable architecture. GPT, BERT, LLaMA, Claude, Gemini — they are all Transformers, trained on the same principles covered in this course.

Course complete: You now understand the foundations that power every modern AI system. To go deeper into how LLMs specifically work, explore the How LLMs Work course. For practical applications, see Prompt Engineering and AI-Assisted Coding.

The Full Journey

// Deep Learning Fundamentals → LLMs 1943: McCulloch-Pitts neuron 1958: Perceptron (learning from data) 1986: Backpropagation (training deep nets) 1989: CNNs (spatial hierarchy) 1997: LSTMs (gated memory) 2012: AlexNet (GPU + deep learning) 2015: ResNet (skip connections) 2015: Attention (direct connections) 2017: Transformer (attention only) 2018: BERT + GPT (pretrained LLMs) 2020: GPT-3 (scaling laws) 2022: ChatGPT (RLHF + scale) 2024+: Reasoning, agents, multimodal

arrow_back Ch 11: The Attention Mechanism Back to Course Index arrow_forward