Encoder-Only (BERT, 2018):
Bidirectional self-attention
Each token sees ALL other tokens
Task: understanding (classification, NER)
Training: masked language modeling (MLM)
Decoder-Only (GPT, 2018):
Causal (masked) self-attention
Each token sees only PREVIOUS tokens
Task: generation (text, code, chat)
Training: next-token prediction
Encoder-Decoder (T5, 2019; original):
Encoder: bidirectional
Decoder: causal + cross-attention to encoder
Task: translation, summarization
Training: span corruption / denoising
The Causal Mask
Decoder models use a causal mask that prevents each position from attending to future tokens. When generating “The cat sat,” the model predicting “sat” can only see “The” and “cat” — not future words. This enables autoregressive generation.
The GPT family won. Decoder-only models dominate modern AI: GPT-4, Claude, Llama, Gemini, Mistral. The simplicity of next-token prediction + massive scale proved more powerful than the architectural complexity of encoder-decoder models. BERT-style models remain useful for classification and retrieval.