The Original Design
The original 2017 Transformer had two halves:
Encoder — Reads and understands the input. Uses bidirectional attention: each word can attend to all other words, both before and after it. Produces a rich representation of the input’s meaning.
Decoder — Generates the output, one token at a time. Uses masked attention: each word can only attend to words that came before it (can’t peek at the future). This is what enables text generation.
BERT: Encoder-Only
Google’s BERT (2018) uses only the encoder. It reads text bidirectionally — understanding each word in the context of all surrounding words. This makes it excellent at understanding tasks: classification, search ranking, sentiment analysis, question answering, named entity recognition. BERT powers Google Search’s understanding of queries.
GPT: Decoder-Only
OpenAI’s GPT series uses only the decoder. It reads text left-to-right and predicts the next token. This makes it excellent at generation tasks: writing, conversation, code generation, summarization, translation. GPT is the architecture behind ChatGPT, and decoder-only models now dominate the industry (Claude, Gemini, LLaMA all use this approach).
Key insight: The encoder-decoder distinction maps to a business decision: Do you need to understand existing text or generate new text? For search, classification, and analysis → encoder models (BERT-family). For chatbots, content generation, and creative tasks → decoder models (GPT-family). For translation and summarization → full encoder-decoder (T5, BART). Most modern systems use decoder-only because generation capability subsumes understanding.