The Core Idea
A language model assigns probabilities to sequences of words. Given a context, it predicts what comes next. "The cat sat on the ___" — a good language model assigns high probability to "mat" and low probability to "democracy." Formally, a language model computes P(w1, w2, ..., wn) — the probability of a sequence. By the chain rule, this decomposes into: P(w1) × P(w2|w1) × P(w3|w1,w2) × ... Language models are the foundation of modern NLP. Spell checkers, autocomplete, machine translation, speech recognition, and every LLM from GPT to Claude are language models at their core. The entire history of NLP can be told through the evolution of language models: from counting word sequences to neural networks that generate human-quality text.
Language Model Basics
Core task: predict the next word
"The cat sat on the ___"
P(mat) = 0.15 (high)
P(floor) = 0.08
P(democracy) = 0.0001 (low)
Chain rule decomposition:
P("the cat sat") =
P("the") ×
P("cat" | "the") ×
P("sat" | "the cat")
Applications:
Autocomplete, spell check
Machine translation
Speech recognition
Text generation (GPT, Claude)
Every LLM is a language model
Key insight: The seemingly simple task of "predict the next word" turns out to require deep understanding of grammar, facts, reasoning, and world knowledge. This is why scaling language models produces increasingly capable AI systems.