Ch 1 — What Is NLP?

The field, its 70-year evolution, the core tasks taxonomy, and why human language is the hardest data type
High Level
history
Rules
arrow_forward
bar_chart
Statistics
arrow_forward
neurology
Neural
arrow_forward
bolt
Transform
arrow_forward
smart_toy
LLMs
arrow_forward
explore
Today
-
Click play or press Space to begin...
Step- / 8
translate
What Is Natural Language Processing?
Teaching machines to read, write, and understand human language
The Field
Natural Language Processing (NLP) is the branch of artificial intelligence that gives machines the ability to read, understand, and generate human language. It sits at the intersection of computer science, linguistics, and machine learning. NLP is behind every spell checker, search engine, voice assistant, and chatbot you've ever used. The field tackles a deceptively hard problem: human language is ambiguous (the same word means different things in different contexts), contextual (meaning depends on what came before), and grounded in world knowledge (understanding "the trophy doesn't fit in the suitcase because it's too big" requires knowing that trophies have physical size). These properties make language the hardest data type in AI.
Why Language Is Hard
Ambiguity: "I saw her duck" → I saw her pet duck? → I saw her physically duck? Context dependence: "It's cold" in a room = temperature "It's cold" about food = quality "It's cold" about a person = emotion World knowledge: "The trophy doesn't fit in the suitcase because it's too big" What is "it"? The trophy. How do you know? Physics.
Key insight: NLP is hard because language was designed for humans who share context, not for machines that don't. Every NLP system is fundamentally trying to reconstruct the context that speakers take for granted.
history
The Rule-Based Era (1950s–1980s)
Hand-crafted grammars, pattern matching, and the limits of human engineering
The Early Days
NLP began with Alan Turing's 1950 question "Can machines think?" The 1954 Georgetown-IBM experiment translated 60+ Russian sentences using just 250 words and 6 grammar rules — and researchers predicted machine translation would be solved within 3–5 years. They were off by about 60 years. ELIZA (1966) by Joseph Weizenbaum used simple pattern matching to mimic a psychotherapist, convincing some users they were talking to a real person. SHRDLU (1970s) could understand natural language commands about blocks on a table — but only blocks on a table. Noam Chomsky's formal grammars (1957) provided the theoretical foundation: language could be described by rules. The problem was that real language has too many rules, too many exceptions, and too much ambiguity for hand-crafted systems to handle.
Key Milestones
1950: Turing Test proposed 1954: Georgetown-IBM translation 60+ Russian sentences, 6 rules 1957: Chomsky's formal grammars 1966: ELIZA (pattern matching) 1966: ALPAC report kills MT funding 1970s: SHRDLU (blocks world) The problem: Hand-crafted rules don't scale Too many exceptions Too much ambiguity Brittle outside narrow domains // "3-5 years to solve MT" // - Georgetown researchers, 1954
Key insight: The rule-based era proved that language can't be fully captured by explicit rules. There are always more exceptions than rules. This insight drove the field toward learning from data instead.
bar_chart
The Statistical Revolution (1980s–2010s)
Let the data decide: from hand-crafted rules to learned probabilities
The Paradigm Shift
The statistical revolution replaced hand-crafted rules with probabilities learned from data. Instead of writing grammar rules, researchers trained models on large text corpora and let the statistics emerge. Hidden Markov Models (HMMs) dominated POS tagging and speech recognition. Naive Bayes and logistic regression powered text classification. Statistical machine translation (IBM Models, phrase-based MT) replaced rule-based translation. The key insight was Frederick Jelinek's famous quote: "Every time I fire a linguist, the performance of the speech recognizer goes up." Data and statistics outperformed hand-crafted linguistic knowledge. This era also introduced rigorous evaluation — BLEU scores for translation, F1 for classification — making NLP a measurable science.
Statistical NLP
Core idea: Learn probabilities from data P(word | context) from corpora Key models: HMMs: POS tagging, speech Naive Bayes: text classification CRFs: sequence labeling Phrase-based MT: translation N-gram language models Key innovation: Rigorous evaluation metrics BLEU, F1, precision, recall // "Every time I fire a linguist, // the performance goes up" // - Frederick Jelinek
Key insight: The statistical revolution taught NLP that data beats rules. But statistical models still relied on hand-crafted features — someone had to decide what to count. The next revolution would learn the features too.
neurology
The Deep Learning Era (2013–2017)
Word2Vec, RNNs, and learning representations from raw text
Neural NLP
Word2Vec (2013) was the watershed moment: words could be represented as dense vectors where geometry captured meaning — "king − man + woman = queen" worked as actual vector arithmetic. This eliminated the need for hand-crafted features. Recurrent Neural Networks (RNNs) and their improved variants LSTMs and GRUs could process sequences of variable length, making them natural fits for language. Sequence-to-sequence models with attention (2014–2015) revolutionized machine translation, and ELMo (2018) introduced contextual embeddings — the same word gets different vectors depending on its context. The deep learning era automated feature engineering: instead of telling the model what to look for, you let it discover patterns in raw text.
Timeline
2013: Word2Vec Dense word vectors, learned king - man + woman = queen 2014: GloVe (Stanford) Global + local context 2014: Seq2Seq models Encoder-decoder for translation 2015: Attention mechanism "Look at relevant parts" 2018: ELMo Contextual embeddings Same word → different vectors Key shift: no more hand-crafted features Model learns what matters
Key insight: The deep learning era's contribution wasn't just better accuracy — it was eliminating feature engineering. Instead of decades of linguistic expertise encoded as features, a neural network could discover its own representations from raw text.
bolt
The Transformer Revolution (2017–Present)
Attention is all you need — and it changed everything
Transformers
The 2017 paper "Attention Is All You Need" introduced the Transformer architecture, which replaced recurrence with self-attention — every token can attend to every other token in parallel. This unlocked massive parallelism and enabled training on unprecedented data scales. BERT (2018) used the encoder side for understanding tasks (classification, NER, question answering), achieving state-of-the-art on 11 NLP benchmarks simultaneously. GPT (2018–2020) used the decoder side for generation. T5 (2019) unified all NLP tasks as text-to-text. The transformer didn't just improve NLP — it unified it. One architecture, pre-trained on massive text, could be fine-tuned for virtually any language task.
The Transformer Family
2017: Transformer Self-attention, parallel processing "Attention Is All You Need" 2018: BERT (encoder) Bidirectional understanding SOTA on 11 benchmarks at once 2018-20: GPT-1/2/3 (decoder) Autoregressive generation Scaling unlocks new abilities 2019: T5 (encoder-decoder) All tasks as text-to-text "Translate English to German: ..." The unification: One architecture for all NLP tasks
Key insight: The transformer's real innovation wasn't just attention — it was enabling pre-training at scale. A model trained on billions of words develops general language understanding that transfers to any downstream task.
category
The NLP Tasks Taxonomy
Every NLP problem falls into one of a few fundamental task types
Task Categories
Despite the enormous variety of NLP applications, most problems reduce to a handful of fundamental task types. Text classification: assign a label to a document (sentiment, topic, spam). Sequence labeling: assign a label to each token (POS tagging, NER). Sequence-to-sequence: transform one sequence into another (translation, summarization). Text generation: produce text from a prompt or context. Information extraction: pull structured data from unstructured text (relation extraction, event detection). Semantic similarity: measure how similar two texts are (paraphrase detection, search). Understanding which task type your problem maps to is the first step in choosing the right approach.
Task Types
Classification (document → label): Sentiment, topic, spam, intent Sequence labeling (token → label): POS tagging, NER, chunking Seq-to-seq (sequence → sequence): Translation, summarization Generation (prompt → text): Completion, dialogue, creative Information extraction: Relations, events, entities Semantic similarity: Paraphrase, search, matching
Key insight: Knowing the task type determines the model architecture, the loss function, the evaluation metric, and the data format. Misidentifying the task type is the most common beginner mistake in NLP.
layers
The NLP Pipeline
From raw text to predictions: the standard processing flow
Pipeline Architecture
A standard NLP pipeline transforms raw text into actionable predictions through a series of stages. Data acquisition: collect raw text plus metadata (source, timestamp, language). Preprocessing: clean, normalize, and tokenize the text. Representation: convert tokens into numerical vectors the model can process. Modeling: apply a machine learning or deep learning model to the vectors. Post-processing: format outputs, apply business rules, filter low-confidence predictions. Evaluation: measure performance against ground truth using task-appropriate metrics. Each stage has its own design decisions and failure modes. The pipeline is only as strong as its weakest stage — a perfect model on poorly preprocessed text will produce poor results.
Pipeline Stages
Raw text1. Acquire: collect + metadata ↓ 2. Preprocess: clean, normalize, tokenize ↓ 3. Represent: tokens → vectors ↓ 4. Model: vectors → predictions ↓ 5. Post-process: format, filter ↓ 6. Evaluate: measure vs ground truth ↓ Actionable output
Key insight: Modern transformer-based models collapse stages 2–4 into a single end-to-end system. But understanding the pipeline is still essential — you need to know what the model is doing implicitly to debug it when it fails.
map
Course Roadmap
What you'll learn across 10 chapters
The Journey
This course follows the evolution of NLP from foundations to the modern landscape. Chapters 2–3 cover the pipeline's foundation: text preprocessing and representation (from bag-of-words to Word2Vec). Chapters 4–5 tackle the core tasks: text classification and sequence labeling, with both classical and neural approaches. Chapter 6 covers language models and generation — the foundation that led to GPT. Chapter 7 is the transformer revolution: BERT, GPT, and T5. Chapters 8–9 cover the modern workflow: transfer learning, fine-tuning, and evaluation metrics. Chapter 10 surveys the current landscape: instruction tuning, few-shot NLP, multilingual models, and where the field is heading next.
Chapter Map
Section 1: Text & Representation Ch 2: Text Preprocessing Ch 3: Representing Text Section 2: Tasks & Models Ch 4: Text Classification Ch 5: Sequence Labeling Ch 6: Language Models & Generation Ch 7: The Transformer Revolution Section 3: Modern NLP Ch 8: Transfer Learning & Fine-Tuning Ch 9: NLP Evaluation Ch 10: The Modern NLP Landscape
Key insight: This course focuses on NLP as a discipline — the tasks, evaluation, and evolution — not just the transformer architecture. Understanding the history and fundamentals makes you a better practitioner of modern NLP.