Ch 3 — Representing Text

From bag-of-words to Word2Vec — the journey from sparse to dense representations
High Level
grid_on
One-Hot
arrow_forward
shopping_bag
BoW
arrow_forward
balance
TF-IDF
arrow_forward
scatter_plot
Word2Vec
arrow_forward
public
GloVe
arrow_forward
dynamic_feed
Contextual
-
Click play or press Space to begin...
Step- / 8
grid_on
The Representation Problem
Machines need numbers, not words — how do we bridge the gap?
Why Representation Matters
Machine learning models operate on numbers, not words. The fundamental challenge of NLP is converting text into numerical representations that preserve meaning. The simplest approach is one-hot encoding: represent each word as a vector with a 1 in its position and 0s everywhere else. With a vocabulary of 50,000 words, each word becomes a 50,000-dimensional vector with a single 1. This is wildly inefficient — 99.998% of every vector is zeros. Worse, one-hot vectors treat every word as equally different from every other word: "cat" is as far from "dog" as it is from "democracy." There's no notion of similarity. The history of text representation is the story of finding better ways to encode meaning into numbers.
One-Hot Encoding
Vocabulary: [cat, dog, fish, bird] "cat" → [1, 0, 0, 0] "dog" → [0, 1, 0, 0] "fish" → [0, 0, 1, 0] "bird" → [0, 0, 0, 1] Problems: Dimensionality = vocabulary size 50K words = 50K dimensions No similarity: dist(cat, dog) = dist(cat, fish) No semantic information encoded Extremely sparse (99.998% zeros)
Key insight: The quality of your text representation determines the ceiling of your model's performance. A great model on bad representations will always lose to a decent model on great representations.
shopping_bag
Bag of Words (BoW)
Count the words, ignore the order — surprisingly effective
How It Works
Bag of Words represents a document as a vector of word counts. Each dimension corresponds to a word in the vocabulary, and the value is how many times that word appears. "The cat sat on the mat" becomes a count vector where "the" = 2, "cat" = 1, "sat" = 1, "on" = 1, "mat" = 1. The name "bag" reflects that word order is completely discarded — "dog bites man" and "man bites dog" have identical BoW representations despite opposite meanings. Despite this limitation, BoW is a strong baseline for document-level tasks like topic classification and spam detection, where the presence of certain words matters more than their arrangement. BoW vectors are sparse (mostly zeros) and high-dimensional, but they're fast to compute and easy to interpret.
Bag of Words Example
Document: "the cat sat on the mat" Vocabulary: [cat, mat, on, sat, the] BoW vector: [1, 1, 1, 1, 2] Order lost: "dog bites man" → [1, 0, 1, 1] "man bites dog" → [1, 0, 1, 1] // Identical vectors, opposite meanings! Strengths: Simple, fast, interpretable Good baseline for classification Works well with Naive Bayes, SVM Weaknesses: No word order No semantic similarity High dimensionality, sparse
Key insight: BoW works because for many tasks, what words appear matters more than how they're arranged. A movie review containing "terrible", "boring", "waste" is negative regardless of word order. But BoW fails when order carries meaning.
balance
TF-IDF
Not all words are created equal — weighting by importance
Term Frequency × Inverse Document Frequency
TF-IDF improves on raw counts by weighting words based on how discriminative they are. Term Frequency (TF) measures how often a word appears in a document. Inverse Document Frequency (IDF) measures how rare a word is across all documents — words that appear in every document (like "the") get low IDF, while words that appear in few documents (like "quantum") get high IDF. The product TF × IDF gives high scores to words that are frequent in a specific document but rare overall — exactly the words that characterize that document. TF-IDF remains a remarkably strong baseline. A large-scale study across 73 datasets found that TF-IDF outperformed neural embeddings on 61 of them for text classification, often by 3–5% in F1 score.
TF-IDF Calculation
TF(word, doc) = count(word in doc) / total words IDF(word) = log(total docs / docs containing word) TF-IDF = TF × IDF Example: "quantum" in a physics paper TF = 15/500 = 0.03 IDF = log(10000/50) = 5.3 TF-IDF = 0.03 × 5.3 = 0.159 (high) Example: "the" in the same paper TF = 30/500 = 0.06 IDF = log(10000/9900) = 0.01 TF-IDF = 0.06 × 0.01 = 0.0006 (low) Result: "quantum" weighted 265x more than "the"
Key insight: TF-IDF captures a deep intuition: the most informative words are those that distinguish one document from others. This principle — weighting by discriminative power — appears throughout machine learning, not just NLP.
scatter_plot
Word2Vec: The Dense Revolution
Words as points in space — where geometry captures meaning
The Distributional Hypothesis
Word2Vec (Mikolov et al., 2013) was the breakthrough that launched the dense embedding era. It's built on the distributional hypothesis: "You shall know a word by the company it keeps" (J.R. Firth, 1957). Words that appear in similar contexts have similar meanings, so they should have similar vectors. Word2Vec trains a shallow neural network on a simple task: given a word, predict its surrounding words (Skip-gram), or given surrounding words, predict the center word (CBOW). The hidden layer weights become the word vectors — typically 100–300 dimensions. The result: words with similar meanings cluster together, and vector arithmetic captures semantic relationships. The famous example: vec("king") − vec("man") + vec("woman") ≈ vec("queen").
Word2Vec Architecture
Skip-gram: predict context from word Input: "sat" Predict: "the", "cat", "on", "the" CBOW: predict word from context Input: "the", "cat", "on", "the" Predict: "sat" Vector arithmetic: king - man + woman ≈ queen Paris - France + Italy ≈ Rome bigger - big + small ≈ smaller Properties: Dense: 300 dims (not 50,000) Similar words = nearby vectors Directions encode relationships Trained on billions of words
Key insight: Word2Vec showed that meaning can emerge from context alone. No one told the model that "king" and "queen" are related — it discovered this from patterns in billions of words. This is the distributional hypothesis in action.
public
GloVe and fastText
Two improvements on Word2Vec — global statistics and subword information
Beyond Word2Vec
GloVe (Global Vectors, Stanford 2014) combines the best of two worlds: the global co-occurrence statistics of matrix factorization methods with the local context window approach of Word2Vec. It builds a word-word co-occurrence matrix from the entire corpus, then factorizes it to produce dense vectors. GloVe often produces more stable embeddings than Word2Vec, especially for less frequent words. fastText (Facebook, 2016) adds a crucial innovation: it represents each word as a bag of character n-grams. The word "where" is represented by its character n-grams: "<wh", "whe", "her", "ere", "re>" plus the word itself. This means fastText can generate vectors for words it has never seen by composing their subword vectors — solving the OOV problem for morphologically rich languages.
GloVe vs fastText
GloVe: Build co-occurrence matrix X X[i][j] = how often word i appears near word j in corpus Factorize: minimize weighted least squares Strength: global + local context Weakness: still one vector per word fastText: "where" = <wh + whe + her + ere + re> Word vector = sum of n-gram vectors Strength: handles unseen words "unhappily" = un + nh + ha + ... + ly Even if "unhappily" never seen in training Strength: great for morphologically rich languages (Turkish, Finnish)
Key insight: fastText's subword approach is the conceptual ancestor of BPE tokenization used in modern transformers. The idea that word meaning can be composed from sub-word parts is fundamental to how BERT and GPT handle vocabulary.
dynamic_feed
Contextual Embeddings
The same word, different meanings, different vectors
The Polysemy Problem
Word2Vec, GloVe, and fastText all share a fundamental limitation: each word gets one vector regardless of context. The word "bank" has the same vector whether it means a financial institution or a river bank. ELMo (Embeddings from Language Models, 2018) solved this with contextual embeddings: it runs a bidirectional LSTM over the entire sentence and uses the hidden states as word representations. Now "bank" in "I deposited money at the bank" gets a different vector than "bank" in "We sat on the river bank." BERT (2018) took this further with transformer-based contextual embeddings, producing even richer representations. Contextual embeddings are the foundation of modern NLP — they're why BERT, GPT, and their successors understand language so much better than earlier models.
Static vs Contextual
Static embeddings (Word2Vec/GloVe): "bank" → [0.2, -0.1, 0.8, ...] Always the same vector, any context Contextual embeddings (ELMo/BERT): "I went to the bank to deposit money" "bank" → [0.3, 0.7, -0.2, ...] (financial meaning) "We sat on the river bank" "bank" → [-0.1, 0.2, 0.9, ...] (geographical meaning) How it works: ELMo: BiLSTM hidden states BERT: Transformer self-attention Each token "sees" the full sentence
Key insight: Contextual embeddings represent a paradigm shift: from "one meaning per word" to "meaning emerges from context." This mirrors how humans understand language — we don't store fixed definitions, we interpret words based on their surroundings.
compare
Sparse vs Dense: When to Use What
The surprising resilience of simple methods
Choosing a Representation
The choice between sparse and dense representations depends on your task, data, and constraints. Sparse methods (BoW, TF-IDF) excel when you have limited labeled data, need interpretability, or are doing keyword-based tasks like search and topic modeling. A large-scale study across 73 classification datasets found TF-IDF outperformed neural embeddings on 61 of 73 datasets by 3–5% average F1. Static dense embeddings (Word2Vec, GloVe) are best when you need semantic similarity, have moderate data, and want pre-trained representations without GPU costs. Contextual embeddings (BERT, GPT) dominate when you have sufficient compute, need to handle polysemy, or are tackling complex tasks like question answering and natural language inference. The best practitioners don't default to the newest method — they match the representation to the problem.
Decision Guide
Use TF-IDF when: Limited labeled data Need interpretability Keyword-based tasks (search, topics) Speed/cost constraints Outperforms neural on 61/73 datasets Use Word2Vec/GloVe when: Need semantic similarity Moderate data, no GPU budget Transfer learning (pre-trained vectors) Analogy-style reasoning Use BERT/contextual when: Complex tasks (QA, NLI, NER) Polysemy matters Sufficient compute available Large fine-tuning datasets
Key insight: There is no universally "best" representation. TF-IDF is still competitive for many real-world classification tasks. The best approach depends on your data size, compute budget, and whether the task requires understanding word order and context.
timeline
The Representation Timeline
From counting words to understanding meaning — 20 years of progress
Evolution Summary
The evolution of text representation follows a clear arc: from no semantics (one-hot) to statistical semantics (TF-IDF) to learned semantics (Word2Vec) to contextual semantics (BERT). Each step added a new dimension of understanding. One-hot encoding treats words as arbitrary symbols. BoW and TF-IDF capture word importance but not meaning. Word2Vec and GloVe capture meaning but not context. Contextual embeddings capture both meaning and context. The next chapter on text classification will show how these representations are used as input to models that make predictions. The representation you choose determines which patterns your model can learn — and which it can't.
The Arc of Progress
One-Hot (1950s+): Words as arbitrary symbols No similarity, no semantics BoW / TF-IDF (1970s+): Word importance from statistics No similarity between words Word2Vec / GloVe (2013+): Semantic similarity from context One vector per word (static) fastText (2016): Subword composition Handle unseen words ELMo (2018): Context-dependent vectors (BiLSTM) BERT / GPT (2018+): Context-dependent (Transformer) Pre-trained on massive corpora
Key insight: Each representation method is still useful today. TF-IDF for search, Word2Vec for lightweight similarity, BERT for complex understanding. The history isn't obsolete — it's a toolkit where each tool has its place.