Ch 4 — Text Classification

Sentiment analysis, spam detection, and the classical ML pipeline for text
High Level
description
Text
arrow_forward
data_object
Features
arrow_forward
model_training
Model
arrow_forward
label
Predict
arrow_forward
analytics
Evaluate
arrow_forward
rocket_launch
Deploy
-
Click play or press Space to begin...
Step- / 8
label
What Is Text Classification?
Assigning labels to documents — the most common NLP task in production
The Task
Text classification assigns one or more labels to a piece of text. Given a document, predict its category. It's the most widely deployed NLP task in industry. Sentiment analysis: "This movie was terrible" → negative. Spam detection: "You've won $1,000,000!" → spam. Topic classification: an article about GDP growth → economics. Intent detection: "Book a flight to London" → book_flight. Classification can be binary (spam/not spam), multi-class (one of N categories), or multi-label (multiple tags per document). The task seems simple, but doing it well at scale — handling edge cases, class imbalance, domain shift, and evolving categories — is where the real engineering challenge lies.
Classification Types
Binary: exactly 2 classes spam / not spam positive / negative toxic / non-toxic Multi-class: one of N classes sports | politics | tech | science anger | joy | sadness | fear | surprise Multi-label: multiple labels per doc Article: [politics, economics, trade] Movie: [action, comedy, sci-fi] Real-world applications: Email routing, content moderation Customer support ticket triage Medical document coding (ICD-10) Legal document categorization
Key insight: Text classification is the "hello world" of NLP — simple to define, hard to master. Most production NLP systems start with classification, and the skills transfer to every other NLP task.
calculate
Naive Bayes
The surprisingly effective probabilistic baseline
How It Works
Naive Bayes applies Bayes' theorem with a "naive" assumption: all features (words) are conditionally independent given the class. This assumption is obviously wrong — "New" and "York" are highly correlated — but the model works remarkably well in practice. For text, Multinomial Naive Bayes models word counts: P(class | document) is proportional to P(class) × ∏ P(word | class). Training is just counting: how often does each word appear in documents of each class? Prediction is fast: multiply the probabilities. Naive Bayes excels with small training sets (as few as dozens of examples), is extremely fast to train and predict, and provides a strong baseline that more complex models must beat. It's still the go-to first model for text classification prototypes.
Naive Bayes for Text
Bayes' theorem: P(spam | "free money now") = P("free"|spam) × P("money"|spam) × P("now"|spam) × P(spam) ÷ P("free money now") Training = counting: P("free" | spam) = count("free" in spam) / total words in spam P("free" | ham) = count("free" in ham) / total words in ham Strengths: Works with tiny datasets (50+ examples) Trains in milliseconds Interpretable: which words drive predictions Hard to overfit Weaknesses: Independence assumption is wrong Can't capture word interactions Sensitive to word frequency imbalance
Key insight: Naive Bayes is the baseline that refuses to die. Its independence assumption is wrong, but the ranking of class probabilities is often correct. Always start with Naive Bayes — if a complex model can't beat it, the problem is your data, not your model.
linear_scale
Logistic Regression
The linear model that dominates text classification
Why It Works for Text
Logistic regression learns a weighted sum of features, passed through a sigmoid function to produce a probability. For text, the features are typically TF-IDF values, and the model learns which words are most predictive of each class. It consistently achieves 83–88% accuracy on sentiment analysis benchmarks, often matching or beating more complex models. Logistic regression works so well for text because TF-IDF features are high-dimensional and sparse — exactly the regime where linear models shine. With L1 regularization (Lasso), it performs automatic feature selection, zeroing out irrelevant words. With L2 regularization (Ridge), it handles correlated features gracefully. The learned weights are directly interpretable: the highest-weighted words for "positive" might be "excellent", "amazing", "loved."
Logistic Regression for Text
Model: P(positive | doc) = sigmoid(w · x + b) x = TF-IDF vector of document w = learned word weights Learned weights (sentiment): "excellent": +2.3 "amazing": +2.1 "terrible": -2.8 "boring": -1.9 "the": +0.01 (near zero) Regularization: L1 (Lasso): sparse weights, feature selection L2 (Ridge): small weights, handles correlation Performance: Sentiment: 83-88% accuracy (F1: 0.83) Spam: 97-99% accuracy Fast training, fast inference
Key insight: Logistic regression with TF-IDF is the strongest classical baseline for text classification. In comparative studies, it achieves the best F1 scores among classical models. If you can only try one model, try this one.
view_in_ar
Support Vector Machines (SVMs)
Finding the maximum-margin decision boundary in high-dimensional space
SVMs for Text
Support Vector Machines find the hyperplane that maximizes the margin between classes. In the high-dimensional space of TF-IDF vectors, SVMs are particularly effective because they handle sparse, high-dimensional data well and are robust to overfitting through the margin constraint. Linear SVMs (LinearSVC) are the standard choice for text — kernel SVMs are usually unnecessary because the feature space is already high-dimensional enough for linear separability. SVMs were the dominant text classification method from the late 1990s through the early 2010s, winning most shared tasks and competitions. They remain competitive today, especially when combined with careful feature engineering. The key advantage over logistic regression: SVMs focus on the hardest examples (support vectors near the boundary) rather than all examples, making them more robust to noisy data.
SVM Properties
Core idea: Find the hyperplane that maximizes the margin between classes Positive class: +++ | --- :Negative class +++ | --- ← margin → | ← margin → For text: Use Linear SVM (not kernel SVM) High-dimensional TF-IDF already separable Kernel adds cost without benefit Strengths: Robust to high dimensionality Focus on hard examples (support vectors) Strong generalization from margin Weaknesses: Slower training than Naive Bayes/LR No probability output (by default) Less interpretable than LR weights
Key insight: For text classification, linear models (LR, SVM) are hard to beat with classical methods. The high dimensionality of text features means non-linear models rarely add value — the data is already linearly separable in TF-IDF space.
neurology
Deep Learning for Classification
CNNs, RNNs, and transformers — when neural models add value
Neural Approaches
Deep learning brought three waves of neural text classifiers. TextCNN (Kim, 2014) applies convolutional filters over word embeddings to capture local n-gram patterns — fast and effective for short texts. LSTMs and BiLSTMs process text sequentially, capturing long-range dependencies that bag-of-words models miss. Transformer-based models (BERT, RoBERTa) achieve state-of-the-art by fine-tuning pre-trained language models on classification tasks. BERT adds a classification head on top of the [CLS] token representation and fine-tunes the entire model. The accuracy gains over classical models are real but task-dependent: for simple binary sentiment, BERT might improve 2–3% over logistic regression; for nuanced multi-class tasks with subtle distinctions, the gap can be 10%+. The trade-off is always compute cost and complexity.
Neural Classification Models
TextCNN (2014): Convolutional filters over embeddings Captures local n-gram patterns Fast, good for short texts BiLSTM (2015+): Sequential processing, both directions Captures long-range dependencies + Attention for weighted aggregation BERT fine-tuning (2018+): [CLS] token → classification head Fine-tune entire pre-trained model SOTA on most benchmarks Accuracy comparison (sentiment): Naive Bayes: ~82% LR + TF-IDF: ~85% TextCNN: ~87% BiLSTM: ~88% BERT: ~93%
Key insight: Deep learning models are not always worth the cost for text classification. If logistic regression gets 85% and BERT gets 87%, the 2% gain may not justify 100x the compute. But for hard tasks with subtle distinctions, transformers are transformative.
tune
Feature Engineering for Text
The features you choose matter more than the model you pick
Beyond Bag-of-Words
For classical models, feature engineering is where the real performance gains come from. Beyond unigram TF-IDF, you can add n-gram features: bigrams ("not good", "very bad") capture negation and intensification that unigrams miss. Character n-grams capture morphological patterns and are robust to typos. Metadata features like document length, punctuation count, capitalization ratio, and emoji presence add signal. Domain-specific features like sentiment lexicon scores (AFINN, VADER) provide expert knowledge. Feature selection using chi-squared tests or mutual information removes noise features. The best classical systems combine multiple feature types: TF-IDF unigrams + bigrams + character n-grams + metadata, often matching neural models without GPU costs.
Feature Engineering Toolkit
N-gram features: Unigrams: "not", "good" Bigrams: "not good" (captures negation!) Trigrams: "not very good" Character n-grams: "amazing" → "ama", "maz", "azi", "zin" Robust to typos: "amzing" still matches Metadata features: Document length, sentence count Punctuation ratio (!!!) Capitalization ratio (ALL CAPS) Emoji count, URL count Lexicon features: VADER sentiment score AFINN word-level scores Domain-specific dictionaries Best practice: Combine TF-IDF (1,2)-grams + metadata
Key insight: Bigram features are the single most impactful addition to a unigram baseline. They capture negation ("not good"), intensification ("very bad"), and multi-word expressions ("New York") that unigrams completely miss.
warning
Common Pitfalls
Class imbalance, data leakage, and the mistakes that silently kill accuracy
What Goes Wrong
Class imbalance is the most common pitfall. If 95% of emails are not spam, a model that always predicts "not spam" gets 95% accuracy but catches zero spam. Use F1 score (not accuracy) as your primary metric, and consider oversampling, undersampling, or class weights. Data leakage happens when test data information leaks into training — fitting TF-IDF on the entire dataset before splitting means the model has seen test vocabulary statistics. Always fit preprocessing on training data only. Domain shift: a model trained on movie reviews may fail on product reviews because the vocabulary and sentiment expressions differ. Label noise: human annotators disagree 10–20% of the time on subjective tasks like sentiment, setting a ceiling on model performance. Overfitting on spurious correlations: the model learns that reviews mentioning "Oscar" are positive, not because of sentiment but because Oscar-nominated films get more positive reviews.
Pitfall Checklist
Class imbalance: 95% negative, 5% positive Accuracy = 95% by always saying "negative" Fix: Use F1, class weights, resampling Data leakage: Fitting TF-IDF on train + test Fix: fit_transform(train), transform(test) Domain shift: Train on movie reviews, test on products Fix: domain adaptation, more diverse data Label noise: Annotators disagree 10-20% of the time Fix: multiple annotators, adjudication Spurious correlations: "Oscar" → positive (not causal) Fix: error analysis, counterfactual tests
Key insight: The difference between a demo and a production classifier is how you handle edge cases. Class imbalance, domain shift, and label noise are not exceptions — they are the norm in real-world text classification.
account_tree
The Classification Pipeline
Putting it all together — from raw text to deployed model
End-to-End Pipeline
A production text classification pipeline follows a structured flow. Data collection: gather labeled examples (manually annotated, crowd-sourced, or from existing systems). Preprocessing: clean and normalize text using the pipeline from Chapter 2. Feature extraction: convert text to numerical features (TF-IDF, embeddings). Model selection: start with logistic regression as a baseline, then try more complex models only if needed. Evaluation: use stratified cross-validation with F1 as the primary metric. Error analysis: examine misclassified examples to find systematic patterns. Iteration: improve features, add data for weak categories, adjust thresholds. Deployment: serve predictions via API with monitoring for drift. The most important step is error analysis — understanding why the model fails teaches you more than any hyperparameter search.
Pipeline Steps
1. Collect: labeled data (1K+ examples) 2. Split: train/val/test (stratified) 3. Preprocess: clean, normalize 4. Featurize: TF-IDF (1,2)-grams 5. Baseline: Logistic Regression 6. Evaluate: F1, precision, recall 7. Error analysis: why does it fail? 8. Iterate: better features, more data 9. Compare: SVM, BERT if needed 10. Deploy: API + monitoring Model selection heuristic: <1K examples: Naive Bayes 1K-100K: Logistic Regression + TF-IDF 100K+: Consider BERT fine-tuning Always start simple, add complexity
Key insight: The best text classification systems are built through iteration, not architecture. Start with the simplest model that works, do thorough error analysis, and add complexity only where the errors demand it.