Ch 4: Text Classification — Natural Language Processing

Ch 4 — Text Classification

Sentiment analysis, spam detection, and the classical ML pipeline for text

Index

High Level

description

Text

arrow_forward

data_object

Features

arrow_forward

model_training

Model

arrow_forward

label

Predict

arrow_forward

analytics

Evaluate

arrow_forward

rocket_launch

Deploy

Click play or press Space to begin...

Step- / 8

label

What Is Text Classification?

Assigning labels to documents — the most common NLP task in production

The Task

Text classification assigns one or more labels to a piece of text. Given a document, predict its category. It's the most widely deployed NLP task in industry. Sentiment analysis: "This movie was terrible" → negative. Spam detection: "You've won $1,000,000!" → spam. Topic classification: an article about GDP growth → economics. Intent detection: "Book a flight to London" → book_flight. Classification can be binary (spam/not spam), multi-class (one of N categories), or multi-label (multiple tags per document). The task seems simple, but doing it well at scale — handling edge cases, class imbalance, domain shift, and evolving categories — is where the real engineering challenge lies.

Classification Types

Binary: exactly 2 classes spam / not spam positive / negative toxic / non-toxic Multi-class: one of N classes sports | politics | tech | science anger | joy | sadness | fear | surprise Multi-label: multiple labels per doc Article: [politics, economics, trade] Movie: [action, comedy, sci-fi] Real-world applications: Email routing, content moderation Customer support ticket triage Medical document coding (ICD-10) Legal document categorization

Key insight: Text classification is the "hello world" of NLP — simple to define, hard to master. Most production NLP systems start with classification, and the skills transfer to every other NLP task.

calculate

Naive Bayes

The surprisingly effective probabilistic baseline

How It Works

Naive Bayes applies Bayes' theorem with a "naive" assumption: all features (words) are conditionally independent given the class. This assumption is obviously wrong — "New" and "York" are highly correlated — but the model works remarkably well in practice. For text, Multinomial Naive Bayes models word counts: P(class | document) is proportional to P(class) × ∏ P(word | class). Training is just counting: how often does each word appear in documents of each class? Prediction is fast: multiply the probabilities. Naive Bayes excels with small training sets (as few as dozens of examples), is extremely fast to train and predict, and provides a strong baseline that more complex models must beat. It's still the go-to first model for text classification prototypes.

Naive Bayes for Text

Bayes' theorem: P(spam | "free money now") = P("free"|spam) × P("money"|spam) × P("now"|spam) × P(spam) ÷ P("free money now") Training = counting: P("free" | spam) = count("free" in spam) / total words in spam P("free" | ham) = count("free" in ham) / total words in ham Strengths: Works with tiny datasets (50+ examples) Trains in milliseconds Interpretable: which words drive predictions Hard to overfit Weaknesses: Independence assumption is wrong Can't capture word interactions Sensitive to word frequency imbalance

Key insight: Naive Bayes is the baseline that refuses to die. Its independence assumption is wrong, but the ranking of class probabilities is often correct. Always start with Naive Bayes — if a complex model can't beat it, the problem is your data, not your model.

linear_scale

Logistic Regression

The linear model that dominates text classification

Why It Works for Text

Logistic regression learns a weighted sum of features, passed through a sigmoid function to produce a probability. For text, the features are typically TF-IDF values, and the model learns which words are most predictive of each class. It consistently achieves 83–88% accuracy on sentiment analysis benchmarks, often matching or beating more complex models. Logistic regression works so well for text because TF-IDF features are high-dimensional and sparse — exactly the regime where linear models shine. With L1 regularization (Lasso), it performs automatic feature selection, zeroing out irrelevant words. With L2 regularization (Ridge), it handles correlated features gracefully. The learned weights are directly interpretable: the highest-weighted words for "positive" might be "excellent", "amazing", "loved."

Logistic Regression for Text

Model: P(positive | doc) = sigmoid(w · x + b) x = TF-IDF vector of document w = learned word weights Learned weights (sentiment): "excellent": +2.3 "amazing": +2.1 "terrible": -2.8 "boring": -1.9 "the": +0.01 (near zero) Regularization: L1 (Lasso): sparse weights, feature selection L2 (Ridge): small weights, handles correlation Performance: Sentiment: 83-88% accuracy (F1: 0.83) Spam: 97-99% accuracy Fast training, fast inference

Key insight: Logistic regression with TF-IDF is the strongest classical baseline for text classification. In comparative studies, it achieves the best F1 scores among classical models. If you can only try one model, try this one.

view_in_ar

Support Vector Machines (SVMs)

Finding the maximum-margin decision boundary in high-dimensional space

SVMs for Text

Support Vector Machines find the hyperplane that maximizes the margin between classes. In the high-dimensional space of TF-IDF vectors, SVMs are particularly effective because they handle sparse, high-dimensional data well and are robust to overfitting through the margin constraint. Linear SVMs (LinearSVC) are the standard choice for text — kernel SVMs are usually unnecessary because the feature space is already high-dimensional enough for linear separability. SVMs were the dominant text classification method from the late 1990s through the early 2010s, winning most shared tasks and competitions. They remain competitive today, especially when combined with careful feature engineering. The key advantage over logistic regression: SVMs focus on the hardest examples (support vectors near the boundary) rather than all examples, making them more robust to noisy data.

SVM Properties

Core idea: Find the hyperplane that maximizes the margin between classes Positive class: +++ | --- :Negative class +++ | --- ← margin → | ← margin → For text: Use Linear SVM (not kernel SVM) High-dimensional TF-IDF already separable Kernel adds cost without benefit Strengths: Robust to high dimensionality Focus on hard examples (support vectors) Strong generalization from margin Weaknesses: Slower training than Naive Bayes/LR No probability output (by default) Less interpretable than LR weights

Key insight: For text classification, linear models (LR, SVM) are hard to beat with classical methods. The high dimensionality of text features means non-linear models rarely add value — the data is already linearly separable in TF-IDF space.

neurology

Deep Learning for Classification

CNNs, RNNs, and transformers — when neural models add value

Neural Approaches

Deep learning brought three waves of neural text classifiers. TextCNN (Kim, 2014) applies convolutional filters over word embeddings to capture local n-gram patterns — fast and effective for short texts. LSTMs and BiLSTMs process text sequentially, capturing long-range dependencies that bag-of-words models miss. Transformer-based models (BERT, RoBERTa) achieve state-of-the-art by fine-tuning pre-trained language models on classification tasks. BERT adds a classification head on top of the [CLS] token representation and fine-tunes the entire model. The accuracy gains over classical models are real but task-dependent: for simple binary sentiment, BERT might improve 2–3% over logistic regression; for nuanced multi-class tasks with subtle distinctions, the gap can be 10%+. The trade-off is always compute cost and complexity.

Neural Classification Models

TextCNN (2014): Convolutional filters over embeddings Captures local n-gram patterns Fast, good for short texts BiLSTM (2015+): Sequential processing, both directions Captures long-range dependencies + Attention for weighted aggregation BERT fine-tuning (2018+): [CLS] token → classification head Fine-tune entire pre-trained model SOTA on most benchmarks Accuracy comparison (sentiment): Naive Bayes: ~82% LR + TF-IDF: ~85% TextCNN: ~87% BiLSTM: ~88% BERT: ~93%

Key insight: Deep learning models are not always worth the cost for text classification. If logistic regression gets 85% and BERT gets 87%, the 2% gain may not justify 100x the compute. But for hard tasks with subtle distinctions, transformers are transformative.

tune

Feature Engineering for Text

The features you choose matter more than the model you pick

Beyond Bag-of-Words

For classical models, feature engineering is where the real performance gains come from. Beyond unigram TF-IDF, you can add n-gram features: bigrams ("not good", "very bad") capture negation and intensification that unigrams miss. Character n-grams capture morphological patterns and are robust to typos. Metadata features like document length, punctuation count, capitalization ratio, and emoji presence add signal. Domain-specific features like sentiment lexicon scores (AFINN, VADER) provide expert knowledge. Feature selection using chi-squared tests or mutual information removes noise features. The best classical systems combine multiple feature types: TF-IDF unigrams + bigrams + character n-grams + metadata, often matching neural models without GPU costs.

Feature Engineering Toolkit

N-gram features: Unigrams: "not", "good" Bigrams: "not good" (captures negation!) Trigrams: "not very good" Character n-grams: "amazing" → "ama", "maz", "azi", "zin" Robust to typos: "amzing" still matches Metadata features: Document length, sentence count Punctuation ratio (!!!) Capitalization ratio (ALL CAPS) Emoji count, URL count Lexicon features: VADER sentiment score AFINN word-level scores Domain-specific dictionaries Best practice: Combine TF-IDF (1,2)-grams + metadata

Key insight: Bigram features are the single most impactful addition to a unigram baseline. They capture negation ("not good"), intensification ("very bad"), and multi-word expressions ("New York") that unigrams completely miss.

warning

Common Pitfalls

Class imbalance, data leakage, and the mistakes that silently kill accuracy

What Goes Wrong

Class imbalance is the most common pitfall. If 95% of emails are not spam, a model that always predicts "not spam" gets 95% accuracy but catches zero spam. Use F1 score (not accuracy) as your primary metric, and consider oversampling, undersampling, or class weights. Data leakage happens when test data information leaks into training — fitting TF-IDF on the entire dataset before splitting means the model has seen test vocabulary statistics. Always fit preprocessing on training data only. Domain shift: a model trained on movie reviews may fail on product reviews because the vocabulary and sentiment expressions differ. Label noise: human annotators disagree 10–20% of the time on subjective tasks like sentiment, setting a ceiling on model performance. Overfitting on spurious correlations: the model learns that reviews mentioning "Oscar" are positive, not because of sentiment but because Oscar-nominated films get more positive reviews.

Pitfall Checklist

Class imbalance: 95% negative, 5% positive Accuracy = 95% by always saying "negative" Fix: Use F1, class weights, resampling Data leakage: Fitting TF-IDF on train + test Fix: fit_transform(train), transform(test) Domain shift: Train on movie reviews, test on products Fix: domain adaptation, more diverse data Label noise: Annotators disagree 10-20% of the time Fix: multiple annotators, adjudication Spurious correlations: "Oscar" → positive (not causal) Fix: error analysis, counterfactual tests

Key insight: The difference between a demo and a production classifier is how you handle edge cases. Class imbalance, domain shift, and label noise are not exceptions — they are the norm in real-world text classification.

account_tree

The Classification Pipeline

Putting it all together — from raw text to deployed model

End-to-End Pipeline

A production text classification pipeline follows a structured flow. Data collection: gather labeled examples (manually annotated, crowd-sourced, or from existing systems). Preprocessing: clean and normalize text using the pipeline from Chapter 2. Feature extraction: convert text to numerical features (TF-IDF, embeddings). Model selection: start with logistic regression as a baseline, then try more complex models only if needed. Evaluation: use stratified cross-validation with F1 as the primary metric. Error analysis: examine misclassified examples to find systematic patterns. Iteration: improve features, add data for weak categories, adjust thresholds. Deployment: serve predictions via API with monitoring for drift. The most important step is error analysis — understanding why the model fails teaches you more than any hyperparameter search.

Pipeline Steps

1. Collect: labeled data (1K+ examples) 2. Split: train/val/test (stratified) 3. Preprocess: clean, normalize 4. Featurize: TF-IDF (1,2)-grams 5. Baseline: Logistic Regression 6. Evaluate: F1, precision, recall 7. Error analysis: why does it fail? 8. Iterate: better features, more data 9. Compare: SVM, BERT if needed 10. Deploy: API + monitoring Model selection heuristic: <1K examples: Naive Bayes 1K-100K: Logistic Regression + TF-IDF 100K+: Consider BERT fine-tuning Always start simple, add complexity

Key insight: The best text classification systems are built through iteration, not architecture. Start with the simplest model that works, do thorough error analysis, and add complexity only where the errors demand it.

arrow_back Ch 3: Representing Text Ch 5: Sequence Labeling arrow_forward