Ch 2 — Text Preprocessing

Cleaning, normalization, tokenization, and building a preprocessing pipeline
High Level
raw_on
Raw Text
arrow_forward
cleaning_services
Clean
arrow_forward
text_format
Normalize
arrow_forward
cut
Tokenize
arrow_forward
filter_alt
Filter
arrow_forward
check_circle
Ready
-
Click play or press Space to begin...
Step- / 8
raw_on
Why Preprocessing Matters
Garbage in, garbage out — but for text, the garbage is subtle
The Problem
Real-world text is messy. It contains HTML tags, special characters, inconsistent casing, extra whitespace, emojis, URLs, and encoding artifacts. The same word appears in dozens of surface forms: "U.S.A.", "USA", "us", "United States." Preprocessing transforms this chaos into a consistent, clean representation that models can learn from. Without it, your model treats "Running", "running", and "RUNNING" as three completely different words. Preprocessing typically accounts for 60–70% of the effort in a classical NLP project. Even with modern transformers that handle raw text, understanding preprocessing helps you debug when models fail on edge cases.
Before vs After
Raw text: "<p>Check out https://example.com!!! The U.S.A.'s GDP grew 2.5% in Q3... it's AMAZING!!! 🎉🎉</p>" After preprocessing: "check out the usa gdp grew 2.5 percent in q3 it is amazing" What changed: HTML removed, URL removed Lowercased, punctuation cleaned Contractions expanded Emojis removed, whitespace normalized
Key insight: Preprocessing decisions are task-dependent. For sentiment analysis, emojis carry signal and should be kept. For legal document search, casing matters. There is no universal "correct" preprocessing pipeline.
cleaning_services
Text Cleaning
Removing noise while preserving signal
Cleaning Steps
Text cleaning removes elements that add noise without contributing meaning. HTML/XML stripping removes markup tags. URL removal replaces links with a placeholder or removes them entirely. Special character handling removes or normalizes characters like &, @, #. Whitespace normalization collapses multiple spaces, tabs, and newlines into single spaces. Encoding fixes handle UTF-8 artifacts like "don’t" appearing as "don’t." Number handling can normalize numbers to a token like <NUM> or spell them out. The key principle is to remove what doesn't carry meaning for your task while preserving what does.
Common Operations
1. Strip HTML/XML "<b>Hello</b> world" → "Hello world" 2. Remove URLs "Visit https://x.com" → "Visit" 3. Fix encoding "don’t" → "don't" 4. Normalize whitespace "hello \n world" → "hello world" 5. Handle numbers "grew 2.5%" → "grew <NUM> percent" 6. Expand contractions "it's" → "it is" "won't" → "will not"
Key insight: Always inspect your data before and after cleaning. Automated cleaning can destroy signal — removing "#" kills hashtag semantics in social media data, and removing numbers destroys financial text.
text_format
Normalization
Reducing surface variation to canonical forms
What Normalization Does
Normalization reduces the vocabulary by mapping different surface forms to the same canonical representation. Lowercasing is the most common: "Apple", "apple", and "APPLE" become one token. Unicode normalization (NFC or NFKC) ensures that characters composed in different ways are treated identically — the accented "é" can be stored as one codepoint or two, and normalization picks one. Accent removal maps "café" to "cafe." Abbreviation expansion maps "Dr." to "Doctor." Each normalization step reduces vocabulary size but potentially loses information. Lowercasing "Apple" (the company) and "apple" (the fruit) merges two distinct concepts. The right level of normalization depends on your task and vocabulary size constraints.
Normalization Decisions
Lowercasing: "Apple" → "apple" Pro: smaller vocabulary Con: "Apple Inc" = "apple fruit" Unicode normalization (NFC): é (U+00E9) = e + ́ (U+0065 U+0301) Both → single codepoint Accent removal: "résumé" → "resume" Danger: different words in some languages Abbreviation expansion: "Dr. Smith" → "Doctor Smith" "U.S.A." → "USA"
Key insight: Normalization is a lossy compression of your vocabulary. Every normalization step trades information for consistency. The goal is to lose the variation that doesn't matter for your task.
cut
Tokenization: Word-Level
Splitting text into meaningful units — harder than it sounds
Word Tokenization
Tokenization splits text into discrete units (tokens) that become the input to your model. The simplest approach is word-level tokenization: split on whitespace and punctuation. But even this is tricky. Should "New York" be one token or two? Is "don't" one token ("don't"), two ("do" + "n't"), or three ("do" + "not")? What about "state-of-the-art" or "3.14"? Different tokenizers make different choices. spaCy uses linguistic rules to handle contractions and special cases. NLTK provides multiple tokenizers with different trade-offs. Word-level tokenization creates a fixed vocabulary from the training data, and any word not seen during training becomes an unknown token (UNK) — a major limitation for production systems.
Word Tokenization Examples
Simple whitespace split: "I can't believe it's 3.14!" → ["I", "can't", "believe", "it's", "3.14!"] spaCy tokenizer: → ["I", "ca", "n't", "believe", "it", "'s", "3.14", "!"] NLTK word_tokenize: → ["I", "ca", "n't", "believe", "it", "'s", "3.14", "!"] The OOV problem: Training vocab: 50,000 words New word "ChatGPT" → <UNK> Every unknown word = lost information
Key insight: Word-level tokenization's fatal flaw is the out-of-vocabulary (OOV) problem. Any word not in the training vocabulary becomes UNK, losing all information. This limitation drove the development of subword tokenization.
content_cut
Tokenization: Subword Methods
BPE, WordPiece, and SentencePiece — the modern standard
Subword Tokenization
Subword tokenization solves the OOV problem by splitting words into smaller, reusable pieces. Common words stay whole ("the", "and"), while rare words are broken into subwords ("unhappiness" → "un" + "happiness"). Byte Pair Encoding (BPE), used by GPT models, iteratively merges the most frequent character pairs until reaching a target vocabulary size. WordPiece, used by BERT, is similar but optimizes for likelihood rather than frequency. SentencePiece treats the input as a raw byte stream (no pre-tokenization needed), supporting both BPE and Unigram algorithms. Typical vocabulary sizes range from 30,000 to 50,000 tokens. The key advantage: any text can be tokenized without UNK tokens, because the algorithm can always fall back to individual characters.
Subword Algorithms
BPE (GPT, RoBERTa): "unhappiness" → ["un", "happiness"] "ChatGPT" → ["Chat", "G", "PT"] Merge most frequent pairs iteratively WordPiece (BERT): "unhappiness" → ["un", "##happiness"] "##" prefix = continuation of word Optimize for likelihood, not frequency SentencePiece (T5, LLaMA): Language-agnostic, raw bytes No pre-tokenization needed Supports BPE + Unigram Typical vocab sizes: BERT: 30,522 tokens GPT-2: 50,257 tokens LLaMA: 32,000 tokens
Key insight: Subword tokenization is the universal standard in modern NLP. It balances vocabulary size against sequence length — common words are single tokens (efficient), rare words are multiple tokens (no information loss).
filter_alt
Stemming vs Lemmatization
Reducing words to their root forms — two very different approaches
Two Approaches
Stemming chops off word endings using simple rules. The Porter Stemmer (1980) applies cascading suffix-stripping rules: "running" → "run", "studies" → "studi", "operational" → "oper". It's fast but crude — "studi" isn't a real word. Lemmatization uses vocabulary lookup and morphological analysis to find the true dictionary form (lemma): "running" → "run", "studies" → "study", "better" → "good." It's slower but linguistically correct. Lemmatization requires knowing the word's part of speech — "saw" as a verb lemmatizes to "see", but "saw" as a noun stays "saw." In practice, stemming is used when speed matters and exact forms don't (search engines), while lemmatization is preferred when linguistic accuracy matters (text analysis, chatbots).
Stemming vs Lemmatization
Stemming (Porter Stemmer): "running" → "run" "studies" → "studi" "operational" → "oper" "university" → "univers" Lemmatization (WordNet): "running" → "run" "studies" → "study" "operational" → "operational" "better" → "good" POS matters for lemmatization: "saw" (verb) → "see" "saw" (noun) → "saw"
Key insight: Modern transformer models have largely made stemming and lemmatization unnecessary for many tasks. Subword tokenization handles morphological variation implicitly, and contextual embeddings capture word form differences. But they remain essential for classical NLP pipelines and search systems.
block
Stop Words and Filtering
Removing high-frequency words — when it helps and when it hurts
Stop Word Removal
Stop words are high-frequency words like "the", "is", "at", "which" that appear in almost every document. Removing them reduces vocabulary size and computational cost while (sometimes) improving signal-to-noise ratio. Standard stop word lists contain 100–300 words. But stop word removal is dangerous in many contexts. "To be or not to be" becomes meaningless without stop words. Negation words like "not", "no", "never" are often on stop word lists but carry critical semantic meaning — "not good" and "good" have opposite meanings. For bag-of-words models and TF-IDF, stop word removal often helps. For neural models and transformers, it usually hurts because these models learn to use function words as structural signals. The modern best practice is to let the model decide what to ignore through attention weights.
When to Remove Stop Words
Remove for: TF-IDF / bag-of-words models Document clustering Keyword extraction Search index optimization Keep for: Sentiment analysis ("not good") Machine translation Question answering Any transformer-based model Any task where word order matters Danger example: "The movie was not good at all" After stop word removal: "movie good" Meaning completely inverted!
Key insight: Stop word removal is a relic of the sparse-vector era when vocabulary size directly impacted computation. With modern models, it's usually better to let the model learn which words to attend to and which to ignore.
account_tree
Building a Preprocessing Pipeline
Putting it all together — and knowing when to skip steps
Pipeline Design
A preprocessing pipeline chains cleaning, normalization, tokenization, and filtering steps in a specific order. The order matters: you should clean HTML before tokenizing, and tokenize before removing stop words. For classical ML pipelines (Naive Bayes, SVM, logistic regression), a typical pipeline is: clean → lowercase → tokenize (word-level) → remove stop words → stem/lemmatize → vectorize (TF-IDF). For transformer-based models, preprocessing is minimal: clean → use the model's pre-trained tokenizer (BPE/WordPiece). The pre-trained tokenizer was trained on specific preprocessing, so you must match it exactly. Adding lowercasing to a cased BERT model will degrade performance. The golden rule: your preprocessing must match what the model expects.
Two Pipeline Patterns
Classical ML pipeline: raw text → strip HTML, fix encoding → lowercase → word tokenize (spaCy/NLTK) → remove stop words → lemmatize → TF-IDF vectorize → model (SVM, Naive Bayes) Transformer pipeline: raw text → minimal cleaning (HTML only) → model's tokenizer (BPE/WordPiece) → model (BERT, GPT, T5) Common mistake: Lowercasing input to cased BERT Adding stop word removal before BERT Custom tokenization before BPE
Key insight: The trend in NLP is toward less preprocessing. Transformers learn their own normalization implicitly. But understanding the full pipeline is essential — you need it for classical methods, for debugging, and for knowing what the transformer is doing under the hood.