Pipeline Design
A preprocessing pipeline chains cleaning, normalization, tokenization, and filtering steps in a specific order. The order matters: you should clean HTML before tokenizing, and tokenize before removing stop words. For classical ML pipelines (Naive Bayes, SVM, logistic regression), a typical pipeline is: clean → lowercase → tokenize (word-level) → remove stop words → stem/lemmatize → vectorize (TF-IDF). For transformer-based models, preprocessing is minimal: clean → use the model's pre-trained tokenizer (BPE/WordPiece). The pre-trained tokenizer was trained on specific preprocessing, so you must match it exactly. Adding lowercasing to a cased BERT model will degrade performance. The golden rule: your preprocessing must match what the model expects.
Two Pipeline Patterns
Classical ML pipeline:
raw text
→ strip HTML, fix encoding
→ lowercase
→ word tokenize (spaCy/NLTK)
→ remove stop words
→ lemmatize
→ TF-IDF vectorize
→ model (SVM, Naive Bayes)
Transformer pipeline:
raw text
→ minimal cleaning (HTML only)
→ model's tokenizer (BPE/WordPiece)
→ model (BERT, GPT, T5)
Common mistake:
Lowercasing input to cased BERT
Adding stop word removal before BERT
Custom tokenization before BPE
Key insight: The trend in NLP is toward less preprocessing. Transformers learn their own normalization implicitly. But understanding the full pipeline is essential — you need it for classical methods, for debugging, and for knowing what the transformer is doing under the hood.