Ch 8: Transfer Learning & Fine-Tuning — Natural Language Processing

Ch 8 — Transfer Learning & Fine-Tuning

Pre-train then fine-tune, feature extraction vs full fine-tuning, LoRA, and the Hugging Face ecosystem

Index

High Level

cloud_download

Pre-train

arrow_forward

tune

Fine-tune

arrow_forward

compress

PEFT

arrow_forward

hub

Hub

arrow_forward

analytics

Evaluate

arrow_forward

rocket_launch

Deploy

Click play or press Space to begin...

Step- / 8

sync_alt

The Pre-Train / Fine-Tune Paradigm

Why training from scratch is almost never the right answer

The Core Idea

Transfer learning is the most important practical concept in modern NLP. Instead of training a model from scratch on your specific task, you start with a model that has already learned general language understanding from billions of words, then adapt it to your task with a small amount of labeled data. Pre-training BERT from scratch costs roughly $10,000–$50,000 in compute and requires billions of words of text. Fine-tuning BERT for your classification task costs $1–$10 and requires hundreds to thousands of labeled examples. This 1000x cost reduction is why transfer learning democratized NLP — a startup with 1,000 labeled examples can now match the performance of systems that previously required millions. The pre-trained model provides a foundation of language understanding that transfers across tasks, domains, and even languages.

Economics of Transfer Learning

Training from scratch: Data: billions of words Compute: $10,000-$50,000+ Time: days to weeks on GPU clusters Result: general language model Fine-tuning: Data: 1,000-10,000 labeled examples Compute: $1-$10 Time: minutes to hours on 1 GPU Result: task-specific model Cost reduction: ~1000x Data reduction: ~1000x Performance: comparable or better Analogy: Pre-training = learning to read Fine-tuning = learning a specific job

Key insight: Transfer learning works because language understanding is general. A model that has learned grammar, semantics, and world knowledge from Wikipedia and books can apply that knowledge to classify medical documents, extract legal entities, or detect spam.

layers

Feature Extraction vs Full Fine-Tuning

Two strategies for adapting pre-trained models

Two Approaches

Feature extraction freezes the pre-trained model and uses its output representations as fixed features for a separate classifier. You pass your text through BERT, take the [CLS] token embedding, and train a logistic regression or small neural network on top. The pre-trained weights never change. This is fast, requires minimal compute, and works well when your task is similar to the pre-training data. Full fine-tuning updates all parameters of the pre-trained model along with the task-specific head. The entire model adapts to your task, learning domain-specific patterns. Full fine-tuning typically achieves 2–5% higher accuracy than feature extraction but requires more compute and risks catastrophic forgetting — the model losing its general language knowledge while adapting to the specific task. The standard practice is to use a low learning rate (2e-5 to 5e-5) to preserve pre-trained knowledge.

Comparison

Feature extraction: Freeze BERT, train classifier on top Fast: minutes on CPU Low risk of catastrophic forgetting Works with very small datasets (100+) Accuracy: ~88% (sentiment example) Full fine-tuning: Update all BERT parameters + head Slower: hours on GPU Risk of catastrophic forgetting Needs more data (1000+) Accuracy: ~93% (sentiment example) Best practices for full fine-tuning: Learning rate: 2e-5 to 5e-5 Epochs: 2-4 (more = overfitting) Warmup: 10% of training steps Weight decay: 0.01

Key insight: Start with feature extraction as a quick baseline. If it's not good enough, try full fine-tuning. The 2-5% accuracy gap is real but not always worth the added complexity, compute, and risk of overfitting.

add_circle

Task-Specific Heads

Adding the right output layer for your task

Architecture Patterns

Fine-tuning requires adding a task-specific head on top of the pre-trained model. The head is a small neural network that transforms the model's representations into task-appropriate outputs. For classification, the head takes the [CLS] token representation and passes it through a linear layer with softmax: 768-dim → num_classes. For NER/token classification, each token's representation gets its own linear layer: 768-dim per token → num_tags per token. For question answering, two linear layers predict the start and end positions of the answer span. For sentence similarity, both sentences are encoded separately, and their [CLS] representations are compared using cosine similarity. The head is randomly initialized and trained from scratch, while the pre-trained body is fine-tuned with a lower learning rate.

Task Heads

Classification: [CLS] → Linear(768, num_classes) → softmax Loss: cross-entropy Token classification (NER): each token → Linear(768, num_tags) Loss: cross-entropy per token Question answering: each token → Linear(768, 2) Output: start_score, end_score Answer = tokens between argmax(start) and argmax(end) Sentence similarity: sent_A → [CLS_A] embedding sent_B → [CLS_B] embedding similarity = cosine(CLS_A, CLS_B)

Key insight: The task head is deliberately simple — usually just a linear layer. The pre-trained model does the heavy lifting of understanding language; the head just maps that understanding to your specific output format.

compress

Parameter-Efficient Fine-Tuning (PEFT)

LoRA, adapters, and training 0.1% of parameters for 99% of the performance

The PEFT Revolution

Full fine-tuning updates all model parameters — for a 7B parameter model, that means storing optimizer states for 7 billion weights, requiring 50+ GB of GPU memory. Parameter-Efficient Fine-Tuning (PEFT) methods freeze most parameters and train only a tiny fraction. LoRA (Low-Rank Adaptation) is the most popular: instead of updating a weight matrix W directly, it learns a low-rank decomposition ΔW = A × B, where A and B are much smaller matrices. For a 7B model, LoRA typically trains 0.1–1% of parameters while achieving 95–99% of full fine-tuning performance. LoRA adapters are tiny (~6 MB vs ~14 GB for the full model), can be swapped at inference time, and multiple task-specific adapters can share one base model. This makes it practical to fine-tune large models on consumer GPUs.

LoRA Details

Full fine-tuning: Update W directly (all params) 7B model = 14GB checkpoint Needs 50+ GB GPU memory LoRA: Freeze W, learn ΔW = A × B A: (d, r), B: (r, d), r << d Typical rank r = 8-64 7B model + LoRA: Trainable params: ~6M (0.1%) Adapter size: ~6 MB GPU memory: ~16 GB Performance: 95-99% of full fine-tuning Benefits: Multiple adapters share one base model Swap adapters at inference time Fine-tune on consumer GPUs (16GB)

Key insight: LoRA proved that most of the knowledge is in the pre-trained weights. You only need to nudge a tiny fraction of parameters to adapt a model to a new task. This insight has profound implications for how we think about model adaptation.

domain

Domain Adaptation

When general-purpose models meet specialized text

The Domain Gap

Pre-trained models learn from general text (Wikipedia, books, web pages), but many real-world applications involve specialized domains with unique vocabulary, writing styles, and knowledge. A BERT model trained on general text may not know that "MI" means "myocardial infarction" in medical text, or that "consideration" has a specific legal meaning in contracts. Domain-adaptive pre-training (DAPT) continues pre-training on domain-specific text before fine-tuning on the task. This produces domain-specific models: BioBERT (biomedical), SciBERT (scientific), LegalBERT (legal), FinBERT (financial), ClinicalBERT (clinical notes). DAPT typically improves performance by 3–8% F1 on domain-specific tasks. The two-stage approach — general pre-training → domain pre-training → task fine-tuning — is the standard recipe for production NLP in specialized domains.

Domain-Specific Models

General BERT: Pre-trained on Wikipedia + BookCorpus Good for general text tasks Domain-adapted models: BioBERT: PubMed abstracts + PMC SciBERT: Semantic Scholar papers LegalBERT: legal documents, case law FinBERT: financial news, SEC filings ClinicalBERT: MIMIC clinical notes Three-stage recipe: 1. General pre-training (BERT) 2. Domain pre-training (DAPT) 3. Task fine-tuning Improvement from DAPT: Biomedical NER: +5% F1 Legal classification: +3% F1 Financial sentiment: +8% F1

Key insight: Domain adaptation is the single most impactful technique for production NLP in specialized fields. If your text looks different from Wikipedia, domain-adaptive pre-training will almost certainly improve your results.

hub

The Hugging Face Ecosystem

The platform that democratized NLP

The Ecosystem

Hugging Face has become the central platform for NLP and machine learning. The Model Hub hosts over 500,000 pre-trained models, from BERT variants to the latest LLMs, all downloadable with a single line of code. The Transformers library provides a unified API for loading, fine-tuning, and deploying models — switching from BERT to RoBERTa to DeBERTa requires changing one string. The Datasets library provides standardized access to thousands of NLP datasets. The Tokenizers library provides fast, production-ready tokenizers. The PEFT library implements LoRA and other parameter-efficient methods. The Trainer API handles training loops, evaluation, logging, and checkpointing. This ecosystem reduced the barrier to entry from "PhD in NLP" to "can write Python," enabling anyone to fine-tune state-of-the-art models.

Hugging Face Stack

Model Hub: 500K+ models from transformers import AutoModel model = AutoModel.from_pretrained( "bert-base-uncased" ) Transformers: unified API AutoTokenizer, AutoModelForXxx One API for BERT, GPT, T5, LLaMA... Datasets: standardized data from datasets import load_dataset ds = load_dataset("imdb") PEFT: parameter-efficient tuning LoRA, IA3, AdaLoRA Trainer: training abstraction Handles loops, eval, logging, checkpoints Distributed training built-in

Key insight: Hugging Face didn't just build tools — it created a network effect. Researchers publish models on the Hub, practitioners download and fine-tune them, and the ecosystem grows. This is why it became the default platform for NLP.

warning

Fine-Tuning Pitfalls

Catastrophic forgetting, overfitting, and the mistakes that waste GPU hours

What Goes Wrong

Catastrophic forgetting: the model loses general language knowledge while adapting to a narrow task. Symptoms: great performance on training data, poor generalization. Fix: lower learning rate, fewer epochs, gradual unfreezing. Overfitting on small datasets: with only a few hundred examples, the model memorizes training data instead of learning patterns. Fix: use feature extraction instead, or apply strong regularization (dropout, weight decay). Learning rate too high: destroys pre-trained representations in the first few steps. The model never recovers. Fix: use warmup (10% of steps) and learning rates of 2e-5 to 5e-5. Wrong model choice: using a 7B parameter model when BERT-base would suffice wastes compute without improving accuracy. Tokenizer mismatch: using a different tokenizer than the one the model was pre-trained with produces garbage representations.

Pitfall Checklist

Catastrophic forgetting: LR too high, too many epochs Fix: LR=2e-5, epochs=2-4, warmup Overfitting on small data: <500 examples + full fine-tuning Fix: feature extraction, or LoRA Learning rate too high: LR=1e-3 destroys pre-trained weights Fix: LR=2e-5 to 5e-5 with warmup Wrong model size: 7B model for simple classification Fix: start with BERT-base (110M) Tokenizer mismatch: Using GPT tokenizer with BERT model Fix: always use model's own tokenizer

Key insight: Most fine-tuning failures come from hyperparameters, not architecture. The learning rate alone accounts for the majority of failed fine-tuning runs. When in doubt, use the defaults from the Hugging Face examples.

account_tree

The Modern Fine-Tuning Decision Tree

Choosing the right approach for your constraints

Decision Framework

The right fine-tuning approach depends on your data size, compute budget, and task complexity. With <100 examples: use a large LLM with few-shot prompting (no fine-tuning needed). With 100–1,000 examples: use feature extraction from a pre-trained model, or LoRA fine-tuning. With 1,000–10,000 examples: full fine-tuning of BERT-base or similar. With 10,000+ examples: full fine-tuning with domain-adapted pre-training. For specialized domains: always consider domain-adaptive pre-training first. For limited GPU: LoRA or feature extraction. For production deployment: consider model distillation (DistilBERT is 60% smaller, 97% of BERT's performance). The trend is toward smaller, specialized models for production and large general models for prototyping.

Decision Tree

How much labeled data? <100 examples: → Few-shot prompting with LLM → No fine-tuning needed 100-1,000 examples: → Feature extraction (fast, safe) → LoRA fine-tuning (better, cheap) 1,000-10,000 examples: → Full fine-tuning (BERT-base) → Consider domain pre-training 10,000+ examples: → Domain DAPT + full fine-tuning → Consider model distillation Production rule of thumb: Prototype with large LLM Deploy with fine-tuned small model

Key insight: The best practitioners prototype with large models and deploy with small ones. Use GPT-4 to validate the task is solvable, then fine-tune a BERT or DistilBERT for production. This gives you the best of both worlds: fast iteration and efficient deployment.

arrow_back Ch 7: Transformer Revolution Ch 9: NLP Evaluation arrow_forward