Ch 1: What Is Fine-Tuning & When to Use It

Ch 1 — What Is Fine-Tuning & When to Use It

Why customize LLMs, and how to decide between fine-tuning, prompting, and RAG

Index Under the Hood →

High Level

psychology

Why

arrow_forward

compare_arrows

Compare

arrow_forward

category

Types

arrow_forward

school

Stages

arrow_forward

dataset

Data

arrow_forward

warning

Risks

arrow_forward

check_circle

Decide

Click play or press Space to begin the journey...

Step- / 7

psychology

Why Fine-Tune an LLM?

When prompting alone isn't enough

What Fine-Tuning Is

Fine-tuning takes a pre-trained LLM (like Llama 3, Mistral, or GPT-4o) and continues training it on your own dataset. The model learns your specific patterns, terminology, style, and tasks. Think of it as taking a generally educated person and giving them specialized job training.

Pre-Training vs Fine-Tuning

Pre-training: Train from scratch on trillions of tokens from the internet. Costs millions of dollars and takes months on thousands of GPUs. Creates a "base model" that can predict the next token but doesn't follow instructions.

Fine-tuning: Continue training on thousands to millions of task-specific examples. Costs $10 to $10,000 depending on model size and method. Takes hours to days on 1-8 GPUs. Creates a specialized model that excels at your use case.

Top Reasons to Fine-Tune

1. Task specialization: Make the model excel at a specific task (code generation, medical Q&A, legal analysis, structured extraction).

2. Style and tone: Match your brand voice, writing style, or communication patterns consistently.

3. Domain knowledge: Teach specialized vocabulary and reasoning patterns (finance, healthcare, law).

4. Cost reduction: A fine-tuned 8B model can match GPT-4 quality on your specific task at 1/100th the inference cost.

5. Latency: Smaller fine-tuned models respond faster than large general-purpose ones.

6. Privacy: Run the model on your own infrastructure. No data leaves your environment.

The key insight: Fine-tuning doesn't teach the model new facts. It teaches the model how to behave — what format to use, what tone to adopt, what tasks to prioritize, and how to apply its existing knowledge to your specific domain.

compare_arrows

Fine-Tuning vs Prompting vs RAG

Three approaches to customizing LLM behavior

Prompting (In-Context Learning)

What: Write instructions and examples in the prompt. No model weights change.
Best for: General tasks, rapid iteration, when you have a strong base model (GPT-4o, Claude).
Limits: Context window constrains how much you can teach. Inconsistent on complex tasks. Expensive at scale (long prompts = more tokens).

RAG (Retrieval-Augmented Generation)

What: Retrieve relevant documents and inject them into the prompt at query time.
Best for: Grounding answers in up-to-date or proprietary data. Reducing hallucination on factual questions.
Limits: Doesn't change model behavior or style. Retrieval quality is the bottleneck. Adds latency.

Fine-Tuning

What: Train the model on your data. Model weights change permanently.
Best for: Consistent behavior, specific output formats, domain expertise, cost optimization at scale.
Limits: Requires training data and compute. Can't easily update knowledge (use RAG for that). Risk of catastrophic forgetting.

Prompting
Change behavior: prompt
Change knowledge: prompt
Cost: per-token
Setup: minutes

RAG
Change behavior: limited
Change knowledge: retrieval
Cost: infra + per-token
Setup: days

Fine-Tuning
Change behavior: weights
Change knowledge: limited
Cost: training + inference
Setup: days-weeks

Best Combo
Fine-tune for behavior
RAG for knowledge
Prompting for control
All three together

category

Types of Fine-Tuning

The spectrum from full to parameter-efficient

Full Fine-Tuning

Update all parameters in the model. Maximum flexibility but requires the most memory and compute. For a 7B model: ~56 GB GPU memory (fp16 weights + optimizer states + gradients). Typically needs 4-8 A100 GPUs.

When to use: Large high-quality dataset (100K+ examples), sufficient compute budget, need maximum quality.

LoRA (Low-Rank Adaptation)

Freeze the base model. Train small adapter matrices (typically 0.1-1% of total parameters). Dramatically reduces memory: fine-tune a 7B model on a single GPU with 16 GB VRAM.

When to use: Most use cases. Best quality-to-cost ratio. The default recommendation for 2024-2025.

QLoRA (Quantized LoRA)

Load the base model in 4-bit precision (NF4 quantization), then train LoRA adapters on top. Fine-tune a 70B model on a single 48 GB GPU (A100/A6000). Introduced by Dettmers et al. (2023).

When to use: Limited GPU memory, large models (30B-70B), prototyping.

Other PEFT Methods

Prefix Tuning: Prepend trainable virtual tokens to the input. Very few parameters but limited expressiveness.

Prompt Tuning: Similar to prefix tuning but only at the input layer. Google's approach for T5 models.

IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations): Learns rescaling vectors instead of matrices. Even fewer parameters than LoRA.

DoRA (Weight-Decomposed Low-Rank Adaptation): Decomposes weights into magnitude and direction, applies LoRA to direction only. Often outperforms standard LoRA.

Start with QLoRA for prototyping, LoRA for production. Full fine-tuning only when you have abundant data and compute. The quality difference between LoRA and full fine-tuning is often small (within 1-2% on benchmarks), while the cost difference is 10-50x.

school

The LLM Training Stages

From raw text to aligned assistant

Stage 1: Pre-Training

Train on trillions of tokens from the internet. The model learns language, facts, reasoning, and code. Output: a base model (e.g., Llama 3 base). It can complete text but doesn't follow instructions. This stage costs millions of dollars and is done by labs (Meta, Google, Mistral, etc.).

Stage 2: Supervised Fine-Tuning (SFT)

Train on instruction/response pairs. The model learns to follow instructions, answer questions, and produce structured output. Output: an instruct model (e.g., Llama 3 Instruct). This is what most people mean by "fine-tuning." Costs $10-$10,000. This is the stage you'll do most often.

Stage 3: Alignment (RLHF / DPO)

Train the model to prefer helpful, harmless, and honest responses using human preference data. Methods: RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization). Output: an aligned/chat model. Makes the model safer and more useful in conversation.

The Full Pipeline

# The LLM training pipeline Pre-Training # Trillions of tokens, months ↓ # Base model (text completion) SFT # 10K-1M instruction pairs, hours-days ↓ # Instruct model (follows instructions) Alignment # Preference data, RLHF or DPO ↓ # Chat model (helpful, safe, aligned) Your Fine-Tune # Your data, your task ↓ # Your specialized model

You typically start from an instruct or chat model (not the base model). The instruction-following ability is already baked in. Your fine-tuning adds domain expertise and task specialization on top. Starting from a base model means you'd also need to teach it to follow instructions.

dataset

Data Requirements

How much data do you need?

Quality Over Quantity

The LIMA paper (Zhou et al., 2023) showed that just 1,000 carefully curated examples can produce a model competitive with GPT-3.5 on many tasks. Microsoft's Phi models demonstrated that high-quality synthetic data can outperform much larger datasets of lower quality.

The lesson: 1,000 excellent examples beat 100,000 mediocre ones.

Practical Guidelines

Minimum viable: 100-500 examples for style/format transfer.
Good baseline: 1,000-5,000 examples for task specialization.
Strong model: 10,000-50,000 examples for complex reasoning tasks.
Maximum quality: 50,000-500,000 examples for full fine-tuning on broad tasks.

More data helps, but with diminishing returns. The first 1,000 examples matter most.

Data Format

Fine-tuning data is typically instruction/response pairs:

# Alpaca format (most common) { "instruction": "Summarize this contract clause...", "input": "The Licensee shall not...", "output": "This clause restricts..." } # Chat format (ShareGPT / multi-turn) { "conversations": [ {"from": "human", "value": "Explain LoRA..."}, {"from": "gpt", "value": "LoRA is..."} ] }

You can generate training data with a stronger model. Use GPT-4o or Claude to generate high-quality instruction/response pairs for your domain. This is how Alpaca (Stanford, 2023) was created: 52K instruction-following examples generated by GPT-3.5 for $500. Validate the generated data manually before training.

warning

Risks & Pitfalls

What can go wrong with fine-tuning

Catastrophic Forgetting

The model forgets general capabilities while learning your specific task. A model fine-tuned on legal text might lose its ability to write code or do math. Mitigation: use LoRA (base model stays frozen), keep learning rate low, mix in general-purpose data.

Overfitting

The model memorizes training examples instead of learning patterns. It performs perfectly on training data but poorly on new inputs. Mitigation: use a validation set, early stopping, regularization (weight decay, dropout), and don't train for too many epochs (1-3 is typical).

Data Quality Issues

Garbage in, garbage out. If your training data has errors, inconsistencies, or biases, the model will learn them. Always review a random sample of your training data before training. One bad pattern repeated 100 times can dominate the model's behavior.

Safety Degradation

Fine-tuning can undo safety alignment. Research has shown that even a small number of harmful examples can remove safety guardrails from aligned models. Always evaluate safety after fine-tuning. Consider re-applying alignment (DPO) after SFT.

License & Legal Risks

Model licenses matter. Llama 3 has a community license (free for most uses, restrictions for 700M+ monthly active users). Mistral models are Apache 2.0 (fully permissive). Phi-3 is MIT licensed. GPT-4o fine-tuning is API-only (OpenAI hosts it). Always check the license before fine-tuning and deploying.

The biggest pitfall: fine-tuning when you don't need to. Try prompting first. Then try RAG. Only fine-tune when those approaches fail or are too expensive at scale. Fine-tuning adds complexity (training pipeline, evaluation, deployment, versioning) that you should only take on when the benefit is clear.

verified

Decision Framework

Should you fine-tune?

Fine-Tune When

1. Consistent output format: You need the model to always produce JSON, XML, or a specific structure that prompting can't reliably enforce.

2. Domain-specific behavior: Medical diagnosis, legal analysis, financial modeling — tasks requiring specialized reasoning patterns.

3. Cost optimization: You're spending $10K+/month on GPT-4o API calls and a fine-tuned 8B model could handle 80% of queries at 1/100th the cost.

4. Latency requirements: You need sub-100ms responses that only a small, local model can provide.

5. Privacy: Data cannot leave your infrastructure. You need a self-hosted model.

Don't Fine-Tune When

1. Prompting works: If few-shot prompting with GPT-4o gives you 95% accuracy, fine-tuning won't add much.

2. You need up-to-date knowledge: Fine-tuning doesn't update facts well. Use RAG instead.

3. Small dataset (<100 examples): Not enough signal for the model to learn meaningful patterns.

4. Rapidly changing requirements: If your task changes weekly, retraining is impractical. Use prompting.

5. No evaluation plan: If you can't measure whether fine-tuning improved the model, don't do it.

The recommended path: Start with prompting (GPT-4o or Claude). If quality is good but cost is too high, fine-tune a smaller model (Llama 3 8B or Mistral 7B) with LoRA. Add RAG for knowledge grounding. This combination handles 90% of production use cases.