Ch 7 — Fine-Tuning & Instruction Following

Turning a text-completion engine into a helpful assistant — SFT, LoRA, and chat templates
Training
psychology_alt
Problem
arrow_forward
school
SFT
arrow_forward
dataset
Data
arrow_forward
chat
Templates
arrow_forward
compress
LoRA
arrow_forward
memory
QLoRA
arrow_forward
code
Practice
arrow_forward
compare
Base vs Chat
-
Click play or press Space to begin...
Step- / 8
psychology_alt
The Problem: Base Models Don’t Follow Instructions
Why a pretrained model is brilliant but unhelpful
The Analogy
A pretrained LLM is like a brilliant scholar who only knows how to finish sentences. Ask “What is the capital of France?” and it might continue with “What is the capital of Germany? What is the capital of Italy?” — because in its training data, questions are often followed by more questions (like a quiz). It has all the knowledge but doesn’t know it should answer the question. Fine-tuning teaches it the format of being helpful.
Key insight: Fine-tuning doesn’t add new knowledge — it unlocks existing knowledge by teaching the model how to respond. InstructGPT (Ouyang et al., 2022) showed that a 1.3B fine-tuned model was preferred by humans over a 175B base model. The knowledge was already there; it just needed to be formatted as helpful responses rather than text completions.
Base Model vs Fine-Tuned
Base Model (Llama 3 8B)
User: What is photosynthesis?
Model: What is cellular respiration? What is the difference between aerobic and anaerobic respiration? What are the stages of...
(Continues generating quiz questions)
Fine-Tuned (Llama 3 8B Instruct)
User: What is photosynthesis?
Model: Photosynthesis is the process by which plants convert sunlight, water, and CO&sub2; into glucose and oxygen. It occurs in chloroplasts...
(Helpful, structured answer)
The training pipeline: Pretraining (Ch 6) → SFT (this chapter) → RLHF/DPO (Ch 8). Pretraining gives knowledge. SFT teaches format. RLHF teaches quality and safety. Each stage uses less data but has outsized impact on behavior.
school
Supervised Fine-Tuning (SFT)
Teaching the model to respond like a helpful assistant
The Analogy
SFT is like on-the-job training. A new employee (base model) has a degree (pretraining knowledge) but doesn’t know company procedures. You show them examples: “When a customer asks X, respond with Y.” After seeing thousands of examples, they learn the pattern. SFT shows the model thousands of (instruction, response) pairs and trains it to produce the response given the instruction.
Key insight: SFT uses the same cross-entropy loss as pretraining (Ch 6), but with a crucial difference: loss is computed only on the assistant’s response tokens, not on the user’s input. This “loss masking” ensures the model learns to generate good responses without trying to memorize user questions. Typical SFT uses 10K-1M examples — tiny compared to pretraining’s trillions of tokens.
SFT Training
# SFT training example: # Input: "[INST] Explain gravity [/INST]" # Target: "Gravity is a fundamental force..." # Loss masking (critical!): # [INST] Explain gravity [/INST] Gravity is... # ───── no loss ──────────── ── loss here ── # Only train on the response, not the prompt # SFT hyperparameters (typical): # Learning rate: 2e-5 (10× lower than pretrain) # Epochs: 2-3 (don't overtrain!) # Dataset: 10K-1M examples # Batch size: 128 # Duration: hours, not months # Compare to pretraining: # Pretraining: 15T tokens, months, $60M+ # SFT: ~100M tokens, hours, ~$100-1000 # Impact on behavior: enormous
dataset
SFT Datasets: Where Examples Come From
Human-written, model-generated, and synthetic data
Data Sources
Early SFT datasets were human-written: OpenAI hired contractors to write ideal responses (InstructGPT used ~13K examples). This is expensive but high quality. Modern approaches use synthetic data: a strong model (GPT-4, Claude) generates training examples for a weaker model. Self-Instruct (Wang et al., 2023) showed models can even generate their own training data. The quality of SFT data matters far more than quantity.
Key insight: LIMA (Zhou et al., 2023) demonstrated that just 1,000 carefully curated examples can produce a remarkably good chat model. They called this the “Superficial Alignment Hypothesis”: a model’s knowledge comes from pretraining, and alignment is just a thin veneer that teaches format and style. Quality over quantity is the dominant lesson.
Notable Datasets
# Key SFT datasets: # Human-written: # InstructGPT: ~13K examples (OpenAI, 2022) # LIMA: 1,000 examples (Meta, 2023) # Dolly: 15K examples (Databricks, 2023) # OpenAssistant: 161K messages (community) # Synthetic / distilled: # Alpaca: 52K (GPT-3.5 generated, Stanford) # Magpie: 300K (Llama 3.1 generated) # UltraChat: 1.5M (GPT-3.5 conversations) # Multi-task: # FLAN v2: 15M+ (1,800+ tasks, Google) # T0/P3: 12M (prompted datasets) # Example format (JSON): { "instruction": "Explain quantum entanglement", "input": "", "output": "Quantum entanglement is..." }
chat
Chat Templates: Structuring Conversations
How models know who’s speaking and when to respond
The Analogy
A chat template is like a screenplay format. In a movie script, you know who’s speaking because of formatting: “JOHN: Hello.” “MARY: Hi there.” Chat templates use special tokens to mark roles: system (instructions), user (questions), and assistant (responses). The model learns that after a user message, it should generate an assistant response. Different models use different templates — mixing them up causes garbled output.
Key insight: The system prompt is where you define the model’s personality and constraints. “You are a helpful coding assistant. Always provide code examples.” The model sees this at the start of every conversation. System prompts are powerful because they leverage the model’s instruction-following ability to customize behavior without any additional training.
Template Formats
# Llama 3 chat template: <|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant.<|eot_id|> <|start_header_id|>user<|end_header_id|> What is gravity?<|eot_id|> <|start_header_id|>assistant<|end_header_id|> Gravity is a fundamental force...<|eot_id|> # ChatML (OpenAI, widely adopted): <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is gravity?<|im_end|> <|im_start|>assistant Gravity is a fundamental force...<|im_end|> # Loss masking in chat: # system message → no loss # user message → no loss # assistant msg → COMPUTE LOSS # Only train on what the model should say
compress
LoRA: Fine-Tuning on a Budget
Updating 0.1% of parameters to get 99% of the benefit
The Analogy
Imagine remodeling a house. Full fine-tuning is tearing down every wall and rebuilding. LoRA (Low-Rank Adaptation) is adding small, targeted modifications — a new shelf here, a fresh coat of paint there. The key insight: when fine-tuning, the weight changes have low rank — they can be represented by two small matrices multiplied together, instead of one huge matrix. This reduces trainable parameters from billions to millions.
Key insight: LoRA (Hu et al., 2021) decomposes the weight update ΔW into two small matrices: ΔW = A × B, where A is (d × r) and B is (r × d), with rank r typically 8-64. For a 4096×4096 weight matrix, full fine-tuning needs 16.7M parameters. LoRA with r=16 needs only 2 × 4096 × 16 = 131K — a 128× reduction. The adapter weights are tiny and can be swapped in/out at inference time.
How LoRA Works
# Original weight: W (d × d) # Full fine-tune: W' = W + ΔW # ΔW has d×d = 16.7M params (for d=4096) # LoRA: ΔW = A × B # A: (d × r), B: (r × d), r << d # Params: 2 × d × r = 2 × 4096 × 16 = 131K class LoRALinear(nn.Module): def __init__(self, base_layer, r=16, alpha=32): super().__init__() self.base = base_layer # frozen! d_in = base_layer.in_features d_out = base_layer.out_features self.A = nn.Linear(d_in, r, bias=False) self.B = nn.Linear(r, d_out, bias=False) self.scale = alpha / r def forward(self, x): base_out = self.base(x) # frozen lora_out = self.B(self.A(x)) # tiny return base_out + self.scale * lora_out # Llama 3 8B with LoRA (r=16): # Base: 8B params (frozen) # LoRA: ~20M params (trainable) # = 0.25% of total parameters
memory
QLoRA: Fine-Tuning a 70B Model on One GPU
Quantize the base model, train LoRA adapters in full precision
The Analogy
QLoRA is like storing a massive reference library in compressed format (4-bit quantization) while keeping your personal notes in full detail (16-bit LoRA adapters). The library takes up 4× less shelf space, but your notes are precise. During fine-tuning, you only write new notes (train LoRA) while referencing the compressed library (quantized base model). This lets you fine-tune a 70B model on a single 48GB GPU.
Key insight: QLoRA (Dettmers et al., 2023) introduced NF4 (Normal Float 4-bit) quantization, which is information-theoretically optimal for normally distributed weights. A 70B model at 4-bit needs only 35 GB — fits on one A100. The LoRA adapters train in BF16 for full precision. QLoRA matches full fine-tuning quality while using ~75% less memory. It democratized LLM fine-tuning.
Memory Comparison
# Fine-tuning Llama 3 70B — memory needed: # Full fine-tuning (BF16): # Weights: 140 GB (70B × 2 bytes) # Gradients: 140 GB # Optimizer: 560 GB (Adam: 4 states × FP32) # Total: ~840 GB → needs 12+ GPUs # LoRA (BF16 base): # Weights: 140 GB (frozen) # LoRA params: ~100M (tiny) # Optimizer: ~1 GB (only for LoRA) # Total: ~145 GB → needs 2-3 GPUs # QLoRA (4-bit base + BF16 LoRA): # Weights: 35 GB (4-bit quantized) # LoRA params: ~100M (BF16) # Optimizer: ~1 GB # Total: ~40 GB → fits on 1 GPU! ✓ # Cost comparison: # Full FT: $1000s, multi-GPU cluster # QLoRA: $10-50, single GPU rental
code
Fine-Tuning in Practice
Real code to fine-tune Llama with QLoRA
Using HuggingFace + PEFT
from transformers import AutoModelForCausalLM from peft import LoraConfig, get_peft_model from trl import SFTTrainer # Load model in 4-bit (QLoRA) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) # Configure LoRA lora_config = LoraConfig( r=16, # rank lora_alpha=32, # scaling target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) # Train with SFTTrainer trainer = SFTTrainer( model=model, train_dataset=dataset, max_seq_length=2048, args=TrainingArguments( learning_rate=2e-4, num_train_epochs=3, per_device_train_batch_size=4, ), ) trainer.train()
What Gets Fine-Tuned
In practice, LoRA adapters are applied to the attention projection matrices (Q, K, V, O) and sometimes the FFN layers. The rank r=16 is a common default; higher ranks (32, 64) give marginal improvements. The alpha/r ratio controls the learning rate scaling for LoRA weights. After training, adapters can be merged into the base weights for zero-overhead inference, or kept separate for easy swapping between tasks.
Key insight: The ecosystem for fine-tuning has matured rapidly. Tools like Unsloth (2× faster LoRA), Axolotl (config-driven training), and LLaMA-Factory (GUI-based) make it possible to fine-tune an 8B model in under an hour on a single GPU. Fine-tuning has been democratized — anyone with a $10 GPU rental can customize an LLM.
compare
Base vs Instruct: The Full Picture
Understanding the complete model pipeline
The Pipeline
Every chat model you use went through this pipeline: Base model (pretrained on raw text, completes sentences) → SFT model (fine-tuned on instruction-response pairs, follows instructions) → RLHF/DPO model (aligned with human preferences, Ch 8). Meta releases both: “Llama-3.1-8B” (base) and “Llama-3.1-8B-Instruct” (SFT + RLHF). The base model is for researchers; the Instruct model is for users.
Key insight: Fine-tuning is where the “personality” of an AI assistant is shaped. The same base model can become a coding assistant, a medical advisor, a creative writer, or a customer service bot — depending on the SFT data. This is why fine-tuning is so commercially valuable: it lets companies customize a general-purpose model for their specific use case at a fraction of the pretraining cost.
Model Pipeline Summary
# The LLM training pipeline: # Stage 1: Pretraining (Ch 6) # Data: 15T tokens of raw text # Cost: $60M+, months # Result: "Llama-3.1-8B" (base) # Ability: text completion # Stage 2: SFT (this chapter) # Data: ~100K instruction-response pairs # Cost: $100-1000, hours # Result: follows instructions # Ability: helpful responses # Stage 3: RLHF / DPO (Ch 8) # Data: ~50K preference comparisons # Cost: $1000-10000, hours-days # Result: "Llama-3.1-8B-Instruct" # Ability: high-quality, safe, aligned # Each stage: less data, more impact # Pretraining = knowledge # SFT = format # RLHF = quality + safety