Key Insights — LLM Fine-Tuning

Foundation

When to Fine-Tune & Data Prep

Chapters 1-3

expand_more

1

Fine-tuning is for teaching a model *how* to talk (form, style, behavior), while RAG is for teaching a model *what* to talk about (facts, knowledge).

The Hierarchy of Needs: Always try Prompt Engineering first. If that fails, try RAG. Only if you need specific tone, format, or faster/cheaper inference should you Fine-Tune.

2

Transformer Architecture for Fine-Tuning

To fine-tune efficiently, you must understand what you are actually modifying.

Targeting Projections: Fine-tuning usually targets the Q, K, and V projection matrices inside the attention heads, as these govern how the model routes information.
Memory Constraints: Training requires 3-4x more memory than inference because you must store gradients and optimizer states alongside the model weights.

3

Dataset Preparation & Curation

In fine-tuning, data quality drastically outweighs data quantity. 1,000 perfect examples beat 100,000 mediocre ones.

Formatting: Data must be structured perfectly into conversational formats (like ChatML or ShareGPT) with clear system, user, and assistant roles.
Synthetic Data: Using larger models (like GPT-4) to generate high-quality training data for smaller models is now the industry standard approach.

The Bottom Line: Do not fine-tune to inject new knowledge. Fine-tune to teach a smaller, cheaper model to behave exactly like a larger, more expensive model for a specific task.

Techniques

LoRA & Distributed Training

Chapters 4-5

expand_more

4

LoRA & Parameter-Efficient Fine-Tuning

LoRA democratized fine-tuning by allowing massive models to be trained on consumer GPUs.

The LoRA Trick: Instead of updating all 70 billion weights, LoRA freezes the original model and trains a tiny "adapter" (using low-rank matrices) that is added to the model during inference.
QLoRA: Quantizing the base model to 4-bit precision while training the LoRA adapter in 16-bit, reducing memory requirements by up to 80%.

5

Full Fine-Tuning & Distributed Training

When LoRA isn't enough (e.g., teaching a model a new language), you must update all weights across multiple GPUs.

FSDP / DeepSpeed ZeRO: Techniques that shard the model weights, gradients, and optimizer states across multiple GPUs so a massive model can fit in memory.
Gradient Checkpointing: Trading compute for memory by recalculating activations during the backward pass instead of storing them.

The Bottom Line: Start with QLoRA. It provides 95% of the performance of full fine-tuning at a fraction of the compute cost, and allows you to hot-swap adapters at runtime.

Alignment

RLHF, DPO & Preferences

Chapters 6-7

expand_more

6

Alignment: RLHF & Reward Models

Supervised Fine-Tuning (SFT) teaches the model to talk. Alignment teaches it to be helpful and harmless.

The RLHF Pipeline: 1) Train a Reward Model on human preferences (A is better than B). 2) Use PPO (Reinforcement Learning) to optimize the LLM to generate responses the Reward Model scores highly.
The Alignment Tax: Making a model safer often makes it slightly worse at objective tasks like coding or math.

7

DPO, ORPO & Modern Alignment

RLHF is incredibly complex and unstable. Modern techniques achieve the same results mathematically without needing a separate reward model.

DPO (Direct Preference Optimization): Treats the LLM itself as the reward model, directly updating weights to increase the probability of the chosen answer and decrease the rejected answer.
ORPO: Combines Supervised Fine-Tuning and Preference Optimization into a single step, drastically simplifying the training pipeline.

The Bottom Line: The industry is rapidly moving away from complex PPO-based RLHF toward simpler, more stable direct preference methods like DPO and ORPO.

Ops

Tools, Evals & Production

Chapters 8-10

expand_more

8

Training Infrastructure & Tools

You don't need to write PyTorch training loops from scratch anymore.

Unsloth: A highly optimized library that makes LoRA fine-tuning 2x faster and uses 70% less memory.
Axolotl: A configuration-driven framework that standardizes the fine-tuning process across different models and hardware setups.

9

Evaluation & Benchmarks

If you can't measure it, you shouldn't fine-tune it. "Vibes-based" evaluation does not scale.

LLM-as-a-Judge: Using GPT-4 to evaluate your fine-tuned model's outputs against a golden dataset (e.g., MT-Bench).
Catastrophic Forgetting: Always run standard benchmarks (MMLU, HumanEval) after fine-tuning to ensure your model didn't lose its general knowledge while learning its new specific task.

10

Production Deployment & Serving

A fine-tuned model is useless if it's too slow or expensive to serve.

Model Merging: Fusing the LoRA adapter weights permanently into the base model weights so there is zero latency penalty during inference.
LoRAX / Multi-LoRA Serving: Loading one base model into GPU memory, but applying different LoRA adapters on a per-request basis, allowing you to serve 100 different fine-tuned models for the cost of 1.

The Bottom Line: The ultimate superpower of PEFT/LoRA is production economics: you can train and serve dozens of highly specialized expert models using the infrastructure footprint of a single base model.