The Analogy
Imagine remodeling a house. Full fine-tuning = tearing down and rebuilding every wall. LoRA = adding small, targeted modifications (new shelves, fresh paint) that transform the space without touching the structure. LoRA freezes the original weights and adds small low-rank matrices that capture the task-specific changes. 175B parameters frozen, only ~10M trained.
Math from this course: LoRA is pure SVD (Ch 3) in action. Instead of updating the full weight matrix W (d×d), LoRA learns W + BA where B is (d×r) and A is (r×d), with rank r << d. This is a low-rank approximation (Ch 3). The insight: weight changes during fine-tuning have low intrinsic dimensionality (Ch 12) — they lie on a low-dimensional manifold.
The Math
# Standard fine-tuning:
# W_new = W + ΔW (ΔW is d×d = huge)
# LoRA insight: ΔW has low rank!
# ΔW ≈ B × A where B: d×r, A: r×d
# r = 4, 8, or 16 (rank)
# Example: d = 12288 (GPT-3 size)
# Full ΔW: 12288² = 151M parameters
# LoRA (r=8): 12288×8 + 8×12288 = 197K
# → 768× fewer parameters!
# Forward pass:
# h = (W + BA) × x = Wx + BAx
# W is frozen, only B and A are trained
# At inference: merge W_new = W + BA
# No extra latency!
# QLoRA: quantize W to 4-bit + LoRA
# Fine-tune 65B model on single GPU!
Full Fine-Tune
Update all 175B params. Needs 100s of GPUs.
LoRA
Update ~10M params (0.006%). Single GPU possible.