Ch 4: LoRA & Parameter-Efficient Fine-Tuning

Ch 4 — LoRA & Parameter-Efficient Fine-Tuning

Low-rank adapters, QLoRA, rank/alpha tuning, and other PEFT methods

Index Under the Hood →

High Level

lightbulb

Idea

arrow_forward

functions

Math

arrow_forward

tune

Config

arrow_forward

compress

QLoRA

arrow_forward

apps

Others

arrow_forward

merge

Merge

arrow_forward

verified

Best Practices

Click play or press Space to begin the journey...

Step- / 7

lightbulb

The Core Idea Behind LoRA

Hu et al. (2021): "LoRA: Low-Rank Adaptation of Large Language Models"

The Problem

Full fine-tuning updates all parameters in the model. For a 7B model, that means storing optimizer states for 7 billion parameters: ~56 GB of GPU memory just for AdamW states. This requires multiple expensive GPUs and creates a separate full copy of the model for each task.

The Insight

Aghajanyan et al. (2020) showed that pre-trained language models have a low intrinsic dimensionality. When fine-tuning, the weight changes (ΔW) don't need the full rank of the weight matrix. The actual "useful" change lives in a much lower-dimensional subspace.

In other words: you don't need to change all 7 billion parameters. A few million carefully chosen changes are enough.

The LoRA Solution

Instead of updating the full weight matrix W (e.g., 4096 × 4096 = 16.8M params), LoRA decomposes the update into two small matrices:

ΔW = B × A

Where A is (r × 4096) and B is (4096 × r), with r being the rank (typically 8, 16, or 32). For r=16: A has 65K params, B has 65K params. Total: 131K params instead of 16.8M (128x reduction).

The base model stays completely frozen. Only A and B are trained. At inference, the adapter output is added: h = Wx + BAx.

Why "low-rank"? A matrix of rank r can be decomposed into the product of two matrices where the inner dimension is r. If the weight change ΔW has low rank (most of the information is in a few dimensions), then B×A is a good approximation. Empirically, rank 16-64 captures most of the useful adaptation for LLM fine-tuning.

functions

LoRA Mathematics

Rank, alpha, scaling, and initialization

The Forward Pass

For a pre-trained weight matrix W₀:

h = W₀x + (α/r) · BAx

W₀: Frozen pre-trained weights (not updated)
B: Down-projection matrix (d × r), initialized to zeros
A: Up-projection matrix (r × d), initialized with random Gaussian
r: Rank (hyperparameter, typically 8-64)
α: Scaling factor (hyperparameter, typically 16-64)
α/r: The effective scaling applied to the adapter output

Why Initialize B to Zero?

At the start of training, BA = 0 (because B is all zeros). This means the model starts with exactly the same behavior as the pre-trained model. Training gradually learns the adaptation ΔW = BA. This is a key design choice: it ensures training starts from a known-good state.

Rank (r) and Alpha (α)

Rank (r): Controls the capacity of the adapter. Higher rank = more parameters = more expressive but more memory. Common values: 8, 16, 32, 64.

Alpha (α): Scaling factor. The adapter output is multiplied by α/r. Common practice: set α = 2×r (e.g., r=16, α=32). This keeps the effective learning rate stable across different rank values.

Rule of thumb: Start with r=16, α=32. If quality is insufficient, increase to r=32 or r=64. If you need to save memory, try r=8.

# Parameter count for LoRA on Llama 3 8B # Target: q_proj, k_proj, v_proj, o_proj (4 modules) # Per module per layer: # A: (r, d_in) B: (d_out, r) # r=16, targeting Q,K,V,O across 32 layers: # q_proj: (16 x 4096) + (4096 x 16) = 131K # k_proj: (16 x 4096) + (1024 x 16) = 82K # v_proj: (16 x 4096) + (1024 x 16) = 82K # o_proj: (16 x 4096) + (4096 x 16) = 131K # Per layer: 426K params # 32 layers: 13.6M trainable params # = 0.17% of 8B total parameters

tune

LoRA Configuration

Target modules, rank, alpha, and dropout

Target Modules

Which layers get LoRA adapters? You choose which weight matrices to target. More targets = more capacity but more memory.

Minimal (attention only):
["q_proj", "v_proj"]
The original LoRA paper found Q and V most important. ~6.8M params for r=16.

Standard (all attention):
["q_proj", "k_proj", "v_proj", "o_proj"]
Most common configuration. ~13.6M params for r=16.

Full (attention + FFN):
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Maximum capacity. ~40M params for r=16. Best quality but most memory.

The LoraConfig

from peft import LoraConfig config = LoraConfig( r=16, # rank lora_alpha=32, # scaling (2x rank) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", ], lora_dropout=0.05, # regularization bias="none", # don't train biases task_type="CAUSAL_LM", # for decoder models )

Dropout

lora_dropout: Applied to the adapter input during training. Acts as regularization to prevent overfitting. Common values: 0.0 (no dropout) to 0.1. Use 0.05 as a default. Increase to 0.1 if you see overfitting on small datasets.

Recommended starting config: r=16, alpha=32, target all attention modules (Q, K, V, O), dropout=0.05. This gives ~13.6M trainable params (0.17% of 8B) and fits on a single 24 GB GPU. Increase rank to 32 or 64 if quality needs improvement.

compress

QLoRA: Quantized LoRA

Dettmers et al. (2023): Fine-tune 70B on a single GPU

What QLoRA Adds

QLoRA combines LoRA with 4-bit quantization of the base model. The base model is loaded in 4-bit NF4 (Normal Float 4) precision, reducing its memory footprint by 4x. LoRA adapters are trained in bf16/fp16 on top of the quantized base.

Memory savings:
Llama 3 8B in bf16: ~16 GB
Llama 3 8B in NF4: ~4 GB
+ LoRA adapters + optimizer: ~2 GB
Total QLoRA: ~6 GB (vs ~16 GB for standard LoRA)

Key Innovations

1. NF4 (Normal Float 4-bit): Quantization format optimized for normally-distributed neural network weights. Better quality than uniform int4.

2. Double Quantization: Quantize the quantization constants themselves, saving ~0.4 GB per 7B model.

3. Paged Optimizers: Use CPU memory for optimizer state pages that don't fit in GPU memory, with automatic paging.

When to Use QLoRA vs LoRA

LoRA (bf16 base)
Better quality (no quantization loss)
Faster training (no dequantization)
Needs more GPU memory
Use for production models

QLoRA (4-bit base)
Slight quality loss (~0.5-1%)
Slower training (~10-20%)
4x less GPU memory
Use for prototyping or large models

# QLoRA configuration from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=bnb_config, device_map="auto", )

QLoRA quality: The original QLoRA paper showed that 4-bit QLoRA matches 16-bit LoRA quality on most benchmarks, with only 0.5-1% degradation. For prototyping and experimentation, QLoRA is the recommended default. For production models where every percentage point matters, use standard LoRA in bf16.

apps

Other PEFT Methods

DoRA, IA3, prefix tuning, and prompt tuning

DoRA (Weight-Decomposed LoRA)

Liu et al. (2024) decompose weight matrices into magnitude and direction components. LoRA is applied only to the direction, while magnitude is trained separately. DoRA often outperforms standard LoRA by 1-2% on benchmarks with the same rank, at minimal extra cost.

In PEFT: use_dora=True in LoraConfig.

IA3 (Infused Adapter by Inhibiting and Amplifying)

Liu et al. (2022) learn rescaling vectors instead of matrices. Multiplies activations by learned vectors in key, value, and FFN layers. Even fewer parameters than LoRA (just vectors, not matrices), but less expressive. Best for very small adaptation tasks.

Prefix Tuning

Li & Liang (2021) prepend trainable virtual tokens to the key and value sequences in each attention layer. The model attends to these virtual tokens as if they were part of the input. Very few parameters but limited expressiveness for complex tasks.

Prompt Tuning

Lester et al. (2021, Google) prepend trainable embeddings to the input layer only (not every layer like prefix tuning). Extremely parameter-efficient but only works well with very large models (100B+). Not recommended for 7-70B models.

Comparison

LoRA
Trainable params: 0.1-1%
Quality: Excellent
The default choice

DoRA
Trainable params: 0.1-1%
Quality: Slightly better
Recommended upgrade

IA3
Trainable params: 0.01%
Quality: Good
Minimal adaptation

Prefix Tuning
Trainable params: 0.1%
Quality: Moderate
Niche use cases

Practical recommendation: Use LoRA as your default. Try DoRA if you want a potential quality boost at minimal extra cost. Use IA3 only if you need the absolute minimum parameter count. Skip prefix tuning and prompt tuning for most LLM fine-tuning tasks.

merge

Merging & Serving LoRA Adapters

From training to deployment

Merge into Base Model

After training, you can merge the LoRA adapter into the base model: W_new = W₀ + (α/r) · BA. This produces a standard model with no adapter overhead. Inference speed is identical to the original model.

Use this when: deploying a single fine-tuned model, converting to GGUF for local inference, or sharing the model on HuggingFace Hub.

# Merge adapter into base model from peft import PeftModel model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16, ) model = PeftModel.from_pretrained(model, "./lora_adapter") model = model.merge_and_unload() model.save_pretrained("./merged_model")

Serve Without Merging

You can also serve the adapter separately from the base model. Load the base model once, then load/swap adapters dynamically. This enables:

1. Multi-tenant serving: One base model + many task-specific adapters. Each user/task gets their own adapter.

2. Hot-swapping: Switch between adapters without reloading the base model. vLLM and LoRAX support this.

3. A/B testing: Serve different adapter versions to different users.

Adapter Size

A LoRA adapter (r=16, attention only) for Llama 3 8B is approximately 27 MB. Compare this to the full model at 16 GB. You can store hundreds of adapters for the cost of one full model. This makes LoRA ideal for personalization at scale.

Deployment strategy: For a single model: merge and deploy as a standard model. For multiple tasks/customers: keep the base model loaded and swap adapters. vLLM supports serving multiple LoRA adapters simultaneously with a single base model, with minimal overhead per adapter.

verified

LoRA Best Practices

Practical tips from the community

Hyperparameter Recommendations

Rank: Start with r=16. Increase to 32 or 64 if quality is insufficient. r=8 for very simple tasks. r=128+ rarely helps and wastes memory.

Alpha: Set to 2×r (e.g., r=16, alpha=32). Some practitioners use alpha=r. The ratio alpha/r determines the effective scaling.

Learning rate: 1e-4 to 3e-4 for LoRA (higher than full fine-tuning because fewer parameters). 2e-4 is the most common default.

Dropout: 0.05 for most cases. 0.1 for small datasets (<1000 examples). 0.0 for large datasets (>50K).

Epochs: 1-3 for SFT. Monitor validation loss to detect overfitting.

Common Mistakes

1. Wrong chat template: Always use the model's native chat template. This is the #1 cause of bad LoRA results.

2. Rank too high: r=256 doesn't help and wastes memory. The quality improvement from r=16 to r=64 is small; from r=64 to r=256 is negligible.

3. Not targeting enough modules: Targeting only Q and V (original paper) is suboptimal. Target all attention modules (Q, K, V, O) at minimum.

4. Training too long: LoRA overfits faster than full fine-tuning because fewer parameters. 1-3 epochs is usually enough.

5. Forgetting to merge: If deploying as a standalone model, always merge the adapter. Serving unmerged adds latency.

The LoRA recipe that works: Llama 3.1 8B Instruct + r=16, alpha=32, target Q/K/V/O + lr=2e-4, cosine schedule + 1-3 epochs + 1K-10K high-quality examples. This combination handles 90% of fine-tuning use cases. Adjust rank and target modules if you need more capacity.