Quality Over Quantity
The LIMA paper (Zhou et al., 2023) showed that just 1,000 carefully curated examples can produce a model competitive with GPT-3.5 on many tasks. Microsoft's Phi models demonstrated that high-quality synthetic data can outperform much larger datasets of lower quality.
The lesson: 1,000 excellent examples beat 100,000 mediocre ones.
Practical Guidelines
Minimum viable: 100-500 examples for style/format transfer.
Good baseline: 1,000-5,000 examples for task specialization.
Strong model: 10,000-50,000 examples for complex reasoning tasks.
Maximum quality: 50,000-500,000 examples for full fine-tuning on broad tasks.
More data helps, but with diminishing returns. The first 1,000 examples matter most.
Data Format
Fine-tuning data is typically instruction/response pairs:
# Alpaca format (most common)
{
"instruction": "Summarize this contract clause...",
"input": "The Licensee shall not...",
"output": "This clause restricts..."
}
# Chat format (ShareGPT / multi-turn)
{
"conversations": [
{"from": "human", "value": "Explain LoRA..."},
{"from": "gpt", "value": "LoRA is..."}
]
}
You can generate training data with a stronger model. Use GPT-4o or Claude to generate high-quality instruction/response pairs for your domain. This is how Alpaca (Stanford, 2023) was created: 52K instruction-following examples generated by GPT-3.5 for $500. Validate the generated data manually before training.