Ch 3: Dataset Preparation & Curation

Ch 3 — Dataset Preparation & Curation

Data formats, quality control, synthetic data, cleaning, and tokenization for fine-tuning

Index Under the Hood →

High Level

description

Formats

arrow_forward

star

Quality

arrow_forward

auto_awesome

Synthetic

arrow_forward

cleaning_services

Clean

arrow_forward

chat

Templates

arrow_forward

token

Tokenize

arrow_forward

verified

Validate

Click play or press Space to begin the journey...

Step- / 7

description

Data Formats for Fine-Tuning

Alpaca, ShareGPT, ChatML, and OpenAI formats

Alpaca Format

The simplest and most common format. Each example has an instruction, optional input, and output. Created by Stanford for the Alpaca project (2023). Best for single-turn instruction-following tasks.

// Alpaca format { "instruction": "Classify the sentiment", "input": "This product is amazing!", "output": "Positive" }

ShareGPT / Conversation Format

Multi-turn conversations. Each example is a list of messages with roles. Used by Vicuna, ShareGPT datasets, and most chat fine-tuning. Best for conversational models.

// ShareGPT format { "conversations": [ {"from": "system", "value": "You are a legal expert."}, {"from": "human", "value": "What is force majeure?"}, {"from": "gpt", "value": "Force majeure is..."} ] }

OpenAI Messages Format

The format used by OpenAI's fine-tuning API and increasingly adopted as a standard. Each line in a JSONL file contains a messages array with role and content fields.

// OpenAI / HuggingFace messages format { "messages": [ {"role": "system", "content": "You are a legal expert."}, {"role": "user", "content": "What is force majeure?"}, {"role": "assistant", "content": "Force majeure is..."} ] }

Recommendation: Use the OpenAI messages format for new projects. It works with HuggingFace SFTTrainer, OpenAI fine-tuning API, and most tools. SFTTrainer automatically applies the correct chat template when data is in this format. Store as JSONL (one JSON object per line).

Alpaca
Single-turn
instruction/input/output
Simple tasks

ShareGPT
Multi-turn
conversations array
Chat models

OpenAI Messages
Multi-turn
messages array
Universal standard

Completion
Raw text
prompt + completion
Legacy / CPT

star

Data Quality: The #1 Factor

Why 1,000 great examples beat 100,000 mediocre ones

The LIMA Lesson

The LIMA paper (Zhou et al., 2023, Meta) demonstrated that a 65B LLaMA model fine-tuned on just 1,000 carefully curated examples could match GPT-3.5 (text-davinci-003) on many tasks. The key was extreme quality: each example was hand-selected and hand-written by the authors.

Their conclusion: "Almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output."

What Makes Data High Quality

1. Correct: The response is factually accurate and complete.
2. Consistent: All examples follow the same format and style.
3. Diverse: Cover the full range of tasks and edge cases.
4. Representative: Match the distribution of real-world queries.
5. Appropriate length: Not too short (under-specified) or too long (verbose).
6. No contradictions: Examples don't teach conflicting behaviors.

Common Quality Problems

Incorrect labels: Wrong answers teach wrong behavior. Even 5% error rate can significantly degrade model quality.

Inconsistent formatting: Some examples use markdown, others plain text. Some use bullet points, others paragraphs. The model gets confused about what format to use.

Duplicates: Repeated examples cause the model to memorize specific responses rather than learn patterns. Deduplicate before training.

Imbalanced distribution: If 80% of examples are about topic A and 20% about topic B, the model will be much better at A. Balance your dataset or oversample underrepresented categories.

The quality checklist before training: (1) Manually review 50-100 random examples. (2) Check for formatting consistency. (3) Verify factual accuracy on a sample. (4) Look for duplicates and near-duplicates. (5) Check the distribution of topics/tasks. (6) Ensure response lengths are appropriate. If any of these fail, fix the data before training. Training on bad data wastes compute and produces a bad model.

auto_awesome

Synthetic Data Generation

Using strong models to create training data for smaller models

The Approach

Use a strong model (GPT-4o, Claude) to generate instruction/response pairs for your domain. This is how many successful open-source models were created:

Alpaca (Stanford, 2023): 52K examples generated by GPT-3.5 for ~$500. Used to fine-tune LLaMA 7B.
WizardLM (Microsoft, 2023): Used "Evol-Instruct" to progressively make instructions more complex.
Orca (Microsoft, 2023): Generated detailed reasoning traces from GPT-4, not just answers.
Phi models (Microsoft, 2023-2024): "Textbooks Are All You Need" approach with high-quality synthetic textbook data.

Generation Strategies

1. Seed + Expand: Start with 10-20 hand-written examples. Ask GPT-4o to generate similar examples with variations in topic, complexity, and format.

2. Document-Grounded: Feed your domain documents to GPT-4o and ask it to generate Q&A pairs based on the content. This ensures factual grounding.

3. Evol-Instruct: Take simple instructions and ask GPT-4o to make them progressively harder: add constraints, require reasoning, increase complexity.

4. Self-Instruct: Generate instructions, then generate responses, then filter for quality. The original approach from Wang et al. (2022).

Quality Control for Synthetic Data

Always validate synthetic data. LLMs can hallucinate, be inconsistent, or produce low-quality outputs. Validation steps:

1. Automated filtering: Remove examples that are too short, too long, contain refusals ("I can't help with that"), or are duplicates.

2. LLM-as-judge: Use a second model (or the same model) to rate the quality of generated examples on a 1-5 scale. Keep only 4-5 rated examples.

3. Human review: Manually review 10-20% of generated examples. Fix or remove bad ones. This is the most important step.

4. Consistency check: Ensure all examples follow the same format, style, and quality bar.

# Synthetic data generation prompt """Generate 10 instruction/response pairs for a legal assistant chatbot. Requirements: - Cover contract law, IP, employment law - Responses should be 2-3 paragraphs - Include relevant legal citations - Use professional but accessible language - Vary complexity from basic to advanced Format each as JSON: {"instruction": "...", "response": "..."} """

Legal considerations: OpenAI's terms of service allow using GPT-4o outputs to train competing models (as of 2024). However, always check the current terms. Some providers may restrict this. Also, synthetic data may inherit biases from the generating model. Diversify your data sources when possible.

cleaning_services

Data Cleaning & Deduplication

Removing noise, duplicates, and problematic examples

Cleaning Steps

1. Remove empty/malformed entries: Check for missing fields, empty strings, null values, and invalid JSON.

2. Normalize whitespace: Strip leading/trailing spaces, collapse multiple newlines, fix encoding issues (UTF-8 BOM, smart quotes).

3. Remove PII: Scan for and redact email addresses, phone numbers, SSNs, credit card numbers. Use regex patterns or dedicated PII detection tools (Presidio by Microsoft).

4. Filter by length: Remove examples that are too short (likely incomplete) or too long (may exceed max_seq_length and get truncated). Typical range: 50-2000 tokens per example.

5. Language filtering: If training an English model, remove non-English examples (use langdetect or fasttext language ID).

Deduplication

Exact dedup: Hash each example and remove duplicates. Fast and catches copy-paste errors.

Near-dedup: Use MinHash or SimHash to find examples that are >90% similar. These are often paraphrases or slight variations that don't add training signal.

Semantic dedup: Embed all examples and cluster. Remove examples that are too close in embedding space. More expensive but catches semantically identical examples with different wording.

Filtering Heuristics

Remove refusals: "I'm sorry, I can't help with that" or "As an AI language model..." These teach the model to refuse when it shouldn't.

Remove self-references: "As GPT-4..." or "I was trained by OpenAI..." if you're training a different model.

Remove low-effort responses: One-word answers, responses that just repeat the question, or responses that are clearly wrong.

A practical pipeline: (1) Parse and validate JSON. (2) Normalize text. (3) Filter by length. (4) Exact dedup by hash. (5) Near-dedup by MinHash. (6) Remove refusals and self-references. (7) Manual review of a sample. This typically removes 10-30% of a raw dataset.

chat

Chat Templates & Formatting

Converting your data into the model's expected format

Why Templates Matter

Each model family expects a specific format with specific special tokens. If your training data uses the wrong format, the model sees garbled input and produces garbled output. This is the #1 cause of bad fine-tuning results.

The chat template converts your messages array into the exact token sequence the model expects. HuggingFace stores the template as a Jinja2 string in tokenizer_config.json.

Model-Specific Formats

The Right Way

Never manually construct special token sequences. Always use tokenizer.apply_chat_template(). SFTTrainer does this automatically when you provide data in the messages format.

If you're using Axolotl or LLaMA-Factory, they handle template application based on your config file. Just specify the model and data format.

# The right way: let the tokenizer handle it from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct" ) messages = [ {"role": "system", "content": "You are helpful."}, {"role": "user", "content": "What is LoRA?"}, ] formatted = tok.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) print(formatted) # Correctly formatted with all special tokens

Tip: Before training, always print a few formatted examples to visually verify the template is correct. Check that special tokens appear in the right places, system prompts are included, and the generation prompt is appended at the end.

token

Tokenization & Sequence Packing

Converting formatted text to training-ready tensors

Tokenization for Training

After applying the chat template, the formatted text is tokenized into input_ids (token IDs) and labels (what the model should predict). Labels are set to -100 for prompt tokens (masked from loss) and to the actual token IDs for response tokens.

max_seq_length: Sequences longer than this are truncated. Common values: 512, 1024, 2048, 4096. Longer = more memory per example, fewer examples per batch. Choose based on your data's length distribution.

Sequence Packing

Without packing: Each example is padded to max_seq_length. If your examples average 200 tokens and max_seq_length is 2048, you waste 90% of compute on padding tokens.

With packing: Multiple short examples are concatenated into a single sequence (with separator tokens). This eliminates padding waste and can speed up training by 2-5x.

SFTTrainer supports packing with packing=True. Axolotl supports it with sample_packing: true.

Token Budget

Training cost is proportional to total tokens processed:

total_tokens = num_examples × avg_tokens × num_epochs

Example: 5,000 examples × 500 avg tokens × 3 epochs = 7.5M tokens. On OpenAI (gpt-4o-mini): 7.5M × $3/1M = $22.50. On your own GPU: a few hours on an A100.

Monitor your token distribution. If some examples are 10x longer than average, they dominate training time. Consider splitting or truncating them.

# Check token length distribution import numpy as np lengths = [len(tok.encode(ex["text"])) for ex in dataset] print(f"Mean: {np.mean(lengths):.0f}") print(f"Median: {np.median(lengths):.0f}") print(f"P95: {np.percentile(lengths, 95):.0f}") print(f"Max: {max(lengths)}") # Set max_seq_length to P95 or P99 # This truncates only the longest 1-5% of examples

Set max_seq_length to the 95th or 99th percentile of your token length distribution. This captures most examples without wasting memory on the few very long ones. If you must keep long examples, use gradient accumulation to fit them in memory.

verified

Validation & Train/Test Split

Ensuring your dataset is ready for training

Train/Test Split

Always hold out 10-20% of your data as a test set. Never train on the test set. The test set is used to:

1. Monitor overfitting: If training loss decreases but test loss increases, you're overfitting.
2. Compare models: Run the base model and fine-tuned model on the same test set.
3. Evaluate quality: Use the test set for LLM-as-judge evaluation after training.

For small datasets (<1,000 examples), use 80/20 split. For large datasets (>10,000), 90/10 or even 95/5 is fine.

Stratified Splitting

If your data has categories (e.g., legal, medical, financial), use stratified splitting to ensure each category is proportionally represented in both train and test sets. Random splitting might put all rare categories in the training set, leaving the test set unrepresentative.

Final Validation Checklist

1. Format check: All examples parse correctly (valid JSON, correct fields).
2. Template check: Print 5 formatted examples and visually verify special tokens.
3. Length check: No examples exceed max_seq_length after tokenization (or you accept truncation).
4. Balance check: Category distribution is reasonable.
5. Leakage check: No test examples appear in the training set (check by hash).
6. Dedup check: No duplicates within or across splits.
7. Sanity check: Read 20 random examples end-to-end. Would you be happy if the model produced these responses?

The "golden rule" of fine-tuning data: Every example in your training set should be an example you'd be proud to show as your model's output. If you wouldn't want the model to produce a particular response, don't include it in training. The model will learn to imitate exactly what you show it.