The Analogy
Training data is the curriculum for the model. Just as a student who only reads comic books won’t write great essays, a model trained on low-quality web scrapes won’t produce high-quality output. Modern LLM training involves massive data engineering: crawling the web, deduplicating, filtering toxic/low-quality content, mixing in curated sources (books, code, academic papers), and carefully balancing the proportions.
Key insight: Llama 3 used 15T tokens from a mix of web data (Common Crawl), code (GitHub), books, academic papers, Wikipedia, and more. The data pipeline includes: language detection, quality filtering (perplexity-based), deduplication (MinHash), PII removal, and domain-specific heuristics. Data quality is now considered more important than model architecture — it’s the biggest differentiator between models.
Data Sources & Mixing
# Typical data mix (approximate):
# Web text: ~67% (Common Crawl, filtered)
# Code: ~17% (GitHub, Stack Overflow)
# Books: ~5% (BookCorpus, etc.)
# Academic: ~5% (ArXiv, papers)
# Wikipedia: ~3% (high quality)
# Math/Science: ~3% (curated)
# Data pipeline stages:
# 1. Crawl: ~100T raw tokens
# 2. Language filter: keep English + top langs
# 3. Quality filter: perplexity scoring
# 4. Dedup: exact + fuzzy (MinHash)
# 5. Safety filter: remove toxic content
# 6. PII removal: emails, phones, etc.
# 7. Domain mixing: set proportions
# Result: ~15T clean tokens
# Key datasets:
# FineWeb (HuggingFace): 15T tokens, open
# The Pile (EleutherAI): 800B tokens, open
# RedPajama: 1.2T tokens, open