The Four Stages
1. Data collection & cleaning — Scrape public repos, deduplicate, filter for quality and licenses, remove PII. Shrink 67 TB to ~32 TB of clean code.
2. Pre-training — Next-token prediction on trillions of tokens with FIM augmentation. Teaches syntax, semantics, and patterns across 600+ languages. Takes weeks on thousands of GPUs.
3. Instruction tuning — SFT on (instruction, code) pairs. Teaches the model to follow natural language requests and produce structured responses.
4. Alignment — RLHF and/or execution-based RL. Optimizes for code that actually works, is safe, and follows best practices.
What Makes Code Models Special
Compared to general LLMs, code models have three unique advantages in training:
• Verifiable output — code can be executed and tested, providing objective reward signals
• Structured data — code has syntax rules, type systems, and dependency graphs that provide learning signal
• Repository context — files relate to each other through imports and dependencies, teaching cross-file reasoning
Key insight: The training pipeline explains both the strengths and weaknesses of code AI. It’s brilliant at patterns it’s seen millions of times (common algorithms, popular frameworks). It struggles with novel architectures, proprietary APIs, and code that doesn’t exist in public repos — because it literally has never seen them.