Ch 3 — Training Code Models

From raw GitHub repos to a model that writes code — the specialized training pipeline
High Level
cloud_download
Collect
arrow_forward
filter_alt
Clean
arrow_forward
model_training
Pre-Train
arrow_forward
tune
Fine-Tune
arrow_forward
thumb_up
Align
arrow_forward
verified
Evaluate
-
Click play or press Space to begin...
Step- / 8
cloud_download
Step 1: Collecting the World’s Code
Trillions of tokens from millions of repositories
The Data Sources
Code models are trained on massive corpora of public source code. The primary source is GitHub — the largest collection of open-source software ever assembled. OpenAI’s Codex was trained on 159 GB from 54 million repos. StarCoder2 used The Stack v2, a 67.5 TB dataset from the Software Heritage archive covering 600+ programming languages.
Beyond Just Code
Modern code models don’t train on code alone. DeepSeek-Coder-V2 used a mix of 60% source code, 10% math, and 30% natural language. The natural language helps the model understand comments, documentation, and the intent behind code. Math improves reasoning about algorithms and logic.
Scale Comparison
// Training data scale progression: Codex (2021) 159 GB 54M repos StarCoder (2023) 6.4 TB 80+ languages StarCoder2 (2024) 67.5 TB 600+ languages DeepSeek-V2 6T tokens 338 languages // StarCoder2-15B alone trained on // 4+ trillion tokens of code
Key insight: The jump from Codex (159 GB) to StarCoder2 (67.5 TB) is a 400x increase in training data in just three years. More data means more patterns learned — more languages, more idioms, more edge cases the model has seen before.
filter_alt
Step 2: Cleaning & Filtering
Not all code is good code — garbage in, garbage out
Why Filtering Matters
Raw GitHub data is noisy. It contains auto-generated files, minified JavaScript, data dumps disguised as code, license headers repeated millions of times, and student homework with bugs. Training on all of it would teach the model to generate noise. Data quality directly determines model quality.
Deduplication
Near-deduplication removes files that are almost identical (forks, copy-paste). The Stack v2 shrank from 67.5 TB to 32.1 TB after deduplication — meaning roughly half of public code is duplicated. Without dedup, the model would memorize common boilerplate rather than learning general patterns.
Filtering Heuristics
Typical filtering steps include:

License detection — keeping only permissively licensed code
Language detection — classifying files by programming language
Quality screening — using compilers, linters, and heuristic rules
PII removal — stripping emails, API keys, passwords
Malware filtering — removing known malicious code patterns
Max line length / file size — excluding auto-generated files
Critical in AI: If the training data contains insecure code patterns (SQL injection, hardcoded secrets), the model learns to reproduce them. This is why AI-generated code inherits the security flaws of the open-source code it was trained on — a theme we’ll revisit in Ch 12.
model_training
Step 3: Pre-Training on Next-Token Prediction
The core objective: predict what comes next
The Training Objective
Pre-training uses the same objective as general LLMs: next-token prediction. Given a sequence of code tokens, predict the next one. The model sees def fibonacci(n):\n if n <= 1:\n return and learns to predict n. Repeated trillions of times across all training data, this teaches the model the statistical structure of code.
Repository-Level Training
DeepSeek-Coder pioneered repo-level pre-training: instead of training on isolated files, it arranges files within a repository by their dependency graph using topological sort. This teaches the model that import utils refers to a specific file, and that types defined in one file are used in another.
Fill-in-the-Middle Augmentation
During pre-training, a fraction of examples are FIM-transformed: a random span is removed from the middle and the model must regenerate it given prefix + suffix. This is a data augmentation strategy, not an architecture change. The key finding: models trained on a mix of standard and FIM-transformed data gain infilling ability without losing left-to-right performance — called the “FIM-for-free” property.
// FIM data transformation during training: // Original code: def add(a, b): return a + b // FIM-transformed (PSM format): <PRE>def add(a, b):\n <SUF>\n<MID>return a + b<EOT>
architecture
Code-Specific Tokenizers
Why code needs its own vocabulary
The Problem with General Tokenizers
A tokenizer trained on English text wastes tokens on code. It might split getElementById into 5 tokens, or treat Python indentation (4 spaces) as 4 separate tokens. Code-specific tokenizers are trained on code corpora so that common programming patterns — keywords, operators, indentation levels — get efficient single-token representations.
Whitespace Handling
Indentation is semantically meaningful in code (especially Python, YAML). Code tokenizers include special handling for whitespace: multi-space tokens that represent 2, 4, 8, or more spaces as a single token, and tab tokens at various indentation levels. This dramatically improves token efficiency for indented code.
Vocabulary Design
Code tokenizer vocabularies typically range from 32K to 128K tokens. They include:

Language keywordsfunction, class, import, async
Common identifiersself, args, data, result
Operators=>, ===, ::, **
Structural tokens — brackets, indentation, newlines
FIM special tokens<PRE>, <SUF>, <MID>, <EOT>
Key insight: Tokenizer quality has an outsized impact on model performance. A better tokenizer means more code fits in the context window, which means the model sees more relevant context, which means better predictions. It’s a compounding advantage.
tune
Step 4: Instruction Tuning for Code
Teaching the model to follow coding instructions
From Completion to Conversation
A pre-trained code model can continue code, but it can’t follow instructions like “write a function that sorts users by age.” Instruction tuning (also called supervised fine-tuning / SFT) trains the model on thousands of (instruction, code) pairs so it learns to map natural language requests to code solutions.
Training Data for SFT
Instruction tuning datasets include:

Human-written pairs — developers writing code for specific prompts
Synthetic data — using a stronger model to generate instruction/code pairs
Competitive programming — problems + solutions from Codeforces, LeetCode
Documentation examples — API docs paired with usage code
Commit messages + diffs — natural language descriptions of code changes
Chat Format
// Instruction tuning example: <|system|> You are a helpful coding assistant. <|user|> Write a Python function that checks if a string is a palindrome. <|assistant|> def is_palindrome(s: str) -> bool: cleaned = s.lower().strip() return cleaned == cleaned[::-1]
Key insight: Instruction tuning is what makes the difference between a model that can only autocomplete and one that can have a conversation about code. It’s the step that turns a code predictor into a coding assistant.
thumb_up
Step 5: Alignment with Execution Feedback
RLHF meets unit tests — code has a unique advantage
Code’s Unique Advantage
Unlike prose, code can be objectively verified. You can run it. You can test it. This gives code models a training signal that general LLMs don’t have: execution feedback. Instead of relying solely on human preferences (RLHF), code models can be aligned using whether the generated code actually passes unit tests.
Reinforcement Learning from Execution
The RLEF (Reinforcement Learning from Execution Feedback) approach works in a loop: the model generates code, a sandbox executes it against test cases, pass/fail results become the reward signal, and the model is updated to favor code that passes. This is more reliable than human ratings because tests don’t have subjective bias.
Beyond Binary Pass/Fail
Recent work like CodeRL+ (2025) goes beyond simple pass/fail signals. It teaches models to infer variable-level execution trajectories — understanding not just whether code passes, but why it fails. This produces a 4.6% improvement in pass@1 over basic execution-based training.
Execution-Free Alternatives
CodeScaler (2026) trains reward models that predict code correctness without executing it, enabling RL training at scale without sandboxed environments. It improved Qwen3-8B by +11.72 points across five benchmarks — outperforming even binary execution-based RL by +1.82 points.
Why it matters: Execution-based alignment is why code models have improved faster than general LLMs. The ability to automatically verify output quality creates a tighter feedback loop than human evaluation ever could.
verified
Step 6: Benchmarks — Measuring Code Ability
From HumanEval to SWE-bench: “Can it code?” vs. “Can it engineer?”
HumanEval (The Classic)
HumanEval (OpenAI, 2021) contains 164 hand-crafted programming challenges — roughly equivalent to easy interview questions. It uses the pass@k metric: generate k solutions and check if at least one passes all test cases. Most frontier models now score 85%+, making it nearly saturated.
SWE-bench (The Real Test)
SWE-bench evaluates whether AI can resolve real GitHub issues in full repositories. It requires understanding codebases, debugging across multiple files, and generating patches that pass test suites. SWE-bench Verified (500 human-reviewed problems) is the gold standard. Top score in 2026: 80.9% (Claude Opus 4.5).
The Difficulty Spectrum
// Benchmark difficulty (2026 top scores): HumanEval ~95% // Saturated MBPP ~90% // Near saturated SWE-bench Verified 80.9% // Hard SWE-bench Pro 46.0% // Very hard // The gap between HumanEval and // SWE-bench Pro shows the distance // between "can write a function" // and "can engineer a solution"
The connection: HumanEval asks “Can it write a function?” SWE-bench asks “Can it fix a real bug in a real codebase?” The 95% vs 46% gap tells you exactly where AI coding stands: excellent at isolated tasks, still struggling with real-world engineering.
layers
The Full Training Pipeline
Putting it all together: from raw repos to coding assistant
The Four Stages
1. Data collection & cleaning — Scrape public repos, deduplicate, filter for quality and licenses, remove PII. Shrink 67 TB to ~32 TB of clean code.

2. Pre-training — Next-token prediction on trillions of tokens with FIM augmentation. Teaches syntax, semantics, and patterns across 600+ languages. Takes weeks on thousands of GPUs.

3. Instruction tuning — SFT on (instruction, code) pairs. Teaches the model to follow natural language requests and produce structured responses.

4. Alignment — RLHF and/or execution-based RL. Optimizes for code that actually works, is safe, and follows best practices.
What Makes Code Models Special
Compared to general LLMs, code models have three unique advantages in training:

Verifiable output — code can be executed and tested, providing objective reward signals
Structured data — code has syntax rules, type systems, and dependency graphs that provide learning signal
Repository context — files relate to each other through imports and dependencies, teaching cross-file reasoning
Key insight: The training pipeline explains both the strengths and weaknesses of code AI. It’s brilliant at patterns it’s seen millions of times (common algorithms, popular frameworks). It struggles with novel architectures, proprietary APIs, and code that doesn’t exist in public repos — because it literally has never seen them.