Ch 3: Training Code Models

Ch 3 — Training Code Models

From raw GitHub repos to a model that writes code — the specialized training pipeline

Index

High Level

cloud_download

Collect

arrow_forward

filter_alt

Clean

arrow_forward

model_training

Pre-Train

arrow_forward

tune

Fine-Tune

arrow_forward

thumb_up

Align

arrow_forward

verified

Evaluate

Click play or press Space to begin...

Step- / 8

cloud_download

Step 1: Collecting the World’s Code

Trillions of tokens from millions of repositories

The Data Sources

Code models are trained on massive corpora of public source code. The primary source is GitHub — the largest collection of open-source software ever assembled. OpenAI’s Codex was trained on 159 GB from 54 million repos. StarCoder2 used The Stack v2, a 67.5 TB dataset from the Software Heritage archive covering 600+ programming languages.

Beyond Just Code

Modern code models don’t train on code alone. DeepSeek-Coder-V2 used a mix of 60% source code, 10% math, and 30% natural language. The natural language helps the model understand comments, documentation, and the intent behind code. Math improves reasoning about algorithms and logic.

Scale Comparison

// Training data scale progression: Codex (2021) 159 GB 54M repos StarCoder (2023) 6.4 TB 80+ languages StarCoder2 (2024) 67.5 TB 600+ languages DeepSeek-V2 6T tokens 338 languages // StarCoder2-15B alone trained on // 4+ trillion tokens of code

Key insight: The jump from Codex (159 GB) to StarCoder2 (67.5 TB) is a 400x increase in training data in just three years. More data means more patterns learned — more languages, more idioms, more edge cases the model has seen before.

filter_alt

Step 2: Cleaning & Filtering

Not all code is good code — garbage in, garbage out

Why Filtering Matters

Raw GitHub data is noisy. It contains auto-generated files, minified JavaScript, data dumps disguised as code, license headers repeated millions of times, and student homework with bugs. Training on all of it would teach the model to generate noise. Data quality directly determines model quality.

Deduplication

Near-deduplication removes files that are almost identical (forks, copy-paste). The Stack v2 shrank from 67.5 TB to 32.1 TB after deduplication — meaning roughly half of public code is duplicated. Without dedup, the model would memorize common boilerplate rather than learning general patterns.

Filtering Heuristics

Typical filtering steps include:

• License detection — keeping only permissively licensed code
• Language detection — classifying files by programming language
• Quality screening — using compilers, linters, and heuristic rules
• PII removal — stripping emails, API keys, passwords
• Malware filtering — removing known malicious code patterns
• Max line length / file size — excluding auto-generated files

Critical in AI: If the training data contains insecure code patterns (SQL injection, hardcoded secrets), the model learns to reproduce them. This is why AI-generated code inherits the security flaws of the open-source code it was trained on — a theme we’ll revisit in Ch 12.

model_training

Step 3: Pre-Training on Next-Token Prediction

The core objective: predict what comes next

The Training Objective

Pre-training uses the same objective as general LLMs: next-token prediction. Given a sequence of code tokens, predict the next one. The model sees def fibonacci(n):\n if n <= 1:\n return and learns to predict n. Repeated trillions of times across all training data, this teaches the model the statistical structure of code.

Repository-Level Training

DeepSeek-Coder pioneered repo-level pre-training: instead of training on isolated files, it arranges files within a repository by their dependency graph using topological sort. This teaches the model that import utils refers to a specific file, and that types defined in one file are used in another.

Fill-in-the-Middle Augmentation

During pre-training, a fraction of examples are FIM-transformed: a random span is removed from the middle and the model must regenerate it given prefix + suffix. This is a data augmentation strategy, not an architecture change. The key finding: models trained on a mix of standard and FIM-transformed data gain infilling ability without losing left-to-right performance — called the “FIM-for-free” property.

// FIM data transformation during training: // Original code: def add(a, b): return a + b // FIM-transformed (PSM format): <PRE>def add(a, b):\n <SUF>\n<MID>return a + b<EOT>

architecture

Code-Specific Tokenizers

Why code needs its own vocabulary

The Problem with General Tokenizers

A tokenizer trained on English text wastes tokens on code. It might split getElementById into 5 tokens, or treat Python indentation (4 spaces) as 4 separate tokens. Code-specific tokenizers are trained on code corpora so that common programming patterns — keywords, operators, indentation levels — get efficient single-token representations.

Whitespace Handling

Indentation is semantically meaningful in code (especially Python, YAML). Code tokenizers include special handling for whitespace: multi-space tokens that represent 2, 4, 8, or more spaces as a single token, and tab tokens at various indentation levels. This dramatically improves token efficiency for indented code.

Vocabulary Design

Code tokenizer vocabularies typically range from 32K to 128K tokens. They include:

• Language keywords — function, class, import, async
• Common identifiers — self, args, data, result
• Operators — =>, ===, ::, **
• Structural tokens — brackets, indentation, newlines
• FIM special tokens — <PRE>, <SUF>, <MID>, <EOT>

Key insight: Tokenizer quality has an outsized impact on model performance. A better tokenizer means more code fits in the context window, which means the model sees more relevant context, which means better predictions. It’s a compounding advantage.

tune

Step 4: Instruction Tuning for Code

Teaching the model to follow coding instructions

From Completion to Conversation

A pre-trained code model can continue code, but it can’t follow instructions like “write a function that sorts users by age.” Instruction tuning (also called supervised fine-tuning / SFT) trains the model on thousands of (instruction, code) pairs so it learns to map natural language requests to code solutions.

Training Data for SFT

Instruction tuning datasets include:

• Human-written pairs — developers writing code for specific prompts
• Synthetic data — using a stronger model to generate instruction/code pairs
• Competitive programming — problems + solutions from Codeforces, LeetCode
• Documentation examples — API docs paired with usage code
• Commit messages + diffs — natural language descriptions of code changes

Chat Format

Key insight: Instruction tuning is what makes the difference between a model that can only autocomplete and one that can have a conversation about code. It’s the step that turns a code predictor into a coding assistant.

thumb_up

Step 5: Alignment with Execution Feedback

RLHF meets unit tests — code has a unique advantage

Code’s Unique Advantage

Unlike prose, code can be objectively verified. You can run it. You can test it. This gives code models a training signal that general LLMs don’t have: execution feedback. Instead of relying solely on human preferences (RLHF), code models can be aligned using whether the generated code actually passes unit tests.

Reinforcement Learning from Execution

The RLEF (Reinforcement Learning from Execution Feedback) approach works in a loop: the model generates code, a sandbox executes it against test cases, pass/fail results become the reward signal, and the model is updated to favor code that passes. This is more reliable than human ratings because tests don’t have subjective bias.

Beyond Binary Pass/Fail

Recent work like CodeRL+ (2025) goes beyond simple pass/fail signals. It teaches models to infer variable-level execution trajectories — understanding not just whether code passes, but why it fails. This produces a 4.6% improvement in pass@1 over basic execution-based training.

Execution-Free Alternatives

CodeScaler (2026) trains reward models that predict code correctness without executing it, enabling RL training at scale without sandboxed environments. It improved Qwen3-8B by +11.72 points across five benchmarks — outperforming even binary execution-based RL by +1.82 points.

Why it matters: Execution-based alignment is why code models have improved faster than general LLMs. The ability to automatically verify output quality creates a tighter feedback loop than human evaluation ever could.

verified

Step 6: Benchmarks — Measuring Code Ability

From HumanEval to SWE-bench: “Can it code?” vs. “Can it engineer?”

HumanEval (The Classic)

HumanEval (OpenAI, 2021) contains 164 hand-crafted programming challenges — roughly equivalent to easy interview questions. It uses the pass@k metric: generate k solutions and check if at least one passes all test cases. Most frontier models now score 85%+, making it nearly saturated.

SWE-bench (The Real Test)

SWE-bench evaluates whether AI can resolve real GitHub issues in full repositories. It requires understanding codebases, debugging across multiple files, and generating patches that pass test suites. SWE-bench Verified (500 human-reviewed problems) is the gold standard. Top score in 2026: 80.9% (Claude Opus 4.5).

The Difficulty Spectrum

// Benchmark difficulty (2026 top scores): HumanEval ~95% // Saturated MBPP ~90% // Near saturated SWE-bench Verified 80.9% // Hard SWE-bench Pro 46.0% // Very hard // The gap between HumanEval and // SWE-bench Pro shows the distance // between "can write a function" // and "can engineer a solution"

The connection: HumanEval asks “Can it write a function?” SWE-bench asks “Can it fix a real bug in a real codebase?” The 95% vs 46% gap tells you exactly where AI coding stands: excellent at isolated tasks, still struggling with real-world engineering.

layers

The Full Training Pipeline

Putting it all together: from raw repos to coding assistant

The Four Stages

1. Data collection & cleaning — Scrape public repos, deduplicate, filter for quality and licenses, remove PII. Shrink 67 TB to ~32 TB of clean code.

2. Pre-training — Next-token prediction on trillions of tokens with FIM augmentation. Teaches syntax, semantics, and patterns across 600+ languages. Takes weeks on thousands of GPUs.

3. Instruction tuning — SFT on (instruction, code) pairs. Teaches the model to follow natural language requests and produce structured responses.

4. Alignment — RLHF and/or execution-based RL. Optimizes for code that actually works, is safe, and follows best practices.

What Makes Code Models Special

Compared to general LLMs, code models have three unique advantages in training:

• Verifiable output — code can be executed and tested, providing objective reward signals
• Structured data — code has syntax rules, type systems, and dependency graphs that provide learning signal
• Repository context — files relate to each other through imports and dependencies, teaching cross-file reasoning

Key insight: The training pipeline explains both the strengths and weaknesses of code AI. It’s brilliant at patterns it’s seen millions of times (common algorithms, popular frameworks). It struggles with novel architectures, proprietary APIs, and code that doesn’t exist in public repos — because it literally has never seen them.

arrow_back Ch 2: How Code LLMs Work Ch 4: The AI Coding Landscape arrow_forward