Ch 2: How Code LLMs Work

text_fields

Code Is Just Text (To a Model)

The starting point: your source code as a string

The Input

When you type in your editor, the AI coding tool captures raw text — your current file, cursor position, and surrounding context. To the model, Python, TypeScript, and Rust are all just sequences of characters. There’s no special “code mode” — the model processes code the same way it processes English, through token prediction.

What Gets Sent

The tool assembles a prompt from your code. For inline completions, this typically includes the prefix (code before your cursor) and suffix (code after your cursor). For chat, it includes your question plus relevant file context. This assembled text is the model’s entire view of your world.

Why Code Is Different

Code has properties that make it distinct from prose: strict syntax (a missing bracket breaks everything), semantic indentation (Python, YAML), cross-file references (imports, types), and execution semantics (the code must actually run). The model doesn’t “understand” any of this — it learns statistical patterns that happen to capture these structures.

Key insight: A code LLM doesn’t parse your code into an AST or run a compiler. It treats def calculate_total(items): as a sequence of tokens, just like it treats “The quick brown fox” as tokens. The magic is that it’s seen so much code that its predictions respect syntax anyway.

tag

Tokenization: Breaking Code into Pieces

BPE and why function is one token but getElementById is four

Byte Pair Encoding (BPE)

Code LLMs use BPE tokenizers (originally a 1994 compression algorithm). BPE starts with individual bytes (256 base tokens) and iteratively merges the most frequent adjacent pairs. Common words like function or return become single tokens. Rare identifiers like calculateMonthlyRevenue get split into subwords: calculate + Monthly + Revenue.

Code vs. Prose Efficiency

English text averages 1–1.5 tokens per word. Code averages 1.5–4 tokens per word because of special characters, camelCase splitting, indentation, and punctuation like {, =>, ::. This means code fills up context windows faster than English — a 128K-token window holds less code than you’d expect.

Example: How Code Gets Tokenized

// Source code: const total = items.reduce((sum, item) => sum + item.price, 0); // Approximate tokens (GPT-4 tokenizer): ["const", " total", " =", " items", ".", "reduce", "((", "sum", ",", " item", ")", " =>", " sum", " +", " item", ".", "price", ",", " 0", ");"] // ~20 tokens for one line of JS

Why it matters: Token count directly affects cost (APIs charge per token) and context limits. Understanding tokenization helps you write prompts that fit more useful code into the model’s window.

scatter_plot

Embeddings: Tokens Become Vectors

How the model gives meaning to each token

From Integer to Vector

Each token ID is mapped to a high-dimensional vector (typically 4,096–12,288 dimensions in modern code models). This embedding captures the token’s learned “meaning” — function and def end up near each other in vector space because they appear in similar contexts across training data.

Positional Encoding

Order matters in code — a = b is different from b = a. Positional encodings (like RoPE, used in most modern models) are added to each embedding so the model knows where each token sits in the sequence, not just what it is.

What the Model “Sees”

At this stage, your code is a matrix of numbers — each row is one token’s embedding vector. A 500-token code snippet becomes a 500 × 4,096 matrix. The model has no concept of “Python” or “function” — just patterns of numbers that it will process through transformer layers.

Key insight: Embeddings are why code LLMs can work across languages. for in Python, for in JavaScript, and for in Go occupy similar regions in vector space because they appear in structurally similar patterns. The model learns language-agnostic coding concepts.

neurology

The Transformer: Where Reasoning Happens

Attention layers decide which tokens matter for the next prediction

Self-Attention for Code

The transformer’s self-attention mechanism lets every token “look at” every other token in the context. When predicting the next token after items.reduce((sum, item) => sum +, the model attends heavily to item, reduce, and the variable names — learning that this pattern typically ends with a property access like item.price.

Layers of Understanding

Modern code models have 32–80+ transformer layers. Early layers capture syntax (bracket matching, indentation). Middle layers capture semantics (variable types, function signatures). Deep layers capture high-level patterns (design patterns, algorithm structure). Each layer refines the representation.

The Context Window

The context window is the maximum number of tokens the model can process at once. GPT-4o supports 128K tokens, Claude supports 200K tokens. This determines how much of your codebase the model can “see” in a single request. Anything outside the window simply doesn’t exist to the model.

The connection: Attention is quadratic in sequence length — doubling the context window quadruples the computation. This is why larger context windows are expensive and why tools carefully select which code to include in the prompt rather than sending everything.

casino

Sampling: Choosing the Next Token

Temperature, top-p, and why AI code isn’t deterministic

The Probability Distribution

After passing through all transformer layers, the model outputs a probability score for every token in its vocabulary (typically 32K–128K tokens). For code completion, the top candidates might be: item.price (72%), item.cost (15%), item.amount (8%), with thousands of other tokens sharing the remaining 5%.

Temperature

Temperature controls how “creative” the model is. At temperature 0 (greedy), it always picks the highest-probability token — deterministic and safe. At temperature 0.7–1.0, lower-probability tokens get a chance, producing more varied output. For code completion, tools typically use low temperature (0–0.2) because correctness matters more than creativity.

Top-P (Nucleus Sampling)

Top-p sampling only considers tokens whose cumulative probability reaches a threshold (e.g., p=0.95). This dynamically adjusts the candidate pool — when the model is confident, only 2–3 tokens qualify. When uncertain, dozens might. It’s more adaptive than a fixed top-k cutoff.

One Token at a Time

The model generates one token per forward pass. To produce a 10-line function, it runs hundreds of sequential predictions, each time feeding its previous output back as input. This is why generation has noticeable latency — and why speculative decoding (predicting multiple tokens at once) is a major optimization.

Key insight: Code completion tools use near-zero temperature because a “creative” variable name or an “imaginative” API call is a bug. Chat and refactoring tasks use slightly higher temperature to explore alternative solutions.

vertical_split

Fill-in-the-Middle: The Code Superpower

How models generate code between existing lines

The Problem

Standard LLMs generate text left-to-right — they can only continue from the end. But coding often requires inserting code in the middle: adding a function body between its signature and the next function, filling in a parameter list, or completing a block inside existing logic.

How FIM Works

Fill-in-the-Middle (FIM) uses special tokens to split code into prefix, middle, and suffix. The model sees the code before your cursor (prefix) and after your cursor (suffix), then generates the missing middle. This is the core technique behind inline code completion.

FIM Token Format

// Your code in the editor: function greet(name) { | <-- cursor here } console.log(greet("Alice")); // What the model receives: <PRE>function greet(name) {\n <SUF> }\nconsole.log(greet("Alice"));<MID> // Model generates the middle: return `Hello, ${name}!`;<EOT>

Key insight: FIM is what makes AI code completion feel magical. The model doesn’t just continue from your cursor — it generates code that fits seamlessly between what’s above and below. Models like CodeLlama, StarCoder2, and DeepSeek-Coder are all trained with FIM.

memory

Context Assembly: What the Model Actually Sees

The prompt is everything — and it’s carefully constructed

The Prompt Budget

Every request has a token budget. The tool must decide what to include: current file, open tabs, imported files, type definitions, recent edits, conversation history. It’s a packing problem — fitting the most useful context into a fixed window. What gets left out is invisible to the model.

Context Sources

Modern tools pull context from multiple sources:

• Current file — code around the cursor (highest priority)
• Open tabs — files you’re actively working with
• Codebase index — semantic search across the full project
• Type definitions — interfaces, schemas, API contracts
• Recent edits — what you just changed (intent signal)

The “Lost in the Middle” Problem

Research shows LLMs attend most strongly to tokens at the beginning and end of the context window, with weaker attention to the middle. This means the order of context matters — the most important code should be placed at the start or end of the prompt, not buried in the middle.

Critical in AI: The quality of AI code suggestions is directly proportional to the quality of context assembly. A brilliant model with bad context produces bad code. A good model with great context produces great code. This is why “context engineering” (Ch 7) is the most important skill.

code

The Full Pipeline in Action

Putting it all together: keystroke to ghost text in milliseconds

End-to-End Flow

1. You type a keystroke
2. The tool extracts prefix + suffix around your cursor
3. It assembles context from open files, codebase index, and type definitions
4. The prompt is tokenized via BPE into integer sequences
5. Tokens are embedded into vectors with positional encoding
6. The transformer processes all tokens through 32–80+ attention layers
7. Output probabilities are sampled at low temperature
8. Generated tokens are decoded back to text and displayed as ghost text

Speed Matters

This entire pipeline must complete in under 200ms to feel responsive. Techniques like speculative decoding (predicting multiple tokens using a smaller draft model), KV-cache reuse (not recomputing unchanged prefix tokens), and quantization (running in 4-bit or 8-bit precision) make this possible.

What the Model Doesn’t Do

The model never executes your code, never checks types, never runs tests, and never verifies imports exist. It predicts the most statistically likely next tokens based on patterns in its training data. When it suggests import pandas as pd, it’s not checking if pandas is installed — it’s predicting what usually follows the patterns it’s seen.

Rule of thumb: Think of a code LLM as the world’s best autocomplete, not a compiler or interpreter. It predicts what code probably comes next based on patterns. It’s your job to verify that the prediction is actually correct.

Ch 2 — How Code LLMs Work