Ch 5: Anatomy of AI Code Completion

keyboard

The Trigger: You Type a Character

Every keystroke starts a race against your next thought

The 200ms Budget

The moment you press a key, the AI coding tool has roughly 200 milliseconds to produce a useful suggestion before it feels laggy. That budget covers: detecting the keystroke, gathering context, sending to the model, generating tokens, filtering the result, and rendering ghost text. Every stage is optimized for speed.

Debounce: Not Every Keystroke Triggers

If the tool sent a request on every keystroke, it would overwhelm the model with wasted requests while you’re mid-word. Instead, a debounce timer (~75ms) waits for you to pause typing. If you type another character within that window, the timer resets. Only when you pause does the completion pipeline fire.

When Completions Don’t Fire

Smart tools suppress completions in situations where they’d be annoying:

• Inside strings — you’re typing a message, not code
• Inside comments — you’re explaining, not implementing
• Deleting code — you’re removing, not adding
• Moving cursor — you’re navigating, not editing
• Low confidence — the model isn’t sure enough to show anything

Key insight: Knowing when not to show a suggestion is as important as the suggestion itself. A tool that shows bad completions constantly trains you to ignore it. The best tools stay silent when uncertain — building trust through restraint.

inventory_2

Context Retrieval: Gathering the Right Code

The packing problem that determines suggestion quality

What Gets Collected

The tool scans multiple sources to build context:

• Current file — code around your cursor (prefix + suffix)
• Open tabs — files you’re actively working with
• Import graph — files referenced by imports in the current file
• Same directory — sibling files likely related
• Recent edits — files you changed recently (intent signal)
• Configuration — rules files, instructions, type definitions

Similarity Scoring

Files are broken into 60-line sliding windows. Each window is scored for relevance to the code around your cursor using Jaccard similarity — a fast token-overlap algorithm. Only the highest-scoring window per file survives. This ensures the most relevant snippets fill the limited token budget.

The Prompt Assembly

// Prompt structure (simplified): [System instructions] [Rules file content] [Ranked snippets from other files] [Import definitions / types] [Constants and config] <PRE> // code before cursor ...current file prefix... <SUF> // code after cursor ...current file suffix... <MID> // model generates here // Token budget monitored continuously // Lower-priority content trimmed first

Key insight: The prompt assembler is the unsung hero of code completion. It decides what the model sees — and what it doesn’t. A great assembler with a mediocre model often beats a great model with a naive assembler. This is why context engineering (Ch 7) matters so much.

neurology

Inference: The Model Generates Tokens

Speculative decoding and the tricks that make it fast

The Speed Problem

A large model generating one token at a time is too slow for inline completions. A 70B-parameter model might produce 30–50 tokens/second normally. But code completion needs to feel instant — you need a multi-line suggestion in under 200ms. Three techniques solve this.

Speculative Decoding

A small, fast draft model generates several candidate tokens quickly. The large model then verifies them in parallel (a single forward pass can check multiple tokens at once). If the draft tokens match what the large model would have produced, they’re accepted instantly. This pushes effective speed to ~1,000 tokens/second for code, because most edited code is predictable.

KV-Cache Reuse

When you accept a suggestion and keep typing, most of the prompt hasn’t changed — only the last few tokens are new. KV-cache reuse stores the intermediate computations from the previous request, so the model only processes the new tokens instead of re-reading the entire context. This dramatically reduces latency for sequential completions.

Quantization

Running models in 4-bit or 8-bit precision (instead of 16-bit) halves or quarters the memory footprint and speeds up computation. For code completion, the quality loss from quantization is minimal — the model still predicts the right tokens, just with slightly less numerical precision in its internal calculations.

Key insight: Speculative decoding works especially well for code because most of an edited file stays the same. The draft model can “copy” existing code as draft tokens, and the large model verifies them in bulk. It’s like proofreading vs. writing from scratch — verification is much faster than generation.

filter_alt

Filtering: Should This Suggestion Be Shown?

The gatekeeper that decides between showing and staying silent

The Show/Suppress Decision

Not every generated completion should be displayed. A quality gate evaluates each suggestion before rendering it. The model estimates the probability that you’ll accept the suggestion. If the estimated acceptance probability is below a threshold (typically ~25%), the suggestion is suppressed entirely. Nothing is shown.

What Gets Filtered Out

• Low-confidence completions — model isn’t sure
• Repetitive suggestions — same thing you just rejected
• Trivially short — completing a single character isn’t helpful
• Syntax-breaking — suggestion would create invalid code
• Security-flagged — known vulnerable patterns detected

The Integrated Approach

Modern systems integrate the show/suppress decision into the completion model itself rather than using a separate filter. The model learns both what to suggest and when to stay silent as a single policy. This produces better results than a two-stage approach because the model can factor in context quality, cursor position, and editing patterns holistically.

Why it matters: A tool that shows fewer but better suggestions builds more trust than one that floods you with mediocre completions. The filtering stage is why some tools “feel” smarter even when using similar underlying models — they’re better at knowing when to shut up.

edit_note

Ghost Text: Rendering the Suggestion

The UX of showing code that doesn’t exist yet

Ghost Text Display

The suggestion appears as dimmed, inline text at your cursor position — visible but clearly not yet part of your code. You accept with Tab, reject by continuing to type, or partially accept (word-by-word) with a modifier key. The ghost text updates in real-time as new tokens stream in from the model.

Multi-Line vs. Single-Line

Single-line completions finish the current line. Multi-line completions can suggest entire function bodies, if/else blocks, or loop implementations. The tool decides which mode based on context: after a function signature, expect multi-line. After const x =, expect single-line.

Partial Accept

Sometimes a suggestion is 80% right. Partial accept lets you take the first word, first line, or first block without accepting everything. This is a crucial UX feature — it means you can use AI suggestions as a starting point rather than an all-or-nothing decision.

Streaming vs. Batch

Some tools show suggestions as tokens stream in (you see it building character by character). Others wait for the full suggestion before displaying (batch mode). Streaming feels faster but can be distracting. Batch feels more polished but adds perceived latency. Most tools now use a hybrid: stream internally, display when the first meaningful chunk is ready.

The connection: Ghost text UX directly affects acceptance rates. If suggestions appear too slowly, you’ve already typed past them. If they’re too aggressive, they interrupt your flow. The best tools feel like they’re reading your mind — appearing exactly when you pause to think.

sync

The Feedback Loop: Learning from You

Every Tab press and every ignore trains the next model

Accept/Reject as Training Signal

Every time you accept or ignore a suggestion, that signal is logged. At scale (hundreds of millions of completions per day), this creates a massive dataset of what developers actually want. The reward structure is explicit: +0.75 for accepted, −0.25 for rejected, 0 for suppressed.

Online Reinforcement Learning

Some tools use online RL to continuously improve their completion model. New checkpoints are deployed, interaction data is collected for 1.5–2 hours, and the model is updated. This rapid cycle keeps training data aligned with the current model’s behavior — a much tighter loop than traditional offline training.

Measurable Results

Online RL applied to code completion has produced striking results: 21% fewer suggestions shown (less noise) while acceptance rates improved 28% (better quality). The model learned to be more selective — showing fewer but better completions. Fewer interruptions, higher hit rate.

Key insight: Your accept/reject behavior is literally training the next version of the model. This means the tool gets better the more you use it — not just for you, but for everyone. It also means that blindly accepting bad suggestions teaches the model to produce more bad suggestions.

compare

Completion vs. Chat vs. Agent

Three modes of AI assistance — and when to use each

Inline Completion

Trigger: Automatic, on keystroke pause
Latency: <200ms
Scope: Current line or block
Best for: Finishing the line you’re already writing, boilerplate, repetitive patterns
Model: Small, fast, specialized for code

Chat / Inline Edit

Trigger: Manual (Cmd+K, chat panel)
Latency: 1–5 seconds
Scope: Selected code or described task
Best for: Explaining code, refactoring a function, generating from a description
Model: Large, capable, general-purpose

Agent Mode

Trigger: Manual (Composer, terminal command)
Latency: 10 seconds to minutes
Scope: Entire codebase, multiple files
Best for: Feature implementation, multi-file refactoring, complex debugging
Model: Largest available, with tool use

Critical in AI: Using the wrong mode wastes time. Don’t open an agent session to rename a variable (use completion). Don’t rely on inline completion to implement a new feature across 5 files (use an agent). Matching mode to task is a core skill covered in Ch 8–10.

tips_and_updates

Making Completions Work Better for You

Practical tips grounded in how the system actually works

Write Good Comments First

A comment like // Sort users by last login, most recent first before an empty function body gives the model a strong intent signal. The FIM mechanism sees your comment as prefix context, dramatically improving the quality of the generated function body. Comments are prompts.

Keep Related Files Open

Open tabs are a primary context source. If you’re implementing a function that uses types from types.ts, open that file in a tab. The completion tool will include those type definitions in the prompt, producing suggestions that use the correct property names and types.

Write the Signature, Let AI Fill the Body

Type the function signature with parameter names and return type, then let the cursor sit inside the empty body. The model now has both prefix (signature) and suffix (closing brace) — the ideal FIM scenario. This consistently produces better completions than typing the first line and hoping.

Reject Deliberately

Don’t accept suggestions you don’t understand. Every acceptance is a training signal. If you accept buggy code, the model learns that pattern is desirable. Rejecting bad suggestions (by continuing to type) teaches the model what you actually want.

Key insight: Understanding the completion pipeline turns you from a passive recipient into an active collaborator. You’re not waiting for magic — you’re shaping the context, timing, and feedback that determine suggestion quality. The tool works with you, not for you.

Ch 5 — Anatomy of AI Code Completion