Ch 5 — Anatomy of AI Code Completion

What happens between your keystroke and the ghost text — in 200 milliseconds
High Level
keyboard
Keystroke
arrow_forward
timer
Debounce
arrow_forward
inventory_2
Context
arrow_forward
neurology
Inference
arrow_forward
filter_alt
Filter
arrow_forward
edit_note
Ghost Text
-
Click play or press Space to begin...
Step- / 8
keyboard
The Trigger: You Type a Character
Every keystroke starts a race against your next thought
The 200ms Budget
The moment you press a key, the AI coding tool has roughly 200 milliseconds to produce a useful suggestion before it feels laggy. That budget covers: detecting the keystroke, gathering context, sending to the model, generating tokens, filtering the result, and rendering ghost text. Every stage is optimized for speed.
Debounce: Not Every Keystroke Triggers
If the tool sent a request on every keystroke, it would overwhelm the model with wasted requests while you’re mid-word. Instead, a debounce timer (~75ms) waits for you to pause typing. If you type another character within that window, the timer resets. Only when you pause does the completion pipeline fire.
When Completions Don’t Fire
Smart tools suppress completions in situations where they’d be annoying:

Inside strings — you’re typing a message, not code
Inside comments — you’re explaining, not implementing
Deleting code — you’re removing, not adding
Moving cursor — you’re navigating, not editing
Low confidence — the model isn’t sure enough to show anything
Key insight: Knowing when not to show a suggestion is as important as the suggestion itself. A tool that shows bad completions constantly trains you to ignore it. The best tools stay silent when uncertain — building trust through restraint.
inventory_2
Context Retrieval: Gathering the Right Code
The packing problem that determines suggestion quality
What Gets Collected
The tool scans multiple sources to build context:

Current file — code around your cursor (prefix + suffix)
Open tabs — files you’re actively working with
Import graph — files referenced by imports in the current file
Same directory — sibling files likely related
Recent edits — files you changed recently (intent signal)
Configuration — rules files, instructions, type definitions
Similarity Scoring
Files are broken into 60-line sliding windows. Each window is scored for relevance to the code around your cursor using Jaccard similarity — a fast token-overlap algorithm. Only the highest-scoring window per file survives. This ensures the most relevant snippets fill the limited token budget.
The Prompt Assembly
// Prompt structure (simplified): [System instructions] [Rules file content] [Ranked snippets from other files] [Import definitions / types] [Constants and config] <PRE> // code before cursor ...current file prefix... <SUF> // code after cursor ...current file suffix... <MID> // model generates here // Token budget monitored continuously // Lower-priority content trimmed first
Key insight: The prompt assembler is the unsung hero of code completion. It decides what the model sees — and what it doesn’t. A great assembler with a mediocre model often beats a great model with a naive assembler. This is why context engineering (Ch 7) matters so much.
neurology
Inference: The Model Generates Tokens
Speculative decoding and the tricks that make it fast
The Speed Problem
A large model generating one token at a time is too slow for inline completions. A 70B-parameter model might produce 30–50 tokens/second normally. But code completion needs to feel instant — you need a multi-line suggestion in under 200ms. Three techniques solve this.
Speculative Decoding
A small, fast draft model generates several candidate tokens quickly. The large model then verifies them in parallel (a single forward pass can check multiple tokens at once). If the draft tokens match what the large model would have produced, they’re accepted instantly. This pushes effective speed to ~1,000 tokens/second for code, because most edited code is predictable.
KV-Cache Reuse
When you accept a suggestion and keep typing, most of the prompt hasn’t changed — only the last few tokens are new. KV-cache reuse stores the intermediate computations from the previous request, so the model only processes the new tokens instead of re-reading the entire context. This dramatically reduces latency for sequential completions.
Quantization
Running models in 4-bit or 8-bit precision (instead of 16-bit) halves or quarters the memory footprint and speeds up computation. For code completion, the quality loss from quantization is minimal — the model still predicts the right tokens, just with slightly less numerical precision in its internal calculations.
Key insight: Speculative decoding works especially well for code because most of an edited file stays the same. The draft model can “copy” existing code as draft tokens, and the large model verifies them in bulk. It’s like proofreading vs. writing from scratch — verification is much faster than generation.
filter_alt
Filtering: Should This Suggestion Be Shown?
The gatekeeper that decides between showing and staying silent
The Show/Suppress Decision
Not every generated completion should be displayed. A quality gate evaluates each suggestion before rendering it. The model estimates the probability that you’ll accept the suggestion. If the estimated acceptance probability is below a threshold (typically ~25%), the suggestion is suppressed entirely. Nothing is shown.
What Gets Filtered Out
Low-confidence completions — model isn’t sure
Repetitive suggestions — same thing you just rejected
Trivially short — completing a single character isn’t helpful
Syntax-breaking — suggestion would create invalid code
Security-flagged — known vulnerable patterns detected
The Integrated Approach
Modern systems integrate the show/suppress decision into the completion model itself rather than using a separate filter. The model learns both what to suggest and when to stay silent as a single policy. This produces better results than a two-stage approach because the model can factor in context quality, cursor position, and editing patterns holistically.
Why it matters: A tool that shows fewer but better suggestions builds more trust than one that floods you with mediocre completions. The filtering stage is why some tools “feel” smarter even when using similar underlying models — they’re better at knowing when to shut up.
edit_note
Ghost Text: Rendering the Suggestion
The UX of showing code that doesn’t exist yet
Ghost Text Display
The suggestion appears as dimmed, inline text at your cursor position — visible but clearly not yet part of your code. You accept with Tab, reject by continuing to type, or partially accept (word-by-word) with a modifier key. The ghost text updates in real-time as new tokens stream in from the model.
Multi-Line vs. Single-Line
Single-line completions finish the current line. Multi-line completions can suggest entire function bodies, if/else blocks, or loop implementations. The tool decides which mode based on context: after a function signature, expect multi-line. After const x =, expect single-line.
Partial Accept
Sometimes a suggestion is 80% right. Partial accept lets you take the first word, first line, or first block without accepting everything. This is a crucial UX feature — it means you can use AI suggestions as a starting point rather than an all-or-nothing decision.
Streaming vs. Batch
Some tools show suggestions as tokens stream in (you see it building character by character). Others wait for the full suggestion before displaying (batch mode). Streaming feels faster but can be distracting. Batch feels more polished but adds perceived latency. Most tools now use a hybrid: stream internally, display when the first meaningful chunk is ready.
The connection: Ghost text UX directly affects acceptance rates. If suggestions appear too slowly, you’ve already typed past them. If they’re too aggressive, they interrupt your flow. The best tools feel like they’re reading your mind — appearing exactly when you pause to think.
sync
The Feedback Loop: Learning from You
Every Tab press and every ignore trains the next model
Accept/Reject as Training Signal
Every time you accept or ignore a suggestion, that signal is logged. At scale (hundreds of millions of completions per day), this creates a massive dataset of what developers actually want. The reward structure is explicit: +0.75 for accepted, −0.25 for rejected, 0 for suppressed.
Online Reinforcement Learning
Some tools use online RL to continuously improve their completion model. New checkpoints are deployed, interaction data is collected for 1.5–2 hours, and the model is updated. This rapid cycle keeps training data aligned with the current model’s behavior — a much tighter loop than traditional offline training.
Measurable Results
Online RL applied to code completion has produced striking results: 21% fewer suggestions shown (less noise) while acceptance rates improved 28% (better quality). The model learned to be more selective — showing fewer but better completions. Fewer interruptions, higher hit rate.
Key insight: Your accept/reject behavior is literally training the next version of the model. This means the tool gets better the more you use it — not just for you, but for everyone. It also means that blindly accepting bad suggestions teaches the model to produce more bad suggestions.
compare
Completion vs. Chat vs. Agent
Three modes of AI assistance — and when to use each
Inline Completion
Trigger: Automatic, on keystroke pause
Latency: <200ms
Scope: Current line or block
Best for: Finishing the line you’re already writing, boilerplate, repetitive patterns
Model: Small, fast, specialized for code
Chat / Inline Edit
Trigger: Manual (Cmd+K, chat panel)
Latency: 1–5 seconds
Scope: Selected code or described task
Best for: Explaining code, refactoring a function, generating from a description
Model: Large, capable, general-purpose
Agent Mode
Trigger: Manual (Composer, terminal command)
Latency: 10 seconds to minutes
Scope: Entire codebase, multiple files
Best for: Feature implementation, multi-file refactoring, complex debugging
Model: Largest available, with tool use
Critical in AI: Using the wrong mode wastes time. Don’t open an agent session to rename a variable (use completion). Don’t rely on inline completion to implement a new feature across 5 files (use an agent). Matching mode to task is a core skill covered in Ch 8–10.
tips_and_updates
Making Completions Work Better for You
Practical tips grounded in how the system actually works
Write Good Comments First
A comment like // Sort users by last login, most recent first before an empty function body gives the model a strong intent signal. The FIM mechanism sees your comment as prefix context, dramatically improving the quality of the generated function body. Comments are prompts.
Keep Related Files Open
Open tabs are a primary context source. If you’re implementing a function that uses types from types.ts, open that file in a tab. The completion tool will include those type definitions in the prompt, producing suggestions that use the correct property names and types.
Write the Signature, Let AI Fill the Body
Type the function signature with parameter names and return type, then let the cursor sit inside the empty body. The model now has both prefix (signature) and suffix (closing brace) — the ideal FIM scenario. This consistently produces better completions than typing the first line and hoping.
Reject Deliberately
Don’t accept suggestions you don’t understand. Every acceptance is a training signal. If you accept buggy code, the model learns that pattern is desirable. Rejecting bad suggestions (by continuing to type) teaches the model what you actually want.
Key insight: Understanding the completion pipeline turns you from a passive recipient into an active collaborator. You’re not waiting for magic — you’re shaping the context, timing, and feedback that determine suggestion quality. The tool works with you, not for you.