Ch 9: Inference Infrastructure — AI Infrastructure

Ch 9 — Inference Infrastructure: Serving AI at Scale

Batching, KV cache, PagedAttention, inference engines, GPU sharing, and cost optimization

arrow_backIndex

Hands-On

swap_vert

Train vs Infer

arrow_forward

data_array

KV Cache

arrow_forward

auto_stories

PagedAttention

arrow_forward

batch_prediction

Batching

arrow_forward

compare_arrows

Engines

arrow_forward

speed

Optimization

arrow_forward

groups

Multi-Tenant

arrow_forward

payments

Cost Math

Click play or press Space to begin...

Step- / 8

swap_vert

Training vs Inference: Different Beasts

Same GPUs, completely different bottlenecks and optimization strategies

Why Inference Is Different

Training processes massive batches of data in a predictable loop — forward pass, backward pass, gradient update. The GPU stays busy with large matrix multiplications. Inference is the opposite: unpredictable request arrivals, variable sequence lengths, and a two-phase execution pattern that fundamentally changes the bottleneck.

Prefill phase: Process the entire input prompt at once. This is compute-bound — similar to training. A 4,096-token prompt on an H100 takes ~50ms.

Decode phase: Generate tokens one at a time, each requiring a full attention pass over all previous tokens. This is memory-bandwidth-bound because you’re reading the entire KV cache for each new token but only producing a single output.

The result: during decoding, an H100’s 3,958 TFLOPS (FP8) sits mostly idle while the 3.35 TB/s memory bus becomes the bottleneck. A single request might use only 1–5% of the GPU’s compute capacity during decode.

Training vs Inference Comparison

Dimension Training Inference ────────────────────────────────────────────── Batch size Large (millions) Small (1-256) Bottleneck Compute (FLOPS) Memory bandwidth Latency target Hours/days 50-500ms Throughput unit Tokens/sec total Tokens/sec/user GPU utilization 30-50% 5-40% (naive) Memory pattern Predictable Dynamic (KV cache) Failure impact Lose hours Lose one request Scaling unit Cluster Individual GPUs

Key insight: Training is like a factory assembly line — predictable, high-throughput, optimized for total output. Inference is like a restaurant kitchen — unpredictable orders, variable complexity, and every customer expects fast service. The same oven (GPU) needs completely different management strategies.

data_array

The KV Cache: Memory Hog of Inference

Why attention’s memory grows linearly with every token generated

What Is the KV Cache?

In transformer attention, each token needs to attend to all previous tokens. Without caching, generating token N requires recomputing attention for all N-1 previous tokens — O(N²) total work for a sequence. The KV cache stores the Key and Value projections from previous tokens so each new token only computes its own K/V and attends to the cached values.

This trades compute for memory. Instead of recomputing, you store and read. For large models with long contexts, this cache becomes enormous.

KV Cache Size Formula

# KV cache per token per layer: size_per_token = 2 × num_heads × head_dim × bytes_per_element # Total KV cache for one sequence: total = num_layers × seq_len × size_per_token # Example: Llama 3 70B (FP16) layers = 80, heads (KV) = 8, head_dim = 128 size_per_token = 2 × 8 × 128 × 2 bytes = 4,096 bytes total (4K ctx) = 80 × 4,096 × 4,096 = 1.28 GB total (128K ctx) = 80 × 131,072 × 4,096 = 40.96 GB # With 32 concurrent users at 4K context: cache_total = 32 × 1.28 GB = 40.96 GB # That's 51% of an H100's 80 GB — just for cache!

The Memory Budget Problem

An H100 has 80 GB of HBM3. For Llama 3 70B in FP16, the model weights alone consume ~140 GB (needs 2 GPUs via tensor parallelism). On each GPU, roughly 10 GB is left for KV cache after weights and activations. That’s enough for only ~8 concurrent users at 4K context.

With FP8 quantization, weights shrink to ~70 GB (1 GPU), leaving ~8 GB for cache. Still only ~6 users. The KV cache is the primary factor limiting how many users a single GPU can serve simultaneously.

Long-context models make this worse: A 128K-context request consumes 40.96 GB of cache — more than the model weights. A single long-context user can monopolize an entire GPU.

KV Cache Memory by Model

Model Params KV/tok 4K Cache 128K Cache ───────────────────────────────────────────────────────── Llama 3 8B 8B 1 KB 0.25 GB 8 GB Llama 3 70B 70B 4 KB 1.28 GB 41 GB Llama 3 405B 405B 8 KB 2.56 GB 82 GB Mixtral 8×22B 141B 4 KB 1.28 GB 41 GB GPT-4 (est.) ~1.8T ~16 KB ~5 GB ~160 GB

Key insight: The KV cache is like a waiter’s notepad that grows with every course of the meal. A short conversation is a sticky note; a 128K-context chat is a full notebook. The restaurant (GPU) can only carry so many notebooks before it runs out of pocket space, no matter how fast the chef (compute cores) can cook.

auto_stories

PagedAttention: Virtual Memory for KV Cache

The breakthrough that made efficient LLM serving possible

The Fragmentation Problem

Before PagedAttention, inference engines pre-allocated contiguous memory blocks for each request’s KV cache based on the maximum possible sequence length. If a model supports 4,096 tokens but a request only uses 500, the remaining 3,596 tokens worth of memory sits wasted — reserved but empty.

Across many concurrent requests, this wastes 60–80% of GPU memory. It’s like reserving an entire row of seats at a movie theater for each person, even though most people come alone.

Additionally, contiguous allocation causes external fragmentation. Even if total free memory is sufficient, it may be scattered in non-contiguous chunks too small for any single request.

How PagedAttention Works

PagedAttention (introduced by vLLM in 2023) borrows the concept of virtual memory paging from operating systems:

1. Fixed-size blocks: KV cache is divided into pages of 16 tokens each (configurable). Each page is a small, fixed-size memory block.

2. Block table: A mapping table (like a page table in OS virtual memory) tracks which physical memory blocks belong to which sequence. Blocks don’t need to be contiguous.

3. On-demand allocation: New blocks are allocated only when a sequence actually generates tokens that need them. No pre-allocation of maximum length.

4. Copy-on-write: When multiple requests share a common prefix (e.g., system prompt), they can share the same physical KV cache blocks until they diverge.

Impact by the Numbers

Without PagedAttention: Memory waste: 60-80% Max concurrent users: ~8 (Llama 70B, H100) Throughput: ~800 tok/s With PagedAttention: Memory waste: < 4% Max concurrent users: ~32-48 (same setup) Throughput: ~2,400-3,200 tok/s Improvement: Memory utilization: ~5× better Throughput: 2-4× higher Cost per token: 2-4× lower # Shared prefix example (RAG with 2K system prompt): Without sharing: 100 users × 2K × 4 KB = 800 MB With CoW: 1 × 2K × 4 KB + overhead = ~10 MB Savings: ~99% for shared prefix portion

Key insight: PagedAttention is to GPU memory what virtual memory was to RAM in the 1960s. Before virtual memory, each program needed a contiguous block of physical RAM. After, programs got virtual addresses mapped to scattered physical pages. The same revolution — applied to KV cache — unlocked 2–4× more throughput from the same hardware.

batch_prediction

Batching Strategies: From Static to Continuous

How to keep GPUs busy when requests arrive unpredictably

Static Batching (Naive)

The simplest approach: collect N requests, process them together, wait for all to finish, then collect the next batch.

Problem: If one request generates 500 tokens and another generates 10, the short request’s GPU slot sits idle for 490 decode steps. With variable-length outputs (common in chat), this wastes 50–70% of potential throughput.

It’s like a bus that won’t leave until every passenger reaches their final destination — even if most people want to get off at the first stop.

Continuous Batching

Continuous batching (also called iteration-level scheduling) checks for completed sequences after every decode step. When a sequence finishes, its slot is immediately filled with a new request from the queue.

Prefill scheduling: New requests can be prefilled (prompt processing) while existing requests are decoding. The prefill is chunked into smaller pieces to avoid stalling decode latency for in-flight requests.

Result: GPU utilization jumps from 30–40% (static) to 70–90% (continuous). Throughput increases 2–3× with the same hardware.

Batching Strategies Compared

Static Batching: [Req A ████████████████████] 500 tokens [Req B ██░░░░░░░░░░░░░░░░░] 10 tokens (idle) [Req C ████████░░░░░░░░░░░] 200 tokens (idle) → GPU idle ~60% of the time Continuous Batching: [Req A ████████████████████] 500 tokens [Req B ██][Req D ██████████] slots reused! [Req C ████████][Req E █████] slots reused! → GPU idle ~10-15% of the time Chunked Prefill: Decode: [A₅₀₁][B₁₁][C₂₀₁] existing requests Prefill: [D_chunk₁ ████] new request (chunked) → Prefill interleaved without stalling decode

Speculative Decoding

A complementary technique: use a small “draft” model (e.g., 1B params) to generate K candidate tokens quickly, then verify all K in a single forward pass of the large model. If the draft model guesses correctly (common for predictable text), you generate K tokens for the cost of ~1 large-model step.

Typical speedup: 1.5–2.5× for code generation and structured text. Less effective for creative writing where the draft model’s predictions diverge.

Key insight: Static batching is like an elevator that waits for everyone to get off at the top floor before going back down. Continuous batching opens the doors at every floor. Speculative decoding is like the elevator predicting which floors people want and pre-positioning — when it guesses right, everyone arrives faster.

compare_arrows

Inference Engines: vLLM, TensorRT-LLM, SGLang & TGI

The software that turns GPU hardware into an API endpoint

The Big Four (2025–2026)

Four engines dominate production LLM serving, accounting for ~85% of deployments:

vLLM — The flexibility champion. Open-source, hardware-agnostic (NVIDIA, AMD, TPU). PagedAttention, continuous batching, speculative decoding, LoRA hot-swapping. vLLM V1 (Jan 2025) redesigned the core for near-zero CPU overhead. Best for high-concurrency (50–100+ requests) and multi-modal workloads.

TensorRT-LLM — NVIDIA’s hardware-optimized engine. Aggressive kernel fusion, FP8/FP4 quantization, pipeline parallelism. 35–50% higher throughput than vLLM on identical NVIDIA hardware for single-model serving. Trade-off: 15–45 minute model compilation step and NVIDIA lock-in.

SGLang — Structured output specialist. RadixAttention for prefix caching, JSON schema enforcement, multi-LoRA batching. Best for DeepSeek models and workloads with high prompt overlap (RAG, chatbots).

Hugging Face TGI — Simplicity champion. Single Docker image, minimal config. Lower raw performance but fastest time-to-production.

Benchmark Comparison (8× H100, FP8)

Engine Throughput TTFT TPOT Setup ────────────────────────────────────────────────────── TensorRT-LLM 4,800 t/s 85ms 12ms 45 min compile vLLM V1 3,400 t/s 62ms 18ms Instant SGLang 3,200 t/s 58ms 19ms Instant TGI 2,900 t/s 95ms 22ms Instant TTFT = Time To First Token (lower is better) TPOT = Time Per Output Token (lower is better) t/s = tokens per second (higher is better) Cost per million tokens at scale: TensorRT-LLM: ~$0.04 vLLM: ~$0.06 TGI: ~$0.07 Gap: $0.03/M tokens → significant at >$20K/month

Selection Guide

# Quick launch with vLLM: $ pip install vllm $ vllm serve meta-llama/Llama-3-70B-Instruct \ --tensor-parallel-size 4 \ --max-model-len 8192 \ --quantization fp8 # Decision tree: Need max throughput + NVIDIA only? → TensorRT-LLM High concurrency + flexibility? → vLLM Structured outputs + prefix reuse? → SGLang Fastest time-to-production? → TGI

Key insight: The performance gap between engines has narrowed from 2–3× in 2024 to 35–50% in 2025. At this margin, flexibility and operational simplicity often matter more than raw speed. Choose based on your constraints (hardware vendor lock-in, team expertise, deployment complexity), not just benchmarks.

speed

Optimization Techniques: Squeezing Every Token

Quantization, prefix caching, disaggregated serving, and model parallelism for inference

Quantization for Inference

Inference quantization is more aggressive than training because you don’t need gradients:

FP8 (W8A8): Weights and activations in 8-bit. Near-zero accuracy loss on most models. Halves memory vs FP16, doubles throughput. The default for production H100/H200 deployments.

INT4/GPTQ/AWQ (W4A16): 4-bit weights, 16-bit activations. Cuts weight memory by 4×. Llama 70B fits on a single GPU (17.5 GB weights). Small accuracy loss (~1–2% on benchmarks).

FP4 (Blackwell): NVIDIA B200 supports native FP4. Llama 70B weights in ~8.75 GB. Enables serving 70B models on a single GPU with room for large KV caches.

Prefix Caching

Many requests share common prefixes: system prompts, few-shot examples, RAG context. Prefix caching computes and stores the KV cache for these shared prefixes once, then reuses it across requests.

Impact: For a RAG application with a 2,000-token system prompt, prefix caching eliminates ~50ms of prefill latency per request and saves ~8 MB of KV cache memory per concurrent user. At 100 concurrent users, that’s 800 MB saved and 5 seconds of GPU time saved per second.

Disaggregated Prefill/Decode

Prefill is compute-bound; decode is memory-bound. Running both on the same GPU means neither phase gets optimal hardware:

Disaggregated serving splits prefill and decode onto separate GPU pools. Prefill GPUs are configured for maximum compute (higher clock, less memory). Decode GPUs are configured for maximum memory bandwidth and capacity.

Benefit: 20–40% throughput improvement at the system level. The trade-off is increased complexity and network overhead for transferring KV cache between GPU pools.

Inference Parallelism Strategies

Tensor Parallelism (TP): Split model across GPUs within a node Latency: reduced (parallel compute) Best for: Latency-sensitive, single-user Example: 70B on 4× H100 → ~12ms TPOT Pipeline Parallelism (PP): Split model layers across GPUs Throughput: increased (pipeline overlap) Best for: Throughput-optimized batch serving Example: 405B on 8× H100, PP=8 Data Parallelism (DP): Replicate model across GPU groups Scales: linearly with replicas Best for: High-throughput serving Example: 4 replicas × 2 GPUs each = 8 GPUs

Key insight: Inference optimization is like running a restaurant during rush hour. Quantization is using smaller plates (same food, less space). Prefix caching is pre-setting the table for regulars. Disaggregated serving is having separate prep and plating stations. Each trick alone helps 20–40%; combined, they can cut serving costs by 3–5×.

groups

Multi-Tenancy and GPU Sharing

Serving multiple models and users on shared GPU infrastructure

The Utilization Problem

Most organizations don’t have enough traffic for a single model to saturate a GPU 24/7. A customer-facing chatbot might peak at 100 requests/second during business hours but drop to 5 requests/second at night. Dedicating 8× H100s ($250K+/year) to a model that’s idle 60% of the time is wasteful.

Multi-tenancy solves this by sharing GPU resources across multiple models, users, or workloads. The challenge: GPU memory isn’t easily shared, model loading takes 30–120 seconds, and latency SLAs vary by customer.

GPU Sharing Strategies

1. Time-slicing: Load/unload models based on demand. Works for infrequent models but model loading latency (30–120s) makes it impractical for interactive workloads.

2. MPS (Multi-Process Service): NVIDIA’s GPU sharing at the CUDA level. Multiple processes share a GPU with isolated memory spaces. Good for small models but no memory overcommit.

3. MIG (Multi-Instance GPU): Hardware-level GPU partitioning on A100/H100. Splits one GPU into up to 7 isolated instances. Each instance has guaranteed memory and compute. Best for hard isolation requirements.

4. LoRA multiplexing: Load one base model + many LoRA adapters (1–5% of base model size). Serve different fine-tuned variants from a single GPU. vLLM supports hot-swapping LoRA adapters with zero downtime.

LoRA Multiplexing Economics

Scenario: 10 fine-tuned Llama 70B variants Dedicated GPUs (no sharing): 10 models × 4 GPUs each = 40 GPUs Cost: 40 × $2.50/hr = $100/hr ($876K/yr) LoRA multiplexing: 1 base model × 4 GPUs = 4 GPUs 10 LoRA adapters × ~700 MB each = 7 GB Cost: 4 × $2.50/hr = $10/hr ($87.6K/yr) Savings: 90% ($788K/yr) # Trade-off: LoRA adapters share the base model's # capacity. If all 10 variants get simultaneous # traffic, latency increases. Solution: autoscale # base model replicas based on aggregate demand.

Autoscaling for Inference

Scale-to-zero: Unload models with no traffic. First request triggers loading (cold start: 30–120s). Acceptable for internal/batch workloads, not for customer-facing APIs.

Predictive scaling: Use historical traffic patterns to pre-scale. If traffic peaks at 9 AM, start scaling at 8:45 AM. Reduces cold starts by 80–90%.

Request-based autoscaling: Scale replicas based on queue depth or P99 latency. Target: keep P99 latency below SLA (e.g., 500ms TTFT). Kubernetes HPA with custom metrics from the inference engine.

Key insight: Multi-tenancy for GPUs is like a coworking space vs. renting an entire office. Most startups don’t need a whole floor — they need a desk, a meeting room sometimes, and the flexibility to scale. LoRA multiplexing is the hot-desking of AI: everyone shares the building (base model) but has their own locker (adapter weights).

payments

Inference Cost Math: Tokens, GPUs, and Dollars

How to calculate and optimize the cost of serving AI at scale

Cost Per Token Breakdown

The fundamental unit of inference cost is dollars per million tokens. This depends on three factors: GPU cost per hour, tokens generated per hour, and overhead (networking, storage, engineering).

For a well-optimized vLLM deployment of Llama 3 70B on 4× H100 (FP8):

GPU cost: 4× H100 on-demand: 4 × $2.50/hr = $10/hr Reserved (1yr): 4 × $1.60/hr = $6.40/hr Throughput (vLLM, FP8, continuous batching): Per GPU: ~850 tok/s 4 GPUs: ~3,400 tok/s Per hour: 3,400 × 3,600 = 12.24M tokens/hr Cost per million tokens: On-demand: $10 / 12.24 = $0.82/M tokens Reserved: $6.40 / 12.24 = $0.52/M tokens + Overhead (~30%): $0.68-1.07/M tokens # Compare to API pricing: OpenAI GPT-4o output: $10.00/M tokens Anthropic Claude 3.5: $15.00/M tokens Self-hosted Llama 70B: $0.68-1.07/M tokens Savings vs API: 10-15×

When Self-Hosting Makes Sense

Self-hosting breaks even when your monthly token volume exceeds the point where GPU costs (amortized) are less than API costs. The crossover depends on utilization:

Break-even analysis (vs GPT-4o at $10/M): Monthly GPU cost (4× H100 reserved): $4,608 Monthly capacity: 12.24M × 730 hrs = 8.9B tokens At 100% utilization: Self-host: $4,608 / 8,935M = $0.52/M Break-even: 0.46M tokens/month At 30% utilization (realistic): Effective: $4,608 / 2,680M = $1.72/M Break-even: 1.5M tokens/month At 10% utilization (low traffic): Effective: $4,608 / 894M = $5.15/M Break-even: 4.6M tokens/month # Rule of thumb: self-host when you consistently # generate >5M tokens/month AND can maintain # >25% GPU utilization.

Hidden Costs of Self-Hosting

Engineering: 0.5–2 FTEs to manage inference infrastructure ($100–400K/yr).
Monitoring: Prometheus, Grafana, custom dashboards for latency/throughput SLAs.
Model updates: New model versions require re-quantization, re-benchmarking, A/B testing.
Redundancy: Need 2× capacity for zero-downtime deployments during updates.
Edge cases: OOM errors, CUDA crashes, driver updates, security patches.

Key insight: Self-hosting inference is like buying vs. leasing a car. Buying is cheaper per mile if you drive enough, but you’re responsible for maintenance, insurance, and depreciation. APIs are like taxis — expensive per trip but zero commitment. The crossover point is lower than most people think (~5M tokens/month), but the hidden costs are higher than most people budget.

arrow_backPrevious Chapter Next Chapterarrow_forward