Ch 6 — Model Serving & Inference

TorchServe, Triton, vLLM, ONNX Runtime, batching strategies, and latency vs. throughput
High Level
model_training
Model
arrow_forward
compress
Optimize
arrow_forward
dns
Server
arrow_forward
dynamic_form
Batch
arrow_forward
speed
Latency
arrow_forward
scale
Scale
-
Click play or press Space to begin...
Step- / 8
dns
Model Serving Fundamentals
Turning a trained model into a production service
What Is Model Serving?
Model serving is the process of making a trained model available for inference — accepting input data and returning predictions. This sounds simple, but production serving must handle: low latency (respond in milliseconds), high throughput (thousands of requests per second), reliability (99.9%+ uptime), scalability (auto-scale with traffic), and efficiency (maximize GPU utilization). There are two main patterns: online serving (real-time, request-response via REST/gRPC) and batch serving (process large datasets offline, results stored for later use). Most production systems use online serving; batch is for non-time-sensitive predictions like recommendations or risk scores.
Serving Patterns
// Model serving patterns Online Serving (real-time): Client → REST/gRPC → Model Server → Response Latency: < 100ms (typical SLA) Use: fraud detection, search ranking, chatbots, real-time recommendations Batch Serving (offline): Data Lake → Spark/Airflow → Predictions → DB Latency: minutes to hours Use: email campaigns, risk scoring, nightly recommendations Streaming Serving (near-real-time): Kafka → Model → Kafka (output topic) Latency: seconds Use: anomaly detection, IoT monitoring // Most common: online serving via REST API // POST /predict {features} → {prediction}
Key insight: The serving pattern should match the business requirement. Don’t build a real-time serving system if batch predictions (computed overnight) would work just as well — batch is 10x simpler and cheaper.
memory
NVIDIA Triton Inference Server
The industry standard for GPU-accelerated serving
Triton Overview
NVIDIA Triton Inference Server is the most widely used production inference server for GPU workloads. Key capabilities: multi-framework support (PyTorch, TensorFlow, ONNX, TensorRT, vLLM — all in one server), dynamic batching (automatically groups incoming requests into batches for GPU efficiency), model ensembles (chain multiple models in a pipeline), concurrent model execution (run multiple models on the same GPU), and metrics (Prometheus-compatible latency, throughput, and GPU utilization metrics). Triton is free and open-source, deployed via Docker containers. It’s the default choice for teams serving models on NVIDIA GPUs.
Triton Setup
# Triton model repository structure model_repository/ ├── fraud_detector/ │ ├── config.pbtxt │ └── 1/ │ └── model.onnx └── text_classifier/ ├── config.pbtxt └── 1/ └── model.pt # config.pbtxt name: "fraud_detector" backend: "onnxruntime" max_batch_size: 64 dynamic_batching { max_queue_delay_microseconds: 100 } # Launch Triton $ docker run --gpus all \ -v ./model_repository:/models \ nvcr.io/nvidia/tritonserver:24.01-py3 \ tritonserver --model-repository=/models
Key insight: Triton’s dynamic batching is its killer feature. Individual requests arrive at different times, but Triton groups them into batches (up to max_batch_size) before sending to the GPU, dramatically improving throughput.
bolt
vLLM: LLM Inference Engine
PagedAttention and continuous batching for LLMs
Why vLLM?
vLLM is the dominant open-source inference engine for large language models. Its key innovation is PagedAttention — inspired by virtual memory in operating systems, it manages the KV cache (key-value pairs stored during autoregressive generation) in non-contiguous memory blocks (“pages”). This eliminates memory fragmentation and waste, achieving near-zero KV cache waste compared to ~60–80% waste in naive implementations. vLLM also implements continuous batching — instead of waiting for all requests in a batch to finish, new requests are added as old ones complete, keeping the GPU busy at all times. Result: 2–4x higher throughput than HuggingFace Transformers.
vLLM Usage
# Start vLLM server (OpenAI-compatible) $ vllm serve meta-llama/Llama-3-8B-Instruct \ --tensor-parallel-size 2 \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 # OpenAI-compatible API $ curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3-8B-Instruct", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 256 }' # Key features: PagedAttention: near-zero KV cache waste Continuous batching: no idle GPU time Tensor parallelism: split across GPUs Speculative decoding: draft + verify
Key insight: vLLM’s OpenAI-compatible API means you can swap between self-hosted vLLM and OpenAI’s API with a single URL change. This makes it easy to start with OpenAI and migrate to self-hosted when cost or privacy requires it.
compress
ONNX Runtime & Model Optimization
Making models faster without changing accuracy
Optimization Techniques
ONNX Runtime (by Microsoft) is a cross-platform inference engine that runs ONNX models with hardware-specific optimizations. Convert your PyTorch/TensorFlow model to ONNX format, and ONNX Runtime applies: graph optimizations (operator fusion, constant folding), hardware acceleration (CUDA, TensorRT, DirectML, OpenVINO), and quantization (reduce precision from FP32 to INT8 for 2–4x speedup). Other optimization techniques: TensorRT (NVIDIA’s optimizer for maximum GPU performance), distillation (train a smaller model to mimic a larger one), and pruning (remove unimportant weights). For LLMs, quantization (GPTQ, AWQ, GGUF) is the primary optimization — reducing from FP16 to INT4 cuts memory by 4x with minimal quality loss.
ONNX Conversion
import torch import onnxruntime as ort # Export PyTorch model to ONNX dummy_input = torch.randn(1, 10) torch.onnx.export( model, dummy_input, "model.onnx", input_names=["input"], output_names=["output"], dynamic_axes={"input": {0: "batch"}} ) # Run inference with ONNX Runtime session = ort.InferenceSession( "model.onnx", providers=["CUDAExecutionProvider"] ) result = session.run( None, {"input": input_data.numpy()} ) # Typical speedups: # PyTorch → ONNX: 1.5-2x faster # ONNX → TensorRT: 2-5x faster # FP32 → INT8: 2-4x faster
Key insight: For traditional ML models (not LLMs), ONNX Runtime is often the easiest optimization win. Export to ONNX, run with ONNX Runtime, and get 1.5–2x speedup with zero accuracy loss and minimal code changes.
dynamic_form
Batching Strategies
The key to GPU efficiency
Why Batching Matters
GPUs are massively parallel processors — they’re designed to process many inputs simultaneously. Processing one request at a time wastes 90%+ of GPU capacity. Batching groups multiple requests together for parallel processing. Three strategies: Static batching (fixed batch size, wait until full — simple but adds latency). Dynamic batching (collect requests for a short window, batch whatever arrived — Triton’s approach). Continuous batching (for LLMs — don’t wait for all sequences to finish; add new requests as old ones complete). The trade-off is always latency vs. throughput: larger batches = higher throughput but higher latency per request.
Batching Comparison
// Batching strategies No Batching (batch_size=1): GPU utilization: ~5-10% Latency: lowest per request Throughput: very low Waste: enormous Static Batching: Wait for N requests → process together GPU utilization: ~60-80% Latency: variable (waiting time) Throughput: good Dynamic Batching (Triton): Collect for max_delay μs → batch GPU utilization: ~70-90% Latency: bounded by max_delay Throughput: very good Continuous Batching (vLLM): New requests join mid-batch GPU utilization: ~85-95% Latency: lowest for LLMs Throughput: best for LLMs
Key insight: Continuous batching (used by vLLM and TGI) was a breakthrough for LLM serving. In static batching, a batch waits for the longest sequence to finish. In continuous batching, short sequences leave and new ones join immediately, keeping the GPU saturated.
speed
Latency vs. Throughput
The fundamental trade-off in model serving
Understanding the Trade-off
Latency is how long one request takes (measured in p50, p95, p99 percentiles). Throughput is how many requests per second the system handles. They’re inversely related: optimizing for one hurts the other. Latency-sensitive applications (fraud detection, search autocomplete): minimize batch size, use faster hardware, apply model optimization. Throughput-sensitive applications (batch scoring, content moderation): maximize batch size, use dynamic batching, scale horizontally. For LLMs, two metrics matter: Time to First Token (TTFT) (how fast the first word appears) and tokens per second (TPS) (generation speed). Users perceive TTFT as responsiveness and TPS as fluency.
Latency Metrics
// Key serving metrics Latency (per request): p50: median response time p95: 95th percentile (most users) p99: 99th percentile (worst case) SLA: p99 < 50ms (typical) Throughput: QPS: queries per second RPS: requests per second TPS: tokens per second (LLMs) LLM-Specific: TTFT: time to first token (~200ms good) TPS: tokens/sec (~30-80 for 7B model) ITL: inter-token latency (~15-30ms) GPU Metrics: Utilization: % of GPU compute used Memory: % of VRAM used Queue depth: requests waiting
Key insight: Always measure p99 latency, not average. If your average is 20ms but p99 is 500ms, 1% of your users have a terrible experience. SLAs should be defined on p95 or p99, never on averages.
scale
Scaling Model Serving
Horizontal scaling, auto-scaling, and load balancing
Scaling Strategies
Vertical scaling: bigger GPU (A100 → H100). Simple but has a ceiling. Horizontal scaling: more replicas behind a load balancer. The standard approach for production. Auto-scaling: automatically add/remove replicas based on metrics (GPU utilization, queue depth, latency). Kubernetes is the standard orchestrator — use KEDA (Kubernetes Event-Driven Autoscaling) or custom HPA (Horizontal Pod Autoscaler) metrics. For LLMs, scaling is more complex: large models may span multiple GPUs (tensor parallelism), so you scale in units of “model replicas” rather than individual pods. KV cache memory is often the bottleneck, not compute.
Auto-Scaling Config
# Kubernetes HPA for model serving apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: fraud-model-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: fraud-model minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: gpu_utilization target: type: AverageValue averageValue: "70" # scale at 70% - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "5"
Key insight: Scale on queue depth, not just GPU utilization. High GPU utilization with an empty queue means you’re efficient. High GPU utilization with a growing queue means you need more replicas.
compare
Choosing a Serving Framework
Decision framework for your use case
Decision Guide
Choose vLLM if: you’re serving LLMs and need maximum throughput with PagedAttention and continuous batching. Choose Triton if: you serve multiple model types (CV, NLP, tabular) on GPUs and need dynamic batching, ensembles, and multi-model support. Choose ONNX Runtime if: you need cross-platform inference (CPU, GPU, edge) with minimal dependencies. Choose BentoML if: you want a Python-first framework that packages models as Docker containers with minimal boilerplate. Choose a managed service (SageMaker Endpoints, Vertex AI Prediction, Azure ML) if: you want zero infrastructure management and are on that cloud platform.
Framework Comparison
// Serving framework decision guide vLLM: Best for: LLM inference Key: PagedAttention, continuous batching API: OpenAI-compatible Triton: Best for: multi-model GPU serving Key: dynamic batching, ensembles API: REST + gRPC ONNX Runtime: Best for: cross-platform, CPU/edge Key: graph optimization, quantization API: library (no server) BentoML: Best for: Python teams, quick deploy Key: easy packaging, Docker export API: REST Managed (SageMaker/Vertex/Azure): Best for: zero infra management Key: auto-scaling, monitoring built-in Cost: $$$
Key insight: For LLMs, vLLM has become the de facto standard. For traditional ML models, start with the simplest option (BentoML or a managed service) and move to Triton when you need GPU efficiency at scale.