Ch 7 — LLMOps: Gateways & Routing

LLM gateways (LiteLLM, Portkey), model routing, fallbacks, rate limiting, and cost tracking
High Level
apps
App
arrow_forward
hub
Gateway
arrow_forward
route
Route
arrow_forward
smart_toy
LLM
arrow_forward
cached
Cache
arrow_forward
monitoring
Observe
-
Click play or press Space to begin...
Step- / 8
hub
What Is an LLM Gateway?
A unified proxy between your app and LLM providers
The Problem
Modern applications use multiple LLM providers: OpenAI for GPT-4o, Anthropic for Claude, Google for Gemini, self-hosted LLaMA via vLLM. Each has a different API, different pricing, different rate limits, and different failure modes. Without a gateway, your application code is littered with provider-specific logic, retry handling, and cost tracking. An LLM gateway sits between your application and the LLM providers, providing: unified API (one interface for all providers), routing (send requests to the right model), fallbacks (switch providers on failure), rate limiting (prevent quota exhaustion), caching (avoid duplicate calls), and observability (logging, cost tracking, latency metrics).
Gateway Architecture
// Without gateway App → OpenAI API (custom code) App → Anthropic API (different code) App → vLLM API (yet another format) // 3 integrations, 3 retry logics, 3 cost trackers // With gateway App → LLM Gateway → OpenAI → Anthropic → vLLM → Gemini // 1 integration, unified retry/fallback/cost Gateway provides: ✓ Unified OpenAI-compatible API ✓ Automatic fallbacks on failure ✓ Rate limiting & quota management ✓ Semantic caching ✓ Cost tracking per request ✓ Logging & observability
Key insight: An LLM gateway is to LLMs what an API gateway (like Kong or Envoy) is to microservices. It’s the single point of control for all LLM traffic, and it becomes essential once you use more than one provider.
code
LiteLLM
Open-source gateway for 100+ LLM providers
LiteLLM Overview
LiteLLM is the most popular open-source LLM gateway. It provides an OpenAI-compatible API that proxies to 100+ LLM providers (OpenAI, Anthropic, Google, Cohere, Mistral, Azure, Bedrock, self-hosted vLLM, Ollama, and more). Key features: unified interface (call any model with the same completion() function), load balancing (distribute across multiple API keys or deployments), fallbacks (automatic retry with alternative models on failure), budget management (set spend limits per user, team, or API key), and virtual keys (create API keys with custom rate limits and budgets). LiteLLM is MIT-licensed and can be self-hosted as a proxy server or used as a Python library.
LiteLLM Usage
# LiteLLM as a Python library from litellm import completion # Same interface, any provider response = completion( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] ) response = completion( model="claude-sonnet-4-20250514", messages=[{"role": "user", "content": "Hello!"}] ) # As a proxy server (OpenAI-compatible) # litellm --model gpt-4o # → http://localhost:4000/v1/chat/completions # With fallbacks response = completion( model="gpt-4o", messages=messages, fallbacks=["claude-sonnet-4-20250514", "gemini-pro"] )
Key insight: LiteLLM’s proxy mode is the most common deployment. Run it as a server, point all your apps at it, and manage routing, budgets, and fallbacks centrally — without changing any application code.
business
Portkey
Managed AI gateway with enterprise features
Portkey Overview
Portkey is a managed LLM gateway with enterprise-grade features. Beyond basic routing and fallbacks, Portkey offers: semantic caching (cache responses for semantically similar queries, reducing costs by up to 40%), guardrails (built-in content filtering and PII detection), detailed analytics (cost, latency, token usage dashboards per model, user, and team), request/response logging (full audit trail for compliance), and conditional routing (route based on request metadata, user tier, or content type). Portkey is SaaS-first with a self-hosted enterprise option. It’s the right choice for teams that want a managed solution with advanced observability.
Portkey Config
// Portkey gateway configuration { "strategy": { "mode": "fallback" }, "targets": [ { "provider": "openai", "model": "gpt-4o", "weight": 0.7 }, { "provider": "anthropic", "model": "claude-sonnet-4-20250514", "weight": 0.3, "on_status_codes": [429, 503] } ], "cache": { "mode": "semantic", "max_age": 3600 } }
Key insight: Semantic caching is Portkey’s standout feature. Instead of exact-match caching, it recognizes that “What is the capital of France?” and “Tell me France’s capital city” should return the same cached response, dramatically reducing costs for repetitive workloads.
route
Model Routing Strategies
Sending the right request to the right model
Routing Patterns
Not every request needs GPT-4o. Smart routing saves money and reduces latency: Complexity-based routing — simple questions go to a cheap, fast model (GPT-4o-mini, Claude Haiku); complex reasoning goes to a powerful model (GPT-4o, Claude Sonnet). Cost-based routing — route to the cheapest model that meets quality requirements. Latency-based routing — route to the fastest available provider. Load-based routing — distribute across providers to stay under rate limits. Content-based routing — code questions to one model, creative writing to another. A classifier model (small, fast) can analyze the request and decide which model to route to.
Routing Examples
// Model routing strategies Complexity-based: Simple Q&A → GPT-4o-mini ($0.15/1M in) Reasoning → Claude Sonnet ($3.00/1M in) Coding → GPT-4o ($2.50/1M in) // 70% of requests are simple → huge savings Fallback chain: Try: GPT-4o If 429 (rate limit): → Claude Sonnet If 503 (down): → Gemini Pro If all fail: → queue + retry Load balancing: OpenAI key 1: 40% of traffic OpenAI key 2: 40% of traffic Azure OpenAI: 20% of traffic // Spread across keys/regions A/B routing: 90% → current model (GPT-4o) 10% → candidate (Claude Sonnet) // Compare quality + cost
Key insight: Complexity-based routing is the highest-ROI optimization. If 70% of your requests are simple enough for a $0.15/1M-token model instead of a $3/1M-token model, you cut costs by ~90% on those requests with minimal quality impact.
block
Rate Limiting & Quota Management
Preventing runaway costs and API exhaustion
Rate Limit Challenges
Every LLM provider enforces rate limits: RPM (requests per minute), TPM (tokens per minute), and RPD (requests per day). Hit the limit and you get HTTP 429 errors. Without management, a single user or feature can exhaust your entire quota. The gateway handles this with: per-user rate limits (each user gets a fair share), per-team budgets (engineering gets $5K/month, marketing gets $2K/month), token-level throttling (limit tokens per minute, not just requests), queue-based smoothing (queue excess requests instead of rejecting), and multi-key distribution (spread requests across multiple API keys to increase effective limits).
Rate Limit Config
// LiteLLM budget/rate limit config # config.yaml model_list: - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-key-1 rpm: 500 # requests/min tpm: 80000 # tokens/min - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-key-2 # 2nd key rpm: 500 general_settings: max_budget: 10000 # $10K total/month # Per-user virtual keys # litellm /key/generate \ # --max_budget 500 \ # --tpm_limit 10000 \ # --team_id engineering
Key insight: Set budget alerts at 50%, 80%, and 95% of your monthly limit. LLM costs can spike unexpectedly — a single prompt engineering experiment with long contexts can burn through hundreds of dollars in minutes.
cached
Caching Strategies
Avoiding redundant LLM calls
Caching Types
LLM calls are expensive and slow. Caching avoids redundant calls: Exact-match caching — cache the response for identical prompts. Simple, effective for deterministic queries (temperature=0). Semantic caching — embed the prompt, find similar cached prompts within a threshold, return the cached response. Catches paraphrases. Prompt prefix caching — some providers (Anthropic, OpenAI) cache the KV states of long system prompts, so only the user message is processed. Reduces latency and cost for apps with long, static system prompts. RAG caching — cache retrieved documents so the same query doesn’t hit the vector database again. Layer these: prompt prefix caching (provider-side) + semantic caching (gateway-side) + application-level caching.
Caching Layers
// LLM caching strategies 1. Exact Match (gateway): Key: hash(model + messages + params) Hit rate: 10-30% (typical) Best for: deterministic queries 2. Semantic Cache (gateway): Key: embedding similarity > 0.95 Hit rate: 20-50% (with good threshold) Best for: customer support, FAQ 3. Prompt Prefix Cache (provider): Anthropic: automatic for long prompts OpenAI: cached_tokens in response Saves: 50-90% on long system prompts 4. Application Cache (app layer): Cache final answers, not LLM responses Redis/Memcached TTL based on data freshness // Combined savings: 30-60% cost reduction
Key insight: Prompt prefix caching is free and automatic on most providers. If your system prompt is 2,000+ tokens, you’re already benefiting. For Anthropic, cached input tokens cost 90% less than uncached ones.
payments
Cost Tracking & Optimization
Understanding and controlling LLM spend
Cost Management
LLM costs are per-token, which makes them hard to predict. A gateway provides visibility: cost per request (input tokens × input price + output tokens × output price), cost per user/team (who’s spending the most?), cost per feature (which product feature drives the most LLM spend?), and cost trends (is spend growing linearly or exponentially?). Optimization levers: model selection (use cheaper models where quality allows), prompt optimization (shorter prompts = fewer tokens = lower cost), caching (avoid duplicate calls), output length limits (set max_tokens appropriately), and batch processing (some providers offer 50% discounts for async batch API calls).
Cost Breakdown
// LLM cost tracking Per-request cost: Input: 1500 tokens × $2.50/1M = $0.00375 Output: 500 tokens × $10.00/1M = $0.005 Total: $0.00875 per request At scale: 100K requests/day = $875/day = $26K/month Optimization: Route 70% to mini model: 70K × $0.0002 = $14/day 30K × $0.00875 = $263/day Total: $277/day = $8.3K/month // 68% cost reduction! + Caching (30% hit rate): $8.3K × 0.7 = $5.8K/month // 78% total reduction
Key insight: Output tokens are 2–4x more expensive than input tokens on most providers. Controlling output length (max_tokens) and using structured output (JSON mode) to avoid verbose responses are the easiest cost optimizations.
compare
Choosing a Gateway
LiteLLM vs. Portkey vs. building your own
Decision Framework
Choose LiteLLM if: you want open-source, self-hosted control, need to support many providers, and have engineering capacity to operate it. Choose Portkey if: you want a managed service with semantic caching, advanced analytics, and enterprise compliance features. Choose OpenRouter if: you want a simple pay-per-use proxy without running infrastructure. Build your own if: you have very specific routing logic, need deep integration with internal systems, or have strict data residency requirements. Most teams should start with LiteLLM (free, self-hosted, covers 90% of needs) and evaluate Portkey if they need advanced observability or semantic caching.
Gateway Comparison
// LLM gateway comparison LiteLLM Portkey OpenRouter License: MIT Commercial Commercial Hosting: Self Cloud/Self Cloud only Providers: 100+ 100+ 50+ Fallbacks: Yes Yes Limited Caching: Exact Semantic No Analytics: Basic Advanced Basic Guardrails:No Yes No Cost: Free $$ Per-token Best for: Self-host Enterprise Simple proxy // Recommendation: // Start → LiteLLM (free, flexible) // Scale → Portkey (if you need analytics) // Simple → OpenRouter (zero setup)
Key insight: The gateway is the control plane for your LLM operations. Once in place, you can change models, add providers, adjust routing, and control costs — all without touching application code. It’s the single highest-leverage LLMOps investment.