Ch 7: LLMOps — Gateways & Routing

Ch 7 — LLMOps: Gateways & Routing

LLM gateways (LiteLLM, Portkey), model routing, fallbacks, rate limiting, and cost tracking

Index

High Level

apps

App

arrow_forward

hub

Gateway

arrow_forward

route

Route

arrow_forward

smart_toy

LLM

arrow_forward

cached

Cache

arrow_forward

monitoring

Observe

Click play or press Space to begin...

Step- / 8

hub

What Is an LLM Gateway?

A unified proxy between your app and LLM providers

The Problem

Modern applications use multiple LLM providers: OpenAI for GPT-4o, Anthropic for Claude, Google for Gemini, self-hosted LLaMA via vLLM. Each has a different API, different pricing, different rate limits, and different failure modes. Without a gateway, your application code is littered with provider-specific logic, retry handling, and cost tracking. An LLM gateway sits between your application and the LLM providers, providing: unified API (one interface for all providers), routing (send requests to the right model), fallbacks (switch providers on failure), rate limiting (prevent quota exhaustion), caching (avoid duplicate calls), and observability (logging, cost tracking, latency metrics).

Gateway Architecture

// Without gateway App → OpenAI API (custom code) App → Anthropic API (different code) App → vLLM API (yet another format) // 3 integrations, 3 retry logics, 3 cost trackers // With gateway App → LLM Gateway → OpenAI → Anthropic → vLLM → Gemini // 1 integration, unified retry/fallback/cost Gateway provides: ✓ Unified OpenAI-compatible API ✓ Automatic fallbacks on failure ✓ Rate limiting & quota management ✓ Semantic caching ✓ Cost tracking per request ✓ Logging & observability

Key insight: An LLM gateway is to LLMs what an API gateway (like Kong or Envoy) is to microservices. It’s the single point of control for all LLM traffic, and it becomes essential once you use more than one provider.

code

LiteLLM

Open-source gateway for 100+ LLM providers

LiteLLM Overview

LiteLLM is the most popular open-source LLM gateway. It provides an OpenAI-compatible API that proxies to 100+ LLM providers (OpenAI, Anthropic, Google, Cohere, Mistral, Azure, Bedrock, self-hosted vLLM, Ollama, and more). Key features: unified interface (call any model with the same completion() function), load balancing (distribute across multiple API keys or deployments), fallbacks (automatic retry with alternative models on failure), budget management (set spend limits per user, team, or API key), and virtual keys (create API keys with custom rate limits and budgets). LiteLLM is MIT-licensed and can be self-hosted as a proxy server or used as a Python library.

LiteLLM Usage

# LiteLLM as a Python library from litellm import completion # Same interface, any provider response = completion( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] ) response = completion( model="claude-sonnet-4-20250514", messages=[{"role": "user", "content": "Hello!"}] ) # As a proxy server (OpenAI-compatible) # litellm --model gpt-4o # → http://localhost:4000/v1/chat/completions # With fallbacks response = completion( model="gpt-4o", messages=messages, fallbacks=["claude-sonnet-4-20250514", "gemini-pro"] )

Key insight: LiteLLM’s proxy mode is the most common deployment. Run it as a server, point all your apps at it, and manage routing, budgets, and fallbacks centrally — without changing any application code.

business

Portkey

Managed AI gateway with enterprise features

Portkey Overview

Portkey is a managed LLM gateway with enterprise-grade features. Beyond basic routing and fallbacks, Portkey offers: semantic caching (cache responses for semantically similar queries, reducing costs by up to 40%), guardrails (built-in content filtering and PII detection), detailed analytics (cost, latency, token usage dashboards per model, user, and team), request/response logging (full audit trail for compliance), and conditional routing (route based on request metadata, user tier, or content type). Portkey is SaaS-first with a self-hosted enterprise option. It’s the right choice for teams that want a managed solution with advanced observability.

Portkey Config

// Portkey gateway configuration { "strategy": { "mode": "fallback" }, "targets": [ { "provider": "openai", "model": "gpt-4o", "weight": 0.7 }, { "provider": "anthropic", "model": "claude-sonnet-4-20250514", "weight": 0.3, "on_status_codes": [429, 503] } ], "cache": { "mode": "semantic", "max_age": 3600 } }

Key insight: Semantic caching is Portkey’s standout feature. Instead of exact-match caching, it recognizes that “What is the capital of France?” and “Tell me France’s capital city” should return the same cached response, dramatically reducing costs for repetitive workloads.

route

Model Routing Strategies

Sending the right request to the right model

Routing Patterns

Not every request needs GPT-4o. Smart routing saves money and reduces latency: Complexity-based routing — simple questions go to a cheap, fast model (GPT-4o-mini, Claude Haiku); complex reasoning goes to a powerful model (GPT-4o, Claude Sonnet). Cost-based routing — route to the cheapest model that meets quality requirements. Latency-based routing — route to the fastest available provider. Load-based routing — distribute across providers to stay under rate limits. Content-based routing — code questions to one model, creative writing to another. A classifier model (small, fast) can analyze the request and decide which model to route to.

Routing Examples

// Model routing strategies Complexity-based: Simple Q&A → GPT-4o-mini ($0.15/1M in) Reasoning → Claude Sonnet ($3.00/1M in) Coding → GPT-4o ($2.50/1M in) // 70% of requests are simple → huge savings Fallback chain: Try: GPT-4o If 429 (rate limit): → Claude Sonnet If 503 (down): → Gemini Pro If all fail: → queue + retry Load balancing: OpenAI key 1: 40% of traffic OpenAI key 2: 40% of traffic Azure OpenAI: 20% of traffic // Spread across keys/regions A/B routing: 90% → current model (GPT-4o) 10% → candidate (Claude Sonnet) // Compare quality + cost

Key insight: Complexity-based routing is the highest-ROI optimization. If 70% of your requests are simple enough for a $0.15/1M-token model instead of a $3/1M-token model, you cut costs by ~90% on those requests with minimal quality impact.

block

Rate Limiting & Quota Management

Preventing runaway costs and API exhaustion

Rate Limit Challenges

Every LLM provider enforces rate limits: RPM (requests per minute), TPM (tokens per minute), and RPD (requests per day). Hit the limit and you get HTTP 429 errors. Without management, a single user or feature can exhaust your entire quota. The gateway handles this with: per-user rate limits (each user gets a fair share), per-team budgets (engineering gets $5K/month, marketing gets $2K/month), token-level throttling (limit tokens per minute, not just requests), queue-based smoothing (queue excess requests instead of rejecting), and multi-key distribution (spread requests across multiple API keys to increase effective limits).

Rate Limit Config

// LiteLLM budget/rate limit config # config.yaml model_list: - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-key-1 rpm: 500 # requests/min tpm: 80000 # tokens/min - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-key-2 # 2nd key rpm: 500 general_settings: max_budget: 10000 # $10K total/month # Per-user virtual keys # litellm /key/generate \ # --max_budget 500 \ # --tpm_limit 10000 \ # --team_id engineering

Key insight: Set budget alerts at 50%, 80%, and 95% of your monthly limit. LLM costs can spike unexpectedly — a single prompt engineering experiment with long contexts can burn through hundreds of dollars in minutes.

cached

Caching Strategies

Avoiding redundant LLM calls

Caching Types

LLM calls are expensive and slow. Caching avoids redundant calls: Exact-match caching — cache the response for identical prompts. Simple, effective for deterministic queries (temperature=0). Semantic caching — embed the prompt, find similar cached prompts within a threshold, return the cached response. Catches paraphrases. Prompt prefix caching — some providers (Anthropic, OpenAI) cache the KV states of long system prompts, so only the user message is processed. Reduces latency and cost for apps with long, static system prompts. RAG caching — cache retrieved documents so the same query doesn’t hit the vector database again. Layer these: prompt prefix caching (provider-side) + semantic caching (gateway-side) + application-level caching.

Caching Layers

// LLM caching strategies 1. Exact Match (gateway): Key: hash(model + messages + params) Hit rate: 10-30% (typical) Best for: deterministic queries 2. Semantic Cache (gateway): Key: embedding similarity > 0.95 Hit rate: 20-50% (with good threshold) Best for: customer support, FAQ 3. Prompt Prefix Cache (provider): Anthropic: automatic for long prompts OpenAI: cached_tokens in response Saves: 50-90% on long system prompts 4. Application Cache (app layer): Cache final answers, not LLM responses Redis/Memcached TTL based on data freshness // Combined savings: 30-60% cost reduction

Key insight: Prompt prefix caching is free and automatic on most providers. If your system prompt is 2,000+ tokens, you’re already benefiting. For Anthropic, cached input tokens cost 90% less than uncached ones.

payments

Cost Tracking & Optimization

Understanding and controlling LLM spend

Cost Management

LLM costs are per-token, which makes them hard to predict. A gateway provides visibility: cost per request (input tokens × input price + output tokens × output price), cost per user/team (who’s spending the most?), cost per feature (which product feature drives the most LLM spend?), and cost trends (is spend growing linearly or exponentially?). Optimization levers: model selection (use cheaper models where quality allows), prompt optimization (shorter prompts = fewer tokens = lower cost), caching (avoid duplicate calls), output length limits (set max_tokens appropriately), and batch processing (some providers offer 50% discounts for async batch API calls).

Cost Breakdown

// LLM cost tracking Per-request cost: Input: 1500 tokens × $2.50/1M = $0.00375 Output: 500 tokens × $10.00/1M = $0.005 Total: $0.00875 per request At scale: 100K requests/day = $875/day = $26K/month Optimization: Route 70% to mini model: 70K × $0.0002 = $14/day 30K × $0.00875 = $263/day Total: $277/day = $8.3K/month // 68% cost reduction! + Caching (30% hit rate): $8.3K × 0.7 = $5.8K/month // 78% total reduction

Key insight: Output tokens are 2–4x more expensive than input tokens on most providers. Controlling output length (max_tokens) and using structured output (JSON mode) to avoid verbose responses are the easiest cost optimizations.

compare

Choosing a Gateway

LiteLLM vs. Portkey vs. building your own

Decision Framework

Choose LiteLLM if: you want open-source, self-hosted control, need to support many providers, and have engineering capacity to operate it. Choose Portkey if: you want a managed service with semantic caching, advanced analytics, and enterprise compliance features. Choose OpenRouter if: you want a simple pay-per-use proxy without running infrastructure. Build your own if: you have very specific routing logic, need deep integration with internal systems, or have strict data residency requirements. Most teams should start with LiteLLM (free, self-hosted, covers 90% of needs) and evaluate Portkey if they need advanced observability or semantic caching.

Gateway Comparison

// LLM gateway comparison LiteLLM Portkey OpenRouter License: MIT Commercial Commercial Hosting: Self Cloud/Self Cloud only Providers: 100+ 100+ 50+ Fallbacks: Yes Yes Limited Caching: Exact Semantic No Analytics: Basic Advanced Basic Guardrails:No Yes No Cost: Free $$ Per-token Best for: Self-host Enterprise Simple proxy // Recommendation: // Start → LiteLLM (free, flexible) // Scale → Portkey (if you need analytics) // Simple → OpenRouter (zero setup)

Key insight: The gateway is the control plane for your LLM operations. Once in place, you can change models, add providers, adjust routing, and control costs — all without touching application code. It’s the single highest-leverage LLMOps investment.

arrow_back Ch 6: Model Serving & Inference Ch 8: LLMOps: Prompt Management & Evaluation arrow_forward