Ch 8 — LLMOps: Prompt Management & Evaluation

Prompt versioning, A/B testing prompts, LLM evaluation in CI, and guardrails integration
High Level
edit_note
Author
arrow_forward
history
Version
arrow_forward
science
Evaluate
arrow_forward
ab_testing
A/B Test
arrow_forward
shield
Guard
arrow_forward
rocket_launch
Deploy
-
Click play or press Space to begin...
Step- / 8
edit_note
Prompts Are the New Code
Why prompts need the same rigor as software
The Shift
In traditional ML, you engineer features. In LLM applications, you engineer prompts. A prompt change can completely alter your application’s behavior — just like a code change. But most teams treat prompts casually: they live in hardcoded strings, get edited in production without review, and have no version history. This is the equivalent of editing production code directly on the server. Prompt management applies software engineering discipline to prompts: version control, code review, testing, staging environments, and rollback. Teams that handle 10K+ daily LLM queries, have multiple engineers editing prompts, or run 5+ distinct prompts across features need formal prompt management.
Prompt Anti-Patterns
// Anti-patterns (how most teams start) 1. Hardcoded strings: prompt = "You are a helpful assistant..." // buried in application code // changes require full deploy 2. No version history: // "Who changed the prompt last week?" // "What did it say before?" // No one knows. 3. No testing: // "I tested it on 3 examples in ChatGPT" // Ships to production. Breaks edge cases. 4. No review: // One person edits, no one reviews // Subtle regressions go unnoticed // The fix: treat prompts like code // Version → Test → Review → Stage → Deploy
Key insight: A single word change in a prompt can shift model behavior dramatically. “You must always” vs. “You should try to” produces very different outputs. Prompts deserve the same change management rigor as code.
history
Prompt Versioning
Semantic versioning and registries for prompts
Versioning Strategy
Apply semantic versioning to prompts: MAJOR (output format changes — e.g., switching from free text to JSON), MINOR (new capabilities — e.g., adding a new instruction), PATCH (wording fixes that don’t change behavior). Store prompts in a prompt registry — a centralized store separate from application code. This decouples prompt changes from code deploys, enabling faster iteration. Tools: MLflow Prompt Registry (open-source, integrates with MLflow ecosystem), Langfuse (open-source observability + prompt management), Humanloop (managed platform with evaluation), PromptLayer (logging + versioning). Each version should have a commit message explaining what changed and why.
Prompt Registry
# MLflow Prompt Registry example import mlflow # Register a prompt prompt = mlflow.register_prompt( name="customer-support-v1", template="""You are a support agent for {{company}}. Answer the customer's question using only the provided context. If unsure, say "I don't know." Context: {{context}} Question: {{question}}""" ) # Load in application (by alias) prompt = mlflow.load_prompt( "customer-support-v1", alias="production" ) # Promote new version client.set_prompt_alias( name="customer-support-v1", alias="production", version=3 ) # → No code deploy needed!
Key insight: Decoupling prompts from code means you can update a prompt in seconds (change the alias in the registry) instead of minutes or hours (full code deploy). This is critical for rapid iteration and incident response.
science
LLM Evaluation in CI
Automated testing for prompt changes
Evaluation Framework
Every prompt change should trigger an evaluation suite before deployment. The suite runs the new prompt against a curated test dataset (50–200 examples covering normal cases, edge cases, and known failure modes) and measures: correctness (does the output match expected answers?), format compliance (is the output valid JSON/markdown as required?), safety (does it refuse harmful requests?), consistency (does it give similar answers to similar questions?), and regression (is it at least as good as the current production prompt?). Use LLM-as-judge for subjective quality (have GPT-4o rate outputs on a 1–5 scale) and deterministic checks for format and safety.
Eval Pipeline
# prompt_eval.py (runs in CI) import json def evaluate_prompt(prompt_version, test_data): results = [] for case in test_data: output = call_llm(prompt_version, case.input) results.append({ "correct": judge_correctness( output, case.expected), "format_ok": is_valid_json(output), "safe": passes_safety_check(output), "tokens": count_tokens(output), "latency_ms": measure_latency(), }) metrics = aggregate(results) assert metrics["correctness"] >= 0.85 assert metrics["format_compliance"] >= 0.95 assert metrics["safety_pass_rate"] == 1.0 return metrics
Key insight: The test dataset is your most valuable asset. Curate it carefully: include real user queries that caused problems, edge cases from production logs, and adversarial examples. Update it continuously as you discover new failure modes.
ab_testing
A/B Testing Prompts
Comparing prompt versions in production
A/B Testing Workflow
Offline evaluation catches most issues, but some things can only be measured in production with real users. A/B testing prompts: assign users randomly to prompt variant A (current) or variant B (candidate), measure business metrics (not just LLM metrics), and promote the winner. Key metrics to compare: task completion rate (did the user accomplish their goal?), user satisfaction (thumbs up/down, CSAT), escalation rate (did the user need human help?), cost per interaction, and latency. Tools: Langfuse (open-source, tracks metrics per prompt variant), Humanloop (managed, with built-in A/B testing), or custom implementation with feature flags (LaunchDarkly, Statsig).
A/B Test Setup
// Prompt A/B testing with Langfuse Variant A (production, 90% traffic): prompt: customer-support@production version: 2.1.0 Variant B (candidate, 10% traffic): prompt: customer-support@staging version: 2.2.0 Metrics tracked: ✓ Task completion: 78% vs 82% ✓ User satisfaction: 4.1 vs 4.3 ✓ Escalation rate: 12% vs 9% ✓ Avg cost/interaction: $0.04 vs $0.03 ✓ p95 latency: 1.2s vs 1.1s Decision: Variant B wins on all metrics → Promote to 100% after 7 days → Statistical significance: p < 0.05
Key insight: A/B test on business metrics (task completion, satisfaction), not LLM metrics (perplexity, BLEU). A prompt that scores lower on automated evals but higher on user satisfaction is the better prompt.
visibility
LLM Observability
Logging, tracing, and debugging LLM applications
What to Observe
LLM applications are non-deterministic — the same input can produce different outputs. Observability is essential for debugging and improvement. Log: full request/response (prompt, completion, model, parameters), latency breakdown (time to first token, total generation time, retrieval time for RAG), token usage (input/output tokens, cost), user feedback (thumbs up/down, corrections), and traces (for multi-step agents: each tool call, retrieval, and LLM call in the chain). Tools: Langfuse (open-source, best for tracing), LangSmith (by LangChain, tight integration), Arize Phoenix (open-source, evaluation + tracing), Portkey (gateway + observability combined).
Langfuse Tracing
# Langfuse observability example from langfuse import Langfuse lf = Langfuse() # Create a trace for each user interaction trace = lf.trace( name="customer-query", user_id="user-123", metadata={"feature": "support-chat"} ) # Log retrieval step retrieval = trace.span(name="rag-retrieval") docs = retrieve(query) retrieval.end(output={"docs": len(docs)}) # Log LLM call generation = trace.generation( name="answer", model="gpt-4o", input=messages, output=response, usage={"input": 1200, "output": 350} ) # Log user feedback trace.score(name="user-rating", value=1)
Key insight: Tracing is essential for debugging multi-step LLM applications (RAG, agents). When a user gets a bad answer, you need to see: was the retrieval wrong? Was the prompt wrong? Was the model hallucinating? Traces answer these questions.
shield
Guardrails
Preventing harmful, off-topic, and incorrect outputs
Guardrail Types
Guardrails are automated checks that validate LLM inputs and outputs before they reach the user. Input guardrails: detect and block prompt injection attempts, PII in user input, off-topic requests, and jailbreak attempts. Output guardrails: detect hallucinations (check claims against source documents), block toxic/harmful content, enforce format compliance (valid JSON, correct schema), and prevent data leakage (model revealing system prompt or training data). Tools: Guardrails AI (open-source, Python validators), NeMo Guardrails (NVIDIA, dialog-level safety), Lakera Guard (managed, prompt injection detection), and LLM-based validators (use a fast model to check the output of a powerful model).
Guardrails Example
# Guardrails AI example from guardrails import Guard from guardrails.hub import ( ToxicLanguage, DetectPII, ValidJSON, ) guard = Guard().use_many( ToxicLanguage(on_fail="fix"), DetectPII( pii_entities=["EMAIL", "PHONE", "SSN"], on_fail="fix" # redact PII ), ValidJSON(on_fail="reask"), ) result = guard( llm_api=openai.chat.completions.create, model="gpt-4o", messages=messages, ) # result.validated_output → safe, clean output # result.validation_passed → True/False
Key insight: Guardrails add latency (50–200ms per check). Use fast, deterministic checks (regex, schema validation) for every request, and reserve expensive LLM-based checks (hallucination detection) for high-risk outputs.
gavel
LLM-as-Judge
Using LLMs to evaluate LLM outputs
How It Works
LLM-as-judge uses a powerful LLM (typically GPT-4o or Claude Sonnet) to evaluate the quality of outputs from another LLM. The judge receives: the original question, the generated answer, optionally a reference answer, and a rubric (scoring criteria). It returns a score (1–5) and reasoning. This is cheaper and faster than human evaluation while correlating well with human judgments (70–85% agreement). Common rubrics: correctness (factually accurate?), helpfulness (answers the question?), relevance (stays on topic?), coherence (well-structured?), and safety (no harmful content?). Limitations: judges have biases (prefer verbose answers, prefer their own outputs), so calibrate against human labels.
LLM-as-Judge Prompt
// LLM-as-Judge evaluation prompt System: You are an expert evaluator. Rate the answer on a scale of 1-5. Rubric: 5 = Perfect, complete, accurate 4 = Good, minor issues 3 = Acceptable, some gaps 2 = Poor, significant errors 1 = Unacceptable, wrong or harmful Question: {{question}} Answer: {{answer}} Reference: {{reference}} (optional) Respond in JSON: { "score": <1-5>, "reasoning": "..." } // Cost: ~$0.01 per evaluation // Speed: ~2 seconds per evaluation // Agreement with humans: ~75-85%
Key insight: LLM-as-judge is not a replacement for human evaluation — it’s a scalable complement. Use it for automated regression testing in CI (hundreds of test cases), and reserve human evaluation for high-stakes decisions and calibrating the judge.
checklist
Prompt Management Maturity
From ad-hoc to production-grade
Maturity Model
Level 0 (Ad-hoc): Prompts hardcoded in application code. No versioning, no testing. Changes require full deploy. Level 1 (Versioned): Prompts in a registry with version history. Changes decoupled from code deploys. Changelog maintained. Level 2 (Tested): Evaluation suite runs on every prompt change. Test dataset curated and maintained. LLM-as-judge for quality scoring. Level 3 (CI/CD): Automated evaluation in CI pipeline. Regression gates block bad prompts. A/B testing in production. Level 4 (Observable): Full tracing and logging. User feedback loop. Continuous improvement based on production data. Guardrails on all inputs and outputs. Most teams should aim for Level 2 within 3 months of production launch.
Maturity Levels
// Prompt management maturity Level 0: Ad-hoc Hardcoded strings, no history "I tested it in ChatGPT" Level 1: Versioned Prompt registry, version history Aliases (production, staging) Decoupled from code deploys Level 2: Tested Evaluation suite (50-200 cases) LLM-as-judge scoring Regression checks vs production Level 3: CI/CD Automated eval in CI pipeline Gates block bad prompts A/B testing in production Level 4: Observable Full tracing (Langfuse/LangSmith) User feedback → dataset updates Guardrails on all I/O Continuous improvement loop
Key insight: The biggest ROI is going from Level 0 to Level 1 (add a prompt registry) and from Level 1 to Level 2 (add an evaluation suite). These two steps prevent most production incidents caused by prompt changes.