Ch 13 — Evaluation, Debugging & Pitfalls

Why prompts fail, the wording trap, systematic debugging, and LLM-as-judge evaluation
Mastery
psychology
Prompts = Code
arrow_forward
text_fields
Wording Trap
arrow_forward
troubleshoot
Debug Framework
arrow_forward
bug_report
Edge Cases
arrow_forward
gavel
LLM-as-Judge
arrow_forward
science
Test Suites
arrow_forward
warning
Top Pitfalls
arrow_forward
checklist
Debug Playbook
-
Click play or press Space to begin...
Step- / 8
psychology
Prompts Are Software — Treat Them That Way
They need testing, versioning, debugging, and iteration — just like code
The Mindset Shift
Most people treat prompts like messages: write once, send, hope for the best. But production prompts are software artifacts:

• They have inputs (user messages, context) and outputs (model responses)
• They can have bugs (wrong output for certain inputs)
• They need testing (does it work for all expected inputs?)
• They need versioning (what changed? did it break anything?)
• They need monitoring (is it still working in production?)
The Prompt Development Lifecycle
1. Draft → Write the initial prompt 2. Test → Run against 10-20 examples 3. Debug → Identify failure patterns 4. Fix → Modify the prompt 5. Test → Verify fix, check regressions 6. Ship → Deploy to production 7. Monitor → Track quality metrics 8. Iterate → Fix new failure modes
Key insight: The difference between a hobby prompt and a production prompt is the same as the difference between a script and a product: testing, error handling, and monitoring. If your prompt handles user input, it needs the same rigor as any other piece of software.
text_fields
The Wording Trap: Same Intent, Different Output
Tiny wording changes cause dramatically different results — and this is why prompt debugging is hard
Three Prompts, Same Intent
Prompt A: "Summarize this article." Prompt B: "Provide a brief summary of this article." Prompt C: "Summarize this article in 3 bullet points, each under 20 words."
What You Get
Prompt A output: A 200-word paragraph that's more of a rewrite than a summary. Includes opinions the article didn't express. Prompt B output: A 50-word paragraph. Better, but still prose. Buries the key points. Prompt C output: • Point 1 (18 words) • Point 2 (16 words) • Point 3 (19 words) Clean, scannable, actionable.
Why This Happens
Token-level sensitivity: The model predicts the next token based on all previous tokens. “Summarize” vs “Provide a brief summary” activates different probability distributions. “Brief” nudges toward shorter output. “3 bullet points, each under 20 words” constrains the output format precisely.

Ambiguity is the enemy: “Summarize” is ambiguous — how long? what format? what to include? The model fills in the blanks with its training distribution, which may not match your intent.
The Real-World Impact
In production, the wording trap means:

• A prompt that works in testing might fail with slightly different user inputs
• A “small improvement” to the prompt can break existing behavior
• Two team members writing “the same” prompt get different results
The pattern: Every word in a prompt matters. This isn’t a flaw — it’s how language models work. The fix isn’t to find the “perfect wording” but to be explicit enough that wording variations don’t matter. Constraints beat cleverness.
troubleshoot
The 5-Step Debugging Framework
When a prompt fails, check these five things in order
The Framework
Step 1: Is the instruction clear? Can a human follow this prompt and produce the expected output? If not, the model can't either. Step 2: Is there enough context? Does the prompt include all information needed to answer? Or is the model expected to "just know" something? Step 3: Is the format specified? Did you tell the model HOW to respond? (JSON, bullets, table, length, structure) Step 4: Are there conflicting constraints? "Be concise" + "Include all details" "Be creative" + "Follow this template" These cancel each other out. Step 5: Is the task too complex? If one prompt tries to do 3 things, split it into 3 prompts.
Applying the Framework
Failing prompt: "Analyze this customer feedback and give me insights." Step 1: "Analyze" how? "Insights" about what? → Instruction unclear Step 2: No context about the product, customer segment, or goals → Missing context Step 3: No format specified → model writes a 500-word essay → No format Fixed prompt: "Analyze these 50 customer reviews for our project management SaaS. Identify: 1. Top 3 complaints (with review count) 2. Top 3 praised features 3. 2 feature requests mentioned 5+ times Format: JSON with arrays for each category. Each item: {topic, count, example_quote}."
Key insight: Most prompt failures fall into one of these five categories. Running through the checklist takes 30 seconds and catches 80% of issues. The most common culprit? Step 1 — the instruction isn’t as clear as you think it is.
bug_report
Domain Example: The 90% Prompt
A prompt that works for most inputs but fails on edge cases — and how to debug it
The Prompt
Task: Classify support tickets into categories: BILLING, TECHNICAL, ACCOUNT, FEATURE_REQUEST Prompt: "Classify this support ticket into one category: BILLING, TECHNICAL, ACCOUNT, or FEATURE_REQUEST. Reply with ONLY the category name."
Works for 90%
"I was charged twice" → BILLING ✓ "App crashes on login" → TECHNICAL ✓ "Can you add dark mode?" → FEATURE_REQUEST ✓
Fails on 10%
"I was charged twice and now I can't log in to see my invoices" → BILLING? TECHNICAL? ACCOUNT? (Model picks randomly each time) "Your app is garbage" → TECHNICAL? (It's actually a complaint, not a technical issue) "Thanks, that fixed it!" → TECHNICAL? (It's a resolution, not a new ticket)
The Debugging Process
1. Collect failures: Run the prompt against 100 tickets. Find the 10 that fail.

2. Categorize failure patterns:
• Multi-category tickets (3 failures)
• Vague complaints (4 failures)
• Non-ticket messages (3 failures)

3. Fix each pattern:
The Fixed Prompt
"Classify this support ticket. Rules: 1. If the ticket spans multiple categories, choose the PRIMARY issue (the one the customer wants resolved most urgently). 2. If the ticket is a vague complaint with no specific issue, classify as ACCOUNT. 3. If the message is not a support ticket (e.g., 'thanks', 'ok', spam), classify as NONE. Categories: BILLING, TECHNICAL, ACCOUNT, FEATURE_REQUEST, NONE Reply with ONLY the category name."
Key insight: The 90% prompt is the most dangerous prompt — it works well enough that you ship it, but fails unpredictably in production. Always test with edge cases: multi-category inputs, vague inputs, empty inputs, adversarial inputs, and non-standard inputs.
gavel
LLM-as-Judge: Automated Quality Evaluation
Use one model to evaluate another model’s output — scalable quality assurance
The Concept
Manual evaluation doesn’t scale. If your prompt processes 10,000 inputs/day, you can’t review each output. Solution: use an LLM to evaluate the outputs of another LLM.

The flow:
1. Your prompt generates an output
2. A “judge” prompt evaluates that output
3. The judge scores it on a rubric
4. Low scores trigger alerts or human review
The Judge Prompt
System: "You are a quality evaluator for a customer support chatbot. Rate each response on these criteria:" 1. ACCURACY (1-5): Is the information factually correct? Does it match the provided context? 2. HELPFULNESS (1-5): Does it actually solve the customer's problem or just acknowledge it? 3. TONE (1-5): Professional, empathetic, not robotic? 4. SAFETY (pass/fail): Does it make promises the company can't keep? Does it reveal internal information? Output format: { "accuracy": 4, "helpfulness": 3, "tone": 5, "safety": "pass", "issues": ["Didn't address the customer's second question"], "overall": 4 }
Making It Work
Use a stronger model as judge: If your chatbot uses GPT-4o-mini, use GPT-4o or Claude 3.5 Sonnet as the judge. The judge should be at least as capable as the model being evaluated.

Calibrate with human ratings: Have humans rate 50–100 outputs. Then run the judge on the same outputs. If the judge’s scores correlate with human scores (r > 0.7), it’s reliable.

Focus on pass/fail for safety: Numeric scores are useful for quality. Binary pass/fail is better for safety checks.
Automated Pipeline
# Evaluate every Nth response if response_count % 10 == 0: score = judge( input=user_message, output=bot_response, context=retrieved_docs ) if score["safety"] == "fail": alert_team(response) if score["overall"] < 3: queue_for_review(response)
Key insight: LLM-as-judge isn’t perfect, but it’s far better than no evaluation. It catches obvious failures (hallucinations, safety violations, off-topic responses) reliably. Use it as a first filter, with human review for flagged cases.
science
Building a Prompt Test Suite
Systematic testing that catches regressions before they reach production
Test Case Structure
test_cases = [ { "id": "billing_simple", "input": "I was charged twice", "expected": "BILLING", "type": "exact_match" }, { "id": "multi_category", "input": "Charged twice and can't log in", "expected": "BILLING", "type": "exact_match" }, { "id": "not_a_ticket", "input": "Thanks!", "expected": "NONE", "type": "exact_match" }, { "id": "summary_quality", "input": "[long article]", "expected": "3 bullet points, each <20 words", "type": "llm_judge" } ]
Test Types
Exact match: Output must equal expected value. Best for classification, extraction, structured output.

Contains: Output must contain specific strings. Good for checking that key facts are included.

Not contains: Output must NOT contain certain strings. Good for safety checks (no PII, no hallucinated URLs).

Format check: Output must parse as valid JSON/XML. Good for structured output prompts.

LLM judge: Use a judge prompt to score quality. Best for open-ended generation where exact matching is impossible.
Run on Every Prompt Change
# Before deploying a prompt change: results = run_test_suite( prompt=new_prompt, tests=test_cases ) passed = sum(1 for r in results if r["passed"]) print(f"{passed}/{len(results)} passed") # If pass rate drops, don't deploy if passed < len(results) * 0.95: print("BLOCKED: regression detected")
Key insight: A test suite of 20–30 cases catches most regressions. Include: 10 happy-path cases, 5 edge cases, 5 adversarial cases, and 5–10 real production failures you’ve seen. Run the suite before every prompt change. It takes 30 seconds and saves hours of debugging in production.
warning
The Top 7 Prompt Engineering Pitfalls
Mistakes that even experienced practitioners make
Pitfalls 1-4
1. The Kitchen Sink Prompt Cramming 5 tasks into one prompt. The model does all 5 poorly instead of any 1 well. Split into focused prompts. 2. The Invisible Assumption "Summarize for our audience" — what audience? The prompt assumes context the model doesn't have. Be explicit. 3. The Contradictory Constraint "Be thorough but keep it under 50 words." Pick one. If you need both, say "Cover the 3 most important points in under 50 words." 4. The Negative-Only Instruction "Don't be formal. Don't use jargon. Don't be too long." Tell the model what TO do, not just what to avoid. "Write in a casual, conversational tone. Use simple language. Keep it under 100 words."
Pitfalls 5-7
5. The Temperature Mismatch Using temperature=1.0 for classification (needs determinism) or temperature=0 for creative writing (needs variety). Match temperature to the task. 6. The Untested Production Prompt "It worked in ChatGPT, ship it." ChatGPT testing ≠ production testing. Different inputs, different scale, different failure modes. 7. The Frozen Prompt Writing a prompt once and never updating it. Models change, user patterns change, edge cases emerge. Prompts need maintenance like code.
Key insight: These pitfalls share a common root: treating prompts as one-off messages instead of maintained software. The fix is always the same: be explicit, test systematically, and iterate based on real failures.
checklist
The Prompt Debugging Playbook
A systematic approach to finding and fixing prompt failures
When a Prompt Fails
1. Reproduce Run the exact same input 3 times. Is the failure consistent or random? Consistent → prompt bug Random → temperature/sampling issue 2. Diagnose (5-step framework) □ Instruction clear? □ Enough context? □ Format specified? □ Conflicting constraints? □ Task too complex? 3. Isolate Remove parts of the prompt until you find the minimum that reproduces the failure. Often it's one ambiguous sentence. 4. Fix Make the smallest change that fixes the failure. Don't rewrite the whole prompt — you'll introduce new bugs. 5. Verify Run the fix against: - The failing input (should pass now) - 10 previously passing inputs (should still pass — no regression)
Quick Reference
Wrong format? → Add explicit format instructions with an example

Too long/short? → Add word/sentence count constraint

Hallucinating? → Add “Only use information from the provided context”

Inconsistent? → Lower temperature, add few-shot examples

Wrong tool called? → Improve tool descriptions (Ch 12)

Loses context? → Add summaries (Ch 11)

Works sometimes? → Run 10x, find the pattern in failures
Key insight: Prompt debugging is a learnable skill. The 5-step framework (reproduce, diagnose, isolate, fix, verify) works for every failure mode. The key discipline is making the smallest fix and verifying it doesn’t break existing behavior. Prompt engineering is iterative, not inspirational.