Ch 13: Evaluation, Debugging & Pitfalls — Prompt Engineering Mastery

Ch 13 — Evaluation, Debugging & Pitfalls

Why prompts fail, the wording trap, systematic debugging, and LLM-as-judge evaluation

arrow_backIndex

Mastery

psychology

Prompts = Code

arrow_forward

text_fields

Wording Trap

arrow_forward

troubleshoot

Debug Framework

arrow_forward

bug_report

Edge Cases

arrow_forward

gavel

LLM-as-Judge

arrow_forward

science

Test Suites

arrow_forward

warning

Top Pitfalls

arrow_forward

checklist

Debug Playbook

Click play or press Space to begin...

Step- / 8

psychology

Prompts Are Software — Treat Them That Way

They need testing, versioning, debugging, and iteration — just like code

The Mindset Shift

Most people treat prompts like messages: write once, send, hope for the best. But production prompts are software artifacts:

• They have inputs (user messages, context) and outputs (model responses)
• They can have bugs (wrong output for certain inputs)
• They need testing (does it work for all expected inputs?)
• They need versioning (what changed? did it break anything?)
• They need monitoring (is it still working in production?)

The Prompt Development Lifecycle

1. Draft → Write the initial prompt 2. Test → Run against 10-20 examples 3. Debug → Identify failure patterns 4. Fix → Modify the prompt 5. Test → Verify fix, check regressions 6. Ship → Deploy to production 7. Monitor → Track quality metrics 8. Iterate → Fix new failure modes

Key insight: The difference between a hobby prompt and a production prompt is the same as the difference between a script and a product: testing, error handling, and monitoring. If your prompt handles user input, it needs the same rigor as any other piece of software.

text_fields

The Wording Trap: Same Intent, Different Output

Tiny wording changes cause dramatically different results — and this is why prompt debugging is hard

Three Prompts, Same Intent

Prompt A: "Summarize this article." Prompt B: "Provide a brief summary of this article." Prompt C: "Summarize this article in 3 bullet points, each under 20 words."

What You Get

Prompt A output: A 200-word paragraph that's more of a rewrite than a summary. Includes opinions the article didn't express. Prompt B output: A 50-word paragraph. Better, but still prose. Buries the key points. Prompt C output: • Point 1 (18 words) • Point 2 (16 words) • Point 3 (19 words) Clean, scannable, actionable.

Why This Happens

Token-level sensitivity: The model predicts the next token based on all previous tokens. “Summarize” vs “Provide a brief summary” activates different probability distributions. “Brief” nudges toward shorter output. “3 bullet points, each under 20 words” constrains the output format precisely.

Ambiguity is the enemy: “Summarize” is ambiguous — how long? what format? what to include? The model fills in the blanks with its training distribution, which may not match your intent.

The Real-World Impact

In production, the wording trap means:

• A prompt that works in testing might fail with slightly different user inputs
• A “small improvement” to the prompt can break existing behavior
• Two team members writing “the same” prompt get different results

The pattern: Every word in a prompt matters. This isn’t a flaw — it’s how language models work. The fix isn’t to find the “perfect wording” but to be explicit enough that wording variations don’t matter. Constraints beat cleverness.

troubleshoot

The 5-Step Debugging Framework

When a prompt fails, check these five things in order

The Framework

Step 1: Is the instruction clear? Can a human follow this prompt and produce the expected output? If not, the model can't either. Step 2: Is there enough context? Does the prompt include all information needed to answer? Or is the model expected to "just know" something? Step 3: Is the format specified? Did you tell the model HOW to respond? (JSON, bullets, table, length, structure) Step 4: Are there conflicting constraints? "Be concise" + "Include all details" "Be creative" + "Follow this template" These cancel each other out. Step 5: Is the task too complex? If one prompt tries to do 3 things, split it into 3 prompts.

Applying the Framework

Failing prompt: "Analyze this customer feedback and give me insights." Step 1: "Analyze" how? "Insights" about what? → Instruction unclear Step 2: No context about the product, customer segment, or goals → Missing context Step 3: No format specified → model writes a 500-word essay → No format Fixed prompt: "Analyze these 50 customer reviews for our project management SaaS. Identify: 1. Top 3 complaints (with review count) 2. Top 3 praised features 3. 2 feature requests mentioned 5+ times Format: JSON with arrays for each category. Each item: {topic, count, example_quote}."

Key insight: Most prompt failures fall into one of these five categories. Running through the checklist takes 30 seconds and catches 80% of issues. The most common culprit? Step 1 — the instruction isn’t as clear as you think it is.

bug_report

Domain Example: The 90% Prompt

A prompt that works for most inputs but fails on edge cases — and how to debug it

The Prompt

Task: Classify support tickets into categories: BILLING, TECHNICAL, ACCOUNT, FEATURE_REQUEST Prompt: "Classify this support ticket into one category: BILLING, TECHNICAL, ACCOUNT, or FEATURE_REQUEST. Reply with ONLY the category name."

Works for 90%

"I was charged twice" → BILLING ✓ "App crashes on login" → TECHNICAL ✓ "Can you add dark mode?" → FEATURE_REQUEST ✓

Fails on 10%

"I was charged twice and now I can't log in to see my invoices" → BILLING? TECHNICAL? ACCOUNT? (Model picks randomly each time) "Your app is garbage" → TECHNICAL? (It's actually a complaint, not a technical issue) "Thanks, that fixed it!" → TECHNICAL? (It's a resolution, not a new ticket)

The Debugging Process

1. Collect failures: Run the prompt against 100 tickets. Find the 10 that fail.

2. Categorize failure patterns:
• Multi-category tickets (3 failures)
• Vague complaints (4 failures)
• Non-ticket messages (3 failures)

3. Fix each pattern:

The Fixed Prompt

"Classify this support ticket. Rules: 1. If the ticket spans multiple categories, choose the PRIMARY issue (the one the customer wants resolved most urgently). 2. If the ticket is a vague complaint with no specific issue, classify as ACCOUNT. 3. If the message is not a support ticket (e.g., 'thanks', 'ok', spam), classify as NONE. Categories: BILLING, TECHNICAL, ACCOUNT, FEATURE_REQUEST, NONE Reply with ONLY the category name."

Key insight: The 90% prompt is the most dangerous prompt — it works well enough that you ship it, but fails unpredictably in production. Always test with edge cases: multi-category inputs, vague inputs, empty inputs, adversarial inputs, and non-standard inputs.

gavel

LLM-as-Judge: Automated Quality Evaluation

Use one model to evaluate another model’s output — scalable quality assurance

The Concept

Manual evaluation doesn’t scale. If your prompt processes 10,000 inputs/day, you can’t review each output. Solution: use an LLM to evaluate the outputs of another LLM.

The flow:
1. Your prompt generates an output
2. A “judge” prompt evaluates that output
3. The judge scores it on a rubric
4. Low scores trigger alerts or human review

The Judge Prompt

System: "You are a quality evaluator for a customer support chatbot. Rate each response on these criteria:" 1. ACCURACY (1-5): Is the information factually correct? Does it match the provided context? 2. HELPFULNESS (1-5): Does it actually solve the customer's problem or just acknowledge it? 3. TONE (1-5): Professional, empathetic, not robotic? 4. SAFETY (pass/fail): Does it make promises the company can't keep? Does it reveal internal information? Output format: { "accuracy": 4, "helpfulness": 3, "tone": 5, "safety": "pass", "issues": ["Didn't address the customer's second question"], "overall": 4 }

Making It Work

Use a stronger model as judge: If your chatbot uses GPT-4o-mini, use GPT-4o or Claude 3.5 Sonnet as the judge. The judge should be at least as capable as the model being evaluated.

Calibrate with human ratings: Have humans rate 50–100 outputs. Then run the judge on the same outputs. If the judge’s scores correlate with human scores (r > 0.7), it’s reliable.

Focus on pass/fail for safety: Numeric scores are useful for quality. Binary pass/fail is better for safety checks.

Automated Pipeline

# Evaluate every Nth response if response_count % 10 == 0: score = judge( input=user_message, output=bot_response, context=retrieved_docs ) if score["safety"] == "fail": alert_team(response) if score["overall"] < 3: queue_for_review(response)

Key insight: LLM-as-judge isn’t perfect, but it’s far better than no evaluation. It catches obvious failures (hallucinations, safety violations, off-topic responses) reliably. Use it as a first filter, with human review for flagged cases.

science

Building a Prompt Test Suite

Systematic testing that catches regressions before they reach production

Test Case Structure

test_cases = [ { "id": "billing_simple", "input": "I was charged twice", "expected": "BILLING", "type": "exact_match" }, { "id": "multi_category", "input": "Charged twice and can't log in", "expected": "BILLING", "type": "exact_match" }, { "id": "not_a_ticket", "input": "Thanks!", "expected": "NONE", "type": "exact_match" }, { "id": "summary_quality", "input": "[long article]", "expected": "3 bullet points, each <20 words", "type": "llm_judge" } ]

Test Types

Exact match: Output must equal expected value. Best for classification, extraction, structured output.

Contains: Output must contain specific strings. Good for checking that key facts are included.

Not contains: Output must NOT contain certain strings. Good for safety checks (no PII, no hallucinated URLs).

Format check: Output must parse as valid JSON/XML. Good for structured output prompts.

LLM judge: Use a judge prompt to score quality. Best for open-ended generation where exact matching is impossible.

Run on Every Prompt Change

# Before deploying a prompt change: results = run_test_suite( prompt=new_prompt, tests=test_cases ) passed = sum(1 for r in results if r["passed"]) print(f"{passed}/{len(results)} passed") # If pass rate drops, don't deploy if passed < len(results) * 0.95: print("BLOCKED: regression detected")

Key insight: A test suite of 20–30 cases catches most regressions. Include: 10 happy-path cases, 5 edge cases, 5 adversarial cases, and 5–10 real production failures you’ve seen. Run the suite before every prompt change. It takes 30 seconds and saves hours of debugging in production.

warning

The Top 7 Prompt Engineering Pitfalls

Mistakes that even experienced practitioners make

Pitfalls 1-4

1. The Kitchen Sink Prompt Cramming 5 tasks into one prompt. The model does all 5 poorly instead of any 1 well. Split into focused prompts. 2. The Invisible Assumption "Summarize for our audience" — what audience? The prompt assumes context the model doesn't have. Be explicit. 3. The Contradictory Constraint "Be thorough but keep it under 50 words." Pick one. If you need both, say "Cover the 3 most important points in under 50 words." 4. The Negative-Only Instruction "Don't be formal. Don't use jargon. Don't be too long." Tell the model what TO do, not just what to avoid. "Write in a casual, conversational tone. Use simple language. Keep it under 100 words."

Pitfalls 5-7

5. The Temperature Mismatch Using temperature=1.0 for classification (needs determinism) or temperature=0 for creative writing (needs variety). Match temperature to the task. 6. The Untested Production Prompt "It worked in ChatGPT, ship it." ChatGPT testing ≠ production testing. Different inputs, different scale, different failure modes. 7. The Frozen Prompt Writing a prompt once and never updating it. Models change, user patterns change, edge cases emerge. Prompts need maintenance like code.

Key insight: These pitfalls share a common root: treating prompts as one-off messages instead of maintained software. The fix is always the same: be explicit, test systematically, and iterate based on real failures.

checklist

The Prompt Debugging Playbook

A systematic approach to finding and fixing prompt failures

When a Prompt Fails

1. Reproduce Run the exact same input 3 times. Is the failure consistent or random? Consistent → prompt bug Random → temperature/sampling issue 2. Diagnose (5-step framework) □ Instruction clear? □ Enough context? □ Format specified? □ Conflicting constraints? □ Task too complex? 3. Isolate Remove parts of the prompt until you find the minimum that reproduces the failure. Often it's one ambiguous sentence. 4. Fix Make the smallest change that fixes the failure. Don't rewrite the whole prompt — you'll introduce new bugs. 5. Verify Run the fix against: - The failing input (should pass now) - 10 previously passing inputs (should still pass — no regression)

Quick Reference

Wrong format? → Add explicit format instructions with an example

Too long/short? → Add word/sentence count constraint

Hallucinating? → Add “Only use information from the provided context”

Inconsistent? → Lower temperature, add few-shot examples

Wrong tool called? → Improve tool descriptions (Ch 12)

Loses context? → Add summaries (Ch 11)

Works sometimes? → Run 10x, find the pattern in failures

Key insight: Prompt debugging is a learnable skill. The 5-step framework (reproduce, diagnose, isolate, fix, verify) works for every failure mode. The key discipline is making the smallest fix and verifying it doesn’t break existing behavior. Prompt engineering is iterative, not inspirational.

arrow_back Ch 12: Tool Use & Function Calling Ch 14: The Prompt Engineer’s Toolkit arrow_forward