Ch 14 — Testing AI Products

Why traditional QA breaks down for AI — and the testing strategy that replaces it.
High Level
block
Why QA Fails
arrow_forward
layers
Test Layers
arrow_forward
shield
Red Team
arrow_forward
history
Regression
arrow_forward
groups
User Test
arrow_forward
checklist
Go / No-Go
-
Click play or press Space to begin...
Step- / 8
block
Why Traditional QA Breaks Down
AI products violate every assumption that traditional testing relies on
Traditional Software Testing
Traditional QA works because software is deterministic: the same input always produces the same output. You write test cases, define expected results, and verify that actual results match. If they do, the software passes. If they don’t, there’s a bug.

This model assumes:
• Inputs and outputs are well-defined
• Behavior is reproducible
• Pass/fail is binary
• A passing test stays passing until code changes
• You can enumerate the important test cases
Why AI Breaks These Assumptions
Non-deterministic: The same input can produce different outputs. Run the same prompt 10 times and you might get 10 different responses. A test that passes today might fail tomorrow with no code changes.

No binary pass/fail: Is a response “correct”? Often there’s no single right answer. Quality exists on a spectrum — partially correct, mostly helpful, slightly off-tone.

Infinite input space: Users can type anything. You can’t enumerate all possible inputs. The long tail of unusual queries is where most failures hide.

Context-dependent: The same question in different conversation contexts can require different answers. Testing individual queries misses multi-turn interaction failures.

Continuous evolution: Model updates, data changes, and prompt modifications all shift behavior. A test suite that covered the important cases last month may miss new failure modes this month.
The mindset shift: AI testing is not about proving the system works correctly. It’s about quantifying the risk of deploying it. You’re measuring quality distributions, not checking pass/fail conditions. The question changes from “Does it work?” to “How often does it fail, and how badly?”
layers
The AI Testing Pyramid
Five layers of testing, from fastest and cheapest to slowest and most expensive
Layer 1: Unit Tests (Automated, Continuous)
Test individual components in isolation:
• Does the data pipeline produce clean, correctly formatted data?
• Does the chunking function split documents at the right boundaries?
• Does the output parser handle all expected formats?
• Do API integrations return expected schemas?

These are traditional software tests for the non-AI parts of the system. They’re fast, cheap, and should run on every code change.
Layer 2: Model Evaluation (Automated, Per Model Change)
Test the AI model’s quality on a curated evaluation dataset:
• Run 200–500 test cases covering normal, hard, and edge cases
• Measure precision, recall, F1 (classification) or faithfulness, relevance (generation)
• Compare against the previous model version
• Flag regressions automatically

This is the primary quality gate for model and prompt changes. No change ships without passing the eval suite.
Layer 3: Integration Tests (Automated, Per Release)
Test the full pipeline end-to-end:
• User query → retrieval → generation → response
• Verify latency, cost, and format requirements
• Test error handling and fallback behavior
• Verify human handoff triggers work correctly

Integration tests catch issues that unit tests and model evaluation miss — like a retrieval change that breaks generation quality.
Layer 4: Adversarial / Red Team (Periodic)
Deliberately try to break the system (covered in detail in the next step).
Layer 5: User Acceptance Testing (Pre-Launch)
Real users interact with the product and provide feedback. The most expensive but most realistic test. Covered in step 6.
The 80/20 rule: Layers 1–3 should catch 80% of issues automatically. Layers 4–5 catch the remaining 20% that require human judgment. Invest heavily in automated testing (layers 1–3) to keep the expensive human testing (layers 4–5) focused on what automation can’t catch.
shield
Red Teaming & Adversarial Testing
Systematically trying to break your AI before users and attackers do
What Red Teaming Is
Red teaming is structured adversarial testing where a dedicated team tries to make the AI behave in unintended ways. It’s now standard practice at OpenAI, Anthropic, Microsoft, and Google before any public release.

The goal is to find vulnerabilities before deployment, not after users discover them on social media.
The Five Phases
1. Reconnaissance: Understand the AI’s capabilities, constraints, and intended behavior. Map the attack surface.

2. Attack surface mapping: Identify all input channels (text, files, API parameters, conversation history) and all output channels (text, actions, data access).

3. Vulnerability testing: Systematically probe each attack vector with crafted inputs.

4. Exploitation: For each vulnerability found, determine the worst-case impact. Can the AI be made to leak data? Take harmful actions? Produce dangerous content?

5. Reporting: Document findings with severity ratings, reproduction steps, and recommended mitigations.
Key Attack Vectors to Test
Prompt injection: Can users override the system prompt? “Ignore all previous instructions and...”

Information disclosure: Can the AI be tricked into revealing the system prompt, internal data, or other users’ information?

Harmful content: Can the AI be made to produce toxic, biased, illegal, or dangerous content?

Excessive agency: Can the AI be manipulated into taking actions beyond its intended scope? (Especially critical for agentic AI.)

Hallucination exploitation: Can users craft queries that reliably trigger hallucinations, then use those hallucinations to mislead others?

Data poisoning: If the AI learns from user feedback, can adversarial feedback degrade its quality over time?
The non-determinism challenge: Unlike traditional security testing, AI vulnerabilities appear inconsistently. A prompt injection might work 5% of the time. Run each adversarial test 10–20 times. A vulnerability that succeeds even once in 20 attempts is a real vulnerability — at scale, 5% success means thousands of exploits per day.
history
Regression Testing for AI
Ensuring improvements don’t break what was already working
Why Regressions Are Common in AI
AI products regress more frequently than traditional software because changes propagate unpredictably:

Prompt changes: Improving the prompt for one type of query can degrade performance on another. Adding a new constraint might conflict with existing instructions.

Model updates: When the underlying model is updated (GPT-4 → GPT-4o, Claude 3 → Claude 3.5), behavior changes in subtle ways. Prompts optimized for one model version may underperform on the next.

Data changes: Adding new documents to a RAG knowledge base can change retrieval rankings, causing previously correct answers to become wrong.

Threshold changes: Adjusting a confidence threshold to reduce false positives might increase false negatives elsewhere.
Building a Regression Suite
Golden test set: A curated set of 200–500 input-output pairs that represent the most important behaviors. These are your “must not break” cases. Include:

• High-traffic queries (the 20% of queries that represent 80% of usage)
• Previously failed cases that were fixed (prevent re-introduction)
• Edge cases that required special handling
• Safety-critical scenarios (must never regress)

Automated comparison: On every change, run the golden test set against both the old and new versions. Flag any case where the new version scores lower.

Regression budget: Define how much regression is acceptable. “No more than 2% of golden tests may regress. Zero safety-critical regressions.” This gives the team a clear decision framework.
CI/CD Integration
Embed regression tests into the deployment pipeline. Every prompt change, model update, or data change triggers the regression suite automatically. Block deployment if regressions exceed the budget. This prevents the most common source of AI quality degradation: well-intentioned changes with unintended side effects.
The growing test set: Every production failure should become a regression test. Over time, your golden test set grows into a comprehensive quality safety net. Teams that do this consistently have 75% fewer production incidents than those that don’t.
data_check
Data Quality Testing
Shift-left: quality begins at the data layer, not at deployment
Why Data Testing Matters
The shift-left approach to AI QA means testing before the model ever sees the data. Most AI quality issues originate in the data, not the model:

• Mislabeled training data → model learns wrong patterns
• Biased data → biased predictions
• Stale knowledge base → outdated answers
• Duplicate or contradictory documents → inconsistent retrieval
• Missing data for important categories → blind spots

Catching these issues at the data layer is 10x cheaper than catching them at the model layer and 100x cheaper than catching them in production.
Data Validation Checks
Schema validation: Does the data conform to the expected format? Are required fields present? Are data types correct?

Distribution checks: Has the data distribution shifted since the last training run? Are there new categories the model hasn’t seen? Are some categories suddenly over- or under-represented?

Freshness checks: When was the data last updated? Are there documents older than the freshness threshold? Are there gaps in the update pipeline?

Consistency checks: Do multiple sources agree? Are there contradictory documents? Are deprecated documents still in the active set?

Bias checks: Does the data represent all relevant user groups? Are there systematic gaps or over-representations that could lead to biased outputs?
Automate data quality: Run data validation checks on every data pipeline execution. Alert when checks fail. Block model training or knowledge base updates when data quality drops below thresholds. Data quality monitoring is as important as model quality monitoring — and should run more frequently.
groups
User Acceptance Testing
The final gate before launch — real users, real tasks, real feedback
Why User Testing Is Non-Negotiable
Automated tests and internal reviews can’t fully predict how real users will interact with an AI product:

• Users phrase questions differently than testers expect
• Users have context and expectations that internal teams don’t anticipate
• Users discover workflows and edge cases that weren’t in the test plan
• User satisfaction depends on subjective factors (tone, speed, helpfulness) that metrics only partially capture

User acceptance testing (UAT) is the bridge between “the AI performs well on benchmarks” and “users actually find this useful.”
Running Effective UAT for AI
Recruit representative users: Include power users, new users, and users from different segments. 20–50 participants is typically sufficient for qualitative insights.

Give real tasks, not scripts: “Use the AI to resolve your actual support question” is better than “Ask the AI: What is your return policy?” Scripted tasks miss the messy reality of real usage.

Measure both satisfaction and accuracy: Users might be satisfied with a wrong answer (they don’t know it’s wrong) or dissatisfied with a correct answer (the tone was off). Measure both independently.

Capture qualitative feedback: Ask users to think aloud. What surprised them? What frustrated them? When did they lose trust? These insights are more valuable than aggregate scores.

Run for at least 1–2 weeks: Users need time to build mental models of the AI’s capabilities. First-day impressions differ from week-two impressions.
The beta program: For AI products, consider a structured beta with 100–500 users before general launch. Provide a feedback channel, monitor usage patterns, and iterate on the most common failure modes. A 4-week beta with active feedback collection can prevent the majority of launch-day issues.
sync
Continuous Testing
Testing doesn’t end at launch — it becomes a permanent part of operations
Why Continuous Testing Is Essential
Unlike traditional software, AI products can degrade without any code changes:

Data drift: User queries evolve over time. New topics emerge. Seasonal patterns shift.
Model drift: If the provider updates the underlying model, behavior changes.
Knowledge staleness: The knowledge base becomes outdated as the real world changes.
Adversarial evolution: Attackers learn new techniques to exploit the AI.

A test suite that passes today may fail next month even if nothing in your system changed. Quality must be monitored continuously, not just at deployment.
The Continuous Testing Cadence
Every change (automated):
Unit tests, model evaluation, regression suite. Blocks deployment on failure.

Daily (automated):
Run the full evaluation suite against production. Sample 100–200 real user queries and evaluate quality. Alert on metric drops.

Weekly (PM review):
Review quality dashboards. Analyze user feedback patterns. Prioritize the top failure modes for the next sprint.

Monthly (human evaluation):
Domain experts evaluate 200–500 production outputs using the scoring rubric. Compare to previous month. Identify emerging quality issues.

Quarterly (red team):
Full adversarial testing cycle. Test new attack vectors. Verify that previous vulnerabilities remain patched. Update the threat model.
The testing flywheel: Production failures → new regression tests → better automated coverage → fewer production failures. Every incident makes the test suite stronger. Teams that run this flywheel for 6+ months achieve dramatically higher quality than teams that test only at deployment.
checklist
The Pre-Launch Testing Checklist
The go/no-go decision framework for shipping an AI product
Quality Gates
□ Model evaluation passes
Primary metric meets the launch threshold. No safety-critical regressions. Guardrail metrics (latency, cost) within budget.

□ Regression suite passes
No more than N% of golden tests regressed. All previously fixed bugs remain fixed.

□ Red team complete
No critical or high-severity vulnerabilities open. Medium-severity items have documented mitigations.

□ Data quality validated
Knowledge base is current, consistent, and complete for the product’s scope.
User Readiness
□ UAT feedback addressed
Top user complaints from beta have been resolved or have documented workarounds.

□ Error handling tested
Graceful degradation, human handoff, and “I don’t know” behaviors work correctly.

□ Feedback mechanisms live
Thumbs up/down, correction, and escalation paths are functional.
Operational Readiness
□ Monitoring dashboards live
Quality, latency, cost, and safety metrics are tracked in real time.

□ Alerting configured
Automated alerts for metric degradation with defined response procedures.

□ Rollback plan tested
Can revert to the previous version within minutes. The rollback has been tested end-to-end.

□ On-call rotation established
Someone is responsible for responding to quality issues 24/7 during the launch window.

□ Continuous testing pipeline active
Daily automated evaluation, weekly PM review, and monthly human evaluation are scheduled.
The go/no-go decision: All quality gates must pass. User readiness items should be substantially complete (minor issues can be tracked post-launch). Operational readiness is non-negotiable — you must be able to detect problems and roll back quickly. If any gate fails, fix it before launching. The cost of a bad AI launch (user trust damage, brand risk) far exceeds the cost of a delayed launch.