Ch 8 — Writing AI Product Specs

What changes in a PRD when the product is probabilistic. The AI Product Requirements Canvas.
High Level
compare
Differs
arrow_forward
speed
Thresholds
arrow_forward
error
Failures
arrow_forward
storage
Data
arrow_forward
monitoring
Monitor
arrow_forward
description
Canvas
-
Click play or press Space to begin...
Step- / 8
compare
Why Traditional PRDs Fail for AI
The fundamental mismatch between deterministic specs and probabilistic systems
Traditional PRD Assumptions
A traditional PRD assumes deterministic behavior: “When the user clicks Submit, the form saves to the database and displays a confirmation.” The behavior is binary — it either works or it doesn’t. QA writes test cases with exact expected outputs. Acceptance criteria are pass/fail.

This breaks completely for AI products:

• “The system shall correctly classify all support tickets” — impossible. No AI achieves 100%.
• “The AI shall generate accurate summaries” — unmeasurable. What does “accurate” mean? By whose standard?
• “The chatbot shall answer customer questions” — unbounded. Which questions? What counts as an answer?
What AI PRDs Must Address
AI product specs need six additional sections that traditional PRDs don’t have:

1. Performance thresholds — Not “it works” but “it achieves X% precision at Y% recall”

2. Error budget & failure modes — What types of errors are acceptable? Which are catastrophic?

3. Data requirements — What data is needed for training, evaluation, and ongoing improvement?

4. Fallback behavior — What happens when the AI fails or isn’t confident?

5. Monitoring requirements — What to track after launch, not just before

6. Evaluation methodology — How will you measure whether the AI is “good enough”?
The mindset shift: A traditional PRD defines what the product does. An AI PRD defines what the product does, how well it does it, how it fails, and how you’ll know if it’s working. The spec is not complete without all four.
speed
Performance Thresholds
Replacing “it works” with measurable, actionable quality bars
How to Write Performance Requirements
Bad: “The AI should accurately detect fraud.”
Good: “The fraud detection model shall achieve ≥95% recall (catch 95% of actual fraud) with ≤2% false positive rate (flag no more than 2% of legitimate transactions).”

Bad: “The chatbot should give helpful answers.”
Good: “The chatbot shall resolve ≥70% of billing questions without human escalation, with a user satisfaction rating ≥4.0/5.0 on resolved conversations.”

Bad: “Summaries should be accurate.”
Good: “Generated summaries shall be rated as factually accurate by domain reviewers on ≥90% of evaluations, with zero hallucinated facts in ≥95% of outputs.”
Three Types of Thresholds
Launch threshold: The minimum performance required to ship v1. Below this, the product doesn’t go live. This should be achievable and tied to user value, not perfection.

Target threshold: The performance level you’re aiming for within 3–6 months post-launch. This drives the improvement roadmap.

Guardrail threshold: The performance level that, if breached, triggers an alert or rollback. “If precision drops below 85%, page the on-call engineer.” This prevents silent degradation.
Latency & Cost Thresholds
Don’t forget non-accuracy thresholds:

Latency: “P95 response time ≤800ms” (95% of requests complete within 800ms)
Cost: “Average cost per query ≤$0.03”
Throughput: “System handles ≥500 concurrent requests”
The threshold negotiation: Setting thresholds is a negotiation between product ambition and technical reality. Set them too high and you never ship. Set them too low and users don’t trust the product. Start with the human baseline (how well do humans do this task?) and set the launch threshold at or slightly above it.
error
Error Budget & Failure Modes
Specifying how the product is allowed to fail
The Error Budget Concept
Borrowed from site reliability engineering, an error budget defines how much failure is acceptable. For AI products, this means explicitly stating:

What percentage of outputs can be wrong? “Up to 10% of classifications may be incorrect.”
Which types of errors are tolerable? “False positives (flagging legitimate transactions) are acceptable up to 2%. False negatives (missing fraud) must stay below 0.5%.”
What constitutes a catastrophic error? “The AI must never generate medical advice. If a medical question is detected, it must refuse and escalate.”

The error budget makes the team’s tolerance for failure explicit rather than implicit. It prevents the endless pursuit of perfection and gives clear criteria for shipping.
Failure Mode Specification
For each AI feature, document the failure mode analysis:

Failure mode: What can go wrong?
User impact: What does the user experience?
Severity: Low / Medium / High / Critical
Detection: How do we know it happened?
Fallback: What does the system do instead?

Example for an AI email classifier:
Failure: Misclassifies an urgent email as low-priority
Impact: User misses a time-sensitive message
Severity: High
Detection: User manually reclassifies; monitor reclassification rate
Fallback: If confidence <70%, show in “Review” folder instead of auto-classifying
The confidence threshold pattern: The most common fallback pattern: if the model’s confidence is below a threshold, don’t act autonomously. Instead, present the output with a disclaimer, ask the user to confirm, or escalate to a human. The confidence threshold is a product decision the PM must specify in the PRD.
storage
Data & Evaluation Requirements
The sections most AI PRDs are missing
Data Requirements Section
Every AI PRD needs a data section that specifies:

Training data:
• What data is needed? (Source, format, volume)
• Is it available or does it need to be created?
• Who labels it? What are the labeling guidelines?
• What’s the timeline and cost for data preparation?

Evaluation data:
• What is the “golden set” of examples used to measure quality?
• How many examples? (Minimum 200–500 for meaningful evaluation)
• Who creates and maintains it?
• How often is it updated?

Production data:
• What data does the model need at inference time?
• What’s the latency requirement for data access?
• What happens if a data source is unavailable?
Evaluation Methodology
Specify how you’ll measure quality, not just what the target is:

Automated evaluation:
• Which metrics? (Precision, recall, F1, BLEU, ROUGE, etc.)
• Against which test set?
• How often? (Every model update, weekly, continuous)

Human evaluation:
• Who evaluates? (Domain experts, users, crowdsourced raters)
• What rubric do they use? (1–5 scale on accuracy, helpfulness, safety)
• How many evaluations per cycle?
• What’s the inter-rater agreement threshold?

User-facing evaluation:
• What feedback mechanisms exist? (Thumbs up/down, ratings, corrections)
• What A/B tests will you run?
• What business metrics indicate success? (Task completion rate, time saved, escalation rate)
The evaluation spec is the real spec: For AI products, the evaluation methodology is more important than the feature description. A well-defined evaluation framework tells the ML team exactly what to optimize for. A vague feature description with no evaluation criteria leads to a model that impresses in demos but fails in production.
shield
Safety & Guardrail Requirements
What the AI must never do — the non-negotiable constraints
Hard Constraints
Every AI PRD needs a section on absolute constraints — behaviors that are never acceptable regardless of other performance metrics:

Content safety:
• “The AI must not generate hate speech, violent content, or sexually explicit material.”
• “The AI must not reveal personally identifiable information from training data.”

Scope boundaries:
• “The AI must refuse requests outside its domain. A customer service bot must not provide medical, legal, or financial advice.”
• “The AI must not impersonate a human. It must identify itself as AI when asked.”

Factual constraints:
• “The AI must not fabricate citations, statistics, or quotes.”
• “When uncertain, the AI must say ‘I don’t know’ rather than guess.”
Guardrail Metrics
Beyond hard constraints, define guardrail metrics — metrics that must not degrade even as you optimize primary metrics:

Primary metric: Task completion rate (optimize this)
Guardrail metrics:
• Safety violation rate must stay <0.1%
• Hallucination rate must stay <5%
• Average response time must stay <2s
• Cost per query must stay <$0.05

Guardrails prevent the team from over-optimizing one dimension at the expense of others. A model that achieves 95% task completion but hallucinates 20% of the time has failed its guardrails.
The red team section: Include a “red team” requirement in the PRD: before launch, the product must be tested by adversarial users who deliberately try to make it fail. Prompt injection, jailbreaking, edge cases, offensive inputs. If you don’t test for abuse, users will find the vulnerabilities for you — publicly.
monitoring
Post-Launch Monitoring Spec
The section that turns a launch into a living product
What to Specify
Traditional PRDs end at launch. AI PRDs must specify what happens after launch:

Dashboard requirements:
• Which metrics are displayed? (Model accuracy, latency, cost, user satisfaction, error rates)
• What’s the refresh frequency? (Real-time, hourly, daily)
• Who has access?

Alert thresholds:
• “Alert if precision drops below 88% over a 24-hour window”
• “Alert if P95 latency exceeds 1.5s for more than 30 minutes”
• “Alert if cost per query exceeds $0.08”
• “Alert if safety violation rate exceeds 0.05%”

Response procedures:
• Who is on-call for model issues?
• What’s the escalation path?
• When do you rollback vs. investigate?
Improvement Cadence
Specify the ongoing improvement cycle:

Weekly:
• Review monitoring dashboard
• Conduct error review session (PM + ML team)
• Triage user feedback

Monthly:
• Run full evaluation against golden set
• Review cost trends
• Assess whether retraining is needed

Quarterly:
• Evaluate new model versions (foundation model upgrades)
• Review and update evaluation dataset
• Assess competitive landscape (are competitors doing this better?)
Why this matters: The monitoring spec is what separates AI products that improve over time from AI products that silently degrade. If the PRD doesn’t specify monitoring, nobody builds it. If nobody builds it, nobody watches the model. If nobody watches the model, it drifts. Specify monitoring as a launch requirement, not a “nice to have.”
block
Common AI PRD Mistakes
What goes wrong when teams write AI specs the traditional way
Mistakes 1–4
1. Binary acceptance criteria.
“The AI correctly classifies tickets.” Correctly how often? 80%? 95%? 100%? Without a number, the spec is meaningless and the team has no target.

2. No failure specification.
The PRD describes what the AI does when it works but says nothing about what happens when it fails. Every user will eventually encounter a failure. If you haven’t designed for it, the experience is terrible.

3. Ignoring data requirements.
The PRD describes the desired AI behavior but doesn’t address where the training data comes from, how it’s labeled, or how the evaluation set is constructed. The ML team is left guessing.

4. No latency or cost constraints.
A model that takes 10 seconds to respond or costs $1 per query may be technically excellent but commercially unviable. Performance thresholds must include operational constraints.
Mistakes 5–7
5. Specifying the solution, not the problem.
“Use GPT-4 with RAG to answer questions.” This over-constrains the ML team. Instead: “Answer customer billing questions with ≥85% accuracy and ≤2s latency.” Let the team choose the approach.

6. No scope boundaries.
The PRD says what the AI should do but not what it should refuse to do. Without explicit boundaries, the AI attempts everything — including tasks it’s terrible at — and users lose trust.

7. Treating the spec as final.
AI PRDs are living documents. The initial thresholds are hypotheses. After the first round of evaluation, you’ll adjust. After launch, you’ll adjust again. Build in review cycles. A PRD that’s never updated after writing is a PRD that’s disconnected from reality.
The litmus test: Show your AI PRD to the ML lead. Ask: “Do you know exactly what to optimize for, what the quality bar is, and how we’ll evaluate success?” If the answer is no, the spec isn’t ready. The best AI PRDs are written collaboratively between PM and ML — not thrown over the wall.
description
The AI Product Requirements Canvas
A one-page template that captures everything an AI spec needs
Left Side: The Problem
1. Problem Statement
What user problem are we solving? What’s the current pain? (2–3 sentences, no AI jargon)

2. Success Metrics
• Primary metric: The one number that defines success
• Guardrail metrics: Numbers that must not degrade
• Business metric: Revenue, cost savings, or efficiency impact

3. Scope & Boundaries
• In scope: What the AI handles
• Out of scope: What the AI refuses
• Expansion path: What comes in v2, v3

4. Error Budget
• Acceptable error rate by type
• Catastrophic errors (never acceptable)
• Confidence threshold for autonomous action
Right Side: The Solution
5. Performance Thresholds
• Launch threshold (minimum to ship)
• Target threshold (goal for 3–6 months)
• Guardrail threshold (triggers alert/rollback)

6. Data Requirements
• Training data: source, volume, labeling plan
• Evaluation data: golden set, size, maintainer
• Production data: real-time requirements

7. Failure & Fallback
• Failure mode analysis (top 5 failure modes)
• Fallback behavior for each
• Human escalation criteria

8. Monitoring & Improvement
• Dashboard metrics and alert thresholds
• Review cadence (weekly, monthly, quarterly)
• Retraining triggers
The bottom line: Keep the AI PRD to 2–4 pages. Be opinionated about scope. Be explicit about thresholds. Be honest about what you don’t know (mark assumptions). Write it collaboratively with the ML team. Update it after every evaluation cycle. The best AI PRDs are living contracts between product and engineering — not static documents that gather dust.