AI PM Toolkit — Frameworks, Checklists & Templates

target

Problem Framing Canvas

Six questions to evaluate whether AI is the right solution — Ch 5

expand_more

quiz The Six Questions

Question	What You’re Assessing	Green Light
1. What decision or action does this enable?	Whether the output is actionable, not just informational	Clear action tied to a user workflow
2. What happens when the AI is wrong?	Error tolerance and failure consequences	Errors are recoverable and cost is bounded
3. Do we have the data?	Data availability, quality, and legal access	Sufficient labeled data exists or can be created
4. Can a human do this today?	Whether a baseline exists and what “good” looks like	Human process exists but is slow, expensive, or inconsistent
5. What’s the simplest approach?	Whether AI is actually needed vs. rules or heuristics	Simpler approaches have been tried or clearly won’t work
6. What does “good enough” look like?	Whether success criteria can be defined and measured	Quantifiable thresholds exist for launch, target, and guardrail

Decision rule: If you can’t answer questions 2, 3, and 6 clearly, you’re not ready to build. Go back to discovery.

description

AI Product Requirements Canvas

The spec template that replaces traditional PRDs for AI features — Ch 8

expand_more

edit_document Canvas Sections

Section	Contents
Problem Statement	User problem, current solution, why AI is the right approach, success metric
Performance Thresholds	Launch: minimum viable quality. Target: goal state. Guardrail: never-cross line. Include specific metrics (precision, recall, latency, etc.)
Error Budget & Failure Modes	Acceptable error rate, failure categories (false positive, false negative, hallucination, latency timeout), confidence thresholds, fallback behavior for each
Data Requirements	Training data sources, volume, quality criteria, labeling requirements, refresh cadence, legal/privacy constraints
Evaluation Plan	Offline metrics, online metrics, human evaluation criteria, A/B test design, golden test set definition
Safety & Guardrails	Content safety rules, action boundaries (for agents), red team scenarios, human escalation triggers, kill switch criteria
Monitoring & Iteration	Key dashboards, alerting thresholds, review cadence, improvement sprint structure, model update protocol

Key difference from traditional PRDs: AI specs define acceptable ranges, not exact behaviors. Every section should answer: “What happens when the AI gets it wrong?”

compare_arrows

Build vs. Buy Decision Matrix

Five-option spectrum with trade-off analysis — Ch 7

expand_more

grid_view The Spectrum

Option	Speed	Cost (Initial)	Control	Differentiation	Best When
Buy SaaS	Days	Low	Minimal	None	Commodity capability, not core to product
Use API	Weeks	Low–Med	Low	Low	Rapid prototyping, validating demand
Fine-Tune	Weeks–Months	Medium	Medium	Medium	Domain-specific quality matters
Train Custom	Months	High	High	High	Proprietary data advantage, core IP
Build from Scratch	Quarters+	Very High	Full	Maximum	Unique architecture required, massive scale

Default strategy: Start with API for speed → fine-tune where quality matters → build custom only where it creates a durable moat. Always maintain the ability to swap providers.

assessment

Evaluation & Metrics Framework

The metrics stack connecting model performance to business outcomes — Ch 10

expand_more

stacked_line_chart The Three-Layer Metrics Stack

Layer	Metrics	Owner	Cadence
Model Metrics	Precision, recall, F1, perplexity, BLEU/ROUGE, latency, throughput	ML Engineer	Every experiment
Product Metrics	Task completion rate, user acceptance rate, edit distance, time-to-value, error rate	PM	Weekly
Business Metrics	Revenue impact, cost per resolution, conversion lift, NPS/CSAT, retention	PM + Leadership	Monthly / Quarterly

psychology LLM-Specific Evaluation

Automated evaluation: LLM-as-judge scoring on rubrics (relevance, accuracy, completeness, safety). Run on every prompt change.
Human evaluation: Expert review of sampled outputs. Weekly cadence, minimum 50–100 samples per review cycle.
RAG evaluation (RAGAS): Context relevance, answer faithfulness, answer relevance, context recall.
A/B testing: Split traffic between model versions. Minimum 2-week test duration. Primary metric: task completion rate, not user preference.

Golden rule: Never report a model metric without its corresponding product metric. “F1 improved by 3%” means nothing. “F1 improved by 3%, reducing false escalations by 12% and saving $40K/month” means everything.

rocket_launch

AI Product Launch Checklist

Pre-launch gates and staged rollout protocol — Ch 14–15

expand_more

checklist Pre-Launch Gates

Performance thresholds met: Launch-level metrics achieved on held-out test set
Red team complete: Adversarial testing conducted, critical vulnerabilities addressed
Golden test set passing: Regression suite confirms no degradation from baseline
Safety guardrails active: Content filters, action boundaries, and rate limits configured
Kill switch tested: Verified ability to disable AI feature instantly without full deployment
Monitoring dashboards live: Latency, quality, cost, and safety metrics visible in real-time
Alerting configured: Automated alerts for quality drops, cost spikes, and safety triggers
Fallback behavior defined: Graceful degradation path when AI is unavailable or underperforming
Human escalation path tested: Users can reach a human when AI fails
Incident runbook written: Step-by-step response procedures for common failure scenarios
Privacy review complete: Data flows documented, PII handling verified, DPAs in place
AI disclosure in place: Users informed they’re interacting with AI where required
Rollout plan approved: Staged rollout with clear go/no-go criteria at each stage

moving Staged Rollout Protocol

Stage	Audience	Duration	Go/No-Go Criteria
Shadow	0% (run in parallel, no user exposure)	1–2 weeks	Output quality matches or exceeds baseline
Canary	1–5% of traffic	1 week	No safety incidents, latency within SLA, quality stable
Beta	10–20% (opt-in users)	2–4 weeks	User satisfaction above threshold, error rate below budget
Ramp-Up	20% → 50% → 100%	2–4 weeks	Business metrics trending positive, no new failure modes
GA	100%	Ongoing	Continuous monitoring, weekly quality reviews

monitoring

PM Monitoring Dashboard Template

Daily and weekly views for AI product health — Ch 16–17

expand_more

today Daily View (5-Minute Check)

Metric	What to Watch	Alert Threshold
P95 Latency	Response time for 95th percentile of requests	> 2x baseline
Error Rate	% of requests returning errors or timeouts	> 1% (adjust per product)
Safety Triggers	Count of content filter or guardrail activations	Any spike > 3x daily average
User Feedback	Thumbs up/down ratio, explicit complaints	Negative ratio > 30%
Daily Cost	Total inference spend vs. budget	> 120% of daily budget

date_range Weekly View (30-Minute Review)

Metric	What to Watch	Action
Quality Trend	Task completion rate, acceptance rate over 7 days	Investigate any downward trend > 5%
Drift Detection	Distribution shift in inputs or outputs vs. baseline	Trigger evaluation suite if detected
Cost per Query	Average cost trend, cost by feature/endpoint	Optimize top-3 most expensive endpoints
Error Analysis	Categorized failures from the past week	Add worst failures to golden test set
User Adoption	WAU, activation rate, stickiness	Investigate drops > 10%

Ritual: Monday morning — review weekly dashboard. Wednesday — error analysis with ML team. Friday — update improvement backlog based on the week’s data.

calculate

AI ROI Calculation Framework

How to quantify and communicate AI product value — Ch 18

expand_more

functions The Formula

AI ROI = (Value Created + Costs Avoided) − (Build Cost + Run Cost + Opportunity Cost)

Value Created: New revenue, conversion lift, upsell from AI features, premium pricing
Costs Avoided: Reduced headcount growth, faster resolution times, fewer errors, lower manual review
Build Cost: Engineering time, data labeling, infrastructure setup, vendor fees during development
Run Cost: Inference costs, monitoring tools, ongoing data ops, model maintenance, compliance
Opportunity Cost: What the team could have built instead

warning Common ROI Pitfalls

1. Ignoring run costs: Inference, monitoring, and maintenance are ongoing — not one-time.
2. Counting displaced headcount as savings: People are usually redeployed, not eliminated.
3. Measuring too early: AI products compound value over time — month-1 ROI is misleading.
4. Forgetting opportunity cost: The team could have shipped something else.
5. Ignoring quality costs: Bad AI outputs create downstream costs (support tickets, trust erosion, rework).

Executive reporting tip: Lead with P&L impact, not model metrics. “AI reduced cost per support resolution by 34%, saving $2.1M annually” — not “We improved F1 score to 0.92.”

map

Three-Horizon Roadmap Template

Planning under uncertainty with confidence levels — Ch 20

expand_more

layers Horizon Structure

Horizon	Timeframe	Confidence	Capacity	Language
H1 — Commit	0–6 weeks	High	~60%	“We will deliver...”
H2 — Plan	6 weeks–3 months	Medium	~30%	“We’re targeting...”
H3 — Explore	3–6 months	Low	~10%	“We’re investigating...”

view_list Initiative Card Template

For each roadmap initiative, document:

Outcome goal: The user/business outcome, not the feature (e.g., “Reduce support resolution time by 35%”)
Confidence level: Committed / Planned / Exploring
Key dependencies: Data availability, model capability, regulatory approval, partner integration
Risk factors: What could go wrong and how you’d detect it early
Go/no-go criteria: Specific conditions that must be met to proceed (H2 items)
Learning objectives: What you need to learn from this spike (H3 items)
Moat contribution: How this builds a compounding advantage
Portfolio type: Optimize / Extend / Explore

event_repeat Quarterly Planning Ritual

Week 1 — Landscape Scan: Review model releases, competitor moves, cost changes, regulatory updates. Update capability watch list.
Week 2 — Portfolio Review: Graduate validated H3 spikes. Promote/demote H2 items. Retire stale initiatives.
Week 3 — Prioritization: Score new opportunities (user impact, business value, technical confidence, moat contribution). Balance Optimize/Extend/Explore mix.
Week 4 — Communication: Present updated roadmap using three-layer format (outcomes, confidence, approach). Align stakeholders.

Roadmap test: For every initiative, ask: “If the underlying model changes next quarter, does this outcome still matter?” If yes, it’s a good roadmap item. If no, you’re planning around technology, not users.

shield

Responsible AI Checklist

Ethics, safety, and compliance gates for every AI product — Ch 19

expand_more

playlist_add_check Before Building

Risk classification: Categorize under EU AI Act (unacceptable, high, limited, minimal risk)
Ethical review: Identify potential harms to users, affected communities, and society
Privacy assessment: Document data flows, PII handling, retention policies, legal basis
Fairness criteria: Define protected groups and fairness metrics to track

build During Development

Bias testing: Evaluate model performance across demographic subgroups
Safety guardrails: Content filters, action boundaries, rate limits configured and tested
Human oversight model: Define in-the-loop, on-the-loop, or over-the-loop based on risk
Red teaming: Adversarial testing for harmful outputs, prompt injection, data leakage

verified At Launch & Ongoing

AI disclosure: Users informed they’re interacting with AI (legal requirement for chatbots in EU)
User controls: Opt-out, feedback mechanisms, data deletion rights
Continuous monitoring: Bias drift, safety trigger rates, user complaints tracked
Incident response: Documented procedure for AI safety incidents
Quarterly ethical review: Reassess risk, update documentation, review incident patterns
Model card maintained: Capabilities, limitations, intended use, and known biases documented