handyman

AI PM Toolkit

Frameworks, checklists & templates — practical tools for every stage of AI product management
target
Problem Framing Canvas
Six questions to evaluate whether AI is the right solution — Ch 5
expand_more
quiz The Six Questions
QuestionWhat You’re AssessingGreen Light
1. What decision or action does this enable?Whether the output is actionable, not just informationalClear action tied to a user workflow
2. What happens when the AI is wrong?Error tolerance and failure consequencesErrors are recoverable and cost is bounded
3. Do we have the data?Data availability, quality, and legal accessSufficient labeled data exists or can be created
4. Can a human do this today?Whether a baseline exists and what “good” looks likeHuman process exists but is slow, expensive, or inconsistent
5. What’s the simplest approach?Whether AI is actually needed vs. rules or heuristicsSimpler approaches have been tried or clearly won’t work
6. What does “good enough” look like?Whether success criteria can be defined and measuredQuantifiable thresholds exist for launch, target, and guardrail
Decision rule: If you can’t answer questions 2, 3, and 6 clearly, you’re not ready to build. Go back to discovery.
description
AI Product Requirements Canvas
The spec template that replaces traditional PRDs for AI features — Ch 8
expand_more
edit_document Canvas Sections
SectionContents
Problem StatementUser problem, current solution, why AI is the right approach, success metric
Performance ThresholdsLaunch: minimum viable quality. Target: goal state. Guardrail: never-cross line. Include specific metrics (precision, recall, latency, etc.)
Error Budget & Failure ModesAcceptable error rate, failure categories (false positive, false negative, hallucination, latency timeout), confidence thresholds, fallback behavior for each
Data RequirementsTraining data sources, volume, quality criteria, labeling requirements, refresh cadence, legal/privacy constraints
Evaluation PlanOffline metrics, online metrics, human evaluation criteria, A/B test design, golden test set definition
Safety & GuardrailsContent safety rules, action boundaries (for agents), red team scenarios, human escalation triggers, kill switch criteria
Monitoring & IterationKey dashboards, alerting thresholds, review cadence, improvement sprint structure, model update protocol
Key difference from traditional PRDs: AI specs define acceptable ranges, not exact behaviors. Every section should answer: “What happens when the AI gets it wrong?”
compare_arrows
Build vs. Buy Decision Matrix
Five-option spectrum with trade-off analysis — Ch 7
expand_more
grid_view The Spectrum
OptionSpeedCost (Initial)ControlDifferentiationBest When
Buy SaaSDaysLowMinimalNoneCommodity capability, not core to product
Use APIWeeksLow–MedLowLowRapid prototyping, validating demand
Fine-TuneWeeks–MonthsMediumMediumMediumDomain-specific quality matters
Train CustomMonthsHighHighHighProprietary data advantage, core IP
Build from ScratchQuarters+Very HighFullMaximumUnique architecture required, massive scale
Default strategy: Start with API for speed → fine-tune where quality matters → build custom only where it creates a durable moat. Always maintain the ability to swap providers.
assessment
Evaluation & Metrics Framework
The metrics stack connecting model performance to business outcomes — Ch 10
expand_more
stacked_line_chart The Three-Layer Metrics Stack
LayerMetricsOwnerCadence
Model MetricsPrecision, recall, F1, perplexity, BLEU/ROUGE, latency, throughputML EngineerEvery experiment
Product MetricsTask completion rate, user acceptance rate, edit distance, time-to-value, error ratePMWeekly
Business MetricsRevenue impact, cost per resolution, conversion lift, NPS/CSAT, retentionPM + LeadershipMonthly / Quarterly
psychology LLM-Specific Evaluation
Automated evaluation: LLM-as-judge scoring on rubrics (relevance, accuracy, completeness, safety). Run on every prompt change.
Human evaluation: Expert review of sampled outputs. Weekly cadence, minimum 50–100 samples per review cycle.
RAG evaluation (RAGAS): Context relevance, answer faithfulness, answer relevance, context recall.
A/B testing: Split traffic between model versions. Minimum 2-week test duration. Primary metric: task completion rate, not user preference.
Golden rule: Never report a model metric without its corresponding product metric. “F1 improved by 3%” means nothing. “F1 improved by 3%, reducing false escalations by 12% and saving $40K/month” means everything.
rocket_launch
AI Product Launch Checklist
Pre-launch gates and staged rollout protocol — Ch 14–15
expand_more
checklist Pre-Launch Gates
  • Performance thresholds met: Launch-level metrics achieved on held-out test set
  • Red team complete: Adversarial testing conducted, critical vulnerabilities addressed
  • Golden test set passing: Regression suite confirms no degradation from baseline
  • Safety guardrails active: Content filters, action boundaries, and rate limits configured
  • Kill switch tested: Verified ability to disable AI feature instantly without full deployment
  • Monitoring dashboards live: Latency, quality, cost, and safety metrics visible in real-time
  • Alerting configured: Automated alerts for quality drops, cost spikes, and safety triggers
  • Fallback behavior defined: Graceful degradation path when AI is unavailable or underperforming
  • Human escalation path tested: Users can reach a human when AI fails
  • Incident runbook written: Step-by-step response procedures for common failure scenarios
  • Privacy review complete: Data flows documented, PII handling verified, DPAs in place
  • AI disclosure in place: Users informed they’re interacting with AI where required
  • Rollout plan approved: Staged rollout with clear go/no-go criteria at each stage
moving Staged Rollout Protocol
StageAudienceDurationGo/No-Go Criteria
Shadow0% (run in parallel, no user exposure)1–2 weeksOutput quality matches or exceeds baseline
Canary1–5% of traffic1 weekNo safety incidents, latency within SLA, quality stable
Beta10–20% (opt-in users)2–4 weeksUser satisfaction above threshold, error rate below budget
Ramp-Up20% → 50% → 100%2–4 weeksBusiness metrics trending positive, no new failure modes
GA100%OngoingContinuous monitoring, weekly quality reviews
monitoring
PM Monitoring Dashboard Template
Daily and weekly views for AI product health — Ch 16–17
expand_more
today Daily View (5-Minute Check)
MetricWhat to WatchAlert Threshold
P95 LatencyResponse time for 95th percentile of requests> 2x baseline
Error Rate% of requests returning errors or timeouts> 1% (adjust per product)
Safety TriggersCount of content filter or guardrail activationsAny spike > 3x daily average
User FeedbackThumbs up/down ratio, explicit complaintsNegative ratio > 30%
Daily CostTotal inference spend vs. budget> 120% of daily budget
date_range Weekly View (30-Minute Review)
MetricWhat to WatchAction
Quality TrendTask completion rate, acceptance rate over 7 daysInvestigate any downward trend > 5%
Drift DetectionDistribution shift in inputs or outputs vs. baselineTrigger evaluation suite if detected
Cost per QueryAverage cost trend, cost by feature/endpointOptimize top-3 most expensive endpoints
Error AnalysisCategorized failures from the past weekAdd worst failures to golden test set
User AdoptionWAU, activation rate, stickinessInvestigate drops > 10%
Ritual: Monday morning — review weekly dashboard. Wednesday — error analysis with ML team. Friday — update improvement backlog based on the week’s data.
calculate
AI ROI Calculation Framework
How to quantify and communicate AI product value — Ch 18
expand_more
functions The Formula
AI ROI = (Value Created + Costs Avoided) − (Build Cost + Run Cost + Opportunity Cost)

Value Created: New revenue, conversion lift, upsell from AI features, premium pricing
Costs Avoided: Reduced headcount growth, faster resolution times, fewer errors, lower manual review
Build Cost: Engineering time, data labeling, infrastructure setup, vendor fees during development
Run Cost: Inference costs, monitoring tools, ongoing data ops, model maintenance, compliance
Opportunity Cost: What the team could have built instead
warning Common ROI Pitfalls
1. Ignoring run costs: Inference, monitoring, and maintenance are ongoing — not one-time.
2. Counting displaced headcount as savings: People are usually redeployed, not eliminated.
3. Measuring too early: AI products compound value over time — month-1 ROI is misleading.
4. Forgetting opportunity cost: The team could have shipped something else.
5. Ignoring quality costs: Bad AI outputs create downstream costs (support tickets, trust erosion, rework).
Executive reporting tip: Lead with P&L impact, not model metrics. “AI reduced cost per support resolution by 34%, saving $2.1M annually” — not “We improved F1 score to 0.92.”
map
Three-Horizon Roadmap Template
Planning under uncertainty with confidence levels — Ch 20
expand_more
layers Horizon Structure
HorizonTimeframeConfidenceCapacityLanguage
H1 — Commit0–6 weeksHigh~60%“We will deliver...”
H2 — Plan6 weeks–3 monthsMedium~30%“We’re targeting...”
H3 — Explore3–6 monthsLow~10%“We’re investigating...”
view_list Initiative Card Template
For each roadmap initiative, document:

Outcome goal: The user/business outcome, not the feature (e.g., “Reduce support resolution time by 35%”)
Confidence level: Committed / Planned / Exploring
Key dependencies: Data availability, model capability, regulatory approval, partner integration
Risk factors: What could go wrong and how you’d detect it early
Go/no-go criteria: Specific conditions that must be met to proceed (H2 items)
Learning objectives: What you need to learn from this spike (H3 items)
Moat contribution: How this builds a compounding advantage
Portfolio type: Optimize / Extend / Explore
event_repeat Quarterly Planning Ritual
Week 1 — Landscape Scan: Review model releases, competitor moves, cost changes, regulatory updates. Update capability watch list.
Week 2 — Portfolio Review: Graduate validated H3 spikes. Promote/demote H2 items. Retire stale initiatives.
Week 3 — Prioritization: Score new opportunities (user impact, business value, technical confidence, moat contribution). Balance Optimize/Extend/Explore mix.
Week 4 — Communication: Present updated roadmap using three-layer format (outcomes, confidence, approach). Align stakeholders.
Roadmap test: For every initiative, ask: “If the underlying model changes next quarter, does this outcome still matter?” If yes, it’s a good roadmap item. If no, you’re planning around technology, not users.
shield
Responsible AI Checklist
Ethics, safety, and compliance gates for every AI product — Ch 19
expand_more
playlist_add_check Before Building
  • Risk classification: Categorize under EU AI Act (unacceptable, high, limited, minimal risk)
  • Ethical review: Identify potential harms to users, affected communities, and society
  • Privacy assessment: Document data flows, PII handling, retention policies, legal basis
  • Fairness criteria: Define protected groups and fairness metrics to track
build During Development
  • Bias testing: Evaluate model performance across demographic subgroups
  • Safety guardrails: Content filters, action boundaries, rate limits configured and tested
  • Human oversight model: Define in-the-loop, on-the-loop, or over-the-loop based on risk
  • Red teaming: Adversarial testing for harmful outputs, prompt injection, data leakage
verified At Launch & Ongoing
  • AI disclosure: Users informed they’re interacting with AI (legal requirement for chatbots in EU)
  • User controls: Opt-out, feedback mechanisms, data deletion rights
  • Continuous monitoring: Bias drift, safety trigger rates, user complaints tracked
  • Incident response: Documented procedure for AI safety incidents
  • Quarterly ethical review: Reassess risk, update documentation, review incident patterns
  • Model card maintained: Capabilities, limitations, intended use, and known biases documented