| Question | What You’re Assessing | Green Light |
|---|---|---|
| 1. What decision or action does this enable? | Whether the output is actionable, not just informational | Clear action tied to a user workflow |
| 2. What happens when the AI is wrong? | Error tolerance and failure consequences | Errors are recoverable and cost is bounded |
| 3. Do we have the data? | Data availability, quality, and legal access | Sufficient labeled data exists or can be created |
| 4. Can a human do this today? | Whether a baseline exists and what “good” looks like | Human process exists but is slow, expensive, or inconsistent |
| 5. What’s the simplest approach? | Whether AI is actually needed vs. rules or heuristics | Simpler approaches have been tried or clearly won’t work |
| 6. What does “good enough” look like? | Whether success criteria can be defined and measured | Quantifiable thresholds exist for launch, target, and guardrail |
| Section | Contents |
|---|---|
| Problem Statement | User problem, current solution, why AI is the right approach, success metric |
| Performance Thresholds | Launch: minimum viable quality. Target: goal state. Guardrail: never-cross line. Include specific metrics (precision, recall, latency, etc.) |
| Error Budget & Failure Modes | Acceptable error rate, failure categories (false positive, false negative, hallucination, latency timeout), confidence thresholds, fallback behavior for each |
| Data Requirements | Training data sources, volume, quality criteria, labeling requirements, refresh cadence, legal/privacy constraints |
| Evaluation Plan | Offline metrics, online metrics, human evaluation criteria, A/B test design, golden test set definition |
| Safety & Guardrails | Content safety rules, action boundaries (for agents), red team scenarios, human escalation triggers, kill switch criteria |
| Monitoring & Iteration | Key dashboards, alerting thresholds, review cadence, improvement sprint structure, model update protocol |
| Option | Speed | Cost (Initial) | Control | Differentiation | Best When |
|---|---|---|---|---|---|
| Buy SaaS | Days | Low | Minimal | None | Commodity capability, not core to product |
| Use API | Weeks | Low–Med | Low | Low | Rapid prototyping, validating demand |
| Fine-Tune | Weeks–Months | Medium | Medium | Medium | Domain-specific quality matters |
| Train Custom | Months | High | High | High | Proprietary data advantage, core IP |
| Build from Scratch | Quarters+ | Very High | Full | Maximum | Unique architecture required, massive scale |
| Layer | Metrics | Owner | Cadence |
|---|---|---|---|
| Model Metrics | Precision, recall, F1, perplexity, BLEU/ROUGE, latency, throughput | ML Engineer | Every experiment |
| Product Metrics | Task completion rate, user acceptance rate, edit distance, time-to-value, error rate | PM | Weekly |
| Business Metrics | Revenue impact, cost per resolution, conversion lift, NPS/CSAT, retention | PM + Leadership | Monthly / Quarterly |
| Stage | Audience | Duration | Go/No-Go Criteria |
|---|---|---|---|
| Shadow | 0% (run in parallel, no user exposure) | 1–2 weeks | Output quality matches or exceeds baseline |
| Canary | 1–5% of traffic | 1 week | No safety incidents, latency within SLA, quality stable |
| Beta | 10–20% (opt-in users) | 2–4 weeks | User satisfaction above threshold, error rate below budget |
| Ramp-Up | 20% → 50% → 100% | 2–4 weeks | Business metrics trending positive, no new failure modes |
| GA | 100% | Ongoing | Continuous monitoring, weekly quality reviews |
| Metric | What to Watch | Alert Threshold |
|---|---|---|
| P95 Latency | Response time for 95th percentile of requests | > 2x baseline |
| Error Rate | % of requests returning errors or timeouts | > 1% (adjust per product) |
| Safety Triggers | Count of content filter or guardrail activations | Any spike > 3x daily average |
| User Feedback | Thumbs up/down ratio, explicit complaints | Negative ratio > 30% |
| Daily Cost | Total inference spend vs. budget | > 120% of daily budget |
| Metric | What to Watch | Action |
|---|---|---|
| Quality Trend | Task completion rate, acceptance rate over 7 days | Investigate any downward trend > 5% |
| Drift Detection | Distribution shift in inputs or outputs vs. baseline | Trigger evaluation suite if detected |
| Cost per Query | Average cost trend, cost by feature/endpoint | Optimize top-3 most expensive endpoints |
| Error Analysis | Categorized failures from the past week | Add worst failures to golden test set |
| User Adoption | WAU, activation rate, stickiness | Investigate drops > 10% |
| Horizon | Timeframe | Confidence | Capacity | Language |
|---|---|---|---|---|
| H1 — Commit | 0–6 weeks | High | ~60% | “We will deliver...” |
| H2 — Plan | 6 weeks–3 months | Medium | ~30% | “We’re targeting...” |
| H3 — Explore | 3–6 months | Low | ~10% | “We’re investigating...” |