Ch 15: Launch Strategy for AI Products

Ch 15 — Launch Strategy for AI Products

Staged rollouts, shadow testing, and the launch playbook that reduces production incidents by 75%.

Index

High Level

visibility_off

Shadow

arrow_forward

science

Canary

arrow_forward

group

Beta

arrow_forward

trending_up

Ramp

arrow_forward

public

arrow_forward

speed

Optimize

Click play or press Space to begin...

Step- / 8

warning

Why AI Launches Are Different

The unique risks that make “ship it and see” dangerous for AI products

The AI Launch Risk Profile

Traditional software launches carry known risks: bugs, performance issues, UX confusion. AI launches carry all of those plus a set of risks unique to probabilistic systems:

Unpredictable failure modes: The AI might produce harmful, embarrassing, or legally risky outputs that weren’t caught in testing. Unlike a bug that affects all users, AI failures can be triggered by specific user inputs that are impossible to enumerate in advance.

Viral failure amplification: A single bad AI response can go viral on social media within hours. The reputational damage from one screenshot of your AI saying something inappropriate can overshadow months of good performance.

Trust is fragile: Users who encounter a bad AI experience early are unlikely to return. First impressions with AI products are disproportionately important because users are already skeptical of AI reliability.

Cost unpredictability: AI inference costs scale with usage. A viral launch can generate unexpected API bills in the tens of thousands of dollars within days.

The Staged Rollout Imperative

Because of these risks, AI products should never launch to 100% of users on day one. Instead, use a staged rollout that progressively exposes the product to more users while monitoring quality at each stage.

Organizations with mature staged rollout practices achieve:
• 75% fewer production incidents
• 99.9%+ uptime with proper redundancy
• 50% faster time-to-market via standardized pipelines
• 40% cost reduction through optimization

The staged approach isn’t slower — it’s faster, because you spend less time fighting fires after launch.

The launch principle: Every AI launch is an experiment. You’re testing a hypothesis: “This AI product creates value for users at acceptable quality and cost.” Staged rollouts let you validate this hypothesis incrementally, with the ability to stop or roll back at any point.

visibility_off

Shadow Testing

Stage 0: Run the AI in production without showing results to users

How Shadow Testing Works

Deploy the AI alongside the existing production system. For every real user request, the AI generates a response — but the response is never shown to users. Instead, it’s logged for evaluation.

This gives you:
• Real-world input distribution: See exactly what users ask, not what your test set assumes
• Performance under load: Measure latency, cost, and reliability at production scale
• Quality baseline: Evaluate AI responses against the existing system’s outputs (if replacing a human process) or against human-labeled ground truth
• Zero user risk: If the AI produces terrible outputs, nobody sees them

When to Use Shadow Testing

Replacing an existing process: If the AI is replacing human agents, shadow test by running the AI on real tickets while humans continue to handle them. Compare AI responses to human responses.

High-stakes domains: Medical, legal, financial — where a bad AI response could cause real harm. Shadow testing lets you validate quality without risk.

New model deployment: Before switching from Model A to Model B in production, shadow test Model B on real traffic to compare quality.

Shadow Testing Duration

Run for 1–2 weeks minimum. You need enough data to cover the full distribution of user queries, including weekday/weekend patterns, seasonal topics, and rare edge cases. Evaluate at least 500–1,000 shadow responses before proceeding.

The shadow test decision: If shadow testing shows the AI matches or exceeds the existing system on quality metrics, proceed to canary. If it falls short, iterate before exposing any users. Shadow testing is the cheapest way to discover that your AI isn’t ready.

science

Canary Deployment

Stage 1: Expose 1–5% of traffic and watch the metrics closely

How Canary Deployment Works

Route a small percentage of real user traffic (typically 1–5%) to the AI product. The remaining 95–99% continues on the existing path (or sees no AI feature).

Monitor a defined set of health metrics in real time:
• Quality: User satisfaction, thumbs up/down ratio, escalation rate
• Performance: Latency (p50, p95, p99), error rate, timeout rate
• Cost: Cost per query, total daily spend
• Safety: Flagged responses, content filter triggers, hallucination rate

If any metric degrades beyond a predefined threshold, automatically roll back to the previous version. No human decision required — the system protects itself.

Canary Best Practices

Choose the canary population carefully.
Don’t canary on your most important customers first. Use a representative but lower-risk segment. Internal employees are ideal for the initial canary.

Define rollback triggers before deployment.
“If error rate exceeds 5%, or latency p95 exceeds 3 seconds, or any safety metric triggers, automatically roll back.” Write these down and automate them.

Monitor for at least 48 hours.
Some issues only appear under specific conditions (peak hours, certain user segments, particular query types). A 2-hour canary isn’t enough.

Compare against the control group.
The 95% of users not in the canary serve as your control. Compare metrics between canary and control to isolate the AI’s impact from external factors.

The canary decision: After 48+ hours, if all metrics are stable or improving: expand to 10–25%. If metrics are mixed: investigate and iterate before expanding. If any critical metric degraded: roll back, fix, and re-canary. Never skip the canary stage — it’s your safety net.

group

Beta Programs

Stage 2: Structured feedback from engaged users who know they’re testing

Beta vs. Canary

Canary users don’t know they’re in an experiment. Beta users know they’re testing and are asked to provide feedback. This distinction matters:

Canary gives you: Unbiased usage patterns, realistic quality metrics, production-scale performance data.

Beta gives you: Detailed qualitative feedback, feature requests, usability insights, edge case discovery, and early advocates who feel invested in the product.

Run both. Canary validates metrics. Beta generates insights.

Structuring the Beta

Size: 100–500 users. Large enough for statistical significance, small enough to manage individually.

Duration: 4–6 weeks. Users need time to integrate the AI into their workflow and encounter edge cases.

Recruitment: Mix of power users (will push boundaries), average users (represent the majority), and skeptics (will find the weaknesses).

Feedback channels: In-product feedback (thumbs up/down, comments), weekly surveys, a dedicated Slack/Discord channel, and optional 1-on-1 interviews with the PM.

What to Measure in Beta

Activation rate: What % of beta users actually try the AI feature? Target: 60–80%. Below 50% signals a discoverability or value proposition problem.

Retention: Do users come back after the first session? Weekly active rate >40% is strong for beta.

Task completion: Can users accomplish their goals with the AI? Measure success rate on key tasks.

Satisfaction: NPS or CSAT specifically for the AI feature. Target NPS >30 for beta (users are more forgiving during beta).

Failure patterns: What are the top 10 queries the AI handles poorly? These become your priority fixes before GA.

The beta-to-GA decision: Proceed to GA when: activation >60%, weekly retention >30%, task completion >70%, NPS >20, and the top 10 failure patterns have been addressed. If any metric falls short, extend the beta and iterate. Beta is your last chance to fix major issues before the broader audience sees them.

trending_up

The Ramp-Up Playbook

Stage 3: From beta to general availability — the gradual expansion

The Ramp Schedule

After a successful beta, expand access gradually:

Week 1–2: 10% of users
First expansion beyond beta. Monitor all metrics. This is where load-related issues often surface.

Week 3–4: 25% of users
Significant scale. Cost projections become reliable. User support volume gives a preview of GA demand.

Week 5–6: 50% of users
Half the user base. If metrics hold here, GA is likely safe. Final opportunity to catch issues before full exposure.

Week 7+: 100% (GA)
Full launch. All users have access. Marketing and communications can begin in earnest.

At each stage, hold for at least one full week before expanding. Some issues only appear after sustained usage.

Feature Flags

Feature flags are the infrastructure that makes staged rollouts possible. They let you:

• Control who sees the AI feature without code deployments
• Instantly disable the feature if issues arise (kill switch)
• Target specific segments (by geography, plan tier, user type)
• Run A/B tests between AI variants
• Gradually increase exposure from 1% to 100%

Modern feature flag platforms (LaunchDarkly, Statsig, FeatureVisor) include predictive modeling that identifies regressions before they become outages, and Bayesian impact mapping that updates success probability as canary data arrives.

The Kill Switch

Every AI feature must have a kill switch — a feature flag that instantly disables the AI and falls back to the non-AI experience. This should be operable by the on-call engineer in under 60 seconds, without a code deployment. Test the kill switch before launch. An untested kill switch is not a kill switch.

The ramp-up rule: Never increase exposure by more than 2x at a time (5% → 10% → 25% → 50% → 100%). Each doubling gives you a clear signal: if metrics held at 25%, they’ll likely hold at 50%. If they degrade, you’ve limited the blast radius to half your users, not all of them.

campaign

Launch Communications

Setting the right expectations — because AI products need different messaging

What to Communicate

1. What the AI can do.
Be specific about capabilities. “Our AI can answer questions about billing, shipping, and returns based on our knowledge base” is better than “Our AI assistant can help with anything.”

2. What the AI cannot do.
Set boundaries explicitly. “The AI cannot process refunds, access your payment information, or provide legal advice.” Users who understand the boundaries trust the product more.

3. How to get help when the AI falls short.
“If the AI can’t help, click ‘Talk to a person’ to reach our support team.” The escape hatch should be prominent in launch communications.

4. How the AI improves over time.
“Your feedback helps us improve. Use the thumbs up/down buttons to let us know how we’re doing.” This frames imperfection as a feature of the improvement process, not a product flaw.

What NOT to Communicate

Don’t overpromise.
“Our AI understands everything” sets users up for disappointment. Underpromise and overdeliver.

Don’t hide the AI.
Users should know they’re interacting with AI. Transparency builds trust. Deception destroys it when users discover the truth.

Don’t launch with a press release before the product is stable.
Media attention drives traffic spikes. If the product isn’t ready for the load (both technical and quality), the press coverage will highlight failures, not features.

Internal Communications

Support team: Train them on what the AI does, common failure modes, and how to handle escalations from AI interactions.

Sales team: Equip them with accurate capability descriptions and known limitations. Nothing kills a deal faster than a sales demo that triggers a hallucination.

Leadership: Set expectations for the ramp-up timeline, early metrics targets, and the iteration cadence. AI products improve post-launch — leadership needs to understand this is by design, not a sign of an incomplete product.

The messaging framework: “We’re launching [specific capability] powered by AI. It’s designed to [specific value]. It works best for [specific use cases]. For [out-of-scope topics], our human team is always available. Your feedback helps us improve every week.” Specific, honest, and improvement-oriented.

local_fire_department

The Launch War Room

The first 72 hours after GA — what to monitor and how to respond

The First 72 Hours

The first three days after general availability are the highest-risk period. Establish a launch war room with the core team monitoring in real time:

Hour 0–4: Smoke test.
Is the system up? Are responses generating? Are latency and error rates within bounds? Are any safety filters triggering at unexpected rates?

Hour 4–24: Pattern detection.
What are users actually asking? Are there query patterns you didn’t anticipate? Are there spikes in negative feedback? Is cost tracking to budget?

Hour 24–48: Quality deep dive.
Sample 200+ responses and evaluate quality. Compare to beta metrics. Are there new failure modes? Is user satisfaction holding?

Hour 48–72: Stabilization.
Address the top 3–5 issues discovered. Deploy hotfixes if needed. Confirm that metrics are stable and trending in the right direction.

Incident Response Plan

Severity 1 (Critical): AI producing harmful, offensive, or legally risky content.
Response: Activate kill switch immediately. Investigate root cause. Do not re-enable until the issue is resolved and tested.

Severity 2 (High): Significant quality degradation affecting >10% of users.
Response: Roll back to previous version. Investigate. Fix and re-deploy through the canary process.

Severity 3 (Medium): Quality issues affecting specific query types or user segments.
Response: Document, prioritize, and fix in the next sprint. Monitor for escalation.

Severity 4 (Low): Minor quality issues, edge cases, cosmetic problems.
Response: Add to backlog. Include in regression test suite.

The war room rule: Staff the war room with PM, engineering lead, ML engineer, and a support representative for the full 72 hours. After 72 hours, transition to the standard on-call rotation. The war room is not optional — it’s the difference between catching a Severity 1 issue in 10 minutes vs. 10 hours.

speed

Post-Launch Optimization

The launch is just the beginning — the real work starts now

The First 30 Days

Week 1: Stabilize.
Fix critical issues. Ensure monitoring and alerting are working. Confirm cost is tracking to budget. Publish a launch retrospective internally.

Week 2: Optimize.
Address the top 10 failure patterns from user feedback. Tune prompts and retrieval based on real-world query patterns. Optimize token usage and cost.

Week 3: Measure.
Compile the first comprehensive metrics report. Compare against launch targets. Identify the biggest gaps between expected and actual performance.

Week 4: Plan.
Based on 30 days of data, create the v1.1 roadmap. Prioritize improvements by impact (user satisfaction lift per engineering effort). Present results and plan to leadership.

Key Post-Launch Metrics

Adoption:
• Signup-to-activation rate (target: 30–50%)
• Daily/weekly active users
• Feature discovery rate

Quality:
• User satisfaction (CSAT/NPS)
• Thumbs up/down ratio
• Escalation rate to human support
• Regeneration rate (users asking for a new response)

Retention:
• Weekly retention (target: 20%+ good, 40%+ great)
• Return rate after first session
• Feature stickiness (% of sessions that use AI)

Economics:
• Cost per query / cost per resolution
• Total AI spend vs. budget
• ROI (value created vs. cost)

The bottom line: An AI product launch is not a single event — it’s a multi-week process from shadow testing through post-launch optimization. The PM who follows the staged rollout playbook (shadow → canary → beta → ramp → GA → optimize) ships with confidence, catches issues early, and builds user trust incrementally. The PM who skips stages learns the same lessons, but at the cost of user trust and brand reputation.

arrow_back Ch 14: Testing AI Products Ch 16: Monitoring & Observability arrow_forward