Ch 4: The AI Product Lifecycle

Ch 4 — The AI Product Lifecycle

Why AI products get better after launch — and can also get worse. The data-model-product loop.

Index

High Level

compare_arrows

Contrast

arrow_forward

storage

Data

arrow_forward

model_training

Train

arrow_forward

rocket_launch

Deploy

arrow_forward

monitoring

Monitor

arrow_forward

autorenew

Loop

Click play or press Space to begin...

Step- / 8

compare_arrows

Software Lifecycle vs. AI Lifecycle

Why the traditional build-ship-maintain model breaks down for AI products

Traditional Software Lifecycle

Traditional software follows a relatively linear lifecycle: define requirements, design, build, test, ship, maintain. After launch, the product is essentially “done” — you add features, fix bugs, and handle infrastructure. But the core product works as designed from day one.

Maintenance is about keeping things running, not fundamentally changing how the product works. A calculator app shipped in 2024 calculates the same way in 2026.

AI Product Lifecycle

AI products follow a circular lifecycle: collect data, train model, deploy, monitor, collect more data, retrain, redeploy. The product is never “done” — it’s continuously evolving based on new data and changing conditions.

This creates two properties that traditional software doesn’t have:

1. Products can improve after launch — User interactions generate data that makes the model better. A recommendation engine gets smarter with every click.

2. Products can degrade after launch — The world changes, but the model doesn’t. A fraud detection model trained on 2024 patterns misses 2026 fraud techniques. This is called model drift.

The fundamental difference: Traditional software depreciates slowly (tech debt accumulates). AI products can depreciate rapidly (the world changes and the model becomes wrong). This means AI products require ongoing investment in model quality just to maintain current performance, before you even think about improvements.

storage

Phase 1: Data Collection & Preparation

The foundation that determines everything else — and where most projects stall

Data Collection

Every AI product starts with data. The quality and quantity of your data sets a ceiling on product quality that no amount of engineering can exceed.

Sources vary by product type:
• Supervised ML: Labeled datasets (images with tags, text with categories, transactions marked as fraud/not-fraud)
• LLM products: Prompt-response examples, evaluation datasets, domain-specific documents for RAG
• Recommendation systems: User behavior logs (clicks, purchases, time spent, skips)

The PM’s role here is critical: define what data you need, where it comes from, and how to get it. This is often the hardest part of the entire project.

Data Preparation

Data cleaning — Remove duplicates, fix errors, handle missing values. In practice, data scientists spend 60–80% of their time on this. It’s unglamorous but determines model quality.

Data labeling — For supervised learning, humans must label the data. Is this email spam? Is this transaction fraudulent? Is this medical image showing a tumor? Labeling is expensive, slow, and error-prone.

Data versioning — Just as you version code, you must version data. Tools like DVC (Data Version Control) track which dataset version produced which model. Without this, you can’t reproduce results or debug regressions.

Feature engineering — Transforming raw data into features the model can use. Converting timestamps into “day of week” and “hour of day.” Calculating “average transaction amount over last 30 days.”

PM takeaway: Data preparation is the most time-consuming and least exciting phase. It’s also where most AI projects die. If your team says “we need 4 weeks for data prep,” don’t push back — push for clarity on what specifically needs to happen and what the blockers are. Cutting data prep time usually means cutting model quality.

model_training

Phase 2: Training & Experimentation

Where models are built, tested, and iterated — the experimental core

The Experimentation Loop

Model training is fundamentally experimental. The ML team tries different approaches, measures results, and iterates. This is closer to scientific research than software engineering.

A typical cycle:
1. Choose a model architecture (or select a foundation model)
2. Train on the prepared dataset (or write/refine prompts)
3. Evaluate against the test set
4. Analyze errors — what’s failing and why?
5. Hypothesize improvements (more data? different features? different architecture?)
6. Repeat

This cycle might run dozens or hundreds of times before the model reaches the performance threshold you defined.

For LLM-Based Products

The experimentation loop looks different for products built on foundation models:

Prompt iteration: Instead of training a model, you iterate on prompts. Write a system prompt, test it against evaluation examples, refine, repeat. This is faster (minutes vs. hours) but still requires systematic evaluation.

Fine-tuning: If prompting isn’t enough, fine-tune the foundation model on your domain-specific data. This is more expensive and slower, but can significantly improve quality for specialized tasks.

RAG development: Build and tune the retrieval pipeline — what documents to index, how to chunk them, what embedding model to use, how many results to retrieve.

Experiment Tracking

Every experiment must be tracked: what data was used, what parameters were set, what results were achieved. Tools like MLflow, Weights & Biases, and Neptune provide this. Without tracking, you can’t reproduce your best results or understand what worked.

PM role during training: You’re not building the model, but you’re actively involved. Review error analyses weekly. Provide product context on which errors matter most. Adjust the performance threshold based on what you learn. The PM who disappears during training and reappears at deployment is the PM whose product fails.

rocket_launch

Phase 3: Deployment

Getting the model from the lab to production — where many projects die

The Deployment Gap

A well-known industry statistic: 87% of ML models never make it to production. The gap between “works in a notebook” and “works in production” is enormous.

Reasons models fail at deployment:
• Latency — The model takes 5 seconds to respond. Users expect 200ms.
• Scale — Works for 10 requests/second, breaks at 1,000.
• Integration — The model expects clean, structured input. Real-world data is messy.
• Cost — Running the model at production scale costs $50K/month. The feature generates $10K in value.
• Reliability — The model works 99% of the time in the lab but the 1% failure in production affects thousands of users daily.

Deployment Strategies

Shadow deployment: Run the new model alongside the existing system. The new model processes real requests but its outputs aren’t shown to users. Compare results to validate quality before switching over.

Canary deployment: Route 1–5% of traffic to the new model. Monitor closely. Gradually increase traffic if metrics hold. Roll back instantly if they don’t.

A/B testing: Split users between the old and new model. Measure business metrics (not just model metrics) to determine which is better.

Blue/green: Maintain two identical production environments. Switch traffic from old (blue) to new (green) instantly. Roll back by switching back.

PM decision: The deployment strategy depends on risk tolerance. High-stakes products (medical, financial) should use shadow + canary. Lower-stakes products (recommendations, content) can move faster with A/B tests. The PM chooses the strategy based on error cost, not engineering preference. Always have a rollback plan.

monitoring

Phase 4: Monitoring & Drift Detection

The phase most teams skip — and the reason most AI products degrade

Why Monitoring Is Different for AI

Traditional software monitoring tracks uptime, latency, and error rates. If the server is up and responding, the product is working. AI products need all of that plus model-specific monitoring:

• Model performance metrics — Is accuracy/precision/recall holding steady? Track against your baseline.
• Input distribution — Are the inputs the model is seeing in production similar to what it was trained on? If not, predictions may be unreliable.
• Output distribution — Are the model’s outputs changing? If a fraud model suddenly flags 40% of transactions instead of the usual 2%, something is wrong.
• Latency per request — Not just average, but P95 and P99. A model that’s fast on average but slow for 5% of users creates a bad experience.
• Cost per prediction — Especially for LLM products where token costs add up.

Model Drift: The Silent Killer

Model drift is when the real world changes but the model doesn’t. It comes in two forms:

Data drift: The distribution of inputs changes. A model trained on pre-pandemic shopping data sees completely different patterns post-pandemic. The model’s inputs look different from its training data.

Concept drift: The relationship between inputs and outputs changes. What constituted “spam” in 2023 is different from what constitutes spam in 2026. The model’s learned patterns become outdated.

Drift is insidious because the model doesn’t crash — it just becomes gradually, silently wrong. Without monitoring, you won’t know until users complain or business metrics drop.

PM action: Define monitoring dashboards before launch, not after. Set alert thresholds: “If precision drops below 88%, alert the team.” “If input distribution diverges by more than X from training data, investigate.” Review the monitoring dashboard weekly. The PM who doesn’t monitor model performance is the PM whose product degrades without anyone noticing.

autorenew

Phase 5: The Feedback Loop

How user interactions feed back into model improvement — the AI product flywheel

Explicit Feedback

Users directly tell you whether the model’s output was good or bad:

• Thumbs up/down — ChatGPT, Claude, and most AI assistants use this. Simple but effective.
• Corrections — User edits the AI’s output. The edit itself is training signal (the AI said X, the user changed it to Y).
• Ratings — 1–5 star ratings on AI-generated content.
• Flagging — User reports harmful, incorrect, or inappropriate output.

Challenge: Only a small percentage of users provide explicit feedback (typically 1–5%). The feedback is biased toward extreme experiences — users rate when they’re very happy or very unhappy, rarely when the output is just “okay.”

Implicit Feedback

Users show you through behavior whether the output was useful:

• Acceptance rate — Did the user accept the code suggestion? Use the generated text? Click the recommendation?
• Dwell time — How long did the user spend reading the AI’s response? Longer usually means more useful.
• Regeneration — Did the user click “regenerate”? That’s a signal the first output was unsatisfactory.
• Follow-up actions — Did the user copy the output? Share it? Or immediately search for something else?
• Abandonment — Did the user leave the product after seeing the AI’s output?

Implicit feedback is higher volume but noisier. A user who doesn’t click a recommendation might have loved it but was busy.

PM design decision: The feedback mechanism is a product design choice. Make explicit feedback frictionless (one-click thumbs up/down, not a survey). Instrument implicit feedback from day one — you can’t retroactively collect behavioral data you didn’t log. The quality of your feedback loop determines how fast your product improves.

sync_problem

Retraining: When & How

The decision framework for updating your model in production

When to Retrain

Retraining is expensive (compute costs, engineering time, testing). Don’t retrain on a schedule just because you can. Retrain when there’s a reason:

Performance degradation: Monitoring shows accuracy dropping below your threshold. This is the most common trigger.

New data available: You’ve collected significantly more labeled data that could improve the model. Rule of thumb: retrain when you have 20%+ more data than the last training run.

Distribution shift: The inputs your model sees in production have changed meaningfully from the training data.

New requirements: The product needs to handle new categories, new languages, or new use cases that the current model wasn’t trained for.

Model upgrade: A better foundation model is available (e.g., GPT-4o → GPT-5) and testing shows it improves your product.

Retraining Approaches

Full retraining: Train from scratch on the complete dataset (old + new data). Most thorough but most expensive. Necessary when the data distribution has fundamentally changed.

Incremental training: Continue training the existing model on new data only. Faster and cheaper but risks “catastrophic forgetting” — the model gets better on new patterns but worse on old ones.

Prompt/config update: For LLM products, update the system prompt, adjust retrieval parameters, or update the knowledge base without touching the underlying model. Fastest and cheapest.

Model swap: Replace the current model with a newer version (e.g., upgrading from Claude 3.5 to Claude 4). Requires thorough evaluation to ensure the new model doesn’t regress on your specific use case.

The retraining trap: Don’t assume retraining always helps. A retrained model can be worse than the current one if the new data is noisy, biased, or insufficient. Always evaluate the retrained model against the same test set before deploying. The rule: never deploy a retrained model without A/B testing or shadow deployment.

cycle

The Continuous Loop

Putting it all together — the AI product lifecycle as a perpetual system

The Loop Visualized

The AI product lifecycle is not a line with a beginning and end. It’s a continuous loop:

Data → Train → Deploy → Monitor → Feedback → Data → ...

Each revolution of the loop should make the product better. Users interact, their interactions generate data, that data improves the model, the improved model serves users better, generating more interactions.

This is the AI product flywheel. When it spins well, it creates a compounding competitive advantage. When it breaks at any point — bad data collection, no monitoring, no feedback mechanism — the product stagnates or degrades.

Speed of the Loop

The speed at which you complete each revolution determines your rate of improvement:

• Fast loop (hours–days): LLM products with prompt iteration. Change the prompt, evaluate, deploy. Fastest improvement cycle.
• Medium loop (weeks): Products with fine-tuning or RAG updates. Collect data, retrain/update, evaluate, deploy.
• Slow loop (months): Custom ML models requiring large datasets and significant compute. Collect data for months, retrain, evaluate extensively, deploy carefully.

The PM’s Lifecycle Responsibilities

Data phase: Define what data to collect, prioritize labeling, ensure data quality standards.

Training phase: Set performance thresholds, review errors weekly, adjust priorities based on what the model struggles with.

Deployment phase: Choose deployment strategy (shadow, canary, A/B), define rollback criteria, coordinate with engineering on launch.

Monitoring phase: Own the monitoring dashboard, set alert thresholds, review performance weekly, decide when drift warrants action.

Feedback phase: Design feedback mechanisms, ensure instrumentation is in place, prioritize which feedback signals to act on.

The bottom line: AI products are living systems, not shipped artifacts. They require continuous investment in data, monitoring, and improvement. The PM who treats an AI product like traditional software — build it, ship it, move on — will watch it degrade within months. The PM who embraces the continuous loop builds products that get better every day. The lifecycle never ends; it only accelerates.

arrow_back Ch 3: AI Product Roles & Team Structure Ch 5: Problem Framing for AI arrow_forward