Ch 9: Model Development for PMs

Ch 9 — Model Development for PMs

What happens between spec and model. What to ask in standups. When to push back.

Index

High Level

science

Mindset

arrow_forward

data_object

Prep

arrow_forward

model_training

Train

arrow_forward

tune

Tune

arrow_forward

bug_report

Errors

arrow_forward

forum

Standups

Click play or press Space to begin...

Step- / 8

science

The Experimental Mindset

Model development is research, not construction — and that changes everything about how you manage it

Research vs. Engineering

Traditional software development is engineering: you know the solution exists, you know roughly how to build it, and progress is roughly linear. More hours produce more features.

Model development is research: you don’t know if the solution exists, you don’t know which approach will work, and progress is non-linear. The team might spend two weeks trying an approach that produces zero improvement, then stumble on a data fix that jumps accuracy by 15 points overnight.

This fundamental difference means you cannot manage model development like a sprint backlog. There are no story points for “improve accuracy from 82% to 88%.” The team doesn’t know how long it will take because nobody knows until they try.

What This Means for PMs

1. Plan in experiments, not features.
“This sprint, we’ll run 3 experiments to improve recall” is realistic. “This sprint, we’ll achieve 90% recall” is a guess.

2. Expect dead ends.
50% of ML experiments fail. That’s not wasted work — it’s information. A failed experiment that eliminates an approach is valuable.

3. Time-box, don’t deadline.
“Spend 2 weeks exploring whether approach X can reach 85% accuracy. If not, we pivot.” This prevents endless optimization without clear stopping criteria.

4. Celebrate learning, not just results.
“We discovered that the model fails on transactions under $10 because they’re underrepresented in training data” is a valuable output even if accuracy didn’t improve this week.

The PM trap: The biggest mistake PMs make during model development is treating it like feature development. They ask “when will it be done?” instead of “what have we learned?” They measure progress by accuracy numbers instead of by hypotheses tested. Shift your mindset from project management to experiment management.

data_object

Stage 1: Data Preparation

The stage that takes the longest and gets the least respect

What Happens

Before any model can be trained, the data must be prepared. This typically consumes 60–80% of the total project time:

Data cleaning: Removing duplicates, fixing errors, handling missing values, standardizing formats. A dataset with “New York,” “NYC,” “new york,” and “NY” needs normalization.

Feature engineering: Transforming raw data into features the model can learn from. Raw timestamp → “day of week,” “hour of day,” “is weekend.” Raw transaction amount → “ratio to average,” “deviation from pattern.”

Data splitting: Dividing data into training set (model learns from), validation set (model is tuned on), and test set (final evaluation). The test set must never be seen during training — it’s the unbiased judge.

For LLM products: Curating prompt examples, building evaluation datasets, preparing documents for RAG indexing.

PM Role During Data Prep

Don’t:
• Rush this phase. Cutting data prep time directly reduces model quality.
• Ignore it because it’s “technical.” Data decisions are product decisions.
• Assume the data team knows what features matter. You understand the user problem better.

Do:
• Provide domain context. “Transactions on weekends have different fraud patterns than weekdays” — this insight drives feature engineering.
• Review the data splits. Ensure the test set is representative of real-world usage, not just a random sample.
• Own the labeling guidelines. Write clear, unambiguous instructions for human labelers. Ambiguous guidelines produce noisy labels.
• Curate the evaluation set. Select 200–500 examples that cover normal cases, edge cases, and known hard cases. This is your quality benchmark.

The feature engineering conversation: Ask the ML team: “What features are you using? What signals do you think would help but we don’t have?” Often the PM can unlock new data sources or provide domain insights that the ML team wouldn’t discover on their own. This collaboration is where the biggest accuracy gains happen.

model_training

Stage 2: Model Selection & Training

Choosing the right approach and training the first baseline

Model Selection (What PMs Should Know)

The ML team will choose between different approaches. You don’t need to make this decision, but you should understand the trade-offs:

Classical ML (Random Forest, XGBoost, Logistic Regression)
Best for: Structured data, tabular data, classification, regression
Pros: Fast, cheap, interpretable, works with smaller datasets
Cons: Can’t handle unstructured data (text, images) well

Deep Learning (Neural Networks, CNNs, RNNs)
Best for: Images, audio, complex patterns in large datasets
Pros: Handles unstructured data, learns complex patterns
Cons: Needs large datasets, expensive to train, less interpretable

Foundation Models (GPT, Claude, Llama + prompting/fine-tuning)
Best for: Text generation, understanding, reasoning, multimodal tasks
Pros: Works with minimal training data, general-purpose
Cons: Expensive per query, non-deterministic, hallucination risk

The Baseline

The first model trained is the baseline. It’s intentionally simple — not the best possible model, but the starting point against which all improvements are measured.

Why baselines matter:
• They establish whether the problem is solvable at all
• They set a reference point for measuring improvement
• They often perform surprisingly well (80% of the final model’s accuracy with 20% of the effort)

Common baselines:
• For classification: Logistic regression or a simple rule-based system
• For LLM tasks: Zero-shot prompting with a foundation model
• For recommendations: Most-popular-item or random recommendation

If the baseline already meets your launch threshold, you might not need a complex model at all. Some of the best AI products run on surprisingly simple models.

PM question to ask: “What’s the baseline performance? How much improvement do we expect from a more complex approach? Is the expected improvement worth the additional complexity, cost, and maintenance?” Sometimes the answer is no — and shipping the simple model is the right product decision.

tune

Stage 3: Iteration & Optimization

The cycle of hypothesize-experiment-evaluate that drives improvement

The Improvement Levers

Once the baseline is established, the team iterates to improve performance. There are four main levers, and the PM should understand which is being pulled:

1. More/better data
Often the highest-impact lever. Adding 50% more labeled data or fixing label quality can improve accuracy more than any model change. This is where PMs add the most value — prioritizing data collection and quality.

2. Better features
Engineering new input signals. Adding “time since last purchase” to a churn model. Including “customer tenure” in a support routing model. Domain knowledge from the PM drives this.

3. Better model architecture
Switching from logistic regression to a neural network. Trying a different foundation model. This is the ML team’s domain.

4. Hyperparameter tuning
Adjusting the model’s internal settings (learning rate, number of layers, regularization). Important but usually yields smaller improvements than the other three levers.

The Diminishing Returns Curve

Model improvement follows a logarithmic curve: early gains are large and fast, later gains are small and slow.

• Week 1–2: Baseline achieves 75%. Quick wins from data cleaning and obvious features push to 82%.
• Week 3–4: Feature engineering and model tuning push to 87%. Progress slows.
• Week 5–8: Intensive optimization pushes to 89%. Each percentage point takes longer.
• Week 9+: Diminishing returns. Getting from 89% to 91% might take as long as getting from 75% to 87%.

The PM’s job is to recognize where you are on this curve and decide when to stop optimizing and ship. Perfection is the enemy of shipping.

The “good enough” decision: Ask: “If we ship at current performance, what’s the user impact of the remaining errors? Is that acceptable?” If the answer is yes, ship. You can continue improving in production with real user feedback, which is more valuable than lab optimization. The model that ships at 87% and improves with user data beats the model that reaches 92% in the lab six months later.

bug_report

Error Analysis: The PM’s Superpower

Understanding why the model fails is more valuable than knowing its accuracy

How to Do Error Analysis

Step 1: Collect the errors.
Pull every example where the model was wrong from the evaluation set. Group them by type.

Step 2: Categorize the failures.
Look for patterns. Common categories:
• Data gap: The model hasn’t seen this type of input before
• Label noise: The training data had wrong labels for this pattern
• Ambiguous case: Even humans would disagree on the correct answer
• Edge case: Unusual input that the model can’t generalize to
• Systematic bias: The model consistently fails on a specific subgroup

Step 3: Prioritize by impact.
Not all errors are equal. An error that affects 1% of users but causes $10K in damage is more important than an error that affects 10% of users but causes mild annoyance.

The PM’s Role in Error Analysis

Error analysis is where the PM adds the most value during model development. The ML team sees the technical patterns; the PM sees the user impact.

The ML team says: “The model misclassifies 8% of inputs in category C.”
The PM adds: “Category C is our enterprise customers. Each misclassification costs us $500 in support escalation. This is our #1 priority.”

The ML team says: “We can improve overall accuracy by 2% or fix the category C errors.”
The PM decides: “Fix category C. The business impact is 10x higher even though the overall accuracy gain is smaller.”

This is product management applied to model development. You’re not choosing the technical approach — you’re choosing which errors matter most to the business.

The weekly error review: Every week, sit with the ML team and review the 20 worst errors. You provide business context. They provide technical context. Together, you prioritize what to fix next. This single ritual is the highest-ROI activity in AI product development. It keeps the model improving in the direction that matters most to users.

splitscreen

Train/Val/Test: The Split That Matters

Understanding why the ML team is so protective of the test set

The Three Splits

Training set (70–80%): The data the model learns from. It sees these examples repeatedly during training and adjusts its internal parameters to perform well on them.

Validation set (10–15%): Used during development to tune the model. The team checks performance on the validation set after each experiment to see if changes helped. The model doesn’t learn from this data directly, but the team’s decisions are influenced by it.

Test set (10–15%): The final, unbiased judge. Used only once, at the end, to measure true performance. The model has never seen this data. The team has never optimized for it. It’s the closest approximation to real-world performance.

Why This Matters for PMs

Overfitting: If the model performs great on training data but poorly on the test set, it has memorized the training examples instead of learning general patterns. It’s like a student who memorizes answers to practice tests but fails the real exam.

The test set is sacred. If the team repeatedly evaluates on the test set and adjusts based on results, the test set becomes contaminated — it’s no longer an unbiased measure. This is why ML teams are protective of it.

PM implication: When the ML team reports “92% accuracy,” ask: “On which set?” Training accuracy is meaningless. Validation accuracy is useful but optimistic. Test accuracy is the number that matters — and it should only be run when you’re ready to make a ship/no-ship decision.

The real-world gap: Even test set performance is optimistic compared to production. The test set was sampled from the same distribution as the training data. Real users will send inputs the model has never seen. Expect production performance to be 2–5% lower than test performance. Plan for this gap in your launch threshold.

warning

When to Push Back

Situations where the PM must challenge the ML team’s direction

Push Back When...

1. The team is optimizing the wrong metric.
“We improved overall accuracy by 3%!” But accuracy on the highest-value user segment dropped. The team optimized for the average case at the expense of the most important case. Redirect to the metric that matters.

2. The team wants more time but can’t articulate why.
“We need two more weeks.” For what specifically? What experiment will you run? What’s the hypothesis? If they can’t answer, they’re grinding without direction. Help them define the next experiment clearly.

3. The team is chasing marginal gains.
Going from 89% to 90% when the launch threshold is 85%. The last 1% of accuracy may take as long as the first 15%. Ship now, improve in production.

Push Back When... (continued)

4. The model is great in the lab but untested on real data.
“We achieved 95% on the test set!” Have you tested on data from the last month? On edge cases? On adversarial inputs? Lab performance is necessary but not sufficient.

5. The team is building infrastructure instead of solving the problem.
Spending 3 months building a custom training pipeline when a fine-tuned API would validate the concept in 2 weeks. Infrastructure should follow validation, not precede it.

6. Nobody is looking at actual model outputs.
The team is focused on metrics dashboards but hasn’t manually reviewed 50 model outputs this week. Metrics tell you how much the model fails. Looking at outputs tells you how it fails. Both are essential.

The PM’s authority: You don’t have authority over technical decisions (which model architecture, which hyperparameters). You do have authority over product decisions: which errors to prioritize, when to ship, what the quality bar is, and whether the team is solving the right problem. Exercise that authority. The ML team needs a PM who provides clear direction, not one who rubber-stamps everything.

forum

The AI Standup Playbook

Seven questions that make AI standups productive instead of performative

Questions 1–4

1. “What experiments did you run since last standup?”
Not “what did you work on?” Experiments have hypotheses and results. Work is vague.

2. “What did the results show?”
Positive results, negative results, and inconclusive results are all valuable. A failed experiment that eliminates an approach is progress.

3. “What’s the current performance on the eval set?”
Track the primary metric and guardrail metrics. Plot them over time. Are we trending toward the launch threshold?

4. “What are the top 3 error categories right now?”
This keeps the team focused on the most impactful failures, not just overall metrics.

Questions 5–7

5. “What’s the next experiment and what’s the hypothesis?”
“We think adding customer tenure as a feature will improve recall on long-term customers by 5%.” Clear hypothesis, testable, time-boxed.

6. “What’s blocking progress?”
Often the answer is data-related: “We need 500 more labeled examples of edge case X.” The PM can unblock this faster than the ML team can.

7. “Are we still on track for the launch threshold?”
Not “are we on schedule?” (meaningless for research). Are we trending toward the performance bar we set? If not, what needs to change — the approach, the data, or the threshold itself?

The bottom line: You don’t need to understand backpropagation or gradient descent. You need to understand what the team is trying, whether it’s working, what the errors look like, and whether you’re converging on the quality bar. Manage experiments, not tasks. Prioritize errors, not features. Ship when good enough, not when perfect. That’s model development for PMs.

arrow_back Ch 8: Writing AI Product Specs Ch 10: Evaluation & Metrics That Matter arrow_forward