Ch 6: Data Discovery & Feasibility

Ch 6 — Data Discovery & Feasibility

Before you build anything: do you have the data? The feasibility checklist every PM needs.

Index

High Level

priority_high

Why First

arrow_forward

Audit

arrow_forward

verified

Quality

arrow_forward

label

Labeling

arrow_forward

gavel

Legal

arrow_forward

checklist

Go / No-Go

Click play or press Space to begin...

Step- / 8

priority_high

Data First, Always

80% of ML projects fail due to data problems, not model problems

The Data Reality

The single most common reason AI projects fail: the data wasn’t there. Teams get excited about the model, skip the data assessment, spend months building, and discover too late that the data is insufficient, inaccessible, or unusable.

Industry research consistently shows that 80% of ML project failures trace back to data problems — not algorithm selection, not compute, not engineering talent. The data was missing, dirty, biased, or legally restricted.

This is why data feasibility is the first gate in any AI product initiative. Before you write a line of code, before you choose a model, before you estimate timelines — audit the data.

What PMs Get Wrong

“We have lots of data.”
Volume is not quality. A million rows of inconsistent, unlabeled, outdated data is worse than 10,000 clean, labeled, representative examples.

“The data team will figure it out.”
Data feasibility is a product decision, not a data engineering task. The PM must understand what data is needed, what exists, and what the gaps are.

“We’ll clean it up later.”
Data cleaning is not a one-time task. It’s an ongoing investment. If the data is fundamentally flawed — wrong labels, missing key features, biased samples — no amount of cleaning fixes it.

“We can use synthetic data.”
Synthetic data has its place, but it’s not a substitute for real-world data. Models trained on synthetic data often fail on real-world edge cases.

PM rule: Never greenlight an AI project without personally reviewing a sample of the data. Look at 50–100 examples. Can you, as a human, see the patterns the model needs to learn? If the data doesn’t make sense to you, it won’t make sense to the model.

The Data Audit

A systematic process for discovering what data exists and where the gaps are

Step 1: Inventory

Map every data source relevant to your AI product:

• Internal databases — CRM, ERP, product analytics, support tickets, transaction logs
• User-generated content — Reviews, messages, uploads, interactions
• Third-party data — Vendor feeds, public datasets, purchased data
• Unstructured sources — Documents, emails, images, audio, video

For each source, document: What is it? Where does it live? Who owns it? How much is there? How old is it? How is it accessed?

Step 2: Gap Analysis

Compare what you have against what you need:

• Feature gaps: Does the data contain the signals the model needs? A churn prediction model needs usage patterns, billing history, support interactions. If you only have billing data, you’re missing critical signals.
• Volume gaps: Do you have enough examples? For supervised ML, typical minimums are 1,000–10,000 labeled examples per class. For rare events (fraud), you may need millions of transactions to get enough positive examples.
• Temporal gaps: Do you have historical data? Most models need 12–24 months of history to capture seasonal patterns.

Step 3: Access Assessment

Data that exists but can’t be accessed is the same as data that doesn’t exist:

• Technical access: Is the data in a queryable format? Or locked in PDFs, legacy systems, or siloed databases with no API?
• Organizational access: Does another team own the data? Will they share it? Is there a data governance process to request access?
• Latency: Can you access the data in real-time (for inference) or only in batch (for training)? If your product needs real-time predictions, batch-only data is insufficient.
• Cost: Is there a cost to access? Third-party data vendors, cloud egress fees, or compute costs for processing large datasets.

The audit deliverable: A single-page data inventory that lists every source, its quality score, gaps, access status, and owner. This becomes the foundation for every data-related decision. If you can’t fill in this page, you’re not ready to build.

verified

The Seven Data Quality Dimensions

How to evaluate whether your data is actually usable for AI

Dimensions 1–4

1. Accuracy
Are the values factually correct? A customer database where 15% of email addresses are invalid has an accuracy problem. For AI, inaccurate data teaches the model wrong patterns. Target: <5% error rate for key fields.

2. Completeness
Are all required fields populated? A dataset with 40% missing values for a critical feature is incomplete. Missing data forces the model to guess — or forces you to drop those examples entirely. Target: <5% null rate for key features.

3. Consistency
Are formats uniform across records? “United States,” “US,” “U.S.A.,” and “America” in the same column is inconsistent. The model treats each as a different value.

4. Freshness
How recent is the data? A model trained on 2023 customer behavior may not reflect 2026 patterns. For fast-changing domains (fraud, social media), data older than 3–6 months may be stale.

Dimensions 5–7

5. Representativeness
Does the data reflect the full population the model will serve? A hiring model trained only on resumes from top universities will fail on candidates from other backgrounds. A medical AI trained only on data from one hospital may not generalize to others. Bias in data becomes bias in the model.

6. Label Quality
For supervised learning, are the labels correct and consistent? If three human labelers disagree on whether a support ticket is “urgent” or “normal,” the model learns from noise. Measure inter-annotator agreement — if humans can’t agree, the model can’t learn.

7. Provenance
Where did the data come from? Can you trace each record to its source? This matters for debugging (why did the model make this prediction?), compliance (can we prove the data was collected legally?), and reproducibility (can we recreate this dataset?).

The quality threshold: You don’t need perfect data to start. But you need to know where the quality problems are. A model trained on data with known 10% label noise can be designed to handle that. A model trained on data with unknown quality issues will fail unpredictably. Measure quality before training.

label

The Labeling Challenge

Getting labeled data is often the hardest and most expensive part of AI

Why Labeling Is Hard

Supervised learning requires labeled examples: inputs paired with correct outputs. “This email is spam.” “This image contains a tumor.” “This transaction is fraudulent.”

Getting these labels requires humans to review each example and assign the correct answer. This is:

• Expensive — Medical image labeling by radiologists costs $5–50 per image. General text labeling costs $0.05–0.50 per example. At 10,000 examples, that’s $500 to $500,000.
• Slow — A labeling project for 50,000 examples typically takes 4–12 weeks.
• Error-prone — Human labelers make mistakes. Ambiguous cases get inconsistent labels. Labeler fatigue degrades quality over time.

Labeling Strategies

In-house labeling: Your domain experts label the data. Highest quality but most expensive. Best for specialized domains (medical, legal, financial).

Crowdsourced labeling: Platforms like Scale AI, Labelbox, or Amazon Mechanical Turk. Cheaper and faster but lower quality. Use multiple labelers per example and take the majority vote.

Programmatic labeling: Write rules or heuristics that automatically label data. “If the transaction amount exceeds $10,000 and the country is on the watchlist, label as suspicious.” Fast and cheap but noisy — the labels are approximations.

Active learning: The model identifies the examples it’s most uncertain about and asks humans to label only those. Reduces labeling volume by 50–80% while maintaining quality.

LLM-assisted labeling: Use an LLM to generate initial labels, then have humans review and correct. Faster than manual labeling but requires careful quality checks.

PM decision: Labeling strategy is a product decision. How much can you spend? How fast do you need it? How specialized is the domain? For most products, a hybrid approach works: programmatic labeling for the easy cases, human labeling for the hard cases, active learning to minimize total human effort.

gavel

Legal & Privacy Constraints

Data you can access isn’t always data you can use

Privacy Regulations

Data privacy laws directly constrain what data you can use for AI:

GDPR (EU): Requires explicit consent for data processing, right to erasure (which affects training data), data minimization (collect only what’s necessary), and purpose limitation (data collected for one purpose can’t be repurposed for AI training without consent).

CCPA/CPRA (California): Right to know what data is collected, right to delete, right to opt out of data sales. Applies to AI training data.

HIPAA (US Healthcare): Protected health information requires de-identification before use in AI. Even de-identified data has re-identification risks.

Industry-specific: Financial services (SOX, PCI-DSS), education (FERPA), children’s data (COPPA) all have specific constraints.

Data Rights & Licensing

Training data rights: Do you have the right to use this data for AI training? User-generated content may be covered by your Terms of Service — or it may not. Third-party data licenses may prohibit ML training. Publicly scraped data is legally contested.

Copyright: Training on copyrighted material (books, articles, code) is the subject of active litigation. The legal landscape is evolving rapidly. Err on the side of caution.

Output ownership: Who owns AI-generated content? If your model is trained on customer data and generates outputs for other customers, there are IP implications.

Practical Steps

• Involve legal early — not after the model is built
• Document data provenance for every training dataset
• Implement consent mechanisms for user data used in training
• Build data deletion pipelines (GDPR right to erasure applies to training data)
• Anonymize or pseudonymize where possible

The legal veto: Legal constraints can kill an AI project regardless of technical feasibility. A healthcare AI that requires patient data but can’t get consent is dead on arrival. A financial AI trained on transaction data without proper anonymization is a compliance violation waiting to happen. Check legal feasibility alongside data feasibility.

science

The Feasibility Spike

A time-boxed experiment to prove data feasibility before committing resources

What Is a Feasibility Spike?

A feasibility spike is a 1–2 week experiment designed to answer one question: “Can we build a useful model with the data we have?”

It’s not a prototype. It’s not a v1. It’s a quick, scrappy test to validate that the data supports the AI approach before you invest months of engineering effort.

The process:
1. Take a sample of the available data (1,000–5,000 examples)
2. Clean it minimally (enough to be usable, not production-grade)
3. Train a simple baseline model (or test with an LLM)
4. Evaluate against your success criteria from the Problem Framing Canvas
5. Decide: proceed, pivot, or kill

Interpreting Results

Green light: The baseline model achieves 70%+ of your target performance. With more data, better features, and proper engineering, you can likely reach the threshold. Proceed to full development.

Yellow light: The baseline achieves 40–70% of target. There’s signal in the data but significant gaps. Investigate: Is it a data quality issue? A feature gap? A labeling problem? Fix the root cause before proceeding.

Red light: The baseline achieves <40% of target, or performs no better than random. The data likely doesn’t contain the signal you need. Options: find new data sources, reframe the problem, or kill the project.

The spike saves months: A 2-week feasibility spike that kills a doomed project saves 6 months of wasted engineering. A spike that validates feasibility gives the team confidence to invest. Either outcome is valuable. Never skip the spike.

data_alert

Data Anti-Patterns

Seven data mistakes that kill AI products

Anti-Patterns 1–4

1. Survivorship bias.
Training a loan approval model on data from approved loans only. The model never sees rejected applicants, so it can’t learn what a bad applicant looks like.

2. Label leakage.
Including information in the training data that wouldn’t be available at prediction time. A model that predicts hospital readmission using the “discharge summary” (written after readmission) achieves 99% accuracy in testing and 50% in production.

3. Temporal leakage.
Using future data to predict the past. Training a stock prediction model on data that includes tomorrow’s prices. The model looks perfect in backtesting and fails in real-time.

4. Class imbalance denial.
Fraud occurs in 0.1% of transactions. A model that predicts “not fraud” for everything achieves 99.9% accuracy. The metric looks great; the model is useless.

Anti-Patterns 5–7

5. Proxy variable trap.
Using zip code as a feature in a credit model. Zip code correlates with race, so the model learns racial bias without explicitly using race as a feature. The proxy achieves the same discriminatory outcome.

6. Stale training data.
Training on data from a different era. A customer behavior model trained on pre-pandemic data makes wrong predictions in a post-pandemic world. The patterns have fundamentally changed.

7. Insufficient diversity.
A facial recognition model trained primarily on light-skinned faces fails on darker skin tones. A medical AI trained on data from one hospital fails at another. The training data doesn’t represent the deployment population.

PM responsibility: You don’t need to be a data scientist to catch these anti-patterns. Ask: “Does the training data represent the real-world population?” “Could any feature be a proxy for a protected characteristic?” “Is any information in the training data unavailable at prediction time?” “How old is this data?” These questions prevent the most common data disasters.

checklist

The Data Feasibility Checklist

The go/no-go gate before committing to an AI project

Data Availability

□ Data exists for the core features.
The signals the model needs are captured somewhere in your systems.

□ Sufficient volume.
Enough examples to train a model (thousands for ML, hundreds for LLM evaluation).

□ Historical depth.
At least 12–24 months of data to capture patterns and seasonality.

□ Accessible format.
Data can be extracted, queried, and processed without heroic engineering effort.

□ Refresh mechanism.
New data flows in continuously (or can be collected) to keep the model current.

Data Quality & Compliance

□ Quality assessed.
You’ve measured accuracy, completeness, consistency, and freshness. Key fields have <5% null rates.

□ Labels available or obtainable.
For supervised learning: labeled data exists, or you have a plan (and budget) to create it.

□ Representative of deployment population.
The training data reflects the users and scenarios the model will encounter in production.

□ Legal review completed.
Privacy, consent, licensing, and copyright have been reviewed. No legal blockers.

□ Feasibility spike completed.
A 1–2 week experiment shows the data contains enough signal to justify full development.

The bottom line: Data feasibility is the most important gate in AI product development. A project that passes this gate has a fighting chance. A project that skips it is gambling with months of engineering time. Be rigorous here — it’s cheaper to kill a project at the data stage than after six months of model development.

arrow_back Ch 5: Problem Framing for AI Ch 7: Build vs. Buy vs. API arrow_forward