Step 1: Inventory
Map every data source relevant to your AI product:
• Internal databases — CRM, ERP, product analytics, support tickets, transaction logs
• User-generated content — Reviews, messages, uploads, interactions
• Third-party data — Vendor feeds, public datasets, purchased data
• Unstructured sources — Documents, emails, images, audio, video
For each source, document: What is it? Where does it live? Who owns it? How much is there? How old is it? How is it accessed?
Step 2: Gap Analysis
Compare what you have against what you need:
• Feature gaps: Does the data contain the signals the model needs? A churn prediction model needs usage patterns, billing history, support interactions. If you only have billing data, you’re missing critical signals.
• Volume gaps: Do you have enough examples? For supervised ML, typical minimums are 1,000–10,000 labeled examples per class. For rare events (fraud), you may need millions of transactions to get enough positive examples.
• Temporal gaps: Do you have historical data? Most models need 12–24 months of history to capture seasonal patterns.
Step 3: Access Assessment
Data that exists but can’t be accessed is the same as data that doesn’t exist:
• Technical access: Is the data in a queryable format? Or locked in PDFs, legacy systems, or siloed databases with no API?
• Organizational access: Does another team own the data? Will they share it? Is there a data governance process to request access?
• Latency: Can you access the data in real-time (for inference) or only in batch (for training)? If your product needs real-time predictions, batch-only data is insufficient.
• Cost: Is there a cost to access? Third-party data vendors, cloud egress fees, or compute costs for processing large datasets.
The audit deliverable: A single-page data inventory that lists every source, its quality score, gaps, access status, and owner. This becomes the foundation for every data-related decision. If you can’t fill in this page, you’re not ready to build.