What Counts as Data
ML systems learn from examples. These can be structured (spreadsheets, databases, transaction logs) or unstructured (emails, images, audio, documents). A fraud detection model learns from historical transactions. A language model learns from text scraped from the internet. A medical imaging system learns from thousands of labeled X-rays.
Labels and Features
Features are the characteristics the model uses to make predictions — transaction amount, time of day, merchant category. Labels are the correct answers — “fraud” or “legitimate.” In supervised learning (the most common type), you need both. The quality and relevance of your features often matters more than the sophistication of the algorithm.
The Data Quality Problem
Most enterprise AI projects spend 60–80% of their time on data preparation — cleaning, formatting, deduplicating, and labeling. Raw data is messy: missing values, inconsistent formats, duplicate records, outdated entries. A model trained on poor data will produce poor results, regardless of how advanced the algorithm is.
Why it matters: When an AI project fails, the cause is rarely the algorithm. It’s almost always the data — not enough of it, wrong kind of it, or poor quality. This is the single most important factor executives should scrutinize in any AI initiative. Chapter 4 goes deeper.