Ch 3 — Data Readiness & Legacy Systems

Only 7% of enterprises are data-ready for AI — siloed systems, dirty data, and the cleanup nobody budgets for
High Level
assessment
Audit
arrow_forward
storage
Silos
arrow_forward
cleaning_services
Clean
arrow_forward
schema
Schema
arrow_forward
policy
Govern
arrow_forward
verified
Ready
-
Click play or press Space to begin...
Step- / 8
assessment
The 7% Reality
Almost nobody's data is ready for AI
The Numbers
A 2026 report from Cloudera and Harvard Business Review Analytic Services found that only 7% of enterprises say their data is completely ready for AI, while 27% report their data is "not very" or "not at all" ready. Meanwhile, 73% of organizations say they should prioritize AI data quality more than they currently do. The gap between AI ambition and data reality is the single largest barrier to enterprise AI adoption — larger than model capability, talent shortage, or budget constraints. Data management and governance now rank ahead of cost and talent as the top challenge to scaling AI, according to Semarchy's 2026 State of MDM report.
Data Readiness Spectrum
Enterprise data readiness (2026): Completely ready: 7% Mostly ready: 28% ███ Somewhat ready: 38% ████ Not very/not at all: 27% ███ Top barrier to scaling AI: #1 Data management & governance #2 Cost #3 Talent // Source: Cloudera + HBR, March 2026
Why it matters: If 93% of enterprises aren't data-ready, then 93% of enterprise AI projects are building on a foundation that will crack under production load. Data readiness isn't a prerequisite you check off — it's the project itself.
storage
The Silo Problem
56% of enterprises can't integrate their own data sources
Why Silos Kill AI
56% of enterprises cite siloed data and difficulty integrating data sources as their primary obstacle to AI readiness. An AI agent that needs to answer "What's the status of order #4401?" might need to check the ERP for the order, the WMS for shipping, the CRM for customer communication, and the finance system for payment status. In most enterprises, these systems were built by different vendors, at different times, with different data models. They don't share customer IDs, use different date formats, and have inconsistent field names. The agent isn't just reasoning about the question — it's translating between incompatible representations of reality across systems that were never designed to talk to each other.
One Question, Four Systems
"What's the status of order #4401?" ERP (SAP): order_number = 4401 date format: YYYYMMDD customer: CUST_ID_882 WMS (Oracle): shipment_ref = "PO-4401-A" date format: MM/DD/YYYY customer: 882 CRM (Salesforce): opp_id = "OPP-4401" date format: ISO 8601 account: "Acme Corp" No shared ID. No shared schema.
Key insight: Data silos aren't just an inconvenience — they mean the agent is working with partial, inconsistent views of the same reality. Without a unified data layer, every agent answer is a guess stitched together from fragments.
cleaning_services
The Cleanup Nobody Budgets For
Data quality adds 15-25% to every AI project budget
The Hidden Cost
Data quality issues are the #1 hidden cost in AI implementation, affecting 54% of projects and adding 15–25% to budgets through missing data remediation, format normalization, bias correction, and data augmentation. Overall, hidden AI implementation costs add 30–70% to project budgets, with 68% of projects exceeding initial estimates by an average of 42%. Prudential Financial spent 18 months scrubbing five years of historical data, normalizing over 600,000 uncleaned vendor entries to achieve 99% categorization of its global spend. Most enterprises don't have Prudential's resources — but they have the same data problems at smaller scale, and they budget zero time for cleanup.
Budget Reality
Typical AI project budget plan: Model & infrastructure: 40% Development: 35% Testing: 15% Data cleanup: 10% ← too low Actual spend: Data cleanup: 25-35% Budget overage: 42% average Projects affected: 54% // Prudential: 18 months, 600K vendor // entries normalized for AI readiness
Rule of thumb: If your AI project plan allocates less than 25% of budget and timeline to data preparation, your plan is wrong. Adjust it now or adjust it later at 3x the cost.
dns
Legacy System Architecture
60% of AI leaders cite legacy integration as their primary barrier
The Legacy Landscape
60% of AI leaders cite legacy system integration as their primary barrier to AI success. Enterprise systems built in the 1990s and 2000s weren't designed for API-first access. Many core business systems — mainframes running COBOL, on-premise ERP installations, custom-built databases — expose data through batch exports, proprietary protocols, or screen-scraping interfaces. An AI agent that needs real-time data from these systems faces a fundamental architectural mismatch: the agent operates in milliseconds, but the data source operates in batch cycles measured in hours or days. Building a real-time integration layer on top of batch-oriented systems is one of the most expensive and error-prone undertakings in enterprise IT.
Integration Patterns
System age Data access Mainframe Batch export / screen scrape On-prem ERP BAPI / RFC (proprietary) Legacy DB ODBC / stored procedures SaaS (2010s) REST API (rate-limited) Modern SaaS GraphQL / webhooks / streaming Agent expectation: real-time, structured Legacy reality: batch, proprietary // 60% cite legacy as #1 barrier // Source: MindXO, 2026
Key insight: The question isn't "can we connect the agent to SAP?" — it's "can we get fresh, structured data from SAP in under 2 seconds?" The answer for most legacy systems is no, without significant middleware investment.
schema
Schema Inconsistency
When "customer" means five different things across five systems
The Semantic Gap
Even when data is accessible, it's rarely consistent. A "customer" in the CRM is an account with contacts; in the ERP it's a billing entity; in the support system it's a ticket requester; in the marketing platform it's an email address. These aren't just naming differences — they represent fundamentally different data models of the same real-world entity. An AI agent asked "How many customers do we have?" will get a different answer from every system. Without Master Data Management (MDM) — a unified, authoritative record for each entity — the agent is reasoning over contradictory data. Exactly 50% of organizations have adopted MDM as a foundation for AI; the other half are scaling on fragmented data.
One Entity, Five Identities
"How many customers do we have?" CRM: 12,400 accounts ERP: 8,200 billing entities Support: 31,000 unique requesters Marketing: 85,000 email addresses MDM: 9,800 golden records Without MDM, agent picks one at random // 50% of orgs have MDM; 50% don't // Source: Semarchy, 2026
Key insight: Schema inconsistency doesn't just produce wrong answers — it produces confidently wrong answers that look plausible because the data is real. It's just the wrong data for the question.
update
Stale Data: The Freshness Problem
When the agent's answer was correct yesterday but wrong today
Why Freshness Matters
Enterprise data has a shelf life. Inventory levels change hourly. Customer contact information changes monthly. Pricing changes quarterly. Regulatory requirements change annually. An AI agent that retrieves data from a nightly batch sync is working with information that could be 24 hours stale. For some use cases (historical reporting), this is fine. For others (real-time inventory checks, customer-facing pricing, compliance status), stale data produces answers that are technically "from the system" but factually wrong. The challenge is that most enterprise data pipelines were designed for human consumption — daily reports, weekly dashboards — not for AI agents that need sub-second access to current state.
Freshness Requirements
Use case Freshness needed Inventory check Real-time (seconds) Order status Near-real-time (minutes) Customer info Hourly Pricing Daily Compliance docs Weekly Historical reports Monthly Typical batch pipeline: Nightly sync = 24hr max staleness Acceptable for reports Unacceptable for agent decisions
Rule of thumb: For every data source your agent uses, define a maximum acceptable staleness. If your pipeline can't meet it, either change the pipeline or change the use case.
policy
Data Governance for AI
Who owns the data the agent uses, and who's responsible when it's wrong?
The Governance Gap
44% of enterprises lack a clear data strategy, and 34% face regulatory constraints on data use. When an AI agent pulls data from five systems to answer a question, the governance questions multiply: Who owns each data source? Who approved the agent's access? What happens when the data is wrong? Who is liable for decisions made on bad data? In regulated industries, these aren't philosophical questions — they have legal answers that must be documented before the agent goes live. Data governance for AI requires lineage tracking (where did this data come from?), access controls (who can the agent read from?), quality SLAs (how accurate is this source?), and retention policies (how long can the agent remember this data?).
Governance Checklist
For each data source the agent uses: Ownership: Who maintains this data? Lineage: Where does it originate? Quality SLA: What accuracy is guaranteed? Access: What can the agent read/write? PII: Does it contain personal data? Retention: How long can agent cache it? Audit: Is every access logged? // 44% of enterprises lack a data strategy // Source: Cloudera + HBR, 2026
Key insight: Data governance for AI isn't a new discipline — it's traditional data governance with higher stakes. The agent amplifies both the value of good data and the damage of bad data.
checklist
The Data Readiness Assessment
A practical framework for evaluating your starting position
Five Dimensions
Before building any AI agent, assess your data across five dimensions. Accessibility: can the agent reach the data via API in under 2 seconds? Quality: what percentage of records are complete, accurate, and current? Consistency: do the same entities have the same IDs and schemas across systems? Governance: are ownership, access controls, and audit trails in place? Volume: is there enough data for the agent to learn patterns, and is it manageable at production scale? Score each dimension 1–5. If any dimension scores below 3, that's your first workstream — not the AI agent. The companies that succeed treat data readiness as Phase 1 of the AI project, not a prerequisite they assume is already met.
Readiness Scorecard
Dimension Score (1-5) Threshold Accessibility API < 2s? ≥ 3 Quality Complete? ≥ 3 Consistency Unified IDs? ≥ 3 Governance Owned? ≥ 3 Volume Enough? ≥ 3 Any dimension < 3 = fix first All dimensions ≥ 3 = proceed // Data readiness is Phase 1, not Phase 0
Key insight: The readiness assessment isn't a gate to pass once — it's a living scorecard that should be re-evaluated as you add new data sources, new use cases, and new agent capabilities.