Ch 3: Data Readiness & Legacy Systems — AI Agents for the Enterprise

Ch 3 — Data Readiness & Legacy Systems

Only 7% of enterprises are data-ready for AI — siloed systems, dirty data, and the cleanup nobody budgets for

Index

High Level

assessment

Audit

arrow_forward

storage

Silos

arrow_forward

cleaning_services

Clean

arrow_forward

schema

Schema

arrow_forward

policy

Govern

arrow_forward

verified

Ready

Click play or press Space to begin...

Step- / 8

assessment

The 7% Reality

Almost nobody's data is ready for AI

The Numbers

A 2026 report from Cloudera and Harvard Business Review Analytic Services found that only 7% of enterprises say their data is completely ready for AI, while 27% report their data is "not very" or "not at all" ready. Meanwhile, 73% of organizations say they should prioritize AI data quality more than they currently do. The gap between AI ambition and data reality is the single largest barrier to enterprise AI adoption — larger than model capability, talent shortage, or budget constraints. Data management and governance now rank ahead of cost and talent as the top challenge to scaling AI, according to Semarchy's 2026 State of MDM report.

Data Readiness Spectrum

Enterprise data readiness (2026): Completely ready: 7% █ Mostly ready: 28% ███ Somewhat ready: 38% ████ Not very/not at all: 27% ███ Top barrier to scaling AI: #1 Data management & governance #2 Cost #3 Talent // Source: Cloudera + HBR, March 2026

Why it matters: If 93% of enterprises aren't data-ready, then 93% of enterprise AI projects are building on a foundation that will crack under production load. Data readiness isn't a prerequisite you check off — it's the project itself.

storage

The Silo Problem

56% of enterprises can't integrate their own data sources

Why Silos Kill AI

56% of enterprises cite siloed data and difficulty integrating data sources as their primary obstacle to AI readiness. An AI agent that needs to answer "What's the status of order #4401?" might need to check the ERP for the order, the WMS for shipping, the CRM for customer communication, and the finance system for payment status. In most enterprises, these systems were built by different vendors, at different times, with different data models. They don't share customer IDs, use different date formats, and have inconsistent field names. The agent isn't just reasoning about the question — it's translating between incompatible representations of reality across systems that were never designed to talk to each other.

One Question, Four Systems

"What's the status of order #4401?" ERP (SAP): order_number = 4401 date format: YYYYMMDD customer: CUST_ID_882 WMS (Oracle): shipment_ref = "PO-4401-A" date format: MM/DD/YYYY customer: 882 CRM (Salesforce): opp_id = "OPP-4401" date format: ISO 8601 account: "Acme Corp" No shared ID. No shared schema.

Key insight: Data silos aren't just an inconvenience — they mean the agent is working with partial, inconsistent views of the same reality. Without a unified data layer, every agent answer is a guess stitched together from fragments.

cleaning_services

The Cleanup Nobody Budgets For

Data quality adds 15-25% to every AI project budget

The Hidden Cost

Data quality issues are the #1 hidden cost in AI implementation, affecting 54% of projects and adding 15–25% to budgets through missing data remediation, format normalization, bias correction, and data augmentation. Overall, hidden AI implementation costs add 30–70% to project budgets, with 68% of projects exceeding initial estimates by an average of 42%. Prudential Financial spent 18 months scrubbing five years of historical data, normalizing over 600,000 uncleaned vendor entries to achieve 99% categorization of its global spend. Most enterprises don't have Prudential's resources — but they have the same data problems at smaller scale, and they budget zero time for cleanup.

Budget Reality

Typical AI project budget plan: Model & infrastructure: 40% Development: 35% Testing: 15% Data cleanup: 10% ← too low Actual spend: Data cleanup: 25-35% Budget overage: 42% average Projects affected: 54% // Prudential: 18 months, 600K vendor // entries normalized for AI readiness

Rule of thumb: If your AI project plan allocates less than 25% of budget and timeline to data preparation, your plan is wrong. Adjust it now or adjust it later at 3x the cost.

dns

Legacy System Architecture

60% of AI leaders cite legacy integration as their primary barrier

The Legacy Landscape

60% of AI leaders cite legacy system integration as their primary barrier to AI success. Enterprise systems built in the 1990s and 2000s weren't designed for API-first access. Many core business systems — mainframes running COBOL, on-premise ERP installations, custom-built databases — expose data through batch exports, proprietary protocols, or screen-scraping interfaces. An AI agent that needs real-time data from these systems faces a fundamental architectural mismatch: the agent operates in milliseconds, but the data source operates in batch cycles measured in hours or days. Building a real-time integration layer on top of batch-oriented systems is one of the most expensive and error-prone undertakings in enterprise IT.

Integration Patterns

System age Data access Mainframe Batch export / screen scrape On-prem ERP BAPI / RFC (proprietary) Legacy DB ODBC / stored procedures SaaS (2010s) REST API (rate-limited) Modern SaaS GraphQL / webhooks / streaming Agent expectation: real-time, structured Legacy reality: batch, proprietary // 60% cite legacy as #1 barrier // Source: MindXO, 2026

Key insight: The question isn't "can we connect the agent to SAP?" — it's "can we get fresh, structured data from SAP in under 2 seconds?" The answer for most legacy systems is no, without significant middleware investment.

schema

Schema Inconsistency

When "customer" means five different things across five systems

The Semantic Gap

Even when data is accessible, it's rarely consistent. A "customer" in the CRM is an account with contacts; in the ERP it's a billing entity; in the support system it's a ticket requester; in the marketing platform it's an email address. These aren't just naming differences — they represent fundamentally different data models of the same real-world entity. An AI agent asked "How many customers do we have?" will get a different answer from every system. Without Master Data Management (MDM) — a unified, authoritative record for each entity — the agent is reasoning over contradictory data. Exactly 50% of organizations have adopted MDM as a foundation for AI; the other half are scaling on fragmented data.

One Entity, Five Identities

"How many customers do we have?" CRM: 12,400 accounts ERP: 8,200 billing entities Support: 31,000 unique requesters Marketing: 85,000 email addresses MDM: 9,800 golden records Without MDM, agent picks one at random // 50% of orgs have MDM; 50% don't // Source: Semarchy, 2026

Key insight: Schema inconsistency doesn't just produce wrong answers — it produces confidently wrong answers that look plausible because the data is real. It's just the wrong data for the question.

update

Stale Data: The Freshness Problem

When the agent's answer was correct yesterday but wrong today

Why Freshness Matters

Enterprise data has a shelf life. Inventory levels change hourly. Customer contact information changes monthly. Pricing changes quarterly. Regulatory requirements change annually. An AI agent that retrieves data from a nightly batch sync is working with information that could be 24 hours stale. For some use cases (historical reporting), this is fine. For others (real-time inventory checks, customer-facing pricing, compliance status), stale data produces answers that are technically "from the system" but factually wrong. The challenge is that most enterprise data pipelines were designed for human consumption — daily reports, weekly dashboards — not for AI agents that need sub-second access to current state.

Freshness Requirements

Use case Freshness needed Inventory check Real-time (seconds) Order status Near-real-time (minutes) Customer info Hourly Pricing Daily Compliance docs Weekly Historical reports Monthly Typical batch pipeline: Nightly sync = 24hr max staleness Acceptable for reports Unacceptable for agent decisions

Rule of thumb: For every data source your agent uses, define a maximum acceptable staleness. If your pipeline can't meet it, either change the pipeline or change the use case.

policy

Data Governance for AI

Who owns the data the agent uses, and who's responsible when it's wrong?

The Governance Gap

44% of enterprises lack a clear data strategy, and 34% face regulatory constraints on data use. When an AI agent pulls data from five systems to answer a question, the governance questions multiply: Who owns each data source? Who approved the agent's access? What happens when the data is wrong? Who is liable for decisions made on bad data? In regulated industries, these aren't philosophical questions — they have legal answers that must be documented before the agent goes live. Data governance for AI requires lineage tracking (where did this data come from?), access controls (who can the agent read from?), quality SLAs (how accurate is this source?), and retention policies (how long can the agent remember this data?).

Governance Checklist

For each data source the agent uses: Ownership: Who maintains this data? Lineage: Where does it originate? Quality SLA: What accuracy is guaranteed? Access: What can the agent read/write? PII: Does it contain personal data? Retention: How long can agent cache it? Audit: Is every access logged? // 44% of enterprises lack a data strategy // Source: Cloudera + HBR, 2026

Key insight: Data governance for AI isn't a new discipline — it's traditional data governance with higher stakes. The agent amplifies both the value of good data and the damage of bad data.

checklist

The Data Readiness Assessment

A practical framework for evaluating your starting position

Five Dimensions

Before building any AI agent, assess your data across five dimensions. Accessibility: can the agent reach the data via API in under 2 seconds? Quality: what percentage of records are complete, accurate, and current? Consistency: do the same entities have the same IDs and schemas across systems? Governance: are ownership, access controls, and audit trails in place? Volume: is there enough data for the agent to learn patterns, and is it manageable at production scale? Score each dimension 1–5. If any dimension scores below 3, that's your first workstream — not the AI agent. The companies that succeed treat data readiness as Phase 1 of the AI project, not a prerequisite they assume is already met.

Readiness Scorecard

Dimension Score (1-5) Threshold Accessibility API < 2s? ≥ 3 Quality Complete? ≥ 3 Consistency Unified IDs? ≥ 3 Governance Owned? ≥ 3 Volume Enough? ≥ 3 Any dimension < 3 = fix first All dimensions ≥ 3 = proceed // Data readiness is Phase 1, not Phase 0

Key insight: The readiness assessment isn't a gate to pass once — it's a living scorecard that should be re-evaluated as you add new data sources, new use cases, and new agent capabilities.

arrow_back Ch 2: The Adoption Gap Ch 4: Use Case Selection arrow_forward