Back to blog

Before AI, Fix the Foundation: Data Quality & Governance

May 7, 2025 by Danny Sandwell

Before AI, Fix the Foundation: Data Quality & Governance

 

The pressure to deliver successful AI projects is at an all-time high. Yet even the biggest names in tech – from Amazon and Google to Microsoft and OpenAI – have suffered headline-grabbing AI disasters. Despite massive budgets, top-tier talent and cutting-edge infrastructure, they faced delays, escalating costs and reputational damage.

Why? Because AI is only as strong as the data that supports it. And in the AI race, many companies are cutting corners on data architecture.

Unrealistic deadlines and competitive pressures are pushing teams to prioritize speed over structure. But without a solid foundation built on data quality for AI, lineage, integrity and governance, AI models are doomed to fail before they even start. In fact, MIT estimates that 78 percent of businesses face challenges in AI adoption due to weak data foundations.

Fix the foundation first

The reality is that AI doesn’t work on messy, inconsistent or poorly governed data. If you’re a data engineer, architect or analyst, you’ve likely been asked to scale AI on a data foundation built for BI – not advanced machine learning applications. This can cause AI projects to devolve into a constant firefight, as you scramble to fix broken pipelines, debug unreliable outputs and explain erratic model behavior to frustrated executives.

Instead of treating failures as isolated incidents, it’s critical to recognize that poor governance and data quality for AI are systemic problems that require a structured, long-term solution.

Let’s take a deeper look at some high-profile failures that illustrate how improving data quality for AI could have prevented disaster. You’ll see why data quality and governance are no longer just compliance concerns – they’re crucial building blocks for AI success.

Real-world consequences of ignoring governance and data quality for AI

OpenAI’s GPT-5 development delays

In 2024, OpenAI attempted to push the boundaries of generative AI with its GPT-5 project. Despite 18 months of development and significant investment from Microsoft and other partners, the project faced repeated delays and soaring costs. The core issue? A lack of sufficient, high-quality training data. OpenAI attempted to fill the gap with synthetic AI-generated data, but this approach introduced major risks, including inaccuracies, biases and hallucinations. These problems resulted in costly retraining efforts and delayed the commercial rollout of the model.

A better approach would have been to invest in a robust data governance framework from the outset. This would have ensured that high-quality, ethically sourced and diverse training data was available, reducing reliance on synthetic data. Rigorous data validation processes should have been in place to identify and correct quality issues before they impacted model performance. Instead, OpenAI found itself repeatedly training and recalibrating its model, leading to unnecessary expenses and lost market momentum.

Amazon’s AI hiring algorithm

Amazon’s attempt to automate hiring decisions using AI exposed one of the most common pitfalls in machine learning: biased training data. The AI model was trained on a decade’s worth of past hiring data, which reflected a workforce that was overwhelmingly male. As a result, the AI systematically downgraded women’s resumes. The model wasn’t explicitly programmed to discriminate, but it learned to do so from biased historical data.

A robust data modeling process would have prevented this. Instead of training the AI on raw hiring data, Amazon’s team should have carefully curated a diverse, representative dataset to ensure the model wouldn’t reinforce past biases. Data governance measures, such as fairness audits and bias detection, should have been implemented early in the model development process. Without these safeguards, AI will inevitably replicate and amplify human biases, leading to ethical and legal challenges.

Zillow’s AI-powered home-flipping disaster

Zillow’s “iBuyer” program was supposed to revolutionize real estate by using AI to predict home values and automate house flipping. Instead, it became a cautionary tale about the dangers of relying on incomplete and inconsistent data. Zillow’s AI failed to account for rapidly changing market conditions, leading the company to overpay for thousands of homes. The result was a $500 million loss and the abrupt shutdown of the program.

This failure underscores the importance of dynamic data modeling. AI models should not rely solely on historical trends, but they should also incorporate real-time market signals and economic indicators. Stronger data governance could have prevented Zillow’s algorithm from making unchecked purchase decisions. A well-structured governance framework would have included regular stress testing, scenario analysis and oversight mechanisms to catch and correct errors before they spiraled out of control.

Why AI models fail before they start

Many AI models don’t fail in production because of bad algorithms; they fail because they were built on bad data.

The most common underlying issues include:

Inconsistent or missing data: AI models require structured, complete and well-documented datasets. When key fields are missing, identifiers don’t match across sources, data formats vary and models produce unreliable results, requiring constant manual intervention.

Lack of data lineage: When AI models start behaving unpredictably, you need to trace data sources, transformations and movements. Without clear lineage tracking, model debugging and trust are impossible.

Poor governance and schema sprawl: Many organizations assume that simply “connecting to the lake” is enough. That’s not architecture; it’s a blueprint for rework and reengineering. Without structured data governance, uncontrolled growth of redundant, conflicting or untracked datasets will erode performance and reliability.

Data drift: When training data for AI models shifts, whether it’s due to evolving trends, incomplete updates or hidden biases, models can make flawed decisions. Amazon’s hiring algorithm is a prime example of how historical biases baked into training data can lead to skewed outcomes, reinforcing discrimination rather than eliminating it. Toxic data, inputs that subtly poison models over time, also distort outputs in ways that aren’t always obvious until real-world failures emerge. Without proactive monitoring and course correction, once-reliable models become unpredictable liabilities.

Companies that fail to address these issues will continue to see their AI initiatives stall or collapse.

Fixing the foundation: practical steps for AI-ready data

Validate your data model early

Before launching an AI initiative, you should rigorously assess whether your existing data architecture is suitable for machine learning. Many legacy BI data models are not designed to support AI workloads, which require deeper historical tracking, flexible schema evolution and complex relationship mapping. You should conduct thorough data audits to identify inconsistencies, gaps and format mismatches before these issues impact model performance.

To ensure your data model is AI-ready:

  • Use validation checks to detect schema inconsistencies and missing relationships early
  • Conduct data profiling to uncover anomalies and standardize formats before ingestion
  • Apply business rules and constraints to prevent structural issues in AI pipelines

Build reusable, well-governed data assets

A recent ESG survey revealed that 84 percent of organizations are now delivering prescriptive (well-curated and governed) data products, recognizing that AI success depends on reliable, reusable data rather than fragmented, one-off pipelines.

This approach addresses a major pitfall in AI development: treating data pipelines as temporary solutions for individual models. Instead, organizations should build certified, well-documented datasets that serve as foundational data products – assets designed for reuse across multiple AI projects. These products should have clearly defined ownership, standardized metadata and controlled access policies to ensure consistency and reliability. Versioning should be enforced to prevent silent corruption of model inputs due to schema changes.

To support AI-driven data reuse:

  • Maintain a central metadata repository to document lineage, ownership and usage history
  • Establish dataset versioning and enforce schema governance to maintain consistency
  • Define role-based access policies to secure sensitive AI training data

Implement continuous data observability

Data quality for AI is not a one-time concern – it requires ongoing monitoring. AI teams should invest in observability tools that provide real-time insights into data drift, schema changes and quality anomalies. Automated alerts can flag potential issues before they degrade model performance, enabling proactive intervention, rather than reactive troubleshooting.

To maintain data quality for AI, take these proactive steps:

  • Set up real-time monitoring to detect schema drift and missing records
  • Apply anomaly detection techniques to identify unexpected patterns in training data
  • Use alerting systems and remediation workflows to prevent data issues from compounding

An AI-ready data platform can help you implement and automate best practices, ensuring your AI systems are built on strong, scalable and governed data foundations. By integrating automated quality checks, lineage tracking, observability and governance enforcement, you can focus on innovation instead of rework.

Fix the data – then scale AI

Organizations often treat AI as a magic bullet, expecting it to deliver insights and automation without first addressing fundamental data issues. This mindset leads to expensive failures and stalled projects. Instead, initiatives should begin with a disciplined approach to governance and data quality for AI.

By investing in structured, well-governed data foundations, you’ll prevent costly AI failures and build reliable machine learning systems that drive real business value. Cutting corners will only backfire. Improving data quality and governance now will save you time and pay off in the long run. Fix the foundation, and everything else will follow.


Danny Sandwell