Why data observability is essential to AI governance
When it comes to using AI and machine learning across your organization, there are many good reasons to provide your data and analytics community with an intelligent data foundation. For instance, Large Language Models (LLMs) are known to ultimately perform better when data is structured. And being that data is fluid and constantly changing, it’s very easy for bias, bad data and sensitive information to creep into your AI data pipeline. But how can delivering an intelligent data foundation specifically increase your successful outcomes of AI models?
And do you have the transparency and data observability built into your data strategy to adequately support the AI teams building them? Will the new creative, diverse and scalable data pipelines you are building also incorporate the AI governance guardrails needed to manage and limit your organizational risk?
We will tackle all these burning questions and more in this article.
Data observability supports our ability to develop and keep data AI-ready
Whether you’re scaling up an AI practice within your organization or just getting started with your data and AI strategy, monitoring and observing the data pipelines that will feed your AI models should be among your top priorities. Why?
One reason would be to counteract our inherent bias as we work to train the data that feeds AI models. What we may consider as “not normal” in the behavior of data could very well be business as usual in the eyes of others. Observing data patterns upfront objectively helps us to eliminate bias as we gather and assemble the data that will be used to train AI models.
Second, organizational data is fluid, typically changing on a regular basis. And if it isn’t changing, it’s likely not being used within our organizations, so why would we use stagnant data to facilitate our use of AI? The key is understanding not IF, but HOW, our data fluctuates, and data observability can help us do just that. By establishing data integrity thresholds and drift limits for data behavior upfront, we’re in a stronger position to act immediately as data drifts beyond acceptable boundaries when we monitor the AI data pipeline.
Every AI model employed will eventually experience a hallucination. Data observability provides the ability to immediately recognize, and be alerted to, the emergence of hallucinations and accept or reject these changes iteratively, thereby training and validating the data. Let’s give a for instance. Maybe your AI model monitors sales data, and the data is spiking for one region of the country due to a world event. You want to be able to accept or reject this data to not disturb the current pattern or accept it as a potential insight. Monitoring data drift allows you to keep the AI model on track with the original intent. If you are not observing and reacting to the data, the model will accept every variant and it may end up one of the more than 50% of models, according to Gartner, that never make it to production because there are no clear insights and the results have nothing to do with the original intent of the model.
Data observability combines with metadata to deliver business truth
IT has long been monitoring databases for performance, reliability and data integrity, and alongside these efforts, we’ve also recognized that metadata, or the schema of a database, must also be continually cared for given its centrality as the source of data truth – regardless of how our organization’s techies or business users view the data. Metadata is the basis of trust for data forensics as we answer the questions of fact or fiction when it comes to the data we see.
Being that AI is comprised of more data than code, it is now more essential than ever to combine data with metadata in near real-time. A data catalog providing automated data profiling does just this and, when tied in with data lineage, your organization can easily see metadata’s pathway back to all sources feeding your AI model. Data quality capabilities that include data observability then monitor and observe these key pipelines for data quality and data drift.
Here’s a practical business example, AI sales or revenue models that will provide your leadership team with the basis for how you plan to grow your business. Monitoring sales revenue data will provide you with regional patterns and anomalies globally. As you test your sample data, you would set data cumulation thresholds within your data observability platform to alert you as sales pricing spikes or drops beyond the set boundaries for this data. The data observability platform then alerts you as the data drifts exceed these boundaries and a human decides whether to accept or reject this change factor into the model.
If you’re not using observability capabilities to monitor the data, the model would take into consideration every shift and therefore the model output could render itself meaningless, with no clear pattern found to learn from and use for your growth plan. Worse yet would be using the data from this model to determine something out of context that wasn’t the intention of this AI model. For instance, during COVID, many pricing patterns were disrupted, and while we may want to study this specific impact, we also may want to exclude this data now to plan for growth as the world has resumed back to a more normal state. Deriving data patterns can take quite some time to achieve. So, instead of starting over, it would be more efficient to train the model to stay in the boundaries of what is acceptable and reject the anomalies that are irregular.
Data observability delivers transparency to keep AI models on target
Just as greater metadata transparency within data intelligence initiatives has resolved organizational conflicts, so too does transparency of data anomalies. When you think about something such as price, what’s considered high and low completely depends on the person, the region, the season, and the social circumstances. But, if we set corporate thresholds and correspond to these thresholds through a job function such as a quality or business analyst, we have eyes into discrepancies and can keep an AI model on track with its original intent.
If we’re not observing and corresponding to the data fluctuations, we are missing the overall model insights. Other examples of alerts that can be applied to the data pipeline are when critical data elements, PII, PHI, or regulatory data seeps into the AI model. A data catalog can help you automate the addition of these elements with your data observability efforts so that you can trace and track them through your AI data pipelines. In other words, you are automating the monitoring of the inputs and outputs to the AI model.
Data observability interwoven with data cataloging fuels AI governance
This is where the combination of data catalog and data observability capabilities become even more impactful. When you understand the data lineage, or the data pipeline, from an element perspective you will see where good, bad, or otherwise non-profiled data is impacting your AI model. Within the catalog one can visualize this lineage for data quality results and sensitive data inputs. And let’s not forget about the controls. When critical data elements are married up to your regulatory requirements, your analysts have the business rules around how your data should be used. Perhaps in that sales data example, you are omitting a particular gender, or you have included an age range that doesn’t apply to the product you are selling. Can you count on one person in your IT organization to notice and react to this?
Like peanut butter is to jelly, so data quality and observability is to your data catalog. You are missing the whole meal with just one or the other applied to your data pipelines supporting AI. It is difficult to find and maintain trusted data, let alone the sheer volume of data needed for your AI models. Data observability can be a game changer in helping you to apply automation to successfully develop AI-ready data, ensure data reliability and deliver the AI data pipelines, and transparency needed to effectively scale your AI program.
Tackle AI data readiness and governance with erwin.
Ensure reliable data and be ready for regulatory compliance.
Learn MoreSusan Laine
Chief Field Technologist
Quest Software
Sue is the chief field technologist and DI thought leader for erwin Data Intelligence by Quest. A thought leader in the application of technology and business process to solve real business problems, Sue has over 25 years of data management experience on the buy side as a customer and the sell side as a vendor, including implementation, data leadership, and enablement. She has worked to structure and drive enterprise data intelligence programs to deliver value. Additionally, Sue has an extensive worldwide network of data leaders that she continues to draw upon for best practices, value use cases and product innovation. She has supported a wide range of clients, including financial, insurance, healthcare, energy, manufacturing, and e-Commerce with a general need to provide data-driven business practices. Sue is responsible for launching and guiding Quest’s market-leading data intelligence and modeling solutions to deliver fresh, modern offerings with extraordinary value for today’s challenging business demands.