When working with a substantial volume of data, improving the data quality should be one of your top priorities. You are only able to make judgments and conduct analysis when you utilize high-quality data in an efficient and precise way.
Defining Data Quality
Data quality is often measured by evaluating common metrics including fill rate, freshness, uniqueness, and others; however, these metrics only capture the statistical signals that help us generalize if systems are capturing data correctly and do not measure how information is solving specific problems.
Many of us have already developed standardized statistical metrics that capture the signal of data as it moves from one system to another but this information does not inform us if our data is valuable to downstream stakeholders. When you begin to consider what quality means for each particular customer, you realize that there are many different ways to measure quality.
In fact, data quality really depends on what you would want to do with the data. For example, when evaluating the quality of email addresses, the definition of quality could vary from a marketing campaign where quality may be measured by engagement with marketing material vs another example, fraud detection, where understanding if the customer is authentic and using the platform as intended may be another definition of quality.
In order to tell a complete story about data quality and the value that information creates, we need to start by having a better understanding of our downstream stakeholders and their intended use cases. If we are able to create feedback loops that send information about recent marketing campaigns or fraud detection systems, we can start to have much more robust conversations about how data creates value for the business.
The importance of real-time
When we think about data quality, many of us have a preconceived understanding of how information is evaluated and processed as it flows to downstream stakeholders. Traditionally, the data quality process starts by evaluating a snapshot of data through a data quality tool once, making minor changes, and then loading the information into a master data management tool. After this process has been completed once, typically data quality is never evaluated again.
While this process may have worked when information was collected through only a small number of sources, today the complexity and expanding volume of data make this an ineffective process. Today, the average enterprise has over 500 connected data sources with larger enterprises having upwards of 1000 connected sources. As the volume and complexity of information continues to grow, data stewards domain expertise continues to shrink and the need for automation and monitoring have become increasingly important.
Today, we are witnessing a new set of tools emerge that proactively monitor your data in real-time, provide recommendations, and immediately alert the necessary stakeholders of concerns that need attention. These tools enable the data steward to continue to manage an increasing number of sources while only having to focus on managing identified problems and not keeping track of every data source as they change and evolve over time. This automation helps data stewards solve the need to find a “needle in the haystack”, as they now have the capability to leverage intelligence-based insights to make faster and more informed decisions.
Attribution of Data Quality Issues
Data quality tools should also help drive clearity when attributing ownership and defining remediation plans to upstream data sources. In traditional systems, this process is done by a data steward manually contacting an upstream stakeholder. To communicate an issue a data steward would need to determine, when the issue was first identified, what changed, where the change came from, and who needs to be contacted to make the fix. This type of analysis can take weeks and introduce bad data to downstream stakeholders for extended periods of time. At scale, it is unmanageable, however many data stewards today still embark in this manual process in an attempt to extinguish an ever growing number of files. .
Without automation, this amount of manual communication, sharing, and training is unmanageable. The modern data quality tool should provide proactive ways to identify and attribute upstream problems with very limited manual effort required from the data steward. Leveraging automated tools that help with observability is increasing in popularity across the entire data pipeline and something that everyone should consider when evaluating new technologies.
Data stewards are the unsung heroes of any business. Their work is often leveraged by most of the organization, yet most business stakeholders and consumers are unaware of the significant amount of time and energy put into making things “just work” behind the scenes. In most enterprises, business stakeholders only hear about the teams managing these systems if there are issues with the information. This is because there isn’t a great way to tell a story of the amount of work data stewards invest to improve the quality of information.
The modern data management suite should include the ability to track the quality of information over time, enabling the business stakeholder and data steward to communicate on a single platform that helps everyone align on a narrative around data quality and the effort invested in tracking and monetizing data. Furthermore, if we are able to track and monitor the business use cases created by downstream stakeholders, we will get much closer to measuring the value of high quality data across the organization. Other Relevant Content: Real-Time Data Quality - Show (video)
Assemble Park City1389 Center Road, Unit 200Park City, Utah 84098USA
+1 (855) 360-DATA+1 (855) 360-3282
© 2022 Reltio. All Rights Reserved
Site by eConverse Media