Automated Data Validation Checks to Ensure Data Accuracy

Organizations can get ahead of data issues observed in the past by proactively setting up automated data validation notifications.

Table of Contents

Val Driveno, an analyst at a large retail company, is just getting back to the office after lunch. She was planning on getting started on a big marketing analysis, but notices she’s received messages from a couple different executives who use a sales dashboard she built in the past. “Why are last month’s sales suddenly showing a 25% decline? Yesterday, it said we were up 5%.” Val immediately begins to check the data tables the report is built on and confirms that the data tables also show a 25% decline. She reaches out to the IT team who manages the data table and they identify that some records were accidently deleted in the source system when the database was refreshed that morning. Eventually, the data table is reloaded, the reporting is accurate again, and Val lets her stakeholders know. 

Problem solved and now behind us, right? Perhaps for now, but, just one week later, Val is pinged again by the same stakeholders about the same dashboard being off again. Even though Val has no control over the accuracy of the data table her report uses, her stakeholders are starting to doubt the reliability of her dashboard, question her sense of ownership and attention to detail, and associate the problems they’re noticing with her performance.

For analysts today, this situation is all too real and all too common. In these cases, analysts may be tempted to play the “blame game” and simply remind stakeholders that they’re not responsible for the database accuracy; but, right or wrong, the reality is that, to dashboard viewers, the dashboard creator is often the face of not only the visualization of the data, but the data itself; so, when the data is off, many stakeholders will naturally associate this with the analyst’s performance.

Automated Data Validation Checks

This is just one reason why, whether it’s done by the IT team managing the database or the analysts using the data, it’s important for someone to implement automated data validation checks on the data. Automated data validation checks are simply scheduled processes that regularly check for unusual or suspicious situations in data tables and, if at least one is detected, trigger an e-mail/text notification to a responsible party automatically. It’s the “I’ve fallen and I can’t get up” of a database. If something happens to the data, rather than not know about it until someone happens to notice, an automated alert is sent. Simple concept, right? Still, creating automated validation checks can be a daunting task since there are so many possibilities for what could go wrong.

A great place to start is creating automated checks for data issues identified in the past. As the old saying goes, “Fool me once, shame on you; fool me twice, shame on me.” Creating an automated validation check should become one of the steps in the standard protocol for when a data issue is identified:

  1. Identify the scope of the issue
  2. Notify the responsible party
  3. Communicate the issue to impacted users
  4. Identify the root cause
  5. Find a solution
  6. Implement the fix
  7. Validate that dependent processes are accurate again
  8. Communicate the issue has been resolved
  9. Create an automated data validation check

It’s important to note that, in order to quickly and efficiently communicate the issue to impacted users, it’s a good idea to proactively set up and maintain distribution lists for users of each platform that could be impacted by data issues. For instance, it may be a good ideal to create a distribution list of people who have access to a certain reporting platform for times like these. It might also be a good idea to proactively put a plan in place around the process of sending that notification, including a notification template.

Creating an automated validation check should become one of the steps in the standard protocol for when a data issue is identified

It can be tempting to skip the last step above and just hope that the data issue doesn’t occur again. Unfortunately, the fact that the issue happened once is likely a sign of a vulnerability in the data process that could easily happen again, even if it seems like a fix has been implemented.

Let’s go back to Val’s scenario above. How does the situation change if, after the first data issue is discovered, as part of the process for handling it, Val creates an automated validation check going forward. If this is done, one week later, when the issue occurs again, Val gets notified early in the morning, right after the data refreshes. She may be able to reach out to the IT team in time for them to reload the data before it impacts her dashboard at all. Perhaps it leads to just a slight delay in reporting availability, which is communicated to users. Even if the dashboard does end up being inaccurate for the entire day, Val is at least able to send a communication to dashboard users proactively in the morning upon getting the automated notification that the data is off. Now, the executives know not to look at the report that day. They don’t mistakenly use it to make a decision; they maintain their confidence in the reporting, knowing that they’ll be notified if there are issues; and they commend Val for her attention to detail and ownership of the situation.

Hopefully, you can now see the importance of putting automated validation checks in place after data issues occur; but automated validation checks can also be set up proactively. While there are numerous things that could go wrong with a data table, make the most of your time by focusing mainly on creating automated validation checks around…

  • The most critical data tables that have the highest cost if they’re inaccurate (due to scope of use or importance of use)
  • The aspects of the data tables that you foresee having the highest likelihood of having an issue (perhaps due to it having manual process involved, etc…)

Keep in mind that the automated validation checks can go beyond flagging “certain” data issues; they can also be set up to flag data situations that are suspicious. Perhaps, in some cases, these unique data situations will occur and trigger an e-mail when the surprising data really was representative of reality; but the team will at least be notified and can investigate.

Making sure you focus on building the right automated validation checks is one aspect of making the process efficient; but another important aspect is how you build the checks. We recommend using a combination of SQL and Python in Python code scheduled to run each morning, right after the database has been refreshed with the latest data. In some cases, the checks could involve saving an archive of information from a data table that will be used tomorrow to assess how much the table has changed. The automated checks could follow a format similar to below:

  • Step 1: SQL query or Python code that creates a Pandas DataFrame called df, which is filtered down to rows that would be considered inaccurate data, exceptions, or suspicious
  • Step 2: If statement in Python that checks whether df has at least 1 row and, if so, sends an e-mail to the people who would look into the data issue

As with any scheduled Python job, we’d also recommend putting try/except statements in place in order to trigger an automated notification e-mail if the data table unexpectedly changes in such a way that the code for the validation check would itself throw an error.

To avoid having to schedule a different Python job for each check and to keep all checks in the same place, we recommend adding future automated validation checks to the same Python code so that the automated checks run in the same scheduled process, one after another. Our Efficient ETL process can make setting this scheduled Python job up and modifying it to add more checks in the future especially quick and easy.

With automated validation checks, the IT team, analysts, and dashboard stakeholders can rest easy knowing that, should any data issues occur, they’ll be proactively notified and can then avoid using the data until it is resolved. This enables a trust in the data and the process around it that is critical for executives to feel comfortable making data-driven decisions. That’s why, to us, creating automated validation checks is a value driven activity.

At Value Driven Analytics, we can help you set up automated validation checks around data issues you’ve observed in the past or around issues you anticipate in the future. We focus on creating a robust process that makes it easy to add in future checks if needed. We apply more than a dozen years of analytics experience and innovation to efficiently create your solution. Rather than sell your organization a single analytics solution for a big ticket price, our goal is to use this project as an opportunity to illustrate the rigor, affordability, and speed of our analytics solutions in order to become your #1 choice for future analytics projects.

Learn more about how automated validation checks can help protect your organization against the risk of making decisions based on bad data

Share this Post

Facebook
Twitter
LinkedIn

Leave a Reply

If you have additional questions about analytics consulting, we’d love to help answer them and brainstorm analytics projects that could truly drive value for your organization.

Discover more from Value Driven Analytics

Subscribe now to keep reading and get access to the full archive.

Continue reading