The (Useful) Insanity of Data Tests

All Models are Wrong. Some Models Match the Upstream Engineering Team's

Dec 08, 2024

Anyone who’s worked in software will tell you that tests are a crucial element of a reliable work-product. Anyone who’s worked in data will tell you it’s not quite that simple. Tests in data-land are somewhat uniquely painful for three reasons: (1) the scope of data you’re responsible for is absurdly broad, (2) you rarely control the abstractions you work with, and (3) the exception handling options for unreliable data are all bad. Unfortunately, not testing isn’t an option, because the core product of a data team is accurate and trusted insights.

The first thing to understand about trying to enforce data quality as a data team is that it is insane. Data warehouses draw data from a bunch of different sources in a company: production systems (a relational database if you’re lucky), a CRM (e.g., Salesforce), a customer support ticket system (e.g., Zendesk), third party data vendors, maybe Grant in Finance’s quarterly forecast excel workbook, etc. This is part of what makes them so useful – you can pull together insights across multiple domains, and discover that most of your support volume is coming from your least lucrative customers, or that some customer acquisition channels produce customers that are much less likely to churn than others. At the same time, this means that a data team’s domain – the area you’re responsible for understanding and validating data for – is effectively all activities that produce data at a company.1 Many of these systems have multiple people responsible just for managing them. For your production system(s), there may be entire teams of engineers devoted to the code that produces data in just a few tables! No data team is staffed with as many people as the entire upstream engineering team, let alone all the people responsible for upstream systems (the idea is laughable) and thus the data team must necessarily focus on a small sliver of that upstream data.

Another reason data quality tests are hard is that even when they appear to be about universal truths, they often have hidden dependencies on things that are contingently true about reality-as-modeled. Choosing which aspects of reality to model in a database is a hard problem.2 Large parts of this modeling happen upstream of the data team in production systems. As a result, data quality tests will often break as the upstream model changes, and you’ll find yourself constantly playing catch-up to the modeling changes made by the much larger group of people working on upstream systems. Consider, for example, a customer status column, that determines whether a company is a prospect, a current customer, or churned. To enforce data quality, you might want to write a test to ensure that all companies that have started spending are classified as current customers or churned, to capture the universal truth that someone who’s started spending is no longer a prospect. Suppose now that upstream decides they want to track resurrected customers separately and adds that as a new status. Now suddenly your test is throwing errors, not because upstream data is wrong, but because they’ve changed their modeling to capture a new facet of reality. Crucially, however, just because the upstream data isn’t wrong doesn’t mean it won’t wreak havoc with downstream systems – for example, a dashboard that shows, separately, spend from current and churned customers will no longer show all categories of spend, and so the test existing and throwing errors may be a good thing even though the upstream data isn’t wrong.

A third reason data quality tests are hard is that the exception handling options for data systems – how you handle unreliable data – are all bad in different ways. When dealing with data systems, you can pick at most two of (a) accuracy, (b) freshness, and (c) completeness. Consider a system that ingests new data from upstream sources daily and rebuilds the derived data warehouse tables (e.g., the default setup for a dbt-based system3). Your options are:

Fresh and complete: All data from upstream is passed through even if the numbers are weird. This is the default situation you’d have with no data tests (and is bad for obvious reasons).
Accurate and complete: If any issues are identified with upstream data, the tables aren’t rebuilt at all until the issues are resolved.
Accurate and fresh: If any issues are identified with upstream data, the tables are rebuilt excluding those rows.

Unfortunately, to make things worse, staleness of data (lack of freshness) and incompleteness of data are often forms of inaccuracy. Consider a dashboard tracking lead volume for a business-to-consumer business. Here, some degree of incompleteness is probably fine – each lead is a relatively small portion of the overall volume – but lack of data for the last day (staleness) is going to at best make the dashboard not fit for purpose, and at worst (if the viewer doesn’t know the data is stale) make it look like something is catastrophically wrong with your lead-generation pipelines.

On the other hand, consider something like loss ratio for an insurance company. Here, you’re probably not obsessing over the last day’s numbers, so some degree of staleness is fine, but one large claim can materially impact overall numbers4, so incompleteness might result in a materially wrong number (if the loss ratio is understood to include a claim it doesn’t). More broadly, anywhere you’re tracking leading indicators, staleness is inaccuracy or not-fit-for-purpose-ness; anywhere you’re reporting on data with extreme outliers, incompleteness is a form of (material) inaccuracy.5

To some degree these bad choices are also a knock-on effect of not controlling the data model. In production systems, you can enforce validations at the system boundaries that actually control whether or not something happens - e.g., you don’t accept an order for an ecommerce system where the zip code is blank - whereas in data-land the order has already happened, and your only choice is whether or not (and how) to report on it.

Given all these problems, why worry about data quality at all? You could instead be a data plumber, building tables and dashboards that would give the right answer assuming the inputs were right, and call it a day. There are a couple reasons not to do this.

First, and most importantly, a data team’s output mainly impacts reality though helping people make better and more data-informed decisions, and this relies on those people trusting the data team’s output. Unlike other systems where failure is often immediately obvious (e.g., a program crashing), the failure mode of bad input data is (often) silently wrong output. It doesn’t matter whose fault a data issue is (or even, really, who gets blamed for the issue), if a dashboard or metric isn’t reliably correct6, either people notice and stop relying on it (at which point what’s the point of having a data team to build and maintain it?), or they don’t notice and make bad decisions thinking they’re following the data.

Second, it’s often hard to even tell whether logic is correct independent of the upstream data. The basic sanity checks any good analyst instinctively does (do the totals match what I see elsewhere? Does e.g. this gross margin pass the sniff test given what I know about our business? Is it even in the possible range?) break down if the input data can’t be trusted – you either stop doing the checks entirely or spend hours hunting down issues that are unrelated to the code at hand.7

Lastly, correctness is often a moving target. Logic that’s correct for one upstream model of reality often becomes disastrously wrong as that upstream model of reality evolves (as in the customer status example above), so the same thing that makes data tests such a painful treadmill makes them crucial.

When someone asks “why are these numbers wonky?”, you need to be able to correctly identify, for any part of the company, whether the data’s wrong or it’s reality that’s weird.

See longer meditation on this problem here.

Dbt is a standard tool for data modeling and orchestration of same in tech startups.

I’m thinking here of something like the Capital One cyberattack, which cost hundreds of millions of dollars. Obviously if an individual auto accident is material for an insurer’s profitability something has gone very wrong.

At this point you might be thinking that the right solution is to have different rules for different source data. This is all fine and dandy until you need to join between these tables, or worse, explain to someone else when they can expect the data to be complete and accurate.

Sometimes it’s fine to have dashboards/metrics/etc that are explicitly flagged as currently incorrect too - the problem is when things are silently wrong.

Unit tests help here somewhat, but at some point this requires recreating the entire upstream data model in unit tests and keeping it in sync when upstream changes, at which point you might as well have the data quality tests.

Applied Epistemology: Thoughts on Data and Data-Work

Ready for more?