Applied Epistemology: Thoughts on Data and Data-Work

The (Useful) Insanity of Data Tests

Jacob Bishop — Sun, 08 Dec 2024 00:27:46 GMT

Anyone who’s worked in software will tell you that tests are a crucial element of a reliable work-product. Anyone who’s worked in data will tell you it’s not quite that simple. Tests in data-land are somewhat uniquely painful for three reasons: (1) the scope of data you’re responsible for is absurdly broad, (2) you rarely control the abstractions you work with, and (3) the exception handling options for unreliable data are all bad. Unfortunately, not testing isn’t an option, because the core product of a data team is accurate and trusted insights.

The first thing to understand about trying to enforce data quality as a data team is that it is insane. Data warehouses draw data from a bunch of different sources in a company: production systems (a relational database if you’re lucky), a CRM (e.g., Salesforce), a customer support ticket system (e.g., Zendesk), third party data vendors, maybe Grant in Finance’s quarterly forecast excel workbook, etc. This is part of what makes them so useful – you can pull together insights across multiple domains, and discover that most of your support volume is coming from your least lucrative customers, or that some customer acquisition channels produce customers that are much less likely to churn than others. At the same time, this means that a data team’s domain – the area you’re responsible for understanding and validating data for – is effectively all activities that produce data at a company.1 Many of these systems have multiple people responsible just for managing them. For your production system(s), there may be entire teams of engineers devoted to the code that produces data in just a few tables! No data team is staffed with as many people as the entire upstream engineering team, let alone all the people responsible for upstream systems (the idea is laughable) and thus the data team must necessarily focus on a small sliver of that upstream data.

Another reason data quality tests are hard is that even when they appear to be about universal truths, they often have hidden dependencies on things that are contingently true about reality-as-modeled. Choosing which aspects of reality to model in a database is a hard problem.2 Large parts of this modeling happen upstream of the data team in production systems. As a result, data quality tests will often break as the upstream model changes, and you’ll find yourself constantly playing catch-up to the modeling changes made by the much larger group of people working on upstream systems. Consider, for example, a customer status column, that determines whether a company is a prospect, a current customer, or churned. To enforce data quality, you might want to write a test to ensure that all companies that have started spending are classified as current customers or churned, to capture the universal truth that someone who’s started spending is no longer a prospect. Suppose now that upstream decides they want to track resurrected customers separately and adds that as a new status. Now suddenly your test is throwing errors, not because upstream data is wrong, but because they’ve changed their modeling to capture a new facet of reality. Crucially, however, just because the upstream data isn’t wrong doesn’t mean it won’t wreak havoc with downstream systems – for example, a dashboard that shows, separately, spend from current and churned customers will no longer show all categories of spend, and so the test existing and throwing errors may be a good thing even though the upstream data isn’t wrong.

A third reason data quality tests are hard is that the exception handling options for data systems – how you handle unreliable data – are all bad in different ways. When dealing with data systems, you can pick at most two of (a) accuracy, (b) freshness, and (c) completeness. Consider a system that ingests new data from upstream sources daily and rebuilds the derived data warehouse tables (e.g., the default setup for a dbt-based system3). Your options are:

Fresh and complete: All data from upstream is passed through even if the numbers are weird. This is the default situation you’d have with no data tests (and is bad for obvious reasons).
Accurate and complete: If any issues are identified with upstream data, the tables aren’t rebuilt at all until the issues are resolved.
Accurate and fresh: If any issues are identified with upstream data, the tables are rebuilt excluding those rows.

Unfortunately, to make things worse, staleness of data (lack of freshness) and incompleteness of data are often forms of inaccuracy. Consider a dashboard tracking lead volume for a business-to-consumer business. Here, some degree of incompleteness is probably fine – each lead is a relatively small portion of the overall volume – but lack of data for the last day (staleness) is going to at best make the dashboard not fit for purpose, and at worst (if the viewer doesn’t know the data is stale) make it look like something is catastrophically wrong with your lead-generation pipelines.

On the other hand, consider something like loss ratio for an insurance company. Here, you’re probably not obsessing over the last day’s numbers, so some degree of staleness is fine, but one large claim can materially impact overall numbers4, so incompleteness might result in a materially wrong number (if the loss ratio is understood to include a claim it doesn’t). More broadly, anywhere you’re tracking leading indicators, staleness is inaccuracy or not-fit-for-purpose-ness; anywhere you’re reporting on data with extreme outliers, incompleteness is a form of (material) inaccuracy.5

To some degree these bad choices are also a knock-on effect of not controlling the data model. In production systems, you can enforce validations at the system boundaries that actually control whether or not something happens - e.g., you don’t accept an order for an ecommerce system where the zip code is blank - whereas in data-land the order has already happened, and your only choice is whether or not (and how) to report on it.

Given all these problems, why worry about data quality at all? You could instead be a data plumber, building tables and dashboards that would give the right answer assuming the inputs were right, and call it a day. There are a couple reasons not to do this.

First, and most importantly, a data team’s output mainly impacts reality though helping people make better and more data-informed decisions, and this relies on those people trusting the data team’s output. Unlike other systems where failure is often immediately obvious (e.g., a program crashing), the failure mode of bad input data is (often) silently wrong output. It doesn’t matter whose fault a data issue is (or even, really, who gets blamed for the issue), if a dashboard or metric isn’t reliably correct6, either people notice and stop relying on it (at which point what’s the point of having a data team to build and maintain it?), or they don’t notice and make bad decisions thinking they’re following the data.

Second, it’s often hard to even tell whether logic is correct independent of the upstream data. The basic sanity checks any good analyst instinctively does (do the totals match what I see elsewhere? Does e.g. this gross margin pass the sniff test given what I know about our business? Is it even in the possible range?) break down if the input data can’t be trusted – you either stop doing the checks entirely or spend hours hunting down issues that are unrelated to the code at hand.7

Lastly, correctness is often a moving target. Logic that’s correct for one upstream model of reality often becomes disastrously wrong as that upstream model of reality evolves (as in the customer status example above), so the same thing that makes data tests such a painful treadmill makes them crucial.

When someone asks “why are these numbers wonky?”, you need to be able to correctly identify, for any part of the company, whether the data’s wrong or it’s reality that’s weird.

See longer meditation on this problem here.

Dbt is a standard tool for data modeling and orchestration of same in tech startups.

I’m thinking here of something like the Capital One cyberattack, which cost hundreds of millions of dollars. Obviously if an individual auto accident is material for an insurer’s profitability something has gone very wrong.

At this point you might be thinking that the right solution is to have different rules for different source data. This is all fine and dandy until you need to join between these tables, or worse, explain to someone else when they can expect the data to be complete and accurate.

Sometimes it’s fine to have dashboards/metrics/etc that are explicitly flagged as currently incorrect too - the problem is when things are silently wrong.

Unit tests help here somewhat, but at some point this requires recreating the entire upstream data model in unit tests and keeping it in sync when upstream changes, at which point you might as well have the data quality tests.

The Hardest Part of Data Analysis is Reality

Jacob Bishop — Sun, 08 Dec 2024 00:06:37 GMT

Data analysis is technically easy. Reality1 is hard. This is surprising to people who don’t work with data, because most people have no ability to write SQL and some ability to think about reality, but there’s nothing novel or interesting about the 17th “group by” operation or the 43rd bar chart. Even regressions2 or neural networks, which sound very fancy, are often 10 lines of code to implement in the toolkit of your choice.3

What, then, makes reality hard? The problem with reality is that it’s messy and our knowledge of it is incomplete. Some things (the future) are necessarily unknowable, and some things get lopped off in the procrustean process of stuffing reality in nice clean boxes (or database tables).4 As a wise man once said, “Categorization is the act of forgetting [differences].”5 This sounds like armchair philosophizing, but it turns up shockingly frequently in practice in two ways: (1) in handling the unknown and unknowable, what looks like a question of fact often turns into three layers of estimates, and (2) in the process of of matching the nice tidy boxes back up with questions about the underlying reality, it’s easy to end up calculating the correct answer to a different question than the one you were actually asked (or, than the question the asker should have asked).

For the former (estimates from incomplete data), consider a set of questions like “what’s this business’s return on investment from a dollar spent on paid search ads? How has that changed as we’ve invested more in this channel?” This sounds like (and is!) a question of fact. Unfortunately, some of those facts are unknowable, either fundamentally or in practice. To be able to answer this question, you need to be able to (a) attribute customers to specific channels and (b) calculate their lifetime value to the business.6

To determine attribution among performance7 advertising categories, you need to make decisions about how to share credit for a customer between multiple channels. Consider the extreme example where 100% of your customers make the final sale by searching for your company’s name and clicking on the sponsored search result - relying on last touch attribution (i.e., the last ad clicked before the sale gets credit) would lead to some interesting investment decisions that are very different from those counseled by first-touch attribution. Unfortunately, while the marginal effect of different ads a user was exposed to during their buying decision (i.e., how to apportion credit between touches in multi-touch attribution) is a question of fact, it’s rarely one that can be answered experimentally.

To determine the lifetime value of a customer, you need to predict all their future spend through the end of time. Assuming this analysis is for a business with recurring revenue, this requires assumptions about churn rate and dollar retention. These are not impossible things to estimate, but depending on how you set up your estimates, you may significantly bias the answer to the original question. For example, if you assume there’s a single churn rate across time or across customer segments, then you might dramatically understate the difference between customer segments (if, in truth, low-spending customers also churn faster), and overstate the return on incremental spend as you ramp spend in the channel (if, in truth, the marginal customer reached by additional ads is more likely to churn). Again, this is not a argument that things are Unknowable™, but that in practice the approximations and estimates we make have far-reaching implications.

For the latter (answering the wrong question), consider a question like “how much money are we [a business] making from our Widget product line?” In a sense, this is a question with Right Answers - after all, aren’t there Generally Accepted Accounting Principles (GAAP)?8 It sounds like a data analyst should be able to go to some sales table, subset to the widget business unit, sum amount_usd by month, and Bob’s your uncle. However, reality is rarely so cooperative. The core problem9 is that “how much money are we making” is a horribly underspecified question. Depending on what the asker really wants to know (Do we have the cashflow necessary to meet payroll at the end of the month? Does the long-run profitability of the business line justify further investment?), the right answer might be one of several different metrics. Depending on what business you’re analyzing, (accurately!) pulling the wrong money metric can be anything from no big deal to wildly misleading. For example, in almost every startup, you’re staffed for and building out systems for the much larger company you hope to become, so looking at your unit economics (whether you make money on an incremental unit sold, irrespective of fixed costs) will paint a very different picture than looking at your cashflows.10

The lesson to take away from this is twofold. First, understanding the data-generating process, both in terms of how reality operates (for a business, this is the sales/manufacturing/support etc. processes) and when and how those processes are reflected in the data, is crucial for doing good data analysis. Second, the key step in translating a question into code is often the step of backing up and figuring out what decision the question-asker is trying to make. Both of these are usually far more important than the tooling (software or otherwise) used to do the analysis.

Or, as it’s sometimes (pejoratively) referred to, “business logic.”

Yes, picking the right model and interpreting the results is hard, but those are usually questions about reality!

This analysis is heavily informed by my experience doing data analysis in corporate environments. It’s probably less applicable to academic environments: my understanding is that while corporate environments generally have lots of greenfield data where the low-hanging fruit is not yet picked, academic work is often about using ever-more-sophisticated methods to extract additional signal from the same datasets.

Joel Spoelsky famously talks about something similar in The Law of Leaky Abstractions.

I’d swear this is Nietzsche, but Google Search isn’t turning up any results for the phrase.

Attribution and lifetime value are both hard problems - this is just a small sampling of the decisions you have to make to estimate them. For example, it ignores issues like adblockers, discount rates, and attribution for offline brand advertising like bus stops.

Performance advertising is pay-per-user-action (e.g. per app download, or per click).

I am by no means an expert in accounting, but for those who know as little about accounting as I did when I started doing data analysis, there are worse places to start than investopedia. Speaking from experience, I don’t recommend trying to re-derive it from first principles.

More accurately, a core problem. I’m ignoring for the moment a host of other problems likely to come up in an exercise like this such as having to combine several different tables from different time periods, each with a different set of business units, different product rollups (and thus cost attribution) and different currencies.

This isn’t just a startup problem, of course. Byrne Hobart briefly touches on another example of this in his article on Liberty Mutual, where a fast-growing company in an industry with high upfront fixed costs (Cable TV) will look non-profitable even though they could stop expanding at any point and produce a profit.