The Hardest Part of Data Analysis is Reality

The Eldritch Horror Lurking Under Every Metric

Dec 08, 2024

Data analysis is technically easy. Reality1 is hard. This is surprising to people who don’t work with data, because most people have no ability to write SQL and some ability to think about reality, but there’s nothing novel or interesting about the 17th “group by” operation or the 43rd bar chart. Even regressions2 or neural networks, which sound very fancy, are often 10 lines of code to implement in the toolkit of your choice.3

What, then, makes reality hard? The problem with reality is that it’s messy and our knowledge of it is incomplete. Some things (the future) are necessarily unknowable, and some things get lopped off in the procrustean process of stuffing reality in nice clean boxes (or database tables).4 As a wise man once said, “Categorization is the act of forgetting [differences].”5 This sounds like armchair philosophizing, but it turns up shockingly frequently in practice in two ways: (1) in handling the unknown and unknowable, what looks like a question of fact often turns into three layers of estimates, and (2) in the process of of matching the nice tidy boxes back up with questions about the underlying reality, it’s easy to end up calculating the correct answer to a different question than the one you were actually asked (or, than the question the asker should have asked).

For the former (estimates from incomplete data), consider a set of questions like “what’s this business’s return on investment from a dollar spent on paid search ads? How has that changed as we’ve invested more in this channel?” This sounds like (and is!) a question of fact. Unfortunately, some of those facts are unknowable, either fundamentally or in practice. To be able to answer this question, you need to be able to (a) attribute customers to specific channels and (b) calculate their lifetime value to the business.6

To determine attribution among performance7 advertising categories, you need to make decisions about how to share credit for a customer between multiple channels. Consider the extreme example where 100% of your customers make the final sale by searching for your company’s name and clicking on the sponsored search result - relying on last touch attribution (i.e., the last ad clicked before the sale gets credit) would lead to some interesting investment decisions that are very different from those counseled by first-touch attribution. Unfortunately, while the marginal effect of different ads a user was exposed to during their buying decision (i.e., how to apportion credit between touches in multi-touch attribution) is a question of fact, it’s rarely one that can be answered experimentally.

To determine the lifetime value of a customer, you need to predict all their future spend through the end of time. Assuming this analysis is for a business with recurring revenue, this requires assumptions about churn rate and dollar retention. These are not impossible things to estimate, but depending on how you set up your estimates, you may significantly bias the answer to the original question. For example, if you assume there’s a single churn rate across time or across customer segments, then you might dramatically understate the difference between customer segments (if, in truth, low-spending customers also churn faster), and overstate the return on incremental spend as you ramp spend in the channel (if, in truth, the marginal customer reached by additional ads is more likely to churn). Again, this is not a argument that things are Unknowable™, but that in practice the approximations and estimates we make have far-reaching implications.

For the latter (answering the wrong question), consider a question like “how much money are we [a business] making from our Widget product line?” In a sense, this is a question with Right Answers - after all, aren’t there Generally Accepted Accounting Principles (GAAP)?8 It sounds like a data analyst should be able to go to some sales table, subset to the widget business unit, sum amount_usd by month, and Bob’s your uncle. However, reality is rarely so cooperative. The core problem9 is that “how much money are we making” is a horribly underspecified question. Depending on what the asker really wants to know (Do we have the cashflow necessary to meet payroll at the end of the month? Does the long-run profitability of the business line justify further investment?), the right answer might be one of several different metrics. Depending on what business you’re analyzing, (accurately!) pulling the wrong money metric can be anything from no big deal to wildly misleading. For example, in almost every startup, you’re staffed for and building out systems for the much larger company you hope to become, so looking at your unit economics (whether you make money on an incremental unit sold, irrespective of fixed costs) will paint a very different picture than looking at your cashflows.10

The lesson to take away from this is twofold. First, understanding the data-generating process, both in terms of how reality operates (for a business, this is the sales/manufacturing/support etc. processes) and when and how those processes are reflected in the data, is crucial for doing good data analysis. Second, the key step in translating a question into code is often the step of backing up and figuring out what decision the question-asker is trying to make. Both of these are usually far more important than the tooling (software or otherwise) used to do the analysis.

Or, as it’s sometimes (pejoratively) referred to, “business logic.”

Yes, picking the right model and interpreting the results is hard, but those are usually questions about reality!

This analysis is heavily informed by my experience doing data analysis in corporate environments. It’s probably less applicable to academic environments: my understanding is that while corporate environments generally have lots of greenfield data where the low-hanging fruit is not yet picked, academic work is often about using ever-more-sophisticated methods to extract additional signal from the same datasets.

Joel Spoelsky famously talks about something similar in The Law of Leaky Abstractions.

I’d swear this is Nietzsche, but Google Search isn’t turning up any results for the phrase.

Attribution and lifetime value are both hard problems - this is just a small sampling of the decisions you have to make to estimate them. For example, it ignores issues like adblockers, discount rates, and attribution for offline brand advertising like bus stops.

Performance advertising is pay-per-user-action (e.g. per app download, or per click).

I am by no means an expert in accounting, but for those who know as little about accounting as I did when I started doing data analysis, there are worse places to start than investopedia. Speaking from experience, I don’t recommend trying to re-derive it from first principles.

More accurately, a core problem. I’m ignoring for the moment a host of other problems likely to come up in an exercise like this such as having to combine several different tables from different time periods, each with a different set of business units, different product rollups (and thus cost attribution) and different currencies.

This isn’t just a startup problem, of course. Byrne Hobart briefly touches on another example of this in his article on Liberty Mutual, where a fast-growing company in an industry with high upfront fixed costs (Cable TV) will look non-profitable even though they could stop expanding at any point and produce a profit.

Applied Epistemology: Thoughts on Data and Data-Work

Ready for more?