Good, Fast, Cheap: How to do Data Science with Missing Data

Matt Brems (he/him)
4 min readApr 15, 2019

--

When doing any sort of data science problem, we will inevitably run into missing data.

Let’s say we’re interviewing 100 people and are recording their answers on a piece of paper in front of us. Specifically, one of our questions asks about income. Consider a few examples of missing data:

Each NA represents a missing value… but we don’t know how the missing value occurred!
  • Someone refuses to answer our question about income. Unbeknownst to us, this person’s income is low, and they do not feel comfortable sharing it.
  • Someone else declines to answer the income question. This person is younger and perhaps young people are less likely to respond to certain questions.
  • One subject didn’t show up to the interview, so we observed no data for this person.
  • After the interviews, I accidentally spill my coffee. The coffee blurs the top of the page, rendering the first few rows of our data unreadable.

We may think we’re safe if we gather data from a computer… but not quite. What if we gather information from a sensor counting the cars passing through a toll road every hour, and the sensor breaks? What if a computer is collecting temperature data, but the temperature drops below the minimum value that can the computer can measure?

In a dataset, we’d see each of these missing values as something like an NA. However, these NA were caused by very different things! As a result, the way we analyze data containing these missing values must be different.

So how do we do data science with missing data?

Well, as I always tell my students with General Assembly: it depends.

To help us make a decision, we can use the “good, fast, cheap” diagram from project management. Even if you haven’t seen it before, the idea is pretty straightforward.

Project Management Triangle
  • You can do a project that is done fast and cheaply… but it won’t be good.
  • You can do a project that is good and is done cheaply… but it won’t be fast.
  • You can do a project that is good and is done fast… but it won’t be cheap.
  • It is basically impossible to have a project that can be done fast and cheaply and also be good.

The same idea applies to how we handle missing data!

Strategy 1: We can handle missing data by just dropping every observation that contains a missing value.

  • Our analysis is fast: In Python, it’s just one line of code!
  • Our analysis is cheap: We don’t need additional money to do this.
  • But it isn’t very good: By dropping all of our observations containing a missing value, we’re losing data and also making dangerous assumptions. Even more sophisticated techniques like replacing missing data with the mean or the mode will have dramatic, negative results on our analysis.

Strategy 2: We can handle missing data by trying to avoid missing data up front.

  • Our analysis is fast: When it gets to analyzing our data, we don’t have to do anything special because our data is already complete. This is effectively zero lines of code!
  • Our analysis is good: We don’t have any uncertainty in our results if we truly collected 100% of the intended data.
  • But it isn’t very cheap: Spending money to collect all of our intended data can be very, very expensive.

Strategy 3: We can handle missing data by using sophisticated techniques such as the pattern submodel approach or multiple imputation.

  • Our analysis is cheap: We don’t need to spend any additional money!
  • Our analysis is good: We are properly estimating the uncertainty in our results or are foregoing imputation techniques altogether.
  • But it isn’t very fast: Our analysis will be more involved and will probably take substantially longer.

Which approach is right for your organization?

Well… it depends!

  • How much time do you have to do your analysis?
  • How much money do you have?
  • What are the trade-offs comparing quality, time, and money?

Interested in learning more? I’m speaking about this at the Open Data Science Conference in Boston on Tuesday, April 30 from 9:00 a.m. to 1:00 p.m.

This blog post was originally posted at OpenDataScience.com.

Feel free to check out:

Thanks for reading!

--

--

Matt Brems (he/him)
Matt Brems (he/him)

Written by Matt Brems (he/him)

Chair, Executive Board @ Statistics Without Borders. Distinguished Faculty @ General Assembly. Co-Founder @ BetaVector.

No responses yet