Stephen T’s Blog Spot

A blog aimed at issues only data scientists, data analysts, statisticians, evaluators, and researchers care about.

Every real dataset has holes. People skip the sensitive question, drop out of the study, or are never measured at all. A sensor fails, a record is incomplete, a field is blank. And most of the time, without quite deciding to, we handle those holes in one of two ways: we delete the rows that have them, or we fill the gaps with an average. Both feel harmless. Both can quietly wreck your conclusions. This post is about why; the next will be about what to do instead.

The key insight, formalized by the statistician Donald Rubin in 1976, is that what matters is not how much data is missing but why it is missing. He described three mechanisms, and they could not be more different in their consequences.

The first is missing completely at random, or MCAR. Here the missingness has nothing to do with anything, observed or unobserved. A tube of blood is dropped, a page is lost in the mail. The missing values are just a random subset of all the values. This is the safe case, and it is also the rarest. If your data really are MCAR, deleting the incomplete rows costs you sample size but does not bias the answer.

The second is missing at random, or MAR, one of the most confusing names in statistics, because it is not random at all. MAR means the missingness depends on things you did observe, but not on the missing value itself. Suppose younger people are less likely to report income, but within any age group, whether they answer has nothing to do with how much they earn. The gap is patterned, but the pattern is explained by data you have, age, so you have a fighting chance to correct for it.

The third is the dangerous one: missing not at random, or MNAR. Here the missingness depends on the very value that is missing. The highest earners decline to state their income precisely because it is high. The sickest patients drop out of the trial precisely because they are doing badly. The information you need to fix the problem walked out the door with the people who left. Nothing in your remaining data can tell you what you lost.

Now the trap. Both default moves assume the safe case and betray you in the others. Deleting incomplete rows, which software often does automatically, is unbiased only if the data are MCAR. Under MAR or MNAR, the people who remain are a skewed subset, and your clean complete-case analysis is confidently wrong. Mean imputation is worse: filling every blank with the average shrinks the natural variation, weakens real correlations, and fabricates false precision. As one summary put it, if you do not consciously choose a method, you have chosen deletion by default.

Here is the part that should keep you honest. You usually cannot prove which mechanism you are facing. You can rule out MCAR by checking whether the people with missing data differ on the things you did measure. But the line between MAR and MNAR depends on the values you never saw, so it cannot be settled by the data alone. It is, in the end, an assumption you must argue for, not a fact you can test.

So the amount missing is the wrong thing to fixate on. A trivial fraction of MNAR can bias a result badly, while a large amount of MCAR mostly just costs precision. The first question about any hole in the data is not how big it is, but why it is there.

So here is my question: When you hit missing data, do you stop to ask why it is missing before you decide how to handle it, and what tells you whether the gap is benign or dangerous?

Posted in

Leave a comment