There is a comforting belief worth dismantling. When a dataset is huge, millions of records, we relax. Surely something that large is representative of the population. It is one of the most expensive intuitions in modern data work, because size and representativeness are not the same thing. A very large sample can be more confidently wrong than a small one.
Start with the distinction that matters. Representativeness comes from the sampling mechanism, not from the size. A probability sample, in which every member of the population has a known, nonzero chance of being selected, earns the right to generalize and to quantify its own uncertainty. A nonprobability sample, people who opted in, an online panel, whatever an administrative system happened to record, has no such mechanism. Its representativeness is a hope, not a property.
In 2018 the statistician Xiao-Li Meng made this precise, and unsettling. He showed that the error in a sample estimate depends on three things: how variable the outcome is, how much of the population you captured, and what he named the data defect correlation, the correlation between whether someone shows up in your data and what their answer would be. The sting is what this means as the sample grows: the bias does not shrink. With a very large sample, the random noise becomes tiny, so your confidence interval narrows to a sliver, but it narrows around the wrong number. You end up precisely, confidently mistaken. Meng put it memorably: the more the data, the surer we fool ourselves.
The numbers are startling. Studying the 2016 United States election, Meng found that a correlation of about half of one percent between responding and which candidate a person favored was enough that a self-reported sample of more than two million carried the same accuracy as a genuine random sample of about four hundred. The rest of that sample bought almost nothing. The pattern returned during the pandemic: a survey run through Facebook, gathering roughly two hundred fifty thousand responses a week, produced worse estimates of vaccination rates than a small probability survey of about a thousand. Bigger lost to better.
This should change how we hear the phrase big data. The flood of administrative records, app logs, opt-in panels, and scraped datasets is overwhelmingly nonprobability data. It is often enormous and almost always self-selected, which is precisely the recipe the paradox warns about. Size reassures us at the exact moment we should be most suspicious.
So what do you do? When you can, use a probability design, because the known selection mechanism is what licenses the inference, and volume does not replace it. When you cannot, you must model your way out, with weighting and post-stratification, tools I have written about, which adjust the sample to resemble the population. But those corrections fix only the bias you can model with the variables you have, and they lean on assumptions you cannot fully test. They help. They do not turn a self-selected pile into a random sample.
For those of us in evaluation, the lesson is practical. When someone hands you a massive dataset and offers its size as proof of its quality, ask the unglamorous question first. How did people end up in this data, and could that selection be related to what we are trying to measure? If it could, more rows will not rescue you. They may sink you faster, and with more confidence.
So here is my question: When you are offered a very large dataset, do you interrogate how people selected into it, or does the sheer size tend to quiet the doubts it should raise?
Leave a comment