In 1936, a magazine ran the largest election poll the world had ever seen. The Literary Digest mailed out around ten million ballots and tallied roughly two and a half million that came back. On that mountain of data, it confidently predicted that Alf Landon would defeat Franklin Roosevelt. Roosevelt then won in one of the largest landslides in American history. The magazine’s reputation never recovered, and within two years it was gone.
What went wrong is one of the most important lessons in all of survey research, and it has nothing to do with sample size. The Digest drew its names from telephone directories, automobile registrations, club rosters, and its own subscriber list. In 1936, in the depths of the Depression, telephones and cars were luxuries. So the list systematically overrepresented the well off, who leaned toward Landon, and missed the poorer voters who carried Roosevelt. The poll did not have too few people. It had the wrong people, and adding more of them would not have helped.
This is coverage error, and it is distinct from the more familiar worry about who responds. Coverage error happens earlier, at the moment you choose the list you will sample from, what researchers call the sampling frame. If that frame does not match the population you actually care about, some people have no chance of being selected at all. Every later step can be flawless, the sampling random, the response rate high, the analysis clean, and the result will still be biased, because the bias was baked in before the first ballot went out.
Here is the part that should unsettle anyone who works with data today. Size does not fix it. A biased frame sampled lightly and a biased frame sampled massively give you the same skewed answer, just with more decimal places of false confidence. The Literary Digest had two and a half million responses. That same year, George Gallup predicted the right winner from a sample a fraction of the size, because he chose it to mirror the country rather than to be large. Representativeness beat raw volume, and it was not close.
That lesson lands harder now than it did then, because we are surrounded by enormous, effortless data that comes from skewed frames. An online survey reaches the people who are online and willing to click. An app’s usage data describes the people who already use the app. Social media sentiment captures the views of the people who post. Each of these can deliver millions of records and still quietly exclude whole segments of the population you mean to understand. Big data does not announce who is missing. You have to ask.
For those of us who collect and analyze data, the discipline is to start with two questions before admiring any number. Who is in the frame, and who can never appear in it no matter how much data we gather? A customer list leaves out the customers you lost. A portal survey leaves out people who cannot reach the portal. A convenience sample leaves out everyone inconvenient. None of that is cured by a bigger pull from the same source. It is cured only by fixing the frame, or by being honest about exactly whom your findings do and do not describe.
The seductive thing about a huge dataset is that it feels authoritative simply by being huge. The Literary Digest felt that way too. The number that matters is not how many people you reached, but whether the people you could reach look like the people you are trying to learn about.
So here is my question: When you assess a data source, how do you probe who is missing from the frame entirely, and have you seen a large, impressive dataset quietly misrepresent the population it claimed to cover?
Leave a comment