We are swimming in aggregate data. Averages by county, by ZIP code, by store, by team, by region. It is everywhere, it is cheap, and it is tempting to read it as if it described the people inside each group. That temptation has a name, the ecological fallacy, and it is one of the oldest and most stubborn errors in quantitative work.
The fallacy is drawing a conclusion about individuals from a correlation measured across groups. The two can disagree completely. The classic demonstration came from the sociologist William Robinson in 1950. Using 1930 census data, he examined the relationship between being foreign-born and being illiterate, computed two ways. Across the 48 states, the correlation was negative: states with more immigrants tended to have lower illiteracy. But across individuals it was positive: a foreign-born person was, on average, somewhat more likely to be illiterate. The two numbers did not just differ in size. They pointed in opposite directions.
How can that happen? Because the groups differ from each other in ways that have nothing to do with the individuals inside them. Immigrants in 1930 tended to settle in states that already had well-educated native-born populations. So states with many immigrants looked highly literate, not because the immigrants were literate, but because of everyone else in those states. The state average was a fact about states. It was never a fact about immigrants, and treating it as one inverted the truth.
This is a cousin of Simpson’s paradox, which I wrote about earlier, but it is its own distinct trap. Simpson’s paradox is about a trend reversing when you pool or split the same units. The ecological fallacy is about changing the unit of analysis entirely, from the group to the person, and assuming the relationship survives the trip. It usually does not, at least not at the same strength, and sometimes not even in the same direction.
Why does this matter now? Because aggregated data is often all we can get, and it is seductive precisely because it is so available. A dashboard says regions with more of some service have better outcomes. A report shows schools with a certain feature score higher. A vendor claims customers in zip codes with trait X buy more of product Y. Each is a statement about groups. The instant you restate it as a claim about a person, a region, a school, a customer, you may have crossed into fiction, and a confident, numerical one at that.
None of this means aggregate data is worthless. Often it is the only data we have, and it can answer group-level questions perfectly well. The error is not in using it. The error is in quietly swapping the question, asking about people and answering about places, or asking about places and answering about people. The two questions have two different answers, and only sometimes do they agree.
So the discipline is to be ruthless about the unit of analysis. Before you accept any correlation, ask what was actually counted: individuals, or groups of them? If the data are aggregated and the claim is about individuals, the honest response is not to believe it and not to dismiss it, but to treat it as a hypothesis that can only be tested with individual-level data. The map is informative, but it is not the people, and the average of a place tells you surprisingly little about anyone who lives there.
So here is my question: Where have you seen a group-level pattern confidently sold as a fact about individuals, and how do you keep the unit of analysis straight when the only data on hand is aggregated?
Leave a comment