Stephen T’s Blog Spot

A blog aimed at issues only data scientists, data analysts, statisticians, evaluators, and researchers care about.

Last time I described selection bias: when people choose into a program, the participants and non-participants differ before anything happens, so a naive comparison measures the people, not the program. Randomization would solve it, but often we cannot randomize. So how do you rebuild a fair comparison from data where the groups were never balanced? The most popular answer is propensity-score matching.

The obvious idea would be to match each participant with a non-participant who looks just like them, same age, same prior earnings, same education, same everything, so that the only difference left is the program. With one or two characteristics that works. With twenty, it collapses: you will almost never find someone who matches on all of them at once. There are too many dimensions and not enough people.

In 1983, Paul Rosenbaum and Donald Rubin showed a way around the wall. You do not have to match on all the characteristics directly. You can collapse them into a single number, the propensity score, the estimated probability that a given person would have ended up in the program, based on their observed characteristics. You calculate it, usually with a model that predicts treatment from the covariates, then match a participant to a non-participant who had the same probability of participating. Their visible traits may differ in detail, but on the whole they were equally likely to be in the program.

The reason this works is quietly elegant. Among people who were equally likely to enroll, who actually did enroll looks, on the measured characteristics, close to random. So comparing a treated person to an untreated person with the same propensity score compares like with like, and the comparison starts to behave like the randomized experiment you could not run. You can check it, by producing the table of characteristics for both groups after matching and confirming they finally look alike. The same logic supports related moves: stratifying by the score, or weighting by it.

Now the limitation, which is the whole reason last week’s post came first. A propensity score can only be built from the characteristics you measured. It balances age, earnings, and education because you have them; it can do nothing about what you never recorded. And in a voluntary program, the biggest driver of both enrolling and succeeding is often something soft and unmeasured, like motivation. Randomization would have balanced that too, silently, because the coin flip ignores everything. Propensity-score matching cannot, because it only knows what is in your data. Two groups can be perfectly balanced on every measured variable and still differ on the one that mattered most.

This is why the method rests on an assumption you cannot test: that you measured everything driving both selection and outcome. It is the same shape of problem as the missing-data assumption from earlier this month, untestable from the data alone, and it calls for the same honesty. The strongest analyses pair the matching with a sensitivity analysis, asking how strong an unmeasured confounder would have to be to overturn the result. If it would take an implausibly powerful hidden factor, the finding is robust. If a modest one would do it, be nervous.

So propensity-score matching is not a substitute for randomization, and anyone who sells it as one is overreaching. What it is, is a disciplined, transparent way to approximate a fair comparison when randomizing was never an option, with its assumptions laid out where reviewers can see them. It converts the vague worry “these groups are different” into the sharper, more answerable question “different on what, and did we measure it?”

So here is my question: When you use matching or weighting to compare non-randomized groups, how do you argue that you captured the variables that drive selection, and do you stress-test for the ones you might have missed?

Posted in

Leave a comment