Picture a voluntary job-training program. A year later, the people who enrolled are earning more than the people who did not. Success, right? Maybe. But ask a harder question first: who signs up for a training program? Often the more motivated, the more employable, the people already on their way up. If the participants were going to do better anyway, then comparing them to non-participants does not measure the program. It measures the kind of people the program attracted. This is selection bias, and it is the single biggest threat to any claim that something worked.
The core problem is comparability. To say a program caused an outcome, you need two groups that were alike in every relevant way except for the program, so that any later difference can be pinned on it. A randomized experiment delivers that by assigning people to treatment or control by chance. Randomization is powerful precisely because the coin flip does not care who you are; over enough people, the groups end up balanced on everything, the things you measured and the things you never thought to. The two groups are interchangeable at the start, so a difference at the end means something.
Selection bias is what you get when that coin flip is missing. In the real world, people are rarely assigned to programs at random. They volunteer, they are referred, they qualify, they opt in. Each of those routes is a filter, and the filter is usually related to the very outcome you care about. The motivated enroll in the training. The health-conscious join the gym. The schools with resources adopt the new curriculum. So the treated group and the untreated group differ before the program ever starts, and that head start contaminates every comparison you make afterward.
This is a close cousin to confounding, but it has its own character. Confounding is usually described as some background variable influencing both treatment and outcome. Selection bias is the version that operates through how people come to be in your groups at all, whether through their own choices or someone else’s sorting. And it does not only happen at enrollment. It can sneak in later, through who stays in the study, who responds, who remains in the data. Anywhere a non-random filter stands between the full population and the people you end up analyzing, selection bias can enter.
What makes it so dangerous is that the data look complete and the analysis looks clean. You have a treated group, an untreated group, an outcome, and a tidy difference between them. Nothing in the spreadsheet warns you that the two groups were never alike. The bias is not a flaw in your calculation. It is baked into who is in the rows, and a more careful regression on the same biased groups will simply give you a more precise wrong answer.
The instinct to compare participants to non-participants is so natural that it is everywhere, in program reports, marketing claims, and much observational research now sailing under the banner of real-world evidence. Sometimes randomization is impossible or unethical, and an observational comparison is all we have. That is legitimate. But it obligates us to take selection seriously rather than wish it away, and to ask, every time, whether the people we are comparing were ever truly comparable.
Next time I will get to the most popular tool for fighting back, propensity-score matching, which tries to rebuild comparable groups after the fact. For now, the discipline is the question itself.
So here is my question for the group. When you see a before-and-after or a participant-versus-nonparticipant comparison, what is the first thing you check to decide whether the two groups were genuinely alike to begin with?
Selection bias
Picture a voluntary job-training program. A year later, the people who enrolled are earning more than the people who did not. Success, right? Maybe. But ask a harder question first: who signs up for a training program? Often the more motivated, the more employable, the people already on their way up. If the participants were going to do better anyway, then comparing them to non-participants does not measure the program. It measures the kind of people the program attracted. This is selection bias, and it is the single biggest threat to any claim that something worked.
The core problem is comparability. To say a program caused an outcome, you need two groups that were alike in every relevant way except for the program, so that any later difference can be pinned on it. A randomized experiment delivers that by assigning people to treatment or control by chance. Randomization is powerful precisely because the coin flip does not care who you are; over enough people, the groups end up balanced on everything, the things you measured and the things you never thought to. The two groups are interchangeable at the start, so a difference at the end means something.
Selection bias is what you get when that coin flip is missing. In the real world, people are rarely assigned to programs at random. They volunteer, they are referred, they qualify, they opt in. Each of those routes is a filter, and the filter is usually related to the very outcome you care about. The motivated enroll in the training. The health-conscious join the gym. The schools with resources adopt the new curriculum. So the treated group and the untreated group differ before the program ever starts, and that head start contaminates every comparison you make afterward.
This is a close cousin to confounding, but it has its own character. Confounding is usually described as some background variable influencing both treatment and outcome. Selection bias is the version that operates through how people come to be in your groups at all, whether through their own choices or someone else’s sorting. And it does not only happen at enrollment. It can sneak in later, through who stays in the study, who responds, who remains in the data. Anywhere a non-random filter stands between the full population and the people you end up analyzing, selection bias can enter.
What makes it so dangerous is that the data look complete and the analysis looks clean. You have a treated group, an untreated group, an outcome, and a tidy difference between them. Nothing in the spreadsheet warns you that the two groups were never alike. The bias is not a flaw in your calculation. It is baked into who is in the rows, and a more careful regression on the same biased groups will simply give you a more precise wrong answer.
The instinct to compare participants to non-participants is so natural that it is everywhere, in program reports, marketing claims, and much observational research now sailing under the banner of real-world evidence. Sometimes randomization is impossible or unethical, and an observational comparison is all we have. That is legitimate. But it obligates us to take selection seriously rather than wish it away, and to ask, every time, whether the people we are comparing were ever truly comparable.
Next time I will get to the most popular tool for fighting back, propensity-score matching, which tries to rebuild comparable groups after the fact. For now, the discipline is the question itself.
So here is my question: When you see a before-and-after or a participant-versus-nonparticipant comparison, what is the first thing you check to decide whether the two groups were genuinely alike to begin with?
Leave a comment