We have been working through selection bias, and last time I covered propensity-score matching, which rebuilds comparable groups but only on the characteristics you managed to measure. There is a design that, where it applies, does something matching cannot: it balances even the things you did not measure. It is beloved in the methods literature, and it is also, as you may suspect, harder to use in practice than its fans let on. It is called regression discontinuity.
The setup is specific. A program is assigned by a strict cutoff on some continuous score. Students above a test score get the scholarship; households below an income line get the benefit; patients above a risk threshold get the drug. Regression discontinuity compares the units just above the line to those just below. The student one point over the scholarship cutoff and the student one point under it are, for practical purposes, the same student. One got the award and one did not, for no reason that matters. That arbitrary line behaves like a coin flip in a narrow band, and the jump in the outcome right at the cutoff is the estimated effect.
This is why methodologists prize it. Near the threshold, the two groups are alike on everything, including the traits you never recorded, because nobody can perfectly control landing just above versus just below an arbitrary line. That is the as-if-randomness matching cannot give you. The idea goes back to a 1960 study of merit awards by Thistlethwaite and Campbell, and it is now regarded as one of the most credible ways to estimate an effect without an actual experiment.
So why the skepticism? Because the conditions that make it work are demanding, and three of them bite hard.
First, you need a real cutoff. The design only applies where treatment is genuinely assigned by a sharp threshold on a measured number. Most programs are not assigned that way, so you cannot simply decide to use regression discontinuity. You can only use it when a law or rule happens to have created a clean line for you. It is an opportunity you find, not a tool you deploy.
Second, the answer is stubbornly local. Even when it works perfectly, it tells you the effect only for the units right at the threshold, the marginal scholarship winner, the barely-eligible applicant. It says little about people well above or below the line. You buy excellent internal validity at the cost of a narrow, sometimes awkwardly specific, conclusion.
Third, and most damaging in practice, people can manipulate the line, and they do so precisely where the stakes are high enough to bother. If anyone can nudge their score across the threshold, the units on either side are no longer comparable, and the magic evaporates. This is not hypothetical: researchers have documented teachers bumping student test scores over performance thresholds, and students arranging their credits to clear a scholarship cutoff. There is a standard check, the density test, which looks for a suspicious pile-up of cases on the lucky side of the line, but two-directional manipulation can slip past it.
Add the fact that the credible comparison lives in a thin slice of data around the cutoff, so you need a large sample to have enough cases near the line, and you can see the shape of the problem. Regression discontinuity is close to a randomized experiment when the cutoff is real, ungamed, well populated with data, and you are content with a local answer. That is a lot of conditions. Its beauty is genuine, and so is its narrowness.
So here is my question for the group. Have you found a setting where a real cutoff handed you a clean natural experiment, or have the thresholds in your work turned out too soft, too gamed, or too sparse to lean on?
Regression discontinuity design
We have been working through selection bias, and last time I covered propensity-score matching, which rebuilds comparable groups but only on the characteristics you managed to measure. There is a design that, where it applies, does something matching cannot: it balances even the things you did not measure. It is beloved in the methods literature, and it is also, as you may suspect, harder to use in practice than its fans let on. It is called regression discontinuity.
The setup is specific. A program is assigned by a strict cutoff on some continuous score. Students above a test score get the scholarship; households below an income line get the benefit; patients above a risk threshold get the drug. Regression discontinuity compares the units just above the line to those just below. The student one point over the scholarship cutoff and the student one point under it are, for practical purposes, the same student. One got the award and one did not, for no reason that matters. That arbitrary line behaves like a coin flip in a narrow band, and the jump in the outcome right at the cutoff is the estimated effect.
This is why methodologists prize it. Near the threshold, the two groups are alike on everything, including the traits you never recorded, because nobody can perfectly control landing just above versus just below an arbitrary line. That is the as-if-randomness matching cannot give you. The idea goes back to a 1960 study of merit awards by Thistlethwaite and Campbell, and it is now regarded as one of the most credible ways to estimate an effect without an actual experiment.
So why the skepticism? Because the conditions that make it work are demanding, and three of them bite hard.
First, you need a real cutoff. The design only applies where treatment is genuinely assigned by a sharp threshold on a measured number. Most programs are not assigned that way, so you cannot simply decide to use regression discontinuity. You can only use it when a law or rule happens to have created a clean line for you. It is an opportunity you find, not a tool you deploy.
Second, the answer is stubbornly local. Even when it works perfectly, it tells you the effect only for the units right at the threshold, the marginal scholarship winner, the barely-eligible applicant. It says little about people well above or below the line. You buy excellent internal validity at the cost of a narrow, sometimes awkwardly specific, conclusion.
Third, and most damaging in practice, people can manipulate the line, and they do so precisely where the stakes are high enough to bother. If anyone can nudge their score across the threshold, the units on either side are no longer comparable, and the magic evaporates. This is not hypothetical: researchers have documented teachers bumping student test scores over performance thresholds, and students arranging their credits to clear a scholarship cutoff. There is a standard check, the density test, which looks for a suspicious pile-up of cases on the lucky side of the line, but two-directional manipulation can slip past it.
Add the fact that the credible comparison lives in a thin slice of data around the cutoff, so you need a large sample to have enough cases near the line, and you can see the shape of the problem. Regression discontinuity is close to a randomized experiment when the cutoff is real, ungamed, well populated with data, and you are content with a local answer. That is a lot of conditions. Its beauty is genuine, and so is its narrowness.
So here is my question: Have you found a setting where a real cutoff handed you a clean natural experiment, or have the thresholds in your work turned out too soft, too gamed, or too sparse to lean on?
Leave a comment