Abstract
As a substitute, randomized controlled trial (RCT) researchers increasingly rely on propensity score modeling (PSM) to estimate causal effects. However, some warn about the dangers of placing too much blind faith in the abilities of PSM. This study tests the reliability and validity of seven common PSM methods in their ability to remove an artificial selection bias and replicate results from several RCTs in criminal justice data. Findings suggest PSM can be an effective means for simulating RCT results. Meta-analyses reveal the average difference between PSM and RCT estimates were relatively small. Ultimately, our findings suggest that PSM can be an effective means for simulating an RCT while also harboring reason for concern. Researchers and policy-makers should approach the use and interpretation of PSM with cautious optimism as it appears to provide a reliable and valid estimate of the treatment effect most of the time.
Similar content being viewed by others
Notes
The NACJD is a subsection of the Inter-university Consortium of Political and Social Research (ICPSR), which is a data-sharing, online repository. ICPSR and NACJD partner with the federal government and public funding institutions to ensure that any data collected under the auspices of such funding are publicly available.
Many of these records were duplicates given the similarities in the keyword search terms.
Although seemingly inherent in the term RCT, there were many studies identified by this keyword that did not involve true random assignment over the course of the project. For instance, some researchers may have been required to stop random assignment in the middle of an evaluation due to ethical issues, such as upon evidence that the rehabilitation program under study was effective in improving behavior. These studies may have introduced bias into the treatment effects, and therefore, we opted not to include them in the current investigation.
This number is based on what was needed to prepare the data for PSM. A power analysis indicated that to detect a true difference of a medium effect size (using Cohen’s d = .5), with approximately .80 power, required at least 65 cases per group (Cohen, 1988). To ensure there are at least twice as many comparison cases available for matching once we introduced the 50% selection bias to the treatment group, our analysis required a minimum of 130 cases per group in the original study.
We ran our analyses with and without this study, and there were no substantive or statistical differences in the results.
We estimated the standardized percent bias using Austin’s (2011) two formulas, with continuous measures calculated as \(d=\frac{{\overline x}_{treatment}-{\overline x}_{control}}{\sqrt{\displaystyle\frac{s_{treatment}^2+s_{control}^2}2}}\) where \(\overline x\) denotes the mean of the respective groups (treatment or control), and S2 denotes sample variance, and for dichotomous measures\(d=\frac{{\widehat P}_{treatment}-{\widehat P}_{control}}{\sqrt{\displaystyle\frac{{\widehat P}_{treatment}\left(1-{\widehat P}_{treatment}\right)+{\widehat P}_{Control}\left(1-{\widehat P}_{Control}\right)}2}}\) where \(\widehat P\) denotes the proportion of the measure’s respective group.
The rest of the PSM studies used a different form (covariate balancing propensity score estimation or machine learning), while others (nine studies) did not mention the technique used to condition the score at all.
We refer readers to Guo and Fraser (2014) for a more detailed description of these techniques and their assumptions.
The minimum number of matched controls was represented by \(minimum\;n=\frac{\frac{1-\left({\textstyle\frac t{t+c}}\right)}2}{\displaystyle\frac t{t+c}}\) where t is the number of treatment cases in the biased sample and c is the number of comparison cases from which to draw matches. Similarly, the maximum number of controls to match to each treatment was represented by \(maximum\;n=\frac{2\left(1-\left(\frac t{t+c}\right)\right)}{\displaystyle\frac t{t+c}}\).
This is an important distinction because some 1-many matching schemes force a matched set to achieve a certain number of cases. If the researcher determines each treatment case should have three matched controls, then each matched set will have invariably four cases. Any treatment case that cannot achieve three matched controls is either lost (when a caliper is employed) or is forced to match with an otherwise incompatible control (when a caliper is not used; see Ming & Rosenbaum, 2000).
It is still possible that adequate matches are not found for both control and treatment cases.
As its name suggests, the IPTW applies a weight to control cases that is equal to the inverse of the case’s odds of being in the treatment. This weight (\(\omega\)) for each case \((x\)) is calculated using \(\omega \left(t,x\right)=t+(1-t)\frac{Pr}{1-Pr}\) where t is the treatment measure (1 for treated cases, 0 for untreated), and Pr is the propensity score (see Guo & Fraser, 2014, citing Hirano & Imbens, 2001; Hirano, Imbens, & Ridder, 2003).
After accounting for some degree of common support, the control weight is calculated as \({\omega }_{s} =\frac{{n}_{z,s}}{{n}_{{z}^{^{\prime}},s}}\) where \({n}_{z,s}\) is the number of units assigned to the treatment group within each stratum (\(s)\), and \({n}_{{z}^{^{\prime}}, s}\) is the number of units in the control group within stratum s (see Hong, 2010, p. 519). All treatment units would receive no weight (i.e., equal to 1).
It is reasonable to expect that the original RCT samples possess relatively low AUC values (e.g., < .600) whereas the biased samples should yield much higher AUC values (e.g., > .800). The closer the AUC value of a PSM sample gets to .500, the more it can be said that the propensity score can no longer differentiate between the treatment and control cases (i.e., the two groups are balanced). To calculate an AUC for the unbiased, experimental data, we fit a logistic regression model to the original dataset with the same measures used in the biased samples’ propensity score. All AUC statistics were calculated using the DeLong et al. (1988) approach, and compared using Hanley and McNeil (1982)’s test of significance for independent sample curves. While the AUC does not address any misspecification of the propensity-score conditioning logistic regression, we assessed and ensured the fit of the logit model individually.
In tandem importance with the reduction of bias is the estimation of hidden bias. Due to the nature of PSM being a quasi-experimental design, there is always the potential that an unobserved covariate may have impacted the findings if it had been observed. To test for this, we use the sensitivity analysis posited by Rosenbaum (2002, 2005) which focuses on the difference in the matched and unmatched outcomes. Specifically, we used the user-written codes for Stata of mhbounds for dichotomous outcomes and rbounds for continuous outcomes. These tests assess how sensitive the findings are to the potential of hidden bias by simulating the ability of an unobserved covariate to predict assignment to the treatment condition in the form of gamma. Gamma is essentially a measure of the degree to which an unobserved measure must improve the prediction of treatment assignment compared to the current propensity score models. As gamma increases, the findings are understood as robust to hidden bias.
Cohen’s d is the standardized difference between two means, and it is calculated as the difference in means between two groups divided by their pooled standard deviation.
If the two 95% CIs did not overlap, we considered them statistically different from one another (Cumming & Calin-Jageman, 2016).
We interpreted r values of .1, .3, and .5 as indicative of small, medium, and large correlations (Cohen, 1988).
For this analysis, we averaged the outcomes within study to produce only one ES per unique sample. This process ensured that our meta-analysis would be weighted by sample size and not by the number of outcomes included.
The random-effects model was selected a priori on conceptual grounds because this method can be used to extend the results of the meta-analysis to a wider population of studies when it cannot be determined with any degree of certainty that the current population of studies is functionally similar (see Bornstein et al., 2009).
Although it is common for meta-analysts to test study heterogeneity with the Q test, the Q statistic only informs about the presence or absence of heterogeneity, not the extent of such heterogeneity. In contract, the I2 statistic serves to quantify the degree of heterogeneity between studies and is presented in easily comparable percentage terms. We interpreted I2 according to Higgins and Thompson’s (2002) guidelines, where values of around 25%, 50%, and 75% are indicative of low, medium, and high levels of heterogeneity among the ESs, respectively.
We also conducted the same set of analyses by averaging the ESs within each study first and then making comparisons (n = 11). This process yielded similar findings to those presented here.
The .24 difference in d was established in the education literature and specific to educational performance in relation to various intervening practices. That said, it is related to many, if not all, of the outcomes we used in that the educational measures are typically behavioral and/or scaled attitudinal tests, with some of the study samples involving educational settings for crime prevention programs.
References
Apel, R. J., & Sweeten, G. (2010). Propensity score matching in Criminology and Criminal Justice. In A. R. Piquero & D. Weisburd (Eds.), Handbook of Quantitative Criminology (pp. 543–562). New York: Springer.
Austin, P. C. (2008). A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003. Statistics in Medicine, 27(12), 2037–2049.
Austin, P. C. (2009). Some methods of propensity-score matching had superior performance to others: Results of an empirical investigation and Monte Carlo simulations. Biometrical Journal, 51(1), 171–184.
Austin, P. C. (2010). Statistical criteria for selecting the optimal number of untreated subjects matched to each treated subject when using many-to-one matching on the propensity score. American Journal of Epidemiology, 172(9), 1092–1097.
Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3), 399–424.
Austin, P. C., & Stuart, E. A. (2015). Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in Medicine, 34(28), 3661–3679.
Braga, A. A., Piehl, A. M., & Hureau, D. (2009). Controlling violent offenders released to the community: An evaluation of the boston reentry initiative. Journal of Research in Crime and Delinquency, 46(4), 411–436.
Bornstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. John Wiley & Sons.
Campbell, C. M., Labrecque, R. M., Mohler, M. E., & Christmann, M. J. (2022). Gender and community supervision: Examining differences in violations, sanctions, and recidivism outcomes. Crime & Delinquency, 68(2), 284–325.
Campbell, C. M., Abboud, M. J., Hamilton, Z. K., vanWormer, J., & Posey, B. (2019). Evidence-based or just promising? Lessons learned in taking inventory of state correctional programming. Justice Evaluation Journal, 1(2), 188–214.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge.
Cole, S. R., Platt, R. W., Schisterman, E. F., Chu, H., Westreich, D., Richardson, D., & Poole, C. (2010). Illustrating bias due to conditioning on a collider. International Journal of Epidemiology, 39(2), 417–420.
Cumming, G., & Calin-Jageman, R. (2016). Introduction to the New Statistics: Estimation, Open Science, and Beyond (Reprint edition). Routledge.
Dehejia, R. H., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448), 1053–1062.
Dehejia, R. H., & Wahba, S. (2002). Propensity score-matching methods for nonexperimental causal studies. Review of Economics and Statistics, 84(1), 151–161.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837. https://doi.org/10.2307/2531595
Diamond, A., & Sekhon, J. S. (2012). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. The Review of Economics and Statistics, 95(3), 932–945.
Dong, N., & Lipsey, M. W. (2018). Can propensity score analysis approximate randomized experiments using pretest and demographic information in pre-k intervention research? Evaluation Review, 42, 34–70.
Freedman, D. A., & Berk, R. A. (2008). Weighting regressions by propensity scores. Evaluation Review, 32(4), 392–409.
Gaes, G. G., Bales, W. D., & Scaggs, S. J. A. (2016). The effect of imprisonment on recommitment: An analysis using exact, coarsened exact, and radius matching with the propensity score. Journal of Experimental Criminology, 12, 143–158.
Gottfredson, D. C., Cook, T. D., Gardner, F. E., Gorman-Smith, D., Howe, G. W., Sandler, I. N., & Zafft, K. M. (2015). Standards of evidence for efficacy, effectiveness, and scale-up research in prevention science: Next generation. Prevention Science, 16(7), 893–926.
Guo, S., & Fraser, M. W. (2014). Propensity score analysis: Statistical methods and applications (2nd ed.). SAGE Publications Inc.
Hamilton, Z. K., Campbell, C. M., van Wormer, J., Kigerl, A., & Posey, B. (2016). The impact of swift and certain sanctions: An evaluation of Washington State’s policy for offenders on community supervision. Criminology & Public Policy, 15(4), 1009–1072.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36. https://doi.org/10.1148/radiology.143.1.7063747
Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association, 99(467), 609–618.
Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21(11), 1539–1558.
Hill, J. (2008). Discussion of research using propensity-score matching: Comments on ‘A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003’ by Peter Austin. Statistics in Medicine, 27(12), 2055–2061.
Hirano, K., & Imbens, G. W. (2001). Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services and Outcomes Research Methodology, 2(3–4), 259–278.
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), 1161–1189.
Hong, G. (2010). Marginal mean weighting through stratification: Adjustment for selection bias in multilevel data. Journal of Educational and Behavioral Statistics, 35(5), 499–531.
Hong, G. (2012). Marginal mean weighting through stratification: A generalized method for evaluating multivalued and multiple treatments with nonexperimental data. Psychological Methods, 17(1), 44.
Hong, H., Aaby, D. A., Siddique, J., & Stuart, E. A. (2019). Propensity score-based estimators with multiple error-prone covariates. American Journal of Epidemiology, 188(1), 222–230.
Imai, K., & Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (statistical Methodology), 76(1), 243–263.
Kim, R. H., & Clark, D. (2013). The effect of prison-based college education programs on recidivism: Propensity Score Matching approach. Journal of Criminal Justice, 41(3), 196–204.
King, G., & Nielsen, R. (2016). Why propensity scores should not be used for matching. Political Analysis, 27(4), 435–454.
Labrecque, R. M., Mears, D., & Smith, P. (2019). Gender and the effect of disciplinary segregation on prison misconduct. Advanced on-line publication.
LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76(4), 604–620.
Loughran, T. A., Wilson, T., Nagin, D. S., & Piquero, A. R. (2015). Evolutionary regression? Assessing the problem of hidden biases in criminal justice applications using propensity scores. Journal of Experimental Criminology, 11(4), 631–652. https://doi.org/10.1007/s11292-015-9242-y
Luellen, J. K., Shadish, W. R., & Clark, M. H. (2005). Propensity scores: An introduction and experimental test. Evaluation Review, 29(6), 530–558.
Lunt, M. (2014). Selecting an appropriate caliper can be essential for achieving good balance with propensity score matching. American Journal of Epidemiology, 179(2), 226–235.
MacDonald, J., Stokes, R. J., Ridgeway, G., & Riley, K. J. (2007). Race, neighbourhood context and perceptions of injustice by the police in Cincinnati. Urban Studies, 44(13), 2567–2585.
McCaffrey, D., Ridgeway, G., & Morral, A. (2004). Propensity score estimation with boosted regression for evaluating adolescent substance abuse treatment. Psychological Methods, 9(4), 403–425.
McNiel, D. E., & Binder, R. L. (2007). Effectiveness of a mental health court in reducing criminal recidivism and violence. American Journal of Psychiatry, 164(9), 1395–1403.
Ming, K., & Rosenbaum, P. R. (2000). Substantial gains in bias reduction from matching with a variable number of controls. Biometrics, 56(1), 118–124.
Nagin, D. S., & Sampson, R. J. (2019). The real gold standard: Measuring counterfactual worlds that matter most to social science and policy. Annual Review of Criminology, 2(1), 123–145.
Peikes, D. N., Moreno, L., & Orzol, S. M. (2008). Propensity score matching: A note of caution for evaluators of social programs. The American Statistician, 62(3), 222–231.
Ridgeway, G., & McCaffrey, D. F. (2007). Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 540–543.
Rosenbaum, P. R. (1984). From association to causation in observational studies: The role of tests of strongly ignorable treatment assignment. Journal of the American Statistical Association, 79(385), 41–48.
Rosenbaum, P. R. (2002). Observational studies. Springer.
Rosenbaum, P. R. (2005). Heterogeneity and causality. The American Statistician, 59(2), 147–152. https://doi.org/10.1198/000313005X42831
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39(1), 33–38.
Rubin, D. B. (2006). Matched sampling for causal effects. Cambridge University Press.
Shadish, W. R. (2013). Propensity score analysis: Promise, reality and irrational exuberance. Journal of Experimental Criminology, 9(2), 129–144.
Shadish, W. R., Clark, M. H., Steiner, P. M., & Hill, J. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association, 103(484), 1334–1350.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin.
Smith, J. A., & Todd, P. E. (2005). Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics, 125(1–2), 305–353.
Smith, J., & Todd, P. (2001). Reconciling conflicting evidence on the performance of propensity-score matching methods. American Economic Review, 91(2), 112–118.
Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15(3), 250–267.
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science : A Review Journal of the Institute of Mathematical Statistics, 25(1), 1–21.
Stuart, E. A., Lee, B. K., & Leacy, F. P. (2013). Prognostic score-based balance measures can be a useful diagnostic for propensity score methods in comparative effectiveness research. Journal of Clinical Epidemiology, 66(8), S84-S90.e1.
ten Bensel, T., Gibbs, B., & Lytle, R. (2014). A propensity score approach towards assessing neighborhood risk of parole revocation. American Journal of Criminal Justice, 40(2), 377–398.
Ury, H. K. (1975). Efficiency of case-control studies with multiple controls per case: Continuous or dichotomous data. Biometrics, 31(3), 643–649.
van Wormer, J. G., & Campbell, C. (2016). Developing an alternative juvenile programming effort to reduce detention overreliance. Journal of Juvenile Justice, 5(2), 12.
Vito, G. F., Higgins, G. E., & Tewksbury, R. (2017). The effectiveness of parole supervision: Use of propensity score matching to analyze reincarceration rates in Kentucky. Criminal Justice Policy Review, 28(7), 627–640.
Wooldridge, J. M. (2005). Violating ignorability of treatment by controlling for too many factors. Econometric Theory, 21(5), 1026–1028.
Acknowledgements
The authors would like to thank Shenyang Guo, Zachary Hamilton, Stephen Vaisey, and Ozcan Tunalilar for their valuable feedback during this process.
Funding
This project was supported by a grant from the National Institute of Justice (Award #2016-R2-CX-0030). The opinions, findings, and conclusions expressed in this article are those of the authors and do not necessarily reflect those of the Department of Justice.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
ESM 1
(DOCX 15.5 kb)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Campbell, C.M., Labrecque, R.M. Panacea or poison: Assessing how well basic propensity score modeling can replicate results from randomized controlled trials in criminal justice research. J Exp Criminol 20, 229–253 (2024). https://doi.org/10.1007/s11292-022-09532-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11292-022-09532-y