Empirical Benchmarks for Planning and Interpreting Causal Effects of Community College Interventions

Randomized controlled trials (RCTs) are an increasingly common research design for evaluating the effectiveness of community college (CC) interventions. However, when planning an RCT evaluation of a CC intervention, there is limited empirical information about what sized effects an intervention might reasonably achieve, which can lead to under-or over-powered studies. Relatedly, when interpreting results from an evaluation of a CC intervention, there is limited empirical information to contextualize the magnitude of an effect estimate relative to what sized effects have been observed in past evaluations. We provide empirical benchmarks to help with the planning and interpretation of community college evaluations. To do so, we present findings across well-executed RCTs of 39 CC interventions that are part of a unique dataset known as The Higher Education Randomized Controlled Trials (THE-RCT). The analyses include 21,163– 65,604 students (depending on outcome and semester) enrolled in 44 institutions. Outcomes include enrollment, credits earned, and credential attainment. Effect size distributions are presented by outcome and semester. For example, across the interventions examined, the mean effect on cumulative credits earned after three semesters is 1.14 credits. Effects around 0.16 credits are at the 25 th percentile of the distribution. Effects around 1.69 credits are at the 75 th percentile of the distribution. This work begins to provide empirical benchmarks for planning and interpreting effects of CC evaluations. A public database with effect sizes is available to researchers (https:// www .mdrc .org/ the -rct -empirical -benchmarks).

The present paper1 presents one type of empirical benchmark that can be used for both planning evaluations of CC interventions, like the comprehensive support program above, and interpreting causal effect estimates from evaluations of CC interventions, like DPP.Conceptually, empirical benchmarks can be helpful to frame the magnitude of an intervention's effects relative to something else.In educational evaluations, examples of such comparative benchmarks include: (a) the distribution of estimated effects from evaluations of other related interventions; (b) normative expectations for educational progress (e.g., typical student growth on achievement tests during the year); (c) prevailing gaps in educational outcomes (e.g., racial inequality in academic achievement); (d) policy-relevant performance thresholds (e.g., the probability of being proficient on a state test); and (e) cost-effectiveness ratios (for examples, see Baird & Pane, 2019;Hill et al., 2008;Kraft, 2020).The present paper focuses on the first type of empirical benchmark.
In doing so, we aim to support researchers planning an evaluation, and funders who might support them, to ensure that evaluations of CC interventions are adequately powered to detect effects that might realistically be achieved, given what has been observed in the past.We also aim to support researchers and policymakers seeking to interpret an estimated effect of a CC intervention, by providing one valuable piece of context-how effective their intervention is relative to the effectiveness of other rigorously evaluated CC interventions.

Background
CCs play a vital role in U.S. postsecondary education.In fall 2021, 4.5 million students attended public two-year colleges, representing 29% of U.S. undergraduates. 2Despite providing unprecedented access to postsecondary education, rates of degree attainment remain low.Only 31% of first-time, full-time students seeking a degree or certificate whose first postsecondary school is a public two-year college graduate within three years (Integrated Postsecondary Education Data System (IPEDS) Trend Generator, 2021).To address these issues, policymakers, foundations, and college administrators are beginning to embrace the need for causal evidence of the effectiveness of postsecondary programs, policies, and practices.
In 2002, the U.S. Department of Education created the Institute for Educational Sciences (IES) as part of the Education Sciences Reform Act, which has provided unprecedented funding for educational evaluations with strong potential to draw causal conclusions.Thus began a transformation in higher education evaluation.Two decades later, MDRC alone has conducted 31 RCTs of 41 interventions in over 45 (mostly community) colleges throughout the United States, including 67,400 students, mostly from low-income backgrounds (Diamond et al., 2021).Many more RCTs in higher education have been conducted by others.For example, the What Works Clearinghouse (WWC) has published reviews of 68 large-scale (with more than 350 participants) postsecondary RCTs that meet their evidence standards without reservations. 3hile the number of RCTs in CCs has grown dramatically over the past 20 years, the information needed to plan a high-quality RCT and to interpret their findings in this context has not.When planning an RCT, it is important to consider the size of effect that the intervention might reasonably achieve.Researchers can use this information when setting a target sample size needed to ensure a study is adequately powered.Relatedly, when interpreting RCT findings, it is important for researchers to convey the practical significance of effect estimates to policymakers and practitioners to help inform their decision-making.In both scenarios, researchers need information that Journal of Postsecondary Student Success will help them consider what effect sizes are meaningful and policy relevant in the context of their study.
For many years, evaluators would plan RCTs and interpret their findings using the benchmarks proposed (as somewhat of a last resort) by Cohen for "small" (0.2), "medium" (0.5), and "large" (0.8) effect sizes.However, more recently in K-12 research, one powerful approach that has been adopted to characterize the magnitude of an effect of an educational intervention is to compare it with previously estimated effects of interventions in a similar context (Hill et al., 2008).Based on an analysis of prior K-12 studies, Hill et al. (2008) found that the average estimated effect size was 0.07, with a 0.32 standard deviation.Assuming approximate normality of the distribution of estimated effects, these findings indicate that, in the elementary school context on broad standardized tests, only around 1% of the examined estimated effect sizes were large by Cohen's rules of thumb (suggesting those rules of thumb are probably not appropriate in the K-12 context).The empirical findings of Hill et al. (2008) have changed the way K-12 researchers plan studies (expected effect sizes have decreased) and interpret their findings (what used to be thought of as "small" effects are now taken quite seriously).
Based on Hill et al.'s (2008) work and numerous studies since, K-12 researchers now have access to empirical benchmarks for standardized effect size estimates based on results from real-world RCTs (Baird & Pane, 2019;Bloom et al., 2008;Hill et al., 2008;Kraft, 2020Kraft, , 2023;;Lipsey et al., 2012;Wolf & Harbatkin, 2022).These benchmarks can be used to situate an effect estimate within the distribution of effect estimates observed in K-12 evaluations, by subject area and grade-level, and even by other factors affecting the magnitude of intervention effects such as outcome type and study design.Although these normative benchmarks do not necessarily inform what effects are practically meaningful to decision-makers (for this purpose, Hill et al., 2008 and others have proposed other benchmarks, discussed in the conclusion), they are very useful for grounding expectations for what might realistically be attainable when conducting power calculations or interpreting the magnitude of a study's effects (Konstantopoulos & Hedges, 2008, p. 1615).
In stark contrast, in postsecondary education, such information is not available.This may partially explain why many postsecondary RCTs appear to be underpowered. 4hus, the purpose of the present paper is to provide normative empirical benchmarks for postsecondary interventions, like the types that already exist in K-12 education.Our focus is on the intervention-level distribution of average effects (mean, standard deviation, and percentiles) across 39 postsecondary (mostly CC) interventions evaluated using an RCT.We examine the distribution of effects on multiple outcomes typically examined in CC studies (enrollment, credit accumulation, degree completion) and by semester (from one through six semesters after individuals joined each study).The distributions presented can be used for planning future evaluations, by helping researchers determine what sample size will be needed to detect effects that might plausibly be achieved, given the intervention they intend to evaluate and the outcomes they plan to measure.They can also help inform the interpretation of impact findings from evaluations like the DPP example introduced earlier.
In addition to providing these benchmarks in a new context, our methodological approach diverges from past efforts in a way that we believe represents an important improvement.Historically, researchers have presented the distribution of estimated effects from a group of studies for benchmarking.However, due to "estimation error variance," the distribution of estimated effects is known to vary more, and sometimes a lot more, than the distribution of true effects (Bloom et al., 2017, p. 824).The approach we use aims to remove estimation error variance from the distribution of estimated effects, allowing for estimation of how much true effects vary among interventions, thus producing more appropriate benchmarks for planning and interpretation purposes.
With respect to planning, and more specifically power calculations, the "minimum detectable effect (MDE)" would more accurately be called the "minimum detectable true effect (MDTE)"-it is not the minimum detectable estimated effect (Somers et al., 2023).Consequently, when planning a study and considering whether the MDTE is of a magnitude that might be achieved by the intervention under study, a relevant benchmark is not "where does the MDTE lie on the distribution of estimated effects from past studies?"Instead, it is "where does the MDTE lie on the estimated distribution of true effects from past studies?"Our proposed benchmarks more accurately support answering this question.
For interpretation purposes, what again seems most relevant is our best understanding of where the effect of an intervention under consideration is situated within the distribution of true effects from past evaluations, not within the wider distribution of effect estimates from past evaluations. 5Our proposed approach should mitigate some of the challenges associated with "promising trials bias" and "the winner's curse" (Simpson, 2022;Sims et al., 2022).We illustrate how the distinction between the distribution of estimated effects and the estimated distribution of true effects can affect study planning and interpretation, by comparing the two approaches on one of our outcomes of interest.
In the next section we describe the methodology used in our analysis, including data sources, measures, estimands, and statistical models.In the following section we share results.In the last section we offer a discussion of key implications and limitations of those findings.Journal of Postsecondary Student Success

Studies and Analysis Samples
Our analyses focus on evaluations of CC interventions where the identification strategy allows for an unbiased estimator of intervention effects.In doing so, we ensure that the distribution of effects is not confounded with cross-study variation in degrees of bias.In addition, we aimed to identify evaluations that are representative of the CC interventions that have been rigorously evaluated to date, to optimize the comprehensiveness and generalizability of the benchmarks.
Accordingly, the findings in this paper are based on 30 well-executed RCTs of postsecondary interventions conducted by MDRC, which represent all but one of the postsecondary RCTs that MDRC had led from 2003-2019 (the one RCT excluded from the present analysis had limited follow-up).We present findings on student outcomes for the first six semesters after random assignment, a common time to consider CC degree completion rates, thereby making it possible to examine the pattern of variation in effect sizes across semesters of follow-up as well as across interventions.
Importantly, the impact findings from these RCTs are causally robust.Twenty-seven of these RCTs have been reviewed by the U.S. Department of Education's WWC and have all met the WWC's evidence standards without reservations.The three RCTs that have not yet been reviewed almost certainly meet the same standards, given their similar design, analytic approach, and attrition rates.
Furthermore, the RCTs used for this paper comprise a sizable portion of all large-scale postsecondary RCTs conducted in the United States.Only 68 large-scale (with more than 350 participants) postsecondary RCTs have been reviewed and met WWC evidence standards without reservations.Thus, the findings from this paper likely provide a representative picture of the distribution of effect sizes in well-executed CC RCTs. 6ome of the 30 RCTs used in this paper are multi-arm trials.These multi-arm trials evaluate the effect of more than one intervention in a single RCT by, for example, randomly assigning students to a control group, intervention A, or intervention B. Thus, although there were 30 RCTs in our sample, they are used to estimate the distribution of effects of 39 interventions. 7The resulting full study sample includes 39 postsecondary interventions and a total of 65,604 students.
As shown in Table 1, the 39 studied interventions vary in their key components (e.g., advising, tutoring, financial supports, etc.) and duration (from one semester to three years).For more information on the key components of each individual intervention, see Appendix Tables A2 and A3.For even greater detail, Appendix Table A1 provides links to original reports about each intervention.
The eligible population in most studied interventions was students enrolled in the colleges (as opposed to prospective students or applicants); in two thirds of interventions, eligibility was limited to new or first-year students (see One intervention was financial aid reform that did not result in any increase in the amount of aid distributed.It is therefore the only intervention with none of the seven intervention components that were coded.Sources: MDRC calculations using data from THE-RCT and reports and journal articles.A list of reports and articles can be found in Appendix Table A1.Reflecting national patterns in two-year colleges, most students (77%) in the average study are younger than 25 (see Table 3).Almost two thirds of students (60%) in the average study are female, and the average percent Black is 25%, the average percent Hispanic is 36%; both percentages are higher than in the average two-year college in the United States.There is substantial variation in the characteristics of students across the studies; for example, the percentage of female students ranges from 0%-92%, and the percentage of White students ranges from 0%-60%.Note.Interventions are equally weighted.Sources: MDRC calculations using data from THE-RCT and reports and journal articles.A list of reports and articles can be found in Appendix Table A1.
The analytic sample used to estimate the distribution of effects varies across outcomes and by semesters depending on data availability, ranging from 20 to 39 interventions, and 21,163 to 65,604 students. 8Table 4 shows the percentage of individuals and interventions that are included in the analysis of effects for each outcome and semester, relative to the full study sample.In semesters 1-3, at least 82% of studied interventions and students are included in the analysis; however, in semesters 4-6, the number of studied interventions with longer-term follow-up data drops.For example, about two thirds of studies (26 studied interventions) collected enrollment data in semester 6 and half of studies collected credits and degree completion data (22 and 20 studied interventions, respectively) through semester 6.For this reason, caution is needed when interpreting results for outcomes beyond semester 3. The distribution of effects is very likely upward biased due to follow-up selection bias.That is, interventions with more promising short-term impacts were more likely to have longer-term follow-up data (see Bailey & Weiss, 2022).Because the data used for the present analysis are from actual RCTs-nearly all of which focus on students who are already enrolled and who agreed to participate in the study-the distribution of effects presented in this paper may not generalize to the effects that one would observe if these interventions were offered to all CC students at the colleges (or in the United States), nor to prospective students or applicants to these colleges.However, the findings are likely to represent the range of effect si zes that researchers will encounter for the subset of colleges and enrolled students who are interested in these types of interventions.

Data Sources and Measures Data Sources
The data for this paper are from MDRC's The Higher Education Randomized Controlled Trials Restricted Access File (THE-RCT RAF; Diamond et al., 2021).THE-RCT RAF is a restricted access student-level database created by MDRC and housed at the University of Michigan's Inter-university Consortium for Political and Social Research (ICPSR).The database includes all RCTs that MDRC has conducted in postsecondary education from 2003-2019 and is available to qualified researchers with few restrictions.
The data i ncludes information about each RCT's design (e.g., study n ame, experimental group indicators, and random assignment block indicators) plus students' characteristics and their academic outcomes by semester (enrollment, credit accumulation, degree completion).These d ata w ere originally o btained f rom t hree s ources: (a) college (or college system) records, which include demographic records, course transcripts, and degree completion; (b) the National Student Clearinghouse (NSC), which maintains information on enrollment and degree completion from nearly 3,600 colleges that combined enroll over 97% of the nation's college students (https:// www .studentclearinghouse.org/about/); and (c) study-administered student surveys implemented at the time of random assignment, to collect information on student characteristics that are not available from college records.9

Outcome Measures
The outcomes explored in the present analysis are the focus of most CC interventions: • Persistence (enrollment).To make academic progress, students must continue to enroll in college over time.We examine the percentage of students who were enrolled in postsecondary education each semester, as well as the cumulative number of semesters enrolled by a given semester. 10• Total credits accumulated.Total credits accumulated is a critical indicator of students' academic progress toward a degree, which typically requires at least 60 college-level credits (in the CC setting).Consequently, we examine total credits earned by semester and cumulatively during students' first six semesters after random assignment. 11Journal of Postsecondary Student Success • Degree or certificate completion.Our measure of this marker is the percentage of students who earned a postsecondary credential by four to six semesters after random assignment, common time frames for measuring CC degree completion.
Data on the number of credits earned are from college or system records.Information on enrollment and degree completion are from college or system records data or from the NSC, depending on the study and the student.To maximize data availability across studies, these outcomes are derived using all data sources.When only college/ system outcome data are available for a study, these measures are defined as enrollment or degree completion at the college/system of random assignment.When both college/system and NSC data are available for a study, these outcomes are defined as enrollment or degree completion at any college/university covered by the two sources.This means that enrollment and degree completion are measured somewhat differently across studies (either at nearly any college in the nation or only the college/system where the study took place, depending on data availability for the study).12Attrition of sample members does not present a problem in our analyses.For enrollment, credit accumulation, and degree completion, data are available for nearly every student in the study, if the relevant information (e.g., transcript records for credit accumulation) was collected for that study. 13When the college or NSC data include no records for a given student, we treat that student as not being enrolled, and therefore earning zero credits and not earning a degree.(The sample size reductions shown in Table 3 across semesters are due to a shorter follow-up period for some studies or cohorts, rather than sample attrition for other reasons.) Supplemental findings in a database associated with this paper include an examination of the distribution of effects on more narrowly defined versions of the main outcomes, including full-time enrollment, developmental credits earned, college-level credits earned, as well credits attempted (total, developmental, and college-level).As an additional supplemental analysis, we also examine effects on students' performance in their courses as measured by their grade point average (GPA), measured on a 4-point scale.Information on GPA is from college records.Impacts on GPA are challenging to evaluate in postsecondary impact studies because GPA is only defined for students who are still enrolled.This means that if an intervention has an impact on enrollment, the estimator of the effects on GPA could be biased (and in all cases, estimated effects on GPA do not apply to unenrolled students).For the present analysis, comparing impacts on GPA across follow-up semesters is also challenging because non-enrollment from the sample increases over time.Therefore, the findings for GPA in the online database are limited to the first follow-up semesters.

Parameters
Before delving into how we estimate key parameters of interest from our data, we first define two key parameters of interest.Let Bj be the true average effect of intervention j.Our analyses begin by estimating two parameters that summarize the crossintervention distribution of intervention mean effects-its mean (β) and its standard deviation (τ).By definition: β and τ provide a summary of the central tendency and spread of the distribution of average true effects across interventions, respectively.
In addition to these two primary summary statistics, we aim to characterize the distribution of true effects by identifying points on the distribution that correspond with percentiles of the distribution.

Important Context about Distributions of Estimated Effects
Much of the K-12 education (and other fields) literature base on empirical benchmarks starts with a group of studies and the estimated effects of the interventions from those studies.Empirical benchmarks often include the mean, median, standard deviation, and various percentiles of the distribution of the estimated effects from those studies (for examples, see Bloom et al., 2008;Hill et al., 2008;Kraft, 2020;Lipsey et al., 2012).Notice the emphasis on estimated.
Define ˆOrig j B as an original estimate of the average effect of intervention j, estimated using an unbiased estimator (e.g., a simple difference-in-means estimator or a regressionbased estimator).Such estimates ( ˆOrig j B ) are a combination of the true effect (Bj) and random estimation error (rj).That is: (3) Journal of Postsecondary Student Success Consequently, the spread (or variance) of the distribution of effect estimates is a combination of variation in the true effects of the interventions ( j Var B ) plus indepen- dent variation in estimation error ( j Var r ).Accordingly, the distribution of effect estimates is expected to be wider (and sometimes much wider) than the distribution of true effects (for more details, see Bloom et al., 2017 andHedges &Pigott, 2001).That is: which can be re-written as: Var B Var r . (5) Recall from Equation 2 that τ, the standard deviation of the cross-intervention distribution of true average effects, is a key parameter of interest.The standard deviation of ˆOrig j B , as is commonly presented in the literature, overestimates the spread of the distribution of true effects.Relatedly, percentiles of the distribution of estimated effects present an inaccurate depiction of percentiles on the distribution of true effects.We attempt to address this issue with our chosen estimators.

Estimators
To examine the distribution of true effects across studied interventions, we use the fixed-intercept, random treatment coefficient (FIRC) model described in detail by Bloom et al. (2017) for studying cross-site impact variation. 14The FIRC approach was used by Weiss et al. (2017) for their secondary analysis of data from 16 multi-site RCTs of education and training programs.
Specifically, we use the following 2-level hierarchical linear model to estimate β and τ: Level 2: Studied Interventions where: In this model, Yij is the value of the outcome (e.g., credits earned) for individual i in studied intervention j, Sij is a vector of random assignment block indicators (one for each block in each studied intervention) set equal to one if student i in studied intervention j was randomly assigned in that random assignment block and zero otherwise, Tij equals one if individual i in studied intervention j was assigned to treatment and zero otherwise.The blocks account for the fact that individuals were randomly assigned within blocks (e.g., colleges and cohorts) and that the proportion of sample members randomized to treatment can vary across blocks.The model does not control for students' baseline characteristics (like their gender and age).Doing so would not appreciably improve the precision of estimated effects because available characteristics are only weakly correlated with outcomes (Somers et al., 2023), and very few baseline variables are consistently available across studies.
An important feature of the FIRC model, relevant to the purpose of this paper, is that it allows for intervention-specific effect coefficients (Bj) that can vary randomly across studied interventions.The Bj's are modeled as representing a cross-intervention population distribution with a mean value of β and a standard deviation of τ.Hence, the intervention-level random error term, bj, has a mean of zero and a standard deviation of τ.Critically, when estimating τ, FIRC accounts for estimation error associated with each intervention specific effect estimate. 15Bloom et al. (2017) provide further information about this model and Raudenbush and Bloom (2015) explore its properties.
For the present analysis, the FIRC model is fitted to the analysis samples separately for each outcome and semester.Estimates of β (average) and τ (standard deviation) are key summaries of findings.
To further aid with interpretation, we also provide percentiles of the distribution of intervention effects.As previously noted, because of estimation error a key challenge with calculating percentiles is that the distribution of estimated study-specific effects (e.g., the ˆOrig j B reported in the original studies) exaggerates the amount of true crossstudy variation in effects (Bloom et al., 2017;Raudenbush & Bloom, 2015).We address this problem using a two-pronged approach proposed by Bloom et al. (2017).First, we begin by using the results of the FIRC model to compute the empirical Bayes shrinkage impact estimate for each intervention, E ˆEB j , which is a weighted average of 15 The model also allows for the variability of level-1 residuals to differ by treatment group.The individual-level random error term, eij, is assumed to have a mean of zero and a variance of , which can be different for treatment group members and control group members.Journal of Postsecondary Student Success the intervention-specific average impact estimate, ˆOrig j B , and the overall average impact estimate, E ˆ, where the weight of the study-specific estimate is based on its reliability, This means that for small-sample studies, where estimated effects are estimated less reliably, the empirical Bayes estimates will be "shrunken" towards the grand mean impact estimate.The shrinkage factor is based on an estimate of reliability.The resulting distribution of empirical Bayes effect estimates varies less than the best estimate of the variance of the distribution of true effects (Raudenbush & Bryk, 2002, p. 88).
That is, . Thus, the variance of empirical Bayes estimates will typically understate the cross-study variance of true mean program effects, and by extension, be smaller than the estimated cross-study variation from the FIRC model (W 2 ˆ).Hence, as a second step, we calculate an "adjusted" empirical Bayes estimate for each study to compensate for this over-shrinkage (Bloom et al., 2017): where γ is an adjustment factor that stretches the distance between the empirical Bayes estimates and the mean impact estimate E ˆ: . This adjustment inflates the variance of the empirical Bayes estimates to be exactly equal to the estimated variance of true program effects ( W 2 ˆ) from the FIRC model.The percentiles presented in this paper are based on the adjusted empirical Bayes estimates, . Study-specific estimates are also available in a public-use dataset created for this paper (available at https:// www .mdrc.org/ the -rct -empirical -benchmarks).

Distribution of Impact Estimates-An Example Using Three Estimators
Before presenting our main results, it is useful to examine the distribution of intervention impact estimates for a single outcome at a single time point, highlighting some important points about our methodological approach.Figure 1 presents the estimated impact of each intervention on the cumulative number of semesters enrolled through two semesters after random assignment.Impact estimates are presented using three estimators: (a) original impact estimates (called ˆOrig j B , above, and "OLS" in Figure 1) 17 , (b) empirical Bayes impact estimates (called ˆEB j B , above), and (c) adjusted empirical Bayes impact estimates (called ˆAEB j B , above).Interventions are listed in the order of the magnitude of their adjusted empirical Bayes impact estimate.The horizontal axis of the figure indicates the direction and magnitude of each impact estimate.
First, notice that the spread of the ˆOrig j B is 16% larger than the spread of ˆAEB .The latter, by construction, is equal to the estimate of τ, the standard deviation of the intervention-level distribution of average effects, from Equation 7. As shown in Equation 5, variation in ˆOrig Next, notice that the rank order of interventions' estimated effects occasionally changes, depending on the estimator.This is especially notable in evaluations where: (a) the original impact estimate is relatively more or less precise than the impact estimates for the other interventions (usually due to a relatively small/large sample size) and (b) the original impact estimate is far from the mean (E ˆ) impact estimate.One notable example is the EASE Info + $ Summer '18 intervention.While + $ '18 ˆOrig EASE Info Summer B is the sixth largest OLS impact estimate, ˆAEB EASE Info Summer B is the fourth largest adjusted 17 To ensure estimator consistency across interventions and because many of the original studies did not estimate impacts on this outcome at this time point, the "original" impact estimates in Figure 1 were obtained from student-level data by estimating an ordinary least squares (OLS) regression with dependent variable cumulative semesters earned and independent variables: (a) 0/1 indicators of students' random assignment block (typically defined by their cohort and/or college campus), and (b) interactions between a 0/1 indicator of students' treatment or control status and 0/1 identifiers of the intervention tested by the RCT that they were part of.The regression coefficients for these interaction terms are the "original" estimated effects of the intervention tested.Journal of Postsecondary Student Success empirical Bayes impact estimate.Compared to the other evaluations at the top end of the distribution, EASE Info + $ Summer '18 was a relatively large evaluation (total sample size around 3,500).Owing to the large sample size (and precise impact estimate), the difference between ˆOrig EASE Info Summer B and + $ '18 ˆAEB EASE Info Summer B is small and so EASE Info + $ Summer '18's adjusted empirical Bayes impact estimate rank is better than its original impact estimate rank.Such rank switching is probably a good thing-estimation error and related issues probably ought to result in shrunken expectations about the true effects of interventions with relatively impressive results coming from trials with imprecise impact estimates.Making such adjustments when interpreting findings from a new trial is prudent.The smallest trial in THE-RCT included 444 students, with all others including over 700 students.If a new trial with 150 students finds an estimated effect of 0.15 cumulative semesters enrolled through two semesters after random assignment, our best estimate probably should not be that this intervention is more effective at increasing cumulative semesters enrolled than nearly all other interventions in THE-RCT.By shrinking expectations, we can help combat "the winner's curse" and "promising trials bias" (Simpson, 2022;Sims et al., 2022).Appendix B provides a discussion of how to take the estimated effect of an intervention (and its associated standard error) and calculate an adjusted empirical Bayes impact estimate that can be located on the distribution of adjusted empirical Bayes impact estimates from this paper.An online tool associated with this paper allows users to make this adjustment easily, by plugging in an effect estimate and its standard error from a new RCT (go to https:// www .mdrc.org/ the -rct -empirical -benchmarks).
Figure 1 illustrated key points about our methodological approach using one outcome and time point as an example.We now turn to a broader discussion of the findings across major outcomes and time points.

Estimated Distribution of True Impacts
Appendix Table A4 presents information on the estimated distribution of true effects (mean, standard deviation, and percentiles) across the 39 postsecondary interventions in THE-RCT, by outcome and by semester.The first two outcomes (enrollment and credits earned) are marginal, focused only on what happened in that semester.The other three outcomes (cumulative semesters enrolled, cumulative credits earned, and degree earned) are cumulative, such that earlier impacts carry forward.As discussed earlier, the distribution of effects beyond semester three or four should be interpreted with caution because these distributions are based on fewer studies and very likely biased upwards because studies with more promising short-term effects are more likely to have longer-term follow-up data.Figure 2 provides a visual summary of Appendix Table A4 for four outcomes and for six semesters after random assignment.
Consider first the effect distribution on credits earned through one semester.The mean of the distribution (β) is 0.48 credits.This implies that, on average, the interventions in THE-RCT had positive impacts on students' first semester credit accumulation.Next, is the estimate of true impact variation across interventions (τ).This cross-intervention standard deviation is estimated to be 0.55 credits.This estimate of τ is larger in magnitude than the mean impact (β), indicating substantial variation in the effectiveness of interventions under study on this outcome at this time point.Lastly, are estimates of the magnitude of effects at various points in the effect distribution-the 10th, 25th, 50th, 75th, and 90th percentiles are − 0.03, 0.13, 0.48, 0.60, and 1.29, respectively.
Cutoffs to create rules of thumb for what is small, medium, or large are arbitrary, resulting in odd distinctions (e.g., an effect estimate of 0.59 credits might be considered medium and an effect estimate of 0.61 might be considered large, despite their being indistinguishable for practical or statistical reasons).Thus, it may be preferable to simply characterize the relative magnitude of an effect estimate in terms of its approximate  4. See Appendix Table A4 for exact values.
percentile within the distribution of effects.Nevertheless, some people find rules of thumb helpful, so one might consider effects on credits earned in semester one below 0.13 (25th percentile) to be relatively small, between 0.13 and 0.60 credits (75th percentile) to be medium-sized, and above 0.60 credits to be relatively large.
When looking across outcomes and semesters, a few patterns emerge:

•
For marginal outcomes, downward trends for means, and decreasing variability over time.
The mean (across interventions) effect on marginal enrollment and credits earned decreases over time.For example, the mean effect on credits earned starts at 0.48 credits in semester 1, and decreases to 0.43, 0.25, 0.17, 0.05, and 0.02 credits in semesters 2-6, respectively.This downward trend holds for enrollment.Similarly, the spread of the intervention-level distribution of effects is widest in semester 1 and decreases over time.These trends suggest that a lot of the action, with respect to intervention effects, occurs early on.Consequently, in the short-term, an intervention's effects on enrollment and credits earned must be larger in absolute magnitude to be considered large relative to the effects of other interventions.
• For cumulative outcomes, upward trends, and increased variability over time.
The mean (across interventions) effect on cumulative number of semesters enrolled and cumulative credits earned increases over time.For example, the mean effect on cumulative credits earned starts at 0.47 credits in semester 1, and increases to 0.89, 1.14, 1.43, 1.70, and 2.46 credits, in semesters 2-6, respectively.This pattern holds for cumulative enrollment.Similarly, the spread of the intervention-level distribution of effects is smallest in semester 1 and increases over time.These trends show that the effects of some of the more effective interventions continue to grow throughout the first three years after random assignment.Consequently, over time, an intervention's effects on cumulative enrollment and cumulative credits earned must be larger in absolute magnitude to be considered large relative to the effects of other interventions.
Because degree completion is a longer-term outcome, the number of studies with follow-up data is smaller than for other outcomes.Based on the subset of studies with available data (20 to 23 interventions, depending on the semester), the mean effect size is 0.9 percentage points by semester 4, 1.6 percentage points by semester 5, and 1.7 percentage points by semester 6.As noted earlier, these average effects are likely biased upwards due to follow-up selection bias.One notable aspect of the degree findings is that two outlier interventions drive a lot of the action.The original effect estimates from these studies are approximately 16 and 18 percentage points.
No other study's original effect estimate is above 4 percentage points.Thus, benchmarking degree impacts based on the evaluations in THE-RCT is limited.It seems Journal of Postsecondary Student Success reasonable to consider any positive effect on degree completion to be an impressive feat, with effects larger than 5 percentage points being notable.

Discussion
We break our discussion into a few parts: planning, interpretation, and other remarks.

Planning
The findings in this paper can be used to plan the sample size for future evaluations of CC interventions.For example, when planning a study of an intervention that is lightertouch (perhaps a single-component intervention), researchers may want to choose a sample size that will make it possible to detect effects at the lower end of the distribution of effects presented in this paper (for example, 0.23 cumulative credits earned through one year).Whereas for more comprehensive interventions, which tend to have larger effects (Weiss & Bloom, 2022;Weiss et al., 2022), a larger minimum detectable true effect may be sufficient, depending on the goal of the study.Importantly, other design parameters are necessary to calculate the minimum detectable true effect of an intervention-for guidance specific to community college evaluations, see Somers et al. (2023).
Notably, researchers should be mindful of the outcome(s) of interest and the timing of measuring those outcomes-this has implications for what sized effects might realistically be achieved.For example, the size of the effect an intervention might reasonably achieve on marginal credits earned in semester 2 is quite different than the size of the effect an intervention might achieve on cumulative credits earned through semester 4. Thus, MDTE calculations need to be specific with respect to outcome and timing.

Interpretation
The findings presented in this paper begin to provide empirical benchmarks to help researchers and policymakers interpret effect estimates from evaluations of CC interventions.Returning to the example of the DPP intervention highlighted in the introduction, recall that the estimated effect of DPP was 1.8 credits earned after two semesters (Ratledge et al., 2021, Appendix Table B3).This results in an adjusted empirical Bayes impact estimate of 1.6 credits earned, which is about 0.70 standard deviations above the mean effect, and at the 83 rd percentile in the distribution of effects after two semesters.18Thus, the effect of the DPP intervention is quite large compared to effects of other evaluated interventions.
As noted earlier, the interventions included in our analysis vary in terms of their components and duration (see Table 1).For this reason, in theory, it may be appropriate for researchers to compare the effect of their study to the findings from prior studies of similar interventions.To this end, study-level estimates of intervention effects (OLS, empirical Bayes, and adjusted empirical Bayes) for the 39 interventions are available in a public-use dataset created for this paper (https:// www .mdrc.org/ the -rct -empirical -benchmarks).The dataset includes estimated effects for each of the 39 studies in the analysis, by semester, for all outcomes (including additional outcomes like credits attempted and credits earned, for total, college-level, and developmental credits), allowing researchers to look at estimated effects for narrower categories of interventions as well as additional outcomes.As more CC studies are conducted, it may be possible to look at effect sizes for different populations, and a broader array of interventions.
Researchers of CC studies can also consider using other types of benchmarks to interpret their findings.One of the limitations of using effects from prior studies as benchmarks is that while they inform what is realistically attainable, they do not necessarily inform what effects are practically meaningful to decision-makers.In K-12 research, several other approaches have been offered for interpreting the practical meaningfulness of effect sizes.This includes comparing a study's effects to "normative expectations for change or growth" (e.g., comparing a study's effect to the typical growth made by a student during the year) and policy-relevant performance gaps (e.g., comparing a study's effect to the outcomes gap between students from families with low versus higher income; Baird & Pane, 2019;Bloom et al., 2008;Hill et al., 2008;Konstantopoulos & Hedges, 2008;Kraft, 2020;Lipsey et al., 2012;Wolf & Harbatkin, 2022).
These approaches could also be used to interpret the practical meaningfulness of findings from CC studies.With respect to normative expectations for growth, the magnitude of a program's effect on college-level credit accumulation could be characterized relative to normal academic progress towards a degree over various time periods.For example, for students who first enrolled in a public CC in 2012, average credit accumulation over one year nationally was 13.2 credits. 19Therefore, if an intervention's estimated effect on credit accumulation through two semesters is 3.0 credits, then the intervention could be said to increase credit accumulation by 25% (3.0/12.0) of the national average credit accumulation.
Similarly, comparing estimated effects to inequality in academic outcomes across relevant populations could also be used to characterize impacts.For example, nationally, among students who first enrolled in a public CC in 2017, there is racial inequality in three-year graduation rates of 10.5 percentage points between Hispanic men and White men (U.S.Department of Education, 2021).Thus, an intervention targeting Latino males with a 1.4 percentage point effect on three-year graduation rates could Journal of Postsecondary Student Success be characterized as reducing racial inequality in graduation rates among Hispanic men and White men by about 13%.
When using normative benchmarks (whether progress towards a degree or racial inequality in academic outcomes), an important consideration is what the reference population should be, which in turn depends on how the findings will be used.The previous examples were based on a national reference population, which may be relevant if the goal is to understand the potential of an intervention to address racial inequality if it were scaled to additional CCs across the country.Information about postsecondary outcomes and achievement gaps for nationally representative samples are readily available from public data sources like the Integrated Postsecondary Education Data System (IPEDS) and the Beginning Postsecondary Students (BPS) study conducted by the National Center for Education Statistics (NCES).20However, for policymakers or practitioners working at the state or institutional level, benchmarks based on a local normative population could be more appropriate.For example, if an intervention is "home grown" by a state, then state policymakers may find it most useful to interpret the effect sizes from a study relative to the racial inequality in academic outcomes in their state.Similarly, at the colleges that are implementing the intervention being evaluated, administrators may want to know by how much the intervention reduces racial inequality in student outcomes for students at their institution, or even more specifically, for the subset of students who were eligible to receive the intervention.Information on a college's racial inequality in academic outcomes can often be obtained directly from the institution, and gaps for participating students can be estimated using the study's control group.
For policymakers and practitioners, practical considerations related to an intervention's implementation can be as important as its effectiveness.Hence, other powerful approaches for contextualizing a study's effects are to discuss the intervention's cost-effectiveness (impact per dollar spent) and its scalability to different contexts (Kraft, 2020)

Other Remarks
Intent -to-Treat and Treatment-on-the-Treated In most of the studies in THE-RCT eligible students who were interested in participating in the intervention were recruited and consented to participate in the evaluation.Consequently, intervention take-up rates were high (typically over 70%) and there is limited difference between the magnitude of effect of the intent-to-treat (ITT) compared with the effect of the treatment-on-the-treated (TOT), or the local average treatment effect (LATE; Angrist et al., 1996;Bloom, 1984).Thus, it is probably inappropriate to compare the ITT effects from an evaluation that randomizes all eligible students "behind-the-scenes" and yields a low take-up rate to the distribution of ITT effects presented in the present paper.
Comparing TOT or LATE estimates from a study with low take-up rates may be more appropriate; however, even this should be done with caution owing to the often-inflated standard errors (when take-up is low) and the fact that, despite high take-up rates, the effect estimates in the present paper are indeed ITT and would need to be inflated to make them more comparable to TOT estimates.Perhaps this is an area of future work.

A Warning About Standardized Effect Sizes
The standardized mean effect size is the program-control group difference in mean outcomes divided by the standard deviation of the outcome (typically for the control group or pooled within research groups).When planning an evaluation, it is common for researchers to calculate the minimum detectable true effect size (MDTES) in standardized units, rather than the minimum detectable true effect (MDTE) in natural units (e.g., credits earned, percentage points for enrollment or degrees earned).The same is sometimes done when describing findings-intervention effect estimates may be presented in "standardized" effect size units.This is especially common when combining effect estimates for the purpose of meta-analysis.We offer caution for those tempted to do so.
First, standardized effect sizes may result in an unnecessary lack of transparency.They are common in K-12 research because test scores are often the outcome measure of interest and test scores are on arbitrary scales, thus some form of standardization is necessary for interpretation.In contrast, in CC research most outcomes have meaning in their natural units-graduation rates, enrollment rates, credits earned.Thus, a very strong reason is needed to justify converting effect estimates from a unit with natural meaning to one that is difficult to interpret and tethered to the amount of variation in the outcome among a specific sample of individuals.
Second In addition to providing empirical benchmarks for planning evaluations and interpreting effect estimates from evaluations in a new context (CCs), this paper also offers some methodological advances.By using random effects models and adjusted empirical Bayes impact estimates, we move closer to providing empirical benchmarks that represent the intervention-level distribution of true effects from past evaluations, rather than the intervention-level distribution of estimated effects from past evaluations.
This advance can be further improved upon.Analyses for each outcome at each time point were conducted independently, despite known correlations among impacts over time and across outcomes.Pooling data could be beneficial, especially given the lack of precision when estimating key parameters of interest.Specifically, the kinks in Figure 2 and some oddities in Appendix Table A4 likely illustrate the challenge of separating signal from noise when estimating τ with a limited number of studies.This challenge carries through to the adjusted empirical Bayes impact estimates and thus the percentiles of the intervention-level distribution of effects.This noise could be smoothed by pooling the data over time within outcome (at a minimum), and by forcing structure on the estimates of the τ's, such as assuming a plausible functional form over time (linearity or curvilinearity).Such pooling might also mitigate some of the concern around follow-up selection bias, since estimates of β and τ at later time points would be estimated, in part, based on data from the earlier time points with more complete data.

A Plea to Postsecondary Researchers and Funders
This article is one in a series of papers that capitalize on the unique dataset known as THE-RCT (for example, see Bailey & Weiss, 2022;Somers et al., 2023;Weiss & Bloom, 2022;Weiss et al., 2021Weiss et al., , 2022)).In addition to promoting increased learning through open and transparent data sharing, THE-RCT facilitates cross-study knowledge building.This was supported by the creation of core outcome measures, available semesterly, across all studies in THE-RCT.
To the extent that postsecondary researchers and funders value this type of cross-study learning, it is imperative to the field that we: (a) agree upon core outcome measures that are examined across studies (even if the outcome is not the primary outcome of the study) and (b) present impact estimates and associated standard errors (even if only in appendices), by semester for these core outcomes.

Conclusion
This paper provides an important first step in helping planners and potential funders of evaluations of CC interventions consider whether a proposed study is adequately powered to detect realistically achievable effects and supporting consumers of CC research interpret the magnitude of effects from rigorous evaluations.We hope that others will expand upon this work to further the field.For example, based on data for previous MDRC trials of 33 postsecondary interventions, the estimated overall mean true impact (E ˆ) on credits accumulated during students' first two semesters after random assignment is 0.89 credits, and the estimated standard deviation (Wˆ) is 1.02 credits.Corresponding estimates of the 10 th , 25 th , 50 th , 75 th , and 90 th percentile values for this distribution are − 0.09, 0.23, 0.73, 1.25, and 2.02 credits, respectively.
To compare an OLS impact estimate (E * ˆOLS This value lies between the 75 th and 90 th percentile values (1.25 and 2.02 credits, respectively) in Appendix Table A4.Hence, relative to past postsecondary interventions used to create the benchmark distribution for our example, the impact of the current hypothetical intervention is substantially positive.
A researcher could then interpolate the percentile value ( ˆcurrent P ) for the current intervention impact to determine where between the 75 th and 90 th percentiles it lies.For this purpose, a linear interpolation is probably a reasonable approximation.Thus, for the present example, § • | ¨ © ¹ 1.44 1.25 75 90 75 79 2.02 1.25 current P Hence, existing information indicates that the impact of the current intervention is comparable to the 79 th percentile of the estimated distribution of previous related true impacts.
An online tool associated with this paper will make these calculations for a user-they simply need to input the effect estimate and its associated standard error from their study.See https:// www .mdrc.org/ the -rct -empirical -benchmarks.
be larger than τ, owing to estimation error.The ˆAEB j B aim to correct for this (and the ˆEB j B slightly overcorrect for this in favor of other desirable statistical properties).Thus, the ˆAEB j B may yield the most accurate representation of the spread of the distribution of true effects.This is particularly relevant when planning a study and considering the MDTE.If a planned study has a MDTE of 0.125 semesters enrolled, examining the distribution of estimated effects ( ˆOrig j B ) would show that effects of at least 0.125 semesters enrolled have been observed in 5 out of 39 interventions.While a relatively large effect, researchers might argue that being in the top 12% of interventions seems feasible, depending on the intervention.Examining the distribution of adjusted empirical Bayes estimates ( ˆAEB j B ) might make telling that story more challenging.The adjusted empirical Bayes estimates suggests that an effect of 0.125 semesters enrolled has only occurred in 3 out of 39 interventions, implying that the intervention proposed for study would need to have a true effect in the top 6% of interventions in THE-RCT for the proposed study to be adequately powered.This could change researcher or funder decision-making.

Figure 1 .
Figure 1.Intervention Effect Estimates Using Three Estimators

Figure 2 .
Figure 2. Points on the Estimated Distribution of Intervention Effects, By Outcome and Semester

j)
from a current study (j*) for the same student outcome to the preceding distribution, one must transform the OLS estimate to its adjusted empirical Bayes counterpart (E * ˆAEB j ).Conceptually, this transformation is a two-step process. 21To be concrete, consider the process in the context of a current study where E * ˆOLS j equals 1.5 credits, and has an estimated standard error ( ) of 0.5 credit.The first step is to convert E * ˆOLS j to a standard empirical Bayes impact estimate (E * ˆEB j ) using Equation B.1.an adjustment factor that accounts for the fact that the estimated variance of standard empirical Bayes impact estimates ( ) for a sample of interventions systemically understates the corresponding variance of true impacts (τ 2 ).Therefore (B.4) 21 Operationally, the two-step process can be represented by a single closed-form expression.22 See Bloom et al. (2016) for a discussion of adjusted empirical Bayes impact estimates.the total number of intervention impact estimates in the benchmark sample.Appendix TableA4reports the value of γ for each postsecondary outcome measure represented.For credits accumulated during students' first two semesters after random assignment, γ equals 0

Table 3 . Characteristics of Students in the Average Study in the Main Analytic Sample (continued ) Table 4. Data Availability as a Percent of the Full Sample of Individuals and Interventions
Note.The full sample includes 65,637 students and studies of 39 interventions.Sample sizes by outcome and semester are shown in Appendix TableA4.
. Notably, 19 of the interventions in THE-RCT are part of MDRC's Intervention Return on Investment (ROI) Tool for Community Colleges (see https:// www .mdrc.org/intervention -roi -tool), and thus estimates of the direct costs of these interventions are publicly available.Careful thought would be required to pull this information together to create cost-effectiveness benchmarks-an important area for future investigation that we hope to pursue.
Somers et al. (2023) paper, the distribution of effects across interventions can vary over time, with changing distributional means and variances.Moreover, as shown inSomers et al. (2023), the standard deviation of the outcome also changes over time, among outcomes, and across interventions.Combining these facts, it may be quite difficult to make sense of comparisons of standardized effect sizes across time, outcomes, or evaluations.The cleanest comparisons or pooling may involve the same (or very similar) outcomes measured at the same time point.Journal of Postsecondary Student Success Methodological Next Steps for Empirical Benchmarks Researchers