Introduction

Persons who committed sexual offenses have received considerable attention in recent years from both policymakers and researchers. Sexual crimes profoundly impact victims and the larger community (deBaca, 2015). Several treatment models for persons with sexual offense histories have been developed, modified, and refined over time (Yates, 2013). The risk-need-responsivity (RNR) model (Bonta & Andrews, 2007), perhaps the most influential treatment model in criminology, provides recommendations on how interventions should be implemented based on the risk that individuals pose, their needs that ought to be met in treatment, and the choice of factors that actually can be changed effectively. The first principle, the risk principle, states that interventions should target the individual risk level, for example, for sexual recidivism, with high-risk offenders being likely to require more intensive interventions to induce any change compared to low-risk individuals for whom complex or expensive interventions may be unreasonable. The second principle, the need principle, states that interventions should target individuals with criminogenic needs that may actually change and thus allow for reducing the rate of reoffending. The third principle, the responsivity principle, states that different treatment approaches likely also differ in their effectiveness in reducing recidivism, with behavioral and cognitive-behavioral interventions generally preferred. The treatment outcome of such interventions is commonly measured in terms of the reduction of posttreatment sexual recidivism in treated compared to untreated control groups (Tyler et al., 2021). Previous meta-analyses, however, showed that treatment evaluation is complicated by methodological concerns, mainly because of significant heterogeneity in sample characteristics, variations in treatment approaches, and deficiencies in study design (Hanson et al., 2013; Lösel, 2020; Schmucker & Lösel, 2017). Due to the significant heterogeneity, previous work did not allow drawing general conclusions about the effectiveness of treatment in persons with sexual offense histories.

The most recent meta-analysis in the field by Holper et al. (2023) provided an update on an earlier meta-analysis by Schmucker and Lösel (2017), which synthesized evidence on sexual recidivism as an indicator of treatment effectiveness. Holper et al. (2023) included 37 unique samples from 35 studies comprising 30,394 individuals published between 1983 and 2021. Overall, a small mean treatment effect on sexual recidivism was suggested with an odds ratio of OR 1.54 [95% CI 1.22, 1.95] (\(p\) < 0.001), equating to a relative reduction in sexual recidivism of 32% after treatment. Holper et al. (2023) also conducted a traditional moderator analysis according to which three predictors were significantly associated with treatment effectiveness: pre-treatment risk level, treatment specialization, and author confounding. The results thus corroborated the first and second principles of the RNR model: First, greater treatment effectiveness was present in high- and medium- compared to low-risk individuals, and secondly, in specialized compared to non-specialized treatments. Additionally, author confounding was related to treatment effectiveness in the sense that authors involved in the treatment programs reported more significant treatment effects than independent authors.

In the context of a moderator analysis, residual heterogeneity refers to the remaining variability between studies not accounted for by the moderators. Explaining heterogeneity is essential for rendering observations from a meta-analysis both easy to interpret and useful for practical recommendations. Although the findings of the meta-analysis by Holper et al. (2023) were promising, the residual heterogeneity was still large (\({I}^{2}\) = 69%), and the three predictors varied considerably in the degree to which they explained residual heterogeneity. Although pre-treatment risk level explained a large amount of heterogeneity (\({I}^{2}\) = 17%), treatment specialization (\({I}^{2}\) = 67%) and author confounding (\({I}^{2}\) = 51%) did not explain much of the residual heterogeneity. Therefore, the question arose whether spurious effects might have occurred in the meta-analysis by Holper et al. (2023). In other words, some relationships between predictors and outcomes may have been only artifacts due to a univariate framework and may therefore have led to inconclusive evidence.

Consequently, the present analysis aimed at reassessing the evidence collected by Holper et al. (2023) using an alternative method that is suitable to explain more heterogeneity by virtue of acknowledging combinations of multiple moderators. As an alternative method, we made use of multimodal inference. Multimodal inference is a natural extension of single-hypothesis testing. In essence, multimodel inference refers to the generalization across multiple single statistical models corresponding to different single hypotheses (Garamszegi & Mundry, 2014). The multimodel approach is particularly appealing when dealing with models containing many moderators that should be assessed for a given outcome and if one is interested in the varying degree of how well those predictors fit the data at hand. Some of the most widely used multimodel approaches are based on information theory. Information theory relies on information criteria, such as Akaike’s information criterion (AIC) (Burnham & Anderson, 2002; Grueber et al., 2011). Recently, recommendations have been provided for implementing such information-theoretic approaches on multimodel inference in meta-analyses (Cinar et al., 2021). Simulations suggested that these information-theoretic strategies likely outperform conventional meta-analytical methods in identifying the most relevant predictors in complex datasets with many moderators. Multimodel analysis thus provides much more robust parameter estimates than single-hypothesis testing. As information theory also provides estimates of the relative importance of the multimodel-inferred moderator effects, the heterogeneity explained by each moderator can be evaluated separately (Burnham & Anderson, 2002).

In sum, the present meta-analysis was conducted to assess whether multimodel inference would yield a better approximation to the underlying data structure compared to single-hypothesis testing, thus explaining a larger component of heterogeneity. We expected that the information-theoretic approach would provide more conclusive evidence regarding which predictors are important or less important in moderating sexual recidivism as an indicator of treatment effectiveness in persons with sexual offense histories. Considering the findings by Holper et al. (2023), we hypothesized that the pre-treatment risk level may survive multimodel inference and thus be deemed an important predictor. In contrast, the other moderators may be considered insignificant and thus be deemed less important predictors.

Methods

Study selection

Holper et al. (2023) provided a dataset of 37 unique samples from 35 studies comprising 30,394 individuals published between 1983 and 2021. Eligible studies had to (1) include males irrespective of age, (2) contain a minimum sample size of ten individuals, (3) provide official sexual recidivism rates, (4) be based on a treatment program that aimed at reducing sexual recidivism rates explicitly, and (5) fulfill at least level 3 study design status on the Maryland Scientific Methods Scale (SMS) (Farrington et al., 2002) in order to ensure equivalence between treatment and control groups. The SMS categorizes the methodological strengths and weaknesses of research studies within the criminal justice setting. It is currently the most widely utilized quality evaluation scale in this context. The main characteristics of the samples are listed in Table 1.

Table 1 Study samples

Random-effect meta-analysis

Random-effect meta-analysis was conducted in order to estimate the mean treatment effects using the metafor package (Viechtbauer, 2010), which provides a comprehensive collection of functions for fitting meta-analytic models in the R programming language (R Core Team, 2022). Sample-specific effect sizes were computed based on recidivism rates for treatment and control groups reported in the primary studies. By default, a continuity correction (0.5) was used in case of zero events. The analysis was conducted on logged odds ratios; results were presented on the odds ratio scale with 95% confidence intervals (OR [95% CIs]). To estimate the expected range of true effects in future studies, the 95% prediction interval ([95% PI]) around the mean effect was computed (Borenstein et al., 2021). The strength of the effect size was interpreted according to Cohen’s \(d\) equivalents, with OR = 1.68, OR = 3.47, and OR = 6.71 considered equivalent to Cohen’s \(d\) = 0.2 (small), \(d\) = 0.5 (moderate), and \(d\) = 0.8 (large) (Chen et al., 2010; Cohen, 1988). Heterogeneity was reported in terms of residual heterogeneity (Q) and \({I}^{2}\) (J. P. Higgins & Thompson, 2002).

Moderator coding scheme

A total of 15 publication-, sample-, treatment-, and individual-specific moderators were assessed as provided by Holper et al. (2023). Details on the coding scheme, which followed the scheme initially used by Schmucker and Lösel (2017), are provided in the online supplementary materials.

  • Publication characteristics: publication status [published, unpublished], publication year [< 2000, \(\ge\) 2000], country [Canada, USA, other], and author confounding [no, yes].

  • Sample characteristics: sample size [< 50, 51–150, 151–250, 251–500, > 500], design quality [level 3 (incidental), level 4 (matching), level 5 (randomized)], follow-up [< 5 years, \(\ge\) 5 years], and recidivism definition [arrest, charge, conviction].

  • Treatment characteristics: treatment approach [behavioral therapy, cognitive-behavioral therapy, insight-oriented therapy, multisystemic therapy, therapeutic community], treatment setting [prison, hospital, outpatient, mixed], treatment individualization [group only, group mainly, mixed, individual mainly, individual only], treatment specialization [no, yes], and aftercare [no, yes].

  • Individual characteristics: age group [adults, juveniles] and risk level [low risk, medium risk, high risk]. The risk ratings were derived as follows. If information on mean risk level was reported in the primary studies based on individual risk assessments, such as the Static-99 (Harris et al., 2003), Static-99R (Phenix et al., 2016), Risk Matrix (Ross & Loss, 1991), or BARS (Brief Actuarial Risk Scale, Olver et al., 2013), it was used in the analysis; this was possible in 12 of the samples (34%). Following the recommendation by Schmucker and Lösel (2017), when no risk assessment was reported in the studies, the mean risk level was rated using the Rapid Risk Assessment for Sex Offence Recidivism (RRASOR) (Hanson, 1997) based on aggregated risk information provided in the studies; this was possible in 18 of the samples (51%). Another five samples (14%) did not allow for obtaining any risk estimate due to insufficient information. Notably, the RRASOR was initially designed for individual risk judgments. Schmucker and Lösel (2017) suggested using the RRASOR to estimate mean risk based on treatment group statistics of relevant variables according to the items: (1) convictions for sex offenses: mean number of previous convictions for sexual offenses (max. 3); (2) age below 25: relative frequency of offenders aged < 25; (3) male victims: relative frequency of offenders with male victims; and (4) any unrelated victims: relative frequency of offenders with unrelated victims. The sum of the item scores was categorized into low risk (\(\le\) 1.5), medium risk (1.5 to 2.5), or high risk (\(\ge\) 2.5). Consequently, the mean risk level thus derived using the RRASOR represents only a rough estimate and cannot be considered equal to individual risk assessment. The RRASOR was originally only recommended for persons with sexual offense histories from 18 years upwards. These aspects should be acknowledged when interpreting the mean risk levels, particularly in the juvenile samples rated using the RRASOR in the updated meta-analysis, except one (Lab et al., 1993).

Multimodel moderator analysis

Multimodel moderator analysis was carried out following the recommendations for meta-analytic data provided by Viechtbauer and colleagues (Viechtbauer et al., 2021; Viechtbauer, 2022a) using the glmulti package (Calcagno & Mazancourt, 2010), which implements the necessary functionality for multimodel inference in combination with the metafor package (Viechtbauer, 2010). In the first step, an exhaustive model screening was conducted, considering all possible combinations between the 15 moderators. This resulted in a set of 32,768 candidate models considered in the main analysis.

In the second step, model fit was estimated by weighting candidate models based on the Akaike information criterion corrected for small sample size (AICc) (Calcagno & Mazancourt, 2010). The AICc was preferred over the traditional AIC because the latter tends to overfit small sample sizes for specific moderator subgroups, resulting in a bias for models with too many terms (Claeskens & Hjort, 2008). Heterogeneity was estimated based on the maximum-likelihood estimator (ML) (Gurka, 2006), to make the AICc values of the different candidate models directly comparable to each other. The resulting Akaike weight for a given model can be regarded as the probability that the model is the optimal model out of all candidate models (Viechtbauer, 2022a).

In the third step, the relative importance of each moderator was estimated based on the sum of the Akaike weights across all candidate models in which a given moderator appeared. Moderators appearing in many models with large weights typically receive higher values for the importance of moderators (IMs). The importance values thus reflect the overall support for each moderator across the candidate model set. A cutoff of IM = 0.8 was chosen, sometimes used to differentiate between important and less important moderators, even though this is an arbitrary division (Calcagno & Mazancourt, 2010). Changing the cutoff to lower values, such as IM = 0.6 considered elsewhere in the literature (Anderson, 2007; Calcagno & Mazancourt, 2010), did not affect our conclusions.

In the fourth step, multimodel inference was performed to compute model-averaged moderator effect sizes across all candidate models. These effect sizes represent naturally weighted averages across the candidate model set, with the weight equaling the AICc evidence for a given moderator. These are unconditional estimates as they are not contingent upon only one model but generalized across the candidate model set. The ORs with corresponding confidence intervals (95% CI) were computed based on the unconditional model-averaged standard errors representing both uncertainties within a given model and across the model set. Subgroup contrasts for each moderator were assessed based on general linear hypothesis (GLH) testing using the multcomp package (Hothorn et al., 2008). The significance of the contrasts was reported in terms of z and p values. Bonferroni correction was applied to counteract the problem of multiple comparisons.

Several sensitivity analyses were conducted to assess the robustness of the methods mentioned above, such as excluding outliers, setting constraints for the candidate model set, or assessing less conservative methods for multiple comparisons. None of the sensitivity analyses cast doubt on the robustness of the conclusions made in the main analysis. Due to constraints on word count, the results of the sensitivity analyses were not explicitly addressed in the main text but are provided within the online supplementary material.

Imputation

Multimodel approaches require all models in a candidate set to contain the same moderator data for the information criteria to be comparable across models. In case of missing moderator data, standard model fitting functions, including the ones from the metafor package, will typically automatically apply “listwise deletion” for those data containing missing values before fitting the models. In the present data, there were eight samples with missing moderator values (five samples with missing values on risk level; three samples with missing values on recidivism definition; Table 1). Hence, applying the multimodal approach to the present data would have implied that eight studies would have been removed from the analysis. This scenario corresponds to a complete case analysis. Complete case analysis (Jamshidian & Mata, 2007) or other common approaches to address missing data in a meta-analysis, such as shifting-unit-of-analysis approaches (Cooper, 1998), however, can lead to biased estimates as they ignore missing values (Cooper, 1998).

To overcome this limitation, model-based imputation methods for missing data have been suggested as more appropriate (Diaz Yanez, 2021). Therefore, in the present analysis, we therefore imputed the eight missing moderator values following the recommendations for meta-analytic data provided by Viechtbauer and colleagues (Viechtbauer, 2022b) using the mice package (Buuren & Groothuis-Oudshoorn, 2011) in combination with the metafor package (Viechtbauer, 2010). The mice package uses fully conditional specification (FCS), also known as multivariate imputation by chained equations (MICEs) (Rubin, 1987, 1996). Details on the imputation using MICE are provided in the online supplementary materials. Following the imputation, all analyses described above were computed for both the reduced dataset, including only complete samples (\(k\) = 29), and the full dataset, including imputed samples (\(k\) = 37), in the following referred to as “complete samples” and “imputed samples.”

Results

Random-effect meta-analysis

The results of the random-effect meta-analysis for sexual recidivism as an indicator of treatment effectiveness (Fig. 1) are illustrated in the first forest plot. The size of the squares representing the sample-specific effect sizes on the OR scale is proportionate to their precision. The arrows indicate that the CIs of some ORs extend beyond axis limits. An OR > 1 indicates a treatment benefit in terms of reducing sexual recidivism, while an OR < 1 indicates an increase in posttreatment sexual recidivism in treated compared to untreated individuals. Note that the results illustrated for the imputed samples are equal to the results reported by Holper et al. (2023).

Fig. 1
figure 1

Forest plot. Forest plot illustrating the 37 sample-specific odds ratios with 95% confidence intervals (OR [95% CI]s) included in the current multimodel meta-analysis for sexual recidivism as an indicator of treatment effectiveness in persons with sexual offense histories. Square size is proportionate to the precision of the sample-specific effect sizes. Arrows indicate CIs extending beyond the axis limits. The diamonds represent the mean treatment effects obtained for complete versus imputed samples; the corresponding 95% CIs are given in brackets, and the 95% prediction intervals are depicted as dotted intervals around the diamonds. Samples containing imputed moderator values are marked (*). Note that the results illustrated for the imputed samples replicate the results reported by Holper et al. (2023)

The mean treatment effects in the complete samples with an OR of 1.46 [95% CI 1.11, 1.92] (\(p\) = 0.006) and the imputed samples with an OR of 1.54 [95% CI 1.22, 1.95] (\(p\) < 0.001) were not statistically significantly different (fixed-effect meta-regression model between complete vs. imputed samples: \(z\) = 0.28, \(p\) = 0.779); the latter corresponds with the outcome reported by Holper et al. (2023) on the same data. The strength of the mean effects was considered small considering the equivalence to Cohen’s \(d\) (\(d\) > 0.2) (Chen et al., 2010; Cohen, 1988). The corresponding 95% prediction intervals around the mean effects were wide for both complete samples (95% PI 0.49, 4.37) and imputed samples (95% PI 0.57, 4.20) and included OR = 1, indicating that the expected range of true effects in future similar studies is likely imprecise. Taken together, the findings indicated that treatment effectiveness in individuals with sexual offense histories is small and of low precision.

Residual heterogeneity in both complete samples (\(Q\) [df = 28] = 129, \(p\) < 0.001, \({\text{I}}^{2}\) = 75%) and imputed samples (\(Q\) [df = 36] = 146, \(p\) < 0.001, \({\text{I}}^{2}\) = 69%) was substantial. The large heterogeneity indicated that a considerable percentage of the variance could not be attributed to sampling error but must be considered systematic differences between studies. The observed heterogeneity thus corroborated the importance of a moderator analysis that may explain variation in treatment effectiveness, as reported in the following sections.

Multimodel inference

An overview of the moderator characteristics examined in the present analysis for sexual recidivism as an indicator of treatment effectiveness is detailed in Table 2. Because details on the moderator characteristics were previously reported by Holper et al. (2023), we provide them in the online supplementary materials.

Table 2 Multimodel moderator analysis

The second forest plot illustrates the results of the multimodel inference for all moderators (Fig. 2). The corresponding multimodel-averaged moderator-specific effect sizes are detailed in Table 2. The online supplementary materials provide details on all subgroup contrasts.

Fig. 2
figure 2

Forest plot moderator-specific effects. Forest plot illustrating model-averaged moderator-specific effect sizes on the odds ratio scale (OR [95% CI]) derived from the multimodel meta-analysis for sexual recidivism as an indicator of treatment effectiveness in persons with sexual offense histories. Square size is proportionate to the precision of the moderator-specific effect sizes. Moderator subgroups between which significant contrasts were only observed for risk level (\(p\) < 0.05)

The relative importance of each moderator is illustrated in Fig. 3 and listed in Table 2. Most moderators were assigned importance values below the cutoff (IM < 0.8), indicating that most moderators were not well supported across models as predictors of treatment effectiveness. Only one predictor was assigned an importance value above cutoff, namely risk level (complete samples IM = 0.99; imputed samples IM = 1.00). This indicated that pre-treatment risk level was well supported across models as a predictor of treatment effectiveness. The corresponding model-averaged subgroup effect sizes demonstrated clear differences between the three risk levels. While high-risk samples (complete samples OR 4.78; imputed samples OR 4.71) and medium-risk samples (complete samples OR 1.83; imputed samples OR 1.90) were observed to have moderate treatment effects considering the equivalence to Cohen’s d (d > 0.5) (Chen et al., 2010; Cohen, 1988), low-risk samples appeared to have essentially no treatment benefit because the corresponding ORs approximated 1 (complete samples OR 0.94; imputed samples OR 0.96). The subgroup contrasts were all statistically significant, with the contrasts between high- and low-risk samples (complete samples \(z\) =  − 4.39, \(p\) < 0.001; imputed samples \(z\) =  − 5.87, \(p\) < 0.001) as well as between medium- and low-risk samples (complete samples \(z\) =  − 6.32, \(p\) < 0.001; imputed samples \(z\) =  − 6.85, \(p\) < 0.001) reflecting a greater difference than those between high- and medium-risk samples (complete samples \(z\) =  − 2.21, \(p\) = 0.063; imputed samples \(z\) =  − 2.87, \(p\) = 0.010). This suggested a positive relationship between greater treatment effectiveness and increasing risk levels. In other words, individuals with high and medium pre-treatment risk levels were considered less likely to recidivate following treatment than low-risk individuals. Indeed, low-risk individuals were suggested to have no treatment benefit concerning sexual recidivism.

Fig. 3
figure 3

Moderator importance. Plot illustrating model-averaged importance of each moderator for sexual recidivism as an indicator of treatment effectiveness in persons with sexual offense histories. Importance values (IMs) represent the overall support of each moderator, i.e., the sum of weights across all models in which a moderator appeared. A vertical line is drawn at IM = 0.8, sometimes used as a cutoff to differentiate between important and less important variables, even though this is an arbitrary division (Anderson, 2007; Calcagno & Mazancourt, 2010)

After accounting for the moderator risk level, the residual heterogeneity was considerably reduced (complete samples \({I}^{2}\) = 26%; imputed samples \({I}^{2}\) = 16%) compared to the mean effect obtained in the random-effect meta-analysis (complete samples \({\text{I}}^{2}\) = 75%; imputed samples \({\text{I}}^{2}\) = 69%). This indicated that if the reduction in heterogeneity was be used as an indicator of model fit, the most important moderator risk level on its own would be sufficient to reduce residual heterogeneity by 65% (complete samples) and 77% (imputed samples; percentage reduction in residual heterogeneity compared to the mean effect obtained in the random-effect meta-analysis). The latter corresponds to the outcome reported by Holper et al. (2023); the slight difference compared to Holper et al. (2023) is because the present dataset contained imputed values for the moderator risk level. Still, the similarity in the two values corroborated the plausibility of the current imputation method.

Most importantly, none of the other moderators revealed statistically significant model-averaged effect sizes or significant subgroup contrasts (Fig. 2, Table 2) and was not considered to affect treatment effectiveness. Details on all moderator results are provided in the online supplementary materials.

Discussion

The present meta-analysis applied an information-theoretic approach to multimodel inference to assess the relative importance of moderators on sexual recidivism as an indicator of treatment effectiveness in persons with sexual offense histories. Our results suggested pre-treatment risk level as the only predictor of importance for treatment effectiveness. In contrast, other moderators previously thought to be plausibly related to treatment outcomes were considered negligible. These findings strengthen the first principle of the RNR model (risk principle) and weaken the evidence for the second and third principles (need and responsivity principles). The current findings based on the information-theoretic approach may thus encourage future refinements of the treatment recommendations in persons with sexual offense histories in the criminal justice system.

The present observations are in contrast with previous meta-analyses on the topic. While previous meta-analyses highlighted pre-treatment risk level certainly as one of the most relevant predictors of treatment effectiveness, all analyses reported several other moderators to be equally related to treatment effectiveness (Holper et al., 2023; Lösel & Schmucker, 2005; Schmucker & Lösel, 2015, 2017). Thus, the present work stands out by reporting risk level as the sole moderator of importance, thus upholding our initial hypothesis.

Hence, if one follows the present observations and considers pre-treatment risk as the sole contributor to sexual recidivism, the question will arise of how to translate this interpretation into forensic practice. The answer is to follow the risk principle while maintaining the need and responsivity principles. Following the risk principle would imply that intensive treatment may be reserved for higher-risk individuals because they may be more likely to benefit from treatment, whereas low-risk individuals may not receive treatment because it is considered inefficient for them. Simply following this notion would, however, ignore valuable previous controversy regarding the practical consequences of the risk principle (Hanson et al., 2009a, b; Landenberger & Lipsey, 2005; Lovins et al., 2009; Schmucker & Lösel, 2015, 2017; Ter Beek et al., 2018; Wilson et al., 2007). Direct comparisons of the efficacy of the risk principle to the need and responsivity principles were inconclusive. Although some meta-analyses suggested the risk principle to mainly contribute to the positive effects of the RNR model (Andrews & Dowden, 2006; Dowden & Andrews, 1999, 2000), others suggested the risk principle to have the most negligible influence on treatment effectiveness (Hanson et al., 2009b). Notably, the present observations supported the former statement. More generally, although the validity and practical utility of the RNR principles have been widely demonstrated (Polaschek, 2012), the evidence of their effectiveness in reducing recidivism has been inconclusive. Earlier meta-analyses reported that treatment programs designed and delivered according to the RNR principles were associated with significantly lower recidivism rates (Andrews et al., 1990; Dowden & Andrews, 1999, 2000; Dowden et al., 2003; Looman & Abracen, 2013) and that this effect was linearly related to the number of RNR principles incorporated into a treatment program (Hanson et al., 2009b). More recent work, however, suggested that the evidence for the effectiveness of the RNR model was low (Bijlsma et al., 2022; Duan et al., 2023; Seewald et al., 2018).

There has also been debate on methodological biases in profound treatment evaluation. First, potential biases related to base rates have been put forward. For example, recidivism base rates for individuals qualifying as low-risk have been typically reported to be too small that treatment may not lead to further reductions in sexual recidivism (Austin et al., 2002; Schmucker & Lösel, 2017). On the other side, biases related to co-morbid disorders have been considered. For example, high-risk individuals with strong psychopathic traits have been reported to be more challenging to treat and less likely to complete treatment, thus, leading to higher dropout rates (Sewall & Olver, 2019). Also, biases related to treatment dosage have been suggested. For example, treatment intensity varies with risk level because higher-risk individuals may receive more treatment than individuals with lower risk scores and fewer criminogenic factors (Mailloux et al., 2003). Consequently, cost-related biases have been discussed because higher treatment intensity in higher-risk individuals may also translate into higher costs per individual, which in turn may potentially lead to economic limitations in the criminal justice system (Aos et al., 2006; Bourgon & Armstrong, 2005).

These considerations suggest that, although pre-treatment risk level was identified as the most important predictor of treatment outcome in the present analysis, several methodological biases may still be unsolved. Future research must aim to eliminate those methodological biases in order to explain the relevance of the risk principle in the RNR model. The following approach might be worthwhile pursuing from the meta-analytical perspective and based on our experience with the current dataset. One could argue that since the RNR model centers on the individual offender being treated, traditional meta-analyses based on aggregated data may not correctly capture the benefits or drawbacks of the model. Therefore, a direction for future research should be to focus on individual patient data (IPD) meta-analysis, considered the meta-analytical “gold standard” (J. Higgins et al., 2023). IPD allows for the analysis of individual characteristics, thereby improving the quality and reliability of the evidence (Stewart & Tierney, 2002). The availability of IPD data would also overcome some of the limitations inherent in the present analysis, as detailed in the following paragraphs.

Limitations

The present analysis has several methodological limitations. First, the source of risk ratings may have impacted the overall risk level. Ratings based on individual risk assessments (e.g., Static-99) should generally be preferred over aggregated risk assessments (e.g., RRASOR). Aggregated risk assessments may provide a feasible approximation only in cases without ratings based on individual risk assessments. The present analysis followed the suggestion by Schmucker and Lösel (2017) and applied the RRASOR to obtain aggregated risk assessments. Considering the high proportion of aggregated risk assessments in the present analysis, the presented risk ratings may only be regarded as a rough estimate of the mean risk level. The most problematic issue related to aggregated risk assessments is that they hide the variation in individual risk scores, which can lead to biased estimates. Therefore, individual risk scores should be preferred in future studies in the field. Since individual risk scores are based on multiple items related to an individual’s demographic or offense history, future studies might even use those individual factors to perform IPD meta-analyses.

The dataset also contained moderators with missing values. One of these moderators was the pre-treatment risk level. The fact that only risk level and another variable, recidivism definition, contained missing data may have indicated that the data were missing at random (MAR). For example, a systematic reason for the MAR in these variables might be explained by the fact that both risk level and recidivism definition require additional resources, information, or tests to be performed by the study authors to report those values; therefore, these variables might have been less frequently reported. Otherwise, however, there were no indications of systematic missing data, and one may thus argue that the data were missing completely at random (MCAR). Using the MICE imputation applied in the present work, the findings across the imputed samples were basically in line with the work by Holper et al. (2023). The imputation performed on the missing moderator values may be considered plausible overall.

The present dataset included studies conducted both in adults and juveniles. Youths have different risk factors than adults for sexual recidivism, and sexual recidivism rates among youths are especially low. Therefore, these two groups should optimally be analyzed separately. Since the inclusion criteria of the present analysis were driven by the works of Schmucker and Lösel (2017) and Holper et al. (2023), and since we aimed to provide a direct comparison to these previous works, we maintained the dataset as previously published.

The present analysis was also limited to evaluating the main moderator effects. Multimodal inference can also be applied to interaction effects between moderators. Interaction effects, however, exponentially increase the number of candidate models, and the interpretation of numerous interaction terms is complex. For these reasons, the present analysis refrained from assessing interaction terms to keep the findings comprehensible in forensic practice.

Finally, the present analysis was limited to the evaluation of sexual recidivism. Though some of the primary studies included in the dataset reported recidivism rates for general and violent crimes, their overall small number (54% and 65% of the studies, respectively) was not considered sufficient to integrate these outcomes in the meta-analytical context adequately. Still, considering the debate on “specialization versus generality” in criminology, arguing that most individuals with sexual offense histories may indeed be generalists who do not restrict themselves to sexual crimes (e.g., Herrero et al., 2016; Lussier, 2005; Sample & Bray, 2003; Simon, 1997, 2000; Smallbone et al., 2003; Soothill et al., 2000), a profound comparison of treatment effectiveness for sexual versus general or violent recidivism may be another worthwhile direction for future meta-analyses.

Taken together, the present meta-analysis applied an information-theoretic approach to multimodel inference to assess the relative importance of moderators on sexual recidivism as an indicator of treatment effectiveness. The findings support researchers and decision-makers in interpreting the current evidence on the RNR model. Our observations encourage future refinements of the treatment recommendations in persons with sexual offense histories in the criminal justice system.