Introduction

Major depressive disorder (MDD) and anxiety disorders are the most prevalent, often co-occurring, emotional disorders found in the western world. The disorders are frequently associated with functional impairment and carry high socio-economic costs (Whiteford et al. 2013; Wittchen et al. 2011). Although cognitive behaviour therapy (CBT) and other evidence-based treatments exist for these disorders and generally show good effects, they do not work for every individual (Dunlop et al. 2017; Hofmann et al. 2012; Loerinc et al. 2015; Springer et al. 2018).

National treatment guidelines recommend CBT as one of the treatments for anxiety and for depression (NICE 2009, 2011, 2013; Danish Health Authority 2007a, b). CBT manuals traditionally address a single disorder by targeting specific psychological mechanisms (i.e., cognitive distortions) that are believed to maintain a particular disorder. Within the CBT framework, however, the frequent occurrence of co-morbidity among emotional disorders has spurred the development of transdiagnostic CBT, i.e., treatments that apply a unified set of interventions to address several anxiety disorders, depression, and other emotional disorders (Barlow et al. 2016; McEvoy et al. 2009; Reinholt and Krogh 2014; Reinholt et al. 2017). The Unified Protocol for Transdiagnostic Treatment of Emotional Disorders (UP, Barlow et al. 2011a, b; Barlow et al. 2018) targets negative affectivity by stimulating emotion regulation strategies away from avoidance towards more adaptive strategies (Barlow et al. 2014; Sauer-Zavala et al. 2012). This builds on empirical evidence which suggests that negative affectivity is an important transdiagnostic process underlying all emotional disorders (Brown et al. 1998; Clark and Watson 1991; Krueger et al. 2005). In contrast, the UP integrates standard cognitive and behavioral techniques with mindfulness-based techniques, with the main focus on learning to accept emotional experiences and manage difficult situations, despite strong emotions.

In a recent randomized controlled trial for anxiety disorders, the UP and diagnosis-specific CBT protocols showed comparable symptom reductions (Barlow et al. 2017). Other studies furthermore suggest that transdiagnostic CBT (e.g., the UP) delivered in groups for patients with anxiety disorders and/or MDD can be effective in reducing anxiety and depressive symptoms (Bullis et al. 2015; de Ornelas Maia et al. 2015; Laposa et al. 2017; Norton and Hope 2005; Norton and Barrera 2012; Osma et al. 2015; Reinholt et al. 2017; Zemestani et al. 2017). However, no randomized controlled trials comparing diagnosis-specific CBT with the UP for groups including patients with MDD have been published (Arnfred et al. 2017; Reinholt et al. 2017).

While the reported average effects of the UP versus diagnosis-specific CBT are comparable, it is possible that some patients would have greater benefit from a broader focus on emotion regulation as applied in the UP, while others would benefit more from the specific symptom-focused approach applied in diagnosis-specific CBT. Clinical practice would therefore benefit from knowledge to assist the matching of individuals with the most optimal treatment (personalized approaches to clinical decision making).

Studying individual pre-treatment characteristics that reliably predict (differential) treatment outcomes is a possible approach to the identification of patients who would benefit most from transdiagnostic and/or diagnosis-specific CBT. Such individual characteristics may be either prescriptive or prognostic. Prescriptive variables (i.e., moderators) can help identify for whom or under what conditions a treatment has a certain causal effect on outcome. Moderators are thus useful in stratifying a population into subgroups: those who would experience greater improvement from the transdiagnostic CBT and those for whom diagnosis-specific CBT would be best (Kraemer 2013). Treatment outcomes can be predicted by prognostic variables (i.e., predictors), irrespective of treatment type. Researchers have recently developed multivariable models or algorithms to predict (differential) outcomes by integrating multiple predictors and/or moderators (Cohen and DeRubeis 2018). Introduced by DeRubeis and colleagues, the Personalized Advantage Index (PAI, DeRubeis et al. 2014a, b) is a promising approach to the prediction of differential treatment effects, which has been replicated by others (Huibers et al. 2015; Keefe et al. 2018; Vittengl et al. 2017; Webb et al. 2019; Zilcha-Mano et al. 2016). This two-step approach first selects relevant predictors and moderators of treatment outcomes, then uses the variables to construct a PAI to recommend a specific treatment for each individual. The recommendation is based on quantitative estimates of the predicted advantage of the optimal treatment over the non-optimal treatment. Further examples of multivariable prediction models to guide treatment selection are provided by the “matching factor” (Barber and Muenz 1996), the “nearest-neighbours” approach (Lutz et al. 2006), and the M* approach (Kraemer 2013; Niles et al. 2017a, b; Niles et al. 2017a, b; Smagula et al. 2016; Wallace et al. 2013). A number of multivariable prediction models containing only prognostic information have been developed over the past years. Studies using the Prognostic Index (PI) are other promising attempts in which a quantified estimate of the individual's prognosis can help determine the needed level of care (Lorenzo-Luaces et al. 2017; van Bronswijk et al. 2019a). Such studies would be particularly relevant with regard to group CBT, since the treatment of patients with anxiety and depression is often fraught with uncertainty regarding treatment outcomes. The evidence concerning predictors of CBT effects is mainly derived from individual treatment studies (Marker et al. 2019); whether the same predictors are relevant for group CBT is not known. However, there is evidence to suggest that:

  1. (a)

    co-morbid depression in group CBT for anxiety is a poor predictor of treatment effects (Talkovsky et al. 2016)

  2. (b)

    interpersonal problems (measured before treatment) predict lower effect of group CBT for depression (Mcevoy et al. 2013)

  3. (c)

    motivation predicts the effect of group CBT for anxiety disorders (Marker et al. 2019)

  4. (d)

    baseline severity and level of depression moderate differential effect of group CBT and mindfulness-based stress reduction for anxiety disorders (Arch and Ayers 2013).

As the evidence cited above appears from single studies with relatively modest sample sizes, our knowledge is limited as to what can help us predict who will improve during any type of group CBT and who will improve more from UP in group compared with diagnosis-specific group CBT.

The aim of the present paper was to examine moderators of treatment effects of the UP and the diagnosis-specific group CBT, using the PAI approach in a multi-site randomized controlled non-inferiority trial. The sampled patients suffered relatively severe and chronic symptoms, and the majority had failed to respond to either psychotherapy or medication. As this is typical of patients in Danish psychiatric outpatient settings, the search for moderators and predictors is even more relevant.

Should we fail to identify any moderators, we had planned to continue with an examination of general predictors to develop a PI for the estimation of personalized prediction of outcome, irrespective of the received treatment. We likewise intended to proceed to gauge the site specificity of the final model to test the generalizability of the results to other contexts. We are not aware of any other studies that have examined moderators of treatment outcome in transdiagnostic versus diagnosis-specific CBT. Furthermore, as the knowledge of predictors of group CBT treatment outcomes is very limited, we used a data-driven analysis strategy, rather than testing specific hypotheses. The study was thus exploratory in nature, aiming at using findings for the design of hypothesis-driven studies.

Methods

Design and Participants

Data came from a randomized controlled trial of the efficacy of transdiagnostic and diagnosis-specific group CBT for MDD, social anxiety disorder, agoraphobia, and/or panic disorder (Arnfred et al. 2017). The trial is registered at Clinicaltrials.gov (NCT02954731); the primary results are planned for publication in a separate paper (Reinholt et al. under review).

Adult patients were recruited from three Danish psychiatric outpatient clinics, which form part of the publicly funded mental health services. The referral criterion for this secondary service was failure to respond to treatment (medication and/or psychotherapy) in primary care, i.e., with general practitioners or with private practice psychiatrists or psychologists. The patients may thus be classified as resistant to treatment. They were assessed using the full MINI Neuropsychiatric Interview, version 7.0.2 (except for module P; Sheehan et al. 1998). The trial inclusion criteria were (1) a principal DSM-5 diagnosis of MDD (single-episode or recurrent; approximately 50%), social anxiety disorder (approximately 25%), or agoraphobia/panic disorder (approximately 25%), (2) age 18–65 years, (3) no use of antidepressants or unchanged use of antidepressants for at least four weeks before intervention onset, with no anticipation of change in use (4) sufficient knowledge of the Danish language. Exclusion criteria were (1) moderate or high risk of suicide, (2) alcohol or substance use disorders, (3) a diagnosis of cluster A or B (DSM-5) personality disorder, (4) co-morbidity of pervasive developmental disorder, psychotic disorders, eating disorders, bipolar disorder, severe physical illness, or untreated attention deficit and hyperactivity disorder, (5) complex psychopharmacological treatment (receiving more than three types of medication for mental health disorders), (6) concurrent psychotherapy, and (7) patients declining to stop the use of anxiolytics within the first four weeks of intervention.

After written informed consent had been secured, a total of 291 patients were randomly assigned to transdiagnostic or to diagnosis-specific CBT.

Treatments

For both versions of CBT, treatment consisted of one individual session followed by 14 weekly two-hour group sessions. A group version of David Barlow's Unified Protocol for Transdiagnostic Treatment of Emotional Disorders was used as transdiagnostic CBT manual (UP, Barlow et al. 2011a, b). The manual was modified from a Danish translation of the published UP for individual therapy, with incorporation of recommendations on group delivery by the UP Institute, Center for Anxiety and Related Disorders (CARD), Boston University (Reinholt et al. 2017). Unpublished manuals based on standard group CBT manuals were used as diagnosis-specific CBT manuals (Arendt and Rosenberg 2012; Craske and Barlow 2008; Due Madsen 2008; Turk et al. 2008), with adaptations for the Danish mental health service context.

Therapists

The 57 therapists (13 for UP and 44 for diagnosis-specific CBT) who participated in the study were licensed psychologists (n = 38), medical doctors in training as psychiatry specialists (n = 7), other professions trained in psychotherapy (n = 7), or psychiatrists (n = 5). At study onset, the therapists held an average of 8.84 years (95% CI 6.98–10.71) of clinical experience. All groups were led by at least one experienced psychologist or psychiatrist and one co-therapist who could be less acquainted with the manual. At least one of the two therapists had completed pre-trial training, followed by monthly supervision. Each therapist led between one and five groups (mean = 2.22). Regarding years of experience, no statistical significant difference was found between the UP therapists and the diagnosis-specific CBT therapists (Two-sample Wilcoxon rank-sum test (z = 1.355, p = 0.1754).

Therapy sessions were audiotaped; 20% of the sessions were randomly selected (stratified for session number, diagnosis, and site) for assessment of therapeutic competence and protocol adherence. Competence levels were rated using the Revised Cognitive Therapy Scale (Blackburn et al. 2001, ranging from 1 to 6); while they were considered good to excellent (mean score between 4 and 6) in both conditions, mean competence scores were found to be statistically significant higher among UP therapists because of a lower competence level in diagnosis-specific CBT at one site. The potential implications of this difference are addressed in the primary RCT paper (Reinholt et al. under review). Using a scale developed for the specific manuals, adherence was measured as a percentage, with an 80% score considered "good adherence". This was achieved in both treatment arms.

Measures

Primary Outcome

As we included patients with a range of different diagnoses, we selected well-being as a user-relevant and transdiagnostic outcome measure. This was measured using the WHO Well-being index (WHO-5), which has been shown to have high clinimetric validity and to be a valid outcome measure in clinical trials. For the Danish population in general, a mean WHO-5 score of 70 has been found; 10 points is considered the threshold for a clinically relevant change. The scale may also be used as a screening tool for depression, in which case a cut-off score of ≤ 50 is used (Topp et al. 2015). The WHO-5 was completed at 16 time points: pre-treatment, before each group session, and immediately after the final treatment session (post-treatment). To aggregate the weekly WHO-5 measures, we calculated a regression slope for each individual. This was a simple linear regression with one response variable (WHO-5) and one explanatory variable (weeks) for each patient (Triola 2010). To establish a robust estimate of the treatment effect for each patient, this model was chosen as an alternative to using end-point status or the difference between pre- and post-measures, which fails to take advantage of the multiple data points and leaves the outcome measure vulnerable to outliers. We did, however, repeat the analyses with change scores as the outcome (the difference between WHO-5 pre- and post-treatment) to allow for comparison with future studies, which may not have access to repeated measures throughout the treatment. The results of these analyses are listed in Online Appendix C.

Pre-treatment Variables

Forty-two pre-treatment variables were available as candidate predictors or moderators. To reduce multicollinearity between pre-treatment variables and thereby eliminate redundant information, we examined correlations between variables corrected for attenuation (cut-off < 0.7), as it has previously been done in similar studies (van Bronswijk et al. 2019a, b). Removing as few as possible, eight variables were removed in the moderator analyses and seven in the predictor analyses (including the pre-treatment measure of WHO-5). The clinical variables that were included in the analyses appear in Table 1. The following scales were used: The six-item Hamilton Anxiety Rating Scale (HAM-A6) (Hamilton 1959), Beck Depression Inventory-II (BDI) (Beck et al. 1996), Positive and Negative Affect Schedule (PANAS) (Watson et al. 1988), Reflective Functioning Questionnaire (RFQ) (Fonagy et al. 2016), Emotion Regulation Questionnaire (ERQ) (Gross and John 2003), Emotion Regulations Skills Questionnaire (ERSQ) (Berking et al. 2008), Perseverative Thinking Questionnaire (PTQ) (Ehring et al. 2011), Personality Inventory for DSM-5—Short Form (PID-5 SF) (Krueger et al. 2012), Level of Personality Functioning—Brief Form 2.0 (LPFS-BF) (Hutsebaut et al. 2016), Life Event Checklist for DSM-5 (LEC-5) (Gray et al. 2004) and Standardized Assessment of Personality—Abbreviated Scale (SAPAS) (Moran et al. 2003). More details on the scales were reported in Arnfred et al. 2017; they may also be found in Online Appendix A (supplemental material). Information on demographics and backgrounds is listed in Online Appendix B.

Table 1 Clinical characteristics of groups

Statistical Analyses

Missing Data and Description of Variables

Missing variables were imputed, using a non-parametric random forest approach (R package "MissForest”; Stekhoven and Bühlmann 2012) using all available data as input. This imputation approach has been shown to be accurate and comparable to multiple imputation and has been used in similar studies before (van Bronswijk et al. 2019a, b; Waljee et al. 2013). The imputation method was tested by applying the same method to the portion of the dataset with no missing values, using artificially produced missing data, subsequently comparing the imputed values with the actual data values by estimating the normalized root mean squared error (NRMSE) for continuous data and the proportion of falsely classified entries (PFC) for categorical data, as suggested by Stekhoven and Buhlmann (2012). Since the missing variables may not have been missing at random, we wanted to test whether imputing the outcome variable could lead to biased results (Sullivan et al. 2017). We therefore repeated the analyses on the dataset in which the outcome variable was not imputed as a sensitivity analysis (see Online Appendix C).

In the PAI analysis, we were interested in the differential treatment effects of the two treatments. To this end, we built a model based on individuals that had received a meaningful course of CBT. Previous research has indicated that four treatment contacts should be considered the lowest meaningful amount of psychotherapy (Delgadillo et al. 2014; Robinson et al. 2020). This four-sessions-or-more rule has been applied in a previous PAI study (Cohen et al. 2020) and was also used in this study. With regard to the PI analyses, we did not use the same limited sample, as we were more interested in general predictors irrespective of amount of treatment.

Pre-treatment Variable Transformation

Discrete variables were centred and continuous variables with a near-to-normal distribution were standardized. Continuous variables with a non-normal distribution were transformed using log transformation or a square root transformation based on visual inspection of histograms.

Pre-treatment Variable Selection

Different statistical methods have been applied to select pre-treatment variables for personalized prediction models (Cohen and DeRubeis 2018). In 2018, Keefe et al. introduced a two-step machine learning approach using a random forest method for model-based recursive partitioning followed by a stepwise AIC-penalized bootstrapped method. Since the development of prediction models in mental health is still in its infancy, there is little agreement on the most appropriate machine learning techniques. However, based on the following strengths, we chose to use the two-step method introduced by Keefe et al. (2018):

  1. 1.

    Random forest Previous research gives reason to believe that treatment response is a multifactorial process involving multiple variables, each having a small effect (Cohen and DeRubeis 2018). Because of this, it is important to apply a method that, while handling a large number of pre-treatment variables, is capable of preventing slightly weaker pre-treatment variables from being dominated by stronger pre-treatment variables. The random forest methodology for model-based recursive partitioning is well suited for this purpose.

  2. 2.

    Stepwise AIC-penalized bootstrapped method A second variable selection approach was added to ensure that the pre-treatment variables were predictive in multiple bootstrapped replications of the data. This approach also enabled us to determine the relationship between the pre-treatment variable and the outcome, based on the direction of the estimated regression coefficients.

In the first step, we applied a model-based recursive partitioning method using a random forest algorithm (R package "mobForest"; Garge et al. 2013, 2018). With this algorithm, multiple trees are created by repeatedly splitting bootstrapped samples into two subgroups. Splits are based on pre-treatment variables that lead to significantly different model behaviour on either side of the split. A regression model with the slope of WHO-5 as the dependent variable and treatment as the independent variable (y = treatment) acted as the pre-determined model for the current analyses. A “split variable” therefore indicates an association with treatment differences (i.e., a potential moderator). A random subset of 10 variables was available to inform each split producing 10,000 different tree-like structures by repeatedly splitting the sample on the variable with the strongest moderator impact until a minimum subgroup of 10 individuals was achieved. By using different random subsets of variables, moderators with smaller effects were less likely to be dominated by stronger moderators (Strobl et al. 2008). An α-level of 0.10 was set for splitting. This method provides a variable importance score for each pre-treatment variable, thus indicating the predictive value. Variables were carried forward to the next step if their permutation accuracy importance score exceeded the threshold, which was the absolute value of the variable importance score of the lowest ranking variable (Garge et al. 2013). As the study was a multi-centre study, the variable site was included in the analyses, together with the pre-treatment variables.

As a second step, we tested whether the variables identified in the first step would be selected as significant moderators in at least 60% of the 1000 bootstrapped samples using a multiple linear regression model with backward elimination (with an α-level of 0.05, R package "bootstepAIC"; Austin and Tu 2004; Rizopoulos and Rizopoulos 2009). The regression model was specified with the slope of the WHO-5 as the dependent variable and the selected first-step pre-treatment variables (in the model-based recursive partitioning method) as the independent variables along with their interactions with treatment. The outcomes of this analysis were the number of times each variable was selected as a statistically significant moderator in the bootstrapped samples and whether the regression coefficient was positive or negative. The moderator was considered robust if selected in at least 60% of the samples. The 60% cut-off has previously been shown to construct a parsimonious model with good model fitting (Austin and Tu 2004), and similar studies has used it as a cut-off (Cohen et al. 2020; Keefe et al. 2018; van Bronswijk et al. 2019a, b; Zilcha-Mano et al. 2016).

Our results may have been substantially affected by the parameter selection, e.g., the number of randomly preselected predictor variables and the minimum required number of individuals at each node to form the tree-like structure (Strobl et al. 2009). We therefore planned to assess the potential influence of different parameter selections on the variables selected by the model-based recursive partitioning method (as reported in Online Appendix C).

Building and Evaluating the Personalized Advantage Index

If the two-step variable selection approach identified robust moderators, we planned to create a personalized advantage index (PAI). The PAI combines the identified pre-treatment variables in a multiple linear regression model and the slope of the WHO-5 as the dependent variable. The independent variables are the pre-treatment variables identified as predictors and the variables identified as moderators and their interactions with treatment. For instance, if two general predictors and two moderators were selected in the two-step approach described above, the equation for the PAI would be: SlopeWHO-5 = β0 + β1*predictor1 + β2*predictor2 + β3*moderator1 + β4*moderator2 + β5*moderator1*treatment + β6*moderator2*treatment. Individual outcome predictions for each treatment are then calculated based on this regression model using a fivefold cross-validation. In such a validation, the sample is split into five equal groups and individual outcomes are predicted using the regression model with weights based on the data of the other four groups to which the individual patient does not belong (Picard and Cook 1984). Two separate predictions are made for each individual, one for each treatment (UP and diagnosis-specific). The difference (positive or negative) in predicted outcomes (coefficients of the WHO-5 slope) constitutes the PAI, indicating whether UP or diagnosis-specific CBT should be recommended for the treatment of the given patient. A quantitative estimate of the advantage of the recommended treatment is thus provided.

To evaluate the PAI, we planned to compare the actual slopes for patients who had received the PAI-indicated treatment with the slopes for patients who had received their non-indicated treatment. This would be done by t-testing (DeRubeis et al. 2014a, b).

Building and Evaluating the Prognostic Index (PI)

If no robust moderators were identified in the two-step variable selection approach, we planned to build a PI to predict overall treatment effects for each individual. To identify robust predictors, we planned to use a variable selection procedure similar to the one described above, but with three adjustments: (1) in the first step we would adjust the pre-determined model to identify general predictors of outcome rather than moderators of differential outcome (y = intercept + variables); (2) we would not include an interaction with treatment in the regression model of the second step; and (3) we would use the whole sample (n = 291) rather than only those patients who completed > / = 4 sessions, since differential treatment effects were irrelevant. After the two-step variable selection, we planned to use the identified predictors to build a multiple linear regression model with the slope of the WHO-5 as the dependent variable and the identified predictors as the independent variables (SlopeWHO-5 = β + β1*predictor1 + β2*predictor2 … + βn*predictorn). Using a fivefold cross-validation, individual outcome predictions could then be calculated on the basis of this regression model (Picard and Cook 1984).

To evaluate the PI, we planned to compare the predicted slope for each individual (based on the weights of the other four folds, to which the individual did not belong) with the actual "observed" slope for that individual. The difference between the two WHO-5 slopes could then be calculated, and the association between these scores examined using Pearson’s correlation analysis.

Testing the Generalizability of Variable Selection and Model Fitting

Since we used the full sample for the two-step variable selection approach and model fitting, the PAI/PI may have been inflated due to double-dipping (Fiedler 2011; Vul et al. 2009). Therefore, as a sensitivity analysis, we planned to revisit all the steps using data combining Sites 1 and 3 (n = 168) and comparing the results with those of Site 2 (n = 123; see Online Appendix C).

Results

Variable Description and Missing Data

The PI analyses included 291 randomized patients (UP: n = 148; diagnosis-specific CBT: n = 143). The means and frequencies of the clinical characteristics of the patients are listed in Table 1. Online Appendix B lists demographic and background information. The median number of completed sessions in both groups was 10. Four or more group sessions were completed by 228 patients (UP: n = 110; diagnosis-specific CBT: n = 118), who were included in the PAI analyses.

There were no significant differences between the two treatment groups regarding demographic or clinical characteristics except that the UP group had higher levels of anxiety symptoms on HAM-A.

A total of 14.5% of the observations were missing. We tested the imputation method by artificially producing the missing data on the data of the 81 patients with no missing pre-treatment data or missing outcome data before Session 5. With a 0.33 error term for the continuous variables (NRMSE), and a 0.41 error term for the discrete variables (PFC), the imputation method appeared satisfactory.

Variable Selection for Personalized Advantage Index

Age, the Positive Affect Schedule, level of the detachment personality trait, duration of disorder, and BDI-II were identified by running the model-based recursive partitioning method, with age having the largest variable importance score. None of these variables, however, were selected as moderators in at least 60% of the bootstrap samples using the backwards elimination technique, indicating that no robust moderators were available for building a PAI.

Variable Selection for PI

Thirteen predictors of outcome were identified by running the model-based recursive partitioning method in the search of predictors. They are depicted in Table 2.

Table 2 Predictors selected with model-based recursive partitioning technique (N = 291)

Only level of positive affect (measured by the PANAS), duration of disorder, level of detachment personality trait, and cognitive reappraisal were selected in at least 60% of the bootstrap samples (Table 2, in bold). More than 99% of the bootstrap samples showed negative coefficients, indicating that longer duration of disorder, higher levels of positive affect, detachment, and cognitive reappraisal were associated with less improvement during treatment.

Estimating PI Scores Using Fivefold Cross Validation

The four selected predictors were combined into the following multiple linear regression model: SlopeWHO-5 = β + β1*log(positive affect) + β2*log(length of disorder) + β3*(detachment) + β4*(cognitive reappraisal). Comparing the actual "observed" slope with the predicted slope (based on weights in four of five folds), we found a mean difference of 0.01, with a 95% CI between − 0.10 and 0.13. The correlation (Pearson's r) was 0.25 (p < 0.001).

Generalizability of Variable Selection and Model Fitting

When we separately conducted the two variable selection steps at each treatment site (Site 2 versus Sites 1 and 3), we were unable to identify any predictors, which were selected both in the primary PI and in the PIs on the two separate samples (see Online Appendix C).

Sensitivity Analyses

No considerable change in results was observed when the analyses were repeated without imputing the outcome variable except that only two variables were selected in both steps: positive affect and duration of disorder. Changing the parameters in the model-based recursive partitioning method also failed to modify the results (details in Online Appendix C).

Discussion

We were unable to identify any robust moderators of differential treatment outcome in this study. This suggests that the pre-treatment patient variables trialled here are of no use in predicting whether patients with social anxiety disorder, agoraphobia/panic disorder, and/or MDD will improve more or less from the UP compared with diagnosis-specific group CBT. We identified four predictors of outcome: level of positive affect, duration of disorder, the detachment personality trait, and the cognitive reappraisal coping strategy. The predictors were negatively associated with outcome, indicating that higher pre-treatment levels of positive affect, i.e., feeling enthusiastic, active, and alert, predicted less improvement during treatment. This may be surprising but could result from the fact that positive affect and well-being are highly correlated and that a high level of well-being at baseline allows less room for improvement and thus a lower slope. This finding corresponds with those of a study of social phobia in which high positive affect before treatment was found to predict less improvement in quality of life during treatment (Sewart et al. 2019). Cognitive reappraisal as an emotion regulation strategy is thought to intervene early in the emotion-generative process and typically leads to more positive emotions and less negative emotions (Gross and John 2003). Our finding that this coping strategy predicts less improvement may therefore be surprising, whereas it may be less surprising that a longer duration of disorder and the detachment personality trait (the tendency to avoid socioemotional experience) predicted less improvement. According to the sensitivity analysis concerning site specificity, however, the PI model was not reliable across treatment sites. This indicates that further study is needed to improve the PI. Variables such as motivation and expectancy as well as interpersonal problems could be relevant for inclusion in future studies. Measures of stability/strain in everyday life should likewise be considered, since these aspects of the patient’s life may interfere with home work engagement that has been shown to mediate the effect of CBT in some studies (Cammin‐Nowak et al. 2013; Westra et al. 2007).

Our results provide no evidence to support a preference for either UP or diagnosis-specific group CBT for a given patient. Accordingly, treatment selection may be based on patient preferences or logistics as well as on the variables examined. However, other variables, not addressed in the present study (e.g., interpersonal problems or prior experience with CBT), should be investigated in future studies before any conclusions can be drawn as to the existence of moderators. In some treatment settings, a logistic advantage may be obtained by using transdiagnostic group CBT rather than diagnosis-specific group CBT since patients would not have to wait until a sufficient number of patients with the same diagnosis were ready to enter their designated group. Moreover, shifting from diagnosis-specific to transdiagnostic CBT may potentially reduce the costs of training therapists, as they would need training only in one manual.

Several aspects may be considered when attempting to explain the lack of identified moderators in this study. Despite differences in the treatment targets in the UP and diagnosis-specific CBT and the heterogeneity of the UP group (only this group had patients with different diagnoses), a likely explanation is that the two therapy formats were too similar in nature and used similar or overlapping interventions. For instance, although the therapies were delivered with different rationale and goals, exposure was an important intervention in both arms of the study. This may have rendered the search for moderators less relevant here than in previous PAI studies, where psychotherapy was compared with medications (e.g., DeRubeis et al. 2014a, b; Vittengl et al. 2017) or CBT was compared with either interpersonal therapy (Huibers et al. 2015; van Bronswijk et al. 2019a, b) or psychodynamic therapy (Cohen et al. 2020).

Another important question concerns the quality of treatment. Both treatment formats led to limited improvement as a WHO-5 slope of some 0.6 (95% CI 0.43–0.77) corresponds to an 8.4-point change (6.02–0.78) over the 14 weeks. The mean change obtained in our study thus failed to reach the clinically significant level, which was defined as a 10-point improvement on the WHO-5 scale (Arnfred et al. 2017). However, as the current study was set in secondary services, the included patients had previously failed to respond to medication and/or psychotherapy offered by primary services (typically delivered by a general practitioner or a private practice psychiatrist or psychologist). The patients may therefore be classified as treatment resistant, with a very low level of functioning in daily life (e.g., less than 10% worked full-time at time of study enrolment; 22% were classified as students). In addition, the use of the well-being index as an outcome measure may have diminished the obtained treatment effect, since positive variables tend to be less responsive as treatment targets (Sewart et al. 2019). However, changes in well-being and symptoms were similar in magnitude, which suggests that this explanation is less plausible here (Reinholt et al., under review). Taking into account the severity and chronicity of the sample as well as the relatively short duration of the treatment, the positive nature of the outcome measure, and the fact that the rating of treatment quality and adherence was high, we find it unlikely that the modest treatment effect reflects poor treatment quality. It cannot be ruled out, however, that moderators may have been identified if we had used a less chronic sample, longer duration of treatment, or another outcome measure, thereby increasing the potential of change during treatment. The homogeneity of the study population is another possible explanation for the failure to detect differences in outcome, since the participants were all non-responders to first-line treatment. Likewise, a high proportion of the group may have been what DeRubeis and colleagues have termed as intractable, i.e., patients who would experience no improvement no matter the quality of therapy provided (DeRubeis et al. 2014a, b). However, the highly varied response renders this explanation less likely. The slope range of − 3.56 to 6.24 corresponds to a change in WHO-5 range from − 49.8 to 87.36 points.

It is also possible that the samples from the three treatment sites differed since it was not possible to identify the same predictors in the three sites when conducting the analyses separately. However, treatment site was not selected as an important variable in the first variable selection step, at which the samples were pooled.

It may be speculated that we overestimated the capacity of multivariable prediction models at this stage of model development. Perhaps the difficulty of building prediction algorithms is greater than expected. Other factors may have played a role, such as the numerous interactions between variables, which may have rendered the models too complex, combined with the possibility that pre-treatment variables were less important than processes during or outside treatment (i.e., non-specific treatment factors such as alliance and group cohesion or unexpected events in the family or at work, etc., Kazantzis et al. 2018; Zilcha-Mano 2017).

Proponents of the complex network approach to psychopathology argue that symptoms are not a reflection of an underlying disease or dimension (e.g., neuroticism), but rather that they constitute the disease itself in the form of a complex network of elements interacting in ways that tend to maintain themselves (Hofmann et al. 2016). In this perspective, response to treatment is caused by critical transitions in network states; hence, pre-treatment factors do not necessarily offer important information on the likelihood of change or the resilience of a given network, which may explain the scarcity of predictors of treatment effect found in the current study. Furthermore, it is possible that the similarities between the two treatment conditions may have provided a similar push to the complex networks, which may explain the failure to find any moderator of treatment effect. In a complex network perspective, the current understanding of moderation as the effect of baseline characteristics may be too narrow, inasmuch as a reassessment of the same patient may be needed to capture the dynamics of a complex ideographic network (Hofmann and Hayes 2019).

Variable selection methods provide a further point of discussion. Such approaches may be biased towards multi-category variables, which are more likely to be selected in the model (Kim and Loh 2001; Strobl et al. 2007). To reduce bias in the current study, predictors were standardized before running the model-based recursive partitioning model. We cannot rule out, however, a bias in the variable selection and the estimation of variable importance scores. The effect may be that less optimal predictors are carried forward to the next step. Additionally, the sample size of our study may have been too small. Power calculation is no simple task in analyses such as these, but a newly published simulation study by Luedtke and colleagues conclude that a sample size of 300 per treatment arm is required for sufficient statistical power to use multivariable prediction models with four predictors for comparison of two or more treatments (Luedtke et al. 2019). On the other hand, the current study has a larger sample size than most of the previous PAI studies.

Some strengths of the study should also be noted. To the best of our knowledge, this is the first study to examine moderators of differential treatment outcome in UP and diagnosis-specific CBT. In contrast to previous studies comparing UP and diagnosis-specific CBT, our sample of 291 patients included a subgroup with a primary diagnosis of MDD, thus allowing us to examine predictors and moderators in both anxiety disorders and MDD. We used a two-step machine learning approach to identify relevant predictors and moderators of treatment outcome rather than basing PAI treatment selection on simple linear regression models, as in earlier approaches. Our approach thus incorporated internal validation techniques such as bootstrapping to maximize the stability and generalizability of these models. The random forest approach is a comparatively stable model when n is relatively low and the number of predictors high (Bureau et al. 2005; Heidema et al. 2006). Without compromising the stability of the model, this approach also allows for the inclusion of many predictors and prevents weaker predictors from being dominated by stronger ones. In addition, the approach is capable of estimating linear and non-linear associations (Strobl et al. 2008).

Conclusion

The current study, which has compared UP with diagnosis-specific group CBT for anxiety and depression, did not enable us to identify moderators of treatment effect. Although four predictors were identified, they were insufficiently robust to be selected across treatment sites. We cannot rule out that the lack of identified moderators and robust predictors may be a consequence of the factors that characterized our study: the chronicity of the sample, the chosen outcome measure, or the statistical methods used. We conclude that there is insufficient evidence to support a preference for either UP or diagnosis-specific group CBT for a given patient.