Predicting post-treatment symptom severity for adults receiving psychological therapy in routine care for generalised anxiety disorder: a machine learning approach

Approximately half of generalised anxiety disorder (GAD) patients do not recover from first-line treatments, and no validated prediction models exist to inform individuals or clinicians of potential treatment benefits. This study aimed to develop and validate an accurate and explainable prediction model of post-treatment GAD symptom severity. Data from adults receiving treatment for GAD in eight Improving Access to Psychological Therapies (IAPT) services ( n = 15,859) were separated into training, validation and holdout datasets. Thirteen machine learning algorithms were compared using 10-fold cross-validation, against two simple clinically relevant comparison models. The best-performing model was tested on the holdout dataset and model-specific explainability measures identified the most important predictors. A Bayesian Additive Regression Trees model out-performed all comparison models (MSE = 16.54 [95 % CI = 15.58; 17.51]; MAE = 3.19; R 2 = 0.33, including a single predictor linear regression model: MSE = 20.70 [95 % CI = 19.58; 21.82]; MAE = 3.94; R 2 = 0.14). The five most important predictors were: PHQ-9 anhedonia, GAD-7 annoyance/irritability, restlessness and fear items, then the referral-assessment waiting time. The best-performing model accurately predicted post-treatment GAD symptom severity using only pre-treatment data, outperforming comparison models that approximated clinical judgement and remaining within the GAD-7 error of measurement and minimal clinically important differences. This model could inform treatment decision-making and provide desired information to clinicians and patients receiving treatment for GAD.


Introduction
Generalised anxiety disorder (GAD) is one of the most commonly occurring and burdensome mental disorders (Ferrari et al., 2022).Efficacious pharmacological and psychotherapeutic treatments exist (Bandelow et al., 2017), but even with the first-line recommended treatments only around half of patients in clinical trials or routine care achieve symptomatic recovery post-treatment (Clark, 2018;Loerinc et al., 2015).There is, therefore, a pressing need to improve patient treatment outcomes.
Knowledge of individual treatment prognosis can inform clinical planning, potentially improving patient outcomes and healthcare costeffectiveness (Chekroud et al., 2021).Such knowledge is wanted by both patients and clinicians (Hayden et al., 2013), with a wider societal demand for precision medicine (Fernandes et al., 2017).Mental health resources are constrained (Clark, 2018) and so the need for accurate patient prognostic prediction models is clear.Despite this, there is a notable lack of such research with patients with GAD.
Clinical prediction models are often constrained by the limited use of routinely collected data in their development (Dwyer et al., 2018).This contributes to reduced generalisability and a lack of ecologically valid ways of explaining how predictions are obtained (Bouwmeester et al., 2012).Such models are therefore not readily implementable, do not perform well in routine care and lack transparency in how predictions are obtained, limiting their value and associated trust (Obermeyer and Emanuel, 2016).To increase the potential utility, prediction models need to be developed using naturalistic samples, leverage explainability measures (to improve model interpretability), and adhere to accepted development guidelines (Collins et al., 2015).This study aims to develop and validate an accurate and explainable prognostic prediction model for adults receiving psychological treatment for GAD, using routinely collected data.

Methods
The methods for this study were preregistered and were adhered to without deviation (see: https://osf.io/s23hc/).

Participants
Routinely collected healthcare data were analysed for adults (aged ≥18) treated for GAD in eight Improving Access to Psychological Therapies (IAPT) services (grouped together in four NHS Trusts) in the North and Central East London IAPT Service Improvement and Research Network (NCEL IAPT SIRN; see Supplementary Material 1 Table 3; Saunders et al., 2020).The national IAPT programme delivers evidence-based psychological therapies using a stepped care model for depression and anxiety disorders, including GAD (Clark, 2018).Included individuals within this study had a primary diagnosis of GAD (referred to as 'problem descriptor' in IAPT services; see Supplementary Material 1 Table 2 for further details), had received two or more treatment sessions and had completed treatment by August 2020.

Predictors and outcomes
IAPT services are mandated by NHS England to collect a standardised set of sociodemographic data, as well as symptom and functioning measures: the IAPT Minimum Dataset (MDS).The predictors used in this study, that were those available to inform clinical decision making prior to starting treatment and were all collected routinely as part of the MDS at the initial assessment, are the individual items from the Generalised Anxiety Disorder Assessment (GAD-7; Spitzer et al., 2006) used to measure symptoms of generalised anxiety disorder, the Patient Health Questionnaire (PHQ-9; Kroenke et al., 2001) for depressive symptoms, the Work and Social Adjustment Scale (WSAS; Mundt et al., 2002) for functional impairment, and the IAPT Phobia Scale Improving Access to Psychological Therapies (2011) to screen for phobic anxiety disorders.The GAD-7 and PHQ-9 total scores were also included.Only items two-to-five of the WSAS were used as the first item is not applicable for unemployed individuals.Other predictors were: age, ethnicity, employment status, gender, long-term health condition status, and psychotropic medication prescription and usage.Local Layer Super Output Areas (LSOAs) were converted to Indices of Multiple Deprivation (IMD) as a measure of local area deprivation.Finally, the number of weeks between referral and initial assessment, and between assessment and first therapy session were included.The primary outcome was the GAD-7 score at the last attended therapy session.The full description of baseline variables is included in Supplementary Material 1 (Tables 1 and 2).
The Transparent Reporting of a Multivariable Prediction model for Individual Prognosis Or Diagnosis (TRIPOD) Statement (Collins et al., 2015) guidelines were followed to ensure model development best practice; see Supplementary Material 2.

Data processing
Analyses were undertaken in R ( R Core Team, 2022).Missing baseline data were single imputed using the 'missForest' package (Stekhoven and Bühlmann, 2012), following conventions from other prediction model development and validation studies (Buckman et al., 2021a;Webb et al., 2020), with missingness varying by predictor from nil to 22.7 % (long-term health condition).Continuous variables were centred and scaled.Categorical variables were dummy encoded post-imputation, with the first dummy encoded variable removed to avoid issues regarding multicollinearity.
The dataset was then divided into three for training, validation and testing, as detailed in the study protocol.Division was determined by service locations within NHS Trusts and undertaken to create an accurate representation of the model's ability to generalise to services beyond those whose data were used to develop and validate the models (Chekroud and Koutsouleris, 2018).The variables collected and delivery of interventions within these service locations are standardised across England (National Collaborating Centre for Mental Health, 2023).The training dataset services were three from Trust 1 and three from Trust (n=8,064; 50.8 %).The validation dataset services were two from Trust 3 (n=5,021; 31.7 %).The holdout dataset service was a single one from Trust 4 (n=2,774; 17.5 %).To identify statistically significant group differences between the three independent samples, chi-square tests were used for the categorical variables, and ANOVA with Welch t-tests for continuous variables.Detection of such differences did not influence the submission of a given predictor to the prediction models.

Machine learning algorithms
Thirteen machine learning algorithms were selected to provide a broad comparison (of a variety of modelling techniques) and to be consistent with research that has used similar methods (Webb et al., 2020).To provide a benchmark against the machine learning models, an ordinary least squares regression (OLS) model with all the available same predictors (in the same manner as the machine learning models) was included.The multiple OLS acts as a comparator with penalised regression models by fitting the data without any regularisation, making it prone to overfitting compared to models with regularisation.Additionally, two other comparison models akin to those used in similar research were fitted (Buckman et al., 2021a).The first was an OLS regression model with only the pre-treatment GAD-7 score included, as initial symptom severity is one of the most important factors considered by clinicians in predicting treatment outcomes (and allocating treatment) for their patients (Amati et al., 2018;Buckman et al., 2021b;O'Driscoll et al., 2021).Separately, a null model was created using the mean post-treatment GAD-7 score in the training and validation dataset as the prediction for all test dataset cases.Five models based upon elastic net regularised regression (ENR; Friedman et al., 2010) were included (alpha parameters were 0.0; full ridge, 0.25, 0.50, 0.75 and 1.0; full lasso), using the ridge and least absolute shrinkage and selection operator (lasso) penalisations to reduce overfitting.Two spline regression models were used: adaptive splines (Friedman, 1991) and adaptive polynomial splines (Stone et al., 1997).In addition, two decision-tree-based algorithms were used: random forest (Breiman, 2001) and Bayesian Additive Regression Trees (BART; Chipman et al., 2010).Lastly, three support vector machines (SVM) were included with linear, polynomial and radial kernels (Dimitriadou et al., 2009).Full details of the models are presented in Supplementary Material (Table 3).
SuperLearner (van der Laan et al., 2007), the chosen ensemble modelling tool, uses cross-validation to estimate the predictive performance of individual machine learning algorithms, and ultimately determines the most accurate model within the ensemble.The primary evaluation metric was the mean squared error (MSE) and the secondary metrices were variance explained (coefficient of determination; R 2 ) and the mean absolute error (as an average prediction error; MAE).
Initially, the ensemble was trained within the training dataset using 10-fold cross-validation.The validation dataset was used as an initial test of each model's ability to generalise, with the model with the lowest MSE selected (taken as an average over the ten repeats).Once the best performing algorithm had been established within the validation dataset, the exact same training and validation process was repeated to tune its hyperparameters, using the grid search method of the 'caret' package (see Supplementary Material 1; Max, 2008).The tuning of hyperparameters was only undertaken for the best performing algorithm to ensure an equal comparison (as hyperparameters vary across models) of the baseline models within the validation dataset.
Subsequently, only the model with tuned hyperparameters was trained on the combined training and validation dataset, then finally tested on the test dataset.The analysis pipeline is shown in Fig. 1.

Explainability
In line with the study protocol, model-specific explainability measures were preferred where the winning model was an inherently interpretable algorithm (such as a decision tree) as they leverage individual model's distinct features, thereby improving fidelity (Belle and Papantonis, 2021).Additionally, model-specific measures bypass known issues (such as an inability to isolate individual contributions of individual features) with global and local model-agnostic explainability methods that use input perturbation within highly correlated variable sets, such as SHAP or LIME (Slack et al., 2020).Where model-specific measures did not exist or were not available, accumulated local explanation plots (Apley and Zhu, 2019) were used as a global model-agnostic explainability measure, due to their ability to handle correlated features.

Ethics
NHS Ethical approval was not required for this study (confirmed by the Health Research Authority July 2020 #81/81).The IAPT services provided data as part of a wider service improvement project.This research followed procedures outlined by the respective data hosting providers and was registered with the individual NHS Trusts operating the IAPT services (project reference: 00,519-IAPT).

Results
There were 30,833 patients treated for GAD.Of these, 14,974 did not meet inclusion criteria for reasons such as having <2 treatment sessions or having missing PHQ-9 or GAD-7 item level data (see Supplementary Material 1 Figure 1).This resulted in an analytic sample of n=15,859 participants which was split into n=8,064 in the training dataset, n=5,021 in the validation dataset, and n=2,774 in the test dataset.Out of the sample of 15,958 participants, 1,397 (8.81 %) had only two sessions and 7,959 (50.2 %) completed at least six sessions.Demographic and clinical characteristics for participants in each dataset are shown and compared in Table 1, with statistically significant differences observed on multiple variables between the datasets.

Exploratory ensemble modelling
The mean of the post-treatment GAD-7 scores in the validation sample was 7.4 (±5.2) and 37.1 % individuals were in caseness at the end of treatment (their GAD-7 score was ≥8).Through training the ensemble using 10-fold internal cross-validation within the training sample (repeated ten times), the BART algorithm produced the lowest prediction error when generalising to the validation dataset (MSE [95 % CI]=16.72 [16.12; 17.32]; MAE=3.16;R 2 =0.38; see Table 2).The BART model also had the combined lowest prediction error in terms of the secondary outcome metric (MAE) along with the adaptive polynomial spline model.Both models' MAE indicated that on average the predicted post-treatment GAD-7 scores differed from the true values by 3.16 points.The linear regression model (with all predictors) had the greatest R 2 but was more error prone to error, with higher MSE and MAE values.As MSE was the primary evaluation metric, the BART model was determined to be the winning model within the ensemble.
The mean of the post-treatment GAD-7 scores in the test sample was 7.3 (±4.9) and 36.0 % of individuals were in caseness at the end of treatment.The hyperparameter tuned BART model was tested on the holdout dataset to predict post-treatment GAD-7 scores (MSE=16.54Note.y Service users were asked about their 'gender' and were provided with these options; we note that this does not consider the full range of experience but are the only categories available within IAPT services.Welch t-tests are reported for significant ANOVA (<0.001) in Supplementary Material 1 (Table 4).

Explainability
Using the model-specific explainability measures available, the inclusion proportion for each individual feature used as a splitting rule within the BART tree was calculated (for further details: see Supplementary Material 1 Table 8).The PHQ-9 item on anhedonia was the most important variable for prediction, chosen twice as many times as the next most important predictor.Next, chosen around 3.5 % of times were three GAD-7 items (annoyance/irritability, restlessness and fear) and the number of weeks from referral to assessment.The remaining variables performed similarly and those included ≥2.5 % of times are shown in Fig. 2.

Discussion
The aim of this study was to develop and validate an accurate and explainable prognosis prediction model for patients receiving psychological treatment for GAD.The winning BART model can predict an accurate index score within the error of measurement for the GAD-7 (±4; Spitzer et al., 2006), as well as within a minimal clinically important difference (MCID) score (±3.3;Bauer-Staeb et al., 2021) between the true and observed values.The obtained accuracies were better than the base comparison models, that on average made predictions outside the error of measurement and the MCID from the true GAD-7 scores.The model was developed using only routinely collected pre-treatment data, highlighting the potential value of this prediction model.Several key variables that all had face-validity were identified, and improve the explainability of the model.These variables provide some insight to the underlying behaviour of the model to clinicians and patients.Anhedonia (from the PHQ-9) was the most important predictor, followed by three GAD-7 items.This is consistent with other research showing symptoms of GAD and depression to be highly comorbid and influential in treatment outcomes (O'Driscoll et al., 2021) and anhedonia substantially contributing to psychological therapy outcomes (Khazanov et al., 2020).Waiting time was similarly evidenced as a key indicator of psychological treatment outcomes (Clark, 2018).

Strengths and limitations
This study had a limited inclusion criteria, used a large naturalistic sample to develop the models, and used routinely collected pretreatment data, improving external validity and addressing generalisability concerns (Meehan et al., 2022).However, data came from services in similar geographical areas, so replication in other primary or community care mental health services that treat GAD, both within the UK and internationally (Cromarty et al., 2016;Knapstad et al., 2018) would provide a more thorough test of generalisability.The study compared a varied range of machine learning algorithms and presents a potential baseline for future research.In addition, the winning model is only usable in services that collect the same data and that provide the same psychological therapies as those used in this study.However, in-line with the study protocol, only the most important variables for prediction were explored for the winning model as we considered this to be the most likely to be used in clinical practice.The use of explainability measures follows recommendations for the development of machine learning prediction models (Dwyer et al., 2018) to build confidence and trust in predictions, increasing the potential for utility.However, within the context of the study, the included simple comparison models were only an approximation of how clinicians might make prognostic predictions.For a more complete evaluation, a comparative test of the prospective prediction performance of the model relative to clinicians is required.It would also be valuable to explore the important variables for prediction for the other models within the ensemble.The study was conducted and reported in accordance with TRIPOD guidelines that might help to improve the potential replicability and interpretability (Burke et al., 2019).This is particularly important given the paucity of prognostic GAD treatment models.Separate prediction models could have been produced for those receiving low and high-intensity treatments, given their distinction in clinical guidelines (National Institute for Health and Care Excellence, 2020).However, as patients can be stepped up or down between intensities, or receive pharmacotherapy concurrent to their psychological therapy intensity level, separating the models does not best represent the clinical settings that models might be used within.Treatment options for GAD within IAPT are limited to low and high intensity CBT, thus limiting the generalisability of the model to alternative therapies.The dataset could have also been partitioned by sociodemographic features (such as age or ethnicity), or severity of symptoms, to provide additional tests of the robustness of the findings.
There was a large amount of unexplained variance for individual models, although this could be partly attributed to measurement error (Bone et al., 2021).Models might have been more accurate with a broader range of predictors, such as those collected through passive measurement of disorder-specific behaviours (De Angel et al., 2022).However, this study only used routinely collected data to ensure greater potential for clinical utility.The hyperparameters were only tuned for the winning model from the first test of generalisation (i.e. the BART model), so an alternative potentially more accurate model could have been obtained if all models were tuned prior.However, this method provided an equal baseline comparison of all models within the ensemble and provides a potential basis for future research.

Implications and conclusions
The winning model could be embedded within healthcare systems to provide a more accurate treatment prognosis for individuals following an initial assessment, providing patients and clinicians with desire knowledge and informing their joint clinical decision-making.Prior to embedding and being used to help guide treatment decisions, a more robust test of the value of the model in clinical practice would be needed.This might include randomising clinicians and patients to receive (or not receive) predictions from the model prospectively, then investigating the effects of this on treatment outcomes.This process has been found to be successful with other data-informed treatment predictions in similar settings (Delgadillo et al., 2018).If there is utility in the routine use of the model, improved treatment outcomes and potentially reduced costs of care would be expected.Those facing poor treatment prognoses would be recommended to: start at high-intensity, have more regular reviews to closely monitor progress throughout treatment, consider combined pharmacotherapy and psychotherapy, or be given augmented treatment options.Those with particularly good prognoses could be recommended to start with lower intensity therapy which might make more efficient use of clinical resources.Further, were causality to be demonstrated, the identified important variables for prediction might help clinicians and services use targeted interventions.For example, offering interventions that target anhedonia (Khazanov et al., 2020) or reducing the number of weeks waited (fast-tracking) for individuals with poor predicted prognoses.This might further improve outcomes and reduce the long-term cost of care.

Fig. 2 .
Fig. 2. Percentage of times each predictor was chosen as a splitting rule within BART Note.Only predictors with ≥2.5 % inclusion proportion shown; see Supplementary Material 1 Table8for the remaining predictors.

Table 1
Demographics and group comparison.

Table 2
Ensemble validation dataset results.Note.MSE = mean squared error; CI = confidence interval; SE = standard error of MSE; MAE = mean absolute error; R 2 = coefficient of determination; OLS = ordinary least squares; SVR = support vector regression.
Table 8 for the remaining predictors.