The association between clinical, sociodemographic, familial, and environmental factors and treatment resistance in schizophrenia: A machine-learning-based approach

Background: Prediction of treatment resistance in schizophrenia (TRS) would be helpful to reduce the duration of ineffective treatment and avoid delays in clozapine initiation. We applied machine learning to identify clinical, sociodemographic, familial, and environmental variables that are associated with TRS and could potentially predict TRS in the future. Study design: Baseline and follow-up data on trait(-like) variables from the Genetic Risk and Outcome of Psychosis (GROUP) study were used. For the main analysis, we selected patients with non-affective psychotic disorders who met TRS (n = 200) or antipsychotic-responsive criteria (n = 423) throughout the study. For a sensitivity analysis, we only selected patients who met TRS (n = 76) or antipsychotic-responsive criteria (n = 123) at follow-up but not at baseline. Random forest models were trained to predict TRS in both datasets. SHapley Additive exPlanation values were used to examine the variables' contributions to the prediction. Study results: Premorbid functioning, age at onset, and educational degree were most consistently associated with TRS across both analyses. Marital status, current household, intelligence quotient, number of moves, and family loading score for substance abuse also consistently contributed to the prediction of TRS in the main or sensitivity analysis. The diagnostic performance of our models was modest (area under the curve: 0.66 – 0.69). Conclusions: We demonstrate that various clinical, sociodemographic, familial, and environmental variables are associated with TRS. Our models only showed modest performance in predicting TRS. Prospective large multi-centre studies are needed to validate our findings and investigate whether the model's performance can be improved by adding data from different modalities.


Introduction
In many countries, clozapine is the only approved antipsychotic for treatment-resistant schizophrenia (TRS) (Howes et al., 2017).As delays in clozapine treatment are associated with poorer clinical outcomes (Yoshimura et al., 2017), robust and reliable models, that can predict non-response to antipsychotics are highly needed.Such models could support clinical decision-making, reduce the duration of ineffective pharmacological treatment and help identify patients who would likely benefit from clozapine.
Several studies used clinical, sociodemographic, and/or genetic data to predict TRS in schizophrenia by use of ML or survival models (Ajnakina et al., 2020;Kadra-Scalzo et al., 2022;Legge et al., 2020;Smart et al., 2022).Similar to observational studies, they reported that age at illness onset is important for the prediction of TRS.In addition, poor social adjustment and low pre-morbid intelligence quotient (IQ) were relevant predictors for TRS (Legge et al., 2020).Unfortunately, the accuracy of the different models was moderate.In addition, these studies had several limitations concerning large amounts of missing data, restrictions in terms of availability of data on specific predictors, and usage of different definitions for TRS.
Although clinical, genetic, sociodemographic and environmental factors might have predictive value for TRS, it remains unclear which factors can best be used for ML-based practical tools that can adequately predict TRS on an individual level.Therefore, we applied an ML model, utilizing comprehensive data from the Genetic Risk and Outcome of Psychosis (GROUP) study, to identify variables that are associated with TRS and that could potentially predict TRS in the future.As the GROUP study offers a unique combination of clinical, familial, sociodemographic, and environmental variables, this study can provide additional insight into TRS and contribute to the development of individualized care.

Study design
GROUP is a Dutch multi-centre longitudinal cohort study, which studies genetic and non-genetic vulnerability and resilience factors involved in the development and course of non-affective psychotic disorders (NAPD) (Korver et al., 2012).Data collection took place between April 2004 and December 2013.In-and outpatients were recruited across 36 mental health institutes.Assessments took place at baseline (T1), after three (T2) and six years (T3) follow-up.Inclusion criteria were 1) 16-50 years of age, 2) a diagnosis of NAPD according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV), 3) good command of the Dutch language, and 4) the mental capacity to give informed consent.Clinical, familial, sociodemographic, and environmental data were collected after obtaining informed consent.The GROUP study was approved by the Ethical Review Board of the University Medical Centre Utrecht and by the local review boards of all institutes.Patient data from database release 8.0 was used.

Participants
The GROUP study included 1136 patients with NAPD of which twothirds were within the early phase of the disease (mean illness duration of ≤5 years).TRS was not included as a variable in the dataset and therefore the use of clozapine was used as a proxy for TRS (Wimberley et al., 2016a).
For our main analysis, we selected TRS patients, i.e., patients who used clozapine (≥12.5 mg) during the study (Fig. 1).This cut-off was chosen, as Dutch guidelines recommend to start clozapine treatment with at least 12.5 mg a day for TRS (Clozapine Plus Werkgroep, 2013;Zorginstituut Nederland, 2015).It is likely that clozapine prescriptions with lower dosages were aimed at other indications.Additionally, we selected antipsychotic-responsive patients, i.e., patients who did not use clozapine but instead used at least one non-clozapine antipsychotic (>0 mg) and who met remission criteria during one of the study assessments (i.e., scores of ≤3 on the Positive and Negative Syndrome Scale items P1, P2, P3, N1, N4, N6, G5, and G9) (Andreasen et al., 2005).Maintenance of remission over six months could not be examined with the available data.We excluded patients who did not meet the remission criteria during any of the study assessments, as it is possible that these patients 1) were not using antipsychotic medication or were incompliant, 2) had an indication for clozapine, but did not receive it for some reason, or 3) were pseudo-resistant (i.e., other factors explain the unresponsiveness to antipsychotics).For some patients, the primary diagnosis changed during the study.Therefore, we selected patients with an NAPD diagnosis at T3 (i.e., schizophrenia, schizophreniform disorder, schizoaffective disorder, delusional disorder, brief psychotic disorder, or psychotic disorder not otherwise specified), as we assumed that this diagnosis was most accurate.In case of missing data, we evaluated the last available primary diagnosis of the patient.
For a sensitivity analysis, we excluded patients who met the TRS or antipsychotic-responsive criteria at T1.This was done as decision tools will likely be used when treatment needs to be initiated for patients, who do not yet show signs of response or non-response.Therefore, the sensitivity analysis included TRS and antipsychotic-responsive patients who only met the corresponding criteria at follow-up (i.e., T2 and/or T3).

Variables
We selected trait(-like) variables from the baseline and follow-up assessments of the GROUP study, which did not change or minimally changed over time (Supplementary Table S1).These included: 1) demographic variables; 2) clinical variables, e.g.DUP; 3) variables on substance use; 4) neurocognitive variables; 5) variables on premorbid functioning, e.g.Premorbid Adjustment Scale (PAS) scores (Cannon-Spoor et al., 1982); 6) familial variables, e.g.familial loading of psychotic disorders (Verdoux et al., 1996); and 7) environmental variables, e.g.urbanicity.Variables were excluded if >25 % of the data was missing.Subsequently, we excluded patients with >25 % of missing data.

Machine learning
For both analyses, the outcome for our ML-based classification was TRS versus antipsychotic-responsive.We applied an ML framework in Python 3.9 using the Scikit-learn library 1.1.0(Fig. 2) (Pedregosa et al., 2011).We used a Random Forest (RF) model (1000 decision trees), which is a frequently used non-parametric ensemble algorithm (Breiman, 2001).
To assess the model's prediction performance on unseen data, we used a stratified 5-fold cross-validation approach, i.e., the data was divided in five subsets and in each subset, the ratio of TRS to antipsychotic-responsive patients was equal.In each cross-validation fold, the RF model was trained on 80 % of data (training set) and validated on an unseen subset of 20 % of data (test set).This was repeated until every fold had served as the test set.The 5-fold cross-validation was repeated 10 times, to limit the occurrence of chance findings.
Before the RF model was trained on the training folds, multiple preprocessing steps were performed on the training sets: removal of highly correlated variables, data imputation, normalization, and dimension reduction to mitigate overfitting (Supplementary Methods S1).These steps were subsequently applied to the test set.We then used nested cross-validation with a randomized search algorithm to optimize model hyperparameters within each training set.From each cross-validation fold, the best-performing hyperparameters (based on training set performance) were used for the test data.After hyperparameter  optimization, we applied oversampling (Supplementary Methods S1).This was done as, in case of a class imbalance, a trained model might have high accuracy in classifying the majority group, while it has a poor performance in classifying the minority group.The model was also trained without oversampling.As all pre-processing and model optimization steps were performed within each training set only, leakage of test data into the trained model was prevented.

Variable importance
The interpretation of ML models is challenging and its methodology is often visualized as a black box (Linardatos et al., 2020).To overcome this, Lundberg and Lee (2017) developed Shapley Additive exPlanations (SHAP), a theoretically reliable and intuitive method to explain the outcome of ML models.SHAP uses the Shapley value of the game theory and its extension.It assigns each variable a SHAP value, which quantifies the contribution (importance) of that variable to the prediction made by the model.More specifically, the SHAP value quantifies the direction (positive or negative) and magnitude of the variable's influence on the prediction (Lundberg and Lee, 2017).By use of SHAP, we visualized variables, for the best-performing (i.e., most prognostic) RF models, that were most important for predicting TRS in both analyses.

Statistical analysis
We created a receiver operator characteristic (ROC) curve and calculated the area under the curve (AUC) to evaluate the performance of each RF model.In addition, we calculated the Brier score to assess model calibration and refinement (0.0 is optimal) (Murphy, 1973).For each score, the mean and standard deviation over the 50 repeated crossvalidation folds were calculated.The best model was selected based on the highest and lowest mean AUC and Brier scores, respectively.To evaluate whether the models performed significantly (p < 0.05) better than random guessing, we performed random permutations (Ojala and Garriga, 2010).This was done by randomly shuffling the labels (TRS/ antipsychotic-responsive) before the 10 times repeated 5-fold crossvalidation was performed, which resulted in a random guessing crossvalidated mean AUC.This procedure was repeated 100 times and resulted in a p-value, that was defined as the fraction of repeated cross-validation iterations in which the permutation mean AUC was equal to or greater than the real mean AUC.Finally, we assessed the impact of the pre-processing steps by comparing the mean AUC between models with and without a particular pre-processing step.

Data selection
One hundred and one trait(-like) variables were selected (Supplementary Table S1).We used 59 and 63 of these variables for the main and sensitivity analyses, respectively.Other variables were excluded due to >25 % of missing data.
For the main analysis, we selected 200 TRS and 423 antipsychoticresponsive patients.The selection for the sensitivity analysis resulted in 76 TRS and 123 antipsychotic-responsive patients.The characteristics of the groups are summarized in Supplementary Table S2.

Predictions
In the main analysis, a model with z-score normalization and without oversampling or variable selection resulted in the highest crossvalidation mean AUC of 0.69 ± 0.04 (p < 0.01; Fig. 3A).This model also resulted in the lowest mean Brier score of 0.20 ± 0.01 and had a sensitivity and specificity of 0.21 ± 0.06 and 0.93 ± 0.03, respectively.Noteworthy is that the five best-performing models all had, rounded to two decimals, the same mean AUC and Brier scores (all p < 0.01; Supplementary Table S3 and Fig. S1).
In the sensitivity analysis, a model with z-score normalization and oversampling (i.e., duplication of cases) but without variable selection resulted in the highest cross-validation mean AUC of 0.66 ± 0.07 and a mean Brier score of 0.22 ± 0.01 (p < 0.01; Fig. 3B).The model's sensitivity and specificity were 0.43 ± 0.11 and 0.80 ± 0.07, respectively.The six models with the highest mean AUC, all had the same mean AUC scores and Brier scores (i.e., rounded to two decimals) and comparable standard deviations of the mean AUC and Brier scores (all p < 0.01; Supplementary Table S4 and Fig. S2).The sensitivities and specificities of the six best-performing models were very similar.
Pre-processing methods had little effect on the performance of the RF models (Supplementary Results S1 and S2).

Variable importance
By use of SHAP values, we visualized the impact of individual variables on the output of the RF models.We did this for the best-performing RF model of both analyses.In Figs. 4 and 5, SHAP values are shown for the 20 most important variables across the 50 cross-validation folds of these models.The bar plots in Fig. 4 show the degree of contribution of a specific variable to the prediction made by the RF model, with higher mean absolute SHAP values implying a bigger contribution of a variable to the output of the model.The summary plots in Fig. 5 visualize how low (blue dots), high (red dots), and intermediate variables' values are related to the probability of belonging to the TRS group.Variable values with higher SHAP values indicate a higher probability of TRS, and vice versa.
In the main analysis, the mean PAS score for ages 16 to 19 was the most important variable for the prediction of TRS (Fig. 4A).Higher PAS scores (i.e., worse premorbid functioning) indicated an increased risk of TRS (i.e., positive SHAP value) (Fig. 5A).In addition, marital status (i.e., not married or married/living together), PAS overall score, and age at illness onset were among the variables with the highest contribution to the prediction of TRS.Patients who were not married had an increased risk of TRS, while patients who were married or living together had a decreased risk of TRS (i.e., negative SHAP values).Higher PAS overall scores and lower age at onset were also associated with an increased risk of TRS.
Interestingly, clear differences were visible between low and high values for the variables of sexual abuse score of the CTQ, any lifetime use of cannabis and other substances (i.e., other than cannabis, cocaine, stimulants, psychedelics, or phencyclidine), frequency of lifetime use of cocaine and other substance use, and average urbanicity between ages 5 to 9. High scores on sexual abuse, (more) substance use, and non-urban environments indicating a higher risk of TRS.When comparing the SHAP bar and summary plots of the four second-best RF models with the plots of the best-performing RF model, it can be noted that the top 20 variables of these four models overlap for 65-80 % with the top 20 of the best-performing RF model (Supplementary Figs.S3 and S4).Mean PAS scores for ages 16 to 19, PAS overall score, and age at onset are part of the top 6 variables in all of these models.In addition, the eight variables (i.e., mean PAS scores for ages 16 to 19, marital status, PAS overall score, age at onset, educational degree, current household [e.g., living alone, with parents/partner, or sheltered], IQ, and year of birth) with the highest contribution to the prediction of TRS in the best-performing RF model were also present in the top 20 of the second-best RF models.This indicates that these variables consistently contributed to the prediction of TRS.
In the sensitivity analysis, highest acquired educational degree was the most important variable for the prediction of TRS (Fig. 4B), with lower levels of education associated with a higher risk of TRS (Fig. 5B).Similar to the main analysis, mean PAS scores for ages 16 to 19 and PAS overall scores had a relatively high contribution to the prediction of TRS.Again, higher PAS scores were associated with higher risks of TRS.Moreover, the number of moves and family loading score for substance abuse (i.e., scores ≤0 indicate a negative family history for substance use disorders, while scores >0 indicate a positive family history) were among the variables with the highest contribution to the prediction of TRS.For both variables, a colour gradient is visible in the summary plot, indicating that lower scores are associated with an increased risk of TRS and vice versa.There was also a clear difference visible in SHAP values between high and low scores of the variables of any lifetime use of other substances and frequency of lifetime use of other substances, with (more) substance use indicative of a higher risk of TRS.
The top 20 variables of the best RF model of the main and sensitivity analyses overlapped for 55 %.In addition, the top 20 variables of the five second-best RF models of the sensitivity analysis overlapped for 70-100 % with the top 20 variables of the best RF model (Supplementary Figs.S5 and S6).Across the six models, highest acquired educational level was consistently one of the top 5 variables with the highest contribution to predicting TRS.Age at onset was repeatedly found in the top 11 of these models.Moreover, the four variables (i.e., educational degree, number of moves, mean PAS scores for ages 16 to 19, and family loading score for substance abuse) with the highest contribution to the prediction of TRS in the best-performing RF model were also present in the top 20 of the second-best RF models.

Discussion
Our main analysis indicates that premorbid functioning and age at illness onset are most important for the prediction of TRS in an NAPD patient cohort consisting of TRS and antipsychotic-responsive patients.Marital status, educational degree, current household, and IQ also consistently contributed to the prediction of TRS.The sensitivity analysis indicated that educational level was most important for the prediction of TRS in the subgroup of patients who did not show signs of response or TRS at baseline.Number of moves, premorbid functioning, and family loading score for substance abuse also consistently contributed to the prediction of TRS in this analysis.Although the most prognostic RF models of both analyses discriminated TRS and antipsychoticresponsive patients significantly better than chance, the model's performances were modest (AUC 0.66-0.69).
Across both analyses, premorbid functioning and age at illness onset were consistent contributors to the prediction of TRS, with worse premorbid functioning and a younger age at onset associated with an increased risk of TRS.This is in line with previous findings (Chan et al., 2014(Chan et al., , 2021;;Demjaha et al., 2017;Kim et al., 2017;Lally et al., 2016;Martin and Mowry, 2016;Meltzer et al., 1997;Smart et al., 2021;Wimberley et al., 2016a).In addition, poor premorbid functioning (Chan et al., 2014(Chan et al., , 2021)), season of birth (Kim et al., 2017;Sørensen et al., 2014), and male sex (Lally et al., 2016;Meltzer et al., 1997) were also proposed as important predictors of TRS by multiple, but not all observational studies (Demjaha et al., 2017;Wimberley et al., 2016a).Our study did not confirm an important role of sex in distinguishing TRS.We also did not find a consistent role for the DUP to first contact with mental health care or to first treatment with antipsychotics in the prediction of TRS.This might be related to the observation that TRS patients had a numerically shorter DUP compared to antipsychotic-responsive patients.Although previous studies found no significant differences between TRS and antipsychotic-responsive patients with regard to marital status (Iasevoli et al., 2016) and no association between living situation (Wimberley et al., 2016a) and TRS, our findings indicate that these factors might have predictive value for TRS.In our sensitivity analysis, educational level was an important contributor to the prediction of TRS, with lower levels of education indicating a higher risk of TRS.This is in line with earlier findings (Díaz et al., 2013;Lasalvia et al., 2017;Verma et al., 2012).Although variables related to substance use, childhood trauma, and urbanicity did not contribute most to the prediction of TRS, our results indicated that (more) substance use, a history of sexual abuse, and non-urban environments are associated with an increased risk of TRS, which is in agreement with previous findings (Ajnakina et al., 2018;Arsalan et al., 2019;Di Forti et al., 2009;Wimberley et al., 2016b).For clinicians, it might be good to be aware of these associations and to be alert for TRS in patients that may be at increased risk.
Comparable to other studies that used clinical and demographical data to predict TRS (Ajnakina et al., 2020;Kadra-Scalzo et al., 2022;Legge et al., 2020;Smart et al., 2022), our RF models attained high specificities and relatively low sensitivities, which indicates that the models are relatively good at recognizing antipsychotic-responsive patients, but not so good at recognizing TRS patients.High specificity is important, as clozapine usage is associated with serious side effects and therefore clozapine should not be initiated in antipsychotic-responsive   patients.Ajnakina et al. (2020) used regression models to predict early and late treatment resistance in first-episode patients (n = 239) by use of clinical, familial, and environmental variables.The sensitivities and specificities of their models ranged between 0-60 % and 71-100 %, respectively.AUC scores (0.74-0.77) were a bit higher compared to the AUC scores of our RF models (0.66-0.69).Legge et al. (2020) used a conditional inference RF model to evaluate the influence of demographic, premorbid, family and illness-related predictors on the risk for TRS (n = 337).The accuracy of their predictive model was 0.59.Similar to our findings, they found that a younger age at onset was the most important variable for the prediction of TRS, followed by, among others, premorbid IQ and poor premorbid social and work adjustments.Age at onset of illness and years in education were also found to be associated with increased odds of TRS by Smart et al. (2022).Their regression model predicted TRS (n = 2216) with an AUC of 0.59, specificity of 76 % and sensitivity of 48 %.Lastly, Kadra-Scalzo et al. (2022) used electronic health records of patients with a schizophrenia spectrum disorder (n = 1435) and showed that, among others, younger age at antipsychotic initiation and minor cognitive problems were predictors of TRS.Their prediction model for TRS produced a Harrel's C index of 0.60, which indicates a weak performance.Studies that used ML models based on neuroimaging data showed promising results for the prediction of response to transcranial magnetic stimulation (Koutsouleris et al., 2018), electroconvulsive treatment (Li et al., 2017), and risperidone (Cao et al., 2020) in schizophrenia.Therefore, to increase the model's performance and sensitivity, additional clinical (Samara et al., 2015), biological (Mondelli et al., 2015;Osimo et al., 2023), or neuroimaging data (Cao et al., 2016(Cao et al., , 2020;;Masychev et al., 2020;Sarpal et al., 2016;Veronese et al., 2021) might be needed.With this, the optimal balance between increasing the accuracy of ML models and the difficulty to obtain certain data needs to be considered.
This study has several limitations.First, for the majority of patients, the data was collected a couple of years after illness onset.To increase the generalizability of our findings, we chose to enter variables into the ML model that did not change or were relatively stable over time.Variables related to symptom severity were therefore not included, although they may have predictive value (Demjaha et al., 2017).Our results, therefore, need to be replicated in a cohort of patients who recently entered psychiatric services.Second, age at illness onset was significantly higher in patients not included in our main analysis compared to patients who were included (p = 0.01; Supplementary Methods S2 and Table S5).To avoid selection bias, future studies might use data from large national registries.Third, to limit bias that arises from data imputation, we had to exclude 38-42 variables from our analyses due to >25 % of missing data.Fourth, some variables might reflect vulnerability for TRS, while others might reflect a consequence of TRS (or both).Unfortunately, our analyses did not inform about the underlying causal relationships between predictors and TRS.Fifth, 12, 8, and 26 patients selected for the main analysis had a primary DSM-IV diagnosis at T3 (or T1/T2 in case of missing data) of delusional disorder, brief psychotic disorder, and schizophreniform disorder, respectively.Many of these patients received antipsychotic prescriptions at more than one study assessment, which indicates that they experienced more severe symptoms for longer periods than is suggested by their primary diagnoses.ML-based tools will especially be useful in case of an unclear clinical presentation and should be utilized in a transdiagnostic way when a patient with psychotic symptoms is presented.Sixth, due to the lack of an external dataset, we could not test the external validity of our RF models.
We demonstrate that younger age at illness onset, worse premorbid functioning, and lower educational levels are most consistently associated with TRS.Our ML models based on clinical, sociodemographic, familial, and environmental variables, however, showed modest performance in predicting TRS.Future large multi-centre studies are needed to investigate whether the model's performance can be improved by adding data from different modalities.
The work of EvdG and MvdP was funded by a Veni grant from the Netherlands Organisation for Health Research and Development (ZonMw, grant number 91618075).

CRediT authorship contribution statement
The GROUP investigators designed and conducted the GROUP study.CvH, MvdP, LdH, FS, and EvdG designed the research question for this project.BdV and MC carried out the analysis.CvH interpreted that data and wrote the original draft of the manuscript.All authors critically reviewed and approved the final version of the manuscript.

Declaration of competing interest
None.

Fig. 3 .
Fig. 3. Mean cross-validated receiver operator characteristic curves of the best-performing machine learning models for the (A) main and (B) sensitivity analyses.Abbreviations: AUC, area under the curve.
C.F.M.van Hooijdonk et al.

Fig. 4 .
Fig. 4. SHAP bar plots of the top 20 variables of the highest performing random forest model of the A) main and B) sensitivity analyses.The mean absolute SHAP value on the x-as indicates the degree of contribution of a variable to the output of the model, with higher values implying larger contributions.Abbreviations: CTQ, Childhood Trauma Questionnaire; DUP, duration untreated psychosis; SHAP, SHapley Additive exPlanations.

Fig. 5 .
Fig. 5. SHAP summary plots of the top 20 variables of the highest performing random forest model of the A) main and B) sensitivity analyses.The higher the SHAP value of a variable, the higher the probability of TRS, and vice versa.Each dot represents one patient and accumulates vertically to illustrate the density.The blue and red colours represent low and high variable values, respectively.Intermediate variable values are denoted by the blue-to-red colour bar.Abbreviations: CTQ, Childhood Trauma Questionnaire; DUP, duration untreated psychosis; SHAP, SHapley Additive exPlanations.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)