A new method for analyzing clinical trials in depression based on individual propensity to respond to placebo estimated using artificial intelligence

One of the major reasons for trial failures in major depressive disorders (MDD) is the presence of unpredictable levels of placebo response as the individual baseline propensity to respond to placebo is not adequately controlled by the current randomization and statistical methodologies. The individual propensity to respond to any treatment or intervention assessed at baseline was considered as a major non-specific prognostic and confounding effect. The objective of this paper was to apply the propensity score methodology to control for potential imbalance at baseline in the propensity to respond to placebo in clinical trials in MDD. Individual propensity was estimated using artificial intelligence (AI) applied to observations collected in two pre-randomization occasions. Cases study are presented using data from two randomized, placebo-controlled trials to evaluate the efficacy of paroxetine in MDD. AI models were used to estimate the individual propensity probability to show a treatment non-specific placebo effect. The inverse of the estimated probability was used as weight in the mixed-effects analysis to assess treatment effect. The comparison of the results obtained with and without propensity weight indicated that the weighted analysis provided an estimate of treatment effect and effect size significantly larger than the conventional analysis.


Introduction
The randomized placebo-controlled clinical trial (RCT) design is considered as the "gold standard" design for investigating the efficacy of new treatments. However, accumulated evidence indicated that this design failed to assess the treatment effect (TE, defined as the baselinecorrected change from placebo) in a large number of trials conducted to investigate the efficacy of novel medications for CNS diseases. The main reason for study failure was identified as the high and unpredictable levels of placebo response (Benedetti et al., 2003).
There is an extremely large body of evidence showing that the level of placebo response has a critical prognostic relevance in the assessment of TE in RCTs conducted in major depressive disorders (MDD) (Khan et al., 2003;Li et al., 2019;Papakostas et al., 2009). Meta-analysis conducted on 81 RCTs in MDD submitted to the US Food and Drug Administration (FDA) between 1983 and 2008 showed that only 53% of the trials were successful in the last 25 years and that the placebo response rate was increasing over time (Khin et al., 2011;Colloca, 2019;Gopalakrishnan et al., 2020;Khan et al., 2017;Tuttle et al., 2015;Enck, 2016). As a consequence, the level of placebo response can be considered as a relevant prognostic covariate that cannot be ignored in any inference even when randomization has been deployed (Senn, 2013). Therefore, new methodological approaches for designing, conducting, and analyzing RCTs are needed for controlling and mitigating the increasing confounding effect of placebo response. Several methods were proposed to address this issue, such as the identification and the exclusion of placebo responders during a placebo run-in period (Faries et al., 2001;Scott et al., 2022), and the two stages sequential parallel comparison designs (Fava et al., 2003;Chen et al., 2011). In addition, the band-pass methodology was proposed to improve signal detection in antidepressant clinical trials using an enrichment window approach that identifies sites with extremely low/high mean placebo responses and excludes data from those sites from the analysis (Merlo-Pich et al. 2008;Gomeni et al., 2019). More recently, released a guidance for implementing enrichment strategies in clinical investigations to evaluate the effectiveness of new drugs in the attempt to identify and exclude patients who improve spontaneously or have large placebo responses (FDA guidance, 2019).
In the context of the present analysis, the following definitions were used: PR = placebo response associated with a clinical improvement in patients treated with placebo, and PE = placebo effect associated with a clinical improvement due to expectancies of positive outcomes of a treatment irrespectively of the assigned treatment (Colloca et al., 2019). PE is a common outcomes in RCTs conducted in many psychiatric diseases (Palpacuer et al., 2017), it is usually associated with the patient's interactions with the clinician (Kaptchuk et al., 2015), and it was identified as a major non-specific effect affecting the individual level of PE (Salanti et al., 2018).
TE can be considered as the resultant of treatment specific and nonspecific responses and the individual propensity to respond to any treatment assessed using pre-randomization observations can be considered as a relevant prognostic factor. The larger is the propensity to respond to non-specific treatment, the lower will be the chance to detect any treatment-specific effect (Ioveno et al., 2012;Katz et al., 2008).
The computations of PE is essential in RCTs for separating the specific effects of treatments from unspecific effects associated with the therapeutic intervention. Thus, the identification of placebo responders is critical for testing the efficacy of new interventions and drugs (Aslaksen, 2021).
In RCTs, subjects are randomly assigned to treatment arms to insure comparability of the study outcomes by balancing the distribution of potential confounders over treatment arms. Despite the randomization, the groups to be compared may remain unbalanced, and thus incomparable as relevant baseline prognostic covariates (i.e., PE propensity) not accounted by the randomization have been ignored.
Propensity weighting is a novel statistical inference approach aimed to reduce and control baseline imbalances between treatment arms (Moons, 2020). This methodology was developed for mitigating the confounding bias in non-randomized comparative studies and to facilitate causal inference for TE estimate (Rosenbaum et al.,1983). The methodology was used mainly in epidemiological and social science studies, until it was adopted in a regulatory setting by the FDA, where it was used in observational studies to support marketing applications for medical devices (Yue, 2007;Campbell et al., 2016;Li and Yue, 2023;Levenson et al.,2013).
This methodology is based on the calculation of the individuals' probability of showing PE using pre-randomization response (Li et al., 2020). The use of propensity weighted approach for analyzing RCTs in MDD was recently proposed and the comparative analysis of data generated in one RCT was presented as a case study to compare the performances of conventional and propensity weighted approaches (Gomeni et al., 2023).
In the present paper, we are further elaborating the propensity weighted approach using data of two additional RCTs in MDD. The estimated individual propensity to PE will be used as weight of the individual observations in the mixed-effect model for repeated measures (MMRM) conducted to assess TE.
The higher the individual PE will be, the lower the contribution of this subject to the TE assessment will be. The expected effect of the weighted analysis will be to enhance signal detection and effect-size due to a better control of the inter-individual variability, as the contribution of subjects with high/low placebo responders will be minimized by the weighting procedure.
The individual propensity probability to respond to placebo was estimated using the change from screening to baseline of the individual 17-item of the Hamilton Depression Rating Scale (HAMD-17) (Hamilton, 1960) as potential predictors of the placebo response at end of study (EOS) using the multilayer artificial neural network (ANN) method (Yu et al., 2019).
The predictive power of the model to estimate the response at EOS was assessed using an artificial intelligence (AI) approach. The ANN model, developed using the placebo data, was applied to the individual HAMD-17 item changes from screening to baseline of each subject to estimate the individual probability of PE. The inverse of this value was used as an individual weight in the MMRM analysis conducted to assess the TE.
A comparative analysis was conducted to estimate TE and effect-size with and without propensity weight in the two selected RCTs in MDD. A sensitivity analysis was also conducted to evaluate the potential risk of inconsistent assessment of TE and study failure in new trials in presence of high or low level of PE by comparing the outcomes of a propensity weighted and traditional MMRM approach.

Data
The data of two antidepressant RCTs were used. The first trial (study 449) was a double-blind, placebo controlled trial evaluating the effects of immediate (IR) and controlled release (CR) paroxetine in MDD using a flexible dose design. Subjects (N = 108, 112, and 110) were randomized to either CR (25-62.5 mg/day), IR (20-50 mg/day), or placebo.
The second trial (study 874) was a randomized, double-blind, parallel-group, placebo-controlled fixed-dose study evaluating the effect of paroxetine CR in MDD elderly outpatients using a fixed-dose design. Subjects (N = 168, 177, and 180) were randomized to either paroxetine CR (12.5 mg), paroxetine CR (25 mg), or placebo. The primary efficacy endpoint for the two RCTs was the change from baseline to the week 8 in the HAMD-17 total score. Details on these two trials were previously reported (Merlo-Pich et al., 2010).

Model development
The data of the two trials were independently analyzed using a sequential approach: 1) ANN model development using screening and baseline observations and EOS data (i.e., visit at 8 weeks) in subjects randomized to placebo to estimate the probability to be placebo responder at EOS. 2) ANN model validation by comparing model-predicted probability and observed placebo response and by estimating the area under the Receiver Operator Characteristic (ROC) curve. 3) Prediction of the individual probability of PE using the prerandomization data of each subject randomized in the study using the ANN model. 4) Longitudinal MMRM analysis using the inverse individual probability as weighting factor to estimate TE.
The propensity to respond to placebo was defined as the probability of a clinically relevant reduction from baseline of the HAMD-17 total score at EOS. The relevant improvement in HAMD-17 was estimated by linking the change of HAMD-17 to the clinical global impressionseverity scale (CGI-I) using the equipercentile linking method (Guy et al., 1976;Kolen et al., 2014). This analysis indicated that a CGI-I score of 3 ('minimally improved') was associated to an average reduction from baseline in the total Montgomery-Åsberg depression rating scale (MADRS) score (Montgomery, 1979) of 24.5%, a CGI-I score of 2 ('much improved') was associated to an average reduction of 52.5%; and a CGI-I score of 1 ('very much improved') to an average reduction of 82% (Leucht et al., 2017). A robust improvement in the disease severity was estimated as a percent change from baseline in MADRS scale of 38%: the median value between minimally and much improved CGI-I. The equivalent clinically relevant reduction in the HAMD-17 scale was estimated using the equipercentile linking method developed to estimate equivalence between MADRS and HAMD-17 assessments. The percent reduction in HAMD-17 of 41% was identified as the equivalent percent reduction of 38% in MADRS (Leucht et al., 2018). This value was used in ANN analysis for identifying placebo-responders.
A binary score (0 or 1) was associated to each subject for absence or presence of response at EOS (i.e., HAMD-17 ≥ 41%). The model development and validation process was based on a random split of the original data into three datasets: 1) Training set including 75% randomly selected data in the placebo arm for ANN model development. 2) Validation set including the remaining 25% data used for assessing model performance in the placebo arm by comparing the model predictions with observed data 3) Working dataset, including the data of all subjects randomized in the RCTs, used to provide individual estimates of the propensity probability applying the ANN model.
Many potential predictors of placebo response evaluated at baseline can be considered such as demographic data, habits and quality of life, or disease-related information, etc. in the attempt to improve the overall predictive performance of the model. For simplicity, we decided to limit our exploration to the 17 items of the HAMD scale as these items are assumed to capture specific and independent symptoms of depression. The performance of the changes in these 17 items to predict the placebo response at EOS was evaluated using ANN as this methodology was shown to provide one of the most performing predictive tools (Hulsen, 2022). The ANN model requires the definition of the number of hidden layers and the number of nodes in each hidden layer (Rosenblatt,1961;Rumelhart et al., 1985). In Step 1 of the analysis, a grid search was conducted for identifying the optimal number of layers and the optimal number of nodes in the ANN models. In Step 2 of the analysis, the validation dataset was used to evaluate the predictive performance of the best performing model. The criterion for model validation was the area under the ROC curve, with the associated 95% confidence interval. The ANN analysis was conducted using the 'neuralnet' library in R (R Core Team, 2023). In Step 3 of the analysis, the ANN models developed using only placebo data were used to predict the individual PE in each subject using the individual pre-randomization data.

Longitudinal analysis
The inverse of the individual estimated probability was used as weight in the MMRM model for the longitudinal analysis of the HAMD-17 total score change from baseline (PROC MIXED, Version 9.4, SAS Institute, Carry, NC, USA). The analysis was conducted on changes from baseline using a random effect on the change from baseline, using an unstructured covariance matrix, time as a classification variable, baseline measurement as a covariate, baseline x time interaction, and treatment x time interaction. A level of α = 0.05 was used to establish the significance of the TE. The effect-size was estimated using the least square (LS) mean active-placebo difference divided by the pooled standard deviation obtained as the standard error of the LS mean difference divided by the square root of the sum of inverse treatment group sample sizes.

Sensitivity analysis
A sensitivity analysis was conducted to assess the impact of excessively high/low propensity to PE on the estimated TE in the analyses conducted with and without a propensity weight in the three scenarios: 1. Exclude subjects with high probability of PE (PE > 0.8) 2. Exclude subjects with very low probability of PE (PE < 0.1) 3. Include all subjects

Results
The descriptive statistics on demographic data and on HAMD-17 total score at screening and baseline are presented in Table 1 by RCT.
The grid search analysis indicated that the optimal number of layers was 3 and the optimal number of nodes per layer was 10, 2, and 5, and 10, 13, and 3 for the 449 and the 874 studies, respectively. The optimality criteria was based on the best predictive performance of the model.
The final neural network layouts of the ANN analysis with the relative importance of the changes from screening to baseline of each individual HAMD-17 item for the prediction of placebo response at EOS is presented in Fig. 1 by study. In the left panel plots, each column represents: • column 1, the change from screening to baseline of the 17 HAMD individual items (dHAMD_x, with x = 1 to 17) evaluated as potential predictors of placebo response ('resp'), • column 2, the combined items characterizing the first layer, • column 3, the combined items defining the second layer, • column 4, the combined items defining the final layer.
The black color indicates an increasing effect and the grey color a decreasing effect. The size of the lines determines the relative influence of information associated with the connected variables in the network.
The relative importance of each explanatory variable for the response, presented in the right panel of Fig. 1, was determined by Table 1 Descriptive statistics on demographic data and on the HAMD-17 total score at screening and baseline for the 449, and 874 studies. identifying all weighted connections between the nodes of interest (Olden et al., 2004). The connections were tallied for each input node and scaled relative to all other inputs. A single value was obtained for each explanatory variable that describes the relationship with response variable in the model. The estimated relative importance of each individual HAMD-17 item was presented as a bar plot where the size on the bar identifies the individual item weight in the prediction and the color identifies the positive (blue) or negative (red) contribution to the prediction.
The results of the analysis indicated that the predictive performance of the individual HAMD-17 items evaluated in the pre-randomization period varied study by study. As a consequence, the predictive performance of the data evaluated in one study cannot be translated to the data of another study as the predictive power is specific to the individual subjects enrolled in a study.
The predictive performance of the ANN models was assessed using the area under the ROC curve (AUC). The value of the AUC was 0.923 (95% confidence interval of 0.772-1.0) and 0.881 (95% confidence interval of 0.766-0.997) for the 449 and 874 studies, respectively. The ROC AUC values were statistically greater than the noninformative threshold of 0.5. As the ANN models were considered as appropriate for predicting the individual propensity probability using the pre-treatment data in the placebo arm, we assumed that the predictions for the individual propensity probability in the active treatment arms was also appropriate when the pre-treatment data were used.
The ANN models were used to estimate the individual propensity to respond to placebo for each subject included in the two RCTs. The percentage of subjects with an estimated PE to respond to non-specific treatment effects in the intervals <0.2, 0.2-0.4, 0.4-0.6, 0.6-0.8, and >0.8 is presented in Fig. 2.
The distribution of PE indicated that a large majority of subjects in study 449 have a high probability (PE > 0.8) to negatively affect the estimated TE. Differently from the 449 study, the distribution of the propensity probability indicated that a large majority of the subjects in the study 874 have a high probability (PE < 0.2) to inflate the estimated TE. The results of the MMRM analyses are presented in Table 2.
The analysis with and without propensity weight indicated that the weighted analysis provided an estimate of TE and effect-size larger than  the non-weighted analysis. As expected, the size of the TE was differently affected in the analyses with and without weight due to the different level of imbalance in the baseline PE, as shown in Fig. 2.
The plots of the longitudinal LS mean changes from baseline of the HAMD-17 total score by treatment and study resulting from nonweighted and weighted MMRM analyses are presented in Fig. 3.
A sensitivity analysis was conducted to evaluate how the estimated values of TE and effect-size were affected by the level of PE as expected from the meta-analysis conducted to evaluate the correlation between different levels of placebo response rate and clinical trial outcome in MDD (Ioveno et al., 2012).
Three analyses were conducted. In the first analysis, the subjects with high probability of PE (PE > 0.8) were excluded, in the 2nd analysis were excluded the subjects with low probability of PE (PE < 0.2) and in the final analysis all subjects were included. The same analyses were conducted with and without propensity weight ( Table 2).
The results of the analyses, presented in Fig. 4, indicated that TE increases when the subjects with high probability of PE were removed, and TE decreases when the subjects with low probability of PE were removed. These finding are in agreement with the expected effect of low/high placebo response on the estimated TE (Ioveno et al., 2012).
The% absolute deviation from the TE value estimated in the total population and in the populations without subjects with high or low probability of PE was considered as measure of the potential risk of inconsistent assessment of TE and study failure in new trials in presence of high or low level of PE. The estimated risk was 0.503 and 0.255 for conventional and propensity weighted analyses in study 449 and 2.294 and 0.421 for study 874, respectively. This large difference in the risk indicates that the propensity analysis is less sensitive to excessively low/high placebo responders due to the effect of the weight probability. On the contrary, the estimated TE in conventional MMRM analyses was significantly influenced by the baseline distribution of different level of PE.

Discussion
As previously reported (Fava, 2015), one may classify treated patients in a MDD trial based on each participant's propensity to respond to a given type of treatment. The propensity weighted methodology assumes that the TE in a MDD trial can be viewed as the resultant of treatment-specific and non-specific effects. While the specific effect can be associated with the active drug, the non-specific effect, defined by the individual probability to respond to any treatment or intervention, can be estimated using the ANN model applied to pre-randomization data.
The larger will be the imbalance in the individual baseline propensities of subjects allocated to the different treatment arms, the lower will be the chance to properly estimate TE and effect size. This because, the estimated TE and effect size derived using the current statistical methodologies will not represent the 'true' properties of the treatment but a working estimate of these values strongly correlated with the level of imbalance in the individual propensity distribution (Ioveno et al., 2012).

Table 2
Sensitivity analysis results to evaluate the impact of the excessively high and excessively low propensity to a placebo effect on the estimated TE with and without a propensity weight in the MMRM analysis.  Fig. 3. Results of the non-weighted and weighted MMRM analyses with the estimation of the effect sizes. The LS mean (± standard error) of the longitudinal HAMD-17 total score changes from baseline are presented by treatment and study.
As shown in Fig. 2, the imbalance in the distribution of the individual propensity to PE varies study by study and remains an unaddressed issue for the comparability of treatments as this imbalance is not accounted by the randomization of the subjects included in the RCTs.
The proposed methodology assumes that the changes in the individual HAMD-17 items between screening and baseline contains relevant information on the time course of the disease, as also reported in a study in schizophrenia conducted using PANSS score (Hopkins et al., 2022). The response to placebo was defined as a clinically relevant change from baseline at EOS in the HAMD-17 total score. The relevant change was estimated by connecting HAMD-17 total score to the CGI-I scores using the equipercentile linking method and by selecting the percentage reduction associated with minimal and much improved CGI-I score.
An ANN model was initially developed to estimate the PE in the placebo treated subjects as a function of the HAMD-17 individual items evaluated in two pre-randomization occasions. This model was then validated by assessing the predictive performances of the individual items on data not used for model development. Finally, this model was applied to the pre-randomization data of each subject in the RCTs to estimate the individual propensity to respond to placebo. The inverse of the estimated propensity probability was included as weight in the MMRM model used to assess the TE in order to reduce baseline imbalances between arms (Zhang et al., 2023;Austin, 2011).
A case study was presented using data of two RCTs. The ANN models performed satisfactorily well in term of predictive performance estimated by the area under the ROC curve: 0.92 (95% confidence interval of 0.77-1.0) and 0.88 (95% confidence interval of 0.77-1.0) for the 449 and 874 studies, respectively.
The results of the analysis with and without the propensity weight indicated that the weighted analysis, corrected by the different and largely unbalanced distribution in baseline propensity probability, provided a larger estimate of both TE and effect-size.
The proposed methodology can be prospectively or retrospectively applied to any RCT when: (i) the study was designed to collect screening and pre-treatment baseline data, (ii) the criteria for assessing the clinical response to placebo were pre-specified in the analysis plan, (iii) the acceptable criteria for qualifying the predictive performance of the ANN model were defined in the analysis plan specifying that the acceptable ROC AUC cut-offs should be statistically greater than 0.5.
A relevant issue associated with the proposed analysis is related to the generalizability to a different population of the results. We are faced by two distinct issues: (a) the generalizability of the ANN model to predict the individual probability to PE and (b) the generalizability of the outcomes (i.e., the estimated TE and effect size) of the propensity weighted analysis. About point (a), the outcomes of the ANN model cannot be used for predicting the individual propensity probability as the subjects and the study designs are study specific as shown by the comparison of the 449 and 874 data. This because, the individual propensity to respond to placebo is associated with the individual expectations specific to each individual. However, the ANN model can be used with the pre-randomization data of different RCTs to estimate the individual propensity in different trials. About point (b), randomized trials remain the most accepted design for estimating the TE, but they do not necessarily answer a question of primary interest about the effectiveness and the generability of TE in a large scale target population. Recent literature indicates that a promising approach for assessing generability of TE size can be based on the use of propensity-score-based metrics using the TE adjusted and normalized by the study specific levels of confounding factors. Therefore, propensity weighting score offers a promising tool to developers, regulators or prescribers to best identify the performance of a new treatment in a target population by accounting for potential confounding effect of excessively low/high placebo response (Stuart et al., 2001(Stuart et al., , 2015Loux and Huang, 2023). Fig. 4. Sensitivity analysis. Propensity weighted and non-weighted analyses: comparison of the estimated TE in the total population (All data) and in population without high (Prob > 0.8) and without low (Prob < 0.2) placebo response. The dots represent the TE value estimated in the MMRM analysis, the horizontal lines represent the 95% confidence intervals (the solid lines correspond to the 12.5 mg arm and the dotted lines corresponds to the 25 mg arm). The vertical blue dotted lines represent some reference TE values of − 4, − 2, and 0.
The major difference and advantage of the propensity weighted approach with respect to the historical study designs and/or analysis procedures Scott et al., 2022;Fava et al., 2003) is that all subjects randomized in the trial are included in the analysis consistently with the intention-to-treat (ITT) paradigm.
The propensity weighting method provides: (i) a model based strategy to associate to each subject a weight accounting for potential individual confounding factor of non-specific response, (ii) an estimate of the TE adjusted for the difference in the individual propensity to respond to placebo, and (iii) a better control of the impact of subjects with low/ high PE. In absence of any propensity adjustment, the estimated TE will be conditioned by the proportion of subjects with excessively high/low PE.
A sensitivity analysis was conducted to evaluate potential risk of inconsistent assessment of TE and study failure in new trials in presence of high or low level of PE associated with the use (or not) of a propensity adjustment. This analysis indicated that the propensity methodology was associated with a reduced risk of inconsistent assessment of TE and study failure in new trials.
Among the benefit associated with the propensity score approach, recent papers advocate the uses of this approach to ensure balance between groups at the time of randomization, and to account for chance imbalances in observed randomization (Travis et al., 2023). While propensity scores were originally developed to address confounding in observational studies of causal effects, recent literature has shown that they are also helpful in randomized studies as well (Stuart et al., 2001). Propensity scores can be used to minimize this imbalance at the randomization stage, or to adjust for between-group differences in the analysis of outcomes. Both uses of propensity scores can improve the power of RCTs, especially in small samples or in investigating subgroup effects. Propensity scores, or propensity-based tools, can also be used to account for selection bias into randomized trials in hopes of generalizing or translating evidence from a randomized trial to a broader population (Freedman and Berk., 2008;Raad et al., 2020).
Several limitations of the current investigation should be noted. The HAMD-17 rating scale was the only clinical score evaluated. Other relevant clinical scores such as the MADRS scale have to be analyzed in trials conducted in MDD. In addition, as the unpredictable high placebo response rate is one of the major factor associated with the failure of randomized clinical trials in a large majority of psychiatric disorders such as bipolar disorders, schizophrenia, anxiety, etc., the propensity weighted approach would need to be also evaluated in trials conducted on these disorders. A further limitation of the current investigation is the restricted number of RCTs evaluated.
In conclusion, propensity score is an extensively used methodology in observational studies for improving treatment comparison by adjusting data for potentially confounding baseline factors. The results of the presented analysis indicate that this methodology can be profitably extended to deal with the control of the placebo effect in randomized placebo-controlled clinical trials.

Funding
No funding was received for this work.

Declaration of Competing Interest
The authors have no conflict of interest.