Developing a predictive signature for two trial endpoints using the cross-validated risk scores method

Summary The existing cross-validated risk scores (CVRS) design has been proposed for developing and testing the efficacy of a treatment in a high-efficacy patient group (the sensitive group) using high-dimensional data (such as genetic data). The design is based on computing a risk score for each patient and dividing them into clusters using a nonparametric clustering procedure. In some settings, it is desirable to consider the tradeoff between two outcomes, such as efficacy and toxicity, or cost and effectiveness. With this motivation, we extend the CVRS design (CVRS2) to consider two outcomes. The design employs bivariate risk scores that are divided into clusters. We assess the properties of the CVRS2 using simulated data and illustrate its application on a randomized psychiatry trial. We show that CVRS2 is able to reliably identify the sensitive group (the group for which the new treatment provides benefit on both outcomes) in the simulated data. We apply the CVRS2 design to a psychology clinical trial that had offender status and substance use status as two outcomes and collected a large number of baseline covariates. The CVRS2 design yields a significant treatment effect for both outcomes, while the CVRS approach identified a significant effect for the offender status only after prefiltering the covariates.


Introduction
It is common in clinical trials that only a subgroup of treated patients is likely to benefit from an experimental therapy (Rothenberg and others, 2003;Foster and others, 2011;Zhao and others, 2013;Janes and others, 2017).This creates the need to better design and analyse clinical trials so they provide more information about which patients, if any, benefit from a treatment.For this purpose, high-dimensional information that is increasingly being collected on patients can be used.There have been several approaches proposed for utilising (potentially high-dimensional) treatment by covariate interactions to stratify patients.For example, Freidlin and Simon (2005) and Freidlin and others (2010) proposed an adaptive signature design (ASD) that combines a prospective development of a sensitive patient classifier and validation of the classifier in a single trial, based on treatment-covariate interactions obtained via regression modelling.Motivated by the work of Freidlin and Simon (2005) and Freidlin and others (2010), Zhang and others (2017) focus on subgroup selection using baseline covariates by incorporating a utility function that takes into account the size of a subgroup or the possibility for an alternative treatment for the patients.
In the randomised clinical trial of evaluation of the effects of the dose of dialysis and the level of flux of the dialyser membrane on time to death from any cause (Eknoyan and others, 2002), the method identified a subgroup with a significantly greater treatment difference.Another approach to handling treatment-covariate interactions is decision tree algorithms (Su and others, 2009;Dusseldorp and others, 2010;Lipkovich and others, 2011).These algorithms recursively partition the data with splits chosen to optimize an objective function.The algorithms differ in the size of the search space (Lipkovich and others, 2011), the order of the splitting variables (Dusseldorp and others, 2010), the selection of covariates and the choice of the optimal cut-off (Loh, 2002;Loh and others, 2015), and preventing bias in the variable selection by fitting a parametric model to the data in each node of the tree in a model-based recursive partitioning approach (Hothorn and others, 2006;Zeileis and others, 2008;Seibold and others, 2016;Krzykalla and others, 2020).
"Virtual twins" (Foster and others, 2011) is another tree-based approach that involves computing the difference between predicted response probabilities for the treatment subject and its control "twin".Tian and others (2014) proposed estimating treatment by covariate interactions using a modified covariate approach without the need for modeling the main effects.Shen and He (2015) tested the existence of subgroups with differential treatment effects by utilising the association between the subgroup membership and subject-specific covariates in a logistic-normal mixture model framework.A few studies have considered the compound covariate predictor approach for the prediction of predefined tumour classes (Radmacher and others, 2002), for quantitatively estimating treatment effects and for predicting survival curves (Matsui and others, 2012;Matsui, 2006).In the compound covariate predictor approach, the compound covariate was constructed using a test statistic from treatment-covariate interactions.Zhao and others (2013) selected the subgroups via a variety of parametric scoring systems that were constructed as a function of multiple baseline covariates and were used to estimate the treatment difference.Rosenwald and others (2002) utilised gene-expression profiles to identify subgroups of patients with different survival rates, while the subgroups with distinctive gene-expression profiles were defined on the basis of hierarchical clustering.
All these methods have been proposed for a single endpoint.However, in many clinical trials, multiple outcomes are of interest.This is especially relevant for clinical trials that analyse both efficacy and toxicity of the treatment (common in oncology and in other areas where treatment can have considerable side effects), efficacy and quality of life, and cost effectiveness of new drugs and treatments in health economics.For example, Zhao and LeBlanc (2019) considered the appropriate subpopulation for a clinical trial which was defined by a single biomarker that exceeded a specified threshold value.In addition to the main outcome that represented a response to the treatment, they proposed to incorporate other outcomes such as risk or cost associated with a new treatment by optimising population impact.The cross-validated risk scores (CVRS) method (Cherlin and Wason, 2020) is based on constructing risk scores from a large number of baseline covariates.Risk scores represent a scoring system that is developed to be associated with the treatment effect and that can be used for predicting a benefit from the treatment.The CVRS design consists of two steps.In the first step, a single-response regression model is fit to every covariate and the risk scores are constructed as sums of associated covariates within each patient weighted by their estimated effects.In the second step, the risk scores are divided into two clusters that correspond to sensitive and non-sensitive groups of patients.Similarly to the cross-validated ASD (Freidlin and others, 2010), the sensitive group is the subgroup of patients predicted to have high treatment effect.The method has been implemented in an R package rapids (Cherlin and Wason, 2019).In this article, we propose the extension of the CVRS method that considers two outcomes, i.e. a design that develops a signature from a large number of covariates, considering two endpoints.The proposed CVRS2 method utilises vector generalized linear models (Yee, 2010) for constructing bivariate risk scores.The bivariate risk scores are then divided into a prespecified number of clusters using an extension of the clustering procedure that has been used for the CVRS method.This clustering procedure is computationally straightforward and can be used as a starting point (we discuss other clustering procedures in the Discussion section).We explore the operating characteristics of the CVRS2 method for simulation scenarios that assume two outcomes that have different correlation structures.We also have extended the R package rapids to incorporate the CVRS2 method.To our knowledge, CVRS2 is the first method within the adaptive signature design family that considers two outcomes.The remainder of this paper is organised as follows.The Methods section briefly describes the original CVRS design and introduces the CVRS2 design.In the Results section, we explore the operating characteristics of the CVRS2 design for various simulation scenarios.In the Real Data Example section, we illustrate the application of the CVRS2 design to a randomised psychiatry clinical trial.Finally, we summarise our conclusions in the Discussion section.

CVRS Design
The full details of the method are provided in Cherlin and Wason, 2020.Briefly, the model assumes that the response to treatment denoted by Y is influenced by a subset of K unknown covariates (the sensitive covariates) though a generalised linear model.For example, for a binary outcome, the model looks as follows: log pi where p i is the probability of the response to treatment for the ith patient; µ is the intercept; λ is the treatment main effect that all patients experience regardless of the values of the covariates; t i is the treatment that the ith patient receives (t i = 0 for the control arm and t i = 1 for the treatment arm); x i1 , . . ., x iK are the values for the K unknown sensitive covariates; α 1 , . . ., α K are main covariate effects for the K covariates; γ 1 , . . ., γ K are the treatment-covariate interaction effects for the K covariates.The model assumes that there is a subset of patients (the sensitive group) with a higher probability of response when treated with the new treatment compared with the control treatment.In the first step of the CVRS design, a signature is developed by constructing risk scores within each patient as sums of associated covariates weighted by their estimated effects.In the second step, a clustering procedure is applied to divide the risk scores into two clusters that correspond to sensitive and non-sensitive groups of patients.For constructing the risk scores, the cross-validation procedure is used, in which the model is built using the training subset and the risk scores are constructed for the patients in the test subset set as follows.For rfold cross-validation, the observed dataset D of size N is randomly divided into r non-overlapping subsets D (l) , l = 1, . . ., r, of (approximately) equal size N/r.A common choice of r is 10 which we adopt here.For each iteration of the r-fold cross-validation, data are split into test D (l) and training D (−l) (formed by removing D (l) from D) subsets and the coefficients for treatment by covariate interaction β(−l) j are estimated from log pi 1−pi = µ + λt i + α j x ij + β j t i x ij , for each covariate j using training subset alone.Then, for each test set D (l) , the risk scores are computed as RS ij is the value of the covariate j for the ith patient in the lth test set.Within each test set D (l) , the k-means procedure (Hartigan and Wong, 1979) with k = 2 is applied to classify the test scores RS (l) i , i = 1, . . ., N/r into sensitive and non-sensitive groups.Therefore, at the end of the cross-validation process each patient in the observed data D is classified either as sensitive or non-sensitive, after pooling group membership status across the r test sets.There are a few ways that one can test for the difference between the arms.One way is to consider the overall test positive if there is either a significant difference in the overall comparison between the arms or a significant difference between the arms in the sensitive group.
The test for the overall comparison between the arms could be performed using a test for the difference of two proportions, carried out at a significance level α 1 , while for the comparison between the arms within the sensitive subgroup only Fisher's exact test could be carried out at a significance level α 2 .The overall type I error is controlled at the significance level α = α 1 + α 2 .
Alternatively, one can test for the interaction effect between the treatment and the sensitivity status using a generalised linear model.Generally, it is advisable to use a permutation method to obtain a P -value for testing the interaction effect between the treatment and the sensitivity status (Simon and others, 2004) because when the sensitive group is obtained by cross-validation, the samples are not independent and therefore the standard asymptotic theory does not apply.In the permutation method, the entire cross-validation procedure is performed for every permuted data set and the corresponding test statistic is obtained.The one-sided permutation P -value is given by 1+number of elements of P * ⩽P0 1+number of permutations , where P * is the vector of the P -values for the treatmentsensitivity status interaction effect computed for a large number of permuted data sets, and P 0 is the P -value for the treatment-sensitivity status interaction effect obtained for the original (non-permuted) data.

CVRS2 Design
The CVRS2 design considers two outcomes, Y 1 and Y 2 (e.g.these could represent efficacy and toxicity in cancer clinical trials), that are influenced by a subset of K unknown covariates (the sensitive covariates) through a vector generalized linear model (Yee, 2010).For example, the relationships between two binary outcomes and sensitive covariates can be described through a bivariate odds ratio model (Yee, 2015) as follows: log = µ (1) + λ (1) t i + α (1) , where p (1) i is the probability of the first outcome (Y 1 ) for the ith patient, p (2) i is the probability of the second outcome (Y 2 ) for the ith patients, ψ is the odds ratio (ratio of odds of Y 1 = 1 when Y 2 = 1 and odds of Y 1 = 1 when Y 2 = 0).A measure that describes the association between the two proportions, ψ, is a natural measure for the association between two binary responses.It is modelled as intercept-only, which is equivalent to the assumption of constant correlation.The responses are independent if and only if ψ = 1.A more complex modelling of ψ that allows for more flexible correlation structures can be used.However in some cases these may lead to numerical problems (Yee, 2015).Here, the estimated correlation between the outcomes is not explicitly used in the analysis.However, taking the correlation into account by jointly modelling the outcomes allows us to better estimate the effects of the covariates.For constructing the risk scores, the coefficients for treatment by covariate interaction are estimated for each covariate j from a single-covariate bivariate odds ratio model: log (1) (1) (2) (2) . The cross-validation procedure is used for obtaining the bivariate risk scores which are computed for each patient i as RS ij } for each iteration of the r-fold cross-validation procedure.Here, the first score is computed with respect to the first outcome, while the second score is computed with respect to the second outcome.The bivariate risk scores can be partitioned into a number of clusters within each test set D (l) .In some settings, e.g.where a treatment could be either (i) safe and effective; (ii) not safe and not effective; (iii) safe and not effective; (iv) effective and not safe, the most natural choice for the number of clusters is four.Letting the rate of the first response represent the efficacy of the treatment and the rate of the second response represent the safety of the treatment, the four clusters would represent (i) participants that have high response rates for both outcomes; (ii) participants that have low response rates for both outcomes, (iii) and (iv) participants who have a high response rate for one of the outcomes and a low response rate for the other.This division can be accomplished by applying the k-means clustering procedure with k = 4. Here, a sensitive group can be defined as one of the clusters or a combination of the clusters.In settings where a treatment would have to be either sufficiently safe and effective or not safe and not effective (i.e. when there are only two underlying clusters of patients), it may make better sense to divide patients into two groups.Here, the first group would represent participants with high response rates for both outcomes, while the second group would represent participants with low response rates for both outcomes.In this case, the k-means is applied with k = 2.We note that k-means clustering with k = 2 may not necessarily divide patients into two clusters with this interpretation.We explore this in the results and consider alternatives in the discussion.For each outcome, the power to conclude treatment effect in the trial population and the power to conclude treatment effect in the sensitive group are computed.We also consider the overall power to reach at least one of these conclusions.For the power in the trial population, we compute the probability to detect a significant effect in the first outcome and the probability to detect a significant effect in the second outcome, i.e.P tp = {P tp i } where i = 1, 2 is the outcome.In the four cluster case, for each outcome, we compute the power for the sensitive group as a set of four elements, each element represents the probability to detect a significant difference between the treatment and the control in each one of the clusters, ı. }.To compute the cluster-wise sensitivity and specificity in the four cluster case, we assume that each cluster in turn corresponds to a sensitive group.A detailed explanation of computing sensitivity and specificity is given in the "Computing sensitivity and specificity" section of the Supplementary Materials.

Simulation study
We conducted a simulation study to evaluate the performance of the CVRS2 for four scenarios, exploring a different number of subgroups and the correlation between the covariates.For every scenario, we assumed a clinical trial with 100 covariates where K=10 of them are sensitive for each outcome with an overlap of five sensitive covariates, i.e. there were five covariates that were sensitive to both outcomes.The main effects of the covariates were assumed to be 0, and the treatmentcovariate interaction effects were assumed to be constant across the sensitive covariates, similarly to Cherlin and Wason, 2020: log (2) K t i x iK .The intercepts µ (1) and µ (2) were set to correspond to control arm response rates of 25%.We used an overall significance level α = 0.05 (two-sided) that corresponds to α 1 = 0.04 and α 2 = 0.01 significance levels for the trial population test and for the sensitive group test, respectively, as suggested in Freidlin and Simon (2005).The empirical overall power was calculated as the percentage of replications with either a positive trial population 0.04 level test or a positive 0.01 level sensitive group test.The results of the simulations were based on 1000 replications.The parameters used in the simulation study are presented in Table 1.In all of the scenarios, the response rates and the sample sizes correspond to those used in Freidlin and Simon (2005) and Cherlin and Wason (2020).A detailed description of how data were simulated is given in the "Simulation steps" section of the Supplementary Materials.To construct the risk scores, we fitted bivariate odds ratio model without the main treatment effect and the main covariate effects, to match the data generating mechanism: . We note that there is no direct comparator method that we are aware of for the two outcome case, hence we are comparing the CVRS2 method to the original CVRS method applied separately to each one of the outcomes, for both the simulations and the real data example.To assess the sensitivity of the CVRS2 method to various model misspecifications and extreme values of the parameters, we performed a sensitivity analysis (see the "Sensitivity analysis" section of the Supplementary Materials).

Scenario I
The data were simulated assuming that there was a group of patients that had high rates of both responses Y 1 and Y 2 (the sensitive group), while the rest of the patients had low rates of both responses (the non-sensitive patients).Here, by response rate we denote the probability that the response takes the value of 1.The rates of Y 1 and Y 2 were 0.7 and 0.25 for the sensitive and the non-sensitive group, respectively (the response rates are illustrated in Figure 1(a) for 10 simulations runs).The percentage of patients in the sensitive group were either 10% or 20%, the sample size was 400.The data were simulated assuming an independence between the covariates.
The results of the analysis with a model that employs k = 2 clusters (to match the data generating mechanism) are presented in Table 2.The resultant risk scores are illustrated in Figure 2(a) for 10 simulation runs.The risk scores show a perfect separation between the sensitive and the nonsensitive groups, and the response rates in both sensitive and non-sensitive groups are estimated with a high precision(estimated rates of Y 1 and Y 2 for the sensitive and non-sensitive group were 0.69 and 0.25, respectively).

Scenario II
The data were simulated assuming four clusters of patients, which represent (i) patients with low response rates for both responses (cluster 1); (ii) and (iii) patients with a high response rate for one of the responses and a low response rate for the other, and vice versa (clusters 2 and 3); and (iv) patients with high response rates for both responses (cluster 4), as explained in Section 2.2.
This scenario covers three sub-scenarios, IIa, IIb and IIc.In all three of the sub-scenarios the low response rate is 25%, while the high response rate is 80% for scenario IIa, 70% for scenario IIb and 60% for scenario IIc.The response rates are illustrated in Figure 1(b-d) for 10 simulation runs.The mean rates of Y 1 and Y 2 for cluster 1 were 25% for all three of the sub-scenarios.In Scenario IIa, the mean rate of Y 1 for clusters 3 and 4 was 80%, and the mean rate of Y 2 for clusters 2 and 4 was 80%.For Scenarios IIb and IIc, these mean response rates were 70% and 60%, respectively (see Table 1).The covariates were assumed to be independent.The results are presented in Table 3 for sample sizes 400 and 1000.For each scenario we investigated the method assuming that the sensitive group corresponds to one cluster in turn, as elaborated in Section 2.2.The risk scores for Scenario IIa are illustrated in Figure 2(b) for 10 simulation runs.The sensitivity and specificity of identifying the sensitive group are high reaching values > 0.95 in many cases.Interestingly, the response rates are better estimated for clusters 1 and 4 rather than for clusters 2 and 3.This is in line with the results for Scenario I, suggesting that the method estimates the sensitive group better for clusters where the rates of both responses are similar.
Very low power for the sensitive group test when cluster 1 corresponds to sensitive group suggests that the type I error is well controlled: considering cluster 1 as the sensitive group is equivalent to a null scenario.

Scenario III
To investigate the sensitivity of the CVRS2 method to the data generating mechanism, we simulated data assuming a moderate pairwise correlation (ρ = 0.4) between all covariates.In this scenario, the data were simulated assuming four clusters of patients, with the low response rate being 25% and the high response rate being 80%, similarly to scenario IIa.The only difference with scenario IIa is the correlation between the covariates.The response rates are illustrated in Figure 1(e) for 10 simulation runs.Clearly, there is an increased variability of the cluster-wise response rates in comparison to scenarios with independent covariates (see Figure 1(b) for comparison).The results are presented in Table 3.The risk scores are illustrated in Figure 2(c) for ρ = 0) suggest that the method would benefit from pre-filtering of the covariates based on the correlation between them.However, even without modelling the correlation between the covariates the method was still able to increase the overall power beyond that achieved for the trial population 0.04 level test.

Scenario IV
The data were simulated assuming a null scenario where the rates of both responses Y 1 and Y 2 in all patients on both arms are 25%.The results are presented in Table 3.The risk scores are illustrated in Figure 2(d) for 10 simulation runs.The results show that a type I error is well controlled at the 0.01 level for the sensitive group test and at the 0.05 level overall.The response rates are estimated with high precision, being very close to 0.25 for all clusters (see Table 3).

Comparison between CVRS2 and CVRS
In order to compare the performance of the CVRS2 with the CVRS, we applied the CVRS to each one of the responses separately (we refer to this method as marginal CVRS).For the marginal CVRS, patients that were non-sensitive to both outcomes were referred to as cluster 1, patients that where sensitive to one of the outcomes were referred to as clusters 2 and 3, and patients that where sensitive to both outcomes were referred to as cluster 4. To illustrate the difference in the results between the CVRS2 and the marginal CVRS, we used three randomly selected simulation replicates with different correlations between the response rates.The correlation between the response rate was induced by the covariates that affect both responses (the overlapping covariates).
All three of the data sets were simulated similarly to scenario IIb, i.e. the high response rate was 0.7, the sample size was 1000, the number of sensitive covariates was 10 for either response.The numbers of the overlapping covariates were zero, five and nine leading to estimated correlations of -0.003, 0.24 and 0.46, respectively.We applied the CVRS2 and the marginal CVRS to each data set.The results as shown by the risk scores coloured by the predicted and true clusters are illustrated in Figure 3(a) for zero overlapping covariates, Figure 3(b) for five overlapping covariates and Figure 3(c) for nine overlapping covariates.A detailed description on how the marginal CVRS assigns the risk scores to four cluster is given in the "Assignment of the risk scores to four clusters by marginal CVRS" section of the Supplementary Materials.In the case of zero overlapping covariates, the CVRS2 perfectly separated the clusters, while the marginal CVRS did not perform as well.In the case of five overlapping covariates, the CVRS2 had a much better ability to classify the patients in comparison to the marginal CVRS that failed to separate clusters 2, 3 and 4.However, in the case of nine overlapping covariates both designs were unable to classify the patients correctly into four clusters.This can be explained by the fact that a higher correlation between the response rates makes the data more compatible with two clusters than four, as illustrated by the risk scores.Interestingly, the marginal CVRS successfully differentiated between the two clusters of the data.When we applied the CVRS2 with k = 2 to this data set, the design was able to perfectly separate the two underlying clusters, similarly to the marginal CVRS.

Real Data Example
Previously, we have applied the CVRS method to the data from the Systematic Therapy of At Risk Teens (START) (Fonagy and others, 2018).START was a randomised controlled trial comparing the outcomes of young people and their families who were allocated to treatment as usual (control arm) and multisystemic therapy (treatment arm).The data set comprised of 669 participants (336 participants in the control arm and 333 participants in the treatment arm) and 86 covariates.Participants with one or more offences were defined as offenders (288 in total; 143 in the control arm and 145 in the treatment arm), while participants with no offences were defined as non-offenders (381 in total; 193 in the control arm and 188 in the treatment arm).No overall significant treatment effect as measured by a logistic regression was detected (P = 0.797).The CVRS method indicated the existence of a sensitive group comprised of 453 participants (222 participants in the control arm and 231 participants in the treatment arm) with no significant interaction effect between the treatment and the sensitivity status (permutationbased P = 0.122).However, a marginally significant interaction effect between the treatment and the sensitivity status (P = 0.043) was achieved after a pre-filtering of the covariates based on a P -value threshold.Here, we analyse the START data considering two outcomes: in addition to the previously analysed offender/non-offender status, we analyse a binary substance use status (the numbers of the offenders/substance users are presented in Supplementary Table 1).Due to missingness of the substance use status in above 30% of participants, the two outcome dataset comprises of 461 participants (218 participants in the control arm and 243 participants in the treatment arm).The analysis was performed without the pre-filtering of the covariates.No overall significant treatment effect as measured by a bivariate odds ratio model was detected (P = 0.993 and P = 0.181 for the offender status and the substance use status, respectively).We analysed the START data with the CVRS2 method assuming either two or four underlying clusters (see Supplementary Figure 1 for the corresponding risk scores).For constructing the risk scores, we fitted the bivariate odds ratio model that includes the main treatment effect and the main covariate effects, as described in Section 2.2.For the two-cluster analysis, the CVRS2 method found a sensitive group comprised of 283 participants (132 participants in the control arm and 151 participants in the treatment arm).The permutation-based P values for the interaction between the treatment and the sensitivity status were 0.003 and 0.129 with respect to the offender status and the substance use status, respectively (based on 2000 permutations).The permutation-based P -value was computed assuming one cluster in turn corresponds to a sensitive group.With respect to the offender status, the cluster-wise permutation P -values were 0.007, 0.147, 0.023 and 0.162, while with respect to the substance use, the cluster-wise permutation P -values were 0.038, 0.193, 0.509 and 0.299.The estimated cluster-wise rates for the offender status were 0.27, 0.36, 0.44 and 0.49, while the estimated cluster-wise rates for the substance use status were 0.23, 0.4, 0.36 and 0.43.The size of the clusters were 67, 132, 151, and 111 subjects.The mean offender rate and substance use rate in each arm within each cluster are shown in Supplementary Table 3.In this study it would make sense to assume that cluster 1 that corresponds to low offence rate and low substance use rate, represents the sensitive group.Our results indicate that there is a significant P -value of the interaction between the sensitivity status and the treatment with respect to both of the outcomes with respect to cluster 1.This suggests that while no treatment effect was indicated for the overall population of the participants, a subgroup of participants who belong in cluster 1 benefited from the treatment.In order to compare the CVRS2 with the CVRS, we applied the marginal CVRS to the two outcome dataset that comprises of 461 patients.The permutationbased P -values for the interaction effect between the treatment and the sensitivity status were P = 0.101 and P = 0.383 with respect to the offender status and the substance use status, respectively (Supplementary Figure 2).The coefficients of the covariates that contributed to the risk scores are presented in Supplementary Table 4 for the CVRS2 method and in Supplementary Table 5 for the marginal CVRS method.One could look at which covariates contribute most of the risk scores by multiplying the values of the covariates by the corresponding coefficients.

Discussion
We have proposed a method that can stratify patients into groups that have different treatment effects on more than one outcome.This method represents a modification of the CVRS design.
The new CVRS2 design considers two outcomes and utilises bivariate risk scores.The scores are constructed as sums of the covariates weighted by their coefficients, estimated with vector generalised linear models.The method assumes four pre-defined clusters, based on the rate of the outcomes (see the "Interpretation of the clusters" section of the Supplementary Materials for the interpretation of the clusters).We have investigated the performance of the CVRS2 method by applying it to various simulated scenarios.We found that when the data is clearly divided into two subgroups of samples (sensitive and non-sensitive), the method is able to perfectly separate between the groups.When the data consists of four clusters (two clusters with both high/low rates of responses, and two clusters with a high rate of one of the responses and a low rate of the other), the method is able to identify the clusters and to estimate the rates of the responses reasonably well.We showed that in the case of four clusters, increasing the sample size, as well as increasing the response rates of a cluster with the highest response rate, improves the performance of the method.We investigated the correlation between the response rates in the sensitive groups on treatment by simulating the data with different numbers of overlapping covariates (zero, five and nine).We found that in the cases of zero and five overlapping covariates, the CVRS2 design performs better than the marginal CVRS design (the original one outcome CVRS that has been applied to each outcome separately).However, when the number of the overlapping covariates is too large (nine out of ten in our case), both methods are unable to identify four clusters, because the data are more compatible with two clusters.This suggests that the method is sensitive to the prior assumption about the number of the true clusters.
To illustrate the applicability of the methods to the real data, we have applied it to the data from a START randomised controlled trial.Here we considered the offender status and the substance use status as two outcomes.We showed that the CVRS2 was able to identify a sensitive group that conferred a significant interaction effect (P = 0.003) between the treatment and the sensitivity status (with respect to the offender status), assuming the data consist of two groups of participants (interestingly, the interaction effect between the treatment and the sensitivity status with respect to substance use was not significant: P = 0.129).This is in contrast to the CVRS method that considers one outcome only, which did not identify a sensitive group that conferred a significant interaction effect between the treatment and the sensitivity status.This shows that incorporating information from an additional outcome could marginally improve the results for the first outcome, even though the results for the second outcome are not significant.
We next analysed the START data assuming there are four clusters of participants (see the "Interpretation of the clusters" section of the Supplementary Materials for the interpretation of the clusters).The results showed a significant interaction effect between the treatment and the sensitivity status with respect to the offender status when cluster 1 (a cluster of patiens with low offence rate and low substance use rate) is assumed to belong to a sensitive group (P = 0.007).The corresponding P -value for the substance use achieved a nominal significance as well (P = 0.038).The estimated response rates for the cluster with the highest response rates (0.49 and 0.42) were lower than those of the simulated data.This suggests that for the real data with higher response rates for the sensitive group, the method would achieve more significant results.
Additionally, a larger data set would achieve a higher power as illustrated for the simulated data.
We compared the results for the CVRS2 design with the results for the marginal CVRS for the same set of patients, to investigate the importance of joint modelling of the outcomes.We showed that the marginal CVRS which is equivalent to independent modelling of the outcomes, was not able to find a significant interaction effect between the treatment and the sensitivity status with respect to both offender status and substance use status.This result is in line with the result for the simulated data that showed a better classification ability for the CVRS2 in comparison to the marginal CVRS.Interestingly, the risk scores for the CVRS2 and for the marginal CVRS are driven by different covariates, as can be seen from Supplementary Tables 4 and 5.The covariates that have the largest absolute values of the coefficients for the marginal CVRS, have coefficients equal to zero for the CVRS2 (for example, a diagnosis of eating disorder).The difference in the contribution of the covariates to the risk scores in the two methods could explain the difference in the assignment to the clusters.To further investigate the differences in the results of the CVRS2 and the marginal CVRS methods, we compared the values of the measure of association, ψ, for the real data for each covariate and for the simulated data (Scenario IIb, averaged over 100 simulations, for 100 covariates).Supplementary Figure 10 shows that the values of ψ for the real data are consistently higher than one.This suggests that the two responses in the case study are not independent which could also explain the differences in the results from applying the CVRS2 and the marginal CVRS methods.
The membership in a cluster is specified by a combination of the baseline covariates.We note that while the clinical interpretation of the covariates is beyond the scope of the manuscript, it could facilitate further independent investigation.The risk scores might be utilised in designing adaptive enrichment trials where only the patients who are predicted to benefit from the treatment are enrolled into the trial.We note that CVRS method was able to achieve a significant result for the real data only after a pre-filtering of the covariates based on a P -value threshold.Here, by taking an additional outcome into account, the CVRS2 method achieved a significant result without applying any pre-filtering of the covariates, even for a smaller sample size.However, the simulation results show that when the covariates are correlated, the method might benefit from the pre-filtering of the covariates based on the correlation between them, which will be investigated in future work.An additional issue that requires further investigation is the number of clusters.
We have investigated the cases of two and four clusters.Yet, for the four-cluster case, one might want to investigate merging some of the clusters together into a sensitive group.In practice, the assumption of the number of the clusters might also depend on the nature of the study and the question to be analysed.Finding the optimal number of clusters would be a part of future research.
Similarly to the CVRS method, in this study, we have retrospectively applied the CVRS2 method to identify the sensitive group in psychiatry trial participants.In principle, the method can be used to prospectively identify whether a participant belongs to a sensitive group.To divide the patients into clusters, we utilised a computationally straightforward and scalable k-means clustering procedure that works well in practice (Jian, 2010).For example, k-means clustering has been used previously in the context of gene expressions, where it successfully separated the high-risk versus low-risk cutoff for the up-/down-regulated mean ratio of gene expressions (Shaughnessy Jr and others, 2007).However, we note that different clustering algorithms can be utilised.For instance, Rosenwald and others (2002) utilised hierarchical clustering to analyse genes whose expression was correlated with the outcomes.In this study, we applied a supervised clustering approach, i.e. the number of clusters was pre-specified.This approach requires prior knowledge about the most plausible number of clusters, as shown in Section 3.5.Future work will investigate the performance of the unsupervised clustering approaches, where the number of clusters is unknown a priori, and on the sensitivity of the method to different clustering approaches.Additional areas to explore are the parametric clustering approach in which each cluster is assumed to follow a parametric distribution and the data is modelled with a mixture model (Fraley and Raftery, 2002), and the semiparametric clustering approach that models highdensity data with a parametric density and low-density data with a non-parametric density (Pan and others, 2019).
We note that our method gives equal weight to both endpoints.However in clinical practice the outcomes could have different importance for different patients.There are a few possible ways to address this issue.First, patients could interpret their pair of risk scores for the two outcomes, and weigh which treatment is preferable according to their relative importance of the two outcomes.
This would allow for patients to apply their own judgement regarding the importance of the outcomes.Secondly, a univariate risk score could be formed that gives a pre-specified weight to each component of the bivariate risk score, i.e.RS i = w 1 RS 1i + w 2 RS 2i , where w 1 and w 2 are the weights, and RS 1i and RS 2i are the components of the bivariate risk score RS i for patient i.
This could then be included within the CVRS2 approach.Finally, a parametric type of clustering could incorporate the relative importance of each outcome into the clustering procedure.
Similarly to the CVRS method, the CVRS2 method is developed for binary outcome.It is straightforward to extend the implementation of the method to incorporate different types of outcomes, such as normally distributed or time-to-event endpoints.Also, a more flexible joint model (e.g.latent variable model) could be employed to handle mixtures of outcomes.

Software
We extended the R package rapids (Cherlin and Wason, 2019) to incorporate the CVRS2 method.

Supplementary Materials
Supplementary Materials are available online at http://biostatistics.oxfordjournals.org.Risk scores 2 Risk scores 2 Risk scores 2

True clusters
Risk scores 1 Risk scores 2

Predicted clusters
Risk scores 1 Risk scores 2 Table 1.Parameters that are used in simulation study.Each row corresponds to a simulated scenario.Column θ i , σ 2 i , ρ i corresponds to the mean, variance and correlation of multivariate normal distribution that was used to simulate sensitive covariates in the subgroups that are sensitive to outcome i = 1, 2. Column ν i , ζ 2 i , κ i corresponds to the mean, standard deviation and correlation of multivariate normal distribution that was used to simulate sensitive covariates in the subgroups that are not sensitive to outcome i = 1, 2. Column η, ξ 2 , τ corresponds to the mean, standard deviation and correlation of multivariate normal distribution that was used to simulate the rest of the covariates in all patients.For each cluster of patients, the columns RR 1 and RR 2 correspond to the rates of response 1 and response 2 red for the sensitive group on the experimental arm, respectively.Table 2. Operating characteristics of the CVRS2 methods for scenario I.The response rates on the control arm are 25%.10 covariates are sensitive to response 1, 10 covariates are sensitive to response 2 with the overlap of 5 covariates.The results are based on 1000 simulations.