Designing multi‐arm multi‐stage clinical trials using a risk–benefit criterion for treatment selection

Multi‐arm clinical trials that compare several active treatments to a common control have been proposed as an efficient means of making an informed decision about which of several treatments should be evaluated further in a confirmatory study. Additional efficiency is gained by incorporating interim analyses and, in particular, seamless Phase II/III designs have been the focus of recent research. Common to much of this work is the constraint that selection and formal testing should be based on a single efficacy endpoint, despite the fact that in practice, safety considerations will often play a central role in determining selection decisions. Here, we develop a multi‐arm multi‐stage design for a trial with an efficacy and safety endpoint. The safety endpoint is explicitly considered in the formulation of the problem, selection of experimental arm and hypothesis testing. The design extends group‐sequential ideas and considers the scenario where a minimal safety requirement is to be fulfilled and the treatment yielding the best combined safety and efficacy trade‐off satisfying this constraint is selected for further testing. The treatment with the best trade‐off is selected at the first interim analysis, while the whole trial is allowed to compose of J analyses. We show that the design controls the familywise error rate in the strong sense and illustrate the method through an example and simulation. We find that the design is robust to misspecification of the correlation between the endpoints and requires similar numbers of subjects to a trial based on efficacy alone for moderately correlated endpoints. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.


Introduction
Prior to undertaking a confirmatory Phase III clinical trial, there is often uncertainty about which treatment should be selected for evaluation from a number of candidates. Here, treatments could be different doses of the same drug or different combinations of multiple drugs. Uncertainty about which treatment to select often stems from the fact that early phase trials typically evaluate medicines in different populations, using different endpoints, to those that will be the focus of confirmatory studies. The current high failure rate of Phase III trials of around 50% [1] combined with their substantial cost [2] make selecting an appropriate treatment for evaluation in Phase III of paramount importance.
As an efficient solution to this problem, designs for seamless Phase II/III multi-arm clinical trials have been proposed, which compare several active treatments with a common control group. Phase II of the study is used to learn about all treatments. At the end of this first stage, one or more of the active treatments is selected and taken forward with control for evaluation in Phase III. Data accumulated across both stages of the trial are used to test whether the selected treatment(s) is(are) superior to control at the end of the study. The simultaneous comparison of several treatments means that expected sample sizes and durations of multi-arm trials are markedly smaller than the alternative of evaluating each treatment separately. For added efficiency, solutions that incorporate a series of interim analyses to allow early stopping either for efficacy or to drop ineffective treatments have recently received attention [3][4][5][6][7]. The approaches discussed in the literature to date can be characterized by two main differences. The first is the underlying statistical framework that either generalizes group sequential designs [8,9] to accommodate multiple treatment arms [3] or makes use of p-value combination rules within closed testing procedures [10]. The second difference is the way in which treatments are selected. In [3], for example, only the best performing treatment is selected at the first interim analysis and subsequently compared with control over multiple stages, while in [6], all treatments surpassing a threshold at each stage are continued. Meanwhile, Kelly et al. [11] advocate a rule that selects all treatments close to the best performing treatment at the first interim analysis.
A further commonality of several of the approaches discussed in the literature is the assumption of normally distributed data and the fact that a single endpoint is considered. However, there are exceptions. For results for non-normal endpoints, see, for example, [12][13][14]. More generally, adaptive procedures using p-value combination rules within closed testing procedures make no assumptions about the distribution of patient responses nor place any constraints on the form of the treatment selection rule: the only constraint is that p-values for testing elementary and intersection null hypotheses must follow a Uniform(0,1) (or stochastically larger) distribution under the null [5]. For procedures that consider more than endpoint, see [15], which describes a seamless Phase II/III trial using a composite rule based on two hierarchically ordered efficacy endpoints to guide treatment selection decisions, as well as relevant safety data; to adjust for multiple testing, pairwise comparisons of selected treatments against placebo are adjusted using a Bonferroni correction. Early phase oncology trials also often assess efficacy and monitor toxicity, see [16] for a Phase I/II trial design combining time-to-response and time-to-toxicity endpoints into a single statistic used for interim decision making, weighting pairs of outcomes according to utilities elicited from experts. In other areas, such as mental health, co-primary efficacy endpoints are measured, and no single measure is accepted as definitive. Although it is sometimes sensible to combine different endpoints into a single test statistic, substantial gains in efficiency can be achieved if they are evaluated jointly, especially when endpoints capture the effects of a treatment on different aspects of the disease. Furthermore, combining information obtained on efficacy and safety endpoints into a single test statistic will be inappropriate because good efficacy will not compensate for poor safety in practice.
Methods for two-arm group-sequential trials with multivariate normal endpoints [17,18], two binary endpoints [19] and a mixture of time-to-event and nonfailure endpoints [20] have been developed. In this article, we develop a multi-arm multi-stage (MAMS) design for a trial with an efficacy endpoint and a safety endpoint. The novelty of the proposed design is that it is based on a joint model for the efficacy and safety outcomes, while information on both endpoints is incorporated into treatment selection decisions. We consider the situation where a minimal safety requirement is to be fulfilled and the treatment with the best combined safety and efficacy trade-off satisfying this constraint is selected for further testing. Selection is made at the first interim analysis, while the whole trial is allowed to compose of J analyses. Final decisions about a selected treatment are based on tests of efficacy and safety relative to control. In Section 2, we show that the design controls the familywise error rate (FWER) in the strong sense and discuss methods for sample size calculations. In Section 3, we illustrate the method through an example and simulations based on the Telmisartan and Insulin Resistance in HIV (TAILoR) study, a multi-arm trial of treatments to reduce insulin resistance in human immunodeficiency virus-positive patients. We conclude in Section 4 with a discussion of our findings and avenues for future research.

Statistical framework
We propose MAMS designs that begin in Stage 1 by comparing K active treatments with a common control group. The overall objective of the trial is to select the 'best' of the K treatments and then make comparisons with the control. Rather than be based solely on efficacy, treatment selection decisions will often reflect a compromise between the potential benefits and side effects of a new therapy. For example, a new treatment may need to demonstrate non-inferior safety and superior efficacy to represent a clinically meaningful advantage over a well-understood control. We propose designs that explicitly account for the impact of safety considerations on decision-making. Throughout, we restrict attention to the case where a single treatment is selected at the first analysis. We begin by focusing attention on a single-stage design and discuss the natural extension to multiple stages in Section 2.4.

Treatment selection rules
Suppose the trial proceeds in Stage 1 by measuring a bivariate endpoint on each patient. Labelling control as treatment 0, let Y Eik and Y Sik represent the efficacy and safety responses, respectively, of subject i on treatment k, which can be modelled as where is the within-subject correlation, and we assume that the variance-covariance matrix of responses is known. Let Ek = Ek − E0 and Sk = Sk − S0 measure the advantage of treatment k over control for efficacy and safety, respectively, where we will assume that increases in response are desirable for both endpoints. Thus, = ( E , S ) is a vector of length 2K containing the efficacy and safety effects of the K treatments. For each treatment k = 1, … , K, we define two hypotheses H Ek ∶ Ek ⩽ 0 and H Sk ∶ Sk ⩽ 0. The null hypothesis we wish to test is H 0k ∶ H Ek ∪ H Sk stating that treatment k is either ineffective or unsafe in comparison with control; rejecting H 0k implies that treatment k is both effective and safe. The global null hypothesis H 0 ∶ ⋂ K k=1 H 0k represents the case that all K treatments are either unsafe or ineffective. For ease of presentation, we consider tests of superiority, although Jennison and Turnbull [18] observe that it is straightforward to accommodate tests of non-inferiority in this framework by subtracting the non-inferiority margin (for a difference in means) from patient responses on the control treatment.
For presentational purposes, we assume a common 1:1 allocation of patients to each of the K active treatments and control and denote the number of patient responses available on each arm by n. Thus, at the end of Stage 1, for each k = 1, … , K, Fisher's information for Tk takes a common value denoted by  T = n∕(2 2 T ), for T ∈ E, S. In Appendix A.1 of the Supporting Information, we outline how the procedure could be extended to accommodate a common r ∶ 1 allocation of patients to active treatments and control. DefinêT k as the maximum likelihood estimator of Tk . Accumulated data on each treatment are summarized by the bivariate score statistic Only treatments meeting a pre-specified minimum safety requirement may be considered for selection. Let N  denote the number of treatments eligible for selection, which are indexed by the selection set  = {k ∶ Z Sk > c}. If N  = 0, the test is stopped for futility without rejecting H 0 . Otherwise, we select from  the treatment maximizing the objective function where w E and w S are pre-specified non-negative weights satisfying w 2 E + w 2 S = 1. Unplanned deviations from the pre-specified treatment selection rule could lead to inflation of the FWER above the nominal level. One of the motivations of this design is, however, to formally include safety in the decision-making so that such deviations become less frequent. Should unexpected modification be necessary, however, conditional error principle [21] can be used to maintain FWER control. It is worth pointing out that we incorporate the safety threshold because the objective function allows good efficacy to compensate for poor safety. In practice, this would only be acceptable up to a certain point, which is defined by the safety threshold. A natural choice for this threshold in our opinion is c = 0, that is, we only select from treatments with comparable or better safety than control in stage 1, although in principle other values could be used instead.
Let i ⋆ index the treatment selected from Stage 1 on the basis of the objective function O i ⋆ = max k∈ {O k }. Because the selected treatment will only be declared preferable to control if we can reject the null hypothesis 0.5 as this ensures consistency between selection decisions and the final analysis of the trial.
We propose single-stage tests of H 01 , … , H 0K with stopping rules of the form: Stop and accept H 0 Otherwise Select from  treatment i ⋆ maximizing objective function O and conduct the final analysis.
At the final analysis: At the final analysis of the proposed test, superiority can only be claimed for the selected treatment i ⋆ . Consequently, we define the FWER of the procedure as P{Reject H 0 in favour of a false H 1i ⋆ ; }. This probability depends on both the minimum safety requirement, c, and the stopping boundaries (u E , u S ).
Our approach is to fix c = 0 and find the pair of critical values maintaining strong control of the FWER at level . This criterion stipulates that P{Reject H 0 in favour of a false H 1i ⋆ ; } ⩽ for all configurations of with at least one Tk ⩽ 0, for T ∈ {E, S} and k ∈ {1, … , K}. If H 01 , … , H 0K are all true, a familywise error is made if the test terminates with rejection of H 0i ⋆ whatever treatment is selected, and the FWER is given by In the remainder of this section, we discuss how to find (u S , u E ) maintaining strong control of the FWER.

Specification of test boundaries
We propose choosing the boundaries of test (3) to ensure the FWER is controlled at level as we approach the following two 'worst-case' limiting configurations of : We claim that specifying test boundaries according to this criterion ensures strong control of the FWER and prove this claim using a combination of analytical arguments and simulation. This result agrees with the findings of [18] for the case that K = 1. Let Γ represent a set indexing treatments with positive efficacy and safety effects. We begin considering a subset of the null parameter space comprising configurations of such that (a) For all treatments k with k ∉ Γ, Ek ⩽ 0 and Sk ⩽ 0; or (b) For all treatments k with k ∉ Γ, Ek ⩾ 0 and Sk ⩽ 0; or (c) For all treatments k with k ∉ Γ, Ek ⩽ 0 and Sk ⩾ 0.
Under the global null hypothesis, Γ = ∅, and the constraints on configurations defined previously correspond to assuming that effects of different treatments on the same endpoint have the same sign. We claim that the FWER of procedure (3) under configurations of in this restricted global null parameter space is maximized under constellations with E = ( E , … , E ) and S = ( S , … , S ), and furthermore that local maxima of the FWER are attained in the limit as S → ∞ and E = 0, and in the limit as To prove these claims, we begin by assuming that all treatments are always eligible for selection and consider the configuration of with E = ( E , … , E ) and S = ( S , … , S ). Then, letting some elements of S fall below S decrease stochastically the distribution of (Z Ei ⋆ , Z Si ⋆ ) as both statistics tend to take lower values on average. To explain this, note that since all treatments remain competitive for efficacy, treatments must perform well for Z S if they are to rank highly for the objective function O. Thus, selection decisions are, in effect, driven primarily by safety data so that a treatment may beat its competitors on the basis of O with lower values of Z E . Letting some of the Sk s drop below S also decreases stochastically the distribution of Z Si ⋆ too: the treatment associated with the largest element of S is now 'safest' by some margin, meaning that on average, lower values of Z S will be sufficient for it to beat the weaker competition to ensure selection. Similar arguments imply keeping S fixed at S and letting some elements of E fall below E decreases stochastically the distributions of Z Ei ⋆ and Z Si ⋆ . On the other hand, simultaneously forcing elements of E below E and elements of S below S decreases stochastically the distribution of (Z Ei ⋆ , Z Si ⋆ ): systematic differences between treatments imply that it is possible for a good safety profile to compensate for poor efficacy (and vice versa) resulting in lower average values of Z E and Z S for the selected treatment.
Letting E = ( E , … , E ) and S = ( S , … , S ), increasing E or S increases the probability of rejecting H 0 . Thus, looking across the restricted global null parameter space, the probability of making a familywise error is maximized at the boundaries of the space, that is, in the limit as S → ∞ and E = 0, and in the limit as E → ∞ and S = 0. If for some treatment k, Ek and Sk are both positive so that Γ ≠ ∅, this treatment will be more likely to be selected, in which case we cannot commit a familywise error and the FWER decreases. Therefore, controlling the FWER for all configurations of in the restricted global null parameter space ensures the FWER is controlled over the wider null parameter space defined previously.
So far, we have consider the case that all treatments are always eligible for selection, in effect setting c = −∞. However, we claim that for general values of c, local maxima of the FWER are attained in the limit as we approach the worst case configurations of identified previously. This is because under this requirement, the expected size of  is determined by S . Therefore, increasing S increases stochastically the distribution of (Z Ei ⋆ , Z Si ⋆ ) as the average number of treatments from which we can select increases. In particular, as S approaches ∞,  includes all K treatments almost surely and rejects H 0 if Z Ei ⋆ > u E . The probability of falsely rejecting H 0 is then maximized for E = 0. Similarly, setting S = 0, the FWER reaches a second local maximum as E → ∞. To see this note that for S = 0, inclusion of treatments in the selection set is random so that the probability of rejection is maximized for maximal effect on efficacy.
To complete our justification for designing test (3) to control the FWER under the 'worst-case' limiting configurations of , we go beyond the arguments stated previously to claim that this approach ensures strong control of the FWER. In particular, tests defined in this way will control the FWER for any configuration of with E = ( E1 , … , EK ) and S = ( S1 , … , SK ), where one ⋅i is zero and the other ∞. While we cannot prove these claims analytically, we evaluate them via simulation in Section 3.2. Assuming for now that these claims do hold, it is appropriate to choose boundaries (u E , u S ) to ensure that because lim S →∞ P{N  ⩾ 1; S = ( S , … , S )} = 1 and the probability that at least one treatment meets the minimum safety criterion does not depend on E . As S and E approach ∞, the bivariate probabilities on the left hand sides (LHSs) of (4) and (5) converge to univariate probabilities. We find (u E , u S ) so that the limits of these marginal rejection probabilities are equal to the values required to ensure FWER control. Limits of rejection probabilities are found by integrating the limits of the marginal conditional densities of Z Ei ⋆ and Z Si ⋆ derived in Appendix A.2 of the Supporting Information. It is important to note that these marginal densities depend on the correlation coefficient . So far, we have assumed that this parameter is known. In Section 3.3, we explore the robustness of attained FWERs to misspecification of . Marginal densities of test statistics depend on variances only through the information levels  E and  S . The effect of assuming a known variance has previously been investigated in similar settings [22], and the quantile substitution approach described in [8] has been shown to work well.

Sample size calculations
We wish to calculate the sample size needed for test (3) to attain a disjunctive power, that is, probability of rejecting at least one false null hypothesis [5,23], of 1 − under the configuration of with E = ( 0 , … , 0 , ) and S = ( S , … , S ) with S > 0. We may approximate further by letting S → ∞, which can be justified by the belief that a potentially unsafe treatment is unlikely to be included in the trial. In this case, a test's power simplifies to and limiting probabilities are found by integrating the limits of the marginal conditional density of Z Ei ⋆ . Using the results of Appendix A.1 and following the workings of Appendix A.2.1 of the Supporting Information, we can show that limiting rejection probability (6) is given by where U ∼ N(0, w E w S + 0.5), For computational convenience, we proceed assuming that  E =  S =  1 and conduct a one-dimensional search to find the common information level  ⋆ 1 for which rejection probability (7) equals 1 − ; at each iteration of this search, boundaries for monitoring score statistics Z Ei ⋆ and Z Si ⋆ are updated to ensure strong control of the FWER at level under the proposed information level. Because information level  ⋆ 1 typically corresponds to requiring fractions of subjects, in practice, we propose rounding up the total sample size to n ⋆ = 2 max{ 2 E  ⋆ 1 , 2 S  ⋆ 1 } patients per treatment arm. The test is then applied with critical values calculated for information levels  E = n ⋆ ∕(2 2 E ) and  S = n ⋆ ∕(2 2 S ). If a procedure's power is monotone increasing in  E and  S , this sample size criterion will be conservative in the sense that attained power will exceed 1 − . In Section 3.2, we use simulation to evaluate properties of tests designed according to the proposed sample size criterion.

Beyond single-stage designs
It is straightforward to extend our approach to find designs maintaining control of the FWER when multiple interim analyses are planned. Let Z Tk,j denote the score statistic at interim analysis j for endpoint T on treatment k. A multi-stage test of H 01 , … , H 0K has a stopping rule of the form: At the end of stage 1: If Z S1,1 , … , Z SK, 1 Multi-stage tests are defined with binding futility rules so that if either Z Ei ⋆ ,j ⩽ l Ej or Z Si ⋆ ,j ⩽ l Sj , the procedure must stop immediately at interim analysis j without declaring treatment i ⋆ safe and effective. Criteria (4) and (5) imply that we can uncouple the searches needed to find critical values for monitoring efficacy and safety score statistics. Furthermore, for T ∈ {E, S}, increments Z Ti ⋆ ,2 − Z Ti ⋆ ,1 , … , Z Ti ⋆ ,J − Z Ti ⋆ ,J−1 are independent and follow the same distribution as increments in score statistics generated by a univariate group sequential test (GST) without selection [3]. Thus, we can find (l E1 , u E1 ), … , (l EJ , u EJ ) as the boundaries defining a one-sided univariate GST monitoring {Z Ei ⋆ ,1 , … , Z Ei ⋆ ,J } with limiting conditional type I error rate given N  ⩾ 1 under E = 0 and letting gamma S → ∞. Following [3], we propose that an alpha-spending approach [24] be used to find the upper and lower boundaries at each stage j = 1, … , J satisfying where t j is the fraction of  EJ , the maximum information level for the efficacy treatment effect, accumulated by stage j, and f U and f L are monotone increasing functions satisfying f U (0) = f L (0) = 0 and, for t ⩾ 1, f U (t) = and f L (t) = 1 − . A similar process can be used to find the boundaries (l S1 , u S1 ), … , (l SJ , u SJ ) for monitoring {Z Si ⋆ ,1 , … , Z Si ⋆ ,J }. Safety boundaries are determined using f L and f U to spend error probabilities as a function of the observed information for E,i ⋆ ; this ensures that u EJ = l EJ and u SJ = l SJ , so that procedure (8) terminates properly at analysis J with a final hypothesis decision for any choice of  EJ even when the variances of the efficacy and safety endpoints differ. To find the required sample size, we follow Section 2.3 and search for the maximum information level  EJ for which the test has power 1 − according to criterion (6) under an anticipated information sequence  E1 , … ,  EJ , setting each  Sj = 2 E  Ej ∕ 2 S to account for differences between the rates at which information on safety and efficacy effects accumulate. The test will recruit up to n ⋆ = 2 2 E  EJ patients on the selected treatment and control in the absence of early stopping.

Example
In this section, we will examine the operating characteristics of the proposed designs through a series of examples motivated by the TAILoR study, a MAMS trial comparing several doses of telmisartan with control for the reduction of insulin resistance in human immunodeficiency virus-positive patients receiving combination antiretroviral therapy [6]. The study, which is currently ongoing, uses the change in the Homeostatic model assessment -Insulin resistance (HOMA-IR) index between baseline and 24 weeks as the efficacy endpoint.
In this section, we imagine how the TAILoR study might have been designed as a single-stage procedure of the form shown in (3), using the methodology described in this paper to incorporate a safety endpoint in addition to the efficacy endpoint used in the ongoing study. A plausible safety endpoint is change in systolic blood pressure from baseline because telmisartan is licensed for the treatment of hypertension. An excessive drop in blood pressure for patients without hypertension would be considered an undesirable safety risk. With the exception of this modification, we will assume the design parameters of the original TAILoR study. We therefore stipulate an FWER of 0.05 and seek designs randomizing patients equally across treatment arms with power 0.9 to correctly reject one false null hypothesis. When the TAILoR study was first designed, four doses of telmisartan were planned. For consistency with previous publications [22,25], we consider the scenario that K = 4 active treatments are to be compared with control, despite the ongoing study using three doses because of last minute changes to the study. The standardized desirable effect for efficacy, , used for sample size calculations is set as 0.545, and the minimum clinically important difference is defined as 0.178. Under the assumption that all treatments are truly safe, we do not require specification of an effect on safety when using (7) for sample size calculations. However, if such an assumption is undesirable, Equation (6) in Appendix A.1 of the Supporting Information can be used with the anticipated safety effect. Boundary calculations and sample size determinations require us to numerically evaluate multi-dimensional integrals. For this purpose, we used the R package cubature [26] and verified solutions for the obtained boundaries using 100 000-fold simulations. Figure 1 shows how the required information per arm and safety/efficacy stopping boundaries vary as the weight w E changes. The within-subject correlation of efficacy and safety responses is assumed to be 0.4. The information required is largest when selection of the treatment is based only on the safety endpoint (w E = 0), while it decreases as the weight on efficacy increases. Similarly, both the efficacy and safety boundaries decrease as the required information decreases, as expected. There is, however, an apparent difference between the efficacy and safety boundary, depending on the weight given to each endpoint. For small weights on efficacy, the efficacy boundary is smaller than the safety boundary, while this pattern reverses once more weight is attributed to efficacy for selection. For equal weights, the boundaries for efficacy and safety are identical.

Error rates
In this section, we illustrate properties of tests of the form (3) designed and conducted with equal weights w E = w S and correlation coefficient = 0.4. Under this setting, the information required per arm is  ⋆ 1 = 47.148 (n = 94.296), and the stopping boundaries are u E = u S = 14.466. Empirical error rates based on 10 000 simulation runs for each point on a grid of parameters are shown in Figure 2 for cases where all treatments have the same pair of effects ( E , S ) versus control. The left-hand panel clearly shows that the FWER of the design is maximized if one of the effects is at the boundary of the null space and the other is large. As expected, the power of the design increases as at least one of the effects increases. Figure 3 provides empirical FWERs for parameter configurations of the from E = ( E1 , … , EK ), S = ( S1 , … , SK ), where one parameter of each pair ( Ei , Si ) is large and the other zero to evaluate the conjecture made in Section 2.2, which designs will control these at the nominal level . For the purpose of this evaluation, the large effect was set to 1 million, and 100 000-fold simulations are used. From the graph, it can be seen that the FWER is well controlled for any parameter configuration, as conjectured. Figure 4 shows how the power of the procedure changes as the safety of the experimental treatments changes. Results are presented for one to four treatments exhibiting the desired effect for efficacy of 0.545, while the remaining have the minimum clinically important effect of 0.178. Power increases as the safety of the treatments increases and reaches the desired level of 0.9 for a safety effect of around 0.5. Power also increases as the number of treatments with the desired efficacy increases, although this increase diminishes somewhat with the number of efficacious treatments.

Misspecification of
When specifying our model, we have so far assumed that response variances and their correlation are known. In this section, we will investigate the robustness of our design to the assumption of known correlation. Figure 5 shows the simulated FWER based on 100 000 simulations of tests as the correlation between endpoints varies. Six different true parameter constellations are considered, namely, the global null hypothesis and five 'worst-case' configurations (once again using 1 million instead of infinity for simulation purposes). For all six settings, the FWER is controlled at the design value = 0.4, and the procedure is conservative for all correlations below this. Under the global null hypothesis, only perfect correlation results in an inflation of the FWER, while it is inflated once the true correlation is above the design value for the worst-case configurations. The maximum inflation, achieved under perfect positive correlation, is, however, small at 10% of the nominal value of the FWER.
Tamhane et al. [27] observe that typically either the correlation is assumed to be known (as performed here) or a correlation of one is treated as the worst-case scenario. Given the relative conservatism of the proposed procedure at reasonable values of the parameters, we believe the former is sufficient, although the latter would certainly also be possible. A more elegant solution given in [27] overcomes this problem by estimating the correlation mid-study and uses an approach due to Berger & Boos [28] to obtain an upper bound for the FWER accounting for the sampling error of the sample correlation coefficient.

Discussion
In this paper, we have presented an approach for designing MAMS studies based on a joint model for efficacy and safety data, which considers both endpoints when selecting the most promising treatment for further investigation and tests the efficacy and safety of the selected treatment relative to control. The main challenge with obtaining the relevant distributions of the test statistics arose from the requirement to select from treatments satisfying a minimum safety requirement. We have shown that the FWER is strongly controlled under the assumption that effects of different treatments for the same endpoint have the same sign. Our simulation results show, however, that strong control of the FWER also appears to hold when this assumption is not made.
In the presentation and derivations, we have made a number of assumptions that may not be appropriate for specific settings. For example, single-stage designs are formulated assuming patient responses follow a bivariate normal distribution with a common correlation between efficacy and safety responses across the active and control treatments. In addition, calculations assume that at the end of Stage 1, there is a common information level for E,1 , … , E,K and a common information level for S,1 , … , S,K . This joint distribution will not in general apply if data do not follow a normal distribution because information levels and correlation coefficients may depend on unknown parameters, such as response rates in the case of binary data (see section 9 of [27]). One potential solution would be to approximate and derive test boundaries setting the correlation coefficient and information levels to the values that would apply under E = S = 0. However, further simulations would be needed to verify whether this approach would maintain strong control of the FWER at a level close to the nominal value.
Another simplifying assumption we have made is to propose designs setting the safety threshold to be zero, so that only treatments with better safety than control can be selected. A simple shift of the safety test statistic can be used to allow for different thresholds to be used. Similarly, it may be desirable to select treatments only based on efficacy provided that the treatment is safe enough. Simply setting the weight on safety within the objective function to zero can accomodate this situation. Finally, as outlined before, it may not be appropriate to test for superiority in terms of safety over control. Shifting the respective test statistics for safety will allow non-inferiority hypothesis to be used instead. It is also easy to envisage application of this design in other settings, such as mental health trials, where there are co-primary efficacy endpoints. In these cases, no minimal threshold would be applied to either efficacy endpointthe ideas of this work apply to this, somewhat simpler, situation setting the threshold c = −∞.
One great benefit of multi-stage clinical trials is their reduced expected sample size compared with single-stage designs. Such gains can, however, only be realized, if the primary endpoint is observed quickly relative to the recruitment time [29]. When this is not the case, it would be of interest to investigate whether methods such as the one described in [30] can be extended to make selection decisions based on intermediate endpoints in the setting discussed in this paper.
Another area for further work regards how to calculate confidence intervals on termination of tests of the form (3). The procedure based around hypothesis testing described here allows almost formulaic decisions about the superiority of experimental treatments over control. It is essential, however, that confidence intervals should also be available to inform decision makers about the probable sizes of any efficacy and safety benefits, in order to give a complete description of the evidence supporting a selected treatment. A future work will be necessary to evaluate if related work [14,31] can be utilized to obtain interval estimates as well.