A comparison of Bayesian adaptive randomization and multi‐stage designs for multi‐arm clinical trials

When several experimental treatments are available for testing, multi‐arm trials provide gains in efficiency over separate trials. Including interim analyses allows the investigator to effectively use the data gathered during the trial. Bayesian adaptive randomization (AR) and multi‐arm multi‐stage (MAMS) designs are two distinct methods that use patient outcomes to improve the efficiency and ethics of the trial. AR allocates a greater proportion of future patients to treatments that have performed well; MAMS designs use pre‐specified stopping boundaries to determine whether experimental treatments should be dropped. There is little consensus on which method is more suitable for clinical trials, and so in this paper, we compare the two under several simulation scenarios and in the context of a real multi‐arm phase II breast cancer trial. We compare the methods in terms of their efficiency and ethical properties. We also consider the practical problem of a delay between recruitment of patients and assessment of their treatment response. Both methods are more efficient and ethical than a multi‐arm trial without interim analyses. Delay between recruitment and response assessment attenuates this efficiency gain. We also consider futility stopping rules for response adaptive trials that add efficiency when all treatments are ineffective. Our comparisons show that AR is more efficient than MAMS designs when there is an effective experimental treatment, whereas if none of the experimental treatments is effective, then MAMS designs slightly outperform AR. © 2014 The Authors Statistics in Medicine Published by John Wiley & Sons, Ltd.


Introduction
In many disease indications, there are multiple drugs in the same phase of clinical development. Specifically, in oncology, there are over 1500 cancer therapeutics in the clinical pipeline [1]. For several malignancies, the number of putative treatments has rapidly increased because of development of molecularly targeted agents [2]. In such a context, efficient alternatives to traditional single-arm and two-arm trials are needed to maximize the number of treatments tested considering the limited number of patients and resources that are available for trials [3].
A multi-arm trial, in which several new drugs are compared with a control treatment within a single trial, is a class of designs that have potential to greatly increase the efficiency of the drug development process. There is a growing literature on multi-arm trials, with new trial designs investigated [3][4][5], compared with traditional designs [6,7] and applied in practice [8][9][10][11]. The main advantage of a multi-arm trial is that the shared control group means fewer patients are required compared with conducting separate two-arm trials of the different experimental drugs. There are also other advantages. First, patient recruitment to clinical trials can often be increased if there is a greater chance of being allocated a new treatment [12]. Second, the overall administrative burden in setting up and running a multi-arm trial can be lower than running several separate trials [13]. Third, in early phase cancer trials, most treatments fail to show superiority [14], and it is attractive to concentrate patient allocation to the most promising arms.
As patients are randomized and treated during a multi-arm trial, a considerable amount of information will be gathered on the effect of different treatments. Several designs have been proposed to use this information in order to focus resources on the most promising arms. Two of the most commonly proposed approaches to achieve this goal are response-adaptive randomized designs and multi-arm multi-stage (MAMS) designs.
Adaptive randomization (AR) allows changes to be made to the randomization probabilities to treatments during the trial. In most cases, the aim of the procedure is to allocate a greater proportion of patients to treatments that have so far demonstrated evidence of a better performance than other arms. a number of papers have discussed AR (see, for example, [15][16][17][18][19]). Here, we will consider Bayesian AR procedures. Such procedures are characterized by time varying randomization probabilities that are defined and recursively updated through a Bayesian model for patients' outcomes. Hu and Rosenberger [20] have also developed and investigated alternative non-Bayesian adaptive procedures for multi-arm trials, with similar goals.
In the case of a single experimental treatment and a control arm, AR can be less efficient than a balanced randomized trial when the design power is considered but can also result in patients on the trial having a better average treatment response. A paper by Coad and Rosenberger [21] compared a form of AR to group-sequential designs in this situation and found the latter to be more efficient. AR is thus often argued by its proponents to be more ethical and by its detractors to be inefficient. This trade-off only applies to the two-arm setting. In the case of multi-arm trials, it has been shown that applying AR to the experimental treatments, but protecting the allocation to the control group, is more efficient than balanced randomization in many situations [7].
Multi-arm multi-stage designs, like AR, use interim data. Instead of changing the allocation proportions, stopping thresholds are set on test statistics, with arms that show insufficient evidence of effectiveness dropped. The allocation to each remaining arm is fixed in MAMS trials, although assigning more patients to the control group than to the experimental arms can provide small gains in efficiency over balanced randomization [22], and has been used in practice [13].
Although both methods have been proposed to improve efficiency compared with a multi-arm trial without interim analyses, no systematic comparison of the two methods has been undertaken. These two approaches, and the investigation of several anticancer treatments within single studies, have been identified in the literature as relevant strategies for streamlining the development of new drugs. The major aim of this paper is to rigorously compare the two methods and to make recommendations about their relative advantages and disadvantages.
We performed the study so that the overall complexity level of the trial is identical under the two approaches; in particular, we compare MAMS and AR designs with identical numbers of interim analyses. The literature reports several trade-off relationships between the complexity of specific designs, for example, the number of interim analyses, and the resulting efficiency and operating characteristics. High complexity levels of the design are, in most cases, associated with higher costs, such as the infrastructures for updating randomization probabilities and coordinating patients assignment in multi-center trials. We therefore keep the trial complexity at identical levels when comparing MAMS and AR trials.
We focus on a binary outcome with a short delay between recruitment and outcome observation. This is a scenario motivated by adjuvant phase II breast cancer trials, where the primary outcome is pathological complete response (pCR), a binary outcome observed after a chemotherapy regimen, but before surgery. We discuss simulations of AR and MAMS trials under scenarios that mimic a recent multi-arm trial in breast cancer described in Gianni et al. [9]. We also consider the effect of the recruitment rate when there is a delay between recruitment and observation of the treatment outcome. An advantage in using this example for illustrating the relative merits of the two competing approaches, AR and MAMS designs, is the possibility of constructing realistic scenarios based on historical data.
The literature on Bayesian clinical trials includes relevant discussions on whether the final decisions at completion of the trial should be based on Bayesian or frequentist arguments [23]. Berry [24] has also considered some intermediate solutions. Here, we do not discuss these alternatives. Instead, we assume that the final recommendations, for each of the tested agents, has to be constrained by standard frequentist criteria. That is, either the type I error probability of each null hypothesis or the overall familywise error rate (FWER) are controlled at a pre-specified˛level. Our choice should not necessarily be interpreted as a preference for the frequentist approach, which is often incompatible with the foundational likelihood principle [25] in multi-stage and adaptive experiments. This choice has two main motivations.
First, anchoring a comparative study to a single approach for the final analyses, at completion of the trial, is necessary to obtain easy to interpret results. Second, in most clinical trials, some of the stakeholders, including in several cases regulators such as the Food and Drug Administration, require final decisions based on frequentist criteria.

Multi-arm multi-stage designs for binary outcomes
Methodology for MAMS trials has mainly focused on situations where the primary endpoint is assumed normally distributed [5,26] or time-to-event [27]. In this section, we describe the procedure we used to design MAMS trials with binary outcomes.
We consider a J -stage trial with K experimental arms and a control arm. The response probability of the k-th arm is p k . We labeled the control arm and the K experimental arms as k D 0 and k D 1; : : : ; K, respectively. We designed the trial to test K null hypotheses H .1/ 0 ; H .2/ 0 ; : : : ; H .K/ 0 , where the k-th null hypothesis is At each stage, we recruited n patients to the control arm, and n patients are recruited to each experimental arm that has not been previously dropped. The value of n is called the group size. We define the number of responses and the number of patients recruited to arm k at the completion of the j -th stage, respectively, as Y jk and n jk . As suggested in Jennison and Turnbull [28], a suitable test statistic for contrasting arm k ¤ 0 with the control is Asymptotically, under the hypothesis of no treatment effect, the test statistic in (1) is distributed as a standard Normal random variable.
We specify futility boundaries for each stage, f D .f 1 ; : : : ; f J /. At stage j < J , if Z jk 6 f j , then arm k is dropped for futility. Once experimental arm k is dropped for futility, no further patients are recruited to it, and the null hypothesis H .k/ 0 is not rejected. If all experimental arms are dropped for futility, the trial terminates. Otherwise the remaining arms continue to stage j C 1. At stage J , if the test statistic Z J k is above f J , then H .k/ 0 is rejected. In the case of a single experimental treatment and a control arm, that is, K D 1, the work of Jennison and Turnbull [28] provided analytical formulae and asymptotic results for the probability of rejecting the null hypothesis. The formulae involves recursive summation, over all possible pairs of the response frequencies on the two arms, at each interim analysis, taking into account the futility boundaries. If the type I error rate and power of each null hypothesis is of interest, then this method can be directly used in the multi-arm case. In our study, we also consider the control of multiple-comparison operating characteristics such as the FWER. For calculation of such quantities, because of the shared control group, recursive summation over all possible .K C1/-tuples of the number of responses on each arm is required. This quickly becomes computationally demanding as K increases.
For this reason, we use a Monte Carlo procedure, similar to the one used in Wason and Jaki [5], to find a design that has pre-specified operating characteristics. This procedure involves simulating a large number of independent arrays of random variables. Unless otherwise stated, we use 250,000 arrays. Each one is a J .K C 1/ matrix of independent uniform variables, U jk U.0; 1/. For any fixed value of .p 0 ; p 1 ; : : : ; p K /, the random variable U jk can be transformed to the number of responses from the set of patients recruited to the k-th treatment at the j -th stage by using the quantile function of the binomial distribution. This simulates the number of responses that would be observed at each stage if the treatment is not dropped. For each simulation replicate, given a futility boundary f , one can determine which (if any) null hypotheses are rejected.
Throughout the paper, we define the power (unless otherwise stated) under the least favorable configuration (LFC) [29]. The LFC is the configuration where one experimental arm, without loss of generality treatment 1, has response probability equal to a clinically relevant value p 1 , and all other arms have some uninteresting treatment effect, which we set equal to p 0 < p 1 , on the basis of historical data or investigators expectations.
We now describe how to select futility boundaries that control the FWER, which is the probability of rejecting any true null hypothesis. The FWER is strongly controlled at level˛if the maximum probability of rejecting a true null hypothesis, over .p 0 ; p 1 ; : : : ; p K / 2 OE0; 1 KC1 , is less than or equal to˛. Magirr, Jaki, and Whitehead [26] showed that the maximum probability of rejecting a true null hypothesis, in the case of normally distributed outcomes, is attained when all of the experimental treatments have the same effect as the control treatment. The proof also applies when the outcomes are binary. When p 0 D p 1 D : : : D p K , the probability of rejecting a null hypothesis H .k/ 0 , on the basis of preselected thresholds f and the summary statistics in (1), depends on the common response probability. For small samples and fixed boundaries f , the type I error rate can vary considerably with the value of the identical response probabilities. Jennison and Turnbull [28], in a similar context, suggested that the type I error rate should be evaluated at a number of plausible values of p 0 and to use the maximum to bound the type I error rate. We adopt this approach for finding MAMS designs that control the FWER.
The procedure for choosing the boundaries .f 1 ; : : : ; f J / is based on the triangular test (refer to [30] and [31]). This approach has been compared with several other types of boundaries and found to perform well for group-sequential designs and MAMS designs [5,32] in terms of the expected number of patients recruited.
For a MAMS design, the boundaries can be adjusted so that the FWER is controlled. Given any approach for early stopping boundaries in the two arms setting, say the triangular method or alternative methods based on power prediction [33], one can tune the parameters of the procedure so that the desired MAMS operating characteristic (in our case, the FWER) is matched. These procedures define two arm designs with futility boundaries f on the basis of (i) a desired significance level˛and (ii) a targeted powerˇat a reference scenario .p 0 ; p 1 /. With several experimental arms, we consider a continuum of boundaries obtained varying only the significance level, then we select the boundary with maximum FWER equal to the pre-specified target, where the maximum is with respect to a grid of values for p 0 . That is, we plug in the largest significance level˛that allows us to control the FWER at the desired level.
In some cases, it is not of interest to control the FWER or, more generally, to correct for multiple testing. The view as to whether the FWER is relevant in multi-arm trials has been discussed in the literature (refer to [6] and [22]). In our comparative study, we also consider settings in which the investigator only controls, for each hypothesis, the type I error at some˛level, without constraints on the FWER.

Adaptive randomization for binary outcomes
We use a Bayesian adaptive design with overall sample size N . The hypotheses to be tested, H .1/ 0 ; : : : ; H .K/ 0 , are the same as in the previous section. The design includes J 1 interim analyses. At each interim analysis, we update the Bayesian model used for setting the randomization probabilities. The j -th interim analysis takes place after .j=J /N patients have been randomized. If .j=J /N is non-integer, it is rounded to the nearest integer.
We use a similar AR procedure to the one used in Trippa et al. [7]; the approach is also strictly related with previous proposals discussed in Thall and Wathen [15]. The procedure requires specification of a prior distribution on the response probabilities .p 0 ; : : : ; p K /. At each interim analysis, we compute the conditional probabilities P .p k > p 0 jdata/, that is, the posterior probability of arm k's response rate being higher than the control arm's response rate given the data so far. Using notation from the previous section, we found that this posterior probability is P .p k > p 0 jY jk ; Y j 0 ; n jk ; n j 0 /. We used these conditional probabilities to obtain dynamic randomization probabilities.
We use independent beta .a 1 ; a 2 / prior distributions for the response probabilities p 0 ; : : : ; p K . The parameters of the prior are chosen to be a 1 D p 0 and a 2 D 1 p 0 so that the the prior is centered around an initial guess p 0 .
We set the randomization probabilities . 0 ; : : : ; K /, during the .j C 1/-th stage of the trial, given the current arm-specific sample sizes n 0 0 ; : : : ; n 0 K , with j N=J 6 n 0 D P K i D0 n 0 i 6 .j C 1/ N=J , to k / The vector of randomization probabilities is therefore a function of the posterior probabilities contrasting each experimental arm with the control. Recall that n jk is the number of patients assigned to treatment k before the j -th interim analysis, that is, before computation of the posterior probabilities; this integervalued variable differs from n 0 k in (2), which denotes the current number of patients that have been randomized to arm k.
Although we do not consider the possibility further in this paper, the posterior probabilities for each experimental arm could vary across patients with different covariates. This would mean that patients with different covariate values would have different allocation probabilities. Recent examples that account for covariates in the computation of the adaptive probabilities are the ISPY2 trial [11] and the BATTLE trial [8].
If we consider a scenario with a single effective experimental treatment, and a total sample size that diverges, then a suitable Á function would result in approximately 50% of patients assigned to the effective treatment and approximately 50% assigned to the control arm. Also, when the overall sample size diverges, if two or more experimental treatments are effective, with identical treatment effects, then approximately 100% of patients assigned to an experimental treatment will receive one of the best treatments. These facts directly follow from standard results in sequential analysis; see, for example, [34,35]. Note that, assuming Á > 0, the dominance of the exponential term exp. / in (2) implies that, as the overall sample size diverges, the proportions of patients allocated to the control and to the experimental arm with the highest final sample size become identical.
The functions and Á govern to what extent the allocation probabilities deviate from equal allocation. For example, if .j=J / D 0, each experimental arm has identical allocation probability. In contrast, as .j=J / increases to infinity, only randomization to the control and the most promising experimental treatment is allowed.
A good choice of the tuning function balances the so-called exploration versus exploitation trade-off [36]. A positive Á function allows one to approximately match the number of patients max.n 0 1 ; n 0 2 ; : : : ; n 0 K / receiving the most frequently assigned experimental treatment and the current sample size n 0 0 of the control arm. This characteristic of the adaptive design we consider is the only major difference with respect to AR procedures previously discussed in the literature. In our simulations, we observed negligible sensitivity of the operating characteristics illustrated in this paper to the specific choice of Á. All presented results are based on a linear Á function with final value Á.1/ D 0:25. In contrast, we have to carefully select the function.
We explored parametric functions such as .n=N / D a n N b and .n=N / D a 1 n N > b . To simplify computations, we have considered only functions with two parameters. We obtained the tuning of with Monte Carlo simulations. We iteratively simulated trials at each value of the parameters over a grid. We considered relevant scenarios for selecting . For cancer studies, where there is a paucity of effective new treatments, we used scenarios with only one effective experimental arm, say arm 1. To choose the parameters, we focused on the final evidence in favor of a treatment effect. In particular, at each simulation, we computed the Bayes factor in favor of the hypothesis of a positive treatment effect for arm 1. This produces a rank of the functions by computing the proportion of simulations with Bayes factor above a threshold, say 2, under each function. This optimization procedure used scenarios with p 0 uniformly distributed in .0:2; 0:8/ and p 1 D p 0 C0:15. Additional simulations suggested negligible sensitivity to the distribution of p 0 and p 1 across iterations.
At completion of the trial, we obtained the control of the type I error rate or FWER by computing a test statistic Z k for each experimental arm and rejecting the associated null hypotheses for those arms whose test statistic exceeds a threshold c. We obtained the threshold using a simulation algorithm aimed at approximating for each possible value of .p 0 ; : : : ; p K / 2 OE0; 1 KC1 the corresponding joint distribution of Z 1 ; : : : ; Z K . Once these approximations are obtained, we select the minimum c value that controls under all .p 0 ; : : : ; p K / 2 OE0; 1 KC1 combinations the type I error rate or the FWER below the pre-specified target.
The algorithm is a direct application of the importance sampling method. It involves two steps. We first iteratively simulate clinical trials varying at each iteration .p 0 ; : : : ; p K /. We sampled the vector of response probabilities at each simulation from independent beta distributions. That is, each trial simulation T is based on random response probabilities .p 0 ; : : : ; p K / g, where g is a conveniently selected distribution. Let L.t I p 0 ; : : : ; p K / be the likelihood of a trial under the adaptive scheme; this is the probability of a specific sequence of outcomes and treatment assignments at a fixed value of the vector .p 0 ; : : : ; p K /. We chose the distribution g so that for each generated trial T , the importance weights w .T I Q p 0 ; : : : ; Q p K / D L.T I Q p 0 ; : : : ; Q p K / R L.T I p 0 ; : : : ; p K /dg (3) can be analytically obtained at any fixed value of Q p 0 ; : : : ; Q p K . The computation of the ratio in expression (3) does not involve considerations on the adaptive scheme. We only need to compute the ratio between likelihoods under binomial and beta-binomial models.
The importance weights, one for each simulated trial, can be used to estimate the distribution of Z 1 ; : : : ; Z K at any combination .
where Z.T / denotes the K test statistics associated to T and the right hand of the equation is the probability of the event .Z 1 ; : : : ; Z K / 2 B at a fixed value of response probabilities Q p 0 ; : : : ; Q p K . Standard Monte Carlo arguments allow one to approximate the probabilities of interest on the right hand of (4) by iteratively simulating the trial T , which in turn produces a simulation of the random variable w.T I Q p 0 ; : : Each simulated trial T defines on OE0; 1 KC1 the function . Q p 0 ; : : : ; Q p K / ! w.T I Q p 0 ; : : : ; Q p K / 1 .Z.T / 2 B/ and therefore contributes to the estimates of all the joint Z 1 ; : : : ; Z K distributions, which vary across the . Q p 0 ; : : : ; Q p K / combinations. The second step consists in minimizing c with the constraints given by the estimates of the joint distributions of Z 1 ; : : : ; Z K over a grid of values for p 0 ; : : : ; p K . In other words, we select c such that the maximum FWER or type I error rate estimate, across possible p 0 ; : : : ; p K , is bounded by a pre-specified value. To speed up computations, one can consider an importance distribution g that generates p 0 D p 1 with probability 1 and a grid of p 0 ; : : : ; p K with p 0 always equal to p 1 , without compromising the described procedure for bounding the type I error probabilities or the FWER.
Most of the computing time to implement the importance sampling is dedicated to the simulation of adaptive trials with different scenarios Â D .p 0 ; : : : ; p K / varying across the parameter space ‚ D OE0; 1 KC1 . In the subsequent optimization stage of the algorithm, with a moderate number of arms, say up to K D 5, we used a regular grid that partitioned each dimension into 35 equally spaced subintervals. When we considered designs with a larger number of arms (say 10), we first optimize over a coarse grid with up to 10 7 points. Then, after the optimum Â over this grid is obtained, the grid is rescaled and centered at Â .
The computational efficiency of the algorithm can be directly compared with a Monte Carlo procedure that estimates Z percentiles at fixed Âs on a grid. We considered, for instance, a trial with K D 2 and an overall sample size equal to 150. The grid is two dimensional. We fixed the overall number of simulated trials (10 6 ) to estimate Z percentiles at each Â value on the grid. Then, by iterating the computational procedures, we obtained estimates for the mean squared errors. As expected, the mean squared errors of the Monte Carlo procedure depend on the grid granularity, because each estimate is based on a number of simulated trials that decreases with the number of points on the grid. If the grid has 100 points, then each estimate is based on 10 4 simulated trials. In contrast, this relation between granularity and the estimation accuracy is absent under the importance sampling approach. In the example given earlier, with up to 16 points on the grid, the efficiency of the Monte Carlo procedure is comparable with our approach, but with as few as 100 points, the higher efficiency of the importance sampling becomes apparent.

Case study
We consider the NeoSphere trial, a multi-arm trial in women with human epidermal growth factor receptor 2 breast cancer [9]. Human epidermal growth factor receptor 2 is a biomarker, which defines a subgroup with notably shortened survival [37]. In NeoSphere, the four arms considered were combination therapies, trastuzumab plus docetaxel (arm A), pertuzumab plus docetaxel (arm B), pertuzumab with trastuzumab (arm C), and pertuzumab plus docetaxel plus trastuzumab (arm D). The trial was powered to compare arm A with arms B and C, and arm B against arm D. The primary endpoint was pCR, which is defined as the absence of invasive neoplastic cells at surgery.
In total, 407 patients were recruited, with an equal allocation probability to each arm. Although not explicitly mentioned in this study, we will assume that pCR was assessed 15 weeks after randomization, which allows 3 weeks between the last day of treatment and the response assessment. Recruitment lasted from December 2007 to December 2009. This corresponds to a recruitment rate of approximately four patients per week. We will investigate the effects of a faster or slower recruitment in Section 6.
The final endpoint is observed with delay from the individual enrollment. We incorporated this characteristic into our comparison of AR and MAMS designs in Section 6. Each patient is randomly assigned a recruitment time. The time (in weeks) between recruitment of two consecutive patients is assumed to be an Exponential.1=m/ random variable, where m is the mean recruitment rate per week.

Comparison of adaptive randomization and multi-arm multi-stage trials
We compare MAMS and AR designs using simulated trials under a variety of scenarios when (i) the FWER or (ii) the type I error rate of each comparison is controlled. The baseline scenario is a trial with five interim analyses, four experimental arms, a target FWER of 0.2, or target type I error rate of 0.05, depending on which quantity is being controlled. Although a FWER of 0.2 may seem high, we consider it to be adequate for typical multi-arm phase II trials. Also, Wason, Jaki, and Stallard recently gave a detailed motivation for controlling the FWER at levels of at least 0.2, on the basis of optimality criteria, in [38]. We used a target power of 0.8 when p 0 D 0:2 and p 1 D 0:4. We then varied each of these parameters and obtained the list of scenarios in Table I. The first six columns illustrate these variations; the acronyms T-FWER, T-Power, and T-TOER are used to indicate the targeted FWER, power, and type I error rate, respectively. Table I provides a description of scenarios, the resulting cutoff value c for the AR design, used at completion of the trial for reporting significant treatment effects, and relevant operating characteristics. The table has two main components: the top half shows designs when the FWER is controlled, and the bottom half shows designs when the type I error rate is controlled.
For the MAMS design, the sample size is random, so the expected sample size (ESS) under the global null hypothesis H G , that all treatment effects are null, and under the LFC, are shown. The expected values are indicated in Table I as ESS-H G and ESS-LFC. For the AR designs, we report the overall (fixed) sample size N . We also report, for the MAMS trial, the group size n, which is the number of patients per arm at each stage. Table I suggests that the procedures described in Section 2 control the FWER at approximately the targeted value. We estimated the reported control of the FWER under H G and the power under the LFC for AR and MAMS designs in Table I from independent simulations performed after the designs had been selected. These estimates are based on 100,000 simulations. Only in a few scenarios is, the FWER/type I error rate slightly above the target level. For example, the scenario in which p 0 D 0:5 appears to result in a small inflation in FWER and type I error rate using the MAMS design. We also note that the MAMS design generally seems more sensitive to the value of p 0 , as the observed FWER is often noticeably lower than the target level. This is not true to the same extent with the AR design.
In all scenarios, there is a positive difference between the maximum sample size under the MAMS design and the sample size of the AR design. For instance, under the first scenario we considered, the maximum MAMS sample size is J .K C1/ n D 325, whereas the AR design sample size is N D 190. These differences are similar when we consider the control of the FWER and of the type I error rate.
In most scenarios, the expected sample size under H G for the MAMS design is smaller than the sample size required by the AR design. This suggests that the MAMS design can be more efficient when all treatments are ineffective. In contrast, when the LFC is true, that is, there is a single effective experimental treatment, the AR design sample size N is always lower than the MAMS expected sample size.
The distribution of the MAMS design sample size across simulations is summarized in Figure 1      In Figure 2, we showed the distribution (over 250,000 simulation replicates) of the proportion of patients allocated to each arm when there is one effective experimental treatment and when there are two equally effective experimental treatments. When there is one effective experimental treatment, an average of 0.31 of the patients receive the best available treatment. A lower proportion of patients is allocated to the three ineffective experimental arms. When there are two effective experimental arms, the average allocation proportion of the control group is the highest across the K D 5 arms. This is because, in each simulation replicate, one of the effective treatments is assigned more frequently than the other by chance, whereas the final sample size of the control arm approximately matches the maximum of the sample sizes across experimental arms. Similar simulations for the MAMS design showed that when there was one effective experimental treatment, the median allocation to the control treatment was 0.281 (interquartile range (IQR) 0.25-0.313), the median allocation to the effective experimental treatment was 0.278 (IQR 0.238-0.313), and the median allocation to each ineffective experimental treatment was 0.150 (IQR 0.100-0.200). For two effective experimental treatments, the median allocation to the control treatment was 0.250 (IQR 0.235-0.278), the median allocation to each effective experimental treatment was 0.238 (IQR 0.217-0.263), and to each ineffective experimental treatment was 0.136 (IQR 0.077-0.167).
With both designs, we observed efficiency gains, in terms of sample size requirements, when the number of interim analyses increases. This fact becomes apparent by comparing designs with three and five interim analyses. As one may expect, the marginal gain of each additional interim analysis diminishes with the number of stages J . In particular, this trend is suggested by the trial characteristics when 3, 5, and 10 interim analyses are considered.
The expected number of treatment failures (ENF) can be used to assess trial designs when possible ethical issues are considered. We report in Table I the ENF under the LFC, while under the H G hypothesis can be directly obtained by multiplying .1 p 0 / and the expected sample size. In all scenarios, the ENF is lower using AR than using a MAMS design. This is partly expected as the sample size used by the AR design under the LFC is lower. Table I shows scenarios with different number of experimental arms. As the number of experimental arms increases, the relative efficiency of the two designs shows slight changes in favor of AR. For instance, with two experimental arms, the expected sample size of the MAMS design under H G is 15% lower than the sample size of the AR design. Conversely, it is 4% higher when there are six experimental arms.
We attempted to check if the major differences between MAMS and AR designs described in the previous paragraphs are related or robust with respect to a few particular aspects involved in the comparison. First, we checked whether the results of the comparative study are confirmed when a different test statistic is used at completion of the simulated AR trials. Second, we checked whether the described differences persist when an alternative approach is used to define the futility boundaries .f 1 ; : : : ; f J /.
We substituted the test statistic used throughout the simulation study (1) with the p-values produced by the Fisher exact procedure for 2 2 tables. The adjective exact could be misleading in our adaptive context-these p-values are only used as Z k summary statistics. Under the considered scenarios, we observed negligible variations of the operating characteristics. We implemented, for instance, the AR procedure with N D 220, K D 4, and J D 5, using Fisher's p-values to control the type I error at 0.05. The resulting cutoff was c D 0:064, and the power at the LFC considered in the first row of Table I was 0.801. The procedure described in section 2 with N D 220 selected .j=J / D 13:5 .j=J / 2:75 . Under the LFC, the resulting average proportion, across simulations, of randomizations to the effective arm, p 1 D 0:4, and the control, p 0 D 0:2, were 33.1% and 33.3% respectively, with the remaining average of 33.6% assigned to the three experimental arms with no treatment effect.
The main differences between AR and MAMS designs were also confirmed when we substituted the triangular method with the predictive power approach [33] for selecting the MAMS boundaries .f 1 ; : : : ; f J /. When we considered, for instance, the first scenario in Table I with the control of the type I error at 0.05 and redefined the futility boundaries the resulting average sample size under the LFC and H G across simulations become 250.1 and 200.6, respectively. Also in this case, the variability of the sample size across simulations is relevant, and the IQR is equal to 120 patients under the H G hypothesis and 65 under the LFC.
Next, we explored the operating characteristics of the AR and MAMS designs in row 1 of Table I under scenarios with more than one effective treatment. We selected these designs to control the FWER. We defined random variables E and I as the number of H .k/ 0 hypotheses that are correctly and erroneously rejected in a given trial. We estimated the probabilities that E > 1 and I > 1, the latter being the FWER, together with the expected value of E and the expected number of treatment failures. We show the results in Table II. Table II shows that the differences between the two designs are small for all criteria considered, with the exception of the ENF. However, MAMS designs do perform marginally better in terms of recommending effective treatments and avoiding recommending ineffective treatments. They tend to have a higher expected number of treatment failures, although as the number of effective treatments increases, this comparison is mainly driven by higher sample sizes under the MAMS designs. The fourth scenario, in which all experimental treatments are inferior to control, is the only one in which the MAMS design performs clearly better. The differences in Table II have an intuitive explanation-AR is conceived to sequentially allocate increasing resources to the comparison of the most promising experiential arm versus the control. But when two arms have similar treatment effects, there is a trade-off between the number of patients randomized to each of them. In contrast, under the MAMS design, the distribution of the number of patients randomized to an experimental arm, say arm 1, is not related to the treatment effects of the other .K 1/ experimental treatments. In the last scenario, the control group response probability is 0:2 and the response probability of each experimental treatment is independently simulated from a uniform U(0.1,0.4) distribution.
P .E > 1/, probability of a trial ending with recommendation of one or more effective treatments; E.E/, expected number of effective treatments recommended in a trial; P .I > 1/, probability of a trial ending with recommendation of one or more ineffective treatments; ENF, expected number of failures; AR, adaptive randomization; MAMS, multi-arm multi-stage.

Futility stopping rules for adaptive randomization designs
The results in Section 4 show the main weakness of the AR procedure; it performs worse than MAMS designs when all experimental treatments are ineffective. Also, ethical concerns arise when scenarios with all experimental arms having detrimental effects are realistic and need to be considered. This is because there is no futility rule that stops the trial if all experimental arms perform poorly in comparison with the control treatment.
We consider a stopping rule that terminates the trial earlier when none of the experimental arms has sufficient evidence to suggest a positive treatment effect. The stopping rule directly exploits the posterior probabilities of the events fp 0 < p k g computed at each interim analysis. If all these posterior probabilities fall below a pre-specified threshold`j at the j -th interim analysis, then the trial is stopped. We use thresholds defined by the equalities`j D .j=J / for a suitable choice of and . Note that these additional thresholds do not add computational costs to the procedure in Section 2 for controlling either the type I error rate or the FWER. That is, the described stopping rule can be directly included in the algorithm for selecting the threshold c that is used at completion of the trial for reporting positive treatment effects. Table III lists the operating characteristics (estimated from 10,000 replicates) that result from a variety of potential stopping rules. Some values of . ; /, for example, .0:6; 2/, result in minimal loss in power and a reduction in the expected sample size under H G of approximately 10 patients. In no cases does the stopping rule lead to an inflation in the type I error rate or FWER. We then considered again scenarios with all experimental arms having a detrimental effect and observed across simulations substantial reduction of the average sample sizes. For instance, under the fourth scenario listed in Table II, by setting . ; / equal to .0:6; 2/, the trial is stopped at the third and fourth interim analyses with probabilities equal to 0.25 and 0.52, respectively.

Comparisons based on a case study
The case study that we use to compare AR and MAMS designs is the NeoSphere trial reported by Gianni et al. [9]. We consider planning a multi-arm trial similar to Neosphere, which compares three experimental treatments to a control arm with a planned sample size of approximately 400 patients. Using the methods in Section 2, we first found AR and MAMS designs that controlled the FWER at 0.1, for J 2 f2; 3; 5; 10g and K D 3, assuming that there was no delay between patient recruitment and response evaluation. We chose the FWER of 0.1 as it is close to the FWER of the actual trial design, which controlled the type I error rate of each comparison at the 5% level (one sided). We chose the AR design so that its sample size N was 400, whereas the MAMS design was chosen so that its expected sample size under the LFC was as close to 400 patients as possible. The AR designs included a futility stopping rule, as described in Section 5, with parameters D 0:6 and D 2. We then evaluated the operating characteristics of the designs assuming that the endpoint of pCR was observed 15 weeks after recruitment. We considered recruitment rates, in terms of mean number of patients recruited per week, equal to 2; 4, and 8.
We first investigated MAMS designs in which interim analyses took place once a pre-planned number of patients per arm had been recruited, as described in Section 2. Using the futility boundaries found in the no-delay case, the MAMS designs, under all scenarios with delayed responses, had both FWER and power considerably lower than the targeted FWER and power. For example, with J D 10, p 0 D 0:2, p 1 D 0:4 and a mean of four patients per week, the estimated FWER and power were 0.059 and 0.436, respectively, compared with planned values of 0.1 and 0.65. This indicates that ignoring the delay between recruitment and assessment leads to relevant deviations from the planned operating characteristics of MAMS trials. The decrease in power is because the futility boundaries are found assuming that the interim analyses are equally spaced in terms of number of patients assessed. Under all scenarios with delayed responses, interim analyses take place when outcome data are available only for a subset of enrolled patients. We therefore chose to focus on MAMS designs with interim analyses shifted in time by 15 weeks, exactly the time interval between each patient enrollment and the subsequent response assessment. The operating characteristics of the AR design were less sensitive to the delay between patient accrual and response evaluation. Therefore, the interim analyses of the AR design were not shifted, and at each interim analysis, the necessary posterior probabilities of the events .p 0 < p k / are updated using the available outcome data. Table IV shows relevant operating characteristics of the designs under different recruitment rates. This simulation study suggested that the FWER was not affected by the described delays and, in all considered scenarios, was controlled at the 10% level as planned. The recruitment rate impacts the designs in different ways. The AR design loses some power as the recruitment rate increases, with a loss in power (compared with the no-delay case) of approximately 6-8% when a high recruitment rate is considered. If the recruitment rate is slow, that is, two patients per week, then only around 1-2% power is lost. The power of the MAMS design is not affected by the recruitment rate, but its expected sample size is affected considerably. Even with a slow recruitment rate, the expected sample size under H G increases by approximately 15%. Table 2 of [9] showed the actual results of the trial. The estimated pCR rates were 0.29, 0.458, 0.168, and 0.240 for arms A-D, respectively. With a FWER of 0.1 and a sample size of 400 patients, the power to conclude that arm B is significantly better than arm A is 0.776 if no interim analyses are used, assuming that these pCR estimates are the truth. We simulated data under this configuration of treatment effects with the designs used in Table IV and assuming a mean recruitment rate of four patients per week. Table  2 in the Supporting information shows the results. In this case, the power to recommend arm B is higher using AR (0.881 for AR, 0.839 for MAMS with J D 5), with the difference increasing in the number of interim analyses. On the other hand, the expected sample size of the MAMS design also decreases as J increases. Both methods lead to a notable decrease in the expected number of failures in the trial (284.4 if no interim analyses are used, 269.7 and 266.3 for AR and MAMS respectively with J D 5).
This case study illustrates that AR and MAMS are generally more efficient than a multi-arm trial without interim analyses. This gain in efficiency depends on the recruitment rate of the trial. Under AR, there is a gain in power, and on average, more patients are assigned to arm B. Although using a MAMS design led to a loss in power, compared with AR, there was also a slight drop in the expected number of patients recruited to the trial. We also considered setting the MAMS design maximum sample size to be as close as possible to 400. Table 1 in the Supporting information shows these results. In this case, the MAMS design has considerably lower power and expected sample size compared with AR. At completion of group sequential or adaptive trials, the investigator reports the arm specific response probabilities and the treatment effects differences between the control and the investigated agents. In most cases, maximum likelihood estimation is used, or equivalently, when the primary outcome is binary, the number of responses and failures are summarized in a table. It is known that the maximum likelihood estimates (MLEs) following group sequential or adaptive experiments can be biased; that is, the expected values of the estimates can be substantially different from the true parameters. Several correction strategies have been proposed but are rarely used in the medical literature. We assessed these discrepancies with 100,000 Monte Carlo iterations under the scenario tailored to the NeoSphere trial after having selected AR and MAMS designs with a power of 80% at .p 0 D 0:2; p 1 D 0:4/ and FWER controlled at 0.2. The bias of the empirical estimates produced at completion of the MAMS trial results larger than the bias under the AR design. The differences .p 0 p j / E. O p 0 O p j /, with j D 1; 2; 3, between the true treatment effects and the estimates under the MAMS (AR) design are equal to 0:036. 0:001/, 0:037. 0:011/, and 0:041. 0:016/, respectively. These results suggests that AR can be considerably less impacted by this problem of bias. AR designs have been shown to result in asymptotically consistent and normally distributed treatment effect MLEs under mild conditions [39]. The main results in [39] can be applied to our AR scheme. For small sample sizes, however, the bias of MLEs is still noticeable for non-effective arms.

Discussion
When several new agents or combinations are at the same stage of clinical development, a multi-arm trial provides efficiency gains over separate trials of each experimental agent against a control treatment.
Two major classes of designs, which can add further efficiency, are AR and MAMS designs. In this paper, we discussed designing AR and MAMS trials and compared their properties. Our study compared the two approaches using easily interpretable operating characteristics such as power, overall sample size, sample size variability, and expected number of treatment failures.
Both AR and MAMS designs provide efficiency advantages over multi-arm trials without interim analyses. Efficiency can be in terms of extra power or a reduced sample size required for a fixed power. Although we did not explicitly consider the trial duration, a reduction in expected sample size will also lead to a reduction of the expected time to complete the study. The relationship depends on the accrual rate and, in some cases, is assumed linear. In practice, the recruitment rate can vary during the trial, and the proportion of time saved by AR or MAMS may slightly deviate from the expected sample size reduction. The relative advantages of both methods depend on the accrual rate and the delay between individual recruitment and outcome assessment. Parmar et al. [40] motivated the use of MAMS designs in oncology but stated that they should only be used if the final endpoint, or some suitable intermediate endpoint, is observed relatively quickly. The same is true for AR designs, and a phase II trial where the final endpoint is observed quickly is the ideal setting for using the designs discussed in this paper.
In several settings, and in particular in confirmatory studies, the investigator has to preform frequentist hypothesis testing with a stringent control of the type I error rate. For group sequential studies, there are several well-established methods to do this, and the performances and robustness of these procedures are well documented in the literature. Under adaptive designs, the control of the type I error rate, in general, is more challenging. The use of standard Monte Carlo simulations under a few representative scenarios is unsatisfactory. An attempt to control the type I error rate with simulations can be insufficient for regulatory agencies. Our application of the importance sampling method has the goal of controlling the probability of a type I error under the complete set of possible arm-specific response probabilities. We do not claim that this is a general solution for any adaptive trial. In our setting, with multi-arm trials and binary outcomes, we assessed the algorithm. This evaluation included independent Monte Carlo simulations. Overall, the proposed computational procedure matches the nominal˛level and the actual control of the type I error rate. More generally, the study of computational strategies applicable to the control of the type I error in adaptive trials can facilitate the use of adaptive designs.
The results in this paper show that AR designs are often more efficient than MAMS designs when there is a single effective experimental treatment. With multiple effective experimental treatments, the differences between the two approaches become less evident. Also, when there are no effective experimental treatments in the trial, the MAMS design tends to have a lower expected sample size, but including a futility rule in the AR design narrows this gap. An additional advantage of AR designs is a lower sensitivity of the operating characteristics when the number of assessed outcomes at each interim analysis is different from the originally planned number.
Adaptive randomization designs allow combining a fixed overall sample size with easy to interpret rules for dropping experimental arms; this may be convenient for planning the trial. The MAMS approach could be modified to have a fixed sample size by fixing the number of patients recruited at each stage (as opposed to a fixed number of patients recruited per arm per stage). Similar to the AR design, the sample size is fixed, unless all experimental arms are dropped at some interim analysis. The stopping boundaries should be adequately chosen, possibly with different values depending on the number of treatments remaining in the trial after each interim analysis, but despite this complication, it may be a useful design to consider in future work.
Adaptive randomization has previously been criticized as being less efficient than a fixed randomization approach (see for example, Korn and Freidlin [41]), and indeed, this is true when a control is compared with a single experimental treatment. However, the results in this paper emphasize that controlling the allocation to the control group and applying AR to the experimental groups, both the power and the rate of positive outcomes across the enrolled patients, can be improved compared with more standard designs with fixed randomization probabilities.
We have assumed that the different arms represent separate treatments as opposed to different doses or different combinations of treatments (with overlap of treatments between arms). If the arms are different doses or combinations, then the approach we have used is valid, but not necessarily the most efficient. Regression models can be used to improve AR and MAMS designs. The AR design could, for instance, integrate a dose-response model for estimating sequentially the effectiveness of different doses. This could potentially improve the allocation procedure. Similar considerations apply when the arms are combinations of agents (with treatments in common): prior models that include the possibility of synergistic effects and agent-specific effects can be considered. Analogous remarks hold for MAMS designs; additionally, more complex decision rules on dropping arms can be used, such as only dropping a higher dose for futility if the lower dose has been dropped.
When we considered MAMS designs, we were forced to use interim analyses taking place after a prespecified number of outcomes had been assessed on each arm. This is not always a practical solution, for example, if there is relevant variability in the time intervals between individual accrual and outcome assessment. In order to obtain the same flexibility of the AR designs, one would need MAMS designs with non-deterministic futility boundaries defined by empirical spending functions (see, for example, [42]). This would allow adaption of the futility boundary according to the number of available outcomes. To our knowledge, this approach has not been applied to trials with multiple experimental arms. Overall, we observed that, with delay between recruitment and response, an increased recruitment rate leads to AR designs having lower power and MAMS designs having higher expected sample size.
We focused on the statistical properties of MAMS and AR designs, but there are other considerations when choosing a design. Both AR and MAMS designs will cause operational issues [43]. Examples of such issues are managing the randomization and changing the allocation ratios after an interim analysis. Both designs will cause problems with the supply of drugs/treatments (as it is not known in advance how many patients will be on each arm); AR could cause greater problems as there is scope for greater deviation from equal randomization. Blinding treatment assignment is another important aspect that needs careful consideration. When blinding is to be maintained, it is important to have properly maintained firewalls between the statistician conducting the interim analyses and other investigators.
A more complex setting in which AR designs have been proposed consists of trials evaluating the effects of several treatments on patients from multiple biomarker subgroups. Examples include the BATTLE [8] and iSPY2 [11] trials. In addition, MAMS approaches have been used to design the FOCUS-4 trial, which will test several treatments in biomarker subgroups. A comparison of AR and MAMS designs in this context would be useful and is an area for future research.