Empirical Bayes estimation of the selected treatment mean for two-stage drop-the-loser trials: a meta-analytic approach

Point estimation for the selected treatment in a two-stage drop-the-loser trial is not straightforward because a substantial bias can be induced in the standard maximum likelihood estimate (MLE) through the first stage selection process. Research has generally focused on alternative estimation strategies that apply a bias correction to the MLE; however, such estimators can have a large mean squared error. Carreras and Brannath (Stat. Med. 32:1677-90) have recently proposed using a special form of shrinkage estimation in this context. Given certain assumptions, their estimator is shown to dominate the MLE in terms of mean squared error loss, which provides a very powerful argument for its use in practice. In this paper, we suggest the use of a more general form of shrinkage estimation in drop-the-loser trials that has parallels with model fitting in the area of meta-analysis. Several estimators are identified and are shown to perform favourably to Carreras and Brannath's original estimator and the MLE. However, they necessitate either explicit estimation of an additional parameter measuring the heterogeneity between treatment effects or a quite unnatural prior distribution for the treatment effects that can only be specified after the first stage data has been observed. Shrinkage methods are a powerful tool for accurately quantifying treatment effects in multi-arm clinical trials, and further research is needed to understand how to maximise their utility.


Introduction
Two-stage drop-the-loser designs provide a framework for picking the most effective treatment out of a larger group of candidates and then testing it against a standard therapy in a confirmatory analysis. Although this design is an efficient way to discover effective treatments, the selection mechanism acts to inflate the type I error of the final test statistic [1,2] and can also induce a substantial bias into the standard maximum likelihood estimate (MLE). With regard to the former, current regulatory authority guidance (e.g. [3]) is unequivocal that the final analysis must control the type I error rate. With regard to the latter, whilst acknowledging that estimation bias is a serious issue affecting the validity of adaptive trials and that bias should be 'minimised', there is a distinct lack of guidance and consensus on how this should be achieved. Research has generally focused on estimators that apply a bias correction to the MLE. One such class of estimators, referred to as uniform minimum variance conditionally unbiased estimators (UMVCUEs), totally removes the MLE's bias [4][5][6][7]. Others have proposed iterative or likelihood-based methods that can substantially reduce the bias of the MLE, without being unbiased [7][8][9]. Unfortunately, methods that explicitly target bias correction generally lead to an estimator with a mean squared error (MSE) larger than that of the MLE.
This form (from Posch [12]) makes clear that at the trial outset, s is a random variable. By definition, Bias. Q s / = 0, and MSE. Q s / is smaller than any other s that is also unbiased. This of course does not mean that MSE. Q s / is smaller than any biased estimator; for example, it is generally true that MSE. Q s / MSE. O s / (see Section 5 for example). Furthermore, it is commonly agreed that MSE (a measure equivalent to its variance + squared bias) provides a far better summary of an estimator's worth than bias alone.

Shrinkage estimation
The standard motivation for using shrinkage methods is to provide simultaneous, accurate estimation for a group of parameters, where accuracy is defined via the combined MSE. For example, if we were interested in jointly estimating the true mean effect of all k treatments 1 ; :::; k using only stage 1 data, then it will generally be true that the combined MSE, is far smaller when i equals L L i as opposed to the MLE X i , where L L i is Lindley's estimator [13]: and Although shrinkage formula (5) was not originally proposed using a Bayesian argument, it can be easily understood and shown to be optimal within a Bayesian framework. Assume that a priori 1 ; :::; k are themselves independent and identically distributed (i.i.d) N. ; 2 / random variables and only stage 1 data are available for the k treatments. Given i , the distribution of its MLE L L i can therefore be viewed as an 'Empirical Bayes' estimate for the posterior mean of equation (7), with N X, X i and O C substituted for , O i and 2 1 2 1 C 2 , respectively. N X is clearly an unbiased estimate of , but it is perhaps less obvious that O C is an unbiased estimate for 2 1 2 1 C 2 , regardless of the true value of 2 . If 1 O C gives a negative value, it is replaced by 0 in the definition of O B C . This 'plus rule' has been shown to further reduce the MSE of the resulting estimate L L i [14].

Carreras and Brannath's approach
Hwang [15] explicitly considered estimation of a single mean parameter from a k component system, where all k components have normally distributed estimates with a common variance and the single component is identified by having the largest estimate. This is identical to estimating s using only stage 1 data in a two-stage drop-the-losers trial. He proved that when the treatment means follow a N. ; 2 / prior distribution (for any and 2 > 0), so that their posterior distributions obey equation (7) and L L s is defined by equation (5) analogously dominates O s from equation (1). Their result relies on the fact that equation (1) is equivalent to replacing L CB s in equation (8) with X s . This also occurs naturally when the shrinkage factor C is set to 0.

An alternative formulation
Carreras and Brannath's method for estimating s following a two-stage drop-the-loser trial can itself be viewed as a two-stage approach. That is, only the stage 1 data are used in the standard shrinkage estimator L L s , and then, stage 2 data on the selected treatment, Y s , are added separately in equation (8) afterwards. Although L CB s has the nice dominance property over the MLE, it is useful to consider whether it can itself be improved upon. For example, why not use all of the stages 1 and 2 data to define a single shrinkage estimator? Although this sounds straightforward, it does impose some extra complications. Despite being only truly concerned with estimation of the top-ranking treatment's mean, s , Carreras and Brannath's method is defined to find shrinkage estimates given k parameter estimates with a common variance. This means that we are assuming the posterior distribution for i implied by (7) (ignoring the stage 2 data), which enables the use of L L s from (5). However, if we use all of the data (including the stage 2 data Y s ), we may assume the following set-up: Further assuming that the i s are stochastically independent and the O i s are conditionally independent given i , then becomes the distribution with which to construct shrinkage estimates for the i s. The fact that W i is not constant makes specification of single appropriate shrinkage factor C and estimate for the grand mean far less straightforward. In Section 4, we will discuss estimation for this new setting, pointing out its connection with model fitting in the area of meta-analysis.

Direct estimation
Under the framework of equation (9), if the parameters and 2 were known, the best estimate for s is given by Furthermore, with infinite data, it is clear that directly replacing and 2 in this expression with consistent estimates, O and O 2 , would give equation (10). If one assumes that 2 is known, then the variance of the posterior distribution in (9) also becomes completely known. It is then possible to show that the MLE for is where q D 2 1 2 2 C 2 2 1 C 2 2 and N X s D 1=.k 1/ P i ¤s X i . 2 could then be estimated by maximising the posterior likelihood of (9) with respect to 2 at D O 2 . See the Appendix for further details. We will refer to the estimate for s obtained by directly plugging in the previous estimates to equation (10) as O MPL s . Although this approach is fairly crude, it will be interesting to observe the performance of O MPL s compared to several 'legitimate' shrinkage estimators that are now introduced.

Shrinkage estimation: Carter and Rolph's standard prior approach
Carter and Rolph [16] investigate shrinkage methods for jointly estimating the parameters of a k component system with unequal variances, as set up in equation (9). We will use it specifically to yield an estimate for s in the context of a two-stage drop-the-loser trial of the form so as to approximate equation (10). Although not immediately obvious, it can be shown that their suggested approach is equivalent the following procedure. First define a new weight V i D .W i C 2 /, and find the value of 2 that solves Q. 2 If Q. 2 / D k 1 for a 2 < 0, then the estimate is truncated to 0. Q. 2 / is known in the area of medical meta-analysis as the 'generalised' Q statistic [17,18], and this estimation method for is known as the Paule-Mandel method of moments algorithm [19,20]. Let the estimate for 2 derived in this manner be equal to O 2 PM . The shrinkage factor for s is then approximated by The form of O C s may appear complicated, but it has the nice property that if the W i s are all equal, then it reduces to the original O C given in equation (6). In our context, the W i s can only approach equality as 2 2 ! 1 or as the stage 2 sample size tends to zero. O C s can be inserted into equation (12) along with the MLE O s and grand mean O given by O O 2 PM , to yield a new shrinkage estimator for s . We will refer to this estimator as L 2 s -the ' 2 ' denoting that this parameter is additionally and explicitly estimated.

Shrinkage estimation: Carter and Rolph's proportional prior approach
Carter and Rolph [16] also propose an alternative method for applying shrinkage estimation within the unequal variance context, which usefully avoids an iterative estimation of 2 . It relies on the assumption of a different prior distribution for the treatment parameters, namely i N. ; W i 2 /. This asserts that the prior uncertainty around each treatment's mean is directly proportional to the variance of its estimate, O i . By replacing 2 with W i 2 in (9), it is clear that this implies the posterior distribution: the mean of which becomes an alternative target to estimate. Because this mean does not depend on W i , it suffices to calculate a single shrinkage factor O C using all of the data. This can then, in conjunction with an estimate for , be used to estimate s via equation (12). Turning first to estimation of : Under the proportional prior, the unconditional distribution of the estimates is O i N ; 2 i .1 C 2 / . Therefore, O .0/ is referred to in meta-analysis as the 'fixed-effects' estimate for , as opposed to a 'random-effects' estimate, of which O O 2 PM is an example. Of course, when 2 is estimated to be 0, they are equal. Turning now to estimation of s via O C : It can easily be shown that Carter and Rolph's approach in this context is equivalent to choosing: O C .0/ can be seen as a simple generalisation of the O C in equation (6) for the case where the W i terms are not constant. Q.0/ is known as Cochrane's heterogeneity statistic in meta-analysis and is closely related to the DerSimonian and Laird estimator for 2 , 2 DL [20]. For example, when Q.0/ > k 1, where N 2 is called the 'typical' within study variance and I 2 is a popular measure of 'inconsistency' (heterogeneity) among studies in a meta-analysis [21]. We will refer to the resulting estimator (which utilises O C .0/ and O .0/) as L 0 s .

Incorporating 'Limiting Translation'
It is well known that shrinkage methods can perform poorly with respect to specific parameter components of a larger system, when the magnitude of the specific parameters are among the most extreme. For this reason, Efron and Morris [22] suggest the use of a 'Limiting Translation' (LT) strategy that constrains one shrinkage estimator to be within a certain distance of the MLE. Efron [14] and Johnson [23] suggest a practical choice for this constraint of one unit of the MLE's standard error. The effect of applying LT to a single extreme parameter component of a larger system, say a i from 1 ; :::; k , is to dramatically reduce the i'th contribution to the overall MSE of equation (4), at the expense of increasing the total value of equation (4) by a small margin. Although from equation (3), we can see that, at the trial outset, s is not a single parameter but rather a weighted mixture of all k fixed parameter values 1 ; :::; k , the specific values of those parameters may mean that s is consistently an outlier. For example, this would certainly be the case if one treatment was far more effective than any other because it would monopolise the value of s . We can apply LT to the shrinkage estimator L 0 s by subtracting O s from equation (12), constraining the result to be 6 p W s and noting that the definition of O C .0/ in equation (16) becomes This estimator will be referred to as L 0 s .LT /. LT versions of all other estimators are clearly possible but are not considered here.

Some implications of using L 0 s
When deriving the form of L 0 s , there is no inherent mathematical difficulty in assuming an N. ; W i 2 / prior distribution (with varying W i s) for the i s because the resulting posterior distribution for i j O i remains in the normal family. However, this shrinkage approach does raise certain philosophical questions when applied in the context of a two-stage drop-the-loser trial. The primary issue is that we do not know a priori which treatment will be selected. So, assigning the N. ; W s 2 / prior to s (for general values of and 2 ) is only possible after we have observed the first stage data. This violates the principle of 'Temporal Coherency' [24] that states that the prior must be specified in advance and constant in time. Indeed, this principle is overwhelmingly adhered to by practitioners of Bayesian inference in the interest of maintaining scientific objectivity. A consequence of this temporal violation, which becomes most apparent in Section 5, is that there is no general way to simulate data consistent with the assumptions of L 0 s . To understand this, suppose we wanted to generate trial data consistent with the shrinkage estimator L 2 s instead. We simply start by simulating the i s from an N. ; 2 / density given values for and 2 , which can then be used to generate the trial data for stages one and two (X 1 ; :::; X k ; Y s ). These data can then be used to specify the distributions O i j i and i j O i from equation (9). Clearly, we can not follow an equivalent data-generating procedure when the i s come from an N. ; W i 2 / prior density because, as previously stated, the prior can only be specified after seeing the data. The single exception is when 2 =0 (implying a degenerate normal prior) in which case the i s all take the value with probability 1. This is equivalent to assuming a fixed effects model with only one unknown parameter, .
Of course, despite these philosophical concerns, we are still free and able to evaluate L 0 s in a simulation study without exactly mimicking the data-generating process it relies upon.

Simulation study
We simulate trial data under a two-stage drop-the-loser design in order to quantify the bias and  (3), which can be simply and accurately approximated by averaging over all simulations where, in each single case, a treatment T i out of k is ranked top at the end of stage 1 so that s D i . We chose four different levels of standard error associated with the stage 1 and 2 estimates, along with four different scenarios for the six unknown means: Scenario I. True means N.0; 1/; Scenario II: True means all 0; Scenario III: One true mean D 1, 5 means equal to 0; Scenario IV: One true mean D 1.5, 5 means equal to 0.
The underlying distribution of the treatment parameters is a key factor driving the Bayesian motivation of any shrinkage estimator. Scenario I is compatible with the prior assumptions of L 2 s and L CB s . Scenario II is compatible with the prior assumptions of all shrinkage estimators, despite being nonstochastic, because it is equivalent to scenario I with 2 D 0. Carreras and Brannath's dominance result for L CB s is valid for scenarios I and II . Scenarios III and IV are not compatible with any shrinkage estimator. However, all scenarios are compatible with the assumptions of the UMVCUE, in the sense that it maintains its unbiasedness for any constellation of parameter values.
We show the results for the 16 simulation scenarios in Table I in terms of p MSE in 8 out of 16 simulations but is sometimes the worst estimator by far (e.g. simulations 11 and 15). Unlike the other shrinkage estimators, it is also negatively biased in general. L 0 s and L 0 s .LT / perform similarly and are consistently the most reliable estimators across the 16 scenarios. It is perhaps surprising that their similarity extends to scenarios III and IV, where one would suspect the LT strategy would come into play. This implies that the difference between O s and L 0 s is still almost always less than p W s . Figures 1-3 show the results of three further simulation studies. In each case, 1 D 2 D 1. Figure 1 shows the scaled bias and p MSE of the estimators for a trial with k D 6 treatments, 5 of which have true mean 0 and one of which has true mean ı, as ı is varied between 0 and 5. In order to highlight the strength of the selection effect as a function of ı, we also plot the average value of s (labelled as 'EOE s '). One can see that the bias of the shrinkage estimators changes sign as ı increases whereas the bias of the MLE decreases from 0.   Figure 2 shows the results of a simulation for k D 6 treatments with mean parameters drawn from a N.0; 2 / distribution as 2 is varied between 0 and 4 (the choice of D 0 is clearly unimportant). The bias of all shrinkage estimators decreases towards 0 as 2 increases, although this happens most rapidly for L 0 s and L 0 s .LT / so that they are the least biased. For values of 2 greater than 1, O MPL s is consistently negatively biased. The p MSE of the shrinkage estimators appears to asymptote upwards towards p W s (towards 1 after scaling) as 2 increases, but remain below that of the MLE in this range.
In an effort to separate L 0 s and L 0 s .LT /, Figure 3 shows the results of a simulation assuming that the treatment mean parameters are drawn from a N.0; 1/ distribution, but the number of treatments, k, is varied between 5 and 20. As k increases, the positive bias exhibited by L 0 s decreases, quickly becoming large and negative. L CB s and L 2 s do not to suffer in the same way; their biases asymptote towards 0 as k increases. L 0 s .LT / appears to protect L 0 s well from its tendency for negative bias beyond k D 10. From the right-hand panel, one can see that the price L 0 s .LT / pays for this bias protection is an increase in p MSE. Interestingly, as k increases beyond 15, even the UMVCUE has a smaller p MSE than the MLE.

Summary of findings
Across all simulations, the performance of L 2 s is most similar to Carreras and Brannath's original estimator L CB s , but the two estimators that performed the best were L 0 s and O MPL s . However, the reasons for the latter's apparent success are now qualified. Estimation of the heterogeneity parameter 2 is  challenging when its true value is small or the number of treatments is small. When k is small or there is little apparent variation between treatment effect estimates . O 1 ; :::; O k /, the estimate for 2 is often likely to be at or close to zero, even when its true value is larger. This fact is well documented in the meta-analysis literature, where quantification of the (between trial) heterogeneity parameter is often not recommended unless the number of studies is sufficiently large (at least 10). Although poor estimation of 2 impacts L 2 s and O MPL s , the latter is most strongly affected because from equation (10) (14) is not equal to 0. This explains why O MPL s performs so well in Scenario II, Table I, and for small values of ı; 2 ; k in Figures 1-3 because in these situations, the true value of s is always close (or equal) to . Conversely, it also explains why it performs badly when s is truly very different from (e.g. Scenario IV, Table I, and large values of ı; 2 ; k in Figures 1-3).
The LT version of L 0 s only helped to improve its performance in simulations when the number of treatments rose above 10, which is unrealistically large for most clinical trials settings. It therefore appears to be an unnecessary extension in this context. However, it could potentially be implemented in a more sophisticated manner than we have here. For example, the width of the protection region around the MLE can be tuned to control its bias and MSE, rather than being fixed at a specific value as we did. Efron and Morris [22] provide the theoretical framework for doing this in a general shrinkage estimation context, but their method would need to be altered before application to drop-the-loser designs, in order to account for the selection mechanism. This is a topic for further research. One could also argue that by shrinking the prior variance for s after selection, L 0 s already contains and in-built form of LT.

Discussion
In this paper, we have reviewed the shrinkage estimation strategy of Carreras and Brannath [10] for two-stage drop-the-loser designs. As an extension, we propose that rather than using only the first stage data, the stage two data should also be used to furnish a single shrinkage equation. Although this strategy replaces two formulae with one, evaluation of this single shrinkage formula is much harder because the variance of its k estimated components are no longer equal. By incorporating the methods of Carter and Rolph [16] and Efron and Morris [22], we identified several alternative procedures. The alternative approaches necessitate either explicit estimation of an additional between treatment arm heterogeneity parameter 2 (for O MPL s and L 2 s ) or a different (and quite unnatural) prior distribution for the mean treatment effects (for L 0 s /. The new methods tend to outperform Carreras and Brannath's original estimator, but unfortunately, no equivalent dominance results could be shown. From looking at the simulations in totality, the estimator that consistently performs well when s is close to and far from the overall mean of all treatments is L 0 s . Some may object philosophically to its use in this context because of concerns over Temporal Coherency. However, as Cox [25] states: There may be certain situations where it is perfectly right to modify one's prior beliefs as more data become available. Furthermore, when one's prior uncertainty about a parameter is allowed to change over time in a manner proportional to the (increasing) size of the data sample, then the resulting Bayesian inference starts to approximate a classical significance test [26]. This is exactly what occurs in the drop-the-loser context when, at the point of selection, the variance of the prior for s shrinks from 2 1 2 to W s 2 -for example, by a factor of 2 when 2 1 D 2 2 . It is therefore pertinent to note that the formula for L 0 s can also be arrived at by applying Lindley's original equal variance shrinkage formula (5) to the standardised [16] because they are sufficient test statistics for the null hypothesis D 0. We chose to illustrate the different estimation approaches for trials involving five or more treatments. Apart from O MPL s , all of the shrinkage estimators discussed in this paper are only defined for k > 4, as indicated by the factors of .k 3/ they contain. However, we can crudely apply them for the k D 3 case by simply replacing these terms with (k 2) instead -as performed by Carreras and Brannath [10]. We repeated the simulations shown in Figures (1) and (2) for k D 3 using this crude fix to see how it affected the performance of the various estimators. The results (not shown) were qualitatively very similar.
Throughout this paper, we have attempted to stress the link between shrinkage estimation in the adaptive trial context with that of meta-analysis. We have shown that L 0 s , which incorporates the fixed effects estimate O .0/, works well as an estimator for s under drop-the-losers selection. It is therefore interesting to note the following: In meta-analyses that exhibit substantial amounts of between study heterogeneity, the random effects estimate for is known to be unreliable when the heterogeneity is thought to be driven by dissemination bias (i.e. selective reporting and publication of extreme findings) [18,27]. In order to address this, it has been advocated that the fixed effects estimate O .0/ be used as the preferred measure of overall effect instead [28,29]. Thus, despite the fact that in the adaptive trial setting, the parameter of interest is O s and in meta-analysis, it is the overall grand mean , when the data are affected by some form of selection, Carter and Rolph's proportional prior approach appears to be an effective solution to both problems.
Bowden and Glimm [30] have extended the idea of a two-stage drop-loser-trial to allow the best performing treatment to be identified over multiple stages. The motivation for adding further stages of selection is that one can markedly increase the probability of selecting the truly best treatment (and subsequently declaring it effective in a confirmatory analysis), whilst keeping trial costs to a minimum. Many other multi-arm multi-stage (MAMS) designs incorporating treatment selection rules have also been proposed with a similar motivation in mind, see for example [31,32]. Because they ignore the selection process altogether and use all of the data to define a target posterior distribution or shrinkage equation, O MPL s , L 2 s and L 0 s should be simple to apply in any of these contexts. It is not so obvious to see how Carreras and Brannath's original estimator, L CB s , would generalise to the multi-stage context or if the dominance results that make it attractive in the two-stage case would remain in intact.
A simple, straightforward translation of the shrinkage estimators proposed here to other multi-arm trial designs is only immediate if the k treatment effect estimates are independent before selection. If, as in the MAMS design of Royston et al. [33], the treatment effect summarised time-to-event data in the form of a log-hazard ratio, then the effect estimates would be intrinsically correlated across treatment arms because of their shared control group data. Extending shrinkage estimation to account for inter-dependence of this sort is another topic for further research.