Two-stage designs optimal under the alternative hypothesis for phase II cancer clinical trials

The Simon two-stage optimal design is often used for phase II cancer clinical trials. A study proceeds to the second stage unless the null hypothesis, that the true tumour response rate is below some speci ﬁ ed value, is already accepted at the end of stage one. The conventional optimal design, for given type 1 and type 2 error rates, is the one which minimises the expected sample size under the null hypothesis. However, at least some new agents are active, and designs that explicitly address this possibility should be considered. We therefore investigate novel designs which are optimal under the alternative hypothesis, that the tumour response rate is higher than the null hypothesis value, and also designs which allow early stopping for ef ﬁ cacy. We make available, software for identifying the corresponding optimal and minimax designs. Considerable savings in expected sample sizes can be achieved if the alternative hypothesis is in fact true, without sample sizes suffering too much if the null hypothesis is true. We present an example discussing the merits of different designs in a practical context. We conclude that it is relevant to consider optimal designs under a range of hypotheses about the true response rate, and that allowing early stopping for ef ﬁ cacy is always advantageous in terms of expected sample size.

S e e h t t p://o r c a .cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s.Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Introduction
The main aim of phase II cancer clinical trials is to evaluate the anti-tumour effect of a treatment, screening out agents that are insufficiently active and selecting active agents for future studies [1].E f ficacy is often determined by whether a treatment causes a tumour response (or not) in a small single arm trial [2] and interest is primarily in making the decision to progress an agent to a randomised phase III comparative study.The evidence to make this decision is evaluated by testing the null hypothesis that the true response rate is less than or equal to some pre-specified value.It is desirable to achieve this goal whilst minimising the number of patients exposed to a novel agent and this led to the development of single arm designs that allowed stopping for efficacy or futility.The Fleming design [3] did this by calculating critical values for accepting and rejecting the null hypothesis using the O'Brien and Fleming multiple testing procedure [4].This design allowed early stopping but only controlled type 1 and type 2 errors and there was no attempt to be "optimal" in terms of minimising the expected sample size; this was introduced by the Simon two-stage design [5].
The Simon two-stage design only considers stopping for futility and the optimal design has the smallest expected sample size when the null hypothesis is true.This paper investigates a novel definition of optimality based on the expected sample size when the alternative hypothesis, that the response rate is greater than the pre-specified value, is true.The rationale for only stopping for futility is that many novel agents do not work, and we want to minimise the number of patients exposed to an inactive drug, but in reality there are sufficient active agents to consider also stopping for efficacy [2].O n eo ft h ej u s t i fications given for only allowing stopping for futility in the Simon design was that when an agent has substantial activity there is interest in studying additional patients in order to estimate the proportion, extent and durability of response [5].However, stopping for efficacy may save drug development time, bring useful treatments into clinical practice quicker, and should reduce costs [6].
In a recent review of study designs in cancer, over 20% of all phase II studies with a reported statistical design were Simon designs and 45% were two-stage designs [2].The main reasons for stopping trials early were futility (69%) and efficacy (13%); other reasons were toxicity and poor accrual.Many of the designs reviewed did not include stopping for efficacy as a possibility and the studies would have benefitted, in terms of expected sample size, had they been optimal.
There have been several extensions to the Simon two-stage design including the optimal three-stage design [7],o p t i m a l three-stage design stopping for efficacy [6], admissible designs that balance the optimisation criteria of expected sample size and maximum sample size [8], using a predictive probability design [9], balanced two-stage designs [10],a n de s t i m a t i n g response rates after using the Simon two-stage design [11].All of these papers only consider the optimal design under the null hypothesis.This paper explores optimal two-stage designs, with stopping for futility and efficacy, under the alternative hypothesis.These new designs retain the same hypotheses and initial specification of the type 1 and type 2 errors as the Simon twostage design.We consider only single arm studies, not randomised phase II trials.

Simon two-stage design
The hypotheses about the true response rate (p)t ob e tested by this single arm design are, where p 0 is the pre-specified fixed null response probability and p 1 is the minimum desired response probability required to progress the treatment to a later stage trial.Typical values for p 0 are below 0.3 and for the target improvement rate (p 1 − p 0 )are between 0.1 and 0.2 [2].
Each Simon two-stage design [5] is indexed by four numbers, r 1 , r, n 1 and n.The study is stopped early for futility if there are ≤ r 1 responders out of n 1 participants at stage 1, and the null hypothesis is not rejected.Otherwise the study proceeds to the second stage, with a total sample size n, and the null hypothesis is not rejected if there ≤ r responders at the end of the study.It is referred to as an "r 1 / n 1 r / n" design.Values for r 1 , r, n 1 and n are found for fixed p 0 , p 1 , α (the type 1 error probability) and β (the type 2 error probability).A type 1 error occurs when there are N r 1 responders at the end of stage 1 and N r responders at the end of the study when p = p 0 .A type 2 error occurs if there are ≤ r 1 responders in stage 1 or ≤ r responders at the end of the study when p = p 1 .
The probability of not rejecting the null hypothesis, À Rp ðÞ , is a function of the true response rate p.The number of responders X, based on a true response rate p and sample size m, has a Binomial distribution, and we write the distribution functions as P(X = x)=b(m, p, x) and P(X ≤ x)=B(m, p, x).It follows that the probability À Rp ðÞis as follows [5]: An acceptable design is one that satisfies the error probability constraints À Rp 0 ðÞ ≥1−α and À Rp 1 ðÞ ≤β and let Ω be the set of all such designs.A grid search is used to go through every combination of r 1 , r, n 1 and n with an upper limit for n, usually between 0.85 and 1.5 times the sample size for a single stage design [12].The probability of terminating the study early, PET(p),isthefirst term in Eq. ( 1), B(n 1 ,p,r 1 ), and theexpectedsamplesizeforthisdesign,E(N|p), is n 1 PET(p)+n (1 −PET(p)).TheoptimaldesignunderH 0 is the one in Ω that has t h es m a l l e s te x p e c t e ds a m p l es i z eE(N|p 0 ) and the minimax design has the smallest E(N|p 0 ) amongst those designs in Ω with the smallest n.This paper will refer to the former as the H 0optimal and the latter as the H 0 -minimax design; these are the traditional Simon two-stage designs [5].
Before investigating stopping for efficacy, we consider the optimal and minimax designs when minimising E(N|p 1 ) but keeping the same type 1 and type 2 errors.These designs will be labelled the H 1 -optimal and H 1 -minimax designs, respectively.The probability of not rejecting the null hypothesis is the same as Eq. ( 1) but with the argument p = p 1 .When the true response is p 1 then the probability of early termination for futility must be smaller than the type 2 error, as seen below in Eq. ( 2), and so the expected sample size will be greater than βn 1 +(1− β)n.For β near 0 the expected sample size is close to the upper bound of n, and the optimal design will be the one with the smallest n.So for studies with high power (small β) the H 1 -optimal and H 1 -minimax designs will be very similar.

Stopping for efficacy
It is possible to reduce expected sample sizes by terminating early for efficacy as well as futility.One obvious design is to stop for efficacy at the first stage when there are greater than r responders, since the criterion to reject the null hypothesis at the end of the second stage has already been met.Although this would have little efficiency advantage under H 0 [5] because there is only a small chance of stopping for efficacy, this will not necessarily be true for designs that are optimal under H 1 .
We consider here a less conservative design that allows for early stopping for efficacy when there are N r 2 responders at the first stage.The full design is as follows: • recruit n 1 participants, -stop for futility if there are ≤ r 1 responders, -stop for efficacy if there are N r 2 responders (where r 2 ≤ r, and r 1 b r 2 ≤ n 1 ), • recruit a further n 2 participants (where n = n 1 + n 2 ), -Do not reject H 0 if there are ≤ r responders, -Reject H 0 if there are N r responders.
This design is now indexed by 5 numbers and will be referred to as a "(r 1 r 2 )/n 1 r / n" design.
Using the same notation as previously, using suffix E to represent stopping for efficacy or futility, the probability of not rejecting the null hypothesis is as follows: This equation only differs from Eq. ( 1) in that the summation upper limit is now bounded by r 2 .A sb e f o r ew er e q u i r et h a t the error probabilities are controlled by À R E p 0 ðÞ ≥1−α and À R E p 1 ðÞ ≤ β.The probability of early termination for this design is as follows: For the same values of r 1 and n 1 , PET E (p)≥PET(p).The optimal and minimax designs are found using a grid search given an upper limit for n,andforp 0 or p 1 , and these designs are labelled H 0 -optimalE, H 0 -minimaxE, H 1 -optimalE and H 1 -minimaxE.

Implementation
A program has been written to find the minimax and optimal designs under either H 0 or H 1 with stopping for efficacy and futility, or with stopping only for futility, using Stata 11 [13].This command is freely available and can be downloaded via Stata using the command ssc install simon2stage.T h e program uses the Mata language to do a grid search of all the possible values of r, r 1 , r 2 , n 1 and n, has options to select stopping only for futility, or stopping for futility and efficacy, and can find the optimal or minimax designs for any value of the true response probability p.Only results for p = p 0 and p = p 1 are used in this paper.
Table 1 displays the minimax and optimal designs under H 0 and H 1 with and without efficacy stopping for p 0 = 0.05 and p 1 = 0.25.In this case the minimax designs are the same when considering the null or alternative response probabilities.Stopping for efficacy always gives smaller expected sample sizes.Stopping for efficacy gives the largest gain in efficiency for the H 0 -minimax design when (α, β)= (0.05, 0.1): the expected sample size under H 0 improves from 20.4 to 18.5 and under H 1 from 24.9 to 16.7.Stopping for efficacy has a much larger impact on E(N|p 1 ) than E(N|p 0 ) because PET(p 1 ) is small for designs which do not allow stopping for efficacy and large for those that do.
The design with the smallest E(N|p 1 )=13.0 is the H 1 -optimalE design for (α,β) = (0.1, 0.1) and is indexed as (0 1) / 10 3/26; the study continues to a second stage only if there is a single event in the first 10 participants.If the true treatment effect is p 1 then the probability of stopping for efficacy is 0.812.In fact this design also has the smallest value for E(N|p 0 )+E(N|p 1 )i n Table 1 for (α,β) = (0.1, 0.1) and so performs well under both hypotheses.The corresponding H 1 -optimal design has an expected sample size of E(N|p 1 )=19.8so the gains can be substantial if the treatment is efficacious.The traditional Simon two-stage design, the H 0 -optimal design, has the highest expected sample size E(N|p 1 ) = 22.9 compared to all the other designs.Table 2 displays the minimax and optimal designs under H 0 and H 1 with and without efficacy stopping for p 0 = 0.1 and p 1 = 0.3.The minimax designs now show some considerable differences under H 0 and H 1 :thefirst stages are different for the minimax designs not stopping for efficacy but the second stages r / n are the same.There is not a clear consistency in the direction of the change in n 1 , it is 16 for the H 0 -minimax design and 11 for H 1 -minimax design when (α, β) = (0.1, 0.1).In contrast to this, for the other type 1 and 2 errors, n 1 is bigger under H 1 than under H 0 .However the impact on the expected sample sizes is small.Stopping for efficacy in minimax designs usually results in a reduction in expected sample sizes but this is not the case when (α, β) = (0.05, 0.2) where E(N|p 0 ) increases.The H 0 -optimal design has a smaller E(N|p 0 ) than all the minimax designs but has a much higher value (up to over 50%) for E(N|p 1 ).In Table 2 the H 1 -optimal designs are exactly the same as the H 1 -minimax designs.The H 0 -optimalE and H 1 -optimalE designs give small expected sample sizes and have similar values for E(N|p 0 )+E(N|p 1 ) but the H 1 -optimalE design has much smaller values of n.S oo p t i m i s i n gu n d e rH 1 also seems to control the maximum sample size.
Table 3 displays the minimax and optimal designs under H 0 and H 1 with and without efficacy stopping for p 0 = 0.3.The H 0 -optimal designs have much larger expected sample sizes E(N|p 1 ) than the H 0 -minimax design.Stopping for efficacy lowers E(N|p 1 )s u b s t a n t i a l l yb u tt h ee f f e c t so nE(N|p 0 ) differ depending on the error probabilities.For (α,β) = (0.1, 0.1) the expected sample size E(N|p 0 ) was 35.0 for the H 0 -minimax design but was 32.7 for the H 0 -minimaxE design.However when (α,β) = (0.05, 0.2) it rose from 25.7 to 30.7 because n 1 increased from 19 to 27 and n decreased from 39 to 36.The H 1 -optimal designs are again exactly the same as the H 1 -minimax designs.Stopping for efficacy can also slightly increase n as seen for the H 0 -optimal design with (α,β) = (0.1, 0.1).Amongst the optimal designs the H 1 designs have smaller second stages than the H 0 designs and bigger fir s ts t a g e s ,b u tt h i sd o e sn o th a v eah u g e impact on PET(p 1 ).Again the H 1 -optimalE design has the smallest values for E(N|p 1 )+E(N|p 0 )exceptfor(α,β) = (0.1, 0.1) where the H 0 -optimalE design is best on this metric.

Acceptable designs for a range of n
Both the H 0 -minimax and the H 0 -optimal designs have been widely applied in the literature and other designs have been largely ignored.However, these two designs can give highly divergent characteristics.For example the H 0 -minimax can lead to a much smaller maximum sample size than the H 0 -optimal design [8]; given the design parameters (p 0 , p 1 , α, β)= (0.1, 0.3, 0.05, 0.15), then the H 0 -minimax design is 2/18 5/27 and the H 0 -optimal design is 1/11 6/35.Using the same design parameters Fig. 1 shows the expected sample sizes under H 0 and H 1 for each optimal design over a range of values for n starting from the maximum sample size for the H 0 -minimax design.The figure shows that the H 0 -optimal and H 1 -optimal designs give almost identical expected sample sizes under H 1 and these increase approximately linearly with n.U n d e rH 1 , only the H 1 -optimalE design does not have a generally increasing expected sample size.Under H 0 ,theH 0 -optimalE designs always have the smallest expected sample sizes and the H 1 -optimalE designs have some larger expected sample sizes but the difference is variable.

Application of the H 1 -optimalE design
A recent paper studying the effects of Pazopanib on soft tissue sarcoma (STS) [15] reported the results from four Simon twostage designs; the Simon two-stage designs were used independently in four different strata defined by STS type.Each design sample sizes over the four strata, the total expected sample size for the H 0 -optimal design is 140 and for the H 1 -optimalE design is 97.9; on average this new design would have required 42 fewer patients.A single stage design would have required a sample size of 36 per stratum with the null hypothesis being rejected if there were ≥11 responders.In totalthe study would have required 144 patients, slightly more than the H 0 -optimal design.

Discussion
This article highlights that if researchers have an agent that is effective then the Simon two-stage design is not an optimal design.In the extreme case, if the true response probability is 1, the study will never stop for futility and the expected sample size is n.In the less extreme case, for a true response probability of p 1 , there is a chance smaller than β of stopping the study; so studies with high power have only a small chance of being stopped early for futility.All the H 1 -optimal designs reduce the expected sample size under the alternative compared to the traditional designs.However, this comes at the cost of slightly larger expected sample sizes under the null.The main effect of stopping for efficacy is that the probabilities of early termination increase, under both hypotheses, leading to efficiency gains over a broader set of true response probabilities.
In designing studies there is a choice whether to minimise the expected sample size or to minimise the maximum sample size.Our results show that the minimax design is beneficial if a treatment is efficacious: for this case the H 1 -optimal and H 0minimax designs were very similar.The only differences between the H 1 -optimal and H 0 -minimax designs were in the first stage; optimising under H 1 and not stopping for efficacy has a small probability of early termination when p = p 1 and hence the size of n dominates the expected sample size.It is clear from the results presented (Tables 1-3) that the best designs vary between situations with different design parameters; this is because of the discrete nature of the binomial distribution and exact probability calculations involved.
Only a few papers have investigated designs optimal for alternative values of the true response.Simon [16] explored a few values that were close to p 0 in designs that only stopped for futility and did not minimise the expected sample size, Shuster [17] looked at minimising the maximum expected sample size over a range of response rates and Hanfelt et al. [18] considered minimising the median sample size.Our designs allow stopping for efficacy and focus on minimising the expected sample size under the alternative hypothesis.Another paper proposed an adaptive version of the Simon two-stage design [12].This design considers two alternative response rates, p 1 and p 2 , and the size of the second stage depends on the number of responders in the first stage.One optimality criterion was to minimise the maximum of expected sample sizes at the null and two response rates, max E N jp i ðÞ ; i =0; 1; 2 fg .The design developed here differs in that it is specified before the study begins but can consider a range of possible true response probabilities.It is still reliant on the null and alternative values for the true probability of response.Another approach would be to elicit a prior distribution for the true response probability and find a design that minimises the expected sample size over this distribution.This is likely to be difficult in practice because the discreteness of the Binomial distribution leads to a lack of ordering of designs and might require a discrete prior to be computationally possible.The other effect of the discreteness is that a single type of design is not uniformly more efficient than any other and the best design depends on what the true response probability is.
Newer treatments, such as cytostatic drugs may require randomised phase II studies [19] and these studies can also incorporate an interim analysis to stop for efficacy or futility.Any two-stage randomised design should consider optimality criteria with both the alternative and null values as possible true response rates.The effect of changing the optimality criteria in extensions of the Simon single arm two-stage design, such as the three-stage design, could also be assessed.It would also be possible to use other definitions of optimality such as minimising E(N|p 0 )+E(N|p 1 ).The methods here could be further improved by using a curtailed sampling approach [14] and stopping each part as soon as the desired number of responders has been reached.If the aim of the study is to estimate the response probability then the equivalence between p-values and confidence intervals can be used to obtain interval estimates [11,20].
This article has presented a new set of designs that consider stopping for efficacy as well as futility under H 0 and H 1 .Allthe designs can be found using freely available software implemented in the common statistical package Stata.Although our focus has been on tumour response in cancer trials, these designs could be used for any binary outcome.The issues covered here should stimulate more discussion during the design of new studies and encourage consideration of a broader range of possibilities instead of using traditional designs to fitall situations.
Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t his ve r sio n.Fo r t h e d efi nitiv e ve r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e.You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wi s h t o cit e t hi s p a p er. Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s.

Table 1
Minimax and optimal designs, with and without stopping for efficacy and under both hypotheses when p 0 = 0.05 and p 1 = 0.25.