Accounting for variation in the required sample size in the design of group-sequential trials

Introduction: Most literature on optimal group-sequential designs focuses on minimising the expected sample size. We highlight other factors for consideration. Methods: We discuss several quantities less-often considered in adaptive design: the median and standard deviation of the random required sample size, and the probability of committing an interim error. We consider how the optimal timing of interim analyses changes when these quantities are accounted for. Results: Incorporating the standard deviation of the required sample size into an optimality framework, we demonstrate how and when this quantity means using a group-sequential approach is not optimal. The optimal timing of an interim analysis is shown to be highly dependent on the pre-specified preference for minimising the expected sample size relative to its standard deviation. Conclusions: Examining multiple factors, which measure the advantages and disadvantages of group-sequential designs, helps determine the best design for a specific trial.


Introduction
Adaptive designs have received substantial recent attention; a response to escalating costs in the drug development process and their potential to improve efficiency [1]. Amongst available types of adaptation, group-sequential (GS) designs are the most established in practice. In a GS approach, interim analyses that can terminate the trial are conducted after pre-planned numbers of participants. With this, the average sample size required by a GS design is often lower than a corresponding fixed-sample design [2].
Here, we highlight the value of considering other factors in the optimality criteria that directly address the drawbacks of a GS approach. Through this, we extend previous work on optimal interim analysis timing in a way that allows examination of when prior desires indicate a fixed-sample approach should be preferred.

Group-sequential designs
Consider a balanced two-arm GS trial with a normally distributed outcome. Other settings can be treated similarly; we discuss single-and two-arm trials with Bernoulli data in the Supplementary Materials.
Assume a maximum of J analyses are allowed, with analysis j ∈ {1, …, J} conducted using data from t j n patients on each arm, 0 < t 1 < t 2 < indicates treatment arm and i ∈ {1, …, n} indicates participant. The variance of the outcomes, σ 2 , is assumed known. Suppose we test H 0 : τ = μ 1 − μ 0 ≤ 0, wishing to control the type-I error-rate to α ∈ (0, 1) when τ = 0 and have power of 1 − β ∈ (0, 1) when τ = δ > 0. At analysis j, we use a Wald test-statistic, Then, Design requires upper (efficacy) and lower (futility) stopping boundaries, which we denote by e = (e 1 , …, e J ) and f = (f 1 , …, f J ), with e j > f j for j ∈ {1, …, J − 1}. We also set e J = f J , so that the trial terminates with a decision on whether to reject H 0 after at most J stages. The decision rules at analysis j are • if T j > e j terminate the trial, rejecting H 0 ; • else if T j ≤ f j terminate the trial, but do not reject H 0 ; • else continue the trial to stage j + 1.
In total, a design is given by choices for n, t, e, and f. Using these, and the implied distribution of T j , we can compute the probability of stopping for efficacy (E j ) or futility (F j ) at analysis j using where ϕ j {x j = (x 1 , …, x j ), μ, Σ} is the PDF of a j-dimensional multivariate normal distribution with mean μ and covariance matrix Σ evaluated at (x 1 , …, x j ). Furthermore, the probability H 0 is rejected and the ESS for any τ can be found with

e, f).
We can also calculate several other quantities, which we will examine the importance of. These are the median sample size (MSS), the standard deviation of the sample size (SDSS), and the probability of an interim error (PIE) : ∃jsuch thatS j (τ|n,t, e,f ) = 0.5, t, e, f ). While the PIE arises in error-spending designs [24], the MSS and SDSS are rarely mentioned in the GS design literature.

Optimal designs
Many methods are available for specifying e and f. We consider two approaches In either case, given t, we identify n as the solution to i.e., the minimal value such that the power requirement is met. Thus, we have methods for choosing n, e, and f, for given t. As in Togo and Iwasaki [16] and Xi et al. [22], we will focus on how to optimally determine t. Togo and Iwasaki [16] focused on Wang-Tsiatis boundaries, choosing t to minimise ESS(δ| n, t, e, f). Xi et al. [22] considered our other setting with futility stopping only, discussing several optimality criteria consisting of weighted sums of ESSs.  We seek to demonstrate how focusing solely on minimising ESSs can negate consideration of the drawbacks of a GS approach. Namely, any use of a GS design involves a trade-off in terms of reducing the ESS at particular treatment effects while increasing the maximal sample size and introducing non-zero variation in the required sample size. Therefore, modifying previous suggestions we consider the optimal solutions to three optimality problems argmin t∈[01] J : 0<t1<⋯<tJ =1 wESS(θ|n, t, e, f ) + (1 − w)SDSS(θ|n, t, e, f ), wESS(θ|n, t, e, f ) + (1 − w)MSS(θ|n, t, e, f ), , t ' , e, f ) , for different values of the weight w ∈ [0, 1] which categorises the relative desires to minimise the two factors in the objective functions. We consider θ ∈ {0, δ}, as these have been common choices historically in the optimal design literature. Note rescaling is including in the third optimality problem as the two factors exist on different scales.

Examples
To correspond to a typical confirmatory trial, we assume α = 0.025 and β = 0.1. The results will not depend on the values of δ and σ; we set δ = 0.3 and σ 2 = 1 arbitrarily. We focus on two-stage designs (J = 2); some results for J = 3 are given in the Supplementary Materials. We allow a fixed-sample (J = 1) design to be identified as optimal by allowing t 1 = t 2 = 1. Code for reproduction of our results is available from https://github.com/mjg211/article_code. The optimal values of t 1 and f 1 are shown, in the stopping for futility designs, as a function of w ∈ [0, 1], the weight given to the expected sample size in the three optimality criteria, for θ ∈ {0, δ}. t 1 , for several Δ. It depicts why focusing solely on the ESS may lead to designs with large values for the SDSS, MSS, or PIE. For example, the smallest considered value of t 1 produces the smallest value of ESS(δ) for all examined Δ, but its SDSS curve is substantially more variable.

Wang-Tsiatis designs
Accordingly, in Fig. 2 we present the optimal value of t 1 for several Δ, as a function of w, for the three optimality criteria. We consider only θ = δ, as the symmetry of Wang-Tsiatis bounds means θ = 0 will always recommend a fixed-sample design.
For the optimality criteria incorporating SDSS(δ) and PIE(δ), small w results in a single-stage design (i.e., t 1 = 1) being optimal. For larger w, the optimal timing of the interim analysis changes rapidly in w; indicative of a large trade-off between the two factors that make up the optimality criteria. Results for the optimality problem involving MSS(δ) are different; they indicate regardless of Δ and w, t 1 in the region of 0.5 is optimal.

Futility stopping
Next, we consider optimising t 1 and f 1 in the designs that allow early stopping for futility alone. The solutions as a function of w are given in Fig. 3 for θ ∈ {0, δ} for the three optimality criteria.
For θ = 0, with all three optimality criteria w need not be large for a GS approach (i.e., t 1 < 1) to be optimal. However, the optimal timing of the interim analysis and the optimal futility bound again often change quickly in w, and vary substantially across the optimality criteria. For θ = δ, the inclusion only of early stopping for futility means a GS design is typically only optimal for larger w.

Discussion
Adaptive designs are not always useful [25]. In a GS design, several issues can arise, for example, it may not be clear what will happen to trial staff if a study terminates early. Costing the trial can also be more challenging, as it may be necessary to determine costs for each possible sample size. Such issues mean in some settings the advantages a GS design brings may not outweigh the drawbacks.
Therefore, we have here demonstrated the utility of directly considering factors that address disadvantages when optimising the design. The optimal design was demonstrated to be highly sensitive to the choice of weighting parameter w (Figs. 2-3). Thus, in practice this may mean a trialist could choose to compromise on the ESS for, e.g., a large reduction in SDSS. By determining what factors matter most to their study, our approach would allow a trialist to determine when an interim analysis could be most effectively timed, and indeed whether one should be conducted at all.
For two-stage single-arm trials with Bernoulli data, we also demonstrate in the Supplementary Materials how a small concession on the ESS may lead to much larger gains in terms of reducing the SDSS. This echoes similar findings from Hanfelt et al. [11] for the MSS.
In conclusion, we encourage those considering utilising a GS design to explicitly evaluate the MSS, SDSS, and PIE when choosing a design. They can have a notable impact on the optimal design, and may even indicate that a GS design would not be appropriate.

Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.