Sample size re-assessment leading to a raised sample size does not inflate type I error rate under mild conditions

Background One major concern with adaptive designs, such as the sample size adjustable designs, has been the fear of inflating the type I error rate. In (Stat Med 23:1023-1038, 2004) it is however proven that when observations follow a normal distribution and the interim result show promise, meaning that the conditional power exceeds 50%, type I error rate is protected. This bound and the distributional assumptions may seem to impose undesirable restrictions on the use of these designs. In (Stat Med 30:3267-3284, 2011) the possibility of going below 50% is explored and a region that permits an increased sample size without inflation is defined in terms of the conditional power at the interim. Methods A criterion which is implicit in (Stat Med 30:3267-3284, 2011) is derived by elementary methods and expressed in terms of the test statistic at the interim to simplify practical use. Mathematical and computational details concerning this criterion are exhibited. Results Under very general conditions the type I error rate is preserved under sample size adjustable schemes that permit a raise. The main result states that for normally distributed observations raising the sample size when the result looks promising, where the definition of promising depends on the amount of knowledge gathered so far, guarantees the protection of the type I error rate. Also, in the many situations where the test statistic approximately follows a normal law, the deviation from the main result remains negligible. This article provides details regarding the Weibull and binomial distributions and indicates how one may approach these distributions within the current setting. Conclusions There is thus reason to consider such designs more often, since they offer a means of adjusting an important design feature at little or no cost in terms of error rate.

http://www.biomedcentral.com/1471-2288/13/94 accumulated knowledge at interim, amounts to at least 50%. This article will show that a less strict bound applies, in agreement with [6], exhibit the bound in terms of a test statistic, and present mathematical as well as computational aspects of it.

Assumptions
Denote the planned final sample size by N 0 , the number of patients available at the pre-planned interim analysis by n, and the possible raise determined at the interim taking conditional power into account by r. Let us consider a one-sided test at level α based on observing X 1 , . . . , X N final . Here N final = N 0 or N final = N = N 0 + r, depending on a decision taken during the course of the trial. The main result assumes normal distribution, but as will be outlined, it will still hold true for more general distributions. Further, assume the X i to be independent normal with mean θ and variance 1. The null hypothesis states that θ = 0. Define the normalised test statistic Z (x) by x. The test rejects if Z (N final ) > z α , where z α is the 100×(1−α) percentile of the standard normal distribution: (z α ) = 1 − α ( being the cumulative distribution function of the standard normal distribution). The normalised test statistic Z (n) = n i=1 X i / √ n is observed when n patients have provided data, and the Data Monitoring Committee (DMC) will in part base its recommendations on the observed value. At this interim analysis an adaptation may lead to closing the study due to futility, continuing the study without changes or raising the sample size by recruiting an extra r subjects, yielding a total of N = N 0 + r subjects. Closing the study due to futility may only decrease the type I error rate. So let us, for the sake of argument, disregard that possibility, and show that the type I error rate still remains protected.
The study protocol will specify n and N 0 , and at the interim we will consider raising the final sample size based on the conditional power evaluated at the current parameter estimate. Since the objective is to assess if the interim results are promising the current estimate of the parameter of interest gives the appropriate information [6]. As pointed out by Müller and Schäfer in [7], the over-all type I error can be preserved unconditionally under any general adaptive change, provided the conditional type I error that would have been obtained had there been no adaptation is preserved. This article however only considers the case of SSA. Unlike the situation in [8] the design does not permit sequential testing. Also, the article only considers the conventional hypothesis tests and p-values without adjustments.
We assess the conditional error rate as a function of r. By showing that the conditional type I error rate is bounded by the error rate which arises from the design without adaptation the unconditional error rate is proven to be controlled at a pre-specified level α.

Derivation of the main result
We use the notation X ∼ N(μ, σ 2 ) to signify that X follows a normal law with mean μ and variance σ 2 .
The change in type I error rate conditional on a sample size increase decided at the interim equals n/N , and similarly for Z (N 0 ) . Expressing the difference in terms of normal distributions yields Now, in order to show this difference to be less than or equal to zero it may be equivalently shown that the difference of the arguments is negative (in the sense of non-positive), and denote this by H(r). Obviously H(0) = 0.
To simplify notation put q = n/(N 0 + r) and V = (N 0 + r)/N 0 for arbitrary n, N 0 and r, satisfying N 0 > n > 0, and r > 0. Please note qV = n/N 0 . Then we aim to show And, then, multiply by −1 and divide by the multiplier of z, to obtain which after cancelling out Now let us compare this bound to z α √ n/N 0 . Denote the bound in (1) by z α b(q, V ), and set out to prove b(q, V ) ≤ √ qV . By subtracting √ qV from both sides and equating denominators we have But the denominator is positive under current assumptions. Thus, we may disregard it, and thus, we need to prove Division by √ 1 − qV , moving the right term to the right hand side of the inequality symbol, and squaring both sides yields By expanding the left hand side product, eliminating terms and multiplying both sides by −1/q, we finally have which is true for all positive V. Now regard b as a function of r for n and N 0 fix. One may show that, b(r) √ n/N 0 asymptotically as r 0. Further, b decreases as r grows in a close to linear fashion.
n/N 0 , which will be seen to give the conditional power 50% (the simple criterion). Consequently, this new criterion is less restrictive than the one presented in [5], and, importantly, changes with r. The reference [6] provides an example where the type I error remains intact although the conditional power descends down to 36%.
To obtain the conditional power please note that The minimum of this probability over z > b(q, V )z α equals From the definition of G(r) it follows that one cannot go further without increasing the conditional error rate. In this sense the bound is optimal.

Weibull ditributed survival time points
We will now study the situation where survival times follow a Weibull distribution and right censoring time points are exponentially distributed.
In [9] the details of an Edgeworth expansion of the product limit estimator are given ( Thus one may at the interim use parameter estimates to calculate a normal approximation to the conditional power. Alternatively, one may simulate the remainder of the trial. A third option is to base the procedure on the logrank test whose statistic converges to a normal distribution. Consider the situation where the time to some event is compared between patients in an active treatment group and those in a control group. Let r i refer to the number of patients remaining at time i and o i refer to the number of observed events. Further, let A refer to the active treatment group and C to the control group.
then z = T/ √ V will asymptotically be standard normal, e.g. [10]. Hence one may apply the simple criterion to z observed at the interim.

Binomial proportion
For the sake of simplicity of exposition we focus attention to a single binomial proportion p and a one-sided test at the 5% level. Let the null hypothesis and alternative hypothesis be H 0 : p = p 0 , H 1 : p > p 0 . Please note that for X N 0 ∼ Bin(N 0 , p 0 ) the conditional distribution given {X n = k} is the same as X N 0 −n + k ∼ Bin(N 0 − n, p 0 ), and similarly for X N 0 +r .
From this follows that we may obtain G(r) exactly: in terms of R code [11] #let the level of the test be 5% qs.r <-qbinom(p = 0.95, size = N0+r, prob = p0) qs <-qbinom(p = 0.95, size = N0, prob = p0) G.r <-pbinom(qs-k,size = N0-n,prob = p0)pbinom(qs.r -k, size = N0+r-n, prob = p0) However, we will look at a normal approximation. In the binomial case several test statistics with close to normal distribution exist: 1. the score test statistic: z = √ n(p − p 0 )/ p 0 (1 − p 0 )),p = k/n 2. the log-odds: [12]  The simple criterion would then say that if z as above exceeds n N 0 z α , then the procedures protects the type I error rate (unconditionally). But we set out to find a more accurate approximation. Now using the condition {X n = k} where q α,m is the 100 × (1 − α) percentile of Bin(m, p 0 ). Also the binomial distribution X n ∼ Bin(n, p) admits a normal approximation of the pivotal statistic U = (X n − E[ X n ] )/SD(X n ), which coincides with the score test statistic above, such that in terms of the third cumulant of U, which picks up the skewness. As a rule of thumb it is often said that the normal approximation is quite accurate when np and n(1 − p) both exceed 5. But this statement holds even without the correction with respect to skewness. In this case we may approximate the difference G(r) defined above by where we insert the percentiles obtained from the inversion of the Cornish-Fisher expansion, cf. e.g. [13]: denoting μ n = np 0 , σ 0 = p 0 (1 − p 0 ), and the third cumulant (1 − 2p 0 )/σ 0 by γ 0 . This quantity will deviate less than 1 from the true percentile for n from 20 to 200, and min{np 0 , n(1 − p 0 )} > 5. Let us consider G(r) through the pivotal quantities We will be concerned with the difference The task is now to identify when this difference is positive. To simplify notation denote by n 1 the larger of the two sample sizes and by n 2 the smaller. After equating the denominators and noting μ n i − μ n i −n = np 0 , i = 1, 2 the difference equals: (n 1 − n)(n 2 − n)σ 0 Disregarding the positive denominator yields the condition: Some algebra will unearth the condition Please note that the first term corresponds to expectation of the null distribution. Further, the second term will be negative if p 0 > 0.5, and the third will always be negative under the conditions of this paper. From this follows that the normal approximation of G(r) is non-positive for k satisfying the above condition. Finally, invoke the fast convergence of the binomial distribution towards a normal law, which means that already 20 observations will make the normal approximation quite accurate, provided min{np, n(1 − p)} > 5. Simulations indicate that this decision rule is accurate already at an interim sample size n as low as 20. However, the rule does not guarantee preservation of the conditional type I error rate for all p. Thus the conclusion is that for the binomial distribution there is no inflation of the unconditional type I error rate under the above conditions. A total of 900000 simulations with n from 20 to 100, p 0 picked randomly in [ 5/n, 1 − 5/n], k randomly generated from Bin(n, p 0 ) and N 0 = 2n and r = n gave a median and mean of G(r) equal to −0.004762 and −0.004574, respectively, over the set defined by the inequality above. A set of similar simulations using the simple criterion (k > n(p 0 + p 0 (1 − p 0 ))/N 0 z α ) gave median and mean equal to −0.02429 and −0.02389, respectively. Thus the simple criterion will be on the conservative side.

Main result
In Methods the following result was derived.

A conditional power that quarantees preservation of nominal significance level
If the conditional power at the interim, which occurs after n out N 0 planned observations and leads to a raise of r, equals at least where q = n/(N 0 + r), V = (N 0 + r)/N 0 , and then the type I error rate is preserved. The function b satisfies the inequalities A more practical criterion, or rule of thumb, may be to derive a test statistic z with close to a standard normal distribution under the null hypothesis, and check whether z > n N 0 z α . This will be referred to as the simple criterion, and stems from [5]. More generally, the condition z > b(q, V )z α suffices (cf. equations (11) and (12) in [6]). The conditional power bound in (2) decreases as r increases, but the lower bound on b implies a limit.

Example
Take the example of n = 55, N 0 = 110, r = 40 and α = 0.025, z α = 1.96. Then the minimum conditional power equals 43%, see next subsection. Thus a conditional power of considerably less than 50% is permissible from the point of view of type I error rate preservation. This may be good to know if the original sample size calculation was grossly wrong. Then recruiting more subjects than planned may resolve the issue without jeopardising the type I error rate. On the other hand, in such a situation the validity of the scientific hypotheses on which the trial design rests may be questioned, and the sponsor will have to judge whether the updated hypotheses suggest a commercially viable route. Nevertheless, in some cases raising the sample size will make sense, and may save the trial from unnecessary disaster.
Above we assume the variance to be known. If it is not we may estimate it and use for instance a t-test statistic which quickly converges to a normal as the sample size increases.
Examination of the t-test has provided evidence of a small degree of inflation [14]. In [15] further details of when inflation occurs are given. However, already at a sample size of 30 the t-distribution and the normal distribution appear almost identical.

Deviations from normal distribution
If we use non-normal data such as survival type of data, then it is often possible to approximate the test statistic by a normal variate. Many test statistics, e.g. those derived by the maximum likelihood method, converge quickly to a normal distribution when the sample size increases. This feature extends the relevance of the main result to measurements following other distributions than the normal.
In Methods we looked into the situation where a Kaplan-Meier (KM) estimate is used. The Edgeworth expansion of the distribution of the (standardised) KM estimator has the form (x)−n − 1 2 φ(x)κ 3 (x 2 −1)/6, wherẽ κ 3 is specified in Methods [9,16], the cumulative distribution function of a standard normal variate and φ its frequency function. So if we express the change in conditional error rate (G(r) below) in terms of this expansion the correction term to difference between normal distribution functions will approach zero as 1/ √ n. Assuming some parametric distribution, such as the Weibull distribution, one may work out the details regarding this approximation. Or, one may assess the deviation from normality through a simulation procedure.
In the case of a single binomial proportion p and a onesided test of the null hypothesis H 0 : p = p 0 versus the alternative hypothesis p > p 0 , it holds that if we at the interim observe X n = k satisfying with σ 0 = p 0 (1 − p 0 ), γ 0 = (1 − 2p 0 )/σ 0 and n 1 = N 0 + r, then inflation of the type I error rate will not occur. More precisely put: on average , over all possible outcomes, the procedure will preserve the type I error rate. However, the conditional error rate will not always fall below the nominal one. http://www.biomedcentral.com/1471-2288/13/94

Discussion
There are operational issues with adaptive designs that must be addressed during the planning stage. In order to safeguard the integrity of the trial and avoid operational bias following an unblinded interim precautions need to be put in place to limit access to both the results and, even, the analysis plans. The latter will specify the output and decision rules, but will leave open the possibility of including other information, such as external factors in the final decision whether to stop for futility or to continue, and if so, whether or not to raise the sample size. Further, a number of concerns have been raised involving the risk of violating statistical principles or lack of efficacy compared to group sequential designs, e.g. [17][18][19].
However valid these objections may be, more and more practitioners have felt that the challenges are tractable and have found SSA designs an attractive option. For small biotechnology companies this option gives the possibility of starting a trial with rather limited resources, followed by an additional investment conditional on the interim results being promising. Also, the SSA design makes a lot of sense whenever a fix size design would have to rely on quite limited amount of information regarding the primary variable.
Several references have argued the superiority of seamless phase II/III designs over the traditional phase II and III trials. Merging the two phases produces gains in valuable time [20], and, under reasonable conditions, saves sample size [21].
Earlier research has established that a conditional power at the interim analysis exceeding 50% implies that the conditional, and hence also the unconditional, type I error rate is preserved, cf. [5,7]. Further, the reference [6] builds on [8] and others to identify a more general region where this happens. The region is identified through equations (11) and (12) in [6]. The derivation of the region relies on results for Brownian motion. Together these two equations implicitly define a bound that coincides with b in (3) above.
Further, one cannot use a lower bound without risking inflation of the conditional error rate, and thus one may not rely on the Müller-Schäfer principle of conditional error functions [7] (new does not exceed the original) to prove preservation of unconditional error rate 1 . By virtue of the Müller-Schäfer principle of conditional error functions any interim decision rule, pre-defined or not, that does not violate this fundamental requirement will permit a redesign of the trial. So from this perspective the SSA designs described here are well behaved and offer great flexibility.

Conclusions
This article has shown that the risk of compromising the nominal significance level of a statistical test by allowing a sample size increase during the course of a trial remains low and controllable. The conditional error rate and power provide key decision tools.