Alternative Designs for Phase II Clinical Trials when Attained Sample Sizes are Different from Planned Sample Sizes

Phase II cancer clinical trials are conducted for initial evaluation of the therapeutic efficacy of a new treatment regimen, and two-stage designs often are implemented in such trials. Typically, designs of phase II trials not only satisfy predefined significance and power requirements but also have some desirable features such as minimizing the total sample size, minimizing the average sample size under the null hypothesis, etc. A frequent issue is that the attained sample sizes differ from the planned sample sizes. We propose alternative designs adjusted to the attained sample sizes when they are different than the planned sample sizes. We present extensive examples and compare the proposed designs to that of Green and Dahlberg. We apply the proposed designs to a phase II trial in non-Hodgkin’s lymphoma patients. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
In a phase II cancer clinical trial, investigators seek to determine whether a new treatment has sufficient promise for further study in a large-scale phase III trial. After the determination of appropriate dose level in Phase I trials, phase II clinical trials are undertaken to provide an initial assessment of the treatment efficacy, typically in terms of response rate of the treatment. Usually, the response rate of the standard treatment is known, and a treatment is considered promising if its response rate is better than the standard level by some predefined margin. This can be set up as testing the null hypothesis H 0 : p ≤ p 0 vs. the alternative hypothesis H 1 : Where p is the response rate of the experimental treatment, p 0 is the response rate of the standard treatment, and p 1 is the level of response at which one considers the treatment promising. For ethical and efficiency reasons, most phase II trials use sequential designs. Twostage designs are commonly used because of their logical simplicity and the diminishing benefit of multistage trials over two stages. Many designs for phase II clinical trials have been proposed [1][2][3][4][5][6][7]. The proposed designs usually not only satisfy predefined significance and power requirements but also have some desirable features such as minimizing the total maximum number of patients, or minimizing the average number of patients under the null hypothesis.
A frequent issue is that the accrual in stages 1 and 2 may not proceed exactly as planned. As pointed by Green and Dahlberg [8], there are logistic problems encountered in the conduct of multicenter studies, including the following: 1.
Accrual cannot be suspended immediately after enrolling a specified number of patients, as institutions may be allowed to enroll patients for whom recruitment efforts have already begun.

2.
Communication of information such as study closure is often slow in large bureaucracies or multicenter study groups such as a cancer cooperative clinical trial group.
In addition, investigators may find that some patients who entered the study are ineligible after the accrual is suspended at either stage 1 or stage 2. Consequently, the number of evaluable patients is smaller than originally expected. As pointed by Herndon [9], the accrual suspension is unfeasible in some studies because of long treatment durations and/ or long time periods required for response assessment. If the decision of whether the study should be stopped is made during a meeting of the Data Safety Board, the attained sample size will likely differ from the planned sample size. The question is how the design should be modified when the attained sample sizes are different than the planned sample sizes.
For example, suppose we want to evaluate the efficacy of lenalidomide in a phase II trial in non-Hodgkin's lymphoma patients who responded and then relapsed after the first chemotherapy. A response rate of p 0 =0.20 or lower is considered too low and a response rate of p 1 =0.40 or higher is considered promising. Assume that investigators plan to accrue 20 patients at the first stage. If the number of responses among the 20 patients is 4 or less, then the study is terminated and the treatment is rejected. Otherwise, an additional 20 patients enter the study at the second stage. If the number of responses among the total of 40 patients is 11 or less by the end of stage 2, then the treatment is rejected. Otherwise, the treatment is considered promising. Assume that the attained sample sizes are 18 and 38 in stages 1 and 2, respectively. What alternative design should the investigators use? We will address this problem below. Two methods were proposed by Green and Dahlberg [8] and Hernden [9]. Green and Dahlberg [8] proposed to conduct a one-sided test of the alternative hypothesis H 1 : p>p 1 at the 0.02 level in stage 1 and conduct a one-sided test at the 0.055 significance level in stage 2. Their simple and uniform approach is attractive, and their designs have been used until recently [10]. Hernden [9] proposed hybrid designs for some phase II clinical trials without accrual suspension during the evaluation of response status of patients entered for hypothesis testing at stage 1. In general, the method of error rate spending function can 1 1 2 0 ( , | , , ) P Y a Y c n n p p > > = ; (2) the power is and the average sample size under the null hypothesis is We denote a category I design by its parameters (n 1 , n 2 , a, c). Simon's minimax designs [6] are category I designs that minimize the total sample size n under the requirements of significance level and power: (2) ≤ α and (3)>1-β. Simon's optimal designs [6] are category I designs that minimize the average sample size in (4) under the constraints (2) ≤ α and (3)>1-β.
A category II design is specified by adding a threshold b at the first stage in addition to design parameters (n 1 , n 2 , a, c) as in a category I design. After n 1 patients enter the study and their response statuses are evaluated, we conduct the test at the first stage as follows: the study is terminated if Y 1 ≤ a or Y 1 >b, and the treatment is rejected or considered promising correspondingly; if a<Y 1 <b, then the testing procedure is continued to the second stage. At the second stage, the treatment is rejected or considered promising when Y ≤ c or Y>c, respectively. The significance level of the test is the power is and the average sample size under the null hypothesis is We denote a category II design by its parameters (n 1 ,n 2 ,a,c).The minimax design is the design that minimizes the total sample size n under the constraints (5)<α and (6)>1-β. The optimal design is the design that minimizes the average sample size in (7) under the constraints (5) ≤ α and (6)>1-β.
The minimax or the optimal design can be obtained by a search program. Compared with a category II design, a category I design has a higher probability of letting the trial proceed to the second stage, and provides a more accurate estimation of the response rate or other endpoints when the treatment is effective. In contrast, a category II design allows early termination and shortens the duration of the trial when the treatment is either clearly effective or ineffective. If we define the rejection region of the null hypothesis at stage 1 to be a null set in a category II design (b=n 1 +1), then the category II design becomes a category I design. Therefore, category I designs are special cases of category II designs. A minimax or an optimal design of category II is more efficient than that of a category I design since category I designs form a subset of category II designs.
A common approach is to use equal or almost equal sample sizes between the two stages (n 1 =n 2 or n 1 =n 2 +1). The design with equal sample sizes or almost equal sample sizes has the advantage of simplicity: formal data monitoring is performed after half or about half of the patients enter the study. The minimax and the optimal design can be obtained by a search program under the constraint: n 1 =n 2 or n 1 = n 2 +1. Some clinicians do not pursue the optimal property in category I or category II designs. They intend to have the testing procedure proceed to the second stage unless the result at the first stage is extreme. They select a low value of a in a category I design or select a low value of be applied for group sequential testing with unpredictable sample sizes. Lan and DeMets [11] proposed a flexible sequential testing procedure by selecting a type I error probability spending function before the study starts. Pampallona et al. in a Harvard School of Public Health technical report and Chang et al. [12] extended the method in group sequential testing using both type I and type II error probability spending functions. Further development can be found in Hampson and Jennison [13].
In our first approach, we define type I and type II error probability spending functions by the planned design, and then conduct two-stage testing using these error probability spending functions according to the attained sample sizes. In designs of Green and Dahlberg [8], the type II error probability spent at the first stage is uniformly set up at a fixed level of 0.02. Their designs may be quite different with the panned designs and may not have the desirable features as in the panned designs. Our proposed designs are more flexible, closer to the planned design, and preserve the desirable features as in the planned design better than designs of Green and Dahlberg [8]. Our second approach is to redesign the two-stage testing procedure after the sample size at the first stage is attained, following the same criteria as in the planned designs. If the attained sample size is different than the redesigned sample size at the second stage, then the threshold at the second stage will be adjusted to satisfy the significance requirement. Our designs satisfy the requirement on significance level. In addition, the power of the proposed design is close to the nominal level when the total sample size is close to that in the planned sample size. Other parameters in the proposed design, such as average sample size, are close to those in the planned design as well.
In next section, we summarize typical planned designs for twostage phase II clinical trials. In the section following the next section, we introduce alternative designs when attained sample sizes are different than planned sample sizes. Then we present numerical examples and a real example. The last section is a brief discussion.

Planned Designs
The typical objective of phase II clinical trials is to evaluate experimental treatments that might increase the response rate over a historical level. We set up the null and alternative hypotheses as in (1). A phase II clinical trial is usually carried out in two stages. We classify most popular designs in two-stage clinical trials as category I and category II as follows. A category I design allows early termination of the trial at the end of the first stage when the treatment is ineffective, and allows the trial to continue to the second stage when the data indicate that the treatment has a certain efficacy. A category II design allows early termination of the trial not only when the treatment is ineffective but also when the treatment is clearly effective at the first stage. We denote sample sizes at stages 1 and 2 and the total sample size by n 1 , n 2 , and n=n 1 +n 2 , respectively, and denote the numbers of responses at stages 1 and 2, and the cumulative number of responses across two stages, by Y 1 , Y 2 , and Y=Y 1 +Y 2 , respectively.
A category I design is specified by a threshold a at the first stage and a threshold c at the second stage. After n 1 patients enter the study and their response statuses are evaluated, we conduct the test at the first stage as follows: the study is terminated if Y 1 ≤ a, and the treatment is rejected; if Y 1 >a, then the testing procedure is continued to the second stage. At the second stage, the treatment is rejected or considered promising when Y ≤ c or Y>c, respectively. The significance level of the test is a and a large value of b in a category II design, such that the probability of early stopping is low. In this way, the data will likely provide a more accurate estimate of the response rate and other endpoints.

Alternative Designs
A frequent issue is that the attained sample sizes at stages 1 and 2, denoted by * 1 n and * 2 n ( * * * 1 2 n n n = + ), are different than the planned sample sizes n 1 and n 2 . We propose in Alternative designs using type I and category II error probability spending functions to conduct the testing procedure using type I error and type II error spending functions defined by the planned design. In Alternative designs with redesigns adjusted to the attained sample sizes, we propose to redesign the testing procedure after the sample size * 1 n is attained at the first stage.

Alternative designs using type I and category II error probability spending functions
For a category I design (n 1 , n 2 , a, c), there is no type I error probability spent at the first stage, and all type I error probability is spent at the second stage. We define the type II error probability spending function according to the planned design. We then obtain the attained design using the error probability spending functions according to the attained sample sizes * 1 n and * 2 n . Assume the required significance level and power are α and 1-β, respectively. The type II error probability spent at stage 1 is We define the type II error probability spending function as a piecewise linear function of sample size m: 1 1 1 We select integer * a as the threshold at stage 1 such that where " ≈ " means "is closest to. " We further select the smallest integer c * as the threshold at stage 2 such that We conduct the testing procedure using the design * * * * 1 2 ( , , , ) n n a c . The proposed design satisfies the significance requirement. If the attained sample sizes are * 1 n and * 2 n close to the planned sample size n 1 and n 2 , then the attained design approximately satisfies the power requirement. Since the type II error probability spending function is used, the attained design has approximately the desired features as in the planned design.
For a category II design (n 1 , n 2 , a, c), the type I error probability spent at stage 1 is and the type II error probability spent at stage 1 is in (8). We define the type I error probability spending function as a piece-wise linear function of sample size m: 1 1 1 We define the type II error probability spending function as in (9). We select integers a* and b* as the thresholds at stage 1 such that We further select the smallest integer c * as the threshold at stage 2 such that We conduct the testing procedure using the design * * * * * 1 2 ( , , , , ) n n a b c . The proposed design satisfies the significance requirement. If the attained sample sizes are * 1 n and * 2 n close to the planned sample sizes n 1 and n 2 , then the attained design approximately satisfies the power requirement. Since type I and type II error probability spending functions are used, the attained design has approximately the desired features as in the planned design.

Alternative designs with redesigns adjusted to the attained sample sizes
We propose in this section to redesign the testing procedure when the attained sample size * 1 n is different than the planned sample size n 1 at the first stage. For a category I design, we determine parameters are * * 2 , a n , and * c conditional on * 1 n to satisfy the significance level and power requirements: In addition, the design * * * * We emphasize that the determination of design parameters * * * 2 , , n a c and ** c does not depend on response data at stages 1 and 2.
For a category II design and when the attained sample size * 1 n is different than the planned sample size 1 n at the first stage, we determine the parameters are * * * 2 , , a b n , and * c conditional on * 1 n to satisfy the significance and power requirements: In addition, the design * * * * *  Tables 1A, 1B, 2A, and 2B. All planned designs satisfy the significance and the power requirements of α ≤ 0.10 and1-β>0.90, subject to the constraint of n 1 =n 2 or n 1 =n 2 +1. The planned designs are optimal in the sense that the planned design has the minimum average sample size under the null hypothesis. We also listed the designs of Green and Dahlberg in the tables for comparison. The designs of Green and Dahlberg spend type I and type II error probabilities at stage 1 by a fixed amount of 0.02 (for type I designs, no type I error probability is spent at stage 1) and satisfy the overall significance requirement of α ≤ 0.10.

Proposed Designs Using Type II Error Probability Spending Function Versus Planned Category I Designs
Examples of proposed designs using a type II error probability spending function versus planned category I designs are presented in Table 1A. For example, at the 9 th row entry, the testing procedure is for the null hypothesis H 0 : p ≤ p 0 =0.10 versus the alternative hypothesis

Proposed designs using type I and II error probability spending functions versus planned category II designs
Examples of proposed designs using type I and II error probability spending functions versus planned category II designs are presented in Table 1b. For example, at the 9 th row entry, the testing procedure is for  n =12 ( * n =31) at stages 1 and 2, respectively. Using the type I and II error probability spending functions defined in (9) and (10), we obtain the attained design * *  =(19, 12, 1, 6, 5). This design has a significance level of 0.082, a power of 0.93, and an average sample size under the null hypothesis of 25.9. In all 32 cases we investigated (Table  1b), the average performance of the proposed designs on significance and power is almost identical to that of Green and Dahlberg (mean significance level: 0.074 vs. 0.073; mean power: 0.91 vs. 0.91). In fact, in 17 cases out of the 32, the proposed designs are the same as those of Green and Dahlberg. The agreement between the proposed designs and the designs of Green and Dahlberg is due to the fact that the type I and II error probabilities computed by (9) and (10) are close to 0.02 in many cases. In the remaining 15 cases, the proposed designs uniformly have smaller average sample size under the null hypothesis than those of Green and Dahlberg (mean average sample size: 27.5 vs. 31.0).

Proposed designs with adjustments by the end of stages 1 and 2 versus planned category I designs
Examples of proposed designs with adjustments by the end of stages 1 and 2 versus planned category I designs are presented in Table  2a. For example, at the 9 th row entry, the testing procedure is for the null hypothesis H 0 : p ≤ p 0 =0.10 versus the alternative hypothesis H 1 : p>p 1 =0.30. The planned design is (n 1 , n 2 , a, c)=(17, 16, 2, 5) with a significance level of 0.081, a power of 0.90, and an average sample size under the null hypothesis of 20.8 (Table 1a). The attained sample size is ( , , , ) n n a c =(19, 9,1,5). This design has a significance level of 0.055, a power of 0.89, and an average sample size under the null hypothesis of 24.2. In all 32 cases we investigated (Table 2a), the average performance of the proposed

Proposed designs with adjustments by the end of stages 1 and 2 versus planned category II designs
Examples of proposed designs with adjustments by the end of stages 1 and 2 versus planned category II designs are presented in Table  2b. For example, at the 9 th row entry, the testing procedure is for the null hypothesis H 0 : p ≤ p 0 =0.10 versus the alternative hypothesis H 1 : p>p 1 =0.30. The planned design is (n 1 , n 2 , a, b, c)=(17, 16, 2, 5, 5) with a significance level of 0.084, a power of 0.91, and an average sample size under the null hypothesis of 20.5 (Table 1b)  attained sample size at stage 2 is ** 2 n =9 ( ** n = 28). For given * * * * * 1 2 ( , , , ) n n a b =(19, 9, 2, 5), we found that the threshold at stage 2 should be ** * 5 c c = = . The attained design * ** * * ** ( , , , , ) n n a b c =(19, 9,1,6,5). This design has a significance level of 0.055, a power of 0.89, and an average sample size under the null hypothesis of 24.1. In all 32 cases we investigated (Table 2B), the average performance of the proposed designs on significance and power is similar to that of Green and Dahlberg (mean significance level: 0.081 vs. 0.075; mean power: 0.89 vs. 0.91). In all 32 cases, the proposed designs are different than those of Green and Dahlberg, and the proposed designs uniformly have smaller average sample sizes under the null hypothesis than that of Green and Dahlberg (mean average sample size: 26.2 vs. 30.8).

A Real Example
Investors want to evaluate the efficacy of lenalidomide in a phase II clinical trial in non-Hodgkin's lymphoma patients who responded and then relapsed after the first chemotherapy. The null and alternative hypotheses are specified in (1) with p 0= 0.20 and p 1 =0.40. The required and Dahlberg. After the planned design is set up, the error probability spending function can be specified. Therefore, both the planned design and the alternative design can be specified in the protocol before the study starts.
In our second approach, we generate a modified design using the same criteria as in the planned design, conditional on the attained sample size at the first stage. A new accrual target is set up for the second stage according to the modified design. When the attained sample size is different than the redesigned sample size at stage 2, another adjustment for the threshold at stage 2 may be needed to satisfy the significance requirement. The dynamic redesigns depend only on attained sample sizes, and are independent of response data. The strategy of the testing procedure can be specified in the protocol before the study starts by tabulating some possible modified designs.
In our numerical investigations, we considered planned designs that minimize the average sample size under the null hypothesis, and we found that the proposed alternative designs had smaller average sample size under the null hypothesis than those of Green and Dahlberg.