Sample size recalculation in multicenter randomized controlled clinical trials based on noncomparative data

Many late‐phase clinical trials recruit subjects at multiple study sites. This introduces a hierarchical structure into the data that can result in a power‐loss compared to a more homogeneous single‐center trial. Building on a recently proposed approach to sample size determination, we suggest a sample size recalculation procedure for multicenter trials with continuous endpoints. The procedure estimates nuisance parameters at interim from noncomparative data and recalculates the sample size required based on these estimates. In contrast to other sample size calculation methods for multicenter trials, our approach assumes a mixed effects model and does not rely on balanced data within centers. It is therefore advantageous, especially for sample size recalculation at interim. We illustrate the proposed methodology by a study evaluating a diabetes management system. Monte Carlo simulations are carried out to evaluate operation characteristics of the sample size recalculation procedure using comparative as well as noncomparative data, assessing their dependence on parameters such as between‐center heterogeneity, residual variance of observations, treatment effect size and number of centers. We compare two different estimators for between‐center heterogeneity, an unadjusted and a bias‐adjusted estimator, both based on quadratic forms. The type 1 error probability as well as statistical power are close to their nominal levels for all parameter combinations considered in our simulation study for the proposed unadjusted estimator, whereas the adjusted estimator exhibits some type 1 error rate inflation. Overall, the sample size recalculation procedure can be recommended to mitigate risks arising from misspecified nuisance parameters at the planning stage.

Also, conducting a multicenter trial with multiple sites offers some form of replication within the trial, which is advantageous compared to monocenter trials to "provide a better basis for the subsequent generalization of its findings," see ICH (1998), Section 3.2. Compared to single-center trials, multicenter trials should account for potential heterogeneity between study sites. This heterogeneity can be considered as an additional nuisance parameter for planning and analysis purposes. The statistical model of such a trial can be implemented in a Gaussian linear fixed-effects or random-effects model, if a continuous outcome is considered. An advantage of a random-effects model is that only a single parameter has to be estimated compared to a fixedeffects model or a stratified analysis that will include at least one additional model parameter for each center. A fixed effects model should be used when only few centers are included, to avoid bias in estimation of between-center variation (Kahan & Morris, 2013).
Sample size formulas for multicenter trials have been described for both fixed and random-effects models. An overview of existing approaches and a new sample size formula is given in Harden and Friede (2018). In contrast to other approaches, this formula does not assume balanced numbers of observations across the treatment groups within centers. Since exact balance is unlikely to be observed in multicenter trials in practice, we believe that this is a useful approach to sizing multicenter trials.
Sample size calculations are typically informed by results from previous trials. Therefore, fixed study designs, for example, studies with an a priori fixed sample size are at risk to be planned with incorrect values when differences between previous trials and the new one occur. Adaptive study designs have been developed to reduce the risk of false negative study results, due to initial misspecification and have been adopted to various trial designs (Wassmer & Brannath, 2016, e.g., Chapters 2, 6, 9). One option is the implementation of an internal pilot study (Wittes & Brittain, 1990). This means that assumed values of nuisance parameters for the initial sample size calculation will be replaced by estimates from the accruing data during the course of the trial. They can be calculated in a blinded fashion using an estimator of the overall variance based on the data lumped across treatment groups, as proposed by Gould (1992), Kieser and Friede (2003), Zucker, Wittes, Schabenberger, and Brittain (1999), or based on unblinded data, for example, Wittes and Brittain (1990), Denne and Jennison (1999), Coffey and Muller (1999), Miller (2005). An overview to this kind of study design is given in Friede and Kieser (2006) and Proschan (2009). Since treatments cannot always be blinded in clinical trials but adaptations can still be based on the data pooled across the treatment groups, the terms "blinded" and "unblinded" are nowadays often replaced by "noncomparative" and "comparative," respectively, to indicate that adaptive designs are not limited to blinded trials. In the following, we will call these procedures comparative or noncomparative.
There are several guidance documents on the use of adaptive designs and sample size recalculation in particular. "Whenever possible, methods for blinded sample size reassessment that properly control the type 1 error rate should be used" as stated in EMA (2007), Section 4.2.2. It is mentioned further that "sufficient justification should be made," whenever unblinded data need to be reassessed. In a recent draft guidance on "Adaptive designs for clinical trials of drugs and biologics" by the U.S. Food and Drug Administration, it is said that "adequately prespecified adaptations based on noncomparative data have a negligible effect in the type 1 error probability. This makes them an attractive choice in many settings, particularly when uncertainty about event probabilities or endpoint variability is high," see FDA (2018), Section IV. Regarding comparative data, it is stated that "adaptations based on comparative data generally do directly increase the type 1 error probability and induce bias in treatment effect estimates. Therefore, statistical methods should take into account the adaptive trial design" (Section V, lines 414-416). The adaptation of sample sizes based on comparative data is suggested, when "there is considerable uncertainty about the true treatment effect size" (Section V.B). The use of adaptive designs is of course not limited to pharmacological interventions and specific guidance was released for example for device trials (FDA, 2016). Following these recommendations, we will present an approach for sample size recalculation based on noncomparative data to deal with uncertainties in nuisance parameters.
To the best of our knowledge, only few approaches for sample size recalculation in multicenter trials have been suggested so far. Shih and Long (1998) consider multicenter trials with unequal variances between treatment groups but equal group sizes within centers. Jensen and Kieser (2010) presented a sample size recalculation procedure based on a sample size formula by Ruvuna (2004) for the fixed effects model. In this article, we aim to apply the fixed sample size formula for the mixed effects model described in Harden and Friede (2018) to an internal pilot study approach to allow for sample size recalculation during the trial.
The manuscript is organized as follows. We introduce the COMPETE II trial as the motivating example in Section 2. Section 3 defines the statistical model and describes the sample size formula for the fixed study design. We extend the sample size formula to a sample size recalculation procedure in Section 4. For the sake of completeness, we will look at sample size recalculation based on comparative, as well as noncomparative data. In Section 5, we assess operation characteristics of the new approach based on a simulation study with parameters similar to the COMPETE II trial. We discuss the results and close with a conclusion in Section 6.  Holbrook et al. (2009) investigated the benefit of an individualized electronic decision support system in adult patients diagnosed with type 2 diabetes. Primary outcome was a composite score difference compared to baseline measuring process quality on a scale from 0 to 10, based on the following parameters: blood pressure, cholesterol, glycated haemoglobin, foot check, kidney function, weight, physical activity, and smoking behavior. The clinical targets are described in the original article. It was assessed twice, at baseline and 6 months after randomization. For this multicenter trial, 511 patients from 46 primary care providers (here referred to as centers) were randomly assigned to intervention or control. Block-randomization was stratified by study site in blocks of six, following a 1:1 allocation scheme. The number of patients by study center are published in another article by Chu et al. (2011) and displayed in Figure 1. At the planning stage, the investigators aimed to recruit 508 patients to achieve 80% power assuming a difference of 1 for the primary outcome between treatment groups using a two-sided t-test with a significance level of = 0.05.

MOTIVATING EXAMPLE: THE COMPETE II TRIAL
We will illustrate ideas and simulations using this example. Obvious parameters that can be reestimated from interim data are the within-and the between-group variance. Additionally, information on subject recruitment can be retrieved at the interim stage. We want to explore whether this information can be used to improve sample size recalculation.

STATISTICAL MODEL AND FIXED SAMPLE SIZE CALCULATION
The statistical model is based on a linear mixed-effects model described as follows = 0 + + ⋅ + with fixed intercept 0 , treatment effect , treatment indicator 1 = 0 and 2 = 1, and pairwise independent , satisfying E( ) = 0, Var( ) = 2 < ∞, E( ) = 0, Var( ) = 2 < ∞ and Cov( , ′ ) = 0. Indices refer to treatment groups = 1, 2, centers , ′ = 1, … , and subjects within centers and treatment groups = 1, … , , sample sizes are defined as = 1 + 2 , = ∑ =1 and = 1 + 2 . Subjects within centers share a common random effect that creates a block-diagonal covariance matrix given by Cov( 111 , … , 2 2 ) = ⨁ =1 ( 2 + 2 ) with the -dimensional identity matrix and the -dimensional matrix consisting of ones only. The structure of the study design is shown in Table 1. In addition to the statistical model, the following assumptions are necessary to use the closed sample size formula as presented in Harden and Friede (2018): 1. Block randomization with fixed block length , locally applied at each center for treatment allocation, 2. fixed ∶ 1 allocation ratio for each randomization block with ∈ ℕ, 3. the proportion of overall sample sizes between treatment groups fulfils the assumed allocation ratio 1 = 2 .
The unknown parameter of interest can be estimated in an unbiased fashion based on overall treatment group meanŝ= The variance of̂has been calculated elsewhere, see Vierron and Giraudeau (2009), Appendix I, and is given by The unknown nuisance parameters 2 and 2 can be estimated bŷ resulting in the following test statistic, which follows asymptotically a normal distribution Under the null hypothesis of no treatment effect H 0 ∶ = 0, follows asymptotically a standard normal distribution, while = √ ∕ under the two-sided alternative H A ∶ ≠ 0. The null hypothesis will be rejected, if | | > 1− ∕2 for ∈ (0, 1), where denotes the -quantile of the standard normal distribution. We determined a sample size formula for̂in a recent article (Harden & Friede, 2018) given by where * describes the assumed treatment effect and Δ 2 ∶=( 1 ∕ − 2 ) 2 the deviation between planned and observed allocation ratio within centers. Nuisance parameters 2 and 2 as well as deviations Δ 2 are unknown at the planning stage and must be replaced by reasonable guesses prior to the trial. For instance, Δ 2 could be replaced by expected values as shown in Harden and Friede (2018). If sample sizes within centers match the planned allocation ratio, that is, Δ 2 = 0 for all , between-center heterogeneity will not decrease the statistical power of the trial and (4) reduces to An upper boundary of the required sample size is given, if every center would contribute an incomplete randomization block with ∕( + 1) subjects receiving the same treatment. Here, we assume that every center can at most contribute one incomplete block.

SAMPLE SIZE RECALCULATION
We use Formula (4) to construct a sample size recalculation procedure for multicenter trials with continuous outcomes and arbitrary sample sizes per treatment arm and center. The goal of an internal pilot study is to gain information on nuisance parameters of the trial during recruitment to improve sample size calculations.

What can be learned from noncomparative interim data?
In order to achieve the planned statistical power, we will analyze what can be learned based on interim data to improve nuisance parameter estimation. In the considered multicenter design, we assume values for the variability between observations 2 and the dependency of observations within study sites 2 . The influence of 2 on the statistical power is influenced by sample size deviations within centers Δ 2 . These values require knowledge of the treatment group allocation, that is, unblinding of the data. In a sample size review based on noncomparative data, we can, however, estimate the distribution of Δ 2 at the interim stage and impute E(Δ 2 ) into Formula (4) for sample size recalculation. In this article, the number of centers is assumed to be fixed.

Nuisance parameter estimation
We suggest the following noncomparative quadratic forms to estimate unknown nuisance parameters 2 and 2 at interim When a comparative estimation of nuisance parameters seems appropriate, estimators described in Formulas (1) and (2) can be applied. All these estimators do not assume any distribution function and can be calculated as long as > 2 ∀ ( > 2 ∀ ) based on noncomparative (comparative) data. Some of these estimators are biased with the following expected values Detailed derivations are given in the Appendix. Friede and Kieser (2001) and others showed that the bias of the noncomparative variance estimator̂2 is neglectable in typical situations for clinical trials. The bias of estimators for 2 can partly be adjusted using estimators̃2 In the following, we will refer to estimatorŝ2 and̂2 as unadjusted estimators, while referring tõ2 and̃2 as adjusted estimators of 2 . We will assess how these biases affect the sample size determination process and if corrections shown in Formulas (7) and (8) improve results using simulation studies.

Incomplete randomization blocks within centers
In addition to nuisance parameters, the unknown dispersion by center Δ 2 has to be be determined during sample size review. Since the quantity is defined by sample sizes which are unknown at interim, ideas similar to Harden and Friede (2018) can be used to calculate the expectation of the distribution of Δ 2 which only depends on block length , allocation ratio and the number of patients in the last randomization block , all of which are available from noncomparative data. In order to investigate how well the distribution of Δ 2 can be estimated at an interim stage, we use the COMPETE II trial-based center sizes to compare the distribution of block lengths at interim stages round{ ⋅ ( 1 , … , )}, for ∈ (0.2, 0.3, 0.4, 0.5) to the ones displayed in Figure 1. The results are shown in Figure 2.
It can be seen that the distribution of block lengths at any interim stage does not look alike the final distribution after including all 511 subjects. We therefore recommend to assume a uniform distribution of on values {1, … , } for sample size calculation in general, leading to the following approximation ) .

Recalculation procedure
All parameters in Formula (4) are either known or can be estimated as described above. For sample size recalculation noncomparative, as well as comparative nuisance parameter estimators as listed in Formulas (1), (2), (5), (6), (7), and (8) can be applied. The sample size recalculation procedure is executed as follows: (i) Calculate initial sample size init using formula MC based on initial assumptions on , , * , 2 , 2 , and Δ 2 and specify a number of subjects BSSR = ⋅ init which will be used to recalculate nuisance parameters with ∈ (0, 1).
(ii) Estimate nuisance parameters 2 , 2 based on the first BSSR subjects recruited.
(iii) Calculate sample size 1 using formula MC based on initial , , and and estimated 2 , 2 .
(iv) Recruit additional subjects into the study until final = max( 1 ; BSSR ) is reached. If an upper limit max is given for recruitment, final = min{max( 1 ; BSSR ); max }.
(v) Perform final analysis based on final subjects.

General settings
A simulation study is carried out to assess the operation characteristics of the sample size recalculation procedure. All analyses are performed using R, version 3.6.1 (R Development Core Team, 2008). Each simulation scenario consists of sim = 10 000 simulation runs to estimate the type 1 and 2 error rates. For data generation, the R-package blockrand is used for block randomization (Snow, 2013). A normal distribution is assumed for both the random effects and observation errors . Source code to reproduce the results is available as Supporting Information online at https://onlinelibrary.wiley.com/doi/10.1002/bimj.201900138 at the end of the article.
The parameter settings of the generated data are motivated by the COMPETE II trial. Block length is chosen to be large to show the benefit of the new approach, which comes into play for unbalanced data. If not stated otherwise, we set the treatment effect to = 1, nuisance parameters 2 = 2 = 16, block length = 16 and the allocation ratio = 1. The number of centers , timing of the sample size recalculation and the assumed treatment effect under the alternative for type 1 error simulations will vary. We generate unequally sized study centers distributing a fixed overall sample size to centers based on a multinomial  distribution with random center sizes as described in Jensen and Kieser (2010), that is, but force > 0 for every center . The different nuisance parameter estimators given in Section 4.2 are compared and we will present the difference between sample size recalculation based on noncomparative and comparative data. We consider different estimators for nuisance parameters as listed in Table 2. Also, the test statistic described in (3) will be calculated using the adjusted as well as the unadjusted estimator for 2 . We will refer to the test statistic in accordance to the nuisance estimator of 2 .

Type 1 error rate
In this section, we present simulation results for the estimated type 1 error rate. We simulate data from = 10 and = 20 centers with varying sample sizes determined by * ∈ {0.82, … , 3.32} and a varying proportion of data used for sample size recalculation ∈ {0.3, … , 0.8}. Initial sample sizes before sample size recalculation for varying center sizes are shown in Table 3. For the purpose of comparison, we included sample sizes for a balanced t-test in that table (centers = 1). The results for a prespecified two-sided type 1 error rate of = 0.05 are shown in Figures 3 and 4. Each subfigure contains four panels that represent results for noncomparative and comparative, as well as adjusted and unadjusted estimators for the nuisance parameters 2 and 2 described in (1) -(8), which were applied for sample size recalculation. The unadjusted test statistic controls the type 1 error rate and shows some conservative behavior for larger treatment effects and an increasing number of centers. The use of noncomparative or comparative nuisance estimators does not seem to affect the estimated type 1 error rate in situations considered for this simulation study. The adjusted estimator of 2 at interim does not affect the estimated type 1 error rate either.
When using the adjusted test statistic for the final analysis, some inflation of the estimated type 1 error rate can be observed for fewer centers and larger treatment effects. Again, no influence of noncomparative/comparative or adjusted/unadjusted nuisance parameters used for sample size recalculation can be seen for the simulated type 1 error. Due to the possible inflation of the type 1 error rate of the adjusted test statistic, we do not consider it for power analyses.

Power
In this section, we explore, whether the sample size recalculation procedure achieves the pre-specified statistical power for varying parameter settings. The simulation results refer to the parameter settings described earlier. The desired statistical power is set to 0.8. Simulation results are presented in Figure 5, which shows the behavior of the unadjusted test statistic for varying values of initially (mis)-specified nuisance parameters, varying number of centers and treatment effects. The power simulation results show in general that an initial misspecification of nuisance parameters will be corrected by the sample size recalculation. For a larger treatment effects, we can see, however, that the adjusted estimator for 2 can lead to underpowered trials. The unadjusted estimator for 2 will at least lead to the planned power level. No difference between noncomparative and comparative estimators can be observed in the simulation settings considered here.
The distribution of nuisance parameters estimated at interim and final sample sizes is shown in Figures 6 and 7. Here we only presents results for one scenario, where both nuisance parameters are correctly assumed at the initial planning stage of the trial.
For a small treatment effect of = 1 we observe only little differences regarding estimated nuisance parameters and resulting sample sizes, as seen Figure 5. For a larger treatment effect ( = 2) we can see, however, that resulting sample sizes are substantially increased for unadjusted nuisance parameter estimated or 2 . This increase seems to be influenced by the number of centers and overall number of subjects. Noncomparative variances estimated are slightly larger than comparative estimators, but this difference does not seem to affect sample size recalculation for the parameters considered here.

DISCUSSION AND CONCLUSIONS
The specification of nuisance parameters is crucial for sample size calculation. Today, it is standard practice to use sample size recalculation procedures based on noncomparative data to correct inappropriate initial guesses of such parameters, generally without any practically relevant inflation of the type 1 error rate.
In this article, we presented a sample size recalculation procedure based on noncomparative data for a random-effects model of multicenter trials. The underlying sample size formula was described previously and depends, in addition to treatment effect, variance, and type 1 and 2 error rates, on the number of centers and heterogeneity between study sites (Harden & Friede, 2018). Whereas other approaches assume balanced data within centers, we relax this assumption here. However, we consider block randomization to approximate the imbalance.
Based in simulation results, we suggest to use the unadjusted nuisance parameter of 2 for sample size recalculation as well as the final analysis to control the type 1 error rate and achieve the desired power. Power simulations confirm that a recalculation of the sample size based on noncomparative data is an adequate method to correct initial misspecification of nuisance parameters. This is especially helpful in multicenter trials, since the heterogeneity between centers is often unknown at the planning stage and since it is barely reported in publications of clinical trials. In terms of type 1 error rate and power, we cannot detect any differences between sample size recalculation based on noncomparative or comparative data. We think that this observation is mainly due to the rather large sample sizes typical for multicenter trials and therefore considered in the presented simulations.
An alternative variance estimator specifically for block-randomized trials has been suggested (Ganju & Xing, 2009;Xing & Ganju, 2005). This estimator is calculated for each randomization block, therefore blindness of the data can be preserved. The estimator is unbiased, if all randomization blocks are complete. This estimation technique has also been extended to 2 in cross-over as well as cluster randomized trials (Grayling, Mander, & Wason, 2018a, 2018b. Since this estimator relies on balanced data, we did not consider it here. Also, it has been demonstrated that the estimator's variance is larger compared to the one-sample variance estimator in situations that are common in clinical trials (Friede & Kieser, 2013). In our motivating example, the COMPETE II trial, no sample size recalculation method was applied. However, we do believe that multicenter trials such as the COMPETE II trial can benefit from sample size recalculation methods. In practice, their implementation is not too different from previously proposed approaches. Data of multicenter trials are usually stored in a central trial database and all calculations can be based on a data export at interim. An ideal time point for the sample size

Method of estimation
Final sample size N final F I G U R E 6 Distribution of reestimated nuisance parameters and sample sizes based on = 1 and varying center sizes. Results based on a block length of = 16 and nuisance parameters 2 = 2 = 16 and a sample size recalculation after half of the initial sample size is recruited. Red line represents true nuisance parameters and sample size of 530 (554) subjects for = 10 ( = 20) centers based on true values. Comparative is abbreviated as comp recalculation depends on several factors, such as pace of recruitment, trial duration, as well as uncertainty on initial choices of nuisance parameters (Chuang-Stein, Anderson, Gallo, & Collins, 2006). Given that not all patients enrolled at interim might have completed follow-up, approaches combining short-and long-term data have been proposed to increase precision of the estimated variance components, which might be transferred to multicenter trials, see, for example, Asendorf, Henderson, Schmidli, and Friede (2019) for an application to longitudinal counts and Friede and Kieser (2006) for an overview. For further reading on practical guidance, we refer to the summary by Pritchett et al. (2015). The integration of between-center heterogeneity into sample size considerations is a necessary step to control statistical power in multicenter trials. The use of a sample size recalculation procedure based on noncomparative data is a helpful tool to account for uncertainty in nuisance parameters at the planning stage of a trial.