Comparing different ways of calculating sample size for two independent means: A worked example.

We discuss different methods of sample size calculation for two independent means, aiming to provide insight into the calculation of sample size at the design stage of a parallel two-arm randomised controlled trial (RCT). We compare different methods for sample size calculation, using published results from a previous RCT. We use variances and correlation coefficients to compare sample sizes using different methods, including 1. The choice of the primary outcome measure: post-intervention score vs. change from baseline score. 2. The choice of statistical methods: t-test without using correlation coefficients vs. analysis of covariance (ANCOVA). We show that the required sample size will depend on whether the outcome measure is the post-intervention score, or the change from baseline score, with or without baseline score included as a covariate. We show that certain assumptions have to be met when using simplified sample size equations, and discuss their implications in sample size calculation when planning an RCT. We strongly recommend publishing the crucial result "mean change (SE, standard error)" in a study paper, because it allows (i) the calculation of the variance of the change score in each arm, and (ii) to pool the variances from both arms. It also enables us to calculate the correlation coefficient in each arm. This subsequently allows us to calculate sample size using change score as the outcome measure. We use simulation to demonstrate how sample sizes by different methods are influenced by the strength of the correlation.


Background
Sample size calculations for a parallel two-arm trial with a continuous outcome measure can be undertaken based on (i) a pre-specified difference between arms at the post-intervention endpoint and (ii) an estimate of the standard deviation (SD) of the outcome measure. If the outcome variable is also measured at baseline, an alternative outcome measure is change from baseline instead of the post-intervention measure. Use of this alternative outcome measure would result in a different power calculation from that obtained using the post-intervention as the outcome measure. It is possible to carry out a power calculation based on analysis of covariance (ANCOVA) where the baseline measure is included as a covariate in the analysis.
Sample size calculations typically use published results from trials similar to those under consideration. We use results from a published paper for the MOSAIC trial [1] to compare different methods for sample size calculation. We examine the assumptions made by each method for calculating sample size, and discuss the implications of these assumptions when calculating the required sample size for a new RCT. We aim to provide insight into sample size calculations at the design stage of an RCT.
We introduce the notion of change scores, and show how to derive variances of these change scores along with related correlation coefficients in Section 3, using published results. We then calculate and compare sample sizes using different methods in Section 4. A description of the simulation of different strengths of the correlation is presented in Section 5, with the aim of investigating its influence on the calculation of sample sizes using different methods. Section 6 discusses simplified sample size equations when certain assumptions are met. Finally, we consider implications in sample size calculation when planning an RCT in Section 7.

Published results of the MOSAIC trial
The MOSAIC trial is an RCT using continuous positive airway pressure (CPAP) for symptomatic obstructive sleep apnoea. The trial randomised 391 patients between two treatment arms (CPAP vs. standard care). It has two primary outcomes at 6 months: change in Epworth Sleepiness Score (ESS), and change in predicted 5-year mortality using a cardiovascular risk score. The authors also reported the energy/vitality score (referred to as the "energy score" hereafter) of the 36-item short-form questionnaire (SF-36). The change in SF-36 energy score at 6 months is a secondary outcome of the MOSAIC trial, and an investigator might conduct another RCT using it as the primary outcome. The online supplement of the MOSAIC paper [1] states that all data were analysed using multiple variable regression models adjusting for the minimisation variables and baseline value of the variable being analysed. Table 1 shows data concerning the SF-36 energy score, taken from Table 4 in the MOSAIC paper [1]. The outcome measure is energy score in the SF-36 questionnaire, measured at baseline and at 6 months postintervention. An increase in the energy score indicates an improvement in health status. The table shows that the adjusted treatment effect (6.6) is the same as the unadjusted treatment effect (10.8-4.2 = 6.6). The baseline mean scores are similar in both arms, being 49.7 and 49.8, respectively.
In the following sections, we show how to derive the variances of the change scores and correlation coefficients between baseline and 6 month measurements for both arms, using the results reported in Table 1 including "Mean change (SE)".

2.2.
Deriving the sample variance of the change score − Y Y ( ) 1 0 We use generic notation in this paper, noting that the proposed method is applicable to arbitrary continuous outcome measures. Suppose the primary continuous outcome measure is Y , with Y 0 and Y 1 denoting Y at baseline and post-intervention, respectively. For simplicity, we will call Y 0 the "baseline score", Y 1 the "post score", and − Y Y ( ) 1 0 the "change score". Let s Y 2 0 denote the sample variance of baseline score Y 0 , s Y  The calculation of − s Y Y ( ) 2 1 0 above requires the knowledge of "mean change (SE)" reported in Table 1. The presence of r is implicitly acknowledged, and we will use  This section shows how to use the variance sum law to derive the correlation coefficient r between Y 0 and Y 1 . The variance sum law states Let r c and r t denote r in the control and intervention arms, respectively. Substituting s Y   for the sample size calculation in the following sections. We note that if ≠ r r c t , the sample size method via ANCOVA in this paper will not be valid; in this example, the values of r c and r t are very close, granting the validity of using ANCOVA for sample size calculation. We will discuss the implication of different values for r c and r t in later Sections.

Comparing different sample size calculations
The calculation of sample size will depend on whether the outcome measure is to be the post score or the change score, without and with baseline included as a covariate.
3.1. Sample size: t-test on post score Y 1 Using Y 1 as the outcome measure in our example, the pooled variance of Y 1 is (see Appendix) For a two-sided significance level α at power − β 1 , with pooled variance of s p 2 , the required number of patients per arm is approximately [2].
is the target mean difference between the two treatment arms, and where − z α 1 /2 and − z β 1 are the ordinates for the standard normal distribution, ∼ z N(0,1). If assuming equal variance σ 2 , simply substitute s p 2 for σ 2 in Equation (2). In the exemplar considered by this paper, we use two-sided significance level = α 0.05, and power − = , respectively. In our example, the target mean difference is set to be the reported treatment effect in Table 1 In the trial design stage, the characteristics of the planned RCT will inevitably differ from those of a previously-published trial, and it is therefore desirable to calculate sample sizes over a range of variances. For example, assuming equal variance using (2), the resulting sample sizes are = N 183 and = N 158, respectively. The pooled variance produces a modest sample size = N 170. In practice, one may choose to calculate N using the most conservative (i.e., the greatest) value of variances when designing a new RCT.

Sample size
1 0 in the previous section; substituting the latter into Equation (2) 1 0 in the sample size calculation shown in Table 3. We strongly recommend publishing resulting "mean change (SE)" in a study paper, because it allows the calculation of 1 0 enables us to calculate r in each arm. This subsequently allows us to calculate sample size using the change score − Y Y ( 1 0 ) as the outcome measure. We will use the derived r to calculate N via ANCOVA in the next section.
3.3. Sample size: assumptions of ANCOVA on Y 1 adjusting for Y 0 When using Y 1 as the outcome while adjusting for Y 0 , the sample size N can be calculated via ANCOVA. Let τ 2 and σ 2 be the variances of Y 0 , as shown in the Appendix. We note that τ 2 , the variance of the baseline score Y 0 , does not appear in the conditional variance of Y Y ( | ) 1 0 . This relationship indicates a variance deflation factor − r (1 ) 2 that can be used for sample size calculation.
However, this variance deflation factor is only true under the assumption of a bivariate normal distribution of Y Y ( , ) i j i j 0, , 1, , . As stated above, this means that the marginal distribution of Y 0 is normal, and that the marginal distribution of Y 1 is also normal, hence the usual assumed normality for a t-test is met. However, the marginal normal distributions of Y 1 and Y 0 do not guarantee the bivariate normal dis- is a stronger assumption than the assumption in a t-test for sample size, and can be violated in practice. It is necessary to examine assumption of a bivariate normal distribution of Y Y ( , ) i j i j 0, , 1, , before applying the variance deflation factor in the sample size calculation.
It is straightforward to visualise Y Y ( , ) i j i j 0, , 1, , by plotting the data in a two-dimensional space, with treatment arm on the horizontal axis, and on the vertical axis. This visualisation will immediately reveal whether the assumption of a bivariate normal distribution is violated. It is possible that data will form two clusters corresponding to the control and intervention arms, respectively, which therefore violates the assumption. Borm, Fransen et al. [3], used this relationship for sample size calculation via ANCOVA, but the authors did not explicitly discuss its assumption.
There are several other assumptions one must make before applying the variance deflation factor − r (1 ) 2 . In this paper, we give mathematical details in the Appendix and explicitly examine all the assumptions, summarised below: , including all patients in both arms, follow a bivariate normal distribution. We recommend visualising the data to examine whether this assumption is violated, as discussed above. 2. The values of the correlation coefficient r between Y 0 and Y 1 are the same in both arms. This means that there exists no interaction between baseline score and the treatment arm. This assumption is adequately met in our example, where ≈ r 0.7 in both arms of the trial. 3. The variances of Y 1 , denoted σ 2 , are the same in both arms. We note that the variance of Y 0 , denoted τ 2 , does not affect the variance deflation factor, hence it does not have to take the same value in both arms. This assumption is mildly violated in our example, because Table 2 shows that the pooled s Y 2 0 and s Y 2 1 are quite similar, being 23. 1 2 and 21. 7 2 , respectively. The resulting sample size by ANCOVA shown in Table 3 should still be a reasonable estimate, due to these similar values of the pooled s Y If all of the above assumptions hold, then the conditional variance of Let N be the sample size (i.e., the number of patients in each arm) by a t-test on Y 1 , then the sample size by an ANCOVA on Y 1 adjusting for Y 0 is while achieving the same power as a t-test on Y 1 . Since − ≤ r N N (1 ) 2 , ANCOVA always produces a smaller sample size than a t-test, illustrated in the first row of Table 3.
In our example, the variance of Y 1 in the control and intervention arms is different (22. 5 2 and 20. 9 2 , respectively), hence it does not meet the assumption of equal variance above (#3).  N by equation (N by PASS).

Comparing sample sizes using different methods
This section summarises and compares different methods for sample size calculation. We discuss the following two factors: 1. The choice of the primary outcome measure: post score Y 1 vs. change score − Y Y ( 1 0 ). 2. The choice of statistical methods: t-test without using r vs. ANCOVA.
In all sample size calculations in this paper (including those for which the results are shown in Table 3), we have used the target mean difference = δ 6.6, two-sided = α 0.05, allocation ratio = 1, achieving 80% power. All sample sizes are produced using the corresponding pooled variance derived in this paper. We used the PASS 15 system (NCSS, LLC) to validate our sample size calculation by equations, shown as "(N by PASS)" in Table 3, and where "N by equation" refers to our derived N in previous sections. The algorithm implemented by the PASS software uses Borm, Fransen et al. [3], in its reference for sample size via ANCOVA, and its results ("N by PASS") are similar to the "N by equation".
The efficiency (i.e. smaller N while maintaining the same statistical power) gained in ANCOVA by using r comes from making strong assumptions. We have used Equation (3) from Section 4.3 (i.e., sample size via ANCOVA) in Table 3, but we note that its assumptions are not fully met in individual arms, and therefore one should not directly use the variance of individual arms for the sample size calculation in AN-COVA. In this instance, our approach is to use the pooled variance of both arms in the sample size equation via ANCOVA. Acknowledging its limitation in practice, one can produce sample sizes using a range of variances to gain a better sense of the required sample size.
In Table 3, we have used = r 0.7 for sample size via ANCOVA, as stated previously. In both the "t-test" and "ANCOVA" methods, we have used the pooled variance = s p for the t-test on ( − Y Y 1 0 ). In the example corresponding to the results shown in Table 3, AN-COVA produces the smallest sample size, while use of a t-test on Y 1 produces the largest. Calculating sample size via a t-test for outcome Y 1 does not consider the correlation r between Y 0 and Y 1 , hence will always yield a sample size larger than that obtained when using an ANCOVA (which involves the use of the value of r). However, N via a t-test for outcome ( − Y Y 1 0 ) is not always larger than N via ANCOVA, depending on the strength of the correlation r and meeting the assumptions presented earlier.

Simulated sample sizes at different values of r
We here simulate different values of r, and then compare the sample sizes calculated using different methods. The pooled variances s Y We show the simulated sample sizes of these two options above in the following sections. The simulated results using both options are shown in Table 4 below, and are plotted in Fig. 1 and Fig. 2. The same parameter values as presented in Table 3 are used for simulation throughout this section.  The results shown in Table 3 correspond to values of N at = r 0.7, where the value of N obtained via ANCOVA is smaller than the value of N obtained via a t-test on the outcome ( − Y Y  (1). Fig. 2 shows the resulting sample sizes obtained by the three different methods, to be compared with Fig. 1. In Fig. 2, the resulting N via ANCOVA remain the same as those shown in Fig. 1 1 0 by the values of r. Fig. 2 also provides a convenient way of assessing the assumption of equal variance required in Equation (4). If the assumption that Y 0 and Y 1 have the same variance is met, the long-dashed line in Fig. 2 (representing the value of N obtained via a t-test on Y 1 ) and the shortdashed line (representing the value of N obtained via a t-test on − Y Y 1 0 ) will cross at = r 0.5. These two lines cross at = r 0.53 in Fig. 2, indicating this assumption is only mildly violated.

The variance sum law when assuming equal variance
Assuming Y 0 and Y 1 have the same variance σ 2 , the variance sum law (Equation (1)) can be simplified to This means that when − Y Y ( 1 0 ) is the outcome measure, its variance deflation factor is − r 2(1 ), assuming that Y 0 and Y 1 have an equal variance σ 2 . This variance deflation factor gives us a simplified Equation (4) for sample size. Let N be the sample size (i.e., the number of patients in each arm) obtained by a t-test on Y 1 ; then a t-test on − Y Y ( 1 0 ) will require − r N 2(1 ) patients to achieve the same power, assuming equal variance of Y 0 and Y 1 .
, if > r 0.5, and vice versa if < r 0.5, then Equation (4) also shows that calculating sample size using a t-test on − Y Y ( 1 0 ) will require fewer patients than would be obtained were a ttest on Y 1 used, if > r 0.5 and vice versa if < r 0.5. The two methods yield the same number of patients if = r 0.5. We emphasise that this relationship only strictly applies when Y 0 and Y 1 have equal variance σ 2 . In practice, if s Y 1 0 , and hence give a reasonable estimate of sample size. This is further illustrated by Fig. 2, where the long-dashed and short-dashed lines cross at = r 0.53, a close value to 0.5, indicating a mild violation of the assumption on equal variance.
In our example, Table 2 shows that Y 0 and Y 1 do not have equal variance, hence the above formula is not directly applicable. However, where equality occurs at = r 1. The left hand and right hand sides of Equation (5) correspond to the sample size obtained via ANCOVA on Y 1 while adjusting for Y 0 and via a t-test on − Y Y ( 1 0 ), respectively. In practice, we always have < r 1; therefore ANCOVA on Y 1 adjusting for Y 0 always yields a smaller sample size than would be obtained using a t- When designing a new RCT, one needs to consider whether the duration of the planned trial will differ from that of previous trials. The correlation between Y 0 and Y 1 is likely to decrease (i.e., a smaller r) for an increased trial period, and vice versa.
In the example used in this paper, the derived correlation coefficient r is similar in both treatment arms, being approximately 0.7. If the correlation between Y 0 and Y 1 in the two treatment arms is different, one will need to consider the interaction between the treatment arm and baseline measure.

If "mean change (SE)" is not reported
If "mean change (SE)" is not reported for a study, we can calculate a range of potential variances of − Y Y ( 1 0 ) by setting a plausible range of values of r, using the variance sum law, as shown in Section 3.3. The simulation method shown in Section 5 can be used to compare sample sizes obtained using different methods at different values of r, providing a sense of the required sample size in the trial design stage.

Future work
In this paper we have used change score − Y Y ( 1 0 ) as a choice of Table 4 Simulated sample sizes at different values of r . "N by ANCOVA" produced by option 1 (plotted in Fig. 1) are the same as those produced by option 2 (plotted in Fig. 2). "N by t-test on post score" remains at a constant value of 170 throughout. In contrast, "N by t-test on change score" by option 1 and 2 are different, and are plotted in Figs. 1 and 2 170  169  164  155  143  128  109  87  62  33  0  N by t-test on post score  170  170  170  170  170  170  170  170  170  170 170 N by t-test on change score (Fig. 1 Fig. 1. Comparing values of sample size N produced using different methods at different values of r , using the same parameter values as are shown in Table 3. The values of outcome measure without questioning its validity. In fact, one should be cautious of using change score as the outcome measure, due to the well-known statistical phenomenon of "regression to the mean". This will be investigated in a future paper.

Declarations
Ethics approval and consent to participate N/A. Not required.

Consent for publication
Yes.
Availability of data and material N/A. Not required.

Competing interests
None.
Funding N/A.

Authors' contributions
LC conceived the research idea, and led the writing of the paper. JB and DC also contributed to writing the paper. X and Y . The bivariate normal density is given by the expression [4]