Power analysis to detect treatment effects in longitudinal clinical trials for Alzheimer's disease

Introduction Assessing cognitive and functional changes at the early stage of Alzheimer's disease (AD) and detecting treatment effects in clinical trials for early AD are challenging. Methods Under the assumption that transformed versions of the Mini–Mental State Examination, the Clinical Dementia Rating Scale–Sum of Boxes, and the Alzheimer's Disease Assessment Scale–Cognitive Subscale tests'/components' scores are from a multivariate linear mixed-effects model, we calculated the sample sizes required to detect treatment effects on the annual rates of change in these three components in clinical trials for participants with mild cognitive impairment. Results Our results suggest that a large number of participants would be required to detect a clinically meaningful treatment effect in a population with preclinical or prodromal Alzheimer's disease. We found that the transformed Mini–Mental State Examination is more sensitive for detecting treatment effects in early AD than the transformed Clinical Dementia Rating Scale–Sum of Boxes and Alzheimer's Disease Assessment Scale–Cognitive Subscale. The use of optimal weights to construct powerful test statistics or sensitive composite scores/endpoints can reduce the required sample sizes needed for clinical trials. Conclusion Consideration of the multivariate/joint distribution of components' scores rather than the distribution of a single composite score when designing clinical trials can lead to an increase in power and reduced sample sizes for detecting treatment effects in clinical trials for early AD.


Introduction
Much effort has been devoted to developing diseasemodifying treatments that intervene in the pathobiologic pro-cesses involved in the early stage of Alzheimer's disease (AD). Any therapy that is effective at treating this early manifestation of the dementia process may provide an opportunity for managing the disease while patient function is relatively preserved [1]. Standard instruments used to quantify cognitive and functional decline in AD are relatively insensitive to the changes at early AD [2]. This raises challenges for assessing the early changes in cognition and function across the spectrum of AD [3] and makes detecting treatment effects in clinical trials for early AD even harder [2].
Power analysis is standard when designing clinical trials for detecting treatment effects. Ard et al. [4] provide a comprehensive review for clinical trials in AD. Misalignment of the power analysis can lead to possible errors in 1 Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc. edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_ to_ apply/ADNI_ Acknowledgement_ List.pdf. *Corresponding author. Tel.: 144 (0)1223 330300; Fax: 144 (0)1223 330365.
E-mail address: robin.huang@mrc-bsu.cam.ac.uk decisions regarding sample size. Too large samples may waste time, resources, and money and may unnecessarily expose some participants to inferior treatment if a treatment could have been shown to be more effective with fewer participants. Significant underestimation of the sample size may be a waste of time as it would unlikely lead to conclusive findings and therefore be unfair to all participants taking part in the trial. In this article, we are interested in the power/sample size to detect the treatment effects on the component scores in clinical trials for early AD.
In the literature of early AD, many researchers have used composite scores as single endpoints for performing power analysis [4]. A composite score is typically a linear combination of the scores of sensitive instruments. It provides a univariate summary of the component scores, avoids the multiple-hypothesis testing problem when each component score is considered separately, and reduces the impact of measurement error [5]. Furthermore, it may be more sensitive to the cognitive and functional decline than its separate components [6].
The construction of a composite score involves the selection and weighting of the component scores. Typically, the selection of the component scores may be based on a broad literature review regarding sensitivity to decline of candidate components [7], with equal weighting tending to be applied, possibly naively, to the chosen components. However, more statistically driven approaches can be used to derive the weights to construct more sensitive composite scores [2,6,[8][9][10][11][12].
We therefore classify the statistical strategies used for the construction of a composite score into two major classes. The first is focused principally on selecting the most informative composite components and using prespecified weights not derived from statistical considerations; for example, Raghavan et al. [8] identify the informative component instruments based on standardized mean of 2year change from baseline for a mild cognitive impairment (MCI) cohort and summed them to create a new composite score. The other is focused on "optimizing" the weights assigned to component scores based on an appropriate optimality criterion and is therefore more data driven; for example, some previous proposals find composite weights, which are sensitive to the clinical decline, by fitting linear mixed-effect models (LMMs) to the longitudinal composite scores [2,6,9]. Xiong et al. [6] propose composite weights that maximize the probability of observing a decline in one participant over a unit interval of time. Their weights can be considered as a special case of the composite weights proposed by Ard et al., who use the power to detect the time effect in a clinical trial as their criterion and obtain the component weights by maximizing this criterion [2]. Ard et al.'s approach is applied to construct a composite atrophy index [9]. Another approach within this class is to base the estimation of the composite weights on a criterion that looks at the mean to standard deviation ratio of change over time [10,11]. Wang et al. [12] propose another composite score construct by using a linear clinical decline equation to select and reweight the component scores simultaneously.
In general, using composite scores as single endpoints may lose information to detect the changes in components [3]; for example, a large change in one component can be masked by small changes on other component scores. Data-driven composite scores have been further criticized [7]. Firstly, they may lose clinical interpretation. It is possible that a clinically meaningful component score has small weights in a datadriven composite score [7]. In addition, they may not be consistent across different data sets. Donohue et al. [7] apply cross-validation to quantify the out-of-sample performance of optimal composite scores and conclude that the overall performance of the optimal composite scores is worse than those composite scores derived without optimization.
A limited amount of the literature in AD has considered power analysis with multiple endpoints, although multiple endpoints are commonplace in AD. Under the assumption that the component scores are jointly from a multivariate linear mixed-effects model (MLMM), we compare three approaches with regard to their power to detect the treatment effects on component scores. Two of them are with multiple endpoints, whereas the other is with a single-composite endpoint.

MLMM for component scores
Mixed-effect models are from a class of useful statistical models for analyzing longitudinal data [13]. They allow a subset of the regression parameters (random effects) to vary randomly between participants and thereby characterize the natural heterogeneity in the target population in these parameters. Fixed effects are used to refer to the regression parameters, which are fixed but unknown and need to be estimated.
Assuming that all possible covariates are balanced (as would be assumed in a clinical trial through randomization), we model the component scores using an MLMM with a random intercept, fixed time, and time by treatment interaction effects. (The addition of further covariates can be easily incorporated if deemed necessary.) Such a model is able to simultaneously characterize the correlations between the component scores at each time t and the correlations across time for each component score.
Let Y ntj be the j-th component score of the n-th participant at visit time t, where n 5 1,.,N, t 5 1,.,T n , and j 5 1,.,J. Here, the number of visits T n is a positive integer depending on the n-th participant, and the number of component scores J is prespecified. We use a linear function to link the component scores with the mixed effects where g j is the j-th component treatment effect, b nj is the random intercept that is unique to the j-th component score of the n-th participant, and ε ntj is the random error of the n-th participant on the j-th component score at time t. For each n, let b n 5 (b n1 ,.,b nJ ) T independently follow a multivariate normal distribution with a mean vector 0 and a covariance matrix P b . Here, for any matrix or vector A, the matrix A T is the transpose of A. For each n and t, further let ε nt 5ðε nt1 ; .; ε ntJ Þ T independently follow a multivariate normal distribution with the mean vector 0 and the covariance matrix P ε . For each n and t, the error ε nt and the random effects b n are independent.
For each participant n and time t, the covariance matrix P ε characterizes the correlation structure between the component scores Y nt1 ,.,Y ntJ . For each participant n, the component scores Y nt 5 (Y nt1 ,.,Y ntJ ) T , t 5 1,.,T n , are independent of each other through time conditional on the random effect b n , but would be correlated marginally.
We can link the LMM for the composite scores to the MLMM for the components by letting C nt 5 , a n 5 P J j51 w j b nj , and d nt 5 P J j51 w j ε ntj , where w 5 (w 1 ,.,w J ) T is the vector of weights for the composite score [2]. The LMM for the composite score of the n-th participant at time t is therefore where g w is the treatment effect on composite scores, and for each n, the random intercept, a n , follows a normal distribution with mean 0 and variance s 2 a 5w T P b w, and for each n and t, the random error, d nt , follows a normal distribution with mean 0 and variance s 2 d 5w T P ε w.

Power analysis-hypothesis testing formulations
To detect the treatment effects on component scores, we consider three-hypothesis testing problems and their associated test statistics. Rejecting any of the null hypotheses suggests statistically significant component treatment effects.
The first hypothesis testing problem is to test the null hypothesis of no treatment effect in any of the components against the alternative that there is at least one non-zero treatment effect: where g5ðg 1 ; .; g J Þ T is the J-dimensional vector of treatment effects. The Wald statistic X J 5b g T P 21 g b g can be used, where b g is the maximum likelihood estimator (MLE) of g under the assumption of known covariance matrices for b n and ε nt , and P g is the covariance matrix of b g. It follows that under the null hypothesis of no treatment effect for any of the components that the Wald test statistic will be distributed as a c 2 distribution with J degrees of freedom, c 2 J . The second hypothesis testing problem considered is for the composite treatment effect, defined as a linear combination of the component treatment effects induced by the weights w 5 (w 1 ,.,w J ) T . Here, we test the null hypothesis of no composite treatment effect versus the alternative of a composite treatment effect. That is, The Wald statistic, here, is X JC ðwÞ5ðw T P g wÞ 21 ðw T b gÞ 2 , which is distributed as c 2 1 under the null, H 0 0 . The last hypothesis testing problem considers the case in which composite scores are used as single endpoints. It aims to test a single treatment effect on the composite scores Given the variances s 2 a and s 2 d , let b g w be the MLE of g w and s 2 g be its variance. We can use the Wald statistic w , which follows the c 2 1 distribution under H 00 0 , to test for this type of treatment effect. The vector of weights w has different meanings under the last two hypotheses testing situations. The weights w are on the component treatment effects in the second, whereas the weights w reweight the component scores in the third. These testing approaches are equivalent only in the very special case of a linear link function, as is assumed in our setting. Table 1 summarizes these three-hypothesis testing problem formulations. Under an alternative model, all the test statistics follow a noncentral c 2 distribution and thereby determine the power to reject the associated null hypothesis. However, using less powerful test statistics will lead to larger sample sizes, which may be judged unethical. In the Supplementary document, we prove that for any given weights w, the test statistic X JC (w) is no worse with regards to power than X C (w). The test statistic X J does not uniformly outperform either X JC (w) or X C (w) over the range of w.

Power analysis-deriving the parameters required from analysis of MCI participants in Alzheimer's Disease Neuroimaging Initiative
For illustration, we conduct a power analysis for a twoarm randomized AD clinical trial with equal allocation probabilities. The component scores consist of the Mini-Mental  [14] was used to compute these estimates. We consider various designs for our clinical trial based on choosing different follow-up periods (i.e., 2, 3, 4, 5, and 6 years) and assuming that it is of interest to detect minimally clinically meaningful treatment effects corresponding to 25% reductions in the annual rates of change in the MMSE, CDR-SB, and ADAS-11 (transformed). These 25% reductions here also correspond approximately to 25% improvements in the treated versus control arms, if the components were considered on their original scales of measurement.

Power analysis-specifying the weights
We compare various weights for X JC (w) and X C (w) (optimal or otherwise) that can be used when performing a power analysis for the clinical trial designs mentioned in the early subsection. All the considered weight vectors are normalized by P J j51 w 2 j 51. The following weighting strategies are considered: 1 The equal weights vector w Z 5ð3 21=2 ; 3 21=2 ; 3 21=2 Þ T assumes that the component treatment effects are equally important or that the treatment effect on the average of the component scores is of interest. Typically this strategy may be adopted in practice and therefore provides a benchmark to compare the other weighting strategies. 2 The unit vectors w (1) 5 (1,0,0) T , w (2) 5 (0,1,0) T , and w (3) 5 (0,0,1) T consider the situations in which either only one of the component treatment effects or the treatment effect on a single component is of interest. 3 The optimal weights vector for X JC (w), denoted by w Ã JC , is optimal in the sense that X JC ðw Ã JC Þ has the greatest power to reject H 0 0 under a given alternative. In the Supplementary Materials, it is proven that X JC ðw Ã JC Þ is always more powerful than X J in rejecting the associated null hypothesis given same conditions. The optimal weights w Ã JC are the eigenvector associated with the largest eigenvalue of P 21 g g Ã g ÃT , which is proportional to P 21 g g Ã , where g Ã is the treatment effect vector under the alternative. In Table 2, we list the optimal weights for X JC (w) for the different trial duration scenarios. 4 The optimal weights vector for X C (w), denoted by w Ã C , maximizes the power of X C (w) to detect the treatment effects under a given alternative over all possible normalized w; see the Supplementary document for the algorithm to calculate w Ã C . The composite score induced by w Ã C is the most sensitive for detecting a treatment effect on the composite score. The optimal weights w Ã C for different trial scenarios are listed in Table 2. Table 3 presents the sample sizes required for each of the aforementioned weighting specifications and under the different trial duration scenarios when the statistical power is specified at 80% and the significance level is set at 5%. Also reported are the calculated sample sizes when each component is considered separately for powering the trial, and a Bonferroni correction is applied. Here, the maximum of the three calculated sample sizes based on the three components is chosen as the sample size to be specified for the trial. Table 2 The optimal weights for X JC (w) and X C (w) in each trial duration Weights Component From the table, we observe that the test statistic X JC ðw Ã JC Þ gives the smallest sample sizes (numbers highlighted in bold) for each of the clinical trial design scenarios considered. Moreover, we make the following points after examining Table 3.

Results
A substantial number of participants may be required when a trial for early AD only lasts for 2 years, under our assumptions. We estimate that at least 17,000 participants would need to be recruited in a 2-year AD trial in an MCI population to have sufficient power (i.e., 80%) to detect a 25% reduction in the annual rate of change on each of the transformed component scores. Recruitment of such numbers may be infeasible for a 2-year duration clinical trial in early AD with four biannual follow-up visits and even if feasible failure rates could potentially be high for early AD populations. Note that the required sample sizes will decrease with increasing trial duration, assuming biannual visits.
The required sample sizes to detect the treatment effect on the transformed MMSE are much smaller than the ones to detect the treatment effect on the transformed CDR-SB or ADAS-11 (comparing w (1) rows to w (2) and w (3) rows in Table 3). Let us consider a clinical trial of 3 years duration as an example. The required sample sizes obtained by X JC (w (1) ) is 55.0% of the ones obtained by X JC (w (2) ) and 54.6% of the ones obtained by X JC (w (3) ). This implies that the transformed MMSE is the more sensitive measure for detecting a treatment effect for early AD than transformed CDR-SB and the ADAS-11 measures [15][16][17].
The approaches that use the optimal weights could require at least 60% fewer participants than the ones using w (2) or w (3) . In our analysis, the performances of X JC (w) and X C (w) with w Z are comparable to the ones using the optimal weights. This is a consequence of the estimated parameters obtained from the analysis of the ADNI data giving rise to optimal weights that are close to w Z (Table 2). Comparable performances across these three statistics will not in general be expected when using other component outcomes.
The sample sizes calculated under X JC (w) are always smaller than the ones calculated under X C (w) for fixed weights, although the reduction may not be significant; for example, there is a 3% reduction in sample sizes when X JC (w) is used with w5w Ã JC . Such gain in efficiency is obtained by specifying the correlation structure among the component scores in the MLMM.

Discussion
We have described three approaches for performing power analysis to detect treatment effects in clinical trials for early AD. From our investigations, we found that jointly modeling the component scores and then constructing sensitive test statistics or composite scores based on optimal weights will improve the efficiency of clinical trials. Under our model assumptions, testing based on the optimal composite treatment effect will lead to the smallest required sample sizes and therefore should be recommended when powering clinical trials in AD if treatment effects on multiple components are of interest.
We end the article with the following discussion points.

Model assumptions
We assume that the component scores are jointly from an MLMM. This may be too strong an assumption for analyzing some cognitive and function scores in AD, because the component scores usually are discrete with strong ceiling or floor effects. Consider the CDR-SB as an example. The CDR-SB is the sum of six component scores, including the Memory Score, the Orientation Score, the Judgement and Problem Solving Score, the Community Affairs Score, the Home and Hobbies Score, and the Personal Care Score. The component scores except the Personal Care Score have the discrete range 0, 0.5, 1, 2, and 3, whereas the Personal Care Score has the range 0, 1, 2, and 3. From the ADNI data, over 30% of individuals have 0 in each component score of the CDR-SB, which would indicate strong floor effects (zero-heavy data). Therefore, it may not be appropriate to use an MLMM with CDR-SB on its original scale or even after transformation as done in this article. The use of other models, which take account of zero-heavy data may be appropriate; see Farewell et al. [18] for a comprehensive review.
In our power analysis results, we took the covariance matrices of ε nt and b n to be known when fitting the MLMM. This allowed us to obtain explicit formulas for the MLEs and their covariance, which enabled us to compare the powers of the test statistics and calculate the optimal composite scores. In practice, these covariance matrices would need to be estimated. They may be obtained from previous investigations or through a pilot study. However, note that without considering the variability in the estimated covariance matrices, there would be a tendency to underestimate the required sample sizes. Monte Carlo studies can be applied to obtain more accurate sample sizes [19]. However, these would require intensive computational work to compute the optimal weights.
In the MLMM for component scores, it is assumed that, for each n, the errors ε nt , t 5 1,.,T n , are independent across time. This implies that the time correlation of Y nt , t 5 1,.,T n , is induced only through the random intercepts b n . This can be generalized so as to introduce the auto correlations between ε nt , t 5 1,.,T n . Such generalization would raise computational challenges, and a bespoke program would be needed. (We were unable to find a statistical software package that would allow us to fit this more generalized model).

Wald statistics
The considered Wald statistics are used to detect the component treatment effect, but they do not make distinction between beneficial effects and deleterious effects. However, because currently in early AD, there may be an expectation that any treatment brought forward for confirmatory testing in a phase III trial has undergone rigorous assessment at phase II to ensure that it does not confer harm, it may be of interest to investigate rejecting H 0 under the alternative that all the component treatment effects g are nonnegative. In this situation, the Wald statistic X J follows a mixture of c 2 p distribution, P 5 0,.,J, where c 2 0 distribution is the distribution with mass 1 at point 0. In general, it is challenging to calculate the weights that combine the c 2 p distribution, P 5 0,.,J, [20].
When the weights w in X JC (w) and X C (w) are nonnegative elementwise, we may modify the alternatives against H 0 0 and H 00 0 to H 0 A : and H 00 A : g w .0; respectively. We can use the Z-statistics, X 1=2 JC ðwÞ and X 1=2 C ðwÞ, for the one-sided tests. They follow the standard normal distribution under their associated null hypothesis. However, the elements of the optimal weights w Ã JC and w Ã C may not always be non-negative.

Parameters necessary for powering clinical trials
It is crucial to obtain plausible values of the parameters needed for the power analysis, including the annual change rates, the covariance matrix of random effects, and the covariance matrix of errors. These parameter values can be informed from a pilot study or existing studies [21]. However, there always exists the concern whether the specified alternative truly represents the clinical trial target population effect of interest and how the variability of the alternatives will affect the calculated sample sizes, sensitivity analysis is recommended [4]. McEvoy et al. [22] compute 95% CIs on the sample sizes through bootstrapping. We also present the 95% bootstrap CIs for the calculated sample sizes in our Supplementary document.
The effect sizes must be determined based on rationale and justification from theory and clinical experiences [4]. When the effect sizes are set to be the percentages of the annual rate of change, they are approximately invariant to the transformation on the component scores if the term g j !ðTreatment!TimeÞ1b j2 !Time in the MLMM is around zero.
The derivation and use of optimal weights w Ã JC and w Ã C here were for the clinical purpose of powering a trial. We did not propose a new composite score to be used as an endpoint but constructed the most powerful test statistics with the optimal weights w Ã JC and the most sensitive composite score with the weights w Ã C to detect treatment effects. We further argued that no extra information or no further model assumption than what is typically needed is required to calculate them given the alternatives. Therefore, it is helpful to compute and use the optimal weights in power analysis. For other clinical purposes, the optimal weights w as defined and clinically meaningful weights may conflict. In such situations, we suggest modifying the criterion for determining the optimal weights to take account clinical meaningfulness. Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

RESEARCH IN CONTEXT
1. Systematic review: The authors reviewed the literature on constructing composite scores sensitive to the early changes in cognition and function and for detecting treatment effects in clinical trials for early AD. Under the assumption that the component scores are jointly from an MLMM, three approaches are compared with regard to their power to detect treatment effects. The authors calculate sample sizes based on these three approaches.