Introduction

Contributions by many researchers have led to gradual recognition of the importance of sample size determination in experimental research. This is a fundamental issue in the behavioral and psychological sciences, because sample size directly relates to the statistical power of tests, the precision of parameters of interest, and cost efficiency. Of particular note are the pioneering works of Cohen (1962, 1988, 1994), who provided the standard in conducting power analysis.

However, there remain several difficulties in effectively performing power analysis in actual research. Particularly in longitudinal data, which has attracted increased attention in recent years (Singer & Willett 2003), actual practice of sample size determination before starting experiments is rare (e.g., Muthén & Curran, 1997), mainly due to a lack of convenient procedures and its limitations.

Here, we illustrate this problem concretely in the case of longitudinal experimental research to investigate an experimental effect by using a multilevel model, which is one of the most popular models for longitudinal data analysis (Hox, 2010; Raudenbush & Bryk, 2002; Singer & Willett 2003).Footnote 1 For simplicity of discussion, we assume just two groups (an experimental group and a control group), and individuals are randomly assigned to groups. Let Y jt be the outcome for an individual j (= 1, 2, . . . , N) at time t ( = 1, t, . . . , T). Here, N and T refer to total sample size and the number of time points, respectively. Then Y jt is expressed by the following time-level equation (level 1, Eq. 1) and individual-level equations (level 2, Eqs. 2 and 3) in the multilevel model:

$$ {Y}_{jt}={\beta}_{0j}+{\beta}_{1j}{X}_t+{e}_{jt}, $$
(1)

and

$$ {\beta}_{0j}={\beta}_{00}+{u}_{j0} $$
(2)
$$ {\beta}_{1j}={\beta}_{10}+\delta {W}_j+{u}_{j1}. $$
(3)

Here, X t is a constant that denotes the time as X t = t − 1. β 0j and β 1j are intercept and slope parameters for growth curves of an individual j, respectively. e jt is an error term, and e jt  ∼ N(0,σ 2) and its independency among individuals and time points [i.e., \( cov \left({e}_{jt},{e}_{j^{\prime }t}\right)= cov \left({e}_{jt},{e}_{j{t}^{\prime }} \right)= cov \left({e}_{jt},{e}_{j^{\prime }{t}^{ \prime }} \right)=0 \)] are assumed. Here, σ 2 is error variance, and cov(x 1,x 2) denotes covariance between variables x 1 and x 2, respectively.

At the individual level, the intervention assignment indicator variable W j is 0.5 if an individual j is assigned to the experimental group or − 0.5 for the control group. Under a balanced design (i.e., ∑  N j  W j = 0), β 00 and β 10 are the overall means of the intercept and slope parameters, and without loss of generality, we can restrict these parameters as β 00 = β 10 = 0. δ is a parameter that denotes an experimental effect, and to reject the null hypothesis H 0 : δ = 0 directly indicates that the slopes are significantly different between groups, and so are outcome means at t  ≥  2. Lastly, u j0 and u j1 are residual terms, and u j = (u j0,u j1)t is assumed to be multivariate normally distributed as u j  ∼ N(0,T) and to be independent among individuals [i.e., \( cov\Big({u}_{j0},{u}_{j^{\prime }0} \)) = \( cov\Big({u}_{j1},{u}_{j^{\prime }1} \)) = \( cov\Big({u}_{j0,}{u}_{j^{\prime }1})=0 \)]. Here, T is a residual variance and is expressed as

$$ T=\left(\begin{array}{cc}\hfill {\tau}_{00}\hfill & \hfill {\tau}_{01}\hfill \\ {}\hfill {\tau}_{10}\hfill & \hfill {\tau}_{11}\hfill \end{array}\right). $$
(4)

τ 00 and τ 11 are variances that indicate individual differences of the intercept and slope parameters, and τ 01( = τ 10) is a corresponding covariance.

Substituting Eqs. 2 and 3 into Eq. 1 produces the combined form of the multilevel model:

$$ {Y}_{jt}=\left({\beta}_{00}+{u}_{j0}\right)+\left({\beta}_{10}+\delta {W}_j+{u}_{j1}\right){X}_t+{e}_{jt}=\left({\beta}_{00}+\left({\beta}_{10}+\delta {W}_j\right){X}_t\right)+\left({u}_{j0}+{u}_{j1}{X}_t+{e}_{jt}\right). $$
(5)

This combined form clarifies the fixed and residual parts of the multilevel model. From this new equation, it becomes clear that mean and variance–covariance structures of Y jt given the intervention assignment indicator variable W j can be expressed as

$$ E\left(\left.{Y}_{jt}\right|{W}_j\right)={\beta}_{00}+\left({\beta}_{10}+\delta {W}_j\right){X}_t $$
(6)
$$ V\left({Y}_{jt}\Big|{W}_j\right)={\tau}_{00}+2{X}_t{\tau}_{01}+{X}_t^2{\tau}_{11}+{\sigma}^2 $$
(7)
$$ cov\left({Y}_{jt},{Y}_{j{t}^{\prime }}\Big|{W}_j\right)={\tau}_{00}+\left({X}_t+{X}_{t^{\prime }}\right){\tau}_{01}+{X}_t{X}_{t^{\prime }}{\tau}_{11}. $$
(8)

Here, E(x) is an expected value of variable x, and V(x) denotes the variance of x. Note that variance and covariance of outcomes given W j (i.e., within-group variance) do not depend on δ and W j , so these are equally specified for both the experimental group and the control group in this model.

This multilevel model is mathematically equivalent to the standard latent curve model (Bollen & Curran, 2006; Meredith & Tisak, 1990) when assuming time-invariant level 1 error variances (σ 2). Similarity between the multilevel model and structural equation modeling (including latent curve models) is discussed in detail by Curran (2003). The basic procedure of power analysis using the standard latent curve model is provided in Muthén and Curran (1997).

To evaluate the power of statistical tests for the null hypothesis H 0 : δ = 0, with sample size N and significance level of test α, we need to specify all parameters included in the models (Eqs. 1, 2 and 3). Namely, these five parameters [σ 2, δ, τ 00, τ 01( = τ 10), and τ 11] should be specified prior to the experiment. This is unfortunately not realistic in most behavioral and psychological research, owing to a lack of prior information or related results of former research that used the same multilevel model.Footnote 2 Additionally, these parameters do not directly reflect the features of the expected growth curves for two groups, such as the mean structures (and mean differences) and variance–covariance structures of outcomes between groups and the reliability of measurement, leading to difficulty in conducting power analysis. Although several useful sample size determination methods currently exist that can be used for longitudinal data (e.g., the Monte Carlo simulation method by Muthén and Muthén (2002) and a generalized method for a multilevel model by Roy, Bhaumik., Aryal, and Gibbons (2007)), these methods require users to specify every parameter in the statistical model, leading to their limited use. Additionally, several software packages are available for conducting power analysis with multilevel data (e.g., ACluster, in Donner & Klar, 2000; Fosgate, 2007; OD, in Raudenbush, Spybrook, Congdon, Liu, & Martinez, 2011; PinT, which is based on derived formulas by Snijders & Bosker, 1993). However, specifying the necessary parameter values before the experiments have been conducted may not be realistic. Finally, although several closed-form sample size determination formulas for multilevel data (e.g., Heo & Leon, 2008; Usami, in press) and related numerical tables exist, these formulas cannot be directly applied to investigate differences in slope means between groups in longitudinal designs. In an attempt to increase the utility of conducting power analyses in longitudinal experimental designs, the present research provides a convenient method and numerical tables to determine the needed sample size using the above multilevel models 1–3, by transforming these five model parameters into indices that are intuitively understandable and easy to specify.

Method

The fundamental idea of this method is to transform model parameters (σ 2, δ, τ 00, τ 01 = τ 10, and τ 11) into indices that are easy to understand and specify. Namely, these are reliability of measurement at the first time point (ρ 1), effect size at the last time point (Δ T ), level 2 residual correlation (r), and proportion of variance of outcomes between the first and the last time points (k). These are summarized as follows by using the results of Eqs. 6, 7 and 8:

$$ {\rho}_1=\frac{V\left({u}_{0j}\right)}{V\left({Y}_{j1}\Big|{W}_j\right)}=\frac{V\left({u}_{0j}\right)}{V\left({u}_{0j}\right)+V\left({e}_{jt}\right)}=\frac{\tau_{00}}{\tau_{00}+{\sigma}^2} $$
(9)
$$ {\varDelta}_T=\frac{E\left({Y}_{jT}\Big|{W}_j=0.5\right)-E\left({Y}_{jT}\Big|{W}_j=-0.5\right)}{\sqrt{V\left({Y}_{jT}\Big|{W}_j\right)}}=\frac{\left(T-1\right)\delta }{\sqrt{\tau_{00}+2\left(T-1\right){\tau}_{01}+{\left(T-1\right)}^2{\tau}_{11}+{\sigma}^2}} $$
(10)
$$ r=\frac{ cov\left({u}_{0j},{u}_{1j}\right)}{\sqrt{V\left({u}_{0j}\right)}\sqrt{V\left({u}_{1j}\right)}}=\frac{\tau_{01}}{\sqrt{\tau_{00}}\sqrt{\tau_{11}}} $$
(11)
$$ k=\frac{V\left({Y}_{jT}\Big|{W}_j\right)}{V\left({Y}_{j1}\Big|{W}_j\right)}=\frac{\tau_{00}+2\left(T-1\right){\tau}_{01}+{\left(T-1\right)}^2{\tau}_{11}+{\sigma}^2}{\tau_{00}+{\sigma}^2}. $$
(12)

In Eq. 9, ρ 1 (0 ≤ ρ 1 ≤ 1) refers to the reliability of measurement at the first time point, defined as the relative variance of V(u 0 j ) (i.e., individual difference of intercepts β 0 j ) to the variance at the first time point within each group (V(Y 1|W j )). Larger values of ρ1 indicate higher reliability of measurement at the first time point. In Eq. 10, Δ T refers to an effect size at the last time point T and is defined as the ratio of expected mean differences of outcome between groups at the last time point (i.e., E(Y jT |W j = 0.5) − E(Y jT |W j = − 0.5)) to the standard deviation of the outcome at the last time point within groups (i.e., \( \sqrt{V\left({Y}_{jT}\Big|{W}_{jT}\right)} \)). The use of Δ T for evaluating statistical power is introduced in Muthén and Curran (1997). An effect size is traditionally known as a simple way to quantify the size of the difference between two groups. Conventional rough indicators commonly used in the context of paired-sample t tests, such as Δ T = 0.2, 0.5, and 0.8 corresponding to small, medium, and large effect sizes, respectively (Cohen, 1988), are useful as a guide to specify this index for power analysis.

In Eq. 11, r ( − 1 ≤  r ≤ 1) is a level 2 residual correlation, and larger values of r indicate a stronger correlation between the intercepts and slopes in growth curves. Finally, k (k ≥ 1) is the ratio of variance of outcomes between the first and the last time points within groups. Larger values of k indicate relatively larger individual differences at the last time point T.

With one additional restriction on variance at the first time point V(Y 1|W j ) = τ 00 + σ 2 = 1 that leads to no loss of generality, the following equations for each parameter can be obtained under the constraints T ≥ 2:

$$ {\tau}_{00}={\rho}_1 $$
(13)
$$ {\sigma}^2=1-{\rho}_1 $$
(14)
$$ {\tau}_{01}=\frac{r\sqrt{\rho_1\left({r}^2{\rho}_1-{\rho}_1+k-{\sigma}^2\right)}-{r}^2{\rho}_1}{T-1} $$
(15)
$$ {\tau}_{11}=\frac{\tau_{01}^2}{r^2{\tau}_{00}}=\frac{{\displaystyle {\left(\sqrt{r^2{\rho}_1-{\rho}_1+k-{\sigma}^2}-r\sqrt{\rho_1}\right)}^2}}{{\left(T-1\right)}^2} $$
(16)
$$ \delta =\frac{\sqrt{k}{\varDelta}_T}{T-1}. $$
(17)

From these equations, we can implicitly specify parameters by indirectly setting just four kinds of indices (ρ 1, Δ T , r, k). The merit of this transformation becomes clear by considering the usefulness of effect size Δ T . Namely, although specifying only five parameters does not clearly show the magnitude of the expected effect size (i.e., mean differences and variances) at the last time point, specifying Δ T provides this information directly. As compared with setting respective parameters, these indices can be specified intuitively, making the proposed method more understandable and convenient.

The overall procedure of estimating statistical power for null hypothesis (H 0 : δ = 0) significance testing can be sketched as follows by applying the method of Satorra and Saris (1985) in Step 2–Step 4:

  1. Step 0: Prepare

    Set the number of time points T, significance level α, and sample size of the control group N C and experimental group N E (here, N E  + N C = N). Additionally, without loss of generality, set β 00 = β 10 = 0.

  2. Step 1: Set indices

    Set values of indices ρ 1, Δ T , r, and k, and calculate parameters σ 2, δ, τ 00, τ 01 = τ 10, and τ 11 using Eqs. 13, 14, 15, 16 and 17.

  3. Step 2: Calculate mean and covariance structures

    Using Eqs. 6, 7 and 8 and the parameters specified in Step 1, calculate the mean structures μ E (Θ) and μ C (Θ) for the experimental and control groups, respectively, and calculate the variance–covariance structure Σ(Θ) that is the same across groups. Here, Θ denotes all parameters in the model.

  4. Step 3: Calculate

    χ2Assuming that the data obtained from both the experimental group (D E) and the control group (D C) have means and variance–covariances equal to the structures specified in Step 2, estimate parameters and calculate the χ 2 values (χ 20 , χ 21 ) under both the null model (H 0 : δ = 0) and the alternative model (H 1 : δ ≠ 0) by using the following equations:

    $$ \begin{array}{c}\hfill {\chi}_0^2=\left({N}_C-1\right) \log {L}^{\ast}\left({\varTheta}_0\Big|{D}_C\right)+\left({N}_E-1\right) \log {L}^{\ast}\left({\varTheta}_0\right|{D}_E\Big)\hfill \\ {}\hfill {\chi}_1^2=\left({N}_C-1\right) \log {L}^{\ast}\left({\varTheta}_1\Big|{D}_C\right)+\left({N}_E-1\right) \log {L}^{\ast}\left({\varTheta}_1\right|{D}_E\Big).\hfill \end{array} $$
    (18)

Here, Θ 0 and Θ 1 denote parameters included in the null and alternative models, respectively—namely, Θ 0 = (σ 2, τ 00, τ 01, τ 11) and Θ 1 = (σ 2, δ, τ 00, τ 01, τ 11); logL (Θ 0|D C ) and logL (Θ 1|D C ) are the log-likelihood for the control group under the null and alternative models, and logL (Θ 0|D E ) and logL (Θ 1|D E ) are the log-likelihood for the experimental group under the null and alternative models, respectively. For example, logL (Θ 0|D C ) can be concretely estimated as

$$ \log {L}^{\ast}\left({\widehat{\varTheta}}_0\Big|{D}_C\right)= tr\left(\varSigma {\left({\displaystyle {\widehat{\varTheta}}_0}\right)}^{-1}\varSigma \left(\varTheta \right))- \log \right|\varSigma \left({\displaystyle {\widehat{\varTheta}}_0}{)}^{-1}\varSigma \left(\varTheta \right)\right|+\left({\mu}_C\left(\varTheta \right)-{\mu}_C\right({\displaystyle {\widehat{\varTheta}}_0}))^{\prime}\varSigma \left({\displaystyle {\widehat{\varTheta}}_0}{)}^{-1}\right({\mu}_C\left(\varTheta \right)-{\mu}_C\Big({\displaystyle {\widehat{\varTheta}}_0}))-T. $$
(19)

Here, tr(X) and |X| denote the trace and determinant of matrix X. \( {\mu}_C\left({\displaystyle {\widehat{\varTheta}}_0}\right) \) and \( \varSigma \left({\displaystyle {\widehat{\varTheta}}_0}\right) \) are, respectively, the estimated mean and variance–covariance structures given data D C under the null model, and other likelihoods can be expressed in a similar manner. Naturally, under the alternative model, \( {\mu}_C\left({\displaystyle {\widehat{\varTheta}}_1}\right) \) and \( \varSigma \left({\displaystyle {\widehat{\varTheta}}_1}\right) \) are equivalent to the mean structure μ C (Θ) and covariance structure Σ(Θ) specified in Step 2, and this relation holds true for the experimental group (i.e., \( {\mu}_E\left({\displaystyle {\widehat{\varTheta}}_1}\right) \)) as well. As a result, χ 21 is expected to become 0 as the data D C and D E fit the alternative model perfectly, given that δ  ≠  0, there are no missing data, and no errors in model misspecification. A key assumption denoted by Satorra and Saris (1985) is that we have data that have specified means and variance–covariances, and this enables us to estimate χ 2 (χ 20 , χ 21 ) values, which include information of statistical power (i.e., a larger difference between these two chi-squares indicates larger differences of model fit between the null model and the alternative model, resulting in the null model being rejected with higher probability).

  1. Step 4: Calculate statistical power

    Calculating the noncentral parameter λ as

    $$ \lambda ={\chi}_0^2-{\chi}_1^2, $$
    (20)

then statistical power ϕ can be calculated using the cumulative probability of a noncentral chi-square distribution whose degree of freedom (v) equals 1 as

$$ \phi =1-P\left({\chi}_{1-\alpha}^2,v=1,\lambda \right). $$
(21)

Here, P(S,v,λ) expresses the cumulative probability for the value S under a noncentral chi-square distribution whose degree of freedom and noncentral parameters equal v and λ, respectively. χ 21 − α denotes a critical value corresponding to significance level α of a central chi-square distribution.

Figure 1 illustrates calculated statistical powers under different N and Δ T . To evaluate the influence of each index on statistical power more clearly, a statistical power curve based on T = 6, r = .2, k = 5, and ρ 1 = 0.5 is set as the baseline, and eight kinds of statistical power curves under different settings of just one parameter are shown here. In Fig. 1a, b, while N and Δ T clearly show large impacts on statistical power, the influence of T is not obvious, despite larger T showing slightly larger statistical power. In Fig. 1c, d, large absolute values of r show positive effects on statistical power, since this implicitly leads to parameter settings with larger τ 01. In Fig. 1e,f, although smaller values of k (i.e., k = 1, implicitly indicating constraints τ 01 = τ 11 = 0 from Eqs. 15 and 16) showed positive effects, the effect of k is not obvious when k becomes larger (e.g., k = 50). Finally, in Fig. 1g, h, larger values of ρ 1 show a positive effect on statistical power, because this indicates relatively smaller level 1 error variance σ 2 at every time point.

Fig. 1
figure 1

Statistical power curves when setting T = 6, r = .2, k = 5, and ρ 1 = 0.5 as base-line

Simulation and construction of numerical tables

Although the procedure described in the previous section is useful for conveniently and intuitively estimating statistical power for the H 0 : δ  =  0 test, there remain two problems preventing the use of power analysis for longitudinal data analysis using a multilevel model through this method. One is the problem that specifying parameters is still a difficult challenge for researchers. Although these four kinds of indices are much more intuitive and can be easily specified mainly by imagining expected growth curves in respective groups, assuming these values with high confidence is difficult without prior information or theoretical knowledge regarding data. Also problematic is insufficient convenience of the procedure. Although t-tests seem to be the most popular statistical tests for which power analysis is used, this can be attributed mainly to the fact that there are multiple ways of conveniently conducting power analysis, such as statistical programs (e.g., G*Power 3; Faul, Erdfelder, Lang, & Buchner, 2007) or numerical tables e.g., (Cohen, 1988). Indeed, not all applied researchers can handle the calculations in Step 0–Step 4 without difficulty.

To solve these problems, this section intensively evaluates the extent of influence of these indices on statistical power through simulation data and then identifies those indices that are desired to be specified with high precision when estimating statistical power and needed sample size. By referring to this simulation result, numerical tables are constructed. Additionally, the uses of both the proposed procedures (Step 0–Step 4) and numerical tables are illustrated.

Simulation study

Through the procedure (Step 0–Step 4) described in the previous section, we calculated statistical power with N = 10, 30, 50, 100, 200, 400, T = 3, 6, 9,Δ T = 0.1, 0.2, . . . , 1.0, ρ = 0.1, 0.3, 0.5, 0.7, 0.9, r = −.8, −.4, .4, .8, and k = 1, 3, 5, 10, 30, 50 by orthogonal design under α =.05. Additionally, different sample size proportions for experimental group P E = N E/N were set as P E = 0.5, 0.6, 0.7, 0.8, 0.9. Namely, a total of 6 × 3 × 10 × 5 × 4 × 6 × 5 = 108, 000 kinds of statistical powers were calculated. The values of these indices are specified by considering the actual experimental designs. As for P E, for example, experimental and control groups are sometimes unbalanced owing to various practical considerations (e.g., producing experimental drugs for clinical trials is expensive, and thus the number of units assigned to experimental groups and control groups [placebo groups] is not equal). A simulation program in R is available at the author’s Web site (http://satoshiusami.com/).

To intuitively evaluate the influence of these indices on statistical power, a six-way (the number of time points T, sample size proportions for experimental group P E, residual correlation r, variance proportion k, reliability at the first time point ρ 1 and effect size at the last time point Δ T ) between-subjects ANOVA assuming just first interactions between each factor was performed.Footnote 3 Table 1 shows the results under different conditions of total sample size N(= 50,400). Table 1 shows that effect size Δ T is the dominant influence on statistical power, indicating 0.81 and 0.89 effect sizes (η 2) for respective sample size conditions. Additionally, a similar tendency is observed for other conditions of total sample size N. This result is also obvious from the means and standard deviations of statistical power for marginal data sets in respective levels of factors, shown in Fig. 2. The next most influential factor seems to be sample size proportion P E and its interaction effect with effect size Δ T (i.e., P E × Δ T ). Namely, P E indicates that effect sizes of 0.08 and 0.02 (η 2) for N = 50, 400, respectively. Figure 2 shows that when P E becomes an extreme value (e.g., P E = .9), statistical power rapidly becomes smaller, since the standard error of a parameter estimate for an experimental effect becomes smallest in a balanced design (P E = .5). Similar results regarding sample size proportions are shown in many publications, such as Usami (2011). These results indicate that specifying appropriate Δ T and P E is the most important and dominant key to precisely estimate statistical power and a needed sample size and that the influences of ρ 1, r, and k can be pragmatically ignored when constructing numerical tables.

Table 1 Result of ANOVA for calculated statistical power.
Fig. 2
figure 2

Means and standard deviations of statistical power for marginal data set in respective levels of factors

Construction of numerical tables

By using the Step 0–Step 4 procedure described in the previous section, the needed total sample sizes N are estimated, and numerical tables based on differences in Δ T ( = 0.1, 0.2, . . . , 2.0) and P E ( = .5, .6, .7, .8, .9) are constructed. Table 2 provides three numerical tables that denote lower bounds of N to obtain statistical power exceeding ϕ = 0.80 under conditions of r = 0, k = 4, and ρ 1 = 0.5 for T = 3, 6, 9 by 10 units of N. Although the effect of T is relatively small on statistical power, T can generally be determined prior to experiments, and so numerical tables based on different Ts are provided here.

Table 2 Numerical tables to evaluate lower bounds of needed sample size to obtain statistical power more than φ = 0.80 under conditions of r = 0, k = 4, and ρ 1 = 0.5

For example, when Δ T   =  1.0 (and ρ 1 = 0.5, k = 4, and r = 0) is assumed in a T = 6 and P E = .5 experimental design, numerical Table 2(b) indicates that the needed total sample size to obtain at least ϕ = 0.80 is approximately N = 40 (i.e., N E = 20 and N C = 20). In a design using similar conditions, except that P E = .9, the same numerical table indicates that the needed total sample size is approximately N = 80 (i.e., N E = 72 and N C = 8).Footnote 4 Although these numerical tables may seem rough, particularly because they are segmented by 10 units of N, pragmatically they provide useful information to researchers, since perfectly specifying the four kinds of indices (or parameters) is nearly impossible in actual research. On this point, note that in these tables, we provided the needed sample size by rounding up to the nearest 10 units, meaning that the needed sample sizes shown here are somewhat conservative, and underestimation of needed sample sizes should occur less frequently in actual use.

For illustrative purposes, Table3 provides similar numerical tables based on Table;2(b) (i.e., T = 6, r = 0, k = 4, and ρ 1 = 0.5) in which just one index is changed. For example, Table3(a), (b) shows the case of k = 1 and 25, leaving other index values unchanged. Table3 shows that the needed sample size does not significantly differ from the values shown in Table 2, especially when the effect size is relatively large (Δ T   >  0.5). The tendency that these indices (k, ρ 1, r) are much less influential on the values in numerical tables is consistent with the former simulation results. Additionally, Table 2 is based on relatively moderate index settings, such as k  =  4, ρ 1  =  0.5, and r = 0, providing conservative needed sample sizes in comparison with Table3 (a), (c), (e), and (f), and shows highly similar needed sample sizes in comparison with the values shown in Table3 (b) and (d).

Table 3 Numerical tables to evaluate lower bounds of needed sample size to obtain statistical power more than φ = .80

Although using only Table 2 provides pragmatically useful information about needed sample sizes, there seem to be many cases where no prior information about indices (and parameters) is available. In that case, a useful procedure is to evaluate the needed sample size conservatively. Namely, from the results in Table3, numerical tables based on k = 100, r = 0, and ρ 1 = 0.1, which are shown in Table 4, provide a conservative needed sample size that is pragmatically usable. Through the use of Table 4, researchers can expect to underestimate the needed sample size rarely.

Table 4 Numerical tables to evaluate lower bounds of needed sample size to obtain statistical power more than φ = 0.80 under conditions of r = 0, k = 100 and ρ 1 = 0.1 (a very conservative condition)

When reliable prior information or theoretical knowledge regarding indices or parameters is available or when the expected effect size is not large (Δ T ≤ 0.5), other numerical tables may be of further use. Although we omit such additional numerical tables because of space limitations, tables based on other index conditions can be provided upon request, and R computer programs, as well as several numerical tables, are available on the author’s Web site.

Examples

Here we show an example using the proposed procedure and numerical table. Several recent epidemiological studies have examined the relationship between sleep habits and mental health status in adolescents. For example, cross-sectional studies have reported that an irregular sleep schedule was associated with poor mental health (e.g., Oshima et al. 2010). Here we consider a hypothetical situation where applied researchers are interested in how mean changes of mental health (outcome Y) in junior high school students differ between students who were exposed to an intervention program for sleep habit (experimental group W = 0.5) and students who had not (control group W = − 0.5). Here students are assumed to be randomly assigned to each group.

In Step 0, students are assumed to fill out a survey annually for 3 years (T = 3), and the significance level α is set as .05. Proportion for experimental group P E = N E/N is set as P E = .5, and five kinds of total sample sizes— N = 50, 100, 200, 300, and 400—are considered (namely, the sample sizes of control group N C and experimental group N E are specified as N E = N C = 25, 50, 100, 150, and 200, respectively). In Step 1, four indices are set as ρ 1 = 0.4 (i.e., mental health is evaluated by questionnaire, and so relatively larger noises are empirically expected to affect measurement), Δ T = 0.3 (i.e., effect size is small), r = .4 (level 2 residual correlation is expected to be moderately large from previous research that showed students who showed poor mental health scores at the entrance of junior high school were likely to show larger declines of mental health scores, while students who showed better mental health show moderate declines.), and k = 1.5 (variance of mental health scores slightly increases over time). In Step 2, by using Eqs. 13, 14, 15, 16 and 17, five parameters can be calculated as τ 00 = 0.4, σ 2 = 0.6, τ 01 = τ 10 = 0.063, τ 11 = 0.062, and δ = 0.184. From these parameters and Eqs. 6, 7 and 8, mean structures μ E (Θ) and μ C (Θ) for the experimental and the control groups can be calculated as μ E (Θ) = (0,0.092,0.184)t and μ C (Θ) = (0, − 0.092, − 0.184)t, respectively (here, without loss of generality, β 00 = β 10 = 0). The variance–covariance matrix Σ(Θ) that is the same across groups is calculated as

$$ \varSigma \left(\varTheta \right)=\left(\begin{array}{ccc}\hfill 1.000\hfill & \hfill 0.463\hfill & \hfill 0.526\hfill \\ {}\hfill 0.463\hfill & \hfill 1.188\hfill & \hfill 0.713\hfill \\ {}\hfill 0.526\hfill & \hfill 0.713\hfill & \hfill 1.500\hfill \end{array}\right). $$
(22)

In Step 3, it is assumed that both the experimental group (D E) and the control group (D C) have means [μ E (Θ) and μ C (Θ)] and variance–covariances Σ(Θ) as specified in Step 2. Parameters were then estimated in OpenMx under both the null model (H 0 : δ = 0) and the alternative model (H 1 : δ ≠ 0). From Eq. 18, χ 20 is calculated as 1.3887, 2.7774, 5.5548, 8.3322, and 11.1096 in the N = 50, 100, 200, 300, and 400 conditions, respectively. χ 21 is calculated to be approximately 0 in every condition of N, since the alternative (true) model is expected to perfectly fit the data, resulting in λ = 1.3887, 2.7774, 5.5548, 8.3322, and 11.1096 in the N = 50, 100, 200, 300, and 400 conditions, respectively. Finally, in Step 4, by using λ, statistical power ϕ can be calculated through the cumulative probability of a noncentral chi-square distribution as ϕ = 1 − P(χ 20.95 , v = 1, 1.3887) = 1 − P(3.8415, v = 1, 1.3887) = 0.2181 in the N = 50 condition. Likewise, ϕ can be estimated as ϕ = 0.3847, 0.6543, 0.8229, and 0.9151 in the N = 100, 200, 300, and 400 conditions, respectively. Therefore, if desired statistical power is 0.8, a sample size greater than N = 300 is required.

When applied researchers desire to estimate the needed sample sizes just by referring to the numerical tables, using information T = 3, Δ T = 0.3 and P E = .5 from Table 2(a), the requisite total sample size is estimated as N = 330. This results in a slight overestimation of the needed total sample size, which is mainly due to ignoring indices ρ 1, k, and r . Additionally, this discrepancy can also be attributed to smaller values of Δ T = 0.3, since in this case, the estimation of the needed total sample size becomes unstable. Finally, consider the situation where applied researchers desire to use a conservative method, since no prior information about indices other than Δ T = 0.3 (and T = 3 and P E = .5) is available and other indices (ρ 1, k, and r) may take extreme values. In this case, from Table 4(a), the needed total sample size is estimated as N = 360. Therefore, even if these unknown indices take extreme values, N = 360 is enough to achieve the desired statistical power ϕ = 0.80.

Discussion

In this article, we proposed a convenient procedure for evaluating statistical power to test an experimental effect in longitudinal experimental research using multilevel models. A fundamental procedure of this method is to transform model parameters (δ, σ 2, τ 00, τ 01 = τ 10, and τ 11) into indices (Δ T , ρ 1, r, and k), thereby making the application of power analysis much easier and intuitive. Additionally, to make the use of power analysis more convenient, numerical tables were constructed on the basis of simulation results to investigate the influence of each index on statistical power.

According to simulation results, Δ T and P E are the most important factors to precisely estimate the needed sample size and statistical power. Moreover, when P E is extreme and Δ T is lower, larger sample sizes are necessary to achieve requisite power. These results imply that less emphasis is needed on other indices when designing longitudinal studies if Δ T and P E can be specified with precision. Since Δ T and P E are the most influential factors, Table 2 provides researchers with pragmatically useful information about needed sample sizes. When no prior information about indices (and parameters) is available and the effect size is not hypothesized as large (Δ T < 0.5), a useful procedure is to conservatively estimate the needed sample size by using Table 4. Additionally, simulation code in R and programs for computing power tables are available on the author’s Web site.

Note that calculation of statistical power relies on asymptotic features in that the difference in chi-square values under the null and alternative models is distributed from a noncentral chi-square distribution with one degree of freedom (Eqs. 20 and 21). However, this would not be completely satisfied unless sample size is infinite, so the robustness of obtained results for both statistical power and needed sample size should be investigated. To that end, we calculated simulated statistical power (Muthén & Muthén, 2002). Namely, a Monte Carlo method was used to generate 5,000 sets of simulation data, and the proportion that actually rejects H 0 : Δ T = 0 was calculated. For example, Fig. 3 provides both the theoretical and simulated statistical power for T = 6, P E = .5, r = .2, k = 4, and ρ 1 = 0.5, showing that there are no obvious differences between these two statistical powers. In fact, the maximum values of absolute difference were less than .02 in all conditions. The needed sample sizes denoted in our numerical tables therefore seem to be sufficiently robust and pragmatic with respect to the asymptotic assumption.

Fig. 3
figure 3

Theoretical and simulated statistical power curve in T = 6, P E = .5, r = .2, k = 4, and ρ 1 = 0.5 (5,000 replications)

For simplicity, we confined explicit development of the proposed method to investigation of a single experimental effect based on two groups. It would be straightforward to extend our methodology to an arbitrary number of groups and factors, although the relevant parameters and indices for evaluating statistical power would be more complex. Cases in which the outcome is binary or ordered or in which longitudinal data entail hierarchical structures such as individual and group structures (e.g., three-level models, as per Raudenbush & Bryk, 2002) are also intriguing topics for future research.

One possible limitation of the proposed method is the model assumptions. Namely, a two-level multilevel model is used that assumes a linear change of outcomes over time to test longitudinal experimental effects with two groups. Although it is straightforward to extend the proposed procedure to larger numbers of groups, model assumptions of a linear growth pattern and a group difference in the slope are not necessarily suitable (e.g., the case where there may be large intervention effects immediately following the intervention but this effect may disappear as time goes on). This could result in a serious misfit of both model means and variance–covariances to the actual longitudinal data (e.g., Liu et al. 2012). However, it should be noted that in many cases, the present multilevel models Eqs. 1, 2, 3 and related ones are robust to violations of linear assumptions, as long as research interests lie on the null hypothesis H 0: δ  =  0 (i.e., investigation of the difference in the amount of changes among groups over time) and if Δ T can be estimated precisely. Additionally, several extensions, such as a quadratic change of outcomes, can be straightforwardly conducted.

Furthermore, the notion that the use of transformed indices can be extended to more complicated multilevel models and latent curve models, as Usami (2012) showed with an example of a latent change score model (McArdle, 2009) based on six indices. Therefore, the increased utility of the present method can promote consideration of the sample size determination problem in research that uses various multilevel models and latent curve models.

An additional consideration that warrants investigation is when the assumption of equal time intervals is unrealistic. However, even in this case, estimated statistical power seems to be very robust if Δ T is specified precisely, since Δ T is so dominant on statistical power. Related to this property, since Δ T was not originally constructed by referring to the results of longitudinal research, caution should be taken when using this indicator. Effect sizes observed in actual longitudinal research can be different from those developed for cross-sectional research. Particularly, the criterion of Δ T = 0.2, 0.5, and 0.8, corresponding to small, medium, and large effect sizes, may not necessarily be the optimal criterion to use, and other criteria may be proposed in the future.

The proposed method is limited in that attrition, which is typical in longitudinal research, is not directly considered, although this is not uncommon for various sample size determination methods (e.g., Heo & Leon, 2008; Usami, in press). Fortunately, the influence of attrition can be taken into consideration when combining the proposed procedures with the Monte Carlo simulation methods proposed by Muthén and Muthén (2002). Another merit of Monte Carlo method is that the effect of nonnormal data on estimated sample size can be easily evaluated.

Constructing multilevel models that include predictors is common in both level 1 (Eq. 1) and level 2 (Eqs. 2 and 3) in actual analysis. Generally, statistical power is expected to improve when the inclusion of predictors in a model is limited to those that have a significant effect on outcomes, since estimated error and residual variances become smaller. In particular, as conditional variance in Eq. 7 indicated, τ 11 must be highly related with the power to reject H 0 : δ= 0. Therefore, including predictors that can explain individual difference of β 1j (i.e., individual differences of slopes of growth curves) is especially useful. Although, note that when predictors are highly correlated, not only with β 1j , but also with the intervention assignment indicator variable W j , estimates for an experimental effect δ may be strongly biased and more difficult to interpret.

Although it may seem contradictory to the arguments presented in this research, obtaining larger statistical power is not always important in practical applications. This is especially true when the precision of estimates is crucial (Hox, 2010; Maas & Hox, 2005) or when the available sample size is large and rejection of the hypothesis does not provide useful information (Usami, 2011). Power analysis is a technique for determining the sample size necessary for rejecting the null hypothesis on the basis of a preset probability, but only considering the significance of estimates does not lead to a meaningful effect size or reliability of estimates, as many researchers have repeatedly pointed out e.g., (Cohen, 1994). In such cases, the use of other indices, such as effect size and confidence intervals, can reinforce the interpretation of the results obtained in behavioral and psychological research, so the construction of numerical tables regarding not only statistical power, but also confidence intervals is also strongly desired.