Introduction

In the past half century, growth curve models (GCMs) have evolved from fitting a single curve of one individual to fitting multilevel or mixed-effects models and from linear to nonlinear models (e.g., McArdle, 2000; McArdle & Nesselroade, 2002; Preacher et al., 2008; Tucker, 1958). The application of GCMs in social and behavioral research has grown rapidly since Meredith and Tisak (1990) showed that such models can be fitted as a restricted common factor model in the structural equation modeling framework (see also, McArdle (1986) and McArdle and Epstein (1987)). Nowadays, GCMs are widely used in analyzing longitudinal data to study change (e.g., Laird & Ware, 1982; McArdle & Nesselroade, 2003; Meredith & Tisak, 1990; Zhang et al., 2007). For a more comprehensive discussion of GCMs, see (McArdle & Nesselroade, 2002).

To estimate GCMs, the maximum likelihood estimation (MLE) method is commonly used (e.g., Demidenko, 2004; Laird & Ware, 1982). MLE is available in both commercial software, such as SAS PROC MIXED, and free software such as the R package lme4. MLE typically assumes that the errors of GCMs are normally distributed. However, in reality, data are more likely to be non-normally than normally distributed (Micceri, 1989). The violation of the normality assumption may result in unreliable parameter estimates, incorrect standard errors and confidence intervals, and misleading statistical tests and inference (Maronna et al., 2006; Yuan et al., 2004; Zu & Yuan, 2010).

Recently, Bayesian methods have become very popular in the social and behavioral sciences largely advanced by the work of Lee and colleagues (e.g., Lee, 2007; Song & Lee, 2012). Nowadays, Bayesian methods have also received significant attention as useful tools for estimating a variety of models including GCMs, especially complex GCMs that can be difficult or impossible to estimate in the current MLE framework or in MLE based software (e.g., Chow et al., 2011; Muthén & Asparouhov, 2012; Song et al., 2009, 2011; Wang & McArdle, 2008; Zhang et al., 2007). For example, Song et al. (2009) proposed a novel latent curve model for studying the dynamic change of multivariate manifest and latent variables and their linear and interaction relationships. Wang and McArdle (2008) applied the Bayesian method to estimate a GCM with unknown change points. Zhang et al. (2007) discussed how to integrate information in growth curve modeling through informative priors. There are also Bayesian studies to analyze non-normal longitudinal data using GCMs. For example, Wang et al. (2008) developed Tobit GCMs to deal with limited response data. Zhang et al. (2013) extended GCMs to allow the errors to follow a t-distribution.

The Bayesian estimation for GCMs has been generally implemented in the specialized software WinBUGS. Although WinBUGS is flexible to estimate a wide range of models, it is unfamiliar to a lot of researchers and can be difficult to use. Since the version 9.2, the SAS institute released a Markov chain Monte Carlo (MCMC) procedure to implement a general Bayesian estimation procedure through the adaptive block random-walk Metropolis algorithm with normal proposal distributions (SAS, 2009). The MCMC procedure has the potential to become popular because of many important advantages, a few given here. First, SAS already has a large user base who can understand and learn the use of the MCMC procedure relatively easily. Second, the data manipulation procedure available in the SAS data step can be used with the MCMC procedure seamlessly. Third, specifically for the growth curve analysis, the use of the MCMC procedure is similar to the use of the existing NLMIXED procedure. Fourth, it is easy to define new prior and data distributions in the MCMC procedure. Currently, the use of the MCMC procedure is still rare, not to say its application in growth curve modeling. One exception is Zhang (2013) that provides SAS code for the MCMC procedure for growth curve modeling but does not give any instruction on its use.

Therefore, this study serves two main purposes. First, we will present a general framework to flexibly model the error distributions of GCMs. Second, we will demonstrate how to estimate the models using the MCMC procedure with instructions. In the following, we will first discuss how to model different types of error distributions. Specifically, we will focus on how to use Bayesian methods to estimate GCMs with the normal, t, exponential power, and skew-normal errors. Then, we will demonstrate how to practically apply the method through the analysis of data from the Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K). Finally, we will evaluate the influence of distribution misspecification through a simulation study.

Growth curve models with different types of error distributions

A typical GCM can be written as a two-level model. The first level involves fitting a growth curve with random coefficients for each individual and the second level models the random coefficients. Specifically, the first level of a GCM can be written as,

$$y_{it}=f(time_{it},\mathbf{b}_{i})+e_{it},\quad t=1,\ldots,\text{T},\ i=1,\ldots,\text{N}, $$

where y i t is the observed response for individual i at trial or occasion t, t i m e i t denotes the measurement time of the tth measurement occasion on the ith individual, f(t i m e i t ,b i ) is a function of time representing the growth curve with individual coefficients b i = (b i1,b i2,…,b i p ), a p×1 vector of random coefficients, and e i t is the residual or error.

The function f(t i m e i t ,b i ) can take either a linear or a nonlinear form of the time variable. The function f determines the growth trajectory representing how an individual changes over time. For example, if f(t i m e i t ,b i )=b i0+b i1 t with t i m e i t = t and b i = (b i0,b i1), it represents a linear growth trajectory. If f(t i m e i t ,b i )=b i0+b i1 t+b i1 t 2 with t i m e i t = t and b i = (b i0,b i1,b i2), it represents a quadratic growth trajectory.

The second level of a GCM can be expressed as,

$$\mathbf{b}_{i}=\boldsymbol{\beta}+\mathbf{u}{}_{i} $$

where β is a vector of fixed-effects that can be seen as the average of the random effects b i without covariates in the second level, and u i represents, typically, the multivariate normal errors with E(u i )=0 p×1 and C o v(u i )=D p×p . The u i is also called the random effects from a mixed-effects model perspective.

For growth curve analysis, it is typically assumed that e i t is normally distributed with mean 0 and variance \({\sigma _{e}^{2}}\). However, the normal distribution often lacks the flexibility to deal with complexity of real data. For example, the normal distribution is symmetric but data in practice can be skewed. Furthermore, it has been shown that the normal distribution is not robust to outliers or data with heavy tails (Yuan and Zhang, 2012b). If a normal distribution is mis-specified for a non-normal distribution, the parameter estimator will not be as efficient.

In this study, we propose a general framework to model the distribution of e i t explicitly and flexibly. The e i t can be assumed to have the general form p(e i t |ζ) where its exact form can be decided from data. The distribution parameter ζ can typically be estimated within a GCM. To be consistent with the growth curve analysis literature, the mean of e i t is assumed to be 0. To determine the distribution of e i t , individual growth curve analysis can be utilized; to estimate the model parameters including \(\boldsymbol{\beta },\mathbf {D},{\sigma _{e}^{2}}\), and ζ, Bayesian estimation methods can be applied.

Individual growth curve analysis

To explore the distribution of e i t , the method based on the individual growth curve analysis discussed by Tong and Zhang (2012) can be applied. Using the method, an individual growth curve

$$y_{it}=f(time_{it},\mathbf{b}_{i})+e_{it},\quad t=1,\ldots,\text{T} $$

is first fitted to data from each individual through the least squares method. The individual coefficients b i and the error e i t can be estimated and retained. Specifically, we have

$$\hat{e}_{it}=y_{it}-f\left( time_{it},\hat{\mathbf{b}}_{i}\right). $$

With the estimated errors, one can first conduct a D’Agostino skewness test (D’Agostino, 1970) on the errors. The D’Agostino skewness test evaluates whether a distribution is symmetric or not. If the test rejects the symmetry of the distribution, a skew distribution is desired. One then conducts an Anscombe test (Anscombe & Glynn, 1983) on the kurtosis. The Anscombe test evaluates whether the kurtosis of the error is different from the normal kurtosis. If the test rejects the normal kurtosis, a distribution with excess kurtosis is then needed.

In addition to the summary statistics on skewness and kurtosis, more detailed distributional information can be observed through the histogram or density of \(\hat {e}_{it}\). Based on the information, one can decide on one or more candidate error distributions. If more than one candidate distribution can be used, the Kolmogorov–Smirnov test (e.g., Chakravarti et al., 1967) can also be used to select the best one. One can also try out all candidate distributions for comparison. To ease the procedure, an online program has been developed to assist the selection of error distribution. The URL to the program is http://psychstat.org/gcmdiag.

Two issues need to be emphasized here. First, the individual analysis method assumes that the growth function is already known a priori. The following strategy can be used to assist the choice of the growth function. If a growth function is supported theoretically or empirically, it should be used. Otherwise, exploratory techniques can be used to assist the selection of a growth function. For example, both the individual and mean growth trajectories from the data can be plotted. Based on the plot, one can decide whether a trajectory is linear or nonlinear. With the linear trajectory, a linear function can be used. For the nonlinear trajectory, it might also suggest what kind of nonlinear form fits best. There are situations where several nonlinear forms can be applied. In the situations, one might try all the nonlinear functions or select the one with the simplest form. Note that in studying the trajectory, both the mean and individual trajectories should be considered because the mean trajectory is often influenced by the non-normality of data. Second, even when the growth function is known, the population error distribution is often impossible to know. The individual analysis method aims to best approach the population distribution through the empirical data. In addition, one would prefer a distribution that is less affected by potential misspecifications.

Bayesian estimation

General introduction on Bayesian analysis can be found in both introductory books (e.g., Kruschke, 2011) and specialized ones (e.g., Lee, 2007; Song & Lee, 2012. For a GCM with the error distribution p(e i t |ζ), the joint posterior distribution of the random effects b i and model parameters based on the data augmentation algorithm (Tanner & Wong, 1987) is

$$ \begin{array}{cl} & p(\boldsymbol{\beta},\mathbf{D},\boldsymbol{\zeta},\mathbf{b}_{i}|y_{it},time_{it},i=1,\ldots,N,t=1,\ldots,T)\\ \propto & p(\boldsymbol{\beta},\mathbf{D},\boldsymbol{\zeta}){\prod}_{i=1}^{N}\left\{ p(\mathbf{b}_{i}|\boldsymbol{\beta},\mathbf{D}){\prod}_{t=1}^{T}\left[p(y_{it}|f(time_{it},\mathbf{b}_{i}),\boldsymbol{\zeta})\right]\right\} \\ = & p(\boldsymbol{\beta},\mathbf{D},\boldsymbol{\zeta}){\prod}_{i=1}^{N}\left\{ \frac{1}{2\pi|\mathbf{\Sigma}|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{b}_{i}-\boldsymbol{\beta})^{\prime}\mathbf{\Sigma}^{-1}(\mathbf{b}_{i}-\boldsymbol{\beta})\right]\right\} \\ & \times{\prod}_{i=1}^{N}{\prod}_{t=1}^{T}\left[p(y_{it}|f(time_{it},\mathbf{b}_{i}),\boldsymbol{\zeta})\right] \end{array} $$
(1)

where p(β,D,ζ) is the joint prior distribution for model parameters.

In this study, the uninformative independent priors are used for β, D, and ζ such that p(β,D,ζ)=p(β)p(D)p(ζ) to minimize the influence of priors. For β, a bivariate normal prior M N(β 0,# #Σ# # 0) is used with β 0 = (0,0) and Σ0 = 1000I 2. For D, an inverse Wishart prior distribution I W(m 0,V 0) is used with m 0 = 2 and V 0 = I 2. Here, I 2 denotes a 2 by 2 identity matrix. These priors are chosen so that the density is relatively flat and are often viewed as uninformative in the Bayesian literature (e.g., Congdon, 2003; Gelman et al., 2003). The prior distribution for ζ depends on the form of the error distribution.

Incorporating the priors above to Eq. 1, the resulting posterior is readily available but usually in a very complex form. To evaluate the posterior, the Markov chain Monte Carlo (MCMC) methods can be used to generate a sequence of random numbers conditionally on consecutive draws, formally known as Markov chains, from the posterior distribution, and to construct the parameter estimates using the average and standard deviation of the generated numbers. In the SAS MCMC procedure, it uses the block Metropolis method that divides the model parameters into different blocks and samples each block of parameter successively. Specifically, we divide the parameters and the random effects into the following blocks - β, D, ζ, and b i , i = 1,…,N. From the joint posterior distribution, we can obtain the conditional posterior distribution for each block given the rest of parameters. Within each block, the condition posterior can be obtained and sampled. Details on such algorithms can be found in the literature (e.g., Robert & Casella, 2004).

There are two practical issues in applying MCMC methods. The first is to diagnose the convergence of Markov chains. The convergence ensures that the empirical average from a Markov chain converges to the mean of the posterior distribution. Brooks and Roberts (1998) and Cowles and Carlin (1996) discussed many different methods for testing convergence. For example, one can use the Geweke test (Geweke, 1992) and/or visually inspect the trace plot of Markov chains. The Geweke test calculates a statistic that is normally distributed. For a converged Markov chain, the absolute value of a Geweke statistic should be smaller than 1.96 at the 0.05 alpha level. Visual inspection focuses on whether there is any unstable pattern in the plot of the generated Markov chains. Often times, the initial part of a Markov chain does not converge to the posterior and is often discarded as the burn-in period. The second is to decide the length of Markov chains. By nature, Markov chains always have autocorrelation. For two Markov chains, the one with greater autocorrelation provides less information about the posterior distribution than the one with smaller autocorrelation. In other words, a longer Markov chain is needed to accurately describe a posterior if its autocorrelation is greater. To characterize the information in a Markov chain, we use the statistic called effective sample size (ESS). The ESS is the equivalent sample size assuming no autocorrelation. For two Markov chains with the same length, the one with the greater ESS provides more information. A practical rule of thumb is to get an ESS at least 400 (Lunn et al., 2012).

For a converged Markov chain with sufficient effective sample size, we can construct the posterior mean, posterior standard deviation, and highest posterior density (HPD; Box & Tiao, 1973) credible interval for each model parameter for inference. Let 𝜃 represent a single parameter in the vector 𝜃 = (β,D,ζ). Suppose the converged Markov chain for 𝜃 is 𝜃 i ,i = 1,…,n where n is the number of iterations. Then, a point estimate of 𝜃 can be constructed by the sample mean of the Markov chain

$$\hat{\theta}=\bar{\theta}=\frac{1}{n}\sum\limits_{i=1}^{n}\theta_{i}. $$

The standard deviation, or the standard error in the frequentist view, of 𝜃 is given by

$$\widehat{s.d.(\hat{\theta})}=\frac{1}{n-1}\sum\limits_{i=1}^{n}\left( \theta_{i}-\bar{\theta}\right)^{2}. $$

Credible intervals for 𝜃 can also be constructed based on the Markov chain. The most widely used credible intervals include the equal-tail credible interval and the HPD credible interval. A 100(1−α) % equal-tail credible interval is [𝜃 α/2,𝜃 1−α/2], where the lower and upper bounds are the 100α/2th and 100(1−α/2)th percentiles of the Markov chain, respectively. The HPD credible interval is the interval that covers 100(1−α) % region of the density formed by the Markov chain but at the same time has the smallest interval width. For symmetrical posteriors, the equal-tail credible interval is the same as the HPD credible interval. For non-symmetrical posteriors, the HPD credible interval has smaller width than the equal-tail credible interval. Therefore, HPD is often used for asymmetrical posterior distributions.

Illustration of error distributions

Potentially, the error e i t can follow any distribution. The normal distribution is certainly most widely used. In addition to the normal distribution, we present three other distributions that can extend GCMs to non-normal data analysis. They include the t, exponential power, and skew normal distributions. It can be seen that the normal distribution is a special case of the three distributions.

Student’s t-distribution

The t-distribution can be viewed as a generalization of the normal distribution. The t-distribution has a kurtosis greater than that of the normal distribution. If e follows a t-distribution t(e|ζ = (μ,ϕ,k)), its density function can be written as

$$p_{t}(e)=\frac{\Gamma[(k+1)/2]}{\Gamma(k/2)}\left( \frac{1}{\pi k\phi}\right)^{1/2}\left[1+\frac{(e-\mu)^{2}}{k\phi}\right]^{-\frac{k+1}{2}} $$

where μ is the location parameter, ϕ is the scale parameter, and k is the degrees of freedom. The t-distribution has longer tails and is more robust to outliers than the normal distribution (e.g., Lange et al., 1989; Zhang et al., 2013). Therefore, the t-distribution is often used in robust data analysis. As the degrees of freedom k goes to infinity, the t-distribution approaches the normal distribution. The mean of the t-distribution is

$$E(e)=\mu,k>1, $$

and the variance is

$$Var(e)=\frac{k}{k-2}\phi,k>2. $$

The t-distribution is symmetric with a skewness value 0. Its kurtosis is

$$\gamma_{2}=\frac{3k-6}{k-4},k>4. $$

Note that when k goes to infinity, γ 2 = 3, the kurtosis of the normal distribution. If the error e i t t(e i t |0,ϕ,k), then y i t t(y i t f(t i m e i t ,b i )|0,ϕ,k).

In Fig. 1, we plot the density of the t-distribution with the degrees of freedom k = 1,3,6,30, and the same location parameter μ = 0 and scale parameter ϕ = 1. When k = , its density is the same as the normal distribution. Note that with the increase of degrees of freedom, the t-distribution moves closer to the normal distribution. Also, it can be observed, on the tails, the t-distribution has greater density than the normal distribution and therefore allows data to be far away from the center. For real data, k can be estimated as an unknown parameter to quantify the deviation from normality.

Fig. 1
figure 1

The t-distribution with the location parameter μ = 0 and scale parameter ϕ = 1. The degrees of freedom is 1,3,5,30, and \(\infty \), respectively. The kurtosis for k = 1 or k = 3 does not exist, and the kurtosis is 9,3.23, and 3 for k = 6,30, and \(\infty \), respectively

Exponential power distribution

The t-distribution can model the error e i t with kurtosis larger, but not smaller, than 3. The exponential power distribution, also called the generalized error distribution or generalized normal distribution, on the other hand, can model the error with both positive and negative kurtosis (Box & Tiao, 1973; Nadarajah, 2005). The exponential power distribution comes with different forms. In this study, the form by Box and Tiao (1973) is used. If e follows an exponential power distribution e p(e|ζ = (μ,σ,γ)), the density function can be expressed as

$$ p_{ep}(e)=\omega(\gamma)\sigma^{-1}\exp\left[-c(\gamma)\left|\frac{e-\mu}{\sigma}\right|^{2/(1+\gamma)}\right] $$
(2)

where

$$ \omega(\gamma)=\frac{\{\Gamma[3(1+\gamma)/2]\}^{1/2}}{(1+\gamma)\{\Gamma[(1+\gamma)/2]\}^{3/2}} $$
(3)

and

$$ c(\gamma)=\left\{ \frac{\Gamma[3(1+\gamma)/2]}{\Gamma[(1+\gamma)/2]}\right\}^{1/(1+\gamma)}. $$
(4)

Here μ is a location parameter, and σ is a scale parameter that is also the standard deviation of the population. γ is a shape parameter that is related to the kurtosis of the distribution and characterizes the “non-normality” of the population. The mean and variance of the exponential power distribution are

$$\begin{array}{@{}rcl@{}} E(e) & =&\mu\\ Var(e) & =&\sigma^{2}. \end{array} $$

The exponential power distribution is symmetric and therefore its skewness is 0. Its kurtosis is

$$\gamma_{2}=\frac{\Gamma[5(1+\gamma)/2]{\Gamma}[(1+\gamma)/2]}{\Gamma[3(1+\gamma)/2]^{2}}. $$

From the exponential power distribution, we can derive many other distributions. For example, if γ = 0, the exponential power distribution becomes a normal distribution. If γ = 1, it becomes a double exponential distribution. Furthermore, when γ approaches −1, it approaches a uniform distribution. For growth curve analysis, the error at different times can take an exponential power distribution with different γ values to account for both platykurtic and leptokurtic errors simultaneously. If the error e i t e p(e i t |0,σ,γ), then y i t e p(y i t f(t i m e i t ,b i )|0,σ,γ).

To better illustrate the influence of γ, we plot the density function of the exponential power distribution with different γ values in Fig. 2. The five distributions in the figure share the same location parameter μ = 0 and scale parameter σ = 1. γ takes one of the five values −.75,−.25,0,.5,1. When γ = 0, the distribution becomes a normal distribution. When γ>0, the exponential power distribution has a fatter tail than the normal distribution; when γ<0, it has a thinner tail than the normal distribution. Therefore, the generalized error distribution can deal with data with both fat and thin tails. For real data, γ can be estimated as an unknown parameter (Box & Tiao, 1973).

Fig. 2
figure 2

The generalized error distribution with μ = 0, ϕ = 1 and different values of γ. Each value of γ is placed on the top of the density plot. The corresponding kurtosis for γ = −.75,−.25,0,.5,1 is 1.92,2.55,3,4.22,6, respectively

Skew normal distribution

Both the t-distribution and the exponential power distribution are symmetric and, therefore, may not be applied if the error distribution is asymmetric. A skew normal distribution captures potential departure from symmetry by allowing for skewness. A skew normal distribution can also be viewed as a generalization of the normal distribution (Azzalini, 1985). If e follows a skew normal distribution s n(e|ζ = (μ,ω,α)), its density function can be written as

$$p_{sn}(e)=\frac{2}{\omega}\phi\left( \frac{e-\mu}{\omega}\right){\Phi}\left( \alpha\frac{e-\mu}{\omega}\right) $$

where μ is a location parameter, ω is a scale parameter, and α is a shape parameter that determines the skewness of the distribution. ϕ and Φ are the standard normal density function and cumulative standard normal distribution function, respectively. The mean and variance of the skew normal distribution are

$$E(e)=\mu+\omega\delta\sqrt{\frac{2}{\pi}} $$

and

$$Var(e)=\omega^{2}\left( 1-\frac{2\delta^{2}}{\pi}\right) $$

where \(\delta =\frac {\alpha }{\sqrt {(1+\alpha ^{2})}}\). Furthermore, its skewness is

$$\gamma_{1}=\frac{4-\pi}{2}\frac{\left( \delta\sqrt{2/\pi}\right)^{3}}{\left( 1-2\delta^{2}/\pi\right)^{3/2}}. $$

The kurtosis of the distribution is

$$\gamma_{2}=3+2(\pi-3)\frac{\left( \delta\sqrt{2/\pi}\right)^{4}}{\left( 1-2\delta^{2}/\pi\right)^{2}} $$

and is greater than that of the normal distribution.

To show the influence of α on skewness, we plot the density function of the skew normal distribution with different α values in Fig. 3. For comparison, we maintained the same location parameter μ = 0 and scale parameter ω = 1. The shape parameter α takes value −4,−2,0,2,4. From the figure, when α = 0, the skew normal distribution becomes a normal distribution and its density is symmetric. If α>0, the distribution is right skewed and if α<0, the distribution is left skewed. For real data, α can be estimated as an unknown parameter. If the error e i t s n(e i t |μ,ω,ϕ), then y i t e p(y i t f(t i m e i t ,b i )|μ,ω,ϕ). Because we often assume the mean of error is zero, we can reparameterize the model parameter μ so that \(\mu =-\omega \delta \sqrt {2/\pi }\).

Fig. 3
figure 3

The skew normal distribution with μ = 0 and ω = 1 and different values of α. The corresponding skewness for α = −4,−2,0,2,4 is −.78,−45,0,.45,.78, respectively

The error distributions discussed here can be plugged into the posterior in Eq. 1. The Bayesian method can then be used to obtain model parameter estimates. We now explain how to estimate GCMs with different error distributions using the MCMC procedure of SAS.

Model estimation using the MCMC procedure

GCMs with different error distributions can be estimated using the MCMC procedure of SAS. We illustrate the use of the MCMC procedure in estimating GCMs with the normal distribution, t-distribution, exponential power distribution, and skew normal distribution. The code can be found from our website at http://psychstat.org/brm2015.

Growth curve model with the normal error

The SAS code using the procedure MCMC for estimating the normal error model is given in Code Block 1. The code on the first line controls the overall data analysis. DATA=ndata specifies the data set to use. NBI=5000 tells SAS to throw away the first 5,000 iterations as burn-in. NMC=40000 tells SAS to generate Markov chains with another 40,000 iterations for analysis. THIN=2 means that every other iteration will be saved for inference. Therefore, combining with NMC, a total of 20,000 iterations are saved. SEED=17 specifies the random number generator seed. INIT=RANDOM is used to generate the random initial values for parameters that are not provided with initial values. DIAG=(ESS GEWEKE) requests the calculation of the effective sample size and the Geweke test statistic. STATISTICS(ALPHA=0.05)=(SUMMARY INTERVAL) requests the calculation of the summary and interval statistics for each parameter at the alpha level 0.05. OUTPOST=histnorm will save the generated Markov chains into a SAS data set named histnorm, which can be processed further if needed.

figure a

Line 2 uses the SAS output delivery system (ODS) to output nicely arranged tables and figures. PARAMETERS requests the output of the sampling method, prior distribution, and initial value information for each parameter. POSTSUMMARIES requests the output of the mean, standard deviation, and 25 %, 50 %, and 75 % percentiles, and POSTINTERVALS requests the output of the 95 % percentile and HPD credible intervals of model parameters. GEWEKE and ESS request the output of the Geweke test and the effective sample size. TADPANEL will generate the trace, autocorrelation and density plots in one figure for visually inspecting Markov chain convergence.

The keyword ARRAY specifies an array. For example, b[2] on Line 3 is an array with two elements and the first element is L and the second element is S. Sigma_b[2,2] on Line 5 specifies a 2 by 2 array. If the values of an array are known, they can be provided as fixed values. For example, beta0[2] on Line 6 has two elements with values 0 and V[2,2] on Line 8 is an identity matrix.

The keyword PARMS specifies each block of parameters and their initial values. For example, on Line 9, beta is treated as one block of parameters with the initial values {5,2}. Note that the initial values are enclosed by {} for vectors and arrays. The keyword PRIOR specifies the prior distribution for the model parameters. For example, a multivariate normal prior (MVN) is used for beta on Line 12. Note that beta0 and sigma0 are given in the ARRAY statement. Furthermore, Sigma_b is given an inverse Wishart prior on Line 13 and var_e is given an inverse Gamma prior on Line 14.

The random coefficients b are augmented with model parameters in the estimation. The distribution of b is a multivariate (bivariate specifically) normal distribution and is specified as RANDOM b ∼ MVN(beta, Sigma_b) SUBJECT=id on Line 15 where id is a variable that distinguishes each individual.

Since the error e i t follows a normal distribution, y i t also has a normal distribution with the mean b i0+b i1 t and the same variance. Therefore, the data are modeled using a normal distribution as shown in MODEL y ∼ NORMAL(mu, var=var_e) on Line 17 with the mean is calculated by mu = L + S ∗ time on Line 16.

Growth curve model with the t error

The SAS code for the linear GCM with the t error is given in Code Block 2. The code is very similar to that for the normal error except that the data y are modeled by the t-distribution y ∼ t(mu, var=var_e, df) with the unknown degrees of freedom df as shown on Line 19. Furthermore, the df is given a uniform prior UNIFORM(1,500) on Line 16.

figure b

Growth curve model with the exponential power error

The SAS code for the linear GCM with the exponential power error is given in Code Block 3. The code is similar to that for the normal error and the t error. In addition to the extra parameters on γ (g1, g2, g3, g4 on Lines 12-15), a notable difference is that the data are given a general distribution with argument ll on Line 26. This is because there is no built-in exponential power distribution in SAS. However, a new distribution can be constructed flexibly. Comparing ll on Line 24 and the exponential power density function in Eq. 2, the use of a new distribution is clear and the new distribution is defined as the logarithm density function. Another difference is that MONITOR=(_PARMS_ var_e) is used on Line 1. MONITOR lists model parameters or newly created parameters with Markov chains to be saved. _PARMS_ represents all parameters directly used in a model as specified by the keyword PARMS. New parameter can be calculated. For example, var_e is the variance calculated based on the standard deviation sigma_e on Line 25.

figure c

Growth curve model with the skew normal error

The SAS code for the linear GCM with the skew normal error is given in Code Block 4. The difference between the current code and previous one is that the error e, instead of the data y, is modeled directly as indicated by MODEL e ∼ general(ll) on Line 23. The reason to model e is because the mean of the skew normal distribution is not “centered” around μ in the sense of the normal distribution. Directly modeling the data y will make the estimates of β not comparable with those from using the other distributions. Therefore, we “trick” the MCMC procedure by specifying a skew normal distribution with a mean \(-\omega \delta \sqrt {2/\pi }\) as indicted by xi = -sigma_ealpha/sqrt(1+alpha ∗alpha) ∗sqrt (2/3.1415926) on Line 21.

figure d

An example

The Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) focuses on children’s early school experiences from kindergarten to middle school. The ECLS-K began in the fall of 1998 with a nationally representative sample of 21,260 kindergartners. The ECLS-K is a longitudinal study with data collected in the fall and spring of kindergarten (1998-99), the fall and spring of 1st grade (1999-2000), the spring of 3rd grade (2002), the spring of 5th grade (2004), and the spring of 8th grade (2007). Data on children’s home environment, home educational activities, school achievement, school environment, classroom environment, classroom curriculum, and teacher qualifications are available.

Typical longitudinal studies do not have such a large sample size. Therefore, we randomly selected a sample of 563 (about 2.5 %) children with complete data on mathematical ability in the fall and spring of kindergarten and the 1st grade. Summary statistics for this data set are provided in Table 1 and the longitudinal plot for all data is presented in Fig. 4. Both the summary statistics and the longitudinal plot suggested a linear growth pattern.Footnote 1 Based on the skewness, the data appear to be symmetric. However, based on the kurtosis, only the data at time 1 seem to be normally distributed. The data distribution is platykurtic at time 2 and 3 and leptokurtic at time 4.

Fig. 4
figure 4

Longitudinal plot of the mathematical data of the ECLS-K sample

Table 1 Summary statistics of the example data

In order to diagnose the error distribution, we fitted a linear regression to the data from each individual. The residual errors from the regression analysis were pooled together for normality test. Specifically, the D’Agostino skewness test and Anscombe-Glynn kurtosis test were conducted (See Table 2; Anscombe & Glynn, 1983; D’Agostino, 1970). First, the skewness was -0.06 and the kurtosis was 3.77 for the overall error distribution. The D’Agostino test indicates that the error distribution was symmetric but the Anscombe-Glynn test showed that the kurtosis of the error distribution was greater than that of the normal distribution. Then, the error distribution was checked at each measurement time. The error distribution at each time seemed to be symmetric, too. However, the distribution at time 2 was platykurtic and at time 4 was leptokurtic. Therefore, the error distribution seems to be best modeled with the exponential power distribution at each time.

Table 2 Tests of skewness and kurtosis for errors from the individual growth curve analysis

The linear GCM with the exponential power error distribution was fitted to the data. The trace plots for the model parameters indicate that the Markov chains for all parameters appeared to converge well after a burn-in period of 5,000 iterations. Furthermore, all Markov chains passed the Geweke test as shown in Table 3. The effective sample sizes ranged from 448 to 9791 for model parameters.

Table 3 Results from the linear growth curve model with the exponential power error distribution

Parameter estimates for the model are given in Table 3. Based on the estimates of γ t , the errors at time 1, 3, and 4 had kurtoses greater than 3 (positive γ) while the errors at time 2 had kurtosis smaller than 3 (negative γ). Furthermore, based on the credible intervals, γ 4 is statistically significant indicating that the error distribution at time 4 cannot be treated as normal. Based on the parameter estimates, the average intercept was 2.354 with an average rate of growth 2.364 in the mathematical ability. There were also significant individual differences in both the intercept and rate of growth. Furthermore, children with a greater intercept would show a lower rate of growth based on the covariance estimate.

Influence of error distribution misspecification

We have demonstrated that we can model error distributions explicitly through Bayesian methods. If the errors are not normally distributed but the normal based method is used, the error distribution is misspecified. On the other hand, if the errors are normally distribution while a non-normal distribution is used, the error distribution is equally misspecified. In this section, we evaluate the influence of error distribution misspecification under the two situations through a simulation study. In the literature, the bootstrap method is often used to obtain the standard errors for model parameters when data are suspected of non-normality. Therefore, we also compare our methods using the non-normal distributions with the bootstrap method.

Data generation

Date are generated from the following linear GCM,

$$\begin{array}{@{}rcl@{}} y_{it} & =& b_{i0}+b_{i1}t+e_{it}\\ b_{i0} & =& \beta_{0}+v_{i0}\\ b_{i1} & =& \beta_{1}+v_{i1} \end{array} $$

where β 0 = 5 and β 1 = 2. The random effects follow a bivariate normal distribution with mean zeros and \(Var(v_{i0})={\sigma _{0}^{2}}=2\), \(Var(v_{i1})={\sigma _{1}^{2}}=1\), and V a r(v i0,v i1)=σ 01 = 0.Footnote 2 To evaluate the influence of misspecifying normal error to non-normal error, e i t is set to follow a normal distribution with mean 0 and variance \({\sigma _{e}^{2}}=1\). To evaluate the influence of misspecifying non-normal error to normal error, e i t is set to follow a t-distribution with ζ = (μ,ϕ,k)=(0,1/3,3), an exponential power distribution with ζ = (μ,σ,γ)=(0,1,γ), and a skew normal distribution with ζ = (μ,ω,α)=c(−1.31,1.64,10), respectively. Note that the parameters of the skew normal distribution are chosen so that the mean of the distribution is 0. For the exponential power distribution, the parameter γ is set at different values at different measurement times. Three different sample sizes with N = 100,300,500 and two different measurement occasions with T = 4,5 are investigated. The parameters γ = (−0.8,−0.8,0,0.8,0.8) for T = 5 and γ = (−0.8,−0.8,0.8,0.8) for T = 4. Under each condition, 100 sets of data are generated and analyzed.

To estimate the model with the normal, t, exponential power, and skew normal distributions, the Bayesian method is used and implemented in the MCMC procedure of SAS as discussed earlier. For the bootstrap method, the same data are analyzed using Mplus and 1000 bootstrap samples are used. R code for data generation, running simulation, as well as Mplus script can be found online at http://psychstat.org/brm2015.

Misspecifying non-normal error as normal error

For the linear GCM with the normal error, t error, exponential power error, and skew normal error, the parameters \(\beta_{0},\beta_{1},{\sigma _{0}^{2}},\sigma _{01},\) and \({\sigma _{1}^{2}}\) are comparable and of the greatest interest to users. Figure 5 displays the scatter plots of parameter estimates and their standard errors for the 5 parameters when modeling the error distribution as the normal and as non-normal distributions with N = 500 and T = 5 across all 100 replications of data analysis. The scatterplots of the parameter estimates on the left panel of Fig. 5 clearly show that when the non-normal error distributions were misspecified as a normal distribution, there was not much difference in terms of parameter estimates. This is evidenced by the fact that the scatterplots nicely fall on the straight diagonal line.

Fig. 5
figure 5

Parameter estimates and their standard errors when the t, exponential power, and skew normal distributions are, respectively, misspecified as normal distribution from top to bottom

However, the right panel of Fig. 5 shows that the standard errors became greater if the non-normal error distributions were misspecified as normal distribution because in the scatterplots, the majority individual points fall under the straight line. This means that the standard errors were overestimated with the incorrect error distributions. To further quantify the overestimation, we calculate a statistic called reduced efficiency in standard error (RESE)Footnote 3 defined as

$$RESE=100\times\left( \frac{1}{100}\sum\limits_{r=1}^{100}\frac{\widehat{se_{r}(\hat{\theta}_{w})}}{\widehat{se_{r}(\hat{\theta}_{c})}}-1\right) $$

where \(\widehat {se_{r}(\hat {\theta }_{w})}\) is the standard error estimates of the parameter estimate \(\hat {\theta }_{w}\) from the wrongly specified error distribution and \(\widehat {se_{r}(\hat {\theta }_{c})}\) is the corresponding one with the correctly specified error distribution for the rth replication of data. Note that if R E S E>0, it indicates a loss of efficiency; otherwise, there is a gain in the efficiency.Footnote 4 The RMSE for each parameter is given in Fig. 5 for the condition N = 500 and T = 5 and all of them are greater than 0, indicating the loss of efficiency when the error distribution is misspecified.

Table 4 summarizes the RESE when the non-normal error distributions are misspecified as the normal distribution with different sample sizes and measurement occasions. Although the RESE varies, the overall patterns are the same. First, under all studies condition, there shows a reduced efficiency. Second, the random-effects parameters \({\sigma _{0}^{2}}\) and \(\sigma _{01}^{2}\) are influenced the most by the misspecification of the error distributions.Third, there is no clear pattern of the relationship between RESE and sample sizes as well as measurement occasions.

Table 4 RESE when the non-normal error distributions are misspecified as normal (in percentage)

Table 5 summarizes the RESE to compare the bootstrap method and the Bayesian method with the correctly specified error distributions. If the RESE is negative, the bootstrap method is more efficient. Otherwise, the Bayesian method is more efficient. The results show that for β 0, \({\sigma _{0}^{2}}\), and \({\sigma _{1}^{2}}\), the Bayesian method is consistently more efficient. For β 1 and σ 01, the results are mixed. Overall, it seems that the Bayesian method is more efficient than the bootstrap method.

Table 5 RESE comparing the bootstrap standard errors with the Bayesian standard deviations using correctly specified error distributions (in percentage)

Misspecifying normal error as non-normal error

We have shown that misspecifying a non-normal error distribution to the normal distribution can reduce the efficiency of standard error estimates. Now, we evaluate the situation where the normal error distribution is misspecified as a non-normal distribution. The normal distribution can be viewed as a special case of the t-distribution, exponential power distribution, and skew normal distribution. If the errors are normally distributed but the three non-normal distributions are used in the model, the error distribution is over-specified, more precisely than misspecified. Therefore, one would expect this kind of misspecification should not have a big influence on the parameter estimates and their standard errors.

Figure 6 shows scatterplots of the parameter estimates and their standard errors when modeling the error distribution as both the normal and non-normal distributions with N = 500 and T = 5 across all 100 replications of data analysis. Clearly, the parameter estimates and their standard errors were very similar when the error was modeled as both the normal and non-normal distributions. Table 6 summarizes the RESE for all investigated conditions. Across the table, RESE is mostly smaller than 0.5 %. Compared to RESE in Table 4, the loss in efficiency in the standard error estimates is negligible when the normal error distribution was misspecified as any of the three non-normal distributions discussed here.

Fig. 6
figure 6

Parameter estimates and their standard errors when the normal error distribution is misspecified as the t, exponential power, and skew normal distributions, respectively, from top to bottom

Table 6 RESE when the normal error distribution is misspecified as non-normal (in percentage)

Summary

When the error distributions were misspecified, there was no noticeable influence on the parameter estimates of the GCMs. If the normal error distribution was misspecified as one of the three non-normal distributions discussed in the study, its influence on the standard error estimates was also negligible. However, if the non-normal error distributions were misspecified as normal distributions, the loss in the efficiency of the standard error estimates can be large. In addition, by correctly specifying the error distribution, one can obtain more efficient results than the bootstrap method.

Discussion

Growth curve models have been widely used in social and behavioral sciences. However, typical GCMs often assume that the errors are normally distributed. The assumption is further carried by both commercial and free software for growth curve analysis and makes it less flexible for non-normal data analysis although non-normal data may be even more common than normal data (Micceri, 1989). As already shown by the literature, blindly assuming normality may lead to severe statistical inference problems (Yuan et al., 2004; Yuan and Zhang, 2012a).

In this study, we proposed a general Bayesian framework to flexibly model normal and non-normal data through the explicit specification of the error distributions. Within the framework, the error distribution is first explored through exploratory data analysis techniques. Then, a statistical distribution that best matches the error distribution is used in growth curve modeling. Bayesian methods are then used to estimate the model parameters. Although we have focused on the discussion of the t-distribution, exponential power distribution, and skew normal distribution, almost any distribution can be applied to the error in growth curve analysis.

Through a simulation study, we found that when the normal error distribution was misspecified as non-normal error distributions discussed in this study, both parameter estimates and their standard errors were very comparable. However, when the non-normal error distributions were misspecified as normal distribution, the loss in the efficiency of the standard error estimates can be large. Therefore, it seemed that moving from normal distribution to those that closely resemble the data is compelling. The similar conclusion is also found in Zhang et al. (2013).

We demonstrated how to conduct growth curve analysis with both normal and non-normal error distributions practically using the MCMC procedure. The normal distribution and t-distribution are built-in distributions of the MCMC procedure. Therefore, the extension of the growth curve model with the normal error in the MLE framework to the Bayesian framework is almost effortless. It is also logically natural to extend the models with the normal error to the ones with the t error. Although there are no built-in functions for exponential power distribution and skew normal distribution, the MCMC procedure allows the specification of new distributions. From our demonstration, potentially any distribution with known density function can be used to model the error of a GCM. Overall, Bayesian growth curve modeling seems to offer great flexibility in longitudinal data analysis. Although GCMs with non-normal error distributions potentially can be estimated in the MLE framework, we are not aware of easy to use software to carry out such analysis.

For demonstration purposes, we have focused our discussion on the linear GCMs. However, the methods proposed can be equally applied to the nonlinear GCMs (e.g., Grimm et al., 2011). The nonlinear model estimation can also be conducted using the the MCMC procedure.

Given the fact that standard error estimates, not parameter estimates, are influenced by distribution misspecification, one might wonder why not use the MLE with robust standard error or bootstrap standard error (e.g., Yuan & Hayashi, 2006). Based on the likelihood theory, the MLE is most efficient when the distribution is correctly specified (e.g., Casella & Berger, 2001). Asymptotically, Bayesian method is equivalent to MLE especially with uninformative priors (e.g., Gelman et al., 2003). Therefore, one would still expect the Bayesian method will provide more efficient standard error estimates than the robust or bootstrap method based on the misspecified normal error distribution. Our simulation study actually supported this in finding that, overall, modeling the error distribution is more efficient than the bootstrap method.

The current study can be extended in different ways. Firstly, in determining the error distribution, the individual growth curve analysis is used. However, the method assumes that the growth function is known a priori. In practice, this might not be possible. Although it is easy to distinguish a linear growth from a non-linear growth, it can be difficult to choose among several potential non-linear forms. Therefore, more rigorous method can and should be developed to determine the best growth function. Secondly, in exploring the error distribution, it is assumed that the error would follow a certain known distribution. It is likely that one cannot find a good distribution to approximate the error distribution from the individual growth analysis. One potential solution is to use a non-parametric error distribution as discussed by Tong and Zhang (2014). Thirdly, although it is shown that when a normal distribution is misspecified as a non-normal distribution, the loss of efficiency is minimum, it is not clearly what will happen if the error distribution is completely misspecified. For example, what will happen if a skewed normal distribution is misspecified as an exponential power distribution with negative kurtosis. Fourthly, the current study did not discuss how to evaluate a model fit and how to conduct model selection with the new error distributions. Evaluation of a model fit can inform whether a model fits the data adequately and model selection can help decide among more than one competing models. Especially, model selection can also help the choice of the growth function forms and the error distributions. Many criteria for model evaluation are available (e.g., Lu et al., 2015; Spiegelhalter et al., 2002) and can be potentially applied for the current study. Fifthly, the current study assumes that the error can be non-normal but the random coefficients are still normal. There exist multivariate counterparts for the t, exponential power, and skew normal distributions. However, we are only aware of the extension of the multivariate t-distribution to the model the random coefficients (Tong & Zhang, 2012). Although the above concerns are out of the scope of the current study, that is to introduce a new way to model error distributions in growth curve analysis, they will be investigated in future studies.