A Comparison of Methods for Specifying Optimal Random Effects Structures

Using Monte Carlo simulations, this study compared the performance of various approaches to the specification of random effects structures in linear mixed effects models (LMMs), including the minimal approach, the maximal approach, the forward search, the backward search, and the all-possible structures approach. The results showed that if the predictor of interest is at the within-cluster level or involves a cross-level interaction, the maximal approach, the best-path forward search, and the best-path backward search are all desirable methods. If the predictor of interest is at the cluster level, it is not essential to specify random slopes of Level-1 predictors. In addition, it is important to specify random slopes of within-cluster control variables, as they can increase the statistical power for testing the main within-cluster variables, especially when the sample size is small and the variance of the random slope of the control variable is large.

Linear mixed effects models (LMMs) are commonly used to analyze clustered data in educational, social, and behavioral sciences.The general form of LMMs is given by Y = X β + Zu + e, where Y represents the outcome variable, X the predictors, β the fixed effects, Z the design matrix for random effects, u the random effects, and e the errors.The random effects (u) can be associated with either a random intercept or a random slope.
The specification of LMMs poses a challenge as researchers must not only specify the fixed effects but also the random effects.A systematic review of applied multilevel modeling research in the past decade revealed that information regarding the specifica tion of random effects in models is frequently missing or incomplete (Luo et al., 2021).Applied researchers often fail to provide clear rationales for the selection of a particular random effect structure and tend to default to using random intercept models.However, this practice is problematic as research has demonstrated that constraining a randomly varying slope of a Level-1 predictor to be constant across Level-2 clusters can lead to inflated Type I error in the test of the fixed effect (e.g., Barr et al., 2013).
One challenge in specifying random effects is the lack of sufficient details in sub stantive theories and previous studies to guide a priori specifications.Therefore, Barr et al. (2013) argued for the maximal model approach, which involves using the full variance-covariance structure of random effects.However, some researchers argue that the maximal approach may lead to overfitting when population variance components are close to zero, potentially resulting in reduced power, particularly with small sample sizes (Matuschek et al., 2017).Therefore, a model selection approach is recommended, where a certain criterion is used to select the random effect structure that is most supported by the data (e.g., Matuschek et al., 2017).
Facing the challenges of specifying random effects in LMMs, there is a lack of methodological investigations that evaluate the different approaches and provide evi dence-based guidelines for applied researchers.Hence, the purpose of this study is to compare various approaches to random effects specification in terms of their impact on the statistical inferences of fixed effects.In the remainder of the paper, we first provide a brief review of the maximal and model selection approach.Next we introduce the design of the simulation study and report the results.Finally, we discuss the findings and provide recommendations for applied researchers.

Literature Review
The Maximal Approach For confirmatory hypothesis testing, Barr et al. (2013) argued that LMMs should include the maximal random effect structure that is justified by the design.Specifically, it is recommended to include random slopes for all within-cluster factors.When a withincluster variable is involved in a cross-level interaction, a random slope should always be included for the within-cluster variable (Heisig & Schaeffer, 2019).For control variables, however, it remains unclear whether including random slopes is essential (Barr et al., 2013).
The rationales behind the maximal approach are twofold.First, random effects are crucial for capturing the measurement dependances in the design (Bolker et al., 2009).Second, a data-driven approach could incorrectly exclude random effects due to insuffi cient power, which is particularly relevant for small sample sizes.Through a simulation study, Barr et al. (2013) showed that the maximal model can control Type I error rate without a significant loss of power, whereas the model selection approach tended to be anticonservative with minimal power gains.However, it is important to note that the model used for data generation in Barr et al. (2013) was based on a two-way random effects ANOVA design (i.e., involving items and subjects) with a single within-subject treatment factor.Hence it remains unclear whether their findings can be generalized to models with more predictors at both the within-cluster and between-cluster levels.

The Model Selection Approach
In contrast, other researchers have recommended using model selection approaches to determine an optimal random effect structure (e.g., Stroup, 2012).To investigate the po tential power loss associated with the maximal model, Matuschek et al. (2017) compared the maximal approach with two model selection approaches, namely backward selection based on likelihood ratio test (LRT) and AIC-based selection from all possible subset models.The results showed that the backward selection based on LRT can achieve higher power without inflated Type I error rates.The AIC-based all subset model selection can achieve similar power to the LRT-based backward selection, but may suffer from slightly inflated Type I error rates in conditions with small sample sizes.However, the maximal approach leads to a significant loss of statistical power when the true value of the variance components is small.
Although Matuschek et al. (2017) used similar data generation models and sample size conditions as Barr et al. (2013), the conclusions drawn from the two studies were quite different.Two potential reasons may explain the contradictory findings.First, the relative magnitudes of the variance components in the two studies differed significantly.Barr et al. (2013) selected all variance components from a uniform distribution ranging from 0 to 3, allowing for a wide range of intraclass correlation coefficients (ICCs) from 0 to 1.In contrast, Matuschek et al. (2017) had a narrower range of ICC from 0 to 0.35.Secondly, the empirical power was calculated differently between the two studies.Barr et al. (2013) corrected the empirical power to account for anticonservativeness, while Matuschek et al. (2017) directly compared the power of different approaches that varied in their Type I error rates.

Model Selection Criteria
In model selection, researchers must strike a balance between goodness of fit and parsi mony.Many model comparison methods aim to address the trade-off between goodness of fit and model complexity (Kuha, 2004;Vandekerckhove et al., 2015).Among these methods, the most popular model selection criteria are the Akaike information criterion (AIC; Akaike, 1973Akaike, , 1974)), the Bayesian information criterion (BIC; Schwarz, 1978) and likelihood ratio tests (LRT).
Information Criterion (IC) Indexes -AIC and BIC incorporate a goodness-of-fit term along with a penalty term for overfitting.Hence, each criterion selects the model with best penalized log-likelihood, given by −2lnp y|θ + A n k.The first term, −2lnp y|θ , quantifies the goodness of fit, where y is the dataset and θ is the maximum likelihood estimate.The second term, A n k, is a penalty for model complexity, where A n is a constant or a function of the sample size (n) and k is the number of parameters.The information criteria, such as AIC and BIC, differ in the choice of A n , indicating different degrees of penalty for model complexity.
For AIC, A n = 2, and thus it is expressed as −2lnp y|θ + 2k, which estimates the degree of information loss due to the probability distribution associated with the true data generation mechanism approximated by the probability distribution with the fitted model.Therefore, the model with the lowest AIC is considered most likely to provide a data generation mechanism close to the true mechanism in the population.
For BIC, the information criterion is expressed as −2lnp y|θ + kln n , with A n = ln n , where n is the number of observations.BIC is an estimate of a function of the posterior probability of a model being true with the unit information prior, so a lower BIC means that a model is considered closer to the true model (Dziak et al., 2020).When n ≥ 8, BIC places more weight on the penalty for model complexity compared to AIC.There is evidence that AIC tends to select complex models that over fit the data, whereas BIC tends to select simpler models that underfit data (Burnham & Anderson, 2002;O'Hagan & Forster, 2004).Matuschek et al. (2017) showed that the model selection approach based on AIC was slightly anticonservative, with empirical Type I error rates exceeding the nominal level.However, the performance of BIC was not investigated in the study.

Likelihood Ratio Test (LRT) -
To select an optimal random effect structure, the LRT is another promising approach.Because the regular LRT is overconservative when testing a variance component (Ryoo, 2011).To address this issue, Crainiceanu and Ruppert (2004) derived a spectral representation of the exact finite null distribution of the restricted likelihood ratio test (RLRT CR ).Using spectral decomposition, the null distribution of the RLRT CR statistic can be simulated rapidly because the distribution only depends on the design matrix of the fixed and random effects.This procedure can be implemented via the RLRsim package in R (Scheipl et al., 2008).

Model Search Algorithms
In the context of selecting random effect structures in LMMs, three search algorithms have been used in prior research: forward selection, backward elimination, and all pos sible structures.The all-possible structures technique involves examining LMMs with every conceivable random effects structure.The best model can be determined based on AIC or BIC.However, if the number of potential random effects is large, this algorithm could rapidly become expensive to compute and difficult to evaluate.For example, when there are two within-cluster predictors, the number of all possible random effects struc tures would increase to 14 (see Table 1 for all the possible structures).In the forward selection method, variances and covariances of random effects are added to a model sequentially based on statistical significance of LRTs.In the context of model selection, the significance level should be interpreted as the relative weight of goodness of fit and model complexity (Matuschek et al., 2017), similar to the trade-off between goodness of fit and model complexity with information criteria.Both Barr et al. (2013) and Matuschek et al. (2017) used the significance level of 0.2 for LRT in model selection.
When using the forward selection method, researchers also need to determine the sequence of adding variances or covariances.For example, when there are two withincluster predictors (X1 and X2), starting from a random intercept model, one could first add the random slope of X1 or the random slope of X2.If the random slope of X1 is included in the model, in the subsequent step, one can add the covariance between the random intercept and random slope of X1, or add the random slope of X2.However, Barr et al. (2013) found that using an arbitrary sequence led to poor results and, therefore, suggested the "best path" algorithm, which involves testing all possible random effects that are not currently in the model and including any that pass the LRT.
Finally, in the backward elimination method, one starts with the maximal random effects structure, and sequentially eliminates variances and covariances based on LRT.
Similar to the forward selection method, there are multiple sequences that can be used in the backward elimination method.The backward "best-path" algorithm tests all random effects in the maximal model and excludes the statistically non-significant ones.Barr et al. (2013) found that when using the "best path" algorithm, the backward elimination method performed equally well as the forward selection method, but outperformed the latter when using arbitrary sequences.

The Current Study
Existing studies have only included a few approaches for comparison and have used different evaluation criteria, making it challenging to draw definitive conclusions about the best approach.In addition, previous studies have used overly simplistic data genera tion models, failing to consider Level-2 predictors and control variables.As a result, the generalizability of their findings to more complex models is questionable.Therefore, the current study aims to address the gap by comparing a broader range of approaches and generating data using more complex models.
First, we compare the maximal approach, forward selection, backward elimination, and the all-possible structure approach in terms of their performance in achieving opti mal empirical Type I error and power when testing fixed effects associated with Level-1 predictors, Level-2 predictors, and cross-level interactions.Second, we examine whether it is essential to correctly model the random slope of a Level-1 control variable.

Method Data Generation and Design Factors
We generated data using the following LMM to simulate cross sectional clustering, such as students nested within schools.The model includes five predictors, consisting of two Level-1 predictors, two Level-2 predictors, and one cross-level interaction term.Ad ditionally, multiple random effects were included to represent the complexity of models observed in empirical studies. Level-1: Level-2: where i and j are indexes of participants and clusters, respectively; X1 is the main Level-1 predictor, X2 is the Level-1 control variable, and W1 and W2 are the Level-2 predictors; μ 0j , μ 1j , and μ 2j are the random effects associated with the intercept, X1, and X2 for cluster j, respectively; and e ij is the within-cluster error term.X1 and X2 are generated from a multivariate normal distribution with a mean vector of zeros and a variance of 1.0.The bivariate correlation between X1 and X2 is set to 0.3.The same specifications apply to the Level-2 predictors W1 and W2.The random effects μ 0j , μ 1j , and μ 2j are assumed to follow a multivariate normal distribution with a mean vector of zeros.The variances of the random effects are indicated by τ 00 , τ 11 , and τ 22 , and the covariances by the off-diagonal components τ 10 , τ 20 and τ 21 .The Level-1 error term e ij is assumed to follow an independent and identically normally distributed (i.i.d.) distribution with a mean of zero and a variance of σ e 2 .The number of clusters (J ) was set to 10, 50 or 100, and the cluster size (I ) was set to 10, 25, or 50.These values represent the typical range of sample sizes at Level-1 and Level-2 as found in the review of empirical studies using MLMs (Luo et al., 2021).The intercept (γ 00 ) was fixed at 0 since it is typically not of interest.All the other fixed effects (γ 10 , γ 20 , γ 01 , γ 02 , and γ 11 ) were assigned values of either 0 (no effect) or 0.5 (medium effect).The Level-1 error variance was set to 1, and the variance of the random intercept (τ 00 ) was set to 0.25.We chose to fix τ 00 at a specific value because the impact of misspecifying the random intercept is not the focus of the investigation, given that researchers typically include a random intercept in their model specifications (Luo et al., 2021).
On the other hand, the variance components associated with random slopes of X1 and X2 (τ 11 and τ 22 ) were set to 0.05, 0.25 or 0.5, representing conditions where the variance component was trivial, medium, or large.The correlation among random effects was set to 0.5.These specific variance-covariance values were selected to ensure that the intraclass correlations ranged between 0.25 and 0.55, as found to be common in cross-sectional clustered studies in behavioral science by Hedges and Hedberg (2007).Taking into account all the design factors, a total of 162 conditions were considered.In each condition, 1000 independent data sets (i.e., replications) were simulated using R 4.0.3(R Core Team, 2021).

Model Selection Algorithms
In the all-possible structure approach, we fitted 14 possible random effects structures as shown in Table 1, and the model with the smallest value of AIC or BIC was adopted as the final model.
For the forward selection method, we started with the random intercept model (M1) and used the "best path" algorithm (FWB) as well as 9 arbitrary sequences (FW1-FW9) to construct the model.In the backward elimination method, we started with the maximal model (M14) and used the "best path" algorithm (BWB) as well as 5 arbitrary sequences (BW1-BW5) to trim the model.The flow charts for all the arbitrary sequences in the for ward selection and backward elimination methods are presented in the Supplementary Materials.
All the models were estimated in the R package lme4 (Bates et al., 2015) through re stricted maximum likelihood estimation.In the forward and backward search algorithms, the RLRT CR approach was adopted for the significance test of variance components to address the boundary issue.The test was implemented using the R package RLRsim (Scheipl et al., 2008).To test covariance components, the regular LRT was performed using the R package lmtest (Zeileis & Hothorn, 2002).The significance level for these tests was set at α = .20.
The minimum approach (i.e., random-intercept model, M1) and the maximum ap proach (i.e., full variance-covariance structure, M14) were also examined.The corre sponding R codes for all the models (M1 to M14), along with the best-path forward and backward search, are provided in the Supplementary Materials.

Empirical Type I Error Rate and Power
Significance tests for all five predictors were performed using the t-test with Sat terthwaite approximation (Satterthwaite, 1946) to correct the degrees of freedom, ensur ing more accurate inferential results in conditions with small sample sizes.To evaluate the performance, empirical Type I error rate and power were calculated.With 1000 replications, we expected the empirical Type I error rate to range from .037 to .063(.05 ± 1.96 × 0.05 * 0.95/1000).
Because the empirical power of an anticonservative approach is inflated, directly comparing the empirical power of approaches that differ in Type I error rate can be misleading.Hence, we adopted the method used in Barr et al. (2013) to compute the corrected empirical power.

Arbitrary Forward Search and Backward Search (FWA and BWA)
The arbitrary sequences used in the forward search approach exhibited similar empirical Type I error rates and corrected power for the significance tests of all the fixed effects.Similarly, minimal differences were found among the sequences used in the arbitrary backward search approach.Therefore, the mean Type I error rate and corrected power across the sequences within the arbitrary forward and backward search approaches were computed and used in the subsequent comparisons with the other approaches.

Empirical Type I Error Rates Overall Performance
For X1 and X2, the MAX, FWB, BWB, and AIC approaches showed acceptable Type I error rate overall.The FWA, BWA, and BIC approaches exhibited slightly inflated Type I error rates, while the random intercept model (RI) resulted in severely inflated Type I error rate (ranging between 0.08 and 0.75).Regarding W1 and W2, all methods demonstrated slight inflation in the Type I error rate, except for RI.For X1W1, RI showed a significant inflation in the Type I error rate, while the other methods exhibited only slight inflations.The specific observed Type I error rates under different data generation conditions are elaborated upon and presented in Table 2, Table 3, Table 4, and Table 5 1 .More detailed results of the Type I error rate and power for each combination of the design factors were available in Supplementary Materials.

Effect of Sample Size
For the slopes of X1 and X2, FWA, BWA, and BIC all had a slightly inflated Type I error rate when the number of clusters was 10.However, the Type I error rate became acceptable when the number of clusters increased to 25 or above.When the number of clusters increased to 50, all of the approaches (except for RI) had similar Type I error rates.
For W1 and W2, only RI had an acceptable Type I error rate across all conditions.For the other approaches, when the number of clusters was 10, they all had unacceptably high Type I error rates, especially when the approach was MAX, FWB, BWB, AIC, or BIC.However, the Type I error rate became acceptable when the number of clusters increased to 25 or above, and the differences among the approaches diminished when the number of clusters increased to 50.
For X1W1, all of the methods had unacceptably high Type I error rates when the number of clusters was 10, among which RI, AIC, and BIC were the worst.However, the Type I error rate decreased and became similar among the approaches (except for RI) when the number of clusters increased to 25 or above.
On the other hand, the cluster size (CS) had little impact on the Type I error rate of the coefficients of X1 and X2, except when the approach was RI or BIC.For RI, the Type I error rate notably increased when CS increased.For BIC, the Type I error rate was unacceptably high when CS = 10, and decreased to an acceptable range when CS increased to 50 or above.
For W1 and W2, regardless of the cluster size, RI had the best performance, followed by FWA and BWA which had slightly inflated but acceptable Type I error rates.The other approaches all had unacceptably high Type I error rate, which tended to increase as 1) Because the results for W1 and W2 were similar, we only presented those for W1.
CS increased.The only exception was BIC, which had a smaller Type I error rate for W1 and W2 when the CS increased.
For X1W1, FWA and BWA had the best performance, with an acceptable Type I error rate regardless of the cluster size.MAX, FWB, and BWB had an acceptable Type I error rate only when CS was 50 or below.AIC and BIC all had unacceptably high Type I error rates regardless of CS.

Effect of Variance Components (τ11 and τ22)
The magnitude of τ11 and τ22 had little impact on the Type I error rate across all the methods, except for BIC and RI.Specifically, BIC had an unacceptably high Type I error rate for X1 when τ11 was small (0.05), and similarly for X2 when τ22 was small.For RI, which had the highest Type I error rate among all methods, the Type I error rate for a level-1 predictor (e.g., X1) was positively related to the magnitude of its variance component (e.g., τ11) and negatively related to the variance component of the slope of the other level-1 predictor (e.g., τ22).
For X1W1, only FWA and BWA had an acceptable Type I error rate regardless of the magnitude of τ11 and τ22.MAX, FWB and BWB had an acceptable Type I error rate when τ11 and τ22 were small.AIC and BIC had an unacceptably high Type I error rate, especially when τ11 was small and τ22 was large.RI had an unacceptably high Type I error rate in all conditions, which was exacerbated as τ11 increased.magnitude of τ22 was large.However, when the number of clusters increased to 25 or above, there was no noteworthy differences between the two models.
For the slope of X1W1, although the under-specification of the random effect of X2 did not negatively affect the Type I error rate, the corrected power was more than 10% lower than that under Model 14 when the number of clusters was small and the magnitude of τ22 was large.
2) The results based on Model 7 were almost identical to Model 4, hence only Model 4 results were presented in Table 6.

Table 6
Type

Implications
First, if the predictor of interest is at the within-cluster level or involves a cross-level interaction, the maximal approach, the best-path forward search, and the best-path backward elimination are all desirable methods for specifying random effect structures.If there are convergence issues with estimating the maximal model, the best-path forward search is recommended.Second, if the predictor of interest is at the cluster level, it is not essential to specify random slopes of Level-1 predictors.Third, it is important to specify random slopes of within-cluster control variables, especially when the sample size is small and the variance of the random slope is large.Lastly, researchers should avoid using AIC-and BIC-based approaches, especially when the sample size is small.

Limitations and Future Directions
The findings of the current study should be interpreted in light of the limitations.First, our study focused on cross-sectional clustered data.It is not safe to directly apply our findings to studies with panel/longitudinal data, which may involve more complex Level-1 error structures (e.g., auto-correlated Level-1 errors).Second, we generated data in which the distributional assumptions of random effects were satisfied.It is unknown whether the performance of these approach would differ when there are violations of the normality assumptions of the random effects.These could be the directions for future studies.

Table 1
Candidate Random Effect Structure

Table 2
Type I Error and Power for X1 Slope