Improved Goodness of Fit Procedures for Structural Equation Models

We propose new ways of robustifying goodness-of-fit tests for structural equation modeling under non-normality. These test statistics have limit distributions characterized by eigenvalues whose estimates are highly unstable and biased in known directions. To take this into account, we design model-based trend predictions to approximate the population eigenvalues. We evaluate the new procedures in a large-scale simulation study with three confirmatory factor models of varying size (10, 20, or 40 manifest variables) and six non-normal data conditions. The eigenvalues in each simulated data-set are available in a database. Some of the new procedures markedly outperform presently available methods. We demonstrate how the new tests are calculated with a new R package and provide practical recommendations.


Improved Goodness of Fit Procedures for Structural Equation Models
Njål Foldnes a , Jonas Moss b , and Steffen Grønneberg b a University of Stavanger; b BI Norwegian Business School

ABSTRACT
We propose new ways of robustifying goodness-of-fit tests for structural equation modeling under non-normality.These test statistics have limit distributions characterized by eigenvalues whose estimates are highly unstable and biased in known directions.To take this into account, we design model-based trend predictions to approximate the population eigenvalues.We evaluate the new procedures in a large-scale simulation study with three confirmatory factor models of varying size (10, 20, or 40 manifest variables) and six non-normal data conditions.The eigenvalues in each simulated dataset are available in a database.Some of the new procedures markedly outperform presently available methods.We demonstrate how the new tests are calculated with a new R package and provide practical recommendations.

KEYWORDS
Bootstrap; covariance structure analysis; factor model; goodness-of-fit test; non-normality; weighted sum of chi-squares Goodness-of-fit testing is central when assessing whether a proposed measurement instrument can be used to understand latent psychological traits and processes.Researchers often evaluate their instruments using factor modeling where the trait is considered a latent variable that dictates the correlational structure among items.Model fit statistics and indices are then calculated, from which the researcher can assess whether the model is well specified.Only in a well-specified model can parameters such as factor loadings and correlations be properly interpreted to gain insight into the workings of a proposed instrument and the associations between latent traits.
In this article, we propose and study new classes of goodness-of-fit tests for structural equation models (SEMs) and confirmatory factor models under non-normality.As the sample size increases, commonly used test statistics have distributions that converge to distributions that are characterized by the eigenvalues of a certain matrix.Once these eigenvalues are estimated, p-values for the goodness of fit test can in principle be directly calculated.Unfortunately, as illustrated in a later section, empirical estimates of the eigenvalues are highly unstable and biased.We present an estimation theory for the eigenvalues and propose to stabilize and bias-correct the estimated eigenvalues using modelbased trend predictions.This theory is based on population eigenvalues but allows for penalized estimation procedures where the penalization function can be chosen.Two classes of prediction models are investigated, where the trend for the eigenvalues may be piece-wise constant or linear.We design penalization functions for these classes that take into account the known systematic bias of eigenvalue estimates.
We start our article with a review goodness-of-fit testing in SEM, including traditional and new procedures, under both normal and non-normal data.Then we present our new tests based on penalized estimation using an illustrative example.Next, we present a large-scale Monte Carlo study to evaluate the procedures in a variety of conditions with varying sample sizes, model sizes, and data distributions.This is followed by a section that summarizes the results of the Monte Carlo simulations.Afterward, we demonstrate how to perform the tests using the new R (R Core Team, 2023) package semTests (Moss, 2024).We end with a discussion of our findings, where we also outline limitations and future research ideas.
The online supplementary material contains an analytical framework for the new tests, software snippets, mathematical deductions, and further simulation results.

Goodness-of-Fit Tests in Covariance Structure Analysis
Factor and structural equation models imply structural constraints R ¼ RðhÞ on the covariance matrix R of the observed variables X ¼ ðX 1 , . . ., X p Þ: Model parameters are contained in the q-dimensional vector h and are estimated by minimizing a discrepancy function that measures the distance between the observed covariance matrix S from n observations and the model-implied covariance matrix RðhÞ: For instance, in confirmatory factor analysis, the model is specified by the equations x ¼ Kf þ e where x ¼ ðx 1 , . . ., x p Þ 0 is a p-dimensional vector of observed variables, f is a latent vector, and e is a p-dimensional vector of residuals, which are uncorrelated with f (Bollen, 1989).The elements in K, some of which are constrained to zero, are referred to as factor loadings.Additional constraints regarding the elements of K, U and W, are needed for model identification, where U and W are the covariance matrices of the latent and residual variables, respectively.The model implies the following covariance structure among the observed variables: RðhÞ ¼ KUK 0 þ W, where h contains all the estimated parameters in K, U, and W: The most popular estimation method is normal-theory maximum likelihood (NTML), where the discrepancy function is (Bollen, 1989) The corresponding estimator ĥNTML is the minimizer of F NTML over h.We remark that this estimator is consistent even under non-normal data.

Tests for Normal Data
Most tests for correct model specification in SEM are based on some model fit test statistic T NT , often referred to as a v 2 statistic, whose sampling distribution can be approximated by a chi-square distribution when data are multivariate normally distributed and the model specification is correct.Popular model fit indices such as RMSEA (Steiger et al., 1985) and CFI (Bentler, 1990) also depend on a T NT that is approximately chi-square distributed under normality.
The most commonly used candidate for T NT , reported by default in most software packages, is T ML ¼ ðn − 1ÞF NTML ðS, Rð ĥNTML ÞÞ: Under correct model specification and normal data, T ML converges to a chi-square distribution with d ¼ pðp þ 1Þ=2 − q degrees of freedom, where q is the number of freely estimated model parameters (J€ oreskog, 1969).
Another candidate for T NT is the reweighted least squares (RLS) statistic Here ĥ is any consistent estimator, e.g., ĥNTML : Just as T ML , T RLS is asymptotically chi-square distributed with d degrees of freedom under correct model specification and normal data (Browne, 1974).However, recent work by Hayakawa (2019) and Zheng and Bentler (2022) suggests that T RLS converges to its limiting distribution quicker than T ML : That is, at a given sample size with normal data, T RLS was found to better maintain Type I error control than T ML :

Robustified Tests for Non-Normal Data
The chi-square sampling distribution of T NT is distorted when the data fails to be normal (Micceri, 1989;Cain et al., 2017).Under correct model specification, its asymptotic distribution is a weighted sum of independent chi-square variables, each with one degree of freedom: where the weights k ¼ ðk 1 , . . ., k d Þ 0 are the non-zero eigenvalues of the matrix product UC: The matrix U depends on model characteristics.Let D ¼ orðhÞ oh , where rðhÞ ¼ vechðRðhÞÞ is the half-vectorization of RðhÞ, i.e., the vector obtained by stacking the columns of the square matrix RðhÞ one underneath the other, after eliminating all elements above the diagonal.Then (Satorra and Bentler, 1994), and D p is the duplication matrix (Magnus and Neudecker, 1999).The matrix C is the asymptotic covariance matrix of the sample covariances and depends solely on the data distribution.
To make use of eq.(1), consistent estimates, i.e., estimates that converge in probability, of the quantities U, C and k must be available.Since eigenvalues are the roots of a polynomial, they are continuous functions of the polynomial coefficients (Harris and Martin, 1987), and we may estimate k consistently by k given as the eigenvalues of Û Ĉ, provided U and C are consistently estimated by Û and Ĉ respectively.A consistent estimator Û of U can be obtained by replacing h with an estimate ĥ: Under standard assumptions, ĥ will be consistent (Satorra, 1989), implying that Û is consistent as long as the mapping h7 !UðhÞ is continuous, which we will assume.We will also assume that consistent estimators of C are available.A standard estimator of C is the moment-based ĈA defined in e.g., Section 3 of Browne, 1984, which is consistent as long as the observations have finite eight order moments.
The most well-known robustification procedure is the Satorra-Bentler (SB) scaling (Satorra and Bentler, 1988) and involves scaling T NT by a factor so that the asymptotic mean of the resulting statistic matches the expectation d of the nominal chi-square distribution: This results in a p-value given by Pðv 2 d > tÞ t¼T SB , where v 2 d is a chi-square distribution with d degrees of freedom.
Asparouhov and Muth� en (2010) proposed to scale and shift (SS) the statistic T NT , where a ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi d=trðð Û ĈÞ 2 Þ q and b ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi dðtrð Û ĈÞÞ 2 =trðð Û ĈÞ 2 Þ q : The statistic T SS has the same asymptotic mean and variance as the reference chi-square distribution.Similarly to the SB-procedure, the resulting p-value is Pðv 2 d > tÞ t¼T SS : Monte Carlo studies (e.g., Foldnes and Olsson, 2015) report that T SB tend to overreject and T SS tend to underreject correctly specified models.
Eigenvalue block averaging (EBA) is a recent effort to improve upon T SB and T SS by defining a flexible class of test statistics (Foldnes and Grønneberg, 2017).First, the d non-zero eigenvalues of Û Ĉ are sorted in increasing order, k1 � k2 � . . .� kd : These eigenvalues are then grouped into several equally sized bins, or blocks, and the block averages are calculated.Then, a vector of weights k1 , . . ., kd is constructed by replacing the eigenvalues with their block averages.For instance, in two-block EBA, denoted EBA2, the first block has where d:e denotes rounding up to the nearest integer, while the second block has The corresponding p-value for the goodness-of-fit test is then obtained as where for independent standard normal variables Z 1 , . . ., Z d : For a single block, each kj for j ¼ 1, . . ., d equals the average of all estimated eigenvalues k1 , . . ., kd : That is, The sum of the eigenvalues of a square matrix equals its trace.Therefore, � k ¼ trð Û ĈÞ=d, and by eq. ( 2), we have
All the robustified tests require an estimate Ĉ of the asymptotic covariance matrix C. Browne (1974) discussed two estimators for C, which we refer to as ĈA and ĈU : The former is asymptotically consistent and is currently the default estimator used in software packages.The latter is unbiased in finite samples, and asymptotically equivalent to ĈA : It has recently attracted attention (Du and Bentler, 2022) as a promising alternative to ĈA : In addition, the robustified tests require a candidate for T NT : In the present study we consider candidates T ML and T RLS : With two candidates for Ĉ and two candidates for T NT , there are four possible estimators for any quantity depending on them.So every robustified procedure considered in the present study has four versions, and all of these are included in our Monte Carlo design.We use the following notation: the T NT version is indicated as a subscript, and we indicate the use of ĈU instead of ĈA by employing the superscript UG.For instance, for the SB procedure we have the versions SB ML , SB RLS , SB UG ML , and SB UG RLS :

Asymptotically Exact Tests
We refer to test procedures whose Type I error control under correct model specification converges to the nominal level as asymptotically exact tests.The procedures SB, SS, and EBA are not in general asymptotically exact.In other words, the Type I error rate of, e.g., SB, will not necessarily approach the nominal rate of 5% even in large samples.In contrast, the three procedures next discussed are asymptotically exact.By imposing mild assumptions on the employed estimator and the rank of D and C, Browne (1984) showed that the asymptotically distribution-free (ADF) test statistic asymptotically follows a chi-square distribution with d degrees of freedom whenever Ĉ consistently estimates C.
Unfortunately, many studies (e.g., Curran et al., 1996;Olsson et al., 2000) report that the ADF test requires very large sample sizes to perform satisfactorily, due to the sampling variance of the fourth-order moments involved in estimating C.
The second asymptotically exact test is the Bollen-Stine bootstrap (Beran and Srivastava, 1985;Bollen and Stine, 1992).The procedure starts with linearly transforming the observed data so that the model fits the transformed data perfectly.Then, a p-value for the hypothesis of correct model specification is calculated by drawing bootstrap samples from the transformed data set and fitting the model to obtain a sequence of normal theory T B NT bootstrap values.

The p-value is the proportion of the T B
NT values that exceed the original T NT value obtained in the original data sample.The number of bootstrap samples is typically at least 1000, so the bootstrap is a computationally intensive method.This likely explains the scarcity of Monte Carlo studies that evaluate the Bollen-Stine bootstrap.Also, most of these studies focus on small models with no more than 11 observed variables (Fouladi, 1998;Ichikawa and Konishi, 2001;Nevitt and Hancock, 2004;Foldnes and Grønneberg, 2019).For larger models, to the best of our knowledge, Ferraz et al. (2022) is the only available study, including up to 30 observed variables.For small models with 10 observed variables, the results of Ferraz et al. (2022) were in line with previous studies in finding that the Bollen-Stine bootstrap adequately controlled Type I error rates.However, for larger models Ferraz et al. (2022) concluded that the empirical rejection rates were too low.For instance, with 30 observed variables and the largest sample size included (n ¼ 1000), the rejection rates were in the range 1.8%-2.6%at the 5% level of significance.In our Monte Carlo study, we expand the number of observed variables to 40 and employ a larger set of non-normal data conditions than previously considered, to gain further insight into the Bollen-Stine bootstrap.
The third asymptotically exact test uses the estimated eigenvalues of UC directly.Since this is equivalent to blockaveraging eigenvalues with blocks of size one, so that kj ¼ kj for j ¼ 1, . . ., d, we may consider this an EBA type procedure, with p-value given by where H is defined in eq. ( 4).This procedure is identical to using d blocks (which are then singleton sets) in EBA, and we refer to it as EBAd.Since EBAd has not yet been studied in the literature, it is included in our Monte Carlo investigations.The estimated eigenvalues will converge toward their population counterparts as the sample size increases, so EBAd is asymptotically exact.The sampling variability of estimated eigenvalues is, however, so large that impractical sample sizes may be required to obtain acceptable Type I error control.Figure 1 illustrates the final sample fluctuations in the estimated eigenvalues in a ten-dimensional twofactor model with non-normal data.The model has 34 degrees of freedom, and hence 34 associated non-zero eigenvalues.In the figure, the crosses represented the population values, i.e., the eigenvalues of UC, in increasing order, with a range 1:12--1:27: We simulated 200 samples of size n ¼ 1500 and extracted in each sample the sorted eigenvalues of Û Ĉ: For each rank i ¼ 1, . . ., 34, the corresponding estimated eigenvalues are represented by box plots.We make the following observations: (i) The estimates have high sampling variability, especially the largest eigenvalues.
(ii) The higher eigenvalues are consistently overestimated, and the lower eigenvalues are consistently underestimated.(iii) Most of the box plots do not cover their corresponding population eigenvalue.These observations suggest that directly using the estimated eigenvalues to approximate the sampling distribution of T NT may not work well.While both the SB and the EBA procedures attempt to handle the sampling variability of eigenvalues by averaging sets of eigenvalues, earlier literature has not addressed the problem of under-and overestimation.The new approaches proposed below take the systematic bias into account and are designed to work well when the eigenvalues are related to the true eigenvalues in the same way as in Figure 1.
Before we turn to the new estimation methods, we explain why the pattern shown in Figure 1 is expected to occur also in conditions not covered by our Monte Carlo study.

Estimated Eigenvalues and the Empirical Spectral Function
The set of eigenvalues of a matrix are not ordered in and of themselves, although we can naturally sort the eigenvalues in increasing order.What are the statistical consequences of this ordering?
To build intuition, let us consider a highly simplified scenario where the eigenvalue estimates are independent and normal.We observe the set where X 1 , . . ., X d are independent, X 1 , . . ., X 25 � Nð2:5, 1Þ, and X 26 , . . ., X 34 � Nð3:5, 1Þ: Although there is an order to the observations in our notation, we only observe the unordered set S, which plays the role of the estimated eigenvalues.This emulates a situation where UC has 34 eigenvalues, each equal to either 2.5 or 3.5.
If we plot the sorted eigenvalues X ð1Þ � X ð2Þ � � � � X ðdÞ against their rank ði=d, X ðiÞ Þ and connect these points via straight lines, the resulting curve is the empirical quantile function of the data.This curve will approximate the population quantile function.To see why, recall first that the empirical quantile function is a generalized inverse of the empirical distribution function where IfAg is the indicator function of A, being 1 if A is true and zero otherwise.This empirical distribution function F uniformly approximate � FðxÞ ¼ E FðxÞ (Shorack and Wellner, 2009, Chapter 25).Under the assumed distribution for X 1 , . . ., X 34 , we get � FðxÞ ¼ 25 34 Uðx − 2:5Þ þ 9 34 Uðx − 3:5Þ: The empirical quantile function will therefore approximate the inverse of � F, which we may denote by Q(p).Therefore, a plot of ði=d, X ðiÞ Þ will be close to ði=d, Qði=dÞÞ: Figure 2 is based on a single realization of X 1 , . . ., X 34 : The quantile function Q is plotted in red, the plot of ði=p, X ðiÞ Þ in black, and the empirical quantile function in blue.The black curve is not visible as it is overwritten by the blue curve.Figure 2 displays shapes similar to the eigenvalues plotted in Figure 3.There is systematic over-and underestimation of these values for i/d near zero or one, an effect that is due solely to sorting.
While the estimated eigenvalues ð ki Þ converge to k i , the variation in ð ki Þ will be considerable for realistic samplesizes.A plot of ði=d, ki Þ will have the shape of an empirical quantile curve defined by the same formula as F above ( 6), but with ki in place of X i .Such objects are known as empirical spectral functions and play an important role in random matrix theory (Pastur and Shcherbina, 2011;Paul and Aue, 2014).We conjecture that the empirical spectral function of Û Ĉ converges to a population function as d and n increases, so there is a limiting curve that plays a similar role as the red curve in Figure 2.With insights into this limit curve, a principled estimation procedure for approximating k 1 , . . ., k d could be developed in future work.

New Goodness-of-Fit Tests Based on Penalized Estimation
In this section, we introduce and motivate new procedures for obtaining p-values based on penalization of the estimated eigenvalues.Technical arguments are deferred to the online supplementary material.
Similar to the EBA procedures, the new tests takes the estimated eigenvalues k1 , . . ., kd as input.From these values, regularized estimates k1 , . . ., kd are produced as next discussed, and these are used to calculate a p-value for the goodness of fit test using H in eq. ( 4): For illustration, we continue with the eigenvalues associated with the factor model discussed in Section 1.3 (Figure 1), which has 34 degrees of freedom.Here, however, we consider a single random sample of size n ¼ 1500 and the corresponding set of estimated eigenvalues, using C A .These estimates and their corresponding population values are plotted in Figure 3. Also, the figure depicts four sets of approximated eigenvalues: First, the SB eigenvalues are plotted, all with the same value, namely the mean eigenvalue of 1.10.Second, the EBA2 approximations are depicted, with the 17 smallest eigenvalues set at the mean value 0.79 and the 17 largest eigenvalues at the mean value 1.41.The two remaining eigenvalue sets in the figure, pEBA2 and pOLS, are obtained by a process explained in the next two subsections.

Penalized EBA
The EBA procedure may be modified naturally to counteract the bias observed in Figures 1 and 3. Figure 3 also contains a new set of eigenvalues in the intermediate positions between SB and EBA2, which we call penalized EBA2 and denote by pEBA2.The connection to penalized estimation is explained in the online supplementary material.pEBA2 consists of a two-block set of weights ð kj Þ equal to the average of the SB and the EBA2 weights.In Figure 3, the first block of weights contains the mean value of 1.10 and 0.79, which  is 0.95.Likewise, the second block has weights equal to the mean of 1.10 and 1.41, namely 1.26.
The procedure may be performed in the same manner for any number of blocks.That is, we average the EBA weights block by block with the overall average eigenvalue and thus obtain penalized versions pEBA3, pEBA4, and so forth.
The additional averaging employed in penalized EBA attempts to counteract the systematic bias observed in Figures 1 and 3.By anchoring the EBA eigenvalues closer to the global average, the overestimation for the larger eigenvalue estimates is reduced, while still not restricting the eigenvalues to be constant.Similarly, the underestimation of the smaller eigenvalue estimates is also reduced.

Penalized OLS
The penalized OLS procedure can be motivated by a simple heuristic.Let ðk i Þ be the population eigenvalues and run a simple linear regression based on ði, k i Þ d i¼1 to obtain the OLS estimates b0 and b1 : The ith eigenvalue -which is positive since UC is positive definite -can now be approximated by ki ¼ maxðb 0 þ b 1 i, 0Þ: This linear approximation inherits the systematic bias observed in Figures 1 and 3, causing the slope to be overestimated.A natural remedy to this sort of overestimation is to down-weight the regression slope using ridge regression, a well-known penalized form of OLS, which we refer to as pOLS.
The extent of down-weighting is represented by a parameter c > 1 that is applied to the OLS slope parameter b 1 : The corresponding ridge regression intercept is The standard OLS estimates are recovered when c ¼ 1.For c ! 1, we obtain b 1 ¼ 0 and b 0 ¼ � k, or the Satorra-Bentler weights.Simulations show that c ¼ 2 works well, and we will use it in the remainder of the article.Figure 3 shows the predictions of pOLS.

RMSEA with Eigenvalue-Based Tests
The RMSEA is a popular measure of approximate fit originating from the work of Steiger (1990).Using the Satorra-Bentler method, Li and Bentler (2006) found the formula RMSEA ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi

:
Here k i are replaced by estimated values in practice.In the online supplementary material, it is shown the proposed penalized eigenvalue-based estimators have the same sum as the Satorra-Bentler estimate, and that the above formula for the RMSEA also holds for these procedures.

Monte Carlo Simulation
We considered a two-factor model x ¼ Kf þ e where x ¼ ðx 1 , . . ., x p Þ 0 is a p-dimensional vector of observed variables, f is a two-dimensional latent vector, and e is a p-dimensional vector of uncorrelated residuals, which is also uncorrelated with f : The model had simple structure, with x 1 , . . ., x p=2 loading on the first factor and x p=2þ1 , . . ., x p loading on the second factor.We included three model sizes with p ¼ 10, 20, and 40, and corresponding degrees of freedom d ¼ 34, 169, and 739.This study was not preregistered.The model specifications are available at https://osf.io/6trwu/, together with a database of eigenvalues for each replicated dataset in the present study.The eigenvalues are given for both the biased and the unbiased C estimators.The database also contains T ML and T RLS and may be used for fast assessment of new variants of eigenvalue-based procedures.

Population Model
To represent a realistic scenario, we used heterogeneous factor loadings with standardized loadings uniformly drawn in the range ½:3, :8�: Such loadings reflect values typically found in empirical studies (Li, 2016).The residual variances were then chosen to ensure that the observed variables had unit variance.The factor loadings were nested between models, e.g., for p ¼ 20 the first five loadings for each factor were equal to the corresponding loadings in the p ¼ 10 model.The interfactor correlations in all models were set to .5.For p ¼ 10, the 45 correlations in the observed variables ranged from .08 to .56.For p ¼ 20 the 190 correlations ranged from .08 to .64.The 780 correlations in the p ¼ 40 model ranged from .045 to .64.

Data Distributions
For each population model, data were drawn from seven distributions.The distributions consisted of the normal distribution and six non-normal distributions.Three of the non-normal distributions had moderate marginal skewness and kurtosis (Curran et al., 1996) taking values 3 and 7, and the other three had severe marginal skewness and kurtosis (with values 3 and 21).We crossed the two marginal nonnormality levels with three data distributions: The independent generator (IG) distribution (Foldnes and Olsson, 2016), the piece-wise linear (PL) distribution (Foldnes and Grønneberg, 2021), and the well-known Vale-Maurelli (VM) distribution (Vale and Maurelli, 1983).We use the notation VM1 and VM2 for the VM distributions with the moderate and severe levels of marginal skewness and kurtosis, and similarly for the IG and PL distributions.
Including several classes of non-normal distributions was necessary for the external validity of the study, and was also required for investigating test performance while controlling for marginal skewness and kurtosis.However, note that even with the same skewness and kurtosis, the IG, PL, and VM distributions are different.

Sample Size
We generated data at sample sizes n ¼ 400, 800, 1500, and 3000, to reflect a range of sample sizes routinely used in empirical investigations.

Goodness-of-Fit Tests
All test statistics were calculated from normal-theory ML estimates.For the robustified tests and EBAd we considered four candidates, obtained by combining base statistic (T ML or T RLS ) and estimator of the asymptotic covariance matrix ( ĈA or ĈU ).
A total of 43 test statistics were evaluated, including the base statistics T ML or T RLS : For the robustified tests we included the traditional tests T SB and T SS (a total of 8 candidates), the EBA procedures EBA2, EBA4, and EBA6 (12 candidates), the penalized EBA procedures pEBA2, pEBA4, and pEBA6 (12 candidates), and the pOLS test (4 candidates).Among the asymptotically exact tests, we included the Bollen-Stine bootstrap based on T ML , and the EBA procedure with singleton blocks, EBAd (4 candidates).

Data Generation and Analysis
Crossing model size, distribution, and sample size resulted in 84 (3 � 7 � 4Þ simulation conditions.We generated 3000 datasets for each condition.All tests except Bollen-Stine were evaluated in each condition based on 3000 replications.The computationally expensive Bollen-Stine test was computed only for n ¼ 800 and n ¼ 3000, and in the largest dimension (p ¼ 40) the number of bootstrap replications was reduced to 1000.
All models were estimated using the maximum likelihood estimator in lavaan (Rosseel, 2012).The package covsim (Grønneberg et al., 2022) was used to simulate from the IG and PL distributions and the package lavaan was used to simulate VM distributions.The goodness-of-fit p-values were calculated using the newly developed package semTests.The package CompQuadForm (Duchesne and De Micheaux, 2010) computed the p-values of the type given in Eq. (3).

Evaluation Criteria
We employed three evaluation criteria based on the observed percentage rejection rates (RR), obtained in each of the 84 conditions as the percentage of p-values below .05.Hence, we adopted the commonly used significance value of a ¼ 5%: Our first criterion is the root-mean-square error (RMSE), which is a measure of the discrepancy between the observed rejection rate RR and the nominal 5% rejection rate: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P c ðRR c − 5Þ 2 =C q , where C denotes the number of conditions we are interested in.For instance, if we look at the smallest model size, and we include all distributions and sample sizes, C ¼ 7 � 4 ¼ 28: Our second criterion, the mean absolute deviation (MAD), is also a measure of the difference between the empirical rejection rates and the nominal rejection rate, defined as MAD ¼ P c ðjRR c − 5jÞ=C: Our third criterion yields the percentage of acceptable rejection rates (ARR), defined as the proportion of conditions c for which 2:5% < RR c < 7:5%, (Bradley, 1978).
Given the large number of test candidates under evaluation, in addition to reporting these three criteria, we also sort the tests according to their RMSE performance in many of our result tables.We acknowledge that the sorting shifts the order of test statistics between tables, making it more difficult to compare the performance of a given test candidate across conditions.However, the sorting greatly facilitates the identification of the best-performing tests by inspecting the upper part of the result tables.

ML, RLS and Robustified Tests
We evaluated two tests based on normality, either with ML and RLS, and 38 robustified tests.

Normal data
Type I error rates in the 12 conditions with normal data are presented in Table 1.For each model size, we have sorted the test statistics according to increasing RMSE values across the four sample sizes.At the smallest model size, p ¼ 10, the normal-theory statistics T ML and T RLS performed well, as expected.All 40 test candidates had acceptable rejection rates, ARR ¼ 1, at all sample sizes.The MAD ranged from 0.3% for ML to 0.733% for EBA4 UG  RLS : With increasing model size, test performance generally deteriorated, as expected.Especially striking was the poor performance of ML in comparison to RLS.For instance, for p ¼ 20 and p ¼ 40 the MAD of ML was 1.18% and 5.85%, respectively.In comparison, the MAD of RLS was negligible for p ¼ 20 and p ¼ 40: 0.29% and 0.51%, respectively.Also, for dimensions p ¼ 20 and p ¼ 40 the robustified test SB UG RLS was a top performer.Indeed, this test was the overall winner in terms of RMSE when collapsed over all 12 conditions, with RLS as the runner-up.

Non-Normal Data
Type I error rates for all tests in the 12 conditions (3 models, 4 sample sizes) are tabulated for each non-normal distribution in the supplementary material, see Tables B2-B7.In Table 2 we report aggregated results over the six non-normal distributions.Test performance was calculated for each model size, across six distributions and four sample sizes, and test candidates were ranked according to increasing RMSE.
Under non-normality, the normal-theory statistics ML and RLS performed poorly.In fact, in none of the 72 nonnormal conditions did these tests achieve an acceptable rejection rate.
Expectedly, the normal-theory tests were outperformed by the traditional robustified tests, SB and SS.Generally, SB outperformed SS, and the SB candidate with the consistently best performance was SB UG RLS : The standard SB test, which is based on ML and ĈA , performed remarkably worse than SB UG RLS , which is based on RLS and ĈU : For instance, collapsing over all 72 non-normal conditions, the MAD of SB and SB UG RLS was 3.28% and 1.63%, respectively.Also, the ARR of SB was 65.3%, compared to 76.4% for SB UG RLS : Among the SS candidates, performance was best when based on ML and ĈA : However, even for this candidate, SS, had overall poor performance, especially in the large model, where ARR was zero.
Many candidates in the family of newly developed procedures (EBA, pEBA, and pOLS) outperformed the SB and SS procedures.The RMSE rank in Table 2 of the best traditional robustified test, SB UG RLS , was 17, 20, and 18 for dimensions 10, 20 and 40, respectively.To further give an overview of the best-performing tests, we aggregated also over model size, with the resulting ten best performers (in terms of RMSE) presented in Table 3.This table hence is based on collapsing 72 conditions (six distributions, four sample sizes, and 3 model sizes).The top nine performers in Table 3 all belong to the new class of penalized eigenvalue modeling.Also noteworthy, eight of the ten tests are based on RLS, and only two on ML.
To investigate in full detail the performance of some of the best tests in Table 3, we picked the top candidate from the pEBA2, pEBA4, pEBA6, and pOLS families, namely EBA2 UG RLS , pEBA4 RLS , pEBA6 RLS , and pOLS RLS : The rejection rates in all 72 conditions of these four candidates are plotted in Figure 4.The figure also includes the best candidate in each of the traditional families of robustified tests: SB UG RLS for SB, and SS for SS.A consistent pattern is that the newly developed tests were associated with rejection rates intermediate between SS, which severely under-rejected, and SB UG RLS , which tended to over-reject.The figure demonstrates that goodness-of-fit testing was more challenging in larger models, while larger sample sizes are associated with better Type I error control.Also, the distributional type affected the test procedures.Under normality (see also Table 1), all tests performed well, except SS in the largest model.Under non-normality, we see that performance depended on marginal kurtosis, as expected, with overall MAD (across tests, distributions, and model sizes) equal to 1.11% for the skewness ¼ 2, kurtosis ¼ 7 condition (IG1, PL1, VM1) and 1.88% for the skewness ¼ 7, kurtosis ¼ 21 condition (IG2, PL2, VM2).Also, there was some variation in overall test performance among the underlying distributional class.The overall MAD for distributions of type IG, PL, and VM was 1.56%, 1.50%, and 1.43%, respectively.

Asymptotically Exact Tests
Table 4 presents the Bollen-Stine rejection rates.The underlying distribution strongly affected test performance, with severe overrejection for PL2 and partly for VM2.In the large model, for PL2 and VM2, the rejection rate was virtually 100%, in striking contrast to the finding in Ferraz et al. (2022) that the Bollen-Stine test tended to underreject in a p ¼ 30 model.In contrast, under the normal and IG distributions, Bollen-Stine consistently underrejected, reflecting the findings in Ferraz et al. (2022).Overall, echoing the findings of Ferraz et al. (2022), as model size increased, Bollen-Stine performed poorly, even for n ¼ 3000.
Next, consider the asymptotically exact test EBAd.The differences between the four EBAd candidates were small (see Figure B1 in the supplementary material).Therefore, in   at sample size n ¼ 3000.To further inspect the rate of converge to nominal 5% rejection rates, and to confirm asymptotic consistency, we simulated some very large sample size conditions (n ¼ 10 4 , 10 5 ).Even at n ¼ 10 4 the tendency to underreject was still pronounced for dimensions p ¼ 20 and p ¼ 40.For instance, for p ¼ 40 and n ¼ 10 4 the overall rejection rate across distributions was only 2.6%.

Illustration of the Package semTests
We demonstrate the use of the newly developed R package semTests (Moss, 2024) by conducting a small power study.Consider the model with p ¼ 20 observed variables used in our Monte Carlo study (see supplementary online material for the complete model specification).We first simulate a n ¼ 800 non-normal data set from this model using the VM2 distribution.Then we run the pvalues() function from semTests on the fitted model, using the default parameter values.The default p-values reported were chosen from the best-performing tests in our Monte Carlo study, in addition to the best-performing traditional test SB UG RLS : The reported p-values are similar, with SB UG RLS having the smallest p-value.
Next, we conduct a small power study with 1000 replications where n ¼ 800 and data is drawn from a VM2 distribution.Model misspecification is obtained by adding a cross-loading with standardized factor loading of 0.4.
The rejection rates are ordered in the same way as observed for the Type I errors in Figure 4 (see bottom middle panel at n ¼ 800).Highest power is achieved by SB UG RLS , which also had the highest rejection rate, 7.1%, among the tests in Figure 4. Hence, in terms of power, SB UG RLS outperformed the other test candidates.However, Type I error control is a more fundamental requirement than adequate power, and the other test candidates outperform SB UG RLS in terms of Type I error control.

Discussion
We have proposed and evaluated new goodness-of-fit methods for factor analysis and structural equation modeling with non-normal data.The new methods pEBA and pOLS apply penalization on the estimated eigenvalues of UC: The pEBA methods were derived from the EBA approach (Foldnes and Grønneberg, 2017) and the pOLS methods are based on a linear approximation of the sorted eigenvalues, where the penalization is obtained by dampening the slope.We have provided a formal analysis of eigenvalue modeling that motivates pEBA and pOLS.
In the past, many tests have been proposed to handle goodness-of-fit testing under non-normality, and we conducted a large Monte Carlo study to compare the Type I error control of the new methods with well-known traditional tests such as the Satorra-Bentler scaled test, the scaled and shifted test, and the Bollen-Stine bootstrap test.Moreover, we took recent developments into account by acknowledging that the little-known normal-theory RLS test (Browne, 1974) might outperform the classical normal-theory ML test in multivariate normal conditions, and that replacing the traditionally used asymptotically unbiased C estimator ĈA by an unbiased estimator ĈU might improve test performance.Therefore, the Monte Carlo study evaluated four versions (ML/RLS, ĈU / ĈA ) of each test statistic.

Practical Recommendations
For the special case of normal data, our results echoed earlier findings (e.g., Hayakawa, 2019) in demonstrating that the RLS test is far superior to the more commonly used ML test.Therefore, we advise using T RLS instead of T ML when reporting goodness-of-fit in normal data conditions.
For non-normal data, our recommendations are as follows.Echoing Ferraz et al., 2022, we do not recommend using the Bollen-Stine bootstrap test.This test is computationally heavy and was found to adequately control Type I error only in the smallest model with 10 observed variables, and then only for five of the six non-normal conditions.In a model with 40 observed variables, we found the bootstrap test to perform poorly even with a sample size of 3000.
Overall, the nine best-performing tests in our Monte Carlo study all were of pEBA or pOLS type.Based on Table 3, we recommend basing goodness-of-fit in non-normal conditions on pEBA4 RLS or pOLS RLS :

Limitations and Future Research
We have focused exclusively on continuous data.However, the methods we propose are naturally applicable to ordinal data analyzed using polychoric correlations.It is natural to ask whether pEBA and pOLS outperform the currently available methods in the testing of ordinal factor models, such as the SS test used by lavaan (Rosseel, 2012).
The idea of eigenvalue penalization has not been fully explored yet, and different variants could potentially result in better test performance.A more thorough analysis of eigenvalue modeling could shed more light on the subject.Using uneven block sizes in the pEBA is another possibility.Future estimation of the eigenvalues could also be based on the limiting behavior of the spectral distribution of Û Ĉ: A natural extension of the present paper is to consider power.That is, among tests that control Type I error, which candidate best detects model misspecification?We consider this as a topic for a future Monte Carlo study.The results of such a study must involve balancing the primary concern of Type I error control with the secondary concern of power.
Our Monte Carlo design is limited in several ways.We only considered factor analysis models, not more general structural equation models.Moreover, the number of observed variables ranged from 10 to 40, but models with hundreds of observed variables are not uncommon in applied research.A study of about 50-100 variables would shed more light on such situations.We modeled three types of non-normal distributions, across two conditions of marginal non-normality as measured in terms of marginal skewness and kurtosis.But other kinds of non-normality is certainly encountered in practice, and further research could be conducted by employing more flexible non-normal distributional classes such as the flexible VITA method suggested by Grønneberg and Foldnes, 2017 and implemented in the R package covsim (Grønneberg et al., 2022).Finally, we considered model fit only for non-nested models.Nested model testing is widely conducted in measurement invariance testing, and the new test procedures can naturally be applied also in these situations.

Conclusion
We have proposed several new methods for evaluating model fit of structural equation models and evaluated their performance in a Monte Carlo study of factor models.The new methods outperform existing methods such as Satorra-Bentler in terms of Type I error control.The overall bestperforming test, pEBA4 RLS , performed adequately in 69 of 72 non-normal conditions.The new methods are available in the R package semTests.

Figure 1 .
Figure 1.Population and estimated eigenvalues for a ten-dimensional CFA with 34 degrees of freedom.The � represent population eigenvalues, while the boxplots represent estimated eigenvalues across 200 replications at sample size n ¼ 1500.

Figure 2 .
Figure 2. The sorted simulated data plotted against i/d for i ¼ 1, 2, . . ., d: the curve in red is the theoretical quantile function.The curve in blue is the empirical quantile function.The dotted black values are the levels of the observations.

Figure 4 .
Figure 4. Rejection rates in % for six selected tests.Panel columns and rows correspond to model size and distribution, respectively.

Table 1 .
Type I Error rates, normal data.
Table 5 we only report results for the default version, which is based on ML and ĈA : The results are aggregated over all 7 distributions.The test exhibited poor Type I error control, especially at low sample sizes, with severe underrejection.The asymptotic superiority of EBAd was not yet detectable

Table 2 .
Test performance across 6 non-normal distributions and 4 sample sizes, ranked in increasing RMSE order.

Table 3 .
Top ten robustified tests according to RMSE when aggregating 6 non-normal distributions, 4 sample sizes and 3 model sizes.

Table 4 .
Rejection rate in % for the bollen-stine bootstrap.

Table 5 .
Rejection rate in % for the EBAd test procedure, aggregated over all seven distributions.