The performance of unweighted least squares and regularized unweighted least squares in estimating factor loadings in structural equation modeling

In a confirmatory study, researchers are expected to employ the covariance-based structural equation modeling (CB-SEM). One of the key presumptions when utilizing CB-SEM is that the data is multi-variate normal. Nevertheless, a perfect normal distribution is rarely observed in real-life data. To resolve this, the unweighted least square (ULS) is designed to specifically deal with non-normal data in SEM. However, ULS often yields improper solutions like negative, or boundary estimates of unique variances since it considers measurement errors in observed variables. The disturbance in SEM is reflected in unique variance, which is random error due to unreliability or measurement error and reliable variation in the item that indicates unknown latent causes. Consequently, this can generate bias in indicator loadings estimates. As an action to disentangle this issue, the present study proposes the implementation of regularization parameters by adding small positive values to the variance-co-variance matrix. The ratio of bias to variance in a model can be improved to obtain the best estimation performance. Pro-Active Monte Carlo simulation was used to produce multivariate non-normal data with designated sample sizes and population characteristics. The data were analyzed using R Programming Environment by employing “psych”, “MASS”, “foreign”, “mvrnonnorm”, “purr”, and “sem-Tools” packages with 1000 replications to produce multivariate non-normal data. Next, the “lavaan” package was used for SEM and regularized SEM analyses. The outcome of this study proves the capability of regularized ULS to improve parameter estimation.


Introduction
Structural equation modeling (SEM) is a second-generation statistical analysis approach.It was developed to assess the interrelationships among several variables in a model (Awang, 2023;Afthanorhan et al., 2021;Ainur et al., 2017;Aimran et al., 2017;Zulkifli et al., 2022).In SEM, the unweighted least squares (ULS) estimation method is designed to work with nonnormal data (Mîndrilă, 2010).According to Jung and Takane (2008), ULS often yields inaccurate solutions like a negative, or boundary estimates of unique variances since it considers measurement errors in observed variables.Unique variance manifests as the disturbance in SEM that is, random error due to unreliability or measurement error and reliable variation in the item that signifies unknown latent causes.In addition, ULS leads to more bias and less precision results in estimating parameters (Forero et al., 2009).When the data is non-normal, the estimates of the loadings may be biased or less efficient, which can lead to inaccurate conclusions about the relationships between the latent variables and the observed variables.Therefore, a vast literature has been promoting the method of regularization to overcome this matter (Jacobucci et al., 2016;Yuan & Bentler, 2017;Yuan & Chan, 2016).Arruda (2017) stated that regularization can be explained simply as the state of having been made regular.Also, the inclusion of information to solve poorly identified problems is typically described using the mathematical notion of regularization.These definitions are similar to the informal definition given by Bickel and Li (2006), which highlights the modification of a method for providing effective solutions in difficult conditions.The majority of the applications focus on multicollinearity, overfitting, or sparsity problems.Numerous regularization techniques address the parsimony principles such as smoothing, model selection, or methods to control model complexity.In a similar vein, regularization can be used to speed up lengthy computations or invert matrices.Recent research has demonstrated that regularized techniques for regression are beginning to be employed in covariance modelling.Some regularization methods specific to SEM methodology have recently been applied and studied (Jacobucci et al., 2016;Yuan & Bentler, 2017;Yuan & Chan, 2016).Methods like lasso and ridge regression, for instance, have been included into SEM (Jacobucci et al., 2016;Jung, 2013).Jacobucci et al. (2016) proposed regularized structural equation modelling, or regSEM, as a method of penalizing parameters to reduce model complexity and improve generalizability of models.Even though regularized ULS in RegSEM has been introduced for improvising the effect of unique variance, however, the method is seen less efficient as the method regularized directly on the specific parameter matrix, that leads to over shrinkage of the estimation problem.To address the issue, this study will demonstrate the implementation of the regularization parameter to each element in the sample covariance matrix in the ULS estimator.This approach seeks the optimal tradeoff between bias and assessing variability by adding small positive values to the elements of the covariance matrix.By doing so, it can be proven that the ratio of bias to variance in a model can be optimized to yield the best estimation performance.

Simulation research model
The existing method developed by Vale & Maurelli (1983) was used in this study to produce non-normally distributed data using Monte Carlo Markov Chain (MCMC) simulation techniques, with given skewness and kurtosis values of 2 and 7, respectively (Pavlov et al., 2020).Three population models with various true indicator loading values were generated.Each model consisted of four latent constructs, each with four items and correlates at 0.7.On the other hand, the homogeneous true indicator loading of 0.7 (for Model 1), 0.8 (for Model 2), and 0.9 (for Model 3), was predetermined respectively.The sample sizes chosen were 50, 100, 200, and 500.In path modelling, sample sizes of 100 to 200 are often used as the starting point (Awang, 2023;Henseler & Chin, 2010) with 50 samples indicating a small sample, and 500 demonstrating a large sample.Next, to ensure the consistency of the findings, 1000 replications of each sample size were performed resulting in the generation of 3x4x1000 = 12,000 datasets.ULS and regularized ULS were employed to estimate the indicator loading.For regularization, Arruda & Bentler (2017), Jacobucci et al. (2016), andJung (2018) recommended that the optimal value of the regularization parameter, λ was chosen based on model performance through the smallest value of RMSEA.The simulation process and SEM analysis were carried out using the R statistical programming environment (R Core Team, 2018).The "psych" package, "MASS" package, "foreign" package, "mvrnonnorm" package, "purr" package and the "semTools" package were applied to produce multivariate non-normal data.Next, the "lavaan" package developed by Rosseel (2012) was used for SEM and regularized SEM analyses correspondingly.The three population models are as illustrated in Fig. 1, Fig. 2, and Fig. 3.The Unweighted Least Squares (ULS) estimation method was used to analyze non-normal data to assess fit and coefficients in CB-SEM.McDonald and Bollen (1990) claimed that ULS minimizes the fit function by using derivatives: where tr is the trace of the matrix, S is the sample covariance matrix, is the model-implied covariance matrix and θ is the (t × 1) vector of parameters.The sum of squares of each element in the residual matrix (S − Ʃ θ is minimized using the fit function  .Compared to ML and WLS, ULS has the advantage of producing a consistent estimator, but unlike ML, no distributional assumptions are required (Schermelleh-Engel, Moosbrugger & Muller, 2003).

b. Regularized Unweighted Least Squares
In order to produce an accurate estimate, particularly for non-ideal conditions and non-normal data, regularization method is applied to the estimation of the variance-covariance matrix in the ULS estimator.The ULS estimator is a popular method for estimating SEM parameters for non-normal data.However, it can suffer from issues such as biased estimates, unstable solutions, and improper solutions (Forero et al., 2009).Improper solutions like negative or boundary estimates of unique variances can occur when there are errors in the measurement of the observed variables.It represents the minimum amount of variance that cannot be explained by the model.Regularization techniques can improve the stability of the covariance matrix estimates, which can also help to reduce the risk of negative error variances by ensuring enough variance for the latent variables.The values in the sample variance-covariance matrix play a crucial role in the estimation of model parameters in SEM (Kline, 2016).The quality and reliability of parameter estimations, such as factor loading, can be significantly impacted by the accuracy and stability of the sample variance-covariance matrix.Stabilizing the variance-covariance matrix requires applying some forms of regularization to the sample covariance matrix.Regularization achieves this by adding a regularization parameter to the sample variance-covariance matrix, which can control the amount of regularization applied.Consequently, in this study, improved estimators ( ) is used instead of  in ULS estimators through the addition of the regularization parameter, λ.This technique seeks the ideal trade-off between bias and assessing variability (Jacobucci et al., 2019) by balancing the fit of the model to the data and the complexity of the model including controlling the amount of shrinkage applied to the sample covariance matrix.This helps to improve the stability of the covariance matrix, especially when the sample size is small, or the number of variables is large.The improved sample variance-covariance matrix ( ), which incorporates the addition of regularization parameter, λ to each element of the sample covariance matrix is demonstrated below: where  is the sample covariance matrix and λ is the regularization parameter.The value of lambda plays a crucial role in determining how much weight to be assigned to the variance-covariance matrix (Arruda & Bentler, 2017).Several lambda values (λ > 0) are tested and an optimal λ is chosen depending on the performance of the model.λ with the smallest RMSEA is chosen for each model.RMSEA is a fitness index that measures the discrepancy between the observed and predicted covariance matrices, adjusted for the complexity of the model.The goal of using the RMSEA criterion to select the regularization parameter is to find a balance between model fit and complexity that results in the most accurate and stable estimates of the model parameters.This study varied λ across multiple values, ranging from 0 to 1 in equal increments.Hence, with a regularization parameter that has been considered in the sample covariance matrix of ULS, sufficient covariance between the measured variables is provided to ensure enough variance for the latent variables.Accordingly, the improved regularized ULS technique is as follows: where tr is the trace of the matrix,  is the regularized sample covariance matrix, Ʃ  is the model-implied covariance matrix and θ is the (t x 1) vector of parameters.

Comparative Bias Index (CBI)
The population data was generated using different sample sizes and specified criteria, as had been previously mentioned.The actual values for model parameters like true indicator loading were found to be the population value.This value is required to generate simulated data.To examine the bias of parameter estimates produced from simulation data, the CBI developed by Aimran et al. (2017) was calculated for comparison.
where  represents an estimate of the model parameter while  denotes the parameter's actual value.A CBI value of ≥ 0.8 denotes an estimate with acceptable bias, whereas a CBI value of > 0.9 denotes an estimate that is unbiased or has low bias.Otherwise, the bias estimate is unacceptable.

Root Mean Square Error (RMSEA)
As shown below, the RMSEA computes the difference due to approximation per degree of freedom: where  denotes the discrepancy between the method used to generate the data and the model that was fitted.The acceptable cut-off values for RMSEA are ≤ 0.08 (Awang, 2023).

Result
The results of the CBI and RMSEA values for all indicator loadings and models are reviewed in Table 1, Table 2 and Table  3, respectively.For Model 1, the loading of each item under the corresponding constructs was set to 0.7.The findings for Model 1 are depicted in Table 1.The finding revealed that ULS outperforms regularized ULS with no bias estimates (≥0.8) for a sample size of 50.
Next, for 100 samples, a number of unfavorable bias estimates (<0.8) was observed for both estimation methods.The biased estimates generated by regularized ULS are due to overestimation of indicator loadings.Thus, this finding suggests that the proposed method generates comparable results to ULS for a sample size of 100.However, the regularized technique produces a compelling result without any indicators exhibiting unacceptable bias estimates when involving a large sample (n ≥200).
Conversely, there are more unacceptable bias estimates for ULS at large sample size (n=500).Moreover, the RMSEA values improved and achieved an acceptable threshold of 0.08 when employing regularized ULS.For the selection of the optimal regularization parameter, a small sample model (n=50) requires an extremely large regularization parameter value, λ=0.75 to obtain a better fit model with minimal RMSEA.Nevertheless, the optimal regularization parameters for sample sizes of 100, 200 and 500 are 0.10, 0.20 and 0.40, respectively, denoting a non-apparent difference.Next, Table 2 displays the indicator loadings computed by the CBI for Model 2. For Model 2, each item loading was set to 0.8.The research revealed that the regularized ULS consistently outperformed ULS across all sample sizes (n= 50, 100, 200, 500) with the lowest RMSEA value.On the other hand, ULS yields numerous low bias indicator estimates (CBI <0.8).A thorough review of CBI values reveals that regularized ULS produces an estimate that is superior to ULS (CBI value close to 1).Also, for the selection of the optimal regularization parameter, a small sample model (n=50) requires an extremely large regularization parameter value, λ=0.80 to obtain a better fit model with the smallest RMSEA.Meanwhile, the optimal regularization parameters needed for sample sizes of 100, 200 and 500 are 0.10, 0.20 and 0.45, correspondingly, denoting a nonapparent difference.From this finding, a similar pattern is observed from Model 1 in selecting the optimal value of λ for the regularized ULS method.Small sample (n=50) requires a very large λ to generate a better fit model, hence yielding a more unbiased indicator loading estimate.

Table 3
The CBI, RMSEA and optimal regularization values for Model 3 Next, every item loading under each of the corresponding constructs for Model 3 was set to 0.9.The outcome is tabulated in Table 3.Consistent with the findings for Model 2, the regularized ULS method obviously appears to be better than ULS for all sample sizes (n= 50, 100, 200, 500) with the lowest value of RMSEA.On this ground, regularized ULS is indeed able to improve the performance of the existing method; ULS, in estimating the parameters for non-normal data.However, this method requires a large sample (n≥500) to yield more precise loading estimates where the CBI value is > 0.9 or close to 1. Besides, a small sample model (n=50) requires an extremely large regularization parameter value, λ=0.65 to obtain a better fit model with the smallest RMSEA.However, there are no discernible differences in the optimal regularization parameter across sample sizes.

Discussion
The performance of the regularized ULS and the traditional ULS methods in terms of CBI was investigated in this current study.Monte Carlo Markov Chain method was used to generate data with different sample sizes using a simple model that fits certain conditions (e.g., non-normal, complete data).Thus, some conclusions were derived from the results.As previously mentioned, the three models' actual item loadings are uniformly set between 0.7 and 0.9.For Model 1, when n=50, ULS outperforms regularized ULS with no bias estimates (≥0.8), indicating that ULS is favorable for small sample sizes.Meanwhile, for n=100, the finding discloses that a number of unfavorable bias estimates (<0.8) was observed for both estimation methods.This finding suggests that the proposed method (i.e., the regularized ULS) generates comparable results to those of ULS when the sample sizes are 50 and 100.However, the regularized ULS method outperforms the ULS when the sample size is sufficiently large (i.e., n≥200).Moreover, the RMSEA values improved and achieved an acceptable threshold of 0.08 when employing regularized ULS.Conversely, the number of undesirable bias estimates for ULS increases as the sample size increases (n ≥ 200).Thus, it is suggested that the study is carried out using a regularized ULS estimator when data with an actual loading of 0.7 is to be simulated.
For Model 2 and Model 3, ULS yields several low bias indicator estimates (CBI <0.8).However, the regularized ULS is capable of yielding unbiased estimates across all sample sizes (n=50, 100, 200 and 500).In light of this finding, when the actual indicator loading is high, regularized ULS is suitable to be employed for small to large samples.Therefore, it can be inferred from the results that when true indicator loadings are high, the loadings are precisely estimated and hence lie within the permissible bias range.Considering greater sample sizes (n=200), it can hence be deduced that regularized ULS can be utilized to generate more precise parameter estimates in simulation studies with population indicator loadings greater than 0.7.The biases of indicator loading estimates in ULS for all sample sizes might be due to its underestimation.Thus, the new regularized approach has been proven to overcome this matter.
In addition, the choice of regularization parameter, λ is crucial in yielding better estimation.A small sample (n=50) requires very large λ to generate a good fit model, hence resulting in more unbiased indicator loading estimate.In addition, the differences between optimal regularization parameters across large sample sizes (n≥100) are not noticeable.Still, the selection of tuning parameters should depend on the model characteristics and sample size.By incorporating a positive constant value into the sample variance-covariance matrix, the stability of the covariance matrix is enhanced, thereby mitigating the risk of negative error variances.The values within the sample variance-covariance matrix play a crucial role in determining the model parameters.The accuracy and stability of the sample variance-covariance matrix can significantly impact the estimated parameter values.Thus, selecting the optimal value of the regularization parameter is crucial for generating the best estimate values (Arruda & Bentler, 2017).A greater λ value is necessary to achieve more stable and precise estimates of certain constructs especially for small sample size.Therefore, it is vital to evaluate the model through the smallest value of RMSEA and test various choices of the regularization parameter accordingly (Jacobucci et al, 2016).

Conclusion
In summary, it is suggested that regularized ULS is a great estimator if the true indicator loadings are high (e.g., ≥ 0.7).It has been discovered in this present study that the addition of regularization parameter to each element in the sample covariance matrix of ULS estimator has been proven to enhance the loading estimates particularly when using non-normal data as well as when the indicator loading for the simulated data is large (e.g., 0.7).Since real-life data are usually non-normal, the findings of this study can be used by the policy makers and researchers to obtain a more accurate estimation in analyzing the inter-relationships between variables.

Table 1
The CBI, RMSEA and optimal regularization values for Model 1 Note: Bold values indicate unacceptable bias estimates

Table 2
The CBI, RMSEA and optimal regularization values for Model 2 Note: Bold values indicate unacceptable bias estimates