Classical and Bayesian Uncertainty Intervals for the Reliability of Multidimensional Scales

Abstract The reliability of a multidimensional test instrument is commonly estimated using coefficients ωt (total) and ωh (hierarchical). However, point estimates for the coefficients are rarely accompanied by uncertainty estimates. In this study, we compare bootstrap and normal-theory confidence intervals. In addition, we develop Bayesian versions of coefficients ωt and ωh by sampling from a second-order factor model. Results from a comprehensive simulation study show that the studied confidence intervals performed well when the sample size was sufficiently large ( ). The Bayesian estimates performed well across most studied conditions. When the sample size was small and the reliability low, only the bias-corrected and accelerated bootstrap confidence interval approached a satisfactory coverage among all intervals. This study guides on ωt and ωh confidence intervals and introduces ωt and ωh credible intervals that are easy to use and come with the benefits of Bayesian parameter estimation.

Many constructs in the social and behavioral sciences assume a multidimensional structure with a general factor along with more specific subdimensions called group factors. For example, the short version of the UPPS-P (Cyders et al., 2014) is a measurement of impulsivity and comprises 20 items. The subscales negative and positive urgency, (lack of) premeditation, (lack of) perseverance and sensation seeking are measured by four items each, so that the more general impulsivity domain is partitioned into five more specific subdomains. Depending on the research question and design, the reliabilities of approximate unidimensional subsets of the composite test score may be of interest. For instance, one may divide the impulsivity scale score into unidimensional subscale scores of four items each and estimate the reliability for each scale using, e.g., coefficient a (Cronbach, 1951). However, this approach is unsuited when either the total or the general factor reliability of the composite test score (all items) is of interest. Likewise, estimating coefficient a for the complete impulsivity test score underestimates the true reliability (Sijtsma, 2009). Proper coefficients for the reliability of a multidimensional scale are the coefficients x t , indicating the total reliability of a scale, and x h , indicating the reliability of the general factor, that is, the extent to which all items of a scale measure a common underlying latent variable (McDonald, 1999;Zinbarg et al., 2005). 1 When reporting reliability coefficients, researchers usually present a point estimate without an uncertainty estimate such as a confidence interval. This practice disregards the sampling error when estimating a reliability parameter from a finite sample. To quantify uncertainty, confidence intervals (in a frequentist setting) or credible intervals based on posterior distributions (in a Bayesian setting) can be used. Here, we investigate several approaches to obtain confidence intervals for coefficients x t and x h and describe a method to estimate a second-order factor model in the Bayesian framework based on work by Lee (2007) that can be employed to obtain Bayesian versions of x t and x h and their posterior distributions.
To our knowledge, confidence intervals for coefficients x t and x h , both analytic and bootstrapped ones, have not been investigated before. 2 Furthermore, we are unaware of any attempts concerning Bayesian versions of coefficients x t and x h . This makes our work novel in several ways. We are the first to investigate various confidence intervals for coefficients x t and x h . Also, we are the first to develop and evaluate Bayesian versions of x t and x h by adopting methodology from Bayesian structural equation modeling (SEM).
Finally, the implementation of the developed methods into an R-package is another novel contribution that increases the impact of this work.
The outline of this paper is as follows. First, we present the factor model approach to reliability and describe how the coefficients x t and x h can be obtained from the parameters of the second-order factor model. Second, we detail how the second-order factor model can be estimated in the Bayesian framework, which yields the parameters needed to compute the posterior distributions of x t and x h . Third, we discuss uncertainty estimation in reliability analysis. Fourth, we compare the frequentist and Bayesian reliability estimates in a simulation study before showcasing a (Bayesian) reliability analysis with the impulsivity example. Finally, we discuss the results of the simulation study, practical implications, and future directions of research.

The Factor Model Approach to Reliability
Historically, reliability has been defined in the realm of classical test theory (CTT). In CTT, the test score is divided into a true score and an error score: X ¼ T þ E: The reliability is the relative amount of the test score variance that is true score variance which equals the correlation between parallel test administrations (Lord & Novick, 1968). Parallel tests can be described as administering the same test at least twice to the same sample of participants with them having no recollection of the previous administrations.
Reliability q XX 0 can be written as: with r 2 X as the test score variance, r 2 T and r 2 E as the true score variance and the error score variance, respectively (Lord & Novick, 1968). Note that CTT makes no assumptions about the dimensionality of the data (e.g., Sijtsma & Pfadt, 2021).
Coefficients x t and x h are based on a factor analytic (FA) approach to decomposing a test score and inferring reliability. In a typical factor model, for example, the singlefactor model (also unidimensional/congeneric model; Spearman, 1904), one divides the test score into a part explained by one common latent variable or factor, and a remaining error part that cannot be explained by the common factor: X ¼ kF þ E, with k as the loadings of the item scores that comprise X on factor F. We notice that in contrast to CTT, the definition of reliability in the FA approach changes as the definition of the true score variance changes. Reliability is then defined as the relative amount of test score variance that is due to one or more factors (e.g., Bollen, 1989;J€ oreskog, 1971;McDonald, 1970). In the single-factor model, there is only one common factor that explains the responses in the test, thus the true score variance is the variance explained by the common factor. A single-factor model is at odds with assuming a multidimensional construct that incorporates a general factor along with group factors and thereby only provides a rather crude approximation of the general factor. Considering the impulsivity-example with its five dimensions of impulsivity, the variance due to only a single factor falls short of capturing the true score variance explained by all facets of impulsivity; hence, a more sophisticated factor model is appropriate.

Second-Order Factor Model
We assume that the test score can be explained by several primary latent variables (group factors) that are each common to some but not all items and the primary latent variables are themselves explained by one second-order general factor. This model is also sometimes referred to as a "higher-order factor model." For instance, one can structure the impulsivity-scale into a general second-order factor, that is, the overall impulsivity that influences the outcomes of the manifest variables through the group factors (or dimensions), specifically the subscales negative and positive urgency, premeditation, perseverance, and sensation seeking.
The second-order factor model allows decomposing the test score into convenient parts (see Figure 1 for an exemplary drawing of a second-order factor model of the impulsivity-example). When we assume data Y with dimensions n Â k (observations Â items) to be multivariate normal and with mean zero for all i ¼ 1, . . . n, the second-order model with two levels and a single common factor at the second level is defined as: Vector y i (k Â 1) contains the scores of person i on k items. The loading matrix K (k Â q) contains the primary loadings of the manifest variables on the group factors; q denotes the number of group factors common to some but not all items. In the primary loadings matrix K, we fix the loadings of manifest variables on "outgroup" factors (factors they are not associated with) to 0. The combined residuals (k Â 1) have a diagonal covariance matrix W e (k Â k) containing k residual variances w e (k Â 1). The vector g i (q Â 1) contains the group factor scores of each person. On the second level, the factor scores g i (q Â 1) are explained by a loading vector b (q Â 1), one exogenous common factor n i (1 Â 1Þ, and the error vector d i (q Â 1). Combined, d has a diagonal covariance matrix W d (q Â q) containing the residual variances w d (q Â 1) of the group factors.
To compute coefficients x t and x h , the loadings of the manifest variables on the general factor are required, which are only indirectly obtained in the second-order model. It is thus useful to transform the second-order model into a bifactor structure instead (see supplemental Figure S1 in the supplementary materials for an exemplary drawing of a biorder factor model for the impulsivity scale). The bi-factor model is a more general factor model that relates group factors to items similar to the second-order model but relates the general factor directly to the manifest variables and not to the group factors. The second-order factor model is nested in the bi-factor model, and the models are equivalent when the proportionality condition is met, that is when the ratios of variance explained by the general factor to the variance explained by the group factor are identical for all indicators of a group factor (Bader & Moshagen, 2022;Mansolf & Reise, 2017;Mulaik & Quartetti, 1997).
Through a Schmid-Leiman transformation, the loadings of a second-order factor model can be transformed to yield the loadings of a corresponding (and thus equivalent) bi-factor model (Schmid & Leiman, 1957). The transformation works as follows: The loadings of the manifest variables on the general factor, c (k Â 1), are obtained by equation c ¼ Kb; the loadings of the items on the group factors in a bifactor model, ; the residual variances of the manifest variables in the bi-factor model equal the residual variances in the second-order model, w e : Now, one may compute: and In the work of Zinbarg et al. (2005), the denominator in (4) and (5) equals the variance of the test score, r 2 X : We assume that the sum of all the variance parts in the denominator of Equations 4 and 5 will equal the total variance of the test score when the model fits in terms of the likelihood ratio test, which represents a strictly model-based definition. When the model fit is poor, one should refrain from using the coefficients at all (McDonald, 1999).
The choice of whether coefficient x t or coefficient x h is the proper reliability coefficient depends on whether one considers the variance due to the group factors as part of the true score variance or the error score variance. Following the reasoning of Zinbarg et al. (2005) and assuming that variance due to group factors is systematic variance and thus true score variance, one arrives at coefficient x t as an estimate for the total reliability. If one is only interested in the general factor variance as the true score variance one considers x h as a measure of the general factor reliability. Coefficient x h then also informs about the extent to which the test scores can be explained by a common factor in the presence of group factors, that is, the general factor saturation (e.g., Rodriguez et al., 2016). Zinbarg et al. (2005) summarize some general properties of x t and x h : Coefficient x t equals the true total reliability of the composite test score when no item specific factors are present in the data; when item-specific factors are present, the coefficient is an underestimate of the reliability. Coefficient x t equals x h when the data fit a unidimensional model; when the data fit a multidimensional model x h will be smaller than x t . Reanalyzing previous studies on the performance of various reliability coefficients, Cho (2022) found coefficient x t (named q SOF ) based on a second-order factor model to perform well with multidimensional data.

The Bayesian Approach to Coefficients x t and x h
The goal of reliability estimation in the Bayesian framework is to obtain the posterior distributions of coefficients x t and x h . As a probability distribution, the posterior represents the relative probabilities for all possible parameter values after the data have been observed. One obtains the posterior distribution by updating the prior distribution of the parameter with the knowledge gained from the observed data. The prior distribution represents the probabilities for different ranges of parameter values before the data have been observed. Thus, the posterior distribution of, for example, coefficient x t contains the probabilities for all possible ranges of values of x t and encapsulates the precision with which x t is estimated. The prior distribution plays an important role as it quantifies the prior confidence in different values of a parameter and thereby affects the parameter retrieval in dynamic ways. The posterior distribution is obtained by updating the prior with the data. If the evidence in the data is not very convincing, for example, because there is a lot of noise or because the sample size is relatively small, then the prior distribution has more influence on the posterior distribution. For instance, in a Bayesian reliability analysis, we could specify prior distributions that imply a prior for x t that resembles a normal distribution with a peak around .5. We might draw a data sample of n ¼ 50 yielding a frequentist x t estimate of .85. The Bayesian point estimate for x t would likely be somewhere between .5 and .8 as the evidence in the data is too weak to cause a major shift of the prior confidence and thus away from .5. The associated credible interval would be considerably wide reflecting the large uncertainty in the estimation with such a small sample. We do not regard such a scenario as a disadvantage of Bayesian parameter estimation, rather it encourages researchers to (1) focus on the width of the posterior distribution and the corresponding uncertainty interval instead of how close the point estimate is to a cutoff value, (2) contemplate the confidence they have in different reliability values before seeing the data, and (3) to collect more data to improve the precision of the reliability estimation.
The prior distributions of coefficients x t and x h are not directly specified as the coefficients are functions of loadings and residual variances. Instead, one specifies the prior distributions of the second-order factor model parameters to control the priors of x t and x h . The factor model prior distributions are then updated with the data by Markov chain Monte Carlo sampling (MCMC; e.g., Gilks et al., 1995), more specifically Gibbs sampling (Geman & Geman, 1984). One draws samples from the conditional posterior distributions of the factor model parameters that are needed to compute coefficients x t and x h . In this procedure, we follow Lee (2007, Chapter 4), who provides the prior and conditional posterior distributions for a Gibbs Sampler of a general structural equation model. At the level of the measurement model, we adopted the standard distributions for Bayesian SEM (Lee, 2007;Muth en & Asparouhov, 2012); at the level of the structural model, we simplified the equations, since there is only a single secondary factor predicting the primary factors instead of multiple possibly hierarchically structured latent variables. Specifically, the loadings of the primary group factors on the general factor reduce to a vector instead of a matrix representation for general SEM (see Lee, 2007); the factor scores of the secondary factor appear in vector form instead of a matrix; and the covariance matrix of the secondary factor scores simplifies to a single variance value.
We draw starting values for the Gibbs sampling from the prior distributions of the second-order model parameters. Following standard Bayesian SEM procedures (Lee, 2007;Muth en & Asparouhov, 2012), we adopt prior distributions for w e , K, w d , b, U, g, and n: Let n be the sample size, k be the number of manifest variables, and q the number of group factors. The prior distribution of the residual variances of the manifest variables is the inverse gamma distribution, pðw e Þ ¼ IGða 0 , b 0 Þ with shape and scale parameters. The prior distribution of the loadings of the manifest variables on the group factors, the non-zero elements in loading matrix K, is a normal distribution with mean k 0 and variance w e multiplied by A 0 , pðK j w e Þ ¼ Nðk 0 , A 0 w e Þ: The prior distribution of the latent residual variances is an inverse gamma, pðw d Þ ¼ IGðc 0 , d 0 Þ, and the prior of the loadings of the group factors on the general factor is a normal distribution, pðb j w d Þ ¼ Nðb 0 , B 0 w d Þ: We combine the scores of the primary and second-order factors, g and n, in X since they have a joint prior distribution, pðXÞ ¼ Nð0, UÞ: The diagonal covariance matrix of the combined factor scores, U, is composed of the general factor variance ϕ and the residual variances w d of the group factors. The prior distribution of the general factor variance is an inverse gamma, pðϕÞ ¼ IGðp 0 =2, R 0 =2Þ: 3 The prior distributions of the factor model parameters are conjugate, which means that the conditional posterior distributions are in the same family of distributions as the prior distributions. We insert the drawn starting values from the prior distributions of the factor model parameters into the conditional posterior distributions of the factor model parameters (cf. Lee, 2007): with A ¼ ðA À1 0 þ X 0 1 X 1 Þ À1 ; X 1 is the subset of factor scores due to the group factors; a ¼ AðX 1 YÞ; and b ¼ diagð b 0 2 ðY 0 YÀa 0 A À1 aÞÞ: with C ¼ ðB À1 0 þ X 0 2 X 2 Þ À1 , X 2 is the subset of factor scores due to the general factor; Matrix C ¼ ðIÀBÞ À1 W þ d ðIÀB 0 Þ À1 I is the identity matrix. Matrix B is a (q þ 1)-dimensional square matrix with only zeros except for the first column, which contains (0, b) as entries. Matrix W þ d is a (q þ 1)-dimensional diagonal matrix with the general factor variance, ϕ, set to 1, and the group factor residual variances, w d , as entries; and R ¼ K 0 0 W À1 e K 0 , where K 0 equals K with an added column of zeros at the beginning. We normalize the factors by dividing the factor scores in X by their standard deviation in every iteration of the Gibbs sampler to ensure identification.
In the first iteration of the Gibbs sampler, we use the drawn starting values from the prior distributions of the factor model parameters to successively draw samples from the conditional posterior distributions of the various factor model parameters (see Equations 6-11). Once we obtained samples from each of the conditional posterior distributions in the first iteration, we plug the just drawn values into the second iteration of the sampler. We repeat that process until sufficient values of each factor model parameter have been drawn, for example, 10,000 times. We restart the Gibbs sampler with different starting values, to assess convergence to the stationary distribution (the posterior). The limiting distribution of samples obtained from an adequately constructed Gibbs sampler is the target distribution (i.e., the posterior distribution). The samples sufficiently approximate the target distribution if the Markov chains converged. Convergence is assessed by checking traceplots and computing theR-statistic, which compares the within-chain variance to the between-chain variance (Gelman et al., 2014, Chapter 11.5).
In Gibbs sampling, there can be considerable autocorrelation within a posterior sample of a factor model parameter. Autocorrelation can be reduced by thinning the posterior samples. For instance, a thinning interval of 4 means only every 4th value of the posterior sample in one chain is used. We can check the level of autocorrelation by computing the effective sample size (ESS). The ESS indicates the number of independent posterior samples that contain as much information as the complete set of autocorrelated posterior samples (Geyer, 2011;Stan Development Team, 2022).
Once the MCMC samples converged to the posterior distributions of the loadings K, b, and residual variances w e , w d , we use the Schmid-Leiman transformation to obtain posterior samples of the parameters c and P: From there, we obtain posterior samples of coefficients x t and x h using Equations 4 and 5.

Uncertainty Estimation
When one estimates a parameter in statistical analysis, it is good practice to account for the uncertainty of that parameter in the form of a standard error and/or an uncertainty interval. An interval adds an area of uncertainty to a point estimate to account for sampling error and informs about the underlying population parameter. Although one estimates the reliability of a test to account for measurement error, the reliability estimate, i.e., coefficient x t , is affected by error itself and thus necessitates an uncertainty estimate.

Confidence Intervals for x t and x h
The most popular candidate for an uncertainty interval is the frequentist confidence interval, or confidence interval for short. Generating a confidence interval follows a general procedure: obtain a measure of error from the sampling distribution of a parameter and add that error to the point estimate of the parameter to create an interval. Contrary to the common misconception that one can be 95% confident that the 95% confidence interval contains the true parameter value, i.e., x t (e.g., Cumming, 2014), a 95% confidence interval does, in fact, cover the underlying population parameter in 95% of the cases when one would repeat the process of sampling and computing the 95% confidence interval for x t numerous times (Morey et al., 2016;Neyman, 1937).
A complication concerning the uncertainty estimation of x t and x h is that obtaining their limiting distribution is unfeasible, so approximations are required. In the present study, we examine six different confidence intervals: five bootstrap confidence intervals, and one Wald-type confidence interval. The procedure to obtain a bootstrap confidence interval for coefficients x t and x h is straightforward. One resamples the data of size n, for example, 1,000 times, each yielding n observations, and computes the coefficients x t and x h for each of the 1,000 resampled data sets using an FA method, that is, in this study a confirmatory factor analysis (CFA). Thus, one obtains an empirical sampling distribution for each reliability coefficient (Efron, 1979). The empirical sampling distribution yields a measure of uncertainty, for example, a standard error, to construct an interval around the point estimate of x t or x h .
We consider five intervals that are constructed this way. First, for the standard error interval (SE), one constructs an interval around the point estimate of x t (or x h ) by adding the z-value from a standard normal distribution multiplied by the standard error of x t (or x h ). The standard error of x t (or x h ) equals the standard deviation of the empirical sampling distribution. Second, one obtains the standard error bias corrected interval (SE bias ) in the same way as the SE-interval, except that one incorporates the difference between the mean of the empirical sampling distribution and the point estimate into the formation of the interval to correct for possible bias in the empirical sampling distribution (for details, see Davison & Hinkley, 1997, Chapter 5: normal approximation). 4 Third, to obtain the standard error logistic transformed interval (SE ln ), one generates an interval by applying the natural logistic transformation to the point estimate of x t (or x h ) as well as to the standard error as obtained for the SE interval. The interval is then transformed back to the normal scale to form the SE ln confidence interval (for details, see Kelley & Pornprasertmanit, 2016). Unlike the SE and SE bias intervals, the SE ln interval is not symmetric. Fourth, for the percentile bootstrap interval (Perc), one computes the lower and upper quantiles of the empirical sampling distribution of x t (or x h ). Fifth, one obtains the bias-corrected and accelerated bootstrap interval (BCA) by accounting for the bias and skewness of the empirical sampling distribution by applying the jackkniferesampling technique (see DiCiccio & Efron, 1996). We chose these specific intervals because of their popularity in common R-packages, boot (Canty & Ripley, 2021) and lavaan (Rosseel, 2012), and their promising performance in Kelley and Pornprasertmanit (2016) and Padilla and Divers (2016).
For the Wald-type interval (Wald), one fits the secondorder factor model in a CFA to multidimensional data. In that process, one can define coefficients x t and x h as to be estimated parameters in the maximum likelihood estimation and thereby obtain standard errors for the coefficients. These one can use to construct Wald-type confidence intervals.
In simulation studies for coefficient x u , the x-version for unidimensional data, the SE, SE ln , and BCA confidence intervals displayed good coverage performance; the Perc interval performed slightly worse but still satisfactory (Kelley & Pornprasertmanit, 2016;Padilla & Divers, 2016). Kelley and Pornprasertmanit (2016) also investigated the performance of the SE, SE ln , Perc, and BCA intervals for coefficient x h and found them to perform well across various conditions. Note that Kelley and Pornprasertmanit computed x h by fitting a unidimensional factor model to obtain single-factor loadings and dividing the squared sum of those loadings by the test score variance, r 2 X : However, with multidimensional data, coefficient x h from Kelley and Pornprasertmanit (2016) will not be equivalent to coefficient x h as defined in this study or in Zinbarg et al. (2005). To date, we are unaware of any simulation studies that examined confidence intervals for coefficient

Credible Intervals for x t and x h
The credible interval for either x t or x h is obtained from their respective posterior samples. We compute the X% credible intervals as the highest posterior density region (Box & Tiao, 1973) that contains X% of the posterior mass of x t or x h . The resulting intervals contain x t or x h with X% probability. To date, we are unaware of any research on credible intervals for x t and x h .
To summarize, coefficients x t and x h are representatives of the FA approach to reliability and define the true score variance of a composite test score in different ways. Coefficient x t includes the true score variance that is due to one general and multiple group factors; coefficient x h only includes the true score variance that is due to a general factor. Although one may estimate both coefficients from familiar structural equation models, estimates for their uncertainty have been scarcely researched, let alone reported. Thus, we examine the statistical performance of six confidence intervals as well as credible intervals for x t and x h in a simulation study. The knowledge gained from the simulation study shall provide researchers with the necessary tools to report uncertainty measures in reliability analysis. We provide the Bayesian routines, as well as the Wald-type confidence interval in the R-package Bayesrel .

Simulation Study
To compare several confidence intervals and credible intervals for coefficients x t and x h , we conducted a simulation study with multiple conditions. In the study, we relate the coefficients to their population values. Although we did not control population reliability per se, we note that the population value of x t equals population reliability when one adopts the FA approach to reliability and excludes item-specific factors.

Priors
The choice of the prior hyperparameters is important in Bayesian structural equation modeling. The hyperparameters define the shape of the prior distributions of the factor model parameters. The default prior hyperparameters of software solutions for Bayesian SEM have in common that they include very large variances: In Mplus, the prior for the factor loadings is Nð0, 10 10 Þ, and the prior for the residual variances is IGðÀ1, 0Þ (Asparouhov & Muth en, 2021); in blavaan, an R-package for Bayesian SEM, the prior loadings are normally distributed with N(0, 10), and the prior residual variances follow an inverse gamma distribution, IGð1, 0:5Þ (Merkle & Rosseel, 2018). Both the priors in Mplus and blavaan yield very large variances, which implies highly uninformative priors. For our Gibbs sampler, prior distributions with such large variances are inefficient. For the simulation study, we aimed to define the hyperparameters so that the prior distributions of the factor model parameters are relatively uninformative. Relatively uninformative priors compromise between efficient sampling and exploring the parameter space in its entirety. Particularly, the priors for the factor model parameters include a variance that is large enough so all plausible values of the parameters are likely to be sampled, but small enough so that reaching the target distribution does not take days. In the following, we describe the reasoning for the hyperparameters of the relatively uninformative setting, which are the basis of the main simulation results.
For a relatively uninformative prior on the residual variances, Lee (2007) recommends an inverse gamma distribution with shape and scale parameters as IG(6, 10), resulting in mean residual variances of 1 with a variance of 2. For the residual variances of the manifest and latent variables, we chose prior hyperparameters a 0 ¼ 2, b 0 ¼ 1, c 0 ¼ 2, and d 0 ¼ 1 so that the prior residual variances have a slightly smaller mean but much larger variances than Lee's recommendation. 5 For the prior loadings, we defined hyperparameters k 0 ¼ 0, A 0 ¼ 1, and b 0 ¼ 0, B 0 ¼ 2:5, resulting in loadings, both manifest and latent, with a mean of 0, and variance determined by the residual variances multiplied by a scalar, A 0 and B 0 , respectively. We chose that multiplier to be higher for the latent loadings since the uncertainty at higher levels of the model is larger. For the hyperparameters of the general factor variance, we chose p 0 ¼ q 2 Àq, and R 0 ¼ k: Thus, with simulation conditions of q ¼ 3,k ¼ 9, and q ¼ 5,k ¼ 30, the prior hyperparameters for the general factor variance yield an inverse gamma prior with mean ¼ 2:25 and variance ¼ 5 1 16 for three group factors, and an inverse gamma with mean ¼ 1 2 3 and variance ¼ 25 72 : This way, we ensure that the prior variance of the general factor scales down when the number of group factors grows. When the number of group factors is small we assume that the posterior variance of the g-factor is smaller than with more group factors.
Since the choice of prior can affect the parameter estimation (Smid & Winter, 2020), we assume that transparently defining relatively uninformative priors is a constructive way to introduce a new method. Whether or not the relatively uninformative prior specification holds up in practice is tested in a simulation study. To explore different prior settings, the simulation study also includes two additional sets of prior hyperparameters: A highly uninformative set, and a highly informative set. For the highly uninformative set we defined the hyperparameters to increase the mean and variance of prior residual variances, the inverse gamma priors, that is, other hyperparameters stay the same as above). 6 With the highly informative set of hyperparameters, we reduced the variance of the priors. We use the population values of the manifest and latent loadings from the simulation study (see below) for prior loadings k 0 and b 0 , and define a 0 ¼ c 0 ¼ 9, b 0 ¼ d 0 ¼ 4 (all other hyperparameters stay the same as above). 7 It is important to keep in mind, that the informativeness of the prior distributions in the simulation study only concerns the factor model parameters. In particular, uninformative prior loadings and residual variances, both manifest and latent, do not necessarily imply uninformative priors on the coefficients x t and x h for two reasons: (a) The prior distributions on x t and x h are functions of the prior loadings and residual variances of a bi-factor model, which are in turn functions of the priors of the second-order model parameters as specified above. (b) The second-order factor loadings are linked to the magnitude of the prior residual variances, that is, larger prior residual variances increase the variance of the prior factor loadings. Researchers should beware of this intricate link between the factor model parameters and the x-coefficients when exploring different prior specifications in the future. Any prior adjustments to the parameters of the second-order model can be performed using the R-package Bayesrel.

Method
We performed all analyses in R (v4; R Core Team, 2021) and provide the R-code for the simulation in an online repository at https://osf.io/xyq4n/. We constructed data generating covariance matrices based on the second-order factor model using the equation K 0 ðIÀBÞ À1 W þ d ðIÀBÞ 0À1 K 0 0 þ W e : We assumed a simple structure for the group factors, that is, no cross-loadings between a group factor of one set of items and an item from another set, and an equal number of items loading on each group factor. The population values of the coefficients were computed from the parameters that constructed the data generating covariance matrices.
We manipulated model size (9 items with three group factors and 30 items with five group factors) and reliability (low and high). To obtain a low (x t ¼ :6, x h ¼ :4) and a high (x t ¼ :9, x h ¼ :7) reliability condition, we drew standardized general factor loadings and group factor loadings from uniform distributions until we reached the desired x t and x h values. 8 We generated random multivariate normal data sets centered on 0 with the data generating covariance matrices in different sample sizes (n ¼ 100/500/2000). This setup resulted in a total of 12 conditions. Within the literature investigating the x-coefficients in simulation studies, researchers use various study designs and factor compositions (cf. Cho, 2022). We assume our conditions imitate a variety of real-world scenarios. Within the substantive literature using CFA, the model sizes we chose reflect typical sets (DiStefano & Hess, 2005;Jackson et al., 2009).
For each condition, we generated 1,000 data sets that yielded properly converging models. In each replication, we set the Gibbs sampler to three chains with 10,000 iterations each (burn-in at 2000) and thinning interval at 4. 9 The number of bootstrap replications was 2000. For the bootstrapped intervals, we always performed 2000 replications and based the CI on the converged solutions only. For the MCMC samples, we ensured convergence by computing thê R-statistic (Gelman et al., 2014, Chapter 11.5) for the posterior samples of the coefficients. We assumed anR-value close to 1 and no larger than 1.1 indicated convergence (Gelman et al., 2014). When an MCMC run yielded a posterior sample with anR-value larger than 1.1, we restarted the sampling until the value was smaller than 1.1. We report theR values as well as the ESS in the results section.
For the frequentist coefficients, we used the R-package lavaan (Rosseel, 2012) to fit the second-order factor model in a CFA using maximum likelihood. This way, we obtained point estimates of x t and x h , the Wald confidence interval, and-using the boot-package (Canty & Ripley, 2021)-we obtained bootstrap samples of coefficients x t and x h to compute the remaining confidence intervals. The credible intervals are highest posterior density intervals (HPD). The point estimates of the Bayesian coefficients are the means of their posterior distributions. We added the R-code to obtain Bayesian versions of x t and x h to the R-package Bayesrel. The package also allows estimating frequentist point estimates and Wald intervals for x t and x h .
be interpreted as the standard deviation of the point estimates around the population value across simulation runs. 10 Since the RMSE indicates not only the bias of a coefficient but also the sampling variability we expect a well-performing method to have small values close to but not equal to 0. The coverage was determined as the percentage of intervals across simulation runs that included the population value of the coefficient. At a confidence level of 95%, we expected a well-performing method to cover the population value in 95% of the cases.

Results
In Table 1, we see the RMSE results for the frequentist CFA and Bayesian point estimation (mean) for x t and x h . For most conditions, the frequentist and Bayesian coefficients performed equally well with the Bayesian methods displaying slightly higher RMSE values throughout most conditions. When the sample size was reasonably large, both methods achieved equally well point estimation performance. For both methods, point estimation was poorest when the reliability was low and the sample size small, which was also the condition when the largest differences between frameworks emerged: The Bayesian point estimation performed slightly worse than the frequentist, an indication of the influence of the prior distribution when evidence in the data is weak. The coverage results of the confidence and credible intervals for x t and x h are in Tables 2 and 3, respectively. As expected, the coverage of all intervals improved with increasing sample size. When the reliability was high all confidence intervals showed good coverage performance for both coefficients. When the reliability was low none of the intervals, except for the BCA interval, could reach satisfactory coverage for both coefficients. The BCA interval showed the best coverage across all conditions. However, with 9 items, low reliability, and a sample size of 2000, the BCA interval could be estimated in only 185 out of 1,000 simulation runs as the adjustment constant for the skewed bootstrap sampling distribution could not be estimated. The SE, SE bias , and SE ln intervals displayed similar satisfactory coverage performance for both coefficients across almost all conditions. However, when reliability was low and the sample size small, the SE and SE bias intervals for coefficient x t did not yield bounds within the 0 to 1 range in any of the simulation runs, which is visible in the 100% coverage rates (see Table 2). The issue also occurred for medium sample size, 30 items, and low reliability, that is, less than half of the simulation runs resulted in proper interval bounds for x t . The issue persisted for x h with 9 items when reliability was low and sample size small: Only about a fifth of the interval bounds fell inside the 0 to 1 range. Similar to the SE and SE bias intervals, the SE ln interval displayed excessive Note. The x t column contains the population value. Except for the Wald interval, all confidence intervals are bootstrapped with a CFA. The Wald interval is based on a single CFA. The † denotes the conditions in which both upper and lower interval bounds fell outside the 0-1 range in more than half of the 1,000 simulation runs. The ‡ denotes the conditions in which the BCA interval estimation failed in more than half of the 1,000 simulation runs. Note. The x h column contains the population value. Except for the Wald interval, all confidence intervals are bootstrapped with a CFA. The Wald interval is based on a single CFA. The † denotes the conditions in which both upper and lower interval bounds fell outside the 0-1 range in more than half of the 1,000 simulation runs. The ‡ denotes the conditions in which the BCA interval estimation failed in more than half of the 1,000 simulation runs. Note. The RMSE equals the standard deviation of the point estimates around the population value across simulation runs. The q column denotes the reliability level and represents the population values of the coefficients, low: x t ¼ :6, x h ¼ :4; high: x t ¼ :9, x h ¼ :7: The subscripts in the CFA columns denote the number of simulation runs in which the point estimates were outside the 0-1 range. No subscript means all estimates were between 0 and 1. 10 The RMSE is defined as , T denotes the number of simulation runs (1,000), h is the population value of the coefficient, andĥ t are the point estimates of the coefficients across the T simulation runs. coverage in the same conditions that the SE and SE bias intervals failed. In these conditions, the SE ln interval yielded interval bounds very close to 0 and 1 since the SE ln interval employs the same bootstrap estimated standard error as the SE and SE bias intervals but its logistic transformation prevents it from leaving the 0 to 1 range. The Perc and Wald intervals for coefficient x t performed very similarly, only displaying insufficient coverage when sample size and reliability were low. In precisely those conditions the coverage of x h of both intervals was unsatisfactory but in opposite directions: The Perc interval covered the population value of x h too frequently, whereas the Wald interval covered it too rarely.
The Bayesian intervals performed well, except when reliability was low and sample size small: With 9 and 30 items the credible intervals could not reach satisfactory coverage performance, similar to most of the confidence intervals. Again, this could be due to the influence of the prior distribution when the data are not influential enough to shift confidence away from the prior. The convergence diagnostics for the Gibbs sampler yielded satisfactory values (see supplemental Table S1). The meanR values for the simulation conditions ranged from 1 to 1.003 for x t and from 1.001 to 1.003 for x h . The mean ESS for x t ranged from 2304 (low reliability, k ¼ 30, n ¼ 100) to 5965 (high reliability, k ¼ 9, n ¼ 2000) and for x h from 1729 (low reliability, k ¼ 30, n ¼ 100) to 5507 (high reliability, k ¼ 30, n ¼ 2000).
Since the prior distributions affect the performance of the coefficients, at the least when the sample size is small and the reliability low, we also ran the simulation study with a set of highly informative and a set of very uninformative priors. The detailed results for these conditions are in the supplementary materials in Tables S2 and S3, respectively. With highly informative priors the point estimation of both reliability coefficients improves, as well as the coverage of the credible intervals for x h . The coverage of the intervals for x t does not change meaningfully, except when the sample size is small, then the coverage is slightly worse than with the original set of priors. With very uninformative prior hyperparameters, the performance of the reliability coefficients is substantially worse than with the original set. The coefficients reach a satisfactory coverage rate when the sample size approaches 2000.

Data Example
To illustrate a (Bayesian) reliability analysis with x t and x h , we use the impulsivity data set of the UPPS-P we gathered from Lozano et al. (2018) containing 455 observations, 443 are complete. The data set is included in the R-package Bayesrel under the name "upps". 11 Interested readers can find the Rcode for the calculations in the supplementary materials.

Frequentist analysis
First, we compute the reliability values of the various subscales of the impulsivity-scale. By means of coefficient a (frequentist) we obtain: negative urgencyâ ¼ :665, lack of perseveranceâ ¼ :694, lack of premeditationâ ¼ :748, sensation seekingâ ¼ :754, positive urgencyâ ¼ :796: Further, we wish to obtain the reliability of the full scale; initially, we estimateâ ¼ :809: Since we are aware of the tendency of coefficient a to underestimate the reliability when data are multidimensional (e.g., Dunn et al., 2014), we proceed and compute coefficients x t and x h . We obtainx t ¼ :865 andx h ¼ :644 by means of a CFA with fullinformation maximum likelihood, and Wald-type confidence intervals x t 95% CI [.847, .883] and x h 95% CI [.594, .693]. Because of the frequentist nature of the confidence interval, we are unable to interpret the bounds of a specific interval at hand. For the x t -interval of [.847, .883], we cannot conclude that the true x t , and thus the reliability, lies within the interval bounds with 95% probability. We can conclude that if we were to repeat the sampling procedure and the estimation of the intervals numerous times, the Wald intervals would cover the population values of x t in 95% of the cases. When trying to make inferences about exceeding a cutoff value, a frequentist might transform the confidence interval into a hypothesis test. In particular, the null hypothesis of the reliability being, for example, .80 is rejected at a significance level of 5% since the 95% confidence interval does not contain the value of .80. To sum up, we learn that the reliability of the scale is higher than indicated by coefficient a, and we also notice that there is uncertainty around the point estimates.

Bayesian Analysis
For the Bayesian analysis, we specify the same prior hyperparameters as for the simulation study which are also the default parameters in the Bayesrel-package. The data contain 13 missing values, which we address with a Bayesian imputation scheme. We treat the missing values as to be estimated parameters in the MCMC sampling process (e.g., Schafer, 1999). 12 We estimate the Bayesian coefficients x t and x h with 10,000 iterations, 3 chains, a burn-in of 2000, and a thinning interval of 4 (the settings also used in the simulation study). The posterior distributions yield point estimates ofx t ¼ :863 andx h ¼ :640, and credible intervals x t 2 95% CI [.844, .88] and x h 2 95% CI [.589, .688]. We are 95% certain that the true x t lies in between .844 and .88; similarly, the probability that coefficient x h lies in between .589 and .688 is 95%. Figure 2 shows the prior and posterior distributions of x t and x h for the impulsivity data set. The posterior plot offers a simple display of the probability for different values of coefficients x t and x h . The amount of posterior mass in the figure depicted by the grey area is proportional to the 95% probability of different reliability values.

11
The data set is also available at https://figshare.com/articles/dataset/ Concordance_between_the_original_and_short_version_of_the_Impulsive_ Behaviour_Scale_UPPS-P_using_an_IRT_model/6013217. 12 The Bayesian framework allows us to deal with missing data by treating the missing values as parameters that we can estimate conditional on the remaining data and the factor model parameters; hence, we include them in the Gibbs sampling algorithm of the second-order factor model and retrieve posterior samples of the missing values.
The plot also shows that the prior distribution of x t is relatively uniform, whereas the prior distribution of x h is slightly peaked around x h ¼ 0:1: Both prior distributions depend on the same prior parameters of the second-order factor model (see Section "The Bayesian approach to x t and x h ").
Furthermore, we can compute the probability that x t and x h for the impulsivity data set exceed certain cutoff values, for example, .80 and .60, respectively. Inspecting both the prior and posterior probabilities for the cutoffs, we notice the shift in our confidence by observing the data: We obtain the confidence that x t is larger than .80 before seeing the data from the prior distribution, which equals pðx t prior >:80Þ ¼ :221; similarly, pðx h prior >:60Þ ¼ :202: After taking the data into account the confidence that x t exceeds .80 has shifted and is obtained from the posterior distribution: pðx t post >:80Þ ¼ 1; similarly pðx h post >:60Þ ¼ :934:

Convergence
To check if the posterior samples of x t and x h have converged to their posterior distributions, we can display the MCMC-chains in a traceplot (see, e.g., Kruschke, 2015, Chapter 7.5), and assessR and ESS. The traceplots for x t and x h display good mixing, that is, they resemble a "hairy caterpillar" (see supplemental Figure S2; cf. Lee & Wagenmakers, 2014). Successful convergence is also indicated byR x t ¼ 1:001, andR x h ¼ 1:000, meaning the within-chain variance is close to the total variance, and continuing the sampling would not significantly result in more information about the target distribution (Gelman et al., 2014, Chapter 11.5). The level of autocorrelation is acceptable, that is, ESS x t ¼ 5654, and ESS x h ¼ 5193, in other words, the posterior samples for coefficient x t and x h consist of more than 5,000 independent samples each that contain as much information as the full samples of 6000. We also conclude that the posterior precision is satisfactory since a high ESS indicates small posterior uncertainty (Stan Development Team, 2022).

Model Fit
In the previous paragraphs, we have treated coefficient x t and x h as measures of reliability. However, as noted in the introduction, the coefficients are only appropriate reliability coefficients when the corresponding factor model properly describes the data. For the frequentist CFA model, we obtain a root mean square error of approximation (RMSEA) value of RMSEA ¼ :057, with pðRMSEA<:05Þ ¼ :039, and RMSEA 90% CI ½:051, :064, and a standardized root mean square residual (SRMR) of SRMR ¼ :057, with SRMR 90% CI ½:047, :066: 13 In the Bayesian framework, we check if the model fits the data in two different ways: (1) In a post-hoc graphical comparison called posterior predictive check (PPC) of the posterior model implied parameters and the data parameters, and (2) with indices that quantify the fit of the factor model similar to the frequentist fit measures. For the PPC, we employ the posterior distributions of the second-order factor model parameters to simulate data which we then compare to the actual data set. The PPC of the second-order factor model resembles a scree plot in that the eigenvalues of covariance matrices predicted by the model are plotted against the eigenvalues of the data covariance matrix. Inspecting the PPC plot for the impulsivity example, we assess the fit as mediocre (see supplemental Figure S3).
For the fit indices, we implemented the Bayesian versions of the RMSEA and the SRMR in the Bayesrel-package. Similar to the frequentist fit, we obtain BRMSEA ¼ :057, with pðBRMSEA<:05Þ ¼ 0, and BRMSEA 90% CI ½:055, :058; BSRMR ¼ :070: The coefficients x t and x h may be interpreted as measures of reliability cautiously. The structure of the data should always be assessed when conducting a reliability analysis that makes assumptions about that structure. An a-priori check could assess the loading structure of the data set using an exploratory factor analysis (EFA) and adjust the factor model accordingly. For example, we could decide to allow an item to load on more than one factor-if we view such a constraint as contributing to the true score we intend to measure.

Robustness
In the above, we used relatively uninformative priors for the second-order factor model. To demonstrate an analysis with more informative priors, we alter some of the prior parameters to reflect a scenario with larger prior loading values and smaller residual variances than the default settings used for the simulation. Specifically, we adjust the prior primary loadings mean to k 0 ¼ 2, the prior secondary loadings mean to b 0 ¼ 2, the prior manifest residuals shape and scale to a 0 ¼ 10 and b 0 ¼ 6, respectively, and the prior latent residuals shape and scale to c 0 ¼ 10 and d 0 ¼ 6, respectively. 14 Then, we run the Bayesian analysis with the impulsivity data set again, and all other parameters stay the same. The resulting coefficients arex t ¼ :855 andx h ¼ :630, and credible intervals x t 2 95% CI [.836, .875] and x h 2 95% CI [.578, .680].
The results are similar to the ones with relatively uninformative priors and we conclude that the posterior sampling with the given data set is robust to the prior specification.

Discussion
Ample research describes the shortcomings of coefficient a and other lower bounds to estimate reliability when data are multidimensional (e.g., Green & Yang, 2015;McNeish, 2018;Savalei & Reise, 2019;Sijtsma, 2009). More suited coefficients for this purpose are x t and x h which estimate different forms of reliability (e.g., Flora, 2020;Revelle & Zinbarg, 2009;Zinbarg et al., 2005). In this study, we investigated the performance of several procedures to quantify the uncertainty in reliability estimation using confidence intervals for x t and x h . The results provide a more comprehensive picture of the confidence intervals for reliability coefficients often used in research. Furthermore, we presented versions of coefficients x t and x h in the Bayesian framework, thereby also providing Bayesian credible intervals. The credible intervals are a well-performing alternative to confidence intervals that present an intuitive approach to quantifying uncertainty in reliability analysis.

Confidence intervals
In the simulation study, the BCA interval showed the best performance among the confidence intervals. The remaining confidence intervals performed well in most conditions, except when the sample size was small and the reliability was low. Before making any recommendations, it is important to note three issues that surfaced during the simulation study. First, all studied confidence intervals depend on a converging CFA with the original data set. Non-convergence and nonsensical parameter estimates (Heywood cases) are frequent issues with small samples and noisy data. Second, sometimes the CFA with bootstrapped data sets yielded nonsensical parameter estimates. This resulted in out-ofrange interval bounds and excessive coverage rates for the SE, SE bias , and SE ln intervals. There are multiple possible remedies to these issues. One could use EFA to obtain the x-coefficients (see, e.g., Revelle & Condon, 2019). Alternatively, one could use bounded estimation for the CFAs involved (see de Jonckere & Rosseel, 2022). Further, within CFA, one could reject all bootstrapped parameter estimates of x t and x h that fall outside the 0 to 1 range, since these inflate the standard error of the coefficients and lead to coverage issues. Thus, one would obtain the bootstrapped intervals only based on a "bounded" empirical sampling distribution. Third, the BCA interval performed well but is not always available, especially when the data is noisy and the sample size is close to the number of bootstrap replications. In these cases, it may help to increase the number of bootstrap samples orif to no availresort to another interval. To sum up, if a confidence interval is desired, the CFA has no convergence issues, the model fits the data well, and the bootstrapped BCA interval is available, researchers may use it to accompany point estimates for x t and x h . When the reliability level is relatively high or the sample size is medium to large, all confidence intervals can be recommended. In any case, it makes sense to inspect the parameter estimates of the CFA model for any peculiarities and to compare multiple confidence interval bounds to get a better understanding of the uncertainty around the reliability coefficients.

Credible intervals
The credible intervals reached good coverage rates (close to 95%) when the sample size was at least 500. With small sample sizes and low reliability values, the coverage rates did not reach a satisfactory level, which was also true for the confidence intervals. In general, the Bayesian coefficients performed similarly to the frequentist ones and can be recommended to any researcher who prefers to do their analysis in the Bayesian framework. The posterior distributions of coefficients x t and x h allow researchers to obtain any X% interval that contains the coefficient with an X% probability and compute the probability that the coefficient value is larger than any specified value. Such inferences are impossible to be drawn from confidence intervals.
Similar to the frequentist approach, the Bayesian versions of x t and x h require an adequately specified model structure. Often researchers estimate reliability as a prerequisite for further analysis. The prior distribution and the need for a precise model definition demand that they give thought to the prior confidence they have in different reliability and factor model parameter values before seeing the data, and that they consider the exact relations between the latent factors and manifest variables for their data.
14 An inverse gamma with shape and rate parameters a 0 ¼ 10 and b 0 ¼ 6 yields residual variances of 2/3 on average.
The prior distribution is an important part of Bayesian parameter estimation; it quantifies the researcher's knowledge before the data have been observed. In settings where the sample size is reasonably large, the influence of the prior distribution is relatively small, since the data provide enough evidence to quickly change the prior beliefs ("the data overwhelm the prior"; Wrinch & Jeffreys, 1919). However, the conditions with highly informative and uninformative priors showed that when the sample size is small and the evidence in the data is insufficient (low reliability), the prior distribution affects the performance of the Bayesian reliability coefficients. In particular, when the prior parameters are highly indicative of the true factor model parameters, the Bayesian coefficients quickly converge to their population values; when the prior is not at all informative about the true parameters, the population values of the x-coefficients are only well approximated when the sample size is large enough to overcome the prior. This is a finding in line with previous research on Bayesian SEM (MacCallum et al., 2012;Smid & Winter, 2020).
The goal of this study was to investigate relatively uninformative prior distributions for the parameters of the second-order factor model that perform well in practice. For future research on the topic, we propose three ways in which the prior distributions from this study can be altered. First, the prior distribution can incorporate prior knowledge into the estimation process. For instance, the parameters of the second-order factor model, i.e., loadings and residual variances from a previous study using the same test instrument, can be incorporated in a reliability analysis as the parameters of the prior distribution and thereby improve the precision of the Bayesian reliability estimation. Second, we computed the prior distributions of the reliability coefficients from the prior distributions of the second-order factor model parameters. We chose the same priors for the factor model parameters for both coefficients, which resulted in differently shaped prior distributions for x t and x h (see Figure 2). The prior for x t distributes confidence almost evenly along the x-axis, whereas the prior for x h places a considerable amount of confidence on values between 0 and 0.1 (see the modest peak in Figure 2). Researchers may want to tailor the prior distributions of the factor model parameters to each coefficient separately to obtain prior distributions that better reflect their prior beliefs about the coefficients. Third, the prior distribution is affected by the number of items in a test and the number of group factors specified. In the simulation study, we chose the same prior for both model sizes (k ¼ 9, and k ¼ 30). Researchers may investigate how priors are affected by a varying number of items.

Practical Implications for Applied Researchers
The key message substantive researchers should take away from this study is to always account for uncertainty in their reliability analysis. When uncertainty estimation is neglected, sampling error is dismissed, and one risks learning nothing about the reliability in the population. In practice, we recommend researchers first decide on the measurement model underlying their scale. If they decide on the second-order factor model they need to ensure that the model approximately fits their data set. When reliability point estimation is of special interest, we refer readers to a tutorial by Revelle and Condon (2019), although it is hard to imagine a scenario where one would not want to account for sampling error. If frequentist confidence intervals are desired, a second-order CFA model and the reliability coefficients may be estimated, and, if the sample size is at least 500 or the reliability is relatively high, any of the confidence intervals investigated in the simulation study will suffice to reflect uncertainty about the reliability estimates. When the sample size is small and the reliability low, the BCA interval should be used. 15 If Bayesian statistics are favoured, we recommend using the R-package Bayesrel for the reliability analysis. The package allows to estimate the Bayesian x t and x h and to adjust the prior hyperparameters of the second-order factor model. Given a specific data set, the first step in a Bayesian reliability analysis should be the decision on how much prior confidence one has in the parameters of the secondorder factor model, and thus in the reliability values. We distinguish two scenarios: prior information about the scale in question is available, and no prior information is available. When prior information is available, researchers may incorporate this information by specifying more informative priors for the second-order factor model (see subsection "Priors" pp. 14-16). The prior variances of all parameters should be reasonably large regardless so that the posterior is not bound by the prior distribution. From the simulation study one might learn that when the reliability is low and the sample size small, a poor performance by the reliability coefficients might be counteracted by highly informative priors with small variances. We advise against such a procedure since one risks learning nothing about the data that goes beyond the information already contained in the prior. Even when prior information is available, we advise always estimating the Bayesian coefficients with at least one more set of priors that is less informative, for example, the relatively uninformative set from the simulation study. Naturally, one should present the results with both sets of priors in a transparent manner. This way, one is safeguarded from making inferences about the reliability coefficients that are too dependent on prior information.
When no prior information is available it makes sense to start the reliability analysis with a relatively uninformative set of priors. Subsequently, the reliability coefficients should also be estimated with multiple sets of priors differing in informativeness (Depaoli et al., 2021;van Erp et al., 2018). The resulting coefficients and their inferences should be reported for each prior setting. Ideally, the sample size or the reliability is large enough so the prior settings have little to no influence on the reliability estimates (see the data example). In any case, a reported reliability analysis should illuminate the reasoning for the choice of prior and contain results for more than one set of prior parameters.

Limitations
We chose the second-order factor model to estimate coefficients x t and x h . The second-order model suits many psychological constructs that are measured by multiple group factors that relate to a general factor (e.g., Moshagen, in press). For those constructs that do not fit the second-order model, other factor models are appropriate. For instance, a general factor for the impulsivity-scale might explain variance in the manifest variables not through group factors but directly; hence, the group factors explain the variance in the manifest variables that cannot be explained by the general factor. Such a bi-factor model is a general factor model that includes the second-order factor model. Fitting a secondorder factor model and transforming it into a bi-factor model-as done in this study-is not equivalent to fitting a bi-factor model, unless the second-order model fits the data. In other words, all second-order models are bi-factor models but not all bi-factor models are second-order models. The procedure to estimate coefficients x t and x h from a bifactor model is mostly similar to the procedure described in this study and differs in the way the loadings and residual variances are obtained to compute the reliability coefficients (see, e.g., Flora, 2020;Revelle & Condon, 2019).
The model structure we implemented in the simulation study was simple, meaning there were no cross-loadings or error covariances specified. Real data may contain crossloadings or correlating errors. In both the frequentist and Bayesian framework such aspects can be addressed through the model specification before running the analysis (see, e.g., Flora, 2020). Whereas the confirmatory approach is inherently subject to identification issues, an EFA can avoid these. The EFA approach to estimating coefficients x t and x h is well known in the frequentist framework (see Revelle, 2020) and could also be applied in the Bayesian framework by fitting a Bayesian EFA (Conti et al., 2014). Within an EFA model, all possible cross-loadings are estimated. Arguably, it is a matter of a given research question whether or not to model all possible loadings between group factors and items. On the one hand, it can maximize the estimated true score variance, on the other hand, it can lead to capitalization on chance.

Conclusion
In practice, measurement instruments are often multidimensional, as are the psychological constructs they attempt to measure. The coefficients x t and x h estimate the total and general factor reliability, respectively, for such instruments. We assume it is undisputed that an estimate of uncertainty around the point estimates of these coefficients is needed, but it is an often neglected practice. To provide researchers with a comprehensive picture of the available uncertainty intervals around x t and x h , we compared the confidence intervals for the coefficients in a simulation study. Furthermore, we described a method to obtain Bayesian versions of x t and x h and their posterior distributions. The information in this study allows researchers to make informed choices about the confidence intervals they wish to report for reliability coefficients or even go beyond standard frequentist practice by expressing their practical beliefs about reliability estimates with posterior distributions of x t and x h .