Bayes Factor Testing of Multiple Intraclass Correlations

The intraclass correlation plays a central role in modeling hierarchically structured data, such as educational data, panel data, or group-randomized trial data. It represents relevant information concerning the between-group and within-group variation. Methods for Bayesian hypothesis tests concerning the intraclass correlation are proposed to improve decision making in hierarchical data analysis and to assess the grouping effect across different group categories. Estimation and testing methods for the intraclass correlation coefficient are proposed under a marginal modeling framework where the random effects are integrated out. A class of stretched beta priors is proposed on the intraclass correlations, which is equivalent to shifted F priors for the between groups variances. Through a parameter expansion it is shown that this prior is conditionally conjugate under the marginal model yielding efficient posterior computation. A special improper case results in accurate coverage rates of the credible intervals even for minimal sample size and when the true intraclass correlation equals zero. Bayes factor tests are proposed for testing multiple precise and order hypotheses on intraclass correlations. These tests can be used when prior information about the intraclass correlations is available or absent. For the noninformative case, a generalized fractional Bayes approach is developed. The method enables testing the presence and strength of grouped data structures without introducing random effects. The methodology is applied to a large-scale survey study on international mathematics achievement at fourth grade to test the heterogeneity in the clustering of students in schools across countries and assessment cycles.


Introduction
The intraclass correlation plays a central role in the statistical analysis of hierarchical data. It quantifies the relative variation between groups or clusters. A large (small) intraclass correlation implies a strong (weak) degree of clustering which implies that there is much (little) variation between groups. In cluster-randomized trials, entire groups (e.g., hospitals, schools) are assigned to the same treatment or intervention. When planning a cluster-randomized experiment, the intraclass correlation is used as an indicator of the level of efficiency of a multistage sample design. Optimal sample size requirements to obtain adequate statistical power and statistical precision depend on the variation between and within groups (Hedges and Hedberg, 2007;Raudenbush, 1997;Spiegelhalter, 2001). When conducting an experiment in different regions and contexts, the statistical variation in intraclass correlations is relevant to optimally plan cluster-randomized experiments across regions and to obtain adequate statistical power in each region. Knowledge about the intraclass correlation is also important to verify that conclusions of a statistical analysis are valid. When incorrectly ignoring a grouping effect, standard errors are generally too small and conclusions about the statistical significance of a treatment effect might be incorrect (Raudenbush, 1997).
Testing intraclass correlations can reveal relevant information about the level of heterogeneity between groups and across different group types. For example, Mulder and Fox (2013) tested the intraclass correlation of Catholic schools and public schools to learn that there is more variation in performance of Catholic schools in comparison to public schools. Van Geel et al. (2017) examined differences in intraclass correlations of teacher scores nested in schools in a pretest-posttest study design. After the teachers participated in an intervention program to improve teacher performances, a decrease of the intraclass correlation was measured. It was assumed that at the posttest teachers performed less alike leading to less similarity between teachers in each school, where some teachers did improve their performances while others did not. We will propose Bayes factor tests to formally test differences between intraclass correlations to be able to make inferences about the heterogeneity in teacher improvements.
In this paper, a Bayesian approach is presented for testing multiple precise and order hypotheses on multiple intraclass correlations belonging to different group categories, ρ = (ρ 1 , . . . , ρ C ) , where ρ c is the intraclass correlation in group category c, for c = 1, . . . , C. The intraclass correlation ρ c is defined as the ratio of the betweengroups variance in group category c and the total variance in group category c. The Q hypotheses have the following general form with equality and order constraints on intraclass correlations: for q = 1, . . . , Q, where the rows of the coefficient matrices R E q and R I q are permutations of either (1, −1, 0, . . . , 0) or (±1, 0, . . . , 0), q = 1, . . . , Q. Thus, restrictions are considered where intraclass correlations are equal to, larger than, or smaller than zero, or equal to, larger than, or smaller than other intraclass correlations. This class covers the most important hypotheses on intraclass correlations in statistical practice.
A key step in our methodology is the use of a marginal modeling framework, where the random effects in the multilevel model are integrated out. In this marginal modeling framework the intraclass correlations can attain negative values (Searle et al., 1992, p. 60-61). The allowed parameter space under the marginal model is in line with the restriction following from the expression for the intraclass correlation of Harris (1913), which states that the intraclass correlation is greater than − 1 p−1 , where p equals the number of observations per group. Unlike the marginal modeling framework of Liang and Zeger (1986) using generalized estimating equations, our marginal approach connects more closely to integrated likelihood methods where the nuisance parameters are integrated out (Berger et al., 1999). In this integrated likelihood approach inferences concerning the intraclass correlations are also invariant under shifts of the random group means. In our approach, the integrated likelihood is defined for Helmert-transformed grouped observations (Lancaster, 1965). The orthonormal Helmert transformation is used to partition the integrated likelihood in a component containing the betweengroups sum of squares and a component containing the within-groups sum of squares, which are the sufficient statistics for the between-groups variance and within-groups variance, respectively .
To aid Bayesian estimation and testing a class of stretched beta priors is proposed for the intraclass correlations. This class of priors has positive support for negative intraclass correlations under the marginal model. Furthermore this class of priors is equivalent to shifted F distributions for the between-groups variances which has an additional shift parameter. To our knowledge this class of priors is novel in the Bayesian literature. Note that the F distribution is equivalent with the scaled-beta2 prior (Pérez et al., 2017) and the half-t prior (Gelman, 2006;Polson and Scott, 2012), which are becoming increasingly popular for modeling variance components (Mulder and Pericchi, 2018).
The proposed class of stretched beta priors under the marginal model has several attractive features. By allowing intraclass correlations to be negative it is possible to test the appropriateness of a random effects model using the posterior probability that an intraclass correlation is positive. Moreover using a noninformative improper prior under the marginal model, we can obtain accurate coverage rates for the credible intervals, even in the case of samples of minimal size with two groups and two observations per group for a zero intraclass correlation in the population. Note that frequentist matching priors play an important role in objective Bayesian analysis (Welch and Peers, 1963;Severini et al., 2002;Berger and Sun , 2008). Another consequence of the marginal modeling approach is that significance type tests of whether an intraclass correlation equals zero can be performed using credible intervals with accurate error rates. This is possible because testing whether ρ = 0 is not a boundary problem. Another important property of the proposed class of priors is that it can be made conditionally conjugate through a parameter expansion. As will be shown the shifted F distribution on the between-groups variance is equivalent to a gamma mixture of shifted inverse gamma distributions. These shifted inverse gamma priors are conditionally conjugate under the marginal model. This results in efficient posterior sampling with a Gibbs sampler.
For the testing problem (1), a Bayes factor testing procedure is proposed under the marginal model. This test can be applied when prior information about the intraclass correlations is available and when no prior information is available or when a default Bayesian procedure is preferred. In the informative case, proper truncated stretched beta priors are specified on the unique intraclass correlations under each constrained hypothesis H q where the hyperparameters can be elicited from prior knowledge. A special case is the uniform prior, which assumes that all intraclass correlations are equally likely a priori. In the noninformative case, truncated improper reference priors will be used in combination with a generalized fractional Bayes approach (O'Hagan, 1995;De Santis and Spezzaferri, 2001;Hoijtink et al., 2018).
The paper is organized as follows. First, the marginal model is introduced, where two parameterizations are discussed and the integrated likelihood of the Helmert-transformed observations is given. Then, two prior classes are discussed, where a stretched beta distribution and a shifted F distribution is introduced to describe the distribution of the intraclass correlation and the between-groups variance, respectively, while taking account of restrictions on the parameter space to ensure that the covariance matrix is positive definite. A Gibbs sampler is then described, and its performance is evaluated through a simulation study. Then a Bayes factor and a generalized fractional Bayes factor are proposed, and their numerical performances are evaluated. Both tests are applied to data from the Trends in International Mathematics and Science Study to evaluate hypotheses concerning the heterogeneity of the intraclass correlation across countries and assessment cycles. Finally, a discussion is given and some generalizations are presented.

The marginal model
We focus on the random intercept model, where measurement j in group (or cluster) i in group category c is distributed according to for j = 1, . . . , p measurements, i = 1, . . . , n c groups in category c, and c = 1, . . . , C categories. In (2), β is a vector of K fixed effects with covariates x cij for measurement j in group i in category c, δ ci is the random intercept of group i in category c, τ 2 c is the between-groups variance in category c, and σ 2 is the common residual variance, which can be interpreted as the within-groups variance. This random intercept model can be recognized as a two-level multiple-group model, where level-1 units j are nested in level-2 groups i for each group category c. For instance, in each country c, math scores y cij of students j nested in schools i are assumed to be independently distributed given the random school intercept δ ci . The dependencies between student scores within each school can vary across countries.
The marginal model is obtained by integrating out the random effects δ ic . The vectorized version of (2) then has a multivariate normal distribution with a covariance matrix having a compound symmetry structure, i.e., where y ci = (y ci1 , . . . , y cip ) , X ci is the p×K stacked matrix of covariates, I p is the p×p identity matrix, and J p is a p × p matrix of ones. In order for the covariance matrix Σ c to be positive definite, it must hold that τ 2 c > − σ 2 p and σ 2 > 0, and thus τ 2 c does not necessarily have to be positive as in (2). For this reason we introduce a more general marginal model with covariance matrix where η c > − σ 2 p . We shall refer to η c as the generalized between-groups variance in category c. Note that (4) is equivalent to (3) when η c > 0. Furthermore when there is support in the data that η c < 0, the multilevel model (2) may not be appropriate. We can reparameterize model (4) using different intraclass correlations for different categories, denoted by ρ c = ηc ηc+σ 2 , for c = 1, . . . , C, and the total variance in group category 1, denoted by φ 2 , such that ⎧ ⎨ ⎩ ρ c = ηc ηc+σ 2 , for c = 1, . . . , C, Note that the total variance φ 2 and the fixed effects β are considered nuisance parameters in the current paper. The intraclass correlation in group category c is defined as the ratio of the generalized between-groups variance and the total variance. Thus, ρ c quantifies how much units in the same group resemble each other in category c. If ρ c = 0, then there is no clustering and measurements y ijc are essentially randomly assigned to the groups in category c.
Using the parameterization (β, ρ, φ 2 ), the marginal model in (4) is given by with ρ c ∈ (− 1 p−1 , 1) in order for Σ c to be positive definite. Hence, the intraclass correlations can attain negative values under this generalized marginal model, which is not the case in the conditional model (2), where ρ c ∈ (0, 1), for c = 1, . . . , C. To get some intuition about the impact of a negative intraclass correlation, Figure 1 displays the sampling distribution of the between-groups sums of squares for population values of σ 2 = 1, β = 0, n 1 = 8, and p = 4, and intraclass correlations of ρ 1 = −.1, 0, or .3. As can be seen the between-groups sums of squares is generally smaller in the case ρ 1 is negative in comparison to ρ c = 0 corresponding to random group assignment. Note that the estimated intraclass correlation is negative when the mean between-groups sums of squares is smaller than the mean within-groups sums of squares (Searle et al., 1992, p. 60-62).
Due to the compound symmetry covariance structure, the orthonormal Helmert transformation is useful to obtain transformed outcomes that are independent. The p × p Helmert transformation matrix is given by (Lancaster, 1965) Subsequently, the transformed observations are independently distributed according to Figure 1: Sampling distribution of the between-groups sums of squares, s 2 B,1 = i (ȳ 1i − y 1 ) 2 , whereȳ 1i denotes the sample mean of group i andȳ 1 denotes the overall sample mean, for φ 2 = 1, β = 0, n 1 = 8, p = 4, and different values of the intraclass correlation where W ci = H p X ci , and p × p matrix D c = diag( 1+(p−1)ρc 1−ρc , 1, . . . , 1), for i = 1, . . . , n c and c = 1, . . . , C. From D c it can be seen that only the first transformed observation, z ci1 , contains information about the intraclass correlation in that group. This can be explained by the fact that z ci1 depends on the sum of y ci , which is a key quantity for the between-groups variation.
The likelihood function under the marginal model is given by where z = (z 11 , z 12 , . . . , z Cn C ), W is a stacked matrix of W ci , w cij is the j-th row of W ci , and N = C c=1 n c . Note that because the Helmert transformation is orthonormal, the likelihood of z given W is equivalent to the likelihood of the untransformed y given X. Further note that inferences are only invariant of the chosen category of η c in φ 2 = η c + σ 2 in (5) when placing a noninformative improper prior on φ 2 . This can be seen when setting the improper prior π N (φ 2 ) = φ −2 and integrating out φ 2 in the posterior. In that case each ρ c will have the same role in the posterior 1 .

Prior specification
We propose the following class of priors under the marginal model: where the stretched beta distribution with shape parameters α c and ζ c in the interval with normalizing constant Q(α c , ζ c , p) = Γ(αc+ζc)(p−1) ζc Γ(αc)Γ(ζc)p αc+ζc −1 , for α c , ζ c > 0. To our knowledge this prior is novel in the Bayesian literature. In the case of a single intraclass correlation, Spiegelhalter (2001) proposed a beta prior for ρ in the interval (0, 1) under the conditional model (2). The stretched beta prior in (10) in the interval (− 1 p−1 , 1) seems more natural however, because the prior has common factors as the likelihood function (8). As a result this class of priors is conditionally conjugate for the marginal model by applying a parameter expansion. This will be shown in the following section. Other generalizations that have been proposed for the beta distribution include Armagan et al. (2011).
Further note that the conditional prior for the nuisance parameters β is based on Zellner's (1986) g prior with prior guess β 0 and Σ N = diag(I n1 ⊗ Σ 1 , . . . , I n C ⊗ Σ C ) of dimension Np × Np, with Σ c given in (6), and X is the stacked matrix of X ci . An improper prior is set for the nuisance parameter φ 2 (similar as in the g prior). Note that by setting g = Np one would obtain a unit information prior (see also Kass and Wasserman, 1995).
If prior information is available about the relative grouping effect in the different categories, this can be translated to informative stretched beta priors using a method of moments estimator. First note that the first two moments of a stretched beta prior equal These expressions can be derived by transforming a beta(α c , ζ c ) distribution in the interval (0, 1) to a stretched beta distribution in the (− 1 p−1 , 1). Subsequently, the prior hyperparameters α p and ζ c can be obtained by setting the prior guess equal to the mean and uncertainty about the prior guess equal to the standard deviation 2 . If all values for the intraclass correlations are assumed to be equally likely a priori, the hyperparameters can be set to 1 resulting in uniform priors. Figure 2 displays a uniform prior (dashed line) and an informative prior with prior guess ρ * 1 = .4 and standard deviation s ρ * 1 = .15 (dotted line) when p = 9.
If prior information is absent or if one prefers to adopt an objective Bayesian procedure, the hyperparameters can be set to α c = ζ c = 0, for c = 1, . . . , C. The resulting noninformative improper prior is given by This is essentially Haldane's (1932) prior for ρ c in the interval (− 1 p−1 , 1). Note that (11) corresponds to (10) when defining Q(0, 0, p) = 1. In the case of a single intraclass correlation, this corresponds to the reference prior where the intraclass correlation is considered to be the most important parameter (Berger and Bernardo, 1992;Chung and Dey, 1998). This prior is equivalent to the prior considered by (Box and Tiao, 1973, p. 251). Figure 2 displays the reference prior when p = 9 (solid line).
In practice, intraclass correlations are generally expected to be positive. Such expectations can be included in the proposed prior by truncating the stretched beta priors on ρ c in the interval (0, 1). Working with this truncated prior essentially comes down to the marginal model of the random effects model in (3) and (2). Note that this truncated prior differs from a standard beta prior in the interval (0, 1) (except in the case of uniform priors). For example Chung and Dey (1998) truncated the noninformative 2 If ρ * c denotes the prior guess and s ρ * c its standard deviation, which reflects the uncertainty about the prior guess, then set αc reference prior in (11) in the interval (0, 1). Throughout this paper we shall mainly focus on non-truncated priors but we also give some results for the truncated case.

Bayesian estimation under the marginal model
A Gibbs sampler is presented for fitting the generalized marginal model using the proposed class of priors in (9). First a parameter transformation is applied to generalized between-groups variances having shifted F priors, which are novel in the Bayesian literature. Subsequently a parameter expansion is applied that results in shifted inverse gamma priors, which are conjugate under the generalized marginal model. Finally a Gibbs sampler is presented.

Lemma 1. Transforming the prior in
where the density of the shifted F distribution is given by where ν 1 is the first degrees of freedom, ν 2 is the second degrees of freedom, s 2 is a scale parameter, and μ is a shift parameter, and Proof. See Appendix A (Mulder and Fox, 2018).
Second a parameter expansion is applied to model a shifted F distribution as a gamma mixture of shifted inverse gamma distributions. (14) can be obtained by setting a gamma mixture distribution on the scale parameter of a shifted inverse gamma distribution,

Lemma 2. The shifted F distribution in
where the shifted inverse gamma distribution is given by where α is a shape parameter, ξ is a scale parameter, and μ is a shift parameter.
By applying Lemma 1 and 2, the joint prior for (β, σ 2 , η, ψ 2 ) can be written as where ψ 2 is a vector of length C of auxiliary parameters.
Subsequently by parameterizing the likelihood in (8) in terms of the generalized between-groups variances η c and within-groups variance σ 2 , it can be shown that the conditional posteriors of the parameters have known distributions from which we can sample in a Gibbs sampler (Appendix C; Mulder and Fox, 2018). This can be achieved by splitting the parameters in two blocks β and (σ 2 , η, ψ 2 ). By of dimension Np × k, and y = (y 11 , y 12 , . . . , y Cn C ) of length Np, the blocked Gibbs sampler can be written as follows 1. Set initial values for (β, η, σ 2 , ψ 2 ), or for (β, ρ, φ 2 , ψ 2 ) and apply the transformation in (5).
If truncated stretched beta priors would be used for the intraclass correlations in the interval (0, 1), this would result in shifted F distributions for a generalized betweengroups variances η c truncated in (0, ∞). Applying the same parameter expansion as above would result in a gamma mixture of truncated shifted inverse gamma priors in (0, ∞). The corresponding conditional posteriors would then also have truncated shifted inverse gamma distributions. Sampling from these truncated shifted inverse gamma distributions can be done by sampling from the nontruncated shifted inverse gamma distribution until a positive value is drawn. This will be fairly efficient because the posterior probability mass in the negative region is generally quite small.

Frequentist coverage rates
Frequentist coverage rates are useful to investigate the performance of noninformative objective priors (e.g. Stein, 1985;Ghosh and Mukerjee, 1992;Berger et al., 2006). A simulation study was conducted to investigate the coverage rates of the lower 5% and 95% posterior quantiles for ρ 1 in the marginal model with C = 1 using the reference prior (11), which should ideally be close to .05 and .95, respectively. This was done for population values of τ 2 ∈ {0, .1, .5, 1, 10} and σ 2 = 1, which correspond to intraclass correlations of ρ ∈ {0, .09, .33, .5, .91}, and μ 1 = 0, and for sample sizes of (n, p) = (2, 2), (10, 5), and (500, 10). Note that the first sample size condition corresponds to a minimal balanced design with 2 groups with 2 observations per group. For each condition 50,000 data sets were generated. The coverage rates can be found in Table 1.
As can be seen from Table 1 the coverage rates under the marginal model with the considered reference prior are very accurate, even in the minimal information case with (n, p) = (2, 2) and an extreme intraclass correlation of ρ = 0. These rates are better than previous results using a truncated reference prior under the multilevel model (2) (Berger and Bernardo, 1992;Ye, 1994;Chung and Dey, 1998, which are also presented in Table 1). This illustrates that the marginal model is superior over the multilevel model in terms of coverage rates of interval estimates for the variance components. Hence, the credible intervals can be used for significance type testing, even when testing ρ = 0. Note that this would not be possible in a multilevel model because testing ρ = 0 would be a boundary problem. Generally however we recommend using Bayes factors for testing intraclass correlations because significance tests, e.g., using interval estimates, tend to overestimate the evidence against a null hypothesis (Sellke et al., 2001;Pericchi, 2005). Bayes factor tests are proposed in the following section.  (7). The results in the last column were taken from Chung and Dey (1998).

Bayes factor testing under the marginal model
When testing statistical hypotheses using the Bayes factor, prior specification plays a more important role than in Bayesian estimation. Instead of having to formulate one prior, which may be improper in Bayesian estimation, proper priors need to be specified for all unique intraclass correlations under all Q equality and order constrained hypotheses in (1). Furthermore, unlike Bayesian estimation, the effect of the priors on the Bayes factor does not fade away as the sample size grows (Jeffreys, 1961;Berger and Pericchi, 2001;Bayarri et al., 2012). Ad hoc or arbitrary prior specification should therefore be avoided. Also note that (objective) improper priors cannot be used in Bayesian hypothesis testing because the resulting Bayes factors would depend on undefined constants (e.g. O'Hagan, 1995;Berger and Pericchi, 1996). These facts have severely complicated the development of (objective) priors in Bayesian hypothesis testing and model selection.
In this section we propose a Bayes factor testing procedure that can be used when prior information about the magnitude of the intraclass correlations under the hypotheses is available or when prior information is too limited for adequate prior specification. When prior information is available this can be translated to proper stretched beta priors for intraclass correlations in (9), similar as in the estimation problem. When prior information is absent or when a default Bayesian method is preferred a generalized fractional Bayesian procedure is proposed. These default Bayes factors are based on the improper versions of stretched beta priors. Note that more examples can be found in the literature where the same family of prior distributions is used for estimation as for hypothesis testing or model selection. For example Cauchy priors with thick tails are useful for estimation in robust Bayesian analyses (Berger, 1994) and in Bayesian regularization problems (Griffin and Brown, 2005), and Cauchy priors are also useful for Bayes factor testing to avoid the information paradox (Zellner and Siow, 1980;Liang et al., 2008). Furthermore, the (matrix) F prior is useful when estimating variance components (Gelman, 2006;Pérez et al., 2017) and for testing variances (Mulder and Pericchi, 2018).

Prior specification and marginal likelihoods
Under a constrained hypothesis H q : R E q ρ = 0, R I q ρ > 0, let the free intraclass correlations be denoted by the vectorρ of length V (the hypothesis index is omitted to simplify the notation). The inequality constraints on the free intraclass correlations can then be written asR qρ > 0. For example, when the first two intraclass correlations are assumed to be equal and larger than the third intraclass correlation, i.e., H 1 : If prior information is available under H q , this can be translated to informative truncated stretched beta priors on the free intraclass correlations, where H * q corresponds to hypothesis H q with the inequality constraints omitted, i.e., H * q : R E q ρ = 0 (see also Pericchi et al., 2008), and the prior probability that the inequality constraints hold under H * q , which serves as a normalizing constant, is given by Subsequently, priors need to be specified for the nuisance parameters β and φ 2 under all hypotheses. First note that the Bayes factor is known to be robust to the choice of the same prior for the common orthogonal nuisance parameters (in the sense of a block-diagonal expected Fisher information matrix; see Jeffreys, 1961;Kass and Vaidyanathan, 1992;Ly et al., 2016). This justifies the use of the same improper prior for the nuisance parameters. First note that the fixed effects β are orthogonal to ρ, and therefore we can use the improper prior π N q (β) = 1. Second, φ 2 is not orthogonal to ρ 3 . When setting a vague inverse-gamma prior for φ 2 however, i.e., π(φ 2 ) = IG( , ), with > 0 small, it can be shown that the resulting Bayes factor will be virtually independent of the exact choice of as long as is small enough. Due to this robustness property, we can specify the improper prior π N q (φ 2 ) = φ −2 = IG(0, 0). Hence, the joint prior under H q is given by π q (β,ρ, φ 2 |α, ζ) = φ −2 π q (ρ|α, ζ).
Under each hypothesis H q , the hyperparameters α, ζ > 0 can be specified in a similar manner as was discussed in Section 3. Proper uniform priors also fall in this category which can be specified by setting α = ζ = 1. A uniform prior for the unique intraclass correlations under hypothesis H q implies that all possible values for the intraclass correlations that satisfy the constraints of H q are equally likely a priori. Once the priors have been specified, the marginal likelihood of the transformed data z under hypothesis H q can be computed according to where f q is the likelihood under H q which is a truncation of the unconstrained likelihood f in (8) in the subspace under H q . For the above example with H 1 : ρ 1 = ρ 2 > ρ 3 , the likelihood would be equal to The computation of the marginal likelihood (19) is discussed in the following section.
Formulating informative priors for the intraclass correlations under all hypotheses can be a challenging and time-consuming endeavor (Berger, 2006). To avoid this step, a default Bayesian procedure is proposed. First truncated reference priors will be specified having truncated stretched beta distributions with hyperparameters of zero, i.e., To avoid the dependence of the marginal likelihood on the undefined constants in these improper priors, a generalized fractional Bayes procedure is considered using different fractions for different transformed observations. The motivation for using different fractions is that only the first element of the transformed observations z ci in (7) contains information about ρ c , and therefore the amount of information in the default prior for the different parameters cannot be properly controlled using one common fraction for all observations, as in the standard fractional Bayes factor (O'Hagan, 1995). Generalized fractional Bayes approaches for normal linear models were for instance considered by Berger and Pericchi (2001), De Santis and Spezzaferri (2001) (8) are raised to different fractions according to (with a slight abuse of notation) where b c is the fraction of the data of the c-th category used to identify the parameters that are specific to category c (such as ρ c , and possibly a category specific intercept), and b 0 is the fraction of the data used to identify the remaining parameters. Generally the use of small fractions is recommended (O'Hagan, 1995;Berger and Mortera, 1995). The choice of the fractions will be motivated in Section 5.3.
Subsequently the marginal likelihood under H q using the generalized fractional Bayes approach is defined by where symbolizes the marginal likelihood of a fraction b of the information in the complete dataset y, i.e., y b , using the truncated noninformative improper prior (20). Note that the numerator in (22) can be obtained by setting b = 1 in (23). Because the same noninformative improper prior is used for computing both marginal likelihoods in (22), the undefined constant in this improper prior cancels out (O'Hagan, 1995).

Computation of the marginal likelihood
In the following lemma a general result is given for the marginal likelihood for a constrained hypothesis H q when using proper truncated stretched beta priors for the unique intraclass correlations or when adopting a generalized fractional Bayes approach.
Lemma 3. Under a constrained hypothesis H q : R E q ρ = 0, R I q ρ > 0, the marginal likelihood in the informative case with α, ζ > 0 and in the noninformative case with α = ζ = 0 are given by where h (ρ, b, α,

ζ) is an analytic function of the unique intraclass correlations under H t , the fractions b, and the prior hyperparameters α and ζ.
Proof. Appendix D (Mulder and Fox, 2018).
Note that the first part of the marginal likelihood in (24) is equivalent to the marginal likelihood of H * q without the inequality constraints, while the second ratio of probabilities quantifies the support for the inequality constraints in the data within hypothesis H * q (see also, Pericchi et al., 2008;Consonni and Paroli, 2017;Gu et al., 2017).
In (24) and (25), the posterior probabilities can be computed as the proportion of draws satisfying the inequality constraints under H * q . The Gibbs sampler for obtaining draws under H * q given y b can be found in Appendix E (Mulder and Fox, 2018). The integrals in (24) and (25) can be computed using the following importance sample estimate , for t = 1 or 2, where q(ρ) is a proposal density under H * q , andρ (s) is the s-th draw from q(ρ). The proposal density is a product of stretched beta distributions, beta(α * v , ζ * v , − 1 p−1 , 1), for v = 1, . . . , V , which is tailored to h(ρ, b, α, ζ). First a posterior sample is drawn forρ under H * q (Appendix E in Mulder and Fox, 2018). Then the shape parameters of the proposal distribution are computed with a method of moments estimator using the estimated posterior mean and variance as in footnote 2. By multiplying the shape parameters of the proposal density with, say, .7, the proposal density gets heavier tails than the kernel of the posterior h which ensures a stable and consistent estimate of the integral.
In the special case where X ci is a p × c matrix with ones in column c and zeros elsewhere, which implies that fixed intercepts per group category are the only covariates (as in a standard random intercept model), the marginal likelihood based on the truncated reference prior (20) has an analytic form. The expression can be found in Appendix F (Mulder and Fox, 2018). Consequently the generalized fractional Bayes factor has an analytic solution when testing equality and/or order constraints on multiple intraclass correlations in the random intercept model.

Choice of the fractions
In the fractional Bayes factor a fraction of the data is used to implicitly construct a default prior that is concentrated around the likelihood (e.g. Gilks, 1995). This is also the case for the generalized fractional Bayes factor as can be seen below where the proper updated prior is defined by In the original papers of the fractional Bayes factor, it was argued that the choice of the fraction should depend on the uncertainty about the employed improper prior: In the case of much (little) uncertainty, a relatively large (small) fraction should be used to update the improper prior (O'Hagan, 1995(O'Hagan, , 1997Conigliani and O'Hagan, 2000). Because the improper prior seems a reasonable objective choice (Section 4.1) and because larger fractions for prior specification would result in less information for hypothesis testing, we focus on minimal fractions in this paper (see also Berger and Mortera, 1995). A minimal fraction is based on the minimal amount of observations that are needed to obtain a proper updated prior. In practice each group category often has its own fixed intercept, which implies that X ci contains a column with only ones. After the Helmert transformation in (7), this column becomes ( √ p, 0, . . . , 0) in W ci = H p X ci . Thus, only the intercept and intraclass correlation of each group category are identified by the first transformed observations, z ci1 , for c = 1, . . . , C and i = 1, . . . , n c . This implies that two observations are needed of the first transformed observations in each group, which corresponds to a minimal fraction of b c = 2 nc . The remaining K − C fixed effects (where the groups specific intercept are excluded) and the total variance parameter φ 2 are then identified by the N (p − 1) transformed observations, z cij , for c = 1, . . . , C, i = 1, . . . , n c , and j = 2, . . . , p, which implies a minimal fraction of b 0 = K−C+1 N (p−1) . To get an idea about the effect of the choice of the fractions on the proper default prior, Figure 3 displays the estimated marginal posterior densities (solid line) of the intraclass correlations (ρ 1 , ρ 2 , ρ 3 ) (left, middle, and right panel, respectively) and the estimated marginal updated prior densities based on minimal fractions (dashed line) and twice the minimal fractions (dotted line), all based on the noninformative improper prior. These densities were estimated from a randomly generated data set with ρ = (.1, .6, .8), n = (20, 25, 30), p = 8, and group type specific intercepts β = (0, 0, 0) . As can be seen the proper updated prior based on minimal fractions are very similar to the noninformative reference priors. The updated priors based on twice the minimal fractions are more concentrated around the likelihood. In the remaining part of the paper we use minimal fractions so that most information in the data is used for hypothesis testing.

Numerical performance
A multiple hypothesis test is considered on C = 2 group specific intraclass correlations.
The following hypotheses are being tested: Our interest is in the default relative evidence based on the generalized fractional Bayes factor while varying the (unconstrained) posterior modes of the intraclass correlations for different group sizes (n 1 , n 2 ). As fixed effects only group specific intercepts were included. Therefore the marginal likelihoods can be computed by simply plugging in the group specific between groups sums of squares, s 2 B,c for c = 1 and 2, and the within groups sums of squares s 2 W in (8) in Appendix F of Mulder and Fox (2018). The sums of squares were varied according to s 2 W = (p − 1)Nσ 2 and s 2 B,c = n c (τ 2 c +σ 2 /p), for c = 1, 2, whereσ 2 andτ 2 are the unconstrained posterior modes, which were varied overτ 2 = (τ 2 1 , 1−τ 2 1 ) , forτ 2 1 = 0, . . . , 1 andσ 2 = 1, so that (ρ 1 ,ρ 2 ) = (0, .5), . . . , (.5, 0). Thus, whenρ 1 ≈ (0, .5), (.5, 0), or (.25, .25), it is expected to receive most evidence for H 1 , H 2 , or H 3 , respectively, and between these regions it is expected to either receive most evidence for H 4 or H 5 . The subspaces under the hypotheses and the trajectory of unconstrained estimated intraclass correlations are displayed in Figure 4. The group size was set to p = 10, and the number of groups in each category was set to n 1 = n 2 = 30, 300, and 3000. (ρ 1 , ρ 2 ) ∈ (− 1 9 , 1)×(− 1 9 , 1) as a function of the unconstrained estimates of the intraclass correlations (ρ 1 ,ρ 2 ) for n 1 = n 2 = 30 (upper panels), n 1 = n 2 = 300 (middle panels), and n 1 = n 2 = 3,000 (lower panels). Figure 5 (right columns) displays the corresponding posterior probabilities of the hypotheses based on equal prior probabilities, which can be computed as P (H q |y) = Bqu 5 q =1 B q u , with B q u = m q (y, b min )/m u (y, b min ). The plots show desirable default behavior of the generalized fractional Bayes factors as a function of the effects and sample size: The evidence is largest for the hypothesis that is also most supported by the data and the posterior probability for the true hypothesis goes to 1 as the number of groups increases, which implies consistency. Also note that the evidence for a true precise hypothesis with equality constraints (i.e., H 1 , H 2 , and H 3 ) , and H 5 (thin solid line) against an unconstrained hypothesis, and corresponding posterior probabilities of the hypotheses (right column) as a function of estimated intraclass correlations (ρ 1 ,ρ 2 ) which varied from (0, .5) to (.5, 0) for n 1 = n 2 = 30 (upper panels), n 1 = n 2 = 300 (middle panels), and n 1 = n 2 = 3,000 (lower panels).
accumulates with a slower rate than for the other hypotheses. This is commonly observed behavior of Bayes factor methodology (e.g., Johnson and Rossell, 2010). The evidence would increase with a faster rate when testing interval hypotheses instead of precise hypotheses (see Appendix G in Mulder and Fox, 2018). Finally note that the lines for H 4 and H 5 in Figure 5 are incomplete because, in the case of misfit of the inequality constraints, the proportion of 10,000 posterior draws that satisfy the constraints that is used for estimating the posterior probabilities is equal to zero.

Testing intraclass correlations in TIMSS
The Trends in International Mathematics and Science Study (TIMSS) measures the performances of fourth and eight graders in more than 50 participating countries around the world (http://www.iea.nl/timss). TIMSS is conducted regularly on a 4-year cycle, where mathematics and science has been assessed in 1995, 1999, 2003, 2007, 2011, and 2015. The fourth grade is a reference to a year in elementary eduction, where in North America the fourth grade is the fifth school year and in The Netherlands it is called group 6. The children are usually around 9-10 years old. The assessment data of each cycle can be found in the TIMSS's International Database.
When considering the international mathematics achievement of 2015 at the fourth grade, 21 countries improved their average performance, 15 countries had the same average achievement, and 5 countries had a lower average achievement compared to the mathematics achievement of 2011. The average 4th-grade mathematics scores in 2015 were lower for Germany and the Netherlands, scoring 6 and 10 points lower on average, respectively. To provide a reference point, the TIMSS achievement scale is centered at 500 and the standard deviation is equal to 100 scale score points. The TIMSS data set has a three-level structure, where students are nested within classrooms/schools, and the classrooms/schools are nested within countries. Only one classroom is sampled per school, so it is not possible to model variability among classrooms within schools.
For the TIMSS 2011 and 2015 assessment, the changes in the mathematics achievement were investigated by examining the grouping of students in schools across countries. The object was to evaluate whether a specific selection of schools (i.e., particular subpopulation) performed less in 2015, or whether the drop in performance applied to the entire population of schools of the considered country. Therefore, changes in the country-specific intraclass correlation coefficient from 2011 to 2015, representing the heterogeneity in mathematic achievements within and between schools across years, were tested. When detecting a decrease in average performance together with an increase of the intraclass correlation, a subset of schools performed worse. For a constant intraclass correlation across years the drop in performance applied to the entire population of schools. For different countries, changes in the intraclass correlation across years were tested concurrently to examine also differences across countries.
From a sampling perspective, the size of the intraclass correlation is also of specific interest, since sampling becomes less efficient when the intraclass correlation increases. Countries with low intraclass correlations have fewer restrictions on the sample design, where countries with high intraclass correlations require more efficient sample designs, larger sample sizes, or both. Knowledge about the size of the heterogeneity provide useful information to optimize the development of a suitable sample design and to minimize the effects of high intraclass correlations.
Four countries were considered, The Netherlands (NL), Croatia (HR), Germany (DE), and Denmark (DK), where Croatia improved their average achievement and Denmark had the same average achievement. The achievement scores of overall mathematics were considered and the first plausible value was used as a measure of the mathematic achievements of the population (Olson et al., 2008). A stratified sample was drawn by country and school to obtain a balanced sample of p = 15 grade-4 students per school for each of the four countries and two measurement occasions.
The final sample consisted of C=8 group categories, by crossing the four countries with the two measurement occasions, which are referred to as group category c = 1 (NL, 11), c = 2 (NL, 15),. . . , c = 8 (DK, 15). The data was retrieved from schools from The Netherlands (n NL,11 = 93, n NL,15 = 112), Croatia (n HR,11 = 139,n HR,15 = 106), Germany (n DE,11 = 179,n DE,15 = 170),and Denmark (n DK,11 = 166,n DK,15 = 153) with the sampled number of n schools in brackets for 2011 and 2015, respectively. Although often unconditional intraclass correlations are the object of study to explore variations (Hedges and Hedberg, 2007), differences in intraclass correlations were tested conditional on several student variables (e.g., gender, student sampling weight variable). The marginal model represented in (6) was fitted to obtain the parameter estimates, where 10,000 iterations were made and a burnin period of 1,000 iterations was used.
The following hypotheses were considered in the analyses. Hypothesis H 1 represents a common positive (invariant) intraclass correlation across countries and years. Positive country-specific and time-invariant intraclass correlations are represented by hypothesis H 2 . Variation in intraclass correlation across years (i.e., a time-variant intraclass correlation) is represented by Hypothesis H 3 , while assuming a common (invariant) positive intraclass correlation across countries per year. Finally, hypothesis H 4 represents the complement of H 1 , H 2 , and H 3 with unique (variant) intraclass correlations across countries and years.
Next to the assumed heterogeneity in country-specific intraclass correlations of H 2 , an ordering in the correlations can also be hypothesized. The variance of the mean from a balanced clustered sample each of size p is larger than the variance of the mean of a simple random sample by a factor 1 + (p − 1)ρ (Kish, 1965, p. 162-163), which is known as the design effect. So, the intraclass correlation modifies the variance of the mean, given the number of schools and students per school. In the Netherlands, the variance of the average mathematic achievements of fourth graders is known to be relatively low. This can be inferred from the reported standard errors of the Netherland's average mathematics achievement during the cycles from 2003 to 2015, which were usually one of the lowest and ranged from 1.7 to 2.1. The standard errors for Denmark were much higher and ranged from 2.4 to 2.7. For Germany they ranged from 2.0 to 2.3. For the cycles in 2011 and 2015, Croatia had a standard error of 1.8 to 1.9, where the Netherlands had a standard error of 1.7 (Mullis et al., 2011, Exhibit 1.5) (http://timssandpirls.bc.edu/timss2015/international-results/timss-2015/ mathematics/student-achievement). It can be expected that the variation in scores across schools was higher for countries with higher reported standard errors of the average mathematics achievement. This implies an ordering of the country-specific intraclass correlations (from high to low) of Denmark, Germany, Croatia, and The Netherlands. Furthermore, the reported country-specific mathematics achievement distribution also revealed this ordering in the spread of student scores across countries.
The different hypotheses were formally tested using the Bayes factor with a uniform prior and the generalized fractional Bayes factor with an improper prior. In Table 2 the results of the Bayes factor based on uniform priors, referred to as BF, and the generalized fractional Bayes factor, referred to as FBF, are reported, including the posterior probability of each hypothesis. First, the invariant positive intraclass hypothesis was evaluated against the variant intraclass hypothesis. For the BF, it was concluded that   . When also including the results from the posterior probabilities of the hypotheses, it was concluded that the positive intraclass correlations differed across countries, and that an order in intraclass correlations was identified. Within each country, the intraclass correlations did not appear to differ across years.
The present analysis showed that having accurate information about the stratification can be beneficial across years, since changes in the intra-correlation coefficient were invariant over time. The intraclass correlations differed across countries, although the estimated correlations did not differ that much and varied from .08 to .22. Nevertheless, efficient sampling strategies are needed in countries with positive intraclass correlations, where countries with higher intraclass correlations will benefit more from efficient stratification strategies. Hedges and Hedberg (2007) also reported intraclass correlations for different large-scale surveys to provide information for employing randomized experiments in eduction, where schools are assigned to treatments. However, only pairs of intraclass correlations were compared using a Bonferroni adjustment, and the estimated intraclass correlations were assumed to be approximately normally distributed to evaluate the significance of a difference in correlations. These limitations do not apply to the developed generalized fraction Bayes factor and Bayes factor test for intraclass correlations.

Discussion
Currently there are two well-known approaches to model grouped data. In the population-average approach the correlation is treated as a nuisance and the marginal expectation of the outcome is modeled as a function of explanatory variables (Liang and Zeger, 1986). In the conditional or group-specific approach, the variability between groups is explicitly modeled using random effects, which measure directly the heterogeneity between groups. The marginal modeling approach outlined in this paper introduces a third approach. The random effects are integrated out in the conditional model, and the marginal mean and implied covariance structure are directly modeled to make inferences about the correlation structure. Under the integrated likelihood, a prior class is presented which can be used when prior information is available or absent. The new shifted F prior can be seen as an extension of popular priors for variance components (Gelman, 2006;Polson and Scott, 2012;Pérez et al., 2017;Mulder and Pericchi, 2018). The prior can be used when prior information is available or absent. In the latter case an improper prior is considered which results in frequentist matching credible intervals. Furthermore, posterior sampling is efficient using a Gibbs sampler via a parameter expansion. It is also straightforward to compute the probability of whether a between-groups variance parameter is less than (or greater than) zero under the proposed marginal approach. Support for a negative between-groups variance can indicate that a random effects model is not appropriate for the data or that the sample size is too small. Unlike the methodology of Kinney and Dunson (2007), no proper prior has to be specified to obtain such posterior probabilities. Finally the numerical performances of the proposed Bayes factor and generalized fractional Bayes factor showed accurate and consistent results.
Although other methods have been proposed for testing intraclass correlations, no general method has been proposed for the testing problem in (1). The classical significance tests proposed by Donner and Zou (2002) and Konishi and Gupta (1989) are limited to testing a null hypothesis of equal intraclass correlations against an unconstrained alternative. Note that significance tests in general are not designed for testing multiple hypotheses simultaneously or for testing nonnested hypotheses with order constraints on the parameters of interest (Silvapulle and Sen, 2004). Furthermore, the Bayesian information criterion (BIC; Schwarz, 1978;Raftery, 1995) is also not suitable for this testing problem because (i) the Gaussian approximation of the posterior of the intraclass correlations, ρ, would be inaccurate for small to moderate samples, (ii) the normally distributed unit-information prior may not be suitable for the bounded interval of intraclass correlations, and (iii) the number of free parameters is ill-defined for hypotheses with order constraints on the parameters (Mulder et al., 2009). Furthermore Bayes factors have been proposed for testing whether a single intraclass correlation equals zero (García-Donato and Sun, 2007;Pauler et al., 1999;Westfall and Gönen, 1996). The Bayes factor test of Mulder and Fox (2013) assumes uniform priors for the intraclass correlations which are not suitable for general use. Pauler et al. (1999) proposed a tailor-made prior, based on the unit of information, to use MCMC for calculating Bayes factors while dealing with the boundary null-hypothesis. However, this truncated-normal prior is not appropriate for order-constrained hypotheses, since the number of groups and the number of within-group observations can vary across different types of groups. This complicates the specification of a noninformative prior to evaluate inequality constrained hypotheses. Furthermore, these authors considered Bayes factors from a multilevel modeling framework. The integration of the joint posterior with respect to the random effect parameters however is computationally challenging and also requires specification of priors for the (random effect) nuisance parameters whose choice might be ambiguous (Berger et al., 1999). Therefore, a more general Bayesian testing framework was presented to make inferences when testing multiple hypotheses with equality constraints and/or order constraints on the intraclass correlations when prior information about the intraclass correlations is available, weak or completely unavailable. Thereby the paper contributes to the increasing literature on Bayes factor tests of equality and order constrained hypotheses (e.g. Hoijtink, 2011;Gu et al., 2014;Braeken et al., 2015;Mulder, 2016;Böing-Messing et al., 2017, and the references therein), which are becoming increasingly popular in the social and behavioral sciences.
The Bayes factor tests have been developed for continuous data. Future research will focus on extending the tests to categorical and count data by using an appropriate data augmentation scheme (Albert and Chib, 1993;Fox, 2010). Fox et al. (2017) proposed Bayes factor tests for the covariance parameter in a multivariate probit model, with a compound-symmetry covariance structure using data augmentation. For categorical data, the intraclass correlation is often used to determine, for instance, the test reliability of a scoring system, where the object is to obtain compatible results in different statistical trials. When the measurement error remains stationary, the intraclass correlation increases in line with increasing subject variability, which demonstrates that subjects can be better distinguished from each other.
In the psychometric application, the Bayes factor was used as a confirmatory tool to determine which hypothesis of a set of four hypotheses with competing constraints on the intraclass correlations receives most evidence from the data. The proposed Bayes factors can also be used for a more exploratory analysis to find the best fitting hypothesis of all possible equality/order constrained hypotheses, similar as in a variable selection problem. In such an exploratory approach it would be recommended to correct for multiple testing, e.g., using the work of Scott and Berger (2006). How to do this in the case of equality and order constrained models on intraclass correlations is an open topic for further research.
For unbalanced designs the number of observations can vary across groups. The distribution of the between-groups variance is then a mixture of shifted inverse-gamma distributions where the shift parameter depends on the group size. The closed-form distributions from the balanced case can be used to generate proposals for a Metropolis-Hastings algorithm. Furthermore, they can also serve as importance sampling functions to compute Bayes factors concerning hypothesis of the intraclass correlation for the unbalanced situation. More research is needed to examine the numerical performances and appropriate priors for making inferences about the intraclass correlation in an unbalanced design.

Supplementary Material
The supplementary material for "Bayes Factor Testing of Multiple Intraclass Correlations" (DOI: 10.1214/18-BA1115SUPP; .pdf). The supplementary material for "Bayes factor testing of multiple intraclass correlations" contains the proof of Lemma 1, the proof of Lemma 2, the conditional posterior distributions for the Gibbs sampler, the proof of Lemma 3, the Gibbs sampler under a constrained model, the analytic expression of the marginal likelihood (with derivation) for a standard random intercept model using fractional Bayes methodology, and a simulation study when testing interval hypotheses.