Gibbs Sampling for a Bayesian Hierarchical General Linear Model

We consider a Bayesian hierarchical version of the normal theory general linear model which is practically relevant in the sense that it is general enough to have many applications and it is not straightforward to sample directly from the corresponding posterior distribution. Thus we study a block Gibbs sampler that has the posterior as its invariant distribution. In particular, we establish that the Gibbs sampler converges at a geometric rate. This allows us to establish conditions for a central limit theorem for the ergodic averages used to estimate features of the posterior. Geometric ergodicity is also a key component for using batch means methods to consistently estimate the variance of the asymptotic normal distribution. Together, our results give practitioners the tools to be as confident in inferences based on the observations from the Gibbs sampler as they would be with inferences based on random samples from the posterior. Our theoretical results are illustrated with an application to data on the cost of health plans issued by health maintenance organizations.


Introduction
The flexibility of Bayesian hierarchical models makes them widely applicable. One of the most popular (see, e.g., Gelman, Carlin, Stern and Rubin, 2004;Spiegelhalter, Thomas, Best and Lunn, 2005) is a version of the usual normal theory general linear model. Let Y denote an N × 1 response vector and suppose β is a p × 1 vector of regression coefficients, u is a k × 1 vector, X is a known N × p design matrix having full column rank, and Z is a known N × k matrix. Then for r, s, t ∈ {1, 2, · · · }, the hierarchy is where the mixture parameters η i , φ j , and ψ l are known nonnegative constants which satisfy ψ l = 1 and we say W ∼ Gamma(a, b) if it has density proportional to w a−1 e −bw for w > 0. Further, we require β and u to be a posteriori conditionally independent given λ R , λ D , and y which holds if and only if X T Z = 0. Finally, b i ∈ R and positive definite matrix B are known and the hyperparameters r j1 , r j2 , d l1 , and d l2 are all assumed to be positive.
Let ξ = u T , β T T and λ = (λ R , λ D ) T . Then the posterior has support X = R k+p × R 2 + and a density characterized by where y is the observed data and f denotes a generic density. Posterior inference is often based on the expectation of a function g : X → R with respect to the posterior. For the model (1) we can only rarely calculate the expectation E π g(ξ, λ) := X g(ξ, λ)π(ξ, λ|y)dξdλ , since it is a ratio of two potentially high-dimensional intractable integrals. Hence inference regarding the posterior may require Markov chain Monte Carlo (MCMC) methods. We consider two-component Gibbs sampling which produces a Harris ergodic Markov chain Φ = {(ξ 0 , λ 0 ), (ξ 1 , λ 1 ), · · · } with invariant density π(ξ, λ|y).
Suppose E π |g| < ∞ and we obtain n observations from the Gibbs sampler. Then a natural estimate of E π g isḡ n = n −1 n−1 i=0 g(ξ i , λ i ) sinceḡ n → E π g with probability 1 as n → ∞. In other words, the longer we run the Gibbs sampler, the better our estimate is likely to be. However, this gives no indication of how large n must be to ensure the Monte Carlo errorḡ n − E π g is sufficiently small. The size of this error is usually judged by appealing to its approximate sampling distribution via a Markov chain central limit theorem (CLT), which in the cases of current interest takes the form where σ 2 g ∈ (0, ∞). Due to the serial correlation in Φ, the variance σ 2 g will be complicated and require specialized techniques (such as batch means or spectral methods) to estimate consistently withσ 2 n , say. Supposeσ 2 n → σ 2 g with probability 1 as n → ∞. In this case, an asymptotically valid Monte Carlo standard error (MCSE) is given byσ n / √ n. In turn, this can be used to perform statistical analysis of the Monte Carlo error and to implement rigorous sequential stopping rules for determining the length of simulation required (see Flegal, Haran and Jones, 2008;Jones and Hobert, 2001) so that the user will have as much confidence in the simulation results as if the observations were a random sample from the posterior; this is described in more detail in Section 4.
Unfortunately, for Harris ergodic Markov chains simple moment conditions are not sufficient to ensure an asymptotic distribution for the Monte Carlo error or that we can consistently estimate σ 2 g . In addition, we need to know that the convergence of Φ occurs rapidly. Thus, one of our goals is to establish verifiable conditions under which the Gibbs sampler is geometrically ergodic, that is, it converges to the posterior in total variation norm at a geometric rate.
We know of three papers that address geometric ergodicity of Gibbs samplers in the context of the normal theory linear model with proper priors. These are Hobert and Geyer (1998), Jones and Hobert (2004), and Papaspiliopoulos and Roberts (2008). The linear model we consider substantively differs from those in Papaspiliopoulos and Roberts (2008) in that we do not assume the variance components are known. Our model is also much more general than the one-way random effects model in Hobert and Geyer (1998) and Jones and Hobert (2004). Gibbs sampling for the balanced one-way random effects model is also considered in Rosenthal (1995) where coupling techniques were used to establish upper bounds on the total variation distance to stationarity. However, these results fall short of establishing geometric ergodicity of the associated Markov chain.
The rest of this paper is organized as follows. Gibbs sampling for the Bayesian hierarchical general linear model is discussed in Section 2 and geometric ergodicity for these Gibbs samplers is established in Section 3. Conditions for the CLT (2) are given in Section 4 along with a description of the method of batch means for estimating the variance of the asymptotic normal distribution. Finally, our results are illustrated with a numerical example in Section 5. Many technical details are deferred to the appendix.

respectively.
Because the Mtd's are strictly positive on the state space it is straightforward to show that Φ 1 and Φ 2 are Harris ergodic. The posterior density π(ξ, λ|y) is invariant for Φ 1 and Φ 2 by construction.
Moreover, it is easy to see that both chains are Feller. Similarly, Φ ξ and Φ λ are Harris ergodic and Feller with invariant densities the marginal posteriors π(ξ|y) and π(λ|y), respectively. Hence all four Markov chains converge in total variation norm to their respective invariant distributions. In the next section we establish conditions under which this convergence occurs at a geometric rate.

Establishing Geometric Ergodicity
Our main goal in this section is to establish conditions for the geometric ergodicity of Φ 1 and Φ 2 . Before doing so it is useful to acquaint ourselves a concept introduced by Roberts and Rosenthal (2001). Let X = {X n , n ≥ 0} be a Markov chain on a space X and Y = {Y n , n ≥ 0} a stochastic process on a possibly different space Y. Then Y is de-initializing for X if, for each n ≥ 1, conditionally on Y n it follows that X n is independent of X 0 . Roughly speaking, Roberts and Rosenthal (2001) use this concept to show that Y controls the convergence properties of the Markov chain X.
To establish the geometric ergodicity of Φ 1 and Φ 2 it suffices to work with the marginal chains Φ ξ and Φ λ . First, Φ ξ is de-initializing for Φ 1 and Φ λ is de-initializing for Φ 2 . Results in Roberts and Rosenthal (2001) initializing. Hence if one is geometrically ergodic, then they both are and Lemma 3.1 follows directly.
Accordingly, we can proceed by studying the convergence behavior of the marginal chains. We establish geometric ergodicity for Φ ξ by establishing a drift condition. That is we need to specify a function V : R k+p → R + and constants 0 < γ < 1 and L < ∞ such that where the expectation is taken with respect to the Mtd k ξ . Let W (ξ) = 1 + V (ξ), b = L + 1 − γ and Jones and Hobert (2004, Lemma 3.1) show that equation (6) implies Here ∆W (ξ ′ ) is the drift, V (or W ) is a drift function and γ a drift rate. If ξ ′ / ∈ C the expected change in W is negative so Φ ξ will tend to "drift" to C, that is, where the value of W is small. Moreover, it also does it in such a way that the drift towards C is faster when γ is small. On the other hand, if γ ≈ 1 the drift will be slow. Thus the value of γ is intimately connected to the convergence rate of Φ ξ ; for a thorough accessible discussion of the connection see Jones and Hobert (2001, Section 3.3).
Hence examination of γ can give us some intuition for the convergence behavior of Φ ξ . However, drift functions are not unique so this examination generally will not lead to definitive conclusions.
for any d > 0. Note that the maximal irreducibility measure for Φ ξ is equivalent to Lebesgue on R k+p so that its support certainly has a non-empty interior. The sufficiency of drift for geometric ergodicity now follows easily from Lemma 15.2.8 and Theorems 6.0.1 and 15.0.1 of Meyn and Tweedie (1993) and Lemma 3.1.
Proposition 3.1. Suppose (6) holds for a drift function that is unbounded off compact sets. Then Φ ξ is geometrically ergodic and so are Φ 1 and Φ 2 .
In Section 3.2 we develop conditions on our Bayesian model (1) which are sufficient for the conditions of Proposition 3.1.

Drift for Φ ξ
For all j ∈ {1, · · · , s} and l ∈ {1, · · · , t}, define constants Also, let x i and z i denote the ith rows of matrices X and Z, respectively, and let y i and u i denote the ith elements of vectors y and u, respectively. Next, for i ∈ {1, · · · , r} define where E i denotes expectation with respect to the N k+p (m i , Σ −1 ) distribution.
1. If Z T Z is nonsingular, d l1 > 1 for all l ∈ {1, · · · , t}, and 2. If for all j ∈ {1, · · · , s} and l ∈ {1, · · · , t} Notice that the formulations of γ given by Proposition 3.2 depend on the Bayesian model setting through δ j1 , δ l2 , δ j3 , and δ l4 . Therefore, the drift and convergence rates of the Φ ξ marginal chain (hence the Gibbs samplers) may be sensitive to changes in the dimension k of u, the total number of observations N , or the hyperparameter setting. However, it is interesting that the dimension of β, which is p, has only an indirect impact on this result. Specifically, when Z T Z is nonsingular the value of p has no impact, that is, the drift rate is unaffected by changes in p. Of course, changing p does mean that X changes which may impact δ j3 which in turn can change the permissible hyperparameters r j1 and the drift rate when Z T Z is singular.
Example 3.1. Consider the balanced random intercept model derived from (1) for k subjects with m observations each. In this case, Z = I k ⊗ 1 m where ⊗ denotes the Kronecker product and 1 m represents a vector of ones of length m. Hence Z T Z = mI k is nonsingular. Define If d l1 > 1 for all l, Condition 1 of Proposition 3.2 establishes drift rate M N,k ≤ γ < 1. Notice that M N,k → 1 as k → ∞ and hence γ → 1 as well. This supports our intuition that the Gibbs sampler should converge more slowly as its dimension increases. On the other hand, if k is held constant but m increases so that N = km → ∞, then Thus increasing the number of observations per subject does not have the same negative, qualitative impact as increasing the number of subjects. Finally, M N,k → 1 (hence γ → 1) when k is held constant and d l1 → 1 for any l.
Consider the condition that G i (λ) ≤ K for all λ ∈ R 2 + and i ∈ {1, · · · , r}. The following result establishes this condition for an important special case of (1). In our experience it is often straightforward to show that G i is bounded and, if desired, numerical optimization methods yields appropriate K.
We are now in position to state conditions on (1) guaranteeing geometric ergodicity of the Gibbs samplers Φ 1 and Φ 2 . This follows easily from Propositions 3.1 and 3.2 if the drift function where d > 0. Notice that V is continuous so it is sufficient to show that, on S, |β i | is bounded for i ∈ {1, 2, . . . , p} and |u j | is bounded for j ∈ {1, 2, . . . , k}. Clearly, S ⊂ S 2 = {ξ : u T u ≤ d} and it is obvious that each |u j | is bounded on S 2 hence also on S. Moreover, note v 2 → ∞ as |u j | → ∞. Given that the |u j | are bounded it is easy to see that v 1 → ∞ as |β i | → ∞. Putting this together we see that V is unbounded off compact sets. The main result of this section follows.

Interval Estimation
Suppose we want to estimate an expectation E π g := X g(ξ, λ)π(ξ, λ|y)dξdλ where g is real-valued and π-integrable. It is straightforward to estimate E π g withḡ n := n −1 n−1 i=0 g(ξ i , λ i ). A key step in the statistical analysis ofḡ n is the assessment of the Monte Carlo errorḡ n − E π g through its approximate sampling distribution.
Theorem 4.1. Assume the conditions of Theorem 3.1. If E π |g| 2+ǫ < ∞ for some ǫ > 0, then there is a constant σ 2 g ∈ (0, ∞) such that for any initial distribution The proof of this theorem follows easily from Theorem 3.1, Theorem 2 of Chan and Geyer (1994) and Section 1 of Flegal and Jones (2009). Roughly speaking, results in Hobert, Jones, Presnell and Rosenthal (2002), Jones, Haran, Caffo and Neath (2006) and Bednorz and Latuszynski (2007) show that, under conditions comparable to those required for Theorem 4.1, techniques such as regenerative simulation and batch means can be used to construct an estimator of σ 2 g , sayσ 2 n , such thatσ 2 n → σ 2 g as n → ∞ almost surely. See Flegal and Jones (2009) for the conditions required to ensure consistency of overlapping batch means and spectral estimators of σ 2 g . Before giving a precise discussion of the conditions for consistency we need a preliminary definition and result. Let X ⊆ R d for d ≥ 1 and k : X × X → [0, ∞) be an Mtd with respect to Lebesgue measure.
Suppose there exists a function s : X → [0, 1) and a density q such that for all x, x ′ ∈ X k(x|x ′ ) ≥ s(x ′ )q(x) .
Then we say there is a minorization condition for k. Proof. The proof follows a technique first introduced by Mykland, Tierney and Yu (1995). Fix x * ∈ X .
Then for all Let x m be the point where the infimum is achieved. Then the minorization follows by setting q(x) = c −1 k(x|x * )I(x ∈ C) and The conditions of Lemma 4.1 are not the weakest that ensure the existence of a minorization condition but they will suffice for our purposes. In particular, it is straightforward to use Lemma 4.1 to see that there exists a minorization condition for both k 1 and k 2 the Mtd's for Φ 1 and Φ 2 , respectively.
Also, Hobert, Jones and Robert (2006) derived an explicit closed form expression for a minorization for a Markov chain for which Φ 2 is a special case.
The consistency results forσ 2 n in Flegal and Jones (2009) Bednorz and Latuszynski (2007) all require that a minorization condition hold. The efficacy of regenerative simulation is utterly dependent upon the minorization while minorization is irrelevant to the implementation of batch means and spectral methods. That is, the minorization is purely a technical device used in the proofs of consistency for batch means and spectral estimators.
We use the method of batch means in Section 5 to estimate σ 2 g . Let n be the simulation length, b n = ⌊n a ⌋ and a n = ⌊n/b n ⌋. Now definē Y j := 1 b n jbn−1 i=(j−1)bn g(ξ i , λ i ) for j = 1, . . . , a n .
Putting together our Theorem 3.1 and Lemma 4.1 with results in Jones et al. (2006) and Bednorz and Latuszynski (2007) we have the following consistency result.
Using Theorems 4.1 and 4.2 we can use (8) to form an asymptotically valid confidence interval for E π g in the usual wayḡ where t an−1 is a quantile from a Student's t distribution with a n − 1 degrees of freedom. Moreover, we can use batch means to implement the fixed-width methods of Jones et al. (2006) to determine how long to run the simulation. Following Flegal et al. (2008) let ε be the desired half-width of the interval in (9) and n * be a minimum simulation size specified by the user. Then we can terminate the simulation the first time The final interval estimate will be asymptotically valid in the sense that the interval will have the desired coverage probability for sufficiently small ε; see also Flegal et al. (2008), Flegal and Jones (2009), Glynn and Whitt (1992) and Jones et al. (2006).

A Numerical Example
In Let y i denote the individual monthly premium of the ith HMO plan. To analyze these data, Hodges (1998) considered a Bayesian version of the following frequentist model: where the ε i are iid N 0, λ −1 R , x i1 denotes the centered and scaled average expenses per admission in the state in which the ith HMO operates, and x i2 is an indicator for New England. The x i1 values were centered and scaled to avoid collinearity. Specifically, ifx i1 is the raw average expense per admission and x 1 is the overall average expense per admission, We perform a Bayesian regression analysis based on the following hierarchical version of (10): where N = 341, y is the N × 1 vector of individual premiums, β = (β 0 , β 1 , β 2 ) is the vector of regression parameters, and X is the N × 3 data matrix. Complete specification of the model requires values for hyperparameters (b, B, r 1 , r 2 ). We use an approach which is empirical Bayesian in spirit. To this end, we fit (10) using least squares regression.
The results are summarized in Table 1.
Accordingly, we chose the following prior mean and covariance matrix for β:  Table 1. Next, we set the prior mean and variance for λ R to E(λ R ) = r 1 r 2 = 1 MSE = 0.00177; and where MSE is the least squares estimate of λ −1 R given in Table 1. Solving for r 1 and r 2 gives r 1 = 3.122 * 10 −6 and r 2 = 0.00177. Since (11) does not contain any random effects, it follows from Theorem 3.1 that the Gibbs sampler for π(β, λ R |y) is geometrically ergodic since and for any λ R ∈ R + the function G(λ R ) is bounded (recall Proposition 3.3) where [E(y i − x i β|λ R , y)] 2 = (y − XE(β|λ R , y)) T (y − XE(β|λ R , y)) and E(β|λ R , y) = (λ R X T X + B) −1 (λ R X T y + Bb).
Consider estimating the posterior means of β 0 , β 1 , and β 2 . By Lemma A.6, the fourth posterior moments of these parameters are finite. Thus Theorems 4.1 and 4.2 in conjunction with geometric ergodicity guarantee the existence of CLTs and consistent estimators of the asymptotic variance via batch means with b n = ⌊n 0.501 ⌋ which was chosen based on recommendations in Jones et al. (2006).
To begin our analysis of the posterior means, we ran independent Gibbs samplers from a variety of starting values and updated λ R followed by β in each iteration. For each chain, we required a minimum simulation length of 1000. At each successive iteration, we calculated the approximate half-widths of the Bonferroni-corrected 95% intervals for the posterior means of β 0 , β 1 , and β 2 , t an−1, 0.025/3σ n √ n + 1 n .
Simulation continued until the half-widths for β 0 , β 1 , and β 2 , were below 0.10, 0.02, and 0.10, respectively. The results were consistent across starting values. That is, Gibbs samplers with different starting values produced similar estimates and required similar simulation effort to meet the above specifications. Here, we present the results for the chain started from the prior means of β and λ R , (b, r 1 /r 2 ). Under this setting, the interval half-width thresholds were met after 16831 iterations. The corresponding estimates of the posterior means are reported in Table 2 with standard errors.

A Appendix
A.1 Proof of Proposition 3.2 We will require the following general results in our proof. A proof of Lemma A.1 is given in Henderson and Searle (1981) and Lemma A.2 follows from the convexity of the inverse function.
Lemma A.1. Let A be a nonsingular n × n matrix, B be a nonsingular s × s matrix, U be an n × s matrix, and V be an s × n matrix. Then When U = V this implies for any n × 1 vector x.
We must show that for all ξ ′ ∈ R k+p where the constants γ and L are given in the statement of Proposition 3.2. Let E i and Var i denote expectation and variance with respect to the N k+p m i , Σ −1 distribution. Similarly, let E jl and Var jl denote expectation and variance with respect to density where the first equality holds by the construction of Φ ξ . Thus we focus on the E jl [E i (V (ξ)|λ) ξ ′ ] in the next 3 lemmas.
Proof. Consider the inner expectation E i (v 1 (ξ)|λ). For any i we have and by Lemma A.1. It follows that for any i, j, l we have Combining this with the fact that Proof. Notice that for any i, j, l where from Lemmas A.1 and A.2 Also, by (3) we have The result holds by combining (15) and (16).
Proof. First, for any i, j, l Let e m denote the k × 1 vector with the mth element being 1 and the rest of the elements being 0.
Thus by Lemma A.1, Also, Putting (18)-(20) together gives We are now ready to finish the proof of Proposition 3.2. We consider the case with nonsingular Z T Z and the case in which no restrictions are placed on Z separately.
1. Case 1: Z T Z nonsingular Notice that L 1 + L 3 ≤ L for L 1 and L 3 given by (13) and (17), respectively. Then by Lemmas A.3 and A.5 we have that for any i, j, l Then combining (12) and (21) establishes the drift condition.
2. Case 2: Z T Z is possibly singular Observe that L 2 + L 3 ≤ L for L 2 and L 3 given by (14) and (17), respectively. Further, it follows from Lemmas A.4 and A.5 that for any i, j, l Hence the result holds by combining (12) and (22).

A.2 Proof of Proposition 3.3
By the assumption that b i = 0 for all i, ξ|λ, y ∼ N k+p (m 0 , Σ −1 ) where Define A g := λ R X T X + B and A h := λ R Z T Z + λ D I k . Then E(β|λ) = λ R A −1 g X T y and E(u|λ) = λ R A −1 h Z T y. We must establish that there exists K for which and note that the claim will be proven if we can show that f (λ) ≤ K for all λ. To this end, define functions g, and h as g(λ) = (y − XE(β|λ)) T (y − XE(β|λ)) h(λ) = E(u|λ) T Z T ZE(u|λ) + E(u|λ) T E(u|λ) − 2y T ZE(u|λ) .
Since the conditional independence of β and u given λ implies X T Z = 0, a little algebra shows that f (λ) = g(λ) + h(λ). Thus, it suffices to find K g and K h such that for all λ, g(λ) ≤ K 2 g and h(λ) ≤ K 2 h . First, g(λ) = y T y + E(β|λ) T X T XE(β|λ) − 2y T XE(β|λ) ≤ y T y := K 2 g by the positive definiteness of B and A −1 g . Notice that this also proves part 1 of the claim. Next, we have Since A −1 h and A −2 h are positive semidefinite we have where the last inequality holds by Lemma A.1. The result now follows by setting K 2 = K 2 g + K 2 h .