Estimating the spectral gap of a trace-class Markov operator

The utility of a Markov chain Monte Carlo algorithm is, in large part, determined by the size of the spectral gap of the corresponding Markov operator. However, calculating (and even approximating) the spectral gaps of practical Monte Carlo Markov chains in statistics has proven to be an extremely difficult and often insurmountable task, especially when these chains move on continuous state spaces. In this paper, a method for accurate estimation of the spectral gap is developed for general state space Markov chains whose operators are non-negative and trace-class. The method is based on the fact that the second largest eigenvalue (and hence the spectral gap) of such operators can be bounded above and below by simple functions of the power sums of the eigenvalues. These power sums often have nice integral representations. A classical Monte Carlo method is proposed to estimate these integrals, and a simple sufficient condition for finite variance is provided. This leads to asymptotically valid confidence intervals for the second largest eigenvalue (and the spectral gap) of the Markov operator. The efficiency of the method is studied. For illustration, the method is applied to Albert and Chib's (1993) data augmentation (DA) algorithm for Bayesian probit regression, and also to a DA algorithm for Bayesian linear regression with non-Gaussian errors (Liu, 1996).


Introduction
Markov chain Monte Carlo (MCMC) is widely used to estimate intractable integrals that represent expectations with respect to complicated probability distributions. Let π : S → [0, ∞) be a probability density function (pdf) with respect to a σ-finite measure µ, where (S, U , µ) is some measure space. Suppose we second largest eigenvalue. We propose a classical Monte Carlo estimator of 1 − δ for DA Markov operators that are trace-class, i.e. compact with summable eigenvalues. While compact operators were once thought to be rare in MCMC problems with uncountable state spaces (Chan and Geyer, 1994), a string of recent results suggests that trace-class DA Markov operators are not at all rare (see e.g. Qin and Hobert, 2016;Chakraborty and Khare, 2017;Choi and Román, 2017;Pal et al., 2017). Furthermore, by exploiting a simple trick, we are able to broaden the applicability of our method well beyond DA algorithms. Indeed, if a reversible Monte Carlo Markov chain has a Markov transition density (Mtd), and the corresponding Markov operator is Hilbert-Schmidt, then our method can be utilized to estimate its spectral gap. This is because the square of such a Markov operator can be represented as a trace-class DA Markov operator. A detailed explanation is provided in Section 4.
Of course, there is a large literature devoted to developing theoretical bounds on the second largest eigenvalue of a Markov operator (see e.g. Lawler and Sokal, 1988;Sinclair and Jerrum, 1989;Diaconis and Stroock, 1991). However, these results are typically not useful in situations where the state space, S, is uncountable and multi-dimensional, which is our main focus. There also exist computational methods for approximating the eigenvalues of a Hilbert-Schmidt operator (see e.g. Koltchinskii and Giné, 2000;Ahues et al., 2001, §4.2). Unfortunately, these methods require a closed form kernel function, which is typically not available in the MCMC context. There are still other methods based on simulation. Most notably, Garren and Smith (2000) used simulations of a reversible chain to estimate the second largest eigenvalue of its operator (assuming it's Hilbert-Schmidt). Their approach is reminiscent of the so-called power method from computer science, and we use these ideas as well. The main difference between their method and ours is that we exploit the specific structure of the Mtd associated with the DA algorithm. This makes our method much simpler to implement computationally, and our results easier to interpret. The power of our method comes at the price of being computationally intensive, especially when the target posterior is based on a large sample.
The rest of the paper is organized as follows. The notion of Markov operator is formalized in Section 2.
In Section 3, it is shown that the second largest eigenvalue of a non-negative trace-class operator can be bounded above and below by functions of the power sums of the operator's eigenvalues. In Section 4, DA Markov operators are formally defined, and the sum of the kth (k ∈ N) power of the eigenvalues of a trace-class DA Markov operator is related to a functional of its Mtd. This functional is usually a multidimensional integral, and a classical Monte Carlo estimator of it is developed in Section 5. The efficiency of the Monte Carlo estimator is studied in Section 6. Finally, in Section 7 we apply our methods to a few well-known MCMC algorithms. Our examples include Albert and Chib's (1993) DA algorithm for Bayesian probit regression, and a DA algorithm for Bayesian linear regression with non-Gaussian errors (Liu, 1996).

Markov operators
Assume that the Markov chain Φ has a Markov transition density, p(u, ·), u ∈ S, such that, for any measurable A ⊂ S and u ∈ S, is the k-step Mtd corresponding to p(u, ·). We will assume throughout that Φ is Harris ergodic, i.e. irreducible, aperiodic and Harris recurrent. Define a Hilbert space consisting of complex valued functions on S that are square integrable with respect to π(·), namely For f, g ∈ L 2 (π), their inner product is given by We assume that U is countably generated, which implies that L 2 (π) is separable and admits a countable orthonormal basis (see e.g. Billingsley, 1995, Theorem 19.2). The transition density p(u, ·), u ∈ S defines the following linear operator P. For any f ∈ L 2 (π), The spectrum of a linear operator L is defined to be where I is the identity operator. It is well-known that σ(P ) is a closed subset of the unit disk in C. Let f 0 ∈ L 2 (π) be the normalized constant function, i.e. f 0 (u) ≡ 1, then P f 0 = f 0 . (This is just a fancy way of saying that 1 is an eigenvalue with constant eigenfunction, which is true of all Markov operators defined by ergodic chains.) Denote by P 0 the operator such that P 0 f = P f − f, f 0 π f 0 for all f ∈ L 2 (π). Then the spectral gap of P is defined as For the remainder of this section, we assume that P is non-negative (and thus self-adjoint) and compact.
This implies that σ(P ) ⊂ [0, 1], and that any non-vanishing element of σ(P ) is necessarily an eigenvalue of P . Furthermore, there are at most countably many eigenvalues, and they can accumulate only at the origin. Let λ 0 , λ 1 , . . . , λ κ be the decreasingly ordered strictly positive eigenvalues of P taking into account multiplicity, where 0 ≤ κ ≤ ∞. Then λ 0 = 1 and λ 1 is what we previously referred to as the "second largest eigenvalue" of the Markov operator. If κ = 0, we set λ 1 = 0 (which corresponds to the trivial case where Since Φ is Harris ergodic, λ 1 must be strictly less than 1. Also, the compactness of P implies that of P 0 , and it's easy to show that σ(P 0 ) = σ(P )\{1}. Hence, Φ is geometrically ergodic and the spectral gap is For further background on the spectrum of a linear operator, see e.g. Helmberg (2014) or Ahues et al. (2001).

Power sums of eigenvalues
We now develop some results relating λ 1 to the power sum of P 's eigenvalues. We assume throughout this section that P is non-negative and trace-class (compact with summable eigenvalues). For any positive integer k, let and define s 0 to be infinity. The first power sum, s 1 , is the trace norm of P (see e.g. Conway, 1990Conway, , 2000, while √ s 2 is the Hilbert-Schmidt norm of P. That P is trace-class implies s 1 < ∞, and it's clear that s k is decreasing in k. The magnitude of s k is directly related to the convergence behavior of the chain. For instance, suppose that the chain starts at a point mass Φ 0 = u, then the chi-square distance between the distribution of Φ k and the stationary distribution is given by (see e.g. Diaconis et al., 2008) where f i : S U → C is the normalized eigenfunction corresponding to λ i . It follows that which is the average of χ 2 k (u) under π. More importantly, one can use functions of s k to bound λ 1 , and thus the spectral gap.
Observe that, Moreover, if κ ≥ 1, then it's easy to show that We now show that, in fact, these bounds are monotone in k and converge to λ 1 .
and if furthermore κ ≥ 1, Proof. We begin with (1). When κ = 0, s k ≡ 1 and the conclusion follows. Suppose κ ≥ 1, and that the second largest eigenvalue is of multiplicity m, i.e.
If κ = m, then s k − 1 = mλ k 1 for all k ≥ 1 and the proof is trivial. Suppose for the rest of the proof that κ ≥ m + 1. For positive integer k, let r k = κ i=m+1 λ k i < ∞. Then r k > 0, and Hence, It follows that Finally, and (1) follows. Now onto (2). We have already shown that Thus, To show that l k is increasing in k, which would complete the proof, we only need note that Suppose now that we are interested in the convergence behavior of a particular Markov operator that is known to be non-negative and trace-class. If it is possible to estimate s k , then Proposition 1 provides a method of getting approximate bounds on λ 1 . When a DA Markov operator is trace-class, there is a nice integral representation of s k that leads to a simple, classical Monte Carlo estimator of s k . In the following section, we describe some theory for DA Markov operators, and in Section 5, we develop a classical Monte Carlo estimator of s k .

Data augmentation operators and an integral representation of s k
In order to formally define DA, we require a second measure space. Let (S V , V, ν) be a σ-finite measure space such that V is countably generated. Also, rename S and π, S U and π U , respectively. Consider the random element (U, V ) taking values in S U × S V with joint pdf π U,V (·, ·). Suppose the marginal pdf of U is the target, π U (·), and denote the marginal pdf of V by π V (·). We further assume that the conditional densities Recall that Φ is a Markov chain on the state space S U with Mtd p(u, ·), u ∈ S U . We call Φ a DA chain, and accordingly, P a DA operator, if p(u, ·) can be expressed as Such a chain is necessarily reversible with respect to π U (·). To simulate it, in each iteration, one first draws the latent element V using π V |U (·|u), where u ∈ S U is the current state, and then given V = v, one updates the current state according to π U |V (·|v). A DA operator is non-negative, and thus possesses a positive spectrum (Liu et al., 1994).
Assume that (3) holds. Given k ∈ N, the power sum of P 's eigenvalues, s k , if well defined, is closely related to the diagonal components of p (k) (·, ·). Just as we can calculate the sum of the eigenvalues of a matrix by summing its diagonals, we can obtain s k by evaluating S U p (k) (u, u) µ(du). Here is a formal statement.
Theorem 1. The DA operator P is trace-class if and only if If (4) holds, then for any positive integer k, Theorem 1 is a combination of a few standard results in classical functional analysis. It is fairly wellknown, but we were unable to find a complete proof in the literature. An elementary proof is given in the appendix for completeness. For a more modern version of the theorem, see Brislawn (1988).
It is often possible to exploit Theorem 1 even when Φ is not a DA Markov chain. Indeed, suppose that Φ is reversible, but is not a DA chain. Then P is not a DA operator, but P 2 , in fact, is. (Just take If, in addition, P is Hilbert-Schmidt, which is equivalent to then by a simple spectral decomposition (see e.g. Helmberg, 2014, §28 Corollary 2.1) one can show that P 2 is trace-class, and its eigenvalues are precisely the squares of the eigenvalues of P . In this case, the spectral gap of P can be expressed as 1 minus the square root of P 2 's second largest eigenvalue. Moreover, by Theorem 1, for k ∈ N, the sum of the kth power of P 2 's eigenvalues is equal to S U p (2k) (u, u) µ(du) < ∞.
We now briefly describe the so-called sandwich algorithm, which is a variant of DA that involves an extra step sandwiched between the two conditional draws of DA (Liu and Wu, 1999;Hobert and Marchev, 2008).
is an Mtd with invariant density π U (·). This Mtd defines a new Markov chain, call itΦ, which we refer to as a sandwich version of the original DA chain, Φ. To simulateΦ, in each iteration, the latent element is first drawn from π V |U (·|u), and then updated using s(v, ·) before the current state is updated according to π U |V (·|v ′ ). Sandwich chains often converge much faster than their parent DA chains (see e.g. Khare and Hobert, 2011).
Of course,p(u, ·) defines a Markov operator on L 2 (π U ), which we refer to asP . It is easy to see that, if the Markov chain corresponding to s(v, ·) is reversible with respect to π V (·), thenp(u, ·) is reversible with respect to π U (·). Thus, when s(v, ·) is reversible,P 2 is a DA operator. Interestingly, it turns out that p(u, ·) can often be re-expressed as the Mtd of a DA chain, in which caseP itself is a DA operator. Indeed, a sandwich Mtdp(u, ·) is said to be "representable" if there exists a random elementṼ in S V such that where π U |Ṽ (u ′ |v) and πṼ |U (v|u) have the apparent meanings (see, e.g. Hobert, 2011). It is shown in Proposition 2 in Section 5 that when P is trace-class andp(u, ·) is representable,P is also trace-class. In this case, let {λ i }κ i=0 be the decreasingly ordered positive eigenvalues ofP taking into account multiplicity, where 0 ≤κ ≤ ∞. Thenλ 0 = 1, andλ 1 ≤ λ 1 < 1 (Hobert and Marchev, 2008). For a positive integer k, we will denote κ i=0λ k i bys k . Henceforth, we assume thatp(u, ·) is representable and we treatP as a DA operator. It follows from Theorem 1 that in order to find s k ors k , all we need to do is evaluate is the k-step Mtd of the sandwich chain. Of course, calculating these integrals (in non-toy problems) is nearly always impossible, even for k = 1. In the next section, we introduce a method of estimating these two integrals using classical Monte Carlo.
Throughout the remainder of the paper, we assume that P is a DA operator with Mtd given by (3), and that (4) holds.

Classical Monte Carlo
Consider the Mtd given by where We will show in this section that this form has utility beyond constructing sandwich algorithms. Indeed, the k-step Mtd of a DA algorithm (or a sandwich algorithm) can be re-expressed in the form (8). This motivates the development of a general method for estimating the integral S U a(u, u) µ(du), which is the main topic of this section.
We begin by showing how p (k) (u, ·), u ∈ S U can be written in the form (8). The case k = 1 is trivial.
Indeed, if r(v, ·) is taken to be the kernel of the identity operator, then a(u, ·) = p(u, ·). Define an Mtd and let q (k) (v, ·), k ≥ 1 denote the corresponding k-step Mtd. If we let and denote the corresponding k-step transition function byq (k) (v, ·). Then taking The following proposition shows that, when P is trace-class, S U a(u, u) µ(du) is finite.
Combining Proposition 2 and Theorem 1 leads to the following result: If P is trace-class andp(u, ·) is representable, thenP is also trace-class. This is a generalization of Khare and Hobert's (2011) Theorem 1, which states that, under a condition on s(v, dv ′ ) that implies representability, the trace-class-ness of P implies that ofP .
We now develop a classical Monte Carlo estimator of S U a(u, u) µ(du). Let ω : S V → [0, ∞) be a pdf that is almost everywhere positive. We will exploit the following representation of the integral of interest: Clearly, is a strongly consistent and unbiased estimator of S U a(u, u) µ(du). This is the Monte Carlo formula that is central to our discussion.
Of course, we are mainly interested in the cases a(u, ·) = p (k) (u, ·) or a(u, ·) =p (k) (u, ·). We now develop algorithms for drawing from η(·, ·) in these two situations. First, assume a(u, ·) = p (k) (u, ·). If k = 1, then r(u, ·) is the kernel of the identity operator, and Thus, when k ≥ 2, we can draw from η(u, v) as follows: , and return (u * , v * ). Of course, we can draw from p (k−1) (u ′ , ·) by simply running k − 1 iterations of the original DA algorithm from starting value u ′ . We formalize all of this in Algorithm 1.

Given
Similar arguments lead to the following algorithm for the sandwich algorithm

Given
It is important to note that we do not need to know the representing conditionals π U |Ṽ (·|v) and πṼ |U (·|u) from (7) in order to run Algorithm 1S.
As with all classical Monte Carlo techniques, the key to successful implementation is a finite variance. Define The following theorem provides a sufficient condition for finite variance.
Theorem 2 implies that an ω(·) with heavy tails is more likely to result in finite variance (which is not surprising). It might seem natural to take ω(·) = π V (·). However, in practice, we are never able to draw from π V (·). (If we could do that, we would not need MCMC.) Moreover, in Section 7 we provide an example where taking ω(·) to be π V (·) leads to an infinite variance, whereas a heavier-tailed alternative leads to a finite variance. On the other hand, it can be beneficial to use ω(·)s resembling π V (·), as we argue in Section 6.
Let ψ : S U → [0, ∞) be a pdf that is positive almost everywhere. The following dual of (12) may also be used to represent S U a(u, u) µ(du): The analogue of (13) is the following classical Monte Carlo estimator of S U a(u, u) µ(du): We now state the obvious analogues of Algorithms 1 and 1S.

Given
Let D ′2 be the variance of π U |V (U * |V * )/ψ(U * ) under ζ. To ensure that it's finite, we only need The following result is the analogue of Theorem 2.
Corollary 1. The variance, D ′2 , is finite if Proof. Note that the left hand side of (19) is equal to Apply the Cauchy-Schwarz inequality, then utilize Jensen's inequality to get rid of r(v ′ , dv), and finally make use of (20) and the fact that P is trace-class.
Suppose that we have obtained estimates of s k and s k−1 based on (13) or (18), call them s * k and s * k−1 . Then u * k = (s * k − 1) 1/k and l * k = (s * k − 1)/(s * k−1 − 1) serve as point estimates of u k and l k , respectively. When our estimators have finite variances, we can acquire, via the delta method, confidence intervals for u k and l k . Assume that a confidence interval for l k is (a k , b k ) and a confidence interval for u k is (c k , d k ), then (a k , d k ) is an interval estimate for λ 1 . Interval estimates ofλ 1 can be derived in a similar fashion.

Efficiency of the algorithm
To obtain an interval estimate of λ 1 based on (13) or (18), one needs to run N iterations of Algorithm 1 or 2. If the time needed to simulate one step of the DA chain is τ , then the time needed to run N iterations of Algorithm 1 or 2 is approximately kN τ . Given k and N , the accuracy of the estimate depends on two factors: 1. The distance between l k and u k , and 2. The errors in the estimates, l * k and u * k . We now briefly analyze these two factors, and give some additional guidelines regarding the choice of ω(·) and ψ(·).
As before, suppose that Hence, as k → ∞, Depending on whether m = 1 or not, u k − l k decreases at either a geometric or polynomial rate as k grows.
The errors of l * k and u * k arise from those of s * k and s * k−1 . We now consider the estimator (18) for estimating s k . Its variance is given by Note that gives the k-step Mtd of a Gibbs chain whose stationary pdf is π U,V (·, ·). Thus, under suitable conditions, for almost any u ∈ S U , As k → ∞, we expect Suppose that ψ(u) ≈ π U (u), then heuristically, Thus, if the sum of P 's eigenvalues, s 1 , is relatively small, we recommend picking ψ(·)s that resemble π U (·), with possibly heavier tails (to ensure that the moment condition (20) holds). By a similar argument, when using the estimator (13), picking ω(·)s that resemble π V (·) is likely to control D 2 around s 1 − 1 for large ks.
While (under suitable conditions) the variance of s * k converges to a constant as k → ∞, this is not the case for u * k and l * k (because u k and l k are non-linear in s k and s k−1 ). In fact, using the delta method, one can show that these variances are unbounded. Thus, there's a trade-off between decreasing u k − l k (by increasing k) and controlling the errors of u * k and l * k . We do not recommend increasing k indefinitely. As long as k is large enough so that s k − 1 is significantly smaller than 1, u k serves as a non-trivial (and often decent) upper bound for λ 1 .

Examples
In this section, we apply our Monte Carlo technique to several common Markov operators. In particular, we examine one toy Markov chain, and two practically relevant Monte Carlo Markov chains. In the two real examples, we are able to take advantage of existing trace-class proofs to establish that (15) (or (20)) hold for suitable ω(·) (or ψ(·)).

Gaussian chain
We begin with a toy example. Let S U = S V = R, π U (u) ∝ exp(−u 2 ), and This leads to one of the simplest DA chains known. Indeed, the Mtd, can be evaluated in closed form, and turns out to be a normal pdf. The spectrum of the corresponding Markov operator, P , has been studied thoroughly (see e.g. Diaconis et al., 2008). It is easy to verify that (4) holds, so P is trace-class. In fact, κ = ∞, and for any non-negative integer i, λ i = 1/2 i . Thus, the second largest eigenvalue, λ 1 , and the spectral gap, δ, are both equal to 1/2. Moreover, for any positive integer k, We now pretend to be unaware of this spectral information, and use (13) to estimate {s k , l k , u k } 4 k=1 . Recall that l k and u k are lower and upper bounds for λ 1 , respectively. Note that It follows that, if we take ω(v) ∝ exp(−v 2 /2), then (15) holds, and our estimator of s k has finite variance.
We used a Monte Carlo sample size of N = 1 × 10 5 to form our estimates, and the results are shown in Table 1. Note that the estimates of the s k s are quite good. We constructed 95% confidence intervals (CIs) for l 4 and u 4 via the delta method, and the results were (0.442, 0.522) and (0.498, 0.524), respectively.
Remark 1. It is interesting that, if ω(·) is set to be π V (·), which seems natural, then (15) fails to hold. In fact, the estimator of s 1 has infinite variance in this case. Indeed, recall that the estimator given in (13) has the form where (U * i , V * i ) are iid, and (U * 1 , V * 1 ) has pdf given by Hence, the variance of the estimator, D 2 , is finite if and only if If ω(·) = π V (·) ∝ exp(−2v 2 ), then the left hand side of (21) becomes proportional to which is infinite.

Bayesian probit regression
Let Y 1 , Y 2 , . . . , Y n be independent Bernoulli random variables with P(Y 1 = 1|β) = Φ(x T i β), where x i , β ∈ R p , and Φ(·) is the cumulative distribution function of the standard normal distribution. Take the prior on β to be N p (Q −1 w, Q −1 ), where w ∈ R p and Q is positive definite. The resulting posterior distribution is intractable, but Albert and Chib (1993) devised a DA algorithm to sample from it. Let z = (z 1 , z 2 , . . . , z n ) T be a vector of latent variables, and let X be the design matrix whose ith row is x T i . The Mtd of the Albert and Chib (AC) chain, p(β, ·), β ∈ R p , is characterized by The first conditional density, π U |V (·|z), is a multivariate normal density, and the second conditional density, π V |U (·|β), is a product of univariate truncated normal pdfs.
A sandwich step can be added to facilitate the convergence of the AC chain. Chakraborty and Khare (2017) constructed a Haar PX-DA variant of the chain, which is a sandwich chain with transition density of the form (6) (see also Roy and Hobert (2007)). The sandwich step s(z, dz ′ ) is equivalent to the following update: z → z ′ = gz, where the scalar g is drawn from the following density: Note that this pdf is particularly easy to sample from when w = 0.
Chakraborty and Khare showed that P is trace-class when one uses a concentrated prior (corresponding to Q having large eigenvalues). In fact, the following is shown to hold in their proof.
Proposition 3. Suppose that X is full rank. If all the eigenvalues of Q −1/2 X T XQ −1/2 are less than 7/2, then for any polynomial function t : We will use the estimator (18). The following proposition provides a class of ψ(·)s that lead to estimators with finite variance.

Proposition 4. Suppose the hypothesis in Proposition 3 holds. If ψ(·) is the pdf of a p-variate t-distribution,
i.e.
for some b ∈ R p , positive definite matrix Σ ∈ R p×p , and positive integer a, then the estimator (18) has finite variance.
Proof. Note that for every β and z where C is a constant. Hence, by Proposition 3, for any polynomial function t : R p → R, Since ψ −2 (·) is a polynomial function on R p , the moment condition (20) holds. The result follows from Corollary 1.
As a numerical illustration, we apply our method to the Markov operator associated with the AC chain corresponding to the famous "lupus data" of van Dyk and Meng (2001). In this data set, n = 55 and p = 3. (2017), we will let w = 0 and Q = X T X/g, where g = 3.499999. It can be easily shown that the assumptions in Proposition 3 are met. Chakraborty and Khare compared the AC chain, Φ, and its Haar PX-DA variant,Φ, defined a few paragraphs ago. This comparison was done using estimated autocorrelations. Their results suggest thatΦ outperforms Φ when estimating a certain test function. We go further and estimate the second largest eigenvalue of each operator.

As in Chakraborty and Khare
It can be shown that the posterior pdf, π U (·), is log-concave, and thus possess a unique mode. Letβ be the posterior mode, andΣ the estimated variance of the MLE. We pick ψ(·) to be the pdf of t 30 (β, (Σ −1 +Q) −1 ). This is to say, for any β ∈ R p , By Proposition 4, this choice of ψ(·) guarantees finite variance. When n is large, ψ(·) is expected to resemble π U (·). The performance of our method seems insensitive to the degrees of freedom of the t-distribution (which is set at 30 for illustration).
We used a Monte Carlo sample size of N = 4 × 10 5 to form our estimates for the DA operator, and the results are shown in Table 2. Asymptotic 95% CIs for l 5 and u 5 are (0.397, 0.545) and (0.573, 0.595), respectively. Using a Bonferroni argument, we can state that asymptotically, with at least 95% confidence, λ 1 ∈ (0.397, 0.595).  We now consider the sandwich chain,Φ. One can show that the Mtd of any Haar PX-DA chain is representable (Hobert and Marchev, 2008). Hence,P is indeed a DA operator. Recall that {λ i }κ i=0 , 0 ≤ κ ≤ ∞, denote the decreasingly ordered positive eigenvalues ofP . It was shown in Khare and Hobert (2011) thatλ i ≤ λ i for i ∈ N with at least one strict inequality. For a positive integer k, κ i=0λ k i is denoted bys k . Letũ k andl k , be the respective counterparts of u k and l k . Estimates ofs k , k = 1, 2, · · · , 5 using 4 × 10 5 Monte Carlo samples are given in Table 3. Our estimate ofs 1 − 1 is less than half of s 1 − 1, implying that, in an average sense, the sandwich version of the AC chain reduces the nontrivial eigenvalues of P by more than half. Asymptotic 95% CIs for l 5 and u 5 are (0.321, 0.518) and (0.456, 0.503). Thus, asymptotically, with at least 95% confidence,λ 1 ∈ (0.321, 0.503). The method does not detect a significant difference between λ 1 andλ 1 .

Bayesian linear regression model with non-Gaussian errors
Let Y 1 , Y 2 , . . . , Y n be independent d-dimensional random vectors from the linear regression model where x i ∈ R p is known, while β ∈ R p×d and the d × d positive definite matrix Σ are to be estimated. The iid errors, ε 1 , ε 2 , . . . , ε n , are assumed to have a pdf that is a scaled mixture of Gaussian densities: where h(·) is a pdf with positive support, and R + := (0, ∞). For instance, if d = 1 and h(u) ∝ u −2 e −1/(8u) , then ε 1 has pdf proportional to e −|ε|/2 .
To perform a Bayesian analysis, we require a prior on the unknown parameter, (β, Σ). We adopt the (improper) Jeffreys prior, given by 1/|Σ| (d+1)/2 . Let y represent the n × d matrix whose ith row is the observed value of Y i . The following four conditions, which are sufficient for the resulting posterior to be proper (Qin and Hobert, 2016;Fernandez and Steel, 1999), will be assumed to hold: The posterior density is highly intractable, but there is a well-known DA algorithm to sample from it (Liu, 1996). Under our framework, the DA chain Φ is characterized by the Mtd p (β, Σ), (·, ·) = R n + π U |V (·, ·|z)π V |U (z|β, Σ) dz, where z = (z 1 , z 2 , . . . , z n ) T , The first conditional density, π U |V (·, ·|z), characterizes a multivariate normal distribution on top of an inverse Wishart distribution, i.e. β|Σ, z is multivariate normal, and Σ|z is inverse Wishart. The second conditional density, π V |U (·|β, Σ), is a product of n univariate densities. Moreover, when h(·) is a standard pdf on R + , these univariate densities are often members of a standard parametric family. The following proposition about the resulting DA operator is proved in Qin and Hobert (2016).
Proposition 5. Suppose h(·) is strictly positive in a neighborhood of the origin. If there exists ξ ∈ (1, 2) and δ > 0 such that then P is trace-class.
When P is trace-class, we can pick an ω(·) and try to make use of (13). A sufficient condition for the estimator's variance, D 2 , to be finite is stated in the following proposition, whose proof is given in the appendix.
For illustration, take d = 1 and h(u) ∝ u −2 e −1/(8u) . Then ε 1 follows a scaled Laplace distribution, and the model can be viewed as a median regression model with variance Σ unknown. It's easy to show that h(·) satisfies the assumptions in Proposition 5, so the resultant DA operator is trace-class. (This result was actually first proven by Choi and Hobert (2013).) Now let The following result shows that this will lead to an estimator with finite variance.
Proof. In light of Proposition 6, we only need to show that (22) holds for some ξ ∈ (1, 4/3). For any ξ > 0, making use of the fact that (by monotone convergence theorem) one can easily show for any δ > 0, On the other hand, using L'Hôpital's rule, we can see where R(u) is a function that is either bounded near the origin or goes to ∞ at the rate of some power function as u → 0. It follows that for ξ ∈ ((1 − 16γ/3) −1 , 4/3) and small enough δ, δ 0 u 3/2 h 3 (u) Combining (23) and (24) yields (22). The result then follows.
We now test the efficiency of the Monte Carlo estimator (13) using some simulated data with d = 1.
Here are our simulated X and y: The simulation was was based on a linear model containing an intercept, one continuous covariate, a single factor with three levels, and an interaction between the two. The elements in the second column of X were independently generated from N (0, 100). Once X was simulated, we generated the data according to where β * = (1, 0, −0.5, 0.5, 0, −1) T , Σ * = 1, {ǫ i } 10 i=1 are iid, and ǫ 1 has pdf given by f h (·) when h(u) ∝ u −2 e −1/(8u) . That is, the errors have a scaled Laplace distribution. Then the DA chain Φ lives in S U = R 6 × R + , and S V is R 10 + . We used a Monte Carlo sample size of N = 2 × 10 6 to form estimates of {s k , l k , u k } 4 k=1 , and the results are shown in Table 4.
If (4) holds, then for any positive integer k, Proof. Note that P is self-adjoint and non-negative. Let {g i } ∞ i=0 be an orthonormal basis of L 2 (π U ). The operator P is defined to be trace-class if (see e.g. Conway, 2000) ∞ i=0 P g i , g i π U < ∞.
This condition is equivalent to P being compact with summable eigenvalues. To show that P being traceclass is equivalent to (4), we will prove a stronger result, namely We begin by defining two new Hilbert spaces. Let L 2 (π V ) be the Hilbert space consisting of functions that are square integrable with respect to the weight function π V (·). For f, g ∈ L 2 (π V ), their inner product is defined, as usual, by Let L 2 (π U × π V ) be the Hilbert space of functions on S U × S V that are square integrable with respect to the weight function π U (·)π V (·). For f, g ∈ L 2 (π U × π V ), their inner product is Note that L 2 (π V ) is separable. Let {h j } ∞ j=0 be an orthonormal basis of L 2 (π V ). It can be shown that {g i h j } (i,j)∈Z 2 + is an orthonormal basis of L 2 (π U × π V ). Of course, g i h j denotes the function given by The inequality (4) is equivalent to which holds if and only if the function ϕ : S U × S V → R given by . Suppose (4) holds. Then by Parseval's identity, Again by Parseval's identity, this time applied to the function on S V (and in fact, in L 2 (π V ) by Jensen's inequality) given by Note that the use of Fubini's theorem in the last equality can be easily justified by noting that g i ∈ L 2 (π U ), and making use of Jensen's inequality. But the right hand side of (27) is precisely ∞ i=0 P g i , g i π U . Hence, (26) holds when S U p(u, u) µ(du) is finite.
For the rest of the proof, assume that P is trace-class. This implies that P is compact, and thus admits the spectral decomposition (see e.g. Helmberg, 2014, §28 Corollary 2.1) given by where f i , i = 0, 1, . . . , κ, is the normalized eigenfunction corresponding to λ i . By Parseval's identity, This equality is in fact a trivial case of Lidskii's theorem (see e.g. Erdös, 1974;Gohberg et al., 2012). It follows that (5) holds for k = 1.
We now consider the case where k ≥ 2. By (29) and a simple induction, we have the following decomposition for P k .
Hence P k is trace-class with ordered positive eigenvalues {λ k i } κ i=0 . Note that P k is a Markov operator whose Mtd is p (k) (u, ·), u ∈ S U . Thus, in order to show that (5) holds for k ≥ 2, it suffices to verify P k is a DA operator, for then we can treat P k as P and repeat our argument for the k = 1 case. To be specific, we'll show that there exists a random variableṼ taking values on SṼ , where (SṼ ,Ṽ,ν) is a σ-finite measure space andṼ is countably generated, such that for u ∈ S U , p (k) (u, ·) = SṼ π U |Ṽ (·|v)πṼ |U (v|u)ν(dv), where πṼ (·), π U |Ṽ (·|·), and πṼ |U (·|·) have the apparent meanings.

B Proof of Proposition 6
Proposition 6. Suppose that h(·) is strictly positive in a neighborhood of the origin. If ω(z) can be written as n i=1 ω i (z i ), and there exists ξ ∈ (1, 4/3) such that for all i ∈ {1, 2, . . . , n}, then (15) holds, and thus by Theorem 2, second moment exists for the estimator (13).