Trace class Markov chains for the Normal-Gamma Bayesian shrinkage model

High-dimensional data, where the number of variables exceeds or is comparable to the sample size, is now pervasive in many scientific applications. In recent years, Bayesian shrinkage models have been developed as effective and computationally feasible tools to analyze such data, especially in the context of linear regression. In this paper, we focus on the Normal-Gamma shrinkage model developed by Griffin and Brown. This model subsumes the popular Bayesian lasso model, and a three-block Gibbs sampling algorithm to sample from the resulting intractable posterior distribution has been developed by Griffin and Brown. We consider an alternative two-block Gibbs sampling algorithm and rigorously demonstrate its advantage over the three-block sampler by comparing specific spectral properties. In particular, we show that the Markov operator corresponding to the two-block sampler is trace class (and hence Hilbert-Schmidt), whereas the operator corresponding to the three-block sampler is not even Hilbert-Schmidt. The trace class property for the two-block sampler implies geometric convergence for the associated Markov chain, which justifies the use of Markov chain CLT's to obtain practical error bounds for MCMC based estimates. Additionally, it facilitates theoretical comparisons of the two-block sampler with sandwich algorithms which aim to improve performance by inserting inexpensive extra steps in between the two conditional draws of the two-block sampler.


Introduction
In recent years, the explosion of data, due to advances in science and information technology, has left almost no field untouched.The availability of high-throughput data from genomic, finance, environmental, marketing (among other) applications has created an urgent need for methodology and tools for analyzing high-dimensional data.In particular, consider the linear model Y = Xβ +σ , where Y is an n × 1 real valued response variable, X is a known n × p matrix, β is an unknown p × 1 vector of regression coefficients, σ is an unknown scale parameter and the entries of are independent standard normals.In the high-dimensional datasets mentioned above, often n < p, 1 and classical least squares methods fail.The lasso [22] was developed to provide sparse estimates of the regression coefficient vector β in these sample-starved settings (several adaptations/alternatives have been proposed since then).It was observed in [22] that the lasso estimate is the posterior mode obtained when one puts i.i.d Laplace priors on the elements of β (conditional on σ).This observation has led to a flurry of recent research concerning the development of prior distributions for (β, σ) that yield posterior distributions with high (posterior) probability around sparse values of β, i.e., values of β that have many entries equal to 0. Such prior distributions are referred to as "continuous shrinkage priors " and the corresponding models are referred to as "Bayesian shrinkage models".Bayesian shrinkage methods have gained popularity and have been extensively used in a variety of applications including ecology, finance, image processing and neuroscience (see, for example, [24? , 3? ???, 19,20]).
In this paper, we focus on the well-known Normal-Gamma shrinkage model introduced in Griffin and Brown [5].The model is specified as follows: Inverse − Gamma α, ξ allow for impropriety via α = 0 or ξ = 0 ∼ Gamma(a, b) for j = 1, 2, ..., p, (1) where N p denotes the p-variate normal density, and D τ is a diagonal matrix with diagonal entries given by {τ j } p j=1 .Also, Inverse-Gamma(α, ξ) and Gamma(a, b) denote the Inverse-Gamma and Gamma densities with shape parameters α and a, and rate parameters ξ and b respectively.The marginal density of β given σ 2 in the above model is given by where K a is the modified Bessel function of the second kind.The popular Bayesian lasso model of Park and Casella [17] is a special case of the Normal-Gamma model above with a = 1, where the marginal density of β simplifies to In this case, the marginal density for each β j (given σ 2 ) is the double exponential density.The Normal-Gamma family offers a wider choice for the tail behavior (as a decreases, the marginal distribution becomes more peaked at zero, but has heavier tails), and thereby a more flexible mechanism for model shrinkage.
The posterior density of (β, σ 2 ) for the Normal-Gamma model is intractable in the sense that closed form computation or direct sampling is not feasible.Griffin and Brown [5] note that the full conditional densities of β, σ 2 and τ 2 are easy to sample from, and develop a three-block Gibbs sampling Markov chain to generate samples from the desired posterior density.This Markov chain, denoted by Φ := {( βm , σ2 m )} ∞ m=0 (on the state space R p × R + ), is driven by the Markov transition density (Mtd) Here π(• | •) denotes the conditional density of the first group of arguments given the second group of arguments.The one-step dynamics of this Markov chain to move from the current state, βm , σ2 m , to the next state, βm+1 , σ2 m+1 can be described as follows: m+1 from π(• | βm+1 , τ , Y).In [15], the authors show that the distribution of the Markov chain Φ converges to the desired posterior distribution at a geometric rate (as the number of steps converges to ∞).
As mentioned previously, the Bayesian lasso Markov chain of [17] is a special case of the Normal-Gamma Markov chain when a = 1.In recent work [21], it was shown that a two-block version of the Bayesian lasso chain (and a variety of other chains for Bayesian regression) can be developed.The authors in [21] then focus their theoretical investigations on the Bayesian lasso, and show that the two-block Bayesian lasso chain has a better behaved spectrum than the original three-block Bayesian lasso chain in the following sense: the Markov operator corresponding to the two-block Bayesian lasso chain is trace class (eigenvalues are countable and summable, and hence in particular square-summable), while the Markov operator corresponding to the original three-block Bayesian lasso chain is not Hilbert-Schmidt (the corresponding absolute value operator either does not have a countable spectrum, or has a countable set of eigenvalues that are not square-summable).
Based on the method outlined in [21], a two-block version of the three-block Normal-Gamma Markov chain Φ can be constructed as follows.The two-block Markov chain, denoted by Φ = {(β m , σ 2 m )} ∞ m=0 (on the state space R p × R + ), is driven by the Markov transition density (Mtd) The one-step dynamics of this Markov chain to move from the current state, β m , σ 2 m , to the next state, β m+1 , σ 2 m+1 can be described as follows: ).The goal of this paper is to investigate whether the theoretical results for the Bayesian lasso in [21] hold for the more general and complex setting of the Normal-Gamma model.In particular, we establish that the Markov operator corresponding to the two-block chain Φ is trace class when a > 1 2 (Theorem 1).On the other hand, the Markov operator corresponding to the three-block chain Φ is not Hilbert-Schmidt for all values of a (Theorem 2).These results hold for all values of the sample size n and the number of independent variables p.Since the Bayesian lasso is a special case with a = 1, our results subsume the spectral results in [21].We note that establishing the results in the Normal-Gamma setup is much harder than the Bayesian lasso setting.This is in part due to the fact that the modified Bessel function K a does not in general have a closed form when a = 1, and a heavy dose of various identities and bounds for these Bessel functions is needed to analyze the appropriate integrals in the proofs of Theorem 1 and Theorem 2.
We now discuss further some of the implications of establishing the trace-class property for Φ.Note that the Markov chain Φ arises from a two-block Data Augmentation (DA) algorithm, with (β, σ 2 ) as the parameter block of interest and τ as the augmented parameter block.Hence the corresponding Markov operator, denoted by K, is a positive, self-adjoint operator (see [6]).Establishing that a positive self-adjoint operator is trace class implies that it has a discrete spectrum, and that (countably many, non-negative) eigenvalues are summable.The trace class property implies compactness, which further implies geometric ergodicity of the underlying Markov chain (see [16,Section 2], for example).Geometric ergodicity, in turn, facilitates use of Markov chain central limit theorems to provide error bounds for Markov chain based estimates of relevant posterior expectations.The DA interpretation of Φ also enables us to use the Haar PX-DA technique from [6] and construct a "sandwich" Markov chain by adding an inexpensive extra step in between the two conditional draws involved in one step of Φ (see Section 5 for details).The trace class property for Φ, along with results in [9], implies that the sandwich chain is also trace class, and that each ordered eigenvalue of the sandwich chain is dominated by the corresponding ordered eigenvalue of Φ (with at least one strict domination).Recent work in [2] provides a rigorous approach to approximate eigenvalues of trace class Markov chains whose Mtd is available in closed form.These results are not applicable to the two-block sampler as its is not available in closed form, and extending results in [2] to such settings is a topic of ongoing research.
The rest of the paper is organized as follows.In Section 2, we provide the form of the relevant conditional densities for the Markov chains Φ and Φ.In Section 3, we establish the trace class property for the two-block Markov chain Φ.In Section 4, we show that the three-block Markov chain is not Hilbert-Schmidt.In Section 5, we derive the Haar PX-DA sandwich chain corresponding to the two-block DA chain.Finally, in Section 6 we compare the performance of the two-block, three-block and the Haar PX-DA based chains on simulated and real datasets.

Form of relevant densities
In this section, we present expressions for various densities corresponding to the Normal-Gamma model in (1).These densities appear in the Mtd for the Markov chains Φ and Φ.
The joint density for the parameter vector (β, τ , σ 2 ) conditioned on the data vector y is given by the following: Based on the joint density in (4), the following conditional distributions can be derived in a straightforward fashion.
In particular, In particular, for In particular, for σ 2 ∈ R + .

Properties of the two-block Gibbs sampler
In this section, we show that the operator associated with the two-block Gibbs sampler Φ, with Markov transition density k specified in (3) is trace class when a > 1 2 and is not trace class when 0 < a ≤ 1 2 .Theorem 1.For all values of n and p, the Markov operator corresponding to the two-block Markov chain Φ is trace class (and hence Hilbert-Schmidt) when a > 1  2 and is not trace class when 0 < a ≤ 1 2 .Proof In the current setting, the trace class property is equivalent to the finiteness of the integral (see [16,Section 2], for example) We will consider five separate cases: a > 1, 3/4 ≤ a ≤ 1, 1/2 < a < 3/4, 0 < a < 1/2 and a = 1/2.In the first three cases, we will show that the integral in ( 9) is finite, and in the last two cases we will show that the integral in ( 9) is infinite.The proof is a lengthy and intricate algebraic exercise involving careful upper/lower bounds for modified Bessel functions and conditional densities, and we will try to provide a road-map/explanation whenever possible.We will start with the case a > 1.
Case 1: a > 1 By the definition of k, we have As a first step, we will gather all the terms with τ , and then focus on finding an upper bound for the inner integral with respect to τ .Using ( 5), ( 7) and ( 8), we get, where We now focus on the inner integral in (11) defined by Let λ denote the largest eigenvalue of X T X.Using the definition of A τ , it follows that where We now examine a generic term of the sum in (13).Note that c j and cj √ τj are both (unnormalized) GIG densities.Hence, for any subset First, by [11, Page 266], we get that for all a > 1 2 .Next, using the fact that if x > 0, then ν → K ν (x) is an increasing function for ν > 0 (again, see [11, Page 266]), we get Hence, from ( 14), we get that It follows from ( 12), ( 13) and ( 15) that By ( 11) and ( 12), the trace class property will be established if we show that for every L ⊆ {1, 2, • • • , p}, the integral is finite.We proceed to show this by first simplifying the inner integral with respect to β.Using the form of the Gamma density, we get It follows by ( 16) that As discussed above, this establishes the trace class property in the case a > 1.
By (14), for any subset It follows from ( 12) that By (11), the trace class property will be established if we show that for every L ⊆ {1, 2, • • • , p}, the integral is finite.We proceed to show this by first integrating out β. Using the form of the Gamma density, we get where
By [1, Page 375], if ν > 0, then Kν (x) as y → 0. Hence there exists 1 > 0 such that we have Since K ν (x) > 0 for positive ν and x, we have It follows from ( 5), ( 7) and ( 8) that Furthermore, we have and If we denote the entries of X T X and X T Y by a ij , b i separately.It's easy to see there is at least i such that a ii > 0 (if not, a ii = 0 for all i, indicating X is exactly 0.) Without loss of generality, we assume a 11 > 0, so where It follows from ( 24), (25), ( 26) and (27) that By (23), the inner integral can be bounded below as It follows from ( 28) and (29) that where . However, we note that where the last step follows from Propositon A1.By (30), it follows that the operator corresponding to the Markov transition density k is not trace class when 0 < a < 1/2.
Hence there exists 2 ∈ (0, 1) such that ) , where . We use this to get a lower bound for the inner integral with respect to τ in (28).In particular, we note that ) Using (28), it follows that ) ) ) It follows that the operator corresponding to the Markov transition density k is not trace class when a = 1 2 .

Properties of the three-block Gibbs sampler
In this section, we show that when a > 0, the Markov operator corresponding to the three-block Gibbs sampler Φ, with Markov transition density k specified in (1), is not Hilbert-Schmidt.Let K be the Markov operator corresponding to Φ.We prove the following result.
Theorem 2. For all a > 0, the Markov operator K is not Hilbert-Schmidt for all possible values of p and n.
Proof Note that the Markov operator K corresponding to the density k is Hilbert-Schmidt if and only if K * K is trace class (see [8], for example).Here K * denotes the adjoint of K.It follows that K is Hilbert-Schmidt if and only if I < ∞, where By ( 2), a straightforward manipulation of conditional densities, and Fubini's theorem, we obtain For convenience, we introduce and use the following notation in the subsequent proof.
We first show I = ∞ for the simpler case with a > 1 2 and then consider the significantly more complicated case 0 < a ≤ 1 2 .
Case 2: 0 < a ≤ 1/2 By the integral formula (see [1], Page 376) Since the hyperbolic function cosh is strictly decreasing on interval (−∞, 0], for every x > 0, K ν (x) is strictly decreasing as ν increases on the interval (−∞, 0].Note that when 0 ) for all x > 0.Moreover, when ν < 0 and x > 0, 2K ν (x) ≤ x ν Γ(−ν)2 −ν (see Proposition A7 of [15]), which implies Similarly, we get Using ( 5), ( 6) and ( 8), along with (37) and (38), we obtain where and the last inequality follows by It follows by (39) and the form of the Inverse-Gamma density that where We now establish some inequalities which will help converting the lower bound in (40) into a simpler form.By (33), it follows that and where )νj and λ is the greatest eigenvalue of matrix X T X.By Proposition A4, we have Hence, it follows from (41) and (42) that From (43), we obtain

Construction of the sandwich Markov chain
The two-block Markov chain Φ can be interpreted as a Data Augmentation (DA) algorithm, with (β, σ 2 ) as the parameter block of interest, and τ as the augmented block.The DA algorithm can suffer from slow convergence (just like the EM algorithm, its analogous version in likelihood maximization).The sandwich algorithm, introduced in [12,6], aims to improve the convergence and efficiency of the DA algorithm by adding an inexpensive extra step in between the two conditional draws of the DA algorithm.In fact, there are many DA chains (see [12,14,13,7,16], for example) where sandwich chains have been constructed and shown to be significantly more efficient with roughly the same computational effort per iteration.In this section, we will focus on deriving the Haar PX-DA sandwich algorithm in the Bayesian lasso setting.The Haar PX-DA algorithm has been shown in [6] to be the best among a class of sandwich algorithms in terms of efficiency and operator norm.
A key ingredient in constructing the Haar PX-DA algorithm is a unimodular group which acts on the augmented variable space (R p in our case).We consider the multiplicative group G of positive real numbers, which acts on an element of R p through scalar multiplication.In particular, if g ∈ G and τ ∈ R p , then the result of the action of g on τ is given by gτ = (gτ 1 , gτ 2 , • • • , gτ p ).Another choice that we need to make is the choice of the multiplier function χ : G → R + , which satisfies χ(g 1 g 2 ) = χ(g 1 )χ(g 2 ) for any pair g 1 , g 2 ∈ G, and for any g ∈ G and any function φ : R p → R. In this setting, the function χ(g) = g p serves as a valid multiplier function.Also, the unimodular group G has a Haar measure H(dg) = dg g .With these ingredients in hand, we define the density f G on G (with respect to the Haar measure) by where m(τ ) = G π(gτ )χ(g)H(dg) is the normalizing constant.From (4), it follows that .
Altohugh f G is not a standard density, samples from this univariate density can be easily generated using a rejection sampling algorithm.Using f G , we can now define the Haar PX-DA sandwich Markov chain, denoted by Φ * = {(β m , σ 2 m )} ∞ m=0 , whose one step transition from (β m , σ 2 m ) to (β m+1 , σ 2 m+1 ) can be described as follow.The lemma below, regarding spectral properties of the Haar PX-DA chain, follows by combining Theorem 1 and results from [9].
Lemma 5.1.The operator corresponding to the Haar PX-DA Markov chain Φ * is trace class.Also, if {λ * i } ∞ i=1 and {λ i } ∞ i=1 are the ordered eigenvalues corresponding to Φ * and Φ respectively, then λ * i ≤ λ i for i ≥ 1 with a strict inequality holding for at least one i.

Examples
In this section, we consider two simulated data examples (one each for n > p and n < p) and a real data example to compare the performance the three block, two block and Haar PX-DA sandwich chains.

Simulation
We consider a setting with n = 10 < p = 15 for the first simulation, and n = 15 > p = 10 for the second simulation.for both cases,the elements of the design matrix X and response y were chosen by generating i.i.d.N (0, 1) random variables.We fit the Normal-Gamma model in (1) with hyper parameters a = 0.75, b = 2, ξ = 100, α = 0. To compare the efficiency performance of the Markov chains, we compute the autocorrelations (up to lag 10) for all the Markov chains for the function Y − Xβ T Y − Xβ + σ 2 .The results are summarized in Table 1 and Figure 1 for the first simulation, and in Table 2 and Figure 2 for the second simulation.We can clearly see that for both datasets, the two block Gibbs sampler has significantly lower autocorrelations than the three block Gibbs sampler, and that the magnitude of the autocorrelations for the sandwich Markov chain is lowest.

Real data example
We consider the wheat data set from Perez and de los Campos [18], which is available in the R package BGLR.The data was obtained from numerous international trials across a wide variety of wheat-producing environments.For our analysis, we consider the average grain yield for a particular environmental condition (there are four to choose from) as the response variable, and 20 binary variables containing genotypic information as the predictors.We fit the Normal-Gamma model in (1)

Discussion of numerical results
For both the simulated and real data settings, the two-block chain clearly has a significantly better performance than the three-block chain.For example, in all the settings the Lag 1 autocorrelation drops by 80% or more when we compare the three-block and the two-block chains.These findings support the theoretical results (Theorem 1 and Theorem 2) in the paper.Since the two-block chain and the three-block chain require the same computational effort, our theoretical and experimental results, support the overall conclusion that a practitioner should prefer the two-block chain over the three-block chain.
The comparison between the two-block and the Haar PX-DA chain is not as decisive.We elaborate on this below.Clearly, in all the experimental settings, the sandwich algorithm performs better than the two-block chain, and substantially so for the simulated datasets.These findings support the theoretical results (Lemma 5.1).However, note that the Haar PX-DA chain requires an extra computational step (sampling from f G ) as compared to the two-block chain.This extra step requires a univariate rejection sampler.We have found that the performance of this rejection sampler varies with the choice of a, b, ξ and α.In the best case, it takes twice as much time as the other steps, and in the worst case, can take more than 10 times as much time as the other steps.Of course, the performance gains can sometimes easily offset this computational overhead, but a practitioner should make a careful determination in their specific setting.x dx = ∞.
with a = 0.75, b = 0.2, ξ = 1, α = 0 and compute autocorrelations for the function Y − Xβ T Y − Xβ + σ 2 for the three block, two block and Haar PX-DA sandwich chains.The results are shown in Table3and Figure3.As in the simulated data examples, the two-block chain has lower autocorrelations than the three-block chain, and the Haar PX-DA sandwich chain is the most efficient among all three Markov chains.

Table 1
First ten autocorrelations for simulated data with n < p

Table 2
First ten autocorrelations for simulated data with n > p

Table 3
First ten autocorrelations with wheat data