Geometric ergodicity for Bayesian shrinkage models

: In recent years, a large variety of continuous shrinkage pri- ors have been developed for a Bayesian analysis of the standard linear regression model in high dimensional settings. We consider two such pri- ors, the Dirichlet-Laplace prior (developed in Bhattacharya et al. (2013)), and the Normal-Gamma prior (developed in (Griﬃn and Brown, 2010)). For both Dirichlet-Laplace and Normal-Gamma priors, Gibbs sampling Markov chains have been developed to generate approximate samples from the cor- responding posterior distributions. We show by using a drift and minorization based analysis that the Gibbs sampling Markov chains corresponding to the aforementioned models are geometrically ergodic. Establishing geometric ergodicity of these Markov chains is crucial, as it provides theoretical justiﬁcation for the use of Markov chain CLT, which can then be used to obtain asymptotic standard errors for Markov chain based estimates of posterior quantities. Both Gibbs samplers in the paper use the Generalized Inverse Gaussian (GIG) distribution, as one of the conditional distribu- tions. A novel contribution of our convergence analysis is the use of drift functions which include terms that are negative fractional powers of normal random variables, to tackle the presence of the GIG distribution.


Introduction
In recent years, there has been a huge influx of high-dimensional datasets from various fields such as genomics, environmental sciences, finance and the social sciences. Classical statistical methods are inadequate to analyze these highdimensional datasets. Hence, development of new techniques for analyzing these datasets has been a major focus of statistical research in the last decade, and continues to generate much interest. In this context, analysis of the standard linear regression model in a high-dimensional setting, is a well-studied and popular research topic.
In particular, consider the model y = Xβ + σǫ, where y = (y i ) n i=1 ∈ R n is the vector of observations, X is the (known) design matrix, β ∈ R p is the (unknown) vector of regression coefficients, the components of ǫ are iid standard normal errors, and σ 2 is the (unknown) variance parameter. The objective is to estimate the unknown parameters β and σ 2 . For many modern datasets, the number of regression parameters (p, the dimension of β) is much larger than the number of observations (n). Classical statistical methods are not applicable for analyzing such datasets. A natural approach in these situations is to look for sparse estimates of β, i.e., estimates where many entries of β are exactly equal to zero. The lasso (introduced in Tibshirani (1994)) is a popular approach to deal with this problem. In this approach, the estimate of β is given bŷ where λ is a user-specified tuning parameter. The estimateβ lasso also has a Bayesian statistical interpretation. It is well known thatβ lasso is the mode of the conditional density of β given σ 2 and y, assuming that the components of β (conditioned on σ 2 ) are independently distributed, with each component having a Laplace prior distribution. Based on this observation, several authors have proposed a Bayesian analysis of the linear regression model in high-dimensional settings by using various adaptations/generalizations of the Laplace prior distributions for the entries of β (see Bhattacharya et al. (2013); Carvalho et al. (2010); Figueiredo (2003); Brown (2005, 2010); Johnstone and Silverman (2002); Park and Casella (2008); Polson and Scott (2010); Tipping and Smola (2001)). These prior distributions are referred to as "continuous shrinkage priors" and the corresponding statistical models are referred to as "Bayesian shrinkage models". In recent work, Bhattacharya et al. (2013) unify many of the standard Bayesian shrinkage models under a common framework, and study their theoretical properties. They show that frequentist optimality of these Bayesian shrinkage methods depends heavily on prior concentration around sparse vectors, i.e., whether the corresponding priors place sufficient mass around sparse β values (especially in high-dimensional settings). They further demonstrate that some popular choices of shrinkage models, including the Bayesian lasso (introduced in Park and Casella (2008)), have sub-optimal prior concentration properties, and develop a new class of continuous shrinkage priors called the Dirichlet-Laplace priors, which have optimal prior concentration properties.
The Normal-Gamma model developed by Griffin and Brown (2010) is another commonly used Bayesian shrinkage model, and contains the Bayesian lasso as a special case. In fact, the Bayesian lasso model corresponds to a specific choice of a hyperparameter in the Normal-Gamma model. Bhattacharya et al. (2013) argue that for an appropriate range of hyperparameter values (not including the Bayesian lasso case) the Normal-Gamma model shares the optimal prior concentration properties of the Dirichlet-Laplace model.
For both the Dirichlet-Laplace and Normal-Gamma model, the posterior density is intractable in the sense that desired expected values cannot be computed in closed form, and it is not possible to directly sample from the posterior density.
As a result, for both these models, Markov chains have been devised to generate approximate samples from the corresponding posterior density. A crucial, and often challenging, task in this context is to investigate whether these Markov chains are geometrically ergodic, i.e., whether the distribution of each of these Markov chains converges to the corresponding posterior distribution at a geomteric rate. As we discuss below, establishing geometric ergodicity is important for a meaningful statistical analysis of the corresponding models. A proof of geometric ergodicity of the Markov chain corresponding to the Bayesian Lasso model is provided in Khare and Hobert (2013). However, an investigation of geometric ergodicity for the Markov chain corresponding to the general Normal-Gamma model and the Dirichlet-Laplace model has not been undertaken. Indeed, the general Normal-Gamma model requires a much more complicated and nuanced analysis as compared to the Bayesian lasso model (see Remark 1).
3. Draw β m+1 from π(· | τ m+1 , σ 2 m+1 , y), i.e., from the conditional density given (τ m+1 , σ 2 m+1 , y). As shown in Section 3, it can be established in a straightforward way that the Markov chain described above is Harris ergodic, and its stationary density is given by the density in (1.6). Harris ergodicity can be used to construct strongly consistent estimators of intractable posterior expectations. In particular, if a real-valued measurable function h satisfies then irrespective of how the chain is started, the estimator is strongly consistent for E π h. However, this estimator is useful only if an estimate of the associated standard error can be provided. All known methods to compute asymptotically consistent estimates of standard errors forh m require the existence of a Markov chain central limit theorem (CLT). In particular, we need to establish that where c 2 is a finite positive constant. In general, the only standard method available to prove a Markov chain CLT, and obtain consistent estimates of the asymptotic variance c 2 , is to prove that the underlying Markov chain is geometrically ergodic (see Chan and Geyer (1994); Flegal and Jones (2010); Mykland, Tierney and Yu (1995); Robert (1995)). In the current context, the Markov chain {(β m , τ m , σ 2 m )} m≥0 is defined to be geometrically ergodic if there exists a positive real-valued function M , and a constant γ ∈ [0, 1) such that, for every (β 0 , τ 0 , σ 2 0 ) and r ∈ N, K r denotes the distribution of the Markov chain started at (β 0 , τ 0 , σ 2 0 ) after r steps, Π denotes the stationary distribution (corresponding to the density in (1.6)), and · TV denotes the total variation norm.
If the state space of an ergodic Markov chain is finite, then geometric ergodicity is immediate. However, establishing geometric ergodicity is a much more challenging task for a Markov chain with an infinite state space. A significant proportion of Markov chains arising in statistical applications (including the Markov chains considered in this paper) have continuous state spaces. Despite some success stories, for a large majority of these Markov chains, a proof of geometric ergodicity is not available.
For the Normal-Gamma Markov chain described above, we prove the following.
A similar result (Therorem 4.1) is established for the Dirichlet-Laplace Markov chain in Section 4. The analysis is based on Harris recurrence techniques developed in Meyn andTweedie (2009), Rosenthal (1995) and Roberts and Tweedie (1996) (see Jones and Hobert (2001) for an extensive review). This method of establishing geometric ergodicity of general state space Markov chains has been previously used by several authors, including Hobert (2001); Hobert and Geyer (1998);Jarner and Hansen (2000); Hobert (2001, 2004); Marchev and Hobert (2004); Mengersen and Tweedie (1996); Mira and Tierney (1997); Rosenthal (1998b, 1999); Roman and Hobert (2012); Rosenthal (1996); Roy and Hobert (2007); Tan and Hobert (2009) ;Tan, Jones and Hobert (2012). In order to use this technique, one needs to construct a drift function, and then use it to establish appropriate drift and minorization conditions (see for example Proposition 3.1 and Proposition 3.2). However, there is no general recipe or guidelines to construct the drift function and the corresponding drift and minorization conditions. Constructing these in any specific application is currently a matter of art.
Both the Gibbs sampling Markov chains analyzed in this paper use the Generalized Inverse Gaussian (GIG) distribution as one of the conditional distributions. To the best of the authors' knowledge, a drift and minorization based analysis for Gibbs samplers involving the GIG distribution has not been undertaken in previous literature. Gibbs sampling Markov chains in Khare and Hobert (2012), Khare and Hobert (2013) and Choi and Hobert (2013) involve the Inverse Gaussian distribution (which is a special case of the GIG distribution), and require a much simpler analysis as compared to the anaylsis in Section 3 and Section 4 (see Remark 1). We construct a new class of drift functions, which include terms that are negative fractional powers of normal random variables, to tackle the presence of the GIG distribution. The details can be found at the beginning of Section 3.1.
The paper is organized as follows. In Section 2, we establish notation for standard densities that will be used in our analysis. Section 3 provides drift and minorization conditions for the Normal-Gamma model. Section 4 provides drift and minorization conditions for the Dirichlet-Laplace model. Section 5 contains a discussion of future research directions. The appendix contains mathematical results, including some new identities related to modified Bessel functions of the second kind. These mathematical results play an important role in the analysis in Section 3 and Section 4.

Notation
We establish notation for some standard densities that will be used in the analysis.
• Inverse-Gamma(α, ξ) denotes the inverse-gamma density with shape parameter α > 0 and rate parameter ξ > 0. • GIG(ν, α, ξ) denotes the generalized inverse Gaussian density with parameters ν ∈ R, α > 0, ξ > 0, and corresponds to the density function Here K ν (·) denotes the modified Bessel function of the second kind. Since f GIG is a density function, it follows immediately that Note that for any x > 0, K ν (x) can be obtained from (2.1) by setting α = x and ξ = x.
The appendix contains several new mathematical identities (along with proofs) for modified Bessel functions of the second kind. These results are useful in the subsequent analysis.

The Normal-Gamma model
In this section, we prove that the Gibbs sampling Markov chain in Griffin and Brown (2010) for drawing approximate samples from the posterior density in (1.6) is geometrically ergodic. We first resolve a minor technical issue, and then proceed to provide the details of the Gibbs sampling algorithm in Griffin and Brown (2010). Consider the Normal-Gamma model specified in (1.2)-(1.5). It can be shown that if a is chosen to be less than 1 2 , then the marginal density of β converges to infinity if one or more entries of β converge to zero. Such a choice is intentional, and the objective is that the posterior density of (β, τ , σ 2 ) will concentrate on neighborhoods of sparse values of β. To avoid technical complications in defining the conditional densities for the Gibbs sampling algorithm in Griffin and Brown (2010), it is assumed that the parameter β takes values in R p , where R := R\{0}. Since the set R p \ R p has Lebesgue measure (on R p ) equal to zero, expectations with respect to the density π(β, τ , σ 2 | y) remain the same, whether we restrict β to R p or not. Hence, it is enough to be able to generate samples from the density π(β, τ , σ 2 | y) restricted to the space R p × R p + × R + . Griffin and Brown (2010) show that the conditional densities corresponding to the density in (1.6) are given as follows.
It is well known, and can be easily verified, that the joint density of (β, τ , σ 2 ) (conditioned on y) defined in (1.6), is invariant for the Gibbs transition density k N G defined above. Since k N G is strictly positive, it follows that the corresponding Markov chain is irreducible with respect to the Lebesgue measure on R p × R p + × R + (see (Meyn and Tweedie, 2009, Page 87)), and aperiodic. Results in Asmussen and Glynn (2011) can now be invoked to conclude that the Markov chain is positive Harris recurrent. We now proceed to prove geometric ergodicity by establishing a geometric drift condition and an associated minorization condition for the Gibbs transition density k N G . Using results from Rosenthal (1995), this will immediately establish geometric ergodicity of the corresponding Markov chain.

Drift condition
Consider the following function This function will be used to establish a geometric drift condition for the Normal-Gamma Markov chain. The first two terms in V (β, τ , σ 2 ) are quadratic forms in β. The choice of quadratic terms like these is pretty standard. However, the third term p j=1 1 |βj| δ is a sum of negative fractional powers of the entries of β (whose conditional distribution given τ , σ 2 and y is multivariate normal with a non-zero mean vector in general), and is the new feature of this drift function. We would like to differentiate this choice with the drift function in Tan and Hobert (2009). The drift function in Tan and Hobert (2009) consists of positive fractional powers of quadratic forms of variables whose conditional distribution given the other parameters and the data is multivariate normal.
Let E kNG [· | β 0 , τ 0 , σ 2 0 ] denote the expectation with respect to one step of the Markov chain with transition density k N G , starting at (β 0 , τ 0 , σ 2 0 ). The following proposition establishes a geometric drift condition for the transition density k N G .
It follows by the definition of the Markov transition density k N G that, where E[· | τ , σ 2 , y] denotes the expectation with respect to π(· | τ , σ 2 , y) (specified in (3.1)), E[· | β 0 , σ 2 , y] denotes the expectation with respect to π(· | β 0 , σ 2 , y) (specified in (3.2)), and E[· | β 0 , τ 0 , y] denotes the expectation with respect to π(· | β 0 , τ 0 , y) (specified in (3.3)). We evaluate the three conditional expectations one step at a time. Note that Let e j denote the j th unit vector in R p . By (3.1), we get that the conditional den- Let λ X be largest eigenvalue of X T X. It follows that for every j = 1, 2, . . . , p, Here (a) follows from the fact that λ X I p − X T X is a non-negative definite matrix and (a ′ ) follows from the fact that if u, v ≥ 0 and 0 < δ < 1 then for a > 0. Using Proposition A1 (see appendix) with β j playing the role of V, and utilizing the inequality in (3.7), we obtain for every j = 1, 2, . . . , p. It follows by summing over all j = 1, 2, . . . , p that (3.9) It follows by (3.6) and (3.9) that the innermost expectation in (3.5) can be bounded as follows For the next step, we consider the middle conditional expectation in (3.5). By where C 1 is an arbitrary positive constant (we will make an appropriate choice of C 1 later). Here (b) follows from the fact that (Segura, 2011, Theorem 2)), (b ′ ) follows from the fact that x 2 + y 2 ≤ |x| + |y|, and (b) ′′ follows from the fact that |xy| ≤ where (c) follows by (3.14) and Proposition A4 (see appendix) with ν := 1 2 − a and x := Geometric ergodicity for shrinkage models
Remark 1. The Bayesian lasso model in Park and Casella (2008) arises as a special case of the Normal-Gamma model, when the hyperparameter a in the Normal-Gamma model is chosen to be 1. In this case, the conditional distribution of 1 τj given β, σ 2 , y reduces to an Inverse Gaussian distribution for each 1 ≤ j ≤ p. In Khare and Hobert (2013), a proof of geometric ergodicity of the Bayesian lasso Markov chain is provided by using a drift function of the form Hence, the term of the form p j=1 1 |βj | δ is not required. This significantly simplifies the drift and minorization analysis in the Bayesian lasso case.

Minorization condition
For every d > 0, let The following proposition establishes an associated minorization condition to the geometric drift condition in Proposition 3.1.
The drift and minorization conditions in Proposition 3.1 and Proposition 4.2 can be combined with (Rosenthal, 1995, Theorem 12) to establish geometric ergodicity of the Normal-Gamma Markov chain with transition density k N G . Theorem 1.1 is a straightforward corollary of Theorem 3.1. Note that the following theorem not only provides a proof of geometric ergodicity, but also provides quantitative convergence bounds which may be used to compute the number of iterations required for the Markov chain distribution to get sufficiently close to the stationary distribution. and U = 1 + 2(γd + ϕ). Then for any (β 0 , τ 0 , σ 2 0 ) ∈ R p × R p + × R + , r ∈ N and 0 < s < 1, where K r N G,(β 0 ,τ 0 ,σ 2 0 ) denotes the distribution of the Normal-Gamma Markov chain started at (β 0 , τ 0 , σ 2 0 ) after r steps, and Π N G denotes the joint distribution of the (β, τ , σ 2 ) (conditioned on y) in the Normal-Gamma model.

The Dirichlet-Laplace model
In this section we analyze the Dirichlet-Laplace Bayesian shrinkage model introduced by Bhattacharya et al. (2013). As in Section 3, let y = (y i ) n i=1 denote the data vector and X denote the known design matrix. The Dirichlet-Laplace model in Bhattacharya et al. (2013) can be specified as follows where η = (ψ, φ, θ), D η denotes a diagonal matrix with diagonal elements (ψ j φ 2 j θ 2 ) p j=1 , and a is a known positive constant. Let

S. Pal and K. Khare
Then, the parameter η = (ψ, φ, θ) takes values in the space ∆ p := R p + ×S p ×R + . It can be shown if a is chosen to be less than 1 in the Dirichlet-Laplace model, then the marginal density of β converges to infinity as one or more entries of β converge to zero. For the same reasons as those discussed at the beginning of Section 3, we assume that the parameter β takes values in R p .
The joint density of (β, η, σ 2 ) (conditioned on y) with respect to the Lebesgue measure on R p × ∆ p × R + is given as follows (4.2) To perform Bayesian statistical inference for this model, we need to compute expectations with respect to the above posterior density. However, these expectations are not available in closed form, and there is no direct method to generate samples from the density in (4.2). Bhattacharya et al. (2013) derive the following expressions for the conditional densities associated with the density in (4.2).
• The conditional density of β given η, σ 2 , y is the for β ∈ R p . • The conditional density of σ 2 given the variables β, η, y is In particular, for σ 2 ∈ R + .
3. Draw β m+1 from π(· | η m+1 , σ 2 m+1 , y) (as described in (4.3)). The transition density of the Markov chain defined above (with respect to the Lebesgue measure on R p × ∆ p × R + ) is given by × π φ, θ |β, σ 2 , y π σ 2 |β,η, y . (4.11) It is well known, and can be easily verified, that the joint density of (β, η, σ 2 ) (conditioned on y) defined in (4.2), is invariant for the Gibbs transition density k DL defined above. Since the Markov transition density k DL is strictly positive, by exactly the same argument as in Section 3, it can be shown that the corresponding Markov chain is Harris recurrent. We now prove geometric ergodicity by establishing a geometric drift condition and an associated minorization condition for the Gibbs transition density k DL .
Remark 2. Note that the conditional densities of β and σ 2 (in (4.3) and (4.4)) are very similar to the conditional densities of β and σ 2 in the Normal-Gamma model (see (3.1) and (3.2)). Hence, for the steps in the analysis involving the conditional densities of β and σ 2 , we heavily use the results established for analogous steps in Section 3. However, the conditional density of the parameter η in the Dirichlet-Laplace model is significantly different and much more complicated than the conditional density of the parameter τ in the Normal-Gamma model. Hence, steps involving components of η require a different and more involved analysis than in Section 3.
Let E kDL [· | β 0 , η 0 , σ 2 0 ] denote the expectation with respect to one step of the Markov chain with transition density k DL , starting at (β 0 , η 0 , σ 2 0 ). The following proposition establishes a geometric drift condition for the transition density k DL .
Proposition 4.1. If n ≥ 3 and a > a * (as defined in Proposition A8 (see appendix)), there exists constants 0 ≤ γ < 1 and ϕ ≥ 0 such that for every (β 0 , η 0 , σ 2 0 ) ∈ R p × ∆ p × R + . Proof It follows by the definition of the Markov transition density k DL that (4.13) As in the proof of Proposition 3.1, we evaluate the three conditional expectations one step at a time. Note that the innermost conditional expectation is with respect to the conditional distribution of β, and by (4.3), Hence, the method for simplifying the innermost expectation here will be very similar to the method for simplifying the innermost expectation in the proof of Proposition 3.1. By exactly the same argument as the one leading to (3.6), it follows that (4.14) Let κ(a) := for a > 0. By a similar argument as the one leading to (3.7) and (3.8), it follows that for every j = 1, 2, . . . , p, β j ∼ N (µ j , σ 2 j ), where µ j = e T j (X T X + D −1 η ) −1 X T y and σ 2 j = σ 2 e T j (X T X + D −1 η ) −1 e j , and It follows from (4.14) and (4.15) that (4.16) We now analyze the middle conditional expectation in (4.13), which requires getting an upper bound for E[ κ(a) Note by (4.9) and (4.10), φ j θ | β 0 , σ 2 , y ∼ GIG(a − 1, 1, 2 |β0j| σ ). Hence where (g) follows from (Laforgia and Natalini, 2009, Theorem 1.2). Furthermore, where (h) follows from (Laforgia and Natalini, 2009, eq. (1.9)) and (i) follows from (Laforgia and Natalini, 2009, Theorem 1.2). By (4.6), we get that Let C 1 , C 2 > 0 be arbitrary (we will make appropriate choices for C 1 and C 2 at the end of the proof). It follows by (4.17), (4.18) and (4.19) that for every j = 1, 2, . . . , p, where (j) follows from the fact that uv ) and (j ′ ) follows from the fact that u | β 0 , σ 2 , y]. Since 0 < δ(a) < 1, it follows that x → x δ(a) is a concave function for x > 0. By Jensen's inequality and (4.7), we obtain
In the argument leading to (3.16), it was established that ν → K ν (x) is a increasing function for ν ≥ 0. It follows from (4.21), and the fact v (4.25) Case 3: a = 1.

Minorization condition
For every d > 0, let The following proposition establishes an associated minorization condition to the geometric drift condition in Proposition 4.1.
The drift and minorization conditions in Proposition 4.1 and Proposition 4.2 can be combined with (Rosenthal, 1995, Theorem 12) to establish geometric ergodicity of the Dirichlet-Laplace Markov chain with transition density k DL .

Discussion
In recent years, a number of Bayesian shrinkage models have been developed to tackle high-dimensional data sets. As mentioned earlier, most of these models use Markov chains to sample from the intractable posterior distributions of the parameters. However, important theoretical properties such as geometric ergodicity of these Markov chains have not been previously investigated. In this paper, geometric ergodicity of Gibbs sampling Markov chains corresponding to the Bayesian shrinkage models in Griffin and Brown (2010) and Bhattacharya et al. (2013) has been established. The new and interesting feature of these Markov chains, especially from the point of view of establishing geometric ergodicity, is the presence of the Generalized Inverse Gaussian distribution as one of the conditional distributions in the Gibbs sampler. To tackle this, we introduced novel drift functions which include the term p i=1 1 |βj | δ (recall that for both models, the conditional distributon of β given other parameters and the data is multivariate normal). The GIG distribution involves modified Bessel functions of the second kind, and the properties of these functions are invoked frequently in the intricate drift and minorization analysis presented in Section 3 and Section 4.
Note that the results in Section 3 (for the Normal-Gamma model) hold for every p, X, α, ξ, a and b. The results in Section 4 (for the Dirichlet-Laplace model) hold for all values of p, X, α, ξ, but require that a > a * . As mentioned earlier, the structure of the conditional distribution of the parameter η in the Dirichlet-Laplace model is more complicated than the conditional distribution of the parameter τ in the Normal-Gamma model. Hence, it is not entirely surprising that we were able to prove a stronger result for the Normal-Gamma model. Using the insights obtained from the analysis in this paper, in future research we plan to investigate geometric ergodicitiy (or the lack of it) for the Dirichlet-Laplace models when a ≤ a * , and for Markov chains corresponding to some of the other Bayesian shrinkage models proposed in the literature. . For every t > 0, define Let A * t be the "symmetric rearrangement" of A t defined by Let f * be the "symmetric decreasing rearrangement" of f defined by Note that the function g(x) = 1 |x| δ where δ ∈ (0, 1) is decreasing function of |x| and symmetric around 0. It can be easily proved that the "symmetric decreasing rearrangement" of g is given by g * (x) = 1 |x| δ . By the Hardy-Littlewood inequality (Lieb and Loss, 2001, Page 82), we get that If V ∼ N (µ, γ 2 ), It follows that By a change of variable with u = x 2 γ 2 we get that Proposition A2. Let ν 1 , ν 2 > 0. Then for arbitrary ǫ * > 0 there exists ǫ > 0 such that for all 0 < x < ǫ (obviously ǫ will depend on ǫ * , ν 1 , ν 2 ).