Geometric ergodicity of the Bayesian lasso

Consider the standard linear model y = X +✏ , where the components of ✏ are iid standard normal errors. Park and Casella [14] consider a Bayesian treatment of this model with a Laplace/Inverse-Gamma prior on (,). They introduce a Data Augmentation approach that can be used to explore the resulting intractable posterior density, and call it the Bayesian lasso algorithm. In this paper, the Markov chain underlying the Bayesian lasso algorithm is shown to be geometrically ergodic, for arbitrary values of the sample size n and the number of variables p. This is important, as geometric ergodicity provides theoretical justification for the use of Markov chain CLT, which can then be used to obtain asymptotic standard errors for Markov chain based estimates of posterior quantities. Kyung et al [12] provide a proof of geometric ergodicity for the restricted case n p, but as we explain in this paper, their proof is incorrect. Our approach is different and more direct, and enables us to establish geometric ergodicity for arbitrary n and p.


Introduction
Consider the standard linear model y = Xβ + σǫ, where y = (y i ) n i=1 ∈ R n is the vector of observations, X is the (known) design matrix, β ∈ R p is the (unknown) vector of regression coefficients, the components of ǫ are iid standard normal errors, and σ 2 is the (unknown) variance parameter. The objective is to estimate (β, σ 2 ). In a variety of modern datasets from genetics, finance, and environmental sciences, the dimension p of β is much larger than the sample size n. An extremely popular approach to handle these kinds of datasets is the lasso, which was introduced in Tibshirani [19]. In this approach, the estimate of β is obtained aŝ β lasso = argmin β∈R p (y − Xβ) T (y − Xβ) + λ p j=1 |β j |, (1.1) where λ is a user-specified tuning parameter. Note that the estimate in (1.1) can be regarded as the posterior mode of β (conditional on σ 2 ) if one puts independent Laplace priors on the entries of β. Based on this observation, several authors proposed a Bayesian analysis using a Laplace-like prior for β (see for example [3,7,20]). Park and Casella [14] construct the following hierarchical Bayesian model as a Bayesian parallel/alternative to the lasso.
In Section 2, the Markov transition density (Mtd) of the resulting Gibbs Markov chain, {(β m , τ 2 m , σ 2 m )} ∞ m=0 , is defined and then used to establish that the chain is well behaved (i.e., Harris ergodic) and converges to the target posterior distribution. Thus, we can use this chain to construct strongly consistent estimators of intractable posterior expectations. To be specific, for q > 0, let L q (π) denote the set of functions g : where π(β, τ 2 , σ 2 | y) represents the joint posterior density evaluated at (β, τ 2 , σ 2 ). Harris ergodicity implies that, if g ∈ L 1 (π), then the estimator is strongly consistent for E π g, no matter how the chain is started. Of course, in practice, an estimator is only useful if it is possible to compute an associated standard error. All available methods of computing a valid asymptotic standard error for g m are based on the existence of a central limit theorem (CLT) for g m ; that is, we require that for some positive, finite φ 2 . Unfortunately, even if g ∈ L q (π) for all q > 0, Harris ergodicity is not enough to guarantee the existence of such a CLT (see for example [16,17]). The standard method of establishing the existence of CLTs is to prove that the underlying Markov chain converges at a geometric rate. Let B(X) denote the Borel sets in X := R p ×R p + ×R + , and let K m : X×B(X) → [0, 1] denote the m-step Markov transition function of the Gibbs Markov chain. That is, K m (β 0 , τ 2 0 , σ 2 0 ), A is the probability that (β m , τ 2 m , σ 2 m ) ∈ A, given that the chain is started at (β 0 , τ 2 0 , σ 2 0 ). Also, let Π(·) denote the joint posterior distribution. The chain is called geometrically ergodic if there exist a function M : X → [0, ∞) and a constant γ ∈ [0, 1) such that, for all (β, τ 2 , σ 2 ) ∈ X and all m = 0, 1, . . . , we have TV ≤ M (β, τ 2 , σ 2 )γ m , where · TV denotes the total variation norm. The relationship between geometric convergence and CLTs is simple: If the chain is geometrically ergodic and E π |g| 2+δ < ∞ for some δ > 0, then g m satisfies a CLT. Moreover, because the Mtd is strictly positive on X (see Section 2), the same 2 + δ moment condition implies that the usual estimators of the asymptotic variance, φ 2 , are consistent [2,8,9,11]. Our main result, which is proven in Section 3 using a geometric drift condition and an associated minorization condition, is the following. Proposition 1. The Bayesian lasso Gibbs Markov chain is geometrically ergodic for n ≥ 3 and arbitrary p, X, λ.
Hence, the Markov chain CLT holds and can be used to obtain asymptotic standard errors of posterior estimates.
In the restricted case when n ≥ p, Kyung et al. [12] contains, among other results, a proof of geometric ergodicity of the Bayesian lasso Gibbs Markov chain. However, there are serious errors in this proof, which essentially arise from their assertion that the Bayesian lasso model in (1.2)-(1.5) can be obtained as a marginal model of a hierarchical random effects model. A detailed explanation of the problems with the proof are provided in the appendix.

K. Khare and J. P. Hobert
Let η denote the Lebesgue measure on R p × R p + × R + . The Bayesian lasso Gibbs Markov chain has a Mtd (with respect to η) given by (2.1) It is well known, and can be verified by a straightforward calculation, that the joint posterior density of (β, τ 2 , σ 2 ) is invariant for the Gibbs transition density k defined above. The Mtd is strictly positive, which implies that the chain is aperiodic and η-irreducible [13, Page 87]. Moreover, the existence of an invariant probability density together with η-irreducibility imply that the chain is positive Harris recurrent (see for example [1]). Note also that η is equivalent to the maximal irreducibility measure. We now prove geometric ergodicity by establishing a geometric drift condition and an associated minorization condition for the Gibbs transition density k.

Drift condition
Consider the function Let E k [· | β 0 , τ 2 0 , σ 2 0 ] represent the expectation with respect to one step of the Markov chain with transition density k, starting at (β 0 , τ 2 0 , σ 2 0 ). The following proposition establishes a geometric drift condition for the transition density k.
It follows by the definition of the Markov transition density k that We evaluate the three conditional expectations one step at a time. We start with the innermost conditional expectation in (3.2).
For the next step, we analyze terms related to the middle conditional expectation in (3.2). Note that if Z ∼ Inverse-Gaussian(µ ′ , λ ′ ), then Hence, it follows from (1.7) that Finally, we analyze terms related to the outermost conditional expectation in (3.2). Note that and Combining (3.2)-(3.6), we get that The last inequality follows from the fact that It follows from (3.7) that where and b = y T y + p(n + 2p + 2α) 2λ 2 + p λ 2 + 2pξ n + p + 2α − 2 . (3.10) Hence, the required geometric drift condition is established.

Minorization condition
Let us recall that V (β, τ 2 , The following proposition establishes an associated minorization condition to the geometric drift condition established in Proposition 2. Proposition 3. There exists a constant 0 < ǫ = ǫ(V, d) ≤ 1 and a probability density functionf on R p × R p and, where g 1 (τ 2 j | σ 2 ) is the density of the reciprocal of an Inverse-Gaussian random variable with parameters λ 2 σ 2 d 2 and λ 2 . Let us now consider the full conditional density of σ 2 given β 0 , τ 2 0 , y. Note that for (β 0 , 14) The last inequality follows since τ 2 0j ≤ d for every 1 ≤ j ≤ p. Note that I n − X(X T X + 1 d I p ) −1 X T is a positive definite matrix. Hence It follows from (3.13) and (3.14) that for (β 0 , where g 2 (σ 2 ) is the density of the Inverse-Gamma distribution with parameters n+p+2α 2 and d+2ξ+λ 2 d 2

2
. It follows from (2.1), (3.12) and (3.15) that, for all andf is a probability density on R p × R p + × R + given bỹ Hence, the required minorization condition for k has been established.
The drift and minorization conditions in Proposition 2 and Proposition 3 can be combined with Theorem 12 of Rosenthal [18] to establish geometric ergodicity of the Bayesian lasso Gibbs Markov chain. Then for any (β 0 , τ 2 0 , σ 2 0 ) ∈ R p × R p + × R + , m ∈ N, and 0 < r < 1 Proposition 1 is an immediate corollary of the above result.

Discussion
The Bayesian lasso is a popular and widely used algorithm for sparse Bayesian estimation in linear regression. In this paper, we have established geometric ergodicity of the Markov chain corresponding to the Bayesian lasso algorithm.
As discussed previously, this proves the existence of the Markov chain CLT in this context, thereby allowing the user to obtain valid asymptotic standard errors for Markov chain based estimates of posterior quantities. The Laplace prior for β used in the Bayesian lasso algorithm is an example of the so-called "continuous shrinkage priors" for sparse Bayesian estimation. In recent years, many other continuous shrinkage priors have been introduced and studied in the literature (see [4,5,6,10,15] and the references therein), and sampling from the resulting posterior distribution is often achieved by using MCMC. To the best of our knowledge, geometric ergodicity of the corresponding Markov chains has not been investigated. Using the insights obtained from the analysis in this paper, we are currently investigating geometric ergodicity for Markov chains corresponding to some of the other continuous shrinkage priors proposed in the literature.

Appendix
We point out the errors in the proof of geometric ergodicity of the Bayesian lasso Gibbs Markov chain in Kyung et al. [12] (referred to henceforth as KGGC). The authors in KGGC only consider the case n ≥ p. The authors assert that the Bayesian lasso model can be obtained by appropriately marginalizing a hierarchical random effects model. They then prove the geometric ergodicity of the Markov chain corresponding to this hierarchical random effects model, and argue that this implies geometric ergodicity of the Bayesian lasso model. We start by reproducing (verbatim) the hierarchical random effects model from KGGC (Page 406, appendix) below.
where Σ(τ ) is of the form We refer to this hierarchical random effects model as the 'RE model'.
• There is a contradiction between (5.5) and (5.6) regarding the prior distribution of τ 2 j . In (5.5), τ 2 j is assumed to be Inverse-Gamma, whereas in (5.6), τ 2 j is assumed to be Exponential. The authors proceed to prove geometric ergodicity of the RE model under the Inverse-Gamma assumption in (5.5). Since the Bayesian lasso model assumes the prior for τ 2 j to be Exponential, it clearly cannot be obtained by marginalizing the RE model with the Inverse-Gamma assumption in (5.5). Hence, the proof of geometric ergodicity for the RE model with the Inverse-Gamma assumption has no bearing on the geometric ergodicity for the Bayesian lasso model. • y is assumed to be p-dimensional in (5.1). In the Bayesian lasso model, the data vector y is n-dimensional. Since the authors are considering the case n > p, this clearly rules out the possibility of marginalizing the RE model to get the Bayesian lasso model. If this is a typo, and y is n-dimensional, then θ is n-dimensional. Hence, Σ(τ ) is an n × n matrix. However, Σ(τ ) is assumed to be a p × p matrix in (5.6). • There is another error in the marginalizing argument provided in KGGC.
On Page 381 (Section 4) the authors consider the model where and θ = µ1 n + Xβ +Xη (with 1 n , X,X mutually orthogonal). Here 1 n is the n-dimensional vector of all ones, and J is the n × n matrix of all ones. They claim that marginalizing over η in (5.7) gives the marginal model Firstly, there is a small issue in terms of µ. Note that, by the mutual orthogonality of 1 n , X,X, 1 T n θ n = 1 T n (µ1 n + Xβ +Xη) n = µ.
Since θ ∼ N n (µ1 n , Σ θ ), this implies µ ∼ N µ, 1 T n Σ θ 1 n n 2 , which is paradoxical, as it implies that µ is a random variable with nonzero variance (Σ θ is a full rank matrix), and also implies that µ is a constant. Let us get rid of this discrepancy by assuming µ = 0, Σ −1 θ = λ σ 2 X(X T X) −1 D −1 τ (X T X) −1 X T +XX T , and choosing the transformation where η is now (n− p)-dimensional, and X TX = 0. Let us assume without loss of generality thatX TX = I n−p (the choice ofX is completely in our hands anyway). Still, the marginal model after integrating out η is not the same as (5.8). An immediate way to check this is that V (y) = σ 2 I n + Σ θ under (5.7), while V (y) = σ 2 I n + σ 2 λ XD τ X T under (5.8). We now derive the correct marginal model. Note that the joint density of (y, θ) is given by By making the linear transformation (y, θ) → (y, β, η), we get that By completing the squares again, and usingX T X = 0, it follows that the marginal model is given by y | β, σ 2 ∼ N n (Xβ, σ 2 I n +XX T ) and β | σ 2 ∼ N p 0 p , σ 2 λ D τ .
This is different from the Bayesian lasso model, as the conditional variance of y given β, σ 2 is σ 2 I n +XX T (as opposed to σ 2 I n in the Bayesian lasso model).