Geometric Ergodicity of Gibbs Samplers in Bayesian Penalized Regression Models

We consider three Bayesian penalized regression models and show that the respective deterministic scan Gibbs samplers are geometrically ergodic regardless of the dimension of the regression problem. We prove geometric ergodicity of the Gibbs samplers for the Bayesian fused lasso, the Bayesian group lasso, and the Bayesian sparse group lasso. Geometric ergodicity along with a moment condition results in the existence of a Markov chain central limit theorem for Monte Carlo averages and ensures reliable output analysis. Our results of geometric ergodicity allow us to also provide default starting values for the Gibbs samplers.


Introduction
Let y ∈ R n be the observed realization of the response Y , X be the n × p model matrix, and β ∈ R p be the regression coefficient vector.The goal, generally, is to identify important predictors amongst the p covariates and estimate the corresponding coefficients in β.However, in many problems, like genetics, image processing, chemometrics, economics, the number of covariates, p can be much larger than n, making it difficult to use classical regression techniques.Bayesian and frequentist penalization methods have been found to be very useful in such situations.Consider the Bayesian regression model of the form where α, ξ ≥ 0 are assumed known, Σ η is a p × p covariance matrix determined by η ∈ R s + , and p(η) is a proper prior on η.Many Bayesian penalized regression and variable selection models can be presented in this framework (see for example Guan and Stephens (2011); Kyung et al. (2010); Park and Casella (2008); Yang et al. (2016)).The resulting posteriors are often intractable and Markov chain Monte Carlo (MCMC) is used to estimate model parameters.
Consider the Bayesian fused lasso, the Bayesian group lasso Kyung et al. (2010), and the Bayesian sparse group lasso Xu and Ghosh (2015), all three of which belong to the family of models in (1).These models have been used in a variety of problems.The Bayesian group lasso and the Bayesian sparse group lasso find use in medical research Fan et al. (2017); Gu et al. (2013); Nathoo et al. (2016); Raman et al. (2010).The Bayesian fused lasso has been used in breast cancer research Zhang et al. (2014).Given the use of these models in medical research, reliable inference is essential.
Reliable estimation from MCMC output rests heavily on the rate of convergence of the Markov chain.In particular, a geometric rate of convergence lets users appeal to the Markov chain central limit theorem (CLT), allowing for the estimation of Monte Carlo error in posterior estimates and consistent estimation of effective sample size.We show that the MCMC samplers used in the three models converge to their respective stationary distribution at a geometric rate.That is, we show that the Gibbs samplers are geometrically ergodic (formal definitions are in Section 2).
In the models we study, the full conditionals for β, η and σ 2 are available in closed form so that it is straightforward to draw samples from f (β | η, σ 2 , y), f (η | β, σ 2 , y), and f (σ 2 | β, η, y).As a consequence, a three variable deterministic scan Gibbs sampler is implemented to draw approximate samples from the intractable posterior distribution and inference is done using sample statistics.The quality of estimation is affected not only by the size of the Monte Carlo sample, but also by the rate of convergence of the Gibbs sampler.We show that all three Gibbs samplers converge to their respective stationary distribution at a geometric rate under reasonable conditions.Specifically, we only require the number of observations, n, to be larger than three and require no assumptions on the number of covariates, p or the model matrix X.This geometric rate of convergence allows for reliable estimation of posterior quantities in the following way.
However, in finite samples there is typically a non-zero Monte Carlo error θ N − θ and an approximate sampling distribution of this error maybe available via a Markov chain CLT.
Let • denote the Euclidean norm.If the deterministic scan Gibbs sampler is geometrically ergodic and X g(β, η, σ 2 ) 2+δ f (β, η, σ 2 | y)dβ dη dσ 2 < ∞ , then a Markov chain CLT holds as below: where Σ is the d × d asymptotic covariance matrix that is difficult to calculate due to the serial correlation in the Markov chain.However, if the process is geometrically ergodic, then Vats et al. (2015a) and Vats et al. (2015b) provide strongly consistent estimators of Σ.
This leads to the construction of asymptotically valid confidence ellipsoids around θ N and consistent estimation of effective sample size Vats et al. (2015a).Under the assumption of geometric ergodicity, the diagonals of Σ were estimated by Flegal and Gong (2015), Flegal and Jones (2010), Gong and Flegal (2016), Hobert et al. (2002), andJones et al. (2006) leading to reliable univariate analysis of MCMC output.For estimating quantiles, Doss et al. (2014) show that geometric ergodicity leads to strongly consistent estimators of the Monte Carlo error.
There has been a considerable amount of work done in establishing geometric ergodicity of Gibbs samplers; many of which are two variable Gibbs samplers.Two variable Gibbs samplers are special because the marginal process for each variable is a Markov chain with the same rate of convergence as the joint chain.Thus, it is sufficient to study the marginal chains to ascertain the properties of the joint chain.Higher variable Gibbs samplers do not benefit from this property and thus studying their rate of convergence is often more challenging.Geometric ergodicity of the three variable Gibbs samplers in the Bayesian lasso and the Bayesian elastic net were shown by Khare and Hobert (2013) and Roy and Chakraborty (2017), respectively; Pal and Khare (2014) proved geometric ergodicity of the three variable Gibbs sampler for the normal-gamma model of Griffin and Brown (2010); Khare and Hobert (2012) demonstrated geometric ergodicity of the three variable Gibbs sampler in Bayesian quantile regression, and Doss and Hobert (2010) and Jones and Hobert (2004) demonstrated geometric ergodicity of the three variable Gibbs sampler in hierarchical random effects models.Recently, Johnson and Jones (2015) established geometric ergodicity of a four variable random scan Gibbs sampler for a hierarchical random effects model.
The rest of the paper is organized as follows.In Section 2 we present important definitions and some relevant Markov chain background.In Section 3, Section 4, and Section 5 we present the models and main results for the Bayesian fused lasso, Bayesian group lasso, and the Bayesian sparse group lasso.We finish with a discussion in Section 6.All proofs are deferred to the appendices.

Markov Chain Background
Recall that F denotes the posterior distribution of (β, η, σ 2 ) obtained from (1) and is the associated density.Also recall that X = R p × R s + × R + is the support of the posterior and let B(X) denote the Borel σ-algebra.Let f (β | η, σ 2 , y) be the density of the full conditional distribution of β and similarly denote the densities of the conditional distributions of η and σ 2 with f (η | β, σ 2 , y) and f (σ 2 | β, η, y), respectively.Let (β (0) , η (0) , σ 2(0) ) be the starting value for the Gibbs sampler and define the Markov chain transition density (MTD) for the deterministic scan Gibbs sampler as Then, the one-step transition kernel P : X × B(X) → [0, 1] is such that for any A ∈ B(X), Similarly, the t-step Markov chain transition kernel for the deterministic scan Gibbs sampler is P t : X × B(X) → [0, 1] such that for all A ∈ B(X), Let • T V denote total variation norm.If the Markov chain is aperiodic, irreducible, and Harris recurrent (see Meyn and Tweedie (2009) for definitions), then for all (β (0) , η (0) , σ 2(0) ) ∈ X However, convergence of the transition kernel to the invariant distribution is not sufficient to ensure reliable inference and a geometric rate of convergence is often required.The Gibbs sampler is geometrically ergodic if there exists a function M : X → [0, ∞) and 0 ≤ ρ < 1 such that for all (β (0) , η (0) , σ 2(0) ) ∈ X, (3) Since ρ < 1, the upper bound in (3) decreases at a geometric rate as a function of t.We will show that the three Gibbs samplers are geometrically ergodic by establishing a drift condition and an associated minorization condition.In effect, we will determine M up to a proportionality constant and minimize this quantity to arrive at default starting values for the Gibbs samplers.Our results can also be used to obtain quantitative upper bounds for (3) using the results of Rosenthal (1995); we do not explore that here.
Geometric ergodicity is often demonstrated by establishing a drift condition and an associated minorization condition.A drift condition is said to hold if there exists a function V : X → [0, ∞), and constants 0 < φ < 1 and L < ∞ such that for all (β 0 , η 0 , σ In ( 4), the expectation is with respect to the MTD for the Gibbs sampler.
Consider for d > 0, the set tion holds if there exists an ǫ > 0 and a distribution Q such that for all (β 0 , η 0 , σ 2 0 ) ∈ C d It is well know that both (4) and ( 5) together imply geometric ergodicity (see Jones and Hobert (2001) and Meyn and Tweedie (2009)).The drift rate φ determines how fast the Markov chain drifts back to the small set C d .A drift rate close to one signifies slower convergence and a smaller value indicates faster convergence.See Jones and Hobert (2001) for a heuristic explanation.
When a drift condition holds, Meyn and Tweedie (2009), Roberts and Rosenthal (1997), and (Roberts and Rosenthal, 2004, Fact 10) explain that the function M is proportional to the drift function V up to an unknown constant.Thus, minimizing V over the state space leads to the tightest bound in (3) for our choice of V .This will lead us to default starting values for the three Gibbs sampler.
3 Bayesian Fused Lasso Recall that y ∈ R n is the observed realization of the response Y , X is the n × p model matrix, and β ∈ R p is the regression coefficient vector.Tibshirani et al. (2005) proposed the fused lasso in an effort account for ordering in the predictors.In addition to penalizing the L 1 norm of the coefficients, the fused lasso also penalizes pairwise differences.That is, for tuning parameters λ 1 , λ 2 > 0, the fused lasso estimate is, βfused = arg max A Bayesian formulation of the fused lasso requires a prior on β so that the resulting posterior mode is the βfused .Kyung et al. (2010) present the following Bayesian formulation of the fused lasso.Let Let τ 2 = (τ 2 1 , . . ., τ 2 p ) and w 2 = (w 2 1 , . . ., w 2 p−1 ).Kyung et al. (2010) state that the priors in (7) lead to the following marginal prior on β given σ 2 .
However, this is not the case and in particular, the independent exponential priors on τ 2 and w 2 do not lead to the marginal prior in (9).Instead, our proposed prior is In Appendix B.1, we show that the prior on (τ 2 , w 2 ) in ( 10) is proper and in Appendix B.2 we demonstrate that the marginal prior on β given σ 2 is the appropriate prior in (9).
Thus, our model formulation is a valid Bayesian fused lasso model.

Gibbs Sampler for the Bayesian Fused Lasso
The resulting full conditionals from the model in (7) with prior (10) are, Here the Inverse-Gaussian(a, b) density is f (x) ∝ x −3/2 exp(−b(x − a) 2 /2a 2 x) and the density of an Inverse-Gamma(a, b) distribution is f (x) ∝ x −a−1 exp(−b/x).Notice that the full conditionals for τ 2 and w 2 are independent and thus can be updated in one block.This reduces the four variable Gibbs sampler to a three variable Gibbs sampler.
) is the current state of the Gibbs sampler the (t + 1)th state is obtained as follows.
This three variable deterministic scan Gibbs sampler has MTD, First we note that the full conditional distribution of 1/τ 2 i is an Inverse-Gaussian with mean parameter λ 2 1 σ 2 /β 2 i .If the starting value for any β i is zero, this Inverse-Gaussian is still well defined as it is an Inverse-Gamma distribution with shape parameter 1/2 and rate parameter λ 2 1 /2.The same is true for the full conditional of 1/w 2 i .Thus, the MTD is strictly positive and well defined which implies the Markov chain is aperiodic, irreducible almost everywhere, and Harris recurrent.
We define the drift function The following theorem is proved by establishing ( 4) and ( 5) for the drift function V BF L .
Theorem 1.If n ≥ 3, the three variable Gibbs sampler for the Bayesian fused lasso is geometrically ergodic.
Proof.See Appendix C.
Remark 1.In Appendix C.1, we arrive at the drift rate Thus, φ BF L is no better than 1/2 and as p increases, the drift rate approaches one.Thus, convergence may be slower for problems with large p.

Bayesian Group Lasso
Knowledge of correlation among predictors is ignored by the usual lasso.The group lasso of Yuan and Lin (2006) imposes sparsity across grouped predictors.For a fixed K, partition Let X G k denote the matrix of predictors for group k.The group lasso estimate for tuning Kyung et al. ( 2010) present the following Bayesian analog of the group lasso.Let where λ > 0 is fixed, α, ξ ≥ 0 are known, and the density of a Gamma(a, b) is f (x) ∝ x a−1 e −bx .

Gibbs Sampler for Bayesian Group Lasso
Let ) .
The Bayesian group lasso in (15) leads to the following full conditionals for β, τ 2 and σ 2 : These full conditionals lead to a three variable Gibbs sampler where the variables are β, τ 2 , and σ 2 .
Remark 3. Kyung et al. (2010) propose a K + 2 variable Gibbs sampler where the variables and σ 2 .For this sampler, the full conditionals for σ 2 and τ 2 are the same as above, but the full conditional for each Kyung et al. ( 2010) had an error in their full conditional where they had The motivation for using the K + 2 sampler is to avoid the p × p matrix inversion of (X T X + D −1 τ ), and instead do K matrix inversions each of size m k × m k .This reduces the computational cost from O(p 3 ) to O( K k=1 m 3 k ).Such a technique was also discussed in Ishwaran and Rao (2005).However, it is known that a blocked Gibbs sampler mixes as well as or better than a full Gibbs sampler (see Liu et al. (1994)).In addition, Bhattacharya et al. (2016) recently proposed a linear time sampling algorithm to sample from high-dimensional normal distributions of the form in (16).Using their method, the computational cost of drawing from the full conditional of β is O(n 2 p), and thus the K + 2 variable Gibbs sampler is not required.
We will study the rate of convergence of the three variable Gibbs sampler.If (β (t) , τ 2 (t) , σ 2 (t) ) is the current state of the Gibbs sampler, the (t + 1)th state is obtained as follows.
The MTD for the above three variable deterministic scan Gibbs sampler is As in the Bayesian fused lasso, the MTD is well defined and strictly positive leading to an aperiodic, irreducible almost everywhere, and Harris recurrent Markov chain.
Define the drift function Theorem 2. If n ≥ 3, the three variable Gibbs sampler for the Bayesian group lasso is geometrically ergodic.
Proof.See Appendix D.
Remark 4. As in the Bayesian fused lasso Gibbs sampler, the drift rate, is no better than 1/2 and approaches 1 as p increases.
Remark 5. Minimizing V BGL yields default starting values for the Markov chain as β 0 being the frequentist group lasso estimate and τ 2 0,k = 2 β T 0,G k β 0,G k /λ.See Appendix D.3 for details.Remark 6.Since for K = p, the Bayesian group lasso is the Bayesian lasso, our result of geometric ergodicity holds for the Bayesian lasso as well.Geometric ergodicity of the Bayesian lasso was demonstrated by Khare and Hobert (2013) under exactly the same conditions.Our result on the starting values in Remark 5 also holds for the Bayesian lasso Gibbs sampler.

Bayesian Sparse Group Lasso
The group lasso induces sparsity across groups but does not induce sparsity within a group.Simon et al. (2013) added an L 1 penalty on the individual coefficients to the group lasso to arrive at the sparse group lasso.As before, for a fixed K, partition β in K groups each of size m 1 , m 2 , . . ., m K , the groups being denoted by β G 1 , β G 2 , . . ., β G K .For tuning parameters λ 1 > 0 and λ 2 > 0, the sparse group lasso estimate is where • 1 is the L 1 norm.The Bayesian sparse group lasso was introduced by Xu and Ghosh (2015).Before presenting the model, we give some definitions.Let and τ 2 1 , . . ., τ 2 p be variables defined on the positive reals.For each group k define, The notation γ 2 k,j is purely for convenience and can easily be replaced with The Bayesian sparse group lasso model formulated by Xu and Ghosh (2015) is where α, ξ ≥ 0 are fixed and the independent prior on each Here λ 1 , λ 2 > 0 are fixed.Xu and Ghosh (2015) show that the prior in ( 21) is proper with the normalizing constant being a function of λ 1 and λ 2 .

Gibbs Sampler for Bayesian Sparse Group Lasso
Define V τ,γ to be the diagonal matrix with diagonals being that of V 1 , . . ., V K in that sequence.In addition, let β k,j , refer to the jth coefficient in the kth group.The Bayesian sparse group lasso model in (20) leads to the following full conditionals for β, τ 2 , γ 2 and σ 2 : Notice that the full conditionals for τ 2 and γ 2 are independent and thus can be updated in one block leading to a three variable Gibbs sampler.If (β (t) , τ 2 (t) , γ 2 (t) , σ 2 (t) ) is the current state of the Gibbs sampler, the (t + 1)th state is obtained as follows.
The MTD for the three variable Gibbs sampler is As in the Bayesian group lasso Gibbs sampler, the MTD is strictly positive and thus aperiodic, irreducible almost everywhere, and the chain is Harris recurrent.We will prove geometric ergodicity by establishing a drift and an associated minorization condition.
Define the drift function Theorem 3. If n ≥ 3, the three variable Gibbs sampler for the Bayesian sparse group lasso is geometrically ergodic.
Proof.See Appendix E.
Remark 7. Define M = max k m k .In Appendix E.1 the drift rate is determined to be Unlike the drift rate in the previous two models, the drift rate here can be lower than 1/2.
However, it is likely that p is large enough so that φ BSGL is determined by the first term p/(n + p + 2α − 2).In this case again, the drift rate will tend to 1 as p increases and thus convergence may be slower for large p problems.
Remark 8.A reasonable starting value for this Markov chain is β 0 being the sparse group lasso estimate,

Discussion
As discussed in Section 1, reliable estimation from MCMC output rests heavily on the rate of convergence of the Markov chain.Our geometric ergodicity results immediately implies the existence of a Markov chain CLT and strong consistency of some estimators of the asymptotic covariance matrix in this CLT.As a consequence, practitioners can use tools such that effective sample size to understand the quality of the Monte Carlo estimates.
Our results of geometric ergodicity hold under reasonable conditions.We require no conditions on p, and only need n to be larger than 3.However, our results suggest that it may be possible for the Gibbs samplers to converge at a slower rate if p ≫ n.This agrees with the results in Rajaratnam and Sparks (2015).Users might then be inclined to first use Bayesian variable selection alternatives to these models.For example, Xu and Ghosh (2015) introduced the Bayesian variable selection alternatives to the group and the sparse group lasso by using spike-and-slab type priors.A natural direction for future research would be to investigate the convergence rate for the Gibbs samplers in these Bayesian variable selection models.

A Preliminaries
In general, E (k) represents expectation with respect to the MTD being studied in the section.Expectations with respect to a full conditional is denoted by E • .The index 0 on variables denotes starting values for the Markov chain.
Below are some properties of known distributions that will be used often.

A.1 Useful Lemmas
We present some results that will used in the proofs of geometric ergodicity for all three samplers.Most of the results are generalizations of the results in Khare and Hobert (2013) and the proofs are presented here for completeness.
Lemma 1.Let y, X, and β be the observed n × 1 response, the n × p matrix of covariates and the p × 1 vector of regression coefficients.Let Σ be the p × p positive definite matrix such that Lemma 2. For α = (α 1 , . . ., α p ) ∈ R p and δ = (δ 1 , . . ., δ p ) such that δ i = 0, Proof.Using the fact that the square of a number is non-negative, Lemma 3.For λ 2 , a 2 , σ 2 > 0, if X has a probability density function f (x) such that then 1/X ∼ Inverse-Gaussian distribution with mean parameter λ 2 σ 2 /a 2 and scale parameter λ 2 .
Lemma 4. If 1/X ∼ Inverse-Gaussian with mean parameter λ 2 σ 2 /a 2 and scale parameter λ 2 and a 2 ≤ d 2 for some d 2 > 0, then where f (x) is the pdf of X and q(x) is the pdf of the reciprocal of the Inverse-Gaussian distribution with mean parameter λ 2 σ 2 /d 2 and scale parameter λ 2 .
Proof.By Lemma 3, we have Lemma 5. Let y, X, and β be the observed n × 1 response, the n × p matrix of covariates and the p × 1 vector of regression coefficients respectively.Let Σ be a p × p positive definite matrix.Then, Proof.The proof mainly requires completing the square in the following way, B Bayesian Fused Lasso Prior

B.1 Propriety of the Prior
First note that det (Σ τ,w ) = det Σ −1 τ,w −1 .We decompose Σ −1 τ,w into where The diagonal matrix L 1 is clearly positive definite.The tridiagonal matrix L 2 is also positive definite since L 2 is real symmetric, has positive diagonals, and is strictly diagonally dominant (Andelić and Da Fonseca, 2011, Theorem 1.2).Here the condition of strict diagonal dominance is satisfied since Thus, the joint prior on (τ 2 , w 2 ) satisfies, . This is the product of p exponentials densities and p − 1 Gamma densities.Thus, the prior is proper.

B.2 Validity of the Prior
In this section we demonstrate that our choice of prior in the Bayesian fused lasso leads to the Laplace prior in ( 9).First we expand β T Σ −1 τ,w β in the following way: Using ( 26), where the last equality is due to the integrands being the densities of the reciprocal of Inverse-Gaussian distributions; see Lemma 3.

C Proof of Geometric Ergodicity in Bayesian Fused Lasso
We will establish geometric ergodicity of the three variable Gibbs sampler for the Bayesian fused lasso by establishing a drift condition and an associated minorization condition.

C.1 Drift Condition
Consider the drift function To establish the drift condition we need to show that there exists a 0 < φ BF L < 1 and L BF L > 0 such that, The left hand side is the expectation with respect to the MTD, that is, We will evaluate these sequentially, starting with the innermost expectation.By Lemma 1, Next we move on to the expectation with respect to the full conditional of τ 2 , w 2 .Note that using the properties of the Inverse-Gaussian distribution mentioned in Appendix A. Since Finally, the last expectation, Using ( 26), Using ( 30) in ( 29), where
D Proof of Geometric Ergodicity in the Bayesian Group Lasso

D.1 Drift Condition
Consider the drift function For the drift condition we need to show that there exists a 0 < φ BGL < 1 and L BGL > 0 such that, Just as in the proof for BFL, We will evaluate the expectations sequentially, starting with the innermost expectation.
By Lemma 1 and following the steps as before ( 28), .
For the last expectation, using steps as before (29), we get Recall that, Let the diagonals of D τ 0 be τ 2 0, * i for i = 1, . . ., p. Then where

D.3 Starting Values
As before, we first differentiate with respect to τ 2 and then with respect to β.Note that Thus, the β 0 that minimizes V BGL is then, which equivalent to the group lasso solution.Thus a reasonable starting value for the Markov chain is β 0 being the group lasso estimate and

E Proof of Geometric Ergodicity in the Bayesian Sparse
Group Lasso

E.1 Drift Condition
Consider the drift function By Lemma 1 and following the steps as before ( 28) Define M = max{m 1 , . . ., m K }.In addition, define Then, For the last expectation, using steps as before (29), we get Let v 0,i denote the diagonals of V τ 0 ,γ 0 .Then by Lemma 2, and the fact that the harmonic mean of positive numbers is less than their arithmetic mean, Using ( 50) in ( 49),