Geometric ergodicity of Polya-Gamma Gibbs sampler for Bayesian logistic regression with a flat prior

The logistic regression model is the most popular model for analyzing binary data. In the absence of any prior information, an improper flat prior is often used for the regression coefficients in Bayesian logistic regression models. The resulting intractable posterior density can be explored by running Polson et al.'s (2013) data augmentation (DA) algorithm. In this paper, we establish that the Markov chain underlying Polson et al.'s (2013) DA algorithm is geometrically ergodic. Proving this theoretical result is practically important as it ensures the existence of central limit theorems (CLTs) for sample averages under a finite second moment condition. The CLT in turn allows users of the DA algorithm to calculate standard errors for posterior estimates.


Introduction
Let (Y 1 , Y 2 , . . . , Y n ) denote the vector of Bernoulli random variables and x i be the p × 1 vector of known covariates associated with the ith observation for i = 1, . . . , n. Let β ∈ R p be the unknown vector of regression coefficients. A generalized linear model can be built (McCulloch et al., 2011) with a link function that connects the expectation of Y i with the covariate x i . One popular link function is the logit link function, F −1 (·), where F is the cumulative distribution function of the standard logistic random variable, that is F (t) ≡ e t /(1 + e t ) for t ∈ R. The logit link function leads to the logistic regression model, The popularity of the logistic regression model is due to the fact that P (Y i = 1) has a closed form expression of x T i β, and it is easy to interpret β in terms of odds ratio.
Let y = (y 1 , y 2 , . . . , y n ) T be the vector of observed Bernoulli response variables. The likelihood function for β is In a Bayesian framework, when there is no prior information available about the parameters, noninformative priors are generally used. A popular method of analyzing binary data is by fitting a Bayesian logistic regression model with a flat prior on β. If the prior density of β, π (β) ∝ 1, the posterior density of β is provided the marginal density Chen and Shao (2001) discuss the necessary and sufficient conditions for propriety of the posterior density (1), that is, when c(y) < ∞. These conditions are given in A.1. Throughout this paper, we assume that the posterior density (1) is proper. From (1), we know that the posterior density of β, π(β|y), is intractable in the sense that means with respect to this density are not available in closed form. Markov chain Monte Carlo (MCMC) algorithms are generally used for exploring this posterior density. The data augmentation (DA) algorithm proposed in Albert and Chib (1993) for the Bayesian probit regression model is widely used. For the logistic regression model, there have been many attempts to produce such a DA algorithm (Holmes and Held, 2006;Frühwirth-Schnatter and Frühwirth, 2010). Recently, Polson et al. (2013) (denoted as PS&W hereafter) have proposed a new DA algorithm for the logistic regression model based on latent variables following the Pólya-Gamma (PG) distribution. As mentioned in Choi and Hobert (2013), PS&W's algorithm is the first DA algorithm for the logistic regression that is truly analogous to Albert and Chib's (1993) DA algorithm. PS&W's DA algorithm, like Albert and Chib's (1993) DA for the probit model, in every iteration makes two draws -one draw from a p−dimensional normal distribution for β and the other draw for the latent variables. We now describe these two steps.
Let X denote the n × p design matrix with ith row x T i . Let R + = (0, ∞) and for (ω 1 , ω 2 , . . . , ω n ) ∈ R n + , define Ω to be the n×n diagonal matrix with ith diagonal element ω i . Finally let P G(1, b) denote the Pólya-Gamma distribution defined in Section 2 with parameters 1 and b. A single iteration of PS&W's algorithm uses the following two steps to move from β ′ to β.

PS&W's algorithm:
PS&W provided an efficient method for sampling from the Pólya-Gamma distribution. It can be shown that the transition density of the Markov chain underlying the above DA algorithm is strictly positive everywhere, which implies the chain is Harris ergodic (Asmussen and Glynn, 2011). Thus the sample averages based on the DA chain can be used to consistently estimate posterior means. However, in order to provide standard errors for these estimates one needs to show the existence of Markov chain CLTs for these estimators. A standard method of establishing Markov chain CLT is by proving the chain to be geometrically ergodic (Roberts and Rosenthal, 1997). Geometric ergodicity also allows consistent estimation of asymptotic variances in Markov chain CLTs by batch means or spectral variance methods (Flegal and Jones, 2010). This in turn allows the MCMC users to decide how long to run MCMC simulations (Jones and Hobert, 2001). Thus proving geometric ergodicity has important practical benefits. In this paper, we prove that the Markov chain underlying PS&W's DA algorithm is geometrically ergodic. Choi and Hobert (2013) considered normal priors on the regression parameters and proved uniform ergodicity of the corresponding Pólya-Gamma DA Markov chain by establishing a minorization condition. Choi and Román (2017) considered the one-way logistic ANOVA model under a flat prior on group (treatment) main effects and showed that the Markov operator corresponding to Pólya-Gamma sampler is trace-class. The assumption of the one-way logistic ANOVA model is restrictive and has limited applications. Here, we analyze the convergence rate of PS&W's DA algorithm for Bayesian logistic regression models with a general form of the design matrix under a flat prior on regression coefficients. In particular, we establish that PS&W's DA algorithm for the Bayesian logistic regression model under the improper flat prior is always geometrically ergodic. The conditions we need are only the conditions of Proposition 1 in A.1, which guarantee the posterior propriety. Since we use drift condition to prove geometric ergodicity of the DA algorithm and hence CLTs for sample average estimators, the techniques used here are different from that of Choi and Hobert (2013) and Choi and Román (2017).
The rest of the paper is organized as follows. In Section 2, we describe PS&W's Gibbs sampler. Section 3 contains a brief discussion on geometric rate of convergence for Markov chains and a proof of geometric ergodicity of PS&W's Gibbs sampler. Some concluding remarks are given in Section 4. Finally, the appendix contains some technical results.

PS&W's Gibbs sampler
In PS&W's DA algorithm, latent variables with the Pólya-Gamma distribution are introduced. The probability density function for a Pólya-Gamma random variable with parameters a > 0 and b ≥ 0 is, We write W ∼ P G(a, b). (Recall that the hyperbolic cosine function cosh is defined as cosh(t) = e t + e −t /2.) Choi and Hobert (2013) developed a new way to formulate PS&W's DA algorithm, which we briefly describe now. Let ω = (ω 1 , . . . , ω n ) T be the latent variables. Assume that, conditional on β, . . , n} be n independent pairs. Then the complete posterior density of β and ω is Clearly from (1) we see that, that is, the β marginal density of the augmented posterior density π(β, ω|y) is our target posterior density π(β|y). Let p (ω i ) be the probability function of P G(1, 0) and κ i = y i −1/2, as defined before. It can be checked that, PS&W's DA algorithm is simply a two-variable Gibbs sampler that, in each iteration, alternates draws from the two conditional distributions of π(β, ω|y). Below we present the conditional densities of ω given β, y and β given ω, y.
From (3) we see that that is, the conditional distribution of ω given β, y is independent of y. Thus the conditional density of ω given β, y is From (4), it is easy to see that the conditional density of β is where κ = (κ 1 , . . . , κ n ) T . Thus the conditional distribution of β is multivariate normal. In particular,

Geometric ergodicity of Pólya-Gamma Gibbs sampler
Let {β (m) , ω (m) } ∞ m=0 denote the Markov chain associated with PS&W's DA algorithm. In Bayesian logistic regression models, inferences on β are made based on the {β (m) } ∞ m=0 sub-chain. As mentioned in the introduction, the DA Markov chain is Harris ergodic. Let h : R p → R be a real valued function of β with R p |h(β)|π(β|y)dβ < ∞, then the posterior mean E(h(β)|y) can be consistently estimated byh m = m−1 i=0 h(β (i) )/m for any starting value β (0) (see A.1 for a discussion on the existence of finite moments for (1)). We can build a CLT forh m if there exists a constant σ 2 Mere Harris ergodicity of the Markov chain {β (m) } ∞ m=0 does not ensure that the CLT in (9) holds. It turns out that the geometric rate of convergence defined below guarantees the CLT under a finite second moment condition (Roberts and Rosenthal, 1997). Also it turns out that all the three Markov chains have the same rate of convergence (Roberts and Rosenthal, 2001). Thus geometric ergodicity is a solidarity property of these three Markov chains. In this article we analyze the Let ω ′ be the current state and ω be the next state, then the Markov transition density (Mtd) of Ψ is where π(·|·, y)'s are the conditional densities defined in (6) and (7). Note that the Mtd of the {β (m) } ∞ m=0 sub-chain (that is, when ω is updated first) is similarly given bỹ Let B denote the Borel σ-algebra of R n + and K(·, ·) be the Markov transition function corresponding to the Mtd k(·|·) in (10), that is, for any set A ∈ B, ω ′ ∈ R n + and any j = 0, 1, . . . , Then the m-step Markov transition function is K m (ω ′ , A) = Pr(ω (m+j) ∈ A|ω (j) = ω ′ ). Let Π(·|y) be the probability measure with density π(ω|y), where π(ω|y) = R p π(β, ω|y)dβ and π(β, ω|y) is the joint density defined in (3). The Markov chain Ψ is geometrically ergodic if there exists a constant 0 < t < 1 and a function H : R n + → [0, ∞) such that for any ω ∈ R n + , and m ≥ 1, Harris ergodicity of Ψ implies that ||K m (ω, ·) − Π(·|y)|| TV ↓ 0 as m → ∞, while (12) guarantees its exponential rate of convergence. Since the Markov chains {β (m) } ∞ m=0 and {ω (m) } ∞ m=0 have the same rate of convergence, (12) implies {β (m) } ∞ m=0 is geometrically ergodic. Roberts and Rosenthal (1997) show that since {β (m) } ∞ m=0 is reversible, if (12) holds then there exists a CLT, that is, for any h : R p → R with E[h(β) 2 |y] < ∞, (9) holds. Also, under (12) a consistent estimator of σ 2 h can be found by batch means or spectral variance methods (Flegal and Jones, 2010). The following theorem shows that the Markov chain Ψ converges at a geometric rate. Proof of Theorem 1. We prove geometric ergodicity of Ψ by establishing a drift condition. In particular, we consider the drift function where α is a positive constant and show that for any ω, ω ′ ∈ R n + , there exist some constants ρ ∈ (0, 1) and L 0 > 0 such that In (14) the expectation is with respect to the Mtd k(ω|ω ′ ) defined in (10). Note that V (ω) is unbounded off compact sets, that is, for any a > 0, the set {ω : V (ω) ≤ a} is compact. We now show that ω-chain is a Feller chain, which means K (ω, O) is a lower semi-continuous function on R n + for each fixed open set O. Consider a sequence ω m with ω m → ω as m → ∞. Note that, where the inequality follows from Fatou's lemma. Since π(β|ω, y) is a continuous function in ω and ω m → ω, Thus by Meyn and Tweedie (1993)(chap. 15), (14) implies that the Markov chain Ψ is geometrically ergodic. Now we establish (14). From the definition of the Mtd of Ψ in (10), it follows that where E [·|β, y] denotes the expectation with respect to π(·|β, y) given in (6) and E {·|ω ′ , y} denotes the expectation with respect to π(·|ω ′ , y) given in (7) . We first evaluate the inner expectation in (15), that is the expectation of V (ω) with respect to π (ω|β, y). From (5), we know that ω i |β, y ∼ P G 1, |x T i β| . Thus by Lemma 1 and Lemma 2 given in A.2, we have where L 1 ≡ L(1), L 2 ≡ L(1/2) and L(·) is a function defined in Lemma 2. Then Now we consider the outer expectation in (15), that is, the expectation with respect to π(β|ω ′ , y). Let where Ω ′ is the diagonal matrix with elements ω ′ i 's. From (8) we know that x T i β|ω ′ , y ∼ N µ i , σ 2 i . Then x T i β has a folded normal distribution. Let G(·) denote the cumulative distribution function of the standard normal random variable. So By the inequality in Roy and Hobert (2010) [Lemma 3], Now where the inequality follows from the fact that I − (Ω ′ ) 1/2 X X T Ω ′ X −1 X T (Ω ′ ) 1/2 is a positive semidefinite matrix.

Summary
In this article, we prove the geometric rate of convergence for the Polson et al.'s (2013) Pólya-Gamma Gibbs sampler for the Bayesian logistic regression with a flat prior on the regression coefficients β. The conditions for geometric ergodicity are the same as the necessary and sufficient conditions for posterior propriety. That means, the Gibbs sampler is always geometrically ergodic if the posterior distribution is proper. If the posterior is improper, the Gibbs sampler is not even positive recurrent and the usual sample average estimator is inconsistent for the posterior mean (Athreya and Roy, 2014). Thus our result guarantees availability of a CLT for the time average estimator as long as it is consistent. Roy and Hobert (2007) established a similar result for Albert and Chib's (1993) DA algorithm for the Bayesian probit regression model with a flat prior on β. The latent variables in Albert and Chib's (1993) DA algorithm are normal random variables and their conditional (posterior) distributions are truncated normal. Since the latent variables in Polson et al.'s (2013) DA algorithm have the non-standard PG distribution, it turns out the drift function, inequalities, techniques used in our proof are quite different from those of Roy and Hobert (2007). One potential future work is to study the convergence properties of the Pólya-Gamma Gibbs sampler for Bayesian logistic mixed models under improper priors for both regression coefficients and variance components.

A.1 Chen and Shao's (2001) conditions for posterior propriety
Let X denote the n × p design matrix with the ith row x T i and Z be the n × p matrix with the ith row z T i = c i x T i , where c i = 1 if y i = 0 and c i = −1 if y i = 1 for i = 1, . . . , n. The following proposition gives the necessary and sufficient conditions for propriety of the posterior density (1).

Proposition 1. (Chen and Shao, 2001). The marginal density c(y) is finite if and only if
1. X is a full rank matrix; 2. There exists a vector e = (e 1 , . . . , e n ) T with strictly positive components such that Z T e = 0 p . (2007) provide a method for checking the second condition in Proposition 1. This method can be easily implemented using publicly available software packages.

Remark 2. Roy and Hobert
Remark 3. Since the moment generating function of the logistic distribution exists from Chen and Shao (2001) [Theorem 2.3], it follows that under the two conditions of Proposition 1, R p e δ β π(β|y)dβ < ∞ for some δ > 0 and R p β r π(β|y)dy < ∞ for all r ≥ 0.
Proof. From (2), the probability density function of PG (1, b) is, We consider the two cases, b = 0 and b = 0 separately.
where Φ(·) is the Lerch transcendent function. The inequality in (26) follows from the fact that 1 + e −b−t ≥ 1. Thus we have, For fixed s > 0, let Using the Dominated Convergence Theorem (DCT), we can show that f (b) is a continuous function of b. DCT can also be used to show that lim b→∞ f (b) = 0 and f (0) = 0. So |f (b)| can be bounded by a positive constant value f 0 . Thus we have Combining the two cases b = 0 and b = 0, we have where L (s) = max 2 s+1 Γ(2s+1) Γ(s+1) + f 0 , 8C + 1 .
The set S * is a compact set. Note that sup ω∈R n Now we study the supremum of 1 T Z Z T Λ −2 Z −1 Z T 1 over λ ∈ S. We know that Z T 1 is a continuous function of λ in S. For λ ∈ S * \S, there exists a Since each element of P m is a bounded, continuous function of λ m over S, its limit as m → ∞ exists and is bounded. Thus, lim m→∞Zm Z T mZ m −1Z T m exists, and we denote it as P . For a matrix A, define A 2 = sup x: x =1 Ax . Since P m 2 ≤ P 2 + P m − P 2 and P 2 ≤ P m 2 + P m − P 2 , we have | P m 2 − P 2 | ≤ P m − P 2 .