The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic

One of the most widely used data augmentation algorithms is Albert and Chib's (1993) algorithm for Bayesian probit regression. Polson, Scott and Windle (2013) recently introduced an analogous algorithm for Bayesian logistic regression. The main difference between the two is that Albert and Chib's (1993) truncated normals are replaced by so-called Polya-Gamma random variables. In this note, we establish that the Markov chain underlying Polson et al.'s (2013) algorithm is uniformly ergodic. This theoretical result has important practical benefits. In particular, it guarantees the existence of central limit theorems that can be used to make an informed decision about how long the simulation should be run.


Introduction
Consider a binary regression set-up in which Y 1 , . . . , Y n are independent Bernoulli random variables such that Pr(Y i = 1|β) = F (x T i β), where x i is a p × 1 vector of known covariates associated with Y i , β is a p × 1 vector of unknown regression coefficients, and F : R → (0, 1) is a distribution function. Two important special cases are probit regression, where F is the standard normal distribution function, and logistic regression, where F is the standard logistic distribution function, that is, F (t) = e t /(1 + e t ). In general, the joint mass function of Y 1 , . . . , Y n is given by A Bayesian version of the model requires a prior distribution for the unknown regression parameter, β. If π(β) is the prior density for β, then the posterior density of β given the data, y = (y 1 , . . . , y n ) T , is defined as where c(y) is the normalizing constant, that is, Regardless of the choice of F , the posterior density is intractable in the sense that expectations with respect to π(β | y), which are required for Bayesian inference, cannot be computed in closed form. Moreover, classical Monte Carlo methods based on independent and identically distributed (iid) samples from π(β | y) are problematic when the dimension, p, is large. These difficulties have spurred the development of many Markov chain Monte Carlo methods for exploring π(β | y).
One of the most popular is a simple data augmentation (DA) algorithm for the probit regression case that was developed by Albert and Chib (1993). Each iteration of this algorithm has two steps, the first entails the simulation of n independent univariate truncated normals, and the second requires a draw from the p-variate normal distribution. A convergence rate analysis of the underlying Markov chain can be found in Roy and Hobert (2007).
Since the publication of Albert and Chib (1993), the search has been on for an analogous algorithm for Bayesian logistic regression. Attempts to develop such an analogue by mimicking the missing data argument of Albert and Chib (1993) in the logistic case have not been entirely successful. In particular, the resulting algorithms are much more complicated than that of Albert and Chib (1993), and simplified versions of them are inexact (see, e.g., Holmes and Held, 2006;Frühwirth-Schnatter and Frühwirth, 2010;Marchev, 2011). Using a different approach, Polson, Scott, and Windle (2013) (hereafter PS&W) have developed a real analogue of Albert and Chib's (1993) algorithm for Bayesian logistic regression. The main difference between the two algorithms is that the truncated normals in Albert and Chib's (1993) algorithm are replaced by Polya-Gamma random variables in PS&W's algorithm. We now describe the new algorithm.
In the remainder of this paper, we restrict attention to the logistic regression model with a proper N p (b, B) prior on β. As usual, let X denote the n × p matrix whose ith row is x T i , and let R + = (0, ∞). For fixed w ∈ R n + , define Σ(w) = X T Ω(w)X + B −1 −1 and m(w) = Σ(w) X T (y − 1 2 1 n ) + B −1 b , where Ω(w) is the n × n diagonal matrix whose ith diagonal element is w i , and 1 n is an n × 1 vector of 1s. Finally, let PG(1, c) denote the Polya-Gamma distribution with parameters 1 and c, which is carefully defined in Section 2. The dynamics of PS&W's Markov chain are defined (implicitly) through the following two-step procedure for moving from the current state, β (m) = β, to β (m+1) .
In this paper, we prove that PS&W's Markov chain is uniformly ergodic. This theoretical convergence rate result is extremely important when it comes to using PS&W's algorithm in practice to estimate intractable posterior expectations. Indeed, uniform ergodicity guarantees the existence of central limit theorems for averages such as m −1 m−1 i=0 g β (i) , where g : R p → R is square integrable with respect to the target posterior, π(β | y) (see, e.g., Tierney, 1994;Roberts and Rosenthal, 2004). It also allows for the computation of a consistent estimator of the associated asymptotic variance, which can be used to make an informed decision about how long the simulation should be run (see, e.g., Flegal, Haran, and Jones, 2008). We also establish the existence of the moment generating function (mgf) of the posterior distribution. It follows that, if g(β 1 , . . . , β p ) = β a i for some i ∈ {1, 2, . . . , p} and a > 0, then g is square integrable with respect to the posterior.
The remainder of this paper is organized as follows. Section 2 contains a brief, but careful development of PS&W's DA algorithm. Section 3 contains the proof that the Markov chain underlying this algorithm is uniformly ergodic, as well as the proof that the posterior density, π(β|y), has an mgf.

Polson, Scott and Windle's algorithm
PS&W's DA algorithm is based on a latent data representation of the posterior distribution. To describe it, we need to introduce what PS&W call the Polya-Gamma distribution. Let {E k } ∞ k=1 be a sequence of iid Exp(1) random variables and define It is well-known (see, e.g., Biane, Pitman and Yor, 2001) that the random variable W has density and that its Laplace transform is given by (Recall that cosh(z) = (e z + e −z )/2.) PS&W create the Polya-Gamma family of densities through an exponential tilting of the density g. Indeed, consider a parametric family of densities, indexed by c ≥ 0, that takes the form Of course, when c = 0, we recover the original density. A random variable with density f (x; c) is said to have a PG(1, c) distribution. We now describe a latent data formulation that leads to PS&W's DA algorithm. Our development is different, and we believe somewhat more transparent, than that given by PS&W.
. . , W n ) T and denote its density by f (w|β). Combining this latent data model with the prior, π(β), yields the augmented posterior density defined as Clearly, which is our target posterior density. PS&W's DA algorithm alternates between draws from π(β|w, y) and π(w|β, y). The conditional independence of Y i and W i implies that π(w|β, y) = f (w|β). Thus, we can draw from π(w|β, y) by making n independent draws from the Polya-Gamma distribution (as in the first step of the two-step procedure described in the Introduction). The other conditional density is multivariate normal. To see this, note that where the last equality follows from the fact that cosh(z) = (1 + e 2z )/(2e z ). A routine Bayesian regression-type calculation then reveals that β|w, y ∼ N p (m(w), Σ(w)), where m(w) and Σ(w) are defined in the Introduction. The Markov transition density (Mtd) of the DA Markov chain, ( 2.2) Of course, we do not have to perform the integration in (2.2), but we do need to be able to simulate random vectors with density k · β ′ , and this is exactly what the two-step procedure described in the Introduction does. Note that k : R p × R p → (0, ∞); that is, k is strictly positive. It follows immediately that the Markov chain Φ is irreducible, aperiodic and Harris recurrent (see, e.g., Hobert, 2011).

Uniform ergodicity
For m ∈ {1, 2, 3, . . . }, let k m : R p × R p → (0, ∞) denote the m-step Mtd of PS&W's Markov chain, where k 1 ≡ k. These densities are defined inductively as follows The conditional density of β (m) given β (0) = β ′ is precisely k m · β ′ . The Markov chain Φ is geometrically ergodic if there exist M : R p → [0, ∞) and ρ ∈ [0, 1) such that, for all m, The quantity on the left-hand side of (3.1) is the total variation distance between the posterior distribution and the distribution of β (m) (given β (0) = β ′ ). If the function M (·) is bounded above, then the chain is uniformly ergodic. One method of establishing uniform ergodicity is to construct a minorization condition that holds on the entire state space. In particular, if we can find a δ > 0 and a density function h : then the chain is uniformly ergodic. In fact, if (3.2) holds, then (3.1) holds with M ≡ 1 and ρ = 1 − δ (see, e.g., Jones and Hobert, 2001, Section 3.2).
In order to state our main result, we need a bit more notation. Let φ(z; µ, V ) denote the multivariate normal density with mean µ and covariance matrix V evaluated at the point z. Recall that the prior on β is N p (b, B), and define Here is our main result.
Proposition 3.1. The Mtd of Φ satisfies the following minorization condition Hence, PS&W's Markov chain is uniformly ergodic.
Remark 3.1. As discussed in the Introduction, uniform ergodicity guarantees the existence of central limit theorems and consistent estimators of the associated asymptotic variances, and these can be used to make an informed decision about how long the simulation should be run. While it is true that, in many cases, δ will be so close to zero that the bound (3.1) (with M ≡ 1 and ρ = 1 − δ) will not be useful for getting a handle on the total variation distance, this does not detract from the usefulness of the result. Moreover, it is certainly possible that a different analysis could yield a different minorization condition with a larger δ.
The following easily established lemmas will be used in the proof of Proposition 3.1.
Lemma 3.1. If A is a symmetric nonnegative definite matrix, then all of the eigenvalues of (I + A) −1 are in (0, 1], and I − (I + A) −1 is also nonnegative definite.
Proof of Proposition 3.1. Recall that Σ = Σ(w) = X T Ω(w)X + B −1 −1 and 2 . Now, sinceX T ΩX is nonnegative definite, Lemma 3.1 implies that |Σ| ≤ |B|, and the result follows. Next, we show that m T Σ −1 m ≤ s T s. Letting l = y − 1 2 1 n , we have where the inequality follows from Lemma 3.1. Using these two inequalities, we have it follows that π(β|w, y)π(w|β ′ , y) Recall that k(β|β ′ ) = R n + π(β|w, y) π(w|β ′ , y) dw. To this end, note that where the first inequality is due to the fact that √ a + b ≤ √ a + √ b for nonnegative a, b, and the second inequality is by Lemma 3.2. It follows that where the second inequality holds because |a| ≤ (a 2 +1)/2 for any real a. Putting all of this together, we have k(β|β ′ ) = R n + π(β|w, y) π(w|β ′ , y) dw and the proof is complete.
Remark 3.2. It is worth pointing out that our proof circumvents the need to deal directly with the unwieldy density of the PG(1, 0) distribution, which is given in (2.1).
The utility of the Polya-Gamma latent data strategy extends far beyond the basic logistic regression model. Indeed, PS&W use this technique to develop useful Gibbs samplers for a variety of Bayesian hierarchical models in which a logit link is employed. These include Bayesian logistic regression with random effects and negative-binomial regression. Of course, given the results herein, it is natural to ask whether these other Gibbs samplers are also uniformly ergodic, and to what extent our methods could be used to answer this question. We note, however, that the Mtd of the Polya-Gamma Gibbs sampler is relatively simple. For example, the Gibbs sampler for the mixed effects model has more than twosteps, so its Mtd is significantly more complex than that of the Polya-Gamma Gibbs sampler. Thus, the development of (uniform) minorization conditions for the other Gibbs samplers would likely entail more than just a straightforward extension of the proof of Proposition 3.1. On the other hand, it is possible that some of the inequalities in that proof could be recycled.
Finally, we establish that the posterior distribution of β given the data has a moment generating function.
Hence, the moment generating function of the posterior distribution exists.
Remark 3.3. Chen and Shao (2000) develop results for Bayesian logistic regression with a flat (improper) prior on β. In particular, these authors provide conditions on X and y that guarantee the existence of the mgf of the posterior of β given the data. Note that our result holds for all X and y Proof. Recall that R n + π(β, w|y) dw = π(β|y), where π(β, w|y) = n i=1 Pr(Y i = y i |β) f (w|β) π(β) c(y) .
Hence, it suffices to show that R n + R p e β T t π(β, w|y) dβ dw < ∞.
Straightforward calculations similar to those done in Section 2 show that π(β, w|y) = φ(β; m, Σ) 2 n c(y) |Σ| Then, since |Σ| ≤ |B| and m T Σ −1 m ≤ s T s, we have π(β, w|y) ≤ φ(β; m, Σ) 2 n c(y) Now, using the formula for the multivariate normal mgf, it suffices to show that We establish this by demonstrating that m T t + 1 2 t T Σt is uniformly bounded in w.
For a matrix A, define A = sup x =1 Ax . Of course, if A is non-negative definite, then A is equal to the largest eigenvalue of A. Now, using Cauchy-Schwartz and properties of the norm, we have