Convergence properties of Gibbs samplers for Bayesian probit regression with proper priors

The Bayesian probit regression model (Albert and Chib (1993)) is popular and widely used for binary regression. While the improper flat prior for the regression coefficients is an appropriate choice in the absence of any prior information, a proper normal prior is desirable when prior information is available or in modern high dimensional settings where the number of coefficients ($p$) is greater than the sample size ($n$). For both choices of priors, the resulting posterior density is intractable and a Data Dugmentation (DA) Markov chain is used to generate approximate samples from the posterior distribution. Establishing geometric ergodicity for this DA Markov chain is important as it provides theoretical guarantees for constructing standard errors for Markov chain based estimates of posterior quantities. In this paper, we first show that in case of proper normal priors, the DA Markov chain is geometrically ergodic *for all* choices of the design matrix $X$, $n$ and $p$ (unlike the improper prior case, where $n \geq p$ and another condition on $X$ are required for posterior propriety itself). We also derive sufficient conditions under which the DA Markov chain is trace-class, i.e., the eigenvalues of the corresponding operator are summable. In particular, this allows us to conclude that the Haar PX-DA sandwich algorithm (obtained by inserting an inexpensive extra step in between the two steps of the DA algorithm) is strictly better than the DA algorithm in an appropriate sense.


Introduction
Let Y 1 , · · · , Y n be independent Bernoulli random variables with P (Y i = 1|β) = Φ(x T i β) where x i ∈ R p is the vector of known covariates corresponding to the ith observation Y i , for i = 1, · · · , n; β ∈ R p is a vector of unknown regression coefficients and Φ(·) denotes the standard normal distribution function. For y i ∈ {0, 1}; i = 1, · · · , n, the likelihood is given by: Our objective is to make inferences about β, and we intend to adopt a Bayesian approach as proposed in Albert and Chib [1]. In particular, we specify the prior density π(β) to be a N p Q −1 v, Q −1 density, with a positive definite matrix Q and v ∈ R p . Note that any vector μ ∈ R p can be written as μ = Q −1 Qμ = Q −1 v with v = Qμ. Hence the assumption that the prior mean is of the form Q −1 v is not restrictive. Albert and Chib [1] consider an improper flat prior for β, which can be obtained as a limiting case of this (proper) prior, e.g., by taking v = 0, and Q approaching the matrix of all zeros. Let y = (y 1 , · · · , y n ) T denote the observed values of the random sample Y = (Y 1 , · · · , Y n ) T , and Φ(x T i β) yi 1 − Φ(x T i β) 1−yi dβ denote the marginal distribution of y. Then the posterior density of β given Y = y is given by Note that the posterior density π(β | y) does not have a closed form. It is highly intractable in the sense that computing expectations with respect to this density is not feasible in closed form, or by using numerical methods (for even moderate p), or by using Monte Carlo methods (for large p). Albert and Chib (1993) proposed a data augmentation MCMC algorithm (henceforth called the AC-DA algorithm) for this problem. As shown below, each iteration of this algorithm involves sampling from (n+p) standard univariate densities. Consider the latent variables z 1 , · · · , z n where z i |β ∼ N (x T i β, 1), with y i = 1(z i > 0) for i = 1, · · · , n.
Further, let X denote the n×p design matrix. Simple calculations show that the joint conditional density π(β, z|y) of β and z = (z 1 , · · · , z n ) T given the data y satisfies the following It follows from (1) that the full conditional density of β given z, y satisfies From (2) we can immediately conclude that conditional on (z, y), β is normally distributed with mean vector (X T X + Q) −1 (v + X T z) and covariance matrix (X T X + Q) −1 , i.e., Again from (1), it is easy to see that the posterior density of z given β, y satisfies It follows that for i = 1, · · · , n z i | β, y indep ∼ TN x T i β, 1, y i where TN(μ, σ 2 , ω) denotes the distribution of a truncated normal variable with mean μ and variance σ 2 which is truncated to be positive if ω = 1 and negative if ω = 0.
Using the standard densities above, Albert and Chib [1] construct a data augmentation Markov chain Ψ = (β m ) m≥0 . The transition of this Markov chain from β m to β m+1 is given as follows.
The above conditional densities have standard forms, therefore observations from the Markov chain Ψ can be easily generated using any standard statistical software, such as R ( [17]; [25]). It can be easily shown that the transition density for Ψ is strictly positive everywhere, which implies that Ψ is Harris ergodic (see Asmussen and Glynn [2]). It follows that cumulative averages based on the above Markov chain can be used to consistently estimate corresponding posterior expectations. However, providing standard errors for these estimates requires the existence of a Markov chain CLT (which is much more challenging to establish than the usual CLT in the IID/independent setup). A standard method available to prove a Markov chain CLT involves proving that the underlying Markov chain is geometrically ergodic (Chan and Geyer [5]; Flegal and Jones [7]; Mykland, Tierney and Yu [15]; Robert [18]). See Section 2 for more details.
The first contribution of this paper is a proof of geometric ergodicity for Ψ for all choices of the design matrix X, sample size n and number of predictors p. It is to be noted that the original Albert and Chib [1] paper has been cited over 2450 times, indicating the wide range of applications and studies that have been made based on this data augmentation (DA) algorithm. This highlights the importance of having consistent standard error estimates for quantities based on the DA Markov chain. As we explain in Section 2, geometric ergodicity is an important ingredient for obtaining a theoretical guarantee for the validity of CLT based standard error estimates used by practitioners.
Data Augmentation (DA) algorithms are typically slow-mixing and take a long time to converge. However, there exist sandwich algorithms (Meng and van Dyk [13]; Liu and Wu [12]; Hobert and Marchev [8]) which can potentially significantly improve the DA algorithm by adding just one computationally inexpensive intermediate step "sandwiched" between the two steps of the DA algorithm. These sandwich algorithms are theoretically proven to be at least as good as the original DA algorithm in terms of the operator norm (see Section 3). However, to show that the sandwich algorithm is strictly better, one needs to prove some additional properties of the DA Markov chain.
As a second major contribution of this paper, we show in Section 3 that the DA Markov chain Ψ is trace-class. The derivation is quite lengthy and involved (see Section 3). The fact that a DA Markov chain is trace-class ensures that one can construct strictly better sandwich algorithms (e.g., Haar PX-DA algorithms; see Section 4) in the sense that the (countable) spectrum of the DA algorithm point wise dominates the (countable) spectrum of the sandwich algorithm, with at least one strict inequality (Khare and Hobert [11]). We would like to point out that no results regarding trace class properties in the improper flat prior case are available in the literature. It is to be noted that our trace class results hold both when n ≥ p and n < p, although some sufficient conditions on X and Q need to be satisfied (see Theorem 2). Roy and Hobert [22] prove the geometric ergodicity of the resultant algorithm when an improper flat prior (instead of a proper normal prior) on β is considered, and also derive the PX-DA sandwich algorithm in this setting. Unlike our paper, these authors construct minorization conditions that allow them to use regeneration techniques for the consistent estimation of asymptotic variances.
On the other hand, the trace class property under the improper prior is not investigated in [22]. It is important to note that an improper flat prior on β leads to a proper posterior only under the following conditions derived in Chen and Shao [6]: 1. n ≥ p and the design matrix has full column rank, 2. there exits a vector a = (a 1 , · · · , a n ) T with strictly positive components such that W T a = 0, where W is an n × p matrix whose ith row is x T i or −x T i according as y i is 0 or 1. Roy and Hobert [22] show that the above conditions are sufficient to establish geometric ergodicity as well. However, these conditions clearly exclude the case of modern high dimensional problems where the dimension p can be much larger than the sample size n. Hence, if p > n one needs to work with a proper normal prior. In fact, we show that when a proper normal prior is assumed, no assumption on n, p and X is necessary to have geometric ergodicity. 1 If n ≥ p, an improper flat prior is useful in the absence of any prior information or for objective Bayesian inference, while the proper prior is useful in the presence of prior information. To the best of our knowledge, there are no general technical results comparing the efficiency or behavior of the AC-DA algorithm in the proper/improper settings when n ≥ p.
The remainder of this article is organized as follows. In Section 2, we formally define geometric ergodicity and prove that Ψ is geometrically ergodic by establishing an appropriate drift condition. In Section 3 we review the notions of trace-class Markov chains and prove that under some easily verifiable sufficient conditions Ψ is trace-class. In Section 4, we briefly review the concepts of sandwich algorithms and derive the form of one such algorithm, namely the Haar PX-DA algorithm, corresponding to the AC-DA algorithm. In Section 5 we provide an illustration based on a real dataset to exhibit the improvements that can be achieved by using the Haar PX-DA algorithm over the AC-DA algorithm. In Appendix A, proofs of some relevant mathematical results are provided. A method for sampling from a density that appears in the Haar PX-DA algorithm is described in Appendix B.

Geometric ergodicity for the AC-DA chain
In this section we first formally define the notion of geometric ergodicity for a Markov chain and then we show that the AC-DA chain Ψ is geometrically ergodic. Let k(·, ·) denote the Markov transition density associated with Ψ, with corresponding Markov transition function K(·, ·). In particular, for β ∈ R p and a measurable set A ∈ B (:= the Borel σ-field on R p ), For m ≥ 1, the corresponding m−step Markov transition function is defined in the following inductive fashion.
Here π(β|y) denotes the β−marginal of the joint density π(β, z|y). The chain Ψ is geometrically ergodic if there exist a constant η ∈ [0, 1) and a function Q : R p → [0, ∞) such that for any β ∈ R p and any m ∈ N, As mentioned in the introduction, geometric ergodicity implies existence of a CLT for Markov chain based cumulative averages. In particular, let g ∈ L 2 (π(β|y)) such that E π g(β) 2 < ∞, and let (β N ) m N =1 denote the observations generated by the DA algorithm. Defineḡ m := m −1 m N =1 g (β N ). If the (reversible) DA Markov chain is geometrically ergodic, then there exists Several methods for obtaining consistent estimators of σ 2 g are available in the literature, see for example [9,7]. These methods typically require additional moment assumptions on g along with other mild regularity assumptions.
The following theorem establishes geometric ergodicity of Ψ by forming a (geometric) drift condition on the basis of the following (drift) function The fact that (X T X + Q) is positive definite ensures that ν(β) is unbounded off compact sets as a function of β, i.e., for each α > 0, the level set {β : β T (X T X + Q)β ≤ α} is compact. Theorem 1. Let k(·, ·) denote the transition density corresponding to the Markov chain Ψ. Then for any arbitrary β ∈ R p representing the current state, there exists ρ ∈ (0, 1) and L ∈ R such that Proof. On the outset, note that the transition density corresponding to the AC-DA Markov chain is given by where Z denotes the space where the random vector z lives, i.e., Z is the Cartesian product of n half lines R + = (0, ∞) or R − = (−∞, 0] according as y i = 1 or 0. Therefore, by Fubini's theorem, we get The inner integral in the right hand side of (5) is given by Note that for any a ∈ R p , b ∈ R p and c > 0, any c > 0, we get the following upper bound for (6): where Hence, from (5), (6) and (7) we can write for any c > 0, Now, note that Here Standard results from the theory of truncated normal distributions show that (see Roy and Hobert [22]) .
A more compact way of expressing this is as follows.
Therefore, from (9) we can write, where A 2 = n λ max (1 + Λ) < ∞, and inequality in the second last line follows from the fact that Q is positive definite. Finally, combining (8), and (10), we get This completes the proof. Remark 2.1. As mentioned earlier, since (X T X +Q) is always positive definite, ν(β) is unbounded off compact sets for any design matrix X. Therefore, from Meyn and Tweedie [14, Lemma 15.2.8] and Theorem 1, it follows that for any X, n and p, the AC-DA Markov chain Ψ is geometrically ergodic.

Trace-class property for the AC-DA chain
Recall that the AC-DA Markov chain Ψ has associated transition density given by Let L 2 0 (π(· | y)) denote the space of square-integrable functions with mean zero (with respect to the posterior density π(β | y)). Let K denote the Markov operator on L 2 0 (π(· | y)) associated with the transition density k. Note that the Markov transition density k is reversible with respect to its invariant distribution, and K is a positive, self-adjoint operator. The operator K is trace class (see Jörgens [10] If the trace-class property holds, then K is compact, and its eigenvalues are summable (stronger than square summable), which in particular also implies that the associated Markov chain is geometrically ergodic. The trace class property for a DA Markov chain has another important implication. Hobert and Marchev [8] define a class of sandwich algorithms called Haar PX-DA algorithms which they show to be optimal in an appropriate sense. If a DA algorithm is trace class, then so is the Haar PX-DA algorithm. Furthermore, the spectrum of the Haar PX-DA operator is strictly better than the DA algorithm in the sense that the (countable) spectrum for the Haar PX-DA algorithm is dominated pointwise by the spectrum of the DA algorithm, with at least one strict domination (Khare and Hobert [11]). See Section 4 for more details.
The following theorem provides sufficient conditions under which the Markov operator K corresponding to the AC-DA algorithm is trace class.
Proof. We shall show that (12) holds for the Markov chain Ψ if either (A) or (B) holds. Conditions (A) and (B) will not play a role at all in the first half of this proof, but will be needed to show the positivity of an appropriate function in the second half of the proof.
First, note that (3) implies Now, let us define for i = 1, · · · , n, Then X T X = W T W and X T z = W T t, and absolute value of the Jacobian of the transformation z → t is one. So the conditional density π(t|β, y) of t given β, y satisfies Again, simple calculations on (1) show that Therefore, using (11), (13) and (14), we get the following form for the integral in (12) in the current setting.
Here C 0 denotes the product of all constant terms (independent of β and t) appearing in the full conditional densities π(β|t, y) and π(t|β, y). Let us define Therefore, the right hand side of (15) is proportional to Now consider the partition The above partition is essentially obtained by using the n hyperplanes defined by w T i θ = 0 for 1 ≤ i ≤ n. This partition has also been used in [22] for proving geometric ergodicity of the DA Markov chain corresponding to an improper flat prior on β. The right hand side of (16) can now be written as Therefore to prove (12), it is enough to show that for any ζ ⊆ {1, · · · , n} Fix an arbitrary ζ ⊆ {1, · · · , n}. Define , for i = 1, · · · , n and a j = (a ji ) 1≤i≤n for j = 1, 2. This means a 1 + a 2 = W θ and a 1i a 2i = 0, for all i. In particular, a T 1 a 2 = 0. Then for i / ∈ ζ where q(x) = (x + √ 4 + x 2 )/2, and φ(·) denote the standard normal density function. Thus, for any i = 1, · · · , n, where Q(a 2 ) = n i=1q (a 2i ).
We now derive an upper bound for the inner integral in (17). Let ∈ (0, 1) be arbitrary and Then using the fact that 2|a where The last inequality follows from the fact the integrand is a normal density. Therefore, from (17) , (19) and (20) we get where the last equality following from the fact that a T 1 a 2 = 0. Therefore, to prove (18) it would be sufficient to show that for some ∈ (0, 1) is integrable on A ζ . This holds when G(θ, ) is a positive definite quadratic form in θ on A ζ , as then, for some sufficiently large C 2 > 0, G(θ, ) − 2θ T v + C 2 is also positive definite, making the exponential term in (23) a constant multiple of an appropriate multivariate normal density and the integrability of (23) follows from the existence of (positive) moments of any multivariate normal distribution. Therefore, our objective is to show that there exists an ∈ (0, 1) for which G(θ, ) is a positive definite quadratic form (on A ζ ) in θ when at least one of (A) and (B) holds. Note that on A ζ , each entry of a 1 and a 2 is a linear function of θ. It follows from (22) that on A ζ , G(θ, ) is a quadratic form in θ for every > 0. Since ζ is arbitrarily chosen, to achieve our objective, it is enough to show that for some ∈ (0, 1), G(θ, ) > 0 for every θ ∈ R p . Since X (and hence W ) is assumed to have full column rank if n ≥ p and full row rank if n < p, it follows that either W T W (when n ≥ p) or W W T (when n < p) is invertible. Therefore, when n ≥ p and when n < p where for any matrix B, P B denotes the orthogonal projection (matrix) onto the column space of B, and the inequality follows from the fact that x T x ≥ x T P B x, for any x ∈ R k , k being the number of rows of B. Thus, it follows that letting Hence, from (22) where a T = (a T 1 , a T 2 ). From (24) it follows that in order to prove G(θ, ) > 0 for all θ ∈ R p it is enough to show that H(a, ) > 0 for all a ∈ R n . Now letting We shall prove this fact by considering the cases n ≥ p and n < p separately.

Case I: n ≥ p
Consider the following singular value decomposition.

S. Chakraborty and K. Khare
and which means and hence where a j = U T a j for j = 1, 2. Note that, (27) and where for two symmetric matrices A and B of the same order, A > B means A − B is positive definite. This shows that the first two terms in (26) are strictly positive.

Now, under (B),
The last equality follows from the fact that a 1i a 2i = 0 for all i = 1, · · · , n. Therefore letting D * = Ip D 2 − 1 2 I p makes the cross product term in (26) equal to zero, which means, under (B), H * (a) is a sum of two positive quantities, and hence is strictly positive.

Case II: n < p
We slightly abuse our notation by considering the following singular value decomposition: where as before (but now with different dimensions) V ∈ R n×n is orthogonal, D ∈ R n×n is diagonal, say D = diag(d 1 , · · · , d n ) where no d i is equal to zero, and U ∈ R p×n is a matrix with orthogonal columns and is orthogonal in R p×p . Here Hence where a j = V T a j for j = 1, 2; and the equality in the second last line arises from the fact that a T 1 V V T a 2 = a T 1 a 2 = 0. Notice the similarities between (31) and (26) and note that the non-zero eigenvalues of are the same, namely d 2 1 , . . . , d 2 n . Therefore by exactly similar arguments as provided in the previous case, it follows that in this case also, H * (a) is positive definite if either (A) or (B) holds.
Thus, both when n ≥ p and n < p, if at least one of (A) and (B) holds, H * (a) is positive definite in a. As mentioned previously, this ensures integrability of I A ζ as given in (17). Since ζ ⊆ {1, 2, · · · , n} is chosen arbitrarily, it follows that Ψ has the trace-class property. (A1) All eigenvalues (or all non-zero eigenvalues, if n ≥ p) of XQ −1 X T are less than 7/2.

Remark 3.2.
The prior considered in this paper reduces to an approximate flat prior when Q is "small" (approaching the zero matrix). However, when Q is "small", Q −1 is "large"; which makes the (positive) eigenvalues of XQ −1 X T large. It follows that, when Q is so small that at least one eigenvalue of XQ −1 X T is bigger than or equal to 7/2, condition (A1) gets violated, and Theorem 2 can no longer be applied.

Remark 3.3.
When n ≥ p and X has full column rank, Zellner [26] specifies a Gaussian prior distribution for β with the prior covariance matrix having the form Q −1 = g(X T X) −1 , where g is a positive scaling constant. This prior is commonly referred to as Zellner's g-prior. Under this prior has eigenvalue g with multiplicity p. Hence condition (A) is satisfied as long as g < 7/2, or equivalently g −1 > 2/7. Thus under this prior, a sufficient condition for the AC-DA Markov chain Ψ to be trace-class is g < 7/2.

Sandwich algorithms
As mentioned previously, one of the common problems with DA algorithms is that they are slow to converge. However, significant improvements over the convergence rate of a two block DA Makrov chain can be achieved by using a socalled sandwich algorithm, where one simple and computationally inexpensive intermediate step "sandwiched" between the two steps of the DA algorithm is added at each iteration (see e.g., Liu and Wu [12]; Meng and van Dyk [13]; Hobert and Marchev [8]). Consider our AC-DA Markov chain Ψ once again and let β be its current state. One iteration of a sandwich algorithm corresponding to Ψ comprises of the following three (instead of two, as in the AC-DA) steps. The first step is similar to AC-DA in the sense that a (latent) random variable z ∼ π(z|β, y) is generated. Next, a Markov transition function R that is reversible with respect to π(z|y) dz (i.e., R(z, dz ) π(z|y) dz = R(z , dz) π(z |y) dz ) is considered, where π(z|y) denotes the z-marginal of π(β, z|y) in (1). The intermediate second step for the sandwich algorithm then amounts to generating a random variable z from the measure R(z, ·). The third and final step in the sandwich algorithm is again similar to the last step in AC-DA except for the fact that here, instead of z, z is used. That is, the third step in the sandwich algorithm entails generating the next state β from π(β|z , y). The intermediate step involving the generation of z from z is typically done with the help of a low (generally one or two) dimensional random variable, making the DA and the sandwich algorithm comparable in terms of computational efficiency. In order to make precise comparisons between the DA and the sandwich algorithms, we first need to introduce some notations. Let Ψ be the Markov chain obtained by the sandwich algorithm. Analogous to the notations used in Section 2 and 3, let K denote the Markov operator associated with the sandwich algorithm, i.e., for all h ∈ L 2 0 (π), K maps h to where k, the Markov transition density of Ψ, is defined as follows: A sandwich algorithm is always at least as good as the DA algorithm in the sense of having a smaller operator norm, that is, we always have K ≤ K , though a strict inequality may not hold in general. Here K denotes the Markov operator associated with the corresponding DA Markov chain. Note that if a DA Markov chain is geometrically ergodic, then so is the sandwich Markov chain due to the relationship K ≤ K < 1. (Recall that a reversible Markov chain is geometrically ergodic if and only if the corresponding operator K satisfies K < 1 (Roberts and Rosenthal [19]).) Thus, as long as the original DA algorithm is geometrically ergodic, a CLT holds for the sandwich algorithm as well. In particular, let g ∈ L 2 (π(β|y)) such that E π g(β) 2 < ∞, and let (β N ) m N =1 and ( β N ) m N =1 respectively denote the observations generated by the DA and the sandwich algorithm. Defineḡ m := m −1 m N =1 g (β N ) and g m : Then there exist positive, finite quantities σ 2 g and σ 2 g such that, as m → ∞, Moreover Hobert and Marchev [8,Theorem 4] show that σ 2 g ≤ σ 2 g , that is, by using a sandwich algorithm, one gets the asymptotic variance of g m no larger (possibly smaller) than that ofḡ m .
One class of sandwich algorithms, the so called Parameter Expanded Data Augmentation (PX-DA) algorithms (Liu and Wu [12], Meng and Van Dyk [13]), use a proper probability measure for the Markov transition function R in the intermediate step. While all PX-DA algorithms are aimed at improving the original DA algorithm, following Hobert and Marchev [8], one can get a sandwich algorithm that is uniformly better than all PX-DA algorithms, as long as a certain group structure is present in the problem. This "best" PX-DA algorithm, while technically not a PX-DA itself as it does not use a proper probability measure for R, and rather involves Haar measure, is called the Haar PX-DA algorithm. We now describe the form of the Haar PX-DA algorithm corresponding to the AC-DA algorithm.
Using the similar notations as in Hobert and Marchev [8], let G be the multiplicative group (R + , •) where the group composition • is defined as multiplication, i.e., for all g 1 , g 2 ∈ G, g 1 •g 2 = g 1 g 2 . G has e = 1 as its identity element and g −1 = 1/g. The multiplicative group (R + , •) is uni-modular with Haar measure μ l (dg) = dg/g, dg being the usual Lebesgue measure on R + . Recall that Z denotes the support of the conditional density π(z|y) of z given y. (In particular, Z is the Cartesian product of n half lines R + or R − according as y i = 1 or 0.) Let us define a (left) group action of G on Z, which act through component-wise multiplication, i.e., g ∈ G, z = (z 1 , · · · , z n ) T ∈ Z =⇒ gz = (gz 1 , · · · , gz n ) T . With this (left) group action, the Lebesgue measure on R n is relatively left invariant with multiplier χ(g) = g n ; i.e., for all g ∈ G and all integrable functions h : R n → R, Then the intermediate step (that involves drawing z from z using some Markov transition function R) of the Haar PX-DA algorithm amounts to generating a random variable g from a density proportional to χ(g) π(gz|y) μ l (dg) = g n−1 π(gz|y) dg =: w(g) dg and defining z = gz = (gz 1 , · · · , gz n ) T . Straightforward calculations show that the z-marginal of the joint density in (1) satisfies so that   on the improper flat prior shown in Figure 1(a) and 2(a). For instance, for β 1 in Figure 1(a), note that all autocorrelations are less than 0.5 for the two proper prior chains while it takes 17 lags for the improper Haar PX-DA chain to achieve such an autocorrelation (and the improper AC-DA chain never reaches that value in the first fifty lags). The autocorrelation plots for β 2 show similar patterns in Figure 2(a). In both Figure 1(a) and 2(a), the autocorrelations for the AC-DA proper prior chain almost coincide with those for the Haar PX-DA proper prior chain. Again, observe the noticeably better performances in terms of stability of running means for the chains based on the proper prior in Figure 1(b) and 2(b). In the scales used in those two plots, the proper prior chains appear almost as coincidental horizontal straight lines. In contrast, on the same scales, the improper Haar PX-DA chain shows moderate, and the improper AC-DA chain, significant, fluctuations till 300,000 and 700,000 iterations respectively for both β 1 and β 2 . 2 These metrics indicate the noteworthy superiority (in terms of efficiency as well as convergence) of the chains based on the proper prior over those based on the improper flat prior in the current setting.
Because performances of the two proper prior chains are almost indistinguishable in the scales used in Figure 1(a), 1(b), 2(a) and 2(b), we take a closer look at these two chains to facilitate comparison. In particular, Figure 3(a) and 4(a) display the autocorrelations and Figure 3(b) and 4(b), the running means, for β 1 and β 2 respectively, in appropriately chosen scales for the proper prior chains. Note that the autocorrelations are almost identical and the running means show very similar patterns in terms of stability for β 2 in Figure 4(a) and 4(b) (even in the adjusted scale). On the other hand, Figure 3(a) and 3(b) demonstrate a slightly more significant dominance of the Haar PX-DA chain over the AC-DA chain for β 1 .
Thus, to summarize, it can be concluded that in terms of convergence, the proper Haar PX-DA chain is the best among the four. Taking into account the practically insignificant amount of time needed to run the extra step, the Haar PX-DA algorithm with proper g-prior (g = 3.499999) is therefore undoubtedly the best choice among the all four algorithms considered in the current setting (the AC-DA algorithm based on the same prior being a close competitor). Proof. We shall consider the cases n ≥ p and n < p separately.

Case I: n ≥ p
Consider the following singular value decomposition: where V ∈ R p×p is orthogonal, D ∈ R p×p is diagonal, say D = diag(d 1 , · · · , d p ) where one or more (but not all) d i 's may be equal to zero, U ∈ R n×p is a matrix with orthogonal columns and let be orthogonal in R n×n . Then where, for diagonal matrices M k×k = diag(m 1 , · · · , m k ) and N k×k = diag(n 1 , · · · , n k ) with n i = 0 for all i = 1, · · · , k, we define Therefore,

Case II: n < p
For this case, consider following singular value decomposition: where as before (but now with different dimensions) V ∈ R n×n is orthogonal, D ∈ R n×n is diagonal, say D = diag(d 1 , · · · , d n ) where one or more (but not all) d i 's may be equal to zero, and U ∈ R p×n is a matrix with orthogonal columns and is orthogonal in R p×p . Then Proof. Note that the result is trivially true if B = 0 n×p . So, without loss of generality we assume B = 0 n×p . Let λ 1 , · · · , λ n denote the eigenvalues of M := B(B T B + τ I p ) −1 B T . Then there exits an orthogonal matrix U ∈ R n×n such that M = U ΛU T =⇒ U T MU = Λ, where Λ = diag(λ 1 , · · · , λ n ). Note that U T (I n − M )U = U T U − U T MU = I n − Λ = diag{1 − λ 1 , · · · , 1 − λ n } which implies that the eigenvalues of I n −M are 1−λ 1 , · · · , 1−λ n . This completes the proof since it follows from Proposition A.1 that λ i ∈ [0, 1) =⇒ 1−λ i ∈ (0, 1] for all i = 1, · · · , n. Remark A.1. Proposition A.1 and Proposition A.2 are essentially generalizations of the (first halves of) Lemma 4 and Lemma 5 in Roman and Hobert [20], where by exhibiting explicit forms for the eigenvalues of a matrix of the form B (κB T B + Σ −1 ) −1 B T , with κ > 0, B ∈ R n×p , and Σ ∈ R p×p positive definite, the authors ultimately prove the positive definiteness of I n − κB (κB T B + Σ −1 ) −1 B T . It is to be noted that these results in Roman and Hobert [20] are derived under the assumption that n ≥ p and rank(B) = p, whereas Proposition A.1 and Proposition A.2 hold for any n, p and B ∈ R n×p .