A comparison theorem for data augmentation algorithms with applications

: The data augmentation (DA) algorithm is considered a useful Markov chain Monte Carlo algorithm that sometimes suﬀers from slow convergence. It is often possible to convert a DA algorithm into a sandwich algorithm that is computationally equivalent to the DA algorithm, but converges much faster. Theoretically, the reversible Markov chain that drives the sandwich algorithm is at least as good as the corresponding DA chain in terms of performance in the central limit theorem and in the operator norm sense. In this paper, we use the sandwich machinery to compare two DA algorithms. In particular, we provide conditions under which one DA chain can be represented as a sandwich version of the other. Our results are used to extend Hobert and Marchev’s (2008) results on the Haar PX-DA algorithm and to improve the collapsing theorem of Liu et al. (1994) and Liu (1994). We also illustrate our results using Brownlee’s (1965) stack loss data.


Introduction
Suppose f X is an intractable density on X that we would like to explore. Consider a data augmentation (DA) algorithm (Tanner and Wong, 1987) based on the joint density f (x, y) on X × Y, which must satisfy Y f (x, y) dy = f X (x). The Markov chain, Φ = {Φ m } ∞ m=0 , underlying the DA algorithm has the Markov transition density (Mtd) given by In other words, k(· | x) is the density of Φ m+1 , given that Φ m = x. It is wellknown and easy to see that the Markov chain driven by the DA algorithm is reversible with respect to f X , and this of course implies that f X is an invariant density. We assume throughout this section that all Markov chains on X are Harris ergodic (see Section 2 for definition). The DA chain can be simulated by drawing alternately from the two conditional densities defined by f (x, y). If the current state is Φ m = x, then Φ m+1 is simulated in two steps: draw Y ∼ f Y |X (· | x), call the result y, and then draw Φ m+1 ∼ f X|Y (· | y).
The DA algorithm is considered a useful algorithm that sometimes suffers from slow convergence. It is often possible to convert a DA algorithm into a sandwich algorithm that is computationally equivalent to the DA algorithm, but converges much faster (see Khare and Hobert (2011), and the references therein). Let f Y denote the y-marginal density of f (x, y). The Mtd of the sandwich algorithm is given by where Q(y, dy ) is a Markov transition function (Mtf) on Y that is reversible with respect to f Y . It is easy to see that k Q (x | x ) f X (x ) is symmetric in (x, x ), so the Markov chain,Φ = {Φ m } ∞ m=0 , underlying the sandwich algorithm is reversible with respect to f X . If the current state of the sandwich chain is Φ m = x, thenΦ m+1 can be simulated as follows. Draw Y ∼ f Y |X (· | x), call the observed value y, then draw Y ∼ Q(y, ·), call the result y , and finally draw Φ m+1 ∼ f X|Y (· | y ). The first and third steps are exactly the two steps used to simulate the DA algorithm, and the name "sandwich algorithm", which was coined by Yu and Meng (2011), is based on the fact that the extra draw from Q(y, ·) is sandwiched between the draws from the two conditional densities.
It is known that the sandwich chain always converges at least as fast as the DA chain in the operator norm sense. Indeed, Hobert and Román (2011) show that Yu and Meng's (2011) Theorem 1 can be used to establish that where K, K Q and Q denote the usual Markov operators defined by k, k Q and Q(y, dy ), respectively, and · denotes the operator norm. (The operator norm will be formally defined in Section 2, but for now it suffices to note that the norm is between 0 and 1, and smaller is better.) Moreover, it follows from Hobert and Marchev's (2008) Corollary 1 that k Q is at least as good as k in the efficiency ordering of Mira and Geyer (1999), which concerns performance in the central limit theorem (CLT).
While the sandwich machinery was designed to improve a given DA algorithm, it can also be used to compare two DA chains. In particular, suppose that, in addition to the DA algorithm based on f (x, y), we have a second DA algorithm based on another joint densityf (x, y) on X×Y such that its x-marginal density is f X . Suppose also that we havẽ wheref X|Y andf Y |X are conditional densities associated withf (x, y) and Q is a Mtf on Y that is reversible with respect to f Y . Then the results described above imply thatk is at least as good as k in the efficiency ordering and in the operator norm sense. The main result of this paper provides conditions under whichk admits this sandwich representation (1.1). We now provide an overview of our main result in the special case where X and Y are Euclidean spaces, and f (x, y) andf (x, y) are densities with respect to Lebesgue measure. Letf Y denote the y-marginal density off (x, y). Suppose that there exits a Mtf R on Y satisfyingf Then (1.1) holds with Q equal to the Mtf corresponding to the DA algorithm for f Y based on the joint distribution on Y × Y given by R(y, dy )f Y (y) dy.
We now illustrate the use of our results with several applications. Our main application involves generalizing the results of Hobert and Marchev (2008) (hereafter, H&M), who themselves generalized results of Liu and Wu (1999) (hereafter, L&W). L&W developed the PX-DA algorithm. The basic idea is to use f (x, y) to create an entire family of joint densities on X × Y such that the xmarginal density of each member is f X . This allows for the construction of a class of viable DA algorithms. To be specific, consider a class of functions t g : Y → Y for g ∈ G such that, for each fixed g, t g (y) is one-to-one and differentiable in y. We are assuming here that G is a group with identity element e. Assume further that (a) t e (y) = y for all y ∈ Y and (b) t g1g2 (y) = t g1 (t g2 (y)) for g 1 , g 2 ∈ G and all y ∈ Y. Suppose that ν : G × X → [0, ∞) is a conditional probability density with respect to (unimodular) Haar measure (see Section 3 for definition). Now, define a probability densityf (ν) Clearly, the x-marginal off (ν) (x, y) is the target, f X . Thus, each conditional density ν(g | x) leads to a new DA algorithm. L&W also propose the Haar PX-DA algorithm, which is a popular sandwich algorithm where Q(y, dy ) corresponds to the move y → y = t g (y) with g (on G) drawn from the density (with respect to ) proportional to f Y (t g (y)) |J g (y)|.
L&W establish that the Haar PX-DA algorithm is at least as good in the operator norm sense as every PX-DA algorithm in the special case where X, Y and G are Euclidean spaces and the group G is unimodular. H&M provide extensions and generalizations of L&W's results in the special case where ν(· | x) does not depend on x. In particular, H&M show that L&W's results hold on more general spaces, and that Haar PX-DA is also at least as good as PX-DA in the efficiency ordering. Moreover, H&M are able to remove a key regularity condition that is required by L&W. In our main application, we show that all of H&M's results continue to hold in the more general case where ν(· | x) does depend on x.
We also apply our results to improve the collapsing theorem Liu, 1994). To be specific, suppose there exists a joint density f (x, y, z) on X×Y× Z such that Z f (x, y, z) dz = f (x, y). Liu et al. (1994) refer to the DA algorithm which iterates between drawing f Y,Z|X and drawing f X|Y,Z as "grouping" and the DA algorithm based on f (x, y) as "collapsing". The collapsing theorem implies that collapsing DA converges at least as fast as grouping DA in the operator norm sense. We show that the collapsing DA chain is also at least as good as the grouping DA chain in the efficiency ordering.
The remainder of this paper is organized as follows. Section 2 contains some results from general state space Markov chain theory, our main result, and a toy example. In Section 3, we apply the main result to extend H&M's results on the Haar PX-DA. We also illustrate our results using a PX-DA algorithm and Choi and Hobert's (2013) Haar PX-DA algorithm for Bayesian linear regression with Laplace errors. Finally, our main result is used to improve the collapsing theorem in Section 4.

Markov chain background
Let P (x, dx ) be a Mtf on a topological space X equipped with its Borel σ-algebra B X . Suppose P (x, dx ) is reversible with respect to a probability measure π. Denote the Markov chain defined by P (x, dx ) as Φ = {Φ m } ∞ m=0 . Assume that Φ is Harris ergodic (i.e., irreducible, aperiodic, and positive Harris recurrent). As usual, let L 2 (π) be the vector space of real-valued, measurable functions on X that are square-integrable with respect to π, and let L 2 0 (π) be the subspace of mean-zero functions. This is a Hilbert space in which inner product of g, h ∈ L 2 0 (π) is defined as and the corresponding norm is, of course, given by g = g, g 1/2 . The Mtf P (x, dx ) defines an operator P on L 2 0 (π) that maps g ∈ L 2 0 (π) to It is easy to see, using reversibility, that P is self-adjoint; that is, for all g, h ∈ L 2 0 (π), P g, h = g, P h . The operator norm of P is defined as A simple application of Jensen's inequality shows that P ∈ [0, 1]. In fact, P provides a great deal of information about the convergence behavior of the corresponding Markov chain Φ. For instance, Φ is geometrically ergodic if and only if P < 1 (Roberts and Rosenthal, 1997). Moreover, results in Liu et al. (1995) show that the smaller the norm, the faster the chain converges.
Assume that Markov chain Monte Carlo will be used to estimate the finite, intractable expectation E π g = X g(x) π(dx). Assume further that there exists a CLT for the ergodic averageḡ m = 1 (If there is no CLT, then we simply write σ 2 g = ∞.) Suppose we have two Harris ergodic Mtfs P and Q that have π as an invariant probability measure. Denote σ 2 g for the two Mtfs by σ 2 g (P ) and σ 2 g (Q). We say P is at least as good as Q in the efficiency ordering, written P E Q, if σ 2 g (P ) ≤ σ 2 g (Q) for every g ∈ L 2 (π) (Mira and Geyer, 1999).

The main result
Let X and Y be separable metric spaces equipped with their Borel σ-algebras. We will refer to such a space Y as a sub-Cauchy space if there exists a complete separable metric spaceȲ such that Y is a Borel subset ofȲ. We assume that Y is sub-Cauchy. This is a weak assumption. For example, with Euclidean metric, In this context, the Mtd of the DA chain based on the joint density f (x, y) is where f X|Y and f Y |X are the conditional densities associated with f (x, y). Analogously, letk be the Mtd of the DA chain for f X based on the joint densitỹ f (x, y). As usual, let f Y denote the y-marginal density of f (x, y), and letf Y and f X|Y be the marginal and conditional densities defined byf (x, y). The following result allows us to compare the two DA chains.
Proof. Our assumptions about the space Y imply that every probability measure on Y is tight (see Parthasarathy, 1967, Theorem 3.2). It then follows from Theorem 6 of Faden (1985) and Theorem 2.4.1 of Ramachandran (1979) that there exists a Mtf R * on Y such that, for all y, y ∈ Y, (2.1) We now show thatk can be written as where the penultimate equality follows from (2.1). Here, the Mtf Q(y, dy ) is reversible with respect to f Y since (2.2) indicates that the Markov chain defined by Q(y, dy ) is a DA for f Y based on the joint distribution (2.1). An application of Hobert and Marchev's (2008) Corollary 1 impliesk E k. Moreover, it follows from Hobert and Román (2011) that K ≤ Q K , where Q is the operator on L 2 0 (f Y ) corresponding to Q(y, dy ). By Liu et al.'s (1994) Theorem 3.2, Q = γ 2 . The proof is complete.
Remark 2.1. We note that, under the conditions in Theorem 2.1,k is actually the Mtd of a GIS algorithm based on f,f and R * (see Yu and Meng, 2011;Hobert and Román, 2011). Indeed, by (2.1), we havẽ Another application of (2.1) reveals that This GIS representation suggests that, if we inappropriately design a GIS algorithm, then we could end up with one of the original DA algorithms.
Remark 2.2. When comparing DA algorithms, computing time and simulation efforts should be taken into account in addition to the efficiency and speed of convergence, but we will not get into that here.

A toy example
Consider the following well-known toy example involving a simple two-level normal hierarchical linear model (see e.g., Liu and Wu, 1999;Yu and Meng, 2011) Here, θ is the parameter, Y is the observed data, Z is the latent variable, and D is known positive constant. We assume that D = 1. With a flat improper prior on θ, the posterior is θ | Y ∼ N (Y, 1 + D), which is our target density. Treating Z as the latent variable, the standard DA algorithm is simulated by drawing alternately from the following two conditional distributions: On the other hand, if we letZ = Z + θ, and treatZ as the latent data, then the model can be rewritten as This is called the centered parametrization (CP), whereas model (2.3) is called the non-centered parametrization (NCP). If we put the same flat prior on θ, then this model leads to a different DA algorithm, which iterates between drawing Z | (θ, Y ) and drawing θ | (Z, Y ): Though both DAs have the same target distribution, they have completely different convergence behavior. Let k andk denote Mtds of NCP and CP DA chains, and let K andK denote operators associated with k andk. It is known that K = D 1+D and K = 1 1+D . Therefore, when D > 1, the CP DA algorithm dominates NCP DA algorithm in the operator norm sense. On the other hand, when D < 1, the operator norm ordering is reversed. Using Theorem 2.1, we can show similar ordering results hold in terms of efficiency. Here is the result.
Proof. Let f (θ, z | y) denote the density of (θ, Z) | Y in the above NCP model, and let f θ|Z,Y and f Z|Y be the conditional and marginal densities defined by f (θ, z | y). Similarly, denote the density of (θ,Z) | Y in the CP model asf (θ,z | y), and letf θ|Z,Y andfZ |Y be the associated conditional and marginal densities. It is easy to see that f Z|Y (· | y) ∼ N (0, D) andfZ |Y (· | y) ∼ N (y, 1). We begin with the case where D > 1. It suffices to establish the two conditions of Theorem 2.1. Let r(w | z, y) denote the N (y − z, D − 1) density evaluated at w. It is easy to show that An application of Theorem 2.1 yields the result.
We now prove the case where D < 1. Similarly, we establish the two conditions of Theorem 2.1. Letr(w | z, y) denote the N (y −z, 1−D) density evaluated at w. Then, we have Another application of Theorem 2.1 completes the proof.
It is interesting to compare the exact value of K with the upper bound γ 2 K from Theorem 2.1. Consider the case where D > 1. We know from the proof of Proposition 2.1 that γ is equal to the maximal correlation of a random pair (Z, Z ) with joint density r(z | z, y)f Z|Y (z | y). This joint density is bivariate normal, and it follows from Gebelein (1941) and Lancaster (1957) Therefore, This example shows that the bound in Theorem 2.1 is tight.

Extending Hobert and Marchev's results on Haar PX-DA
In this section, we use Theorem 2.1 to show that H&M's extensions of L&W's results continue to hold when ν does depend on x. We also illustrate our results using Brownlee's (1965) stack loss data.

Hobert and Marchev's group structure
Let X, Y, μ X and μ Y be as in Section 2.2, and let G be a group with identity element e. Allow the group G to act topologically on the left of Y; that is, there is a continuous function F : G × Y → Y such that F (e, y) = y for all y ∈ Y and F (g 1 g 2 , y) = F (g 1 , F (g 2 , y)) for all g 1 , g 2 ∈ G and all y ∈ Y. As is typically done, we will denote the value of F at (g, y) by gy so, in this notation, the two conditions are written ey = y and (g 1 g 2 )y = g 1 (g 2 y).
Assume that there exists a function j : G × Y → (0, ∞) such that: 1. j(g −1 , y) = 1 j(g,y) for all g ∈ G and all y ∈ Y, 2. j(g 1 g 2 , y) = j(g 1 , g 2 y) j(g 2 , y) for all g 1 , g 2 ∈ G and all y ∈ Y, and 3. For all g ∈ G and all integrable functions h : As in L&W, suppose that Y ⊆ R n , μ Y is Lebesgue measure on Y, and for each fixed g ∈ G, F (g, ·) : Y → Y is differentiable. Then if we take j(g, y) to be the Jacobian of the transformation y → F (g, y), the three properties listed above can be easily verified from calculus.

Constructing general PX-DA and Haar PX-DA algorithms
As before, let f (x, y) be a probability density on X × Y with respect to μ X × μ Y whose x-marginal density is f X . We construct a general PX-DA algorithm for f X . The idea is to build a joint density, that is a general version off (ν) (x, y) of L&W's PX-DA described in the introduction, using the group structure on G. This leads to a new DA algorithm. To be specific, define a probability density on X × Y as follows:f where ν(x, ·) is a conditional probability measure on G given x ∈ X. It is easy to see that the x-marginal density off (ν) (x, y) is f X , and the y-marginal density is where we assume m ν (y) is positive, finite for all y ∈ Y. (As in H&M, it is possible to handle cases where m ν (y) = ∞ on a set of Y that has μ Y -measure zero, but we will not go into that here.) The associated conditional densities arẽ y) ν(x, dg). Our general PX-DA is a DA with Mtd given by We note that if ν(x, ·) is free of x, then we recover H&M's general PX-DA chain.
We now describe H&M's general Haar PX-DA algorithm. H&M use the group structure to construct a sandwich step, that behaves like a generalized version of the sandwich step of L&W's Haar PX-DA described in Section 1. Under the assumptions in the previous section, there exists a left-Haar measure, l , on G, which is a nontrivial measure satisfying for allg ∈ G and all integrable functions h : G → R. This measure is unique up to a multiplicative constant. Moreover, there exists a multiplier, , called the (right) modular function of the group, which relates the left-Haar and right-Haar measures, l and r , on G such that r (dg) = (g −1 ) l (dg). (A function χ : G → (0, ∞) is called a multiplier if χ is continuous and χ(g 1 g 2 ) = χ(g 1 ) χ(g 2 ) for all g 1 , g 2 ∈ G.) Here, the right-Haar measure satisfies the obvious analogue of (3.2). Groups for which ≡ 1; that is, for which right-Haar and left-Haar measures are equivalent, are called unimodular. We now state two useful formulas from H&M that will be used later. Ifg ∈ G and h : G → R is an integrable function, then As before, let f Y be the y-marginal density of f (x, y), and assume without loss of generality that is positive, finite for all y ∈ Y. A straightforward application of (3.3) shows that, for y ∈ Y, m(gy) = j(g −1 , y) (g −1 ) m(y) . (3.5) We now describe a recipe for using the group structure to build a Mtf that is reversible with respect to f Y . Let R be an operator on Then the corresponding Markov chain on Y evolves as follows. If the current state is y, then the distribution of the next state is that of gy where g is a random element drawn from the density (with respect to l ) f Y (gy) j(g, y)/m(y). We denote its Mtf by R(y, dy ). It is shown in H&M's Proposition 3 and 4 that the operator R on L 2 0 (f Y ) is self-adjoint (with respect to f Y ) and idempotent. The Mtd of H&M's general Haar PX-DA is Together, H&M's Proposition 1 and Theorem 4 imply that k * is itself a DA algorithm. Precisely, k * is a DA algorithm based on the joint density That is, k * can be written as where f * X|Y and f * Y |X are conditional densities associated with f * (x, y).

Comparing General PX-DA and Haar PX-DA Algorithms
In this section, we establish that k * is at least as good as k ν in the efficiency ordering and in the operator norm sense. In fact, H&M's Theorem 4 implies that their general Haar PX-DA algorithm is at least as good as their general PX-DA algorithm in the efficiency ordering and operator norm sense. Since H&M's general PX-DA is a special case of our general PX-DA, our result improves upon their result. Here is our result.
Proposition 3.1. Let ν(x, ·) be a conditional probability measure on G given x (∈ X). Assume that m ν (y) and m(y) are positive and finite for all y ∈ Y. If the Markov chains driven by k ν and k * are Harris ergodic, then k * E k ν and K * ≤ K ν , where K ν and K * are the operators on L 2 0 (f X ) defined by k ν and k * .
Proof. Recall that k ν is the DA Mtd based on the joint densitỹ (g , y) ν (x, dg ) and that the y-marginal density off (ν) LetR(y, dy ) be the Mtf on Y with invariant density m ν (y) that is constructed according to the recipe in (3.6); that is,R is what we would have ended up with had we used m ν (y) in place of f Y (y). If we substitute m ν (y) for f Y (y) in the definition of m(y), we have Hence, the function m(y) is the same whether we use f Y or m ν . Recall that k * is the DA Mtd based on the joint density dy ) .
We now establish the two conditions of Theorem 2.1. We will first show that Indeed, using the definition ofR and calculation above, we have We proceed by demonstrating that, for y, y ∈ Y, It suffices to show that, for bounded functions h 1 , h 2 on Y, where the third equality follows from (3.5), and the penultimate equality is due to (3.4). An application of Theorem 2.1 implies k * E k ν and K ≤ γ 2 K ν , where γ is the maximal correlation of the pair (Y, Y ) whose joint distribution is (3.7). We now show that γ = 1. As pointed out in the proof of Theorem 2.1, γ 2 is the norm of the operator Q on L 2 0 (m ν ) associated with the Mtf given by Q(y, dy ) = y ∈YR (y , dy ) R(y, dy ) .
Remark 3.1. Choi (2014) contains an alternative but more complicated proof of Proposition 3.1.

An illustration using Brownlee's stack loss data
We end this section with an illustration of the efficiency part of Proposition 3.1 by using a PX-DA algorithm and Choi and Hobert's (2013) Haar PX-DA algorithm for Bayesian linear regression with Laplace errors. To be specific, we will develop a PX-DA algorithm using the joint density upon which Choi and Hobert's (2013) DA algorithm is based and the group structure under which the Haar PX-DA algorithm is derived. We then compare the efficiency of the PX-DA and Haar PX-DA algorithms on Brownlee's (1965) stack loss dataset. The Bayesian linear model with Laplace errors is formulated as follows. Let {Y i } n i=1 be independent random variables such that where x i ∈ R p is a vector of known covariates associated with Y i , β ∈ R p is a vector of unknown regression coefficients, and σ ∈ R + := (0, ∞) is an unknown scale parameter. The errors, , are assumed to be iid from the Laplace density with scale equal to two, so the common density is d( ) := e − | | 2 /4. The standard default prior for (β, σ 2 ) is an improper prior that takes the form π(β, σ 2 ) = (σ 2 ) −1 I R+ (σ 2 ). For inferential purposes, we would like to sample from the posterior density of (β, σ 2 ). Let y = (y 1 , . . . , y n ) T be the vector of observed responses. The posterior density of (β, σ 2 ) is given by As usual, let X denote the n × p matrix whose ith row is equal to x T i , and let C(X) denote the column space of X. We assume throughout that X has full rank and y / ∈ C(X) since these are necessary and sufficient conditions for c(y) < ∞ (Choi and Hobert, 2013, Proposition 1). A DA algorithm and the Haar PX-DA algorithm for exploring this intractable posterior are described in Choi and Hobert (2013) (hereafter, C&H). The DA algorithm is based on introducing a latent variable Z = {Z i } n i=1 , so the joint posterior density of β, σ 2 and Z, say f (β, σ 2 , z | y), is given by A straightforward calculation using the well-known normal/inverse Gamma representation of the Laplace density shows that C&H's DA algorithm simply iterates between draws from the associated conditional densities, f β,σ 2 |Z,Y (β, σ 2 | z, y) and f Z|β,σ 2 ,Y (z | β, σ 2 , y), in the usual way. C&H show that the two conditionals can be specified as follows.
• The conditional distribution of (β, σ 2 ) is described as where θ = (X T D −1 X) −1 X T D −1 y and D is a diagonal matrix whose ith diagonal element is z −1 i . Also, when we write W ∼ IG(α, γ), we mean that W has density proportional to where α and γ are strictly positive.
• Z 1 , . . . , Z n are conditionally independent with Here, when we write W ∼ Inv Gau(μ, λ), we mean that W has density given by λ where μ and λ are strictly positive.
C&H's Haar PX-DA algorithm is derived under the group G = R + , which acts on R n + (the space of the latent variable Z) through scalar multiplication. In particular, the sandwich step is formed using the recipe described in (3.6) with j(g, z) = g n and l (dg) = dg/g, and f Y equal to the z-marginal density of (3.9). It is shown that the sandwich step corresponds to the move z → z = gz with g drawn from IG(n, n i=1 1 8zi ) distribution. We now develop a PX-DA algorithm using (3.9) and the group structure on G. Consider a conditional probability measure, I(σ 2 , dg), on G given (β, σ 2 ) that depends on σ 2 but not on β and has a point mass at σ 2 . (Note that σ 2 lives on R + , so I(σ 2 , dg) is legitimate.) As described in (3.1), define a probability density such that A straightforward manipulation reveals thatf (I) (β, σ 2 , z | y) is equal to By construction, the (β, σ 2 )-marginal density off (I) (β, σ 2 , z | y) is the target (3.8), and the DA algorithm based on the new joint densityf (I) (β, σ 2 , z | y) is a PX-DA algorithm. The associated conditional densities,f , σ 2 , y), to simulate the PX-DA algorithm can be easily derived using the similar arguments in Section 2 of C&H, along with the conditional independence of β and σ 2 given (z, y) as follows.
• β and σ 2 are conditionally independent with We implement the PX-DA algorithm and C&H's Haar PX-DA algorithm on Brownlee's (1965) stack loss data. The data are from the operation of a plant for the oxidation of ammonia to nitric acid, measured on 21 consecutive days. The dataset consists of , where x i1 is a covariate indicating the air flow to the plant, and y i is the percentage of ammonia lost (times 10). We consider the model where i 's are iid with the common Laplace density d( ), with the standard default prior π(β, σ 2 ). It can be easily verified (using C&H's necessary and sufficient conditions for posterior propriety described above) that the posterior density is proper. For all Markov chains, we choose the initial value of σ 2 to be 1 and the initial value of β to be the maximum likelihood estimate under the standard linear model (with Gaussian errors). We run the PX-DA and C&H's Haar PX-DA algorithms for a burn-in period of 4 × 10 5 iterations. The next 10 5 iterations are used to obtain the posterior expectation for the two Markov chains, and we adopt the batch means method to estimate the asymptotic variances. (See Jones et al. (2006) for precise formula and theoretical properties of the batch means estimator.) For this particular example, we are interested in estimating the posterior expectation of h(β, σ 2 ) = σ 2 as Yu and Moyeed (2001) fit similar Bayesian regression models with fixed scale parameter (σ 2 = 1). Also, similar arguments to the proof of C&H's Proposition 1 imply that, in the current setting, |h(β, σ 2 )| 3 is integrable with respect to the posterior. Table 1 provides the simulation results. Note that the estimated asymptotic standard error for the PX-DA algorithm is 2.09 times as large as the corresponding value for C&H's Haar PX-DA algorithm. This suggests that, in this particular example, the PX-DA algorithm requires about 2.09 2 = 4.35 times as many iterations as C&H's PX-DA algorithm to achieve the same level of precision.

Improving the collapsing theorem
Let X, Y, μ X and μ Y be as in Section 2.2. Let Z be a separable metric space equipped with its Borel σ-algebra, and assume that Z is sub-Cauchy. Assume Since f * Y,Z (y, z) = X f * (x, y, z) μ X (dx) = f Y (y) π(z), we have y∈Y z∈Z An application of Theorem 2.1 impliesk E k and K ≤ γ 2 K , where γ is the maximal correlation of the random pair ((Y, Z), (Y , Z )), whose joint distribution is We now show that γ = 1. It suffices to establish that, for some function g(y, z) on Y × Z such that 0 < Var{g(Y , Z )} < ∞, Take an arbitrary nonzero function h ∈ L 2 0 (f Y ), and define a function g(y, z) on Y × Z as g(y, z) = h(y). It is easy to see that 0 < Var{g(Y , Z )} < ∞ and that, for all (y, z) ∈ Y × Z,

So, we have
which completes the proof.