FILL’S ALGORITHM FOR ABSOLUTELY CONTINUOUS STOCHASTICALLY MONOTONE KERNELS

coupling, absolutely continuous Markov regularity Abstract Fill, Machida, Murdoch, and Rosenthal (2000) presented their algorithm and its variants to extend the perfect sampling algorithm of Fill (1998) to chains on continuous state spaces. We consider their algorithm for absolutely continuous stochastically monotone kernels, and show the correctness of the algorithm under a set of certain regularity conditions. These conditions succeed in relaxing the previously known hypotheses su(cid:14)cient for their algorithm to apply.


An extension of Fill's algorithm
In Markov chain Monte Carlo (MCMC) a Markov kernel K is devised in an attempt to sample approximately from its stationary distribution π. Then the rate of convergence to stationarity could determine the Markov chain steps t to gain a desired approximation; however, such analysis is often found too difficult in practice. A perfect sampling algorithm, if applicable, not only eliminates the necessity of such analysis but also enables us to simulate exactly from π. The recent paper by Fill, Machida, Murdoch, and Rosenthal [3] laid down a framework for an extension of Fill's perfect sampling algorithm [1] to generic chains on general state spaces. We will call the extension of Fill's algorithm (and its variants) in [3] FMMR algorithms. The original form of this algorithm was proposed by Fill [1] for finite-state stochastically monotone chains, and was later extended by Thönnes [12] to the penetrable spheres mixture models, and by Møller and Schladitz [8] to general repulsive models. Their extensions were then considered in the context of FMMR algorithms (see Remark 7.4 in [3]). In this paper we discuss the FMMR algorithm when a monotone case obtains, that is, when (a) a state space X is partially ordered with minimum element0 and maximum element1 (i.e.,0 ≤ x ≤1 for every x ∈ X ), (b) the Markov kernel K is stochastically monotone (see Section 2.1 for a definition), and (c) there exists a collection {M x,y : x ≤ y} of upward Markov kernels satisfying for every x ≤ y, K(y, ·) = K(x, dx )M x,y (x , ·). (1.1) [We call a kernel L an upward kernel if the probability measure L(x , ·) is, for every x ∈ X , supported on the set {y : y ≥ x }.] Given a collection of such upward kernels, the FMMR algorithm is a form of rejection sampling to simulate exactly from π. Algorithm 1.1. In the first phase, we use the kernelK (called the time-reversed kernel) satisfying π(dx)K(x, dy) = π(dy)K(y, dx). Fix a positive integer t, and run a chain for t steps from the time-reversed kernelK, starting from X t =0, and obtaining X t−1 , . . . , X 0 in succession. In the second phase, generate a chain (Y 0 , . . . , Y t ), starting from Y 0 =1, and obtaining Y s inductively for s = 1, . . . , t from the distribution M xs−1,ys−1 (x s , ·) given X s−1 = x s−1 , Y s−1 = y s−1 and X s = x s . Finally check whether Y t =0. If so, the value X 0 is accepted as an observation from the stationary distribution π. If not, we start the routine again with an independent simulation (and possibly with a fresh choice of t).
The framework of FMMR algorithm indicates that Algorithm 1.1 runs correctly if the stationary distribution π is atomic at0 [i.e., π({0}) > 0]; see Section 7 of [3]. In Section 1.2 we will present a simple example, and demonstrate that Algorithm 1.1 can fail without this "atomicity" condition, but runs correctly when different upward kernels are chosen. This illustrates our motive for studying the FMMR algorithm when a Markov kernel K is absolutely continuous, that is, when it has a kernel density k and a stationary distribution π (which is now a density). After making a short discussion on the issue of technical assumptions in Section 2. In the rest of Section 2 (that is, Sections 2.3-2.5), we will present several key properties of the kernel density k and the upward kernel M x,y , which leads to our proof of Theorem 1.2 in Section 3.1. In Section 3.2 we will discuss how we can construct a particular upward kernel having the continuity property in (R3). Finally in Section 3.3 we will introduce a quasimonotone case toward possible extensions of Theorem 1.2, and discuss the applicability of FMMR algorithm for this quasi-monotone case. The present paper is what has become of the reference listed as [31] in [3].

A simple example
We present a simple example constructed on the state space X := [0, 1] with the usual Euclidean metric and with the usual linear order having the minimum element 0 and the maximum element 1. Let ν denote the Lebesgue measure on [0, 1], and let δ x denote the Dirac measure at x. Then we can define a probability measure π α (·) on [0, 1] with parameter α ∈ [0, 1/2) by π α (·) := αδ 0 (·) + αδ 1 (·) + (1 − 2α)ν(·), and a Markov kernel K (with parameter α) by It can be easily checked that the Markov kernel K is stochastically monotone, and satisfies π α (dx)K(x, dy) = π α (dy)K(y, dx). Thus, K is a time-reversible Markov kernel with stationary distribution π α . Let U be a uniform random variable on [0, 1], and let be the inverse probability transformation of K(x, ·). Then we call φ a transition rule for K because of the property that K(x, ·) = P (φ(x, U ) ∈ ·) for every x ∈ X . Furthermore, φ is said to be monotone in the sense that φ(x, U ) ≤ φ(y, U ) whenever x ≤ y. Therefore, we find that the conditional probability yields a desired upward kernel satisfying (1.1) for every x ≤ y. Now that the monotone case has obtained, we can apply Algorithm 1.1 with the choice of t = 1. In the first phase we generate a sample X 0 = z from the distributionK(0, ·) which equals K(0, ·) by reversibility, and in the second phase we accept the value z with probability M z,1 (0, {0}). We first assume that α > 0. Then we can easily compute and P (X 0 ∈ dz | accept) = K(0, dz)M z,1 (0, {0})/(1/2) = π α (dz). Thus, we have seen that Algorithm 1.1 works as desired when π α is atomic at 0. We now suppose, however, that α = 0. This is the case when K has the kernel density Here we consider the multigamma coupler, introduced by Murdoch and Green [9], and construct different upward kernels particularly for α = 0. Let U ≡ (V, W ) be a random vector with a Bernoulli random variable V having success probability 1/2 and, independently, with a uniform random variable W on [0, 1]. Then we can define a monotone transition rule φ 0 for K by i fV = 1 and x < 1/2; (1 + W )/2 if V = 1 and x ≥ 1/2.

Regularity conditions and their properties 2.1 The monotone case
In what follows we assume that the state space X is a Polish space (i.e., a complete separable metric space), and is equipped with a closed partial ordering [i.e., {(x, y) : x ≤ y} is closed in X × X ]. The Polish space assumption provides a sufficient condition for existence of regular conditional probabilities (see, e.g., Dudley [4]), and was used implicitly in Section 1.2. In a typical application the state space can be chosen as a connected region of some Euclidean space R n with the usual (coordinate-wise) partial order. Additional technical assumptions will be made about the monotone case throughout this paper, as described below. Measure space. A subset G of X is called a down-set if it implies y ∈ G whenever y ≤ x for some x ∈ G, and D will denote the collection of non-empty open down-sets. Then we assume that D is a base of neighborhoods of0, that is, that We note that (2.2) is equivalently characterized by every nonnegative measurable function g for which g(x) ν(dx) > 0 whenever g is strictly positive and continuous at0. The measure ν will be used as a "reference" measure to introduce density functions (i.e., Radon-Nykodym derivatives with respect to ν). For this reason we will simply write dz for ν(dz) as a shorthand notation. Lebesgue measure is the usual choice for reference measure ν when X is a connected region of some Euclidean space. But note that itself is a neighborhood of0, and that this is clearly the case when X is discrete, or0 is an isolated point in X . Monotone coupling. The notion of stochastic monotonicity is closely related to Nachbin-Strassen theorem on stochastic ordering, introduced by Kamae, Krengel, and O'Brien [6]. Among several equivalent definitions, stochastic ordering can be defined in terms of the collection D of non-empty open down-sets. Let P 1 and P 2 be probability measures on (X , B). Then P 1 is said to be stochastically smaller than P 2 , denoted by P 1 P 2 , if P 1 (G) ≥ P 2 (G) for all G ∈ D. The Nachbin-Strassen theorem (Theorem 1 in [6]) shows that P 1 P 2 if and only if there exists an upward kernel L satisfying P 2 (·) = P 1 (dx)L(x, ·). (See [6] for other characterizations of stochastic ordering.) A Markov kernel K is said to be stochastically monotone if K(x, ·) K(y, ·) whenever x ≤ y. By the Nachbin-Strassen theorem the stochastic monotonicity of K is equivalent to existence of a collection {M x,y : x ≤ y} of upward kernels satisfying (1.1). Let V denote the closed subset {(x, y) : x ≤ y} of X ×X . Then we will assume that the function M x,y (x , A) of (x, y, x ) is jointly measurable on V × X for every A ∈ B, and call M x,y (x , ·) a monotone coupling. As we have demonstrated in Section 1.2, it is often the case that a monotone transition rule φ exists and that a monotone coupling M x,y (x , ·) can be constructed via (1.3). We note, however, that such monotone transition rule φ does not necessarily exist for stochastically monotone kernel K when X is not a linearly ordered state space. (For the precise statement in the case of a finite state space, consult [2].)

Regularity conditions
We define the essential supremum for a measurable function f on a measurable set G by We will say that a sequence n=1 of open balls with lim n→∞ r n = 0 such that G n ⊆ B(0, r n ). Now assume that the Markov kernel K has a kernel density k and a stationary density π. Then we can introduce the following conditions (R1)-(R3) for the monotone case, and call them collectively regularity conditions: (R1) The stationary density π is continuous at0, and satisfies π(x) > 0 for all x ∈ X .
(R2) The kernel density k satisfies for every sequence {y n } ∞ n=1 converging to0, Recall the example discussed in Section 1.2. Here we examine the regularity conditions when the Markov kernel K is absolutely continuous, that is, when α = 0. Clearly the stationary density π 0 and the kernel density k satisfy (R1) and (R2), respectively. But the upward kernel (1.3) with transition rule (1.2) does not satisfy (R3). Indeed one can show that the limit in (R3) becomes unity by choosing x < 1/2 ≤ y and the shrinking sequence {G n } ∞ n=1 with G n = [0, 1/n). In Section 1.2 we then obtained a successful monotone coupling by using another transition rule (1.5). We will see in Remark 2.1(b) that (R3) holds for this monotone coupling. In the sense of (2.3) we will say that the functions f (x, ·), x ∈ X , are equicontinuous at0.  1, 2, . . ., by (2.1) we can find some G n ∈ D such that G n ⊆ B(0, 1/n). Then the resulting sequence {G n } ∞ n=1 is clearly shrinking.

Kernel density and ergodicity
A kernel density k t exists for the t-step transition kernel K t for t = 1, 2, . . ., and is constructed recursively via where k 1 ≡ k. From the integral representation (2.5) one easily conclude that the equicontinuity property in (R2) is maintained for the t-step kernel density k t . Recalling Remark 2.1(a), for any ε > 0 we can find some neighborhood G of0 satisfying (2.3), and apply it to obtain for any y ∈ G, and for every x ∈ X and every t = 1, 2, . . .. Thus, the choice of neighborhood G of0 does not depend on t. From this observation in (2.6) we derive the following theorem in connection with ergodicity of the Markov kernel K.
Proof. By (R1) and (2.6), for any ε > 0 we can find a neighborhood G of0 such that for all z ∈ G and all t ≥ 1. By combining these two inequalities, we can obtain By integrating the above inequality over the neighborhood G, we derive where we are using the symbol π for both density and distribution. By ergodicity K t (x, G) converges to π(G) as t → ∞, and This completes the proof since ε > 0 is arbitrary.

Pointwise derivative and upward kernel
Since k t (x, ·) is continuous at0, we can consider the pointwise derivative of K t (y, ·) with respect to K t (x, ·) at0 (see, e.g., [5]). Assuming k t (x,0) > 0, one can easily show that for any shrinking sequence {G n } ∞ n=1 of neighborhoods of0, An immediate application of (2.7) comes from the fact that stochastic monotonicity for the kernel K is preserved for the t-step transition kernel K t (see, e.g., [7]). Recall in Remark 2.1(c) that we can choose a shrinking sequence {G n } ∞ n=1 of non-empty open down-sets. Since K t (x, G n ) ≥ K t (y, G n ) for every n ≥ 1, by (2.7) we conclude the following corollary.
In the next proposition we will see that the pointwise derivative is also related to the upward kernel. Then, by combining Proposition 2.4 and (2.7), one can see that for each (x, y) ∈ V, Proof. Since M x,y is upward and G n is down-set, we have M x,y (x , G n ) = 0 for every x ∈ G n . By (1.1) we obtain Therefore, we can derive Then the limit of the essential supremum must be zero by (R3).

The t-fold monotone coupling
In this subsection we discuss the monotone coupling and introduce its t-fold extension under the regularity conditions. For each fixed (x, y) ∈ V, we first define a measure on V bŷ From a simple constructive argument of Lebesgue integration, one can see that (1.1) equivalently implies that g(y ) k(y, y ) dy = g(y )K((x, y), dx × dy ) (2.9) for every measurable function g on X , if it is integrable. By the upwardness and the joint measurability of M x,y (x , ·),K is actually a Markov kernel on the state space V. Then we can establish the following lemma.

Lemma 2.5. Let g(u, v, x ) be a jointly measurable function on V × X . Then
Proof. We can prove it by monotone class argument (see, e.g., Dudley [4]). First let g(u, v, x ) = 1 A×B (u, v, x ) be an indicator function on A × B with measurable sets A ⊆ V and B ⊆ X . Then clearlyĝ(x, y, x ) =K((x, y), A)1 B (x ) is jointly measurable. We can then pass to finite disjoint unions of them, and then to any measurable sets on V × X by monotone class theorem. Now g can be replaced by simple functions, nonnegative and general jointly measurable functions.
Now we define the collection {M t x,y : (x, y) ∈ V} of t-fold upward kernels as follows. We set M 1 x,y (x , ·) := M x,y (x , ·), and for t ≥ 2, we can construct M t x,y (x , ·) recursively via if k t (x, x ) > 0; otherwise, we set M t x,y (x , ·) to be an arbitrary upward kernel, say M t x,y (x , ·) := δ x (·). Then it is an immediate corollary of Lemma 2.5 that the function M t x,y (x , ·) of (x, y, x ) is jointly measurable. Furthermore, it satisfies (2.11) which is verified by the following straightforward induction argument: By applying Fubini's theorem and then (2.9), one can show that The next proposition is the result of another straightforward induction argument, and will be the most important one in the proof of Theorem 1.2.
Proposition 2.6. If x ≤ y and k t (x,0) > 0, then we have Proof. Note that (2.8) is a special case of (2.12) for t = 1. For t ≥ 2, by applying the induction hypothesis and then (2.9), we derive In the sense of (2.11), we will call M t x,y (x , ·) a t-fold monotone coupling. Then the last theorem of this subsection shows that the continuity property of (R3) is maintained for the t-fold extension of M x,y (x , ·). This in turn can be used to provide an alternative proof of Proposition 2.6 which will be exactly the same as that of Proposition 2.4.
Proof. For each n ≥ 1 define where a n := ess.sup x ∈Gn |M x,y (x , G n ) − M x,y (0, {0})|. By definition of the essential supremum, we have ν(B n ) = 0 for every n ≥ 1, and therefore, ν ( then the proof will be completed. We will show (2.13) by induction. By (R3) we can see that (2.13) clearly holds for t = 1. Since is equicontinuous at0, we can find some integer n 0 and positive value c > 0 satisfying for all (u, v) ∈ V and all n ≥ n 0 . By applying the induction hypothesis and then dominated convergence theorem, we derive They together implies (2.13).

Proof of Theorem 1.2
We now complete the proof of Theorem 1.2 by introducing a rejection sampling method (see, e.g., Chapter 10 of [11]), which essentially parallels Section 7.3 of [1]. Under the regularity conditions we can form a time-reversal kernel densityk bỹ k(y, x) = π(x)k(x, y) π(y) , (3.1) and obtain the t-step time-reversed kernel densityk t via (2.5) withk substituted for k. Suppose that there exists a constant c satisfying Then one can construct an algorithm composed of the following two steps: Step 1. Generate a sample value X 0 = z from the densityk t (0, ·).
According to the rejection sampling principle, this algorithm generates a sample exactly from the density π, and will yield an accepted value every iteration with success probability c −1 . And it will perform best when the least upper bound c * in (3.2) is chosen. By applying (2.5) with (3.1) inductively, we obtain π(x)k t (x, y) ≡ π(y)k t (y, x). Then, together with Corollary 2.3 we derive the least upper bound (1,0) , (3.3) and the acceptance probability By recursively applying (2.10), the t-fold upward kernel M t z,1 (0, {0}) has also the closed from Sincek one can see that the acceptance probability in Algorithm 1.1 is exactly equal to the probability M t z,1 (0, {0}). Furthermore, by Proposition 2.6 the second phase of Algorithm 1.1 works precisely as Step 2, and generates the same acceptance probability (3.4). Therefore, we have shown the correctness of Algorithm 1.1.

Gamma coupling
Let W be a measurable set on X × X . By W x := {y : (x, y) ∈ W} we denote the section determined by the first coordinate x. In what follows we will assume that W x is nonempty for every x ∈ X . Then a collection {M x,y : (x, y) ∈ W} of Markov kernels is called a coupling, if (i) M x,y (x , ·) is supported on the section W x for every (x, y, x ) ∈ W × X, and (ii) the function M x,y (x , A) of (x, y, x ) is jointly measurable on W × X for each A ∈ B. We simply write M x,y (x , ·) for a coupling if the underlying set W has been clearly specified. Furthermore, given a Markov kernel K, we call M x,y (x , ·) a coupling for K if it is a coupling satisfying (1.1) for every (x, y) ∈ W. We note that a monotone coupling is a special case of coupling for K where the set W is particularly chosen to be the closed set V = {(x, y) : x ≤ y}. One can generate a bivariate chain (X t , Y t ) t=0,1,... on the state space W from the bivariate Markov kernelK ((x, y), dx × dy ) := M x,y (x , dy )K(x, dx ), (3.5) and the initial distribution (X 0 , Y 0 ). Then the marginal chain (X t ) t=0,1,... is clearly a Markov chain generated by K, and (1.1) becomes a necessary and sufficient condition for the marginal chain (Y t ) t=0,1,... to be Markovian, having the same Markov kernel K. When W = X × X , we can always construct the "trivial" coupling M x,y (x , ·) := K(y, ·) for a Markov kernel K, by which one can run independent marginal chains (X t ) t=0,1,... and (Y t ) t=0,1,... . In the rest of this subsection we further assume that W contains the diagonal set ∆ := {(x, x) : x ∈ X }, and that the Markov kernel K has a kernel density k. Given a coupling L x,y (x , ·), we can construct another coupling M x,y (x , ·) by Let (x, y) ∈ W be fixed, and let γ x,y := k(x, z)∧k(y, z) dz where a∧b := min(a, b). Assuming γ x,y < 1, we can introduce the probability measures If L x,y (x , ·) satisfies for every (x, y) ∈ W, then we will see in the next lemma that (3.6) is a coupling for K, and call it a γ-coupling. Note that if γ x,y = 1 then k(x, x ) = k(y, x ) and M x,y (x , ·) = δ x (·) for a.e. x ∈ X , which trivially satisfies (1.1).
which equals K(y, A) if and only if (3.7) holds.
The notion of γ-coupling appeared at several different places (see the notes in Chapter 1 of [7]), and was introduced by Lindvall [7] for a coupling of two probability measures P 1 and P 2 to achieve a maximal coupling. This particular coupling was a bivariate probability measure Q on a product space whose marginals are consistent with P 1 and P 2 , and its mass Q(∆) on the diagonal set ∆ achieves its maximum 1 − 1 2 P 1 − P 2 , where · denotes the total variation norm. By employing the same argument as in [7], one can show that γ x,y = 1 − 1 2 K(x, ·) − K(y, ·) , and that our γ-coupling in (3.6) attainsK((x, y), ∆) = γ x,y . But in the next proposition we justify our notion of the γ-coupling in connection with the regularity conditions. Proposition 3.2. If the kernel density k satisfies (R2), then (R3) holds for a γ-coupling M x,y (x , ·) of the form (3.6).
Proof. Let ε > 0 be arbitrary. By Remark 2.1(b), (R3) holds if we can find a neighborhood G of0 so that (2.4) holds for any neighborhood G of0 satisfying G ⊆ G. (i) Suppose that 0 < k(x,0) < k(y,0). Then by (R2) we can find some neighborhood G of0 such that 0 < k(x, x ) < k(y, x ) for all x ∈ G. Thus, we obtain for every x ∈ G and for every measurable subset G of G, as desired. (ii) Suppose that 0 < k(x,0) = k(y,0). Similarly we can find some neighborhood G of0 such that 0 < k(x, x ) and |k(x, x ) − k(y, x )|/ k(x, x ) < ε/ 2 for all x ∈ G. Thus, we obtain for every x ∈ G and for every measurable subset G of G, as desired. Finally, (iii) suppose that k(x,0) > k(y,0). By (R2) we can find some neighborhood G of 0 such that k(x, x ) − k(y, x ) ≥ α := (k(x,0) − k(y,0)) /2 and k(y,x ) k(x,x ) − k(y,0) k(x,0) < ε for all x ∈ G. Then, by (3.7) we obtain for every x ∈ G and for every measurable subset G of G. Since Q (x,y) (G ) = 0 in (3.8) implies that L x,y (x , G ) = 0 for a.e. x ∈ X , we derive k(x,0) < ε for a.e. x ∈ G , as desired.
Now we assume that the Markov kernel K is absolutely continuous and stochastically monotone, and choose W := {(x, y) : x ≤ y}. Since K(x, ·) K(y, ·) implies Q (x,y) (·) Q (x,y) (·), by Nachbin-Strassen theorem there exists an upward kernel L x,y satisfying (3.7). If we can find further that L x,y (x , A) is jointly measurable for each A ∈ B, then the resulting γcoupling M x,y (x , ·) is monotone and also satisfies (R3) by Proposition 3.2.

Quasi-monotone case
Given a measurable subset W of X × X , we will say that a quasi-monotone case obtains, if (a) there exists a particular element in X , say1, satisfying (x,1) ∈ W for all x ∈ X , (b) there exists a particular element in X , say0, satisfying k(x,0) ≥ k(y,0) for every (x, y) ∈ W, and