Construction of weakly CUD sequences for MCMC sampling

In Markov chain Monte Carlo (MCMC) sampling considerable thought goes into constructing random transitions. But those transitions are almost always driven by a simulated IID sequence. Recently it has been shown that replacing an IID sequence by a weakly completely uniformly distributed (WCUD) sequence leads to consistent estimation in finite state spaces. Unfortunately, few WCUD sequences are known. This paper gives general methods for proving that a sequence is WCUD, shows that some specific sequences are WCUD, and shows that certain operations on WCUD sequences yield new WCUD sequences. A numerical example on a 42 dimensional continuous Gibbs sampler found that some WCUD inputs sequences produced variance reductions ranging from tens to hundreds for posterior means of the parameters, compared to IID inputs.


Introduction
In Markov chain Monte Carlo (MCMC) sampling, random inputs are used to determine a sequence of proposals and, in some versions, their acceptance or rejection. Considerable effort and creativity have been applied to devising good proposal mechanisms. Many examples can be found in recent books such as Newman and Barkema (1999), Liu (2001), Robert and Casella (2004), and Landau and Binder (2005).
With very few exceptions, described below, the proposals and acceptancerejection decisions are all sampled the same way. A sequence of points u i ∈ (0, 1) that simulate independent draws from the U [0, 1] distribution is used to drive the process. By replacing IID draws with more balanced samples, we can hope to improve the accuracy of MCMC.
MCMC sampling is subtle and modifying the IID driving sequence without theoretical support is risky. To do so is to simulate a Markov chain using random variables that do not have the Markov property. Caution is in order and few have tried it. According to Charles Geyer, writing in 2003, (http://www.stat.umn.edu/geyer/mcmc/talk/mcmc.pdf) "Every MCMC-like method is either a special case of the Metropolis-Hastings-Green algorithm, or is bogus". Our objective in this paper is to show that a variety of alternative sampling schemes avoid the latter category. They yield consistent MCMC estimates whose accuracy we investigate empirically.
Two exceptions to IID driving sequences that we know of are Liao (1998) and Chaudary (2004). The first proposed using randomly reordered quasi-Monte Carlo points in the Gibbs sampler. The second proposed strategic sampling of proposals with weighting of rejected proposals. Both found empirical evidence of a modest improvement, but neither gave any theory to support his method.
We will follow up on Liao's proposal, which is illustrated in Figure 1. The quasi-Monte Carlo points, which are rows of the matrix A are very uniformly distributed through [0, 1] s . They get reordered into a matrix X that then has its rows concatenated into a long vector u that gets used to drive the MCMC computation. We refer to both u and X as the driving sequence.
Recently Owen and Tribble (2005) proved consistency for Metropolis-Hastings algorithms, including Gibbs sampling as a special case, when the driving sequence u i are completely uniformly distributed (CUD), or even weakly CUD (WCUD), as defined below. That work built on Chentsov (1967) who gave conditions under which CUD sequences lead to consistency when Markov chains are sampled by inversion.
We give formal definitions of CUD and WCUD sequences below. For now, note that a CUD sequence is one that can be broken into overlapping d-tuples that pass a uniformity test, and that this holds for all d. Replacing IID points by CUD points is similar to using the entire period of a random number generator. When the points of the CUD sequence are well balanced, the effect is to get a quasi-Monte Carlo (QMC) version of MCMC. One then has to design or choose random number generators with good uniformity that are small enough to use in their entirety.
The theory in Owen and Tribble (2005) applied to infinite sequences, but the examples there used finite sequences that were not yet known to be WCUD. This paper establishes that several classes of constructions give WCUD sequences and shows how certain natural operations on WCUD sequences yield other WCUD sequences. We then illustrate the use of WCUD sampling on a 42 parameter Gibbs sampling problem and find that the posterior means are estimated with variance reductions ranging from tens to hundreds. A detailed outline of this paper follows.
sequences. Theorem 2 shows that triangular array (W)CUD sequences lead to consistent MCMC estimates. Sometimes it is simpler to prove a pointwise version of the (W)CUD property instead of the uniform one required. Lemma 1 shows that pointwise (W)CUD sequences are necessarily (W)CUD. (W)CUD sequences are defined through overlapping blocks of consective values, but some QMC constructions make it easier to work with non-overlapping blocks. Theorem 3 shows that a (W)CUD like property defined over non-overlapping blocks suffices to prove the original (W)CUD property. That theorem also shows that one can restrict attention to any convenient infinite set of dimensions.
Next we turn to specific constructions. In Section 4, Theorem 4 shows that a lattice construction of Niederreiter (1977) leads to a CUD sequence. Cranley-Patterson rotations are commonly used to randomize lattice rules. Lemma 2 shows that Cranley-Patterson rotations of Niederreiter's sequence, or indeed of any (W)CUD sequence, are WCUD. Section 5 investigates Liao's proposal for randomly reordering the points of a QMC sequence. Lemma 3 gives a nonasymptotic bound for the discrepancy of the reordered sequence in terms of the discrepancy of the original one. Then Theorem 5 gives conditions for the reordered points to be WCUD. Liao's proposal simply shuffles a QMC sequence. Theorem 6 shows that some generalizations that rearrange the QMC points are also WCUD. Section 6 shows that we can mix IID U [0, 1] sequences into WCUD sequences in a certain way, and end up with a WCUD sequence. This then proves, via a coupling argument, that Liao's proposal for handling acceptancerejection sampling leads to consistent estimates. Section 7 presents an example, with a probit model on some data of Finney (1947), using QMC-MCMC to drive the Gibbs sampler from Albert and Chib (1993). In this example, the parameter vector has 42 dimensions. The variance reductions attained are typically in the range from 20 to 40 but some improvements of over 500-fold were attained. Section 8 summarizes our findings, discusses rates of convergence and extensions to continuous state spaces.
To conclude this introduction, we mention some related work in the literature that merges QMC and MCMC ideas. L' Ecuyer et al. (2005) have developed an array-RQMC method for simulating Markov chains on an ordered state space. Craiu and Lemieux (2007) combine antithetic and stratified sampling in the multiple-try Metropolis algorithm of Liu et al. (2000). James Propp and co-workers have been applying QMC ideas to derandomize some randomized automata. At present the best way to find this work is via Internet search using the term "rotor-router". Lemieux and Sidorsky (2005) use quasi-Monte Carlo sampling to drive the exact sampling of Propp and Wilson (1996).

Background
This section sets out the notation for MCMC, assuming some familiarity with Markov chain Monte Carlo methods, as described for example in Liu (2001), Robert and Casella (2004) or Gilks et al. (1996). Our version works on a finite state space with Metropolis-Hastings type proposals.
Then we describe quasi-Monte Carlo sampling (QMC) and use it to define CUD and weakly CUD sequences. Finally we show how CUD and weakly CUD sequences can be used as driving sequences for Metropolis-Hastings sampling.
We use integers m, s, and d to describe dimensions in this paper. One Metropolis-Hastings step typically consumes m uniform random numbers. A quasi-Monte Carlo point set is typically constructed for a finite dimension s. It is most natural to arrange matters so that m = s, but we need two distinct integers because m = s is also workable. Finally we will need some equidistribution properties to hold for all d in an infinite set of dimensions.
Points in [0, 1] m , [0, 1] s , or [0, 1] d are denoted by letters such as x and z, or when there are many such points, by x (i) or z (k) for integer indices. Components of such tuples are denoted as x j or z (k) j . For j ≤ k, the notation x j:k denotes the (k − j + 1)-tuple taken from components j through k inclusive of x.

Markov chain Monte Carlo
We describe MCMC for sampling from a distribution π on a discrete state space Ω = {ω 1 , . . . , ω K } for K < ∞. We suppose as is usual that the ratio π(ω)/π(ω ′ ) can be easily obtained for any pair ω, ω ′ ∈ Ω, or at least for any pair that can arise as two consecutive samples. Starting with some point ω (0) ∈ Ω, the MCMC simulation generates a Markov chain ω (i) ∈ Ω for i ≥ 1 whose stationary distribution is π.
Common usage of Metropolis-Hastings sampling make a proposal ω (i+1) based on ω (i) and m − 1 consecutive u j from the driving sequence: Acceptance or rejection of the proposal is based on one more member of the driving sequence as follows: using the Metropolis-Hastings acceptance probability Here p(ω → ω) denotes the probability of proposing a transition from ω to ω. The mechanism described by equations (1), (2), and (3) is less general than the usual MCMC. It leaves out the case where acceptance-rejection sampling is used to generate one or more of the components of the proposal. Section 6 shows how one can splice in IID elements of the driving sequence for that case.
The function Ψ that constructs proposals may involve inversion of CDFs or apply other similar transformations for generating random variables given by Devroye (1986). We will however need a regularity condition (Jordan measurability) for the proposals. This very mild condition rules out pathological constructions.
Definition 1 (Regular proposals). The proposals are regular if for all ω, ω ∈ Ω the set Jordan measurability of S ω→ ω means that it has a Riemann integrable indicator function. The boundary of such sets has volume zero. Proposals that do one thing for u i with rational components and another for irrational u i might be ruled out. Practically useful proposal mechanisms are regular.
By stage n the fraction of time spent in state ω iŝ Consistency ofπ n is defined differently for random and nonrandom u i , as follows.

Quasi-Monte Carlo
In quasi-Monte Carlo sampling, one does not simulate randomness. Instead one picks points that are more uniform than random points would be. Here we sketch QMC sampling. The reader seeking more information may turn to Niederreiter (1992). and the local discrepancy function We suppress the dependence of Vol on s and x (i) to keep notation uncluttered. The star discrepancy of x (1) , . . . , x (n) ∈ [0, 1] s is The star discrepancy is an s dimensional generalization of the Kolmogorov-Smirnov distance between U [0, 1] s and the empirical distribution of the points x (i) .
Let the function f : [0, 1] s → R have total variation f HK in the sense of Hardy and Krause. For s = 1, f HK reduces to the usual concept of the total variation of a function. For s > 1 one must take care to use a measure of variation that does not vanish when f depends only on a subset of its arguments. Variation in the sense of Hardy and Krause does so, as described in Owen (2005). Then the Koksma-Hlawka inequality shows the advantage of low discrepancy for integration. There are QMC constructions for which D * s n = O(n −1+ǫ ) holds for any ǫ > 0. Then functions f of finite variation can be integrated with an error rate that is superior to the familiar O(n −1/2 ) root mean square error rate of Monte Carlo. The asymptotics can be slow to set in but in practice the accuracy of QMC ranges from comparable with MC to far superior to MC.

CUD and weakly CUD sequences
Quasi-Monte Carlo points will not always lead to the right answer when used to drive MCMC sampling. The QMC constructions that can be made to work are the ones that are completely uniformly distributed (CUD) and weakly CUD (WCUD) as defined below.
be a regular proposal generated from (u mi+1 , . . . , u mi+m−1 ) and ω (i) via equation (1), and let ω (i+1) be determined from u is by equation (2). Assume that weak consistency (6) holds for all ω, ω (0) ∈ Ω and all ǫ > 0 when u i are independent U [0, 1] random variables. If the IID u i are replaced by weakly CUD u i , then the weak consistency result (6) still holds. If the IID u i are replaced by CUD u i , then the stronger consistency result (5) holds.
Theorem 1 does not explicitly make any assumptions about whether the chain is ergodic or even whether it is irreducible. Those considerations are very important, but they are buried in the weak consistency assumption (6). If a proposal mechanism is known to be consistent for an IID driving sequence, then it remains so for driving sequences that are CUD or WCUD. Therefore MCMC sampling separates into two problems. One is selecting a good proposal mechanism. The other is selecting a weakly CUD driving sequence, where of course IID points constitute the usual choice.

Extensions of WCUD
This section presents some basic results for (W)CUD sequences that we use later to show that specific constructions give rise to consistent MCMC sampling. We define a triangular array notion of (W)CUD sequences for use with finite driving sequences.
Theorem 1, taken from Owen and Tribble (2005) applies to infinite sequences u i . In practice one uses a finite sequence of length N . A theory for large N has to account for the fact that the constructions are not always nested: for N 1 < N 2 the sequence of length N 2 might not be an extension of the sequence of N 1 points.
To formulate our limit, we take a triangular array in which the row indexed by N has points u N,1 , . . . , u N,N ∈ [0, 1] that we use to computeπ. The value N belongs to an infinitely large nonrandom set N of positive integers. In terms of Figure 1, the bottom row has N = ns numbers in [0, 1]. Asymptotics for Liao's construction involve a limit of such figures indexed by N . Other constructions similarly involve limits of finite sequences.
The number n of transitions made depends on N . When each transition consumes m points of the driving sequence, then n(N ) = ⌊N/m⌋. The estimate using u N,1 , . . . , u N,N isπ n(N ) (ω). Consistency or weak consistency holds when for all ω,π n(N ) (ω) converges, or converges weakly, to π(ω) as N → ∞. Limits as N → ∞ are always understood to be through the values N ∈ N .
Definition 6 (Triangular array (W)CUD). Let u N,1 , . . . , u N,N ∈ [0, 1] for an infinite set of positive integers N ∈ N . Suppose that as N → ∞ through the values in N , that holds for any integer d ≥ 1. Then the triangular array (u N,i ) is CUD. If the u N,i are random and as N → ∞ through values in N holds for all integers d ≥ 1 and all ǫ > 0, then the triangular array (u N,i ) is weakly CUD.
If an infinite sequence u 1 , u 2 , . . . is CUD (or WCUD) then the triangular array of prefixes taking u N,i = u i , for all N ≥ 1 is also CUD (respectively WCUD). This means that the triangular array definitions are broader than the original ones.
Theorem 2. Suppose that the transitions are as described in Theorem 1, including weak consistency (6) when u i are independent U [0, 1] random variables. Let each transition consume m of the u i . If u i are replaced by elements u N,i of a CUD triangular array then If u i are replaced by elements u N,i of a WCUD triangular array then Proof. The proof is similar to that of Theorem 3 in Owen and Tribble (2005), and so we only sketch it. Fix ǫ > 0 and identify the set T r (ǫ) ⊂ [0, 1] rm for which ω∈Ω Pr(|π r (ω) − π(ω)| > ǫ) > ǫ holds when (u 1 , . . . , u rm ) ∈ T r (ǫ). For large enough r the set T r (ǫ) has volume no more than ǫ. Then apply Definition 6 using d = rm. The rest of the proof follows as in Owen and Tribble (2005).
In proving that D * d n converges to zero, the natural first step is to show that the local discrepancy δ d n (z) tends to zero at each z. It turns out that such a pointwise (W)CUD property implies the (W)CUD property.
If u N,i are pointwise CUD then they are CUD. If u N,i are pointwise WCUD then they are WCUD.
Proof. The final two statements follow from the first two which we prove here. Pick ǫ > 0 and then choose a positive integer M > 1/ǫ. Next let L be the lattice of points in [0, 1] d whose coordinates are integer multiples of 1/(2dM ).
The (W)CUD properties are defined in terms of consecutive blocks of observations that overlap, at least when d > 1. It is often useful to consider non-overlapping blocks of points such as (u 1 , . . . , u d ), (u d+1 , . . . , u 2d ), . . . . For infinite deterministic sequences it is known (Knuth, 1998, page 155) that if the discrepancy of overlapping sequences tends to zero for all d then the discrepancy of non-overlapping sequences also tends to zero for all d. The converse (with 'for all d' in both clauses) also holds. See Chentsov (1967).
We need sufficiency of non-overlapping block results for the random case as well. Also it is helpful to be able to work with only a convenient subset of dimensions d. Theorem 3 below shows that such special cases are sufficient to prove the WCUD property and hence consistency.
holds for all ǫ > 0 and all integers d ≥ 1, so that u N,i are WCUD.
Proof. Let ǫ > 0 and η > 0, let d be a positive integer, and suppose that . The function f is piecewise constant within a finite set of axis parallel hyperrectangular regions in [0, 1] d . It follows that for some Because z, ǫ, and η are arbitrary we have shown that u N,i are pointwise weakly CUD. To complete the proof we apply Lemma 1. Niederreiter (1977) gives a result that shows how lattice rules may be used to construct a triangular array that is CUD. Let N be a prime number. Let u 0 = 1/N and for i ≥ 1 let u i = au i−1 /N mod 1 where a is a primitive root modulo N . For integer dimensions s ≥ 1, there are N − 1 distinct consecutive s-tuples in this sequence; call them x (i) = (u i , . . . , u i+s−1 ). Niederreiter (1977) shows that for well chosen a = a(N ) that
The totient function φ(n) counts the number of positive integers less than or equal to n that are relatively prime to n. The totient function grows quickly enough for our purposes because where γ . = 0.5772 is the Euler-Mascheroni constant. As a result there exists an A < ∞ and an N 0 < ∞ such that holds uniformly in s ≥ 1 and N ≥ N 0 . The constant A in (19) does not have to grow exponentially with s because the factor 2/π in equation (18) is smaller than 1. Proof. For any d ≥ 1 choose N d with s(N d ) ≥ d. Then for all N ≥ N d we have D * d N −1 smaller than the right side of (19). Now the growth condition on s makes D * d N −1 → 0. Owen and Tribble (2005) employed a method of running through the lattice rule more than once, so as to use all N −1 of the s-tuples in it exactly once. That work also prepends s zeros to the sequence. The result is that n(N ) = N , and the set of s-tuples used to drive the MCMC form a lattice rule Sloan and Joe (1994) in [0, 1] s . The lattice rule structure is much more balanced than random u i would be and this accounts for most of the improved accuracy seen there. Prepending one single s-tuple will not affect the CUD property of overlapping d-tuples for any d ≥ 1. Similarly, shifting the points from u 1 , . . . , u N to u k+1 , . . . , u N , u 1 , . . . , u k−1 for any finite k does not destroy the CUD property (Chentsov (1967)). Finally concatenating a finite number of CUD triangular arrays with the same N → ∞ yields a CUD result.
Lattice rules are commonly randomized via a rotation due to Cranley and Patterson (1976). Let a ∈ [0, 1] s and suppose that U ∼ U [0, 1] s . Then the Cranley-Patterson rotation of a is the point x = a+U mod 1 under componentwise arithmetic. Whatever value a has, the point x is uniformly distributed on [0, 1] s . A Cranley-Patterson rotation applied to all points in a s-dimensional low discrepancy lattice yields a shifted lattice with points that are individually U [0, 1] s while collectively having low discrepancy.
In terms of Figure 1, this proposal starts with the matrix X, skipping the randomization of A. The matrix X has 0s in its first row and then all other s-tuples of the lattice obtained in order with possibly multiple passes. Then a Cranley-Patterson rotation is applied to each row of the matrix. Finally, the concatenation into a single u vector works as in Figure 1. Owen and Tribble (2005) applied a single Cranley-Patterson rotation to all of the s-tuples (u rs−s+1 , . . . , u rs ) (r = 1, . . . , N ) in the MCMC driving sequence. As we show next, applying a Cranley-Patterson rotation to any CUD or WCUD triangular array yields a triangular array that is WCUD.
. . , x (N −d+1) ) for some K < ∞. It follows that Pr(δ d N −d+1 (z) > ǫ) → 0 for any z and any ǫ > 0. Therefore Pr(D * d > ǫ) → 0 for any d that is a multiple of m. Therefore v N,i is WCUD. If the u N,i are CUD they are also WCUD and so then v N,i are still WCUD.

Consistency of Liao's proposal
Let a (1) , . . . , a (n) be points in [0, 1] s . In Liao's proposal these points are of low s dimensional discrepancy. Let x (i) = a (τ (i)) where τ is a uniform random permutation of 1, . . . , n. Liao's proposal is to concatenate the points x (i) into a driving sequence for MCMC. We need to show that those points are WCUD. It is very natural to make the dimension s of the reordered QMC points equal the number m of driving points consumed by one transition of the chain. It is not however necessary to use m = s as we show below.
Let u 1 , . . . , u sn ∈ [0, 1] be the components of the points x (i) concatenated. The point u i comes from the ⌈i/s⌉'th point in the sequence, specifically

Table 1
A sequence of 4 dimensional points x (i) , represented in the top row, is concatenated into a sequence of scalars u i , represented in the middle row. Those scalars are then regrouped into the 7 dimensional points z (i) as shown in the bottom row.
Now let z (1) , . . . , z (m) be the points u i regrouped into m = ⌊sn/d⌋ consecutive batches of length d, leaving out the last sn − dm of the u i if m < sn/d. That is The situation is illustrated in Table 1 for the case with s = 4 dimensional points x (i) regrouped into d = 7 dimensional points z (i) . The middle row in Table 1 corresponds to the last row in Figure 1 and the bottom row of Table 1 shows the vectors we need to analyze the d dimensional discrepancy of the driving sequence for d = s. We will avoid working directly with the rightmost expression in (20) by breaking the x (i) into chunks. To illustrate, consider the point z (2) in the example of Table 1 and let z ∈ [0, 1] 7 . Then z (2) ∈ [0, z] if and only if all hold. The next Lemma takes care of the main details needed to get a bound for the discrepancy.
Lemma 3. For i = 1, . . . , n, let a (i) be points in [0, 1] s with star discrepancy at most D. Let x (i) = a (τ (i)) where τ is a uniformly distributed random permutation of 1, . . . , n. For i = 1, . . . , ⌊ns/d⌋ let z (i) be obtained as in equation (20). Assume that D < 1/3 and that n > 3⌈(d − 1)/s⌉. Then for any z ∈ Proof. Because the x (i) are a permuted version of a (i) they have the same star discrepancy, which is at most D. Furthermore for 1 ≤ j ≤ k ≤ s the k − j + 1dimensional star discrepancy of x For chunks c = 1, . . . , C let L(c) and U (c) be the first and last indices in z (i) that are taken from chunk c. Let j(i) be the integer such that the first component z (i) 1 is taken from x (j) . Then chunk c of z (i) is taken from x (j(i)+c−1) for c = 1, . . . , C. Let ℓ(c) and u(c) be the first and last indices in x (j(i)+c−1) that are used to form the c'th chunk of z (i) . That is ℓ:u ∈[0,z ℓ:u ] count the number of x (i) whose ℓ : u subcomponents are in the box [0, z ℓ:u ].
To streamline notation we write x c for the c'th chunk x What is random in (23) is the selection, by simple random sampling, of which a (k) will become x (j(i)+c−1) . The conditional probability for chunk c is between max(( N c − c + 1)/(n − c + 1), 0) and min( N c /(n − c + 1), 1) depending on how many of the suitable a (k) were 'used up' for chunks 1, . . . , c − 1. Clipping these bounds to 0 and 1 is only necessary to handle some extreme cases. We use the notation y + to denote max(y, 0). For the lower bound, The last step in (24) is quite conservative. It follows from an expansion of the previous line into 2 C terms of which one is Vol([0, z]) and the others have alternating signs and are all of smaller magnitude than D + (C − 1)/n. The upper bound on D and lower bound on n suffice to give D + (C − 1)/n < 1 so that the largest terms are not (D + (C − 1)/n)) C . When D + (C − 1)/n is very small then the quantity 2 C − 1 can be replaced by one almost as small as C.
Proof. Let Y i = 1 if z (i) ∈ [0, z] and Y i = 0 otherwise and let v = Vol([0, z]). Then For large enough n both D * n < 1/3 and n ≥ 3d hold. Using Lemma 3 we find that |E(Y i ) − v| ≤ K 11 D * n + K 12 /n holds for large enough n, where K 1ℓ < ∞ are constants from Lemma 3. Also, K 12 = 0 when d = 1. Now suppose that Y i and Y j are well separated in that |i − j| ≥ S ≡ 1 + ⌈(d − 1)/s⌉. Then none of the points x (k) contribute chunks to both z (i) and z (j) . Then the same argument used in Lemma 3 can be adapted to show that for large enough n, The argument simply requires studying the combined set of chunks for z (i) and z (j) . Then n 2 E(δ d n (z) 2 ) may be expanded and bounded as follows: (26) and hence also (27) by Markov's inequality. Finally (28) follows by counting the number of z (i) ≤ z.

is weakly consistent for Metropolis-Hastings sampling.
Proof. Liao's proposal generates pointwise WCUD points by Theorem 5 and hence WCUD points by Lemma 1. They are then weakly consistent by Theorem 1.
The proposal of Liao (1998) leads to local discrepancies δ d n (z) that vanish, but are not particularly small except for d = 1. Liao's motivating application was Gibbs sampling where the number m of variates required for one cycle matches the dimension s of the quasi-Monte Carlo points.
That proposal gets better than Monte Carlo stratification for the u i individually and for consecutive q-tuples such as (u km+r1 , u km+r2 , . . . , u km+rq ) for k ≥ 0 that nest within consecutive m = s-tuples. The proposal does not get particularly good discrepancy even for consecutive pairs (u i , u i+1 ) because there is a jump from the boundaries of the underlying QMC points when i is a multiple of s.
Suppose that we have a problem for which we want to stratify successive updates to the j'th component of ω. We might want roughly the right number of low and high proposals for that component and roughly the right probability for consecutive pairs or triples of proposals. In such a case, we might wish to arrange that s consecutive proposals for the j'th component of ω are generated from one of the original s dimensional QMC points.
Similarly we might have a problem in which we wish to treat the acceptancerejection step of Metropolis-Hastings specially. We could then take s = m − 1 and use one QMC point for each proposal we need and one QMC point for each of s consecutive acceptance-rejections.
Whether such alternative schemes work well depends of course on how well the scheme matches the problem. But such schemes can be used to generate consistent samples. Each scheme takes the points of a driving sequence, such as Liao's proposal, groups them into consecutive blocks of r = ms points, and applies a fixed permutation within each block. That permutation operation preserves the WCUD property: Theorem 6. Let a (i) ∈ [0, 1] s for i = 1, . . . , n have star discrepancy D * s n → 0 as n → ∞. Let x (i) = a (τ (i)) where τ is a random permutation of {1, . . . , n}.
Proof. The v i are the driving sequence proposed by Liao (1998). The u i are a permutation of them in which no element is moved by more than r positions. The arguments in Lemma 3 and Theorem 5 go through as before. All that has changed is the number and identity of chunks contributing to a consecutive d-tuple of u i 's.

Acceptance-rejection sampling
It is often impractical, if not impossible, to generate a transition using an a priori fixed number of members u i of the driving sequence. The primary example of such a method is acceptance-rejection sampling, which we sketch here to fix ideas.
To sample a real valued y from a probability mass or density function f we begin by sampling y from g instead where f (y) ≤ cg(y) holds for all y ∈ R for some constant c ∈ [1, ∞). Then we accept y with probability f (y)/(cg(y)). If y is not accepted then we keep on sampling from g until a point is accepted.
For an illustration of acceptance-rejection sampling suppose that g has CDF G with an efficiently computable inverse G −1 . Then to get a sample from f let v j ∼ U [0, 1], IID, j ≥ 1 and then deliver y = y j * . To use this method one needs to be able to compute the functions G −1 and f /(cg). More elaborate versions use k ≥ 1 uniformly distributed v j to produce each proposal. Liao (1998) proposes to handle acceptance-rejection sampling by using the QMC points to make the first two draws from g and the first two acceptancerejection decisions. If the first two points are both rejected then he suggests drawing from an IID U [0, 1] sequence until a point is accepted before switching back to the QMC points. To simplify matters we'll suppose that only one acceptance-rejection step is tried with the QMC points, and that we switch to IID points if that one is rejected. Derandomized adaptive rejection sampling Hörmann et al. (2004) can be used to construct proposals that are accepted with probability arbitrarily close to unity, under a generalized concavity assumption on the density.
To continue the illustration, suppose that the proposal ω has 3 components. The first two are generated by inversion, each using one point from [0, 1], while the third is done by acceptance-rejection sampling. Then the driving sequence for the first n transitions can be represented in a tableau as follows: The points u i are WCUD and the points v ij are independent U [0, 1]. The i'th row of the table drives the transition from ω (i) to ω (i+1) . For the proposal ω (i+1) , u 5i−4 generates the first component ω , u 5i−1 is used to accept or reject the third component and u 5i is used to accept or reject the entire proposal ω (i+1) as ω (i+1) . In the event that u 5i−1 leads to rejection of the third component, then the infinite sequence v i1 , v i2 , . . . is used to continue acceptance-rejection until a third component is generated for the i'th proposal.
The difficulty with acceptance-rejection sampling is that the set of driving points for which a transition from ω to ω ′ is proposed does not have a fixed finite dimension. It is a union of regions whose dimensions depend on the number of proposals rejected during the course of acceptance-rejection sampling. This variable dimension complicates discrepancy based methods for studying the driving sequence.
The tabulation above suggests a coupling argument. We replace the sequence v i1 , v i2 , v i3 , . . . by a single point v i ∈ [0, 1]. The values v i are independent U [0, 1] random variables such that the random variable finally generated by acceptancerejection would also have been generated by inversion through v i . If the component is continuously distributed with CDF H, independent of all other driving variables. That is, using acceptance-rejection on the third component can be coupled with the use of inversion for the third component, based on the following driving sequence: Liao's padding proposal for acceptance-rejection sampling will work so long as inserting an IID U [0, 1] sequence into his permuted points at regular intervals, as illustrated above, preserves the WCUD property. This proposal works more generally. If we insert an IID U [0, 1] sequence at regular intervals into a CUD or an independent WCUD sequence, then the result is a WCUD sequence. Inserting IID points increases the length of a finite sequence, so that row N of the original sequence becomes row N of the new sequence.
If v N,i are WCUD and independent of w i , then u N,i are WCUD.
Proof. Let d = r(m + 1) for integer r ≥ 1 and choose z ∈ [0, 1] d . For k = 1, . . . , ⌊N/d⌋, the d-tuple x (k) = (u N,(k−1)d+1 , . . . , u N,kd ) has r components from w and dr components from v N,i . Let A represent the components from w and B represent the components from v N,i . Then The z It follows that Pr(δ d ⌊N/d⌋ (z) > 0) → 0 as N → ∞. Invoking Lemma 1 and Theorem 3 completes the proof.
If some fixed number k ≥ 1 of components are to be sampled by acceptancerejection at each step, then we simply apply Theorem 7 k times inserting k independent streams of IID U [0, 1] random variables.

Remark 1. It is important that one of the streams in Theorem 7 be IID.
Merging two independent and WCUD streams does not necessarily produce a WCUD result. For example if the two WCUD sequences u i and v i are independent Cranley-Patterson rotations of the same underlying deterministic sequence, then u 1 , v 1 , u 2 , v 2 , . . . will not be WCUD because every pair of the form (u i , v i ) will lie within one or two lines in [0, 1] 2 .

Example: Probit regression
In this section we apply a Gibbs sampling scheme developed by Albert and Chib (1993) for a probit regression example of Finney (1947). For i = 1, . . . , 39, the response Y i ∈ {0, 1} is 1 if the subject exhibited vasoconstriction and 0 otherwise. The predictors are X i = (X i1 , X i2 ) where the first component is the volume of air inspired and the second is the rate at which air is inspired. The probit model has Y i = 1 Zi>0 where Z i ∼ N (β 0 + β 1 X i1 + β 2 X i2 , 1) are conditionally independent given X 1 , . . . , X n and β = (β 0 , β 1 , β 2 ) ′ . The data are shown in Figure 2.
Taking a non-informative prior for β, the full conditional distribution of β given Z 1 , . . . , Z n is N ((X ′ X) −1 X ′ Z, (X ′ X) −1 ), where X is the n by 3 matrix with i'th row (1, X i1 , X i2 ) and Z is the column vector of Z i values. If Y i = 1 then the full conditional distribution of Z i given β and To run the Gibbs sampler we need only invert the normal CDF to obtain the normal and truncated normal full conditionals. We used the SSJ package of L'Ecuyer (2008) which includes a high accuracy inverse normal CDF based on Blair et al. (1976).
The dimension of this simulation is m = 42 variables per step. In our lattice sampling we let N be a prime number and choose a to be a primitive root modulo N . Suppose at first that m and N − 1 are relatively prime. Then the driving sequence we use is obtained by scanning the following matrix from left to right and top to bottom,  and then dividing by N and applying the same Cranley-Patterson rotation to each m = 42 dimensional row. Starting in the second row above, the matrix above contains elements from a linear congruential generator (LCG), with initial seed 1. Unlike LCGs, integration lattices contain a point at the origin, introduced here via the first row. When m and N − 1 are not relatively prime, then the simple scheme above does not utilize all N − 1 m-tuples of the LCG. For the general case let g = GCD(m, N −1), so g = 1 when m and N −1 are relatively prime. Then the table above contains the 0 vector and g identical blocks of b = (N − 1)/g rows. We multiply the k'th such block by a k−1 (modulo N ) in order to get all m-tuples of the LCG into the simulation.
Ideally a should be a good multiplier for a linear congruential generator, so that consecutive tuples are nearly equidistributed. If possible a m should also be a good multiplier so that consecutive updates to a given parameter are also well equidistributed.
We applied our technique with values of N and a shown in Table 2. These are from L'Ecuyer (1999). We also applied Liao's method on the same problem. Each method was repeated independently 300 times in order to estimate the sampling variance. Table 3 shows estimated variance reduction factors for the posterior means of β j . The 0.975 point of the F 299,299 distribution is 1.25. Therefore Table 2 Shown are the parameters of the LCGs used to drive the Gibbs sampler for the Finney (1947) probit model. Five prime numbers N near powers of two are shown. Five corresponding primitive roots a from L'Ecuyer (1999) are listed. In each case g is the greatest common divisor of a and N − 1. The simulation goes through g blocks of b = (N − 1)/g m-tuples (m = 42) from the LCG.  individual ratios between 0.8 and 1.25 should not be considered statistically significant. Some trends are clear in Table 3. The variance reductions for all three regression coefficients track each other very closely. Liao's method typically gives a variance reduction of about 20 fold. The LCG method gives a variance reduction that tends to increase with sample size but is not perfectly monotone. The quality of the underlying lattices may not be monotone in N . For larger N the LCG approach performed better than Liao's. For smaller N , Liao's approach did slightly better but perhaps not significantly so.
The QMC-MCMC methods also reduce the variance of the estimated posterior means for the latent parameters Z 1 , . . . , Z 39 , sometimes by very large amounts. When |Z i | is large then the variance reduction for it is nearly the same as we see for the coefficients β j . When |Z i | is small, corresponding to cases near the borderline, then variance reductions of several hundred fold are attained. Figure 3 plots the variance reductions for latent parameters versus the estimated values of those latent parameters. The curves corresponding to the largest and smallest sample sizes are shown. The curves for the other sample sizes are qualitatively similar. The LCG version attains some much larger variance re- ductions, sometimes over 500-fold, for the Z i near 0. Table 4 shows summary statistics of the variance reductions.
In MCMC sampling there is usually a bias because the chain only approaches its stationary distribution asymptotically. Variance reductions are most meaningful when the biases of two methods are comparable and small. Because the sample values of all 42 parameters averaged over 300 replications are essentially identical for all 3 methods at every N , we are sure that the biases of all of these methods are nearly identical here. Table 5 summarizes the evidence on bias.

Discussion
This paper has produced some specific constructions of WCUD sequences, has given general methods that convert WCUD sequences into other WCUD sequences, and has found conditions that simplify the task of proving that a sequence is WCUD. Some further results appear in the thesis of Tribble (2007). In particular, Tribble (2007) establishes results parallel to the ones here, for methods that use small feedback shift register generators instead of small congruential generators. Tribble (2007) also introduces a skipping method that simplifies the process of running through all s-tuples of a small random number generator.
Our motivating application for studying (W)CUD sequences is for MCMC, especially in continuous state spaces. The sequences we construct take place in a continuous space, and the transformations we apply are those for continuous random variables. For technical reasons, we have analyzed the methods for discrete state spaces. We expect that some further though hopefully mild regularity will be required. For now, it is encouraging that nothing seemed to go awry in the continuous example that we ran.
Our analysis of Liao's shuffling proposal shows that it only improves the rate of convergence for one dimensional discrepancies. This fact suggests that his proposal will affect the constant, but not ordinarily the rate in the MCMC convergence. The numerical results appear to bear this out. It is however surprising to see as much as a 20 fold variance reduction from a method that only improves certain one dimensional histograms of the input sequence. This must mean that those one dimensional aspects are relatively important compared to high dimensional and more subtle features. Such a pheomenon has been seen before in finite dimensional applications of QMC. The effective dimension can be much smaller than the nominal one as described in Caflisch et al. (1997). Of course not all integrands have low effective dimension and we would not expect large variance reductions every time MCMC was applied. Furthermore the posterior moments we studied are smooth functions of their arguments and this plays to a strength of QMC.
The LCG scheme by contrast shows steady improvement with increasing sample size, though we have no theory that applies to the rate of convergence, and no reason to expect that better than the MC rate can be attained for MCMC problems. Against the possibility of better LCG performance there is a tradeoff. Liao's method is very simple to use and LCGs require parameter searches.