Discrepancy estimates for variance bounding Markov chain quasi-Monte Carlo

Markov chain Monte Carlo (MCMC) simulations are modeled as driven by true random numbers. We consider variance bounding Markov chains driven by a deterministic sequence of numbers. The star-discrepancy provides a measure of efficiency of such Markov chain quasi-Monte Carlo methods. We define a pull-back discrepancy of the driver sequence and state a close relation to the star-discrepancy of the Markov chain-quasi Monte Carlo samples. We prove that there exists a deterministic driver sequence such that the discrepancies decrease almost with the Monte Carlo rate $n^{1/2}$. As for MCMC simulations, a burn-in period can also be taken into account for Markov chain quasi-Monte Carlo to reduce the influence of the initial state. In particular, our discrepancy bound leads to an estimate of the error for the computation of expectations. To illustrate our theory we provide an example for the Metropolis algorithm based on a ball walk. Furthermore, under additional assumptions we prove the existence of a driver sequence such that the discrepancy of the corresponding deterministic Markov chain sample decreases with order $n^{-1+\delta}$ for every $\delta>0$.


Introduction
Markov chain Monte Carlo (MCMC) simulations are used in different branches of statistic and science to estimate an expected value with respect to a probability measure, say π, by the sample average of the Markov chain. This procedure is of advantage if random numbers with distribution π are difficult to construct.
When sampling the Markov chain the transitions are usually modeled as driven by i.i.d. U(0, 1) s random variables for some s ≥ 1. But in simulations the driver sequences are pseudo-random numbers. In many applications, if one uses a carefully constructed random number generator, this works well. Instead of modeling the Markov chain with random numbers, or imitating random numbers, the idea of Markov chain quasi-Monte Carlo is to construct a finite, deterministic sequence of numbers, (u i ) 0≤i≤n in [0, 1] s for all n ∈ N, to generate a deterministic Markov chain sample and to use it to estimate the desired mean.
The motivation of this conceptual change is that carefully constructed sequences may lead to more accurate sample averages. For example, quasi-Monte Carlo (QMC) points lead to higher order of convergence compared to plain Monte Carlo, which is a special case of MCMC. Numerical experiments for QMC versions of MCMC also show promising results [LS06,Lia98,OT05,Sob74,Tri07]. In particular, Owen and Tribble [OT05] and Tribble [Tri07] report an improvement by a factor of up to 10 3 and a better convergence rate for a Gibbs sampler problem.
In the work of Chen, Dick and Owen [CDO11] and Chen [Che11] the first theoretical justification for Markov chain quasi-Monte Carlo on continuous state spaces is provided. The authors show a consistency result if a contraction assumption is satisfied and the random sequence is substituted by a deterministic 'completely uniformly distributed' sequence, see [CDO11,CMNO12,TO08]. Thus the sample average converges to the expected value but we do not know how fast this convergence takes place.
Recently, in [DRZ13] another idea appears. Namely, the question is considered whether there exists a good driver sequence such that an explicit error bound is satisfied. It is shown that if the Markov chain is uniformly ergodic, then for any initial state a deterministic sequence exists such that the sample average converges to the mean almost with the Monte Carlo rate.
However, in [CDO11] and [DRZ13] rather strong conditions, the contraction assumption and uniform ergodicity, are imposed on the Markov chain.
We substantially extend the results of [DRZ13] to Markov chains which satisfy a much weaker convergence condition. Namely, we consider variance bounding Markov chains, introduced by Roberts and Rosenthal in [RR08], and show existence results of good driver sequences. In the following we describe the setting in detail and explain our main contributions.

Main results
The MCMC sampling can be represented via X i+1 = ϕ(X i , U i ) for i ≥ 2, with X 1 = ψ(U 1 ) and the U i ∼ U(0, 1) s are i.i.d. The state X i is an element in G ⊆ R d , the function ϕ : G × [0, 1] s → G is called update function and ψ : [0, 1] s → G is called generator function. The update function corresponds to a transition kernel, say K. For f : G → R let E π (f ) = G f (x)π(dx) be the desired mean and P f (x) = G f (y)K(x, dy) be the Markov operator induced by the transition kernel K. We assume that the transition kernel is reversible with respect to the distribution π and that it is variance bounding, see [RR08]. Roughly, a Markov chain is variance bounding if the asymptotic variances for functionals with unit stationary variance are uniformly bounded. Equivalent to this is the assumption that Λ < 1 with Λ = sup{λ ∈ σ(P − E π | L 2 )} (1) where σ(P − E π | L 2 ) denotes the spectrum of P − E π on L 2 . For example let us consider the two state Markov chain which always jumps from one state to the other one. It is periodic and satisfies Λ = −1, thus it is variance bounding. With this toy example in mind let us point out that the Markov chain does not need to be uniformly or geometrically ergodic, it might even be periodic, and the distribution of X i , for i arbitrarily large, is not necessarily close to π. By a deterministic sequence (u i ) i≥0 we generate the deterministic Markov chain (x i ) i≥1 with x 1 = ψ(u 0 ) and The efficiency of this procedure is measured by the star-discrepancy, a generalized Kolmogorov-Smirnov test, between the stationary measure π and the empirical distribution π n (A) = 1 where A denotes a certain set of subsets of G. By inverting the iterates of the update function we also define a push-back discrepancy of the driver sequence (the test sets are pushed back). We show that for large n ∈ N both discrepancies are close to each other.
The main result, in a general setting, is an estimate of D * A ,π (S n ) (Theorem 2) under the assumption that we have an approximation of A , for any δ > 0, given by a so-called δ-cover Γ δ of A with respect to π (Definition 5). The proof of the main result is based on a Hoeffding inequality for Markov chains. After that we prove that a sufficiently good δ-cover exists if π is absolutely continuous with respect to the Lebesgue measure and the set of test sets is the set of open boxes restricted to G anchored at −∞, i.e. we consider the set of test sets with f H 1 defined in (24). Thus a bound on the discrepancy leads to an error bound for the approximation of E π (f ). We show for all n ≥ 16 that there exists a driver sequence u 0 , . . . , u n−1 ∈ [0, 1] s such that S n = {x 1 , . . . , x n } given by where dν dπ is the density of ν = P ψ (the probability measure induced by ψ) with respect to π and Λ 0 = max{Λ, 0} with Λ is defined in (1). For the details we refer to Corollary 4 below. This implies, by the Koksma-Hlawka inequality, that the sample average converges to the mean with O(n −1/2 (log n) 1/2 ).
Additionally we might take a burn-in period of n 0 steps into account to reduce the dependence of the initial state in the discrepancy bound. Roughly, the idea is to generate a sequence x 1 , . . . , x n 0 +n by the Markov chain quasi-Monte Carlo procedure and to consider the discrepancy of the last n 0 states, i.e. of S [n 0 ,n] = {x n 0 +1 , . . . , x n 0 +n }. Under suitable convergence conditions of the Markov chain, for example the existence of an absolute L 2 -spectral gap (see Definition 1), the density d(νP n 0 ) dπ is close to 1, see Subsection 4.3. If we further assume that one can reach every state from every other state within one step of the Markov chain, then we prove that there exists a driver sequence such that the discrepancy converges with O(n −1 (log 2 n) (3d+1)/2 ). We call the additional assumption 'anywhere-to-anywhere' condition. The result shows that in principle a higher order of convergence for Markov chain quasi-Monte Carlo is possible. Note that, many well studied Markov chains satisfy such a condition, for example the hit-and-run algorithm, the independent Metropolis sampler or the slice sampler, see for example [Liu08]. However, it is not clear how to obtain suitable driver sequences which yield such an improvement. We provide an outline of our work in the following.

Outline
In the next section the necessary background information on Markov chains is stated. Section 3 is devoted to the study of the relation of the discrepancies. The Monte Carlo rate of convergence for deterministic MCMC is shown in Section 4. There we also provide results for the case when a burn-in period is taken into account. Section 5 deals with the set of test sets which consists of axis parallel boxes, see B above. We show the existence of a good δ-cover and how the discrepancy bounds can be used to obtain bounds on the error for the computation of expected values of smooth functions. This yields a Koksma-Hlawka inequality for Markov chains. To illustrate our results, we provide an example of a Metropolis algorithm with ball walk proposal on the Euclidean unit ball. A special situation arises when the update function of the Markov chain has an 'anywhere-to-anywhere' property, see Section 6. In this situation we show that a convergence rate of order almost n −1 can be obtained.

Background on Markov chains
Let G ⊆ R d and let B(G) denote the Borel σ-algebra of G. In the following we provide a brief introduction to Markov chains on (G, B(G)). We assume that K : G × B(G) → [0, 1] is a transition kernel on (G, B(G)), i.e. for each x ∈ G the mapping A ∈ B(G) → K(x, A) is a probability measure and for each A ∈ B(G) the mapping x ∈ G → K(x, A) is a B(G)-measurable realvalued function. Further let ν be a probability measure on (G, B(G)).
Then let (X n ) n∈N , with X n mapping from some probability space into (G, B(G)), be a Markov chain with transition kernel K and initial distribution ν. This might be interpreted as follows: Let X 1 = x 1 ∈ G be chosen with ν on (G, B(G)) and let i ∈ N. Then for a given X i = x i , the random variable X i+1 has distribution K(x i , ·), that is, for all A ∈ B(G), the probability that X i+1 ∈ A is given by K(x i , A).
Let π be a probability measure on (G, B(G)). We assume that the transition kernel K is reversible with respect to π, i.e. for all A, B ∈ B(G) holds This implies that π is a stationary distribution of the transition kernel K, i.e. for all A ∈ B(G) holds G K(x, A) π(dx) = π(A). (3) We assume that the stationary distribution π is unique. Let L 2 = L 2 (π) be the set of all functions f : G → R with The transition kernel K induces an operator acting on functions and an operator acting on measures. For x ∈ G and A ∈ B(G) the operators are given by where f ∈ L 2 and ν is a signed measure on (G, B(G)) with a density dν dπ ∈ L 2 . By the reversibility with respect to π we have that P : L 2 → L 2 is self-adjoint and π-almost everywhere holds P ( dν dπ )(x) = d(νP ) dπ (x) . For details we refer to [Rud12].
In the following we introduce two convergence properties of transition kernels. Let the expectation with respect to π be denoted by E π (f ) = G f (y)π(dx). Let L 0 2 = {f ∈ L 2 : E π (f ) = 0} and note that L 0 2 is a closed subspace of L 2 . We have P − E π L 2 →L 2 = P L 0 2 →L 0 2 , for details see [Rud12, Lemma 3.16, p. 44].
Definition 1 (absolute L 2 -spectral gap) We say that a transition kernel K and its corresponding Markov operator P has an absolute L 2 -spectral gap if β = P L 0 2 →L 0 2 < 1, and the absolute spectral gap is 1 − β.
Let us introduce the total variation distance of two probability measures ν 1 , ν 2 on (G, B(G)) by Note that for a Markov chain (X n ) n∈N with transition kernel K and initial distribution ν holds P ν,K (X n ∈ A) = νP n−1 (A), where ν and K in P ν,K indicate the initial distribution and transition kernel. Then we obtain the following relation between the absolute L 2 -spectral gap and the total variation distance. The result is an application of [Rud12, Corollary 3.15 and Lemma 3.21].
Proposition 1 Let ν be a distribution on (G, B(G)) and assume that there exists a density dν dπ ∈ L 2 . Then The next convergence property is weaker than the existence of an absolute spectral gap.
Definition 2 (Variance bounding or L 2 -spectral gap) We say that a reversible transition kernel K and its corresponding Markov operator P is variance bounding or has an L 2 -spectral gap if where spec(P | L 0 2 ) denotes the spectrum of P : L 0 2 → L 0 2 .
For a motivation of the term variance bounding and a general treatment we refer to [RR08]. In particular by [RR08,Theorem 14] under the assumption of reversibility our definition is equivalent to the one stated by Roberts and Rosenthal. Note that the existence of an absolute L 2 -spectral gap implies variance bounding, since We have the following relation between variance bounding and the total variation distance.
Lemma 1 Let the transition kernel K be reversible with respect to π and let n ∈ N with n ≥ 2. Further, let P be variance bounding. Then the Markov operator P n = 1 n n−1 j=0 P j has an absolute L 2 -spectral gap. In particular, if ν is a distribution on (G, B(G)) with dν dπ ∈ L 2 , then Proof. By the spectral Theorem for bounded self-adjoint operators we have for a polynomial F : spec(P | L 0 2 ) → R that For details see for example [Rud91] or [Kre89, Theorem 9.9-2]. In our case The last inequality is proven by spec(P |L 0 2 ) ⊆ [−1, 1] and the following facts: For λ ∈ [−1, 0] holds 1−λ n n·(1−λ) ≤ 1 n and for λ ∈ [0, 1] the function 1−λ n n·(1−λ) = 1 n n−1 j=0 λ j is increasing. The estimate of the total variation distance follows by Proposition 1. ✷ The next part deals with an update function, say ϕ, of a given transition kernel K. We state the crucial properties of the transition kernel in terms of an update function. This is partially based on [DRZ13].
Let λ s denote the Lebesgue measure on R s . Then the function ϕ is an update function for the transition kernel K if and only if where P is the probability measure for the uniform distribution in [0, 1] s .
Note that for any transition kernel on (G, B(G)) there exists an update function, see for example [Kal02, Lemma 2.22, p. 34]. For x ∈ G and A ∈ B(G) the set B(x, A) is the set of all random numbers u ∈ [0, 1] s which take x into the set A using the update function ϕ with arguments x and u.
We consider the iterated application of an update function. Let ϕ 1 (x; u) = ϕ(x; u) and for i > 1 with i ∈ N let Thus, x i+1 = ϕ i (x; u 1 , u 2 , . . . , u i ) ∈ G is the point obtained via i updates using u 1 , u 2 , . . . , u i ∈ [0, 1] s , where the starting point is x ∈ G.
Proof. The proof follows by induction on i. ✷ is the set of all random numbers u 1 , u 2 , . . . , u i ∈ [0, 1] s which take x into the set A using the ith iteration of the update function ϕ, i.e. ϕ i with arguments x and u 1 , u 2 , . . . , u i .
In [DRZ13] we considered the case where the initial state is deterministically chosen. The following assumption is useful to work with general initial distributions.
Assumption 1 For a probability measure ν on (G, B(G)) we assume that ψ : [0, 1] s → G is a generator function, i.e. ψ satisfies For a probability measure ν on (G, B(G)) let Assumption 1 be satisfied.
is the set of possible sequences to get into the set A with starting point ψ(u 0 ) and i updates of the update function. The next lemma is important to understand the relation between the update function, generator function, transition kernel and initial distribution.
Lemma 3 Let K be a transition kernel and ν a distribution on (G, B(G)). Let ϕ be an update function for the transition kernel K. Let (X n ) n∈N be a Markov chain with transition kernel K and initial distribution ν. Further, let Assumption 1 for ν be satisfied. Let i ∈ N and F : G i → R. The expectation of F with respect to the joint distribution of X 1 , . . . , X i is given by whenever one of the integrals exist.
Proof. By Assumption 1 we have and by Lemma 2 we obtain By iterating the application of Lemma 2 the assertion is proven. ✷ Note that the right-hand-side of (8) is the expectation with respect to the uniform distribution in [0, 1] is .
Proof. By Lemma 3 we have which completes the proof. ✷

On the push-back discrepancy
Let A ⊆ B(G) be a set of test sets. Then the star-discrepancy of a point set S n = {x 1 , . . . , x n } ⊆ G with respect to the distribution π is given by Assume that u 0 , u 1 , . . . , u n−1 ∈ [0, 1] s is a finite deterministic sequence. We call this finite sequence driver sequence. Further, let ϕ : G × [0, 1] s → G and ψ : [0, 1] s → G be measurable functions. Then let the set S n = {x 1 , . . . , x n } ⊆ G be given by where x 1 = ψ(u 0 ). Note that ψ might be considered as a generator function and ϕ might be considered as an update function. We now define a discrepancy measure on the driver sequence. We call it push-back discrepancy. Below we show how this push-back discrepancy is related to the star-discrepancy of S n .
Definition 4 (Push-back discrepancy) Let U n = {u 0 , u 1 , . . . , u n−1 } ⊂ [0, 1] s and let C i,ψ (A) for A ∈ B(G) and i ∈ N ∪ {0} be defined as in (7). Define the local discrepancy function by Let A ⊆ B(G) be a set of test sets. Then we define the discrepancy of the driver sequence by The discrepancy of the driver sequence D * A ,ψ,ϕ (U n ) is a 'push-back discrepancy' since the test sets C i,ψ (A) are derived from the test sets A ∈ A from the star-discrepancy D * A ,π (S n ) via inverting the update function and the generator.
The following theorem provides a relation between the star-discrepancy of S n and the push-back discrepancy of U n , this is similar to [DRZ13, Theorem 1].
Theorem 1 Let K be a transition kernel and ν be a distribution on (G, B(G)). Let ϕ be an update function for K and let us assume that ν satisfies Assumption 1 with generator function ψ. Further, let U n = {u 0 , u 1 , . . . , u n−1 } ⊂ [0, 1] s be the driver sequence, such that S n is given by (10). Let A ⊆ B(G) be a set of test sets. Then Proof. For any A ∈ A we have by (9) that λ (i+1)s (C i,ψ (A)) = νP i (A). Thus The inequality follows by the same arguments. ✷ Corollary 2 Assume that the conditions of Theorem 1 are satisfied. By P denote the Markov operator of K. Further, let K be reversible with respect to π, let P be variance bounding and let dν dπ ∈ L 2 . Then where Λ 0 = max{0, Λ} and Λ is defined in (4).
Proof. With P n = 1 n n−1 Thus, the assertion follows by Lemma 1 and Theorem 1. ✷ Remark 1 For the moment let us assume that we can sample with respect to π. For any initial distribution ν with dν dπ ∈ L 2 , for all x ∈ G and A ∈ B(G) we set K(x, A) = π(A), hence Λ = 0. Thus Note that the discrepancies do not coincide. The reason for this is that the initial state is taken into account in the average computation. However, if ν = π, then for any reversible transition kernel with respect to π we obtain D * A ,π (P n ) = D * A ,ψ,ϕ (U n ).

Monte Carlo rate of convergence
In this section we show the existence of finite sequences U n = {u 0 , u 1 , . . . , u n−1 } ⊂ [0, 1] s , which define S n by (10), such that converge to 0 approximately with order n −1/2 if the transition kernel or the corresponding Markov operator is variance bounding. The main result is proven for D * A ,π (S n ). The result with respect to D * A ,ψ,ϕ (U n ) holds by Theorem 1.

Useful tools: delta-cover and Hoeffding inequality
The concept of a δ-cover will be useful (cf. [Gne08] for a discussion of δcovers, bracketing numbers and Vapnik-Červonenkis dimension).
The following result is well known for the uniform distribution, see [HNWW01, Section 2.1] (see also [DRZ13, Remark 3] for the particular case below).
Proposition 2 Let Γ δ be a δ-cover of A with respect to π. Then, for any Z n = {z 1 , . . . , z n }, holds Instead of considering the supremum over the possibly infinite set of test sets A in the star-discrepancy we use a finite set Γ δ and take the maximum over C ∈ Γ δ by paying the price of adding δ.
For variance bounding Markov chains on discrete state spaces, i.e. the second largest eigenvalue of the transition matrix is less than 1, in [LP04] a Hoeffding inequality is proven. In [Mia12] this is extended to non-reversible Markov chains on general state spaces. The following Hoeffding inequality for reversible, variance bounding Markov chains follows by [Mia12, Theorem 3.3 and the remark after (3.4)].
Proposition 3 (Hoeffding inequality for Markov chains) Let K be a reversible transition kernel with respect π and let ν be a distribution on (G, B(G)) with dν dπ ∈ L 2 . Let us assume that the Markov operator of K is variance bounding. Further, let (X n ) n∈N be a Markov chain with transition kernel K and initial distribution ν. Then, for any A ∈ B(G) and c > 0, we obtain with Λ 0 = max{0, Λ} and where Λ is defined in (4).
We provide a lemma to state the Hoeffding inequality for Markov chains in terms of the driver sequence. We need the following notation. Let Lemma 4 Let K be a transition kernel and ν be a distribution on (G, B(G)). Let ϕ be an update function of K and let us assume that ν satisfies Assumption 1 with generator function ψ. Further, let (X n ) n∈N be a Markov chain with transition kernel K and initial distribution ν. Then, for any A ∈ B(G) and c > 0, holds where P denotes the uniform distribution in [0, 1] ns and P ν,K denotes the joint distribution of X 1 , . . . , X n .

Discrepancy bounds
We show that for any s ∈ N, for any update function of the transition kernel K, for every initial distribution ν with dν dπ ∈ L 2 and every n there exists a finite sequence u 0 , u 1 , . . . , u n−1 ∈ [0, 1] s such that the star-discrepancy of S n , given by (10), converges approximately with order n −1/2 . The main idea to prove the existence result is to use probabilistic arguments. We apply a Hoeffding inequality for variance bounding Markov chains and show that for a fixed test set the probability of point sets with small ∆ n,A,ϕ,ψ , see (12), is large. We then extend this result to all sets in the δ-cover using the union bound and finally to all test sets. The result shows that if the finite driver sequence is chosen at random from the uniform distribution, most choices satisfy the Monte Carlo rate of convergence of the discrepancy for the induced point set S n .
Theorem 2 Let K be a reversible transition kernel with respect to π and ν be a distribution on (G, B(G)) with dν dπ ∈ L 2 . Assume that P , the Markov operator of K, is variance bounding and that ν satisfies Assumption 1 with generator ψ. Let A ⊆ B(G) be a set of test sets and for every δ > 0 assume that there exists a set Γ δ ⊆ B(G) with |Γ δ | < ∞ such that Γ δ is a δ-cover of A with respect to π. Further, let ϕ be an update function for K.
Theorem 3 Let K be a reversible transition kernel with respect to π and ν be a distribution on (G, B(G)) with dν dπ ∈ L 2 . Assume that P , the Markov operator of K, is variance bounding and that ν satisfies Assumption 1 with generator ψ. Let A ⊆ B(G) be a set of test sets and for every δ > 0 assume that there exists a set Γ δ ⊆ B(G) with |Γ δ | < ∞ such that Γ δ is a δ-cover of A with respect to π. Further, let ϕ be an update function for K.
We refer to Remark 2 and Lemma 6 for a relation between δ and |Γ δ |. Thus, we showed the existence of a driver sequence with small push-back discrepancy. Note that by using Corollary 2 one could also argue the other way around: If one can construct a sequence with small push-back discrepancy then the star-discrepancy of S n is also small.
Remark 3 Let us consider a special case of Theorem 2 and Theorem 3. Namely, let us assume that we can sample with respect to π. Thus, we set ν = π and K(x, A) = π(A) for any x ∈ G, A ∈ B(G). Then since Λ 0 = Λ = 0. This is essentially the same as Theorem 1 in [HNWW01] in their setting. However, it is not as eloberate as Theorem 4 in [HNWW01], which is based on results by Talagrand [Tal94] and Haussler [Hau95]. We do not know a version of these results which apply to Markov chains (such a result could yield an improvement of Theorems 2 and 3).

Burn-in period
For Markov chain Monte Carlo a burn-in period is used to reduce the bias of the initial distribution. We show how a burn-in changes the discrepancy bound of Theorem 3. Let us introduce the following notation. Let ϕ : G × [0, 1] s → G and ψ : [0, 1] s → G be measurable functions. Let n 0 , n ∈ N, let U n 0 ,n = {u 0 , . . . , u n 0 , u n 0 +1 , . . . , u n 0 +n−1 } ⊂ [0, 1] s and assume that S [n 0 ,n] = {x n 0 +1 , . . . , x n 0 +n } ⊆ G is given by (10), i.e. x 1 ; u 1 , . . . , u i ), i = 1, . . . , n 0 + n − 1, where x 1 = ψ(u 0 ). As before ψ might be considered as a generator function and ϕ might be considered as an update function. We now define a discrepancy measure on the driver sequence where the burn-in period is taken into account. We call it push-back discrepancy with burn-in.
Definition 6 (Push-back discrepancy with burn-in) Let C i,ψ (A) for A ∈ B(G) and i ∈ N ∪ {0} be defined as in (7). Define the local discrepancy function with burn-in by ∆ loc n 0 ,n,A,ψ,ϕ (U n 0 ,n ) = 1 n Let A ⊆ B(G) be a set of test sets. Then we define the discrepancy of the driver sequence by D * n 0 ,A ,ψ,ϕ (U n 0 ,n ) = sup A∈A ∆ loc n 0 ,n,A,ψ,ϕ (U n 0 ,n ) .
We call D * n 0 ,A ,ψ,ϕ (U n 0 ,n ) push-back discrepancy with burn-in of U n 0 ,n .
By adapting Proposition 3 and Lemma 4 to the setting with burn-in we obtain, by the same steps as in the proof of Theorem 2, a bound on the star-discrepancy for S [n 0 ,n] . Further, adapting Theorem 1 and Corollary 2 to the burn-in leads to a bound on D * n 0 ,A ,ψ,ϕ (U n 0 ,n ) for a certain set U n 0 ,n .
Theorem 4 Let K be a reversible transition kernel with respect to π and let ν be a distribution with dν dπ ∈ L 2 . Assume that P , the Markov operator of K, is variance bounding and that ν satisfies Assumption 1 with generator ψ. Let A ⊆ B(G) be a set of test sets and for every δ > 0 assume that there exists a set Γ δ ⊆ B(G) with |Γ δ | < ∞ such that Γ δ is a δ-cover of A with respect to π. Further, let ϕ be an update function for K.
Then there exists a driver sequence U n 0 ,n = {u 0 , u 1 , . . . , u n 0 +n−1 } ⊂ [0, 1] s such that with Λ 0 = max{0, Λ} and Λ defined in (4). If P has an absolute L 2 -spectral gap we have with β = P L 0 2 →L 0 2 , see Definition 1. In particular, by Λ ≤ Λ 0 ≤ β < 1 and |Λ| ≤ β, we deduce Equations (18) and (19) reveal that the burn-in n 0 can eliminate the influence of the initial state induced by ψ under the assumption that there exists an absolute L 2 -spectral gap. A variance bounding transition kernel is not enough, since it could be periodic and then νP n 0 would not converge to π at all.

Application
We consider the set of test sets B which consists of all axis parellel boxes anchored at −∞ restricted to G ⊆ R d , i.e.
In the following we study the size of δ-covers with respect to such rectangular boxes.
We then focus on the application of Theorem 2 and state the relation between the discrepancy and the error of the computation of expectations. The Metropolis algorithm with ball walk proposal provides an example where one can see that the existence result shows an error bound which depends polynomially on the dimension d.

Delta-cover with respect to distributions
We now use an explicit version of a result due to Beck [Bec84], for a proof and further details we refer to [AD, Theorem 1]. We state it as a lemma. Then for any r ∈ N there exists a set Z r = {z 1 , . . . , z r } with z 1 , . . . , z r ∈ suppµ such that Note that log 2 denotes the dyadic and log the natural logarithm.
Proof. The assertion follows by [AD,Theorem 3] This implies a version of [AD, Corollary 1], thus a version of [AD, Theorem 1], with x 1 , . . . , x N ∈ suppµ. ✷ By a linear transformation we extend the result to general, bounded state spaces G ⊂ R d .
Corollary 3 Let G ⊂ R d be a bounded, measurable set and let (G, B(G), π) be a probability space. Let the set of test sets Then for any r ∈ N there exists a set S r = {x 1 , . . . , x r } ⊆ G such that By Lemma 5 we have that there exists a set Z r = {z 1 , . . . , z r } ⊆ supp µ such that (20) is satisfied. Let x i = T −1 (z i ) for i = 1, . . . , r and for z ∈ [0, 1] d let x = T −1 (z). Then Since z 1 , . . . , z r ∈ suppµ ⊂ T (G) and By taking the supremum over the test sets on the right-hand side and using (20) the assertion follows. ✷ As in [DRZ13, Lemma 4] a point set which satisfies a discrepancy bound can be used to construct a δ-cover. The idea is to define for each subset of the point set a minimal and maximal set for the δ-cover, see [DRZ13,Lemma 4]. To simplify the bound of Corollary 3, for any r ∈ N and 0 < ε < 1 we have With this notation we obtain the following result.
Lemma 6 Let G ⊂ R d be a bounded measurable set and let π be a probability measure on (G, B(G)) which is absolutely continuous with respect to the Lebesgue measure. For the test set B = {(−∞, x) G | x ∈ R d }, any 0 < δ ≤ 1 and 0 < ε < 1, there is a δ-cover Γ δ of B with respect to π with where C ε,d is given by (21).
Proof. The proof of the assertion follows essentially by the same steps as the proof of [DRZ13,Lemma 4]. The only difference is that we use the discrepancy bound of Corollary 3 instead of [HNWW01,Theorem 4]. ✷ The dependence of the size of the δ-cover on δ is arbitrarily close to order δ −d in Lemma 6, whereas in [DRZ13, Lemma 4] it is of order δ −2d . Furthermore, the constant in Lemma 6 is fully explicit (one can choose 0 < ε < 1 to obtain the best bound on the size of the δ-cover).
By Theorem 2 and Lemma 6 we obtain the following result.
Corollary 4 Let G ⊂ R d be a bounded set. Let K be a reversible transition kernel with respect to π and ν be a distribution on (G, B(G)) with dν dπ ∈ L 2 . Assume that P , the Markov operator of K, is variance bounding and that ν satisfies Assumption 1 with generator ψ. Let B = {(−∞, x) G | x ∈ R d } be the set of test sets and ϕ be an update function of K.

Integration error
In this section we state a relation between a reproducing kernel Hilbert space and the star-discrepancy. As in [DRZ13, Appendix B] we define a reproducing kernel Q by The function Q uniquely defines a reproducing kernel Hilbert space H 2 = H 2 (Q) of functions defined on R d . Reproducing kernel Hilbert spaces were studied in detail in [Aro50]. It is also known that the functions f in H 2 permit the representation for some f 0 ∈ C and function f ∈ L 2 (R d , ρ), see for instance [SC08, Theorem 4.21, p. 121] or follow the same arguments as in [BD14, Appendix A]. The inner product in H 2 is given by With these definitions we have the reproducing property For 1 ≤ q ≤ ∞ we also define the space H q of functions of the form (23) for which f ∈ L q (G, ρ), with finite norm The following result concerning the integration error in H q is proven in [DRZ13, Theorem 3].
Theorem 5 Let G ⊆ R d and π be a probability measure on G. Further let We assume that 1 ≤ p, q ≤ ∞ with 1/p+1/q = 1. Then for Z n = {z 1 , z 2 , . . . , z n } ⊆ G and for all f ∈ H q we have , for functions f : B d → R which are integrable with respect to π ρ . Note that for an approximation of E πρ (f ) the functions f and ρ are part of the input of a possible approximation scheme. We assume that sampling directly with respect to π ρ is not feasible. We consider the Metropolis algorithm with ball walk proposal for the approximate sampling of π ρ . Let γ > 0, x ∈ B d and C ∈ B(B d ), then the transition kernel of the γ ball walk is where λ d denotes the d-dimensional Lebesgue measure and D γ (x) = {y ∈ R d | x − y ≤ γ} denotes the Euclidean ball with radius γ around x ∈ R d . The transition kernel of the Metropolis algorithm with ball walk proposal is where θ(x, y) = min{1, ρ(y)/ρ(x)} is the so-called acceptance probability. The transition kernel M ρ,γ is reversible with respect to π ρ . Now we provide update functions of the ball walk and the Metropolis algorithm with ball walk proposal. Let S d−1 = {x ∈ R d | x = 1} be the unit sphere in R d . Let ψ : [0, 1] d−1 → S d−1 be a generator for the uniform distribution on the sphere, see for instance [FW94]. Then, ψ γ : [0, 1] d → D γ (0) given by withū = (v 1 , . . . , v d ) ∈ [0, 1] d , is a generator for the uniform distribution in D γ (0) (the Euclidean ball with radius γ around 0). Thus, an update function This leads to an update function ϕ M,γ,ρ : B d ×[0, 1] d+1 → B d of the Metropolis algorithm with ball walk proposal. Let A(x;ū) = min{1, ρ(ϕ W,γ (x,ū))/ρ(x)} then an update function for the Metropolis algorithm with ball walk proposal is where u = (v 1 , . . . , v d+1 ) ∈ [0, 1] d+1 and x ∈ B d . Thus the algorithms are given by the update functions above. We assume that the functions f : B d → R and ρ : B d → (0, ∞) have some additional structure. Let f ∈ H 1 with f H 1 ≤ 1, where H 1 is defined in Subsection 5.2. For α > 0 let ρ ∈ R α,d if the following conditions are satisfied: (i) ρ is log-concave, i.e. for all λ ∈ (0, 1) and for all x, y ∈ B d holds (27) Next we provide a lower bound for Λ γ,ρ , defined as in (4) for the transition kernel M γ,ρ , where the density ρ is log-concave and log-Lipschitz. The result follows by [MN07, Corollary 1, Lemma 13].
Proposition 4 Let us assume that ρ ∈ R α,d . Further let The combination of Proposition 4, Theorem 5, Lemma 6 and Corollary 4 lead to the following error bound for the computation of E πρ (f ) for f ∈ H 1 and ρ ∈ R α,d .
Thus by Corollary 4 and Theorem 5 the assertion follows. ✷ Let us emphasize that the theorem shows that for any ρ ∈ R α,d there exist a deterministic algorithm where the error depends only polynomially on the dimension d and the Log-Lipschitz constant α.

Beyond the Monte Carlo rate
In the previous sections we have seen that there exist deterministic driver sequences which yield almost the Monte Carlo rate of convergence of n −1/2 . Roughly speaking, the proof of Theorem 2 reveals that, if the driver sequence is chosen at random from the uniform distribution the discrepancy bound of (14) is satisfied with high probability. In this section we use a stronger assumption to achieve a better rate of convergence. Again this result is an existence result. We want to point out that the proof of the result does not reveal any information on how to find driver sequences which leads to good discrepancy bounds. Its proof is based on the 'anywhere-to-anywhere' condition and Corollary 3.
Definition 7 Let ϕ : G × [0, 1] s → G be an update function. We say that ϕ satisfies the 'anywhere-to-anywhere' condition if for all x, y ∈ G there exists a u ∈ [0, 1] s such that ϕ(x; u) = y.
Now we use the 'anywhere-to-anywhere' condition to reformulate Corollary 3. We obtain a bound on the star-discrepancy for the Markov chain quasi-Monte Carlo construction.
Corollary 6 Let G ⊂ R d be a bounded, measurable set and let (G, B(G), π) be a probability space. Let the set of test sets B = {(−∞, x) ∩ G | x ∈ R d } be the set of anchored boxes intersected with G. Let ϕ be an update function and assume that ϕ satisfies the 'anywhere-to-anywhere' condition. Let ψ : [0, 1] s → G be an arbitrarily surjective measurable function. Then for any n ∈ N there exists u 0 , u 1 , . . . u n−1 ∈ [0, 1] s such that S n = {x 1 , . . . , x n } given by x 1 = ψ(u 0 ) and The corallary states that if the 'anywhere-to-anywhere' condition is satisfied, in principle, we can get the same discrepancy for the Markov chain quasi-Monte Carlo construction as without using any Markov chain. If the update function and underlying Markov operator P satisfies the conditions of Corollary 2, then a similar discrepancy bound as in Corollary 6 also holds for the driver sequence U n = {u 0 , u 1 , . . . , u n−1 }. Namely

Concluding remarks
Let us point out that the discrepancy results of Subsection 4.2 and Subsection 4.3, in particular, also hold for local Markov chains which do not satisfy the 'anywhere to anywhere' condition and the proof of this bound reveals that a uniformly i.i.d. driver sequence satisfies the discrepancy estimate with high probability. In other words, there are many driver sequences which satisfy the discrepancy bound of order (log n) 1/2 n −1/2 . On the other hand, the choice of the driver sequence depends on the initial distribution ν and the transition kernel. It would be interesting to prove the existence of a universal driver sequence, which yields Monte Carlo type behavior for a class of initial distributions and transition kernels. (For a finite set of initial distributions and transition kernels such a result can be obtained from our results since for any given initial distribution and transition kernel we can show the existence of good driver sequences with high probability.) Another open problem is the explicit construction of suitable driver sequences. The results in this paper do not give any indication how such a construction could be obtained. However, we do obtain that the push-back discrepancy is the relevant criterion for constructing driver sequences.