Multivariate approximation in total variation using local dependence

We establish two theorems for assessing the accuracy in total variation of multivariate discrete normal approximation to the distribution of an integer valued random vector $W$. The first is for sums of random vectors whose dependence structure is local. The second applies to random vectors~$W$ resulting from integrating the $\mathbb{Z}^d$-valued marks of a marked point process with respect to its ground process. The error bounds are of magnitude comparable to those given in Rinott and Rotar (1996), but now with respect to the stronger total variation distance. Instead of requiring the summands to be bounded, we make third moment assumptions. We demonstrate the use of the theorems in four applications: monochrome edges in vertex coloured graphs, induced triangles and $2$-stars in random geometric graphs, the times spent in different states by an irreducible and aperiodic finite Markov chain, and the maximal points in different regions of a homogeneous Poisson point process.


Introduction
In this paper, we prove a general theorem that can be used to give bounds in total variation on the accuracy of multivariate discrete normal approximation to the distribution of a random vector W in Z d , when W is a sum of n random vectors whose dependence structure is local. Our setting is rather similar to that in Rinott & Rotar (1996). In their paper, Stein's method is used to derive the accuracy, in terms of the convex sets metric, of multivariate normal approximation to suitably normalized sums of bounded random vectors; under reasonable conditions, error bounds of order O(n −1/2 log n) are obtained. Fang (2014) improves the order of the error to O(n −1/2 ), using slightly different conditions, and also obtains optimal dependence on the dimension d. Here, we are interested in total variation distance bounds, so as to be able to approximate the probabilities of arbitrary sets. For random elements of Z d , this necessitates replacing the multivariate normal distribution by a discretized version. We use the d-dimensional discrete normal distribution DN d (nc, nΣ) that is obtained from the multivariate normal distribution N d (nc, nΣ) by assigning the probability of the d-box to the integer vector (i 1 , . . . , i d ) T , for each (i 1 , . . . , i d ) T ∈ Z d . This family of distributions is a natural choice, when approximating a discrete random vector in a central limit setting. We are able to establish discrete normal approximation under conditions broadly analogous to those of Rinott & Rotar (1996) and Fang (2014), with an error of order O(n −1/2 log n), but without their boundedness assumption; a suitable third moment condition is all that is needed.
The unspecified constants can in principle be deduced from the more detailed information in Barbour, Luczak & Xia (2018a,b).
Applying the theorem in practice may not be easy. Condition (b) is much like the sort of condition that has to be checked to prove multivariate normal approximation using Stein's method (Chen, Goldstein & Shao (2011, p. 337)), with differences and derivatives exchanged, except for the indicator I[|W − mc| ≤ mδ]|, which truncates W to the ball B mδ (mc). The truncation has both good and bad consequences. It introduces an awkward discontinuity inside the expectation, which needs careful treatment in the arguments that follow. On the other hand, it ensures that all the expectations to be considered are finite, and that the function h only has to be evaluated within certain closed balls around mc; this latter feature is important, because the solutions to the Stein equation for this problem may grow large as the distance from mc increases. Condition (a) imposes a certain smoothness on the distribution of W .
In Section 2, we prove a multivariate approximation theorem, Theorem 2.1, with error bounds in the total variation distance, that is much simpler to use than Theorem 1.1. The setting is one of predominately local dependence. The basic elements making up the error bounds are sums of third moments, similar to those that would be expected to quantify the error in the CLT for dissociated summands, together with dependence coefficients analogous to those in Rinott & Rotar (1996). However, there is an extra quantity ε W appearing in the bound, which quantifies the smoothness of the distribution of W , and which is not as simple to express in concrete terms. We also consider a more general setting, in which W arises from integrating the marks of a marked point process with respect to its ground process on a suitable metric space. For integrals of functionals of a Poisson process, Schulte & Yukich (2018a,b) have recently established an order O(n −1/2 ) rate of multivariate approximation with respect to the convex sets metric, using the Malliavin-Stein approach and second order Poincaré inequalities. They require somewhat stronger moment assumptions than ours, but, as in the theorems of Rinott & Rotar (1996) and of Fang (2014), there is no need to bound an analogue of ε(W ).
In Section 3, we introduce a stronger notion of local dependence, that is convenient for many applications. It enables us to give rather simple error bounds, in Corollary 3.1, expressed in terms of an upper bound for the maximum of the third moments of the |X (α) | and the sizes of the neighbourhoods in the dependency graph, both being quantities that typically appear in error bounds in the CLT. It also enables us to give a general result, Theorem 3.2, that is helpful for bounding ε W . The effectiveness of our bounds is illustrated in a number of examples in Section 4. These also give some insight into why, in addition to the sort of moment conditions that suffice for approximation in metrics weaker than total variation, some smoothness condition is needed.
For the ease of use, we present our main results for the accuracy of multivariate discrete normal approximation in two distinct but related settings. We postpone the proofs of the main theorems to Section 5.
In the first setting, we suppose that W = n j=1 X (j) is a sum of n vectors in R d . We assume that there are decompositions of the following form: , with X (j,k) ∈ Z d , and then, for each 1 ≤ k ≤ n j , we can write W (j) = W (j,k) + Z (j,k) , where W (j,k) ∈ Z d is only weakly dependent on (X (j) , X (j,k) ).
Because of the restrictions to Z d , centring on the mean is not possible in these decompositions, but it could, for instance, be arranged that each component of Z (j) , X (j,k) and Z (j,k) has mean with modulus at most 1. This makes no difference to the arguments that follow, but the moment sums H 1 and H 2 that appear in the error bounds might otherwise be larger than necessary.
Weak dependence is expressed by the smallness of dependence coefficients analogous to those in Rinott & Rotar (1996). With µ (j) := EX (j) , we begin by defining (2.1) and then set We then write χ 1 := max 1≤l≤3 χ 1l . Note that the m-factors defined in Theorem 1.1 are not present in the quantities in Rinott & Rotar (1996) that are directly analogous to χ 11 , χ 2 and χ 3 . This is because, in their formulation, the random variables corresponding to X (j) are normalized to make Cov(W ) close to the identity matrix. Since our sum W is not normalized, to keep its values in Z d , the elements of its covariance matrix typically grow with n. The quantities χ 12 and χ 13 have no direct analogue in Rinott & Rotar (1996), and appear only in dealing with the truncation to B nδ (µ), something that is not needed in their arguments. Assuming that E|X (j) | 3 < ∞ for each j ∈ so that Tr(Σ) = m −1 TrV ≤ d; this makes m the analogue of the variance in the one dimensional context. We then introduce some moment sums, used in the error estimates, defining and then setting We also assume that In view of the definitions of m and V , H 0 and H 1 can be expected to be of moderate size in many applications, H 2 can be expected to grow with the size of a typical neighbourhood of a vertex j, and the assumption (2.7) can be expected to be satisfied. The various dfactors are designed to offset any automatic dimension dependence in the corresponding quantities, but their choice plays no essential part in the bounds given below. We now make a smoothness assumption on the distributions of W (j) and W (j,k) that is key for approximation in total variation. We assume that, for each 1 ≤ j ≤ n, 1 ≤ k ≤ n j and 1 ≤ i ≤ d, we have (2.8) for some ε W < 1. Of course, for the bounds that we shall prove, we shall want ε W to be suitably small. This assumption is clearly useful in establishing Condition (a) of Theorem 1.1, but is also used throughout the treatment of |E{ A m h(W )}I[|W −µ| ≤ mδ]|. Theorem 2.1 Let W := n j=1 X (j) be decomposed as above, with (2.7) satisfied, and suppose that V is positive definite. Then there exist constants C 2.1 and n 2.1 , depending continuously on the condition number ρ(V ), such that for all m ≥ n 2.1 .
Our second setting is somewhat more general. We suppose that W results from integrating the marks of a marked point process with respect to its ground process. We assume that the carrier space Γ of the ground point process Ξ is a locally compact second countable Hausdorff topological space (Kallenberg (1983, p. 11)), with Borel σ-field B(Γ). Let G := Γ × Z d , and equip it with the product Borel σ-field B( G) = B(Γ) × B(Z d ).
Our goal is to establish the accuracy of discrete normal approximation to W = Γ X (α) Ξ(dα). When Γ = {1, . . . , n} and Ξ is the counting measure on Γ, W reduces to the sum in the previous setting, so the bound in Theorem 2.1 is a corollary of that in Theorem 2.2. However, if there is dependence between Ξ and X, then there is significant difference between the two settings. For the latter setting, it is necessary to introduce extra machinery, including the first and second order Palm distributions (Kallenberg (1983, p. 83 and p. 103)), to tackle the problem.
Next, we set µ (α) := E α X (α) and µ := Γ µ (α) ν(dα), and define (2.10) Note that the quantities χ 2αβ and χ 2αβ are more complicated than their counterparts χ 2jk in the earlier setting, to allow for possible dependence between the ground process and the marks. Then let (2.11) As in the discrete sum, we assume that E|X (α) | 3 < ∞ ν-a.s. Recalling We next introduce some moment sums by defining ; As a consequence of the dependence between the marks and the ground process, the analogue of (2.7) is more involved: we need to assume that there exists a constant C ≥ 1 such that The analogue of (2.8) is even more involved. First, for 1 ≤ i ≤ d, ν-a.s. in α and ν 2 -a.s. in α, β, we need to find ε W < 1 such that (2.16) We then also need to find ε W < 1 such that Finally, we need a bound controlling the difference between some conditional and unconditional expectations: we need to find ε W < 1 such that Fortunately, under many circumstances (see Barbour & Xia (2006)), both ε W and ε W can be reduced to 0, as is the case in Example 4.4.
Theorem 2.2 Let W := Γ X (α) Ξ(dα) be decomposed as above, such that (2.15) is satisfied, and suppose that V is positive definite. Then there exist constants C 2.2 and n 2.2 , depending continuously on ρ(V ), such that for all m ≥ n 2.2 .

Intersection graph dependence
In this section, we consider sums W := n j=1 X (j) of random vectors X (j) that are determined by the values of an underlying collection of independent random elements The subsets M j induce an intersection graph G on [n], in which there is an edge between j and k = j, j ∼ k, exactly when M j ∩ M k = ∅; we denote by , and the graph G is a dependency graph in the sense of Baldi & Rinott (1989).
In this setting, there is a natural way to define W (j) and W (j,k) . For each j ∈ [n], we define Z (j) := X (j) + k∼j X (k) and W (j) := W − Z (j) , noting that W (j) and X (j) are independent, so that X (j,k) = X (k) , k ∈ N j ∪ {j}, and χ 1 = χ 3 = 0. Then, for k = j, W (j,k) = W (j) and Z (j,k) = 0; otherwise, for j = k ∈ [n] such that j ∼ k, we define W (j,k) := l / ∈N j ∪N k X (l) and Z (j,k) := W (j) − W (j,k) ; note that W (j,k) and the pair (X (j) , X (k) ) are independent, so that χ 2 = 0 also. If we also impose some uniformity, by supposing that then we have the following corollary of Theorem 2.1.
Corollary 3.1 Suppose that the above assumptions are satisfied. Define Then for all n ≥ n 2.1 .
Proof: All that is needed is to observe that The main difficulty in applying the bounds in Theorem 2.1 and Corollary 3.1 is putting a value to ε W . This can nonetheless often be dealt with, provided that enough of the underlying random variables (Y l , l ∈ [M ]) each influence rather few of the X (j) . The next theorem gives a way of exploiting this. Given and find l 1 < l 2 < · · · < l s ∈ [M ] \ M (j,k) such that L lr ∩ L l r = ∅ for all 1 ≤ r < r ≤ s. Then the vectors S l 1 , . . . , S ls are conditionally independent, given Theorem 3.2 Suppose that, for j = k ∈ [n] such that j ∼ k, we can find s and l 1 < l 2 < · · · < l s ∈ [M ] \ M (j,k) such that the sets L lr , 1 ≤ r ≤ s, are disjoint, and such that Now, by the Mineka coupling argument (Lindvall, 2002, Section II.14), Then it is easy to compute Thus TrV = nd{c 1 + Dc 2 }, where so that we take m := n(c 1 + Dc 2 ) in Corollary 3.1. We can clearly take γ = 1 also, and, for fixed d and π 1 , . . . , π d , this yields a bound which relies on having a reasonable bound for ε W . In order to apply Theorem 3.2, for each j ∼ k ∈ E, we want first to find s and , {l, l } ∈ E , so that |L l | = δ l , and L l ∩ L l = ∅ exactly when {l, l } ∈ E. Thus we need to find a set of vertices l 1 , . . . , l s subtending no edges of G (independent in the graph theoretical sense). Letting δ * := max l δ l , we note that The next step is to bound d , for each 1 ≤ i ≤ d and for any l. To do so, let R il := l : {l,l }∈E I[Y l = i] be the number of neighbours of l in G that have colour i. Then S l takes one of the values R 1l e (1) , . . . , R dl e (d) ∈ Z d , with conditional probabilities π 1 , . . . , π d . Hence, if R il = 1 and R i l = 0 for some i = i, then S l = e (i) with conditional probability π i , and S l = 0 with conditional probability at least π i , giving d . Then, since the events {R i,lr = 1, R i ,lr = 0} and {R i,l r = 1, R i ,l r = 0} are independent unless there is a path of length 2 connecting l r and l r , we have Hence, by Chebyshev's inequality, .
Thus we can take in Theorem 3.2. If, as M → ∞, n ≥ cM for some c > 0 and δ * remains bounded, with the colour probabilities remaining constant, this gives ε W = O(M −1/2 ), and so The order in M is the same as is obtained, in the context of δ * -regular graphs and using the convex sets metric, by Rinott & Rotar (1996).
Note that, if most of the degrees in G become large as M increases, h min (i, i ) may well converge to zero too fast for the bound on ε W to be useful, and more sophisticated arguments would be needed. Note also that, for d = 2, h(t, i, i ) = h(t, 1, 2) = 0 for all t = 1, because π 1 + π 2 = 1, and we obtain no bound on ε W in this way. Indeed, if G is an δ * -regular graph and d = 2, ε W is not small, since the distribution of W is concentrated on a sub-lattice of Z 2 if δ * ≥ 2, and L(W ) is no longer close to DN d (µ, V ) in total variation; see Barbour, Luczak & Xia (2018b, Section 4.2.1).
The problem can be modified, by only counting a random subset of monochrome edges. Let ( Y j , j ∈ E) be independent Be (p) random variables, and define i . As before, for fixed d and π 1 , . . . , π d , this yields However, the quantityε n is rather easier to bound than ε W , since we can take the independent random variables ( Y j , j ∈ E) to use in Theorem 3.2, each of which influences only the corresponding X (j) . Conditional on the colours, The apparent order in D is misleading here. If D is large, the covariance matrix V is ill conditioned, since TrV n D M D 2 , whereas Var Thus the condition number ρ(V ) grows like D, and ρ(V ) enters the constant C 2.1 implied in the order symbol in (4.1). However, if only the joint distribution of, say, ( W 1 , . . . , W d−1 ) T is of interest, the corresponding covariance matrix then has condition number that is bounded in D, for fixed d and π 1 , . . . , π d > 0, and the orders in both M and D are as in (4.1).
A more general modification, in the same spirit, it to choose ( Y to be any independent integer valued random variables, with distributions depending only on i, and to set W i : j . Then, if the mean and variance of Y (i) 1 arem (i) andṽ (i) , we havẽ , from which the corresponding value of m can be deduced. As above, it is not difficult to show that 1 + 1), from which it follows thatε n = O((M D) −1/2 ). Hence we find from Corollary 3.1 that

Random geometric graphs
Let M := n 2 points be distributed uniformly and independently over the torus T n := [0, n] × [0, n]. For some fixed r, join all pairs of points whose distance apart is less than or equal to r. This yields a particular example of a random geometric graph; the book by Penrose (2003) r ≤p 2 r . The quantities G j and G k are independent unless j and k have at least two of their vertices in common, G j is independent of the set (G k : j ∩ k = ∅), and the pair (G j , G j ) is independent of the set (G k : (j ∪ j ) ∩ k = ∅). Using these facts, we can make some computations: and the matrix c 11 c 12 c 21 c 22 is non-singular, with values involving the geometry of intersections of discs in R 2 . Thus we can take m = cn 2 for some c > 0. The quantities H 0 and H 2 are then easily bounded: giving d TV (L(W ), DN 2 (µ, V )) = O((n −1 + ε W ) log n).
It thus remains to bound ε W . To do so, break up [0, n] 2 into 9 n/3r 2 non-overlapping r × r squares, denoted by Q l,l := [(l − 1)r, lr) × [(l − 1), l r). Then there can be no triangles or 2-stars with points in two of the squares in Q := (Q 3l,3l , 1 ≤ l, l ≤ n/3r ), because points in two of them are more than 2r apart. Consider evaluating d TV (L(W + e (i) | G), L(W | G)), much as for Theorem 3.2, where G consists of the positions of all points not in members of Q, together with the numbers of points falling in each member of Q. If S l,l denotes the contribution resulting from assigning positions to the points in Q 3l,3l , then the random variables (S l,l , 1 ≤ l, l ≤ n/3r ) are conditionally independent, given G. Let N (A) denote the number of points falling in the set A ⊂ T n . Then the event E l,l that N (Q 3l,3l ) = 1, that the rectangle [(3l + 0.25)r, (3l + 0.5)r) × [(3l − 1)r, 3l r) contains two points at a distance between r/2 and r from one another, and that N (U l,l ) = 3, where U l,l is the union of (Q r,s , l − 2 ≤ r ≤ l + 2, l − 2 ≤ s ≤ l + 2), is such that P[E l,l ] 1 as n → ∞, and is the same for all l, l . Indeed, we have for a constant χ > 0 that is independent of n also. Conditional on E l,l , we have for u 1 , u 2 > 0. Now the events (E l,l , 1 ≤ l, l ≤ n/3r ) are not independent, but, except for neighbouring pairs of indices, they are only weakly dependent: for r, r such that max{|r − l|, |r − l | ≥ 2}, we have The asymptotics of ε W are, however, sensitive to the choice of r: if r = r n → ∞, even logarithmically in n, P[N (U 1,1 ) = 3] becomes very small, and the bound on ε W derived in this way is no longer useful.

Finite Markov chains
Let (Z j , j ≥ 0) be an irreducible, aperiodic Markov chain on the finite state space and n. We are interested in the accuracy of approximating the distribution of W n by DN d (µ n , V n ), where µ n := EW n and V n := CovW n ; translated Poisson approximation for each component W in separately can be shown to be accurate to order O(n −1/2 ) using the results of Barbour & Lindvall (2006). In a Markov chain, the dependence between the states at different times never completely disappears, so we shall need to make use of the dependence coefficients χ l , 1 ≤ l ≤ 3. We make the following simplifying assumption: is likely to be effective, if m n is suitably chosen. Because a finite state irreducible aperiodic Markov chain is geometrically ergodic, there exist 0 < ρ < 1 and C < ∞ such that, for all 0 ≤ i, r ≤ d, we have |P where P (k) ir := P[Z k = r | Z 0 = i], and π r := lim n→∞ P[Z n = r | Z 0 = i]. It is then easy to deduce that, as n → ∞, n −1 µ n ∼ (π 1 , . . . , π d ) T =: π; (4.4) for 1 ≤ i, r ≤ d. Now, from (4.3), uniformly in i, r, s, q, so that all the dependence coefficients χ l , 1 ≤ l ≤ 3, in (2.3) are of order O(ρ mn ). It is also immediate, because indicators are bounded random variables, that H 0 = O(1), H 1 = O(m n ) and H 2 = O(m 2 n ). It thus remains to consider the quantity ε W of (2.8). Assuming that 2m n ≤ n/4, it is enough to bound for any l ≥ n/4 and 0 ≤ r ≤ d, where L r stands for the distribution given the initial state of the Markov chain is at r. This is because, for 1 ≤ j ≤ n/2 and k ≤ j + m n , conditioning on the values of Z j up to time j + k + m n and using the Markov property, the quantity ε W in (2.8) is no bigger than any bound for the distance in (4.5), for l = n − j − k − m n ≥ n/2 − 2m n ≥ n/4, that is uniform in the initial state r. For j > n/2, we note that the same argument works, using the reversed Markov chain. We establish (4.5) by using coupling. Let Z and Z be two copies of the Markov chain Z, both starting in r. We couple them in such a way that the sequence of transitions in the first is the same as that in the second, except that the holding times in 0 and i are allowed to be different. Initially, if (N 0l , l ≥ 1) and (N 0l , l ≥ 1) denote the sequence of successive holding times in 0 of the two chains, then the pair (N 0l , N 0l ) is chosen independently of the past according to the Mineka coupling (Lindvall 2002, Section II.14), so that (N 0l − N 0l , l ≥ 1) are the increments of a lazy symmetric random walk with steps in {−1, 0, 1}. After the first occasion L 0 such that the values of N 0l and N 0l are chosen to be identical. The same strategy is applied to the holding times N il and N il , except that they are chosen to be identical after the first occasion L i on which Let M 0i denote the first time in the underlying Markov chains Z and Z at which both of these occasions have occurred. At this point, both chains have made the same number of steps, because their paths differ only through differences in the partial sums l {N 0l +N il } and l {N 0l + N il }, and these are equal at all times after M 0i . However, at this point, both have spent the same amount of time in states other than i and 0, but Z has spent one step less in i. By the usual coupling argument, for any set A ⊂ Z d , It thus follows that Now we have P[L 0 > l] = O(l −1/2 ) and P[L i > l] = O(l −1/2 ), by Lindvall (2002, Section II.14). Also, because Z has finite state space, the times between visits to 0 and between visits to i have means γ 0 and γ i and finite variances v 0 and v i . So, if τ 0l denotes the time at which Z completes its l-th visit to 0, we have where this order follows for the first pair of terms as above, and the second pair are of order O(n −1 ). This shows that ε W = O(n −1/2 ). where π and V are as given in (4.4).

Maximal points
Given a configuration Ξ of points in R 2 , a point α α α = (α 1 , α 2 ) T ∈ Ξ is called maximal if there are no other points β β β = (β 1 , β 2 ) T ∈ Ξ such that β i ≥ α i for i = 1, 2. In this example, we take Ξ to be a realisation of a Poisson point process with intensity λ on the triangle parallel to the hypotenuse of Γ and close to it, and define Y i = Υ(E i ). Our interest is in the approximate joint distribution of (Y 1 , Y 2 ) T .

The proofs of Theorems 2.1 and 2.2
Before proving our main theorems, we establish an auxiliary lemma. It is useful in what follows to be able to extend the definition of a function h from the ball B mδ (nc) ∩ Z d to the whole of Z d in such a way that ∆h ∞ can be bounded in terms of ∆h 3mδ/2,∞ . That this can be done, if mδ ≥ 2 √ d, is proved using the following lemma.
Lemma 5.1 Let h : Z d → R be given. Then, for any x ∈ R d and r > 0, it is possible to modify h outside the set Z d ∩ B r (x) in such a way that the resulting functionh satisfies Proof: First, for all y = (y 1 , . . . , y d ) ∈ B r (x), we have where y := ( y 1 , . . . , y d ), because, for each z ∈ Z(y), |z − y| ≤ √ d. Extend the definition of h to all y ∈ B r (x) by averaging over the values at the points Z(y): where {y i } := y i − y i . It is immediate that h is continuous in B r (x), and that, for y in the interior of any unit cube, Hence it follows that |h(y) − h(y )| ≤ √ d ∆h r+ √ d,∞ |y − y | for any y, y ∈ B r (x). Now defineh on R d by settingh(y) = h(y) on B r (x), andh(y) = h(π x y) for y / ∈ B r (x), where π x y := x + r(y − x)/|y − x| is the projection of y onto the surface of B r (x). Then, since We are now in a position to prove our main theorems.
Proof of Theorems 2.1 and 2.2 We first prove Theorem 2.2. Condition (a) of Theorem 1.1 follows directly from (2.16), with ε W for ε 1 . We thus turn to Condition (b), using the Stein operator A m , as in (1.1), with m as defined in (2.12).