Stein's method for nonconventional sums

We obtain almost optimal convergence rate in the central limit theorem for"nonconevntional"sums of the form $S_N=N^{-\frac12}\sum_{n=1}^N (F(\xi_n,\xi_{2n},...,\xi_{\ell n})-\bar F)$.


Introduction
Let Φ be the standard normal distribution function and let X 1 , X 2 , X 3 ... be a sequence of independent and identically distributed random variables such that EX 1 = 0 and 0 < EX 2 1 = σ 2 < ∞. The classical Berry-Esseen theorem provides a uniform approximation of the error term in the central limit theorem (CLT) for the sumsŜ n = 1 √ nσ n k=1 X k , stating that for any n ∈ N, where F n is the distribution function ofŜ n (see Section 6 of Ch. III in [17]) and C > 0 is an absolute constant which by efforts of many researchers was optimized by now to a number a bit less than 1/2. During the last 50 years there were several extensions of the CLT for sums of weakly dependent random variables and for martingales, including many estimates of error terms. Among the most used methods in the case of weak dependence are Gordin's method for martingale approximation (see [5], [12] and [6]) and Stein's method (see [15]). While Stein's method can yield close to optimal convergence rate (see [15] and [13]), martingale approximation method can not, since Berry-Esseen type estimates for martingales do not yield (in general) optimal convergence rates even for sums of independent random variables (see, for instance [6] and [1]).
Partially motivated by the research on nonconventional ergodic theorems (the term "nonconventional" comes from [4]), probabilistic limit theorems for sums of the form S N = N n=1 F (ξ q1(n) , ξ q2(n) , ..., ξ q ℓ (n) ) have become a well studied topic. Here {ξ n , n ≥ 0} is a sufficiently fast mixing processes with some stationarity properties and F is a function satisfying some regularity conditions. The summands here are nonstationary and long range dependent which makes it difficult to apply standard methods. This line of research started in [9], where the author proved a functional CLT for the normalized sums N − 1 2 S [N t] taking the characteristic function approximation approach. In [11] the authors established a functional CLT for more general q i 's than in [9], showing that the martingale approximation approach is applicable. Their results included the case when q i (n) = in which was the original motivation for the study of nonconventional averages (see [4]). In [7] the authors estimated the convergence rate of Z N = N − 1 2 S N in the Kolmogorov (uniform) metric towards its weak limit under the assumptions of [11]. The proof relied on Berry-Esseen type results for martingales, which led to estimates of order N − 1 10 ln(N + 1), which is far from optimal. In the special case when ξ n 's are independent the authors provided optimal rate of order N − 1 2 relying on Stein's method for sums of locally dependent random variables (see [3]).
The goal of this paper is to show that Stein's method is applicable for nonconventional sums when ξ n 's are weakly dependent, and to significantly improve the rates obtained in [7]. We first consider the case when F is a bounded Hölder continuous function and q i (n) = in for any 1 ≤ i ≤ ℓ and n ∈ N, and (in the self normalized case) provide almost optimal upper bound of the form We also obtain rates of the form where ǫ > 0 is an arbitrary positive constant and C ǫ is a constant which in general depends on ǫ. When {ξ n : n ≥ 0} forms a stationary and exponentially fast φ-mixing sequence then, in fact, we show that (1.2) and (1.3) hold true for any bounded function F which is not necessarily continuous. Convergence rates for more general functions and more general indexes q i (n)'s will be discussed, as well. As in [7], our results hold true when, for instance, ξ n = T n f where f = (f 1 , ..., f d ), T is a mixing subshift of finite type, a hyperbolic diffeomorphism or an expanding transformation taken with a Gibbs invariant measure, as well, as in the case when ξ n = f (Υ n ), f = (f 1 , ..., f d ) where Υ n is a Markov chain satisfying the Doeblin condition considered as a stationary process with respect to its invariant measure. In fact, any stationary and exponentially fast φ-mixing sequence {ξ n } can be considered. In the dynamical systems case each f i should be either Hölder continuous or piecewise constant on elements of Markov partitions. As an application we can consider ξ n = ((ξ n ) 1 , ..., (ξ n ) ℓ ), (ξ n ) j = I Aj (T n x) in the dynamical systems case and (ξ n ) j = I Aj (Υ n ) in the Markov chain case where I A is the indicator of j ) be a bounded Hölder continuous function which identifies with the function G(x 1 , ..., on the cube ([0, 1] ℘ ) ℓ . Let N (n) be the number of l's between 0 and n for which T qj(l) x ∈ A j for j = 0, 1, ..., ℓ (or Υ qj (l) ∈ A j in the Markov chains case), where we set q 0 = 0, namely the number of ℓ−tuples of return times to A j 's (either by T qj(l) or by Υ qj (l) ). Then our results yield a central limit theorem with almost optimal convergence rate for the numbers N (n).

Acknowledgement
This paper is a part of the author's PhD thesis conducted at the Hebrew university of Jerusalem. I would like to thank my advisor Professor Yuri Kifer for suggesting to me the problem studied in this paper and for many helpful discussions.

Preliminaries and main results
Our setup consists of a ℘-dimensional stochastic process {ξ n , n ≥ 0} on a probability space (Ω, F , P ) and a family of sub σ−algebras We will impose restrictions of the mixing coefficients where we recall that for any two sub σ−algebras G, H ⊂ F , In order to ensure some applications, in particular, to dynamical systems we will not assume that ξ n is measurable with respect to F n,n but instead impose conditions on the approximation rates where X L ∞ denotes the essential supremum of the absolute value of a random variable X. We do not require stationarity of the process {ξ n , n ≥ 0}, assuming only that the distribution of ξ n does not depend on n and that the joint distribution of (ξ n , ξ m ) depends only on n − m which we write for further reference by (2.3) ξ n ∼ µ and ξ n , ξ m ∼ µ m−n where Y ∼ µ means that Y has µ for its distribution. Let F = F (x 1 , ..., x ℓ ) : (R ℘ ) ℓ → R, ℓ ≥ 1 be a bounded Hölder function and let M > 0 and κ ∈ (0, 1] be such that for any x = (x 1 , ..., x ℓ ) and y = (y 1 , ..., y ℓ ) in (R ℘ ) ℓ . To simplify formulas we assume the centering condition which is not really a restriction since we can always replace F by F −F . The main goal of this paper is to prove a central limit theorem with convergence rate for the normalized sums (c N ) −1 S N , where F (ξ n , ξ 2n , ..., ξ ℓn ) and either c N = N − 1 2 or c N = ES 2 N . 2.1. Assumption. There exist d > 0 and c ∈ (0, 1) such that The following theorem is a consequence of the arguments in [11], [10] and [7] and is formulated here for readers' convenience.

2.2.
Theorem. Suppose that Assumption (2.1) is satisfied. Then the limit D 2 = lim N →∞ N −1 ES 2 N exists and there exists C 1 > 0 which depends only on ℓ, c and d such that Next, recall that the Kolmogorov (uniform) metric is defined for each pairs of distributions L 1 and L 2 on R with distribution functions G 1 and G 2 by For any random variable X we denote its law by L(X). Our main result is the following theorem.

Theorem. Suppose that Assumption (2.1) holds true and that
Let N (0, 1) be the zero mean normal distribution with variance 1. Then there exists a constant C > 0 which depends only on ℓ, d and c such that Moreover, for any ǫ > 0 there exists a constant c ǫ > 0 which depends only on ǫ, c, d and ℓ so that for any N ≥ 1, where N (0, D 2 ) is the zero mean normal distribution with variance D 2 . When β ∞ (r 0 ) = 0 for some r 0 then (2.9) and (2.10) hold true with constants C and c ǫ which depend also on r 0 , assuming only that F is a bounded function satisfying (2.4).
The outline of the proof goes as follows. Relying on [13], Stein's method becomes effective for the sum S N when {F n : 1 ≤ n ≤ N }, F n = F (ξ n , ξ 2n , ..., ξ ℓn ) are locally weak dependent in the sense that there exist sets A n and nonnegative integers d n,m , 1 ≤ n, m ≤ N such that n ∈ A n a n = |A n | and b n (k) = |{1 ≤ m ≤ N : d n,m = k}|, k ≥ 0 are small relatively to N , F n and {F s : s ∈ A n } are weakly dependent and the random vectors F n = {F k : k ∈ A n } and F m = {F s : s ∈ A m } are weakly dependent when d n,m is sufficiently large. We first reduce the problem of approximation of the left hand side of (2.9) to the case when ξ = {ξ n : n ≥ 0} forms a sufficiently fast φ-mixing process. Then we consider the sets |in − jm| ≤ l} and the numbers d n,m = min{|ia− jb| : a ∈ A n , b ∈ A m , 1 ≤ i, j ≤ ℓ} and show that a n and b n (k) defined above are of order l. In Section 3 we provide estimates which will show that the required type of the above weak dependence is satisfied, and then we take l of the form l = A ln(N + 1) to complete the proof. In fact, existing estimates on the left hand side of (2.9) using Stein's method become effective only after using the expectation estimates obtained in Section 3 even for "conventional" sums of φ-mixing sequences (i.e. in the case ℓ = 1), which is a particular case of our setup, and so, in particular, we show that Stein's method is effective for such sums and yields almost optimal convergence rate.

Auxiliary results
The following result will be used.
3.1. Lemma. Let X and Y be two random variables defined on the same probability space. Let Z be a random variable with density ρ bounded from above by some constant c > 0. Then, The second inequality is proved in Lemma 3.3 in [8], while the proof of the first inequality goes in the same way as the proof of that Lemma 3.3, taking in Next, we recall that (see [2], Ch. 4) for any two sub σ−algebras G, H ⊂ F , The following lemma does not seem to be new but for reader's convenience and completeness we will sketch its proof here.
denote by κ the distribution of the random vector (V 1 , V 2 ) and consider the where {A i } is a measurable partition of the support of µ 1 . Any uniformly continuous function H is a uniform limit of functions of the above form, which implies that (3.2) holds true for uniformly continuous functions. Finally, by Lusin's theorem (see [14]), any function H ∈ L ∞ (R d , B, ν) is an L 1 (and a.s.) limit of a sequence .., s and H ∞ stands for the supremum of H. Namely, let U (i) (C i ) be independent copies of the processes U (C i ), i = 1, ..., s. Then where j i satisfies that i ∈ C ji , for any 1 ≤ i ≤ k.
Proof. Denote by ν i the distribution of U i , i = 1, .., k. We first prove by induction on k that for any choice of H and U i 's with the required properties, which means that (3.4) holds true when k = 2. Now, suppose that (3.4) holds true for any k ≤ j − 1, U 1 , ..., U k with the required properties and any bounded Borel function H : R e1+...+e k−1 → R, where e 1 , ..., e k−1 ∈ N. In order to deduce (3.4) for Applying the induction hypothesis with the function h completes the proof of (3.4), since h ∞ ≤ H ∞ . Next, we prove by induction on s that for any choice of k, H, U i 's with the required properties and C 1 , ..., C s , For s = 1 this is just (3.4). Now suppose that (3.5) holds true for any s ≤ j − 1, and any real valued bounded Borel function H defined on R d1+...+d k , where k and d 1 , ..., d k are some natural numbers. In order to prove (3.5) for s = j, set u (I) = (u (C1) , u (C2) , ..., u (Cs−1) ) and let the function I be defined by Then H(u 1 , u 2 , ..., u k )dν 1 (u 1 )dν 2 (u 2 )...dν k (u k ) = I(u (I) ) j ∈Cs dν j (u j ).
It is clear that J ∞ ≤ H ∞ . Applying the induction hypothesis with the function J (considered as a function of the variable u) and taking into account (3.7) and (3.9) we obtain (3.5) with s = j. We complete the induction. Inequality In general we can replace H ∞ appearing in the right hand side of (3.3) by some essential supremum norm of H with respect to some measure which has a similar but more complicated form as κ defined in Lemma 3.2.

Nonconventional CLT with almost optimal convergence rate via Stein's method
First, the proof of Theorem 2.2 follows from arguments in [11], [10], and [7]. Indeed, relying on (2.25) in [11], the conditions of [7] and [10] hold true in our circumstances. Existence of D 2 follows from Theorem 2.2 in [11], inequality (2.8) follows from the arguments in [10] (first by considering the case when M = 1) and the condition for positivity follows from Theorem 2.3 in [7].
Before proving Theorem 2.3 we introduce the following notations. For any a, b ∈ R set Finally, for any A 1 , A 2 , ..., A L ⊂ R, we will write A 1 < A 2 < ... < A L if a 1 < a 2 < ... < a L for any a i ∈ A i , i = 1, 2, ..., L.
Observe now that where we used that S N,r L ∞ ≤ 2N (recall our assumption that M = 1). Next, using (4.1), (4.5) and that ln(N + 1) ≤ N 1 2 for any N ≥ 1 we derive that min(s 2 N ,s 2 N,r ) ≥ 1 4 D 2 N when 3N 1 2 D 2 ≥ 8(C 1 + c 2 ) and in this case where c 4 = C 4 (1 + c 0 + c 2 + A), C 4 > 1 is some absolute constant and we also used that N 1+A ln c < 1 . Next, let N be sufficiently large so that 3N . Then by (2.4) and the above lower bound of (s N,r ) 2 , We claim that there exist constants K 1 and K 2 which depend only on ℓ such that for any n = 1, 2, ..., N and k ≥ 0, |A n | ≤ K 1 l and |A n (k)| ≤ K 2 l. (4.9) Indeed, since A n is contained in a union of at most ℓ 2 intervals whose lengths do not exceed 2l + 1 we have |A n | ≤ ℓ 2 (2l + 1) and since l ≥ 1 we can take K 1 = 3ℓ 2 . To prove the second inequality, let n and m be such that d ℓ (A n , A m ) = k ≥ 0. Then there exist 1 ≤ i s , j s ≤ ℓ, s = 1, 2, 3 and 1 ≤ u, v ≤ N such that |i 3 u − j 3 v| = k, |i 1 n − j 1 u| ≤ l and |i 2 m − j 2 v| ≤ l. When j 3 v − i 3 u 3 = k, we deduce from the last two inequalities that and similar inequality holds when j 3 v − i 3 u 3 = −k. Thus, when n and k are fixed the set A n (k) is contained in a union of 2ℓ 6 intervals whose lengths do not exceed 2(ℓ 2 + 1)l, and the choice of K 2 = 4ℓ 6 · (ℓ 2 + 2) is sufficient. Now, set δ = δ l,N = N n=1 m∈An EY n Y m . Then where γ = γ l,N = N n=1 m∈{1,...,N }\An EY n Y m . Let 1 ≤ n, m ≤ N be such that m ∈ A n . Consider the sets of indexes Γ k = {jn : 1 ≤ j ≤ ℓ} where k = n, m and set Γ n,m = Γ n ∪ Γ m . By the definition of the set A n we have dist(Γ n , Γ m ) > l. Therefore, the set Γ n,m can be represented in the form where L ≤ 2ℓ, each B t is either a subset of Γ n or of Γ m and dist(B t , B t−1 ) > l. Set U t = {ξ s,r : s ∈ B t }, t = 1, ..., L.
Since r ≤ l 4 , there exist numbers n t and m t , t = 1, ..., L such that n t−1 < m t ≤ n t < m t+1 + l 2 , where n 0 = −∞, m L+1 := ∞, and each U t is measurable with respect to F mt,nt . Set {1, 2, ..., L}, Y n is measurable with respect to σ{U t : t ∈ C 1 } and Y m is measurable with respect to σ{U t : t ∈ C 2 }. Therefore, by (3.10) and (4.8) and since EY n = 0, We assume now, in addition to the previous restriction on N , that 64dℓD −2 N − 1 2 < 1 2 . Then δ > 1 2 and so we can set σ = √ δ and W =Z N,r σ . Then σ 2 ≥ 1 2 and using (4.11) we obtain where we also used thats N,r ≥ 1 2 DN 1 2 . Since A| ln c| > 1 the above right hand side does not exceed 16D −3 N − 1 2 which together with (4.7) and Lemma 3.1 yields that (4.13) where c 5 = 16c 4 . In order to approximate d K (L(W ), N (0, 1)), set X n = σ −1 Y n , n = 1, 2, ..., N . Then W = N n=1 X n and by (4.8) Applying Theorem 2.1 in [13], using the equality (15) from there and taking into account (4.9) we obtain that N n=1 X n m∈An X m 2 2 W 2 + 5 and X q q = E|X| q = X q L q for any random variable X. Now we estimate R 1 , R 2 and R 3 . Set T n = m∈An (X n X m − EX n X m ), n = 1, ..., N . Then Let n 1 and n 2 be such that d ℓ (A n1 , A n2 ) = k > 2r. Consider the sets Γ s = {jm : m ∈ A ns , 1 ≤ j ≤ ℓ}, s = 1, 2. Then dist(Γ 1 , Then there exist numbers m t , n t , t = 1, 2, ..., L such that .., L} and T ns , s = 1, 2 is measurable with respect to σ{U t : t ∈ C s }. Since X n L ∞ ≤ R we have T n L ∞ ≤ 2K 1 lR 2 (recalling (4.9)) and thus by (3.10), where we used that ET n = 0. Given n 1 and k > 2r, the number of n 2 's satisfying d ℓ (A n1 , A n2 ) = k is at most K 2 l (recalling (4.9)), while for any other n 2 and k we can use the trivial upper bound |ET n1 T n2 | ≤ T n1 L ∞ T n2 L ∞ ≤ 4K 2 1 l 2 R 4 . Therefore, by the definitions of R and r, where C 0 is a constant which depends only on c and d and ℓ. In order to approximate R 2 , let 1 ≤ n ≤ N and set X n = {X m : m / ∈ A n }. Then, . Consider the sets τ 1 = {n, 2n, ..., ℓn} and τ 2 = {jm : m ∈ A n , 1 ≤ j ≤ ℓ}. Then by the definition of A n we have dist(τ 1 , τ 2 ) > l. Thus, the union τ 1 ∪ τ 2 can be written as a union of at most 2ℓ+1 disjoint sets B 1 , B 2 , ..., B L such that B 1 < B 2 < ... < B L , dist(B t , B t+1 ) > l and each B t is either a subset of τ 1 of a subset of τ 2 . Consider the random vectors U t = {ξ s,r : s ∈ B t }, t = 1, ..., L and the partition of {1, 2, ..., L} into the sets Then X n is measurable with respect to σ{U t : t ∈ C 1 } and E[X n |X n ] is measurable with respect to σ{U t : t ∈ C 2 }. Therefore by (3.10) and (2.7), where we used that r ≤ l 4 , EX n = 0 and that E[X n |X n ] L ∞ ≤ X n L ∞ ≤ R. We conclude from (4.14) and (4.15) that there exists a constant C ′ 0 which depends only on ℓ such that To estimate R 3 , first observe that by the definition of W and by (4.10) we have W 2 2 = δ −1 Z N,r 2 2 = 1 + δ −1 γ and therefore W 2 ≤ 2, since |γ| < 1 2 and δ > 1 2 . The first factor in the definition of R 3 is clearly bounded from above by 2N K 2 1 l 2 R 3 and we conclude that for some constant C 4 which depends only on ℓ. The estimate (2.9) in Theorem 2.3 follows now by taking any A > max(1, 2| ln c| −1 ), using (4.13) and the above estimates of R i 's. Note that all the approximations in this section hold true only for N 's satisfying 3N ≥ 8D −2 (C 1 + c 2 ) and 64dℓD −2 N − 1 2 < 1 2 , but inequality (2.9) follows for all other N 's from the basic estimate d K (L(Z N ), N (0, 1)) ≤ 1. We also remark that when β ∞ (r 0 ) = 0 for some r 0 then taking r ≥ r 0 we get S N,r = S N and so there is no need for (2.5) to hold true. Now we derive (2.10) where again it is sufficient to consider the case when M = 1. Let 0 < ǫ < 1 4 . First for any b > 1, where in the second equality we used that |x −1 − y −1 | = |x 2 − y 2 |(xy 2 + yx 2 ) −1 for any x, y > 0. By Lemma 5.2 in [7] for any b > 1 there exits Γ b which depends only on c, d, b and ℓ so that Using the previous estimates, for any N so that 3N where and we also used (2.8). Applying the second statement of Lemma 3.1 with b = 1 2ǫ − 1 and using (2.9) completes the proof of (2.10).
Nonlinear indexes. Let q i , i = 1, ..., ℓ be strictly increasing functions satisfying q i (N) ⊂ N which are ordered so that q 1 (n) < q 2 (n) < ... < q ℓ (n) for any sufficiently large n. The proof of Theorem 2.3 will proceed similarly for the sums S N if we show that limit D 2 = lim N →∞ ES 2 N exists, obtain convergence rate towards it and upper bounds similar to the ones in (4.9). Suppose that q 1 , ..., q k are linear, for some k < ℓ and that q j , j ≥ k are not. When all q i 's are polynomials, existence of D 2 is proved in [8]. Though the limit D 2 does not exist in general, if q j+1 grows faster then q j for j > k in the sense of (2.11) in [11], then existence of D 2 follows from Theorem 2.3 in [11]. Convergence rate towards D 2 when q i 's are polynomials can be obtained by proceeding similarly to the proof of Proposition 5.3 in [8]. If, instead, q j+1 (n α ) − q j (n) converges to ∞ as n → ∞ for some 0 < α < 1 and all j ≥ k, then convergence rate towards D 2 with some dependence on α follows from the arguments in [11].
Each q i (n) grows at least as fast as linearly which implies that |A n | is of order l. When all q i 's are polynomials of the same degree then the limit lim n→∞ q −1 i (q j (n))/n exists for any 1 ≤ i, j ≤ ℓ and therefore the proof of the second upper bound in (4.9) proceeds in a similar way but withd ℓ (a, b) = min 1≤i,j≤ℓ |q i (a) − q j (b)| in place of d ℓ (a, b). When q i 's do not necessarily have the same degree then beginning the summation in the definition of S N from cN γ for appropriate γ < 1 and c > 0, guarantees that |q i (n) − q j (m)| > CN when deg q i = deg q j and cN γ ≤ n, m ≤ N . Similar to the latter inequality is satisfied when max(i, j) > k and q s grows faster than q s−1 for s = k + 1, ..., ℓ and so an appropriate version of (4.9) follows in this situation, as well.