Berry-Esseen Bounds for typical weighted sums

Under correlation-type conditions, we derive upper bounds of order $\frac{1}{\sqrt{n}}$ for the Kolmogorov distance between the distributions of weighted sums of dependent summands and the normal law.


Introduction
Given a random vector X = (X 1 , . . . , X n ) in R n (n ≥ 2), we consider the weighted sums S θ = θ 1 X 1 + · · · + θ n X n , θ = (θ 1 , . . . , θ n ) ∈ S n−1 , parameterized by points of the unit sphere S n−1 = {θ ∈ R n : θ 2 1 + · · · + θ 2 n = 1}. According to the celebrated result by Sudakov [S], if n is large, and if the covariance matrix of X has bounded spectral radius, the distribution functions F θ (x) = P{S θ ≤ x} concentrate around a certain typical distribution function given by the mean x ∈ R, (1.1) Moreover, if we want to study the approximation for most of F θ 's by the standard normal distribution function one is led to study another concentration problem -namely rates for the distance ρ (F, Φ). To this aim, let us rewrite the definition (1.1) as F (x) = P{rZ n ≤ x} with r 2 = |X| 2 n = X 2 1 + · · · + X 2 n n (r ≥ 0), where the random variable Z n is independent of r and has the same distribution as √ nθ 1 under µ n−1 . Since Z n is close to being standard normal, F itself is approximately normal, if and only if r 2 is nearly a constant, which translates into a weak law of large numbers for the sequence X 2 k . This property -that the distribution of r 2 is concentrated around a point -may be quantified by the variance-type functionals which are expected to be of order 1 in reasonable situations (at least, they are finite, as long as M 2p < ∞). For example, if |X| 2 = n a.s., we have σ 2p = 0. If the components X k are pairwise independent, identically distributed, and with EX 2 1 = 1, then σ 2 4 = 1 n Var(|X| 2 ) = Var(X 2 1 ). It turns out that control of the two functionals, M 3 and σ 3 is sufficient to guarantee a Berry-Esseen type rate of normal approximation for F θ on average, in analogy with the Berry-Esseen theorem for independent identically distributed random variables. Since the second moment for the typical distribution F is equal to Er 2 , a normalization condition for this moment is desirable.
Theorem 1.1. If E |X| 2 = n, then with some absolute constant c E θ ρ (F θ (1.2) In the case of non-correlated random variables X k 's, with mean zero and variance one, all S θ have also mean zero and variance one, so that M 2 = 1. In many interesting examples, M 3 is known to be of the same order as M 2 (in particular, when Khinchinetype inequalities are available for linear functionals of X). In some other examples, the magnitude of M 3 is however much larger, and here control via M 2 is preferable, as the following assertion shows.
Theorem.1.2. If E |X| 2 = n, then for some absolute constant c E θ ρ (F θ log n √ n . (1.3) Thus, modulo an additional logarithmic factor, a Berry-Esseen type rate holds for this average under a second moment assumption, only.
For an illustration, consider the trigonometric system X = (X 1 , . . . , X n ) with components X 2k−1 (ω) = √ 2 cos(kω), X 2k (ω) = √ 2 sin(kω), −π < ω < π, k = 1, . . . , n/2, assuming that n is even. They may be treated as random variables on the probability space Ω = (−π, π) equipped with the normalized Lebesgue measure P, such that the linear forms represent trigonometric polynomials of degree at most n 2 . The normalization √ 2 is chosen for convenience only, since then X is isotropic, so that M 2 = 1. Since also σ 2 = 0, by Theorem 1.2, most of the distributions F θ of S θ are approximately standard normal, and we have an upper bound The study of asymptotic normality for trigonometric polynomials has a long history, starting with results on lacunary systems due to Kac [Ka], Salem and Zygmund [S-Z1-2], Gaposhkin [G]; see also [B-G1-2], [A-B], [F], [A-E]. As we see, normality with an almost Berry-Esseen type rate remains valid for most choices of coefficients even without an assumption of lacunarity. One can show that the inequality (1.4) still holds for many other functional orthogonal systems as well, including, for instance, Chebyshev's polynomials on the interval Ω = (−1, 1), the Walsh system on the Boolean cube {−1, 1} n . It holds as well for any system of functions of the form X k (ω 1 , ω 2 ) = f (kω 1 + ω 2 ), ω 1 , ω 2 ∈ (0, 1), where f is 1-periodic and belongs to L 4 (0, 1) (this is a strictly stationary sequence of pairwise independent random variables). A common feature of all listed examples is that (1.4) may actually be reversed modulo a logarithmic factor, in the sense that with some s > 0. (However, we do not derive lower bounds here referring the interested reader to [B-C-G2]).
The conditions of Theorem 1.2 may be further relaxed in order to eliminate dependence on σ 2 2 . This can be achieved by replacing it by the requirement of small probabilities for P{|X − Y | 2 ≤ n/4}, where Y is an independent copy of X, cf. Theorem 6.3 below. This extends the applicability of our results to further groups of examples, while replacing Φ by a certain mixture of centered Gaussian measures. More precisely, define G to be the law of rZ, where Z ∼ N(0, 1) is independent of r = 1 n |X|. In particular, we have: Theorem 1.3. If the components X k of the random vector X in R n are independent, identically distributed, have mean zero and finite second moment, then where the constant c depends on the distribution of X 1 only.
At first sight it seems surprising that an approximate Berry-Esseen type rate holds under no additional assumption beyond the finiteness of the second moment. Indeed, in the classical situation of equal coefficients, and when EX 1 = 0, EX 2 1 = 1, the distributions F n of the normalized sums S n = (X 1 + · · · + X n )/ √ n may approach the standard normal law at an arbitrary slow rate: For any sequence ε n → 0 + , one may choose the distribution of X 1 such that for all n large enough (cf. [M]). This shows that for typical coefficients, the distributions F θ behave in a more stable way in comparison to F n . This interesting phenomenon has been studied before. For example, Klartag and Sodin [K-S] have shown in the i.i.d. case and under the 4-th moment assumption, that thus essentially improving the standard rate in the Berry-Esseen theorem (see also [Kl]). The paper is organized as follows. We start with comments on general properties of the moment and variance-type functionals. Then we turn to the normal approximation for distributions of the first coordinate on the sphere (with rate of order 1/n), which is used in Section 4 to describe proper bounds on the distance from the typical distributions to the standard normal law. Proofs of both Theorems 1.1 and 1.2 rely upon the spherical Poincaré inequality and Berry-Esseen-type estimates in terms of characteristic functions. The characteristic functions of the weighted sums are discussed separately in Section 5. Their properties are used in Section 6 to complete the proof of Theorem 1.2 (in a more general form). Theorem 1.1 is proved in Section 8, and in the last section we add some remarks concerning Theorem 1.3.

Moment and variance-type functionals
First let us describe some basic properties of the functionals M p = M p (X) and σ 2p = σ 2p (X). We shall as well introduce a few additional functionals. Define where Y is an independent copy of X. All these quantities do not depend on the systems of coordinates, that is, m p (UX) = m p (X) and M p (UX) = M p (X) for any orthogonal linear map U : R n → R n .
We call M p the p-th moment of X. In case M 2 is finite, one may consider the covariance operator (matrix) of X which is defined by the equality E X, a 2 = Ra, a , a ∈ R n .
It is symmetric, positive definite, and has non-negative eigenvalues λ i (1 ≤ i ≤ n).
Choosing a system of coordinates such that R is diagonal, with entries λ i , we see that The random vector X is called isotropic (or having an isotropic distribution), if the covariance matrix of X is an identity, i.e., E X, a 2 = |a| 2 , for all a ∈ R n .
In this case, m 2 = M 2 = 1, and E |X| 2 = n. Isotropic distributions are invariant under orthogonal transformations of the space. Applying Cauchy's inequality, from (2.2) we immediately obtain: Proposition 2.1. For any random vector X in R n with E |X| 2 = n, we have m 2 ≥ 1, where equality is attained, if and only if X is isotropic.
The p-th moments of X may easily be related to the moments of |X|.
Proof. By the rotational invariance of the uniform distribution on S n−1 , we have where E θ denotes the integral over the uniform measure ν n−1 . Inserting here a = X, we get |X| p E θ |θ 1 | p = E θ | X, θ | p . Next, take the expectation with respect to X and use E | X, θ | p ≤ M p to arrive at the upper bound Indeed, let Y be an independent copy of the random vector X. By the very definition, for any particular value of Y , we have E X | X, Y | p ≤ M p p |Y | p . It remains to take the expectation with respect to Y .
In particular, m 2 ≤ M 2 2 , as can also be seen from (2.2). The identities in (2.2) also show that, in the general non-isotropic case, M 2 2 may be larger than m 2 . Let us now turn to the functionals where it is natural to assume that E |X| 2 = n. Note that σ 2p represents a non-decreasing function of p, which attains its minimum at p = 1 with value Another important value is σ 4 = 1 n Var(|X| 2 ). They may be related to the variance of the Euclidean norm.
Proof. Put ξ = 1 √ n |X| and a = Eξ 2 . Then, since ξ ≥ 0, That is, Eξ 2 Var(ξ) ≤ Var(ξ 2 ), which is exactly the first required relation. Now, in terms of ξ, one may write implying that σ 2 2 ≤ 4 Var(|X|). The last inequality of the proposition may be rewritten as 1 − (Eξ) 2 ≤ E |1 − ξ 2 |. If (Ω, P) is the underlying probability space, define the probability measure and write E Q for the expectation with respect to it. The required inequality then takes The functionals σ 2 2p and m p are useful in the problem of estimation of "small" ball probabilities.
Proposition 2.5. Let Y be an independent copy of a random vector X in R n such that E |X| 2 = n. For all p, q ≥ 1, Proof. According to the definition, Hence, for any λ ∈ (0, 1), by Chebyshev's inequality, In particular, choosing λ = 3/4, we get On the other hand, by Markov's inequality, and split the event |X − Y | 2 ≤ 1 4 n into the case | X, Y | ≥ 1 4 n and the case of the opposite inequality. In view of the set inclusion the proposition follows.

Linear functionals on the sphere
The aim of this section is to quantify the asymptotic normality of distributions of linear functionals with respect to the normalized Lebesgue measure µ n−1 on the unit sphere S n−1 ⊂ R n (n ≥ 2). By the rotational invariance of this measure, all linear functionals f (θ) = θ, v with |v| = 1 have equal distributions, and it is sufficient to focus just on the first coordinate θ 1 of the vector θ ∈ S n−1 . As a random variable on the probability space (S n−1 , µ n−1 ), it has density 2 ) is a normalizing constant. Let us denote by ϕ n the density of the normalized first coordinate Z n = √ n θ 1 under the measure µ n−1 , i.e., Clearly, as n → ∞, and one can show that c ′ n < 1 √ 2π for all n ≥ 2. We are interested in non-uniform deviation bounds of ϕ n (x) from ϕ(x).
First we consider the asymptotic behavior of the functions so that there is a uniform bound To study the rate of convergence of p n (x), assume that |x| ≤ 1 2 √ n. By Taylor's On the other hand, which together with the lower bound on δ yields Thus, Combining this inequality with (3.2), we also get a non-uniform bound on the whole real line, namely where C is an absolute constant. Let us integrate this inequality over x. Since Hence, we arrive at the conclusion (3.1) for the densities ϕ n for n ≥ 4 as well.
In the sequel we denote by J n the characteristic function of the first coordinate θ 1 of a random vector θ which is uniformly distributed on the unit sphere S n−1 . In a more explicit form, for any t ∈ R, Note that the equality defines the classical Bessel function of the first kind with index ν ( [Ba], p. 81). Therefore, However, this relationship will not be used in the sequel. Thus, the characteristic function of Z n = θ 1 √ n is given bŷ which is the Fourier transform of the probability density ϕ n . One immediate consequence from Proposition 3.1 is the following: where C is an absolute constant.
For large t, this bound may be improved by virtue of the following upper bound.
Proof. One may assume that n ≥ 4 (since 4 e −n/12 > 1 for n = 2 and n = 3, while |J n | ≤ 1). For the approximation we shall use an approach based on contour integration in complex analyis.
is analytic in the whole complex plane when n is odd and in the strip z = x + iy, |x| < 1, when n is even. Therefore, integrating along the boundary of the rectangle C = [−1, 1] × [0, y] with y > 0 (slightly modifying the contour in a standard way near the points −1 and 1), we have C e itz 1 − z 2 n−3 2 dz = 0.

Typical Distributions and Mixtures of Gaussian Measures
The asymptotic normality of the typical distributions F in Sudakov's theorem, defined in (1.1), may be described in the next assertion proved in [B-C-G1].
Proposition 4.1. Given a random vector X in R n , suppose that E |X| 2 = n. With some absolute constant c > 0 we have where r = 1 √ n |X|.
Here the positive measure |F −Φ| denotes the variation in the sense of measure theory, and the left integral represents the weighted total variation of F − Φ. In particular, we have a similar bound for the usual total variation distance between F and Φ, as well as for the Kolmogorov distance ρ(F, Φ). Applying Proposition 2.4, the latter may be related to the variance-type functionals σ 2p (cf. also [M-M]).
Corollary 4.2. In particular (under the same conditions), The proof of Proposition 4.1 is based on the following observation about general mixtures of centered Gaussian measure on the real line. Given a random variable r ≥ 0, let us denote by Φ r the distribution function of the random variable rZ, where Z ∼ N(0, 1) is independent of r. That is, As shown in [B-C-G1], if Er 2 = 1, then with some absolute constant c we have To explain the transition from (4.2) to (4.1), assume that n ≥ 3. Let Φ n and ϕ n denote respectively the distribution function and the density of Z n = θ 1 √ n, where θ 1 is the first coordinate of a random point θ uniformly distributed in S n−1 . If r 2 = 1 n |X| 2 is independent of Z n (r ≥ 0), then, by the definition of the typical distribution, But, for any fixed value of r, hence, by (4.2), taking the expectation with respect to r and using Jensen's inequality, we get It remains to apply (3.1), which yields with some universal constant C.

Characteristic Functions of Weighted Sums
As before, let X = (X 1 , . . . , X n ) denote a random vector in R n , n ≥ 2. The concentration problems for distributions of weighted sums S θ = X, θ may be studied by means of their characteristic functions f θ (t) = E e it X,θ , t ∈ R. (5.1) In particular, we intend to quantify the concentration of f θ around the characteristic function f of the typical distribution F on average over the directions θ in terms of correlation-type functionals. Note that the characteristic function of F is given by where J n is the characteristic function of the first coordinate θ 1 under the uniform measure µ n−1 on the unit sphere S n−1 . First let us describe the decay of t → |f θ (t)| at infinity on average with respect to θ. Starting from (5.1), write where Y is an independent copy of X. To proceed, let us rewrite the Gaussian-type bound (3.3) of Proposition 3.3 as |J n (t)| ≤ 4.1 e −t 2 /2n + 4 e −n/12 (5.2) which gives E θ |f θ (t)| 2 ≤ 4.1 E e −t 2 |X−Y | 2 /2n + 4 e −n/12 . Splitting the latter expectation into the event A = {|X −Y | 2 ≤ λn} and its complement, we get the following general bound.
Lemma 5.1. The characteristic functions f θ satisfy, for all t ∈ R and λ > 0, where Y is an independent copy of X.
In case E |X| 2 = n, the right-hand side of these bounds can be further quantified by using the moment and variance-type functionals, which we have discussed before, Note that both m p and σ p are non-decreasing functions in p ≥ 1. In order to estimate the probability of the event B, we shall use Proposition 2.5, which gives with a constant C = 4 2p (m 2p 2p + σ 2p 2p ). Hence, from Lemma 5.1 and using m 2p ≥ m 2 ≥ 1 (cf. Proposition 2.1), we deduce: Lemma 5.2. Suppose that E |X| 2 = n. If the moment m 2p is finite for p ≥ 1, then with some constant c p > 0 depending on p, By the triangle inequality, |f (t)| ≤ E θ |f θ (t)|. Hence, the characteristic function of the typical distribution shares the same bounds. In fact, here the parameter m 2p is not needed. Indeed, as was shown in the proof of Proposition 2.5 with λ = 1 2 , we have P |X| 2 ≤ 1 2 n ≤ 2 p σ p 2p n p/2 . Hence, by (5.2), ≤ 2 p σ p 2p n p/2 + C e −t 2 /4 + e −n/12 . Thus, we get: Lemma 5.3. Suppose that E |X| 2 = n. Then with some constant c p > 0 depending on p ≥ 1, for all t ∈ R, n p/2 + e −t 2 /4 , and therefore, for all T > 0, We first study the concentration properties of f θ (t) as functions of θ on the sphere with fixed t ∈ R (rather than directly for the distributions F θ ). This can be done in terms of the moment functionals Our basic tool is a well-known spherical Poincaré inequality It holds true for any complex-valued function u which is defined and smooth in a neighborhood of the sphere, and has gradient ∇u and the mean a = u dµ n−1 (cf. [L]). According to (5.1), the function θ → f θ (t) is smooth on the whole space R n and has partial derivatives θ or in the vector form Taking the sup over all v ∈ S n−1 , we obtain a uniform bound on the modulus of the gradient, namely |∇f θ (t)| ≤ M 1 |t|. A similar bound holds as well in average. To this aim, let us square the vector representation and write where Y is an independent copy of X. Integrating over v with respect to µ n−1 , we get the representation (where E θ refers to integration over µ n−1 ). Applying (5.3), one can summarize.
Lemma 5.4. Given a random vector X in R n with finite moment M 1 , for all t ∈ R, In addition, where Y is an independent copy of X.

Berry-Esseen Bounds. Theorem 1.2 and its Generalization
Fourier Analysis provides a well-established tool to prove Berry-Esseen-type bounds for the Kolmogorov distance To study the average behavior of this distance with respect to θ using the uniform measure µ n−1 on the unit sphere, as a preliminary step, let us first introduce two auxiliary bounds.
Lemma 6.1. Let X be a random vector in R n . With some absolute constant c > 0, As before, here F θ denote distribution functions of the weighted sums S θ = X, θ with their characteristic functions t ∈ R, θ ∈ S n−1 , and F = E θ F is the typical distribution function with characteristic function For an estimation of the Kolmogorov distance, the following general Berry-Esseen bound will be convenient: Here U and V may be arbitrary distribution functions on the line with characteristic functions u and v, respectively, and c > 0 is an absolute constant (cf. e.g. [P1-2], [B3]). In our situation, we take U = F θ and V = F . In order to estimate the first integral in (6.2), we shall split the integration into the two intervals, [0, T 0 ] (the interval of moderate values of t), where it is easier to control the closeness of the two characteristic functions, and the long interval [T 0 , T ], where both characteristic functions can be shown to be sufficiently small. Note that, by the triangle inequality, we have |f (t)| ≤ E θ |f θ (t)|, which implies E θ |f θ (t) − f (t)| ≤ 2 E θ |f θ (t)|. Using this on the long interval, we arrive at the more specific variant of (6.2), namely (6.1).
The estimation of the integrals in (6.1) will be done in terms of the functionals m p = m p (X), M p = M p (X) and σ 2p = σ 2p (X).
Lemma 6.2. Suppose that X has a finite moment of order 2p (p ≥ 1), and E |X| 2 = n. Then with some constant c p depending on p only, for all T ≥ T 0 > 0, Proof. By the second inequality of Lemma 5.3 (on this step we use the assumption while by Lemma 5.2 yields the bound This allows us to estimate the pre-last and last integrals in (6.1).
We are now prepared to establish Theorem 1.2, in fact -in somewhat more general form which requires the first moment, only. Recall that Theorem 6.3. If the random vector X in R n has finite first moment M 1 , then where c > 0 is an absolute constant, and Y is an independent copy of X. As a consequence, if X has finite 2-nd moment M 2 and E |X| 2 = n, then A similar bound also holds for the normal distribution function Φ in place of F .
Proof. We apply Lemma 6.1 with T 0 = 5 √ log n and T = 5n. The first integral in (6.1) can be bounded by virtue of the spherical Poincaré-type inequality, i.e., using the first bound of Lemma 5.4. It gives and hence Next, we apply Lemma 5.1 with λ = 1/4 which gives with some absolute constant c > 0. Similarly, These bounds prove the first assertion of the theorem. For the second assertion, it remains to recall that, by Proposition 2.5, so that from (6.3) we get c E θ ρ(F θ , F ) ≤ M 1 log n n + 4 m 2 + σ 2 √ n log n + 1 n . (6.5) Here, the last term 1/n is dominated by m 2 /n. This leads to the bound (6.4), in which F may be replaced with the standard normal distribution function Φ due to the estimate ρ(F, Φ) ≤ C √ n 1 + σ 2 , cf. Corollary 4.2.
Remark. Working with the Lévy distance L, which in general is weaker then the Kolmogorov distance ρ, one can get guaranteed rates with respect to n for E θ L (F θ , F ) in terms of M 1 or M 2 . In particular, if X isotropic, it is known that This deviation bound yields with some absolute constant C ( [B1]). See also [B2] for similar results about the Kantorovich distance.
7. Proof of Theorem 1.1 In order to get rid of the logarithmic term in the bounds of Theorems 1.2/6.3, one may involve the 3-rd moment assumptions in terms of the moment and variance-type functionals m p and σ p of index p = 3. They are defined by where Y is an independent copy of X, and Let us recall that m 3 ≤ M 2 3 . Hence, Theorem 1.1 will follow from the following, slightly sharpened assertion.
Theorem 7.1. Let X be a random vector in R n with finite 3-rd moment, and such that E |X| 2 = n. Then with some absolute constant c Proof. We now apply Lemma 6.2, choosing there p = 3/2, T = 4n and T 0 = 4 √ log n. Since necessarily m 3 ≥ 1, the last term e −T 2 0 /16 is negligible, and we get the bound with some absolute constant c > 0. To analyze the last integral over the interval [0, T 0 ], we apply Lemma 5.4, which gives Next, let us apply the bound of Corollary 3.2, |J n t √ n − e −t 2 /2 | ≤ C n , which allows one to replace the J n -term with e −t 2 |X−Y | 2 /2n at the expense of an error of order where we used the inequality m 3 ≥ m 2 ≥ 1. As a result, the bound (7.2) may be simplified to . Note that I(t) ≥ 0 which follows from I(t) = E e it X,Z / √ n | 2 , where the random vector Z is independent of X and has a standard normal distribution on R n . Now, focusing on I(t), consider the events We split the expectation in the definition of I(t) into the sets A and B, so that I(t) = I 1 (t) + I 2 (t), where As we know (cf. Proposition 2.5), n 3/2 . Hence, applying Hölder's inequality, we have where we used that m 3 ≥ 1. Now, we represent the second expectation as Here the last expectation has been already bounded by 32 √ n (m 3 3 + σ 3 3 ). To estimate the first one, we use an elementary inequality Since on the set B, there is a uniform bound t 2 ( |X−Y | 2 2n − 1) ≥ − 7 8 t 2 , we conclude by virtue of Hölder's inequality that The first expectation on the right-hand side is E | X, Y | 3 = (m 3 √ n) 3 . Writing we also have, by Jensen's inequality, ≤ 4t 2 e 7t 2 /8 (m 2 3 + σ 2 3 ), and, as a result, where the factor e −t 2 in the first term can be removed without loss of strength.
Together with the estimate on I 1 (t), we get with some absolute constant C.
8. The i.i.d. Case Theorem 1.3 follows from Theorem 6.3, by taking into account the following elementary statement (various variants of which under higher moment assumptions are well-known).
Proof of Theorem 1.3. First let us derive the inequality E θ ρ(F θ , F ) ≤ c log n n (8.2) with the typical distribution F in place of G. Let Y = (Y 1 , . . . , Y n ) be an independent copy of X. Since the Kolmogorov distance is scale invariant, without loss of generality one may assume that E (X 1 − Y 1 ) 2 = 1. But then, by Lemma 8.1, applied to the random variables ξ i = (X i − Y i ) 2 , we have P{|X − Y | 2 ≤ n/4} ≤ e −cn with some constant c > 0 depending on the distribution of X 1 only. In addition, M 1 ≤ M 2 = 1 2 . As a result, Theorem 6.3 yields (8.2). Now, in order to replace F with G in (8.2), one may apply Proposition 3.1. Indeed, F represents the distribution function of rZ n , where r = 1 √ n |X| and Z n = √ n θ 1 is independent of r. Similarly, G is the distribution function of rZ where Z ∼ N(0, 1) is independent of r. Let Φ n denote the distribution function of Z n and ϕ n its density.
Since F (x) = E Φ n (x/r) and G(x) = E Φ(x/r), we conclude, by the triangle inequality, that