Optimal-order bounds on the rate of convergence to normality in the multivariate delta method

Uniform and nonuniform Berry--Esseen (BE) bounds of optimal orders on the closeness to normality for general abstract nonlinear statistics are given, which are then used to obtain optimal bounds on the rate of convergence in the delta method for vector statistics. Specific applications to Pearson's, non-central Student's and Hotelling's statistics, sphericity test statistics, a regularized canonical correlation, and maximum likelihood estimators (MLEs) are given; all these uniform and nonuniform BE bounds appear to be the first known results of these kinds, except for uniform BE bounds for MLEs. When applied to the well-studied case of the central Student statistic, our general results compare well with known ones in that case, obtained previously by specialized methods. The proofs use a Stein-type method developed by Chen and Shao, a Cram\'er-type of tilt transform, exponential and Rosenthal-type inequalities for sums of random vectors established by Pinelis, Sakhanenko, and Utev, as well as a number of other, quite recent results motivated by this study. The method allows one to obtain bounds with explicit and rather moderate-size constants, at least as far as the uniform bounds are concerned. For instance, one has the uniform BE bound $3.61\mathbb{E}(Y_1^6+Z_1^6)\,(1+\sigma^{-3})/\sqrt n$ for the Pearson sample correlation coefficient based on independent identically distributed random pairs $(Y_1,Z_1),\dots,(Y_n,Z_n)$ with $\mathbb{E} Y_1=\mathbb{E} Z_1=\mathbb{E} Y_1Z_1=0$ and $\mathbb{E} Y_1^2=\mathbb{E} Z_1^2=1$, where $\sigma:=\sqrt{\mathbb{E} Y_1^2Z_1^2}$.

the i.i.d. case, for the Student statistic to be asymptotically standard normal was established only in 1997 by Giné, Götze and Mason [13]. For more recent developments concerning the Student statistic, see e.g. the 2005 paper by Shao [43].
Employing such simple and standard tools as linearization together with the Chebyshev and Rosenthal inequalities, we quickly obtained (in the i.i.d. case) a uniform bound of the form O(n −1/3 ) for the Pearson statistic. Indeed, Pearson's R can be expressed as f (V ), a smooth nonlinear function of the sample mean V = 1 n n i=1 V i , where the V i 's are independent zero-mean random vectors constructed based on the observations of a random sample; cf. (4.8). A natural approximation to f (V ) is the linear statistic L(V ) = n i=1 L( 1 n V i ), where L is the linear functional that is the first derivative of f at the origin. Since BE bounds for linear statistics is a well-studied subject, we are left with estimating the closeness between f (V ) and L(V ). Assuming f is smooth enough, one will have |f (V ) − L(V )| on the order of V 2 , and so, demonstrating the smallness of this remainder term becomes the main problem.
Using (instead of the mentioned Rosenthal inequality) exponential inequalities for sums of random vectors due to Pinelis and Sakhanenko [40] or Pinelis [34,35], for each p ∈ (2, 3), under the assumption of the finiteness of the pth moment of the norm of the V i 's, one can obtain a uniform bound of the form O(1/n p/2−1 ), which is similar to the BE bound for a linear statistic with a comparable moment restriction. However, the corresponding constant factor in the O(1/n p/2−1 ) will then explode to infinity as p ↑ 3. As for p 3, this method produces bounds of order O((ln n) 3/2 / √ n) (for p = 3) and O((ln n)/ √ n) (for p > 3), with extra logarithmic factors. Concerning this method and the corresponding results, see Proposition 3.9 in the present paper.
While any of these bounds would have sufficed as far as the ARE is concerned, we became interested in obtaining an optimal-rate BE bound for the Pearson statistic. Soon after that, we came across the recent remarkable paper by Chen and Shao [9]. Suppose that T is any nonlinear statistic and W is any linear one, and let ∆ := T − W ; then make the simple observation that − P(z − |∆| W z) P(T z) − P(W z) P(z W z + |∆|) for all z ∈ R. Chen and Shao [9] offer a Stein-type method to provide relatively simple bounds on the two concentration probabilities in the above inequality, hence bounding the distance between T and W ; the reader is referred e.g. to [1] for illustrations of the elegance and power of Stein's method to a wide array of problems. Chen and Shao provided a number of applications of their general results.
However, in the applications that we desired, such as to Pearson's R, it was difficult to deal with ∆ = T − W , as defined above. The simple cure applied here was to allow for any ∆ |T − W |, so that, for T = f (V ), W = L(V ), and smooth enough f , the random variable ∆ could be taken as V 2 (up to some multiplicative constant). This allowed for a BE bound of order O(1/ √ n), though under the excessive moment restriction that E V i 4 < ∞. To obtain a BE bound of the "optimal" order O(1/ √ n) using only the assumption E V i 3 < ∞, we combine the Chen-Shao technique with a Cramér-type tilt transform, which appears to be the most important and novel modification of the Stein-type method given in the present paper. Yet another modification was made by introducing a second level of truncation, to obtain a bound of order O(1/n p/2−1 ) in the case when E V i p < ∞ for p ∈ [2, 3). As for the requirement that the observations be identically distributed, it may (and will) be dispensed in general; that is, V will in general be replaced by a sum S of independent but not necessarily identically distributed random vectors.
There are two main groups of results in this paper. One is represented by Theorem 2.3, which provides a "non-uniform" upper bound on | P(T z) − P(W z)| (that is, an upper bound which decreases to 0 in |z|), for a general nonlinear statistic T and a general linear statistic W ; a "uniform" bound on | P(T z)− P(W z)| is given by Theorem 2.1. The other kind of main results, obtained based on Theorems 2.1 and 2.3, is represented by Theorem 3.5, which provides a non-uniform upper bound on | P(f (S) z) − P(L(S) z)|; it is the latter bound that took more of our time and effort. Once such a bound is established, it becomes rather straightforward to obtain the desired BE bound for the Pearson statistic -as well as for other similar statistics, including the non-central Student and Hotelling ones.
The paper is organized as follows.
-In Section 2, we state and discuss the mentioned upper bounds on | P(T z) − P(W z)| for general T and W , as well as certain other related results; in particular, in this section we provide an improvement (Proposition 2.5) of a non-uniform BE bound by Osipov and Petrov for linear statistics. -In Section 3, the mentioned Theorem 3.5 and other results are stated, providing bounds on | P(f (S) z) − P(L(S) z)|; a certain optimality and other nice properties of these bounds are presented and discussed there. -Applications to several commonly used statistics, namely the non-central Student T , the Pearson R, and the non-central Hotelling T 2 are stated in Section 4. The resulting BE bounds for these statistics appear to be new to the literature. -All proofs are deferred to Section 5.

Approximation of the distributions of nonlinear statistics by the distributions of linear ones
Let X 1 , . . . , X n be independent r.v.'s with values in some measurable space X, and let T : X n → R be a statistic of the random sample (X i ) n i=1 . Further let (2.1) Consider the linear statistic Further let δ be any real number such that note that such a number always exists (because the limit of the left-hand side of (2.3) as δ ↑ ∞ is 1). Necessarily, δ > 0. Also consider the sum of the mixed second-third moments Theorem 2.1. Let ∆ be any r.v. such that |∆| |T − W |. For each i = 1, . . . , n, let ∆ i be any r.v. such that X i and (∆ i , W − ξ i ) are independent. Then for all z ∈ R where ∆ is any r.v. such that Further, for all z ∈ R where Φ(z) is the standard normal distribution function.
Remark 2.2. Inequalities (2.5), (2.6), and (2.8) are the same as ones found in Chen and Shao's paper [9, Theorem 2.1], with two exceptions. In the first place, there they defined ∆ to be equal to T − W . The second generalization comes from an added truncation level via the inclusion of ∆ and the subsequent addition of the term P(max i |η i | > 1). As bounding E |ξ i (T − W − ∆ i )| may be rather cumbersome depending on the form of T − W , the first generalization allows one to choose a possibly larger ∆ which would be more amenable to analysis. However, if that ∆ should happen to be "too large," (i.e. if it violates some moment assumptions) the second generalization allows one to truncate ∆ to within acceptable constraints. This will prove useful in the construction of the Berry-Esséen type bounds of Section 3, when p ∈ [2, 3), though it should be noted that a choice of h i = 0 (say) and ∆ = T − W returns us to the original bounds in [9]. These two generalizations are also used in the non-uniform bounds of Theorem 2.3 below.
Before stating the "non-uniform" extension of Theorem 2.1, let us introduce some notation. For where X p := E 1/p |X| p for any real-valued r.v. X. Further, for any n-tuple (ζ 1 , . . . , ζ n ) of realvalued r.v.'s, let for arbitrary z 0, where the subscript ζ refers to the ζ i 's. Let A denote positive absolute constants, possibly different in different instances. Similarly, let A(p) denote positive expressions depending only on p, also possibly different in different instances. Additionally let where a, b are nonnegative expressions; the use of this simplifying notation may sometimes result in a loss of information, though the information could be regained by reworking the arguments.
Theorem 2.3. Let ∆ be any r.v. such that |∆| |T − W |. For each i = 1, . . . , n, let ∆ i be any r.v. such that X i and (∆ i , (X j : j = i)) are independent, and assume that the mentioned Borelmeasurable functions g i and h i are such that |h i | |g i |, so that |ξ i | |η i | almost surely (a.s). Take any p 2 and let q := p p−1 , so that 1 p + 1 q = 1. Then for all z ∈ R (2.10) and ∆ is any r.v. satisfying (2.7). Remark 2.4. As will be made clear in the proof, τ in (2.12) could be replaced by for two different sets of conjugate numbers (p 1 , q 1 ) and (p 2 , q 2 ), with p 1 , p 2 2 and p 1 = p 2 a distinct possibility; A(p) (suppressed by the " p " notation) would then be replaced by A(p 1 , p 2 ), depending only on p 1 and p 2 .
For p = 2 (and with h i = g i and ∆ = ∆ = T − W ), Theorem 2.3 was obtained by Chen and Shao [9,Theorem 2.2]. The more general form of the bound given by (2.10) allows one to lessen moment restrictions. Indeed, in applications of Theorem 2.3 given in this paper -such as Theorem 3.5 -one will have |∆| on the order of S 2 and |∆ − ∆ i | on the order of X i 2 + X i S − X i , where S := n i=1 X i and the X i 's are independent random vectors. So, using Theorem 2.3 with p = 3 (and hence q = 3 2 ) in order to obtain a bound of the classical form O( 1 √ n(|z|+1) 3 ), one will need only the 3rd moments of X i to be finite. On the other hand, using (2.12) with p = 2 to get the same kind of bound would require the finiteness of the 4th moments of X i . Bound (2.10) on the closeness of the distribution of the linear approximation W to that of the original statistic T can be complemented by the following bounds on the closeness of the distribution of the linear statistic W to the standard normal distribution.
Proposition 2.5. Let p 2. Then for W as in (2.2), ξ i and η i as in (2.1) with |ξ i | |η i | a.s. for i = 1, . . . , n, and for all z ∈ R one has where (2.14) Note that the bound B 1 (z) in (2.14) was obtained in a more general form by Bikelis [6,Theorem 4] (see also [33,Chapter V,Supplement 24]), and also in its present form by Chen and Shao [7,Theorem 2.2]. The more classical non-uniform version of the Berry-Esséen inequality is implied by (2.13): 3]. This was also stated, for p = 3, in [8]; the case when p = 3 and the ξ i 's are i.i.d. is due to Nagaev [26]. Similarly, when g i = h i (and hence ξ i = η i ) for i = 1, . . . , n, (2.13) and Chebyshev's inequality imply which is a generalization and improvement of the known Osipov-Petrov theorem (see [33, Theorem 13 of Chapter V] and also Osipov [32]); that theorem was given for p 3, i.i.d. ξ i 's, and with (|z| + 1) p in place of e |z|/2 . While this latter bound may appear more familiar, the accuracy provided by the sum of the tail probabilities G η in (2.15) (rather than the sum of the absolute moments given by σ p p ) shall prove useful. In the remainder of the paper, uniform and non-uniform bounds on the distance between the distributions of the nonlinear statistic T and its linear approximation W shall be stated, with the acknowledgement that Proposition 2.5 may be used to place a bound on the distance between the distribution of T and the standard normal distribution. Further, non-uniform bounds shall be stated for z sufficiently far away from the origin, with the understanding that the accompanying uniform bound may be used for the small |z|. In anticipation of the results of the next section, let us also state Corollary 2.6. If the conditions of Theorem 2.3 are satisfied, then for all z ∈ R such that |z| 1, (2.16)

Berry-Esséen bounds for nonlinear functions of sums of independent random vectors
In this section, we shall use results of Section 2. Assume from hereon that (X, · ) is a separable Banach space of type 2; for a definition and properties of such spaces, see e.g. [17,39]. Let X 1 , . . . , X n be independent random vectors in X, with E X i = 0 and E X i 2 < ∞ for i = 1, . . . , n, and also let for any p 1 and z 0. Under this notation, note the assumption that X is of type 2 implies the existence of a constant D := D(X) > 0 such that We shall assume that D is chosen to be minimal with respect to this property; note that D = 1 (and there is equality in (3.1)) whenever X is a Hilbert space.
Remark 3.1. The results of this section hold for vector martingales taking values in a 2-smooth separable Banach space; in such a case, one can apply results of [34] instead of the ones of [40] used in the present paper. By [17,34], every 2-smooth Banach space is of type 2. It is known that L p spaces are 2-smooth, and hence of type 2, for p 2 [34, Proposition 2.1]. In particular, any separable Hilbert space is of type 2.
Let next f : X → R be a functional with f (0) = 0, satisfying the following smoothness condition: there exist ǫ > 0, M > 0, and nonzero L ∈ X * such that Thus, the continuous linear functional L necessarily coincides with the first Fréchet derivative, f ′ (0), of the function f at 0. Moreover, for the smoothness condition (3.2) to hold, it is enough that the second derivative f ′′ (x) exist and be bounded (in the operator norm) by M over all x ∈ X with x ǫ. If X is a finite-dimensional Euclidian space, the latter condition means that the largest singular value of the Hessian matrix of f be bounded by M over all x ∈ X with x ǫ. Then we have the following uniform Berry-Esséen type bound on f (S): then for all p 2 and z ∈ R one has and q := p p−1 . Remark 3.3. The term P( S > ǫ) in (3.4) can be bounded in a variety of ways. For instance, using Chebyshev's inequality and (3.1), Remark 3.4. Note that u < ∞ whenever s p < ∞ (or hence λ p < ∞), whether p 3 or 2 p < 3, while λ 2q may be infinite for p ∈ [2, 3) even when s p < ∞. It is the additional truncation, with ∆ instead of ∆, in the bounds of Section 2 that allows one to use u instead of λ 2q .
The main result of this section is the following non-uniform bound: Theorem 3.5. If the conditions of Theorem 3.2 are satisfied, then for all p > 2 and z ∈ R such that The following corollary of Proposition 2.5 is to be used together with Theorems 3.2 and 3.5.
Corollary 3.7. If the conditions of Theorem 3.2 are satisfied, then for all p 2 and z ∈ R While the expressions for the upper bounds given in Theorems 3.2 and 3.5 are quite explicit, they may seem complicated (as compared with the classical uniform and non-uniform Berry-Esséen bounds). However, one should realize that here there are a whole host of players: L , M , ǫ, and D (besides such more traditional terms as s p and σ) -each with a significant and rather circumscribed role to play.
One way to see this is as follows. Think of the coordinates of the random vectors X i (in a given basis) as measurements in certain units, say centimeters (cm). Suppose then that the statistic f (S) has the dimensions of cm d , for some d ∈ R; that is, f (S) is measured using a unit equal to cm d ; let us write this as f (S) ∼ cm d . (In the applications given later in this paper one will have d = 0, which makes sense, as one does not want the result of a statistical test to depend on the choice of the units of measurement.) Then for all p, and so, the upper bounds in (3.4) and (3.11) are unit-free, ∼ cm 0 .
Another nice feature of these bounds is that they do not depend on the dimension of the space X of type 2 (which may even be infinite-dimensional) -but only on its "smoothness" constant D.
It is yet another nice feature that the bounds in (3.4) and (3.11) do not explicitly depend on n. Indeed, n is irrelevant when the X i 's are not identically distributed (because one could e.g. introduce any number of extra zero summands X i ). In fact, the bounds in (3.4) and (3.11) remain valid when S is the sum of an infinite series of independent zero-mean r.v.'s, i.e. S = ∞ i=1 X i , provided that the series converges in an appropriate sense; see e.g. Jain and Marcus [21].
On the other hand, for i.i.d. r.v.'s X i our bounds have the correct order of magnitude in n. Indeed, let in place of S (and hence 1 n V i in place of X i ). Then we have the following 2) holds and In the i.i.d. setting, one has a uniform bound of the form O(1/n (p∧3−2)/2 ) on the distance to normality of the statistic √ n f (V )/σ 1 (again, the constant in the O(·) notation will depend on f and the distribution of V , and also assumes V p < ∞). For p ∈ (2, 3), the following proposition provides uniform bounds of the same order, though the corresponding constant factor explodes to infinity as p ↑ 3. For p = 3 the bound is of order O((ln n) 3/2 / √ n) and for p > 3 it is of order O((ln n)/ √ n). While these rates are suboptimal for p 3, for moderate values of n the bound given in Proposition 3.9 may prove to be better than the uniform bound given in Corollary 3.8, since the methods used in Proposition 3.9 are less complicated and thus may result in smaller constants.
where the constant A depends on p, D, f , and the distribution of V .
The following proposition shows that the upper bound on z in (3. 16), and hence in (3.10), is in general optimal up to a constant factor.
for all |v| v 0 , where the real number v 0 and the density on (−v 0 , v 0 ) are chosen so that v 0 > 1, V 2 = 1, and V p < ∞. Then there is no sequence (z(n)) such that z(n)/ √ n → ∞ and (3.17) holds for all n and z = z(n).
In the proof of Proposition 3.10 we shall use the following proposition. While the inequalities in (3.19) are probably well-known, we shall provide a proof of Propositon 3.11 in Section 5.
Proposition 3.11. Let (X, · ) be any (not necessarily type 2) separable Banach space. Let X 1 , . . . , X n be independent symmetric r.v.'s in X. Then for all real x.
When the sum of the tails, i P( X i > x), is subexponential (as it is in Proposition 3.10), one actually has in contrast with the inequalities in (3.19) the asymptotic equivalence P( S > x) ∼ i P( X i > x) for x in an appropriate zone; here the symmetry of the X i 's is not needed. See [38] or [36] and the bibliography there, or [37].
Remark 3.12. Note that, in applications to problems of the asymptotic relative efficiency of statistical tests, usually it is the closeness of the distribution of the test statistic to a normal distribution (in R) that is needed or most convenient; in fact, as mentioned before, obtaining uniform bounds on such closeness was our original motivation for this work.
On the other hand, there have been a number of deep results on the closeness of the distribution of f (S), not to the standard normal distribution, but to that of f (N ), where N is a normal random vector with the mean and covariance matching those of S. In particular, Götze [15] provided an upper bound of the order O(1/ √ n) on the uniform distance between the d.f.'s of the r.v.'s f (S) and f (N ) under comparatively mild restrictions on the smoothness of f ; however, the bound increases linearly with the dimension k of the space X (which is R k therein).
On the other hand, one should note here such results as the ones obtained by Götze [14] (uniform bounds) and Zalesskiȋ [47,48] (non-uniform bounds), also on the closeness of the distribution of f (S) to that of f (N ). There (in an i.i.d. case), X can be any type 2 Banach space, but f is required to be at least thrice differentiable, with certain conditions on the derivatives. Moreover, Bentkus and Götze [3] provide several examples showing that, in an infinite-dimensional space X, the existence of the first three derivatives (and the associated smoothness conditions on such derivatives) cannot be relaxed in general.

Applications
To illustrate the use of the Berry-Esséen bounds in Section 3, we present some bounds on the rate of convergence to normality for some commonly used statistics. For the sake of simplicity and brevity, we assume only the special case where p = 3 and the r.v.'s are i.i.d., with the understanding that the reader may apply the results of Section 3 in the general non-i.i.d. and/or p > 2 setting. To this end, let us give the following corollary, which entails some loss of accuracy but is perhaps somewhat easier to parse than Corollary 3.8: Let f satisfy (3.2). Let X be a Hilbert space, and let V, and for all z ∈ R satisfying (3.16) and C 1 is defined in (3.7).
In what follows, R k is equipped with the Euclidean norm · , a vector x ∈ R k is treated as a k × 1 column matrix, and a linear operator B : R k → R k is treated as a k × k matrix. There are two matrix norms considered, namely the Frobenius norm and the spectral norm Consider the statistic commonly referred to as Student's T (or simply T ): let T take arbitrary values when X 2 = X 2 just so that T remain a statistic (i.e. measurable), and call T "central" when µ = 0 and "non-central" when µ = 0. Note that S is defined here as the empirical standard deviation of the sample (X i ) n i=1 , rather than the sample standard deviation n n−1 (X 2 − X 2 ).
Note that T is invariant under the transformation X i → aX i for arbitrary a > 0; so, let us assume w.l.o.g. that σ = 1.
Thus, if X follows a normal distribution we see that n−1 n T follows Student's non-central T distribution with n − 1 degrees of freedom and non-centrality parameter µ. Of course, we do not limit ourselves to this specific case, but rather allow X to have any distribution subject to the moment assumptions given in Theorem 4.2.
Much work has been done rather recently concerning the distribution of the central T . Bentkus and Götze [4] proved the uniform Berry-Esséen bound on the distribution of T when µ = 0: 2,3] and all z ∈ R; Nagaev [27] provided explicit constants for this bound when p = 3. Bentkus et al. [2] proved a uniform Berry-Esséen bound when the X i 's are not necessarily i.i.d., and Shao [43] provided explicit constants for this bound. See also Hall [16] concerning the Edgeworth expansion of the distribution of T , Novak [30,31] concerning Berry-Esséen bounds on the selfnormalized sum, Chistyakov and Götze [10,11] for probabilities of moderate deviations, Shao [41,42] and Nagaev [28] for probabilities of large deviations, or Wang and Jing [46] and Jing et al. [22] for non-uniform Berry-Esséen bounds. This is of course but a sampling of the recent work done concerning asymptotic properties of the central T ; for work done even earlier, the reader is referred especially to the bibliography in [4].
We contribute to this work by applying the results of Section 3 to T (regardless of the value of µ): Theorem 4.2. Suppose that X 6 < ∞, and also Then for all z ∈ R and for all z ∈ R satisfying (3.16) with (say) ǫ = 1 That is, σ 1 = 0 if and only if where B p is a standardized Bernoulli(p) r.v. with p ∈ (0, 1) \ { 1 2 }. Bentkus et al. [5] recently showed that if X 4 < ∞, then (after some standardization) T has a limit distribution which is either the standard normal distribution or the χ 2 distribution with one degree of freedom; the latter will be the case if and only if X has the two-point distribution described above.
Bounds (4.6) and (4.7) appear to be new for the non-central T . Bentkus et al. [5] provide a sufficient condition for (T − √ nµ)/σ 1 to converge in distribution to a standard normal r.v.; namely, that X 4 < ∞ and σ 1 = 0 (see the previous Remark 4.3 concerning the degeneracy condition σ 1 = 0). Note that the condition X 4 < ∞ is equivalent to -which is what we use to derive Theorem 4.2 from Corollary 4.1. Therefore, it seems rather natural to require that V 3 < ∞ or, equivalently, X 6 < ∞ to obtain a bound of order O(1/ √ n); cf. the classical Berry-Esséen bound for linear statistics, where the finiteness of the third moment of the summand r.v.'s is usually imposed to achieve a bound of order O(1/ √ n).

4.2.
Pearson's R. Let (X, Y ), (X 1 , Y 1 ), . . . , (X n , Y n ) be a sequence of i.i.d. random points in R 2 , with E(X 2 + Y 2 ) < ∞, Var X > 0, and Var Y > 0. Recall the definition of Pearson's productmoment correlation coefficient: let us allow R to take arbitrary values if the denominator in (4.8) is 0 -as long as R remains a statistic. Note that R is invariant under all linear transformations of the form X i → a + bX i and Y i → c + dY i with positive b and d, so in what follows we may (and shall) assume w.l.o.g. that the r.v.'s X and Y are standardized: We then have the following non-uniform bound on the rate of convergence of the statistic R to normality: and also Then, for all z ∈ R, and, for all z ∈ R satisfying (3.16) (with ǫ = 1 2 ), where A 1 and A 2 are defined in (4.3) and (4.4) (again with ǫ = 1 2 ), M < ∞ is a constant dependent only on ρ, L = 1 + ρ 2 2 , and Remark 4.5. Note that the degeneracy condition σ 1 = 0 is equivalent to the following: there exists some κ ∈ R such that the random point (X, Y ) lies a.s. on the union of the two straight lines through the origin with slopes κ and 1/κ (for κ = 0, these two lines should be understood as the two coordinate axes in the plane R 2 ). Indeed, if σ 1 = 0, then XY − ρ 2 (X 2 + Y 2 ) = 0 a.s.; solving this equation for the slope Y /X, one obtains two roots, whose product is 1. Vice versa, if (X, Y ) lies a.s. on the union of the two lines through the origin with slopes κ and 1/κ, then XY = r 2 (X 2 + Y 2 ) a.s. for r := 2κ/(κ 2 + 1) and, moreover, r = E r 2 (X 2 + Y 2 ) = E XY = ρ. For example, let the random point (X, Y ) equal (cx, κcx), (−cx, −κcx), (κcy, cy), (−κcy, −cy) with probabilities p 2 , p 2 , q 2 , q 2 , respectively, where x = 0, y = 0, κ ∈ R, c := x −2 + y −2 κ 2 + 1 , p := y 2 x 2 + y 2 , and q := 1 − p; then σ 1 = 0 (and the r.v.'s X and Y are standardized). In particular, one can take here x = y = 1, so that p = q = 1 2 . The bounds in (4.9) and (4.10) appear to be new. In fact, we have not been able to find in the literature any uniform (or non-uniform) bound on the closeness of the distribution of R to normality. Note that such bounds are important in considerations of the asymptotic relative efficiency of statistical tests; see e.g. Noether [29]. Shen [44] recently provided results concerning probabilities of large deviation for R in the special case when (X, Y ) is a bivariate normal r.v. Formal asymptotic expansions for the density of R follow from the paper by Kollo and Ruul [23] note that the generalized inverse is often used in place of the inverse in (4.11), though here we may allow T 2 to take any value whenever S 2 is singular -as long as T 2 remains a statistic. Note that S 2 is defined as the empirical covariance matrix of the sample (X i ) n i=1 , rather than the sample covariance matrix n n−1 S 2 . Call T 2 "central" when µ = 0 and "non-central" otherwise. For any nonsingular matrix B, T 2 is invariant under the invertible transformation X i → BX i ; particularly, letting B = Σ −1/2 allows us to assume w.l.o.g. that Cov X = I, the k × k identity matrix.
The form of T 2 in (4.11) is easily seen to yield to a Berry-Esséen type bound via Corollary 4.1, being a function of the two sums of random vectors X and XX T . Theorem 4.6. Assume that X 6 < ∞ and Then for all z ∈ R (4.12) and for z ∈ R satisfying (3.16) (with ǫ = 1 2 , say) where A 1 and A 2 are defined in (4.3) and (4.4) (again, with ǫ = 1 2 ), M < ∞ is a constant dependent only on µ , L µ 4 + µ 2 , and Remark 4.7. The non-degeneracy condition σ 1 > 0 immediately implies that µ = 0, so that Theorem 4.6 is applicable only to the non-central T 2 . If µ = 0, then σ 1 = 0 if and only if (X − µ) T µ = 1 ± 1 + µ 2 a.s., that is, if and only if P(X T µ = x 1 ) = 1 − P(X T µ = x 2 ) = p, where in other words, σ 1 = 0 if and only if X lies a.s. in the two hyperplanes defined by X T µ = x 1 or X T µ = x 2 . Note the similarity to the degeneracy condition of T described in Remark 4.3. Recalling the conditions E X = µ and Cov X = I , we have σ 1 = 0 if and only if andX is a random vector in R k such that EX = 0, E ξX = 0,X T µ = 0 a.s., and CovX is the orthoprojector onto the hyperplane {µ} ⊥ := {x ∈ R k : x T µ = 0}.
Again, the bounds in (4.12) and (4.13) appear to be new; indeed, we have found no mention of Berry-Esséen bounds for T 2 in the literature. Probabilities of moderate and large deviations for the central Hotelling T 2 statistic (when µ = 0) are considered by Dembo and Shao [12]. Asymptotic expansions for the generalized T 2 distribution for normal populations were given by Itô [19] (for µ = 0), and by Itô [20], Siotani [45], and Muirhead [25] (for any µ).

Proofs
Proofs of all theorems, propositions and corollaries stated in the previous sections are provided here.

Proofs of results from Section 2.
Proof of Theorem 2.
for arbitrary z ∈ R. Replace every instance of ∆ in the proof of [9, Theorem 2.1] (from [9, (5.2)] and thereafter) with ∆; this action proves that Recalling the condition (2.7) on ∆, one has Then P(z −|∆| W z) is bounded in a similar fashion, using z −|∆| in place of z. Inequality (2.5) then follows from ( Recalling that |ξ i | |η i | a.s. and also the condition (2.7), one has and further Next, replace every instance of ∆ in the statement and proof of [9, Lemma 5.2] as well as [9, (2.8)] with ∆. After making this replacement, there are two inequalities which need modification. First, [9, (5.21)] is modified to the following: (5.5) the last step above comes from [9, (5.15)]. The final change is to [9, (5.22)]: Chen and Shao [9] were able to bound E W 2 e W (corresponding to the case p = 2) with an absolute constant; here, more work is required to bound the last term in (5.6) for the general p. Specifically, we apply Cramér's tilt to the ξ i 's. For any c > 0, let ξ := (ξ 1 , . . . , ξ n ), ξ := (ξ 1 , . . . , ξ n ), and letξ =: (ξ 1 , . . . ,ξ n ) be a random vector such that P(ξ ∈ A) = E e cW I{ξ ∈ A} E e cW for all Borel sets A ∈ R n ; note that theξ i 's are independent r.v.'s. Further, if f : R n → R is any nonnegative Borel function, then Next, so that Jensen's inequality yields E e cξ i e c E ξ i e −c , and further with i Eξ 2 i e 2c a consequence of this. Also, |e cξ i − 1| c|ξ i |e c c|ξ i |e c , so that Letting f (x 1 , . . . , x n ) = | i x i | p in (5.7) and using [9, (5.15)] again, one has (5.9) Hence, combining (5.6), (5.9) and (5.8), we have in a similar fahsion, one shows P(z W z + |∆|) p γ z + τ e −z/3 . Referring back to (5.1) finishes the proof.
We shall prove Proposition 2.5 by using a result of [8], based on an appropriate modification of Stein's method, along with the following corollary of an exponential inequality due to Hoeffding: for all z 0 and t > 0, this can be easily obtained by truncation from e.g. [34,Theorem 8.2] (recall that σ 2 = 1).
Finally, let Then, by (3.2) and (3.7), Adopt some more notation: With all of this notation in mind, note that the assumptions of Theorems 2.1 and 2.3 are satisfied for the nonlinear statisticT (in place of T ) and its linear approximation W ; particularly, E ξ i = 0, Var W = 1, |∆| |T − W |, |ξ i | η i , ∆ satisfies (2.7) and ∆ i satisifes the condition that X i and (∆ i , (X j : j = i)) are independent (which further implies X i and (∆ i , W − ξ i ) are independent).
Lemma 5.1. If the conditions of Theorem 3.2 are satisfied, then Lemma 5.2. If the conditions of Theorem 3.2 are satisfied, then The proofs of these lemmata are deferred to the end of this subsection.
Proof of Proposition 3.9. Assume w.l.o.g. that z 0. Let ∆ := f (V ) − L(V ) and δ > 0 be some quantity dependent on n (and perhaps other constants associated with the distribution V ), unspecified for the moment. Then (cf. (5.1)) where the second inequality follows from the classical Berry-Esséen bound (or using (2.14) ) Similarly one bounds P( f (V ) σ1/ √ n z) from below. So, it remains to bound P( √ n|∆| > σ 1 δ), for an appropriate choice of δ. Let S := n i=1 V i . Then, by (3.2), Next take any y > 0 and let V i,y := V i I{ V i y} and S y := n i=1 V i,y . Note that ; the last inequality here will hold by an appropriate choice of δ and y (or, equivalently, x and y), to be made at the end of this proof. Using the exponential inequality [40,Corollary 2] along with Chebyshev's inequality, one has (5.35) .
Combining (5.32), (5.33) and (5.35), and solving for δ in terms of x, one obtains If p 3, let y := e V 2 n/ ln n and choose δ so that x = 2e V 2 √ n ln n. If p ∈ (2, 3), take x = 2e V 2 n (5−p)/4 and y = e V 2 √ n. Then for large enough n one has the last inequality of (5.34), as well as (3.18); if n is not large, then (3.18) is trivial.
If f = f (n) > 0 and g = g(n) > 0 are sequences of real numbers, let us use the following standard definitions: Proof of Proposition 3.10. Let S := V , T := f (S)/σ = √ n(S + S 2 ), W := L(S)/σ = √ nS and ∆ := T − W = √ nS 2 (note that C 1 may be taken to be 1 by choosing ǫ 1, so that (5.18) holds). Throughout this proof, let C stand for various positive constants which do not depend on n.
To obtain a contradiction, assume that (3.17) holds for some sequence z = z(n) 1 such that further let (5.37) w := n 3/4 z 1/2 = κ 1/2 n, so that w/n = κ 1/2 → ∞. Note that, for v > v 0 , the tail probability of V is which follows by l'Hospital's rule. Consider the various terms on the right-hand side of inequality (3.17). Now (5.36)-(5.38) imply and nz) p ln 2 n = n w p κ p/2 ln 2 n = o n w p ln 2 κ ln 2 n = o n w p ln 2 (nκ) = o n P(V > w) .
Proof of Proposition 3.11. Introduce r.v.'s T j := S − 2X j (obtained from S by flipping the sign of X j ) and the disjoint events A j := { X 1 x, . . . , X j−1 x, X j > x} for j = 1, . . . , n. Then X j = 1 2 S − 1 2 T j , and so, X j 1 2 S + 1 2 T j . Hence, the occurrence of event A j implies that either S > x or T j > x. It follows that P(A j ) P(A j ; S > x)+P(A j ; T j > x) = 2 P(A j ; S > x), by the symmetry. Summing now in j, one has P(max i X i > x) = j P(A j ) 2 P( S > x), so that the first inequality in (3.19) is proved.
To prove the second one, observe that P( , from which the second inequality in (3.19) follows.
The lemma is thus proved for p ∈ [2, 3) as well.

Proofs of results from Section 4.
Proof of Corollary 4.1. Recall the various simplifications with notation in the i.i.d. case (see the proof of Corollary 3.8). Then follows by (3.14), using Chebyshev's inequality on the tail probabilities G V , and also recalling L C 1 ǫ. Similarly, recalling that D = 1 (as X is a Hilbert space) and using Lyapounov's inequality V α V β whenever 0 < α β, the right-hand side of (3.17) is bounded by The result (4.2) now follows upon combining (3.17) and (2.13). Concerning (4.1), use Theorem 3.2 and similar arguments as above; note that P( S > ǫ) V 2 2 /(nǫ 2 ) as per Remark 3.3.
Proof of Theorem 4.2. Let X = R 2 , and consider the X-valued r.v.'s defined as Next let f : X → R be defined by , or that f ′′ (x) is uniformly bounded over all x ∈ X such that x 1 2 for some ǫ > 0, which is obvious.
Proof of Theorem 4.4. Let X = R 5 , and define the X-valued r.v.'s