An asymptotically Gaussian bound on the Rademacher tails

An explicit upper bound on the tail probabilities for the normalized Rademacher sums is given. This bound, which is best possible in a certain sense, is asymptotically equivalent to the corresponding tail probability of the standard normal distribution, thus affirming a longstanding conjecture by Efron. Applications to sums of general centered uniformly bounded independent random variables and to the Student test are presented.

Upper bounds on the tail probabilities P(S n x) have been of interest in combinatorics/optimization/operations research; see e.g. [3,4,17,18,28,34] and bibliography therein. Other authors, including Bennett [5], Hoeffding [32], and Efron [23], were mainly interested in applications in statistics. The present paper too was motivated in part by statistical applications in [75].
A particular case of a well-known result by Hoeffding [32] is the inequality for all x 0. Obviously related to this is Khinchin's inequality -see e.g. survey [55]; for other developments, including more recent ones, see e.g. [39,44,54,86]. Papers [60,68] contain multidimensional analogues of an exact version of Khinchin's inequality, whereas [67] presents their extensions to multiaffine forms in ε 1 , . . . , ε n (also known as Rademacher chaoses) with values in a vector space. Lata la [43] gave bounds on moments and tails of Gaussian chaoses; Berry-Esseen-type bounds for general chaoses were recently obtained by Mossel, O'Donnell, and Oleszkiewicz [51]. For other kinds of improvements/generalizations of the inequality (1.2) see the recent paper [1] and bibliography there. While easy to state and prove, bound (1.2) is, as noted by Efron [23], "not sharp enough to be useful in practice". Exponential inequalities such as (1.2) are obtained by finding a suitable upper bound (say E(t)) on the exponential moments E e tSn and then minimizing the Markov bound e −tx E(t) on P(S n x) in t 0. The best exponential bound of this kind on the standard normal tail probability P(Z x) is inf t 0 e −tx E e tZ = e −x 2 /2 , for any x 0. Thus, a factor of the order of magnitude of 1 x is "missing" in this bound, compared with the asymptotics P(Z x) ∼ 1 x ϕ(x) as x → ∞; cf. the result by Talagrand [81]. Now it should be clear that any exponential upper bound on the tail probabilities for sums of independent random variables must be missing the 1 x factor. The problem here is that the class of exponential moment functions is too small.
Eaton [19] obtained the moment comparison E f (S n ) E f (Z) for a much richer class of moment functions f , which enabled him [20] to obtain an upper bound on P(S n x), which is asymptotic to c 3 P(Z x) as x → ∞, where c 3 := 2e 3 9 = 4.4634 . . . .
Clearly, as pointed out e.g. in [10], the constant c in (1.3) cannot be less than which may be compared with c 3 . Bobkov, Götze and Houdré (BGH) [12] gave a simple proof of (1.3) with a constant factor c ≈ 12.01. Their method was based on the Chapman-Kolmogorov identity for the Markov chain (S n ). Such an identity was used, e.g., in [63] concerning a conjecture by Graversen and Peškir [25] on max k n |S k |. Pinelis [73] showed that a modification of the BGH method can be used to obtain inequality (1.3) with a constant factor c ≈ 1.01 c * . Bentkus and Dzindzalieta [11] recently closed the gap by proving that c * is indeed the best possible constant factor c in (1.3); they used the Chapman-Kolmogorov identity together with the Berry-Esseen bound and a new extension of the Markov inequality. Bentkus and Dzindzalieta [11] also obtained the inequality whereas Holzman and Kleitman [34] proved that P(S n > 1) 5 16 . We should also like to mention another kind of result, due to Montgomery-Smith [50], who obtained an upper bound on ln P(S n x) and a matching lower bound on ln P(S n Cx) for some absolute constant C > 0; these bounds depend on x > 0 and on the sequence (a 1 , . . . , a n ) and differ from each other by no more than an absolute constant factor; the constants were improved by Hitczenko and Kwapien [29]. The result of [50] was inspired by upper and lower bounds on the L p -norm of sums of general independent zero-mean r.v.'s obtained by Lata la [42] and was extended to such general sums in [31]. The proof in [50] was in part based on an extension of the improvement of Hoffmann-Jørgensen's inequality [33] found by Klass and Nowicki [36]. More recent developments in this direction are given in [37,38].
In the mentioned paper [23], Efron conjectured that there exists an upper bound on the tail probability P(S n x) which behaves as the corresponding standard normal tail P(Z x), and he presented certain facts in favor of this conjecture. Efron's conjecture suggests that even the best possible constant factor c = c * = 3.17 . . . in (1.3) is excessive for large x; rather, for such x the ratio of a good bound on P(S n x) to P(Z x) should be close to 1. Theorem 1.1 below provides such a bound, of simple and explicit form.
Another well-known conjecture, apparently due to Edelman [22,78], is that for all x 0; that is, the conjecture is that the supremum of P(S n x) over all finite sequences (a 1 , . . . , a n ) satisfying condition (1.1) is the same as that over all such (a 1 , . . . , a n ) with equal a i 's. Certain parts of the proof of Theorem 1.1 may be seen as providing additional credence to this conjecture. On the other hand, if (1.6) were known to be true, it would to a certain extent simplify the proof of Theorem 1.1. Also, it is noted in [11] that (1.6), used together with the Berry-Esseen bound, would imply another known conjecture [4,28,34] -that P(S n > 1) 1 4 . Yet another interesting conjecture [14,30,53,85] states that P(S n 1) 7 64 . The main result of the present paper is The constant factor C is the best possible in the sense that the first inequality in (1.7) turns into the equality when x = n = 1. It would be desirable to find the optimal constant C if the constant 9 in the denominator in (1.7) is replaced by a smaller positive value, for then the bound Q(x) would be decreasing somewhat faster; however, such a quest appears to entail significant technical complications. Using e.g. part (II) of Proposition 3.1, it is easy to see that the ratio of the bound Q(x) in (1.7) to P(Z > x) increases from ≈ 2.25 to ≈ 3.61 and then decreases to 1 as x increases from 0 to ≈ 2.46 to ∞, respectively. Figure 1 presents a graphical comparison of this ratio, Q(x)/ P(Z > x), with (i) the best possible constant factor c = c * ≈ 3.18 in (1.3); (ii) the level 1, which is asymptotic (as x → ∞) to the ratio of either one of the two bounds in (1.7) to P(Z > x), and hence, by the central limit theorem, is also asymptotic to the ratio of the supremum of P(S n x) (over all normalized Rademacher sums S n ) to P(Z > x); (iii) the ratio of Hoeffding's bound e −x 2 /2 to P(Z > x).
In Figure 1, the graph of the latter ratio looks like a steep straight line (and asymptotically, for large x, is a straight line), most of which is outside the vertical range of the picture, thus showing how much the bounds c * P(Z x) and Q(x) improve the Hoeffding bound e −x 2 /2 .
In view of the main result of Bentkus [6], one immediately obtains the following corollary of Theorem 1.1.
for all real x 0, whereQ n is the linear interpolation of the restriction of the function Q to the set 2 √ n ( n 2 − ⌊ n 2 ⌋ + Z). Here we shall present just one more application of Theorem 1.1, to the selfnormalized sums V n := X 1 + · · · + X n where, following Efron [23], we assume that the X i 's satisfy the so-called orthant symmetry condition: the joint distribution of s 1 X 1 , . . . , s n X n is the same for any choice of signs s 1 , . . . , s n ∈ {1, −1}, so that, in particular, each X i is symmetrically distributed. It suffices that the X i 's be independent and symmetrically (but not necessarily identically) distributed. In particular, V n = S n if X i = a i ε i for all i. It was noted by Efron that (i) Student's statistic T n is a monotonic function of the so-called self-normalized sum: T n = n−1 n V n / 1 − V 2 n /n and (ii) the orthant symmetry implies in general that the distribution of V n is a mixture of the distributions of normalized Rademacher sums S n . Thus, one obtains Corollary 1.4. Theorem 1.1 holds with V n in place of S n .
Note that many of the most significant advances concerning self-normalized sums are rather recent; e.g., a necessary and sufficient condition for their asymptotic normality was obtained only in 1997 by Giné, Götze, and Mason [24].
It appears natural to compare the probability inequalities given in Theorem 1.1 with limit theorems for large deviation probabilities. Most of such theorems, referred to as large deviation principles (LDP's), deal with logarithmic asymptotics, that is, asymptotics of the logarithm of small probabilities; see e.g. [16]. As far as the logarithmic asymptotics is concerned, the mentioned bounds c * P(Z x) and Q(x) and the Hoeffding bound e −x 2 /2 are all the same: ln c * P(Z x) ∼ ln Q(x) ∼ ln e −x 2 /2 = −x 2 /2 as x → ∞; yet, as we have seen, at least the first two of these bounds are vastly different from the Ho-effding bound, especially from the perspective of statistical practice. Results on the so-called exact asymptotics for large deviations (that is, asymptotics for the small probabilities themselves, rather than for their logarithms) are much fewer; see e.g. [16,Theorem 3.7.4] and [56,Ch. VIII]. Note that the inequalities in (1.7) hold for all x > 0, and, a priori, the summands a i ε i do not have to be identically or nearly identically distributed; cf. conjecture (1.6). In contrast, almost all limit theorems for large deviations in the literature -whether with exact or logarithmic asymptotics -hold only for x = O( √ n), with n being the number of identically or quasi-identically distributed (usually independent or nearly independent) random summands; the few exceptions here include results of the papers [52,64,76,77,87] and references therein, where the restriction is not imposed and x is allowed to be arbitrarily large. In general, observe that a limit theorem is a statement on the existence of an inequality, not yet fully specified, as e.g. in "there exists some n 0 such that |x n − x| < ε for all n n 0 "; as such, a limit theorem cannot provide a specific bound. Of course, being less specific, limit theorems are applicable to objects of much greater variety and complexity, and limit theorems usually provide valuable initial insight. Yet, it seems natural to suppose that the tendency, say in the studies of large deviation probabilities, will be to proceed from logarithmic asymptotics to asymptotics of the probabilities themselves and then on to exact inequalities. We appear to be largely at the beginning of this process, still struggling even with such comparatively simple objects as the Rademacher sums -the simplicity of which is only comparative, as the discussion around Figure 1 in [73] suggests. However, there have already been a number of big strides made in this direction. For instance, Boucheron, Bousquet, Lugosi, and Massart [13] obtained explicit bounds on moments of general functions of independent r.v.'s; their approach was based on a generalization of Ledoux's entropy method [46,47], using at that a generalized tensorization inequality due to Lata la and Oleszkiewicz [45]. More recently, Tropp [83] provided noncommutative generalizations of the Bennett, Bernstein, Chernoff, and Hoeffding bounds -even with explicit and optimal constants; as pointed out in [83], "[a]symptotic theory is less relevant in practice". Yet, as stated above, in the case of Rademacher sums and other related cases significantly more precise bounds can be obtained.

Proof of Theorem 1.1: outline
Let us begin the proof with several introductory remarks.
In this section, a number of lemmas will be stated, from which Theorem 1.1 will easily follow. Most of these lemmas will be proved in Section 3 -with the exception of Lemmas 2.3 and 2.7, whose proofs are more complicated and will each be presented in a separate section. Each of these two more complicated lemmas is based on a number of sublemmas -which are stated in the corresponding section and used there to prove the lemma; each of these two sections is then completed by proving the sublemmas. This tree-like structure appears suitable for presentation: first the general scheme of the proof and then gradually down to the finer details.
There are many symbols used in the proof. Therefore, let us assume a localization principle for notations: any notations introduced in a section or in a proof of a lemma/sublemma supersede those introduced in preceding sections or proofs. For example, the meaning of the X i 's introduced later in this section differs from that in Section 1.
Without loss of generality (w.l.o.g.), assume that 0 a 1 . . . a n =: a, (2.1) so that a = max i a i . Introduce the numbers The proof of Theorem 1.1 is to large extent based on a careful analysis of the Esscher tilt transform of the r.v. S n . In introducing and using this transform, Esscher and then Cramér were motivated by applications in actuarial science. Closely related to the Esscher transform is the saddle-point approximation; for a recent development in this area, see [57]. The Esscher tilt has been used extensively in limit theorems for large deviation probabilities, but much less commonly concerning explicit probability inequalities -two rather different in character cases of the latter kind are represented by Raič [79] and Pinelis and Molzon [75]. One may also note that, in deriving LDP's, the tilt is usually employed to get a lower bound on the probability; in contrast, in this paper the tilt is used to obtain the upper bound.
The main idea of the proof is to reduce the problem from that on the vector (a 1 , . . . , a n ) of an unbounded dimension n to a set of low-dimensional extremal problems, involving sums of the form i g(u i ). The first step here is to represent such sums as x 2 g dν, whereg(u) := g(u)/u 2 (for u = 0), and δ t denotes the Dirac probability measure at point t, so that ν is a probability measure on the interval [0, xa]. This step turns the initial finite-dimensional problem into an infinite-dimensional one, involving the measure ν. However, then the well-known Carathéodory principle allows one to reduce the dimension to (at most) k−1, where k is the total number of the integrals (with the respect to the measure ν) involved in the extremal problem in hand. Moreover, it turns out that the systems of integrands one has to deal with in the proof of Theorem 1.1 enjoy the so-called Tchebycheff and, even, Markov properties; therefore, one can reduce the dimension even further, to about k/2, which allows for effective analyses. It should also be noted that the verification of the Markov property of a finite sequence of functions largely reduces to checking the positivity of several functions of only one variable. Major expositions of the theory of Tchebycheff-Markov systems and its applications are given in the monographs by Karlin and imsart-generic ver. 2009/05/21 file: arxiv.tex date: July 19, 2011 Studden [35] and Kreȋn and Nudel ′ man [40]; closely related to this theory are certain results in real algebraic geometry, whereby polynomials are "certified" to be positive on a semialgebraic domain by means of an explicit representation, say in terms of sums of squares of polynomials; see e.g. [41,49]. A brief review of the Tchebycheff and Markov systems of functions, which contains all the definitions and facts necessary for the applications in the present paper, is given in [59]. Even after the just described reductions in dimension, the proof of Theorem 1.1 entails extensive calculations, both symbolic and numeric, which we did using Mathematica; other advanced calculators should be able to do the job. A well-known result by Tarski [15,48,82] -which can be viewed as a far-reaching development of Sturm's theorem on the real roots of a polynomial -implies that systems of algebraic equations/inequalities can be solved in a completely algorithmic manner. Similar results hold for algebraic-hyperbolic polynomials (that is, polynomials in x, e x , e −x ) -as well as for certain other expressions involving inverse-trigonometric and inverse-hyperbolic functions (including the logarithmic function), whose derivatives are algebraic. However, it was only a few years ago that Tarski's algorithm and its further developments were implemented into widely used computer software. In Mathematica, this is done via Reduce and other related commands, such as Maximize and Minimize. In particular, command returns a simplified form of the given system (of equations and/or inequalities) cond1, cond2, . . . over real variables var1, var2, . . . . However, the execution of such a command may take a very long time (or require too much computer memory) if the given system is more than a little complicated; in such cases, Mathematica can use some human help. As for the commands Maximize and Minimize, whenever possible they return the exact global maximum/minimum subject to the given restrictions; otherwise, these commands return a statement implying that Mathematica cannot do the requested exact optimization. Alternatively, such calculations, say for piecewise smooth functions of a finite number of variables, can be done, also quite rigorously, using interval arithmetics; see e.g. [27]; again, the only limitation here is the computer power. It should be quite clear that all such calculations done with an aid of a computer are no less reliable or rigorous than similar, or even less involved, calculations done by hand.
Next, letX 1 , . . . ,X n be any r.v.'s such that for all Borel-measurable functions g : R n → R. Equivalently, one may require condition (2.3) only for Borel-measurable indicator functions g; clearly, such r.v.'sX i do exist. It is also clear that the r.v.'sX i are independent. Moreover, for each i the distribution ofX i is e ui δ ai + e −ui δ −ai /(e ui + e −ui ). Formula (2.3) presents the mentioned Esscher tilt transform, with the tilting parameter (TP) the same as the x in (1.7); that is, we choose the TP to be the minimizer of e −tx E e tZ = e −tx+t 2 /2 in t 0 -rather than the minimizer of e −tx E e tSn , which latter is usually taken as the TP in limit theorems for large deviations and can thus be expressed only via an implicit function. Our choice of the TP appears to simplify the proof greatly.
In terms of the tilted r.v.'sX 1 , . . . ,X n , introduce now where ch := ch, sh := sh, th := th, and arcch := arccosh assuming that arcch z 0 for all z ∈ [1, ∞); thus, for each z ∈ [1, ∞), arcch z is the unique solution y 0 to the equation ch y = z . Let F n and Φ denote, respectively, the tail function ofX 1 + · · · +X n and the standard normal tail function, so that F n (z) = P(X 1 + · · · +X n z) and Φ(z) = P(Z z) for all real z. Also, let c BE denote the least possible constant in the Berry-Esseen inequality by Shevtsova [80], c BE 56 100 ; a slightly worse bound, c BE 0.5606, is due to Tyurin [84].
with a as in (2.1).
Now and later in the paper, we need the following special l'Hospital-type rule for monotonicity.  (a, b). It is assumed that g and g ′ do not take on the zero value and do not change their respective signs on (a, b).
(I) If f (a+) = g(a+) = 0 or f (b−) = g(b−) = 0, and if the ratio f ′ /g ′ is strictly increasing/decreasing on (a, b), then (respectively) (f /g) ′ is strictly positive/negative and hence the ratio f /g is strictly increasing/decreasing on (a, b). (II) If f (a+) = g(a+) = 0 and if the ratio f ′ /g ′ switches its monotonicity pattern at most once on (a, b) -only from increase to decrease, then the ratio f /g does so. Similar statements, under the condition f (b−) = g(b−) = 0 and/or for a switch from decrease to increase, are true as well.

Proof of Lemma 2.3
We begin with a technical sublemma, used in the proof of Sublemma 4.2: is concave.
It remains, in this section, to prove Sublemmas 4.1-4.6.
Proof of Sublemma 4.1. Since h a (v) is affine in a, w.l.o.g. a ∈ {0, 1}. Consider first the case a = 0. Observe that which switches the sign from + to − at t = 3 2 as t increases from 1 to ∞. Hence, the maximum of h ′′ 0 (v) in v ∈ (0, 1] is attained at v = 2 3 , and this maximum is easily seen to be negative, which proves the case a = 0.
The case a = 1 is considered similarly. Observe that d dt is a certain polynomial in t (of degree 13), which switches the sign from + to − at a certain algebraic number t 1 as t increases from 1 to ∞. Hence, the maximum of 8+t 3 +6t 5 in t 1 is attained at t = t 1 , and h ′′ 1 (t −2 1 ) can seen to be negative (using e.g. the Reduce command again), which proves the case a = 1 as well. Sublemma 4.1 is now completely proved.
Proof of Sublemma 4.2. Observe that L x = (xs x ) −3 i u 3 i (1 − th 4 u i ). So, inequality (4.2) means exactly that for all u i 's in the interval [0, u * ] such that i u 2 i = x 2 and i u 2 Next, the object i u 2 i g(u i ) in (4.9) with the restrictions i u 2 i = x 2 and i u 2 i ch 2 ui = x 2 s 2 x can be rewritten as where h(·) := h a (·) (as in (4.1)) with a = s 3 x and Y is a r.v. with the distribution ν := 1 x 2 i u 2 i δ vi , with v i := 1 ch 2 ui ; note that one always has s x ∈ (0, 1] and ν is indeed a probability measure due to the restriction i u 2 i = x 2 . So, by Sublemma 4.1 and Jensen's inequality, x ) = 0, which proves the inequality in (4.9) and hence that in (4.2).
To avoid these long execution times, one can, alternatively, do as follows. Let, for a moment, W stand for either W 2,3 or W 4,2 . As mentioned above, W can be represented in the form (4.11). Now expand ch(2ju) and sh(2ju) into Maclaurin series, and write where each coefficient c m,k is of the form 4 j=1 (a k + b k m)j 2m for some real numbers a k and b k , depending only on k ∈ {0, 1, 2, 3}.
For W = W 2,3 , using the command Reduce (or manually), one can quickly verify that for all m 4 inequalities c m,1 < 0, c m,2 > 0, and c m,3 < 0 hold, which implies that for all u ∈ (0, u * ) one hasw m (u) > c m,0 + c m,1 u 2 * + c m,3 u 6 * > 0, the latter inequality quickly checked (say) by another Reduce. Thus, for all u ∈ (0, u * ) one has w m (u) > 0 for all m 4 and hence p j (u 2 )u 2ju c m,k u 2k , p j (u 2 ) and q j (u 2 ) are polynomials in u 2 (of degree 2), and each coefficient c m,k is of the form 3 j=1 (a k + b k m)j 2m for some real numbers a k and b k , depending only on k ∈ {0, 1, 2}. Using the command Reduce, one can verify that for all m 9 and k ∈ {0, 1, 2} inequalities c m,k < 0 hold, which implies that for all u > 0 one has r m (u) < 0 and hence ∞ m=9 r m (u) > 0; the verification of the inequality c m,2 < 0 for m 9 takes about 18 sec, while that of each of the inequalities c m,0 < 0 and c m,1 < 0 for m 9, just a fraction of a second. On the other hand, 8 m=0 r m (u) is a polynomial in u, which can be quickly checked by yet another Reduce to be negative for all u ∈ (0, u * ). This proves that DDρ 1 < 0 on (0, u * ). It follows that (Dρ1)(u) < 0. So, the monotonicity pattern of ρ 1 changes exactly once on (0, u * ), from increase to decrease. Note also that k 1 (0) = ℓ 1 (0) = 0. So, by part (II) of Proposition 3.1, the monotonicity pattern of ρ can change at most once on (0, u * ), and only from increase to decrease. However, ρ ′ (u * ) = 0.017 . . . > 0. Thus, ρ is strictly increasing on (0, u * ) and hence, by the continuity, on [0, u * ].

Proof of Lemma 2.7
This proof could be simplified using the mentioned result (1.5); however, we decided to present an independent proof here, which is not much more complicated. Let We have to show that ∆ 0 for all pairs (x, u) in the set the condition u < x corresponding to the condition a = a n < 1. Introduce , where p 1 (x, u) := x 2 (11 + x 2 ) − (10u 2 + 2ux 2 ), p 2 (x, u) := x 2 (9 + x 2 ) − (8u 2 + 2ux 2 ), which is slightly larger than P . Using e.g. the Mathematica command Reduce, one can see that on the setP the polynomials p 1 , p 2 , and p 3 are positive, and so, ∆ 1 (x, u) and ∆ 2 (x, u) are equal in sign to (u − 1) ∂∆ ∂u (x, u) and ∂∆1 ∂u (x, u), respectively, for all points (x, u) ∈ P with u = 1; note here that u < x < x 2 for all (x, u) ∈P .