Optimal-order uniform and nonuniform bounds on the rate of convergence to normality for maximum likelihood estimators

: It is well known that, under general regularity conditions, the distribution of the maximum likelihood estimator (MLE) is asymptotically normal. Very recently, bounds of the optimal order O (1 / √ n ) on the close- ness of the distribution of the MLE to normality in the so-called bounded Wasserstein distance were obtained [2, 1], where n is the sample size. How- ever, the corresponding bounds on the Kolmogorov distance were only of the order O (1 /n 1 / 4 ). In this paper, bounds of the optimal order O (1 / √ n ) on the closeness of the distribution of the MLE to normality in the Kol- mogorov distance are given, as well as their nonuniform counterparts, which work better in tail zones of the distribution of the MLE. These results are based in part on previously obtained general optimal-order bounds on the rate of convergence to normality in the multivariate delta method. The cru- cial observation is that, under natural conditions, the MLE can be tightly enough bracketed between two smooth enough functions of the sum of in- dependent random vectors, which makes the delta method applicable. It appears that the nonuniform bounds for MLEs in general have no prece- dents in the existing literature; a special case was recently treated by Pinelis and Molzon [20]. The results can be extended to M -estimators.


Introduction
Let us begin with the following quote from Kiefer [9] of 1968: a second area of what seem to me important problems to work on has to do with the fact that we do have, in many settings, quite a good large sample theory, but we don't know how large the sample sizes have to be for that theory to take hold. Now, I'm sure most of you are familiar with the error estimate one can give for the classical central-limit theorem, which goes by the name of the Berry-Esseen estimate, and which tells you that under certain assumptions one can actually give an explicit bound on the departure from the normal distribution of the sample mean for a given sample size, the error term being of order 1/ √ n. For most other statistical problems, in fact for almost anything other than the use of the sample mean, we have nothing. The most obvious example of this (and this is not original with me; many people have been concerned with this), is the maximum likelihood estimator in the case of regular estimation. We all know what the asymptotic distribution is. Can you give explicitly some useful bound on the departure from the asymptotic normal distribution as a function of the sample size n? It seems to be a terrifically difficult problem.
Since then, there has been some significant progress in this direction, especially rather recently. For instance, Berry-Esseen-type bounds of order 1/ √ n were obtained for U -statistics -see e.g. [10]; for the Student statistic [4,3]; and, even more recently, for rather broad classes of other statistics that depend on the observations in a nonlinear fashion [6,20].
As Kiefer pointed out, it is well known that, under general regularity conditions, the distribution of the maximum likelihood estimator (MLE) is asymptotically normal. In this paper, we shall consider Berry-Esseen-type bounds of order 1/ √ n for the MLE. First such bounds were apparently obtained in the paper [14], followed by [16,17]. Very recently, bounds on the closeness of the distribution of the MLE to normality in the so-called bounded Wasserstein distance, d bW , were obtained in [2]. In the rather common special case when the MLEθ is expressible as a smooth enough function of a linear statistic of independent identically distributed (i.i.d.) observations, the bounds obtained in [2] were sharpened and simplified in [1] by using a version of the delta method. More specifically, it was assumed in [1] that where q : Θ → R is a twice continuously differentiable one-to-one mapping, g : R → R is a Borel-measurable function, and the X i 's are i.i.d. real-valued r.v.'s.
It was noted in [2, Proposition 2.1] that for any r.v. Y and a standard normal r.v. Z one has d Ko (Y, Z) 2 d bW (Y, Z), where d Ko denotes the Kolmogorov distance. This bound on d Ko in terms of d bW is the best possible one, up to a constant factor, as shown in [20]. Therefore, even though the bounds on the bounded Wasserstein distance d bW obtained in [2,1] are of the optimal order O(1/ √ n), the resulting bounds on the Kolmogorov distance are only of the order O(1/n 1/4 ). (That the order O(1/ √ n) is optimal for MLEs is well known; for instance, see the example of the Bernoulli family of distributions given in [14].) In [20], optimal-order bounds of the form O(1/ √ n) on the rate of convergence to normality in the general multivariate delta method were given. Those results are applicable when the statistic of interest can be expressed as a smooth enough function of the sum of independent random vectors. Accordingly, various kinds of applications were presented in [20]. In particular, uniform and nonuniform bounds of the optimal order on the closeness of the distribution of the MLE to normality were obtained in [20] under conditions similar to the mentioned conditions assumed in [1].
In this paper we present a way to extend those results in [20] to the general case, without an assumption of the form (1.1), made in [1,20]. Of course, in general the MLE cannot be represented as a function of the sum of independent random vectors (see the Appendix in the arXiv version [19] of this paper for details). However, the crucial observation here is that, under natural conditions, the MLE can be tightly enough bracketed between two such smooth enough functions, which makes the delta method applicable. Thus, the present paper is methodologically different from the preceding work on Berry-Esseen-type bounds for the MLE, in that it relies on the general result developed in [20], rather than on methods specially designed to deal with the MLE.
Perhaps more importantly, the new method yields not only uniform bounds (that is, in the Kolmogorov metric) of the optimal order O(1/ √ n) on the closeness of the distribution of the MLE to normality but also their so-called nonuniform counterparts, which work much better for large deviations, that is, in tail zones of the distribution of the MLE -which are usually of foremost interest in statistical tests. Such nonuniform bounds for MLEs in general appear to have no precedents in the existing literature (except that, as stated above, a special case of nonuniform bounds for MLEs was recently treated in [20]).
The paper is organized as follows. The general setting of the problem is described in Section 2. The key step of tight enough bracketing of the MLE between two functions of the sum of independent random vectors is made in Section 3. General uniform and nonuniform optimal-order bounds from [20] on the convergence rate in the multivariate delta method are presented in Section 4. In Section 5, we make the bracketing work by applying the general bounds in the multivariate delta method. Yet, this leaves out the problem of bounding a remainder, which is a probability of large deviations of the MLE from the true value of the parameter. It is shown in Section 6 that under natural conditions this remainder is exponentially fast decreasing (in n) and thus asymptotically negligible as compared to the main term on the order of 1/ √ n. All these findings are summarized in Section 7, where the main result of this paper is presented, along with corresponding discussion.

General setting
Let X, X 1 , X 2 , . . . be random variables (r.v.'s) mapping a measurable space (Ω, A) to another measurable space (X , B) and let (P θ ) θ∈Θ be a parametric family of probability measures on (Ω, A) such that the r.v.'s X, X 1 , X 2 , . . . are i.i.d. with respect to each of the probability measures P θ with θ ∈ Θ; here the parameter space Θ is assumed to be a subset of the real line R. As usual, let E θ denote the expectation with respect to the probability measure P θ . Suppose that for each θ ∈ Θ the distribution P θ X −1 of X has a density p θ with respect to a measure μ on B. Because the extended real line [−∞, ∞] is compact, for each n ∈ N and each point x = x n = (x 1 , . . . , x n ) ∈ X n the likelihood function Θ θ → L x (θ) := n i=1 p θ (x i ) has at least one generalized maximizerθ n (x) in the closure of the set Θ in [−∞, ∞], in the sense that sup θ∈Θ L x (θ) = lim sup θ→θn(x) L x (θ). Picking, for each x = (x 1 , . . . , x n ) ∈ X n , any one of such generalized maximizerŝ θ n (x), one obtains a map Ω ω →θ n (X(ω)), where X := X n := (X 1 , . . . , X n ); any such map will be denoted here byθ n (X) (or simply byθ n orθ) and referred to as a maximum likelihood estimator (MLE) of θ. This is a somewhat more general definition of the MLE than usual, and in general an MLEθ will not have to be a r.v.; that is, it can be non-measurable with respect to the sigma-algebra A. However, to simplify the presentation, we shall still refer to sets of the form {θ ∈ J} := {ω ∈ Ω:θ n (X(ω)) ∈ J} for Borel sets J ⊆ Θ as events and write P θ (θ ∈ J) implying that the latter expression may and should be understood as either one of the expressions (P θ ) * (θ ∈ J) or (P θ ) * (θ ∈ J), where * and * stand for the corresponding outer and inner measures. Of course, when the mapθ is measurable, then one can use the bona fide expressions of the mentioned form P θ (θ ∈ J).
Let θ 0 ∈ Θ be the "true" value of the unknown parameter θ, such that for some real δ > 0, where Θ • denotes the interior of the subset Θ of R. For brevity, let P := P θ0 and E := E θ0 .
(II) Standard regularity conditions hold so that E X (θ 0 ) = 0 and E X ( Remark 2.1. The introduction of the set X >0 in condition (I) is needed even for a careful definition of the log-likelihood. The expectation E X (θ 0 ), mentioned in condition (II), may be understood as X>0 p x (θ 0 )μ(dx), where p x (θ) := p θ (x); similarly, for the other expectations mentioned in conditions (II)-(IV).
Of course, all the derivatives at this point are with respect to θ.
Concerning the "standard regularity conditions" mentioned in condition (II), it will be enough to assume that P( ∂ ∂θ p θ (X) = 0) > 0 and for some measurable function g : ; see e.g. [ Conditions (I)-(IV) are rather similar to regularity conditions used in related literature; see Remark 7.4 on page 1177 for details. It appears that these conditions will be generally satisfied provided that x (θ) is smooth enough in θ.
For instance, let us briefly consider the case when the family of densities (p θ ) is a location family, so that where λ is a smooth enough function. If the densities p θ have power-like tails, then for some positive real constants c + and c − one has λ(x) ∼ −c ± ln |x| as x → ±∞, in which case typically |λ (k) (x)| ∼ −c ± k!|x| −k ln |x| for k = 0, 1, . . . as x → ±∞. So, conditions (III) and (IV) will hold, since | (k) If the tails of the densities p θ are lighter than power-like tails, so that (say) λ(x) ∼ −c ± |x| α for some real α > 0 as x → ±∞, then typically |λ (k) (x)| ∼ −c ± k!|x| α−k for k = 0, 1, . . . as x → ±∞, so that conditions (III) and (IV) will again hold.
The case of a scale family is quite similar to that of a location family. Alternatively, the "scale" case can be reduced to the "location" one by logarithmic rescaling in both x and θ.
At this point, consider also the case when the family of densities (p θ ) is an exponential family, so that x (θ) = w(θ)T (x)+d(θ) for some functions w, T , and d and for all (x, θ) ∈ X ×Θ = R 2 , where the functions w and d are smooth enough, with w (θ 0 ) = 0. Then ) for any given real α > 0 and any given nonzero real h, and the conditions θ 0 ∈ Θ • and w (θ 0 ) = 0 im- for θ ∈ Θ, the log-likelihood of the sample X = (X 1 , . . . , X n ).

Tight bracketing of the MLE between two functions of the sum of independent random vectors
Without loss of generality (w.l.o.g.), X >0 = X . Then on the event Note that the Z i 's are i.i.d. r.v.'s, and so are the U i 's and the R * i 's (but not necessarily the R i 's).
Equalities (3.2) and (3.3) provide a quadratic equation forθ. So, on the event G one hasθ Letting (3.7) By definitions (3.4) and conditions (II), (III), and (IV), and hence E R * 1 < ∞. So, w.l.o.g. one may choose δ > 0 to be small enough so that (3.7), Markov's inequality, and a Rosenthal-type inequality (see e.g. [18, Theorem 1.5]), we have Next, the occurrence of B 2 implies the occurrence of at least one of the following events: (3.10) In view of (3.8), the bounding of each of the probabilities P( Thus, by (3.6), (3.9), and (3.10), where C depends on the likelihood function, the measure μ, and the choice of θ 0 -but not on n.
On the other hand, if R = 0 and U > 0, then ; here, the condition U > 0 was used only to ensure that the denominator of the latter ratio is nonzero. Hence, on the event G \ B one has where note that, when R = 0 and U > 0, the expression ofθ − θ 0 in (3.12) is in agreement with the corresponding expression in (3.5). Now that the desired bracketing ofθ − θ 0 between T − and T + is obtained in (3.12), we are ready to apply some of the mentioned general results of [20], presented in the next section.

General uniform and nonuniform bounds from [20] on the rate of convergence to normality for smooth nonlinear functions of sums of independent random vectors
The standard normal distribution function (d.f.) will be denoted by Φ. For any R d -valued random vector ζ, we use the norm notation where · denotes the Euclidean norm on R d . Take any Borel-measurable functional f : R d → R satisfying the following smoothness condition: there exist ∈ (0, ∞), M ∈ (0, ∞), and a linear func- Thus, f (0) = 0 and L necessarily coincides with the first Fréchet derivative, f (0), of the function f at 0. Moreover, for the smoothness condition (4.1) to hold, it is enough that it is not necessary that f be twice differentiable at 0. E.g., if d = 1 and f (x) =

4)
where C is a finite positive expression that depends only on the function f (through (4.1)) and the momentsσ, ς 3 , and v 3 . Moreover, for any ω ∈ (0, ∞) and for all z ∈ 0, ω √ n (4.5) one has where C ω is a finite positive expression that depends only on the function f (through (4.1)), the momentsσ, ς 3 , and v 3 , and also on ω.
The restriction (4.5) cannot be relaxed in general; see [20].
To simplify the presentation, in what follows let C stand for various finite positive expressions whose values do not depend on n or z; that is, C will denote various positive real constants -with respect to n and z. However, C may depend on other attributes of the setting, including the model (P θ ) θ∈Θ under consideration, the P θ0 -distribution of X 1 , the measure μ, and the values of parameters freely chosen in a given range (such as ω in (4.5) and ε in (4.1)).

Making the bracketing work: Applying the general bounds of [20]
Now let d = 3 and then let

By (3.4) and conditions (II) and (IV),
. So, for some real > 0, the set D contains the -neighborhood of the origin 0 of R 3 .
Define functions f ± : R 3 → R by the formula , and, in accordance with (4.2), the smoothness condition (4.1) holds for some and M in (0, ∞) -because, as was noted above, , and hence the denominator of the ratio in (5.1) is bounded away from 0 for x = (x 1 , x 2 , x 3 ) in a neighborhood of 0.
Next, let and, quite similarly, for all real z. Note that P(G c ∪ B) = P(G c ) + P(G ∩ B). It follows now by (3.1) and (3.11) that for all real z. Quite similarly, but using (4.6) instead of (4.4), one has for z as in (4.5). Typically, given rather standard regularity conditions, the remainder term P(|θ − θ 0 | > δ) decreases exponentially fast in n and thus is negligible as compared with the "error" term C √ n , and even with the "error" term C z 3 √ n -under condition (4.5). Some details on this can be found in the following section.

Bounding the remainder: General case
Upper bounds on the large-deviation probability P(|θ − θ 0 | > δ) that are exponentially decreasing in n without the assumption of the concavity of the log-likelihood function were presented e.g. in [22,21,15,5,13]. However, the parameter space Θ was assumed in [22,21,5] to be bounded, whereas in [13] the distributions P θ were assumed to be subgaussian (cf. Theorems 2.1, 2.2, and 3.3 in [13]). Conditions in [15] appear to be difficult to verify, including the strict positivity of the infimum of the rate function, needed for an actual exponential decrease.
Related is the work [7], containing a result on so-called moderate deviation probabilities for MLEs, which decrease slower than exponentially but still faster than any powers. So, such a result would be enough for our conclusions in Theorem 7.1 in the next section (cf. Remark 7.2 there), if it were not assumed in [7] (as in [22,21,5]) that Θ is bounded.
Here we modify the method of [5] to get rid of the condition that Θ is bounded. Consider the (squared) Hellinger distance between the probability measures P θ and P θ0 . Assume now the following conditions:

(B) The set Θ is a (possibly infinite) interval, and the Fisher information I(θ)
is well defined and satisfies the boundedness condition for some positive real constants c 1 , c 2 , α and all θ ∈ Θ. (If a point θ in Θ is an endpoint of the interval Θ, then I(θ) is naturally understood in terms of the corresponding one-sided derivative of p θ (x) in θ.) over all θ ∈ U . (D 1 ) For some real constant γ > 0 and some bounded neighborhood V of θ 0 , Here and in the sequel, for any two expressions E 1 > 0 and E 2 0 whose values depend on some variables, the relation E 1 > E 2 and its equivalent E 2 < E 1 mean that sup(E 2 /E 1 ) < ∞, where the supremum is taken over the corresponding specified range of values of the variables. Conditions (D 0 ) and (D 1 ) may be referred to as distinguishability conditions: (D 0 ) means that the probability measures P θ and P θ0 are not too close to each other for θ in a punctured neighborhood of θ 0 -whereas (D 1 ) implies that for θ far away from θ 0 , the probability measures P θ and P θ0 are almost mutually singular, and thus, easily distinguishable, at least in principle. Remark 6.1. In the particular case when the parameter space Θ is compact (or just bounded), condition (D 1 ) trivially holds. Moreover, as shown in [5, Section 31], if Θ is compact and the Fisher information I(θ) is continuous in θ ∈ Θ and strictly positive for θ ∈ Θ, then (6.4) holds over all θ ∈ Θ. So, condition (D 0 ) holds (whether the set Θ is bounded or not) whenever the Fisher information I(·) is continuous and strictly positive on Θ.
However, since H(θ, θ 0 ) is always bounded from above by 2, it is clear that condition (6.4) cannot possibly hold over all θ ∈ Θ if the parameter space Θ is unbounded. In such a case, we need to complement condition (D 0 ) by condition (D 1 ), which latter appears to be natural, and it is indeed commonly satisfied. In particular, conditions (D 0 ) and (D 1 ) (as well as regularity conditions (I)-(IV)) hold if p θ is the density belonging to any one of the following families of probability distributions: , where B(·, ·) is the Beta function; Item (a) above, concerning the normal location family, can be quite broadly generalized: Proposition 6.2. Suppose that (p θ ) θ∈Θ is a location family over R, so that p θ (x) = p(x − θ) for all x ∈ R and θ ∈ Θ, where p is a pdf (with respect to the Lebesgue measure over R). Suppose also that for some real α > 1 and all real u. Then condition (D 1 ) holds.
Note that the restriction α > 1, together with (6.6), implies the integrability of the nonnegative function p.
Now we are well prepared to state the main result of this subsection: for some real constants c > 0 and λ ∈ [0, 1) (depending on γ, c 0 , α, c 1 , c 2 ) and all natural n; cf. (6.1).
(i) It is assumed in [5] that Θ is compact, in addition to the assumption that I(θ) is continuous in θ ∈ Θ and strictly positive for θ ∈ Θ. Under these assumptions, condition (D 0 ) is, not assumed, but derived in [5]. As noted above, if the parameter space Θ is compact, then condition (D 1 ) is trivial. (ii) As we do not assume that Θ is compact (or even bounded), we need to control the behavior of log-likelihood X (θ) for θ far from θ 0 . This is done using condition (D 1 ). (iii) In [5], instead of condition (B) above, it is assumed that the Fisher information I(θ) is just bounded over all θ ∈ Θ. However, mainly following the lines of proof in [5], one can see that the more general condition (B) suffices, given conditions (D 0 ) and (D 1 ).

For the readers' convenience here is
Proof of Proposition 6.4. Let where p θ (X) := n i=1 p θ (X i ) = exp X (θ) and X is the log-likelihood function, as defined in (2.2); here and subsequently in this proof, u is a real number such that θ 0 + u ∈ Θ.

Conclusion
Inequalities ( for all real z, and for z as in (4.5). Here, as before, each of the two instances of the symbol C stands for a finite positive expression whose values do not depend on n or z, in accordance with the last paragraph of Section 4.

Remark 7.2.
It should be clear that the conditions assumed in the second sentence of Theorem 7.1 can be replaced by any other conditions that imply (6.11) for some real constants c > 0 and λ ∈ [0, 1) not depending on n. Actually, a much weaker bound, of the form c/n 2 , instead of the exponentially fast decreasing upper bound cλ n in (6.11), will already suffice.  4 }, for all real x and θ. (One may note that, whereas the tails in the first example here, of the Cauchy family, are very heavy, they are very light in the second example.) An additional advantage of Theorem 7.1 of the present paper over [20,Theorem 3.16] is that now one does not have to check a special, restrictive condition of the form (1.1) even when it holds.
Theorem 7.1 can be extended to the more general case of M -estimators. Indeed, the condition that p θ is a pdf for θ = θ 0 is used in our proofs only in order to state that E θ X (θ) = 0 and E θ X (θ) 2 = − E θ X (θ) = I(θ) ∈ (0, ∞). In the case of M -estimators, the corresponding conditions will have to be just assumed, with some other expressions in place of the Fisher information I(θ), as it is done e.g. in [16,17], where uniform bounds of optimal order O(1/ √ n) for M -estimators were obtained; M -estimators were referred to as minimum contrast estimates in [14,16,17]. We have chosen to restrict the consideration here to MLEs in order not to obscure the novelty elements in our result.
The most significant novelty in our Theorem 7.1, as compared with the results of [14,16,17], is that, in addition to the uniform bound in (7.1), inequality (7.2) in Theorem 7.1 also provides a nonuniform Berry-Esseen-type bound for MLEs in general, which latter appears to be the first such result in the literatureexcept for the already mentioned special case considered recently in [20]. On the other hand, paper [17] treats the case of a multidimensional parameter θ. The uniform bound in [14] was of the form O( √ ln n/ √ n), rather than of the optimal order O(1/ √ n). Another notable distinction is that condition [14, (1)] (the same as the corresponding conditions on page 73 in [16] and on page 173 in [17]) effectively reduces the consideration to the case when the parameter space Θ is compact in [−∞, ∞]. This obviates the need in a condition such as (D 1 ), which is there to control the behavior of the likelihood X (θ) for large |θ|. However, as pointed out in [14, page 75] concerning the main result there, the nonconstructive compactification condition used in [14,16,17] "gives no method for determining [the] value [of the constant in the Berry-Esseen-type bound] for a given family of probability measures." The problem of controlling the likelihood over far-away zones of a noncompact parameter space Θ was illustrated in Example 6.3, where the "bad" sit-uation was excluded by condition (6.5). That same situation -with f θ = − ln p θ for θ ∈ Θ = [−1, ∞] and μ(∞) := lim θ→∞ μ(θ) = 0 = μ(0) and variance σ 2 (∞) := lim θ→∞ σ 2 (θ) = 1 = σ 2 (0) -was also excluded by the mentioned compactification condition in [14,16,17].
As was pointed out, the method of the present paper is based on the general Berry-Esseen bounds for the multivariate delta method obtained in [20], which were apllied here via the bracketing argument delineated in Section 3. As such, this method is quite different from the methods in [14,16,17], specialized to deal with MLEs. Partly because of this difference in the methods, there are many differences between the conditions in [14,16,17] and those in the present paper. Most of these differences -apart from the ones discussed above -are rather minor. Since the result of [16] is apparently the closest to ours in the literature, let us further discuss the regularity conditions in [16], in comparison with ours, in some detail: Remark 7.4. Condition (I) in the present paper can be replaced by the condition that p θ > 0 everywhere on X . The latter condition is necessary in order for x (θ) = ln p θ (x) to be defined for all x ∈ X ; cf. the first paragraph on page 83 in [16].
Next, condition (III) follows from [16, (vi)]. Here and in the rest of this remark, the lower-case Roman numerals and letters in parentheses refer to the regularity conditions on pages 83-84 in [16] -again for f θ := − θ .
Condition (IV) is, in main, a bit stronger than [16, (viii)]. Of course, condition (IV) can be relaxed, for the price of making it more complicated.
By Remark 6.1, our condition (D 0 ) will hold if the Fisher information I(·) is continuous and strictly positive on Θ, for which conditions (ix) and (v)(a), respectively, in [16] will be more than enough.
Next, our condition (D 1 ), to control the behavior of the likelihood X (θ) for large |θ|, was already discussed at length, versus the compactification condition used in [14,16,17].
In the case when Θ is compact, for our condition (B) to hold, either one of regularity conditions (vi)(a) or (vi)(b) in [16] will be more than enough. More generally, condition (B) together with condition (D 1 ) replace the just mentioned compactification condition in [14,16,17].
So, quite predictably, neither our conditions imply those in [14,16,17], nor vice versa. However, our conditions appear to be a bit simpler and more explicit overall than those in [14,16,17]. It should also be mentioned that in [14,16] both the relevant conditions and the corresponding results are stated uniformly over compact subsets of Θ. Of course, a similar modification of our conditions and results can be done.