Assessing the multivariate normal approximation of the maximum likelihood estimator from high-dimensional, heterogeneous data

The asymptotic normality of the maximum likelihood estimator (MLE) under regularity conditions is a cornerstone of statistical theory. In this paper, we give explicit upper bounds on the distributional distance between the distribution of the MLE of a vector parameter, and the multivariate normal distribution. We work with possibly high-dimensional, independent but not necessarily identically distributed random vectors. In addition, we obtain explicit upper bounds even in cases where the MLE cannot be expressed analytically.


Introduction
In this paper, we give explicit upper bounds on the distributional distance between the distribution of a vector MLE and the multivariate normal, which under specific regularity conditions is the MLE's limiting distribution. We focus on independent but not necessarily identically distributed random vectors. The quantitative statement obtained from our bounds can be helpful to assess whether using the limiting distribution of the MLE is an acceptable approximation or not. From the opposite point of view, the results presented in this paper can save both money and time by giving a good indication on whether a larger sample size is indeed necessary, for a good approximation to hold. The wide applicability of the maximum likelihood estimation method adds to the importance of our results. Among others, an MLE is used in ordinary and generalised linear models, time series analysis and a large number of other situations related to hypothesis testing and confidence intervals. They appear in a broad category of different fields, such as econometrics, computational biology and data modelling in physics and psychology.
The notation which is used throughout the paper is as follows. The parameter space is Θ ⊂ R d equipped with the Euclidean norm. Let θ = (θ 1 , θ 2 , . . . , θ d ) be a parameter from the parameter space, while θ 0 = (θ 0,1 , θ 0,2 , . . . , θ 0,d ) denotes the true, but unknown, value of the parameter. The probability density (or probability mass) function is denoted by f (x|θ), where x = (x 1 , x 2 , . . . , x n ). The likelihood function is L(θ; x) = f (x|θ). Its natural logarithm, called the log-likelihood function is denoted by l(θ; x). A maximum likelihood estimate (not seen as a random vector) is a value of the parameter which maximises the likelihood function. For many models the maximum likelihood estimator as a random vector exists and is also unique, in which case it is denoted byθ n (X); this is known as the 'regular' case. Existence and uniqueness of the MLE can not be taken for granted, see e.g. Billingsley (1961) for an example of non-uniqueness.
In order to secure existence and uniqueness in the case where the likelihood function L(θ; x) is twice continuously differentiable varying in an open parameter space Θ ⊂ R d , we make the following assumptions from Makelainen et al. (1981) of second partial derivatives is negative definite at every point θ ∈ Θ for which the gradient vector The interest is on assessing the quality of the asymptotic normality of the MLE and the approach we follow is partly based on Stein's method under a multivariate setting. Let H = h : R d → R : h is three times differentiable with bounded derivatives (1) be the class of test functions we use in this paper. We abbreviate h 1 := sup whereĪ n (θ 0 ) is defined in (3). The bounds are explicit in terms of the sample size and θ 0 . The two main results of the paper are given in Theorems 2.2 and 3.1. Theorem 2.2 gives a general upper bound on (2) which holds under the usual, sufficient regularity conditions for the asymptotic normality of the MLE. The generality of the bound adds to its importance as it can be applied in various different occasions; we have chosen the class of linear regression models to serve as an illustration of our results. Theorem 3.1 is also substantial since, under further assumptions, we achieve to obtain upper bounds related to the asymptotic normality of the MLE, even when the MLE is not known analytically. The paper is organised as follows. Section 2 first treats the case of independent but not necessarily identically distributed (i.n.i.d.) random vectors. The upper bound on the distributional distance between the distribution of the vector MLE and the multivariate normal distribution is presented. Special attention is given to linear regression models with an application to the simplest case of the straight-line model. Furthermore, under weaker regularity conditions, we explain how the bound can be simplified for the case of i.i.d. random vectors. Specific results for independent random variables that follow the normal distribution with unknown mean and variance are also given. Section 3 contains an upper bound on the aforementioned distributional distance, which holds even in cases where no analytic expression of the vector MLE is available. We illustrate the results through the Beta distribution with both shape parameters unknown. In order to make the paper more easily readable, we only provide an outline of the proofs of our main Theorems 2.2 and 3.1 and the complete proofs are given in Section 4. In addition, some technical results and proofs of corollaries that are not essential for the smooth understanding of the paper's developments are confined in the Appendix.

Bounds for multi-parameter distributions
In this section we examine the case of i.n.i.d. t-dimensional random vectors, for t ∈ Z + . Apart from the assumptions (A1) and (A2) for the existence and uniqueness of the MLE, we use some regularity conditions, first stated in Hoadley (1971), in order to establish the asymptotic normality of the MLE. We give an upper bound on the distributional distance between the distribution of the MLE and the multivariate normal and then we focus on the specific case of linear models. The last subsection covers, under weaker regularity conditions, the case of i.i.d. random vectors and an example from the normal distribution with unknown mean and variance serves as an illustration of our results.
(4) In terms of the dimensionality d of the parameter, K 1 (θ 0 ) = O d 4 , K 2 (θ 0 ) = O d 4 and K 3 (θ 0 ) = O d 8 as can be deduced from (7), (8) and (9), respectively. The last term of the bound in (6) is of order d in terms of the dimensionality of the parameter. Thus, for d ≫ n the bound does not behave well, but d could grow moderately with n. For example d = o (n α ) , 0 < α < 1 16 would still yield a bound which goes to zero as n goes to infinity.

Linear regression
This subsection calculates the bound in (6) for linear regression models. The asymptotic normality of the MLE in linear regression models has been proven in Fahrmeir and Kaufmann (1985). We give the example of a straight-line regression and the bound turns out to be, as expected, of order O 1 √ n , where n is the sample size. The following notation is used throughout this subsection. The vector Y = (Y 1 , Y 2 , . . . , Y n ) ⊺ ∈ R n×1 denotes the response variable for the linear regression, while β = (β 1 , β 2 , . . . , β d ) ⊺ ∈ R d×1 is the vector of the d parameters and ǫ = (ǫ 1 , ǫ 2 , . . . , ǫ n ) ⊺ ∈ R n×1 is the vector of the error terms, which are i.i.d. random variables with ǫ i ∼ N(0, σ 2 )∀i ∈ {1, 2, . . . n}. The true value of the unknown parameter β is denoted by β 0 = (β 0,1 , β 0,2 , . . . , β 0,d ) ⊺ ∈ R d×1 . The design matrix is For the model Y = Xβ + ǫ the aim is to find bounds on the distributional distance between the distribution of the MLE, β, and the normal distribution. The probability density function for Y i is where X [i] denotes the i th row of the design matrix. The parameter space Θ = R d is open and if X ⊺ X is of full rank, the matrix X ⊺ X is invertible and the vector MLE iŝ We now bound the corresponding distributional distance.
Using (17) we get that the Hessian matrix for the log-likelihood function, The expression in (20) is the same as W in (37) and therefore the quantity of interest ] is equal to (11), with (12) being equal to zero for this specific case of the linear regression model. Thus, using (44) and in Theorem 2.2 yields the result of the corollary.

Example: The simple linear model (d=2)
Here, we apply the results of (19) to the case of a straight-line regression with two unknown parameters. The model is The unknown parameters β 1 and β 2 are the intercept and slope of the regression, respectively. As before, the i.i.d. random variables ǫ i ∼ N(0, σ 2 ), ∀i ∈ {1, 2, . . . , n}. The MLE exists, it is unique and equal toβ = Ȳ , Proof. We have that For an upper bound for the second term of (19), since d = 2 then k = 1, j = 2 leading to For the final term of (19) Summarizing, in the case of Y 1 , Y 2 , . . . , Y n being independent random variables with , σ 2 , we apply to (19) the results of (22), (23) and (25) to obtain the assertion of the corollary.

Special case: Identically distributed random vectors
In this subsection we use weaker regularity conditions than (N1)-(N8) in order to find an upper bound in the case of independent and identically distributed random vectors. Following Davison (2008), we make the following assumptions: (R.C.1) The densities defined by any two different values of θ are distinct; (R.C.2) the log-likelihood function is three times differentiable with respect to the unknown vector parameter, θ, and the third partial derivatives are continuous in θ.
These regularity conditions in the multi-parameter case resemble those in Anastasiou and Reinert (2015) where it is assumed that the parameter is scalar. From now on, unless otherwise stated, the notation I(θ) stands for the expected Fisher information matrix for one random vector. Under (R.C.1)-(R.C.4), (Davison, 2008, p.118) shows that The upper bound on the distributional distance between the distribution of a vector MLE and the multivariate normal in the case of i.i.d. random vectors is the same as the bound in Theorem 2.2 and thus it is not given again. The bound can be simplified due to the fact that in the i.i.d.
In the next example of independent random variables from the normal distribution with both mean and variance unknown the bound can be easily calculated and it is, as expected, of the order 1 √ n .

Example: The normal distribution
Here, we apply Theorem 2.2 in the case of X 1 , X 2 , . . . , X n independent and identically distributed random variables from N(µ, σ 2 ) with θ 0 = (µ, σ 2 ). It is well-known that the MLE exists, it is unique and equal toθ n (X) = μ,σ 2 Davison (2008), p.116. In addition, the regularity conditions (R.C.1)-(R.C.4) are satisfied. The proof of the following corollary is given in the Appendix.
(1) The rate of convergence of the upper bound in (26) is 1 √ n .
(2) There might be cases where the parameters depend on the sample size, so that µ = µ(n) and σ 2 = σ 2 (n). The bound in (26) does not depend on µ(n) and goes to zero as long as are both satisfied. From (i), the order of σ 2 (n) should not be less than or equal to 1 n , while from (ii) we see that σ 2 (n) should be of order smaller than √ n. For instance, σ 2 (n) = c n 1 4 , where c ∈ R is a constant, satisfies the above limits. The bound in (26)  3 Bounds when the MLE is not known explicitly Anastasiou and Reinert (2015) give an upper bound for the mean squared error (MSE) of the MLE and use it to get upper bounds on the distributional distance of interest which can then be applied when the MLE is not expressed in a closed-form. In this section, we give similar bounds for the multi-parameter case with multivariate i.i.d. random vectors. We make some extra assumptions, (Con.1) ∀j ∈ {1, 2, . . . , t}, the support S j of X ij is a bounded interval in R; let s j := sup and s := max {s 1 , s 2 , . . . , s t }; (Con.2) for all θ 0 ∈ Θ, where Θ is the open parameter space, there exists an ǫ 0 = ǫ 0 (θ 0 ) > 0 such that for all θ ∈ Θ with |θ j − θ 0,j | < ǫ 0 , ∀j = 1, 2, . . . , d sup θ:|θq−θ 0,q |<ǫ 0 ∀q∈{1,2,...,d} where M kji = M kji (θ 0 ) is a constant that may depend only on θ 0 ; If (Con.1)-(Con.3) hold, then M kim − nǫ 2 0 < 0 holds with ǫ 0 as in (Con.2). Section 2 gave an upper bound for the distributional distance between the distribution of the MLE and the multivariate normal distribution. As explained in the outline of the proof of Theorem 2.2, this bound in (6) can be split into terms coming from Stein's method, and terms due to Taylor expansions and conditional expectations. For ease of presentation, we abbreviate In order to give an upper bound whenθ n (X) is not known explicitly, we bound E d j=1 θ n (X) j − θ 0,j 2 by a quantity which does not require knowledge of the MLE.
The result is given in Theorem 3.1 below, followed by a brief explanation of the idea of the proof. The complete proof is given in Section 4.

Example: The Beta distribution
Here, we find an upper bound for the specific example of i.i.d. random variables from the Beta distribution with both shape parameters being unknown. An analytic expression for the MLE is not available. Applying the result of (29) to bound E 2 j=1 θ n (X) j − θ 0,j 2 , gives an upper bound for the distributional distance of interest. Some useful notations are now presented. Firstly, Ψ j (.) is the j th derivative of the digamma function Ψ, with Ψ(z) = Γ ′ (z) Γ(z) , z > 0. The function Ψ j (z) can be defined through a sum, with For α, β, x, y > 0 and 0 < ǫ < min {x, y}, let nδ I (C 2 (α, β) + C 2 (β, α)) .
Corollary 3.1 below gives the upper bound related to the Beta distribution. The proof is given in the Appendix.
Corollary 3.1. Let X 1 , X 2 , . . . , X n be i.i.d. random variables from the Beta(α, β) distribution with θ 0 = (α, β). Let m = min {α, β}, ǫ = m 2 > 0 and Then, a) When n satisfies (35), E 2 j=1 Remark 3.2. Using the notation (34), it is straightforward that the first three terms of the bound are O 1 √ n . In addition, since γ B and ω B are O (1), the fourth and the fifth term of the bound are of order 1 n and 1 √ n , respectively. Combining these results for the order of each of the terms, the order of the bound (36) is 1 √ n .

Proofs of Theorems 2.2 and 3.1
In this section the complete steps of the proofs of the two main theorems of our paper are given.
The following lemma (special case of Chebyshev's 'other' inequality) is useful for bounding conditional expectations, which sometimes can be difficult to derive. The proof is given in the Appendix. Proof of Theorem 2.2. It has already been shown in the outline of the proof in p.6 that the triangle inequality yields 11) + (12).
Step 3: The MSE test function. To this purpose define the test function h as in (30). Then, Hence, Denoting by B h the bound in (57) for the test function h as in (30), we get using (31) that For the calculation of B h , note that with s as in (Con.1) and M as in (Con.3), For U := E d j=1 θ n (X) j − θ 0,j 2 , the results in (27), (57) and (59) yield The results in (58) and (60) give that with γ as in (28). Solving the quadratic inequality in (61) (with unknown U ) and using (Con.3) related to the sample size, n, we obtain for v and ω as in (28) that proving the result of the theorem.
Using that Since f (m) is an increasing function, Applying this to (63) gives that A simple iteration over k gives that which is the result of the lemma.