Uniformly Valid Inference Based on the Lasso in Linear Mixed Models

Linear mixed models (LMMs) are suitable for clustered data and are common in biometrics, medicine, survey statistics and many other fields. In those applications, it is essential to carry out valid inference after selecting a subset of the available variables. We construct confidence sets for the fixed effects in Gaussian LMMs that are based on Lasso-type estimators. Aside from providing confidence regions, this also allows to quantify the joint uncertainty of both variable selection and parameter estimation in the procedure. To show that the resulting confidence sets for the fixed effects are uniformly valid over the parameter spaces of both the regression coefficients and the covariance parameters, we also prove the novel result on uniform Cramer consistency of the restricted maximum likelihood (REML) estimators of the covariance parameters. The superiority of the constructed confidence sets to naive post-selection procedures is validated in simulations and illustrated with a study of the acid neutralization capacity of lakes in the United States.


Introduction
Linear mixed models (LMMs) are regression models that incorporate dependency structures that particularly arise in clustered data. They are widely applied in many empirical sciences ranging from genetics [Henderson, 1950] to survey statistics [Pfefferman, 2013] and many more. Comprehensive reviews are given by Demidenko [2004], Pinheiro and Bates [2000]. The classical LMM can be written as with observations y i ∈ R n i , known fixed covariates X i ∈ R n i ×p , Z i ∈ R n i ×q and v i ∈ R q , ε i ∈ R n i being independent and each independently distributed for all clusters i = 1, . . . , m. The term X i β 0 is referred to as 'fixed effects', and Z i v i as 'random effects'. While all observations share same regression coefficients in the fixed effects, only the n i observations in each cluster share the same realization of v i in the random effects. The latter can be associated with subject effects in medical trials or group effects in longitudinal studies. Consistency of parameter estimators is ensured if the number of clusters m → ∞, whereas the sample size per cluster n i may stay bounded or grow as well. The coefficients β 0 ∈ R p and covariance parameters θ 0 ∈ Θ ⊂ R r >0 are unknown and to be estimated. In practice, typically both model selection and estimation need to be performed. Twostage methods for fitting regression models involve variable selection first, e.g. based on information criteria, and thereafter estimation of the parameters in the obtained model. However, the classical inferential theory does not consider the selection procedure as stochastic. Accounting for the additional uncertainty of variable selection is a difficult task that recently received attention as post-selection inference [Berk et al., 2013]. A single stage approach is given by the least absolute shrinkage and selection operator (Lasso) [Tibshirani, 1996], which performs selection and estimation simultaneously. In the context of mixed models the Lasso can be applied on both the fixed and random effects. For example, Ibrahim et al. [2011] and Bondell et al. [2010] penalize both β 0 and θ 0 , whereas Peng and Lu [2012] penalize the random effects directly. A review on these methods is given by Müller et al. [2013]. Even when only the fixed effects are penalized, the joint estimation of β 0 and θ 0 in LMMs raises difficulties from computational aspects [Schelldorfer et al., 2011, Pan andShang, 2019]. Its usefulness and wide application sparked interest in how to construct confidence intervals based on the Lasso. The general difficulty is that its finite-sample and asymptotic distribution depends on the unknown coefficient parameters β 0 in an intricate manner [Pötscher and Leeb, 2009]. This is problematic, since honest confidence sets in the sense of Li [1989] obtain nominal coverage over the whole parameter space. For a low dimensional framework ('p < n'), Ewald and Schneider [2018] propose limiting versions of the objective function in order to obtain uniformly valid confidence sets. This contribution builds on Ewald and Schneider's results and extends them to LMMs. The fundamental problem is that since the covariance parameters θ 0 enter model (2) in a non-linear way, they cannot be simply concentrated out as in a standard linear model. For the same reason, the arguments that establish uniform coverage for coefficient parameters do not apply for the covariance parameters. Instead, our key idea is that the estimation of θ 0 via restricted maximum likelihood (REML) in LMMs is separated from β 0 , see Section 3. Thus, while it is shown that the results on Lasso-type penalized estimators for the fixed effects carry over to LMMs, the covariance parameters can be treated separately. Eventually, the main result establishes confidence sets that are uniformly valid over the space of coefficient and variance parameters together. The rest of the article is structured as follows. First, we specify the setting and regularity conditions and state the estimation procedure that avoids the need of a non-convex optimization problem for both β 0 and θ 0 in Section 2. Next, in Section 3, the estimation of covariance parameters θ 0 is discussed and their uniform consistency established. In Section 4, the main results are presented. Their usefulness and limitations, and, in particular, their superiority to a naïve approach based on least squares is demonstrated in a simulation study in Section 5. A real data example is provided in Section 6. The article concludes with a discussion in Section 7.

Setting and Regularity Conditions
Consider a linear model with n dependent observations and model equation The covariance matrix V(·) ∈ R n×n models the dependency amongst the observations. The vectors β 0 ∈ R p as well as θ 0 ∈ Θ ⊂ R r >0 are unknown and remain to be estimated. Model (1) can be represented as (2) The objective function for Lasso-penalized fixed effects and given tuning parameters λ 1 , . . . , λ p is given by where here and throughout the article the Frobenius norm is used, that is, for A ∈ R n×m we have A 2 = tr(A t A). The joint minimization over both β and θ is a non-convex optimization problem [Schelldorfer et al., 2011]. In order to establish uniform coverage for both parameter spaces by adapting the approach of Ewald and Schneider [2018], we consider the Lasso estimator for β 0 for a given estimator θ for θ 0 to be We aim to find a set M ⊂ R p such that inf β 0 ,θ 0 P β 0 ,θ 0 { √ n( β L − β 0 ) ∈ M } ≈ 1 − α for some nominal level α ∈ (0, 1), which is attained uniformly over β 0 ∈ R p and θ 0 ∈ Θ. We borrow and slightly adapt the notation of Ewald and Schneider [2018]. The key idea is that instead of treating β L with its intractable distribution directly, consider The hindmost plays the role of adjusting sign of the coefficients. Further, denote u d , C and w analogously with θ replaced by θ 0 . As the distribution of u d is not analytically available, the proofs exploit that u d ∼ N p (−C −1 Λd, C −1 ). We only consider confidence sets of ellipsoidal shape, that is, M will be set to E( C, t) = {z ∈ R p | z t Cz ≤ t}, t > 0. Their specific form is used in Lemma 3 to establish the minimal coverage probability over θ 0 . The following regularity conditions are imposed.
(C) rk(X) = p < n and for all θ 0 ∈ Θ: C → C pointwise for finite and positive definite C as n → ∞.
First, (A) imposes restrictions on θ 0 . It encompasses the conditions of positive covariance parameters and non-degenerativity as introduced by Jiang [1996]. The ratio allows for all components of θ 0 to be either arbitrarily small or large, but not both. Condition (B) ensures that the covariance matrix is not singular. Now, (C) ensures that C can be inverted and has bounded entries, which is the key advantage in the low-dimensional setting. (D) ensures that the normalizing sequence in Lemma 1 below is correctly specified in order to estimate θ 0 uniformly consistently. Finally, (E) relates to the estimator of the covariance parameters θ. It encompasses the condition that H k ∈ R n×n , k = 1 . . . , r, are linearly independent, which corresponds to the condition of identifiablility of covariance parameters from Jiang [1996]. Since all involved matrices are known, the last two conditions are easy to verify: Example. Consider a linear mixed model (1) with X i = Z i = 1 n i , Ψ(θ 0 ) = σ 2 v and Ω i (θ 0 ) = σ 2 e I n i , so that V(θ 0 ) = σ 2 v H 1 +σ 2 e H 2 , with H 1 = blockdiag(1 n 1 1 t n 1 , . . . , 1 nm 1 t nm ), m > 1, H 2 = I n and θ 0 = (σ 2 v , σ 2 e ). It is well known that the ratio of the entries in θ 0 is influential its estimation [Kramlinger et al., 2020]. In particular, in the considered LMM the ratio γ i = σ 2 v σ 2 v + σ 2 e /n i measures the class variability relative to the total variability. If γ i ≈ 0, σ 2 v can not be estimated reliably. Condition (A) translates into a positive lower bound for γ i . Further, H 1 ≥ 0 and H 2 > 0, which corresponds to condition (B). Condition (C) is trivially satisfied. Next, (D) and (E) can be verified directly, since all quantities involved are known. For e.g. c = 1, n 1 = · · · = n m = 2, this gives ω = 1/3 and ω = 2/3 and , for which e.g. b = 1/5.
The preceding example shows that condition (D) includes cases in which n i is bounded as well as n i is unbounded, i = 1, . . . , m. This has implications at which rate θ 0 and in turn β 0 are estimated. For instance, the rate in which σ 2 v is estimated via REML is governed by m, rather than the total number of observations n = m i=1 n i . Clearly, if all n i , i = 1, . . . , m are bounded, m = O(n). Since (D) holds for both asymptotic scenarios and for balanced (n 1 = · · · = n m ) as well as unbalanced (n i = n j ) LMMs, the condition is sufficiently general for most applications. The reasoning in this contribution, however, is not limited to LMMs and allows for different dependency structures in V(θ 0 ) from (2) as long as θ 0 can be estimated with restricted maximum likelihood and regularity conditions (A) -(E) are satisfied.

Estimation of the Covariance Parameters
The key idea in deriving uniformly valid confidence sets in LMMs is that the estimation of β 0 and θ 0 is separated. The virtue of restricted maximum likelihood (REML) lies in estimating the covariance parameters while accounting for the loss of degrees of freedom from estimation of β 0 [Searle et al., 1992]. Instead of treating y directly, only the transformed data onto the space orthogonal to the columns of X is considered. Then, the REML estimator for θ 0 and P(θ) = P(θ, 1) is a maximizer of see LaMotte [2007]. By construction, R is constant with respect to β 0 as P(θ)Xβ 0 = 0 n . As interest lies in valid inference, uniform consistency of the REML estimator is required. Consistency for ML estimators was first shown by Wald [1949], and uniform MLE consistency by Moran [1970]. Both required the parameter space to be compact, as well as independently drawn observations. A consistency result that omits those assumptions was first given by Weiss [1971Weiss [ , 1973. Miller [1977] applied this on MLEs in LMMs, while Das [1979] and Cressie and Lahiri [1993] did so for REML estimators. Jiang [1996] considered the REML estimator under non-normal LMMs and for unbounded p.
Another difference in both approaches is that in the former, which assumes a compact parameter space, the REML estimator is the unique maximizer of (6). The latter approach only shows that there exists a sequence of local maximizers that is consistent for θ 0 , but this sequence does not have to be unique. This is referred to as Cramér -type consistency. In general, for unbalanced panels in LMMs, that is n i = n j , for i = j in (1), it cannot be guaranteed that a maximizer of (6) is unique [Jiang, 1997]. Moreover, none of the latter authors considering REML estimation explicitly investigated uniform consistency over Θ.

and (A) -(E) hold and let
Then, there exists a sequence θ of local maximizers of (6), such that uniformly over θ 0 ∈ Θ.
Here, the notation Z n = O P (a) for a sequence of random variables Z n , n ∈ N, is defined that for all ε > 0, there exist constants M ε , N ε such that for all n > N ε : Regarding the normalizing sequence (15) for details. The proof of Theorem 1 follows the lines of Moran [1970] and Weiss [1971]. We extend their reasoning to account for a uniformity argument over Θ by construction of the normalizing sequence ν k (θ 0 ). If θ 0 were fixed, the normalizing sequence would reduce to rk(H k ), which corresponds to the classical case of √ n in linear models. However, the specific choice of ν k (θ 0 ) allows to prove uniform consistency by the help of the auxiliary result given in Lemma 1.

Main Results
For linear models it is known that the infimum coverage for Lasso-based confidence sets over the space of coefficient parameters can be expressed in terms of the limiting distribution, depending only on the sign of the entries of β 0 . This finding carries over to infimum coverage over the space of coefficient and covariance parameters, as the latter, when estimated with REML, do not depend on the fixed effects.
Proposition 1. Let model (2) and (A) -(E) hold. Thenthere exists a sequence θ of local maximizers of (6) such that The proof follows from Ewald and Schneider [2018, Theorem 1] as θ is independent from β 0 by construction. Instead of evaluating the infimum for β 0 directly, the result states that a much simpler minimization over a discrete set is sufficient to evaluate the probability. In the context of LMMs the key consequence of Proposition 1 is that due to the orthogonality of θ and β 0 , the uniform coverage probability can be separately evaluated for β 0 and θ 0 . For the latter, Theorem 1 allows to replace θ by θ 0 at a cost of an error term with a tractable dependency on θ 0 . Eventually, this leads to the following result.
Then, there exists a sequence θ of local maximizers of (6) such that The result differs from Proposition 1 in two aspects. First, the minimization over d ∈ {−1, 1} p has been shifted to the choice of non-centrality parameter of the χ 2 -distribution, based on Propositions 4 and 5 from Ewald and Schneider [2018]. Second, the estimation of θ 0 only contributes to an error term of uniformly vanishing, parametric order m −1/2 . The rate is given by the m observations that contribute to the estimation of θ 0,k , k = argmin k rk(H k ). The critical finding is that the error terms within the probability in Theorem 2, which depend on θ 0 , eventually result in an error term uniformly independent of θ 0 . As a consequence of this result it follows that the Lasso based confidence ellipsoid attains nominal coverage uniformly up to an error of parametric rate. By separating the estimation of β 0 with penalization and θ 0 without, the result allows to infer about the fixed effects for data with complex dependency structure. From the orthogonality argument it is clear that the penalization on β 0 does not interfere with inference based on θ, for which existing theory can be applied. Since the REML estimates themselves are not under the regime of an analytically tractable distribution, Theorem 2 states that M is only valid asympotically. Although this implies that the resulting testing procedure is not of nominal level 1 − α as discussed in Leeb and Pötscher [2017], the discussion in Section 5 suggests that this error term has little influence in finite samples. Most importantly, however, is that the error term does not depend on β 0 nor θ 0 and the stated coverage probability thus holds uniformly over both the space of regression coefficients and covariance parameters. When compared to standard confidence sets based on estimation procedures without variable selection, it is well-known that M is larger [Pötscher, 2009]. If smaller models are preferred however, β L and inference based upon it offer an alternative, despite the Lasso's difficult distributional properties. In particular, it is clear from Lemma 1 that the coverage probability of M heavily depends on the sign of the entries of β 0 . This dependency and behaviour of M over the space of coefficient parameters is illustrated in the following section.

Simulations
The derived confidence sets are uniformly valid over the whole space of regression coefficients and covariance parameters. For the former however, the specific coverage probability depends on the sign of the entries of β 0 . As a consequence of Lemma 4, accurate coverage up to an error term is attained in two orthants only, whereas an overcoverage up to an error term is exhibited in all other orthants. In order to visualize this effect, we use the following simulation design for two coefficient parameters. Consider the 'random intercept model', a special case of model (1) We fix the parameters to m = 20, n = 400, n i = 20, σ u = σ v = 4. For visualization purposes, we restrict the simulation to p = 2 parameters. The tuning parameters are chosen to be λ i = n 1/2 /2 for i = 1, 2, which corresponds to a conservative tuning regime. The entries of the matrix of covariables are independently drawn from N (0, 4), so that the fixed and random effects are of a comparable magnitude. Hence, for β 0 ∈ [−2, 2] × [−2, 2] the empirical coverage probability is computed by checking if β 0 ∈ M , for M with α = 0.05 from (7). For each β 0 , 3,000 simulations were carried out. Figure 1 shows the results. The probabilities shown are average empirical coverages for all configurations of β 0 . First consider the classical confidence sets based on the weighted least-squares (WLS) β W LS original Naïve β W LS after AIC β L as in (7) β L with WLS-set  (7) or WLS estimator (left column) achieve nominal level over the whole parameter space (yellow), the former is conservative (green and white) around the origin and axes. Naïvely applying WLS sets to the Lasso or WLS estimator after AIC variable selection (right column) yields undercoverage (dark).
, with no variable selection involved. As the distribution of β W LS − β 0 is independent of β 0 , the coverage based on its confidence set is attained uniformly over the space of coefficient parameters. One finds that no deviations from the the nominal level of 95% (yellow) can be observed in the simulation. Next, consider the confidence set based on the Lasso as given in (7) (upper left). Nominal coverage is only attained up to a small error in two orthants, namely for sign(β 1 ) = sign(β 2 ). The other orthants exhibit a slight overcoverage (green), whereas a significant overcoverage (white) occurs at the axes and around the origin. The latter effects are due in the event of variable selection, when a component in β L is set to zero.
Hence, at the axes, a coverage close to 1 is achieved, and the 95%-confidence sets prove to be too wide. These findings are in line with the example of the linear regression model from Ewald and Schneider [2018, Fig. 4]. Although additional uncertainty is present due to the estimation of random effects, the additional error term does not appear too influential. Hence, although Theorem 2 postulates that nominal coverage is only achieved with an additional term of vanishing order, the experiment indicates that the confidence sets still prove to be adequately close to the nominal level, with a coverage of 94.3% and 95.8% in the orthants.
In contrast to the methods on the left column in Figure 1, which meet the nominal level thoroughly, but without variable selection (bottom left) or conservatively, but with selection (upper left), two additional naïve approaches are displayed on the right column. First, one observes that naïvely applying classical WLS confidence sets around a WLS estimator β W LS after performing variable selection with AIC (bottom right) yields inconsistent coverages over the parameter space. Type-I-error inflation [Berk et al., 2013] occurs in regions with undercoverage (purple, dark) in event of variable selection at the axes. Another approach is applying a WLS confidence set around the estimator β L (upper right). Again, these sets are not theoretically justified and indeed, an overcoverage occurs at the origin, whereas a severe undercoverage over the rest of the coefficient parameter space. Both naïve approaches thus yield misleading confidence sets, and their use is inadvisable.

Application
We demonstrate the performance of the confidence sets based on the Lasso and WLS with no variable selection on a data set on the ecological status of lakes in the north-eastern United States collected by the US Environmental Protection Agency [1996]. The set has been thoroughly investigated in previous studies [Opsomer et al., 2009, Breidt et al., 2007. The model of interest is given by y = 1 n β 1 + xβ 2 + Zv + Du + , v ∼ N (0 m , σ 2 v I m ), u ∼ N (0 K , σ 2 u I K ) and ∼ N (0 n , σ 2 e I n ).
The response variable of interest is the acid neutralizing capacity, which serves as an indicator of the environmental state of the lake. It is related to a design matrix which includes an intercept and the respective elevation. In total, the n = 551 lakes are partitioned via m = 19 hydrological unit codes with 1 ≤ n i ≤ 13, which enter the model as random effect intercept Zv. Eventually, a penalized spline with K = 80 knots accounts for spatial proximity effects. It is included as second random effect Du, where D is a transformed radial basis as described by Opsomer et al. [2009]. The penalization parameters for the Lasso are set to λ 1 = λ 2 = 0.0235 by 10-fold crossvalidation obtained with the R-function glmnet. The fitted model does not indicate any deviation from regularity conditions (A) -(E). In particular, max( θ)/ min( θ) ≈ 3.8. The estimated fixed effects are given as β L = (0, −1.12) t and β W LS = (483, −1.20) t .
In previous studies, the intercept has been deemed insignificant [Opsomer et al., 2009]. Indeed, the estimated standard deviations are given by s( β W LS,1 ) = 1718 and s( β W LS,2 ) = 0.14. Consequently, the Lasso estimator sets the intercept equal to zero. This corresponds to a selection event on the axes in Figure 1 (upper left). Such suffer from severe overcoverage. A 99%-confidence set constructed as in Theorem 2 thus gives that the Lasso based confidence ellipse is 200 times larger than the WLS based confidence ellipse. The Lasso based set M L is given by M from (7), with τ = χ 2 2,0.99 (1634) ≈ 1829. The WLS based set M W LS is given as M , but β L replaced with β W LS and τ with χ 2 2,0.99 ≈ 9. This does not indicate that the former are useless however. Much of their additional volume is distributed along the axis of the nonincluded intercept. The additional uncertainty induced by the Lasso in case of no selection event can be assessed by performing inference on β 2 only. This gives confidence intervals for the elevation coefficient, rounded to the second decimal place, by Both sets are constructed similarly. While for M L , the corresponding quantile is from the non-central χ 2 1 -distribution with a non-centrality parameter close to zero, M W LS is based upon a central χ 2 1 -quantile. The respective interval lengths are therefore almost equal, with M L being narrowly wider by the sixth decimal place. However, both intervals differ in location. The Lasso based M L is centered around β L,2 and thus covers a different part of the real line than the WLS based M W LS . Consequently, although both intervals are of almost equal length, inference based on the Lasso indeed differs from WLS based inference, even in case of no selection event. Altogether, this case example confirms if interest lies in narrow confidence sets, WLS sets are smaller and should be employed. Simultaneously performing variable selection with the Lasso always evokes uncertainty. Nevertheless, confidence sets based on the Lasso are not always much larger than their WLS counterparts. When smaller models are preferable, those sets can be useful for assessing uncertainty and hypothesis testing.

Discussion
This contribution presents a solution for inference in a low-dimensional LMM in which the fixed effects are estimated with a Lasso-penalization. Our aim is to construct uniformly consistent confidence sets for the fixed effects. We suggest a two-stage estimation procedure, where the covariance parameters are estimated via REML estimator θ first, and the coefficient parameters β 0 with θ plugged in second. This design simplifies the minimization of Q(β, θ) by avoiding a non-convex optimization problem. Moreover, it allows to treat coefficient and covariance parameters separately. Using this approach, we decisively simplify the verification of the theoretical properties of parameter estimators. Eventually, it can be shown that the resulting confidence sets are uniformly valid under both the coefficient and covariance parameters. It is well known that employing standard WLS based confidence sets without model selection will always yield smaller confidence sets than post-selection inference [Pötscher, 2009]. However, selecting parsimonious models is often a key concern for practitioners.
Although resulting post-selection confidence sets will include additional uncertainty due to variable selection, Sections 5 and 6 indicate that derived confidence sets are useful. The novelty of our method is that, to the best of our knowledge, this work is the first that considers inference based on the Lasso for an elaborate covariance structure, as given in LMMs. We expect that this approach can serve as a basis for proper inference for an estimation procedure that penalizes both fixed and random effects.

Proofs
To prove Theorem 1, we first derive the following lemma.
Lemma 1. Let conditions (A) -(E) hold, m = min{rk(H 1 ), . . . , rk(H r )} and let K be a finite set drawn with replacement from {1, . . . , r} with |K| > 1. Then, for any θ ∈ Θ, The result is shown using that the eigenvalues of the matrices of interest are smaller or equal to one and subsequently employing regularity condition (D).

Proof. Consider the symmetric and positive semi-definite matrices
so that tr{ k∈K P(θ)H k } = tr{ k∈K A k B}. Since B is a projection matrix with eigenvalues zero or one, proceeding as in the proof of Theorem 1.1 (Hölder inequality) from Manjegani [2007] gives We continue by evaluating tr(A |K| k ). First note that conditions (A) and (D) imply For k = r, let η i = η i (H −1 r r−1 l=1 θ l /θ r H l ) ≥ 0 so that the above inequality yields and, similarly for k = 1, . . . , r − 1, let Altogether, for k = 1, . . . , r, this gives Plugging this into equation (11) yields which gives the claim.
Note that under the conditions of Lemma 1 but with |K| = 1, (11) implies that Before getting into the proof of Lemma 1, we denote from now on Q i (θ) = P(θ)H i , Q ij (θ) = l∈{i,j} P(θ)H l , Q ijk (θ) = l∈{i,j,k} P(θ)H l to facilitate notation. In particular, Lemma 1 implies that It is straightforward to adapt Lemma 1 to the case in which P(θ) is replaced by V(θ) −1 : Corollary 1. Let model (2) and (A) -(E) hold and let K be a set drawn with replacement from {1, . . . , r} with |K| > 1. Then, for θ ∈ Θ, Above results are related to the reverse inequality from Horn and Johnson [2013, p. 232, Problem 4.1.P13], namely that for a positive definite matrix Z ∈ R n×n , n ∈ N, it holds Proof of Theorem 1. This proof is adapted to account for uniformity w.r.t. θ 0 from Lemma 7.2 (i) from Jiang [1996] and closely follows the lines of Weiss [1971]. The four boundedness properties below on the derivatives of the log-likelihood are surrogates of the boundedness conditions imposed on the log-likelihood directly by Wald [1949] and Moran [1970]. By a Taylor expansion the conditions below include boundedness of the log-likelihood as well. They are further required to omit the compactness condition of Θ, by constructing a compact set Θ , see details below. For all i, j = 1 . . . , r, it holds for any θ 0 that where q = m 1/(6+δ) for some δ > 0 and m = min k=1,...,r rk(H i ). For readability, suppress the dependency from θ when the argument is clear from the context. Now, (i)-(iv) are shown.
(i) Note that ∂P/∂θ i = −PH i P and PVP = P, as well as The claim now follows after taking expectation.
(ii) As E(∂ R /∂θ i | θ 0 ) = 0 by (i), Chebychevs inequality can be applied, it holds uniformly, and gives that for any > 0 there exists k > 0, such that where the last equation follows by Var (15).
By a similar argument, Now, for a ∈ R r with a t a = 1 eigenvector to η r {J(θ 0 )}, (iv) First, note that for any θ ∈ Θ q a Taylor expansion around θ 0 yields where the second equality is due to Corollary 1. This implies that Now, by (19), it follows for θ ∈ Θ q : With Lemma 1, this gives for m = min k=1,...,r rk(H i ) and all θ ∈ Θ q . Similarly, now dropping the argument for all quantities depending on θ ∈ Θ q on behalf of readability, i.e. ν i = ν i (θ), Putting the previous two results together and proceeding as in (ii) with Chebychev, this gives that for any θ 0 it holds for θ ∈ Θ q . In order to prove (iv), this must hold for θ 0 in the right hand side. A Taylor expansion of tr{Q ij (θ)} around tr{Q ij (θ 0 )} gives The second part of the proof mimics the reasoning of Weiss [1971].
We will show that there exists a θ so that

By (ii) and (iii), the last term is
. This is shown by proving that a maximum of (6) is attained in a region close to θ . Now, for any θ ∈ Θ q , where . Now consider the set Θ n = θ : ν i (θ 0 )|θ i − θ i | < n and its boundary ∂Θ n and Θ n = Θ n ∪ ∂Θ n . For any θ ∈ Θ n with n → 0, that is for all ε > 0 there exists a constant C and N ε such that for all n ≥ N ε : Together, for any θ ∈ Θ n and any ε > 0 there exists N ∈ N, such that for all n ≥ N : i.e., inf θ 0 ∈Θ P θ 0 (Θ n ⊆ Θ q ) → 1. This implies that Second-to-last, consider by (iii). Note that δ( n ) is not stochastic, increasing in n and δ(0) = 0. Further, by (iv), 2 sup θ∈Θq |R(θ, θ)| = O P (q 3 / √ m), i.e. that for all ε > 0 there exists a constant C and N ε such that for all n ≥ N ε : 1 − ε ≤ inf θ 0 ∈Θ P θ 0 {2 sup θ∈Θq |R(θ, θ)| < C q 3 / √ m}. Now set n → 0 so that it exists N > N ε such that for all n > N : 2 n rc /2 > C q 3 / √ m. This gives that for all n > N : Finally, this implies that for θ ∈ Θ q and by (20) lim i.e., with probability going to one, R attains a local maximum in Θ n . Denote θ as a local maximum of R in Θ n if one exists and set θ ∈ Θ n otherwise. Then, above statement gives that for all ε > 0 there exists N ε such that for all n ≥ N ε : inf θ 0 ∈Θ P θ 0 ( θ ∈ Θ n ) ≥ 1 − ε. Now, for any constant C > 0 and any ε > 0 let N be such that C > N and N > N ε . Then, for all n > N , In order to address the infimum over Θ, we use the following result.
Lemma 2. For n ∈ N, let (X n ) and (Y n ) be sequences of real valued continuous random variables with finite variances, in x and y uniformly bounded conditional density p Xn|Yn=y (x) = O(1) and X n = O P (a n ) and Y n = O P (a n b n ) with sequences (a n ), (b n ) ∈ R with b n = o(1). Then, as n → ∞, P X n + Y n ≤ a n = P(X n ≤ a n ) + O(b n ).
The asymptotic result is clear as convergence in probability implies convergence in distribution. Above result further specifies the rate of convergence.
Proof of Lemma 2. First, let φ(s, t) = P (X n + s ≤ a n |Y n = t) and consider a Taylor expansion for f (x) = P (X n /a n + x ≤ 1|Y n = t) around x = 0, which gives φ(s, t) = φ(0, t) + O(s/a n ). This implies an an−t p Xn|Yn=t (z) dz = φ(0, t) − φ(t, t) = O(t/a n ). Using convolution the formula we obtain P X n + Y n ≤ a n = an −∞ p Xn+Yn (z) dz which gives the claim.
The next two results are helpful to prove Theorem 2.