Critical dimension in profile semiparametric estimation

This paper revisits the classical inference results for profile quasi maximum likelihood estimators (profile MLE) in the semiparametric estimation problem. We mainly focus on two prominent theorems: the Wilks phenomenon and Fisher expansion for the profile MLE are stated in a new fashion allowing finite samples and model misspecification. The method of study is also essentially different from the usual analysis of the semiparametric problem based on the notion of the hardest parametric submodel. Instead we derive finite sample deviation bounds for the linear approximation error for the gradient of the loglikelihood. This novel approach particularly allows to address the important issue of the effective target and nuisance dimension. The obtained nonasymptotic results are surprisingly sharp and yield the classical asymptotic statements including the asymptotic normality and efficiency of the profile MLE. The general results are specified to the important special cases of an i.i.d. sample and the analysis is exemplified with a single index model.


Introduction
Many statistical tasks can be viewed as problems of semiparametric estimation when the unknown data distribution is described by a high or infinite dimensional parameter while the target is of low dimension. Typical examples are provided by functional estimation, estimation of a function at a point, or simply by estimating a given subvector of the parameter vector. The classical statistical theory provides a general solution to this problem: estimate the full parameter vector by the maximum likelihood method and project the obtained estimate onto the target subspace. This approach is known as profile maximum likelihood and it appears to be semiparametrically efficient under some mild regularity conditions. We refer to the papers Van der Vaart (2000, 1999) and the book Kosorok (2005) for a detailed presentation of the modern state of the theory and further references. The famous Wilks result claims that the likelihood ratio test statistic in the semiparametric test problem is nearly chi-square with p degrees of freedom corresponding to the dimension of the target parameter. Various extensions of this result can be found e.g. in Fan et al. (2001); Fan and Huang (2005); Boucheron and Massart (2011); see also the references therein.
This study revisits the problem of profile semiparametric estimation and addresses some new issues. The most important difference between our approach and the classical theory is a nonasymptotic character of our study. A finite sample analysis is particularly challenging because most of notions, methods and tools in the classical theory are formulated in the asymptotic setup with growing sample size. Only few finite sample general results are available; see e.g. the recent paper Boucheron and Massart (2011).
The results of this paper explicitly describes all "small" terms in the expansion of the log-likelihood. This helps to carefully treat the question of applicability of the approach in different situations. A particularly important question is about the critical dimension of the target p and the full parameter dimension p * for which the main results are still accurate. Another issue addressed in this paper is the model misspecification. In many practical problems, it is unrealistic to expect that the model assumptions are exactly fulfilled, even if some rich nonparametric models are used. This means that the true data distribution IP does not belong to the considered parametric family. Applicability of the general semiparametric theory in such cases is questionable. An important feature of the new approach of Spokoiny (2012) is that it equally applies under a possible model misspecification.
The mentioned issues, especially the non-asymptotic character of study dictate to change entirely the tools and methods of analysis. We apply the recent bracketing approach of Spokoiny (2012) and demonstrate its power on the considered case of semi-parametric estimation. Let Y denote the observed random data, and IP denote the data distribution. The parametric statistical model assumes that the unknown data distribution IP belongs to a given parametric family (IP υ ) : where Υ is some high dimensional or even infinite dimensional parameter space. This paper concentrates on a finite dimensional setting, however, an extension to a functional space is feasible and to be considered elsewhere. The maximum likelihood approach in the parametric estimation suggests to estimate the whole parameter vector υ by maximizing the corresponding log-likelihood L(υ) = log dIPυ dµ 0 (Y ) for some dominating measure µ 0 : Our study admits a model misspecification IP / ∈ (IP υ , υ ∈ Υ ) . Equivalently, one can say that L(υ) is the quasi log-likelihood function on Υ . The "target" value υ * of the parameter υ can defined by Under model misspecification, υ * defines the best parametric fit to IP by the considered family.
In the semiparametric framework, the target of analysis is only a low dimensional component θ of the whole parameter υ . This means that the target of estimation is for some mapping Π 0 : Υ → IR p , and p ∈ N stands for the dimension of the target.
The profile maximum likelihood approach defines the estimator of θ * by projecting the obtained MLE υ on the target space: The Gauss-Markov Theorem claims the efficiency of such procedures for linear Gaussian models and linear mapping Π 0 , and the famous Fisher result extends it in the asymptotic sense to the general situation under some regularity conditions. The Wilks phenomenon describes the limiting distribution of the likelihood ratio test statistic T : (1.1) It appears that the distribution of this test statistic is nearly chi-square χ 2 p as the samples size grows, Wilks (1938). In particular, this limiting behavior does not depend on the particular model structure and on the full dimension of the parameter υ , only the dimension of the target matters. The full parameter dimension can be even infinite under some upper bounds on its total entropy. Below we consider a slightly different presentation of this estimator based on the partial optimization of the objective function L(υ) for a fixed θ . Namely, definȇ (1.2) Then the profile MLE can be defined as the point of maximum ofL(θ) : The test statistic T from (1.1) is also called the semiparametric excess and it can be defined asL The Wilks result can be rewritten as The local asymptotic normality (LAN) approach by Le Cam leads to the most general setup in which the Wilks type results can be established. However, the classical theory of semiparametric estimation faces serious difficulties when the dimension of the nuisance parameter becomes large of infinite. The LAN property yields a local approximation of the log-likelihood of the full model by the log-likelihood of a linear Gaussian model, and this property is only validated in a root-n neighborhood of the true point. The non-and semiparametric cases require to consider larger neighborhoods where the LAN approach is not applicable any more. A proper extension of the Wilks result to the case of a growing or infinite nuisance dimension is quite challenging and involves special constructions like a pilot consistent estimator of the target, a hardest parametric submodel as well as some power tools of the empirical process theory; see Murphy and Van der Vaart (2000) or Kosorok (2005) for a comprehensive presentation.
The recent paper Spokoiny (2012) offers a new look at the classical LAN theory. The basic idea is to replace the local approximation by local bracketing. Instead of one approximating Gaussian log-likelihood, one builds two different quadratic processes such that the original log-likelihood can be sandwiched between them up to a small error.
It appears that the bracketing device can be applied for much larger neighborhoods than in the LAN approach. In this paper we show that the local bracketing approach of Spokoiny (2012) can be used for obtaining a version of the Wilks Theorem in a quite general semiparametric setup avoiding any special construction like "the hardest parametric submodel".
Another important issue is that the new approach does not rely on any pilot estimator of the target. The usual assumption that a consistent pilot estimator is available can be even misleading in our setup because it separates local and global considerations. This paper attempts to figure out a list of condition ensuring global concentration and local expansion at the same time. This particularly allows to address the crucial question of the largest dimensionality or the nuisance parameter for which the Wilks result still holds.
It appears that the profile semiparametric approach is validated under the constraint p * 3 ≪ n , where p * is the full parameter dimension. It applies even if the dimension p of the target grows with the sample size under the mentioned constraint. The important identifiability issue is also addressed in a more careful way for the considered finite sample case.
For the further presentation we have to briefly outline the basic results from Spokoiny (2012). Introduce the log-likelihood ratio process The key bracketing result of Spokoiny (2012) claims that L(υ, υ * ) can be sandwiched on a local elliptic set Υ • (r) around υ * by two quadratic in υ processes L ǫ (υ, υ * ) and L ǫ (υ, υ * ) : where ♦ ǫ (r) > 0 and ♦ ǫ (r) > 0 are small terms. The value r here can be viewed as the radius of the set Υ • (r) in the intrinsic semimetric corresponding to the process L(θ) . See Section B for a precise formulation. This local result is accompanied with the deviation bound of the form where x grows almost linearly with r . The bracketing result (1.3) yields a number of important and informative corollaries. One of them shows that the excess L( υ, υ * ) can be approximated by a quadratic form ξ 2 /2 , where ξ def = D −1 0 ∇L(υ * ) is the normalized score while D 2 0 approximates the total Fisher information matrix. Another important corollary of (1.3) is an expansion of the quasi MLE υ . The mentioned results can be written in the form where ∆ ǫ is a random term called the spread which is small with a large probability. In a typical situation with a correctly specified model, ξ is nearly standard normal and hence, 2L( υ, υ * ) is nearly χ 2 p * , where p * is the full parameter dimension, while the MLE υ is asymptotically normal and efficient. The expansion (1.4) helps to build likelihoodbased confidence sets for the true parameter υ * . Let χ α be the (1 − α) -quantile of the chi-square distribution with p * degrees of freedom. Set Then (1.4) ensures that the coverage probability IP υ * / ∈ E(α) is close to α provided that ∆ ǫ is sufficiently small.
This paper aims at establishing a similar statements for the processL(θ) from (1.2).
In particular, the Wilks result can be written as where the random p -vectorξ satisfies IEξ = 0 and IE ξ 2 ∼ = p . The deviation properties of ξ 2 resemble the ones of a chi-square random variable with p degrees of freedom just as in the Wilks phenomenon. The expansion of the profile MLE θ reads as The symmetric matrixD 2 0 ∈ IR p×p is usually called the influence matrix and it is the covariance of the efficient influence function; see Kosorok (2005).
Usually in the classical semiparametric setup, the vector υ is represented as υ = (θ, η) , where θ is the target of analysis while η is the nuisance parameter. We refer to this situation as (θ, η) -setup and our presentation follows this setting. An extension to the υ -setup with θ = Π 0 υ is straightforward. Also for simplicity we only develop our results for the case that the full parameter space Υ is a subset of the Euclidean space of dimensionality p * . An extension to an infinite dimensional parameter space is possible but involves a range of technical issues that have to be done elsewhere.
Section 2 introduces the objects and tools of the analysis and collects the main results including an extension of the Wilks Theorem, concentration properties of the profile estimator and the construction of confidence sets for the "true" parameter θ * . The concentration properties of the profile MLE are discussed in Section D.1. The appendix collects the conditions and the proofs of the main results.

Main results
This section presents our main results on the semiparametric profile estimator which include the Wilks expansion of the profile maximum likelihood and the Fisher expansion of the profile MLE θ . All the results are stated under the same list of conditions that can be found in Section A of the appendix. As already mentioned, our setup follows Spokoiny (2012). However, at one point there is an essential difference. The results of Spokoiny (2012) are stated for just one fixed finite sample. The same continues to hold for the results below. But we are also interested in understanding what happens if the full dimension p * becomes large. For this we consider below an asymptotic setup with p * = p n , where n denotes the asymptotic parameter. It can be viewed as the sample size with n → ∞ . We assume that all considered objects depend on n including the likelihood function, the full parameter set Υ and its dimension p * , as well as all the constants in our conditions. The primary goal of our study is to fix the necessary and sufficient conditions on growth of p n with n which ensures the Wilks and Fisher results.
Our result apply even if the target parameter θ is of growing dimension. The dimension p can be of order p * . The case with a full dimensional target and low dimensional nuisance is also included.

The Wilks and Fisher expansion
This section states the key results in the semiparametric framework which heavily use the local bracketing idea of Spokoiny (2012). First we introduce the main elements of the bracketing device. This includes two p * × p * matrices V 2 0 and D 2 0 and two constants ǫ = (δ, ̺) . The matrix V 2 0 describes the variability of the process L(υ) around the true point υ * : (2.1) The matrix D 2 0 is defined similarly to the Fisher information matrix: Here and in what follows we implicitly assume that the log-likelihood function L(υ) is sufficiently smooth in υ , ∇L(υ) stands for the gradient and ∇ 2 IEL(υ) for the Hessian of the expectation IEL at υ . It is worth mentioning that the matrices D 2 0 and V 2 0 coincide if the model Y ∼ IP υ * ∈ (IP υ ) is correctly specified and sufficiently regular; see e.g. Ibragimov and Khas'minskij (1981).
Now we switch to the (θ, η) -setup. Consider the block representation of the vector ∇ def = ∇L(υ * ) and of the matrices V 2 0 from (2.1) and D 2 0 from (2.2): Define also the p × p matrixD 2 0 and p -vectors∇ θ andξ as In what follows, by C we denote a generic fixed constant. For all results presented below we assume a sufficiently large value x to be fixed. It determines our level of overwhelming probability: a generic random set In the asymptotic setup with a growing sample size n the value x grows as well, x = x n → ∞ . We also suppose that a sufficiently large constant x is fixed which specifies random events Ω(x) of dominating probability. Similarly to p * , the value x may depend on the asymptotic parameter n and grows to infinity with n . A particularly relevant choice is x = x n = C log n for a fixed C > 0 . We only require that x n is not too large, more precisely, x ≤ x c ; see (C.2) from Section C. In the i.i.d. setup x c is of order n 1/2 .
The other important value to be fixed is r 0 . This value determines the frontier between local and global consideration. In the local vicinity Υ • (r 0 ) of radius r 0 we apply a very accurate local quadratic approximation of the log-likelihood process while outside of this vicinity a much more rough upper function device can be used; see Section B for more details. The general rule for the choice of r 0 is given by the condition r 2 0 ≥ C 0 (p * + x) for some specific constant C 0 . The quality of local quadratic approximation is measured by two functions δ(r) and ω(r) shown in local conditions (ED 1 ) , (L 0 ) of Section A. More exactly, it can be described by the quantities τ ǫ defined as where the constants ν 0 and a are from conditions (ED 1 ) and (I) in Section A. The sub-index ǫ stands for the pair δ(r 0 ), ω(r 0 ) . Our results implicitly assume that τ ǫ is small. We comment on typical behavior of τ ǫ is Section 2.2 in context of i.i.d. models.
The first result can be viewed as an extension of the Wilks Theorem.
In the next section the result (2.5) is used to show asymptotic normality and efficiency of the profile estimator in the i.i.d. setting and under the correct model specification.

The i.i.d. case and asymptotic efficiency
Here we briefly discuss the implications of our general results to the case with Y = (Y 1 , . . . , Y n ) ⊤ where observations Y i are i.i.d. from a measure P . The parametric assumption means P = P υ * ∈ (P υ , υ ∈ Υ ) for a given parametric family (P υ ) , where Υ is a subset of the Euclidean space IR p * . We assume that (P υ ) obeys the regularity conditions listed in Section 5.1 of Spokoiny (2012). By ℓ(y, υ) we denote the log-density of P υ w.r.t. some dominating measure µ 0 . For simplicity of comparison with the classical results we do not discuss the model misspecification issue, i.e. the parametric assumption is correct. However, an extension to the case of a misspecified model is straightforward. We utilize that V 2 0 = D 2 0 = nF , ω(r) = ω * r/n 1/2 , δ(r) = δ * r/n 1/2 , and g = g 1 √ n ; see Lemma 5.1 in Spokoiny (2012). Here F is the Fisher information matrix of the family (P υ ) at the point υ * , and ω * , δ * , and g 1 are some positive constants.
It is shown in Spokoiny (2012) that the full parameter υ * can be well estimated provided that p * /n is sufficiently small. More precisely, the concentration property for the set Υ • (r) requires r 2 ≥ Cp * for a fixed C , while the local bracketing device is validated up to the spread ∆ ǫ (r) which is of order p * δ(r) ≍ p * r/n 1/2 ≍ p * 3/2 /n 1/2 .
The range of applicability for the proposed approach can be informally defined by the rule "the spread is smaller than the value of the problem", where the value of the problem is understood as the expected excess. If the full parameter υ is estimated, the value of the problem is of order p * leading to the constraint " p * /n is small". If the target parameter is of dimension p , then the value of the problem is also of order p leading to the constraint " p * 3/2 /(n 1/2 p) is small". Now we specify the results in the (θ, η) semiparametric setup. To state the result we only need a version of the identifiability condition (I) on the marginal distribution. Let F be the Fisher information matrix of the family (P υ ) at the true point υ * . Consider its block representation The required identifiability condition reads as follows: (ι) There is a constant ρ < 1 such that The presented result admits that the full dimension p * grows with the sample size but slower than n 1/3 . The result is applicable even in the case when the target dimension also depends on the sample size.

Critical dimension
This section discusses the issue of a critical dimension. Namely we assume that the full dimension p * grows with the sample size n and write p * = p n . Theorem 2.3 requires that p n = o(n 1/3 ) . Here we show that this condition is critical for the class of models satisfying the conditions of Section A. Namely, we present an example in which the behavior of the profile MLE θ heavily depends on the value β n = p 3 n /n ≥ β > 0 . If β n → 0 , then the conditions of Section A are satisfied yielding asymptotic efficiency of θ . At the same time, if β n ≥ β > 0 , then the MLE θ is not anymore root-n consistent.
Assume that p n / √ n → 0 . Let a random vector X ∈ IR pn follow X ∼ N(υ * , n −1 I I pn ) .
It is easy to see that all conditions from Section A are satisfied with τ ǫ p n ∼ = β 1/2 n and D 2 0 = V 2 0 = nI I pn .
Therefore, the results from Section 2.1 yield efficiency of the profile MLE θ if p 3 n /n → 0 . Moreover, it is straightforward to see that It follows similarly to Theorem 2.1 that if β 2 n = p 3 n /n → 0 then The next result shows that in the case when β n = p 3 n /n is not small, the profile MLE θ is not root-n consistent.
There exists a positive α > 0 such that it holds with a probability exceeding α If β n → ∞ , then where IP −→ means convergence in probability.
The appendix collects our conditions and proofs of the main results.
We adopt the conditions from Section 2 of Spokoiny (2012) with the obvious change of notations. The local conditions only describe the properties of the process L(υ) for υ ∈ Υ • (r 0 ) with some fixed value r 0 . The global conditions have to be fulfilled on the whole Υ . We start with the local conditions.
(ED 0 ) There exists a constant ν 0 > 0 , a positive symmetric p * × p * matrix V 2 0 satisfying Var{∇ζ(υ * )} ≤ V 2 0 , and a constant g > 0 such that for all |µ| ≤ g (ED 1 ) For all 0 < r < r 0 , there exists a constant ω(r) ≤ 1/2 such that for all υ ∈ Υ • (r) and |µ| ≤ g (L 0 ) There exists a symmetric p * × p * -matrix D 2 0 such that such that it holds on the set Υ • (r 0 ) for all r ≤ r 0 This condition together with the identity ∇IEL(υ * ) = 0 implies The global conditions are: (Lr) For any r > r 0 there exists a value b(r) > 0 , such that (Er) For any r ≥ r 0 there exists a constant ν 0 > 0 and a constant g(r) > 0 such that Our results are stated for g(r) ≡ g > 0 , however, an extension to the case g(r) → 0 can be made similarly to Spokoiny (2012).
Finally we specify the regularity conditions. We begin by representing the information and the covariance matrices in block form: The identifiability conditions in Spokoiny (2012) ensure that the matrix D 0 is positive and satisfies a 2 D 2 0 ≥ V 2 0 for some a > 0 . Here we restate these conditions in the special block form which is specific for the (θ, η) -setup.
(I) There are constants a > 0 and ρ < 1 such that The quantity ρ bounds the angle between the target and nuisance subspaces in the tangent space. The regularity condition (I) ensures that this angle is not too small and hence, the target and nuisance parameters are identifiable. In particular, the matrixD 2 0 is well posed under I .
The bounds in (A.1) are given with the same constant a only for simplifying the notation. One can show that the last bound on D 2 0 follows from the first two and (A.2) with another constant a ′ depending on a and ρ only.

B Bracketing and upper function devices
For ǫ = (δ, ̺) , define the bracketing quadratic processes L ǫ (υ, υ * ) and L ǫ (υ, υ * ) : and accordingly for ǫ = −ǫ = (−δ, −̺) . The next result restates the local bracketing bound of Spokoiny (2012) in the semiparametric framework. The imposed conditions and the involved constants ν 0 , δ(r) , and ω(r) are explained in Section A. The presented results implicitly assume that p * is large, x is large as well to ensure that e −x is negligible. A proper choice is x = Cp * for a fixed C .
Theorem B.1 (Spokoiny (2012), Theorem 3.1). Assume (ED 1 ) and (L 0 ) . Let for some r , the values ̺ ≥ 3ν 0 ω(r) and δ ≥ δ(r) be such that where the random variables ♦ ǫ (r), ♦ ǫ (r) fulfill on a random set Ω(x) of dominating In fact, Theorem 3.1 of Spokoiny (2012) states the following bound: with Q = 2.4p * and Under the assumption that g is sufficiently large, that is, g/ν 0 ≫ p * , we can apply , and the result of Theorem B.1 follows. The bracketing result of Theorem B.1 is local in the sense that it only applies for υ ∈ Υ • (r) . Following to the general approach of Spokoiny (2012) we accompany it with the large deviation bound on the concentration probability IP υ ∈ Υ • (r) when the local radius r exceeds some level r 0 which has to be sufficiently large, namely r 2 0 ≥ Cp * . We adopt the upper function approach from Spokoiny (2012); cf. Corollary 4.4 therein.
Again the constants g(r) and b(r) are introduced in Section A. If for r ≥ r 0 , the following conditions are fulfilled: then υ ∈ Υ • (r 0 ) on a random set Ω(x) of dominating probability. The same bound holds for the probability υ θ * ∈ Υ • (r 0 ) where υ θ * maximizes L(υ, υ * ) subject to Π 0 υ = θ * : Remark B.1. The condition (B.3) helps to understand which r 0 ensures prescribed concentration properties of υ and υ θ * . Namely, if g(r) is large enough, then (B.3) follows from the bound

C Deviation bounds for quadratic forms
The following general result from Spokoiny (2013) helps to control the deviation for quadratic forms of type IBξ 2 for a given positive matrix IB and a random vector ξ .
It will be used several times in our proofs. Suppose that For a symmetric matrix IB , define p = tr(IB 2 ), v 2 = 2 tr(IB 4 ), We suppose that λ * ≤ 1 , otherwise we should replace everywhere IB with IB/λ * .
Let g be shown in (C.1). Define ω c by the equation Define also µ c = ω 2 c /(1 + ω 2 c ) ∧ 2/3 . Note that ω 2 c ≥ 2 implies µ c = 2/3 . Further define Spokoiny (2013)). Let ξ fulfill (C.1) with g 2 ≥ 2p . Then we have for x ≤ x c with x c from (C.2): It appears that the bound is slightly different in two zones separated by some specific value x c from (C.2). It is large in typical situations as x c ∼ = g (it is of order √ n in the i.i.d. case). For x ≤ x c , we obtain the same type of bounds as in the Gaussian case, for x > x c they are a bit worse.

D Proofs
This section collects the proofs of the results in chronological order.

D.1 Proof of Theorem 2.1
Define the m × m matrices H 2 ǫ and H 2 ǫ by cf. (B.1). Below we fix some constant r which is assumed to be large enough for ensuring the dominating probability for the concentration event C ǫ (r) defined as Note that the conditions V 0 ( υ − υ * ) ≤ r and V 0 ( υ θ * − υ * ) ≤ r can be represented as { υ ∈ Υ • (r)} and { υ θ * ∈ Υ • (r)} . Similar representation holds for Later we show that a proper choice of r ensures a dominating probability of the random set C ǫ (r) ; see Section D.1.
We first show that the bound (2.4) is fulfilled on the set C ǫ (r) from (D.1) with In analogy with Spokoiny (2012), the quantity ∆ ǫ (r) with can be called the semiparametric spread. It can be seen as a payment for the bracketing device. Below we show that ∆ ǫ (r) ≤ C τ ǫ (p * + x) with a dominating probability.
Lemma D.1. It holds where sup υ means the maximum over all υ ∈ IR p * . Moreover, on the random set Proof. The identity (D.5) directly follows by maximizing the quadratic expression with the maximum at υ = υ * +D −2 ǫ ∇ . Similarly, the maximum of L ǫ (υ, υ * ) is achieved at υ = υ * + D −2 ǫ ∇ ∈ C ǫ (r) which is within Υ • (r) under the condition This yields the claim.
The next lemma states similar results for the constrained maximum of L ǫ and L ǫ subject to Π 0 υ = θ * . The proof is the same as for Lemma D.1. Remember the notation We also use the block representation of D 2 0 : Moreover, it holds on the random set Further, define the process Remember the definition of∇ θ andD 2 0 : Lemma D.3. It holds on the random set Proof. First consider the adaptive cases with A 0 = 0 yieldingD 2 0 = D 2 0 and∇ θ = ∇ θ . Then the process L(υ, υ * ) can be decomposed as and the partial optimization subject to θ = θ * yields the results (D.6) and (D.7). Note that the constrained maximum is attained at η = η * + H −2 0 ∇ η . The general case can be reduced to the adaptive one by the change of variable. With which corresponds to the decomposition in the adaptive case.
By Lemmas D.1 and D.2, on the random set C ǫ (r) , one can replace the sup of L ǫ (υ, υ * ) over Υ • (r) by the sup over the whole vector space IR p * . Putting all the obtained bounds together yields (D.10) Lemmas D.1 implies Define now with τ ǫ = δ + ̺a −2 so that the quantities α ǫ and α ǫ satisfy This yields Similarly by using the result of Lemma D.2 Further, (D.10) and (D.8) yield The proof of (D.2) and (D.3) is completed.
The next step is to bound the spread ∆ ǫ (r) from (D.4). The error terms ♦ ǫ (r) and ♦ ǫ (r) follow the bound (B.2) of Theorem B.1 and they are of order ̺(p * +x) . Further we have to show that τ ǫ D −1 0 ∇ 2 is small relative to ξ 2 and similarly for τ ǫ H −1 0 ∇ η 2 .
Theorem C.1 provides a general deviation probability bound for such quadratic forms.
In particular, for IB where z(x, IB) ≤ tr(IB) + 6x and the constant x c is large; see Section C for a precise formulation. Under the regularity condition (I) it holds tr(IB) ≤ a 2 p * . A similar bound holds for H −1 0 ∇ η 2 . We conclude that the spread ∆ ǫ (r) can be bounded with a probability of order 1 − e −x by C τ ǫ (p * + x) for a fixed constant C .
To control the probability IP V 0 D −2 ǫ ∇ > r we apply Corollary C.1 with With the definitions from Section C provided that r 2 > a 4 (1 − τ ǫ ) −2 (p * + 6x) and x ≤ x c . By similar arguments with Putting the obtained bounds together shows that for x ≤ x c and r 2 0 ≥ C 1 (p * + x) , it holds 1 − IP C ǫ (r 0 ) ≤ C 2 e −x , for some fixed constants C 1 and C 2 depending on τ ǫ and a only. This completes the proof.
Also the i.i.d. structure of the data yields D 2 0 = nF.