On the asymptotic variance of the debiased Lasso

: We consider the high-dimensional linear regression model Y = Xβ 0 + (cid:3) with Gaussian noise (cid:3) and Gaussian random design X . We assume that Σ := IE X T X/n is non-singular and write its inverse as Θ := Σ − 1 . The parameter of interest is the ﬁrst component β 01 of β 0 . We show that in the high-dimensional case the asymptotic variance of a debiased Lasso estimator can be smaller than Θ 1 , 1 . For some special such cases we establish asymptotic eﬃciency. The conditions include β 0 being sparse and the ﬁrst column Θ 1 of Θ being not sparse. These sparsity conditions depend on whether Σ is known or not.


Introduction
Let Y be an n-vector of observations and X ∈ R n×p an input matrix. The linear model is where β 0 ∈ R p is a vector of unknown coefficients and ∈ R n is unobservable noise. We examine the high-dimensional case with p n. The parameter of interest in this paper is a component of β 0 , say the first component β 0 1 . We consider the asymptotic properties of debiased estimators of the one-dimensional parameter β 0 1 under scenarios where certain sparsity assumptions fail to hold. The paper shows that the asymptotic variance of the debiased estimator can be smaller than the "usual" value for the low-dimensional case. For simplicity we examine Gaussian data: the rows of (X, Y ) ∈ R n×(p+1) are i.i.d. copies of a zero-mean Gaussian row vector (x, y) ∈ R p+1 , where x = (x 1 , . . . , x p ) has covariance matrix Σ := I Ex T x. We assume the inverse of Σ exists and write it as Θ := Σ −1 . The vector β 0 of regression coefficients is β 0 = ΘI Ex T y. We denote the first column of Θ by Θ 1 ∈ R p and the first element of this vector by Θ 1,1 . Our main aim is to present examples where lack of sparsity in Θ 1 can result in a small asymptotic variance of a suitably debiased estimator. In particular, the asymptotic variance can be smaller than Θ 1,1 . For the case Σ known, this means applying for instance a (noiseless) node-wise Lasso, instead of an exact orthogonalization of the first variable with respect to the others, may reduce the asymptotic variance (as follows from combining Theorem 2.1 with Lemma 3.2). If Σ is unknown, the high dimensionality of the problem already excludes exact empirical projections for orthogonalization. The (noisy) Lasso is designed to deal with approximate orthogonalization in the high-dimensional case. Using the node-wise Lasso, we find that one may again profit from non-sparsity of the (now unknown) vector Θ 1 (see Theorem 4.1).
We look at specific examples or constructions of covariance matrices Σ. The results illustrate that asymptotic efficiency claims require some caution. The high-dimensional situation exhibits new phenomena that do not occur in low dimensions.
Throughout, the minimal eigenvalue of Σ, denoted by Λ 2 min , is required to stay away from zero, i.e. 1/Λ 2 min = O (1). We further consider only Σ's with all 1's on the diagonal and assume for simplicity that σ 2 := I E(y − xβ 0 ) 2 is known and that its value is σ 2 = 1.
Let a given subset B of R p be the model class for β 0 . An interesting research goal is to construct for the model B a regular estimator of β 0 1 with asymptotic variance that achieves the asymptotic Cramér Rao lower bound (given here in Proposition 1.1). One then needs to decide which model class B one considers as relevant. In high-dimensional statistics it is commonly assumed that β 0 is sparse in some sense. Let 0 < r ≤ 1, define for a vector b ∈ R p its r -"norm" b r by b r r := p j=1 |b j | r and let b 0 0 be its number of non-zero entries. A sparse model is for example for some ("small") s ∈ N. Alternatively one may believe only in 1 -sparsity.
for some s > 0. These are the two extremes of weakly sparse models of the form for some s > 0 and 0 ≤ r ≤ 1. Throughout, the value of s is allowed to depend on n, but r is fixed for all n.
Constructing estimators that achieve the asymptotic Cramér Rao lower bound for model (1.1), (1.2), (1.3) or some other sparse model is to our understanding quite ambitious, especially if one wants to do this for all possible covariance matrices Σ. See e.g. Example 1.1 for some details concerning model (1.1). However, for special cases of Σ's the problem can be solved. One such special case is where the first row of Θ 1 of Θ is sparse in an appropriate sense. This is the situation considered in previous work such as [26] and [25] where Σ is unknown. In this paper we consider known and unknown Σ and in both cases do not require sparsity of Θ 1 . The paper [11] also does not require sparsity of Θ 1 when Σ is known and it turns out that for certain non-sparse vectors Θ 1 their estimator is not asymptotically efficient, for example under the model (1.2) with s = o(n/ log p) and with a matrix Σ of a certain form (see Theorem 2.1 or Remark 2.6 following this theorem).
The debiased Lasso defined in this paper in equation (1.5) below is based on a directionΘ 1 ∈ R p whereΘ 1 is thought of as some estimate of Θ 1 . As we do not assume sparsity of Θ 1 a reliable estimator of Θ 1 may not be available. Nevertheless, we show that this does not rule out good theoretical performance. We present a class of covariance matrices Σ for which a debiased Lasso has asymptotic variance smaller than Θ 1,1 . This phenomenon is tied to the highdimensional situation, see Remark 2.3. For special cases, we establish that an asymptotic Cramér Rao bound smaller than Θ 1,1 can be achieved. In other words, there exist cases where a debiased Lasso profits from sparsity of β 0 combined with non-sparsity of Θ 1 . This is good news: the asymptotic variance can be small for two reasons: either Θ 1 is sparse in which case the asymptotic variance Θ 1,1 is the inverse of the residuals of regressing the first variable on only a few of the other variables, or Θ 1 is not sparse but then the asymptotic variance can be smaller than Θ 1,1 . This paper presents cases where the latter situation indeed occurs.

The Lasso, debiased Lasso and sparsity assumptions
The Lasso ( [23]) is defined aŝ with λ > 0 a tuning parameter. (We will throughout take of order log p/n but not too small.) A debiased Lasso is given bŷ The p-dimensional vectorΘ 1 is some estimate of the first column Θ 1 of the precision matrix Θ, but in our case it will rather be estimating a sparse approximation. We refer toΘ 1 as a direction. The estimatorβ is commonly taken to be the Lasso given in (1.4) although this is not a must. The debiased Lasso (1.5) was introduced in [26] and further developed in [9] and [25] for example. Related work is [1] and [2].
Which sparsity variant is needed to establish asymptotic normality of the debiased Lasso (1.5) depends to a large extent on whether Σ is known or not. In [10], [6], [20]. [11] one can find refined results on this issue. This case of Σ known is treated in Section 2. We introduce and apply there the concept of an eligible pair, see Definition 2.1. An eligible pair is a sparse approximation of Θ 1 together with a parameter describing the order of approximation and sparsity. We allow for sparsity variant (i) in model (1.1) as in [11], see Example 2.1. Sparsity variant (i) will also be allowed for the models (1.2) and (1.3), see Examples 2.2 and 2.3.
Eligible pairs will also play a crucial role in Section 4 where Σ is unknown. Let us discuss some of the literature for this case and for the sparsity model (1.1). From the papers [6] and [20] we know that for the minimax bias of an estimator of β 0 1 to be of order 1/ √ n, the assumption s = O( n/ log p) is necessary. Thus, up to log-terms this needs the second sparsity variant. When considering asymptotic Cramér Rao lower bounds, one also needs to restrict oneself to a certain class of estimators, for instance estimators with bias of small order 1/ √ n or asymptotically linear estimators. In [8] such restrictions are studied. One can show asymptotic linearity of the debiased Lasso under model (1.1) with sparsity variant (ii) and in addition Θ 1 0 = o( √ n/ log p). If Θ 1 is not sparse nor can be approximated by a sparse vector, then it is unclear whether an asymptotically linear estimator exists. We refer to Remark 4.4 for more details. In summary, modulo log-terms, sparsity variant (ii) cannot be relaxed as far as minimax rates for the bias are concerned, and sparsity variant (ii) with in addition sparsity of order o( √ n/ log p) for Θ 1 or its sparse approximation appears to be needed for establishing asymptotic linearity. We note that the paper [11] establishes asymptotic normality under (among others) the assumption Bias and asymptotic linearity are not considered (these issues are not within the scope of that paper). In our setup however, Θ 1 is not sparse at all, so variant (ii) is in line with (1.6). Tables 2 and 3 presented in Subsection 1.4 summarizes the sparsity conditions applied in this paper. One sees that models (1.1) and (1.2) are special cases of model (1.3), with r = 0 and r = 1 respectively. However, when r = 0 the asymptotic efficiency depends on β 0 and also quite severely on the value of s. For the case Σ unknown, model (1.2) is too large.

The asymptotic Cramér Rao lower bound
We briefly review the Cramér Rao lower bound and refer to [8] for details. Let the model be β 0 ∈ B, where B is a given class of regression coefficients. Let We call H β 0 the set of model directions. An estimator T (or actually: sequence of estimators) is called regular at β 0 if for all fixed ρ > 0 and R > 0 not depending on n, and all sequences h ∈ H β 0 with |h 1 | ≥ ρ and h T Σh ≤ R 2 , it holds that is some constant (depending on n and possibly on β 0 , but not depending on ρ, R or h), called the asymptotic variance (it is defined up to smaller order terms). Regularity is important in practice. It means that the asymptotics is not just pointwise but uniformly over neighbourhoods.  (1). Assume the Lindeberg condition Assume further that T is regular at β 0 . Then for all fixed ρ > 0 and R > 0 This proposition is as Theorem 9 in [8] but tailored for the particular situation. A proof is given in Section 6. We remark that such results are not a direct consequence of the Le Cam theory, as we are dealing with triangular arrays.
The next corollary is our main tool to arrive at asymptotic efficiency for some special Σ's. Corollary 1.1. Assume the conditions of Proposition 1.1 and that for some fixed ρ > 0 and R > 0 and some sequence h ∈ H β 0 , with |h 1 | ≥ ρ and h T Σh ≤ R 2 , it is true that Then T is asymptotically efficient.
The restriction to directions in H β 0 means that the Cramér Rao lower bound for the asymptotic variance V 2 β 0 can be orders of magnitude smaller than Θ 1,1 .
Note that the condition on Θ 1 depends on β 0 (via s 0 ). b) Suppose {1} ∈ S 0 , s = s 0 and that the following "betamin" condition holds: |β 0 j | > m n / √ n for all j ∈ S 0 , where m n is some sequence satisfying m n → ∞. Then we see that The lower bound is then where Σ S0,S0 is the matrix of covariances of the variables in S 0 . This lower bound corresponds to the case S 0 is known. The bound could be achieved if one has a consistent estimateŜ of S 0 . For this one needs betamin conditions in order to have no false negatives. Applying least squares with variables inŜ, whereŜ is an estimator of S 0 results in an estimator of β 0 1 which is not regular. There is a series of papers on this issue, e.g. [15], [14], [17], [18], [19]. Imposing further conditions beyond model (1.1), for example betamin conditions, will diminish the lower bound. c) More generally, if {1} ∈ S 0 and |β 0 j | > m n / √ n for all j ∈ S 0 and some sequence m n → ∞ then the lower bound corresponds to knowing the set S 0 up to s − s 0 additional variables. d) Suppose that β 0 is an interior point in the sense that it stays away from the boundary: for some fixed 0 < η < 1 not depending on n, it holds that s 0 ≤ (1−η)s (so that 1 − s 0 /s stays away from zero). Then By a rescaling argument the dependence on η in the left hand side plays no role in the lower bound: for all M > 0 fixed (not depending on n) where M > 0 is any fixed constant (not depending on n).
It is clear, and illustrated by Examples 1.1, 1.2 and 1.3, that the lower bound of Proposition 1.1 depends on the model B. The sparse model (1.1) is perhaps too stringent. One may want to take the model B as the largest set for which a regular estimator exists. This points in the direction of model (1.2). We will see that when Σ is known this model is indeed useful but when Σ is unknown it is too large.

Notations and definitions
We consider an asymptotic framework with triangular arrays of observations. Thus, unless explicitly stated otherwise, all quantities depend on n although we do not (always) express this in our notation.
The order symbols refer to asymptotics with sample size n → ∞. Thus, for sequences {a n } and {b n } of positive numbers, the notation a n = O(b n ) means that lim sup n→∞ a n /b n < ∞ and a n = o(b n ) means that lim n→∞ a n /b n = 0. Moreover, a n b n means that both a n = O(b n ) and b n = O(a n ) hold. Finally, a n 0 means that the sequence {a n } stays away from zero, i.e. that 1/a n = O (1). Let x 1 be the first entry of x and x −1 := (x 2 , . . . , x p ) be this vector with the first entry excluded, so that . For vectors b ∈ R p be use a similar notation: b 1 ∈ R is the first coefficient and b −1 ∈ R p−1 forms the rest of the coefficients. Apart from the regression (projection) xβ 0 of y on x, we consider the regression We define when possible an approximation γ of γ 0 which is accompanied by a parameter λ to form an "eligible pair" (γ , λ ), see Definition 2.1. When Σ is known we can invoke the noiseless Lasso γ Lasso with tuning parameter λ Lasso (an approximate projection of x 1 on x −1 ) to approximate γ 0 . See (3.1) for its definition. For the case Σ is unknown we apply the notationΣ := X T X/n. We let X 1 ∈ R n be the first column of X, and the matrix X −1 ∈ R n×(p−1) be the remaining columns and we writeΣ −1,−1 := X T −1 X −1 /n. We do an approximate regression of X 1 on X −1 invoking the noisy Lassoγ with tuning parameter λ Lasso as given in (4.1). The various vectors of coefficients and their "lambda parameter" are summarized in Table 1. Here we also add the Lassoβ for the estimation of β 0 , as defined in (1.4) with tuning parameter λ. Table 1 The various coefficients and lambda parameters.
. We further let for S ⊂ {1, . . . , p} with cardinality s, the matrix Σ S,S := I Ex T S x S ∈ R s×s be the covariance sub-matrix formed by the variables in S and Σ −S,−S : Also the various other "gamma" parameters" γ will be indexed by {2, . . . , p}. It should be clear from the context when this indexing applies. For c = (c 2 , . . . , c p ) and S ⊂ {2, . . . , p} we write c S := {c j : j ∈ S} and c −S := {c j : j / ∈ S, j = 1}. We use the same notation for the (p − 1)-dimensional vector c S which has the entries not in S set to zero, and we then let For a positive semi-definite matrix A we let Λ 2 min (A) be its smallest eigenvalue and Λ 2 max (A) be its largest eigenvalue. The smallest eigenvalue of Σ is written shorthand as Λ 2 min := Λ 2 min (Σ). Recall we assume throughout that Λ 2 min stays away from zero: 1/Λ 2 min = O(1). We use the shorthand notation 0 for "strictly positive and staying away from zero". Thus throughout we assume Λ 2 min 0. In order to be able to construct confidence intervals one needs some uniformity in unknown parameters. We give the following definition (see also [3], Definition 1 on page 18).

Organization of the rest of the paper
Section 2 contains the results for the case of Σ known and applying a debiased Lasso using sample splitting. Here we also introduce the concept of an eligible pair (γ , λ ) in Definition 2.1. Section 3 contains results and constructions for eligible pairs. Section 4 considers the case Σ unknown and a debiased Lasso (without sample splitting). Section 5 concludes and Section 6 collects the proofs. Section 7 (included for completeness) contains some elementary probability inequalities for products of Gaussians, which are applied in Section 4.
In Tables 2 and 3 we summarize the (sparsity) conditions we use, see Examples 2.1, 2.2 and 2.3 for the case Σ known and Examples 4.1 and 4.2 for the case Σ unknown. The particular cases r = 0 and r = 1 follow from the general case 0 ≤ r ≤ 1 when Σ is known. When Σ is unknown the case r = 0 also follows from the general case 0 ≤ r < 1. With r = 1 the model is then too large. We have displayed the extreme cases separately so that the conditions for these can be read off directly. In particular for r = 0 one sees the standard sparsity conditions known from the literature. For r = 1 (Σ known) one sees that unlike in the other cases there is no logarithmic gap between conditions for asymptotic normality and asymptotic efficiency. Table 2 The conditions used to prove asymptotic normality, linearity and efficiency when Σ is known. Throughout, (γ , λ ) is required to be an eligible pair (see Definition

The case of Σ known
Before presenting "eligible pairs" in Definition 2.1, we provide the motivation that led us to this concept.
Recall the debiased Lasso given in (1.5). If Σ is known we choose the direction Table 3 The conditions used to prove asymptotic normality, linearity and efficiency when Σ is unknown. Throughout, (γ , λ ) is required to be an eligible pair (see Definition 2.1), i.e. and λ γ 1 → 0). The value of r may be different from r. It is assumed to be fixed and 0 ≤ r ≤ 1. Asymptotic efficiency is established when β 0 stays away from the boundary of B. In the case B = { b 0 0 ≤ s} the conditions on γ for asymptotic efficiency depend on β 0 .
The remainder is The first term (i) can be handled assuming Θ T 1 ΣΘ 1 = O(1) and Σ 1/2 (β − β 0 ) 2 = o I P (1). This goes along the lines of techniques as in [11], applying the conditions used there. One then arrives at (i) = o I P (1/ √ n). (We will however alternatively use a sample splitting technique later on in Theorem 2.1 to simplify the derivations.) The second term (ii) is additional bias and will be our major concern. If Θ 1 = Θ 1 this term vanishes. However, as we will see it is useful to apply instead of Θ 1 some sparse approximation of Θ 1 . In fact, we aim at a sparse approximation Θ 1 with Θ 1,1 being smaller than Θ 1,1 and their difference not vanishing.
We will assume conditions that ensure the additional bias is negligible and invoke that (recall that by the definition of Θ 1 it is true that e 1 = ΣΘ 1 ).

Remark 2.1. One may think of applying instead the Cauchy-Schwarz inequality
This leads to requiring that Choosing p ≤ 2 here again works against our aim to improve the asymptotic variance. Thus we need to choose p > 2 (and therefore q < 2). This certifies the choice p = ∞ as being quite natural.
In other words, we can only improve the asymptotic variance in the high-dimensional case.
Taking the dual norm inequality (2.1) as starting point we now need With the above considerations as motivation, we now concentrate on an ∞condition as given in inequality (2.2). We settle for some λ and construct vectors Θ 1 for which inequality (2.2) holds. It is based on replacing the vector of coefficients γ 0 of the regression of x 1 on x −1 by a sparse approximation γ .
Clearly, if for a vector γ ∈ R p−1 , is an eligible pair. However, as we will see in the last statement in the next lemma, we aim in Definition 2.1 at eligible pairs (γ , λ ) with λ a large value (instead of the smallest value) such that (2.3) and (2.4) are met. The conditions in Definition 2.1 will allow us to arrive at (2.2) as is shown in the next lemma. Moreover, Finally, in order to have a non-vanishing improvement of Θ 1,1 over Θ 1,1 it must be true that γ 0 is not sparse, in the sense that λ γ 0 1 0.

Remark 2.4. The first condition (2.3) of Definition 2.1 can be rewritten as
The second condition 2 (2.4) in this definition can be thought of as a sparsity condition on γ . The two conditions together require that the regression of x 1 on x −1 is sparse when one relaxes the orthogonality condition of residuals to approximate orthogonality. One may think of γ 0 as a "least squares estimate" of γ in a noisy regression model. This leads to a very natural interpretation of eligible pairs. We refer to Subsection 3.5.2 for details. Further, for Θ 1 defined in (2.5) one has the equivalence To have Θ 1,1 improving over Θ 1,1 we see from the above lemma that we aim at a situation where γ 0 , and hence Θ 1 , is not sparse, but where γ 0 can be replaced by a sparse vector γ . For some special Σ's, we give examples of eligible pairs in Section 3. That section also discusses for a given λ uniqueness of the vectors γ for which the pair (γ , λ ) is eligible. Moreover, we show cases where x −1 γ is an approximation of the projection of x 1 on a subset of x S of the other variables for some S ⊂ {2, . . . , p}, see Lemma 3.5. This is why the Cramér Rao lower bound can be achieved in those cases. Lemma 2.1 has all the ingredients to prove asymptotic normality of the debiased Lasso (1.5) with directionΘ 1 = Θ 1 and Θ 1 given in (2.5) in this lemma. It can be done along the lines of Theorem 3.8 in [11], assuming the conditions stated there. However, as the authors point out, when using instead the sample splitting approach their Assumption (iii) is not needed. It is also mathematically less involved in the present context. Sampling splitting techniques date back at least to [22]. We use the following.
Assume the sample size n is even. Define the matrices Letβ I be an estimator of β 0 based on the first half (X I , Y I ) of the sample, for instance the Lasso estimator arg min{ Y I − X I b 2 2 /n + λ b 1 }. Similarly, letβ II be an estimator of β 0 based on the second half (X II , Y II ) of the sample. Let (γ , λ ) be an eligible pair. We then define the two debiased estimatorŝ where Θ 1 is given in (2.5) in Lemma 2.1. The final estimatorb 1 is obtained by averaging these two:b Let now B be a given model class for the unknown vector of regression coefficients β 0 . Theorem 2.1. Let (γ , λ ) be an eligible pair, Θ 1 be given in (2.5) andb 1 be given in (2.6) withb I,1 andb II,1 the debiased estimators based on Θ 1 using the splitted sample. Suppose that uniformly in β 0 ∈ B and This follows from e.g. Theorem 6.1 in [5], together with results for Gaussian quadratic forms as given in (4.3 [11], which usesΘ 1 = Θ 1 , is in certain cases asymptotically inefficient as a choiceΘ 1 = Θ 1 = Θ 1 can give an improvement in the asymptotic variance (and is then efficient for certain such cases). We see this happening in the next example (Example 2.2), where the model is (1.2): B := {b ∈ R p : b 1 ≤ √ s} with 0 < s = o(n/ log p) and the matrix Σ is constructed following one of the Lemmas 3.7, 3.8 or 3.10.

Example 2.2. In this example we take the model (1.2) with 0 < s = o(n/ log p) (sparsity variant (i)). Letβ be again the Lasso estimator given in (1.4) with
appropriate choice of the tuning parameter λ log p/n. One may use a "slow rates" result: uniformly in β 0 ∈ B it is true that see for example [5], Theorem 6.3, and combine this with results for quadratic forms as given in (4.3). (The arguments for establishing these "slow rates" are in fact as in Lemma

The requirement (2.8) on λ is thus
If γ r r = O( √ n r s 2−r ) then (γ , λ ) is an eligible pair and the Cramér Rao lower bound is achieved whenever β 0 stays away from the boundary. In order to be able to improve over Θ 1,1 we now need γ 0 r r of larger order √ n r s 2−r by Corollary 1.1. Remark 2.5 can be taken into the considerations here too.

Finding eligible pairs
The main results of this section can be found in Subsection 3.5 where for any λ we construct eligible pairs (γ , λ ) by choosing γ 0 appropriately. These results can be seen as existence proofs. Before doing these constructions we discuss uniqueness in Subsection 3.1, in Subsection 3.2 the noiseless Lasso as a practical method for improving over Θ 1,1 and in Subsection 3.3 we examine whether or not projections on a subset of the variables can lead to eligible pairs. For the latter we impose rather stringent conditions. We show in Subsection 3.4 that eligible pairs are more flexible than projections. Nevertheless in the final part of this section we return to projections as they come up naturally when imposing non-sparsity constraints on γ 0 .

Using the Lasso
Consider the noiseless Lasso with tuning parameter λ Lasso : One may verify that (γ Lasso , λ Lasso ) is an eligible pair if λ Lasso γ 0 1 → 0. But the latter is exactly what we want to avoid. If (γ , λ ) is an eligible pair then the noiseless Lasso can find it if one chooses λ Lasso = O(λ ) larger than λ , as follows from the next lemma. It says that given λ one may use the noiseless Lasso for constructing a direction, Θ 1,Lasso say, with which one has same improvement over Θ 1,1 as with Θ 1 . (γ , λ ) is an eligible pair. Let λ Lasso > λ and λ Lasso γ 1 → 0. Let γ Lasso be the noiseless Lasso defined in (3.1). Then

Using projections
In this subsection we investigate (rather straightforwardly) conditions such that the coefficients of a projection of x −1 γ 0 can be joined with a λ to form an eligible pair. Consider some set S ⊂ {2, . . . , p} with cardinality s. The value of s need not be s, where s is the sparsity used in the model class B. Let γ S S : we have for γ S the sparsity condition (2.4) of Definition 2.1: being the anti-projection of x −S γ 0 −S on x S . We check whether the pair (γ S , λ ) is an eligible pair, which is the case if λ √ s = o (1) and v S −S ∞ ≤ λ . We briefly discuss some conditions that may help ensuring the latter.
Let |||A||| 1 := max j k |a j,k | be the 1 -operator norm of the matrix A.
. We therefore examine situations where the coefficients in γ 0 decrease at a rate quicker than 1/ √ s. The following definition is analogous to the definition of "effective sparsity" as given in [16]. Definition 3.1. Let N be some integer and 0 = v ∈ R N be a vector. We call can have some relatively large coefficients, but it cannot have too many of these. If in addition v 1 is large it cannot have many zeroes either. Asymptotic non-sparseness of the vector v with the large coefficients removed means that there are many very small non-zero coefficients.

Approximate projections
Recall that the first condition (2.3) of Definition 2.1 can be written as Denote the active set of γ by S = {j ∈ {2, . . . , p} : γ j = 0} and its cardinality by s := |S|. Then λ √ s → 0 implies that 1/Θ 1,1 is asymptotically the squared residual of the projection of x 1 on x S as is shown in the next lemma. In that sense, eligible pairs are more flexible than projections.
If s is small enough, then for model (

Reverse engineering
In this subsection we fix λ and then construct vectors γ 0 ∈ R p−1 such that there is an eligible pair (γ , λ ). These constructions are in a sense equivalent but approach the problem from different angles. In these constructions the vector γ is having active set S = {γ j = 0} with cardinality s := |S|. The sparsity of γ is then measured in terms of the value of s. More general constructions are possible, but in this way we can apply the results to any of the models (1.1), (1.2) or (1.3). In view of Lemma 3.5 is means that the constructions correspond to an approximate projection on x S . Throughout this subsection, the matrix Σ −1,−1 is assumed to have 1's on the diagonal and smallest eigenvalue Λ 2 min (Σ −1,−1 ) 0.

Which γ 0 's are allowed?
We let for

Regression: γ 0 as least squares estimate of γ
In this subsection we create γ 0 using random noise. We then arrive at an eligible pair "with high probability". Let N ∈ N be a given sequence with N > p. Take be the least squares estimator of γ . Finally let λ log p/N be appropriately chosen. Then (γ , λ ) is with high probability an eligible pair. Indeed the first condition (2.3) of Definition 2.1 follows from so that for appropriate λ log p/N , with high probability The second condition (2.4) of Definition 2.1 follows from the condition γ 1 = o( log p/N ) so λ γ 1 = o (1). We also see that if p/N 0 then with high probability Θ 1,1 as given in (2.5) is an improvement over Θ 1,1 , since where χ 2 p−1 has the chi-squared distribution with p − 1 degrees of freedom. It follows that With high probability this stays away from zero so that also For appropriate N with 1 − p/N 0 the vector γ 0 is with high probability also allowed by Lemma 3.6. With this choice of N we have λ log p/p. Recall that according to Remark 2.3 it must be true that pλ 2 0. In the present context we in fact have pλ 2 / log p 0.

Creating γ 0 directly
We first recall that with Θ 1 given in (2.5). It will be the case in the constructions of this subsection. Consider some set S ⊂ {2, . . . , p} with cardinality s and some λ . We will assume λ √ s → 0.
Let γ be a vector in R p−1 with γ −S = 0 (i.e. γ = γ S ) and Define as there and one arrives at eligible pairs "with high probability".
We now examine the following question: can one choose γ S in Lemma 3.7 equal to γ 0 S ? As we will see this will only be possible if a form of the irrepresentable condition holds. The "usual" irrepresentable condition (that implies the absence of false positives of the Lasso, see [27]) involves the coefficients of the projection of the "large" collection x −S on the "small" collection x S . In our case, we reverse the roles of S and −S.
We say that the reversed irrepresentable condition holds for (S, z −S ) if Lemma 3.8. a) Assume the reversed irrepresentable condition holds for (S, z −S ), and that in addition is an eligible pair, γ 0 is eventually allowed and λ γ 0 1 → 0. In fact b) Conversely, if for some γ 0 and for γ := γ 0 S the pair (γ , λ ) is an eligible pair, then the reversed irrepresentable condition holds for (S, z −S ) with appropriate z −S satisfying z −S ∞ ≤ 1.

Creating γ 0 using a non-sparsity restriction
Consider some set S ⊂ {2, . . . , p} with cardinality s and some λ > 0. Let w ∈ R p−1 be a vector of strictly positive weights with w ∞ ≤ 1 and define the matrix W as the diagonal matrix with w on its diagonal. Let Lemma 3.10. Suppose that Let γ be a vector satisfying γ −S = 0 and pair (γ , λ ) is eligible, γ 0 is allowed and λ γ 0 1 → 0. In fact

Remark 4.4. Assume the conditions of Theorem 4.1, and that in fact
Then also √ log p Θ 1 − Θ 1 1 = o I P (1), which implieŝ Thus, then the estimatorb 1 is asymptotically linear, uniformly in β 0 ∈ B. The uniform asymptotic linearity ofb 1 implies in turn that the Cramér Rao lower bound of Subsection 1.2 applies.
The results of the following two examples are summarized in Table 3. Suppose now as in Theorem 4.1 that λ = O( log p/n) and log p/n γ 1 = o (1). In fact, assume that γ 0 0 is small enough so that Θ 1 is a model direction. Then one obtains by the same arguments if λ Lasso log p/n is suitably chosen

Thus then (4.5) is met so that we have asymptotic linearity. It means that the Cramér Rao lower bound applies and is achieved.
The model (1.2) is too large for us to be able to apply when Σ is unknown. We now turn to the model (1.3).

Conclusion
This paper illustrates that Θ 1,1 can be larger than the asymptotic Cramér Rao lower bound, that for certain Σ the asymptotic variance of a debiased Lasso is smaller than Θ 1,1 and that in special such cases the asymptotic Cramér Rao lower bound is achieved. In Examples 1.2 and 1.3 we showed that if β 0 stays away from the boundary, then the asymptotic Cramér Rao lower bound is min c r r ≤Mn where M is any fixed value not depending on n. When Σ is known, Theorem 2.1 shows that up to log-terms this lower bound is achieved as soon as there exists an eligible pair (γ , λ ), with γ r r = O(n r 2 s 2−r 2 ). When Σ is unknown the situation is more involved and in particular for model (1.1) sparsity variant (i) is replaced by the stronger variant (ii). Model (1.2) is too large for the case Σ unknown and model (1.3) requires more sparsity then model (1.1): if r is larger the sparsity s is required to be smaller. Model (1.1) however appears for both known and unknown Σ too stringent as results depend on the exact value of s, not only on its order. Model (1.2) (with Σ known) or more generally model (1.3) (with 0 < r < 1 if Σ is unknown) do not suffer from such a dependence as long as β 0 stays away from the boundary.

Proof for Section 1
The proof of Proposition 1.1 relies on the results in [8], which allow the arguments to follow those of the low-dimensional case. These arguments are then rather standard.
By the Lindeberg condition, we can apply Lindeberg's central limit theorem to conclude that for any sequence a : Therefore, by Wold's device

S. van de Geer
We now apply a slight modification of Lemmas 16 and 23 in [8], where we drop the assumption of bounded eigenvalues of Σ (which is possible because we have (1)). The asymptotic linearity of T and the 2-dimensional central limit theorem just obtained imply that at the alternative β 0 + h/ √ n it holds that As T is assumed to be regular at β 0 we conclude that

But by the Cauchy-Schwarz inequality
Moreover, so that we obtain where in the last step we used h T Σh ≥ h 2 2 /Λ 2 min ≥ ρ 2 so that 1/h T Σh = O(1). Since the result is true for all h ∈ H β 0 , |h 1 | ≥ ρ and h T Σh ≤ R 2 , we may maximize the right hand side of (6.1) over all such h.

Proofs for Section 2
Proof of Lemma 2.1. Because x −1 γ 0 is the projection of x 1 on x −1 we know that Thus We now rewrite where in the second equality we used that x 1 − x −1 γ 0 is the anti-projection of x 1 on x −1 and hence orthogonal to x −1 γ . For the cross-product we have by the two conditions on the pair (γ , λ ) .
Combining this with inequality (6.2) proves the first result of the lemma. The second result: follows trivially from this. For the third result, we compute and re-use the already obtained results: To show the final statement of the lemma, assume on the contrary that γ 0 is sparse: λ γ 0 1 → 0. Then Proof of Theorem 2.1. We use the decomposition of the beginning of this section applied tob I,1 .
Here, I := Y I − X I β 0 andΣ I := 2X T I X I /n. But, given (X II , Y II ), is the average of n/2 i.i.d. random variables which are the product of a random variable with the N (0, Θ T 1 ΣΘ 1 )-distribution and a N (0, Σ 1/2 (β II − β 0 ) 2 2 )distributed random variable. Since the variances satisfy Θ T uniformly in β 0 ∈ B. For the term (ii) we use that and the assumption λ β II − β 0 In the same way one derives that uniformly in with II := Y II − X II β 0 . Sinceb 1 = (b I,1 + b II,1 )/2 is the average of the two, this proves the asymptotic linearity. Further var(Θ 1 X T / √ n) = Θ T 1 ΣΘ 1 = Θ 1,1 + o(1) by Lemma 2.1. The central limit theorem completes the proof.

Proofs for Section 3
Proof of Lemma 3.1. We have Proof of Lemma 3.2. By the KKT conditions Therefore, where we used that λ Lasso > λ and λ Lasso γ 1 → 0. We also know that by the KKT conditions If λ Lasso ≥ 2λ , we obtain from the above So (γ Lasso , λ Lasso ) is an eligible pair.

Proof of Lemma 3.3. Note that
Proof of Lemma 3.4. We have Thus (γ S , λ S ) is an eligible pair. Finally Proof of Lemma 3.5. Write It holds that Proof of Lemma 3.6. For a ∈ R and c ∈ R p−1 satisfying a 2 + Σ But then Hence Λ 2 min (Σ(γ 0 )) is positive definite and It further holds for all j ∈ {2, . . . , p} that Proof of Lemma 3.7. By definition Moreover, λ γ 1 ≤ λ √ s γ 2 → 0, since So (γ , λ ) is an eligible pair. We further have where the positivity is true for large enough n. Therefore, by Lemma 3.6, γ 0 is eventually allowed. Finally, So, since λ γ 1 → 0, it must be true that λ γ 0 1 0. We in fact have holds. By assumption z −S ∞ ≤ 1 and by the reversed irrepresentable condition also z S ∞ ≤ 1. Thus To see make sure that γ 0 is allowed we bound γ 0T Σ −1,−1 γ 0 : Therefore, since λ γ 0 S 1 → 0, and in view of Lemma 3.6, the vector γ 0 is for large enough n allowed. Finally, we have In fact is an eligible pair we have Define now c = γ 0 − γ and z = Σ −1 −1,−1 c/λ . Then Proof of Lemma 3.9. One readily verifies that all c 0 j with j / ∈ S are non-zero. One thus has the Lagrangrian Proof of Lemma 3. 10. It holds that Thus (γ , λ ) is an eligible pair. Furthermore

Probability inequalities
In this section we present some probability inequalities for products of Gaussians. Such results are known (for example as Hanson-Wright inequalities for sub-Gaussians, see [21]) and only presented here for completeness.
where W i is a zero-mean Gaussian random variable independent of U i . It follows that U T V/n = λ U 2 2 /n + U T W/n.
Since var(W i ) ≤ σ 2 for all i we see from Lemma 7.2 that IP U T W/n ≥ √ 2σ t/n + σ t/n ≤ exp[−t].