The Adaptive and the Thresholded Lasso for Potentially Misspecified Models (and a Lower Bound for the Lasso)

We revisit the adaptive Lasso as well as the thresholded Lasso with refitting, in a high-dimensional linear model, and study prediction error, ℓq-error (q ∈ {1, 2}), and number of false positive selections. Our theoretical results for the two methods are, at a rather fine scale, comparable. The differences only show up in terms of the (minimal) restricted and sparse eigenvalues, favoring thresholding over the adaptive Lasso. As regards prediction and estimation, the difference is virtually negligible, but our bound for the number of false positives is larger for the adaptive Lasso than for thresholding. We also study the adaptive Lasso under beta-min conditions, which are conditions on the size of the coefficients. We show that for exact variable selection, the adaptive Lasso generally needs more severe beta-min conditions than thresholding. Both the two-stage methods add value to the one-stage Lasso in the sense that, under appropriate restricted and sparse eigenvalue conditions, they have similar prediction and estimation error as the one-stage Lasso but substantially less false positives. Regarding the latter, we provide a lower bound for the Lasso with respect to false positive selections.


Introduction
Consider the linear model where β ∈ R p is a vector of coefficients, X is an (n × p)-design matrix, and Y is an n-vector of noisy observations, ǫ being the noise term. We examine the case p ≥ n, i.e., a high-dimensional situation. The design matrix X is treated as fixed, and the Gram matrix is denoted byΣ := X T X/n. Throughout, we assume the normalizationΣ j,j = 1 for all j ∈ {1, . . . , p}.
This paper presents a theoretical comparison between the thresholded Lasso with refitting and the adaptive Lasso. Both methods are very popular in practical applications for reducing the number of active variables.
We emphasize here and describe later that we allow for model misspecification where the true regression function may be non-linear in the covariates. For such cases, we can consider the projection onto the linear span of the covariates. The (projected or true) linear model does not need to be sparse nor do we require that the non-zero regression coefficients (from a sparse approximation) are "sufficiently large". As for the latter, we will show in Lemma 3.3 how this can be invoked to improve the result. Furthermore, we also do not require the stringent irrepresentable conditions or incoherence assumptions on the design matrix X but only some weaker restricted or sparse eigenvalue conditions.
Regularized estimation with the ℓ 1 -norm penalty, also known as the Lasso (Tibshirani [1996]), refers to the following convex optimization problem: where λ > 0 is a penalization parameter.
Regularization with ℓ 1 -penalization in high-dimensional scenarios has become extremely popular. The methods are easy to use, due to recent progress in specifically tailored convex optimization (Meier et al. [2008], Friedman et al. [2010]).
A two-stage version of the Lasso is the so-called adaptive Lassô Here,β init is the one-stage Lasso defined in (1), with initial tuning parameter λ = λ init , and λ adap > 0 is the tuning parameter for the second stage. Note that when |β j,init | = 0, we exclude variable j in the second stage. The adaptive Lasso was originally proposed by Zou [2006].
Another possibility is the thresholded Lasso with refitting. Definê S thres = {j : |β j,init | > λ thres }, which is the set of variables having estimated coefficients larger than some given threshold λ thres . The refitting is then done by ordinary least squares: b thres = arg min βŜ thres Y − XβŜ thres 2 2 /n, where, for a set S ⊂ {1, . . . , p}, β S has coefficients different from zero at the components in S only.
We will present bounds for the prediction error, its ℓ q -error (q ∈ {1, 2}), and the number of false positives. The bounds for the two methods are qualitatively the same. A difference is that our variable selection properties results for the adaptive Lasso depend on its prediction error, whereas for the thresholded Lasso, variable selection can be studied without reference to its prediction error. In our analysis this leads to a bound for the number of false positives of the thresholded Lasso that is smaller than the one for the adaptive Lasso, when restricted or sparse minimal eigenvalues are small and/or sparse maximal eigenvalues are large.
Of course, such comparisons depend on how the tuning parameters are chosen. Choosing these by cross validation is in our view the most appropriate, but it is beyond the scope of this paper to present a mathematically rigorous theory for the cross validation scheme for the adaptive and/or thresholded Lasso (see Arlot and Celisse [2010] for a recent survey on cross validation).

Related work
Consistency results for the prediction error of the Lasso can be found in Greenshtein and Ritov [2004]. The prediction error is asymptotically oracle optimal under certain conditions on the design matrix X, see e.g. Bunea et al. [2006Bunea et al. [ , 2007a, van de Geer [2008], Bickel et al. [2009], Koltchinskii [2009a,b], where also estimation in terms of the ℓ 1 -or ℓ 2 -loss is considered. The "restricted eigenvalue condition" of Bickel et al. [2009] (see also Koltchinskii [2009a,b]) plays a key role here. Restricted eigenvalue conditions are implied by, but generally much weaker than, "incoherence" conditions, which exclude high correlations between co-variables. Also Candès and Plan [2009] allow for a major relaxation of incoherence conditions, using assumptions on the set of true coefficients.
There is however a bias problem with ℓ 1 -penalization, due to the shrinking of the estimates which correspond to true signal variables. A discussion can be found in Zou [2006], and Meinshausen [2007]. Moreover, for consistent variable selection with the Lasso, it is known that the so-called "neighborhood stability condition" (Meinshausen and Bühlmann [2006]) for the design matrix, which has been re-formulated in a nicer form as the "irrepresentable condition" (Zhao and Yu [2006]), is sufficient and essentially necessary. Wainwright [2007Wainwright [ , 2009 analyzes the smallest sample size needed to recover a sparse signal under certain incoherence conditions, Because irrepresentable or incoherence conditions are restrictive and much stronger than restricted eigenvalue conditions (see van de Geer and Bühlmann [2009] for a comparison), we conclude that the Lasso for exact variable selection only works in a rather narrow range of problems, excluding for example some cases where the design exhibits strong (empirical) correlations.
Regularization with the ℓ q -"norm" with q < 1 will mitigate some of the bias problems, see Zhang [2010]. Related are multi-step procedures where each of the steps involves a convex optimization only. A prime example is the adaptive Lasso which is a two-step algorithm and whose repeated application corresponds in some "loose" sense to a non-convex penalization scheme (Zou and Li [2008]). Zou [2006] analyzed the adaptive Lasso in an asymptotic setup for the case where p is fixed. Further progress in the high-dimensional scenario has been achieved by . Under a rather strong mutual incoherence condition between every pair of relevant and irrelevant covariables, they prove that the adaptive Lasso recovers the correct model and has an oracle property. As we will explain in Subsection 6.5, the adaptive Lasso indeed essentially needs a -still quite restrictive -weighted version of the irrepresentable condition in order to be able to correctly estimate the support of the coefficients. Meinshausen and Yu [2009] examine the thresholding procedure, assuming all non-zero components are large enough, an assumption we will avoid. Thresholding and multistage procedures are also considered in Candès et al. [2006], Candès et al. [2008]. In Zhou [ , 2010, it is shown that a multi-step thresholding procedure can accurately estimate a sparse vector β ∈ R p under the restricted eigenvalue condition of Bickel et al. [2009]. The two-stage procedure in Zhang [2009] applies "selective penalization" in the second stage. This procedure is studied assuming incoherence conditions. A more general framework for multi-stage variable selection was studied by Wasserman and Roeder [2009]. Their approach controls the probability of false positives (type I error) but pays a price in terms of false negatives (type II error). The main contribution of this paper is that we provide bounds for the adaptive Lasso that are comparable to the bounds for the Lasso followed by a thresholding procedure. Because the true regression itself, or its linear projection, is perhaps not sparse, we moreover consider a sparse approximation of the truth, somewhat in the spirit of .

Organization of the paper
The next section introduces the sparse oracle approximation, with which we compare the initial and adaptive Lasso. In Section 3, we present the main results. Eigenvalues and their restricted and sparse counterparts are defined in Section 4. Some conclusions are presented in Section 5.
The rest of the paper presents intermediate results and complements for establishing the main results of Section 3. In Section 6, we consider the noiseless case, i.e., the case where ǫ = 0. The reason is that many of the theoretical issues involved concern the approximation properties of the two stage procedure, and not so much the fact that there is noise. By studying the noiseless case first, we separate the approximation problem from the stochastic problem.
Both initial and adaptive Lasso are special cases of a weighted Lasso. We discuss prediction error, ℓ q -error (q ∈ {1, 2}) and variable selection with the weighted Lasso in Subsection 6.1. Theorem 6.1 in this section is the core of the present work, as regards prediction and estimation. Lemma 6.1 in this section is the main result as regards variable selection. The behavior of the noiseless initial and adaptive Lasso are simple corollaries of Theorem 6.1 and Lemma 6.1. We give in Subsection 6.2 the resulting bounds for the initial Lasso and discuss in Section 6.3 its thresholded version. In Subsection 6.4 we derive results for the adaptive Lasso by comparing it with a thresholded initial Lasso. Moreover, Subsection 6.5 briefly discusses the weighted irrepresentable condition, to show that even the adaptive Lasso needs strong conditions on the design for exact variable selection. This subsection is linked to Corollary 3.2, where it is proved that the false positives of the adaptive Lasso vanish if the coefficients of the oracle are sufficiently large.
Section 7 studies the noisy case. It is an easy extension of the results of Sections 6.1, 6.2, 6.3 and 6.4. We do however need to further specify the choice of the tuning parameters λ init and λ adap . After explaining the notation, we present the bounds for the prediction error, estimation error and for the number of false positives, of the weighted Lasso. This then provides us with the tools to prove the main results.
All proofs are in Section 8. Here, we also present explicit constants in the bounds to highlight the non-asymptotic character of the results.
2 Model misspecification, weak variables and the oracle where f 0 is the regression function. First, we note that without loss of generality, we can assume that f 0 is linear. If f 0 is non-linear in the covariates, we consider its projection Xβ true onto the linear space {Xβ : β ∈ R p }, i.e., It is not difficult to see that all our results still hold if f 0 is replaced by its projection Xβ true . The statistical implication is very relevant. The mathematical argument is the orthogonality For ease of notation, we therefore assume from now on that f 0 is indeed linear: Nevertheless, β true itself may not be sparse. Denote the active set of β true by S β := {j : β j,true = 0}, which has cardinality s true := |S true |. It may well be that s true is quite large, but that there are many weak variables, that is, many very small non-zero coefficients in β true . Therefore, the sparse object we aim to recover may not be the "true" unknown parameter β true ∈ R p of the linear regression, but rather a sparse approximation. We believe that an extension to the case where f 0 is only "approximately" sparse, better reflects the true state of nature. We emphasize however that throughout the paper, it is allowed to replace the oracle approximation b 0 given below by β true . This would simplify the theory. However, we have chosen not to follow this route because it generally leads to a large price to pay in the bounds.
The sparse approximation of f 0 that we consider is defined as follows. For a set of indices S ⊂ {1, . . . , p} and for β ∈ R p , we let β j,S := β j l{j ∈ S}, j = 1, . . . , p.
Given a set S, the best approximation of f 0 using only variables in S is Thus, f S is the projection of f 0 on the linear span of the variables in S. Our target is now the projection f S 0 , where Here, |S| denotes the size of S. Moreover, φ 2 (6, S) is a "restricted eigenvalue" (see Section 4 for its definition), which depends on the Gram matrixΣ and on the set S. The constants are chosen in relation with the oracle result (see Corollary 8.3). In other words, f S 0 is the optimal ℓ 0 -penalized approximation, albeit that it is discounted by the restricted eigenvalue φ 2 (6, S 0 ). To facilitate the interpretation, we require S 0 to be a subset of S true , so that the oracle is not allowed to trade irrelevant coefficients against restricted eigenvalues. With S 0 ⊂ S true , any false positive selection with respect to S true is also a false positive for S 0 .
We refer to f S 0 as the "oracle". The set S 0 is called the oracle active set, and b 0 = b S 0 are the oracle coefficients, i.e., We write s 0 = |S 0 |.
Inferring the sparsity pattern, i.e. variable selection, refers to the task of estimating the set of non-zero coefficients, that is, to have a limited number of false positives (type I errors) and false negatives (type II errors). It can be verified that under reasonable conditions with suitably chosen tuning parameter λ, the "ideal" estimator has O(λ 2 s 0 ) prediction error and O(s 0 ) false positives (see for instance Barron et al. [1999] and van de Geer [2001]). With this in mind, we generally aim at O(s 0 ) false positives (see also Zhou [2010]), yet keeping the prediction error as small as possible (see Corollary 3.1).
As regards false negative selections, we refer to Subsection 3.5, where we derive bounds based on the ℓ q -error.

Main conditions
The behavior of the thresholded Lasso and adaptive Lasso depends on the tuning parameters, on the design, as well as on the true f 0 , and actually on the interplay between these quantities. To keep the exposition clear, we will use order symbols. Our expressions are functions of n, p, X, and f 0 , and also of the tuning parameters λ init , λ thres , and λ adap . For positive functions g and h, we say that g = O(h) if g/h ∞ is bounded, and g ≍ h if in addition h/g ∞ is bounded. Moreover, we say that g = O suff (h) if g/h ∞ is not larger than a suitably chosen sufficiently small constant, and g ≍ suff h if in addition h/g ∞ is bounded.
Our results depend on restricted eigenvalues φ(L, S, N ), minimal restricted eigenvalues φ min (L, S, N ), and minimal sparse eigenvalues φ sparse (S, N ) (which we generally think of as being not too small), as well on maximal sparse eigenvalues Λ sparse (s) (which we generally think of being not too large). The exact definition of these constants is given in Section 4.
To simplify the expressions, we assume throughout that (where φ(6, S 0 ) = φ(6, S 0 , s 0 )), which roughly says that the oracle "squared bias" term is not substantially larger than the oracle "variance" term. For example, in the case of orthogonal design, this condition holds if the small nonzero coefficients are small enough, or if there are not too many of them, i.e., if We stress that (4) is merely to write order bounds for the oracle, bounds with which we compare the ones for the various Lasso versions. If actually the "squared bias" term is the dominating term, this mathematically does not alter the theory but makes the result more difficult to interpret.
We will furthermore discuss the results on the set where X j is the j-th column of the matrix X. For an appropriate choice of λ init , depending on the distribution of ǫ, the set T has large probability. Typically, λ init can be taken of order log p/n. The next lemma serves as an example, but the results can clearly be extended to other distributions.
Lemma 3.1 Suppose that ǫ ∼ N (0, σ 2 I). Take for a given t > 0, λ init = 4σ 2t + 2 log p n . Then The following conditions play an important role. Conditions A and AA for thresholding are similar to those in Zhou [2010] (Theorems 1.2, 1.3 and 1.4).
Condition A For the thresholded Lasso, the threshold level λ thres is chosen sufficiently large, in such a way that Condition AA For the thresholded Lasso, the threshold level λ thres is chosen sufficiently large, but such that Condition B For the adaptive Lasso, the tuning parameter λ adap is chosen sufficiently large, in such a way that Condition BB For the adaptive Lasso, the tuning parameter λ adap is chosen sufficiently large, but such that The above conditions can be considered with a zoomed-out look, neglecting the expressions in the square brackets ([· · ·]), and a zoomed-in look, taking into account what is inside the square brackets. One may think of λ init as the noise level (see e.g. Lemma 3.1, with the log p-term the price for not knowing the relevant coefficients a priori). Zooming out, Conditions A and B say that the threshold level λ thres and the tuning parameter λ adap are required to be at least of the same order as λ init , i.e., they should not drop below the noise level. Assumption AA and BB put these parameters exactly at the noise level, i.e., at the smallest value we allow. The reason to do this is that one then can have good prediction and estimation bounds. If we zoom in, we see in the square brackets the role played by the various eigenvalues. As they are defined only later in Section 4, it is at first reading perhaps easiest to remember that the φ's can be small and the Λ's can be large, but one hopes they behave well, in the sense that the values in the square brackets are not too large.

The results
The next three theorems contain the main ingredients of the present work. Theorem 3.1 is not new (see e.g. Bunea et al. [2006Bunea et al. [ , 2007a, Bickel et al. [2009], Koltchinskii [2009a]), albeit that we replace the perhaps non-sparse β true by the sparser b 0 (see also van de Geer [2008]). Recall that the latter replacement is done because it yields generally an improvement of the bounds.
Theorem 3.1 For the initial Lassoβ init =β defined in (1), we have on T , The next theorem discusses thresholding. The results correspond to those in Zhou [2010], and will be invoked to prove similar bounds for the adaptive Lasso, as presented in Theorem 3.3.
Theorem 3.2 Suppose Condition A holds. Then on T ,

Theorem 3.3 Suppose Condition B holds. Then on T ,
We did not present a bound for the number of false positives of the initial Lasso: it can be quite large depending on further conditions as given in Lemma 7.1. A rough bound is presented in Lemma 3.2.
Theorem 3.2 and 3.3 show how the results depend on the choice of the tuning parameters λ thres and λ adap . The following corollary takes the choices of Conditions AA and BB, as these choices give the smallest prediction and estimation error.
Corollary 3.1 Suppose we are on T . Then, under Condition AA, Similarly, under Condition BB, and Remark 3.1 Note that our conditions on λ thres and λ adap depend on the φ's and Λ's, which are unknown. Indeed, our study is of theoretical nature, revealing common features of thresholding and the adaptive Lasso. Furthermore, it is possible to remove the dependence of the φ's and Λ's, when one imposes stronger sparse eigenvalue conditions, along the lines of . In practice, the tuning parameters are generally chosen by cross validation.

Comparison with the Lasso
At the zoomed-out level, where all φ's and Λ's are neglected, we see that the thresholded Lasso (under Condition AA) and the adaptive Lasso (under Condition BB) achieve the same order of magnitude for the prediction error as the initial, one-stage Lasso discussed in Theorem 3.1. The same is true for their estimation errors. Zooming in on the φ's and the Λ's, their error bounds are generally larger than for the initial Lasso.
For comparison in terms of false positives, we need a corresponding bound for the initial Lasso. In the paper of , one can find results that ensure that also for the initial Lasso, modulo φ's and Λ's, the number of false positives is of order s 0 . However, this result requires rather involved conditions which also improve the bounds for the adaptive and thresholded Lasso. We briefly address this refinement in Subsection 7.3, imposing a condition of similar nature as the one used in . Also under these stronger conditions, the general message remains that thresholding and the adaptive Lasso can have similar prediction and estimation error as the initial Lasso, and are often far better as regards variable selection In this section, we confine ourselves to the following lemma. Here, Λ 2 max is the largest eigenvalue ofΣ, which can generally be quite large.

Comparison between adaptive and thresholded Lasso
When zooming-out, we see that the adaptive and thresholded Lasso have bounds of the same order of magnitude, for prediction, estimation and variable selection.
At the zoomed-in level, the adaptive and thresholded Lasso also have very similar bounds for the prediction error (compare (5) with (7)) in terms of the φ's and Λ's. A similar conclusion holds for their estimation error. We remark that our choice of Conditions AA and BB for the tuning parameters is motivated by the fact that according to our theory, these give the smallest prediction and estimation errors. It then turns out that the "optimal" errors of the two methods match at a quite detailed level. However, if we zoom-in even further and look at the definition of φ sparse , φ, and φ min in Section 4, it will show up that the bounds for the adaptive Lasso prediction and estimation error are (slightly) larger.
Regarding variable selection, at zoomed-out level the results are also comparable (see (6) and (8)). Zooming-in on the the φ's and Λ's, the adaptive Lasso may have more false positives than the thresholded version.
A conclusion is that at the zoomed-in level, the adaptive Lasso has less favorable bounds as the refitted thresholded Lasso. However, these are still only bounds, which are based on focussing on a direct comparison between the two methods, and we may have lost the finer properties of the adaptive Lasso. Indeed, the non-explicitness of the adaptive Lasso makes its analysis a non-trivial task. The adaptive Lasso is a quite popular practical method, and we certainly do not advocate that it should always be replaced by thresholding and refitting.

Bounds for the number of false negatives
The ℓ q -error has immediate consequences for the number of false negatives: if for some estimatorβ, some target b 0 , and some constant δ upper q one has then the number of undetected yet large coefficients cannot be very large, in the sense that Therefore, on T , for example Similar bounds hold for the thresholded and the adaptive Lasso (considering now, in terms of the φ's and Λ's, somewhat larger |b 0 j |). One may argue that one should not aim at detecting variables that the oracle considers as irrelevant. Nevertheless, given an estimatorβ, it is straightforward to bound β − β true q in terms of β − b 0 q : apply the triangle inequality Moreover, for q = 2, one has the inequality where Λ min (S) is the smallest eigenvalue of the Gram matrix corresponding to the variables in S. One may verify that φ(6, S true ) ≤ Λ min (S true ). In other words, by choosing β true as target instead of b 0 , does in our approach not lead to an improvement in the bounds for β − β true 2 .

Having large coefficients
Let us have a closer look at what conditions on the size of the coefficients can bring us. We only discuss the adaptive Lasso (thresholding again giving similar results, see also Zhou [2010]).
We define Moreover, we let be the harmonic mean of the squared coefficients.
Condition C For the adaptive Lasso, take λ adap sufficiently large, such that Condition CC For the adaptive Lasso, take λ adap sufficiently large, but such that Then under Condition C, It is clear that by Theorem 3.1, This can be improved under coherence conditions on the Gram matrix. To simplify the exposition, we will not discuss such improvements in detail (see Lounici [2008]).
Under Condition CC, the bound for the prediction error and estimation error is again the smallest. We moreover have the following corollary for the number of false positives.
Corollary 3.2 Assume the conditions of Lemma 3.3 and Then on T , By assuming that |b 0 | harm is sufficiently large, that is, one can bring |Ŝ adap \S 0 | down to zero, i.e., no false positives. One may verify that this boils down to a situation where the weighted irrepresentable condition holds: see Example 6.1 in Subsection 6.5.
As discussed in Section 3.5, large non-zero coefficients also lead to a small number or eventually zero false negative selections. Therefore, the adaptive and thresholded Lasso are recovering the support of S 0 if all of its non-zero coefficients are sufficiently large (in absolute value), assuming much weaker conditions on the design than the (unweighted) irrepresentable condition, which is necessary for the Lasso.

Notation and definition of generalized eigenvalues
We reformulate the problem in L 2 (Q), where Q is a generic probability measure on some space X . (This is somewhat more natural in the noiseless case, which we will consider in Section 6.) Let {ψ j } p j=1 ⊂ L 2 (Q) be a given dictionary. For j = 1, . . . , p, the function ψ j will play the role of the j-th co-variable. The Gram matrix is Σ := ψ T ψdQ, ψ := (ψ 1 , . . . , ψ p ).
We assume that Σ is normalized, i.e., that ψ 2 j dQ = 1 for all j. In our final results, we will actually take Σ =Σ, the (empirical) Gram matrix corresponding to fixed design.
Write a linear function of the ψ j with coefficients β ∈ R p as The L 2 (Q)-norm is denoted by · , so that Recall that for an arbitrary β ∈ R p , and an arbitrary index set S, we use the notation β j,S = β j l{j ∈ S}.
We now present our notation for eigenvalues. We also introduce restricted eigenvalues and sparse eigenvalues.

Eigenvalues
The largest eigenvalue of Σ is denoted by Λ 2 max , i.e., We will also need the largest eigenvalue of a submatrix containing the inner products of variables in S: Its minimal eigenvalue is

Restricted eigenvalues
A restricted eigenvalue is of similar nature as the minimal eigenvalue of Σ, but with the coefficients β restricted to certain subsets of R p . The restricted eigenvalue condition we impose corresponds to the so-called adaptive version as introduced in van de Geer and Bühlmann [2009]. It differs from the restricted eigenvalue condition in Bickel et al. [2009] or Koltchinskii [2009a,b]. This is due to the fact that we want to mimic the oracle f S 0 , that is, do not choose f 0 as target, so that we have to deal with a bias term f S 0 − f 0 . For a given S, our restricted eigenvalue condition is stronger than the one in Bickel et al. [2009] or Koltchinskii [2009a,b]. On the other hand, we apply it to the smaller set S 0 instead of to S true .
Define for an index set S ⊂ {1, . . . , p}, and for a set N ⊃ S and constant L > 0, the sets of restrictions It is easy to see that φ min (L, S, N ) ≤ φ(L, S, N ) ≤ φ(L, S) ≤ Λ min (S) for all L > 0. It can moreover be shown that

Sparse eigenvalues
The fact that we also need sparse eigenvalues is in line with the sparse Riesz condition occurring in . One easily verifies that for any set N with |N | = ks, k ∈ N, Moreover, for all L ≥ 0, φ sparse (S, N ) = φ(0, S, N ) ≥ φ(L, S, N ).

Conclusions
We present some comparable bounds for the adaptive Lasso and the thresholded Lasso with refitting and we also compared them to the ordinary Lasso. The framework of our analysis allows for misspecified linear models whose best linear projection is not necessarily sparse and with possibly small non-zero regression coefficients, i.e., many weak variables. This setting is much more realistic than the usual high-dimensional framework where the model is true with only a few but strong variables.
Estimating the support S 0 of the non-zero coefficients is a hard statistical problem. The irrepresentable condition, which is essentially a necessary condition for exact recovery of the non-zero coefficients by the one-step Lasso, is much too restrictive in many cases. In this paper, our main focus is on having O(s 0 ) false positives while achieving good prediction and estimation. This is inspired by the behavior of the "ideal" ℓ 0 -penalized estimator.
We have examined thresholding the Lasso with least squares refitting and the adaptive Lasso. Our main conclusion is that both methods can have about the same prediction and estimation error as the one-stage ordinary Lasso, and that both gain over the one-stage Lasso in the sense of having less false positives. Moreover, according to our theory (and not exploiting the fact that the adaptive Lasso mimics thresholding and refitting using an "oracle" threshold), thresholding with least squares refitting and the adaptive Lasso perform equally well, even when considered at a rather fine scale. Our bounds for the adaptive Lasso are more sensitive to small (minimal) restricted eigenvalues or small minimal sparse eigenvalues, or large sparse maximal eigenvalues. Both thresholded and adaptive Lasso benefit from a situation with large non-zero coefficients of the oracle.
We do not give an account of the tightness of our bounds. The thresholded Lasso allows a rather direct analysis, and we believe there is little room for improvement of the bounds for this method. The analysis of the adaptive Lasso more involved. Our comparison to thresholding might not do justice to the adaptive Lasso. Indeed, we have not fully exploited the finer oracle properties of the adaptive Lasso.
In practice the the tuning parameters are often chosen by cross validation, which may correspond to a choice giving the smallest prediction error. It is not within the scope of this paper to prove that with cross validation, thresholding and the adaptive Lasso again have comparable theoretical performance, although we do believe this to be the case. As for the computational aspect, we observe the following. For the solution path for all λ adap , the adaptive Lasso needs O(n|Ŝ init | min(n, |Ŝ init |)) essential operation counts. The same order of operation counts is needed when computing the thresholded Lasso for the whole solution path over all λ thres . Therefore, the two methods are also computationally comparable.

The noiseless case
Consider a fixed target f 0 = f βtrue ∈ L 2 (Q). Let S ⊂ {1, . . . , p} and let f S := arg min f =f β S f β S − f 0 be the projection of f 0 on the |S|-dimensional linear space spanned by the variables {ψ j } j∈S . We denote the coefficients of f S by b S , i.e., The oracle set S 0 is defined by trading off dimension against fit, namely where the constants are now from Theorem 6.1 (or its Corollary 8.1). We call f S 0 the oracle, and we let b 0 := b S 0 , i.e., f S 0 = f b 0 .
For simplicity, we assume throughout that which roughly says that the approximation error does not overrule the penalty term.
The initial Lasso is We assume that the tuning parameter λ init is set at some fixed value. Of course, in the noiseless case, the optimal -in terms of prediction error -value for λ init is λ init = 0. However, in the noisy case, a strictly positive lower bound for λ init is dictated by the noise level. Write Let for δ > 0, is the refitted Lasso after thresholding at δ. Note that we express explicitly the dependence of the thresholded estimator on the threshold level, which we now call δ (instead of λ thres as we did in the introduction). The reason for this is that the analysis of the adaptive Lasso will go via the thresholded Lasso with a choice of the threshold δ that trades off prediction error against estimation error (see (18) in the proof of Theorem 6.4).

The adaptive Lasso is
The second stage tuning parameter λ adap is again assumed to be strictly positive. We denote the resulting adaptive variants of (10) by f adap := f β adap , S adap := {j : β j,adap = 0}, δ adap := f adap − f 0 .
As the initial and adaptive Lasso are special cases of the weighted Lasso, many of the results in Subsections 6.2, 6.3 and 6.4 are consequences of those for the weighted Lasso as studied in Subsection 6.1. The weighted Lasso is where the {w j } p j=1 are non-negative weights. We set f weight := f β weight , S weight := {j : β j,weight = 0}. Moreover, we define By the reparametrization β → γ := W β, where W = diag(w 1 , · · · , w p ), one sees that the weighted Lasso is a standard Lasso with Gram matrix We emphasize however that Σ weight is generally not normalized, i.e., generally diag(Σ weight ) = I.

The weighted Lasso
We first present a bound for the prediction and estimation error and then consider variable selection.
Theorem 6.1 Let S be an index set with cardinality s := |S|, satisfying for some constants M ≥ 0 and L > 0, Then for all β, we have Moreover, for all β, we have Finally, it holds for all β, that We will apply the above theorem with S the set of the smaller weights.
Corollary 6.1 Fix some arbitrary δ > 0, and let The indices j with w j = 1/δ can be put in either S δ weight or in its complement. Suppose that for some α ≥ 0, |S δ weight \S 0 | ≤ αs 0 .
Taking S = S δ weight , L = 1 and M = 1/δ in Theorem 6.1, we get that for all β, .

The initial Lasso
Recall that Theorem 6.2 The prediction error of the initial Lasso has and its estimation error has The initial estimator has number of false positives Considering the variable selection result, it is clear that Λ 2 max (S init \S 0 ) ≤ Λ 2 max . Without further conditions, this cannot be refined, and the eigenvalue Λ 2 max can be quite large (yet having the minimal eigenvalue of Σ bounded away from zero). Therefore, the result of Theorem 6.2 needs further conditions for good variable selection properties of the initial Lasso.

Thresholding the initial estimator
Variable selection results by thresholding are not difficult to obtain: Hence, for δ ≥ δ 1 /s 0 ∧ δ 2 / √ s 0 , we get for q ∈ {1, 2}, If the coefficients of the oracle are sufficiently large, thresholding will improve the prediction and estimation error. Here, we do not impose such minimal size conditions. The estimation error of the thresholded Lasso is then still easy to assess. Our bound for the prediction error, however, now depends on maximal sparse eigenvalues.
At this stage, we invoke the noiseless counterparts of Conditions A and AA.
Theorem 6.3 Assume Condition a. Then The expressions for the prediction and estimation error lead to favoring the choice λ init /φ 2 (2, S 0 , 2s 0 ) ≍ suff δ of Condition aa, which yields

The adaptive Lasso
Observe that the adaptive Lasso is somewhat more reluctant than thresholding and refitting: the latter ruthlessly disregards all coefficients with |β j,init | ≤ δ (i.e., these coefficients get penalty ∞), and puts zero penalty on coefficients with |β j,init | > δ. The adaptive Lasso gives the coefficients with |β j,init | ≤ δ a penalty of at least λ init (λ adap /δ) and those with |β j,init | > δ a penalty of at most λ init (λ adap /δ). (Looking ahead, we will actually need to choose λ adap ≥ δ in the noisy case, see Theorem 3.3.) Recall The noiseless versions of Conditions B and BB are:

Condition bb We have
Note the slight discrepancy with the noisy versions: the noiseless versions are somewhat better. This is due to the fact that we also will need to choose λ adap large enough to handle the noise.
Theorem 6.4 Assume Condition b. Then Considering the bounds for the prediction and estimation error leads to favoring the choice of Condition bb, giving O(s 0 ).

The weighted irrepresentable condition
This subsection will show that, even in the noiseless case, exact variable selection needs rather strong conditions. It serves as a motivation for the perhaps more moderate aim of having O(s 0 ) (≤ O(s true )) false positives and detecting only the larger coefficients. Moreover, we illustrate in Example 6.1 of this subsection that the lower bound on the non-zero coefficients as given in Corollary 3.2 is tight.
It is known that the initial Lasso essentially needs the irrepresentable condition in order to have no false positives (Zhao and Yu [2006]). Similar statements can be made for the weighted Lasso.
Definition We say that the weighted irrepresentable condition holds for S if for all vectors τ S ∈ R |S| with τ S ∞ ≤ 1, one has The reparametrization β → γ := W −1 β leads to the following lemma, which is the weighted variant of the first part of Lemma 6.2 in van de Geer and Bühlmann [2009]. Here, we actually take f 0 as target, instead of its ℓ 0 -sparse approximation f S 0 . Recall S true := {j : β j,true = 0}.
Lemma 6.2 Suppose the weighted irrepresentable condition is met for S true . Then S weight ⊂ S true .
We now consider conditions for the weighted irrepresentable condition to hold.
Lemma 6.3 Suppose that Then the weighted irrepresentable condition holds for S.
The next example shows that the result of Lemma 6.3 cannot be improved (essentially, up to the strict inequality) without assuming further conditions.
We now will take a special choice for Σ, which is perhaps not very representative when Σ is an empirical Gram matrixΣ, but it is legitimate for a worst case analysis (as we study here). We suppose that Σ 1,1 := I is the (s × s)-identity matrix, and Σ 2,1 := ρ(c 2 c T 1 ), with 0 ≤ ρ < 1, and with c 1 an s-vector and c 2 a (p − s)-vector, satisfying c 1 2 = c 2 2 = 1. Moreover, we suppose Σ 2,2 is the ((p − s) × (p − s))identity matrix. Then Λ min (S true ) = 1 and the smallest eigenvalue of Σ is 1 − ρ. Its largest eigenvalue is 1 + ρ. Take c 1 = w Strue / w Strue 2 , and c 2 = (0, . . . , 1, 0, . . .) T , where the 1 is placed at arg min j∈S c true w j . Then As a special case, suppose c 1 = (1, 1, . . . , 1) T / √ s, and ρ = 1/2. The adaptive 7 Adding noise After introducing the notation for the noisy case (Subsection 7.1), we will give the extension of the results for the weighted Lasso to the noisy case 1 (see Theorem 7.1). Once this is done, results for the initial Lasso, its thresholded version, and for the adaptive Lasso, follow in the same way as in Subsections 6.2, 6.3 and 6.4. The new point is to take care that the tuning parameters are chosen in such a way that the noisy part due to variables in S c 0 are overruled by the penalty term. In our situation, this can be done by taking λ init , as well as λ adap ≥ λ init sufficiently large.
1 Of separate interest is a direct comparison of the noisy initial Lasso with the noisy ℓ0penalized estimator. Replacing f 0 by Y in Corollary 8.1 (and dropping the requirement We provide the result for the noisy weighted Lasso in Subsection 7.2. Theorems 3.1, 3.2 and 3.3 follow from this and from some further results for the noisy case (their proofs are in Subsection 8.3). In Section 7.3, we look at more restrictive sparse eigenvalue conditions in the spirit of .

Notation for the noisy case
Consider an n-dimensional vector of observations where f 0 := (f 0 (X 1 ), . . . , f 0 (X n )) T , with X 1 , . . . , X n co-variables in some space X . Let {ψ j } p j=1 be a given dictionary. The regression f 0 , the dictionary {ψ j }, and f β := ψ j β j are now considered as vectors in R n . The norm we use is the normalized Euclidean norm induced by the inner product In other words, the probability measure Q is now Q := Q n = n i=1 δ X i /n, the empirical measure of the co-variables X 1 , . . . , X n . With some abuse of notation, we also write Y − f 2 n := Y − f 2 2 /n, and The design matrix X is X = (ψ 1 , . . . , ψ p ).
We write the eigenvalues involved as before, e.g., Λ max is the largest eigenvalue of the empirical Gram matrixΣ := X T X/n, and φ 2 (L, S, N ) is the (L, S, N )restricted eigenvalue ofΣ. The projections in L 2 (Q n ) are also written as before, i.e. f S := Xb S := arg min The ℓ 0 -sparse projection f S 0 = j∈S 0 b 0 j is now defined with a larger constant (7 instead of 3) in front of the penalty term, and a larger constant (L = 6 instead of L = 2) in the restrictions of the restricted eigenvalue condition: (9)).
The initial and adaptive Lasso are defined as in Section 1. We writef init := fβ init andf adap := fβ adap , with active setsŜ init := {j :β j,init = 0} andŜ adap := {j : β j,adap = 0}, respectively. Let be the prediction error of the initial Lasso, and and, for q ≥ 1, q be its ℓ q -error. Denote the prediction error of the adaptive Lasso bŷ The least squares estimator using only variables in S is also written with a "hat":f S = fb S := arg min A threshold level will be denoted by δ, instead of λ thres as we do in Section 1. The reason is again that we need to explicitly express dependence on the threshold level. With λ thres the notation will be too complicated. We define, for any threshold δ > 0,Ŝ δ init := {j : |β j,init | > δ}.
The refitted version after thresholding, based on the data Y, isfŜ δ init .
To handle the (random) noise, we define the set This is the set where the (empirical) correlations between noise and design is "small".
Here λ init is chosen in such a way that is the confidence we want to achieve.

The noisy weighted Lasso
Theorem 7.1 Suppose we are on T . Let S be a set with cardinality s = |S|, which satisfies for some positive L and M Then for all β, Moreover, under the condition λ weight w min S c ≥ 1,

Another look at the number of false positives
Here, we discuss a refinement, assuming a condition corresponding to the one used in .
Moreover, under Condition B, Under Condition BB, this becomes Under Condition D, the first term in the right hand side of (15) is generally the leading term. We thus see the adaptive Lasso replaces the potentially very large constant in the bound for the number of false positives of the initial Lasso by φ 2 min (6, S 0 , 2s 0 )φ 2 (6, S 0 ) φ 4 (6, S 0 , 2s 0 ) 1/2 , a constant which is close to 1 if the φ's do not differ too much.
Admittedly, Condition D is difficult to interpret. On the one hand, it wants s * to be large, but on the other hand, a large s * also can render Λ sparse (s * ) large. We refer to  for examples where Condition D is met.

Proofs
We present three subsections, containing respectively the proofs for Section 6, Section 7, and finally Section 3.
8.1 Proofs for Section 6: the noiseless case 8.1.1 Proofs for Subsection 6.1: the noiseless weighted Lasso Proof of Theorem 6.1. Take We have It follows that But then, by the definition of restricted eigenvalue, and invoking the triangle inequality, N ). N ) .
Case ii) If The first result of the Lemma now follows from taking N = S.
For the second result, we add in Case i), λ init λ weight M √ N (β weight ) N − β S 2 to the left and right hand side of (16): The same arguments now give .
In Case ii), we have

So then
Taking N = S gives the second result.
For the third result, we let N be the set S, complemented with the s 0 largestin absolute value -coefficients of (β weight ) S c . Then φ(2L, N ) ≤ φ(2, S, s + s 0 ). Moreover, N ≥ s 0 . Thus, from the second result, we get Moreover, as is shown in Lemma 2.2 in van de Geer and Bühlmann [2009] (with original reference Candès and Tao [2005], and Candès and Tao [2007]),

⊔ ⊓
We now turn to the proof of Lemma 6.1. An important characterization of the solution β weight can be derived from the Karush-Kuhn-Tucker (KKT) conditions (see Bertsimas and Tsitsiklis [1997]).

Proofs for Subsection 6.2: the noiseless initial Lasso
We first present the corollaries of Theorem 6.1 and Lemma 6.1 when we apply them to the case where all the weights are equal to one.
Corollary 8.1 For the initial Lasso, w j = 1 for all j, so we can apply Corollary 6.1 with δ = 1 and S δ weight = S 0 . Let We have The estimation error can be bounded as follows: Moreover, application of Lemma 6.1 bounds the number of false positives: Proof of Theorem 6.2. This is now a direct consequence of Corollary 8.1. ⊔ ⊓

Proofs for Subsection 6.3: the noiseless thresholded Lasso
We first provide some explicit bounds.
Lemma 8.1 We have Proof of Lemma 8.1. To obtain the first result, we use The ℓ 2 -error of the second result follows by the same arguments.
The first inequality of the third result follows from the definition of f S δ init as projection, and the second follows from the triangle inequality, where we invoke that The final result follows from

⊔ ⊓
thresholds that allow a comparison with the results for the thresholded initial Lasso. This means that we might loose here some further favorable properties of the adaptive Lasso.

Proof of Lemma 7.1 with the more involved conditions
To prove this lemma, we actually need some results in from Section 3 and an intermediate result in their proof. One may skip the present proof at first reading and first consult the next subsection (Subsection 8.3).
Also, on T ,

Proof of Theorem 3.2: the noisy thresholded Lasso
The least squares estimatorfŜ δ init using only variables inŜ δ init (i.e., the projection of Y = f 0 + ǫ on the linear space spanned by {ψ j } j∈Ŝ δ init ) has similar prediction properties as fŜδ init (the projection of f 0 on the same linear space). This is because, as is shown in the next lemma, their difference is small.
Assumption A together with Lemma 8.2 complete the proof for the bounds for prediction and estimation error, with the ℓ 1 -bound being a simple consequence of the ℓ 2 -bound. Also, the variable selection result follows from |Ŝ δ init \S 0 | ≤δ 2 2 δ 2 , and Assumption A. ⊔ ⊓

Proof of Theorem 3.3: the noisy adaptive Lasso
We first apply Theorem 7.1 to the adaptive Lasso.
and (β init )Ŝδ init − b 0 2 ≤ 3δ √ s 0 . The prediction and estimation results now follow from Corollary 8.4 combined with Condition B.