The Lasso as an ℓ 1 -ball model selection procedure

: While many eﬀorts have been made to prove that the Lasso behaves like a variable selection procedure at the price of strong (though unavoidable) assumptions on the geometric structure of these variables, much less attention has been paid to the oracle inequalities for the Lasso involving the ℓ 1 -norm of the target vector. Such inequalities proved in the literature show that, provided that the regularization parameter is properly chosen, the Lasso approximately mimics the deterministic Lasso. Some of them do not require any assumption at all, neither on the structure of the variables nor on the regression function. Our ﬁrst purpose here is to provide a conceptually very simple result in this direction in the framework of Gaussian models with non-random regressors. Our second purpose is to propose a new estimator particularly adapted to deal with inﬁnite countable dictionaries. This estimator is constructed as an ℓ 0 -penalized estimator among a sequence of Lasso estimators associated to a dyadic sequence of growing truncated dictionaries. The selection proce- dure is choosing automatically the best level of truncation of the dictionary so as to make the best tradeoﬀ between approximation, ℓ 1 -regularization and sparsity. From a theoretical point of view, we shall provide an oracle inequality satisﬁed by this selected Lasso estimator. The oracle inequalities presented in this paper are obtained via the application of a general theorem of model selection among a collection of nonlinear models which is a direct consequence of the Gaussian concentration inequality. The key idea that enables us to apply this general theorem is to see ℓ 1 -regularization as a model selection procedure among ℓ 1 -balls. ℓ 1 -oracle inequalities, model selection by penalization, ℓ 1 -balls, generalized linear Gaussian model.


Introduction
We consider the problem of estimating a regression function f belonging to a Hilbert space H in a fairly general Gaussian framework which includes the fixed design regression or the white noise frameworks. Given a dictionary D = {φ j } j of functions in H, we aim at constructing an estimatorf =θ.φ := jθ j φ j of f which enjoys both good statistical properties and computational performance even for large or infinite dictionaries.
For high-dimensional dictionaries, direct minimization of the empirical risk can lead to overfitting and we need to add a penalty to avoid it. One appropriate choice would be to use an ℓ 0 -penalty by penalizing the number of non-zero coefficientsθ j off (see [5] for instance) so as to produce sparse estimators and interpretable models, but this minimization problem is non-convex and thus computationally unfeasible when the size of the dictionary becomes too large. On the contrary, ℓ 1 -penalization leads to convex optimization, so there are efficient algorithms to approximate the solution of this problem even for highdimensional data (see [12] for instance). Besides, by running these algorithms, one can notice that ℓ 1 -penalty tends to produce sparse solutions and thus to behave like an ℓ 0 -penalty. This phenomenon is due to the geometric properties of the ℓ 1 -norm. For these reasons, ℓ 1 -penalization and its associated solution, the so-called Lasso, have been widely used in the recent years as surrogate for ℓ 0 -penalization.
In this paper, we look at the Lasso as an ℓ 1 -regularization algorithm rather than a variable selection procedure. We analyze its performance by providing an ℓ 1 -oracle inequality (see Theorem 3.1). It will prove that, provided that the regularization parameter is properly chosen, the Lasso performs almost as well as the deterministic Lasso, and what is noticeable is that this ℓ 1 -result requires no assumption neither on the unknown target function nor on the variables φ j of the dictionary (except simple normalization that we can always assume by considering φ j / φ j instead of φ j ), contrary to the usual ℓ 0 -oracle inequalities in the literature that are valid only under restrictive conditions.
The establishment of ℓ 1 -oracle inequalities for the Lasso is not entirely new. In fact, a few authors such as Barron and al. [15], Bartlett and al. [2] or Rigollet and Tsybakov [20] have recently studied such ℓ 1 -bounds, but there are some differences between their results and ours (see page 673 for more details). By stating Theorem 3.1, our aim is to add to the existing literature results on the ℓ 1performance of the Lasso in simple, yet important, cases such as the fixed design Gaussian regression or the white noise models. We shall establish both a bound in probability and a bound in expectation, and our results shall be valid with no assumption neither on the target function nor on the variables of the dictionary (except simple normalization). Besides, we propose a method of analysis which is quite different from the methods used in the papers mentioned above. We shall derive our results from a fairly general model selection theorem for non linear models, interpreting ℓ 1 -regularization as an ℓ 1 -balls model selection criterion (see Appendix A). This approach will allow us to go one step further than the analysis of the Lasso for finite dictionaries and to deal with infinite dictionaries in various situations.
In the second part of this paper, we shall thus focus on infinite countable dictionaries. While infinite dictionaries are more and more used in many applications such as micro-array data analysis or signal reconstruction, it proves difficult to calibrate the regularization parameter of the Lasso and thus to establish good theoretical results on the performance of this estimator for such dictionaries. To solve this problem, we propose a procedure that provides an optimal level of truncation of the whole infinite dictionary as well as an efficient estimator in the linear span of this optimal finite subdictionary. For orthonormal dictionaries, this estimator is nothing else than a soft-thresholding estimator with an adaptive threshold.
The article is organized as follows. The framework and statistical problem are introduced in Section 2. In Section 3, we study the case of finite dictionaries and analyze the performance of the Lasso as an ℓ 1 -regularization algorithm by providing an ℓ 1 -oracle inequality showing that the Lasso estimator works almost as well as the deterministic Lasso provided that the regularization parameter is chosen large enough. In section 4, we look at the case of infinite countable dictionaries and introduce a procedure based on Lasso type penalization combined with an additional complexity penalty that produces an efficient selected Lasso estimator. The explanation of the key idea that enables us to derive all our oracle inequalities from a single general model selection theorem is postponed until Appendix A. Finally, the oracle inequalities are proved in Appendix B.

General framework and statistical problem
Let H be a separable Hilbert space equipped with a scalar product ., . and its associated norm . . The statistical problem we consider is to estimate an unknown target function f in H when observing a process (Y (h)) h∈H defined by where ε > 0 is a fixed parameter and (W (h)) h∈H is an isonormal process, that is to say a centered Gaussian process with covariance given by E[W (g)W (h)] = g, h for all g, h ∈ H. This framework is convenient to cover both finitedimensional models, such as the classical fixed design Gaussian regression model, and infinite-dimensional models, such as the Gaussian white noise model (see [17] for details on these models).
To solve the statistical problem (2.1), we shall introduce a dictionary D, i.e. a given finite or infinite set of functions φ j ∈ H that arise as candidate basis functions for estimating the target function f , and consider estimatorŝ f =θ.φ := j, φj ∈Dθ j φ j in the linear span of D. All the matter is to choose a "good" linear combination in the following meaning. It makes sense to aim at constructing an estimator as the best approximating point of f by minimizing f − h or, equivalently, −2 f, h + h 2 . However f is unknown, so one may instead minimize the empirical least squares criterion But since we are mainly interested in very large dictionaries, direct minimization of the empirical least squares criterion can lead to overfitting. To avoid it, we shall rather consider estimators solution of a penalized risk minimization problem,f ∈ arg min where pen(h) is a positive penalty to be chosen. Since the estimatorf depends on the observations, its quality will be measured by its quadratic risk In Sections 3 and 4.3, we shall consider ℓ 1 -penalization, that is to say pen(h) ∝ inf{ θ 1 = j, φj ∈D |θ j | such that h = θ.φ}, while we shall suggest in Section 4 a penalty pen(h) combination of an ℓ 1 -penalty and a complexity penalty.

The Lasso for finite dictionaries
In this section, we provide an ℓ 1 -oracle inequality satisfied by the Lasso in the case of finite dictionaries.

Definition of the Lasso estimator
We consider the generalized linear Gaussian model and the statistical problem (2.1) introduced in the last section. Throughout this section, we assume that D p = {φ 1 , . . . , φ p } is a finite dictionary of size p. In this case, any h in the linear span of D p has finite ℓ 1 -norm and thus belongs to L 1 (D p ). We propose to estimate f by a penalized least squares estimator as introduced at (2.3) with a penalty pen(h) proportional to h L1(Dp) . This estimator is the so-called Lasso estimatorf p defined bŷ where λ p > 0 is some regularization parameter and γ(h) is defined by (2.2).

An ℓ 1 -oracle inequality
Let us now state the main result of this section. This ℓ 1 -oracle inequality highlights the fact that, provided that the regularization parameter λ p is properly chosen, the Lasso, which is the solution of the ℓ 1 -penalized empirical risk minimization problem, behaves as well as the deterministic Lasso, that is to say the solution of the ℓ 1 -penalized true risk minimization problem, up to an error term of order O(ε 2 ) where O(.) depends on the complexity of the dictionary.
Then, there exists an absolute positive constant C such that, for all z > 0, with probability larger than 1 − 3.4 e −z , (3.4) Integrating (3.4) with respect to z leads to the following ℓ 1 -oracle inequality in expectation, Remark 3.2. These last years, the Lasso has essentially been developed as an approach to sparse recovery based on convex optimization and thus the main focus on this estimator has been on the establishment of ℓ 0 -oracle inequalities so as to study its performance as a variable selection procedure. Here, Theorem 3.1 does not take into account sparsity and rather provides information about the performance of the Lasso as an ℓ 1 -regularization algorithm by providing ℓ 1 -oracle inequalities satisfied by this estimator. Notice that the ℓ 1 -oracle inequalities of Theorem 3.1 are valid for regularization parameters of the same order (3.3) as the usual regularization parameters considered for the establishment of ℓ 0 -oracle inequalities (see [4] among others). Let us also stress that, contrary to the ℓ 0 -results that require some restrictive assumptions on the dictionary and that are interesting only if the target function can be well approximated by a sparse function in the linear span of the dictionary, the ℓ 1 -oracle inequalities (3.4) and (3.5) are established with no assumption neither on the target function nor on the structure of the variables φ j of the dictionary D p , except simple normalization that we can always assume by considerating φ j / φ j instead of φ j . This shows that, whereas one can not be sure whether the conditions for the Lasso to be a good variable selection procedure are fulfilled or not, one is always guaranteed that the Lasso achieves high-performance as regards ℓ 1 -regularization.
In fact, ℓ 1 -oracle inequalities of the same type as (3.4) or (3.5) have already been studied by a few authors such as Barron and al. [15], Bartlett and al. [2] or Rigollet and Tsybakov [20], but all these results present dissimilarities with Theorem 3.1. Let us have a look at these differences.
In [20], Rigollet and Tsybakov are proposing an oracle inequality for the Lasso similar to (3.4) which is valid under the same assumption as the one of Theorem 3.1, i.e.simple normalization of the variables of the dictionary, but their bound in probability can not be integrated to get an bound in expectation as the one we propose at (3.5). Indeed, the constant measuring the level of confidence of their risk bound appears inside the infimum term as a multiplicative factor of the ℓ 1 -norm whereas the constant z measuring the level of confidence of our risk bound (3.4) appears as an additive constant outside the infimum term so that the bound in probability (3.4) can easily be integrated with respect to z, which leads to the bound in expectation (3.5). Besides, the lower bound of the regularization parameter λ p proposed by Tsybakov and Rigollet (λ p ≥ 8(1 + z/ ln p) ε √ ln p) depends on the level of condidence z, with the consequence that their choice of the Lasso estimatorf p =f p (λ p ) also depends on this level of confidence. On the contrary, our lower bound λ p ≥ 4ε( √ ln p + 1) does not depend on z so that we are able to get the result (3.4) satisfied with high probability by an estimator f p =f p (λ p ) independent of the level of confidence of this probability.
As regards Bartlett and al. [2], they have obtained an oracle inequality for the Lasso of the same type as (3.4) in the context of linear regression. Nonetheless, they have considered the case of random design (X, Y ) ∈ R d × R rather than our setting with fixed design and Gaussian noise. Therefore, they have to overcome rather substantial (and interesting) difficulties in the analysis of empirical processes involved in the problem. In particular, their method of analysis requires a uniform concentration phenomenon that forces them to make strong assumptions, namely that both X and Y are bounded almost surely by a constant independent of n. Moreover, they get a lower bound on the regularization parameter with an extra ln-factor compared to (3.3).
However, the oracle inequality of Theorem 3.1 is proved with undetermined constant C whereas the ℓ 1 -oracle inequalities in both [20] and [2] are sharp, i.e. with C = 1.
Barron and al. [15] have also studied risk bounds for ℓ 1 -penalized estimators in the case of random design. Rather than assuming that Y is bounded as it is done by Bartlett and al., they make the assumption that the errors satisfy some Bernstein's moment condition, but on the other hand, they assume that the target function is bounded by a constant and the risk bound they provide is not satisfied by the Lasso itself but only by a truncated Lasso estimator.
The proof of Theorem 3.1 is detailed in Appendix B and we refer the reader to Appendix A for the description of the key observation that has enabled us to establish it. In a nutshell, the basic idea is to view the Lasso as the solution of a penalized least squares model selection procedure over a countable collection of models consisting of ℓ 1 -balls. Inequalities (3.4) and (3.5) are then deduced from a general model selection theorem stated as Theorem A.1 in Appendix A. Let us point out that this approach will allow us to go one step further than the analysis of the Lasso for finite dictionaries and to deal with infinite dictionaries as we shall see in Section 4.

A selected Lasso estimator for infinite countable dictionaries
In many applications such as micro-array data analysis or signal reconstruction, we are now faced with situations in which the number of variables of the dictionary is always increasing and can even be infinite. So it is desirable to find competitive estimators for such infinite dimensional problems, but (except in rare situations when the variables have a specific structure: see Remark 4.3 on neural networks) it proves very difficult to establish good theoretical results on the performance of the Lasso solution over an infinite dictionary. Indeed, when considering a finite dictionary of size p, Theorem 3.1 guarantees that for a regularization parameter greater than a certain quantity depending on the size p, the corresponding Lasso estimator achieves good performance results, but for an infinite dictionary there is no size p and thus no lower bound on the regularization parameter to guarantee good performance of the corresponding Lasso estimator. Our goal here is to propose a procedure to calibrate the regularization parameter by providing an optimal sizep in a sense described below.
In order to deal with an infinite countable dictionary D, one may order the variables of the dictionary, write the dictionary D = {φ j } j∈N * = {φ 1 , φ 2 , . . . } according to this order, then truncate D at a given level p to get a finite subdictionary {φ 1 , . . . , φ p } and finally estimate the target function by the Lasso estimatorf p over this subdictionary. This procedure implies two difficulties. First, one has to put an order on the variables of the dictionary, and then all the matter is to decide at which level one should truncate the dictionary to make the best tradeoff between approximation and complexity. Here, our purpose is to resolve this last dilemma by proposing a selected Lasso estimator based on an algorithm choosing automatically the best level of truncation of the dictionary once the variables have been ordered. Of course, the algorithm and thus the estimation of the target function will depend on which order the variables have been classified beforehand. Notice that the classification of the variables can reveal to be more or less difficult according to the problem under consideration. Nonetheless, there are a few applications where there may be an obvious order for the variables, for instance in the important case of dictionaries of wavelets.
For the particular case of an orthonormal dictionary where the truncated Lasso estimators are nothing else than soft-thresholding estimators with a fixed threshold, the selected Lasso estimator is a soft-thresholding estimator with an adaptive threshold which is automatically chosen by the algorithm constructing this estimator. So, our procedure provides a new contribution to the crucial problem of choosing the threshold when working with soft-thresholding estimators.

Definition of the selected Lasso estimator
We still consider the generalized linear Gaussian model and the statistical problem (2.1) introduced in Section 2. To solve this problem, we recall that we use a dictionary D = {φ j } j and seek for an estimatorf =θ.φ = j, φj ∈Dθ j φ j solution of the penalized empirical risk minimization problem, where pen(h) is a suitable positive penalty. Here, we assume that the dictionary is infinite countable and that it is ordered, Given this order, we can consider the sequence of truncated dictionaries (D p ) p∈N * where D p := {φ 1 , . . . , φ p } corresponds to the subdictionary of D truncated at level p, and the associated sequence of Lasso estimators (f p ) p∈N * defined in Section 3.1, where (λ p ) p∈N * is a sequence of regularization parameters whose values will be specified below. Now, we shall choose a final estimator as an ℓ 0 -penalized estimator among a subsequence of the Lasso estimators (f p ) p∈N * . More precisely, let us denote by Λ the set of dyadic integers, where pen(p) penalizes the size p of the truncated dictionary D p for all p ∈ Λ. Then, the final estimator we consider is the selected Lasso estimatorfp. From (4.3) and the fact that L 1 (D) = ∪ p∈Λ L 1 (D p ), we see that this selected Lasso estimatorfp is a penalized least squares estimator solution of (4.1) where, for all p ∈ Λ and h ∈ L 1 (D p ), pen(h) = λ p h L1(Dp) + pen(p) is a combination of both ℓ 1 -regularization and complexity penalization. We also see from (4.3) that the algorithm automatically chooses the rankp so thatfp makes the best tradeoff between approximation, ℓ 1 -regularization and sparsity.
Remark 4.1. From a theoretical point of view, one could have definedfp as an ℓ 0 -penalized estimator among the whole sequence of Lasso estimators (f p ) p∈N * (or more generally among any subsequence of (f p ) p∈N * ) instead of (f p ) p∈Λ . Nonetheless, to computefp efficiently, it is interesting to limit the number of computations of the sequence of Lasso estimatorsf p especially if we choose a complexity penalty pen(p) that does not grow too fast with p.
In the sequel, we shall consider a penalty pen(p) ∝ ln p. So, taking a dyadic truncation D p := {φ 1 , . . . , φ p } = {φ 1 , . . . , φ 2 J } of the dictionary D enables to get a complexity penalty pen(p) ∝ ln p = J ln 2 which grows linearly at each step J of the algorithm, thus leading to a more efficient algorithm. That is why we have chosen to work with a dyadic subsequence of dictionaries rather than with other subsequences.
Although our primary motivation for introducing the selected Lasso estimator described above was to construct an estimator adapted from the Lasso and fitted to solve problems of estimation dealing with infinite dictionaries, notice that this estimator remains well-defined and can also be interesting for estimation in the case of large finite dictionaries. Indeed, if we consider a finite dictionary of size p 0 , then it can be advantageous to work with the selected Lasso estimator fp rather than with the Lasso estimatorf p0 since the definition offp guarantees thatfp always makes a better tradeoff between approximation, ℓ 1 -regularization and sparsity thanf p0 . Besides,fp is always sparser thanf p0 sincep ≤ p 0 . In particular, notice thatfp andf p0 coincide whenp = p 0 .

An oracle inequality
By applying the same general model selection theorem (Theorem A.1) as for the establishment of Theorem 3.1, we can provide a risk bound satisfied by the estimatorfp with properly chosen penalties λ p and pen(p) for all p ∈ Λ. The sequence of ℓ 1 -regularization parameters (λ p ) p∈Λ is simply chosen from the lower bound given by (3.3) while a convenient choice for the complexity penalty will be pen(p) ∝ ln p.

The Lasso for particular infinite uncountable dictionaries
As explained at the beginning of this section, it is generally very difficult to establish good theoretical results on the performance of the Lasso for infinite dictionaries. Yet, let us just point out here that it can be easier to prove such results for some particular infinite dictionaries whose structure is nice enough. For example, it is the case for neural networks in the fixed design Gaussian regression models. Recall that a neural network is a real-valued function defined on R d belonging to the linear span of the dictionary where λ > 0 is a regularization parameter, Despite the fact that the dictionary D is infinite uncountable, we are able to establish an ℓ 1 -oracle inequality satisfied by the Lasso which is similar to the one provided in Theorem 3.1 in the case of a finite dictionary. This is due to the very particular structure of the dictionary D which is only composed of functions derived from the Heaviside function. This property enables us to achieve theoretical results without truncating the whole dictionary into finite subdictionaries contrary to the study developed above where we considered arbitrary infinite countable dictionaries. The following ℓ 1 -oracle inequality is once again a direct application of the general model selection Theorem A.1 already used to prove both Theorem 3.1 and Theorem 4.2. for some absolute constant κ > 0 large enough and consider the corresponding Lasso estimatorf defined by (4.7).
Then, there exists an absolute constant C > 0 such that Appendix A: A model selection theorem Let us end this paper by describing the main idea that has enabled us to establish all the oracle inequalities of Theorem 3.1, Theorem 4.2 and Theorem 4.3 as an application of a single general model selection theorem, and by stating and proving this general theorem. We keep the notations introduced in Section 2. The basic idea is to view the Lasso estimator as the solution of a penalized least squares model selection procedure over a properly defined countable collection of models with ℓ 1 -penalty. The key observation that enables one to make this connection is the simple fact that L 1 (D) = R>0 {h ∈ L 1 (D), h L1(D) ≤ R}, so that for any finite or infinite given dictionary D, the Lassof satisfies Then, to obtain a countable collection of models, we just discretize the family of ℓ 1 -balls {h ∈ L 1 (D), h L1(D) ≤ R} by setting for any integer m ≥ 1, and definem as the smallest integer such thatf belongs to Sm, i.e.
It is now easy to derive from the definitions ofm andf and from the fact that that is to say with pen(m) = λmε and ρ = λε. This means thatf is equivalent to a ρapproximate penalized least squares estimator over the sequence of models given by the collection of ℓ 1 -balls {S m , m ≥ 1}. This property will enable us to derive ℓ 1 -oracle inequalities by applying a general model selection theorem that guarantees such inequalities provided that the penalty pen(m) is large enough. This general theorem, stated below as Theorem A.1, is a restricted version of an even more general model selection theorem that the interested reader can find in [17], Theorem 4.18. Then, there is a positive constant C(K) such that for all f ∈ H and z > 0, with probability larger than 1 − Σ e −z , Integrating this inequality with respect to z leads to the following risk bound We have noticed at (A.2) that the Lasso estimatorf p is a ρ-approximate penalized least squares estimator over the sequence {S m , m ≥ 1} for pen(m) = λ p mε and ρ = λ p ε. So, it only remains to determine a lower bound on λ p that guarantees that pen(m) satisfies condition (A.4).

B.2. Proof of Theorem 4.2
Let M = N * × Λ and consider the set of ℓ 1 -balls for all (m, p) ∈ M, Definem as the smallest integer such thatfp belongs to Sm ,p , i.e.
. From (B.6) and (4.4), using the fact that for all p ∈ Λ, √ ln p ≤ (ln p)/ √ ln 2, the definitions of α andfp and the fact that L 1 (D p ) = m∈N * S m,p , we get that that is to say with pen(m, p) := λ p mε + α pen(p) and ρ p := (1 − α) pen(p) + c 1 ε 2 (notice that thanks to the assumption c 2 > c 1 / √ ln 2, we have α ∈ ]0, 1[, so pen(m, p) > 0 and ρ p > 0). This means thatfp is equivalent to a ρ p -approximate penalized least squares estimator over the sequence of models {S m,p , (m, p) ∈ M}. By applying Theorem A.1, this property will enable us to derive a performance bound satisfied byfp provided that pen(m, p) is large enough.
Lemma B.1. Let t > 0 and let D = {φ a,b : a ∈ R d , b ∈ R} be a dictionary of neurons where φ a,b is defined by (4.6). Then, where C > 0 is an absolute constant (C ≥ 22 is convenient).
Proof. The result just comes from the fact that D is a subset of the boolean ncube with Vapnik-Chervonenkis dimension d + 1. Indeed, for all a ∈ R d and b ∈ R, let us denote by A a,b the affine half-space of R d defined by A a,b = {x ∈ R d : a, x + b > 0} and consider the associated VC class A = {A a,b , a ∈ R d , b ∈ R} which is of dimension d + 1. Also introduce