On the asymptotic properties of the group lasso estimator for linear models

We establish estimation and model selection consistency, pre- diction and estimation boundsand persistencefor the group-lassoestimator and model selectorproposed by Yuan and Lin (2006) for least squares prob- lems when the covariates have a natural grouping structure. We consider the case of a fixed-dimensionalparameter space with increasing sample size and the double asymptotic scenario where the model complexity changes with the sample size.


Introduction
In recent years there has been a rapidly growing interest in penalized least squares problems via ℓ 1 regularization, especially in high dimensional settings where the model complexity is comparable or even larger than the sample size.The lasso, originally put forward by Tibshirani (1996) for linear regression models, is a regularization procedure in which the penalty for model complexity is the ℓ 1 norm of the estimated coefficients.It has the crucial advantages of being a convex problem, thus computationally feasible even when the number of predictor is larger than the sample size, and of producing solutions that are sparse, i.e. containing zero components.These two key properties make the lasso simultaneously a shrinkage estimation and a model selection procedure that is viable in high-dimensional problems where traditional model selection criteria are not feasible.Furthermore, the lasso has been shown to have optimal theoretical properties: model selection, or sign consistency, or sparsistency (see, e.g., Meinshausen and Bühlmann, 2006;Wainwright, 2006Wainwright, , 2007;;Zhao and Yu, 2006), consistency and oracle properties (see, e.g.Meinshausen and Yu, 2006;Bickel et al., 2007;Bunea et al., 2007a,b;Zhang, 2007;Koltchinskii, 2005;Zhang and Huang, 2007), and persistence (Greenshtein and Ritov, 2006;Greenshtein, 2006).
Researches have also devised few extensions of the lasso that are suited to deal with regression problems in which the explanatory variables are grouped or are organized in a hierarchal manner and, at the same time, exhibit similar computational ease and the shrinkage properties of the lasso.We mention, in particular, the group-lasso procedure by Yuan and Lin (2006) and its extension by Kim et al. (2006), the elastic net regularization by Zou and Hastie (2005), the hierarchical lasso by Zhou and Zhu (2007), regularization methods based on ℓ ∞ penalty by Gilbert et al. (2005) and the very general CAP penalties by Zhao et al. (2007).Most of these procedures essentially comprise a penalty for the model complexity that results from a composition of the ℓ 1 norm with some other norm computed over each group of parameters, thus exhibiting a behavior that, at the group level, resembles that of the lasso solution.Besides ANOVA models, the group-lasso penalty has been applied to generalized linear models in Dahinden et al. (2006), Meier et al. (2006) and Nardi and Rinaldo (2007) and to non-parametric problems in Bach (2007) and Ravikumar et al. (2007).
The general purpose of this paper is to prove for the group-lasso estimator described in Yuan and Lin (2006) the same type of optimality properties that have been established for the lasso estimator.In particular, we will derive conditions ensuring estimation and model selection consistency, prediction and estimation consistency, oracle properties and persistence.For the case of a fixed-dimensional parameter space, Bach (2007) derives some conditions for estimation and model selection consistency.For the double-asymptotic scenario in which the dimension of the parameter space grows with the sample size, a rigorous study of the performance of the group-lasso seems to be missing in the statistical literature.Our contributions include novel consistency and asymptotic normality results for the fixed-dimensional parameter space, model selection consistency when the number of predictors is larger than the sample size, oracle inequalities and persistence properties.Our methods of proofs are based on non-trivial extensions and generalizations of condition and results for the lasso procedure already in existence in the literature.
The paper is organized as follows.Section 2 introduces the group-lasso settings for least square problems.In section 3 we establish estimation and model selection consistency and asymptotic normality under the traditional scenario of increasing sample size and fixed parameter space.The conditions we impose are of different nature than the ones introduced in Bach (2007) and the results we obtain complement that analysis.In section 4 we investigate the properties of the group-lasso solution under the more complex, double-asymptotic scenario in which both the sample size and the model complexity grow simultaneously.In section 4.1 we provide a sufficient condition guaranteeing uniqueness of the group-lasso solution when the number of covariates is larger than the sample size.In section 4.2 we provide conditions for model selection consistency that holds even when the number of covariates grow at a larger rate than the sample size and in section 4.3 we derive finite sample bounds that can be used to establish consistency for estimation and prediction.Finally, in section 4.4 we derive two persistence properties.All the proofs are gathered in section 5 and in the Appendix.

The group-Lasso settings
Let H be an index set representing a class of linear subspaces of R n , each subspace being spanned by the columns of a n × d h matrix X h , where h ranges over H.We will be assuming henceforth that the set H is known and has been assigned a total ordering, and we will always be using such an ordering.Let X be a n × d design matrix formed by concatenating the design matrices X h , h ∈ H, with d = h d h .While we allow for non-zero correlations among groups, namely X ⊤ h X h ′ = 0 for h = h ′ , we will be making the simplifying assumption where I dh denote the d h -dimensional identity matrix, a condition that can be enforced via the Gram-Schmidt orthogonalization procedure.
For a subset H ′ ⊂ H, we will write (H for the ( h∈H1 d h ) × ( h∈H2 d h ) block matrix, with blocks indexed by the subsets H 1 and H 2 .In particular, if H 1 = H 2 , we will simply write M H1 .
We assume that the n-dimensional observed vector Y satisfies the linear model where ǫ is a n-dimensional vector of iid errors, with distributional properties to be specified below, and β 0 is the unknown d-dimensional vector of true coefficients.Then, the vector β 0 can be represented as vec{β 0 h , h ∈ H}, the concatenation of |H| vectors, where β 0 h ∈ R dh , for each h ∈ H.Our crucial modeling assumption is that some of the subvectors of β 0 are zero and we will denote by H 0 = {h : β 0 h = 0} the unknown index set of non-zero subvectors of β 0 .Then, the true model complexity is given by d We consider the problem of estimating both β 0 and H 0 in the non-trivial situation in which the cardinality |H 0 | of the number or subspaces spanning the true mean vector of the response variable Y is smaller than the total number |H| of candidate subspaces.In essence, the estimation of the true underlying model H 0 requires identifying, based on Y , the zero subvectors of β 0 and removing the blocks indexed by H c 0 .This may be naturally formulated as a penalized least square problem with a ℓ 0 penalty on the cardinality of the subspaces included.Effectively, this entails considering all possible subsets of H, an NP-hard task that is computationally infeasible, when |H| (and therefore d) is large.Instead, Yuan and Lin (2006) propose to use the group-lasso penalty, which is a convex relaxation to the ℓ 0 penalty based on the combination of the ℓ 1 penalty over the number of subspaces with the ℓ 2 penalty on the estimated coefficients of each subspace.The resulting group-lasso estimator is obtained as the solution to the convex problem where λ and {λ h , h ∈ H}, are tuning parameters that depend on the sample size n.A reasonable choice for λ h is √ d h , so that larger subspaces are penalized more heavily.The group-lasso regularization is an extension of the lasso, or ℓ 1 penalty function, and consists of applying first the ℓ 2 penalty to individual blocks, to promote non-sparsity, and then the ℓ 1 norm to the resulting block norms, to promote block sparsity.Notice also that the group-lasso problem (2) includes as a special case the lasso and adaptive lasso (see Zou, 2006) problem in which |H| = d and each h correspond to the 1-dimensional subspace of R n spanned by the corresponding column of the design matrix X.
Equation ( 2) is the Lagrangian function (with Lagrangian multipliers {λλ h , h ∈ H}) of the equivalent convex problem inf where {t h , h ∈ H} are non-negative constants.In fact, there exists a correspondence between the coefficients {t h , h ∈ H} and {λλ h , h ∈ H} of (3) and (2), respectively.In this article, we will mostly focus on the more popular, unconstrained formulation (2), which has, in particular, the advantage of letting one choose in a more direct way the regularization parameters.The constrained settings (3) will be used in Section 4.4 to establish persistence properties of the group-lasso.
In our analysis, we will study the asymptotic properties of the group-lasso estimator β, defined as a minimizer of (2), and of the associated group-lasso model selector We will consider two asymptotic regimes.In the simpler, traditional scenario, we assume that the model complexity is fixed and that only the sample size n increases.In the second, more modern, scenario, the model complexity increases with the sample size and we will then study the group-lasso solutions to a sequence of linear models in which |H| and {d h , h ∈ H} grow with n.In fact, we allow for d to grow at a faster rate than n.For ease of readability, we will not make the dependence of H, {d h , h ∈ H}, λ, X, ǫ and {λ h , h ∈ H} on n explicit, although it will be apparent that all those quantities may change with n.
We conclude this section with some computational remarks.The subgradient conditions for the problem (2) are where z h are generic vectors such that z h 2 ≤ 1 for all h.Because the objective function in (2) is convex on R d , the first-order conditions obtained by solving the sub-gradient equations produce the solutions to the group-lasso problem.By inspecting the sub-gradient conditions (5), Yuan and Lin (2006) devise a modification of the LARS algorithm by Efron et al. (2004) to account for the block structure of the penalty function that can be used to solve (2) numerically.Dahinden et al. (2006) improve on this method and develop a different computational strategy based on a block-coordinate gradient descent method in the context of logistic regression, which can be adapted to the present settings.See Dahinden et al. (2006) and, in particular, Zhao et al. (2007) for further details and some discussion on the computational aspects of the group-lasso estimator and on the choice of the regularization parameters.

Example: ANOVA models
Consider an ANOVA design, arising from the cross-classification of K categorical variables, each taking value on a finite set I k = {1, . . ., I k }, for k = 1, . . ., K.
Let I = k I k be the set of cells and I = k I k the total number of cells.Also, for each i ∈ I, let n i be the total number of observations in cell i.Then H = 2 K , the power set of K = {1, . . ., K}.Each h ∈ H 0 represents an effect.For example, h = ∅ corresponds to the grand mean, a subset h with |h| = 1 to a main effect and, more generally, a subset h to an interaction effect among the variables indexed by h.The true model can be represented as H 0 ⊆ 2 K .
As h ranges over H, R I can be decomposed into the direct sum of orthogonal subspaces indexed by h, each with dimension j∈h (I j − 1) (see, e.g.Rinaldo, 2006).Let U h be a matrix of full column rank spanning the subspace index by h ⊂ K.Then, the columns of h∈2 K U h form a basis for R I .Next, let T be a n × I matrix of the form where each 1 i is a n i -dimensional vector of ones, and has full column rank and its columns span a h X h ′ = 0 when the cells n i differ, i.e. when the model is unbalanced or when some cells are empty.It is clear that the group-lasso settings described above include as special cases unbalanced and empty-cells ANOVA models, for which the usual decomposition of sums of square does not hold.

Fixed-d asymptotics
In this section we derive conditions for model selection and estimation consistency when the sample size n increases, while the parameter space remains fixed.Our consistency results rely on different assumptions and slightly different settings than the analogous results in Bach (2007), and our analysis provides novel insights into this problem.Furthermore, we obtain rates of consistency and asymptotic normality.
We will use the classical assumptions for consistency of the ordinary least squares solutions: (F2) the errors ǫ i are iid with mean zero and finite second moment σ 2 .
To motivate our analysis, we first consider the necessary conditions for the group-lasso procedure to be model selection consistent, namely where H is defined in ( 4).An adaptation of Theorem 3 in Bach ( 2007) yields that, under assumption (F1) and ( F2), ( 6) holds only if the weakly irreducibility condition max is verified, where B H0 denotes the d 0 -dimensional block-diagonal matrix with blocks We remark that (7) generalizes an analogous necessary condition for model selection consistency of the lasso (see Zou, 2006;Zhao and Yu, 2006;Yuan and Lin, 2006).Below, we derive a different necessary condition for model selection consistency, which provides a rationale for the results we derive in the remainder of this section.
Proposition 3.1.Under assumption (F1) and (F2), the model selection consistency property (6) holds only if Using the previous condition, it seems natural to consider sequences of penalty parameters such that which will also satisfy the weakly irreducibility condition (7).Implicitly, this idea is behind both Theorem 3.2 and 3.3.The weak irreducibility condition and the other necessary condition of Proposition 3.1 both have the undesirable feature of depending on the unknown index set H 0 of non-zero blocks.To remedy this problem, in the following result we describe an oracle procedure which automatically yields model selection consistency without knowing H 0 .This estimator is obtained as a direct generalization to the group-lasso framework of the adaptive lasso penalty put forward by Zou (2006).We let β OLS = X ⊤ X −1 X ⊤ Y for the least squares estimate of β.In the proof we essentially follow Zou (2006) and Knight and Fu (2000) and generalize their results to our settings.
Theorem 3.2.Assume (F1) and (F2) and let √ nλ → 0, the model selection consistency property ( 6) is satisfied and, furthermore, where ) and Z H c 0 = 0. Remark.The only property of the ordinary least squares estimate β OLS that was used in the proof is its √ n-consistency.This is enough to guarantee that the penalty parameters {λ h , h ∈ H c 0 } corresponding to the index set of the zero subvectors of β 0 are very large, with high probability for all n big enough.More generally, the Theorem remains true also when β OLS is replaced by any a n -consistent estimator, where We conclude this section with one final consistency result for the group-lasso estimator, which demands the knowledge of H 0 .Unlike the consistency results derived in (Bach, 2007, Section 2), the weakly irreducibility condition ( 7) is replaced by conditions on the asymptotic behavior of √ nλλ h , h ∈ H.Despite its reduced practical value, this result has the merit of showing explicitly that the penalty terms for the zero and non-zero blocks need to have a different asymptotic behavior.
Theorem 3.3.Assume (F1), (F2) and further assume that the (possibly random) sequence {a n }, with then the conclusions of Theorem 3.2 still hold.
Remark.The previous Theorem covers cases in which estimation consistency may hold (at a suboptimal rate) but not model selection consistency.
Theorem 3.2 and Theorem 3.3 both establish that the group-lasso estimator is asymptotically optimal, namely unbiased and efficient, and, therefore, offers the same asymptotic guarantees as the ordinary, unpenalized, least squares estimator.However, unlike the ordinary least squares, the group-lasso solutions comes equipped with a built-in penalty for sparsity, so that some of its blocks components will be zero.In fact, and this is key, as n increases, these zero components will be the same zero components of the true vector of coefficients β 0 , with probability tending to 1.In contrast, the solutions to the ordinary least squares are all non-zero, thus making it much less effective at recovering H 0 .

Double asymptotics
We turn now to the study of double-asymptotic scenario in which |H| → ∞ and the block-dimensions {d h , h ∈ H} are allowed to change with n.In particular, this includes situations in which d >> n, i.e. d grows faster than n.
To simplify our derivations, we will enforce a normality assumption on the vector ǫ of errors: Specific cases in which this assumption can be relaxed are discussed as we proceed with our analysis.

Uniqueness of the group-Lasso solution
When d > n, there is a (d − n) dimensional affine space of vectors satisfying the model equation ( 1).As a result, the solution to (2) needs not be unique and, therefore, it may no longer make sense to refer to "the" group-lasso estimator or model selector.To overcome this problem, we may want to impose the following condition, which is enough to guarantee uniqueness of the model representation (1) and, therefore, of the group-lasso solution: , for some constants c > 0 and δ > λ 2 max , where, for a m × p matrix A, A 2 denotes the operator norm with respect to the Euclidian metric.In stating the assumption, we make explicit only the dependence on the more relevant constant c.Proposition 4.1.Under assumption (U (c)), if β 1 and β 2 satisfy (1) with |{h : 1. Assumption (U(c)) is the group-lasso equivalent of Assumption 2 in Lounici (2008) on the maximal mutual coherence between different columns of the design matrix X, which is where ρ > 1 and c > 0. We point out that uniqueness of the representation (1) follows also from this mutual coherence condition.However, assumption (U(c)) is more naturally tailored to the problem at hand and, furthermore, implies the important (RE(|H 0 |, c)) condition (see Proposition 4.4 below), which is essential to establish the bounds derived in Section 4.3.2. Alternatively, one may consider investigating conditions guaranteeing uniqueness of the group-lasso solution (2) directly, rather of the model representation, following the arguments used in Osborne et al. (2000) for the lasso problem.Although it is apparent from their analysis that | H| ≤ n, i.e. the number of non-zero blocks is no larger than the sample size, extending the polyhedral arguments of Osborne et al. (2000, Section 3.1) to the group-lasso penalty appears problematic.

Sparsistency
In this section, we provide conditions for the model selection consistency (6), or sparsistency, of the group-lasso model selector under the double asymptotic settings.
To this end, let O be the event that there exists a solution β to (2) such that β h 2 > 0 for all h ∈ H 0 , and β h = 0 for all h ∈ H c 0 .Then, the sparsistency property is We will make the following assumptions: (S1) the smallest eigenvalue of 1 n X ⊤ H0 X H0 is bounded below by a constant (S3) for some 0 < ǫ < 1 and every h ∈ H c 0 , Theorem 4.2.Under the assumptions (N) and (S1)-(S4), the sparsistency property (9) holds.
Remarks.The conditions of Theorem 4.2 deserve a few comments.
1. From the proof, it can be seen that we can combine (S1) and (S2) into one assumption thus allowing the minimal eigenvalue of 1 n X ⊤ H0 X H0 to vanish at a rate slower than 1 α 2 log d0 n .
2. The normality assumption (N) is by no means crucial.In fact, it is enough to require the errors to be independent, sub-gaussian random variables, with second moments bounded uniformly in n.Then, by applying, for example, Lemma 2.3 in Massart (2007), the conclusions of the Theorem would hold unchanged.

If λ
which is the same rate appearing in Equation 15 b) in Wainwright (2006) for the simpler lasso penalty.4. It is apparent from condition (S4) that not only can d be much bigger then n, but it can in fact grow at at faster rate than n.In particular, condition (S4) formalizes quite explicitly the notion that the true model should be sparse in order for the group-lasso model selector to be successful.5.Because the group-lasso solution may not be unique, Theorem 4.2 only implies the existence of a sequence of solutions guaranteeing sparsistency.
In order to obtain a more satisfactory result, one may want to enforce also the uniqueness condition (U(c)), for some c > 0.

Inequalities for prediction and estimation
We now derive oracle inequalities for the prediction and estimation loss of the group-lasso estimator.
As a main technical step in our derivations (which generalizes standard arguments found, for example, in Bunea et al., 2007a,b;Bickel et al., 2007), the prediction and estimation bounds we establish hold on the event Therefore, we must impose conditions implying that A occurs with probability tending to 1, as both n and the model complexity increase.To that end, we formulate the asymptotic condition , which will guarantee that the inequalities given below are meaningful for n large enough and also offers some characterizations of the rates of growth of the regularization parameters.

Remarks.
1. Assumption (A) provides general guidelines for choosing the tuning parameters λ and {λ h , h ∈ H}.In particular, if λ h = √ d h for each h, the condition reduces to where d min = min h d h .For such a choice of λ h , for example, we can use λ = σ Cn n , where C n is such that Since d min ≥ 1, we can set for some A > 1. 2. Alternatively, and in less generality, if again λ h = √ d h for each h, we could consider the event where X i denotes the i-th column of the matrix X.Then, for with A ≥ 2, a standard Gaussian tail bound (see, e.g., van de Geer, 2007, Lemma 3.8) yields which vanishes provided d → ∞.Notice that this case is covered by assumption (A).Then, using the event A ′ and Cauchy-Schwarz's inequality in equation ( 39) in the proof of Theorem 4.6, it is easy to see that the results of this section would hold with A replaced by A ′ .3. It appears that the Gaussianity assumption (N) is quite important in this context, as it is used in a fundamental way to establish condition (A).If, instead of the event A, one considers the event A ′ (with the additional constraints λ h = √ d h for each h), then Gaussianity is not necessary and, n for some η > 0, one can still guarantee a vanishing probability for A ′ under the slightly stronger requirement (log d) (1+δ) = o(n) and some additional mild constraints.See Lounici (2008, Theorem 3) for a formal argument.
Another key assumption to our results is given below, where s is an integer and c a positive number: Here Λ is the d × d matrix with diagonal vec{1 dh λ h , h ∈ H} and 1 dh denotes the d h -dimensional vector with entries all equal to 1.This assumption specializes the restricted eigenvalue assumption introduced by Bickel et al. (2007) to analyze the L 2 consistency property of the lasso procedure.
In particular, in the special case in which s = |H 0 |, the (RE(s, c)) assumption is implied by the uniqueness assumption (U(c)), as demonstrated in the next proposition.
Our first result provides finite sample bounds for the prediction and estimation loss and for the number of non-zero blocks of the group-lasso estimator under the linear model ( 1), with unknown block-support set H 0 .
Theorem 4.5.Assume (N) and (RE(|H 0 |, 3)).On the event A, where and where C max is the largest eigenvalue of 1 n X ⊤ H0 X H0 and κ 0 = κ(|H 0 |, 3).Next, we establish a more general oracle inequality for the prediction loss of the group-lasso estimator which covers the case of a mispecified model.Specifically, rather than assuming that the true model is linear, we consider the more general model Y = f 0 (X) + ǫ, for some unknown, possibly non-linear, function f 0 of the covariates.

Remarks.
1. Recall that, under our assumption (A), the event A c has vanishing probability, so the bounds we obtain holds with large probability, for n big enough.
2. In both Theorems (4.6) and ( 13), we do not enforce the uniqueness condition (U(c)), and, therefore, the conclusions hold for any solution to (2).In fact, because of Proposition 4.4, we can replace the RE(s, c) conditions in both Theorems (4.6) and ( 13) by the appropriate U(c) conditions, which would guarantee the same results and also uniqueness of the group-lasso estimator.3. The inequalities derived above directly generalize the corresponding bounds established by Bunea et al. (2007a) and Bickel et al. (2007) for the lasso problem.4. From both Theorems, it is possible to get rates of prediction and estimation consistency of the group-lasso.These rates depend crucially on the choice of the tuning parameters compatible with assumption (A), in particular of λ.See Remark 1. after Lemma 4.3 for some comments on the possible values for λ.In particular, for λ h = √ d h and λ = Aσ log |H| n , for some A > 1, we obtain rates that are comparable to lasso rates, with the number of parameters replaced by the number of blocks.This is due to the nature of our assumption (RE(s, c)).

Persistence
In this final section, we change our settings and adopt the double-asymptotic framework of Greenshtein and Ritov (2006) and Greenshtein (2006).Our goal is to study the risk consistency of the group-lasso solutions under a triangular array framework for the random vector Q = (Y, X), where Y is the response variable and X = (X 1 , . . ., X d ) the vector of covariates.We are concerned with the predictive risk R(β) = E(Y − Xβ) 2 , where the expectation is with respect to the joint distribution P (X,Y ) of Y and X.
Specifically, let β n be an estimator based on an iid sample (Q 1 , . . ., Q n ) of size n from P (X,Y ) and let R( β) = E(Y − βX|Q 1 , . . ., Q n ), for a new iid observation (Y, X) ∼ P (Y,X) .Just like above, we allow d to grow unbounded with n.Let {S n } be a sequence of sets of increasing dimensions.A sequence of estimators { β n } is said to be persistent with respect to Notice that, in order for persistence to hold, it is not necessary for the best predictor of Y based on X to be linear.
We assume that the random covariates X have a grouping structure, which we represent using the same notation and conventions of Section 2. Accordingly, we consider the following two sequences of sets, each of them providing a different form of group penalty: for some sequences of numbers {b n } and {c n } to be determined.Letting γ = (−1, β 1 , . . ., β p ), we can write R(β) = γ ⊤ Σγ, where Σ = EQQ ⊤ .The empirical equivalent of this quantity is R(β) = γ ⊤ Σγ, where In these new settings, the group-lasso estimator β with respect to the sequence {S n } of sets of potential coefficients, which can be {B n } or {C n }, is computed as Following Zhou et al. (2007), we impose the conditions where and A, B and α are some positive constants with 0 < α ≤ 1.
Theorem 4.7.Under the assumptions (P1) and (P2), the group lasso estimator defined in ( 14) is persistent with respect to It is persistent with respect to {C n } if and the minimal eigenvalue of the covariance matrix of the predictors is positive.

Remarks.
1. Notice that ( 15) is implied by the stronger condition , which is of the same form as ( 16). 2. The definition of the set sequence {B n } n can be generalized to and the results of Theorem 4.7 would remain true provided max h √ dh λh = O(1).15) and ( 16) are easy generalizations of their lasso equivalents derived in Greenshtein and Ritov (2006) and Greenshtein (2006), the only difference being the additional term d max .For the choice λ h = √ d h , for each h, this is precisely the extra term appearing also in Theorem 4.6.4. Assumptions (P1) and (P2) are not the only options.See Greenshtein and Ritov, (2006) and Greenshtein (2006) for alternative assumptions and derivations.

Proofs
Proof of Proposition 3.1.For every h ∈ H, let lim Then, by the same arguments used in the proof of Theorem 3.2 below and by equation ( 18), where with W ∼ N d (0, σ 2 M) (see also Knight and Fu, 2000;Zou, 2006).
We will prove the claim by showing that if c h ′ = ∞ for some h ′ ∈ H 0 , then which will contradict the assumed model selection consistency (6).The optimal solution u * must satisfy the first order optimality conditions which together imply Then, since c h ′ < ∞, Proof of Theorem 3.2.We first show (8).Letting β n = β 0 + u √ n , where u ∈ R d , the objective function (2) (multiplied by n) can be written as a function of u as , where β is the minimizer of (2).Next, write Note that D n is strictly convex.If , and, therefore, I 2,n,h converges in probability to 0 by Slutsky theorem and the assumption where the second assumption in the statement was used.Because The unique minimizer of D(u) is (M −1 H0 W, 0) ⊤ .By the argmax theorem in van der Vaart and Wellner (1998, Corollary 3.2.3)(or alternatively, the results in Geyer, 1994), and (8) is verified.
Next, we prove model selection consistency (6).Since β is √ n−consistent, for each h ∈ H 0 , β h = 0 with arbitrarily high probability for sufficiently large n.Thus, we only need to show that, for each h ∈ H 0 , β h = 0 with arbitrarily high probability for sufficiently large n.Model selection consistency will then follow from the finiteness of |H 0 |.Suppose that, for some h ∈ H 0 , β h = 0.Then, from the subgradient conditions (5), Because of √ n(β 0 − β) is asymptotically normally distributed, and using our assumption on the design matrix, Furthermore, by the same arguments leading to (19), Then, the norm of the terms on two sides of equation ( 20) have different order of magnitude, as n → ∞, which implies that β h does not satisfy that first order condition for being non-zero with increasing probability, and therefore β h = 0 with probability tending to 1.
Proof of Theorem 3.3.In the first part of the proof, we follow Fan and Li (2001).Let α n = 1 √ n + λa n and p h : R dh → R be a random function given by and for u ∈ R d .We will show that, for each ǫ, there exists a constant C such that, for large enough n, which implies the existence of a local minimizer inside the ball {β 0 +αu : u 2 ≤ C} and therefore a solution The first two terms in ( 21) can be written as from which it follows easily that they are of order The last term on the right hand side of ( 21) can be bounded as follows: Combining the previous display with ( 21) and ( 22), one can conclude that, for sufficiently large C = u 2 , the positive term O α 2 n n u 2 2 dominates all the others.
which gives a contradiction, since δ > λ 2 max .Proof of Theorem 4.2.The proof is an adaptation to the present settings of arguments use in Wainwright (2006).Let H = {h : β h = 0} and set where z h 2 ≤ 1.Using the subgradient conditions, the event O holds if and only if and We will use equations ( 23) and ( 24) to show and respectively, where we recall that α = min h∈H0 β 0 h ∞ .In turn, ( 25) and ( 26) imply as claimed.We begin with (25).Write, for simplicity, Σ 0 = 1 n X ⊤ H0 X H0 and consider the d 0 -dimensional vector nCmin for each coordinate i of Z.Using standard results on the maximum of a Gaussian vector (see, e.g., Ledoux and Talagrand, 1991), As for the second term on the right hand side of (23), we obtain where in the last inequality we use the bounds By Markov inequality, and using ( 27) and ( 28), which goes to zero under (S2), thus establishing (25).
Next, we show (26).Rewrite (24) as where Then, for any h ∈ H c 0 , We bound the first term in the previous equation as follows, with the last inequality stemming from assumption (S3).As for the second term in (29), notice that EW h = 0 and By the same arguments used above, hence, in virtue of Markov's inequality, Therefore, using assumption (S4), which gives (26).The proof is now complete.
Proof of Theorem 4.6.The proof follows closely (Bickel et al., 2007, Theorem 5.1) and is essentially based on Lemma 6.1 in the Appendix.Let β ∈ R d be arbitrary, with H(β) ≤ s.On the event A, if the claim holds trivially from the first inequality in (37).Consider instead the complementary case On the event A ∩ A 1 , from the first inequality in (37), we get Using the assumption RE(s, 3 + 4/ǫ),we obtain, still on Thus, by the second inequality in (37), on This expression is of the same form as inequality (A.3) in Bunea et al. (2007a).
Following their arguments, we get that, for any a > 1, and ( 13) is established by setting ǫ = 2 a−1 .Proof of Proposition 4.4.We adapt the arguments used in Lounici (2008, Lemma 2) where assumption (U) is used in the second inequality.Denoting with X H ′ the submatrix of X comprised by {X h , h ∈ H ′ }, the last inequality yields where we have used Cauchy-Schwarz's inequality in the third and fourth line and assumption (U) in the third line.Since δ > λ 2 max by assumption, we obtain κ(s, c) > 0.
In order to show (12), we first show that From the subgradient conditions, we get, for each h, where z h = βh βh 2 if β h = 0 and z h is any vector with ℓ 2 norm bounded by 1 if On the other hand, since where the last inequality follows from the fact that 1 n X ⊤ X and 1 n XX ⊤ have the same maximal eigenvalue.Combining ( 35) and (36), which is (34).Inserting equation ( 11) in (34), we obtain (12).
Proof of Theorem 4.7.Following the results of section A, part IV of Zhou et al. (2007), assumptions (P1) and (P2) coupled with Berstein's inequality yield where, in the second inequality, we used the bound γ 1 ≤ 1 + h √ d h β h 2 and the last step follows from (15).Therefore, The second part of the statement follows for the simple chain of inequalities where β 2 ≤ C holds uniformly over n for some constant C in virtue of (P1) and the assumed positivity of the minimal eigenvalue of the covariance matrix of the predictors.Under ( 16), this implies C n ⊂ B n for each n and thus persistency with respect to {C n } n .

Appendix
Proof of Lemma 4.3.Let V h = 1 √ nσ X ⊤ h ǫ, so that V h ∼ N dh (0, I) and V h 2 2 ∼ χ2 dh .By the union bound, For large enough n, we can apply the tail bound inequality for a variable distributed like χ 2 dh (see, e.g.Cavalier et al., 2002), yielding Because of (A), for large enough n, from which it follows, once again using (A), that This concludes the proof.
Lemma 6.1.Let EY = f 0 (X), for some function f 0 and assume (N).On the event A, for any β ∈ R d with block support set H ′ = {h : Proof of Lemma 6.1.Following the derivation in Bunea et al. (2007a), for an arbitrary β ∈ R d with block support set H ′ , it holds that where W h = 1 n X ⊤ h ǫ.By Cauchy-Schwarz's inequality, on the event A, Using the last display, and adding and subtracting 1 2 h λλ h β h − β h 2 to both sides of (38), the term which, in turn, is no larger than all the above inequalities being valid on A. Then, from (38), and applying the triangle inequality to the last display, we obtain, still on A, where the second inequality stems from Cauchy-Schwarz's inequality.The last expression, multiplied by 2, is (37).

Acknowledgments
We thank the anonymous referees and, in particular, the associate editor for detailed and constructivce comments that led to a much improved presentation.
the model selection consistency (6) follows from the same arguments used at the end of the proof of Theorem 3.2.Since the event { H = H} has vanishing probability, asymptotic normality (8) is easily proved by restricting to the complementary event { H = H} and applying the central limit theorem and Slutsky's theorem to equation (23) below, taking into account fact that λa n → 0. Proof of Proposition 4.1.Let β = β 1 − β 2 .Then Xβ = 0. Assume that β = 0. Using the same notation as in Proposition 4.4 with s = 2|H 0 |, we get, by equation (30), Xβ 2 2 max j,k Σ j,k − Σ j,k = O P log n k − Σ j,k (1 + b n ) 2= o P (1).
, we will use the notation x H ′ = vec{x h , h ∈ H ′ } for the d ′ -dimensional subvector comprised by the blocks of x indexed by H ′ , where d ′ = h∈H ′ d h .Similarly, if H 1 and H 2 are two subsets of H, and M a d × d matrix, we will write ′ ) c = H \ H ′ and, if x ∈ R d