Theoretical Properties of the Overlapping Groups Lasso

We present two sets of theoretical results on the grouped lasso with overlap of Jacob, Obozinski and Vert (2009) in the linear regression setting. This method allows for joint selection of predictors in sparse regression, allowing for complex structured sparsity over the predictors encoded as a set of groups. This flexible framework suggests that arbitrarily complex structures can be encoded with an intricate set of groups. Our results show that this strategy results in unexpected theoretical consequences for the procedure. In particular, we give two sets of results: (1) finite sample bounds on prediction and estimation, and (2) asymptotic distribution and selection. Both sets of results give insight into the consequences of choosing an increasingly complex set of groups for the procedure, as well as what happens when the set of groups cannot recover the true sparsity pattern. Additionally, these results demonstrate the differences and similarities between the the grouped lasso procedure with and without overlapping groups. Our analysis shows the set of groups must be chosen with caution - an overly complex set of groups will damage the analysis.


Introduction
In this paper, we consider the linear regression model: y = Xβ 0 + ǫ, where X is an n × p real valued data matrix, y ∈ R n is a vector of responses, β 0 ∈ R p is a vector of linear weights, and ǫ is an error vector. Much work focuses on estimating a sparse β, where many of the entries are equal to zero, effectively excluding many of the dimensions of X -the candidate predictors -from the model. Recent work adds the notion of structure to this setting. That is, we desire the set of nonzero entries in β to follow some predefined structure over the candidate predictors. There are now many methods tailored to a diverse collection of lasso does not perform well with overlapping groups. The goal of this paper is to expose exactly how introducing the possibility of overlapping groups impacts the grouped lasso. Towards this goal, we demonstrate some theoretical properties of the overlapping grouped lasso, with a focus on the consequence of the number and complexity of the groups of predictors. We give a finite sample and an asymptotic result. In particular, we make the following contributions: 1. We show that both the finite sample and asymptotic performance of the overlapping grouped lasso suffers as the number and complexity of the groups grows. 2. In the finite sample case, we show that the assumptions on the design matrix X become more restrictive as the complexity of the groups grows. 3. In our asymptotic analysis, we introduce the adaptive overlapping grouped lasso, and give an adaptive weighting scheme with asymptotic selection guarantees similar to the adaptive lasso of Zou (2006) (see also the adaptive grouped lasso results in Nardi and Rinaldo (2008)).
Overall, we conclude that the overlapping grouped lasso enjoys many of the same theoretical guarantees as the grouped lasso, provided that the set of groups are not too complex or large. We therefore recommend that the procedure should be used with a set of groups that is not overly complex, or contains a nested structure.
The paper is organized as follows: we first introduce notation for the overlapping grouped lasso. We also reproduce some basic theoretical properties of the procedure and the associated overlapping grouped norm. We next give our finite sample results, and then our asymptotic results. We then present a simulation study to support our theoretical results. Proofs of the main results along with supporting lemmas appear in the appendix.

Notation
We adopt a combination of the notation of Jacob, Obozinski and Vert (2009) and Lounici et al. (2009). Recall our basic setting, the linear model: Here, X is an n × p data matrix, y ∈ R n is the response, β 0 ∈ R p is a vector of true linear coefficients, and ǫ is a stochastic error term. Our goal is to estimate a sparse β, such that the nonzero entries follow some structure which we assume to be known a priori. In particular, we consider structures defined in terms of groups of predictors, which we define as subsets of the set of candidate predictors indices: I = {1, 2, . . . , p}. We denote a collection of groups as G with elements g such that each g ⊆ I. Let |G| = M , and assume g∈G g = I. For coefficient vectors β, we define β g ∈ R |g| as the sub vector consisting of the entires corresponding to the indices in g. Define the support of a vector as: imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-rev1.tex date: January 16, 2013 We now give a framework, proposed by Jacob, Obozinski and Vert (2009), to measure the structured sparsity of vectors in R p . We define the following convention: for vectors denoted v g ∈ R p we have that supp(v g ) ⊆ g. We define a decomposition of β ∈ R p with respect to G as: ( That is, each decomposition in V G (β) is a collection of M vectors in R p each satisfying supp(v g ) ⊆ g for a different g ∈ G. From now on, we suppress the G in the notation for decompositions and write V(β). V(β) is not unique in general. We define the following norms: Here, ||·|| denotes the Euclidean or ℓ 2 norm. The above two equations are norms by arguments presented in Jacob, Obozinski and Vert (2009). Note that the notation min V(β) indicates the minimum over all possible decompositions. Note that the decomposition that minimizes these norms is not necessarily unique, as we state in the following lemma.
Lemma 1. (Corollary 1 from Jacob, Obozinski and Vert (2009)) For any collections {v g }, {v ′ g } minimizing the norm 5, we have, ∀g ∈ G: The above lemma implies that in some cases the collection of groups used in the decomposition -that is, {g ∈ G s.t. v g = 0} -is not unique. Finally is the set of groups used to decompose β for a particular decomposition. M v (β) is thus a measure of the structured sparsity of β with respect to a particular decomposition. Let M (β) = min v M v (β), where this minimum is taken over the set of decompositions minimizing the norm 5. Thus, M (β) measures the overall structured sparsity of β, with respect to the groups G.
Here is a simple example to illustrate the setting. Let p = 3, and consider the following groups: imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-rev1.tex date: January 16, 2013 For any α ∈ R, we have the following possible decomposition of β = [a, b, c]: Thus, the norm from Equation 5 can be expressed as: Finally, it is clear that for a, c = 0, M (β) = 2, and M (β) = 1 otherwise.

Overlapping Grouped Lasso
Recall our goal, under the model of Equation 1, we estimate the target β 0 with a sparse β -that is, many entires of β are set to zero. Additionally, we know these nonzero entries occur in a structured pattern, as given by G. We evaluate the fit with the usual quadratic loss: The overlapping grouped lasso solves the following optimization problem: Here λ > 0 is a tuning parameter controlling the amount of regularization. If the elements of G are restricted to be pairwise disjoint, then the norm || · || 2,1,G reduces to the grouped ℓ 1 norm. We then recover the original formulation of the grouped lasso. In the special case where the groups are all singletons: G = {{i} : i ∈ I}, we recover the familiar lasso Tibshirani (1996). If we allow G to be any collection, allowing for the possibility of overlap between groups, then the minimum over V(β) in the norm now plays a role since the decomposition of β is no longer unique in general. This setting gives us the overlapping grouped lasso. For each of these problems, the key fact is that the support of β will be a union of members of a subset of G. Finally, we also introduce the the adaptive overlapping grouped lasso: As previous work and theory has suggested (Nardi and Rinaldo (2008); Zou (2006)), the choice of weights: λ g = 1/||β OLS g || γ , where β OLS = (X T X) −1 X T y, and γ > 0, gives good asymptotic guarantees. In Section 5, we show that a different choice is needed in our setting to give similar asymptotic guarantees.
Finally, as noted by Jacob, Obozinski and Vert (2009), the overlapping grouped lasso method is simple to implement. In the case where G consists of nonoverlapping groups, there are several efficient algorithms available. In the overlapping case, no new specialized algorithm is required. Write X g as the submatrix of X with only the columns of X indexed by the elements of g. Now define X = [X g ] g∈G -a n × g |g| matrix of the concatenation of the columns of X corresponding to each group in G. We then can solve the optimization problem with with a new, non-overlapping, set of groups G defined on the appropriate columns of X. Since G is now a non-overlapping set of groups for X, we can simply apply existing algorithms for the grouped lasso.

Finite Sample Bounds
We now give a sparsity oracle inequality for the overlapping grouped lasso. This finite sample result is an extension of a result on multitask regression due to Lounici et al. (2009), which is in turn built on results from Bickel, Ritov and Tsybakov (2009). We first state and discuss our main assumption, which is an adaptation of the restricted eigenvalue condition of Bickel, Ritov and Tsybakov (2009) to the overlapping grouped lasso seetting.
In the subsequent results, the integer s measures the structured sparsity of the target. There are two key differences between this assumption and other restricted eigenvalue conditions. First, it relies on norms of the decompositions of vectors, rather than norms of the vector or appropriate sub-vectors. Note that the decomposition of ∆ must be a decomposition minimizing the || · || 2,1,G norm. As we will discuss later, this condition grows more restrictive as G becomes more complex. The key second difference in the assumption lies in the denominator term g∈J ||v ∆ g ||, which appears instead of the directly analogous || g∈J v ∆ g ||. We know by the triangle inequality that || g∈J v ∆ g || ≤ g∈J ||v ∆ g ||, and so our κ is less than or equal to a κ ′ obtained under the analogous assumption. In the case of non-overlapping groups, this is an equality, and the assumption is identical in this case.
We now examine some sufficient conditions for the existence of κ(s). Examining the numerator of the main quantity defining κ(s), we see that ∆ T X T X∆/n ≥ |ρ X | 1/2 ||∆||, where ρ X is the minimal eigenvalue of X T X/n. Examining the denominator, we can make the following bounds: Here, G overlap := max j∈I g∈G 1 j∈g is the maximal number of times a candidate predictor appears in the groups of the collection G. Thus, as long as X T X has a nonzero minimal eigenvalue, we are guaranteed to find a κ(s) of at most (ρ X /M G overlap ) 1/2 . In particular, for κ(s) to exist, it is sufficient for X T X to be positive definite. We now state our main result.
Theorem 1. Consider the model in Equation 1. Suppose |G| = M ≥ 2, and n ≥ 1. Assume that the entries of ǫ are i.i.d. Gaussian with mean 0 and variance σ 2 . Let X be normalized so that the the diagonal entries of X T X/n are all equal to 1. Denote M (β 0 ) ≤ s as the maximum number of nonzero groups in decompositions of β 0 , V(β 0 ). Let Assumption 1 hold with κ = κ(s). Let: Here, A > 8. Define q = min g (ρ −2 g ) min A min g |g|/8, 8 log M , where ρ g is the maximal absolute eigenvalue of a Cholesky decomposition of X T g X g , where X g is the sub matrix of X corresponding to the columns indexed by the group g. Then, with probability at least 1 − M 1−q , for any solution β to Equation 14, for all β 0 ∈ R p , the following inequalities hold: The proof for this result is given in the appendix A.1.2. The proof relies on Lemma 4 given in the appendix A.1.1. We now discuss the result.
1. As the set of groups grows, the finite sample guarantees degrade. In Proposition 1, the prediction and estimation bounds both get coarser as the number of groups increases. Note that the set of groups can grow not only as the dimension of the problem grows, but also if we encode complex structures over the predictors using G. Thus, even for a problem of fixed dimension p, there is a consequence to choosing an arbitrarily complex set of groups. To make this result clear, let the groups be maximally complex: G = 2 I , the power set of the set of predictors. Now, as the dimension of the problem grows, the prediction bound grows at rate O(p 3/2 ), and the estimation bound at rate O(p). If |G| is instead of the same order as p and the maximum group size is constant, these rates are instead both O(log p). This shows that grouped sparsity achieves the tightest upper bounds if both the maximum group size and the number of groups grow at slower rates than p. Note the contrast here to the results of Lounici et al. (2009) in the multi-task setting, where a growing number of tasks benefitted the procedure. Note that in multi-task setting, the number of observations necessarily grows with the number of tasks, contrary to our setting. 2. As the complexity of the groups grows, Assumption 1 becomes more restrictive. Since κ appears in the denominator of both the prediction and estimation bounds, the bounds become less tight as κ decreases. Consider the condition: Recall that J is a cardinality s set of groups. Thus, for fixed s, as the complexity of G grows, the flexibility of the decompositions grows, and then more vectors ∆ satisfy this condition. This makes κ decreasing as a function of |G|. We also recall that when X T X has a nonzero minimal absolute eigenvalue, we know κ is at most ρ X /M G overlap . As noted earlier, as the complexity of the groups grows, G overlap increases as well, leading to a smaller κ and in turn inferior prediction and estimation bounds. If G overlap is on the same order as the number of predictors, then κ(s) is of order 1/M rather than 1/ √ M . This dependence shows that our bounds depend equally on the dimension of the problem M and the group complexity as measured by G overlap . In the case of the lasso or group lasso, G overlap = 1, giving us no dependence on group complexity, as expected. 3. The results show that the procedure enjoys an advantage over non-structured procedures when β 0 is structured sparse. For example, in the finite sample case, none of our bounds depended explicitly on the dimension of the problem p. Thus, we can adopt a similar argument to those of Lounici et al. (2009) to show that compared to the lasso, the overlapping grouped lasso gives superior results in the case where β 0 is structured sparse. That is, from Bickel, Ritov and Tsybakov (2009), if we let: imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-rev1.tex date: January 16, 2013 then for A > 2 √ 2, we have that with probability at least 1 − (p) 1−A 2 /8 : Thus, if max g |g| log M + max g |g| is of smaller order than log p, the procedure has a predictive advantage. Since κ depends on the structured sparsity of the target, this result holds only for structured sparse targets β 0 which give sufficiently large values of κ under our assumption. 4. In the non-overlapping case, we can recover many results available in the literature. Here we have G overlap = 1. We adjust our assumption to match the literature, so that the quantity in the minimum is replaced with: Combining this with an application of the Cauchy-Schwarz inequality in the last steps of the proofs of the result, we can recover the results of Lounici et al. (2009) in the multi-task case. In the case of the grouped lasso, we can recover the result from Nardi and Rinaldo (2008). The dependence on the minimal eigenvalues of the Cholesky decomposition of each X T g X g is related to the conditions given in Huang and Zhang (2010). In the settings of Lounici et al. (2009), ρ g = 1 for all g. 5. We can show a similar result solely in terms of max g |g|. In particular, for: the same results hold with probability 1−M 1−q , for q = min g (ρ −2 g ) min A/8, 8 log M maxg |g| . This result is a consequence of a simple adjustment for this choice of λ in the proof of Lemma 4 from the appendix. This alternate result shows that as the maximum group size grows, the estimation and prediction bounds become less tight, and the probability that they hold falls. 6. The result does not depend on the any uniqueness assumptions on the decomposition of β 0 . The consistency result for the overlapping grouped lasso in Jacob, Obozinski and Vert (2009) assumes that the decomposition of β 0 that minimizes the || · || 2,1,G norm is unique. Our result, in contrast, depends only on the maximal structured sparsity of such decompositions. Thus, in the case where β 0 does not have a unique decomposition minimizing the || · || 2,1,G norm, our results still hold. This is a contrast to the asymptotic results of the next section.

Asymptotic Results
In this section, we consider fixed dimension asymptotic for the adaptive overlapping grouped lasso as described in Equation 15. These results extend those on imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-rev1.tex date: January 16, 2013 the grouped lasso found in Nardi and Rinaldo (2008) to the case of overlapping groups.
To begin, define the set of indices of the true linear coefficient vector β 0 which are nonzero and zero as the following: Accordingly, we define X H as the sub matrix containing the entries with column indices in the set H. Similarly, for a p-vector x, let x H be the sub vector containing the entries with indices in the set H. Clearly, H ∪ H c = I. However, that H and H c are not necessarily the union of members of J(β 0 ) and J(β 0 ) c , respectively. We next define the following three subsets of G related to H and H c : These are, respectively, the set of groups in which the indices are all nonzero in β 0 , all zero in β 0 , and a mix of zero and nonzero in β 0 . For this setting, we now make the following assumptions: Assumption 4. There exists a neighborhood in R p around β 0 such that the decomposition of any vector b in the neighborhood has a unique decomposition {v b g } minimizing the norm ||b|| 2,1,G . In particular, the decomposition {v 0 g }, minimizing the norm ||β 0 || 2,1,G is unique. Further, this decomposition is such that v 0 g = 0 for all g ∈ G Ho . Assumptions 2 and 3 are directly taken from the grouped lasso setting. Assumption 4 is another such condition adapted to our setting. A direct adaptation would be that there exists some G ⊆ G, such that ∪ g∈G g = supp(β 0 ). This property is implied by Assumption 4. Note that these three assumptions are analogous to those needed for the consistency result given in Jacob, Obozinski and Vert (2009). This assumption also addresses indirectly the issue of identifiability of the groups. For example, for M = 3, and G = {{1, 2}, {2, 3}, {1, 3}}, the target β 0 = (a, a, a) does not admit a unique, norm minimizing decomposition within any neighborhood. Similarly, we can create the set {1, 2, 3} in four possible ways from unions of members of G. Thus, this particular G does not satisfy Assumption 4 for some targets.
In the following result, we consider the adaptive overlapping grouped lasso of Equation 15. We now propose a set of weights {λ g } for the adaptive overlapping grouped lasso. If we let β OLS = (X T X) −1 X T y, and let {v OLS g } = V(β OLS ) be imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-rev1.tex date: January 16, 2013 any decomposition minimizing the norm ||β OLS || 2,1,G . Then, let λ g = 1/||v OLS g ||. This choice of weights gives us our main result: Theorem 2. Consider the adaptive overlapping grouped lasso. Suppose Assumptions 2, 3, and 4 hold. Let β OLS = (X T X) −1 X T y, and let {v OLS g } = V(β OLS ) be any decomposition minimizing the norm ||β OLS || 2,1,G . Then, let Where the above is convergence in distribution. The vector Z has entries: Where M H is the sub-matrix of M consisting of the entries with row and column indices in H.
We now make some comments on the result.
1. In the non-overlapping case, our result reduces to previous results from Nardi and Rinaldo (2008). In particular, the weights are clearly λ g = 1/||β OLS g || γ . Given this, we could ask what is the consequence of simply choosing λ g = 1/||β OLS g || γ for the adaptive weights in any case? In the proof of the result, the impact is for the case when g ∈ G Ho . In summary, the term n γ/2 ||β OLS g || γ is no longer O p (1), since ||β 0 g || > 0. Then, we get the following distribution: The resulting distribution is nonzero with positive probability in coordinates that are zero in β 0 . In this situation, the problem can be remedied by assuming that G Ho is empty, that is: Assumption 5. (Separation of support) ∃G ⊂ G such that ∪ g∈G g = H and ∪ g / ∈G g = H c . For many settings with overlap, this is an overly restrictive assumption. Note that this assumption corresponds to assuming the groups are correct in the non-overlapping grouped lasso. If the groups are incorrect, the result of this proposition gives us some insight as to what goes wrong asymptotically. 2. The result gives a consequence of having an "incorrect" set of groups, relative to the support of β 0 . When the condition ∀ g ∈ G Ho ; v 0 g = 0 of Assumption 4 is violated, we have that n γ/2 ||β OLS g || γ is no longer O p (1) for g ∈ G Ho , and the consequence is similar to the previous remark. Again, we get the wrong asymptotic mean, and the estimator does not have good selection properties. Such a violation Assumption 4 implies that the structure implied by G is not sufficient to capture the structure in β 0 .
3. These results exclude some types of structures: in particular nested groups in G. In particular, the uniqueness assumption implies that we can not use a G which contains nested groups. In this case, given a set of groups, the uniqueness condition of Assumption 4 are violated for some β 0 . For example, suppose p = 5 and Then, for β 0 = [a, a, 0, 0, c], then there are an infinite number of decompositions minimizing the || · || 2,1,G . In particular, for any α ∈ (0, a), the following decomposition minimizes the norm: Then, consider the weights λ g = ||v OLS G ||. In almost all data applications we have: supp(β OLS ) ⊃ {1, 2, 3, 4}. The minimizing decomposition of ||β OLS || 2,1,G will clearly have v OLS {1,2} = v OLS {3,4} = 0. This effectively excludes the first two groups, and we will be unable to detect all possible sparsity patterns. More generally, using the same argument as the example, we can state that in the case where groups are nested, there exist some β 0 which cannot be uniquely decomposed to minimize the || · || 2,1,G norm. Thus, using nested groups degrades the asymptotic guarantees of the overlapping grouped lasso. This property precludes using a complex nested set of groups to encode multiple structures.

Simulation Study
We now present the results of a simulation study to illuminate and support our earlier theoretical claims. For ease of comparison, we imitate the setting of Huang and Zhang (2010). Here, we explore issues most pertinent to the overlapping groups lasso, leaving aside some of the issues addressed by the simulation study in Huang and Zhang (2010). We generate an n × p design matrix X with i.i.d. standard normal entries, with each row scaled so it has unit magnitude. We next generate a structured sparse β 0 vector with the nonzero entries defined as the union of the first k groups from our set of groups G. We choose the first k groups to achieve a consistent amount of overlap in β 0 with respect to G between trials. We define k, G, n, and p separately in each experiment. After constructing our response from X and β 0 , we add zero mean Gaussian noise with standard deviation σ = 0.01. We compare the standard lasso against the overlapping groups lasso, with set of groups G. As in Huang and Zhang (2010), we adopt the following metric to evaluate the performance of both estimators: imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-rev1.tex date: January 16, 2013 We conduct the following pair of experiments: 1. Study on the effect of overlap. Here, we simulate a problem that has nearly constant difficulty for the ordinary, un-grouped, lasso, but increasing difficulty for the grouped lasso. We set p = 512, and set each group so that it consists of 8 consecutive (by index) predictors. We then vary G Overlap ∈ {1, 2, . . . , 8}. For example, with G Overlap = 1, our first two groups are g 1 = {1, 2, . . . , 8}; g 2 = {9, 10, . . . , 16}, and with with G Overlap = 2, g 1 = {1, 2, . . . , 8}; g 2 = {8, 9, . . . , 15}, and so forth. We select k = ceiling((64 − 8)/(8 + G Overlap )) + 1 groups to be nonzero in β 0 , and set n = 192. 2. Study on the effect of sample size. We adopt a similar setting of the first experiment. We set G Overlap = 4, and set G, p, k in a similar manner as the first experiment. We consider n satisfying log 2 (n/48) ∈ {0, 1, 2, 3, 4}.
The purpose of the first experiment is to study the effect of increasing complexity of G on estimation performance. For G ∈ {1, 2, 3, 4}, we see that as the degree of overlap increases, the estimator performance degrades, though not dramatically in these settings. For G = 5, with groups of size 8, we can see that due to the consecutive placement of the signal, about half of the groups may be dropped without degradation in performance, and we return to the setting and performance of G = 1. For G ∈ {6, 7, 8}, the estimator again does worse than in the case of no overlap, but no worse than G = 4. This result supports the discussion surrounding Assumption 1 and Theorem 1, but still indicates that the procedure is more robust to overlap than postulated in Huang and Zhang (2010).
In the sample size study, we see that for a reasonable (G overlap = 4) set of groups, the estimator outperforms the lasso: it is able to achieve a limiting level of recovery error for lower sample sizes than the lasso. This supports the conclusions of Theorem 1, as well as the conclusions from the literature about the grouped lasso, e.g. Huang and Zhang (2010) and Lounici et al. (2009). We thus see that even in the overlap case, the procedure still enjoys a benefit due to group sparsity.

Discussion and Conclusions
In the previous two sections, we have given results on the performance of the overlapping grouped lasso in both the finite sample and asymptotic setting. One of the basic steps in practical applications of this procedure is the choice of the collection of groups G. In both cases, we showed that an overly complex choice of G degrades the theoretical guarantees on the performance of the estimator. In the case where the dimension of the problem is fixed, increasing the number of groups leads to less tight upper bounds on both prediction and estimation in the finite sample case. In the asymptotic setting, nested groups lead to inconsistent selection of the true sparsity pattern. Nonetheless, when G is suitably chosen, we still see that the procedure retains the theoretical benefits of the grouped lasso demonstrated in previous literature.
In summary, we find that the overlapping grouped lasso is a useful extension of the grouped lasso that must be used with caution. The flexibility allowed by overlapping groups is valuable in many applications, and can encode a wide variety of structures as collections of groups. We have shown that allowing for overlap does not remove many of the theoretical properties and benefits proven for the lasso and grouped lasso. However, the procedure must be used with caution. While the flexible nature of the procedure suggests that the analyst may encode many structures simultaneously, this approach is not supported by the results in this paper.
Assume that the entries of ǫ are i.i.d. Gaussian with mean 0 and variance σ 2 . Let X be normalized so that the the diagonal entries of X T X/n are all equal to 1. Let {v β−β g } denote a decomposition of β − β minimizing the || · || 2,1,G norm. Let J = J(β 0 ) = {g : v 0 g = 0} be the set of groups that are nonzero in the norm minimizing decomposition of β. Let: Here, A > 8. Define q = min A min g |g|/8, 8 log M . Then, with probability at least 1−M 1−q , for any solution β to Equation 14, for all β ∈ R p , the following inequality holds: Proof. We follow the proof strategy of Lounici et al. (2009). For all β ∈ R p , we have: Let y = Xβ 0 + ǫ to obtain: imsart-ejs ver. 2009/08/13 file: percival-overlap-theory-rev1.tex date: January 16, 2013 We now examine the second term on the right hand side: Here, we apply our version of Hölder's inequality (Lemma 3). We now consider the event: Note that random variables V g(j) = 1 σ √ n n i=1 X ij ǫ i , where g(j) denotes the jth element of g ∈ G, are standard Gaussian random variables. Within a group, they have a multivariate normal distribution with covariance matrix X T g X g /(σ 2 n), where X g denotes the sub matrix of X consisting of the columns indexed by the group X g . It then follows that, provided X g admits a Cholesky decomposition, that (X T g X g ) −1/2 X g ǫ/σ 2 n is a vector of i.i.d. standard normal random variables. Thus, letting ρ g denote the maximal absolute eigenvalue of (X T g X g ) −1/2 , we have ||X g ǫ/σ 2 n|| ≤ ρ g ||(X T g X g ) −1/2 X g ǫ/σ 2 n|| by properties of the operator norm of (X T g X g ) 1/2 . Now, for any g ∈ G define: Note, ∀g ∈ G; γ g ≤ λ. Now: ≤ P χ 2 |g| ≥ γ 2 g n 4σ 2 ρ 2 g G overlap (60) = P χ 2 |g| ≥ ρ −2 g (|g| + A |g| log M ) This corresponds to the result in Equation 22. Equation 23 follows from an analogous chain as the above, beginning with the inequality 69.

A.2. Asymptotic Setting
Before we prove the main result, we give the following lemma.