Within Group Variable Selection through the Exclusive Lasso

Many data sets consist of variables with an inherent group structure. The problem of group selection has been well studied, but in this paper, we seek to do the opposite: our goal is to select at least one variable from each group in the context of predictive regression modeling. This problem is NP-hard, but we study the tightest convex relaxation: a composite penalty that is a combination of the $\ell_1$ and $\ell_2$ norms. Our so-called Exclusive Lasso method performs structured variable selection by ensuring that at least one variable is selected from each group. We study our method's statistical properties and develop computationally scalable algorithms for fitting the Exclusive Lasso. We study the effectiveness of our method via simulations as well as using NMR spectroscopy data. Here, we use the Exclusive Lasso to select the appropriate chemical shift from a dictionary of possible chemical shifts for each molecule in the biological sample.


Introduction
In regression problems with a predefined group structure, we seek to accurately predict the response using a subset of variables composed of at least one variable from each predefined group.We can phrase this structured variable selection problem as a constrained optimization problem where we minimize a regression loss function subject to a constraint that ensures sparsity and selects at least one variable from every predefined group.This problem has potential applications in many areas including genetics, chemistry, computer science, and proteomics.Consider a motivating example from finance.In portfolio selection, the variance of the portfolio is just as important as the expected performance of the returns (Markowitz, 1952).Suppose we want to select an index fund comprised of a diverse set of 50 stocks whose performance approximates the performance of the S&P 500.We can ensure that we are selecting a diversified portfolio by requiring that we select at least one stock from every financial sector; selecting securities from different sectors diversifies the index fund and effectively lowers the variance of the return of our portfolio.We can phrase this strategy as a structured variable selection problem where we minimize the difference in performance between the S&P 500 and our portfolio subject to selecting a small set of securities that is comprised of at least one security from each predefined financial sector.
Even though this problem is known to be NP-hard, a popular approach in the literature uses convex penalties to relax similar combinatorial problems into tractable convex problems.While the Lasso (Tibshirani, 1996) is the most well known of these convex relaxations, there are several frameworks specifically designed to find convex alternatives to complicated structured combinatorial problems (Obozinski and Bach, 2012;Halabi and Cevher, 2014).These frameworks lead to convex penalties like the Group Lasso (Yuan and Lin, 2006), Composite Absolute Penalties (Zhao et al., 2009), and the Exclusive Lasso (Zhou et al., 2010), the subject of this paper.Zhou et al. (2010) first uses the Exclusive Lasso penalty in the context of multitask learning, and Obozinski and Bach (2012) and Halabi and Cevher (2014) relate the penalty to their framework for relaxing combinatorial problems.The Exclusive Lasso penalty has not yet been explored statistically or developed into a method that can be used for sparse regression and within group variable selection.We will develop the Exclusive Lasso method and study its statistical properties in this paper.
To motivate our statistical investigation of the Exclusive Lasso for sparse regression further, consider the problem of selecting one variable per group using existing techniques such as the Lasso or Marginal Regression.If the Lasso's incoherence condition and beta-min condition are satisfied and Marginal Regression's faithfulness assumption is satisfied, then both methods recover the correct variables with out any knowledge of the group structure (Genovese et al., 2012;Wainwright, 2009).However, data rarely satisfies these assumptions.Consider that if two variables are correlated with each other, the Lasso often selects one instead of both variables.When whole groups are correlated, the Lasso may only select variables in one group as opposed to variables across multiple groups.Similarly, if the variables most correlated with the response are in the same group, Marginal Regression will ignore the true variables in other groups.If we recall the portfolio selection example, we group variables together because they are correlated.In these situations, the fact that the Lasso and Marginal regression are agnostic to the group structure hurts their ability to select a reasonable set of variables across all predefined groups.If we know that this group structure is inherent to our problem, then complex real world correlated data motivate the development of new structured variable selection methods that directly enforce the desired selection across groups.
In this paper, we investigate the statistical properties of the Exclusive Lasso for sparse, within group variable selection in regression problems.Specifically, our novel contributions beyond the existing literature (Zhou et al., 2010;Obozinski and Bach, 2012;Halabi and Cevher, 2014) include: characterizing the Exclusive Lasso solution and relating this solution to the existing statistics literature on penalized regression (Section 2); proving consistency and prediction consistency (Section 3); developing a fast algorithm with convergence guarantees for estimation (Section 4); deriving the degrees of freedom that can be used for model selection (Section 5); and investigating the empirical performance of our method through simulations (Sections 6 and 7).

The Exclusive Lasso
Consider the linear model where the response is a linear combination of the variables subject to Gaussian noise: y = Xβ * + where is i.i.d Gaussian.For notational convenience, we assume the response is centered to eliminate an intercept term.We assume β * is structured such that its indices are divided into non-overlapping, predefined, groups and that the support of β * is distributed across all groups.We allow the support set within a group to be as small as one element and as large as the entire group.We can write this as two structural assumptions; (1) there exists a collection of non-overlapping predefined groups denoted, G, such that ∪ g∈G g = {1, . . ., p}, ∩ g∈G = ∅ and (2) the support set S of the true parameter β * is non-empty in each group such that for all g ∈ G we have S ∩ g = ∅ and β * i = 0 for all i ∈ S. Let C = {β ∈ R p , : β S = 0, S ∩ g = ∅, ∀g ∈ G} be the set of all parameters that satisfy our structural assumptions.
Our goal is to find the element in C that best represents y using the optimization problem: β = argmin β∈C y − Xβ 2 2 .Our constraint set makes this a combinatorial problem and is generally NP-hard.Instead of considering the problem as stated, we study its convex relaxation by replacing the combinatorial constraint with the convex penalty P (β) = 1 2 g∈G β g 2 1 first proposed in the context of document classification and multitask-learning (Zhou et al., 2010).Obozinski and Bach (2012) showed that the Exclusive Lasso penalty is in fact the tightest convex relaxation for the combinatorial constraint requiring the solution to contain exactly one variable from each group.
In this paper we propose to study the Exclusive Lasso penalty in the context of penalized regression, looking at both the constrained version: where τ is some positive constant and its lagrangian (2) We predominantly work with the Lagrangian as they are equivalent because it is a convex problem.Now let us understand the penalty better.For each group g, the penalty takes the 1-norm of the parameter vector restricted to the group g, β g , and then take the 2-norm of the vector of norms.If each element is its own group, the penalty is equivalent to ridge regression.If all elements are in the same group, the penalty is equivalent to squaring the 1-norm penalty.Loosely, the penalty performs selection within group by applying separate lasso penalties to each group.At the group level, the penalty is a ridge penalty preventing entire groups from going to zero.Whatever the case, the group structure informs the type of regularization because it is a composite penalty, utilizing the 1 and 2 norms within and between groups respectively.
As an illustration, consider the following toy example.Let ) be our parameter such that the first index denotes group membership and the second denotes the element within group.If we evaluate the penalty at this parameter ) 2 .We can visualize this example using the Exclusive Lasso's unit ball as shown in Figure 1.Restricting our attention to variables in the same group β * 1,1 , β * 1,2 and setting β * 2,1 = 0 yields a unit ball equivalent to the ball generated by the 1 norm.Alternatively, if we restrict our attention to variables in different groups β * 1,1 , β * 2,1 and set β * 1,2 = 0 the unit ball is equivalent to the ball generated by the 2 norm.The geometry of simple convex penalties dictate the structure of the estimate in constrained least squares problems (Chandrasekaran et al., 2012) suggesting that if the 1 -norm enforces sparsity in its estimate and that the 2 -norm enforces density, we can expect the Exclusive Lasso to send either β * 1,1 or β * 1,2 to zero while never sending β * 2,1 to zero.Like the Group Lasso, studied by Yuan and Lin (2006), the Exclusive Lasso assumes the variables have an inherent group structure.However the Group Lasso also assumes that only a small number of groups represent the response y.Consequently, the Group Lasso penalty performs selection at the group level sending entire groups to zero.Despite their differences, both the Exclusive Lasso and the Group Lasso are examples of a broader class of penalties studied by Zhao et al. (2009)    group.The second norm is applied at the group level to the vector of group norms.
This yields the desired structure between groups.In a sense, the Exclusive Lasso is the opposite of the Group Lasso.Where the Exclusive Lasso employs an 1 -norm within group and an 2 -norm between groups, the Group Lasso uses an 2 -norm within group and an 1 -norm between groups.Several authors have investigated some of the well known composite penalties.Nardi et al. (2008) study the conditions under which the Group Lasso correctly identifies the correct support.Negahban and Wainwright (2008) study the theoretical properties of the 1 / ∞ norm penalty, a penalty similar to the Group Lasso.Despite the work on other composite penalties, the statistical properties of Exclusive Lasso have not yet been studied.

Optimality Conditions
We use the first order optimality conditions to characterize the active set and derive two expressions for the Exclusive Lasso estimate β.Each of these expressions offers insight into either the behavior of the estimate or its statistical properties.
Because problem (1) is convex, an optimal point satisfies −X T (y − X β) + λz = 0 where z is an element of the sub gradient such that Alternatively, we can express the sub gradient as the product of a matrix and a vector.If we let M g = sign( βs∩g )sign( βs∩g ) T and let M S be a block diagonal matrix with matrices M g on the diagonal, then the sub gradient restricted to the support set S of β will be z S = M S βS .
Note that the matrix M S depends on the support set as the block diagonal matrices are defined by the nonzero elements of β in each group.
Proposition 1.If S is the support set of β, we can express β in terms of the support set: βS = (X T S X S + λM S ) † X T S y and βS c = 0 (4) The matrix M S distinguishes the Exclusive Lasso from similar estimates like Ridge Regression.It is a block diagonal matrix that is only equivalent to the identity matrix when there is exactly one nonzero variable in each group.At this point, the Exclusive Lasso behaves like a Ridge Regression estimate on the nonzero indices that it has selected.
Note that this characterization describes the behavior of the nonzero variables but it does not describe the behavior of the entire active set as we vary λ.To derive a second characterization of β, we note that the optimality conditions imply that every nonzero variable in the same group has an equal correlation with the residual X T i (y − X β).This allows us to determine when variables enter and exit the active set.Recall that there is always at least one nonzero variable in each group.Another variable only enters the active set once its correlation with the residual is equal to the correlation shared by the other nonzero variables in the same group.We call the set E = i : = λ the "weighted equicorrelation set" because of its resemblance to the equicorrelation set described in Efron et al. (2004).
We can use this set to derive an explicit formula for β.
Proposition 2. If E is the weighted equicorrelation set, i is in group g, and γ is a where s ∈ {−1, 1} | E | is a vector of signs that satisfies the optimality conditions and E c is the compliment of the set E.
The expression points to the general behavior of the penalty.For the non-zero indices, the first term is a ridge regression estimate (X T E X E + λI) −1 X T E y.The second term (X T E X E + λI) −1 λγ s adaptively shrinks the variables to zero.In the case where the all groups have exactly one non-zero element the Exclusive Lasso estimate is a ridge regression estimate, ensuring that there is at least one non-zero element in each group.
This characterization also helps us see that our method is not guaranteed to estimate exactly one non-zero element in each group.Selecting exactly one element from each group depends on the response y and the design matrix X.We believe that the degree of correlation between the columns of the design matrix impact the probability of selecting greater than one element per group.In comparison to other methods, we recover the correct structure at much higher rates, but it is possible to construct examples that prevent the Exclusive Lasso from estimating the correct structure.See the appendix for more details.
Before proceeding we use a small simulated example to compare the behavior of the Lasso to the behavior of the Exclusive Lasso.We let y = Xβ * + where ∼ N (0, 1).The design matrix X ∈ R 20×30 is multivariate normal with covariance that encourages correlation between groups and within groups.The incoherence condition is not satisfied with |X T S c X S (X T S X S ) −1 | ∞ = 2.603.There are five groups and β * is nonzero for one variable in each group.In Figure 2, we show the Exclusive Lasso and Lasso regularization paths for this example.In the figure the solid lines are the truly nonzero variables and each color represents a different group.The Exclusive Lasso sends variables to zero until there is exactly one nonzero variable in each group whereas the Lasso eventually sends all variables to zero.Further, notice that the Lasso does not enforce the proper structure.The first five variables to enter the regularization path only represent three of the five groups.Because of this, the Lasso misses several true variables.The regularization path also highlights the Exclusive Lasso's connection to Ridge Regression; five variables will never go to zero.

Statistical Theory
The Exclusive Lasso is prediction consistent under weak assumptions.These assumptions are relatively easy to satisfy in practice compared to the assumptions typically associated with sparsistency results or consistency in the 2 -norm.Throughout the rest of this section, we use the following notation: as before, X ∈ R n×p denotes the design matrix and β * ∈ R p is the true parameter.We let G be a collection of non overlapping groups such that ∪ g∈G = {1, 2, . . ., p} and for all g, h ∈ G, g ∩ h = ∅.We let S denote the support set of β * , meaning that for all i ∈ S, β * i = 0. We denote elements of X as X ij and we index the columns of X by group so that X g are the columns corresponding to group g.Let Y * = Xβ * and Ŷ = X β where the vector β is the estimate produced by minimizing squared error loss subject to P ( β) ≤ K for some constant where Σ is the covariance matrix of X.Later this allows us to compare and bound the 2 -norm coefficient error by the mean squared prediction error.
In order to prove prediction consistency we need three assumptions: Assumption (1): The data X is generated by a probability distribution such that the columns {X 1 . . .X p } have covariance Σ and the entries of X are bounded so that

Assumption (2):
The value of the penalty evaluated at the true parameter is bounded so that 1 2 P (β * ) ≤ K.
Using assumptions (1) − (3) we show that the Exclusive Lasso is prediction consistent.
Our assumptions are similar to those of the Lasso.Authors have shown that prediction consistency for the Lasso has assumptions that are much easier to satisfy then assumptions for other consistency results like sparsistency (Greenshtein et al., 2004).Like the Lasso's prediction consistency assumptions, many data sets will satisfy assumption (1).If we believe the data truely arises from a linear model then assumptions ( 2) and (3) will be satisfied as well.
Theorem 1 shows that the Exclusive Lasso is consistent in terms of the norm x Σ .
The result differs from the prediction consistency result in (Chatterjee, 2013) by one term.The group structure in the penalty appears in the bound as the cardinality of the collection of groups.This suggests that we can allow n, p and the number of groups to scale together and still ensure that the estimate is prediction consistent.
We use this result to justify using the Exclusive Lasso for prediction when a small number of variables are desired in each group.
We can also bound the estimated mean squared prediction error.
Theorem 2. Under assumptions (1), ( 2) and (3) the estimated mean squared prediction error of β is bounded such that which goes to 0 as n → ∞.
Similar to Theorem 1, the Exclusive Lasso is consistent in terms of the norm x Σ under weak assumptions.If we add a further assumption, we can show that the Exclusive Lasso is consistent using the 2 norm.
Corollary 1.If the smallest eigenvalue of the covariance matrix Σ is bounded below by c > 0 then the Exclusive Lasso estimate is consistent in the 2 -norm: We add another assumption to establish consistency in the 2 norm.This requires the covariance matrix to be strictly positive definite which is much more restrictive then our previous assumptions on Σ.In general, our results for the Exclusive Lasso are comparable to the consistency results for the Lasso but differ to account for the additional structure in the penalty.

Estimation
Many types of algorithms exist to fit sparse penalized regression models including coordinate descent, proximal gradient descent, and Alternating Direction Method of Multipliers (ADMM).We develop our Exclusive Lasso Algorithm based on proximal gradient descent because it is well studied and known to be computationally efficient.
Roughly, this type of algorithm, popularized by Beck and Teboulle (2009), proceeds by moving in the negative gradient direction of the smooth loss projected onto the set defined by the non-smooth penalty.These algorithms are easy to implement for simple penalties, because simple penalties typically have closed form proximal operators.
In our case, the proximal operator associated with the Exclusive Lasso penalty is a major challenge as there is no analytical solution.The proximal operator for the Exclusive Lasso is defined as follows: We propose an iterative algorithm to compute the proximal operator of the Exclusive Lasso penalty, prove that this algorithm converges, and prove that the proximal gradient descent algorithm based on this iterative approach converges to the global solution of the Exclusive Lasso problem.
First, we propose an algorithm to compute the proximal operator.
Lemma 1.For proximal operator prox P (z) where P is our Exlcusive Lasso penalty, if S(z, λ) = sign(z)(|z| − λ) + and β −i g = (β k+1 1 , . . ., β k+1 j−1 , β k j+1 , . . ., β k p ) then the coordinate wise updates are: Notice that each coordinate update depends on the other coordinates in the same group.Because of this, we can implement this in parallel over the groups.At each step, instead of cyclically updating all of the coordinates we update each group in parallel by cyclically updating each coordinate in a group.If there are a large number of groups or the data is very large, this can help speed up the calculation of the proximal operator.This is important in the context of our proximal gradient descent algorithm because the proximal operator is calculated at each step of the proximal gradient descent method.Empirically, we have observed that coordinate descent is an efficient way to calculate the proximal operator.However, we still need to prove that our algorithm converges to the correct solution.
Note that because our penalty is non-separable in β, we cannot invoke standard convergence guarantees for coordinate descent schemes without additional investigation.Nevertheless, we can guarantee our algorithm converges and defer the proof to the appendix: Theorem 3. The coordinate descent algorithm converges to the global minimum of the proximal operator optimization problem given in equation ( 9) .
We are now ready to derive a proximal gradient descent algorithm to estimate the Exclusive Lasso using the coordinate descent algorithm described above.As the negative gradient of our 2 regression loss is −X T (y − Xβ), our proximal gradient descent update is 2 (see appendix).Note that this step and Lipschitz constant are the same for all regression problems that use an 2 -norm loss function.Putting everything together, we give an algorithm outline for our Exclusive Lasso estimation algorithm in Algorithm 1.
Next, we prove convergence of Algorithm 1.Note that we never calculate the proximal operator exactly.Our coordinate descent algorithm solves the proximal Algorithm 1: Exclusive Lasso Algorithm to fit the Exclusive Lasso Input: In parallel for each g: operator optimization problem to within an arbitrarily small error.We need to ensure that the proximal gradient descent algorithm converges despite this sequence of errors We can show that as long as the sequence of errors converges to zero, the proximal gradient descent algorithm will converge.Overall, this particular algorithm compares well to ISTA, the proximal gradient descent algorithm for the Lasso (Beck and Teboulle, 2009).Although computing the proximal operator is more complicated due to the structure of the penalty, the convergence rate is the same order as the convergence rate for ISTA.The fact that the iterates are easy to compute and the convergence results are competitive reinforce our empirical observations; despite the additional structure, the Exclusive Lasso Algorithm compares well to first order methods for the Lasso and other penalized regression problems.

Model Selection
In practice, we need a data-driven method to select λ and regulate the amount of sparsity within group.To this end, we provide an estimate of the degrees of freedom that will allow us to use BIC and EBIC approaches for model selection.Note that while other general model selection procedures like cross validation and stability selection can be employed, these do not perform well for the Exclusive Lasso.Like the Lasso, cross validation tends to overselect variables.Similarly, we observe stability selection overselect variables, possibly because the Exclusive Lasso always selects at least one variable per group.If a true variable is not in the model, it is necessary replaced by a false variable leading to artificially high probabilities of inclusion and stability scores for false variables.
The BIC formula relies on an unbiased estimate for the degrees of freedom for the Exclusive Lasso.We leverage techniques used by Stein (1981) and Tibshirani et al. (2012) to calculate the degrees of freedom, but defer the proof to the appendix.
Our formula leads to an unbiased estimate for the degrees of freedom that we use for both the BIC and the EBIC.Recall that the matrix M S is a block diagonal matrix where each nonzero block M g is the outer product of the sign vector of the estimate, M g = sign( βS∩g )sign( βS∩g ) T .This leads to our statement of the degrees of freedom for ŷ: Theorem 5.For any design matrix X and regularization parameter λ ≥ 0, if y is normally distributed, then the degrees of freedom for X β is df (ŷ) = E trace(X S (X T S X S + λM S ) † X T S ) .
An unbiased estimate of the degrees of freedom is then To verify this result, we compare our unbiased estimate of the degrees of freedom to simulated degrees of freedom following the set up outlined in Efron et al. (2004) and Zou et al. (2007) .Recall that for Gaussian y, the formula for the degrees of freedom can be stated as df (ŷ) = n i=1 cov(ŷ i , y i )/σ 2 .This formula points to a convenient way to simulate the degrees of freedom.We let β * be the true parameter and we simulate y, B times such that y b = Xβ * + b where b iid ∼ N (0, 1).We then calculate an estimate for the covariance.Because y is standard Gaussian with σ 2 = 1, the simulated degrees of freedom is df (ŷ) = n i=1 cov(ŷ i , y i )/σ 2 where we simulate the covariances according Note that H b is the hat matrix for the estimate ŷb .In other words E[ŷ b ] = X S (X T S X S + λM ) † X T S Xβ * = H b Xβ * (where S here depends on the estimate at iteration b).In our simulations, we set B = 2000 and found that empirically, our unbiased estimate of the degrees of freedom closely matches the simulated degrees of freedom (Figure 3).We can now use our unbiased estimate of the degrees of freedom to develop a model selection method for the Exclusive Lasso based on the Bayesian Information Criteria (BIC) (Schwarz et al., 1978) and the Extended Bayesian Information Criteria (EBIC) (Chen and Chen, 2008).Recall that while the BIC provides a convenient and principled method for variable selection, it can be too liberal in a high dimensional setting and is known to select too many spurious variables.Chen and Chen (2008) address this with the EBIC approach.Hence, we present both the BIC and EBIC for our method, noting that the latter is preferable in high-dimensional settings.If we assume the variance of y is unknown, the respective formulas for the BIC and the EBIC are and These formulas for the BIC and the EBIC can be used to select λ for the Exclusive Lasso in practice.Usually, we can select λ sufficiently large to select exactly one variable per group.In cases where the design matrix does not permit selecting one variable per group, (as discussed in Sections 6 and 7) we suggest using the BIC or EBIC to select λ and then thresholding the estimate within each group so that there is only one variable per group.We call this group-wise thresholding.

Simulation Study
We study the empirical performance of our Exclusive Lasso through two sets of simulation studies: first, for selecting one variable per group and second, for selecting a small number of variables per group.We examine three situations with moderate to large amounts of correlation between groups and within groups.We omit the low correlation setting from the simulations because they correspond to design matrices that are nearly orthogonal, satisfying both the Incoherence condition and the Faithfulness condition.This is not representative of the types of real data for which we would need to use the Exclusive Lasso and is uninteresting because all methods perform perfectly, selecting all of the truly nonzero variables and none of the false variables.
In the first simulations, we simulate data using the model y = Xβ * + where iid ∼ N (0, 1) and β * is the true parameter.The variables are divided into five equal sized groups and the true parameter is nonzero at one index in each group and zero otherwise.We use three design matrices each with n = 100 observations and p = 100 variables, to test the robustness of the Exclusive Lasso to within group correlation and between group correlation.All three matrices are drawn from a multivariate normal distribution with a Toeplitz covariance matrix with entries Σ ij = w |i−j| for variables in the same group, and Σ ij = b |i−j| for variables in different groups.The first covariance matrix uses constant b = .9and w = .9to simulate high correlation within groups and high correlation between groups.The second covariance matrix uses b = .6and w = .9so that the correlation between groups is lower then the correlation within groups, resulting in high correlation within group and medium correlation between groups.The third covariance matrix uses constants w = .6and b = .6so that there is medium correlation both between group and within group.
We compare two versions of our Exclusive Lasso as described in the previous section.First, we use a regularization parameter λ, large enough to ensure that the method selects exactly one element per group.In these simulations, λ = max i |X T i y| was large enough to ensure the correct structure was estimated; we refer to this as the Exclusive Lasso.The second estimate, the Thresholded Exclusive Lasso, chooses the regularization parameter λ that minimizes the BIC and then thresholds in each group keeping the index with the largest magnitude.We also compare our method to competitors and logical extensions of competitors in the literature.We base three comparison methods on the Lasso: First, we take the largest regularization parameter that yields exactly five nonzero coefficients (Lasso); second, we take the largest λ that has nonzero indices in each group and then threshold group-wise to keep the coefficient in each group with the largest magnitude (Thresholded Lasso); third, we take the first coefficient along the Lasso regularization path to enter the active set from each group (Thresholded Regularization Path).Our final two comparison methods use Marginal Regression: First, we take the five indices that maximize |X T i y| (Marginal Regression); second, we take the one coefficient in each group that maximizes |X T i y| for i ∈ g ( Group-wise Marginal Regression).For all methods we select a set of variables S, and then use the data matrix restricted to this set X S to calculate an Ordinary Least Square estimate βS .The prediction error is calculated using βS .Results in terms of prediction error and variable selection recovery are given in Table 1.
The thresholded version of the Exclusive Lasso outperforms all other methods at all levels of correlation, likely because it selects more variables that are truly nonzero.
We observe that the thresholded estimators generally perform better then the non thresholded estimators.Among non-thresholded estimators, the Exclusive Lasso also Here, there is one nonzero coefficient in each of the five groups, n = 100 and p = 100, and we vary the amount of between (b) and within (w) group correlation of the design matrix with Toeplitz covariance.The Thresholded Exclusive Lasso outperforms all of the competing methods in both the recovery of truly nonzero variables and prediction error.
performs the best at all levels of correlation.These simulations highlight the Exclusive Lasso's robustness to moderate and large amounts of correlation, which is important considering we expect variables in the same group to be similar and possibly highly correlated with each other.
In the second set of simulations, we also simulate data using the model y = Xβ * + where ∼ N (0, 1) and β * is the true parameter for n = p = 100.In these simulations the variables are divided into the same five equal-sized groups but the true parameter can be nonzero at more then one index in each group.Specifically, there are seven nonzero coefficients distributed so that three groups have exactly one nonzero index and two groups have two nonzero indices each.We simulate the design matrices in the same way we simulate design matrices in the first set of simulations to have varying levels of between and within group correlation.
We compare three methods: the Exclusive Lasso, the Lasso, and the Lasso applied independently to each group.For all methods, we use the BIC to select the regularization parameter.When we apply the Lasso separately to each group we use separate regularization parameters as well.
Results in terms of prediction error and variable selection given in Table 2.
Exclusive Lasso Group-wise Lasso w=.9, b=.9 Table 2: We compare the Exclusive Lasso to the Lasso and the Group-wise Lasso with BIC model selection for the second simulation scenario where we have five groups with either one or two true variables per group for a total of seven true variables.Again, n = 100 and p = 100 with the amount of between and within group correlation of the design matrix is varied.The Exclusive Lasso performs best in terms of variable selection and prediction error.
The Exclusive Lasso has the best prediction error across all three simulations.
The Exclusive Lasso selects fewer false variables then the Lasso and selects more true variables then the Group-wise Lasso.These simulations also suggest the Exclusive Lasso is more robust to high levels of correlation.Overall, our results suggest that the Exclusive Lasso performs best at within group variable selection when we have known group structure with relatively large amounts of correlation within and or between groups.

NMR Spectroscopy Study
Finally, we illustrate an application of the Exclusive Lasso for selecting the chemical shift of molecules in Nuclear Magnetic Resonance (NMR) spectroscopy.NMR spectroscopy is a high-throughput technology used to study the complete metabolic profile of a biological sample by measuring a molecule's interaction with an external magnetic field (De Graaf, 2013;Cavanagh et al., 1995).This technology produces a spectrum where the chemical components of each molecule resonate at a particular ppm.See Figure 4.b for example.A central analysis goal of NMR spectroscopy is identifying and quantifying the molecules in a given biological sample.This is challenging for numerous reasons discussed in (Ebbels et al., 2011;Weljie et al., 2006;Zhang et al., 2009).We seek to use the Exclusive Lasso to solve one of the major analysis challenges with NMR spectroscopy: accounting for positional uncertainty when quantifying relative concentrations of known molecules in a sample.Known as "chemical shifts", every molecules' chemical signature is subject to a random translation in ppm (Figure 4.a) due to the external physical environment of the sample (De Graaf, 2013) .One way to model this positional uncertainty, is to create an expanded dictionary of shifted molecules to use for quantification.With this expanded dictionary, we can consider each molecule and its shifts as a group, and use the Exclusive Lasso to select the best shift of each molecule for quantification.
We choose not to use real NMR spectroscopy data as often true molecules and true concentrations are unknown.Instead we create a simulation based on real NMR molecule spectra in order to test our method for the purpose of NMR quantification.In our application, we simulate an NMR signal using a dictionary of reference measurements for thirty-three unique molecules.The dictionary, X ∈ R    and 2,1 showing that it is equivalent to the `2 unit ball in these dimensions.Right: Figure shows the ball in all 3 dimensions.We know the structure enforced in the estimate is connected to the extreme points of the constraint space.In the Exclusive Lasso's case, we have sparsity within group and no structure between groups because the Exclusive Lasso is the Lasso in some dimensions and Ridge Regression in other dimensions.The simulated NMR signal, y, is a linear combination of the molecules in the dictionary with values chosen so that the signal has several properties that we observe in real data.For example, real NMR data can contain several unique molecules.Many of these will resonate at similar frequencies, causing peaks to overlap (De Graaf, 2013).Informally, this yields signals that appear smoother with less pronounced peaks because of the crowding.With thirty-three molecules we can recreate this effect in the region between .5 and 0 ppm (see Figure 5.b).We then simulate our signal using positive noise so that y = Xβ * + where is the absolute value of Gaussian noise; this is done as real NMR spectra is non-negative.
We then use each method, the Exclusive Lasso, the Lasso, and the Group-wise molecule from an expanded dictionary.Given the selected variables, Ŝ, these methods use OLS estimates for X Ŝ to estimate β and quantify concentrations, the accuracy of Lasso, to select a set of variables S, consisting of one shift from each molecules' group of chemical shifts.Where applicable we use the thresholded versions of the estimates where we select λ using the BIC and threshold group-wise so that there is only one nonzero variable in each group.Finally, we compare these methods to an ordinary least squares estimate that uses the original un-expanded dictionary without modeling the positional shifts.In Table 3, we report the prediction error and mean squared ) 2 so that we can accurately compare the methods as variable selection procedures.This measure eliminates the shrinkage that occurs with penalized regression methods and allows us to focus on how accurately we recover the concentrations of each molecule.Among all methods, the Exclusive Lasso performs best at quantifying molecule concentrations under positional uncertainty.This case study highlights a real example where there is high correlation both within and between pre-defined groups.
Consistent with our simulation studies, the Exclusive Lasso performs best in these situations.The covariance matrix for the expanded dictionary of molecules.We simulate chemical shifts by generating 10 lagged variables for each of the 33 molecules (blocks on the diagonal).A molecule and its 10 shifts comprise a group where each variable in the group is very correlated with every other member in the group.We can also see that the molecules are very correlated with each other as we include molecules that are chemically similar.(b) The simulated NMR signal and the signal estimated using the Exclusive Lasso.The estimate recovers most of the peaks suggesting it is selecting a useful set of shifts.The estimate also zeros out most of the noise in the simulated signal.

Discussion
Although others have introduced the Exclusive Lasso penalty, we are the first to investigate the method's statistical properties in the context of sparse regression for within group variable selection.We propose two new characterizations of the Exclusive Lasso in an effort to understand the estimate.The first characterization is an explicit definition of β in terms of the support set that allows us to derive the degrees of freedom.This expression is similar to that of the ridge regression estimate, especially when there is exactly one nonzero variable in each group.The second characterization allows us to explore the properties of the active set.We then prove that the Exclusive Lasso is prediction consistent under weak assumptions, the first such result.Additionally, we develop a new algorithm for fitting the Exclusive Lasso based on proximal gradient descent and derive the degrees of freedom so that we can use the BIC formula or the EBIC formula for model selection.
Overall, we find that the Exclusive Lasso compares favorably to existing methods.
Even though the Exclusive Lasso is a more complex composite penalty, convergence results for the Exclusive Lasso Algorithm are comparable to convergence rates for standard first order methods for computing the Lasso.Additionally, through several simulations, we find that the Exclusive Lasso not only selects at least one variable per group better then any existing method, but it also performs better when there is strong correlation both within groups and between groups.
In this work, we focus on statistical questions important to the practitioner, but there are several directions for future work.Investigating variable selection consistency, overlapping or hierarchical group structures, and inference are important open questions.One could also use the Exclusive Lasso penalty with other loss functions such as that of generalized linear models.Additionally, there are many possible applications of our method besides NMR spectroscopy such as creating index funds in finance, and selecting genes from functional groups or pathways, among others.
Overall, the Exclusive Lasso is an effective method for within group variable selection in sparse regression; an R-package will be made available for others to utilize our method.9 Appendix Proof of theorems 1 and 2 The proof of theorems 1 and 2 follows the proof technique presented in Chatterjee (2013).There are several differences due to the structure of our penalty, however, the assumptions are the same.We assume that the columns of the design matrix {X 1 . . .X p } are possibly dependent random variables such that the covariance matrix for {X 1 . . .X p } is Σ.We assume the entries of X are bounded so that |X i,j | ≤ M and that the data we observe (Y 1 , X 1 ) . . .(Y n , X n ) is independent and identically distributed.We also assume the value of the penalty evaluated at the true parameter is bounded so that P (β * ) ≤ K and that the response is generated by the linear model Y = Xβ * + where ∼ N (0, σ 2 ).Let G be a collection of predefined non overlapping groups such that ∪ For constrained optimization problems first order necessary conditions for an optimal solution state that for all d in the linear tangent cone a solution to the problem x * necessarily satisfies f (x * ; d) ≥ 0. In our case the linear tangent cone is the set Our assumption P (β * ) ≤ K and the definition of β let us bound This implies that if we let we have the bound See lemma 3 in Chatterjee (2013) for proof of the bound.Therefore which gives us theorem 2: We use this result to prove theorem 1.By the independence of the data (Y, X) and β we have Combining these two expressions yields We then define We use a version of Hoeffding's inequality that is rather uncommon so we refer the interested reader to the appendix of Chatterjee (2013) for a derivation of the result. Finally Combining our results yields theorem 1 2 goes to 0 as M SP E( β) goes to 0.

Proof of theorem 3
Our coordinate descent algorithm calculates the proximal operator by solving the optimization problem prox P (y) = argmin We show that the assumptions for theorem 4.1 from Tseng (2001) hold for the problem above.For a function of the form where g is convex and differentiable and h is convex but not necessarily differentiable, verifying the assumptions involves showing that 1.The differential part of our function g satisfies assumption (A1) from Tseng (2001) Assumption: (A1) The domain of g is open and g is Gateux differentiable 2. The function f is a regular function.
3. The level set X 0 = {x : f (x) ≤ f (x 0 )} is compact and that f is continuous on X 0 4. For every pair i, k ∈ {1 . . .p} it follows that f is jointly pseudo convex in x i and x k First we state several definitions.
We say direction d is a vector in R n .We allow d k to be the scalar in the k th position in the vector (0 . . .0, d k , 0 . . .0).We abuse notation if the meaning is unambiguous, and also let d k denote the entire vector with 0s in all positions except for the k th position.It is typical to define first order optimality conditions in terms of the Gateaux derivative.We however use the more general forward variation defined as follows: Definition 1.For a function f the forward variation in direction d at x is The Gateaux derivative exists if both the forward and backward variation exist and are equal.Tseng uses the Gateaux derivative to define his optimality conditions but for our unconstrained convex non-differentiable problem it is necessary and sufficient for a minimizer of f to satisfy f + (x; d) ≥ 0 for all d ∈ R n .We also use a notion called regularity.Note that this is the same definition of regularity given in Tseng (2001) communicated here for convenience.Throughout the rest of the paper we use the forward variation and the directional derivative interchangeably.Proof.If we let A similar argument holds as t ↑ 0 Assumption 2: the function f is a regular function Proof.Our goal is to show that if we have a point x that minimizes f point wise i.e. that f (x; d k ) ≥ 0 for all d k then we have a point that minimizes f and satisfies the standard first order necessary and sufficient condition for optimality f (x; d) ≥ 0 for all d ∈ R n .We know that g(x) = 1 2 y − x 2 2 is Gateux-differntiable on R n .Next we show that the entire function f (x) = g(x) + h(x) is regular.Assume that the point x minimizes f point wise therefore satisfying: Assumption 3: The level set X 0 = {x : f (x) ≤ f (x 0 )} is compact and that f is continuous on X 0 Proof.We show that the function is continuous by showing that the penalty is continuous and that the differentiable part of the objective function is continuous.Let x, y ∈ X 0 then there exists a δ such that for Note that the first line follows from the reverse triangle inequality.If i ∈ g then for any > 0 we can define δ such that which shows that the penalty is continuous on the set.
To show that the term y − x 2 2 is continuous consider two points x, z ∈ X 0 and suppose So for δ i ≤ √ n the term y − x 2 2 is continuous.Therefore f is continuous because the sum of continuous functions is a continuous function.Using theorem 1.6 of Rockafellar and Wets (2009), continuity implies that the level sets are closed.
By the Heine-Borel theorem since X 0 a closed bounded subset of R n it is compact.Assumption 4: For every pair i, k ∈ {1 . . .p} it follows that f is jointly pseudoconvex in x i and x k .
Proof.For any pair of indices i, k ∈ {1 . . .p} the function is jointly convex in x i and x k .Suppose indices i and k are in the same group.We can rewrite the objective function as c 2 where c 0 , c 1 , c 2 are terms constant in x i and x k and y i,k = (y i , y k ) and x i,k = (x i , x k ) are the vectors restricted to indices i, k.Both the 2 norm and the affine function of x i,k are convex.The function f 1 (x i , x k ) has a positive semidefinite hessian so it is also convex.
If i, k are in different groups we rewrite the objective function as Function f 2 also has a positive semidefinite hessian so it is also convex.
Therefore the function f is convex in every pair of indices which implies that it is pseudoconvex in every pair of indices.
Given that the objective function satisfies all of the assumptions for Tseng (2001) Theorem 4.1 we can say that our coordinate descent algorithm converges to a stationary point.Because our function is convex the stationary point is a global minimum. tx Therefore f (x) = x 2 1 is convex.The convexity of P (β) follows from the fact that the sum of convex functions is also convex.
We have already shown that both g and h are continuous so their sum must also be continuous.Therefore because the level sets of our function f are bounded, and f is continuous and proper by theorem 1.9 there exists a minimum to our objective function f .Assumption 4: This assumption holds by theorem 3.
Therefore by proposition 1 from Schmidt et al. (2011) the Exclusive Lasso algorithm converges at a rate of O(1/k).

Proof of theorem 5
= max ≥ 0 (18) Line 3 follows from the fact that there exists a regularization parameter such that the necessary conditions for the Exclusive Lasso problem are exactly the same as the necessary conditions for the optimization problem that uses the square root of the Exclusive Lasso penalty.Notice that if we let α = 2λP ( β) We then solve for βS using z S = M S βS yielding βS = (X T S X S + λM S ) † X T S y Note that we are relying on the fact that we have already proved the existence of a solution to the optimization problem in the proof for theorem 4.This gives us an estimate ŷ = X S (X T S X S + λM S ) † X T S y.The divergence is therefore which is equal to the sum of the eigenvalues.

Penalty
For specific values of X and y the Exclusive Lasso will select more than one variable per group for all values of the regularization parameter λ.This means that although the Exclusive Lasso is designed to select exactly one element per group we cannot guarantee the Exclusive Lasso will enforce the correct structure.Consider an example.
Suppose we characterize the Exclusive Lasso estimate using the equicorrilation set.
Recall the equicorrilation set If we let s be a vector such that s i = sign( βi ) for i ∈ E and γ be a vector such that γ i = βg i 1 where g i is the group for an index i ∈ E. Let γ be a vector such that γi = βg i 1 − | βi | then we can solve for β.
X T E (y − X E βE ) = λγs = λγs + λ βE βE = (X T E X E + λI) −1 [X T E y − λγs] Let X = I 2 and we let y T = (1, 1) then because X is orthonormal the estimate simplifies to βE = 1 1 + λ y − λ 1 + λ γ s In this case β1 = β2 so the term λ 1+λ γ s is going to shrink both indices equally for all λ.This prevents the estimate from selecting exactly one element in each group.
We conjecture that conditions on X and y for this to occur can be formalized, but this is beyond the scope of this work.Intuitively, this behavior occurs when two or more variables get shrunken equally.As such, this behavior is relatively rare in practice.If it does occur and one variable per group is desired, we propose to use BIC to select λ and apply group-wise t (a) The unit ball is equivalent to the 2 ball between groups, enforcing density.(b) The unit ball is equivalent to the 1 ball within group, enforcing sparsity.(c) The Exclusive Lasso unit ball.

Figure 1 :
Figure 1: The unit ball for the Exclusive Lasso penalty.The ball has properties of both the 1 unit ball and the the 2 unit ball.Let β * = (β * 1,1 , β * 1,2 , β * 2,1 ) be a parameter with two groups where the first index denotes the group and the second index enumerates the elements within a group.Considering the perspective where β * 2,1 = 0 yields a ball equivalent to the 1 ball (b).Considering a perspective where either β * 1,1 = 0 or β * 1,2 = 0 yields a ball equivalent to the 2 ball (a).

Figure 2 :
Figure 2: A toy simulation with n = 20 and p = 30 consisting of five groups with one true variable per group.The coefficient paths of the true variables are solid and non-true variables are dashed lines.Each color represents a different group.(a) Regularization path for the Exclusive Lasso.The Exclusive Lasso behaves like an adaptively regularized Ridge Regression estimate sending variables to zero until only one variable from each group is nonzero.At this point it behaves like a Ridge Regression estimate.(b) Regularization path for the Lasso.The Lasso sends variables to zero without considering the group structure.Note that the first five variables to enter the model for the Lasso represent only groups 3, 4 and 5, where as the Exclusive Lasso has five variables, at least one from each group, that are in the model for all λ.

Theorem 4 .
Given objective function f (β) = 1 2 y − Xβ + λP (β) the sequence of iterates {β k } generated by our proximal gradient descent algorithm converges in objective function at a rate of at least O(1/k) when the sequences { k } and { √ k } are summable.

Figure 3 :
Figure 3: Comparison of our estimate for the degrees of freedom to the simulated degrees of freedom.The simulated degrees of freedom matches the estimated degrees of freedom very closely. Figures Figure 1.Left: Spectra for biological sample.These samples can be composed of up to 5000 unique molecules.We believe the spectra is a linear combination of the component molecules respective spectra.The neuron sample spectra is a linear combination of sucrose and acetaminophen among other possible component molecules.Right: A chemical shift for the molecule carnosine.Chemical shifts occur due to the chemical environment of the molecule being measured.The positional uncertainty introduced by chemical shifts complicates the identification and quantification problems.

Figure 2 .
Figure 2. The unit ball for the Exclusive Lasso Penalty from 3 di↵erent perspectives.Consider the following example.Let ⇤ = ( ⇤ 1,1 , ⇤ 1,2 , ⇤ 2,1 ) be our parameter such that the first index denotes group membership and the second denotes the element within group.Left: Figure considers only 1,1 and 1,2 and it is equivalent to the `1 unit ball.Middle: Figure considers only 1,1and 2,1showing that it is equivalent to the `2 unit ball in these dimensions.Right: Figure shows the ball in all 3 dimensions.We know the structure enforced in the estimate is connected to the extreme points of the constraint space.In the Exclusive Lasso's case, we have sparsity within group and no structure between groups because the Exclusive Lasso is the Lasso in some dimensions and Ridge Regression in other dimensions.

Figure 4 :
Figure 4: (a) Positional uncertainty of the chemical shift for the molecule Carnosine.All NMR spectroscopy signals are subject to random translations in ppm, due to the chemical environment of the sample.(b) NMR spectra of a neuron cell sample.NMR spectroscopy measures concentrations of all molecules in a sample.The observed signal is a linear combination of its unobserved component molecule's chemical signatures.
Figure5: (a) The covariance matrix for the expanded dictionary of molecules.We g∈G g = {1 . . .p}.Instead of the Exclusive Lasso penalty, we work with the equivalent constrained optimization problem β = argmin β:P (β)≤K Y − Xβ 2 Let C = {Xβ : P (β) ≤ K}.By definition, Ŷ is the projection of Y onto the set C.

Definition 2 .
A function f is regular at x if f (x; d) ≥ 0 for all d such that f (x; d k ) ≥ 0 Regularity ensures that if we have a point that minimizes f coordinstewise, then the point minimizes the function f.Definition 3. A function f is pseudoconvex if f (x + d) ≥ f (x) whenever x ∈ dom(f ) and f (x; d) ≥ 0Assumption 1: The differential part of our function g satisfies assumption (A1) fromTseng (2001)

12
then λ∂P ( β) = α∂ P ( β).This implies that β necessarily satisfies −X T (y − X β) + α∂ P ( β) = 0 Taking the inner product with β yields(X β) T (y − X β) = α 2 P ( β)Line 5 follows for the set C = {u ∈ R n : P * (X T u) ≤ α 2 } proving that y − X β is equal to the projection of y onto the set C. This implies that X β = (I − P C )yCombining Lemmas 1 and 2 yields that the exclusive lasso estimate is continuous and almost differentiable.Next we define β in terms of the support set S. First recall the KKT conditions −X T (y − X β) + λz = can rewrite the sub gradient for the indices i ∈ g ∩ S. If we let s g∩S = sign( βg∩S )z g∩S = s g∩S s T g∩S βg∩SWe can write the sub gradient over the indices of the support asz S = M S βSwhere M S is a block diagonal matrix with the matrices {s g∩S s T g∩S : g ∈ G} on the diagonal.We can rewrite the KKT conditions with respect to the support set S y + X T S X S βS +λz S = 0 −X T S c y + X T S c X S βS +λz S c = 0 called Composite Absolute Penalties.Composite Absolute Penalties employ combinations of p norms to effectively model a known grouped or hierarchical structure.The first norm is applied to the coefficients in a group.This enforces the desired structure within

Table 1 :
We compare the Exclusive Lasso and a thresholded version of the Exclusive Lasso to alternative variable selection methods as described in the Simulation section.

Table 3 :
In our simulation using NMR spectroscopy data, we seek to quantify concentrations of molecules in a sample (see M SE(β)) under positional uncertainty in chemical shifts.Here, OLS regression quantifies concentrations without accounting for positional uncertainty whereas the Exclusive Lasso, Marginal Regression and the Lasso account for positional uncertainty by selecting one chemical shift for each