Minimum Distance Lasso for robust high-dimensional regression

We propose a minimum distance estimation method for robust regression in sparse high-dimensional settings. Likelihood-based estimators lack resilience against outliers and model misspecification, a critical issue when dealing with high-dimensional noisy data. Our method, Minimum Distance Lasso (MD-Lasso), combines minimum distance functionals customarily used in nonparametric estimation for robustness, with 1-regularization. MD-Lasso is governed by a scaling parameter capping the influence of outliers: the loss is locally convex and close to quadratic for small squared residuals, and flattens for squared residuals larger than the scaling parameter. As the parameter approaches infinity the estimator becomes equivalent to least-squares Lasso. MD-Lasso is able to maintain the robustness of minimum distance functionals in sparse high-dimensional regression. The estimator achieves maximum breakdown point and enjoys consistency with fast convergence rates under mild conditions on the model error distribution. These hold for any solution in a convexity region around the true parameter and in certain cases for every solution. We provide an alternative set of results that do not require the solutions to lie within the convexity region but where the 2-norm of the feasible solutions is constrained within a safety radius. Thanks to this constraint, a first-order optimization method is able to produce local optima that are consistent. A connection is established with re-weighted least-squares that intuitively explains MD-Lasso robustness. The merits of our method are demonstrated through simulation and eQTL analysis.


Introduction
We address the problem of robust sparse estimation in high-dimensional regression.Sparse linear models allow for simultaneous model estimation and variable selection.They have become very popular tools to analyze the high-dimensional data that is prevalent in many domains such as genomics [45], neuroimaging [21,42], and economics [16].A widely used approach to sparse learning is via sparsity-inducing regularization.A well known example is the Lasso [38], which employs 1 -penalized least-squares to identify a parsimonious subset of predictors.Beyond Lasso, various structured penalties have been proposed that reflect the underlying structural information among the predictors.For instance the Group Lasso [46] enforces group sparsity via the 1 / q norm (q > 1), the Path Coding Penalties [26] and Graph Lasso [20] deal with applications where the variables reside in a graph, the Fused Lasso [39] enforces sparsity in both the coefficients and their successive differences for settings where variables are ordered in some meaningful way and a locally constant coefficient profile is desirable.Much attention has been devoted recently to the study of these structured norms and their theoretical properties [29,30], and to devising efficient algorithms for large scale problems [6].
The issue of robustness, however, has been largely overlooked in the sparse learning literature, while this aspect is critical when dealing with high dimensional noisy data.Traditional likelihood-based estimators (including Lasso and variants) are known to lack resilience to outliers and model misspecification.Despite this fact, there has been limited focus on robust sparse learning methods in high-dimensional modeling.Relevant penalized regression methods include the "extended" Lasso formulation [32] which employs the traditional squared error but incorporates an additional sparse error vector into the model so as to account for corrupted observations, and the LAD-Lasso [43], which uses the least absolute deviation combined with an 1 penalty.Note that the least absolute deviations estimate also arises as a maximum likelihood estimate if the errors have a Laplace distribution.Hence the aforementioned approaches can still be viewed as likelihood-based, and they share the deficiencies of maximumlikelihood estimators for sparse estimation in high-dimensional regression.In particular their performance drops significantly if the model is mis-specified or outliers are present: a single outlier can make their estimates entirely unreliable [1].
Departing from likelihood-based methods, we propose a penalized minimum distance criterion for robust and consistent estimation of sparse high dimensional regression.Our approach is motivated by minimum distance estimators [44], which are popular in nonparametric methods and have been shown to exhibit excellent robustness and efficiency properties [10,15].Their use for parametric estimation has been discussed in Basu et al. [8], Scott [36] and investigated by Chi and Scott [13] for sparse logistic regression.However, the robustness properties of minimum distance estimators have not been formally established in the high-dimensional regression setting.We propose the Minimum Distance Lasso (MD-Lasso) estimator, which is derived from the integrated squared error distance between the model and the "true" distribution, and imposes sparse model structure via 1 penalty.
The MD-Lasso loss, taken as a function of a single observation, acts similarly to the squared-loss if the residual squared-error of that observation is small, while the loss becomes flat as the squared-error becomes large.This ensures that the contributions of large outliers to the overall loss are capped.Overall the MD-Lasso loss is invex1 and locally convex.The extent of the local convexity region and the capping of outliers are both governed by a scaling parameter of our estimator, against which the residual squared error of each observation is being compared.In the extreme case where the scaling parameter goes to zero, only the most "trusted" observation is taken into account, while as the scaling parameter goes to infinity, the estimator becomes equivalent to the traditional Lasso estimator with the same amount of regularization on the 1 penalty.Our analysis shows that the tradeoff between convexity and robustness, as controlled by the scaling parameter, is, understandably, essential in securing both robustness and consistency of the estimator.
Our results demonstrate that the MD-Lasso enjoys fast convergence rates in high dimensional settings under mild conditions on the model error distribution in relation to the scaling parameter.These conditions are much less restrictive than the traditional sub-gaussian assumption, and cover a broad class of heavy-tailed distributions.We present two sets of consistency results.One holds for any of the solutions in the local convexity region around the true parameter (and in certain cases these are the only existing solutions globally).One does not necessitate that the solutions lie within the local convexity region but requires adding a constraint to the MD-Lasso problem to bound the 2 -norm of the parameters β considered, and assuming that the true parameter vector β * is feasible.The latter set of results has practical consequences as they allow us to show that a simple incremental algorithm yields consistent estimates.
We also show that MD-Lasso achieves a maximum breakdown point [19] for any finite value of the scaling parameter c, namely MD-Lasso is able to tolerate the maximum percentage of arbitrarily corrupted observations achievable by any method.In contrast the least squares Lasso and LAD-Lasso have a vanishing breakdown point, namely a single corruption can already drastically affect these estimators.We shed further light onto the robustness of MD-Lasso by establishing its connection with a form of iteratively re-weighted 1 -penalized least-squares regression (namely the traditional Lasso) where the weights assigned to the observations can be interpreted in terms of their likelihood.
The performance of our estimator is demonstrated on simulation data under various error distributions, in comparison to the traditional Lasso, LAD-Lasso, and Extended Lasso.This study also confirms that outliers and/or heavy-tailed noise can severely influence the variable selection accuracy of existing sparse learning methods.Experiments on real eQTL data further illustrate the usefulness of our approach.
The manuscript is organized as follows.Section 2 is devoted to the MD-Lasso estimator, its derivation from a minimum distance criterion, its geometry, and the analysis of its breakdown point.The statistical consistency results and convergence rates are shown in Section 3. The incremental method for efficient and scalable optimization is presented in Section 4. Empirical results are described in Sections 5.All proofs are collected in the Appendix.

Problem formulation and notation
Let X ∈ R n×p denote the predictor matrix, whose rows are p-dimensional variable vectors observed for n training examples.Denote by X i ∈ R p the vector formed by i th observation across all variables.
Denote by X j ∈ R n the vector formed by the observations for the j th variable.Denote by X j i ∈ R the entry in matrix X corresponding to the i th observation for the j th variable.Similarly let Y ∈ R n denote the response vector, and Y i it's i-th observation.
Consider the general regression model: where β ∈ R p is the coefficient vector one wishes to estimate, and η ∈ R n is the error term, and for simplicity we assume that the data have been standardized so that we need not consider intercept terms.We address the sparse estimation of coefficient vector β via 1 -penalized loss minimization.Specifically, we consider estimators of the form where the loss function L measures the goodness-of-fit on the response and λ n is the regularization parameter for the 1 penalty.However our framework is readily applicable to other sparsity-inducing penalties such as the group or fused Lasso.Using likelihood-based loss functions such as squared loss is a common practice in estimating and exploring the sparsity structure of the unknown parameters for model (1), whereby L(β) is derived from a product of probability density functions (p.d.f.).However, the likelihood-based estimators are known to lack resilience to outliers and model misspecification.In contrast, the minimum distance estimators [44] often used in nonparametric function estimation show excellent robustness properties [10,15].This motivates our proposed MD-Lasso estimator.
We begin this section by presenting how the MD-Lasso objective can be rigorously derived from a minimum distance criterion.

Minimum distance estimation
Here we treat response Y and predictors X as random variables, where Y ∈ R and X ∈ R p .
We first apply the Integrated Squared Error to the conditional distribution of response Y given the predictors X.This leads to an 2 distance between the true conditional distribution f (Y |X) and the parametric distribution function f (Y |X; β) as follows We remark that minimum distance estimators originally involved distances between cumulative distribution functions [44], but the notion was subsequently broadened to encompass distances between probability density functions [10,15,36].We consider the latter, which is easier to work with, and is also more natural in the context of linear regression.
Note that we assume a parametric family for the model while using a nonparametric criterion (the Integrated Squared Error) to measure goodness of fit.From the perspective of the loss function, the Integrated Squared Error is a more robust measure of the goodness-of-fit compared to likelihood-based loss functions.It can match the model with the largest portion of the data because the integration in (3) accounts for the whole range of the squared loss function.
To derive our estimator, we assume that f (Y |X; β) is the p.d.f. of multivariate normal N (X β, σ 2 ).However it is important to note that our methodology and theoretical results go well beyond the normal assumption for the error.
Such an approximation technique has also been used for Gaussian mixture density estimation [36].Disregarding the terms that are independent of β we can write the resulting empirical criterion as Rather than directly minimizing d n we will aim to minimize, equivalently, where C is a function of σ and n but is independent of β.As σ is unknown we consider instead where c is a scaling parameter.Plugging the resulting loss in (2) yields the MD-Lasso problem: Remarks.We can already gain some intuition on the robustness of MD-Lasso by considering the ratio between data and model probability density functions: An outlier in the data may drive this ratio to infinity, in which case the log-likelihood becomes infinite as well.In contrast, the differ- 3) is always bounded.This property makes the 2 -distance a favourable choice when dealing with outliers.A similar argument can be applied to the problem of density estimation and explains why the 2 -distance is also very well suited for this problem (e.g.see Sugiyama et al. [37]).A more heuristical intuition comes from noting that in (5) the logarithm is applied to a sum of probability density functions, in contrast to the likelihoodbased estimators which involve a product: the sum should be more robust to noise and outliers, as often encountered in high-dimensional data.

The geometry of the MD-Lasso estimator
The geometry of MD-Lasso is worth examining, as it provides some key insights on the estimator's robustness and the theoretical conditions required for fast convergence rates.The MD-Lasso loss, taken as a function of the residual error for a single observation with the contributions from the other observations fixed, is depicted in Figure 1, for various values of the scaling parameter c, along with the squared loss and the absolute loss.In the figure, the MD-Loss has been translated w.r.t. the y-axis for ease of comparison.From Figure 1 we can see that the MD-Lasso loss acts similarly to the squared-loss if the residual squarederror of that observation is small, while the loss becomes flat as the squared-error becomes large.This insures that the contributions of large outliers to the overall loss are capped.The range of the similarity to the squared loss is governed by the scaling parameter c of the MD-Lasso estimator, against which the residual squared error of each observation is being compared.Intuitively, the scaling parameter can thus be interpreted as a cut-off on what is an acceptable range for the error.The MD-Lasso loss over all observations is depicted in Figure 2 as a function of the regression parameter vector β, for an illustrative examples with dimensionality p = 2 and just a single relevant predictor, namely β = (β 1 , 0).As shall be formally discussed in Section 3, the MD-Lasso loss is invex and locally convex, yet it is globally non-convex.The extent of the local convexity region is controlled by the scaling parameter c.As the parameter increases, the convexity region becomes larger, and so does the proportion of observations whose squared-error are below the scaling parameter.The robustness, however, becomes weaker, as instances with larger error are allowed to significantly contribute to the overall loss.If the scaling parameter becomes too small, the proportion of observations with squared error below the scaling parameter becomes too small and compromises the convexity of the overall loss with respect to the model coefficients, a property that is needed to yield fast convergence rates.

Limit cases.
As the parameter c goes to infinity the MD-Lasso estimator becomes equivalent to the traditional Lasso estimator with the same amount of regularisation λ n .Indeed as x → 0, we have exp(x) ∼ 1 + x and log(1 + x) ∼ x.Therefore as c → ∞, the MD-Lasso estimator is identical the minimizer of On the other extreme, for c → 0, the MD-Lasso is equivalent to the minimizer of min In the latter case, only the observation with smallest residual error is taken into account, while the other observations are being discarded.This setting can thus be viewed as an extreme case of trimmed Lasso regression, where all but one observation are trimmed out.
Non-convexity and robustness.To illustrate the limitations of convex loss functions and the appropriateness of non-convex loss functions with respect to robustness, it is worthwhile to recall the notion of influence function from robust statistics [18].Consider the loss L as a function of the residual r i of a single sample.The influence function represents the rate of change in L upon a small amount of contamination on r i , and thus measures the effect of the size of a residual on the loss.Specifically, for loss functions induced by log-concave densities (e.g. the squared and absolute losses are induced by the Gaussian and Laplace distributions respectively) the influence function is identical to the derivative of the loss with respect to the residual.For squared error loss, the influence function is given for example by I(r i ) = r i and for the absolute loss it is I(r i ) = sign(r i ).In our case the influence function can be written as , where d > 0 is a constant due to the contribution from other observations.We can see that in contrast to the case of log-concave densities, the influence function of the MD-Lasso loss is redescending as r i becomes large, which signifies that large residuals are basically ignored.The so-called redescending behavior is a desirable property for robustness, which clearly cannot be achieved by convex loss functions.We refer the reader to Hampel et al. [18] for a review of influence-function approaches to robust statistics, including redescending influence functions.Note that the negative log-likelihood functions of heavy tailed distributions (e.g.Student's t and Cauchy) are non-convex.For instance for a Student's t error model with ν degrees of freedom, the loss becomes L(r i ) = log(1 + r 2 i /ν).It thus appears that non-convexity assists in accommodating large outliers or significantly noisy data.See Aravkin et al. [3] for additional pertinent points elucidating the need for non-convex loss functions to achieve robustness.

Breakdown point of the MD-Lasso estimator
Intuitively, the breakdown point [27] is the proportion of arbitrarily corrupted observations an estimator can tolerate before giving an arbitrarily large result.Thus a high breakdown point reflects a good resistance to corruptions and hence can be considered as a measure of robustness.We recall the definition of replacement finite-sample breakdown point [27].
In order to compare different estimators, one usually considers the asymptotic behaviour of * n ( β; X, Y ) : * ( β; X, Y ) = lim n→∞ * n ( β, X, Y ).The following theorem shows that the MD-Lasso estimator achieves the maximum breakdown point.
Theorem 1.Let Q c (β) denote the objective MD-Lasso seeks to minimize: For any finite choice of the scaling parameter c, consider the non-empty set For every α ∈ (0, 1), the breakdown point of any solution in B c is at least α.Namely the MD-Lasso can tolerate at least αn arbitrarily corrupted observations and still produce estimates with bounded 2 -norm.
Theorem 1 indicates that even if a majority of observations are corrupted, the estimated regression coefficients will remain bounded.Naturally, if more than 50% of observations are arbitrarily corrupted, it makes no sense to trust any model and thus the breakdown point is typically capped at 50%.We can not make a statement about the breakdown point of all local solutions but can show a high breakdown point for the best local solutions, in the sense that the local solutions in a specific levelset of the objective function have a high breakdown point.Recall that as c → 0 the MD-Lasso yields a special case of sparse trimmed least squares regression, where all but the most trusted observation are disregarded.Our results for this limit case are consistent with those on sparse trimmed regression [1].As c → ∞, MD-Lasso is equivalent to the Lasso estimator.If one sets c → ∞ in the proof of theorem 1, the reasoning guaranteeing high breakdown for MD-Lasso is no longer valid.This is to be expected: for the traditional Lasso estimator it was shown in Alfons et al. [1] that only one outlier can already send the estimates to infinity and the breakdown point is 1/n.Finally, we note that the LAD Lasso estimator also shares the poor breakdown point property of Lasso; the absolute loss does not help in improving the breakdown point.
The results of Theorem 1 pertain to arbitrary corruptions in both the data matrix X and the response vector Y .In the next section we further characterize the robustness of MD-Lasso from a different standpoint.Specifically, we consider errors in Y that are incurred via the error term η, and show that under mild assumptions on the distribution of the error term η, MD-Lasso achieves consistency with fast convergence rates.

Main results
In this section we establish the conditions for consistency and fast convergence rates of the MD-Lasso estimator under the high-dimensional setting (p n).The proofs of our results are all relegated to the Appendix.Consistency and fast convergence rates can be secured thanks to two key properties: (i) the restricted strong convexity of the loss L in the neighborhood of the true model parameter vector and (ii) the gradient boundedness at the true model parameter vector.The importance of these two properties was first identified in [30].Before defining and establishing them, we introduce some notation and state the assumptions required by our analysis.
Notation.Define t γ to be the cumulative distribution function of |η i | such that for all γ ≥ 0, Let S denote the set of indices corresponding to the support of the true coefficient vector β .Writing Δ S for the projection of a vector Δ ∈ R p onto indices S, and Δ S C for the projection onto the complement of S, define the cone Assumptions.We make the following assumptions throughout.
[A2] The error terms (η i ) n i=1 form a sequence of independent and identically distributed random variables, with zero-mean or, if the mean is undefined, a probability density function symmetric around zero.
[A3] The design matrix X satisfies the following Restricted Eigenvalue condition with constant κ RE > 0.
[A4] For every X i=1,...,n , the variable v, X i is sub-Gaussian with parameter at most κ 2 u v 2 2 .
Assumptions [A1] and [A4] could be relaxed in some ways but we are mainly interested here in robustness with respect to outliers in the target.The second assumption [A2] is weak since it allows arbitrarily heavy tails in the error distribution, while the last assumption [A3] is standard, see, for example, Bickel et al. [12].

Gradient boundedness property
The following results provide upper-bounds on the ∞ -norm of the gradient of the MD-Lasso loss evaluated at β .These bounds are the most important part when establishing rates of convergence.

Lemma 1. Under Assumptions [A1] and [A2] for any
Then, for some positive constants α 1 , α 2 , A proof is given in the Appendix.

Lemma 2. Under Assumptions
Then, for some positive constants α 1 , α 2 , α 3 , A proof is given in the Appendix.
Remarks.We note that Lemma 1 rests on establishing the bounded differences property of the gradient coordinates, based on whether or not the amplitude of the error η i exceeds √ c.Lemma 2 employs Bernstein's inequality [23], noting that the variance of η i exp(−η 2 i /(2c)) is always well-defined regardless of whether or not the variance of η i exists.
As c → ∞ the bound in Lemma 1 becomes vacuous: it essentially scales with √ c.For heavy-tailed distributions, this is an accurate indication that large values of c are not an option, as this would essentially mean giving up on the robustness property of the estimator.For lighter-tailed distributions for which the variance of η i exists (and is finite), Lemma 2 is preferred.Since for c → ∞ (by the monotone convergence theorem), Lemma 2 yields finite upperbounds if c → ∞ but with a rate depending on n such that c log p/n → 0. We present below some specific examples illustrating this interesting fact.If the variance of η i is undefined Lemma 1 yields tighter bounds for large values of c.
Examples.Gaussian Errors.If the error terms η i , i = 1, . . ., n follow a Gaussian distribution N (0, σ2 ) and c is finite, then Lemma 2 implies that with high probability If c → ∞ while c log p/n → 0, we recover the condition for the traditional Lasso (up to a constant factor) namely: This is consistent with the fact that the MD-Lasso estimator yields the traditional Lasso estimator as c → ∞.Laplace-distributed Errors.If the error terms η i , i = 1, . . ., n follow a Laplace distribution with scale parameter b, and c is finite, then Lemma 2 implies that with high probability where F (•) denotes the tail probability function of the standard normal distribution.
If c → ∞ while (c log p)/n → 0, Lemma 2 together with the monotone convergence theorem yields the condition We will use these gradient bounds to show fast convergence rates of the MD-loss under potentially heavy-tailed distributions.

Restricted strong convexity
The following lemma states conditions that guarantee the restricted strong convexity of the MD-Lasso loss in a restricted neighborhood of the true model coefficients β .

Lemma 3. Under Assumptions
If the model error distribution satisfies the tail condition: then for all Δ ∈ K(S, μ) it holds that with probability at least A proof is given in the Appendix.
Remarks.Noting that for any as long as n > 64(κ 2 /κ 1 ) 2 s log p. Lemma 3 indicates that the region of restricted strong convexity is controlled by parameter c via the condition Δ 2 ≤ μ where μ < √ c/(4κ u √ log n).For fixed c, we remark that when n is increasing the region of restricted strong convexity (or μ) will shrink due to the log n dependency.However, as will be clarified in the next section, this does not compromise convergence rates and consistency of the estimator: when n is increasing, the region within which restricted strong convexity is required to hold is also shrinking with a rate that is faster.
The convexity of the loss rests on a condition related to the tail of the error distribution, which is required so that κ 1 > 0. We consider a specific example to illustrate that the requirements on the tail of the error distribution are very mild. Example.
These conditions are quite similar.Nevertheless, except for the Laplace distribution, the heavier the tail, the larger the lower bound on c needs to be in order to secure restricted strong convexity, which makes sense as the number of large outliers is expected to increase.It is important to note, however, that while a large value of c extends the convexity region, it reduces the resilience to outliers (via the gradient bound).Thus the choice of c is key in guaranteeing both fast convergence rates and robustness, as shall be made explicit in the next section.We conclude this section by noting that under similar tail conditions on the error, the MD-Lasso loss function is (simply) convex with asymptotic probability 1 in the set The proof is similar to that of Lemma 3 and is thus omitted.

Consistency results
We now state the results on the consistency and convergence rates for the MD-Lasso estimator.These results leverage the restricted strong convexity and gradient boundedness properties.

Consistency in the local convexity region
In this section we provide error bounds for the solutions of MD-Lasso which reside within the local convexity region.The following theorem builds on the gradient bound of Lemma 1, and is thus preferred if the variance of the error is undefined.In the Appendix, we also provide a second theorem (Theorem 4) that uses the gradient bound of Lemma 2 and is thus preferred for errors with finite variance.

given the MD-Lasso estimator (5) with scaling parameter c and regularization parameter λ
c , any of the solutions in H c (there is at least one such solution) satisfies with probability at least 1 − α 1 exp(−α 2 nλ 2 n ), for any n > max ξc,γ A proof is given in the Appendix.Both Theorem 2 and Theorem 4 demonstrate that MD-Lasso is robust with respect to errors in Y .The results rest on mild assumptions on the quantiles of the error distribution η.
The bounds on βλn − β 2 in Theorem 2 and Theorem 4 scale inversely with the restricted strong convexity constant of Lemma 3.This makes sense as the constant reflects the curvature of the loss function L in a restricted set of directions around the true solution β : the higher the curvature the faster the convergence.On the other hand, the convergence rates and the regularization parameter λ n are proportional to the gradient bound of Lemmas 1 and 2. While restricted strong convexity favors large values of c, the gradient bound favors small values, hence the "tension" between the two that we now elaborate upon.
Figure 3 depicts the impact of c on the scaling factor in the convergence rates of ( 9) and (48) for various error distributions.The values of the y-axis do not reflect the multiplicative factors that do not depend on c and the error distribution.Note that to generate the figure, γ was varied, and t γ determined numerically given the error distribution, then the value with the 'best' γ was selected.The convergence rates along with Figure 3 suggest the following points: • Regardless of the error distribution, one should not set c to values of much below a value of, say, 5.This makes sense intuitively, since a small c means that we are not using many observations and operating on too small a sample size.Recall that as c → 0 the estimator acts as a trimmed 1 − penalized Least Squares estimator where all but one observations are trimmed out.• As c grows beyond an "optimum value" 2 , the heavier the tail, the faster the multiplicative factor grows.This is aligned with the intuition that one should be more conservative the heavier the tail, and thus not set c too large.Recalling that the MD-Lasso estimator is equivalent to the traditional Lasso estimator as c → ∞, our results also corroborate the fact that the performance of the traditional Lasso estimator degrades dramatically in the presence of heavy-tailed noise.• Interestingly, for lighter-tailed distribution (e.g.Laplace and Gaussian) the multiplicative factor flattens out and converges to a finite value as c → ∞, provided that c grows in a sample size dependent fashion so that c log p/n → 0. In particular, for sub-gaussian tails one recovers the results of the traditional Lasso estimator (up to a constant factor) as c → ∞. • Robustness does not cause significant loss of estimation efficiency in the absence of outliers.This will be confirmed in the simulations of Section 5, e.g.under the Gaussian setting.
We also want to make a few remarks about global and local solutions.Recall that the MD-Lasso estimator is invex and locally convex but not globally convex.If the constraint region induced by the 1 -penalty resides within the local convexity region, the invexity of the loss is not compromised, every local minimum lies within the local convexity region and is also a global minimum.The results of Theorem 2 and Theorem 4 therefore apply to any solution of the MD-Lasso estimator.A sufficient condition for this case to hold is that where λ is the parameter corresponding to parameter λ in the constrained version of the MD-Lasso problem3 : minimize L(β) If the constraint region induced by the 1 -penalty merely intersects (or even resides outside of the local convexity region), local minima may exist outside of the convexity region.

Consistency within a safety radius
The next set of results does not require the solutions to lie within the convexity region.Here the original MD-Lasso estimator is slightly modified to introduce a "safety" radius for β.Namely we consider βλn = arg min where the safety radius b 0 is chosen such that β is feasible.A key benefit of the following results is their practical impact: in Section 4 we will see that a simple incremental algorithm yields consistent estimates.Since the solutions need not belong to the local convexity region of MD-Lasso, we do not make use of Assumption [A3].Instead we introduce the following assumption: [A3 ] The rows of the design matrix X are independently drawn from N (0, Σ) where Σ has the minimum eigenvalue λ min (Σ).
The following theorem guarantees that any of the local solutions are consistent.The theorem is obtained by adapting the work of [24], noting that global convexity is actually not required for 2 consistency (see the Appendix for more details).(1) and assume that the support of the true model coefficients β has cardinality s.Let β λn be any local optima of (10) under the assumptions of Theorem 2. Suppose that the scale parameter c and the radius parameter b 0 of (5) are selected so that (i) the model error distribution can satisfy the tail condition: t √ c/2 < (1+((64/21)e −3/2 ) −1 < 0.6, and (ii)

Theorem 3. Consider the linear regression model
4 λ min (Σ) max Σ jj , and ξ 2 c,γ , C are as defined in Theorem 2.Then, with probability at least 1 − α 1 exp(−α 2 nλ 2 n ), for some universal positive constants α 1 and α 2 , the local optimal error is guaranteed to be consistent: A proof is given in the appendix.We note that in order to have a valid selection for b 0 from the condition (ii) in the theorem, √ c should at least scale with β 2 √ log n.This is the cost of the non-convexity.However, the cost is mitigated when s and β ∞ are bounded since

Optimization and parameter tuning
We first describe an incremental method.A connection with re-weighted least squares follows thereafter.We conclude this section by describing a procedure for parameter tuning.

Incremental algorithm
We first show an incremental method.The need for solving very large problems has led to a recent resurgence of interest in first-order optimization methods, such as the composite gradient method of Nesterov [31] and the incremental methods of Bertsekas [11] (adopted e.g. in [35,28]).We focus on the MD-Lasso objective with "safety" radius of (10).
The algorithm proceeds with the following updates where Π { β 2 ≤b0} denotes the projection onto the 2 ball of radius b 0 .It is important to note that the algorithm is guaranteed to converge to a local minimum even if the loss L is not convex [11] and regardless of the initialization.Hence, the algorithm can be readily instantiated for MD-Lasso.
Instantiation of the incremental algorithm for MD-Lasso.Denote by S the soft-thresholding operator defined as where all operations are applied element-wise on a vector u.Each incremental algorithm update can be computed by (at most) two simple operations.The first operation consists in computing p ρ (β (t) ) = S λn/ρ (β (t) − 1 ρ ∇L(β (t) )).Let r i = y i − X i β (t) denote the residual for the i th sample, and define Let R = (r 1 , . . ., rn ) where ri = r i w i .Then ∇L(β (t) ) = −X R, and R can be interpreted as a generalized residual.The thresholding operation boils down to the following simple step: If the 2 -norm of the projection exceeds the safety radius, a second operation has to be carried out to project onto the the 2 ball of radius b 0 .The overall procedure is summarized by Algorithm 1.

Consistency of the local optima found by the incremental algorithm.
The incremental algorithm applied to the objective of ( 10) is guaranteed to converge to a local optimum [11].Hence theorem 3 shows that any local optimum obtained by the incremental algorithm is consistent, regardless of the initialization.

Re-weighted penalized least squares
Some interesting insights on the robustness of MD-Lasso can be gained by examining the descent direction in the incremental procedure, as explicated in (14).Given an initial solution β, under the traditional squared loss one would get For the MD-Lasso we have where the weights w i are given in (13).Hence the descent direction for MD-Lasso can be seen as a "weighted version" of the direction for usual squared loss, where the weights w i can be interpreted as being proportional to the likelihood functions of individual data points, i.e., , where L denotes the likelihood function under Gaussian assumption.Thus data with high likelihood values are given more weights in the computation of the descent direction.Conversely, data with low likelihood values, which are more likely to be outliers, contribute less.The connection between the likelihood functions and weights provides an intuitive insight on the resilience of the original MD-Lasso to outliers.
We remark that a similar conclusion can be obtained by considering a first order approximation of the "log-sum-exp" term in (5) around an initial solution.This yields an approximate iterative procedure where given initial estimates, data are first re-weighted by w i in (13) and then passed to a traditional Lasso solver to provide new estimates, and the procedure is repeated until convergence.The following algorithm summarizes the procedure.
While the weighted least squares formulation illustrates most intuitively the robustness of the MD-Lasso loss, it requires running several individual Lasso problems.Even though the procedure can benefit from a warm start in each iteration, it is is computationally more intensive than running the incremental approach.We therefore do not adopt it in practice and instead prefer the incremental approach.Intuitively, the incremental approach can be interpreted as a "lazy update" version of the iteratively reweighted least squares.

Parameter tuning
In practice, the scaling parameter c is unknown and rather than guessing its value, one might wish to automatically select it, along with the regularization parameter λ.The challenge is that MD-Lasso (unpenalized) losses for different values of c, denoted by L c , are not directly comparable.This is because c 1 < c 2 implies L c1 (β) > L c2 (β).Hence the selection criterion should penalize large values of c somehow.A rigorous criterion can be found by inspecting (3) and (4), and reintroducing the term f 2 (Y |X; β)dY = 1/(2π 1/2 σ) to obtain Based on this consideration, we propose as selection criterion the minimization of the following loss on evaluation data where n eval is the evaluation sample size.
The overall tuning procedure can then be summarized as follows.For each c in a grid of candidate scaling parameters and each λ in a grid of candidate tuning parameters, solve the MD-Lasso problem on training data and obtain the estimate βc,λ .Compute Leval ( βc,λ ).Pick the pair (c, λ) with smallest loss Leval .The procedure can be naturally extended to cross-validation.

Numerical results
We compare the proposed MD-Lasso estimator with the LAD-Lasso [43], the Extended Lasso [32] and the traditional Lasso [38].We also present a strategy to automatically select the MD-Lasso scaling parameter c.
In each simulation study, we consider both n = 200 and n = 1000 observations, and p = 1000 predictors.The entries of true model coefficient vector β are set to be 0 everywhere, except for a randomly chosen subset of s coefficients, which are chosen independently and uniformly in (1,3).The size s of the set of non-zero coefficients is randomly set between 3 and 10.
Parameter tuning.We consider holdout-validated estimates, which are obtained by selecting the tuning parameter λ that minimizes the average loss on a validation set.In a first set of experiments, we hold the parameter c fixed at various values so as to examine the impact of c for various error distributions.In a second set of experiments, c is selected automatically along with λ as described in Section 4.3.
Performance metrics.To measure the estimation accuracy, we report the model error defined as To measure variable selection accuracy, we use the F 1 score [40] defined by F 1 = 2P R/(P + R), where P is precision (fraction of correctly selected variables among selected variables) and R is recall (fraction of correctly selected variables among true relevant variables).

Results.
For each setting, we present the average of the performance measure based on 100 simulations.Figure 4 and Figure 5 provide boxplots for the Model Error and the variable selection accuracy, respectively of MD-Lasso.In the figures MD-x denotes MD-Lasso with c = x and x = 1, 2, 5, 10, 25, 50, 100, Lasso denotes the Least Squares Lasso, which is the limiting case of MD-Lasso for c → ∞.
From the figures we can see that the simulations results are in agreement with the theoretical results of Section 3. Specifically if the scaling parameter is too small, the performance of the MD-Lasso method degrades, as the restricted strong convexity property is violated.As expected, the performance of MD-Lasso gets closer to that of Lasso as c becomes large.For light tail distributions (e.g.Gaussian and Laplace) we see that as long as c is larger or equal to the minimum value required for restricted strong convexity, the performance of MD-Lasso is quite insensitive to the choice of c, while the sensitivity increases for heavy tailed distributions (e.g.Student's t and Cauchy).It is intriguing to note that in many cases the variable selection accuracy of MD-Lasso decreases monotonically with the scaling parameter, suggesting that the restricted strong convexity of the loss might be more influential on the model error than on the variable selection accuracy.
Figure 6 and Figure 7 provide boxplots for the Model Error and the variable selection accuracy, respectively of MD-Lasso with automatic selection of the scaling parameter c (denoted by MD in the figures), and comparison methods.Lasso denotes the Least Squares Lasso, LAD denotes the Least Absolute Deviation Lasso, and ExLasso denotes the Extended Lasso.
For Laplace distributed errors, LAD-Lasso performs the best in terms of model error.This can be explained by the fact that the noise distribution matches the LAD-Lasso loss exactly.However, MD-Lasso achieves higher variable selection accuracy.Even for Gaussian errors, MD-Lasso is often able to outperform the Least Squares Lasso in terms of variable selection accuracy.This might be partly attributable to a better parameter selection by MD-Lasso.Indeed we noticed that MD-Lasso tends to select fewer variables than standard Lasso.For instance, under Gaussian errors, Toeplitz design, with n = 100, and p = 1000, MD-Lasso selects on average 8 variables while Lasso selected over 11 variables.
The results for Cauchy distributed errors underscore the need for a nonconvex loss function as offered by MD-Lasso, and the limited ability of convex loss functions (including LAD-Lasso) in dealing with very large outliers.

Application to eQTL mapping
We apply MD-Lasso and other methods for comparison to the task of expression quantitative trait locus (eQTL) mapping.The main goal of eQTL studies is to identify the genetic variants (SNPs) that are associated with gene expression traits.In our analysis we use data on Alzheimer's disease (AD) generated by Harvard Brain Tissue Resource Center and Merck Research Laboratories4 .The dataset concerns n = 206 AD cases with SNPs and expression levels in the visual cortex.We study the associations between p = 18137 candidate SNPs and the expression levels of APOE gene, which is a key Alzheimer's gene [41].Specifically, persons having an APOE 4 allele have an increased chance of developing the disease; those who inherit two copies of the allele are at even greater risk.
The tuning parameters for all methods were chosen using a five-fold cross validation.To start, we investigated the Normal QQ-plots of the residuals from different regression methods and saw that the residuals from the fitted regressions have very heavy right tails.As an example the plot of the MD-Lasso is shown in Figure 8(a); the plots for the competing methods look similar.This suggests that for this eQTL data analysis it might not be judicious to use methods that lack robustness to noise and model misspecification.
For ease of comparison, we first focus on a cis-eQTL analysis, namely we look into the subsets of SNPs within chromosome 19 (where gene APOE is located).
To get a measure of confidence in the associations identified, we apply the bootstrap procedure (see Davison and Hinkley [14] for a review) as follows.Given the original data, we randomly draw 100 datasets by sampling with replacement the rows of the original data, so that each dataset has the same number of rows as the original data.We then apply the comparison methods to each of the 100 bootstrap datasets.For each SNP selected using the original dataset, we count Table 1 Top 20 selected SNPs on chromosome 19 by comparison methods for the cis-eQTL study and their "confidence score" (the number of times the SNP is selected in 100 bootstrap samples).the number of times it appears in the bootstrap datasets.There is a sharp contrast among methods with respect to the number of selected SNPs.MD-Lasso and LAD-Lasso select fewer SNPs than Extended Lasso and Lasso (19 selected coefficients for MD-Lasso, 20 for LAD-Lasso, and over a 100 for Lasso and Extended Lasso).For each method, the top 20 selected SNPs according to the amplitude of their regression coefficients are listed in Table 1 along with their "confidence score".

MD-Lasso
From Table 1 we can see that MD-Lasso and LAD-Lasso share 9 common SNPs.In contrast Extended Lasso share no common SNPs with MD-Lasso or LAD-Lasso, and Lasso shares at most 2 common SNPs with other methods.The SNPs identified by MD-Lasso are selected in average 71% of the time in the bootstrap datasets, those by LAD-Lasso 65% of the time, those by Extended Lasso 35% of the time, and those by Lasso 37% of the time.While a low variability in the selection process is not a guarantee that the selection includes the SNPs of interest, a high variability in the selection results (and a corresponding low confidence score like for Lasso and Extended Lasso) makes a consistent selection of interesting SNPs less plausible.We thus hypothesize that non-robust methods may select too many spurious associations due to their inability to cope with outliers and heavy-tailed errors.
We now focus on the results obtained by the comparison methods on the full set of chromosomes.As the genetics of Alzheimer's disease are not yet fully understood, the variable selection results can only lead to qualitative statements about the performance of each method.To provide a more quantitative assessment, we evaluate the predictive accuracy of the various methods by randomly partitioning the data into training and test sets, using 150 observations for training and the remainder for testing.To get a sense of how a robust criterion performs, we also tested trimmed Lasso regression, removing the worst 10% observations according to the absolute residuals.We computed both the absolute prediction error and squared prediction error for the testing set for the model estimated using the training set.We repeated this process 20 times (using 20 random partitions).The results are presented in Table 2. Overall the predictive performance of MD-Lasso is superior to the other methods.We can also see that trimming is not as beneficial as using MD-Lasso or LAD-Lasso.
To conclude the eQTL analysis, we discuss some biologically interesting SNPs selected by various methods.These are depicted in Figure 8(b), which shows the chromosomes, highlights the position of the target APOE gene and the selected SNPs along with their closest gene.To facilitate the following discussion, we refer to the genes close to the corresponding SNPs.We first describe results pertaining to MD-Lasso, which are not found by other methods.Gene DLG2 is a memory-associated protein known to be associated to Schizophrenia.However, a recent study showed that conservation of DLG2 functions could potentially reduce the symptoms of Alzheimer's [22].It has been shown that the inactivation of the gene GRIK2 can cause severe learning disabilities.Gene GNA14 has been identified by several studies as linked to Alzheimer's disease progression (see, e.g.Arefin et al. [4]).It has been indicated that SNPs in gene GAB2 can modify the risk of late-onset Alzheimer's disease in APOE 4 carriers and plays an important role in Alzheimer's pathogenesis (see, e.g.Reiman et al. [34]).Remarkably, MD-Lasso was the only method to select SNPs in the coding region of GAB2.
We also checked for interesting SNPs selected by other methods.Most of them were also selected by MD-Lasso.For instance, SNP rs17138233, located within gene SNX13, was selected by both MD-Lasso and LAD-Lasso.The carboxyl terminal fragment of SNX13 was reported to associate with activated H-Ras [17], which has been implicated in the process of neurodegeneration in Alzheimer's disease [5].Our finding is quite intriguing as the functional consequence of the interaction between SNX13 and H-Ras is not fully understood.An example among the few interesting SNPs discarded by MD-Lasso is rs7156281, located near gene LINGO-1 , which was identified by Lasso only.LINGO-1 is known to be involved in neurodegenerative processes including Alzheimer's disease [25].
Overall, our results suggest that the MD-Lasso method achieves greater predictive accuracy and stability than other methods, and is successful in identifying plausible and relevant SNPs in eQTL mapping.

Concluding remarks
We have shown that by combining minimum distance estimation with 1 penalization the robustness of minimum distance estimation can be preserved in the sparse high-dimensional regression setting.Our theoretical results indicate that the proposed MD-Lasso estimator can achieve optimal convergence rates even under heavy-tailed error distributions.These results hinge on the selection of a scaling parameter of MD-Lasso.If the scaling parameter is very large, MD-Lasso is identical to standard least-squares Lasso.Combining robustness with fast convergence rates requires non-convexity of the loss function, and the objective function can have multiple minima as a consequence.One set of results holds for all local minima within a local convexity region around the desired solution (and we have provided reasonable conditions under which these are the only existing local minima of the objective function).Another set holds beyond the local convexity region but requires constraining the 2 -norm of the feasible solutions within a safety radius.This guarantees the convergence of a simple first-order optimization method to consistent solutions regardless of the initialization.These desirable properties were confirmed by numerical examples.The MD-Lasso framework should prove equally useful in other statistical models such as generalized linear models, which will be investigated in a future study.Another pertinent direction for future work is to consider minimum-distance loss functions beyond those stemming from likelihood-based models.

A.1. Breakdown point analysis: Proof of Theorem 1
The proof technique is similar to Alfons et al. [1].Assume that m = n−l observations are corrupted, and l observations are kept intact.For convenience assume that the uncorrupted observations are placed at the beginning of the sample, so one actually observes {(X 1 , Y 1 ), . . ., (X l , Y l ), (X l+1 , Y l+1 ), . . ., (X n , Y n ), } where (X i , Y i ) denote the corrupted observations.Let K = max i=1,...,l |Y i |, where Y i are the uncorrupted responses.We have For any β we have 16) and (15) imply By contraposition we thus obtain that Recall that the MD-Lasso loss is non-convex, hence multiple local minima may exist.However, there is at least one minimizer, β, such that Q c ( β) ≤ Q(0).Using (17), we get that for such β we have β 2 ≤ K .
) is independent of the corrupted sample.Hence for any finite c we can tolerate at least m = αn corruptions where α is arbitrarily close to 1, as in such cases K < ∞ even as n → ∞ and regardless of the nature of the corruptions.
Since the predictors are bounded, we have by Hölder's inequality Now where b = j =i exp(− ).We have where the last equation comes from the fact that μ is symmetrical.Now we have (20) and (21) we conclude that E[f j ] = 0.
McDiarmid's inequality.We have to find bounds δ i such that for all η 1 , . . ., η n , ηi and all i, Once we have these, since E[f ] = 0, by McDiarmid's inequality we have for all t > 0 If the bounds δ i in (22) hold only with probability 1 − α, we have Bounds.Now, for a given η 1 , . . ., η i−1 , η i+1 , . . ., η n , we have, if η i can take any real value, max ) is attained at x = √ c and the maximal value is c/e.Hence the lhs in ( 22) is bounded by max On the other hand, if |η| is constrained to be within λ, and λ ≤ √ c then the equivalent bound becomes 2 max Fixing a particular value of λ, we have by (19) that with probability at most exp(−2nt 2 λ ) a proportion of at least 2t λ of all samples have value larger in absolute value than λ.Hence the sum of all δ 2 i is bounded with probability 1 − exp(−2nt 2 λ ) by We have E[ j =i 1{|η j | ≥ λ }] = (n − 1)t λ .By Hoeffding's inequality, and hence, since exp(−x) is decreasing, For γ = 2 and λ = 1 it follows that Hence, with probability at least 1 In summary, we have from (24) that where b λ,c = with λ ≤ √ c.By a union bound over the predictors we obtain We can set t 2 ≥ b −1 λ,c log p/n and λ = √ c.We remark that it seems safe to assume that log p/n → 0. However, if we choose c as a function of n we still need to make sure that t λ = ω(1/ √ n), where λ < √ c and ω denotes the "Small Omega".
Bernstein's inequality.The random variable z i = η i exp(−η 2 i /(2c)) is a mean-zero random variable.Furthermore it is bounded in absolute value by c/e and hence its variance is guaranteed to exist and is at most c/e (regardless of whether or not the variance of η i is well-defined).We can thus apply Bernstein's inequality as follows. .
We now specialize the above bounds in cases where the variance of z i is computable in closed form.

Gaussian errors. If (η i
) is a zero-mean gaussian sequence with variance σ 2 , let d σ,c = (c/(2σ 2 + c)) Error distributions with undefined variances (e.g.Cauchy).If the error distribution has an undefined variance, there is no hope of getting a gradient condition which would guarantee that the gradient is still finite as c → ∞.While Bernstein's inequality is still applicable, as we know that E[z 2  i ] ≤ c e , McDiarmid's inequality yields tighter bounds in this case.
Preliminary lemma.The following technical lemma establishes the conditions required on the Hessian to guarantee restricted strong convexity of the loss L in a neighborhood of the true parameter β .

Lemma 4.
Let A ⊂ R p be star-shaped with respect to β ∈ R p , namely for any β ∈ A and t ∈ Proof of Lemma 4: Consider the function g(t) . This implies that g is positive on (0, 1] and g(1) > 0.
Lemma 4 indicates that it suffices to focus on the Hessian of L and establish that a lower bound of the form ∇ 2 L(β + tΔ)(Δ, Δ) ≥ 2κ 1 Δ 2 ( Δ 2 − κ 2 Δ 1 ) holds for all Δ in K(S, μ) and t ∈ (0, 1]. Condition on the Hessian of L. We first provide the expression for ∇ 2 L(β + tΔ)(Δ, Δ), for which we then provide a lower-bound.Let Δ ∈ R p .The gradient of L evaluated at a vector β = β +tΔ, 0 ≤ t ≤ 1 can be expressed as Differentiating a second time we obtain where s i = X i Δ.
Applying an existing bound on the expectation of sub-Gaussian maxima (e.g., see Ledoux and Talagrand [23]) we get Hence we conclude where Step 4: Tail bound on Z(ν).In view of (36), McDiarmid's inequality implies that for any t > 0 we have where C a = (1 − a)
To prove Theorem 2 and Theorem 4 is thus suffices to establish a lower-bound on F (Δ) over K(δ, S) for the specific values of δ in the theorem statements.For any Δ ∈ K(δ, S) due to restricted strong convexity and decomposability of the where S is true support set of β as defined earlier, the inequalities (i) and (ii) hold by respectively the convexity and the triangular inequality of 1 norm.Now, combining two ingredients in (52) and (54), we obtain Since the theorem assumes max ∇L(β ) ∞ , 2κ 2 b 0 log p n ≤ λn 4 , we can conclude that where we have already shown how the term ∇L(β ) ∞ can be upper bounded in Theorem 2. As a result, we can finally have an 2 error bound as follows: implying that At the same time we can also derive 1 error bound using the inequality by (55): Hence, combining with 2 error bound, we obtain which completes the proof.

Fig 1 .
Fig 1.The MD-Lasso loss, squared loss and absolute loss, as a function of the residual error of a single observation.

Fig 2 .
Fig 2. Contour plot and graph of the MD-Lasso loss for illustrative examples where the dimensionality p = 2, and sample size n > p (two plots on the left), and n = p (two plots on the right).The x-axis and the y-axis in the contour plots and graphs correspond to the coordinates of the parameter vector β.In the graphs, the z-axis corresponds to the loss.

Definition 1 .
Consider any sample of n points (X, Y ) and let β be a regression estimator.Then consider all possible corrupted samples (X , Y ) that are obtained by replacing m of the original points by arbitrary values.The breakdown point of the estimator β at the sample (X, Y ) is defined as * n ( β; X, Y ) = min m n : sup

Fig 3 .
Fig 3. Impact of the error distribution on the scaling of the convergence rate for β − β 2 as a function of c.The values of the y-axis correspond to the contributions from the error term; multiplicative factors which are independent of c are disregarded.

Fig 4 .
Fig 4. Influence of the scaling parameter c on the performance of MD-Lasso (Lasso corresponds to c → ∞).Model error (the lower the better), from top to bottom row, errors with a Gaussian, Laplace, Gaussian Mixture, Student t (4 df ) and Cauchy distribution.TM=Toeplitz Model, FM=Factor Model.

Fig 5 .
Fig 5.  Influence of the scaling parameter c on the performance of MD-Lasso (Lasso corresponds to c → ∞).Variable selection accuracy (F1 score, the higher the better), from top to bottom row, errors with a Gaussian, Laplace, Gaussian Mixture, Student t (4 df ) and Cauchy distribution.TM=Toeplitz Model, FM=Factor Model.

Fig 6 .
Fig 6.Model error (the lower the better) for the comparison methods and, from top to bottom row, errors with a Gaussian, Laplace, Gaussian Mixture, Student t (4 df ) and Cauchy distribution.TM=Toeplitz Model, FM=Factor Model.

Fig 7 .
Fig 7.  Variable selection accuracy (F1 score, the higher the better) for the comparison methods and, from top to bottom row, errors with a Gaussian, Laplace, Gaussian Mixture, Student t (4 df ) and Cauchy distribution.TM=Toeplitz Model, FM=Factor Model.

Fig 8 .
Fig 8. (a) Normal QQ-plots of residuals for MD-Lasso in the eQTL study.(b)"Circle" graph of the chromosomes highlighting the location of the target APOE gene, a subset of SNPs selected by MD-Lasso only (unmarked), by both MD-Lasso and LAD-Lasso (marked as * ), and by Lasso only (marked as * * ) for the trans-eQTL study.

Table 2
Average test absolute error (MAE) and square error (MSE) with standard deviation for the models output by MD-Lasso and representative comparison methods on the eQTL dataset.(Smallervalues indicate higher predictive accuracy).