A general family of trimmed estimators for robust high-dimensional data analysis

: We consider the problem of robustifying high-dimensional struc- tured estimation. Robust techniques are key in real-world applications which often involve outliers and data corruption. We focus on trimmed versions of structurally regularized M-estimators in the high-dimensional setting, including the popular Least Trimmed Squares estimator, as well as analo- gous estimators for generalized linear models and graphical models, using convex and non-convex loss functions. We present a general analysis of their statistical convergence rates and consistency, and then take a closer look at the trimmed versions of the Lasso and Graphical Lasso estimators as special cases. On the optimization side, we show how to extend algorithms for M-estimators to ﬁt trimmed variants and provide guarantees on their numerical convergence. The generality and competitive performance of high-dimensional trimmed estimators are illustrated numerically on both simulated and real-world genomics data.


Introduction
We consider the problem of high-dimensional estimation, where the number of variables p is much higher than the number of observations n. These problems arise in a variety of domains, including signal processing, computational biology and finance. The development and the statistical analysis of structurally constrained estimators for high-dimensional estimation has recently attracted considerable attention. These estimators seek to minimize the sum of a loss function and a weighted regularizer. The most popular example is that of 3520 E. Yang et al. Lasso [43], which solves an 1 -regularized (or equivalently 1 -constrained) least squares problem. Under sub-Gaussian errors, Lasso has been shown to have strong statistical guarantees [44,46]. Regularized maximum likelihood estimators (MLEs) have been developed for sparsity-structured Generalized Linear Models (GLMs), with theoretical guarantees such as 1 and 2 -consistency [29], and model selection consistency [10]. For matrix-structured regression problems, estimators using nuclear-norm regularization have been studied e.g. by [36]. Another prime example is that of sparse inverse covariance estimation for graphical model selection [35].
In practice, however, the desirable theoretical properties of such regularized M-estimators can be compromised, since outliers and corruptions are often present in high-dimensional data problems. These challenges motivate the development of robust structured learning methods that can cope with observations deviating from the model assumptions. The problem of reliable high-dimensional estimation under possibly gross error has gained increasing attention. Relevant prior work includes the "extended" Lasso formulation [31,53] which incorporates an additional sparse error vector to the original Lasso problem so as to account for corrupted observations, the LAD-Lasso [47] which uses the least absolute deviation combined with an 1 penalty, the "matrix uncertainty selectors" [38] which consider the case where the predictor matrix is observed with additive error, and the Robust Matching Pursuit method of Chen et al. [12] which performs feature selection based on a trimmed inner product of the features with the residuals to mitigate the impact of corrupted observations. Extending Mestimators beyond the least squares case along these directions is challenging. For example, Yang et al. [51], Tibshirani and Manning [42] extend the strategy in Nguyen and Tran [31] to generalized linear models in two ways: (1) modeling errors in the input space, using a convex objective with stringent conditions for consistency; and (2) modeling errors in the output space, breaking convexity but yielding milder conditions. Recent works [33,24] consider estimation frameworks based on robust mean estimation however they are not able to guarantee consistent solutions (the estimation upper bound depends on the standard deviation of dense white noise, σ, which does not converge to zero even n → ∞).
Another line of work aims at trimming outliers by minimizing the residuals over a selected subset. A key motivation for trimmed approaches is that convex loss functions with linear tail growth (such as the 1 -norm and Huber loss) are not robust enough. As Alfons et al. [1] points out, these approaches have a breakdown point of = 0, since even a single gross contamination can arbitrarily distort the regression coefficients. Remarkably, the median of least squares residual originally proposed by Rousseeuw [39] avoids this problem, reaching breakdown point of nearly 50%; the approach is equivalent to 'trimming' a portion of the largest residuals. This leads to the consideration of sparse Least Squares (sparse LTS) for robust high-dimensional linear models [1], who look for approximate solutions using alternating minimization schemes, since solving the trimming problem is combinatorial -computationally intractable as the number of features grows. While Alfons et al. [1] established high breakdown point property for sparse LTS, its statistical convergence has not been analyzed.
Statistical consistency under arbitrary corruption is certainly not a given. Consider a case of adversarial corruption, where some measurements are replaced by very high values, all exactly consistent with an estimator that is arbitrarily far away from the truth. Any likelihood formulation can easily be fooled and led to find the 'wrong' estimator. Some additional assumptions are thus needed as safeguard in the statistical analysis. In practice, extreme deviations might be easily detected in specific datasets; it is more difficult to catch outliers which masquerade as real data.
In this paper, we present a unified framework and statistical analysis for trimmed regularized M-estimators, generalizing the sparse least trimmed squares (Sparse LTS) estimator [1] to allow for a wide class of (possibly non-convex) loss functions as well as structured regularization. Using our analysis, we derive error bounds for the sparse LTS estimator. These require less stringent conditions for estimation consistency than those of Extended Lasso. [6] and follow-up work of Zhang et al. [54] consider this high dimensional linear model and propose a hard thresholding-based algorithm, which can be seen as a special case of the partial trimmed optimization discussed in Section 4. The analysis provided in [6] allows for a linear faction corruption but again involves the standard deviation of dense white noise σ in estimating the upper bound, which our analysis avoids. We also derive error bounds for sparse Gaussian graphical models (GGMs) as a specific example. In contrast, existing approaches for robust sparse GGMs estimation lack statistical guarantees.
We also show how to extend optimization algorithms for M-estimators to trimmed formulations, using partial minimization over auxiliary variables. An important example of the approach is a modified proximal gradient method. For convex M-estimators, we show that under moderate assumptions, the 'trimming' is completed in finitely many steps, and thereafter the method reduces to a descent method for a convex problem over a fixed set of identified 'inliers'. We use simulated data to compare with competing methods, and then apply our approach to real genomics datasets.
The manuscript is organized as follows. In Section 2 we introduce the general setup and present the family of High-Dimensional Trimmed estimators. Results on convergence and consistency of these estimators are given in Section 3, along with corollaries for linear models and Gaussian graphical models. Algorithms are developed in Section 4. Empirical results are presented in Section 5 for simulated data and Section 6 for the analysis of genomics datasets. All proofs are in the Appendix.

A general framework for high-dimensional trimmed estimators
Motivating example 1: Linear regression We start with high-dimensional linear regression. The real-valued observation y i ∈ R comes from the linear model 3522 E. Yang et al. where x i ∈ R p is a covariate, the true regression parameter vector is θ * = (θ * 1 , . . . , θ * p ) ∈ R p , and o i is the observation noise. Since outliers are commonly present in high-dimensional data problems, we assume p is substantially larger than n without loss of generality.
Let G be the set of "good" samples, and B denote the set of "bad" samples, corrupted arbitrarily. To deal with observations that deviate from the true model, Alfons et al. [1] proposed sparse LTS, an 1 -penalized version of the classical least trimmed squares (LTS) estimator [39] solving where are the order statistics of the squared residuals r 2 (θ). Alfons et al. [1] established the breakdown point of the resulting sparse LTS estimator, and proposed an iterative algorithm for its computation. At iteration t, the algorithm computes the Lasso solution based on the current subset H t of observations with pre-defined size h = |H t |, and constructs the next subset H t+1 from the observations corresponding to the h smallest squared residuals.
Our starting point is the following reformulation of regularized LTS problem: where Δ h := {w : w ∈ [0, 1] n , 1 T w = h} is the h-scaled capped unit simplex, B 1 is the 1 -norm ball, and the constraint θ ∈ ρB 1 (or equivalently θ 1 ≤ ρ) ensures that the optimum of non-convex problem (3) exists as discussed in Loh and Wainwright [25]. This constraint on θ is a theoretical safeguard, since problem (3) is equivalent to the problem (2) when ρ is large enough.
A family of trimmed estimators Based on the reformulation (3), we propose the family of trimmed estimators for general high-dimensional problems: given a collection of arbitrary corrupted samples Z n 1 = {Z 1 , . . . , Z n }, and a differentiable (possibly non-convex) loss functionL, we solve where R(·) is a decomposable norm used as a regularizer [29] to encourage particular low-dimensional structure of the estimator, and B R is the unit ball for R(·) (in other words, the constraint θ ∈ ρB R ensures R(θ) ≤ ρ). The parameter h selects the number of samples (or sum of weights) used in the training. Ideally, h is the number of uncorrupted samples in G, but we can tune h by cross-validation along with parameter λ. Cross-validation is a popular choice in practice and has been recently analyzed for the vanilla Lasso estimator [13]. Though beyond the scope of this paper, it would be interesting to explore the applicability of other selection strategies including self-tuning approaches which are pivotal with respect to the penalty parameters [5,4] and approaches estimating h by treating it as concomitant similarly to [22] Having a general choice of loss function L in (4) is particularly important, given the wide variety of losses used by practitioners in statistics and machine learning.
Motivating example 2: Graphical models Gaussian graphical models (GGMs) form a powerful class of statistical models for representing distributions over a set of variables [23]. These models use undirected graphs to encode conditional independence assumptions among the variables, which is particularly convenient for exploring network structures. GGMs are widely used in variety of domains, including computational biology [32], natural language processing [27], image processing [48,19,14], statistical physics [20], and spatial statistics [37].
In such high-dimensional settings, sparsity constraints are particularly pertinent for estimating GGMs, as they encourage only a few parameters to be non-zero and induce graphs with few edges. The most widely used estimator, the Graphical Lasso minimizes the negative Gaussian log-likelihood regularized by the 1 norm of the entries (or the off-diagonal entries) of the precision matrix (see [52,17,3]). This estimator has strong statistical guarantees (see e.g. [35]). The corresponding optimization problem is a log-determinant program that can be solved with interior point methods [7] or by co-ordinate descent algorithms [17,3]. Alternatively neighborhood selection [28,50] can be used to estimate conditional independence relationships separately for each node in the graph, via Lasso linear regression [43]. Under certain assumptions, the sparse GGM structure can still be recovered even under high-dimensional settings.
The aforementioned approaches rest on a fundamental assumption: the multivariate normality of the observations. However, outliers and corruption are frequently encountered in high-dimensional data (see e.g. [15] for gene expression data). Contamination of a few observations can drastically affect the quality of model estimation. It is therefore imperative to devise procedures that can cope with observations deviating from the model assumptions. Despite this fact, little attention has been paid to robust estimation of high-dimensional graphical models. Partially Relevant work includes [16], which leverages multivariate tdistributions for robustified inference and the EM algorithm. They also propose an alternative t-model which adds flexibility to the classical t but requires the use of Monte Carlo EM or variational approximation as the likelihood function is not available explicitly. Another pertinent work is that of [41] which introduces a robustified likelihood function. A two-stage procedure is proposed for model estimation, where the graphical structure is first obtained via coordinate gradient descent and the concentration matrix coefficients are subsequently re-estimated using iterative proportional fitting so as to guarantee positive definiteness of the final estimate.
A special case of the proposed family is that of the Trimmed Graphical Lasso for robust estimation of sparse GGMs: Here for matrices U ∈ R p×p and V ∈ R p×p , U, V denotes the trace inner product tr(A B T ). For a matrix U ∈ R p×p and parameter a ∈ [1, ∞], U a denotes the element-wise a norm, and U a,off does the element-wise a norm only for off-diagonal entries. For example, U 1,off := i =j |U ij |. We provide statistical guarantees on the consistency of this estimator. To the best of our knowledge, this is in stark contrast with prior work on robust sparse GGM estimation (e.g. [16,41]) which are not statistically guaranteed in theory.

Statistical guarantees of trimmed estimators
In this section, we provide a statistical analysis of the family of structurally regularized estimators (4). In order to simplify the notation in our theorem and its corollaries, we assume without loss of generality that the number of good samples is known a priori and the tuning parameter h in (4) is exactly set as the genuine samples size, |G|. This is an unrealistic assumption, but as long as we set h smaller than |G|, the statements in the main theorem and its corollaries apply.
Noting that the optimization problem (4) is non-convex, estimators returned by iterative methods for (4) will be stationary points. We call ( θ, w) a local minimum of (4) when 1. θ is a local minimum of g 1 (θ) := f (θ, w) and 2. w is a global minimum of g 2 (w) := f ( θ, w).
These are precisely the points that are found by the algorithms developed in Section 4. In this section, we give statistical error bounds for any such points.
Consider any such local minimum ( θ, w). While we are mainly interested in the error bounds of our estimator for target parameter θ * (that is, θ − θ * ), we first define w * as follows: for the index i ∈ G, w * i is simply set to w i so that w * i − w i = 0. Otherwise for the index i ∈ B, we set w * i = 0. Note that while θ * is fixed unconditionally, w * is dependent on w. However, w * is fixed given w.
In order to guarantee bounded errors, we first assume that given ( θ, w), the following restricted strong convexity condition for ( θ, w) holds: (C-1) (Restricted strong convexity (RSC) on θ) We overload notation and use L(θ, w) to denote 1 h n i=1 w iL (θ; Z i ). Then, for any possible Δ := θ − θ * , the differentiable loss functionL satisfies where κ l is a curvature parameter, and τ 1 (n, p) is a tolerance function on n and p.
Note that this condition is slightly different from the standard restricted strong convexity condition because of the dependency on w * and therefore on w. Each local optimum has its own restricted strong convexity condition. In case of no corruption with w * i = 1 for all i, this condition will be trivially reduced to the standard RSC condition, under which the standard general M -estimator has been analyzed (see Negahban et al. [29] for details).
We additionally require the following condition for a successful estimation with (4) on corrupted samples: (C-2) can be understood as a structural incoherence condition between θ and w. This type of condition is also needed for the guarantees of extended LASSO [31] and other dirty statistical models with more than a single parameter [49]. Due to the dependence on w, each local optimum will have its own conditions (C-1) and (C-2). We will see later in this section that these two conditions are mild enough for the popular estimators (such as linear models and GGMs) to satisfy.
Armed with these conditions, we state the main theorem on the error bounds of (4):

Theorem 1.
Consider an M -estimator from (4) with any local minimum ( θ, w), and suppose that it satisfies the conditions (C-1) and (C-2). Suppose also that the regularization parameter λ in (4) is set as . Then the following error bounds for θ are guaranteed for a given model space M: measures the compatibility between R(·) and 2 norms.
For sparse vectors, Ψ := sup u∈M\{0} u 1 / u 2 = √ k where k is the sparsity of true parameter θ * , and M is the space of vectors with the correct support set [29].
The statement in Theorem 1 is applicable to any local minimum of (4), and it holds deterministically. Probabilistic statements come in when the condition on λ n specified in Theorem 1 is satisfied. In (6), λ is chosen based on R * ∇ θ L(θ * , w * ) similarly to Negahban et al. [29]. We shall see that the remaining terms with tolerance functions τ in (6) have the same order as R * ∇ θ L(θ * , w * ) for the specific cases of linear models and GGMs developed in the next sections.

Statistical guarantees of high-dimensional least trimmed squares
We now focus on the special case of high-dimensional linear regression, and apply Theorem 1 to problem (3). In particular, if i ∈ G, y i = x i , θ * + i where the observation noise i follows zero mean and has sub-Gaussian tails.
While Theorem 1 holds for any setting under (C-1) and (C-2), we consider the following natural setting with dependent design matrices. This setting has been widely studied in past work on conventional (without outliers) high dimensional linear models (e.g. [29]):

(LTS2) (Sub-Gaussian noise)
The noise vector ∈ R n is zero-mean and has sub-Gaussian tails, which means that for any fixed vector v such that for all t > 0. The sub-Gaussian is quite a wide class of distributions, and contains the Gaussian family as well as and all bounded random variables.
(LTS3) (Column normalization) Let X ∈ R n×p be the design matrix whose i-th row is the covariate i-th sample: x i , and X j ∈ R n be the j-th column vector of X. Then, X j 2 √ h ≤ 1. As pointed out in Negahban et al. [29], we can always rescale linear models with out loss of generality to satisfy this condition.
In addition to the standard conditions above, we further assume the following conditions, which will guarantee structural incoherence (C-2): (C-h) Let h be the number of good samples: |G| = h and hence |B| = n − h. Then, we assume that at least half of samples are genuine and uncorrupted so that |G|−|B| |G| ≥ α where 0 < α ≤ 1. If we assume that 40% of samples are corrupted, then α = 1/3.

(LTS4)
We set the tuning parameter ρ in (3) as ρ ≤ C1 2 h log p for some constant C 1 . This setting requires that the number of good samples h is larger than or equal to 2k θ * ∞ C1 2 log p so that the true regression parameter θ * is feasible for the objective.
Under these conditions, we can recover the following error bounds of highdimensional LTS (3), as a corollary of Theorem 1: where c is some constant dependent on Σ, σ and the upper bound of (maxi δ 2 i )|B| h 1 . Then, ( θ, w) is guaranteed to satisfy (C-1) and (C-2) for the specific case of (3), and have the following error bounds: for some constant c depending on c, Σ and the portion of genuine samples α in (C-h), for some universal positive constants c 1 and c 1 .
Note that Corollary 1 concerns any single local minimum. For the guarantees of multiple local optima simultaneously, we may use a union bound from the corollary.
Remarks While the error bounds in Corollary 1 hold under the condition |B| < |G| ((C-h)), in order to have meaningful and consistent bounds, α in (C-h) should be properly lower bounded so that |B| log p h converges to zero as n and p increase. When the fraction of corruptions is linearly proportional to the number of overall samples n, then the parameters are estimated within some bounded error involving the logarithm of n. It might imply that our method is more resilient to massive outliers than other existing works such as [12] having up to an additive error bound of √ p under linear fraction of corruptions. Also note that this is also the case in the 2 error bound of extended Lasso analyzed in Nguyen and Tran [31], which has a similar term with asymptotically the same rate. Specifically, the extended Lasso estimator solves: From a theoretical standpoint it is impossible to recover the true signal by trimming when the amount of corruptions exceed that of normal observation errors. An adversarial 'corruption' could be constructed to point to the wrong estimator, and be a local minimum of the formulation. The required assumption for success is that the squared loss for corrupted samples should be less than or equal to squared loss for good samples: (max i δ 2 i )|B| ≤ C 2 2 h for some constant C 2 . In practice, large deviations that are not adversarial (e.g. random) are easily detected by the approach, even when their magnitude far exceeds normal observations errors.
where λ e is the regularization parameter for parameter e capturing corruptions. e is encouraged to be sparse to reflect the fact that only a fraction of samples is corrupted. The 2 norm-based error rate in Corollary 1 is almost the same as under the standard Gaussian design setting (LTS1), which is asymptotically the same as that in Corollary 1 (since 1/h ≤ 2/n under (C-h)).
However, it is important to revisit the conditions required for the statistical guarantees of extended Lasso. Besides an extended version of the restricted eigenvalue condition, Nguyen and Tran [31] assumes a mutual incoherence condi- for some large and fixed constant c. Provided that k and |B| are fixed, the inequality can hold for a large enough sample size n. However, when |B| grows with n, this condition will be violated; for example if (i) a square root fraction of samples is corrupted (|B| = α √ n) for a fixed k or (ii) a linear fraction of n is corrupted (|B| = αn), then c |||Σ||| 2 can easily exceed 1/16. Our experimental results of Section 5 will confirm this observation: as the fraction of corruptions increases, the performance of extended Lasso deteriorates compared to that of our estimator (3).
Statistical guarantees when covariates are corrupted In the linear model (1), corruption is considered in the space of the response variable y i ∈ R: namely an additional random variable δ i ∈ R is used to model corruption in the response space. Even in the case where we have outliers with corrupted covariates x i + δ ∈ R p , δ i can be understood as the mean-shift variable to model δ , θ * . For linear models, modeling outliers in the parameter space or modeling them in the output space is thus equivalent (In constrast, for more general GLM settings, the link function is not the identity function and both approaches are distinct, see e.g. [51]). Nevertheless, when outliers stem from corrupted covariates, condition (LTS1) might be violated. For this setting, we introduce the following alternative condition: Under condition (LTS5) we recover results similar to Corollary 1: with probability at least 1 − c 1 exp(−c 1 hλ 2 ) for some universal positive constants c 1 and c 1 .

Statistical guarantees of trimmed graphical Lasso
We now focus on Gaussian graphical models and provide the statistical guarantees of our Trimmed Graphical Lasso estimator as presented in Section 2 (Motivating Example 2). Our theory in this section provides the statistical error bounds on any local minimum of (5). We use U F and |||U ||| 2 to denote the Frobenius and spectral norms, respectively. Let X = (X 1 , X 2 , . . . , X p ) be a zero-mean Gaussian random field parameterized by p × p concentration matrix Θ * : where A(Θ * ) is the log-partition function of Gaussian random field. Here, the probability density function in (7) is associated with p-variate Gaussian distribution, N (0, Σ * ) where Σ * = (Θ * ) −1 .
We consider the case where the number of random variables p may be substantially larger than the number of sample size n, however, the concentration parameter of the underlying distribution is sparse so that the number of non-zero off-diagonal entries of θ * is at most k: |{Θ * ij : Θ * ij = 0 for i = j}| ≤ k. We now investigate how easily we can satisfy the conditions in Theorem 1. Intuitively it is impossible to recover true parameter by weighting approach as in (5) when the amount of corruptions exceeds that of normal observation errors.
To this end, suppose that we have some upper bound on the corruptions: where X B denotes the sub-design matrix in R |B|×p corresponding to outliers. Under this assumption, we can recover the following error bounds of Trimmed Graphical Lasso (5), as a new corollary of Theorem 1:

Corollary 3. Consider corrupted Gaussian graphical models with conditions
(C-h) and (TGL1). Suppose that we compute the local optimum ( Θ, w) of (5) choosing Then, ( θ, w) is guaranteed to satisfy (C-1) and (C-2) for the specific case of (5) and have the error bounds of with probability at least 1 − c 2 exp(−c 2 hλ 2 ) for some universal positive constants c 2 and c 2 .
In Corollary 3, the term √ k + p captures the relation between element-wise 1 norm and the error norm · F including diagonal entries.
If we further assume that the number of corrupted samples scales with √ n at most: then we can derive the following result as another corollary of Theorem 1:

Suppose that the conditions (C-h), (TGL1) and (TGL2) hold. Then, if the sample size n is lower bounded as
then ( Θ, w) is guaranteed to satisfy (C-1) and (C-2) for the specific case of (5) and have the following error bound: with probability at least 1 − c 1 exp(−c 1 hλ 2 ) for some universal positive constants c 1 and c 1 .
Note that an · 1,off -norm error bound can also be easily derived using the selection of λ from (8).

Remarks Corollary 4 reveals an interesting result: even when O(
√ n) samples out of total n samples are corrupted, our estimator (5) can successfully recover the true parameter with guaranteed error in (9 . To the best of our knowledge, our results are the first statistical error bounds available in the litterature on parameter estimation for Gaussian graphical models with outliers.

Optimization for trimmed estimators
While the objective function f (w, θ) in (4) is non-convex in (w, θ), it simplifies for block w or θ held fixed. Perhaps for this reason, prior algorithms for trimmed approaches [39,1] alternated between solving for θ and w. These approaches are inefficient, since each solve in θ is as expensive as finding the original (untrimmed) estimator. We take advantage of the computational complexity gap between subproblems in θ and in w. With w fixed, the problem in θ is equivalent to classic high-dimensional problems, e.g. Lasso. In contrast, the problem in w for fixed θ is the simple linear program with all dependence on the predictors captured by the current lossesL(θ; Z i ). The solution is obtained setting w i = 1 for the h smallest values ofL(θ; Z i ), and setting remaining w i to 0. We exploit structure using partial minimization. Similar ideas have been used for optimizing a range of nonlinear least squares problems [18] as well as more

Algorithm 1 Partial Minimization using Proximal Gradient Descent for (11)
Initialize θ (0) , t = 0 repeat Compute w (t) given θ (t) as the global minimum of (10) Given w (t) , compute the direction , with η (t) selected using line search. until stopping criterion is satisfied general problems involving nuisance parameters [2]. Rather than an alternating scheme (similar to that of [1] for least squares) where we solve multiple 'weighted' regularized problems to completion, we can rewrite the problem as follows: Problem (11) is equivalent to (4). The reader can verify that L(θ) is non-smooth and non-convex 2 . However, partial minimization provides a way to modify any descent method for fitting an M-estimator to find the corresponding trimmed estimator (11). Algorithm 1 gives a description of the steps involved for the specific case of proximal gradient descent. The algorithm uses the proximal mapping, which for the case of 1 regularization is the soft-thresholding operator defined as [S ν (u)] i = sign(u i ) max(|u i | − ν, 0). We assume that we pick ρ sufficiently large, so one does not need to enforce the constraint R(θ) ≤ ρ explicitly. When the loss L is convex and smooth with Lipschitz continuous gradient, the proximal gradient has a global convergence theory (see e.g. [30]). Convergence of the extended Algorithm 1 is analyzed in the following proposition.

Proposition 1. Consider any monotonic algorithm
and (ii) for any fixed w ∈ Δ h , A produces converging sequence of {θ (t) } when solving argmin θ F (θ; w) := f (w, θ). If A is extended to solve (11) using partial minimization (10), the monotonic property is preserved, at least one limit point exists, and every limit point of the sequence {(θ (t) , w (t) )} is a stationary point of (4). Moreover, if F is convex, and estimators over each feasible data selection have different optimal values, then w (t) converge in finitely many steps, and the extended algorithm converges to a local minimum 3 of (4).
Finite convergence of the weights w (t) is an important point for practical implementation, since once the weights converge, one is essentially solving a single estimation problem, rather than a sequence of such problems. In particular, after finitely many steps, the extended algorithm inherits all properties of the original algorithm A for the M-estimator over the selected data.

Simulated data experiments
We illustrate the generality of our approach by considering sparse logistic regression, trace-norm regularized multi-reponse regression and sparse GGMs (For experiments with sparse linear models, see Alfons et al. [1]).

Simulations for sparse logistic regression
We begin with sparse logistic regression. We adopt an experimental protocol similar to [51]. We consider p = 200 features. The parameter vectors have k = √ p non-zero entries sampled i.i.d. from N (0, 1). The data matrix X is such that each of its n observations is sampled from a standard Normal distribution N (0, I p ). Given each observation, we draw a true class label from {0, 1} following the logistic regression model. We show two scenarios, selecting either √ n or 0.1n samples with the highest amplitude of θ * , x i and flipping their labels. We compare the 2 errors over 100 simulation runs of the new estimator with those of vanilla Lasso for logistic regression, and with two extended Lasso methods for logistic regression of [51] (with "error in parameter" and in "error in output") as the sample size n increases. The tuning parameters for each method are set via holdout where we use a separate dataset of size n with the same type and amount of corruption. This tuning also applies to the trimming parameter h. Specifically, we set a grid for the trimming parameter ranging from 5% to 45% of the sample size with an increment of 5%. Figure 1 shows that the trimmed approach has both better performance (achieves lower errors), and is faster, matching the computational efficiency of the vanilla Lasso method. This result is anticipated by Proposition 1: the weights w (t) converge in finitely many steps, and then we are essentially solving the Lasso with a fixed weight set thereafter.

Simulations for trace-norm regularized regression
Beyond the 1 penalty, we consider trace-norm regularized multi response regression. We set R(Θ) = Θ * , for Θ ∈ R p×q . We consider n = 50 samples, p = 300 covariates, and q = 10 responses. Each entry of X is generated independently from N (0, 1). To generate the true low rank weights, we first sample a p × q matrix of coefficients, with each coefficient sampled independently from N (0, 1). We then set the true parameter matrix to the best rank 3 approximation of the sample, obtained using an SVD. For clean samples in G, we then set the error term as i ∼ N (0, 0.01). The contaminated terms are generated with an error term as δ i ∼ N (2, 1). We consider varying corruption levels ranging from 5% to 30%. The parameters are tuned via holdout validation as in the previous section and we present the average 2 error based on 100 simulation runs. Figure 2 further illustrates the computational advantage of the partial minimization scheme described in Section 4 for general structures.

Simulations for Gaussian graphical models
We compare the Trimmed Graphical Lasso (trim-glasso) algorithm against the vanilla Graphical Lasso(glasso) [17]; the t-lasso and t*-lasso methods [16], and robust-LL: the robustified-likelihood approach of [41]. Our simulation setup is similar to [41] and is a akin to gene regulatory networks. Namely we consider four different scenarios where the outliers are generated from models with different graphical structures. Specifically, each sample is generated from the following mixture distribution:  For each simulation run, θ is a randomly generated precision matrix corresponding to a network with 9 hub nodes simulated as follows. Let A be the adjacency of the network. For all i < j we set A ij = 1 with probability 0.03, and zero otherwise. We set A ji = A ij . We then randomly select 9 hub nodes and set the elements of the corresponding rows and columns of A to one with probability 0.4 and zero otherwise. Using A, the simulated nonzero coefficients of the precision matrix are sampled as follows. First we create a matrix E so that where Λ min (E) is the smallest eigenvalue of E.θ is a randomly generated precision matrix in the same way θ is generated.
For the robustness parameter β of the robust-LL method, we consider β ∈ {0.005, 0.01, 0.02, 0.03} as recommended in [41]. For the trim-glasso method we consider 100h n ∈ {90, 85, 80}. Since all the robust comparison methods converge to a stationary point, we tested various initialization strategies for the concentration matrix, including I p , (S + λI p ) −1 and the estimate from glasso. We did not observe any noticeable impact on the results. Figure 3 presents the average ROC curves of the comparison methods over 100 simulation data sets for scenarios M1-M4 as the tuning parameter λ varies. In the figure, for robust-LL and trim-glasso methods, we depict the best curves with respect to parameter β and h respectively. The detailed results for all the values of β and h considered are provided in the appendix.
From the ROC curves we can see that our proposed approach is competitive compared the alternative robust approaches t-lasso, t*-lasso and robust-LL. The edge over glasso is even more pronounced for scenarios M2, M4. Surprisingly, trim-glasso with h/n = 80% achieves superior sensitivity for nearly any specificity.
Computationally the trim-glasso method is also competitive compared to alternatives. The average run-time over the path of tuning parameters λ is 45.78s for t-lasso, 22.14s for t*-lasso, 11.06s for robust-LL, 1.58s for trimmed lasso, 1.04s for glasso. Experiments were run on R in a single computing node with a Intel Core i5 2.5GHz CPU and 8G memory. For t-lasso, t*-lasso and robust-LL we used the R implementations provided by the methods' authors. For glasso we used the glassopath package.

Analysis of Yeast genotype and expression data
We apply Sparse-LTS, Extended Lasso [31], LAD Lasso [47], standard Least Squares Lasso estimator [43], and ROMP [12] to the analysis of yeast genotype and gene expression data. We employ the "yeast" dataset from [9]. The data set concerns n = 112 F1 segregants from a yeast genetic cross between two strains: BY and RM. For each of these 112 samples, we observe p = 3244 SNPs (These genotype data are our predictors x) and focus on the gene expression of gene GPA1 (our response y), which is involved in pheromone response [9]. For both Sparse-LTS-Ada and Sparse LTS considering a total of |B| = 11 contaminated observations lead to the best predictive performance on the uncontaminated data. In addition, the QQ-plots of the fitted residuals from the various comparison methods indicated heavy left tails (see Figure 4). This suggests that it might be advisable to use robust methods. We compare the trimmed mean square error (T-MSE) computed from 10folds cross validation for each method, where for each method we exclude the 11 observations with largest residual absolute error. From Table 2 we can see thatSparse-LTS exhibit the smallest T-MSE.
We conclude by examining the SNPs selected by the methods achieving the lowest T-MSE: Sparse-LTS and LAD Lasso. Out of p = 3244 SNPs, Sparse-LTS selected 30 SNPs, and LAD Lasso chose 61 SNPs. Table 3 provides a list of the SNPs selected on chromosome 8, which is where gene GPA1 resides. In the dataset, there is a total of 166 SNPs on chromosome 8. From the table we can see that there is some overlap in terms of the selected SNPs across the various methods. Sparse-LTS tends to select a larger number of SNPs on chromosome 8 even though it selects fewer SNPs in total (namely within and beyond chromosome 8). Five of these are very close to GPA1 which is consistent with the fact that GPA1 can directly inhibit the mating signal by binding to its own subunit [40].

Application to the analysis of Yeast gene expression data
We analyze a yeast microarray dataset generated by [8]. The dataset concerns n = 112 yeast segregants (instances). We focused on p = 126 genes (variables) belonging to cell-cycle pathway as provided by the KEGG database [21]. For each of these genes we standardize the gene expression data to zero-mean and unit standard deviation. We observed that the expression levels of some genes are clearly not symmetric about their means and might include outliers. For example the histogram of gene ORC3 is presented in Figure 5(a). For the robust-LL method we set β = 0.05 and for trim-glasso we use h/n = 80%. We use 5-fold-CV to choose the tuning parameters for each method. After λ is chosen for each method, we rerun the methods using the full dataset to obtain the final precision matrix estimates. Figure 5(b) shows the cell-cycle pathway estimated by our proposed method. For comparison the cell-cycle pathway from the KEGG [21] is provided in Figure 6. It is important to note that the KEGG graph corresponds to what is currently known about the pathway. It should not be treated as the ground truth. Certain discrepancies between KEGG and estimated graphs may also be caused by inherent limitations in the dataset used for modeling. For instance, some edges in cell-cycle pathway may not be observable from gene expression data. Additionally, the perturbation of cellular systems might not be strong enough to enable accurate inference of some of the links.
glasso tends to estimate more links than the robust methods. We postulate that the lack of robustness might result in inaccurate network reconstruction and the identification of spurious links. Robust methods tend to estimate networks that are more consistent with that from the KEGG (F 1 -score of 0.23 for glasso, 0.37 for t*-lasso, 0.39 for robust-NLL and 0.41 for trim-glasso, where the F 1 score is the harmonic mean between precision and recall). For instance our approach recovers several characteristics of the KEGG pathway. For instance, genes CDC6 (a key regulator of DNA replication playing important roles in the activation and maintenance of the checkpoint mechanisms coordinating S phase and mitosis) and PDS1 (essential gene for meiotic progression and mitotic cell cycle arrest) are identified as a hub genes, while genes CLB3, BRN1, YCG1 are unconnected to any other genes.

Concluding remarks
We presented a family of trimmed estimators for a wide class of structured high-dimensional problems. We provided general results on their statistical convergence rates and consistency. In particular our results for sparse linear regression and gaussian graphical models allow to precisely characterize the impact of corruptions on the statistical performance of the resulting estimators, while recovering the rates of their 'untrimmed' counterparts under clean data. We showed how to efficiently adapt existing optimization algorithms to solve the modified trimmed problems. Relevant directions for future work include specializing our theoretical analysis to generalized linear models, applying and analyzing trimmed approaches for more general structural regularizations, and the study of concomitant selection of the amount of trimming.

Appendix A: Proof of Theorem 1
We use the shorthand for local optimal error vector: Δ := θ−θ * and Γ := w−w * where ( θ, w) is an arbitrary local optimum of M -estimator of (4). Our proof mainly uses the fact that ( θ, w) is a local minimum of (4) satisfying This inequality comes from the first order stationary condition (see Loh and Wainwright [25] for details) in terms of only θ fixing w at w. In order to provide the complete proof of the theorem, we need to define the set of notations on the model space, perturbation space and corresponding projections following Negahban et al. [29]. The sparse LTS (3) is a typical example of (4), and such notations can be naturally defined based on the true support set S. In this proof, we specifically focus on the case with R(·) := · 1 for notational simplicity, but statements here can be seamlessly extendible for the general regularizer R(·) and the appropriately defined model/perturbation spaces. If we take θ = θ * above, we have where S is true support set of θ * , the inequalities (i) and (ii) hold by respectively the convexity and the triangular inequality of 1 norm. Now, by the RSC condition in (C-1), we obtain which is equivalent with In order to directly utilize Theorem 1 for linear models, we only need to show that (15) (for the condition (C-1)) and (16) (for (C-2)) hold. Throughout the proof, we use the fact that all elements in Γ corresponding to G (set of good examples) are all zeros: Γ G = 0 by construction.
First, consider the condition (C-1) in (15) Recall that we constructed w * as follows: w * i is simply set to w i if i ∈ G, and w * i = 0 for i ∈ B. Hence, by construction, i∈G w * . Let G, which is the subset of G, be the set of such samples.
can be lower bounded as follows: Noting that all x i ∈ G are uncorrupted and iid sampled from N (0, Σ G ), we can appeal to the result in Raskutti et al. [34]: with probability at least 1−c 1 exp − where κ 1 and κ 2 are strictly positive constants depending only on Σ G . Therefore, hence, (15) holds with since |G| ≥ h − (n − h) as discussed. Now, we consider the condition (C-2) in (16).
where the inequality comes from (1) and from the fact that Γ i is always greater than 0: if i ∈ G, Γ i = 0, and if i ∈ B, Γ i := w i − w * i ≥ 0 since w i ≥ 0 and w * i = 0. Now, we follow similar strategy as in Nguyen and Tran [31]: given Δ, we divide the index of Δ into the disjoint exhaustive subsets S 1 , S 2 , . . . , S q of size |B| such that S 1 contains |B| largest absolute elements in Δ, and so on. Then, we have where we use the fact that Γ i = 0 if i ∈ G and the Cauchy-Schwarz inequalities, and X B Sj denotes |B| × |S j | sub-matrix of X B ∈ R |B|×p corresponding only to indices S j . Combining all pieces together yields hence, we can guarantee (16) with functions To complete the proof, we need to specify the quantity 1 i i x i ∞ for the appropriate choice of λ as stated in Theorem 1. By the sub-Gaussian property of noise vector in (LTS2): for any fixed vector v such that v 2 = 1, Given vector x i , let x j i be the j-th element of vector x i . Using the column normalization condition (LTS3) with 0 ≤ w * ≤ 1, we have for all j = 1, . . . , p for all t > 0 , and consequently by the union bound over, Setting t 2 = 4σ 2 log p h , we obtain 1 h i∈G w * i i x i ∞ ≤ 4σ 2 log p h with probability at least 1 − c exp(−c hλ 2 ). Now, we have all pieces to utilize Theorem 1. The assumption on choosing λ in the statement is satisfied as follows: where C 2 is some constant satisfying C 2 2 ≥ (maxi δ 2 i )|B| h , and we use the condition (LTS4). Finally, the RSC constant in (18) can be simply lower bounded with the assumption (C-h): hence we can have the bounds as stated.
Appendix C: Results for trimmed graphical Lasso

C.1. Useful lemma(s)
Lemma 1 (Lemma 1 of Ravikumar et al. [35]). Suppose that {X (i) } n i=1 are iid samples from N (0, Σ) with n ≥ 40 max i Σ ii . Let A be the event that 10τ log p n where τ is any constant greater than 2. Then, the probability of event A occurring is at least 1 − 4/p τ −2 .

C.2. Proof of Corollary 3
Although Theorem 1 can be seamlessly applied for the Trimmed Graphical Lasso as well, we need to restrict our attention to the case of Δ F ≤ 1 in order to guarantee the (vanilla) restricted strong convex in Lemma 2 (which is the standard technique even for the case without outliers as developed in Loh and Wainwright [26]). Toward this, we first show that Δ F ≤ 1 actually holds under the conditions: . Then, for ( Θ, w), Δ F ≤ 1.
Proof. The Lemma 3 can be proved by the fact − log det Θ is a convex function. Hence, the function f : [0, 1] → R given by f (t; Θ * , Δ) := − log det Θ * + t Δ is also convex in t, and − (Θ * + Δ) −1 , Δ ≥ − (Θ * + t Δ) −1 , Δ for t ∈ [0, 1] (see Loh and Wainwright [26] for details). τ 2 (n, p) and τ 3 (n, p) in (C-2). Toward this, we follow similar strategy as in Nguyen and Tran [31]: given Δ, we divide the index of Δ into the disjoint exhaustive subsets S 1 , S 2 , . . . , S q of size |B| such that S 1 contains |B| largest absolute elements in Δ, and so on. Then, we have Let D Γ be a |B| × |B| diagonal matrix whose i-th diagonal entry is [D Γ ] ii := Γ i . Let also X B is a |B| × p design matrix for samples in the set B. Finally X B Sj denotes a |B| × |S j | sub-matrix of X B whose columns are indexed by S j . Then, q j=1 i∈B Combining all pieces together yields hence, we can guarantee the condition (C-2) with functions τ 2 (n, p) = f (X B ) |B| log p h and τ 3 (n, p) = f (X B ) log p h .