Robust Mean Estimation in High Dimensions: An Outlier-Fraction Agnostic and Efficient Algorithm

The problem of robust mean estimation in high dimensions is studied, in which a certain fraction (less than half) of the datapoints can be arbitrarily corrupted. Motivated by compressive sensing, the robust mean estimation problem is formulated as the minimization of the <inline-formula> <tex-math notation="LaTeX">$\ell _{0}$ </tex-math></inline-formula>-‘norm’ of an <italic>outlier indicator vector</italic>, under a second moment constraint on the datapoints. The <inline-formula> <tex-math notation="LaTeX">$\ell _{0}$ </tex-math></inline-formula>-‘norm’ is then relaxed to the <inline-formula> <tex-math notation="LaTeX">$\ell _{p}$ </tex-math></inline-formula>-norm (<inline-formula> <tex-math notation="LaTeX">$0< p\leq 1$ </tex-math></inline-formula>) in the objective, and it is shown that the global minima for each of these objectives are order-optimal and have optimal breakdown point for the robust mean estimation problem. Furthermore, a computationally tractable iterative <inline-formula> <tex-math notation="LaTeX">$\ell _{p}$ </tex-math></inline-formula>-minimization and hard thresholding algorithm is proposed that outputs an order-optimal robust estimate of the population mean. The proposed algorithm (with breakdown point <inline-formula> <tex-math notation="LaTeX">$\approx 0.3$ </tex-math></inline-formula>) does not require prior knowledge of the fraction of outliers, in contrast with most existing algorithms, and for <inline-formula> <tex-math notation="LaTeX">$p=1$ </tex-math></inline-formula> it has near-linear time complexity. Both synthetic and real data experiments demonstrate that the proposed algorithm outperforms state-of-the-art robust mean estimation methods.

the high dimensional regime. A notable exception is Tukey's Median [9] that has an error bound that is independent of the dimension, when the fraction of outliers is less than a threshold [10], [11]. However, the computational complexity of Tukey's Median algorithm is exponential in the dimension. A number of recent papers have proposed polynomial-time algorithms that have dimension independent error bounds under certain distributional assumptions (e.g., bounded covariance or concentration properties). For a recent comprehensive survey on robust mean estimation, we refer the interested readers to [12]. One of the first such algorithms is Iterative Filtering [13], [14], [15], in which one finds the top eigenvector of the sample covariance matrix and removes (or down-weights) the points with large projection scores on that eigenvector, and then repeat this procedure on the rest of points until the top eigenvalue is small. However, as discussed in [4], the drawback of this approach is that it only looks at one direction/eigenvector at a time, and the outliers may not exhibit unusual bias in only one direction or lie in a single cluster. Figure 1 illustrates an example for which Iterative Filtering might have poor empirical performance. In this figure, the inlier datapoints in blue are randomly generated from the standard Gaussian distribution in (high) dimension d, and therefore their ℓ 2 -distances to the origin are roughly sensing. The Iterative Filtering algorithm has similarities to the greedy Matching Pursuit compressive sensing algorithm [17]. In the latter algorithm, one finds a single column of sensing matrix A that has largest correlation with the measurements b, removes that column and its contribution from b, and repeats this procedure on the remaining columns of Dong et al. [4] proposed a new scoring criteria for finding outliers, in which one looks at multiple directions associated with large eigenvalues of the sample covariance matrix in every iteration of the algorithm. Interestingly, this multi-directional approach is conceptually similar to Iterative Thresholding techniques in compressive sensing (e.g., Iterative Hard Thresholding [18] or Hard Thresholding Pursuit [19]), in which one simultaneously finds multiple columns of matrix A that are more likely contribute to b. Although iterative thresholding techniques are also greedy, they are more accurate than the Matching Pursuit technique in practice [20], [21].
A common assumption in robust mean estimation problem is that the fraction of the corrupted datapoints is small. In this paper, we explicitly use this information through the introduction of an outlier indicator vector whose ℓ 0 -'norm' we minimize under a second moment constraint on the datapoints. This is partially motivated by compressive sensing and shares the same principle of 'fitting the majority of the data' that is common in robust statistics. This new formulation not only enables us to leverage advanced compressive sensing techniques to solve the robust mean estimation problem, but also allow us to design algorithms that do not require prior knowledge of the fraction of outliers. There are some works in sparse recovery (see, e.g. [22], [23]), in which ℓ 0 /ℓ p minimization is used to remove outliers in data. In these works, a linear model y = Ax + e is considered, wherein y denotes the measurements, the matrix A is known, and the unknown sparse vector e models the potential outlier corruption on each datapoint. Consequently, the analyses in the works on sparse recovery methods heavily rely on the assumption that the underlying model is linear (e.g., some works exploit the range-space/null-space properties of the matrix A). On the other hand, in robust mean estimation, a general observation model (not necessarily linear) is considered. In light of this, the analyses in the works on sparse recovery cannot be transferred in an obvious way to the robust mean estimation problem.
We consider the setting in which the distribution of the datapoints before corruption has bounded covariance, as is commonly assumed in many recent works (e.g., [4], [14], [24], [25]). In particular, in [24], the authors propose to minimize the spectral norm of the weighted sample covariance matrix and use the knowledge of the outlier-fraction ϵ to constrain the weights. Along this line, two very recent works [26], [27] show that any approximate stationary point of the objective in [24] gives a near-optimal solution. In contrast, our objective is designed to minimize the sparsity of an outlier indicator vector, and we show that any sparse enough solution is nearly optimal. Contributions: • At a fundamental level, a contribution of this paper is the formulation of the robust mean estimation problem as minimizing the ℓ 0 -'norm' of the proposed outlier indicator vector, under a second moment constraint on the datapoints. In addition, order-optimal estimation error guarantees and optimal breakdown point (ϵ < 1/2) are shown for this objective. We relax the ℓ 0 objective to ℓ p (0 < p ≤ 1) as in compressive sensing, and establish corresponding order-optimal estimation error guarantees. The guarantees are order-optimal with respect to the number of datapoints(n), dimension of the data (d), and the fraction of corrupted datapoints(ϵ). Henceforth we use the term 'order-optimal' in this sense. • Motivated by the proposed ℓ 0 and ℓ p objectives and their theoretical justifications, we propose a computationally tractable iterative ℓ p (0 < p ≤ 1) minimization and hard thresholding algorithm, and establish the order optimality of the algorithm. Empirical studies show that the proposed algorithm significantly outperforms state-of-the-art methods in robust mean estimation. • The proposed algorithm (with maximal breakdown point of 1 − 1/ √ 2) does not require the knowledge of the fraction of outliers (in contrast to most existing algorithms). For p = 1, the algorithm has near-linear time complexity.

II. PROPOSED OPTIMIZATION PROBLEMS
We begin by defining what we mean by a corrupted sample of datapoints.
Definition 1: (ϵ-corrupted sample [4]) Let P be a distribution on R d with unknown mean µ, and letỹ 1 , . . . ,ỹ n be independent and identically distributed (i.i.d.) drawn from P . These datapoints are then modified by an adversary who can inspect all the datapoints, remove ϵn of them, and replace them with arbitrary vectors in R d . We then obtain an ϵ-corrupted sample, denoted as y 1 , . . . , y n .
Throughout the rest of the paper, we adhere to the notation given above: we represent a datapoint before corruption asỹ i , and after corruption as y i . Given a set of datapoints {x i , i = 1, . . . , n}, we term the following as sample covariance matrix around z: There are other types of contamination one can consider, e.g., Huber's ϵ-contamination model [28]. The contamination model described in Definition 1 is the strongest in the sense that the adversary is not oblivious to the original datapoints, and can replace any subset of ϵn datapoints with any vectors in R d . We refer the reader to [12] for a more detailed discussion on contamination models.
Our primary goal is to robustly estimate the true population mean, given an ϵ-corrupted sample. We assume that the underlying distribution has bounded second moment. A powerful and useful key insight that was exploited in previous work on the problem is that if the outliers in an ϵ-corrupted sample (of large size) shift the average of datapoints before corruption by Ω(ξ) in a direction ν, then the variance of the projected sample along ν increases by Ω(ξ 2 /ϵ). Thus, intuitively, it suffices to find a large subset of the ϵ-corrupted sample, whose sample covariance matrix is close to the covariance matrix of the underlying distribution. In order for such a subset to exist and for the mean of this large subset to be close to the true mean, we need some form of concentration of the datapoints (before corruption) around the mean of their distribution. A constrained second moment condition is sufficient to guarantee this, and such an assumption is also used in previous works. In the following, we provide a brief high-level explanation (details can be found in the Appendix). Suppose we are given a sufficiently large sample of datapoints of size n, generated from a distribution with mean µ and spectral norm of the covariance matrix bounded by σ 2 . Then, with high probability, there exists a large subset of the sample with spectral norm of the sample covariance matrix around µ bounded by O(σ 2 ). Hence, after corruption, with high probability there still exists a sufficiently large subset, say G * , of the resulting ϵ-corrupted sample, of size (1 − ϵ ′ )n (where ϵ ′ → ϵ as n → ∞), such that the spectral norm of the sample covariance matrix around µ is bounded by O(σ 2 ). Utilizing this, the concentration of the sample before corruption around µ, and a fundamental result [27,Lemma C.2] about closeness of population mean and conditional mean, it can be shown that the distance between µ and the sample average of G * is O(σ √ ϵ ′ ). Based on this motivation, we propose an ℓ 0 -minimization problem to find the largest subset, whose sample covariance matrix exhibits bounded spectral norm. We first introduce an outlier indicator vector h: for the i-th datapoint, h i indicates that whether it is an outlier (h i = 1) or not (h i = 0). Given an ϵ-corrupted sample of size n, we propose the following optimization problem, for which the solution in x should yield a robust estimate of the mean: where c 1 is a constant that controls the inflation of the constraint with respect to the bound (σ 2 ) on the spectral norm of the covariance matrix of the underlying distribution. We further relax the problem to the following: Note that any globally optimal solution of (2) is also globally optimal solution of (3). To see this, leth be a global optimum of (3). Let h ′ be the vector obtained after setting the non-zero values ofh to 1. Note that h ′ has the same ℓ 0 -norm ash, and is also a feasible point of (2). Since the constraint set of (3) is larger than (2), the optimum value of (2) must be greater than or equal to the optimum value of (2). This implies that h ′ is a global optimum of (2). Hence, the claim holds. We show in Theorem 1, that any sparse enough feasible pair including the global optimum of (3) achieves order-optimality in terms of the error in estimating the mean. However, minimizing the above ℓ 0 objective is not computationally tractable. Motivated by compressive sensing, we further propose to relax the ℓ 0 -'norm' to the ℓ p -norm (0 < p ≤ 1), which leads to the following optimization problem: We show in Theorem 2, that even in this case any 'good' feasible pair including the global optimum is order-optimal in terms of the error in estimating the mean.
In the approaches taken in prior works (see, e.g., [27]), the robust mean estimation problem is the following feasibility problem: Most works (see, e.g., [4], [24]) consider the following problem (or its variant), which is obtained by changing the feasibility problem into the following optimization problem: where x is either fixed or is the weighted average of y i 's with weights as 1 − h i . Landscape results related to the optimization problem (7) were obtained in [24] and [27]. Our formulation (4), for the special case of p = 1, corresponds to minimizing the feasibility condition related to the sum of "weights" in (5). We provide landscape results for the optimization problem given in (4) (Theorems 1 and 2). An advantage of our formulation, which we will exploit in Algorithm 1, is that it does not require knowledge of the fraction of outliers ϵ. We now provide theoretical guarantees for the estimator which is given by the solution of the optimization problem (3). We show that given an ϵ-corrupted sample of sufficiently large size, then with high probability, the ℓ 2 -norm of the estimator's error is O σ . We formalize this in the following theorem. It is well known that an information-theoretic lower bound on the ℓ 2 -norm of any estimator's error ∥x − µ∥ 2 is Ω σ ϵ 1−2ϵ (see [27]). Thus, the estimator is order-optimal in terms of the error as α → 0 and n → ∞.
Theorem 1: Let P be a distribution on R d with unknown mean µ and unknown covariance matrix Σ ⪯ σ 2 I. Let δ ∈ (0, 1/4) and c 1 > 1 be fixed. Let c ′ 1 = c 2 1 min{c 2 1 log c 2 1 + 1 − c 2 1 , 1}, n > 2e c ′ 1 δ 2 d log(d/δ) and α = ed log(d/δ) . Let ϵ ∈ (0, 1/2 − α) and ϵ ′ = ϵ + α. Given an ϵ-fraction corrupted set of n datapoints from P , let Then the following holds with probability at least 1 − 4δ: 1) Any feasible pair (ĥ,x) for the optimization problem (3) such that (ĥ,x) ∈ S satisfies . 2 The proof is deferred to the Appendix. A high-level sketch of the proof of Theorems 1 is as follows. We use the idea in [27, Lemma 2.2] stated in Lemma 2. Informally, if two probability distributions on a set of datapoints are close in total variation distance, then the weighted means of the distributions are close. Consider the uniform distribution on the set {y i :ĥ i = 0} (say P 1 ). Note that the estimatorx in Theorem 1 is the mean of P 1 . We show that the total variation distance between P 1 and the uniform distribution (say P ′ ) on the set of inlier datapoints (that are within a distance of , is small. Therefore one can show that the distance betweenx and the mean of Using Lemma 2, we show that the distance between the mean of P ′ and µ is O(σ √ ϵ ′ ). Using triangle inequality, it follows that the distance betweenx and Remark 1: Theorem 1 shows that, as long as we find a feasible pointĥ that is sparse enough, i.e., ∥ĥ∥ 0 ≤ (ϵ + α)n, the average of the estimated inliers {i:ĥ i =0} y i |{i:ĥi=0}| is close to the true mean in the optimal sense. It is not necessary to reach the global optimum of the objective (3).
We now provide a similar order-optimal error guarantee for the solution of the optimization problem in (4).
Theorem 2: Let P be a distribution on R d with unknown mean µ and unknown covariance matrix Given an ϵ-fraction corrupted set of n datapoints from P , let Then the following holds with probability at least 1 − 4δ: 1) Any feasible pair (ĥ,x) of (4) such that (ĥ,x) ∈ S ′ satisfies . (12) 2) A global optimum (h opt , x opt ) of (4) lies in S ′ with ∥h opt ∥ p p ≤ ϵ ′ n. The proof is deferred to the Appendix. The high-level idea is similar to that of the proof of Theorem 1. We consider the distribution on the α-corrupted samples with (normalized) probability weights 1−h i (say P 2 ). Note that the estimatorx in Theorem 2 is the mean of P 2 . We show that the total variation distance between P 2 and the uniform distribution (say P ′ ) on the set of inlier datapoints (that are within a distance of , is small. Therefore one can show that the distance betweenx and the mean of Using Lemma 2, we show that the distance between the mean of P ′ and µ is O(σ √ ϵ ′ ). Using triangle inequality, it follows that the distance betweenx and Remark 2: The breakdown point of the estimators in Theorems 1 and 2 is nearly the maximal possible 1/2 (as α → 0 and n → ∞), that is the estimator can tolerate any corruption level ϵ < 1/2, assuming that the number of samples n satisfies the lower bound.
Remark 3: From Lemma 6 in the Appendix, we know that given any feasible pair of (4) with ∥ĥ∥ p ≤ (ϵ ′ n) 1/p , we have   is also a feasible pair, and therefore it lies in the set S ′ defined in (11). Theorem 2 further shows that this weighted average of the datapoints is close to the true mean. Again, we note that it is not necessary to reach the global optimum of the objective (4); we only need to find a feasible point h of (4) whose ℓ p -norm is small enough.

III. ALGORITHM A. ℓ p Minimization and Thresholding
Motivated by the ℓ p objective and its theoretical guarantee, we propose an iterative ℓ p minimization algorithm. The algorithm, which is detailed in Algorithm 1, alternates between updating the outlier indicator vector h via minimizing its ℓ p -norm and updating the estimated mean x. To describe Algorithm 1, let H be the set defined by When updating the estimated mean x in Step 2 of Algorithm 1, we add an option to threshold the h i by τ , Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

Algorithm 1 Robust Mean Estimation via ℓ p Minimization and Thresholding
Inputs: 1) An ϵ-corrupted set of datapoints {y i } n i=1 ∈ R d generated by a distribution whose covariance matrix satisfies Σ ⪯ σ 2 I. 2) Upper bound on corruption level:ε 3) Upper bound on spectral norm of Σ: Step 2: Given h (t) , update x: where γ and β are defined in (16) and (17) so one can use the weighted average of the estimated 'reliable' datapoints (i.e., those for which h i ≈ 0) to estimate x. This is motivated by the analysis of the original ℓ 0 objective in Theorem 1, where the average of the estimated 'reliable' datapoints {i:ĥ i =0} y i |{i:ĥi=0}| is close to the true mean as long as the outlier indicator vectorĥ is sparse enough. The breakdown point of Algorithm 1 depends on the threshold τ and is given by f (τ ) (see (15)). The maximal breakdown point corresponds to no thresholding, i.e., f (1) = 1 − 1/ √ 2. Algorithm 1 requires an upper boundε on the true fraction of outliers. This upper bound can be set arbitrarily close to (but less than) the breakdown point.
With this intuitive updating rule in Step 2, Algorithm 1 has following order-optimal guarantee.
Theorem 3: Let P be a distribution on R d with unknown mean µ and unknown covariance matrix Σ ⪯ σ 2 I. Let δ ∈ (0, 1/5), c 1 > 1 and p ∈ (0, 1] be fixed. Let . Given an ϵ-fraction corrupted set of n datapoints from P , with probability at least 1 − 5δ, all the iterates of Algorithm 1 (for t ≥ 1) satisfy where c (0) 2 is given in Algorithm 1, and The output of Algorithm 1 at the end of 1−γ(ε) ) iterations is order-optimal: The proof is deferred to the Appendix, but we briefly discuss the design of the algorithm and the high-level approach. Let x * be the average of the set of inlier datapoints that are within a distance of σ d αδ from µ. We use induction to show that We show in the Appendix that the coordinate-wise median satisfies ∥x (0) − µ∥ 2 ≤ c (0) 2 σ with high probability. Firstly, observe that in Step 1 of Algorithm 1, the constraint on the spectral norm of the weighted covariance matrix around 2 ) 2 σ 2 n instead of c 2 1 σ 2 n as in (4). This ensures that with high probability that the optimization problem in Step 1 has a feasible point, and that the optimum solution satisfies ∥h (t) ∥ p ≤ (ϵ ′ n) 1/p . Secondly, we exploit the boundedness of ∥h (t) ∥ p and the fact that the spectral norm of the weighted covariance matrix around x (t) is bounded (similar to the idea used in Theorem 2), along with some concentration bounds to show that in each iteration the iterate x (t+1) in Step 2 moves closer to µ than From the proof we can see that it is not necessary to reach the global optimum in Step 1, we only need to find a feasible point whose ℓ p -norm is small enough.
Remark 4: The results of Theorems 1, 2 and 3 can be easily extended to establish the estimators' closeness to the average of the datapoints before corruption,μ = 1 n n i=1ỹ i , using the fact thatμ is close to µ, which is shown in the Appendix (see (36)). We obtain the following extension to the above theorems with the same probability guarantees: Moreover, it can be also shown that the estimators are close to the average of inliers, that are at most a distance of σ d αδ = σ nδc ′ 1 e log(d/δ) from µ. Remark 5: The initialization c (0) 2 = 3 √ d + 2c 1 can be replaced by a smaller value as long as it is possible to guarantee ∥x (0) − µ∥ 2 ≤ c (0) 2 σ with high probability. An important aspect of the proposed algorithm is that it does not require the true fraction of outliers ϵ and is still orderoptimal. To the best of our knowledge no other algorithm for our corruption model has this property.

B. Solving
Step 1 of Algorithm 1 When we set p = 1 in the objective ∥h∥ p in Step 1 of Algorithm 1, the resulting problem is convex, and can be reformulated as the following packing SDP [29] with w i := 1 − h i , and e i being the i-th standard basis vector in R n . The details can be found in the Appendix.
When 0 < p < 1, the equivalent objective function ∥h∥ p p = i h p i is concave, not convex. So it may be difficult to find its global minimum. Nevertheless, we can iteratively construct and minimize a tight upper bound on this objective function via iterative re-weighted ℓ 2 [30], [31] or ℓ 1 techniques [32] from compressive sensing. 1 And it is well-known in compressive sensing that such iterative re-weighted approaches often performs better than ℓ 1 [30], [32].

C. Complexity Analysis
Theorem 3 guarantees that the total number of iterations of Algorithm 1 required to achieve optimality is upper bounded by O( log d log|ε| ). In each iteration, the computational complexity of Step 2 is O(nd). It follows easily from the proof of Theorem 3, that it suffices to solve the SDP in step 1 of Algorithm 1 (with p = 1) to a constant precision. As a result, the error is affected by a constant and thus remains order-optimal and the time complexity isÕ(nd) parallelizable work using positive SDP solvers [33] (the notationÕ(m) hides the poly-log factors:Õ(m) = O(m.polylog(m))). A comparison of our theoretical results with those in state-of-the-art works is given in Table I.
If we use ℓ p with 0 < p < 1 in Step 1, we iteratively construct and minimize a tight upper bound on the ℓ p objective via iterative re-weighted ℓ 2 [30], [31] or iterative re-weighted ℓ 1 techniques [32]. 2 Minimizing the resulting weighted ℓ 1 objective can be also solved very efficiently to a constant precision by formulating it as a Packing SDP (see Appendix) with computational complexity ofÕ(nd) [33]. If we use iterative re-weighted ℓ 2 , minimizing the resulting weighted ℓ 2 objective is a SDP constrained least squares problem, whose computational complexity is in general polynomial in both d and n. We will explore more efficient solutions for this objective in future work.

IV. EMPIRICAL STUDIES
In this section, we present empirical results on the performance of Algorithm 1 and compare with the following stateof-the-art high dimension robust mean estimation methods: Iterative Filtering (IF) [14], Generalized Filtering (GF) [27,Algorithm 2], the method proposed in [8] (denoted as LRV), the method for bounded covariance distributions in [24] (denoted as CDG), and Quantum Entropy Scoring (QUE) [4], which scores the outliers based on multiple directions. We briefly discuss the implementation details of these algorithms. We implemented the QUE method by utilizing the code provided in [4]. In [24], the authors provide a way to implement the CDG method approximately; we provide results for an exact implementation of the CDG method. Since the number of datapoints considered in the following simulations is less than the minimum requirement, a reasonable approach to compare the performance of the algorithms is to tune the hyper-parameters of all algorithms to get the best possible error. For example, the hyper-parameter c 4 that appears in CDG method [24], is set to be 1.05, which produced the smallest empirical error. The value of σ provided to the algorithms is not the theoretical value, but the empirical one (precisely, the spectral norm of the sample covariance matrix of G * (see Section II)). For evaluation purposes, we report the recovery error, which we define it as the ℓ 2 distance of the estimated mean to the oracle solution, i.e., the average of the uncorrupted datapoints after corruption.

A. Synthetic Data
We consider two experimental settings. For the first setting we follow [4]. The dimension of the data is d, and the number of datapoints is n. The inlier datapoints are generated i.i.d. according to the standard Gaussian distribution with zero mean. Randomly (uniformly) chosen ϵ fraction of the datapoints are replaced by outliers. For the outliers, half of them are set to be ( d/2, d/2, 0, . . . , 0) ⊤ , and the other half are set as ( d/2, − d/2, 0, . . . , 0) ⊤ , so that their ℓ 2 distances to  the population mean (0, . . . , 0) ⊤ are all √ d, similar to that of the inlier points. These are two clusters of outliers, and their ℓ 2 distances to the true mean x are similar to that of the inlier points. In Algorithm 1, we set the threshold τ = 0.6, c 1 = 1.1, and we initialize c (0) 2 as the ℓ 2 error of the coordinate-wise Median relative to the true mean. We implemented the IF method for sub-Gaussian parameters [14, Theorem 3.1]. We vary the total fraction ϵ of the outliers and report the average recovery error of each method over 10 trials in Table II with d = 100, n = 1000. The proposed ℓ 1 and ℓ 0.5 methods show significant improvements over the competing methods, and the ℓ 0.5 method performs the best.
We also tested the performance of each method for different numbers of datapoints. The dimension of the data is fixed to be 100. The fraction of the corrupted points is fixed to be 20%. We vary the number of datapoints from 100 to 1000, and report the average recovery error for each method over 50 trials in Table III. We can see that the performance of all methods get better when the number of datapoints is increased. Again, our proposed methods consistently perform better than the other methods.
The second experimental setting is as follows. The dimension of the data is d, and the number of datapoints is n. The inlier datapoints are generated i.i.d. such that each coordinate follows the Pareto distribution with scale parameter as 1 and shape parameter as 2.5. This implies that each coordinate has bounded second moment but no higher moments. All outliers are set to be the same vector v which is chosen as follows. Let g be the average of the ℓ 2 norms of the datapoints. The vector v is set as (2 + g/d, 2+ g/d, . . . , 2 + g/d) ⊤ . Randomly (uniformly) chosen ϵ fraction of the datapoints are replaced by outliers. The LRV method is applicable only to cases where distributions have bounded fourth moment, and hence is not applicable to this setting. The CDG method was not implemented due to the high computational complexity of its implementation. We implemented the IF method for bounded second moment parameters [14, Theorem 3.2]. We implemented Algorithm 1 with p = 1, τ = 1(no thresholding during the iterations), c 1 = 1 and c (0) 2 = 3 √ d + 2c 1 . However, we threshold the last iterate h (T ) with threshold 0.6. We vary the total fraction ϵ of the outliers and report the average recovery error of each method over 100 trials   Table IV with d = 1000, n = 100000. The proposed ℓ 1 method show significant improvement over the competing methods. We also tested the performance of each method for two different numbers of datapoints. The dimension of the data is fixed to be 1000. The fraction of the corrupted points is fixed to be 20%. We consider the number of datapoints to be 10000 and 100000, and report the average recovery error for each method over 100 trials in Table V. Again, the proposed method consistently perform better than the other methods.

B. Corrupted Image Dataset
Here we use a dataset of real face images to test the effectiveness of the robust mean estimation methods. The average face of particular regions or certain groups of people is useful for many social and psychological studies [34]. Here we use 100 frontal human face images from the Brazilian face database 3 as inliers. For the outliers, we choose 15 face images of cats and dogs from the CIFAR10 [35] database. In order to be able to run the CDG method [24], we scale the size of images to 18 × 15 pixels, so the dimension of each datapoint is 270. The oracle solution is the average of the 100 human faces. Table VI reports the recovery error, which is the ℓ 2 distance of the estimated mean to the oracle solution, for each method. The proposed methods achieve smaller recovery error than the state-of-the-art methods. The sample inlier and outlier images as well as the estimated mean for each method can be found in the Appendix.

V. CONCLUSION
We formulated the robust mean estimation problems as the minimization of the ℓ 0 -'norm' of the introduced outlier indicator vector, under a second moment constraint on the datapoints. We further relaxed the ℓ 0 objective to an ℓ p (0 < p ≤ 1) objective, and theoretically justified the new objective. The proposed ℓ 0 and ℓ p optimization problems do not need to know ϵ, and still achieve information-theoretically order-optimal error bounds with optimal breakdown points. Then we proposed a computationally tractable iterative ℓ p (0 < p ≤ 1) minimization and hard thresholding algorithm, which significantly outperforms state-of-the-art robust mean estimation methods, and is order-optimal. In the empirical studies, we observed strong numerical evidence that using the ℓ p (0 < p ≤ 1) norm in the optimization leads to sparse solutions; theoretically justifying this phenomenon is also of interest. It is worth noting that almost all previous polynomial-time methods (with dimension-independent error bound) need to know ϵ, while our Algorithm 1 does not require to know ϵ. It has a maximal breakdown point of 1 − 1/ √ 2, and has near-linear time complexity for p = 1.

A. Technical Preliminaries
We introduce the following parameters that control the minimum number of datapoints needed, error and confidence level.
. . ,ỹ n } be a set of n datapoints drawn from a distribution P with mean µ and covariance matrix Σ ⪯ σ 2 I. We now define G as the set of datapoints which are less than σ d αδ = σ nδc ′ 1 e log(d/δ) distance away from µ: It follows from Lemma 4 that for the event Let E 2 be the event: It follows from Lemma 5 that Thus, we have that For analysis purposes, we consider the far away uncorrupted datapoints S \ G as outliers also. Let {y 1 , . . . , y n } be an ϵ-corrupted version of the set S. Let h * be such that h * i = 1 for the outliers (both far away uncorrupted datapoints and corrupted datapoints), and h * i = 0 for the rest of uncorrupted datapoints, i.e., Let the set of inliers be given by G * : Note that I * ⊆ I and G * ⊆ G. Since (ỹ i − µ)(ỹ i − µ) ⊤ is positive semi-definite (PSD), we must have This implies that Then, we have: Our intended solution is to have h i = 0 for the inlier points and h i = 1 for the outlier points. Letx andx * be the averages of datapoints in G and G * respectively. Applying Lemma C.2 from [27], we have We now introduce some more events (c.f. [14,Lemma A.18]): where z i = (ỹ i − µ)1 ∥ỹ i − µ∥ 2 > σ d αδ . From Lemma 4, we get that Let E be the event given by Let ∆ n,ξ be the set of probability vectors given by: Let TV(., .) denote the total variation distance between probability measures.

B. Technical Lemmas
Lemma 1 (Lemma 2.2 [27]): For a finite set of datapoints w i y i and Σ w = i∈ [n] w i (y i − x w )(y i − x w ) ⊤ be the weighted average and weighted covariance with respect to a probability weight vector w. Let w 1 and w 2 be two probability weight vectors such that TV(w 1 , w 2 ) ≤ ζ. Then, Lemma 2 (Lemma 2.3 [27]): Let w 1 ∈ ∆ n,ϵ1 and w 2 ∈ ∆ n,ϵ2 . Then Lemma 3: Let P be a distribution on R d with mean µ and covariance matrix Σ ⪯ σ 2 I. Let ϵ ≤ 1/3. Given an ϵ-fraction corrupted set of n datapoints from P , the coordinate-wise median of the corrupted set,x, satisfies with probability at least 1 − d exp(−n/90) that Proof: We first show that with high probability the error in each dimension is bounded by 3σ. Fix a coordinate, and letỹ i , y i , µ andx be the component ofỹ i , y i , µ andx respectively in that coordinate. By Markov's inequality, we have Thus with high probability more than five-sixth of the datapoints satisfy |ỹ i − µ i | ≤ 3σ, which implies that even if ϵ ≤ 1/3 fraction of datapoints are corrupted, we would have Applying union bound, we get that with probability at least 1 − d exp(−n/90), the error in each dimension is bounded by 3σ and hence ∥x − µ∥ 2 ≤ 3σ √ d holds. Lemma 4: Let 0 < δ ≤ 1. Let E 1 , E 3 and E 4 be the events as described in (24), (36) and (37). Then, Proof: By Markov's inequality we have Applying Markov's inequality again, we have Thus, we get This proves the result for E 1 . Applying Markov's inequality again, we obtain This proves the result for E 3 . By similar reasoning, the result for E 4 follows.
. Let E 2 be the event described in (26). Then Proof: We adopt the approach in [14,Lemma A.18 (iv)]. Lemma A.19 from [14] states that the following: and for any η > 0, We apply this result by assigning We consider two mutually exclusive cases: 1) Suppose that M < e −1 δc 2 1 σ 2 n. Applying (60) with θ = 1, we obtain Applying Markov's inequality, we obtain The inequality in (64) follows from the assumption that M < e −1 δc 2 1 σ 2 n and the inequality in (65) follows from the fact that α = ed log(d/δ) The inequality in (67) follows from the fact that M ≤ nσ 2 , the inequality in (69) follows from the fact that e α < (1 + α) 1+α for any α > 0, and the fact that α = ed log(d/δ) 1 min c 2 1 log c 2 1 + 1 − c 2 1 , 1 ]. Lemma 6: Given a set of points y i ∈ R d , i = 1, . . . , n, for any w ∈ R n , let x w := n i=1 wiy i ∥w∥1 . Then The equality (73) follows from the fact that the minimum in the RHS of (72) is attained at x w = n i=1 wiy i ∥w∥1 . Consequently, (70) holds.
Lemma 7: Let n > ed log(d/δ) (31). Then on event E 2 defined in (26), h * satisfies Proof: Let I and I * be the sets defined in (22) and (30). We have The last inequality follows from the definition of E 2 in (26) and Lemma 6. . Let y 1 , . . . ,ỹ n be i.i.d. datapoints drawn from a distribution with mean µ and covariance matrix Σ ≼ σ 2 I. Let G be the set defined in (23). Letx be the average of datapoints in G. Then the following holds on the event E 1 ∩ E 3 ∩ E 4 , where the events are defined in (24), (36) and (37): Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. where The last term is upper bounded as follows, The inequality (a) follows from Cauchy-Schwarz inequality, and (b) follows from Markov's inequality. From (86), (25), (38), and (93), we get that on the event Lemma 9: Let 0 < τ ≤ 1. Suppose h ∈ R n such that ∀i, 0 ≤ h i ≤ 1, and ∥h∥ 1 ≤ ϵn for some ϵ ∈ [0, 1). Then Proof: We first show that Hence, we have Consequently, we obtain C. Proof of Theorem 1 Proof: Let (ĥ,x) be a feasible pair for (3) lying in S. Note that we get a corresponding feasible pair lying in S by only setting non-zeroĥ i to be 1. With slight abuse of notation, let (ĥ,x) be this feasible pair.

D. Proof of Theorem 2
Proof: Let (ĥ,x) ∈ S ′ be a feasible pair for (4) with some 0 < p ≤ 1. We have Since 0 ≤ĥ i ≤ 1 for all i, we have This implies the following Letŵ = 1−ĥ ∥1−ĥ∥1 and β = ∥ĥ∥ p p /n. Note thatŵ ∈ ∆ n,β . Consider h * as defined in (29). Letx * be the average of datapoints in the set G * defined in (31) and let w * = 1−h * n−∥h * ∥0 . Observe that on event E 1 , w * ∈ ∆ n,ϵ ′ . As a consequence of Lemma 6, on event E 1 ∩ E 2 , we have and similarly, From Lemma 2, we obtain Consider the case β ≤ ϵ ′ < 1/2. This implies Consider the case ϵ ′ ≤ β < 1 − ϵ ′ . This implies Consequently, on the event E defined in (39), using Lemma 8, (35) and applying triangle inequality, we obtain that with probability at least 1 − 4δ Let (h opt , x opt ) be an optimal solution to (4). From Lemma 6 we have that  is also an optimal solution. Note that on the event E, we have that (h * , µ) is a feasible pair for (4). Hence, This implies Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

E. Proof of Theorem 3
Proof: We prove the result by the method of induction. Let x (0) be the coordinate-wise median of the corrupted sample. It is easy to check that under the conditions stated in Theorem 3, it follows that c 3 ≤ c 1 and ϵ ≤ (1 − α)(1 − ϵ). Note that if n ≥ 90 log d δ , then by Lemma 3, Lemma 8, (35) and triangle inequality, we have that the following holds with probability at least 1 − δ: Let E ′ be the event All the following statements hold on the event E ∪E ′ , where E is defined in (39). Also note that P(E ∪ E ′ ) ≥ 1 − 5δ, when n ≥ max 90, 2e From Lemma 7, we have that h * is a feasible point for the above optimization problem. Hence, Since This implies Let w be such that By Lemma 9, we have that w ∈ ∆ n, ϵ ′ τ . Now we follow the proof of Theorem 2. Let x (t+1) = n i=1 w i y i . Observe that w * ∈ ∆ n,ϵ ′ . As a consequence of Lemma 6, we have and similarly, From Lemma 2, we obtain From Lemma 1, we get ∥x (t+1) −x * ∥ ≤( λ max (Σ w ) + λ max (Σ w * )) TV(w, w * ) 1 − TV(w, w * ) (144) We established that ∥x (t+1) −x * ∥ 2 ≤ σc (t+1) 2 and ∥h (t) ∥ p p ≤ ϵ ′ n. Hence, by the principle of mathematical induction, the result follows. It is easy to check that γ(ε) < 1 holds if and only ifε < f (τ ). Furthermore,ε < f (τ ) implieš ϵ < τ . Thus, we have that ∥x (t) −x * ∥ 2 ≤ σ γ(ϵ ′ )c Consequently, using Lemma 8, (35) and applying triangle inequality, we obtain that with probability at least 1 − 5δ, ∥x (t) − µ∥ 2 is upper bounded by It is easy to see that for T = 1 + log c (0) 2 | log γ(ε)| , we have Define the vector w with w i : Therefore, solving (154) is equivalent to solving the following: Then, we rewrite the constraints 0 ≤ w i ≤ 1, ∀i as 0 ≤ w i , and w i e i e ⊤ i ⪯ I n×n , where e i is the i-th standard basis vector in R n . This establishes the equivalence between (155) and (21).
H. Solving Weighted ℓ 1 Objective via Packing SDP Consider ℓ p (0 < p < 1) in Step 1 of Algorithm 1 (see objective (156)). If we employ iterative re-weighted ℓ 1 approach [30], [32], we need to solve the following problem: where u i is the weight on corresponding h i . Define the vector w with w i : (162) is equivalent to solving the following: Then, we rewrite the constraints 0 ≤ w i ≤ 1, ∀i as 0 ≤ w i , and w i e i e ⊤ i ⪯ I n×n , where e i is the i-th standard basis vector in R n . Finally, we can turn (163) into the following Packing SDP: (164)

I. Corrupted Image Dataset
We use real face images to test the effectiveness of the robust mean estimation methods. The average face of particular regions or certain groups of people is useful for many social and psychological studies [34]. Here we use 100 frontal human face images from Brazilian face database 4 as inliers.
For the outliers, we choose 15 face images of cat and dog from CIFAR10 [35]. In order to run the CDG method [24], we scale the size of images to 18 × 15 pixels, so the dimension of each datapoint is 270. Fig. 2 and Fig. 3 show the sample inlier and outlier images. Fig. 4 shows the oracle solution (the average of the 100 inlier human faces) and the estimated mean by each method, as well as their ℓ 2 distances to the oracle solution. The proposed ℓ 1 and ℓ p methods achieve smaller recovery error than the state-of-the-art methods. The estimated mean faces by the proposed methods also look visually similar to the oracle solution, which illustrates the efficacy of the proposed ℓ 1 and ℓ p methods. 4 https://fei.edu.br/~cet/facedatabase.html