Optimal robust mean and location estimation via convex programs with respect to any pseudo-norms

We consider the problem of robust mean and location estimation with respect to any pseudo-norm of the formx∈Rd↦xS=supv∈S〈v,x〉\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x\in {{\mathbb {R}}}^d\mapsto \left\| x\right\| _S = \sup _{v\in S}\bigl < v,x \bigr>$$\end{document} where S is any symmetric subset of Rd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathbb {R}}}^d$$\end{document}. We show that the deviation-optimal minimax sub-Gaussian rate for confidence 1-δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1-\delta $$\end{document} is maxℓ∗(Σ1/2S)N,supv∈SΣ1/2v2log(1/δ)N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \max \left( \frac{\ell ^*(\Sigma ^{1/2}S)}{\sqrt{N}}, \sup _{v\in S}\left\| \Sigma ^{1/2}v\right\| _2\sqrt{\frac{\log (1/\delta )}{N}}\right) \end{aligned}$$\end{document}where ℓ∗(Σ1/2S)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^*(\Sigma ^{1/2}S)$$\end{document} is the Gaussian mean width of Σ1/2S\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma ^{1/2}S$$\end{document} and Σ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Sigma $$\end{document} the covariance of the data. This improves the entropic minimax lower bound from Lugosi and Mendelson (Probab Theory Relat Fields 175(3–4):957–973, 2019) and closes the gap characterized by Sudakov’s inequality between the entropy and the Gaussian mean width for this problem. This shows that the right statistical complexity measure for the mean estimation problem is the Gaussian mean width. We also show that this rate can be achieved by a solution to a convex optimization problem in the adversarial and L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document} heavy-tailed setup by considering minimum of some Fenchel–Legendre transforms constructed using the median-of-means principle. We finally show that this rate may also be achieved in situations where there is not even a first moment but a location parameter exists.


Introduction
We consider the problem of robust (to adversarial corruption and heavy-tailed data) multivariate mean and location estimation with respect to any pseudo-norm ν ∈ R d → ν S = sup µ∈S µ, ν where S is any symmetric subset of R d (i.e. if x ∈ S then −x ∈ S).This problem has been extensively studied during the last decade for S = B d 2 the unit euclidean ball [34,8,15,7,11,31,6,13,28,14,9,10,26].Only little is known for general symmetric sets S and we will mainly refer to [30] where this problem has been handled for S which is the unit dual ball B • of a norm • (so that • S = • ).
In [30], the authors introduced the problem of robust to heavy-tailed data estimation of a mean vector w.r.t.any norm.The problem can be stated as follow: given N i.i.d.random vectors X 1 , . . ., X N in R d with mean µ * and covariance matrix Σ, a norm • on R d and a confidence parameter δ ∈ (0, 1) find an estimator μN (δ) and the best possible accuracy r * (N, δ) such that with probability at least 1 − δ, μN (δ) − µ * ≤ r * (N, δ).In [30], the authors use the median-of-means principle [35,17,1] to construct an estimator satisfying the following result.
Theorem 1. [Theorem 2 in [30]] There exist an absolute constant c such that the following holds.Given a norm • on R d and a confidence δ ∈ (0, 1), one can construct μN (δ) such that with probability at least where B • is the unit dual ball associated with • , (ǫ i ) are i.i.d.Rademacher variables independent of the X i 's and G ∼ N (0, I d ).
The construction of μN (δ) is pretty involved and it seems hard to design an algorithm out of this procedure.In particular, μN (δ) has not been proved to be solution to a convex optimization problem.Theorem 1's main interest is thus from a theoretical point of view, while robust multivariate mean estimation can also be interesting from a practical point of view [12].
The rate obtained in Theorem 1 can be decomposed into two terms: a deviation term where sup v∈B • Σ 1/2 v 2 is a weak variance term and a complexity term which is the sum of a Rademacher complexity E N −1/2 N i=1 ǫ i (X i − µ * ) and a Gaussian mean width E Σ 1/2 G .The intuition behind this rate is explained in [30], in particular, in Question 1.We will however show that this rate is not the right one and that the Gaussian mean width term is actually not necessary.Moreover, we will show that the improved rate can be achieved by an estimator solution to a convex optimization problem in Section 3 and that this holds even in the adversarial corruption model (see Assumption 1 in Section 3 below for a formal definition) and even in some situations where there is not even a first moment; in that case, µ * is a location parameter and Σ a scatter parameter.
The optimality of the rate in Theorem 1 has been raised in [30].The classical approach to answer this type of question is to consider the Gaussian case that is when the data X i , i ∈ [N ] are i.i.d.N (µ * , Σ).This is also the strategy used in [30] to obtain the following deviation-minimax lower bound result 1 .
Theorem 2. [Theorem 3 and first paragraph in p.962 in [30]] There exists an absolute constant c > 0 such that the following holds.If μ : R N d → R d is an estimator such that for all µ * ∈ R d and all δ ∈ (0, 1/4), where P N µ * is the probability distribution of (X i ) i∈[N ] when the X i are i.i.d.N (µ * , Σ) then where is the minimal number of translated of ηB d 2 needed to cover Σ 1/2 B • .
The term sup v∈S Σ 1/2 v 2 log(1/δ) in the lower bound from Theorem 2 is obtained in [30] from Proposition 6.1 in [6] which is a deviation-minimax lower bound result holding in the one dimensional case which relies on the fact that the empirical mean is a sufficient statistics in the Gaussian shift theorem 2 .
The complexity term sup η>0 η log N (Σ 1/2 B • , ηB d 2 ) obtained in Theorem 2 follows from the duality theorem of metric entropy from [2] and a volumetric argument in the Gauss space similar to the one used 1 the result from [30] is proved for Σ = I d , it is however straightforward to extend it to the general case. 2 The argument used in [30] goes from the one dimensional case studied in [6] to the d-dimensional case.It is given in a none formal way and may require some extra argument to hold.Indeed the estimator x * ( ΨN ) in [30] is constructed using the d-dimensional data X1, . . ., XN and not one-dimensional data such as x * (X1), . . ., x * (XN ).However, the result from [6] holds for estimators of a one dimensional mean using one-dimensional data and not d-dimensional ones.Nevertheless, Olivier Catoni showed us how to adapt the proof of Proposition 6.1 in [6] by using the sufficiency of the empirical mean in the Gaussian shift model in R d to get this deviation dependent lower bound term.
to prove dual Sudakov's inequality in p.82-83 in [25] which has also been used to obtain minimax lower bounds based on the entropy in [22] and [32].
In general, there is a gap between the upper bound from Theorem 1 and the lower bound from Theorem 2 even in the Gaussian case.This gap is characterized by Sudakov's inequality (see Theorem 3.18 in [25] or Theorem 5.6 in [36]): where G ∼ N (0, I d ).Indeed, in the Gaussian case the complexity term of the rate obtained in Theorem 1 is the Gaussian mean width, that is the right-hand term from (1) whereas the complexity term from Theorem 2 is the entropy, that is the left-hand term in (1).
As mentioned in Remark 3 from [30], when Sudakov's inequality (1) is sharp then upper and lower bounds from Theorem 1 and 2 match in the Gaussian case (in that case the Rademacher complexity is equal to the Gaussian mean width in Theorem 1).Sharpness in Sudakov's inequality is however not a typical situation.In particular, for ellipsoids, Sudakov's bound (1) is not sharp in general and therefore the lower bound from Theorem 2 fails to recover the classical subgaussian rate for the standard Euclidean norm case (that is for S = B d 2 ) which is given in [31] by Indeed, when where (e n+1 (Σ 1/2 )) n are the entropy numbers of Σ 1/2 : ℓ d 2 → ℓ d 2 (see page 62 in [36] for a definition) and λ 1 ≥ . . .≥ λ d are the singular values of Σ.In particular, when λ j = 1/j, the entropy bound (3) is of the order of a constant whereas the Gaussian mean width is of the order of √ log d.We will fill this gap in Section 2 by showing a lower bound where the entropy is replaced by the (larger) Gaussian mean width.We will therefore obtain matching upper and lower bounds revealing that Gaussian mean width is the right way to measure the statistical complexity for the mean estimation problem w.r.t.any • S .
The paper is organized as follows.In the next section, we obtain the deviation-minimax optimal rate in the i.i.d.Gaussian case.In Section 3 we show that the rate from Theorem 1 can be improved and that it can be achieved by a solution to a convex program in the adversarial contamination model and in under weak or no moment assumptions.All the proofs have been gathered in Section 4.
2 Deviation minimax rates in the Gaussian case: benchmark subgaussian rates for the mean estimation w.r.t.• S In this section, we obtain the optimal deviation-minimax rates of estimation of a mean vector µ * when we are given N i.i.d.X 1 , . . ., X N distributed like N (µ * , Σ) when Σ 0 is some unknown covariance matrix.
In the following, P N µ * denotes the probability distribution of (X 1 , . . ., X N ); it is a Gaussian measure on R N d with mean ((µ * ) ⊤ , . . ., (µ * ) ⊤ ) and a block (N d) × (N d) covariance matrix with d × d diagonal blocks given by Σ repeated N times and 0 outside of these blocks.
Unlike classical minimax results holding in expectation or with constant probability (see Chapter 2 in [38]) we want, in this section, the deviation parameter δ to appear explicitly in the minimax lower bound.
Moreover, this dependency of the convergence rate with respect to δ should be of the right order given by the subgaussian log(1/δ) rate and not other polynomial dependency such as 1/δ as one gets for the empirical mean for L 2 variables (see Proposition 6.2 in [6]).This subtle behavior of the rate in terms of δ cannot be seen in expectation or constant deviation minimax lower bounds.In particular, this makes such results (like Theorem 3 or 4 below) unachievable via classical information theoretic arguments as in Chapter 2 in [38].
Fortunately, in [22], a minimax lower bound has been proved thanks to the Gaussian shift theorem which makes the deviation parameter δ appearing explicitly in the minimax lower bound.We use the same strategy here to prove our main result Theorem 3 below and its corollary Theorem 4 in the classical Euclidean S = B d 2 case.We consider the general problem of estimating µ * w.r.t.• S .Let S ⊂ R d be a symmetric set.We first obtain an upper bound result revealing the subgaussian rate.We use the empirical mean XN = N −1 i X i as an estimator of µ * .Using Borell TIS's inequality (Theorem 7.1 in [24] or pages 56-57 in [37]) we get: for all 0 < δ < 1, with probability at least 1 − δ, where σ S = sup v∈S E v, XN − µ 2 is called the weak variance.It follows that with probability at least 1 − δ, where is the Gaussian mean width of the set Σ 1/2 S. In particular, in the case where S = B d 2 , we recover the subgaussian rate (2) in (4).Our aim is now to show that the rate in (4) is deviation-minimax optimal.This is what is obtained in the next result.
Theorem 3. Let S be a symmetric subset of R d such that span(S) = R d .If μ : R N d → R d is an estimator such that for all µ * ∈ R d and all δ ∈ (0, 1/4], It follows from the upper bound (4) and the deviation-minimax lower bound from Theorem 3 that it is now possible to know exactly (up to absolute constants) the subgaussian rate for the problem of mean estimation in R d w.r.t.• S , it is given by We may identify the two complexity and deviation terms in this rate.In particular, the complexity term is measured here via the Gaussian mean width of the set Σ 1/2 S and not its entropy as it was previously known following Theorem 2. Theorem 3 together with (4) show that the right way to measure the statistical complexity in the problem of mean estimation in R d w.r.t. to any • S is via the Gaussian mean width.This differs from other statistical problems such as the regression model with random design where the entropy has been proved to be the right statistical complexity in several examples [32,22].Following the later results in the regression model, Theorem 3 is a bit unexpected because one may though that by taking an ERM over an epsilon net of R d for the right choice of ǫ one could obtain a better rate than the one driven by the Gaussian mean width in (5); indeed, for this type of procedure, one may expect a rate depending on the (smaller) entropy instead of the (larger) Gaussian mean width.Theorem 3 shows that this is not the case: even discretized ERM cannot achieve a better rate than the one driven by the Gaussian mean width in the mean estimation problem.An important consequence of Theorem 3 is obtained when S = B d 2 that is for the problem of multivariate mean estimation w.r.t. the ℓ d 2 -norm which is the problem that has been extensively considered during the last decade.In the following result, we recover the well-known subgaussian rate (2) showing that all the upper bound results where this rate has been proved to be achieved are actually deviation-minimax optimal and therefore could not have been improved uniformly over all µ * ∈ R d .
Given that the empirical mean XN is such that for all µ ∈ R d with P N µ -probability at least 1 − δ, we conclude from Theorem 4 that the sub-gaussian rate ( 2) is the deviation-minimax rate of convergence for the multivariate mean estimation problem w.r.t.ℓ d 2 and that it is achieved by the empirical mean.In particular, there are no statistical procedure that can do better than the empirical mean uniformly over all mean vectors µ * ∈ R d up to constant, this includes in particular all discretized versions of XN .

Convex programs
In this section, we introduce statistical procedures which are solutions to convex programs and which can achieve the rate from Theorem 1 without the unnecessary Gaussian mean width term E Σ 1/2 G .We also show that these procedures handle adversarial corruption and may still perform optimally in some situations where there is not even a first moment.
3.1 Construction of the Fenchel-Legendre minimum estimators.
For our purpose, the main property of a Fenchel-Legendre transform we will use is that it is a convex function as it is the maximal function of the family We are now defining two examples of functions such that by taking the minimum of their Fenchel-Legendre transform over S will lead to optimal estimators of µ * w.r.t.• S .The construction of these two functions are based on the median-of-means principle: the dataset The two functions we are considering are using the K bucketed means ( Xk ) k and are defined, for all v ∈ R d , by where if (this is the rearrangement of the values a k 's themselves and not of their absolute values) and is the inter-quartiles interval -w.l.o.g.we assume that K + 1 can be divided by 4. In other words, f (v) is the average sum over all inter-quartile values of the vector ( Xk , v ) k∈[K] and g(v) is the median of this vector.Note that both functions f and g are homogeneous i.e. f (θv) = θf (v) and g(θv) = θg(v) for every v ∈ R d and θ ∈ R and in particular they are odd functions; two facts we will use later.
We are now considering the Fenchel-Legendre transform of the functions f and g over a symmetric set S: As mentioned previously the two functions f * S and g * S are convex functions.We are now using them to define convex programs whose solutions will be proved to be robust and subgaussian estimators of the mean / location vector µ * w.r.t.
For some special choices of S, the Fenchel-Legendre minimization estimator μg S coincides with some classical procedures.This is for instance the case when where (e j ) d j=1 is the canonical basis of R d , because where conv(S) is the convex hull of S and so one may just take S = {±e j : j ∈ [d]}.It is therefore possible to derive deviation-minimax optimal bounds for the coordinate-wise Median of Means w.r.t. the ℓ d ∞ -norm from general upper bounds on μg In the case S = B d 2 (that is for the mean/location estimation problem w.r.t.ℓ d 2 ), the Fenchel-Legendre minimum estimator μg S is a minmax MOM estimator [23].This connection allows to write μg S (as well as μf S ) as a non-constraint estimator, it also shows that this minmax MOM estimator is actually solution to a convex optimization problem and how minmax MOM estimator can be generalized to other estimation risks.
Minmax MOM estimators have been introduced as a systematic way to construct robust and subgaussian estimators in [23].They have been proved to be deviation-minimax optimal for the mean estimation problem in [28] w.r.t.• 2 .Their definition only requires to consider a loss function; here we take for all 2 and the minmax MOM estimator is then defined as μ ∈ argmin where P B k is the empirical measure on the data in block B k .The minmax MOM estimator μ was proved to achieve the subgaussian rate in (2) with confidence 1 − δ when the number of blocks is K ∼ log(1/δ) and K |O| in [28].
Even though the minmax formulation of μ suggests a robust version of a descent/ascent gradient method over the median block (see [23,28] for more details), no proof of convergence of this algorithm is known so far.Moreover, the main drawback of the minmax MOM estimator seems to be that it is solution of a non-convex optimization problem and may therefore be likely to be rather difficult to compute in practice.In the next result, we show that this is not the case since the minmax MOM estimator (10) is in fact equal to μg S for S = B d 2 and it is therefore solution to a convex optimization problem.
Proposition 1.The minmax MOM estimator μ defined in (10) The minmax MOM estimator is therefore solution to a convex optimization problem.
Proof.We show that μ ∈ argmin µ∈R d sup v 2 =1 Med( Xk −µ, v ).We consider the quadratic/multiplier decomposition of the difference of loss functions: for all µ, ν ∈ R d and x ∈ R d , we have ( Hence, for all µ ∈ R d , we have sup We conclude since argmin Med Xk − µ, v . It follows from Proposition 1 that the minmax MOM estimator μ is solution to a convex optimization problem.This fact is far from being obvious given the definition of μ in (10).
Proposition 1 suggests a new formulation for μg S and μf S .It is indeed possible to write these estimators as regularized estimators instead of their original constraint formulation (note that the Fenchel-Legendre transforms in (7) are suprema over S and are therefore constraint optimization problems).We now show that we may write them as suprema over all R d if we add an ad hoc regularization function.
Let us introduce the two following functions which may be seen as regularized versions of the two f and g functions from ( 6): for all ν ∈ R d , We also consider their Fenchel-Legendre transforms over the entire set R d : for all µ ∈ R d , The next result shows that the later two Fenchel-Legendre transforms can be used to define the two estimators μf S and μg S .The proof of Proposition 2 is similar to the one of Proposition 1 where the ℓ 2 -norm is replaced by • S and is therefore omitted.
As a consequence of Proposition 2, one can write the two estimators μf S and μg S as solutions to unconstrained minmax optimization problems like the minmax MOM estimator (10) and in particular, one may design an alternating ascent/descent sub-gradient algorithm similar to the one from [23] -we expect the one associated with μf S which uses half of the dataset at each iteration to be more efficient than the one associated with μg S which uses only the N/K data in the median block at each iteration.That is the reason why we provide in Figure 1 this algorithm only for μf S ∈ argmin input : the data X 1 , . . ., X N , a number K of blocks, two steps size sequences (η t ) t , (θ t ) t and ǫ > 0 a stopping parameter output: A robust estimator of the mean µ 1 Construct an equipartition 3 Compute μ(0) the coordinate-wise median-of-means and put µ (0) = μ(0) and ν (0) = μ(0) Construct g (t) a subgradient of • S at ν (t) and the ascent direction Make one descent step: Algorithm 1: An alternating ascent/descent algorithm for the robust mean estimation problem w.r.t.• S with randomly chosen blocks of data at each step.

3.2
The adversarial corruption model and two models for inlier.
In this section, we introduce the assumptions under which we will obtain some statistical upper bounds for the Fenchel-Legendre minimum estimators introduced above.We are considering two types of assumptions: one for the outliers which will be the adversarial corruption model and one for the inlier which will be either the existence of a second moment or a regularity assumption on a family of cdf around 0. We start with the adversarial corruption model.
Assumption 1.There exists N independent random vectors are first given to an "adversary" who is allowed to modify up to |O| of these vectors.This modification does not have to follow any rule.Then, the "adversary" gives the modified dataset (X i ) N i=1 to the statistician.Hence, the statistician receives an "adversarially" contaminated dataset of N vectors in R d which can be partitioned into two groups: the modified data (X i ) i∈O , which can be seen as outliers and the "good data" or inlier (X i ) i∈I such that ∀i ∈ I, X i = Xi .Of course, the statistician does not know which data has been modified or not so that the partition O ∪ I = {1, . . ., N } is unknown to the statistician.
In the adversarial contamination model from Assumption 1, the set O ⊂ [N ] can depend arbitrarily on the initial data ( Xi ) N i=1 ; the corrupted data (X i ) i∈O can have any arbitrary dependence structure; and the informative data (X i ) i∈I may also be correlated (for instance, it is, in general, the case when the |O| data Xi with largest ℓ d 2 -norm are modified by the adversary).The adversarial corruption model covers the Huber ǫ-contamination model [16] and also the O ∪ I framework from [21,23,27].Assumption 1 does not grant any property of the inlier data ( Xi ) i∈[N ] except that they are independent.We will obtain a general result under only Assumption 1 in Section 4.However, to recover convergence rates similar to the one in Theorem 1 or the subgaussian rate in (5), we will grant some assumptions on the inlier as well.We are now considering two assumptions on the inlier which are of different nature.
The two assumptions on the inlier we are now considering are related to a subtle property of the Medianof-Means (MOM) principle which somehow benefits from its two components: the empirical median and the empirical mean.Indeed, MOM is en empirical median of empirical means and so if we refer to the classical asymptotic normality (a.n.) results of the empirical mean and the empirical median, the first one holds under the existence of a second moment and the second one holds under the assumption that the cdf is differentiable at the median with positive derivative at the median (see Corollary 21.5 in [39]).We therefore recover these two types of assumptions when we work with estimators using the MOM principle.A nice feature of MOM based estimators is that their estimation results hold under either one of the two conditions and do not require the two assumptions to hold simultaneously.We can therefore consider the two assumptions independently and get two estimation results for the Fenchel-Legendre minimum estimators introduced above (which are based on the MOM principle).We start with the moment assumption.
Assumption 2. The N independent random vectors ( Xi ) N i=1 have mean µ * and there exists a SDP matrix Most of the statistical bounds obtained on MOM based estimators have focused on the heavy-tailed setup and have therefore consider Assumption 2 as their main assumption.This is the 'empirical mean component' of the MOM principle which has been the most exploited so far.It is however also possible to use the 'empirical median component' of the MOM principle to get statistical bounds even in cases where a first moment does not even exist.In that case, µ * is called a location parameter and Σ a scale parameter.Also, a natural assumption is similar to the one used to get the a.n. of the empirical median, that is an assumption on the cdf at the median adapted to the multidimensional and non-asymptotic setup.We are now introducing such an assumption.Assumption 3. The inlier data ( Xi ) N i=1 are i.i.d..There exists µ * ∈ R d and two absolute constants c 0 > 0 and c 1 > 0 such that the following holds: for all v ∈ S and all 0 < r ≤ c 0 , H N,K,v (r) ≤ 1/2 − c 1 r where A typical example where Assumption 3 holds is when S = S d−1

2
(that is for the location estimation problem w.r.t. the Euclidean ℓ d 2 norm) and the Xi 's are rotational invariant that is when for all v ∈ S d−1 2 , X1 − µ * , v has the same distribution as X1 − µ * , e 1 where e 1 = (1, 0, . . ., 0) ∈ R d .In that case, X1 has the same distribution as µ * + RU where R is a real-valued random variable on R + independent of U a random vector uniformly distributed over S d−1

2
. In that case and for K = N , for all v ∈ S d−1 2 and all r ∈ R, P R is the probability distribution of R and C d is a normalization constant which can be proved to satisfy √ d ≤ C d ≤ 6 √ d (see for instance, Chapter 4 in [5]).In particular, it follows from the mean value theorem that for all r ≥ 0, H(r) ≤ H(0) − min 0≤x≤r f (x)r = 1/2 − f (r)r.Therefore, Assumption 3 holds in that case when there exists constants c 0 , c 1 > 0 such that f (c 0 ) ≥ c 1 .Furthermore, we have As a consequence, Assumption 3 holds if there are some constants c 0 > 0 and c 2 > 0 such that P by Borell-TIS inequality but as well when R is the positive part of a Cauchy variable because As a consequence, Assumption 3 has nothing to do with the existence of any moment and it may hold even when there is not a first moment and even for K = N .
Another example where Assumption 3 holds, that we will use in the following to obtain statistical bounds for the coordinate-wise median of means for the location problem is when S = {±e j : j ∈ [d]} and X1 = µ * + Z where Z = (z j ) d j=1 is random vector in R d with coordinates z 1 , . . ., z d having a symmetric around 0 Cauchy distribution.In that case, X1 does not have a first moment and µ * is a location parameter as the center of symmetry of the distribution of X1 .We have for all j ∈ [d], for all 0 < r ≤ 1.Therefore, Assumption 3 holds in that case as well.

S
In this section, we obtain estimation bounds w.r.t.• S for μf S and μg S in the adversarial contamination model with either the L 2 moment Assumption 1 or the regularity at 0 Assumption 3.

Estimation properties of μf
S and μg S under Assumption 1.In this section, we obtain high probability estimation upper bounds satisfied by μf S and μg S w.r.t.• S in the adversarial contamination and heavy-tailed inlier model.The rate of convergence is given by the quantity The key metric property satisfied by the two Fenchel-Legendre transforms f * S and g * S in the adversarial contamination and heavy-tailed inlier model is the following isomorphic result.

Lemma 1. Grant Assumption 1 and Assumption 2. Let
and the same holds for μg S .This leads to the following result.
Theorem 5. Grant Assumption 1 and Assumption 2. Let S be a symmetric subset of R d and r * S be defined in (13).For all K > 16|O|, with probability at least 1 − exp(−K/512), The rate r * S obtained in Theorem 5 can be split into two terms: the complexity term given by the Rademacher complexity and a deviation term exhibiting the weak variance term as in the Gaussian case.Compare with Theorem 1 from [30], this result shows that the Gaussian mean width term appearing in Theorem 1 is actually not necessary, it also shows that this improved rate can be obtained by a procedure solution to a convex program and that it can also handle adversarial corruption.When S = B d 2 , we recover the classical subgaussian rate because in that case the Rademacher complexity term in r * S is less or equal to Tr(Σ) [19].In particular, since μg S is the minmax MOM estimator in that case, we recover the main result from [28].

Estimation properties of μg
S under Assumption 3. In this section, we consider some cases where a first moment may not exist; in that case, µ * is a location parameter so that Assumption 3 holds.The rate of convergence we obtain in that case is given by where c 1 is the absolute constant from Assumption 3, C 0 the absolute constant from ( 28) and u > 0 a confidence parameter.
The following result is an isomorphic result satisfied by the Fenchel-Legendre transforms g * S under Assumption 3. It is similar to the one of Lemma 1 but with the rate r ⋄ .Lemma 2. Let S be a symmetric subset of R d .Grant Assumption 1 and Assumption 3 for some K ∈ As explained below Lemma 1, a result such as Lemma 2 may be used to upper bound the • S distance between μg S , a minimum of g * S , and µ * , a minimum of µ → µ − µ * S .This yields to the following result.S ≤ 2r ⋄ where r ⋄ is defined in (14).Unlike Theorem 5, Theorem 6 may hold even when there is not a first moment.The result from Theorem 6 hold for all 0 < u K whereas Theorem 5 holds only for u = K (even though one may use a Lepski's adaptive scheme to chose adaptively K).The price for adversarial corruption in (14) is between |O|/N (for K ∼ N ) and |O|/N (for K ∼ |O|).It therefore depends on the choice of K for which Assumption 3 holds.As shown after Assumption 3 for spherically symmetric random variables one can take K = N and so the best possible price |O|/N for adversarial corruption may be achieved even when a first moment does not exist.If one needs some averaging effect so that Theorem 6 holds, then one should take K as small as possible that is K ∼ |O| and then |O|/N will be the price for adversarial corruption as in the L 2 case described in Theorem 6.
Subgaussian rates under weak or no moment assumption.It is possible to recover (up to absolute constants) the subgaussian rate (5) in Theorem 5 for K ∼ log(1/δ) when the Rademacher complexity term from (13) and the Gaussian mean width from ( 5) satisfy Such a result (i.e.Rademacher complexity is smaller than the Gaussian mean width up to constant) depends on the set S and the number of moments granted on the Xi 's as well as the sample size.It obviously holds when the Xi 's are i.i.d.N (µ * , Σ), so that we recover the deviation-minimax optimal subgaussian rate (5) in that case.It is also true when the Xi 's are subgaussian vectors.There are other situations under weaker moment assumption where (15) holds.For instance, when S = B d 2 , (15) holds under only a L 2 -moment assumption (see [19]).It also holds for S = B d 1 when the Xi 's are isotropic with coordinates having log d subgaussian moments (i.e. Xi , e j Lp ≤ L √ p for all 1 ≤ p ≤ log d and coordinate j ∈ [d]) and N log d.Together with (9) and Theorem 5, this implies that the coordinate-wise MOM is a subgaussian estimator of the mean under a log d subgaussian moment assumption.Upper bounds such as (15) have been extended in [33] to general unconditional norms.
It is also possible to recover the subgaussian rate (5) in situations where there is not even a first moment thanks to Theorem 6.Indeed, for the case S = B d 1 and X1 = µ * + Z where Z = (z j ) d j=1 has symmetric around 0 Cauchy distributed coordinates, we showed that Assumption 3 holds for K = N and that μg S is the coordinate-wise median (here K = N ) in (9).It follows from Theorem 6 that, when d N and |O| N then for all d ≤ u N , with probability at least 1 − exp(−u), which is the deviation-minimiax optimal subgaussian rate (5) we would have gotten if the Xi were i.i.d.isotropic Gaussian vectors centered in µ * corrupted by |O| adversarial outliers (up to absolute constants).But here, ( 16) is obtained without the existence of a first moment.Moreover, in ( 16), the number of outliers is allowed to be proportional to N and the price for adversarial corruption is of the order of |O|/N which is the same price we have to pay when inlier have a Gaussian distribution -this differs from the |O|/N information theoretical lower bound that has been obtained for some non-symmetric inlier.Furthermore, the computational cost of the coordinate-wise MOM is O(N d) since the cost for computing the bucketed means is O(N d), the one of finding the median of K numbers is O(K) [3], it is therefore the same computational cost as the one of the empirical mean.It is therefore possible to achieve the same computational and statistical properties as the empirical mean in a setup where a first moment does not even exist.

Proofs
Proof of Theorem 3. The minimax lower bound rate r * exhibits two quantities: one which is a complexity term depending on the Gaussian mean width of Σ 1/2 S and a deviation term depending on δ.The two terms come from two arguments.We start with the deviation term.Let v 1 ∈ R d be such that v 1 S = 1.We consider two Gaussian measures on R dN : P 0 = N (0, Σ) ⊗N and P 1 = N (3r * v 1 , Σ) ⊗N .They are the distributions of a sample of N i.i.d.Gaussian vectors in R d with the same covariance matrix Σ and the first one with mean 0 and the second one with mean 3r * v 1 .We set The key ingredient for the deviation lower bound term is a slightly generalization of Lemma 3.3 in [22] which is based on a version of the Gaussian shift Theorem from [29].Lemma 3. Let t → Φ(t) = P(g ≤ t) be the cumulative distribution function of a standard gaussian random variable on R. where is the square root of the pseudo-inverse of Σ 0 .
Proof of Lemma 3. When Σ 0 = I N d , Lemma 3 is exactly Lemma 3.3 in [22] for σ = 1.To prove Lemma 3, we observe that ν A and G is a standard Gaussian variable in Im(Σ 0 ).Hence, it follows from Lemma 3.3 in [22] that which is exactly (17).
Lemma 4. Let A ∈ R d×d be a symmetric and invertible matrix.Let • be a norm and its dual norm • * on R d .Let S be a symmetric subset of R d such that span(S) = R d .We have Proof of Lemma 4. Let v be such that v S = 1 and w ∈ S. We have | v, w | ≤ 1 and so The later holds for all v such that v S = 1 and {A −1 v/ A −1 v : v S = 1} is the unit sphere of • .Hence, we conclude by taking the sup over v such that v S = 1 and w ∈ S.
It follows from (20) and Lemma 4 for Let us now turn to the second part of the lower bound; the one coming from the complexity of the problem (here, it is the Gaussian mean width of Σ 1/2 S).We know that μ is an estimator such that for all where we set φ : t ∈ R → I(t > 1) and E N µ is the expectation with respect to X 1 , . . .X N i.i.d.
∼ N (µ, Σ).Next, we consider a Gaussian distribution γ over the set of parameters µ ∈ R d : for s > 0, we assume that µ ∼ N (0, sΣ).It follows from (22) that In other words, we lower bound the minmax risk by a Bayesian risk.We now use Anderson's lemma to lower bound the Bayesian risk appearing in (23).We first recall Anderson's Lemma.
Theorem 7 (Anderson's Lemma).Let Γ be a semi-definite d × d matrix and Z ∼ N (0, Γ).Let w : R d → R be such that all its level sets (i.e.{x ∈ R d : w(x) ≤ c} for c ∈ R) are convex and symmetric around the origin.Then for all x ∈ R d , Ew(Z + x) ≥ Ew(Z).
Proof of Lemma 1.We first prove the result for the g * S function.The one for the f * S is similar up to constants and will be sketched after.The proof of Lemma 1 for the g * S function is a corollary of the general fact which holds under only Assumption 1.Let u > 0 be a confidence parameter and define R * S such that Let us show that with large probability for all µ We have for all µ ∈ R d , where we used that S is symmetric and g is odd.It only remains to show that g * S (µ * ) ≤ R * S with large probability.To that end, it is enough to prove that, with large probability, for all v ∈ S, We use the notation introduced in Assumption 1 and we consider Xk = |B k | −1 i∈B k Xi for k ∈ [K] which are the K bucketed means constructed on the N independent vectors Xi , i ∈ [N ] before contamination (whereas Xk are the ones constructed after contamination).We also set K = {k ∈ [K] : B k ∩ O = ∅} the indices of the non corrupted blocks.We have It only remains to show that with probability at least 1 − exp(−u), for all v ∈ S, We have I(t ≥ 1) ≤ φ(t) ≤ I(t ≥ 1/2) for all t ∈ R and so Next, we use several tools from empirical process theory and in particular, for a symmetrization argument, we consider a family of N independent Rademacher variables (ǫ i ) N i=1 independent of the ( Xi ) N i=1 .In (bdi) below, we use the bounded difference inequality (Theorem 6.2 in [4]).In (sa-cp), we use the symmetrization argument and the contraction principle (Chapter 4 in [25]) -we refer to the supplementary material of [27] for more details.We have, with probability at least 1 − exp(−u), We therefore showed that under Assumption 1, with probability at least 1 − exp(−u), for all µ ∈ R d , |g * S (µ) − µ − µ * S | ≤ R * S .Now, if Assumption 2 holds then for all v ∈ S, we have from Markov's inequality that and therefore (24) holds for R * S = r * S when |O| < K/8 and u = K/128.This proves the result of Lemma 1 for g * S under Assumption 2. Finally, for the function f * S one needs to control the average of the K/2 inter-quartiles.One way to do it is to control the value of all elements Xk − µ * , v in the inter-quartiles interval.This can be done by defining an R * S similar to the one in (24) but where the right-hand side value 1/2 is replaced by 1/4 in (24).This only modifies the absolute constants which are the one used in Lemma 1.
Proof of Lemma 2. Unlike in Lemma 1 where we used the Rademacher complexities as a complexity measure, in this proof, the complexity measure we are using is the Vapnik and Chervonenkis (VC) dimension [41,42] of a class F of Boolean functions, i.e. of functions from R d to {0, 1} in our case.We recall that the Vapnik and Chervonenkis dimension of F, denoted by V C(F), is the maximal integer n such that there exists x 1 , . . ., x n ∈ R d for which the set {(f (x 1 ), . . ., f (x n )) : f ∈ F)} is of maximal cardinality, that is of size 2 n .The VC dimension of the set of all indicators of half affine spaces in R d is d + 1 (see Example 2.6.1 in [40]).We also know (see, for instance, Chapter 3 in [18]) the following concentration bound: let Y 1 , . . ., Y n be independent random vectors in R d , there exists an absolute constant C 0 such that for all u > 0, with probability at least 1 − exp(−u), Lemma 2 is a corollary of a general result which holds under the only Assumption 1.This general result says that for all u > 0, with probability at least 1 − exp(−u), for all µ ∈ R d , |g * S (µ) − µ − µ * S | ≤ R ⋄ where R ⋄ is any point such that where C 0 is the constant from (28).In particular, when Assumption 3 holds then one can check that (29) holds for R ⋄ = r ⋄ when r ⋄ ≤ c 0 proving the result of Lemma 2. It only remains to show the general result.
To that end we follow the same strategy as in the proof of Lemma 1 up to (27) (and with R * S replaced by R ⋄ ).From that point, we use (28) and the VC dimension of the set of affine half spaces to get that with probability at least 1 − exp(−u), for all v ∈ S, and so by definition of R ⋄ , on the same event, for all v ∈ S, k∈[K] I( Xk − µ * , v > R ⋄ ) < 1/2.This concludes the proof.

Theorem 6 .
Let S be a symmetric subset of R d .Grant Assumption 1 and Assumption 3 for some K ∈ [N ].Let u > 0 and assume that C 0 (d + 1)/K + u/K + |O|/K ≤ c 0 c 1 .With probability at least 1 − exp(−u), μg S − µ * k∈[K] I( Xk − µ * , v > R * S ) ≤ k∈[K] I( Xk − µ * , v > R * S ) − P[ Xk − µ * , v > R * S /2] + P[ Xk − µ * , v > R * S Xk − µ * , v > R * S /2] S and the same holds for f * S .It means that both g * S and f * S are two convex functions equivalent (up to absolute constants) to µ → µ − µ * S on R d \(2r * S )B S , where B S is the unit ball associated with • S and, on (2r * S )B S , they are both smaller than 2r * S .Hence, both g * S (• − µ * ) and f * S (• − µ * ) provide a good approximation of the metric space (R d , • S ).In particular, any minimum of g * S and f * S will be close (up to r * S ) to a minimum of µ → µ − µ * S which is µ * .This explains the statistical properties of μf S and μg S : from Lemma 1,