Confidence regions and minimax rates in outlier-robust estimation on the probability simplex

We consider the problem of estimating the mean of a distribution supported by the $k$-dimensional probability simplex in the setting where an $\varepsilon$ fraction of observations are subject to adversarial corruption. A simple particular example is the problem of estimating the distribution of a discrete random variable. Assuming that the discrete variable takes $k$ values, the unknown parameter $\boldsymbol \theta$ is a $k$-dimensional vector belonging to the probability simplex. We first describe various settings of contamination and discuss the relation between these settings. We then establish minimax rates when the quality of estimation is measured by the total-variation distance, the Hellinger distance, or the $\mathbb L^2$-distance between two probability measures. We also provide confidence regions for the unknown mean that shrink at the minimax rate. Our analysis reveals that the minimax rates associated to these three distances are all different, but they are all attained by the sample average. Furthermore, we show that the latter is adaptive to the possible sparsity of the unknown vector. Some numerical experiments illustrating our theoretical findings are reported.


Introduction
Assume X 1 , . . ., X n are n independent random variables taking their values in the k-dimensional probability simplex ∆ k−1 = {v ∈ R k + : v 1 +. ..+v k = 1}.Our goal is to estimate the unknown vector θ = E[X i ] in the case where the observations are contaminated by outliers.In this introduction, to convey the main messages, we limit ourselves to the Huber contamination model, although our results apply to the more general adversarial contamination.Huber's contamination model assumes that there are two probability measures P , Q on ∆ k−1 and a real ε ∈ [0, 1/2) such that X i is drawn from distribution Q.In general, all the three parameters P , Q and ε are unknown.The parameter of interest is some functional (such as the mean, the standard deviation, etc.) of the reference distribution P , whereas Q and ε play the role of nuisance parameters.
When the unknown parameter lives on the probability simplex, there are many appealing ways of defining the risk.We focus on the following three metrics: total-variation, Hellinger and The Hellinger distance above is well defined when the estimator θ is non-negative, which will be the case throughout this work.We will further assume that the dimension k may be large, but the vector θ is s-sparse, for some s ≤ k, i.e. #{j : θ j = 0} ≤ s.Our main interest is in constructing confidence regions and evaluating the minimax risk where the inf is over all estimators θn built upon the observations X 1 , . . ., X n iid ∼ (1 − ε)P + εQ and the sup is over all distributions P , Q on the probability simplex such that the mean θ of P is s-sparse.The subscript of R above refers to the distance used in the risk, so that is TV, H, or L 2 .
The problem described above arises in many practical situations.One example is an election poll: each participant expresses his intention to vote for one of k candidates.Thus, each θ j is the true proportion of electors of candidate j.The results of the poll contain outliers, since some participants of the poll prefer to hide their true opinion.Another example, still related to elections, is the problem of counting votes across all constituencies.Each constituency communicates a vector of proportions to a central office, which is in charge of computing the overall proportions.However, in some constituencies (hopefully a small fraction only) the results are rigged.Therefore, the set of observed vectors contains some outliers.
We intend to provide non-asymptotic upper and lower bounds on the minimax risk that match up to numerical constants.In addition, we will provide confidence regions of the form B ( θ n , r n,ε,δ ) = {θ : d ( θ n , θ) ≤ r n,ε,δ } containing the true parameter with probability at least 1 − δ and such that the radius r n,ε,δ goes to zero at the same rate as the corresponding minimax risk.
When there is no outlier, i.e., ε = 0, it is well known that the sample mean is minimax-rate-optimal and the rates corresponding to various distances are This raises several questions in the setting where data contains outliers.In particular, the following three questions will be answered in this work: Q1. How the risks R depend on ε? What is the largest proportion of outliers for which the minimax rate is the same as in the outlier-free case ?Q2.Does the sample mean remain optimal in the contaminated setting?Q3.What happens if the unknown parameter θ is s-sparse ?
imsart-ejs ver.2014/10/16 file: main.texdate: February 4, 2020 The most important step for answering these questions is to show that It is surprising to see that all the three rates are different leading to important discrepancies in the answers to the second part of question Q1 for different distances.Indeed, it turns out that the minimax rate is not deteriorated if the proportion of the outliers is smaller than (s/n) 1/2 for the TV-distance, s/n for the Hellinger distance and (1/n) 1/2 for the L 2 distance.Furthermore, we prove that the sample mean is minimax rate optimal.Thus, even when the proportion of outliers ε and the sparsity s are known, it is not possible to improve upon the sample mean.In addition, we show that all these claims hold true for the adversarial contamination and we provide corresponding confidence regions.
The rest of the paper is organized as follows.Section 2 introduces different possible ways of modeling data sets contaminated by outliers.Pointers to relevant prior work are given in Section 3. Main theoretical results and their numerical illustration are reported in Section 4 and Section 5, respectively.Section 6 contains a brief summary of the obtained results and their consequences, whereas the proofs are postponed to the appendix.

Various models of contamination
Different mathematical frameworks have been used in the literature to model the outliers.We present here five of them, from the most restrictive one to the most general, and describe their relationship.We present these frameworks in the general setting when the goal is to estimate the parameter θ * of a reference distribution P θ * when ε proportion of the observations are outliers.

Huber's contamination
The most popular framework for studying robust estimation methods is perhaps the one of Huber's contamination.In this framework, there is a distribution Q defined on the same space as the reference distribution P θ * such that all the observations X 1 , . . ., X n are independent and drawn from the mixture distribution P ε,θ * ,Q := (1 − ε)P θ * + εQ.This corresponds to the following mechanism: one decides with probabilities (1 − ε, ε) whether a given observation is an inlier or an outlier.If the decision is made in favor of being inlier, the observation is drawn from P θ * , otherwise it is drawn from Q.More formally, if we denote by O the random set of outliers, then conditionally to O = O, for every O ⊂ {1, . . ., n}.Furthermore, for every subset O of the observations, we have We denote by 2 M HC n (ε, θ * ) the set of joint probability distributions P n of the random variables X 1 , . . ., X n satisfying the foregoing condition.

Huber's deterministic contamination
The set of outliers as well as the number of outliers in Huber's model of contamination are random.This makes it difficult to compare this model to the others that will be described later in this section.
To cope with this, we define here another model, termed Huber's deterministic contamination.As its name indicates, this new model has the advantage of containing a deterministic number of outliers, in the same time being equivalent to Huber's contamination in a sense that will be made precise below.
We say that the distribution P n of X 1 , . . ., X n belongs to the Huber's deterministic contamination model denoted by M HDC n (ε, θ * ), if there are a set O ⊂ {1, . . ., n} of cardinality at most nε and a distribution Q such that (2) is true.The apparent similarity of models M HC n (ε, θ * ) and M HDC n (ε, θ * ) can also be formalized mathematically in terms of the orders of magnitude of minimax risks.To ease notation, we let R d (n, ε, Θ, θ) to be the worst-case risk of an estimator θ, where is either HC or HDC.More precisely, for This definition assumes that the parameter space Θ is endowed with a pseudo-metric Proposition 1 Let θ n be an arbitrary estimator of θ * .For any ε ∈ (0, 1/2), sup Proof in the appendix, page 11 Denote by D Θ the diameter of Θ, D Θ := max θ,θ d(θ, θ ).Proposition 2 implies that When Θ is bounded, the last term is typically of smaller order than the minimax risk over M HDC n (2ε, Θ).Therefore, the minimax rate of estimation in Huber's model is not slower than the minimax rate of estimation in Huber's deterministic contamination model.This entails that a lower bound on the minimax risk established in HC-model furnishes a lower bound in HDC-model.

Oblivious contamination
A third model of contamination that can be of interest is the oblivious contamination.In this model, it is assumed that the set O of cardinality o and the joint distribution Q O of outliers are determined in advance, possibly based on the knowledge of the reference distribution P θ * .Then, the outliers The set of all the joint distributions P n of random variables X 1 , . . ., X n generated by such a mechanism will be denoted by M OC n (ε, θ * ).The model of oblivious contamination is strictly more general than that of Huber's deterministic contamination, since it does not assume that the outliers are iid.Therefore, the minimax risk over M OC n (ε, Θ) is larger than the minimax risk over M HDC n (ε, Θ): The last inequality holds true for any set Θ, any contamination level ε ∈ (0, 1) and any sample size.

Parameter contamination
In the three models considered above, the contamination acts on the observations.One can also consider the case where the parameters of the distributions of some observations are contaminated.More precisely, for some set O ⊂ {1, . . ., n} selected in advance (but unobserved), the outliers {X i : i ∈ O} are independent and independent of the inliers {X i : i ∈ O c }. Furthermore, each outlier X i is drawn from a distribution Q i = P θi belonging to the same family as the reference distribution, but corresponding to a contaminated parameter θ i = θ * .Thus, the joint distribution of the observations can be written as ( i∈O c P θ * ) ⊗ ( i∈O P θi ).The set of all such distributions P n will be denoted by M PC n (ε, θ * ), where PC refers to "parameter contamination".

Adversarial contamination
The last model of contamination we describe in this work, the adversarial contamination, is the most general one.We denote by M AC n (ε, θ * ) the set of all the joint distributions P n of all the sequences X 1 , . . ., X n generated by the aforementioned two-stage mechanism.This set M AC n (ε, θ * ) is larger than all the four sets of contamination introduced in this section.Therefore, the following inequalities hold: for any n, ε, Θ and any distance d.

Minimax risk "in expectation" versus "in deviation"
Most prior work on robust estimation focused on establishing upper bounds on the minimax risk in deviation 4 , as opposed to the minimax risk in expectation defined by (1).One of the reasons for dealing with the deviation is that it makes the minimax risk meaningful for models5 having random number of outliers and unbounded parameter space Θ.The formal justification of this claim is provided by the following result.

Proof in the appendix, page 12
This result shows, in particular, that the last term in (5), involving the diameter of Θ is unavoidable.Such an explosion of the minimax risk occurs because Huber's model allows the number of outliers to be as large as n/2 with a strictly positive probability.One approach to overcome this shortcoming is to use the minimax risk in deviation.Another approach is to limit theoretical developments to the models HDC, PC, OC or AC, in which the number of outliers is deterministic.

Prior work
Robust estimation is an area of active research in Statistics since at least five decades (Huber, 1964;Tukey, 1975;Donoho and Huber, 1983;Donoho and Gasko, 1992;Rousseeuw and Hubert, 1999).Until very recently, theoretical guarantees were almost exclusively formulated in terms of the notions of breakdown point, sensitivity curve, influence function, etc.These notions are well suited for accounting for gross outliers, observations that deviate significantly from the data points representative of an important fraction of data set.
More recently, various authors investigated (Nguyen and Tran, 2013;Dalalyan and Chen, 2012;Chen et al., 2013) the behavior of the risk of robust estimators as a function of the rate of contamination ε.A general methodology for parametric models subject to Huber's contamination was developed in Chen et al. (2018Chen et al. ( , 2016)).This methodology allowed for determining the rate of convergence of the minimax risk as a function of the sample size n, dimension k and the rate of contamination ε.An interesting phenomenon was discovered: in the problem of robust estimation of the Gaussian mean, classic robust estimators such as the coordinatewise median or the geometric median do not attain the optimal rate (k/n) 1/2 + ε.This rate is provably attained by Tukey's median, the computation of which is costly in a high dimensional setting.
In the model analyzed in this paper, we find the same minimax rate, (k/n) 1/2 + ε, only when the total-variation distance is considered.A striking difference is that this rate is attained by the sample mean which is efficiently computable in any dimension.This property is to some extent similar to the problem of robust density estimation (Liu and Gao, 2017), in which the standard kernel estimators are minimax optimal in contaminated setting.
Computational intractability of Tukey's median motivated a large number of studies that aimed at designing computationally tractable methods with nearly optimal statistical guarantees.Many of these works went beyond Huber's contamination by considering parameter contamination models (Bhatia et al., 2017;Collier and Dalalyan, 2017;Carpentier et al., 2018), oblivious contamination (Feng et al., 2014;Lai et al., 2016) or adversarial contamination (Diakonikolas et al., 2016;Balakrishnan et al., 2017;Diakonikolas et al., 2017Diakonikolas et al., , 2018)).Interestingly, in the problem of estimating the Gaussian mean, it was proven that the minimax rates under adversarial contamination are within a factor at most logarithmic in n and k of the minimax rates under Huber's contamination6 .While each of the aforementioned papers introduced clearly the conditions on the contamination, to our knowledge, none of them described different possible models and the relationship between them.
Another line of growing literature on robust estimation aims at robustifying estimators and prediction methods to heavy tailed distributions, see (Audibert and Catoni, 2011;Minsker, 2015;Donoho and Montanari, 2016;Devroye et al., 2016;Joly et al., 2017;Minsker, 2018;Lugosi and Mendelson, 2019;Lecué and Lerasle, 2017;Chinot et al., 2018).The results of those papers are of a different nature, as compared to the present work, not only in terms of the goals, but also in terms of mathematical and algorithmic tools.

Minimax rates on the "sparse" simplex and confidence regions
We now specialize the general setting of Section 2 to a reference distribution P , with expectation θ * , defined on the simplex ∆ k−1 .Along with this reference model describing the distribution of inliers, we will use different models of contamination.More precisely, we will establish upper bounds on worst-case risks of the sample mean in the most general, adversarial, contamination setting.Then, matching lower bounds will be provided for minimax risks under Huber's contamination.

Upper bounds: worst-case risk of the sample mean
We denote by ∆ k−1 s the set of all v ∈ ∆ k−1 having at most s non-zero entries.

Proof in the appendix, page 13
An unexpected and curious phenomenon unveiled by this theorem is that all the three rates are different.As a consequence, the answer to the question "what is the largest possible number of outliers, o * d (n, s), that does not impact the minimax rate of estimation of θ * ?" crucially depends on the considered distance d.Taking into account the relation ε = o/n, we get Furthermore, all the claims concerning the total variation distance, in the considered model, yield corresponding claims for the Wasserstein distances W q , for every q ≥ 1.Indeed, one can see an element θ ∈ ∆ k−1 as the probability distribution of a random vector X taking values in the finite set A = {e 1 , . . ., e k } of vectors of the canonical basis of R k .Since these vectors satisfy e j − e j 2 2 = 21(j = j ), we have where the inf is over all joint distributions Γ on A × A having marginal distributions θ and θ .This implies that imsart-ejs ver.2014/10/16 file: main.texdate: February 4, 2020 In addition, since the L 2 norm is an upper bound on the L ∞ -norm, we have R AC L∞ (n, ε, ∆ k−1 ) ≤ (1/n) 1/2 + √ 2 ε.Thus, we have obtained upper bounds on the risk of the sample mean for all commonly used distances on the space of probability measures.

Lower bounds on the minimax risk
A natural question, answered in the next theorem, is how tight are the upper bounds obtained in the last theorem.More importantly, one can wonder whether there is an estimator that has a worst-case risk of smaller order than that of the sample mean.
Theorem 2 There are universal constants c > 0 and n 0 , such that for any integers k ≥ 3, s ≤ k ∧ n, n ≥ n 0 and for any ε ∈ [0, 1], we have where inf θn stands for the infimum over all measurable functions θn from (∆ k−1 ) n to ∆ k−1 .

Proof in the appendix, page 15
The main consequence of this theorem is that whatever the contamination model is (among those described in Section 2), the rates obtained for the MLE in Section 4.1 are minimax optimal.Indeed, Theorem 2 yields this claim for Huber's contamination.For Huber's deterministic contamination and and the TV-distance, on the one hand, we have where (1) uses the fact that for ε = 0 all the sets M n (ε, θ * ) are equal, while (2) follows from the last theorem.On the other hand, in view of Proposition 1, for ε ≥ (6/n) log(8n/c) (implying that 2e −nε/6 ≤ (c/4)ε), Combining these two inequalities, for n ≥ (10 + 2 log(1/c)) 2 , we get The same argument can be used to show that all the inequalities in Theorem 2 are valid for Huber's deterministic contamination model as well.Since the inclusions ) hold true, we conclude that the lower bounds obtained for HC remain valid for all the other contamination models and are minimax optimal.
The main tool in the proof of Theorem 2 is the following result (Chen et al., 2018, Theorem 5.1).There is a universal constant c 1 > 0 such that for every ε ∈ [0, 1),  where w d (ε, ∆) is the modulus of continuity defined by w d (ε, ∆) = sup{d(θ, θ ) : Choosing θ and θ to differ on only to coordinates, one can check that, for any Combining with the lower bounds in the non-contaminated setting, this result yields the claims of Theorem 2. In addition, (6) combined with the results of this section implies that the rate in ( 7) is minimax optimal.

Confidence regions
We established so far bounds for the expected value of estimation error.The aim of this section is to present bounds on estimation error of the sample mean holding with high probability.This also leads to constructing confidence regions for the parameter vector θ * .To this end, the contamination rate ε and the sparsity s are assumed to be known.It is an interesting open question whether one can construct optimally shrinking confidence regions for unknown ε and s.
Theorem 3 Let δ ∈ (0, 1) be the tolerance level.If θ * ∈ ∆ k−1 s , then under any contamination model, the regions of ∆ k−1 defined by each of the following inequalities To illustrate the shapes of these confidence regions, we depicted them in Figure 2 for a three dimensional example, projected onto the plane containing the probability simplex.The sample mean in this example is equal to (1/3, 1/2, 1/6).

Illustration on a numerical example
We provide some numerical experiments which illustrate theoretical results of Section 4. The data set is the collection of 38 books written by Alexandre Dumas (1802-1870) and 38 books written by Emile Zola (1840-1902)7 .To each author, we assign a parameter vector corresponding to the distribution of the number of words contained in the sentences used in the author's books.To be more clear, a sentence containing l words is represented by vector e l , and if the parameter vector of an author is (θ 1 , . . ., θ k ), it means that a sentence used by the author is of size l ∈ {1, . . ., k} with probability θ l .We carried out synthetic experiments in which the reference parameter to estimate is the probability vector of Dumas, while the distribution of outliers is determined by the probability vector of Zola.Ground truths for these parameters are computed from the aforementioned large corpus of their works.Only the dense case s = k were considered.For various values of ε and n, a contaminated sample was generated by randomly choosing n sentences either from Dumas' works (with probability 1 − ε) or from Zola's works (with probability ε).The sample mean was computed for this corrupted sample, and the error with respect to Dumas' parameter vector was measured by the three distances TV, L 2 and Hellinger.This experiment was repeated 10 4 times for each special setting to obtain information on error's distribution.Furthermore, by grouping nearby outcomes we created samples of different dimensions for illustrating the behavior of the error as a function of k.
The error of Xn as a function of the sample size n, dimension k, and contamination rate ε is plotted in Figures 3 and 4.These plots are conform to the theoretical results.Indeed, the first plot in Figure 3 shows that the errors for the three distances is decreasing w.r.t.n.Furthermore, we see that up to some level of n this decay is of order n −1/2 .The second plot in Figure 3 confirms that the risk grows linearly in k for the TV and Hellinger distances, while it is constant for the L 2 error.Left panel of Figure 4 suggests that the error grows linearly in terms of contamination rate.This is conform to our results for the TV and L 2 errors.But it might seem that there is a disagreement with the result for the Hellinger distance, for which the risk is shown to increase at the rate ε 1/2 and not ε.This is explained by the fact that the rate ε 1/2 corresponds to the worst-case risk, whereas here, the setting under experiment does not necessarily represent the worst case.When the parameter vectors of the reference and contamination distributions, respectively, are e i and e j with i = j (i.e., when these two distributions are at the largest possible distance, which we call an extreme case), the graph of the error as a function of ε (right panel of Figure 4) is similar to that of square-root function.

Summary and conclusion
We have analyzed the problem of robust estimation of the mean of a random vector belonging to the probability simplex.Four measures of accuracy have been considered: total variation, Hellinger, Euclidean and Wasserstein distances.In each case, we have established the minimax rates of  the expected error of estimation under the sparsity assumption.In addition, confidence regions shrinking at the minimax rate have been proposed.
An intriguing observation is that the choice of the distance has much stronger impact on the rate than the nature of contamination.Indeed, while the rates for the aforementioned distances are all different, the rate corresponding to one particular distance is not sensitive to the nature of outliers (ranging from Huber's contamination to the adversarial one).While the rate obtained for the TV-distance coincides with the previously known rate of robustly estimating a Gaussian mean, the rates we have established for the Hellinger and for the Euclidean distances appear to be new.Interestingly, when the error is measured by the Euclidean distance, the quality of estimation does not get deteriorated with increasing dimension.
Using the same argument as for (8), for any O of cardinality o < 2nε, we get sup This completes the proof of (3).
One can use the same arguments along with the Tchebychev inequality to establish (4).Indeed, for every S of cardinality o ≤ 2εn, we have sup Summing the obtained inequality over all sets O of cardinality o ≤ 2εn, we get sup On the other hand, it holds that sup and the claim of the proposition follows.
Proof of Proposition 2 on page 6.Let θ 1 and θ 2 be two points in Θ.We have To ease writing, assume that n is an even number.Let O be any fixed set of cardinality n/2.It is clear that the set of outliers O satisfies where in the last step we have used the triangle inequality.The obtained inequality being true for every θ 1 , θ 2 ∈ Θ, we can take the supremum to get This completes the proof.

Appendix B: Upper on the minimax risk over the sparse simplex
This section is devoted to the proof of the upper bounds on minimax risks in the discrete model with respect to various distances.
Proof of Theorem 1 on page 7. To ease notation, we set Xn In the adversarial model, we have And for a fixed θ * it is well known that This gives sup imsart-ejs ver.2014/10/16 file: main.texdate: February 4, 2020 In addition, for every θ * , where in (1) we have used the notation J = {j : θ * j = 0} and in (2) we have used the Cauchy-Schwarz inequality.This leads to Finally, for the Hellinger distance where we have already seen that Hence, by Jensen's inequality E[d H ( Ȳ n , θ * )] < s/n.Therefore, we infer that and the last claim of the theorem follows.
For Hellinger distance, the computations are more tedious.We have to separate the case of small θ * j .To this end, let J = {j : 0 < θ

Fig 1 :
Fig 1: Visual representation of the hierarchy between various contamination model.Note that the inclusion of M HC n (ε, θ * ) in M HDC n (2ε, θ * ) is somewhat heuristic, based on the relation on the worst-case risks reported in Proposition 1.

Fig 3 :
Fig 3: Estimation error of Xn measured by total variation, Hellinger, and L 2 distances as a function of (left panel) number of observations with contamination rate 0.2 and dimension 10 2 and (right panel) dimension with contamination rate 0.2 and 10 4 samples.The interval between 5th and 95th quantiles of the error, obtained from 10 4 repetitions, is also depicted for every graph.

Fig 4 :
Fig 4:The error of Xn measured by total variation, Hellinger, and L 2 distances in terms of the contamination rate (with dimension 10 2 and 10 4 samples).At right, we plotted the error with respect to the contamination rate for an extreme case, where the reference and contamination distributions have the largest distance.The interval between 5th and 95th quantiles of the error, obtained from 10 4 trials, is also depicted.