Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

This paper presents new algorithms for distributed statistical estimation that can take advantage of the divide-and-conquer approach. We show that one of the key benefits attained by an appropriate divide-and-conquer strategy is robustness, an important characteristic of large distributed systems. We introduce a class of algorithms that are based on the properties of the geometric median, establish connections between performance of these distributed algorithms and rates of convergence in normal approximation, and provide tight deviations guarantees for resulting estimators in the form of exponential concentration inequalities. Our techniques are illustrated through several examples; in particular, we obtain new results for the median-of-means estimator, as well as provide performance guarantees for robust distributed maximum likelihood estimation.


Introduction
This paper introduces new statistical estimation methods that exhibit scalability, a necessary characteristic of modern methods designed to perform statistical analysis of large datasets, as well as robustness that guarantees stable performance of distributed systems when some of the nodes exhibit abnormal behavior. The computational power of a single computer is often insufficient to store and process modern data sets, and instead data is stored and analyzed in a distributed way by a cluster consisting of several machines. We consider a distributed estimation framework wherein data is assumed to be randomly assigned to computational nodes that produce intermediate results. We assume that no communication between the nodes is allowed at this first stage. On the second stage, these intermediate results are used to compute some statistic on the whole dataset; see figure 1 for a graphical illustration. Often, such a distributed setting is unavoidable in applications, whence interactions between subsamples stored on different machines are inevitably lost. Most previous research focused on the following question: how significantly does this loss affect the quality of statistical 5214 S. Minsker   Fig 1: Distributed estimation protocol where data is randomly distributed across nodes to obtain "local" estimates that are aggregated to compute a "global" estimate.
estimation when compared to an "oracle" that has access to the whole sample?
The question that we ask in this paper is different: what can be gained from randomly splitting the data across several subsamples? What are the statistical advantages of the divide-and-conquer framework? Our work indicates that one of the key benefits of an appropriate merging strategy is robustness. In particular, the quality of estimation attained by the distributed estimation algorithm is preserved even if a subset of machines stops working properly. At the same time, the resulting estimators admit tight probabilistic guarantees (expressed in the form of exponential concentration inequalities) even when the distribution of the data has heavy tails -a viable model of real-world samples contaminated by outliers.
We establish connections between a class of randomized divide-and-conquer strategies and the rates of convergence in normal approximation. Using these connections, we provide a new analysis of the "median-of-means" estimator which often yields significant improvements over the previously available results. We further illustrate the implications of our results by constructing novel algorithms for distributed maximum likelihood estimation that admit strong performance guarantees under weak assumptions on the underlying distribution.

Background and related work
Let us introduce a simple model for distributed statistical estimation. Assume that X 1 , . . . , X N is a sequence of independent random variables with values in a measurable space (S, S) that represent the data available to a statistician. We will assume that N is large, and that that the sample X = (X 1 , . . . , X N ) is partitioned into k disjoint subsets G 1 , . . . , G k of cardinalities n j := card(G j ) respectively, where the partitioning scheme is independent of the data. Let P j be the distribution of X j , j = 1, . . . , N. The goal is to estimate an unknown parameter θ * = θ * (P j ), j = 1, . . . , N shared by P 1 , . . . , P N and taking values in a separable Hilbert space (H, · H ); for example, if S = H, θ * could be the common mean of X 1 , . . . , X N . Distributed estimation protocol proceeds via performing "local" computations on each subset G j , j ≤ k, and the local estimatorsθ j :=θ j (G j ), j ≤ k are then pieced together to produce the final "global" estimatorθ (k) =θ (k) (θ 1 , . . . ,θ k ). We are interested in the statistical properties of such distributed estimation protocols, and our main focus is on the final step that combines the local estimators. Let us mention that the condition requiring the sets G j , 1 ≤ j ≤ k to be disjoint can be relaxed; we discuss the extensions related to U-quantiles in section 2.6 below. The problem of distributed and communication -efficient statistical estimation has recently received significant attention from the research community. While our review provides only a subsample of the abundant literature in this field, it is important to acknowledge the works by Mcdonald et al. (2009) ;Zhang, Wainwright and Duchi (2012) ;Fan, Han and Liu (2014); Shafieezadeh-Abadeh, Esfahani and Kuhn (2015); Battey et al. (2015); Duchi et al. (2014); Lee et al. (2015); Cheng and Shang (2015); Rosenblatt and Nadler (2016); Zinkevich et al. (2010). Li, Srivastava and Dunson (2016); Scott et al. (2016); Shang and Cheng (2015); Minsker et al. (2014) have investigated closely related problems for distributed Bayesian inference. Applications to important algorithms such as Principal Component Analysis were investigated in Fan et al. (2017); Liang et al. (2014), among others. Jordan (2013), author provides an overview of recent trends in the intersection of the statistics and computer science communities, describes popular existing strategies such as the "bag of little bootstraps", as wells as successful applications of the divide-and-conquer paradigm to problems such as matrix factorization.
The majority of the aforementioned works propose averaging of local estimators as a final merging step. Indeed, averaging reduces variance, hence, if the bias of each local estimator is sufficiently small, their average often attains optimal rates of convergence to the unknown parameter θ * . For example, when θ * (P ) = E P X is the mean of X andθ j is the sample mean evaluated over the subsample G j , j = 1, . . . , k, then the average of local estimatorsθ = 1 k k j=1θ j is just a empirical mean evaluated over the whole sample. More generally, it has been shown by Battey et al. (2015); Zhang, Duchi and Wainwright (2013) that in many problems (for instance, linear regression), k can be taken as large as O( √ N ) without negatively affecting the estimation rates; similar guarantees hold for a variety of M-estimators (see Rosenblatt and Nadler, 2016). However, if the number of nodes k itself is large (the case we are mainly interested in), then the averaging scheme has a drawback: if one or more among the local estimatorsθ j 's is anomalous (for example, due to data corruption or a computer system malfunctioning), then statistical properties of the average will be negatively affected as well. For large distributed systems, this drawback can be costly.
One way to address this issue is to replace averaging by a more robust procedure, such as the median or a robust M-estimator; this approach is investigated in the present work. In the univariate case (θ * ∈ R), the merging strategies we study can be described as solutions of the optimization problem for an appropriately defined convex function ρ; we investigate this class of estimators in detail. A natural extension to the case θ * ∈ R m is to consider for some convex function ρ and norm · • . For example, if ρ(x) = x, then θ (k) becomes the spatial, also known as geometric or Haldane's, median (Haldane, 1948;Small, 1990) ofθ 1 , . . . ,θ k . Since the median remains stable as long as at least a half of the nodes in the system perform as expected, such model for distributed estimation is robust. The merging approach based on the various notions of the multivariate median has been previously considered by Minsker (2015) and Hsu and Sabato (2016); here, we analyze the setting when ρ(x) = x and · • is the L 1 -norm using a novel approach.
Existing results for the median-based merging strategies have several pitfalls related to the deviation rates, and in most cases known guarantees are suboptimal. In particular, these guarantees suggest that estimators obtained via the median-based approach are very sensitive to the choice of k, the number of partitions. For instance, consider the problem of univariate mean estimation, where X 1 , . . . , X N are i.i.d. copies of X ∈ R, and θ * = EX is the expectation of X. Assume that card(G j ) ≥ n := N/k for all j, letθ j = 1 |Gj | i:Xi∈Gj X i be the empirical mean evaluated over the subsample G j , and define the "medianof-means" estimator via where med (·) is the usual univariate median. This estimator has been introduced by Nemirovski and Yudin (1983) in the context of stochastic optimization, and later appeared in Jerrum, Valiant and Vazirani (1986) and Alon, Matias and Szegedy (1996). If Var(X) = σ 2 < ∞, it has been shown (for example, by Lerasle and Oliveira, 2011) that the median-of-means estimator θ (k) satisfies with probability ≥ 1 − α if k = log(1/α) + 1. However, this bound does not provide insight at what happens at the confidence levels other than 1 − α. For example, if k = √ N , the only conclusion we can make is that θ (k) − θ * N −1/4 with high probability, which is far from the "parametric" rate N −1/2 . And if we want the bound to hold with confidence 99% instead of 1 − e − √ N , then, according to (3), we should take k = log 100 + 1 = 5, in which case the beneficial effect of parallel computation is very limited. The natural questions to ask is the following: is it possible to "decouple" parameter k, the number of subgroups, from α that controls the deviation probability? Is the medianbased merging step suboptimal for large values of k (e.g., k = √ N ), or is the problem related to the suboptimality of existing bounds? We claim that in many situations the latter is the case, and that previously known results can be strengthened: for instance, the statement of Corollary 1 below implies that whenever E|X − θ * | 3 < ∞, the median-of-means estimator satisfies with probability ≥ 1 − 4e −2s , for all s k simulnateously. 1 Inequality (4) shows that the estimator (2) has "typical" deviations of order N −1/2 whenever k = O( √ N ), hence the "statistical cost" of employing a large number of computational nodes is minor. Moreover, we will prove that It will also be demonstrated that improved bounds hold in other important scenarios, such as maximum likelihood estimation, even when the subgroups have different sizes and the observations are not identically distributed.

Organization of the paper
Section 1.3 describes notation used throughout the paper. Sections 2 and 3 present main results and examples for the cases of univariate and multivariate parameter respectively. Outcomes of numerical simulation are discussed in section 4, and proofs of the main results are contained in section 5.

Notation
Everywhere below, · 1 , · 2 and · ∞ stand for the L 1 , L 2 and L ∞ norms of a vector.
Given a probability measure P , E P (·) will stand for the expectation with respect to P , and we will write E(·) when P is clear from the context. Convergence in distribution will be denoted by d − →. For two sequences {a j } j≥1 ⊂ R and {b j } j≥1 ⊂ R for j ∈ N, the expression a j b j means that there exists a constant c > 0 such that a j ≤ cb j for all j ∈ N. Absolute constants will be denoted c, C, c 1 , etc., and may take different values in different parts of the paper. For a function f : will denote the right and left derivatives of f respectively (whenever these limits exist). Additional notation and auxiliary results are introduced on demand for the proofs in section 5.

Main results
As we have argued above, existing guarantees for the estimator (2) are sensitive to the choice of k, the number of partitions. In the following sections, we demonstrate that these bounds are often suboptimal, and show that large values of k often do not have a significant negative effect on the statistical performance of resulting algorithms.
The key observation underlying the subsequent exposition is the following: assume that the "local estimators"θ j , 1 ≤ j ≤ k, are asymptotically normal with asymptotic mean equal to θ * . In particular, distributions ofθ j 's are approximately symmetric, with θ * being the center of symmetry. The location parameters of symmetric distributions admits many robust estimators of the form (1), the sample median being a notable example.
This intuition allows us to establish a parallel between the non-asymptotic deviation guarantees for distributed estimation procedures of the form (1) and the degree of symmetry of "local" estimators quantified by the rates of convergence to normal approximation. Results for the univariate case are presented in section 2, and extensions to the multivariate case are presented in section 3.

The univariate case
We assume that X 1 , . . . , X N is a collection of independent (but not necessarily identically distributed) S-valued random variables with distributions P 1 , . . . , P N respectively. The data are partitioned into disjoint groups G 1 , . . . , G k of cardinality n j := card(G j ) each, and such that k j=1 n j = N . Letθ j :=θ j (G j ), 1 ≤ j ≤ k be a sequence of independent estimators of the parameter θ * ∈ R shared by P 1 , . . . , P N . Our main assumption will be thatθ 1 , . . . ,θ k are asymptotically normal as quantified by the following condition. Assumption 1. Let Φ(t) be the cumulative distribution function of the standard normal random variable Z ∼ N (0, 1). For each j = 1, . . . , k, If what follows, we will set σ (j) nj := Var θ j . Furthermore, let Distributed statistical estimation 5219

Merging procedure based on the median
In this subsection, we establish guarantees for the merging procedure based on the sample median, namely, This case is treated separately due to its practical importance, the fact that we can obtain better numerical constants, and a conceptually simpler proof.
Theorem 1. Assume that s > 0 and n j = card(G j ), j = 1, . . . , k are such that Moreover, let assumption 1 be satisfied, and let ζ j (n j , s) solve the equation Then for all s satisfying (5), with probability at least 1 − 4e −2s .
The following lemma yields a more explicit form of the bound and numerical constants.
Proof. This is an immediate consequence of Lemma 4 in the Appendix.
for any integer 1 ≤ m ≤ k, hence the deviations of θ (k) are controlled by the smallest elements among σ

Example: new bounds for the median-of-means estimator
The univariate mean estimation problem is pervasive in statistics, and serves as a building block of more advanced methods such as empirical risk minimization. Early works on robust mean estimation include Tukey's "trimmed mean" (Tukey and Harris, 1946), as well as "winsorized mean" (Bickel et al., 1965); also see discussion in Bubeck, Cesa-Bianchi and Lugosi (2013). These techniques often produce estimators with significant bias. A different approach based on M-estimation was suggested by O. Catoni (Catoni, 2012); Catoni's estimator yields almost optimal constants, however, its construction requires additional information about the variance or the kurtosis of the underlying distribution; moreover, its computation is not easily parallelizable, therefore this technique cannot be easily employed in the distributed setting.
Here, we will focus on a fruitful idea that is commonly referred to as the "median-of-means" estimator that was formally defined in equation (2) (2017). Advantages of this method include the facts that that it can be implemented in parallel and does not require prior knowledge of any information about parameters of the distribution (e.g., its variance). The following result for the median-of-means estimator is the corollary of Theorem 1; for brevity, we treat only the case of identically distributed observations. Recall that n = N/k and card(G j ) ≥ n, j = 1, . . . , k. Corollary 1. Let X 1 , . . . , X N be a sequence of i.i.d. copies of a random variable X ∈ R such that EX = θ * , Var(X) = σ 2 , and E|X − θ * | 3 < ∞. Then for all s > 0 and k such that 0.4748 E|X−θ * | 3 Remark 2. The term 1.43 σ E|X−θ * | 3 /σ 3 n can be thought of as the "bias" due to asymmetry of the distribution of the sample mean. Note that whenever k √ N (so that n √ N ), the right-hand side of the inequality above is of order Indeed, assume that X has exponential distribution E(1). Then the sum n j=1 X j has Gamma distribution Γ(n, 1) with mean equal to n. Moreover, it is known (Choi, 1994) that for large n, the median M of Γ(n, 1) satisfies n − 1/3 < M < 1 − 1/3 + 1 2n , hence one easily checks that the median M n of the law of 1 n n j=1 X j −n satisfies |M n | ≥ 1 4n ∝ k N for n large enough. On the other hand, when k → ∞ while n remains fixed, θ (k) → M n almost surely.
Proof. It follows from the Berry-Esseen Theorem (Fact 1 in section 5.1) that assumption 1 is satisfied with σ σ 3 √ n + s/k , and the claim follows from Theorem 1.
Results similar to Corollary 1 can be obtained under weaker moment assumptions as well.
Then there exist absolute constants c 1 , C 2 > 0 such that for all s > 0 and k satisfying E|X−θ * | 2+δ σ 2+δ n δ/2 + s k ≤ c 1 , the following inequality holds with probability at least 1 − 4e −2s : In this case, typical deviations of θ (k) are still of order N −1/2 as long as k N δ/(1+δ) . The proof of this result again follows from Theorem 1 and a version of the Berry-Esseen inequality stated in section 5.1. Finally, we remark that under stronger assumptions on the distribution of X, the "bias term" can be improved.
that depend on the distribution of X such that for all s > 0 and k such that Proof. It follows from Theorem 2 in (Ibragimov, 1967) that under the stated assumptions, there exists C X > 0 that depends on the distribution of X such that for all j. The claim now follows from Lemma 1 and Theorem 1.
We remark the the requirement lim sup t→∞ |φ X (t)| < 1 implies that the distribution of X is not concentrated on a lattice.

Example: distributed maximum likelihood estimation
Assume that for each θ ∈ Θ, P θ is absolutely continuous with respect to a σ-finite measure μ, and let p θ = dP θ dμ be the corresponding density. In this section, we state sufficient conditions for assumption 1 to be satisfied when θ 1 , . . . ,θ k are the maximum likelihood estimators (MLE) of θ * . All derivatives below (denoted by ) are taken with respect to θ, unless noted otherwise. Pinelis (2016) proved that the following conditions suffice to guarantee that the rate of convergence of the distribution of the MLE to the normal law is n −1/2 . Assume that the the log-likelihood function x (θ) = log p θ (x) is such that: (2) "standard regularity conditions" that allow differentiation under the expectation: assume that E X (θ * ) = 0, and that the Fisher information (5) P |θ 1 − θ * | ≥ δ ≤ cγ n for some positive constants c and γ ∈ [0, 1).

Merging procedures based on robust M-estimators
In this section, we establish performance guarantees for a distributed algorithms based on the robust M-estimators. Let ρ be a convex, even function such that ρ(z) → ∞ as |z| → ∞ and ρ + ∞ < ∞. Moreover, it will be assumed that where M ≥ 1, satisfy these assumptions. We study the family of merging procedures based on the M-estimators where γ j , j = 1, . . . , k are the nonnegative weights and τ j , j = 1, . . . , k are nonnegative "scaling factors." The sample median med θ 1 , . . . ,θ k corresponds to the choice of ρ(x) = |x|, equal weights γ j = 1 and τ j = 1, j = 1, . . . , k.
Results below demonstrate that different choice of weights leads to potentially better bounds. We will also assume that for all 1 ≤ j ≤ k, for some known constants β 1 , . . . , β k > 0. In this case, we will set The following result quantifies non-asymptotic performance of the estimator θ (k) ρ . Theorem 2. Let assumption 1 be satisfied, and suppose that s > 0 and n 1 , . . . , n k are such that 5224
To understand the implications of this technical bound, we consider the special case when the expressions can be simplified significantly. Let ρ(z) = |z|, and assume that β 1 = . . . = β k = 1/2 and g j (n) ≤ C X n −1/2 for all j and some C X > 0 that depends on the distribution of X. Moreover, let be the weighted harmonic mean of V 1 , . . . , V k , and setα j =H k Vj . Corollary 5. There exist positive constants c 1 , C 2 such that for all s > 0 and n 1 , . . . , n k satisfying the following inequality holds with probability at least 1 − 2e −s : Proof. Observe that for ρ(z) = |z|, the estimator θ hence (11) implies (7). It is also straightforward to check that (8) implies (10) for an appropriate choice of the constant C 2 .
Let us compare the previous bound with the result of Theorem 1 when the observations are i.i.d. with variance σ 2 : Theorem 1 yields that , and by the inequality between the harmonic mean and arithmetic mean, hence the second inequality is stronger than the first. (6) with M = 1, the data are i.i.d.,and that |G j | = n for all j. The bound of Theorem 2 implies that one should pick the scaling factor Δ that is not too large, as the quantity max(Δ, V n ) controls the estimation error, where V n = n β σ n . On the other hand, it will be shown in section 2.5 that to get an estimator with small asymptotic variance, one should choose Δ that is not too small, and the "optimal" choice is Δ = V n . While V n is typically unknown, it can be estimated from the data. Indeed, sincē θ j 's are approximately normal, their standard deviation can be estimated by the median absolute deviation as

Remark 4. Assume that ρ is Huber's loss defined in
where the factor 1/Φ −1 (0.75) is introduced to make the estimator consistent (Hampel et al., 2011). At the same time, when ρ(x) = |x|, the estimator is invariant with respect to Δ, but its asymptotic variance is larger than the variance of the estimator based on Huber's loss with optimally chosen scale parameter.

Adversarial contamination
One of the advantages of allowing the number of subgroups k to be large is improved robustness with respect to adversarial contamination. Assume that the initial sample X 1 , . . . , X N is merged with a set of O < N outliers that are generated by an adversary who has an opportunity to inspect the data in advance; combined dataset of cardinalityÑ = N + O is then presented to a statistician. We would like to understand performance of proposed estimators in this framework. To highlight the dependence of the estimation error on the number O of outliers, we consider only the simplest scenario of i.i.d. data and equal group sizes satisfying card(G j ) = n ≥ Ñ /k . Moreover, suppose that β 1 = . . . = β k = 1/2 and that g(n) ≤ C X n −1/2 . In this case, the estimator we are interested in is defined as In what follows, we will also assume that k > 2O. The following result holds.
The display above implies that, if k ≥ C · O for a sufficiently large constant C, the error θ (k) ρ − θ * behaves like the maximum of 2 terms: the first term is the error bound for the case O = 0, and the second term is of order Δ √ n O N . Dependence on O in Theorem 3 can not be improved in general: indeed, if θ * is the mean and X has 3 finite moments, it is known (Steinhardt, Charikar and Valiant, 2017)  Remark 5. An important characteristic of Theorem 3 is that its guarantees still hold uniformly over all 0 < s k, as long as O/k is not too large. Another method for obtaining bounds that hold uniformly over the wide range of confidence parameters s was suggested by Devroye et al. (2016), and is based on the ability to construct, for each 0 < s ≤ c 1 N , a "sub-Gaussian" confidence interval with coverage probability at least 1 − e −s . While our bounds rely on stronger moment assumptions to obtain uniformity over a wider range of confidence parameters, they have two important advantages in the context of the median-ofmean estimators: first, they do not require prior information about the variance that is needed to construct confidence intervals. Second, in the framework of adversarial contamination, our bounds are uniform over all 0 < s k, while the guarantees obtained in Devroye et al. (2016) can only be made uniform over the range O s N ; when O is relatively large, this difference becomes noticeable.

Asymptotic results
In this section, we complement the previously discussed non-asymptotic deviation bounds for θ (k) ρ by the asymptotic results that partially explain the difference that the choice of the function ρ makes. For the benefits of clarity, we make some simplifying assumptions, and the complete list is presented below: (1) X 1 , . . . , X N are i.i.d., n = N/k and card(G j ) = n, j = 1, . . . , k; result for subgroups of different sizes is presented in Appendix 5.8. (2) Assumption 1 is satisfied for some function g(n) (note that there is no dependence on index j due to the i.i.d. assumption); (3) k and n are such that k → ∞ and √ k · g(n) → 0 as N → ∞; (4) ρ is a convex, even function, such that ρ(z) → ∞ as |z| → ∞ and ρ + ∞ < ∞; where Z ∼ N (0, 1). Note that, since ρ is differentiable almost everywhere,

Remark 6.
The key assumptions in the list (1)-(5) governing the regime of growth of k and n are (2) and (3). For instance, if the random variables possess finite moments of order (2 + δ) for some δ ∈ (0, 1], then it follows from the Berry-Esseen bound (Fact 1 in section 5.1) that as N → ∞.

Connection to U-quantiles
In this section, we discuss connections of proposed algorithms to U-quantiles and the assumption requiring the groups G 1 , . . . , G k to be disjoint. We assume that the data X 1 , . . . , X N are i.i.d. with common distribution P , and let θ * = θ * (P ) ∈ R be a real-valued parameter of interest. It is clear that the estimators produced by distributed algorithms considered above depend on the random partition of the sample. A natural way to avoid such dependence is to consider the U-quantile (in this case, the median) θ (k) = med θ J , J ∈ A j∈J X j , θ (k) is the well-known Hodges-Lehmann estimator of the location parameter, see Hodges and Lehmann (1963); Lehmann and D'Abrera (2006); for a comprehensive study of U-quantiles, see Arcones (1996). The main result of this section is an analogue of Theorem 1 for the estimator θ (k) ; it implies that theoretical guarantees for the performance of θ (k) are at least as good as for the estimator θ (k) . Since the data are i.i.d., it is enough to impose the assumption 1 onθ (X 1 , . . . , X n ) only, hence we drop the index j and denote the normalizing sequence {σ n } n∈N and the corresponding error function g(n).
Theorem 5. Assume that s > 0 and n = N/k are such that Moreover, let assumption 1 be satisfied, and let ζ(n, s) solve the equation Then for any s satisfying (12), Proof. See section 5.6. As before, a more explicit form of the bound immediately follows from Lemma 1.
A drawback of the estimator θ (k) is the fact that its exact computation requires evaluation of n N estimatorsθ J over subsamples {X j , j ∈ J}, J ∈ A (n) N . For large N and n, such task becomes intractable. However, an approximate result can be obtained by choosing subsets J 1 , . . . , J from A (n) N uniformly at random, and setting θ (k) := med θ J1 , . . . ,θ J .
We note that Theorem 2 admits a similar extension for the estimator defined as Namely, if the data are i.i.d., then under the assumptions on ρ made in section 2.4, with probability at least 1 − 2e −s , whenever s > 0 and n = N/k are such that for some positive constants C 1 (ρ), c 2 (ρ). We omit the proof of (13) since the required modifications in the argument of Theorem 2 are exactly the same as those explained in the proof of Theorem 5.

Estimation in higher dimensions
Results presented above admit natural extension to higher dimensions. In this section, it will be assumed that θ * = (θ * ,1 , . . . , θ * ,m ) ∈ R m , m ≥ 2, is a vector-valued parameter of interest. Let X 1 , . . . , X N be i.i.d. random variables that are randomly partitioned into disjoint groups G 1 , . . . , G k with cardinalities n 1 , . . . , n k , and letθ j :=θ j (G j ) ∈ R m , 1 ≤ j ≤ k be a sequence of estimators of θ * , the common parameter of the distributions of X j 's. Define and V (i) j := n 1/2 j Var θ j,i . Moreover, we will assume that for all 1 ≤ i ≤ m and 1 ≤ j ≤ k, We will be interested in the estimator given by weighted L 1 median Theorem 6. There exist absolute constants c 1 , C 2 > 0 such that for all s > 0 and all n 1 , . . . , n k satisfying the following inequality holds with probability at least 1 − 2e −s : Proof. See section 5.7.
Similar results can be established under the more general setting of Theorem 2, albeit at the cost of bulkier statements.

Example: multivariate median-of-means estimator
Consider the special case of Theorem 6 when θ * = EX is the mean of X ∈ R m , andθ j (X) := 1 |Gj | Xi∈Gj X i are the sample means evaluated over the subsamples indexed by G 1 , . . . , G k . The problem of finding the mean estimator that admits sub-Gaussian concentration around EX under weak moment assumptions on the underlying distribution has recently been investigated in several works. For instance, Joly, Lugosi and Oliveira (2016) constructs an estimator that admits "almost optimal" behavior under the assumption that the entries of X possess 4 moments. Recently, Mendelson (2017, 2018) proposed new estimators that attains optimal bounds and requires existence of only 2 moments. More specifically, the aforementioned papers show that, for any s such that 2 N < e −s < 1, there exists an estimator θ (s) such that with probability at least 1 − c e −s , where c, C > 0 are numerical constants, Σ is the covariance matrix of X, tr (Σ) is its trace and λ max (Σ) -its largest eigenvalue. However, construction of these estimators explicitly depends on the desired confidence level s, and (more importantly) they are numerically difficult to compute. On the other hand, Theorem 6 demonstrates that performance of the multivariate median-of-means estimator is robust with respect to the choice of the number of subgroups k, and the resulting deviation bounds hold simultaneously over the range of confidence parameter s under mild assumptions, for example when the coordinates of X possess 2 + δ moments for some δ > 0. The following corollary summarizes these claims.
Moreover, assume that n j ≥ n := N/k for all 1 ≤ j ≤ k. Then there exist absolute constants c 1 , C 2 > 0 such that for all s > 0 and k satisfying the following inequality holds with probability at least 1 − 2e −s : Proof. Result follows immediately from Theorem 6 and Fact 1 in section 5.1.

Remark 7.
Let us compare the bound achieved in Corollary 6 with the deviation guarantees for the sample mean of gaussian random vectors. It follows from the general deviation inequalities for suprema of Gaussian processes (Ledoux and Talagrand, 1991)   is of order o(N −1/2 )), the deviation inequality of Corollary 6 provides sub-Gaussian guarantees in the range 0 < s k.

Simulation results
We illustrate results of the previous sections with numerical simulations that compare performance of the median-of-means estimator with the usual sample mean, see figure 2 below. Moreover, we compared the theoretical guarantees for the median-of-means estimator (described in section 2.2) against the empirical outcomes for the Lomax distribution with shape parameter α = 4 and scale parameter λ = 1; the corresponding probability density function is In particular, the Lomax distribution with α = 4 and λ = 1 has mean 1/3 and median 4 √ 2 − 1 ≈ 0.1892. Since the mean and median do not coincide, the error of the median-of-means estimator has a significant bias component for large values of k. Figure 3 depicts the impact of the bias beyond k = √ N (equivalently, log N k = 1/2), and also the fact that the median error is mostly flat for k < √ N . Finally, we assessed empirical coverage of the confidence intervals constructed using Theorem 4 and centered at the median-of-means estimator; results are presented in figure 4. The sample of size N = 10 5 was generated from the halft distribution with 3 degrees of freedom; recall that a random variable ξ has half-t distribution with ν degrees of freedom if ξ d = |η| where η has usual tdistribution with ν degrees of freedom. It is clear that half-t distribution is both asymmetric and heavy-tailed. Each sample was further corrupted by outliers sampled from the normal distribution with mean 0 and standard deviation 10 5 ; the number of outliers ranged from 0 to √ N = 100 with increments of 20. The median-of-means estimator was constructed for k = √ N = 100. For comparison, we present empirical coverage levels attained by the sample mean in the same framework.

Proofs
In this section, we present the proofs of the main results.

Preliminaries
We recall several well-known facts that are used in the proofs below. The following generalization of Berry-Esseen bound (Berry, 1941;Esseen, 1942) is due to Petrov (1995) .  Fig 2: Comparison of errors corresponding to the median-of-means and sample mean estimator over 256 runs of the experiment. In (a) the sample of size N = 10 6 consists of i.i.d. random vectors in R 2 with independent Paretodistributed entries possessing only 2.1 moments. Each run computes the medianof-means estimator using partition into k = 1000 groups, as well as the usual sample mean. In (b), the ordered differences between the error of the sample mean and the median-of-means over all 256 runs illustrates robustness. Positive error differences in (b) indicate lower error for the median-of-means, and negative error differences occur when the sample mean provided a better estimate. Images (c) and (d) illustrate a similar experiment that was performed for two-dimensional random vectors with independent entries with Student's tdistribution with 2 degrees of freedom. In this case, the sample size is N = 100 and the number of groups is k = 10.
Fact 1 (Berry-Esseen bound). Assume that Y 1 , . . . , Y n is a sequence of i.i.d. copies of a random variable Y with mean μ, variance σ 2 and such that E|Y − μ| 2+δ < ∞ for some δ ∈ (0, 1]. Then there exists an absolute constant A > 0 such that Moreover, for δ = 1, A ≤ 0.4748. . The x-axis is log N k taken from a uniform partition of (0, 1) and the y-axis indicates the median error of the median-of-means estimator over 2 16 runs of the experiment. For each value of N and k, we run 2 16 simulations by drawing N i.i.d. random variables with Lomax distribution with shape parameter α = 4 and scale parameter λ = 1, splitting into k groups, and then computing the median of the means of those groups. From the 2 16 simulations, we display (on a logarithmic scale) the median of the absolute differences between the true mean 1/3 and the median-of-means estimators, producing the dashed lines in the figure. The solid and dotted lines are our theoretical bounds with 4e −2s = 1/2 (that is, the probability that the solid and dotted bounds holds is guaranteed to be at least 1/2).
The upper bound on A in the case when E|X| 3 < ∞ is due to Shevtsova (2011).

5235
Nominal confidence level Fraction of outliers 0 A U-statistic of order n with kernel h based on the i.i.d. sample X 1 , . . . , X N is defined as (Hoeffding, 1948) Clearly, EU N (h) = Eh(X 1 , . . . , X n ), moreover, U N (h) has the smallest variance among all unbiased estimators. The following analogue of fact 2 holds for the U-statistics: Fact 3 (Concentration inequality for U-statistics, (Hoeffding, 1963)).

Proof of Theorem 2
We will use notation as in the proof of Theorem 1. Let Suppose z 1 , z 2 are such that F + (z 1 ) > 0 and F − (z 2 ) < 0. Since F + , F − are increasing, it is easy to see that θ (k) ρ ∈ (z 2 , z 1 ). To find such z 1 and z 2 , we proceed in 3 steps; we provide a bound on z 1 , and the bound on z 2 follows in a similar manner.

S. Minsker
Observe that . . . , k. (a) First, note that the bounded difference inequality (fact 2) implies that for any fixed z ∈ R, with probability at least 1 − e −s . (b) Next, we will find an upper bound for Note that for any bounded non-negative function f : R → R + and a signed measure Q, Since any bounded function f can be written as f = max(f, 0) − max(−f, 0), we deduce that nj (x + θ * ) and

5239
(c) In remains to find z 1 satisfying The following bound yields the desired inequality.
and set Proof. The proof is relatively long and is presented in section 5.8.
Finally, set ε := ρ + ∞ √ 2s + 2 k j=1 nj N g j (n j ) . If conditions of Lemma 2 are satisfied, the result follows. The estimate for z 2 follows the same pattern, and yields that one can choose z 2 = −z 1 , implying the claim.

Proof of Theorem 3
Let J ⊂ {1, . . . , k} of cardinality |J| ≥ k − O be the subset containing all j such that the subsample {X i , i ∈ G j } does not include any of the O outliers. Clearly, {X i : i ∈ G j , j ∈ J} are still i.i.d. as the partitioning scheme is independent of the data. Set N J := j∈J |G j |, and note that All the probabilities below are evaluated conditionally on N J . The proof closely follows the steps of the proof of Theorem 2. Let

S. Minsker
As 0 ∈ ∂F θ We would like to find z 1 , z 2 are such that F + (z 1 ) > 0 and F − (z 2 ) < 0. Since F + , F − are increasing, it is easy to see that θ (k) ρ ∈ (z 2 , z 1 ) in this case. Observe that The second sum can be estimated as hence, to guarantee that F + (z 1 ) > 0, it suffices to find z 1 satisfying By the definition of the set J, the sum in the expression is over the subgroups not including the adversarial contamination, hence it can be processed in exactly the same way as in the proof of Theorem 2, and the desired inequality would follow.

Proof of Theorem 4
Recall that L(z) = Eρ (z +Z) for Z ∼ N (0, 1), and note that under our assumptions, equation L(z) = 0 has a unique solution z = 0 (even if ρ is not strictly convex). Next, observe that hence it suffices to show that both the left-hand side and the right-hand side of the inequality above converge to 1 − Φ(t) for all t. We will outline the argument for the left-hand side, and the remaining part is proven in a similar fashion. Note that where Lemma 3. Under the assumptions of Theorem 4, √ kEY n,1 → t Ω L (0) and where Z ∼ N (0, 1).
Proof of Lemma 3. Let Z ∼ N (0, 1). Since ρ is convex, its derivative ρ := (ρ + + ρ − )/2 is monotone and continuous almost everywhere (with respect to Lebesgue measure). Together with the assumption that ρ ∞ < ∞, Lebesgue dominated convergence Theorem implies that Next, we will prove the assertion that √ kEY n,1 → t Ω L (0). It is easy to see that Reasoning as in the proof of Theorem 2 (see step (b) in section 5.3), we deduce that where g(n) is the function from assumption 1. Hence, recalling that g(n) √ k → 0 as N → ∞, we obtain that On the other hand, it follows from (18) that for t = 0

S. Minsker
For t = 0, it is also clear that Eρ (Z) = 0. To establish the fact that note that weak convergence ofθ 1 −θ * σn to the normal law (assumption 1) together with Lebesgue dominated convergence Theorem implies that Since L (0) > 0, we deduce that  (Serfling, 1981, Theorem 1.9.3) to Y n,j 's to deduce the result from equation (17). To this end, we only need to verify the Lindeberg condition requiring that for any ε > 0, However, since ρ (·) (and hence Y n,1 ) is bounded, (19) easily follows.

Proof of Theorem 5
The argument is similar to the proof of Theorem 1. Let Φ (n) (·) be the distribution function ofθ 1 −θ * σn and Φ ( N n ) (·) -the empirical distribution function corresponding to the sample W J =θ J −θ * σn , J ∈ A (n) N of size N n . Suppose that z ∈ R is fixed, and note that Φ ( N n ) (z) is a U-statistic with mean Φ (n) (z). We will apply the concentration inequality for U-statistics (fact 3) with M = 1 to get that with probability ≥ 1 − 2e −2s ; here, we also used the fact that n = N/k . Let z 1 ≥ z 2 be such that Φ (n) (z 1 ) ≥ 1 2 + s k and Φ (n) (z 2 ) ≤ 1 2 − s k . Applying (20) for z = z 1 and z = z 2 together with the union bound, we see that for j = 1, 2, on an event E of probability ≥ 1−4e −2s . It follows that on E, med W J , J ∈ A (n) N ∈ [z 2 , z 1 ]. The rest of the proof repeats the argument of section 5.2.

Proof of Theorem 6
Set F (z) := k j=1 nj N z −θ j 1 . Then θ (k) = argmin z∈R m F (z) by the definition. Since F (z) is convex, the sufficient and necessary condition for θ (k) to be its minimizer is that 0 ∈ ∂F ( θ (k) ), the subdifferential of F at point z = (z 1 , . . . , z m ). It is easy to see that where ρ(x) = |x|, ρ +,i (x) = I z i ≥θ j,i ) − I z i <θ j,i ) and ρ −,i (x) = I z i >θ j,i ) − I z i ≤θ j,i ) are the right and left derivative of ρ. Since the subdifferential is convex, it suffices to find points z i,1 , z i,2 , i = 1, . . . , m such that for all i, This task has already been accomplished in the proof of Theorem 2: in particular, the argument presented in section 5.3 yields that, on an event of probability at least 1 − 2e −s , inequalities (21)  assuming that condition (14) is satisfied. The union bound implies that for i = 1, . . . , m simultaneously, with probability ≥ 1 − 2me −s . The result follows by taking the maximum over i on both sides of (22).
Observe that h(x) ≥ 0 for x ≥ 0 by assumptions on ρ, hence for any j, where we used the fact that both terms are nonnegative. Next, we will find lower bounds for each of the terms in the maximum above, starting with the first.
(1) Consider two possibilities: (a) Δ < V j and (b) Δ ≥ V j . In the first case, we will use the trivial lower bound E Direct computation shows that for any a ∈ R, t > 0, Take a = n β j j z1 Vj , t = 2 Δ Vj , and observe that assumptions of the Theorem imply the inequality |a| ≤ t 4 . The minimum of the function a → a 2 + t 2 − 2|a|t over the set 0 ≤ a ≤ t/4 is attained at a = t/4, implying that a 2 + t 2 − 2|a|t ≥ 9 16 t 2 > t 2 2 . Combining this with (25), we deduce that Moreover, since |z 1 | ≤ 1 2 Δ n β j j by assumptions of the lemma, it follows that Together with (23), (24), the last display yields that Proof. The result is often known as the Cameron-Martin inequality; we give a short proof for reader's convenience. Observe that and the claim follows.