1 Introduction

The article presents a class of robust (to outliers and heavy-tailed data) estimators and algorithms for the classification problem. Consider the classical binary classification problem, let \(\mathcal{F}\) denote a class of functions from \(\mathcal{X}\) to \(\{\pm 1\}\), the empirical risk minimizer (ERM) is defined by

$$\begin{aligned} \widehat{f}_{\text {ERM}} \in \mathop {\mathrm{argmin}}\limits _{f\in \mathcal{F}}\frac{1}{N}\sum _{i=1}^NI{\{Y_i\ne f(X_i)\}} \end{aligned}$$
(1)

where \(I{\{Y_i\ne f(X_i)\}} = 1\) if \(Y_i\ne f(X_i)\) and 0 otherwise. In this paper, we are interested in the case where the random variables \(f(X_i)\) only satisfy a second moment assumption and where the dataset \(\{(X_i,Y_i)_{i\in \{1,\ldots ,N\}}\}\) may contain outliers. The ERM behaves well under these assumptions (see Theorem 1 below). The reason is that the \(0-1\) loss \(\ell ^{0-1}_f(x,y)=I_{\{y\ne f(x)\}}\) is bounded, which grants concentration no matter the distribution of X and a small number of data cannot really impact the empirical mean performance. However, it is well known that ERM is a theoretical estimators that can only be approximated in most situations by efficient algorithms. Indeed, the minimization problem (1) is NP-hard even for classes \(\mathcal{F}\) of half-spaces indicators (Guruswami and Raghavendra 2009; Feldman et al. 2012). One of the most classical way to approximate ERM is to choose a convex relaxation of the problem (1) and design an algorithm solving the associated convex problem. The problem of these approaches in the setting of this paper is that the relaxed criteria are unbounded and therefore way more sensitive to outliers or heavy tailed inputs. This results into poor performance of the algorithms on corrupted and/or heavy-tailed data. Figure 1 illustrates this problem on a toy example where most data would be well separated by a linear classifier like Perceptron (Rosenblatt 1958) or logistic classifier, but some anomalies flaw these algorithms.

Fig. 1
figure 1

Scatter plot of the toy dataset, the color of the points gives their class. The background color gives the linear separation provided by the perceptron (left) and the logistic regression (right) trained on this corrupted dataset (Color figure online)

The example in Fig. 1 is representative of a general problem that this paper intends to study. Robust learning has received particular attention in recent years by practitioners working on large datasets which are particularly sensitive to data corruption. Challenges recently posted on “kaggle”, the most popular data science competition platform, have put forward this topic (see, the 1.5 million dollars problem “Passenger Screening Algorithm Challenge” involves the discovery of terrorist activity from 3D images or the challenge named “NIPS 2017: Defense Against Adversarial Attack” consists in building algorithms robust to adversarial data). Robust algorithms have also been studied theoretically both in statistical and computer science communities. In statistics, robust results usually deal with issues arising when data have heavy-tailed distribution (Lugosi and Mendelson 2017; Minsker 2015; Chen et al. 2018; Fan and Kim 2018). In computer science, most works deal with corrupted datasets, in particular when this corruption arise from adversarial outliers (Diakonikolas et al. 2016, 2017; Cheng et al. 2019). Only few papers consider both problems simultaneously (Lecué and Lerasle 2017, 2019).

In learning theory, most alternatives to ERM manage the problem of outliers and heavy tail distributions for outputs only. These solutions are based on the pioneering work of Tukey (1960, 1962), Huber (1964, 1967) and Hampel (1971, 1974), replacing the square loss by a robust alternative like Huber loss or Tukey’s biweight loss. These methods do not allow to treat the case where the inputs are with heavy tails or corrupted, which is a classical problem in robust statistics also known as the “leverage point problem”, see Huber and Ronchetti (2009).

In this article, we address this question by considering an alternative to M-estimators, called median-of-means (MOM) minimizers. Several estimators based on MOM have recently been proposed in the literature Minsker (2015), Lecué and Lerasle (2017, 2019), Lugosi and Mendelson (2017, 2019a, b) and Mendelson (2017). To our knowledge, these articles use the small ball hypothesis (Koltchinskii and Mendelson 2015; Mendelson 2014) to treat problems of least squares regression or Lipschitzian loss regression. This assumption is restrictive in some classic functional regression frameworks (Saumard 2018; Han and Wellner 2017) or for problems such as the construction of recommendation system where inputs are sampled in the canonical basis and therefore do not satisfy a small ball condition.

We construct a natural estimator based on the MOM principle, which is called MOM minimizer. This estimator is studied here without the small ball hypothesis. Instead, we assume an a priori bound on the \(L^2\)-norm of learning functions. We can identify mainly two streams of hypothesis in Learning theory: 1) boundedness with respect to some norm of the class F of functions and the output Y, the typical example is the boundedness in \(L^\infty \) assumption or 2) norm equivalence assumption over the class F (or, more precisely, on the shifted class \(F-f^* = \{f-f^*:f\in F\}\) where \(f^*\) is the oracle in F, i.e. the minimizer of the theoretical risk among the functions in F) and Y, the typical example being the subgaussian assumption, i.e. \(\left\| f-f^*\right\| _{\psi _2}\le L \left\| f-f^*\right\| _{L^2}, \forall f\in F\) where for \(g \in F\)\(\Vert g\Vert _{\psi _2}=\inf \{t>0:\mathbb {E}[\exp (X^2/t^2)]\le 2\}\). The small ball assumption is a norm equivalence assumption between the \(L^1\) and \(L^2\) norms and is concerned with the second type of assumptions. Our approach here deals with the first type of assumption. As we only assume boundedness in \(L^2\)-norm, this can be seen as a significant relaxation upon the \(L^\infty \) boundedness assumption. It turns out that, in this relaxed setting, MOM minimizers achieve minimax rates of convergence (Devroye et al. 1997) in the absence of a margin condition (Mammen and Tsybakov 1999) even under a \(L^\infty \) assumption.

The estimation of the expectation of a univariate variable by median-of-means (MOM) (Alon et al. 1999; Jerrum et al. 1986; Nemirovsky and Yudin 1983) is done as follows: given a partition of the dataset into blocks of the same size, an empirical mean is constructed on each block and the MOM estimator is obtained by taking the median of these empirical means (see Sect. 2.2 for details). These estimators are naturally resistant to the presence of a few outliers in the dataset: if the number of these outliers does not exceed half the number of blocks, more than half of these blocks are made of “clean” data and the median is a reliable estimator.

On the practical side, we introduce algorithms inspired by the MOM minimizers. In these algorithms, the MOM principle is used within algorithms originally intended for the evaluation of ERM estimators associated to convex loss functions. In Sect. 4, we present a “ MOM version” of gradient descent algorithms following this approach. The general principle of this iterative algorithm is as follows: at iteration t, a dataset equipartition \(B_1,\ldots ,B_K\) is selected uniformly at random and the most central block \(B_{\text {med}}\) is determined according to the following formula

$$\begin{aligned} \sum _{i\in B_{\text {med}}}\ell _{f_t}(X_i,Y_i)=\text {median}\left( \sum _{i\in B_{k}}\ell _{f_t}(X_i,Y_i):\quad k=1,\ldots ,K\right) = \text {MOM}_{K}\big (\ell _{f_t}\big ) \end{aligned}$$
(2)

where \(\ell _{f_t}(X_i,Y_i)=\ell (f_t(X_i),Y_i)\) is the loss of the prediction \(f_t(X_i)\) of the label \(Y_i\). Next iteration \(f_{t+1}\) is then produced by taking from \(f_t\) a step down in the direction opposite to the gradient of \(f \rightarrow \sum _{i\in B_{\text {med}}}\ell _{f}(X_i,Y_i)\) at \(f_t\), cf. Algorithm 1. The underlying heuristic is that the data in the selected block \(B_{\text {med}}\) are safe for estimating the risk of \(f_t\), in the sense that empirical risk \(|B_{\mathrm{med}}|^{-1}\sum _{i\in B_{\text {med}}}\ell _{f_t}(X_i,Y_i)\) is a subgaussian estimator of \(\mathbb {E}\ell _{f}(X_i,Y_i)\), cf. Devroye et al. (2016a) and that data indexed by \(B_{\text {med}}\) should not be outliers. The differentiation properties of \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\) are studied in Sect. 4.2. One additional advantage of our algorithm is that it is based on a simple idea: select a “good” block of data in such as way that it does not contain outliers and it is a subgaussian estimator of the risk. As a result, it requires only little modifications on existing Gradient descent based algorithms to make them robust to outliers and heavy-tailed data. As a proof of concept, in this article, we perform this “MOM modification” to the Logistic Regression, Perceptron and SVM-like algorithm.

In Sect. 5, the practical performances of thess algorithms are illustrated on several simulations, involving in particular different loss functions. These simulations illustrate not surprisingly the gain of robustness that there is to use these algorithms in their MOM version rather than in their traditional version, as can for example be appreciated on the toy-example of Fig. 1 (see also Fig. 4 below). MOM estimators are compared to different learning algorithms on real datasets that can be modeled by heavy tailed data, obtaining in each case performances comparable to the best of these benchmarks.

Another advantage of our procedure is that it works on blocks of data. This can improve speed of execution and reduce memory requirements, which can be decisive on massive datasets and/or when one wishes to use non-linear algorithms as in Sect. 4.3. This principle of dividing the dataset to calculate estimators more quickly and then aggregating them is a powerful tool in statistics and machine learning (Jordan 2013). Among others, one can mention bagging methods (Breiman 1996) or subagging—a variant of bagging where the bootstrap is replaced by subsampling—(Bühlmann and Bin 2002). These methods are considered difficult to study theoretically in general and their analysis is often limited to obtaining asymptotic guarantees. By contrast, the theoretical tools for non-asymptotic risk analysis of MOM minimizers have already essentially been developed. Finally, subsampling by the central block \(B_{\mathrm{med}}\) ensures robustness properties that cannot be guaranteed by traditional alternatives.

Moreover, the algorithm provides an empirical notion of data depth: data providing good risk estimates of \(f\rightarrow \mathbb {E}\ell _f(X,Y)\) are likely to be selected many times in the central block \(B_{\text {med}}\) along the descent, while outliers will be systematically ignored. This notion of depth, based on the risk function, is very natural for prediction problems. It is complemented by an outliers detection procedure: data that are selected a number of times below a predetermined threshold are classified outliers. This procedure is evaluated on our toy example of Fig. 1—for this example, data represented by the dots in the top right corner (the outliers) all end with a null score (see Fig. 7 below). The procedure is then tested on a real dataset on which the conclusions are more interesting. On this experiment, according to the theoretical upper bounds in Theorem 2, MOM minimizer’s prediction qualities are deteriorated with large values of K, and this result is verified in some practical cases cf. Fig. 10. On the other hand, when there are enough data and when the data are not too heavy tailed (finite third moment of the \(f(X_i)\)), the article Minsker (2019) decouples K and N in the risk bound and find an optimal scaling of \(K\asymp \sqrt{N}\), and one might think that this decoupling ought to be possible also in our context. On the other hand, outlier detection is best when the number of blocks is large, cf. Fig. 8. Outlier resistance and anomaly detection tasks can therefore both be handled using the MOM principle, but the main hyper-parameter K—the number of blocks of data—for setting this method must be chosen carefully according to the objective. A number of blocks as small as possible (about twice the number of outliers) will give the best predictions, while large values of this number of blocks will accelerate the detection of anomalies. Note that it is essential for outliers detection to use different (for instance, random) partitions at each step of the descent to avoid giving the same score to an outlier and to all the data in the same block containing it.

Detecting outliers is usually performed in machine learning via some unsupervised preprocessing algorithm that detects outliers outside a bulk of data, see for example Hubert and Van Driessen (2004), He and Fung (2000), Christophe and Catherine (2001), Gunduz and Fokoué (2015) or other algorithms like DBSCAN (Birant and Kut 2007) or isolation forest (Liu et al. 2008). These algorithms assume elliptical symmetry of the data, a solution for skewed data can also be found in Hubert and Van Der Veeken (2010). These unsupervised preprocessing removes outliers in advance, i.e. before starting any learning task. As expected, these strategies work well in the toy example from Fig. 1. There are several cases where it will fail though. First, as explained in Huber and Ronchetti (2009), this strategy classifies data independently of the risk, it is likely to remove from the dataset outlier coming from heavy-tailed distribution, yielding biased estimators. Moreover, a small group of misclassified data inside a bulk won’t be detected. Our notion of depth, based on the risk, seems more adapted to the learning task than any preprocessing procedure blind to the risk.

The paper is organized as follows. Section 2 presents the classification problem, the ERM and its MOM versions and gathers the assumptions granted for the main results. Section 3 presents theoretical risk bounds for the ERM estimator and MOM minimizers on corrupted datasets. Section 4 deals with theoretical results on the algorithm computing MOM minimizers. We present the algorithm, study the differentiation property of the objective function \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\) and provide theoretical bounds on its complexity. Section 5 shows empirical performance of our estimators in both simulated and real datasets. Proofs of the main results are postponed to Sect. 6 where we also added heuristics on the practical choice of the hyper-parameters.

2 Setting

2.1 Empirical risk minimization for binary classification

Consider the supervised binary classification problem, where one observes a sample \((X_1,Y_1)\), \(\ldots \), \((X_N,Y_N)\) taking values in \(\mathcal{X}\times \mathcal{Y}\). The set \(\mathcal{X}\) is a measurable space and \(\mathcal {Y} = \{-1,1\}\). The goal is to build a classifier—that is, a measurable map \(f:\mathcal{X}\rightarrow \mathcal{Y}\)—such that, for any new observation (XY), f(X) is a good prediction for Y. For any classifier f, let

$$\begin{aligned} \ell ^{0-1}_f(x,y)=I\{y\ne f(x)\},\qquad R^{0-1}(f) = P\ell ^{0-1}_f={\mathbb P}_{(X,Y)\sim P} \bigl ( Y\ne f(X) \bigr ). \end{aligned}$$

The \(0-1\) risk \(R^{0-1}(\cdot )\) is a standard measure of the quality of a classifier. Following Chervonenkis and Vapnik (2000), a popular way to build estimators is to replace the unknown measure P in the definition of the risk by the empirical measure \(P_N\) defined for any real valued function g by \(P_Ng=N^{-1}\sum _{i=1}^Ng(X_i,Y_i)\) and minimize the empirical risk. The empirical risk minimizer for the \(0-1\) loss on a class \(\mathcal{F}\) of classifiers is \(\widehat{f}_{\text {ERM}}^{0-1}\in \mathop {\mathrm{argmin}}\nolimits _{f\in \mathcal{F}}\{P_N\ell ^{0-1}_f\}\).

The main issue with \(\widehat{f}_{\text {ERM}}^{0-1}\) is that it cannot be computed efficiently in general. One source of computational complexity is that both \(\mathcal{F}\) and the \(0-1\) loss function are non-convex. This is why various convex relaxations of the \(0-1\) loss have been introduced in statistical learning theory. These proceed in two steps. First, \(\mathcal{F}\) should be replaced by a convex set F of functions taking values in \(\mathbb {R}\). Then one builds an alternative loss function \(\ell \) for \(\ell ^{0-1}\) defined for all \(f\in F\). The new function \(\ell \) should be convex and put less weight on those \(f\in F\) such that \(f(X_i)Y_i>0\), these loss functions are commonly called ”classification-calibrated losses” in the literature. Classical examples include the hinge loss \(\ell ^{\text {hinge}}_f(x,y)=(1-yf(x))_+\), or the logistic loss \(\ell ^{\text {logistic}}_f(x,y)=\log (1+e^{-yf(x)})\). A couple \((F,\ell )\) such that F is a convex set of real valued functions and \(\ell \) is a convex function (i.e. for all \(y\in \{-1, 1\}\) and \(x\in \mathcal{X}\), \(f\in F\rightarrow \ell _f(x, y)\) is convex) such that \(\ell _f(x,y)<\ell _f(x,-y)\) whenever \(yf(x)>0\) will be called a convex relaxation of \((\mathcal{F},\ell ^{0-1})\). Given a convex relaxation \((F,\ell )\) of \((\mathcal{F},\ell ^{0-1})\), one can define the associated empirical risk minimizer by

$$\begin{aligned} \widehat{f}_{\text {ERM}}\in \mathop {\mathrm{argmin}}\limits _{f\in F} P_N\ell _f. \end{aligned}$$
(3)

Note that \(\widehat{f}_{\text {ERM}}\) does not build a classifier. To deduce, a classification rule from \(\widehat{f}_{\text {ERM}}\) one can simply consider its sign function defined for all \(x\in \mathcal{X}\) by \(\mathrm{sign}(\widehat{f}_{\text {ERM}}(x)) = 2(I\{\widehat{f}_{\text {ERM}}(x)\ge 0\}-1/2)\). The procedure \( \widehat{f}_{\text {ERM}}\) is solution of a convex optimization problem that can therefore be approximated using a descent algorithm. We refer for example to Bubeck (2015) for a recent overview of this topic and Sect. 4 for more examples.

2.2 Corrupted datasets

In this paper, we consider a framework where the dataset may have been corrupted by outliers (or anomalies). There are several definitions of outliers in the literature, here, we assume that the dataset is divided into two parts. The first part is the set of inliers, indexed by \(\mathcal{I}\), data \((X_i,Y_i)_{i\in \mathcal{I}}\) are hereafter always assumed to be independent and identically distributed (i.i.d.) with common distribution P. The second one is the set of outliers, indexed by \(\mathcal{O}\subset [N]\) which has cardinality \(|\mathcal{O}|\). Nothing is assumed on these data which may not be independent, have distributions \(P_i\) totally different from P, satisfying \(P_i|f|^\alpha =\infty \) for any \(\alpha >0\), etc...Doing no hypothesis on the outliers is commonly done in Machine Learning with adversarial examples, see Gao et al. (2018) and Diakonikolas et al. (2017) for examples of such application. In particular, this framework is sufficiently general to cover the case where outliers are i.i.d. with distribution \(Q\ne P\) as in the \(\epsilon \)-contamination model (Huber and Ronchetti 2009; Chen et al. 2017; Gao 2017; Donoho and Montanari 2015).

Our first result shows that the rate of convergence of \(\widehat{f}_{\text {ERM}}^{0-1}\) is not affected by this corruption as long as \(|\mathcal{O}|\) does not exceed \(N\times (\text {rate of convergence})\) see Theorem 1 and the remark afterward. However, it is easy to remark that, when the number N of data is finite as it is always the case in practice, even one aggressive outliers may yield disastrous breakdown of the empirical mean’s statistical performance. Consequently, even if \(\widehat{f}_{\text {ERM}}^{0-1}\) behaves correctly, its proxy \(\widehat{f}_{\text {ERM}}\) defined in (3) for a convex relaxation \((F, \ell )\) can have disastrous statistical performances, particularly when F and \(\ell \) are unbounded, cf. Fig. 4 for an illustration.

To bypass this problem, we consider in this paper an alternative to the empirical mean called median-of-means (Alon et al. 1999; Jerrum et al. 1986; Nemirovsky and Yudin 1983). Let \(K\le N\) denote an integer and let \(B_1,\ldots ,B_K\) denote a partition of \(\{1,\ldots ,N\}\) into bins \(B_k\) of equal size \(|B_k|=N/K\). If K doesn’t divide N, one can always drop a few data. For any function \(f:\mathcal{X}\times \mathcal{Y}\rightarrow \mathbb {R}\) and any non-empty subset \(B\subset \{1,\ldots ,N\}\), define the empirical mean on B by \(P_Bf=|B|^{-1}\sum _{i\in B}f(X_i,Y_i)\). The median-of-means (MOM) estimator of Pf is defined as the empirical median of the empirical means on the blocks \(B_k\)

$$\begin{aligned} \text {MOM}_{K}\big (f\big )=\text {median}\left\{ P_{B_k}f:\;k=1,\ldots ,K\right\} . \end{aligned}$$

As the classical Huber’s estimator (Huber 1964), MOM estimators interpolate between the unbiased but non robust empirical mean (obtained for \(K=1\)) and the robust but biased median (obtained for \(K=N\)). In particular, when applied to loss functions, these new estimators of the risk \(P\ell _f, f\in F\) suggest to define the following alternative to Chervonenkis and Vapnik’s ERM estimator, called MOM minimizers

$$\begin{aligned} \widehat{f}_{\text {MOM},K}\in \mathop {\mathrm{argmin}}\limits _{f\in F} \text {MOM}_{K}\big (\ell _f\big ). \end{aligned}$$
(4)

From a theoretical point of view, we will prove that, when the number \(|\mathcal{O}|\) of outliers is smaller than \(N\times (\text {rate of convergence})\), \(\widehat{f}_{\text {MOM},K}\) performs well under a second moment assumptions on F and \(\ell \). To illustrate our main assumptions and theoretical results, we will regularly use the following classical example.

Example 1

(Linear classification) Let \(\mathcal{X}=\mathbb {R}^p\) and let \(\left\| \cdot \right\| _2\) denote the classical Euclidean norm on \(\mathbb {R}^p\). Let F denote a set of linear functions

$$\begin{aligned} F=\{f_t:x\mapsto \bigl < x, t \bigr >: \left\| t\right\| _{2}\le \Gamma \}. \end{aligned}$$

Let \(\ell \) denote either the hinge loss or the logistic loss defined respectively for any \((x,y)\in \mathcal{X}\times \mathcal{Y}\) and \(f\in F\) by

$$\begin{aligned} \ell ^{\text {hinge}}_f(x,y)=(1-yf(x))_+,\qquad \ell ^{\text {logistic}}_f(x,y)=\log (1+e^{-yf(x)}). \end{aligned}$$

Remark that the case with an intercept is included in this linear case by adding an artificial \((p+1)\)th dimension: we consider \(x'=(x_1,\dots ,x_p,1)\) where \(x_1,\dots ,x_p\) are the coordinated of x, and then \(\bigl< x', (t_1,\dots ,t_p,t_{p+1} \bigr>=\bigl < x, (t_1,\dots ,t_p) \bigr >+t_{p+1}\). In practice this correspond to adding a column of 1 at the end of the design matrix.

2.3 Main assumptions

As already mentioned, data are divided into two groups, a subset \(\{(X_i,Y_i) : i\in \mathcal{O}\}\) made of outliers (on which we will make no assumption) and the remaining data \(\{(X_i,Y_i): i\in \mathcal{I}\}\) contains all data that bring information on the target/oracle

$$\begin{aligned} f^{*}\in \mathop {\mathrm{argmin}}\limits _{f\in F}P\ell _f. \end{aligned}$$

Data indexed by \(\mathcal{I}\) are therefore called inliers or informative data. To keep the presentation as simple as possible, inliers are assumed to be i.i.d. distributed according to P although this assumption could be relaxed as in Lecué and Lerasle (2017, 2019). Finally, note that the \(\mathcal{O}\cup \mathcal{I}= \{1, \ldots , N\}\) partition of the dataset is of course unknown from the statistician. Moreover, since no assumption is granted on the set of data indexed by \(\mathcal{O}\), this setup covers the framework of adversarial attack where one may imagine that the data indexed by \(\mathcal{O}\) have been changed in the worst possible way by some malicious adverser.

Let us now turn to the set of assumptions we will use to study MOM minimizers procedures. For any measure Q and any function f for which it makes sense, denote by \(Qf=\int f dQ\). Denote also, for all \(q\ge 1\), by \(L^q\) the set of real valued functions f such that \(\int |f|^qdP<\infty \) and, for any \(f\in L^q\), by

$$\begin{aligned} \left\| f\right\| _{L^q}=\left( \int |f|^qdP\right) ^{1/q}. \end{aligned}$$

Our first assumption is an \(L^2\)-assumption on the functions in F.

Assumption 1

For all \(f \in F\), we have \(\left\| f\right\| _{L^2}\le \theta _{2}\).

Of course, Assumption 1 is granted if F is a set of classifiers. It also holds for the linear class of functions from Example 1 as long as \(P\left\| X\right\| _2^2<\infty \) with \(\theta _{2}=\Gamma (P\left\| X\right\| _2^2)^{1/2}\). As announced in the introduction, it is a boundedness assumption (w.r.t. the \(L_2\)-norm) and not a norm equivalence assumption. For instance, it covers cases that cannot be handled via norm equivalence. A typical example is for matrix completion problems where X is uniformly distributed over the canonical basis \((E_{pq}:p\in [m], q\in [T])\) of the linear space \({\mathbb R}^{m\times T}\) of \(m\times T\) matrices. One has for \(f(\cdot )=\bigl < \cdot , E_{11} \bigr>\) and any \(r\ge 1\), \(\left\| f\right\| _{L_r} = (\mathbb {E}|f(X)|^r)^{1/r}=(1/(mT))^{1/r}\). Hence, any norm equivalence assumption on the class \(F=\{f_A = \bigl < \cdot ,A \bigr >: \left\| f_A\right\| _{L^2}\le \theta _{2}\}\) will depend on the dimension mT of the problem resulting either in wrong rates of convergence or in assumption on the number of data. Our approach does not use any norm equivalence assumption so that our rates of convergence do not depend on dimension dependent ratio. Rates depend only on the \(L_2\) radius \(\theta _2\) of F from Assumption 1.

The second assumption deals with the complexity of the class F. This complexity appears in the upper bound of the risk. It is defined using only informative data. Let

$$\begin{aligned} \mathcal{K}=\{k\in \{1,\ldots ,K\} : B_k\cap \mathcal{O}=\emptyset \} \quad { \text{ and } }\quad \mathcal{J}=\cup _{k\in \mathcal{K}}B_k. \end{aligned}$$

Definition 1

Let \(\mathcal{G}\) denote a class of functions \(f:\mathcal{X}\rightarrow \mathbb {R}\) and let \((\epsilon _i)_{i\in \mathcal{I}}\) denote i.i.d. Rademacher random variables independent from \((X_i, Y_i)_{i\in \mathcal{I}}\). The Rademacher complexity of \(\mathcal{G}\) is defined by

$$\begin{aligned} \mathcal{R}(\mathcal{G})=\max _{A\in \{\mathcal{I},\mathcal{J}\}}\mathbb {E}\left[ \sup _{f\in \mathcal{G}}\sum _{i\in A}\epsilon _if(X_i)\right] . \end{aligned}$$

The Rademacher complexity is a standard measure of complexity in classification problems (Bartlett and Mendelson 2002). It can be upper bounded by \(\mathrm{comp}/\sqrt{N}\) where \(\mathrm{comp}\) is a measure of complexity such as the square root of the VC dimension or the Dudley’s entropy integral or the Gaussian mean width of the class F see for example Boucheron et al. (2005, 2013), Koltchinskii (2008), Bartlett and Mendelson (2002) and Devroye et al. (1997) for a presentation of these classical bounds. Our second assumption is simply that the Rademacher complexity of the class F is finite.

Assumption 2

The Rademacher complexity of F is finite, \(\mathcal{R}(F)<\infty \).

Assumption 2 holds in the linear classification example under Assumption 1 since it follows from Cauchy-Schwarz inequality that \(\mathcal{R}(F)\le \theta _{2}\sqrt{|\mathcal{I}|p}\). Finally, our last assumption is that the loss function \(\ell \) considered is Lipschitz in the following sense.

Assumption 3

The loss function \(\ell \) satisfies for all \((x,y)\in \mathcal{X}\times \mathcal{Y}\) and all \(f,f^\prime \in F\),

$$\begin{aligned} |\ell _f(x,y)-\ell _{f'}(x,y)|\le L|f(x)-f'(x)| . \end{aligned}$$

Assumption 3 holds for classical convex relaxation of the \(0-1\) loss such as hinge loss \(\ell ^{\text {hinge}}\) or logistic loss \(\ell ^{\text {logistic}}\) as in Example 1. In these examples, the constant L can be chosen equal to 1. It also covers non-convex loss functions such as the one in Baraud et al. (2017), Catoni (2012) and Audibert and Catoni (2011) or sigmoid loss functions such as the one used in Deep Learning. In particular, our results do not follow from other work on MOM estimators using convex loss functions such as in Chinot et al. (2019).

3 Theoretical guarantees

Our first result follows Vapnik–Chervonenkis’s original risk bound for the ERM and shows that \(\widehat{f}_{\text {ERM}}^{0-1}\) is insensitive to the presence of outliers in the dataset. Moreover, it quantifies this robustness property since Vapnik–Chervonenkis’s rate of convergence is still achieved by \(\widehat{f}_{\text {ERM}}^{0-1}\) when there are less than (number of observations) times (Vapnik’s rate of convergence) outliers.

Theorem 1

Let \(\mathcal{F}\) denote a collection of classifiers. Let \(\mathcal{L}_\mathcal{F}^{0-1}=\{\ell _f^{0-1}-\ell _{f^*}^{0-1} : f \in \mathcal{F}\}\) be the family of excess loss functions indexed by \(\mathcal{F}\) where \(f^*\in \mathop {\mathrm{argmin}}\nolimits _{f\in \mathcal{F}} R^{0-1}(f)\). For all \(K>0\), with probability at least \(1-e^{-K}\), we have

$$\begin{aligned} R^{0-1}(\widehat{f}_{\text {ERM}}^{0-1})-\inf _{f\in \mathcal{F}}R^{0-1}(f)\le \frac{2\mathcal{R}(\mathcal{L}_\mathcal{F}^{0-1})}{ N}+\sqrt{\frac{K}{2|\mathcal{I}|}}+\frac{2|\mathcal{O}|}{N}. \end{aligned}$$

Theorem 1 is proved in Sect. 6.1. It is an adaptation of Vapnik–Chervonenkis’s proof of the excess risk bounds satisfied by \(\widehat{f}_{\text {ERM}}^{0-1}\) in the presence of outliers.

Remark 1

In the last result, one can easily bound the excess risk using \(\mathcal{R}(\mathcal {F})\) instead of \(\mathcal{R}(\mathcal{L}_\mathcal{F}^{0-1})\) since

$$\begin{aligned} \mathcal{R}(\mathcal{L}_\mathcal{F}^{0-1}) = \max _{A\in \{\mathcal{I},\mathcal{J}\}}\mathbb {E}\left[ \sup _{f\in \mathcal{F}}\sum _{i\in A}\epsilon _i(f(X_i)-f^*(X_i))\right] = \mathcal{R}(\mathcal {F}) . \end{aligned}$$

The final bound is of similar flavor: for all \(K>0\), with probability at least \(1-e^{-K}\), we have

$$\begin{aligned} R^{0-1}({\hat{f}}_{\text {ERM}}^{0-1})-\inf _{f\in \mathcal{F}}R^{0-1}(f)\lesssim \max \left( \frac{\mathcal{R}(\mathcal{F})}{N},\sqrt{\frac{K}{|\mathcal{I}|}}, \frac{|\mathcal{O}|}{N}\right) . \end{aligned}$$
(5)

Remark 2

When \(\mathcal{F}\) is the class of all linear classifiers, that is when \(\mathcal{F}=\{\mathrm{sgn}(\bigl < t,\cdot \bigr >): t\in \mathbb {R}^p\}\), one has \(\mathcal{R}(\mathcal{F})\le \sqrt{|\mathcal{I}| p}\) [see Theorem 3.4 in Boucheron et al. 2005]. Therefore, when \(|\mathcal{I}|\ge N/2\), Theorem 1 implies that for all \(1\le K\le p\), with probability at least \(1-\exp (-K)\),

$$\begin{aligned} R^{0-1}({\hat{f}}_{\text {ERM}}^{0-1})-\inf _{f\in \mathcal{F}}R^{0-1}(f)\lesssim \max (\sqrt{p/N}, |\mathcal{O}|/N). \end{aligned}$$

As a consequence, when the number of outliers is such that \(|\mathcal{O}|\lesssim N\times \sqrt{p/N}\), Vapnik–Chervonenkis’s classical “slow” rate of convergence \(\sqrt{p/N}\) is still achieved by the ERM estimator even if \(|\mathcal{O}|\) outliers have polluted the dataset. The interested reader can also check that “fast rates” p/N could also be achieved by the ERM estimator in the presence of outliers if \(|\mathcal{O}|\lesssim p\) and when the so-called strong margin assumption holds (see, Boucheron et al. 2005). Note also that the previous remark also holds if F is a class with VC dimension p beyond the case of indicators of half spaces.

The conclusion of Theorem 1 can be misleading in practice. Indeed, theoretical performance of the ERM estimator for the \(0-1\) loss function are not downgraded by outliers, but its proxies based on convex relaxation \((F,\ell )\) of \((\mathcal{F},\ell ^{0-1})\) are. This can be seen on the toy example in Fig. 1 and in Fig. 4 from Sect. 5. In this work, we propose a robust surrogate, based on MOM estimators of the risk and defined in (2), to the natural empirical risk estimation of the risk which works for unbounded loss functions. In the next result, we prove that the MOM minimizer \({\hat{f}}_{\text {MOM},K}\) defined as

$$\begin{aligned} {\hat{f}}_{\text {MOM},K}\in \mathop {\mathrm{argmin}}\limits _{f\in F} \text {MOM}_{K}\big (\ell _f\big ) \end{aligned}$$
(6)

satisfies an excess risk bound under weak assumptions introduced in Sect. 2.

Theorem 2

Grant Assumptions 1, 2 and 3. Assume that \(N>K>4|\mathcal{O}|\) and let \(\Delta =1/4-|\mathcal{O}|/K\). Then, with probability larger than \(1-2\exp \left( -2\Delta ^2K\right) ,\) we have

$$\begin{aligned} R({\hat{f}}_{\text {MOM},K})\le \inf _{f\in F}R(f) + 4L\max \left( \frac{4\mathcal{R}(F)}{ N}, 2\theta _{2}\sqrt{\frac{K}{ N}}\right) . \end{aligned}$$

Theorem 2 is proved in Sect. 6.2. Compared to Theorem 1, \({\hat{f}}_{\text {MOM},K}\) achieves the same rate \((\mathcal{R}(F)/ N)\vee (\sqrt{K/ N})\) under the same conditions on the number of outliers with the same exponential control of the probability as for the ERM estimator \(f_{\text {ERM}}^{0-1}\). The main difference is that the loss function may be unbounded, which is often the case in practice. Moreover, unlike classical analysis of ERM obtained by minimizing an empirical risk associated with a convex surrogate loss function, we only need a second moment assumption on the class F.

These theoretical improvements have already been noticed in previous works (Minsker 2015; Devroye et al. 2016b; Lugosi and Mendelson 2017, 2019a, b; Lecué and Lerasle 2017, 2019; Mendelson 2017). Contrary to tournaments of Lugosi and Mendelson (2017), Le Cam MOM estimators of Lecué and Lerasle (2019) or minmax MOM estimators Lecué and Lerasle 2019, Theorem 2 does not require the small ball assumption on F but only shows “slow rates” of convergence. These slow rates are minimax optimal in the absence of a margin or Bernstein assumption (Bartlett and Mendelson 2006; Mammen and Tsybakov 1999). Removing the small ball assumption may be useful in some examples. As an illustration, consider the toy example where the design

$$\begin{aligned} X= \begin{bmatrix} \mathbf 1 _{W\in I_1}\\ \vdots \\ \mathbf 1 _{W\in I_d} \end{bmatrix} \end{aligned}$$

where \(I_1,\ldots ,I_d\) is a partition of a measurable set \({\mathbb W}\) into subsets such that \(\mathbb {P}(W\in I_i)=1/d\) for each \(i\in \{1,\ldots ,d\}\). Then \({\mathbb X}=[0,1]^d\) and one can consider the set F of linear functions \(f(X)=\bigl < t,X \bigr>\), where the Euclidean norm of t satisfies \(\Vert t\Vert \leqslant B\sqrt{d}\). Then, as \(\Vert \bigl < t,\cdot \bigr >\Vert _{L^2}^2=\sum _{i=1}^dt_i^2\mathbb {P}(W\in I_i)=\Vert t\Vert ^2/d\), Assumption 1 holds with \(\theta _2=B\). In this example, Assumption 2 holds with \(\mathcal{R}(F)\le \theta _{2}\sqrt{|\mathcal{I}| d}\leqslant B\sqrt{N d}\). It follows from Theorem 2 that the remainder term in this example is bounded from above by

$$\begin{aligned} 4LB\max \left( 4\sqrt{\frac{d}{ N}}, 2\theta _{2}\sqrt{\frac{K}{ N}}\right) . \end{aligned}$$

In particular, it converges to 0 if \(d\vee K\ll N\). By comparison, in the same example, it is shown in Chinot et al. (2019) that the remainder term converges to 0 only if \(d\lesssim \sqrt{N}\).

Proof of Theorem 2 does not enable fast rates to be obtained. Indeed, the non-linearity of the median excludes the possibility of using localization techniques leading to these fast rates. However, we show in the simulation study (cf. left side picture of Fig. 12) that fast rates seem to be reached by the MOM minimizer.

Remark 3

The MOM principle has been used together with Lipschitz loss functions recently in Chinot et al. (2019). In this paper, a minmax MOM estimator is constructed which can achieve fast rates of convergence under a margin condition. The argument from Chinot et al. (2019) relies heavily on the convexity of the loss—an assumption we do not have here. The reason why the convexity of the loss is so important in Chinot et al. (2019) is that it allows to exclude (as potential minmax MOM estimator) all the functions in F outside a \(L_2\)-ball centered in \(f^*\) with radius r if all the functions in F in the sphere \(f^*+r\mathcal{S}_2\) are excluded. Therefore, thanks to convexity, the latter “ homogeneity argument” reduces the problem to the study of the sub-model \(F\cap (f^*+r \mathcal{S}_2)\) (which is bounded in \(L_2\) with the right radius r). Here, no such homogeneity argument can be used because we did not assume the loss to be convex. Nevertheless, if we assume that the loss is convex then we may still apply Theorem 4 in Chinot et al. (2019) and replace all the localized sets by the entire set F and the variance term by the \(L_2\) uniform bound \(\theta _2\) coming from Assumption 1 to obtain a similar result as Theorem 2 for a minmax MOM estimator. These stronger results require the convexity of the loss and a Bernstein assumption that may be satisfied only under strong assumptions as discussed in the toy example.

Finally, the main advantage of our approach is its simplicity, we just have to replace empirical means by their MOM alternative in the definition of the ERM estimator. Moreover, as expected, this simple alternative to ERM estimators yields a systematically way to modify algorithms designed for approximating the ERM estimator. The resulting “MOM versions” of these algorithms are both faster and more robust than their original “ERM version”. Before illustrating these facts on simulations, let us describe algorithms approximating MOM minimizers.

4 Computation of MOM minimizers

In this section, we present a generic algorithm to provide a MOM version of descent algorithms. We study the differentiation property of the objective function \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\). Then we check on simulated and real databases the robustness and outlier detection property of these MOM algorithms.

4.1 MOM algorithms

The general idea is that any descent algorithms such as gradient descent, Newton method, alternate gradient descent, etc. (cf. Moulines and Bach 2011; Bubeck 2015; Boyd and Vandenberghe 2004; Bach et al. 2012) can easily be turned into a robust MOM-version. To illustrate this idea, a basic gradient descent is analyzed in the sequel. We start with a block splitting policy of the database.

The choice of blocks greatly influences the practical performance of the algorithm. In particular, a recurring flaw is that iterations tend to get stuck in local minima, which greatly slows the convergence of the alogorithm. To overcome this default and improve the stability of the procedure, a new partition is constructed at each iteration by drawing it uniformly at random, cf. step 2 of Algorithm 1.

Let \(\mathcal{S}_N\) denote the set of permutations of \(\{1,\ldots ,N\}\). For each \(\sigma \in \mathcal{S}_N\), let \(B_0(\sigma )\cup \cdots \cup B_{K-1}(\sigma )=\{1, \ldots , N\}\) denote an equipartition of \(\{1, \ldots , N\}\) defined for all \( j \in \llbracket 0,K-1\rrbracket \) by

$$\begin{aligned} B_j(\sigma ) =\{\sigma (Kj+1),\dots ,\sigma (K(j+1))\} = \sigma \left( \{Kj+1, \ldots , K(j+1)\}\right) . \end{aligned}$$

To simplify the presentation, let us assume the class F to be parametrized \(F=\{ f_u : u \in \mathbb {R}^p\}\), for some \(p\in \mathbb {N}^*\). Let’s assume that the function \(u\mapsto f_u\) is as regular as needed and convex (a typical example is \(f_u(x) = \bigl < u, x \bigr>\) for all \(x\in \mathbb {R}^p\)). Denote by \(\nabla _u\ell _{f_u}\) the gradient or a subgradient of \(u\mapsto \ell _{f_u}\) in \(u\in \mathbb {R}^p\). The step-sizes sequence is denoted by \((\eta _t)_{t\ge 0}\) and satisfies the classical conditions: \(\sum _{t=1}^{\infty }\eta _t = \infty \) and \(\sum _{t=1}^{\infty }\eta _t^2 < \infty \). Iterations will go on until a stopping time \(T\in \mathbb {N}^*\) has been achieved. With these notations, a generic MOM version of a gradient descent algorithm (with random choice of blocks) is detailed in Algorithm 1 below.

figure a

Remark 4

(MOM gradient descent algorithm and stochastic block gradient descent) Algorithm 1 can be seen as a stochastic block gradient descent (SBGD) algorithm minimizing the function \(t\rightarrow \mathbb {E}\ell _t(X, Y)\) using a given dataset. The main difference with the classical SBGD is that the choice of the block along which the gradient direction is performed is chosen according to a centrality measure computed thanks to the median operator in step 4 of Algorithm 1.

In Sect. 5, we use the MOM principle (as in the generic Algorithm 1) to construct MOM versions for various classical algorithms such as Perceptron, Logistic Regression, Kernel Logistic Regression, SGD Classifiers or Multi-layer Perceptron.

4.2 Differentiation properties of \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\), random partition and local minima

Let us try to explain the choice of the descent direction \(\nabla _t\) in step 5 of Algorithm 1. In the previous sections, we introduced and studied MOM minimization procedures which are minimizers of \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\) over F. The optimization problem that needs to be solved to construct a MOM minimizer is not convex, in general. It therefore raises difficulties since classical tools and algorithms from the convex optimization toolbox cannot be used a priori. Nevertheless, one may still try to do a gradient descent algorithm for this (non-convex) optimization problem with objective function given by \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\). To do so, we first need to check the differentiation properties of \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\) over F.

First observe that the descent direction \(\nabla _t\) is the gradient of the empirical risk constructed on the median block of data \(B_{k_{med}(t)}(\sigma _t)\) at \(f_{u_t}\) (we recall that F is parametrized like \(\{f_u:u\in {\mathbb R}^p\}\)). A classical Gradient Descent algorithm on \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\) starting from \(f_{u_t}\) would use a gradient at \(f_{u_t}\) of the objective function. Let us first identify situations where this is indeed the case i.e. when \(\nabla _t\) is the gradient of \(f\rightarrow \text {MOM}_{K}\big (\ell _f\big )\) in \(f_{u_t}\).

Assumption 4

For almost all datasets \(\mathcal{D}_N=\{(X_i,Y_i):i=1, \ldots , N\}\) and Lebesgue almost all \(u \in \mathbb {R}^p\), there exists an open convex set B containing u such that for any equipartition of \(\{1, \ldots , N\}\) into K blocks \(B_1, \ldots , B_K\) there exists \(k_{med}\in \{1,\cdots , K \}\) such that for all \(v\in B\), \(P_{B_{k_{med}}}(\ell _{f_{v}})\in \text {MOM}_{K}\big (\ell _{f_{v}}\big )\).

In other word, under Assumption 4, for almost all \(u_0\in {\mathbb R}^p\), the median block \(B_{k_{med}}\) achieving \(\text {MOM}_{K}\big (\ell _{f_{u_0}}\big )\) is the same as the one achieving \(\text {MOM}_{K}\big (\ell _{f_{u}}\big )\) for all u in an open and convex neighborhood B of \(u_0\). It means that the objective function \(u\rightarrow \text {MOM}_{K}\big (\ell _{f_u}\big )\) is equal to the empirical risk function over the same block of data \(B_{k_{med}}\): \(u\rightarrow P_{B_{k_{med}}} \ell {f_u}\), on B. Since B is an open set and that \(u\rightarrow P_{B_{k_{med}}} \ell _{f_u}\) is differentiable in \(u_0\) then the objective function \(u\rightarrow \text {MOM}_{K}\big (\ell _{f_u}\big )\) is also differentiable in \(u_0\) and the two gradients coincide:

$$\begin{aligned} \nabla \left( u \rightarrow \text {MOM}_{K}\big (\ell _{f_u}\big )\right) _{|u_0} = \nabla \left( u \rightarrow P_{B_{k_{med}}} \ell _{f_u} \right) _{|u_0}. \end{aligned}$$
(7)

Under Assumption 4, Algorithm 1 is indeed a gradient descent algorithm performed on the objective function \(u\in {\mathbb R}^p\rightarrow \text {MOM}_{K}\big (\ell _{f_u}\big )\).

Let us give an example where Assumption 4 is satisfied. Let \(B_1\cup \cdots \cup B_K=\{1, \ldots , N\}\) be an equipartition and let \(\psi \) be defined for all \(x=(x_i)_{i=1}^N\in \mathbb {R}^N\) and \(u\in {\mathbb R}^p\) by,

$$\begin{aligned} \psi _u(x)=\text {MOM}_{K}\big (f_u(x)\big )=\text {median}\left( \frac{K}{N}\sum _{i\in B_k}f_u(x_i), 1\le k\le K \right) = P_{B(K/2)(u)}(f_u), \end{aligned}$$

where for all blocks \(B\subset \{1, \ldots , N\}, P_Bf_u = |B|^{-1}\sum _{i\in B} f_u(x_i)\) and the blocks \(B(k)(u),k=1, \ldots , K\) are rearranged blocks defined such that \(P_{B(1)(u)}(f_u)\ge \dots \ge P_{B(K)(u)}(f_u)\). Proposition 1 below shows that Assumption 4 is satisfied in several situations. Its proof can be found in Section 6.

Proposition 1

Let \(X_1, \ldots , X_N\) be N real-valued random variables, suppose K is odd and N is a multiple of K. Let \((f_u)_{u \in \mathbb {R}^d}\) be a family of functions with values in \(\mathbb {R}\). Assume that for all \(x\in \mathbb {R}\), the function \(u\mapsto f_u(x)\) is Lipschitz and the probability distribution of \(f_u(X_1)\) has a law absolutely continuous with respect to Lebesgue measure. Then, with probability 1, Assumption 4 is satisfied, in particular, the partial derivative of \(u\mapsto \psi _{f_u}((X_i)_{i=1}^N)=MOM_K(f_u((X_i)_{i=1}^N))\) with respect to the jth coordinate is given for almost all \(X_1, \ldots , X_N\) by

$$\begin{aligned} \partial _j\psi _{f_u}((X_i)_{i=1}^N)= \frac{K}{N}\sum _{i\in B(\lceil K/2 \rceil )(u)}\partial _jf_u(X_i) \end{aligned}$$

where \(\partial _j\) denote the derivative with respect to the jth coordinate of u.

Under Assumption 4, the picture of the MOM gradient descent algorithm is pretty simple and depicted in Fig. 2. At every step t, the median operator makes a partition of \(\mathbb {R}^p\) into K cells \(\mathcal{C}_k(t) = \{u\in \mathbb {R}^p: \text {MOM}_{K}\big (\ell _{f_u}\big ) = P_{B_k}\ell _{f_u}\}\) for \(k=1, \ldots , K\)—this partition changes at every step because the blocks \(B_1, \ldots , B_K\) are chosen randomly at the beginning of every step according to the random partition \(\sigma _t\). We want every iteration \(u_t\) of the MOM algorithm to be in the interior of a cell and not on a frontier in order to differentiate the objective function \(u\rightarrow \text {MOM}_{K}\big (\ell _{f_u}\big )\) at \(u_t\). This is indeed the case under Assumption 4, given that in that case, there is an open neighbor B of \(u_t\) such that for all \(v\in B\), \(\text {MOM}_{K}\big (\ell _{f_v}\big ) = P_{B_k}\ell _{f_v}\) where the index \(k=k_{med}\) of the block is common to every \(v\in B\). Therefore, to differentiate the objective function \(u\rightarrow \text {MOM}_{K}\big (\ell _{f_u}\big )\) at \(u_t\) one just needs to differentiate \(u\rightarrow P_{B_k}\ell _{f_u}\) at \(u_t\). The objective function to minimize is differentiable almost everywhere under Assumption 4 and a gradient of the objective function is given by \(\nabla (u\rightarrow P_{B_k}\ell _{f_u})_{|u=u_t}\), that is \(\nabla _t\) from step 5 of Algorithm 1.

Fig. 2
figure 2

Partition of \(\mathbb {R}^p\) at step t by the median operator and iteration number \(t-2, t-1, t\) and \(t+1\) of the MOM gradient descent algorithm. Under Assumption 4, there is a natural descent direction given at step t by \(-\nabla _{u}(u\rightarrow P_{B_{k_{med}(t)}(\sigma _t)}\left( \ell _{f_u}\right) ))_{|u=u_t}\)

Under Assumption 4, the importance of partitioning the dataset at each new iteration is more transparent. Indeed, if we were to perform the MOM gradient descent such as in Algorithm 1 but without a new partition at each step then local minima of the K empirical risks \(u\rightarrow P_{B_k}\ell _{f_u}, k\in [K]\) may mislead the descent algorithm. Indeed, if a minimum of \(u\rightarrow P_{B_k}\ell _{f_u}\) for some \(k\in [K]\) is in the cell \(\mathcal{C}_k\) then the algorithm will reach this minimum without noticing that a “ better” minimum is in another cell. That is why re-partitioning the dataset of every iteration avoid this effect and speed up the convergence (see Lecué and Lerasle 2017 for experiments).

4.3 Complexity of MOM risk minimization algorithms

In this section, we compute the computational cost of several MOM versions of some classical algorithms. Let C(m) be the computational complexity of a single standard gradient descent update step on a dataset of size m and let L(m) be the computational complexity of the evaluation of the empirical risk \((1/m)\sum _{i\in B}\ell _f(X_i, Y_i)\) of some \(f\in F\) on a dataset B containing m data. Here the computational complexity is simply the number of basic operations needed to perform a task (Arora and Barak 2009).

For each epoch, we begin by computing the “MOM empirical risk”. We perform K times N/K evaluations of the loss function, then we sort the K means of these blocks of loss to finally get the median. The complexity of this step is then \(O(KL(N/K)+K\ln (K))\), assuming that the sort algorithm is in \(O(K\ln (K))\) (like quick sort (Hoare 1962)). Then we do the gradient step on a sample of size N/K. Hence, the time complexity of this algorithm is

$$\begin{aligned} O(T(KL(N/K)+K\ln (K)+C(N/K))). \end{aligned}$$

Example 2

(Linear complexity “ ERM version” algorithms) For example, if the standard gradient step and the loss function evaluation have linear complexity—like Perceptron or Logistic Regression—the complexity of the MOM algorithm is \(O(T(N+K\log (K)))\) against O(TN) for the ERM algorithm. Therefore, the two complexities are of the same order and the only advantage of MOM algorithms lies in their robustness to outliers and heavy-tailed properties.

Example 3

(Super-linear complexity “ ERM version” algorithms) If, on the other hand, the complexity is more than linear as for Kernel Logistic Regression (KLR), taking into account the matrix multiplications whose complexity can be found in Le Gall (2014), the complexity of the MOM version of KLR, due to the additional need of the computation of the kernel matrix, is \(O(N^2+T(N^2/K+K\log (K)+(N/K)^{2.373}))\) against \(O(TN^{2.373})\) for the standard “ ERM version”. MOM versions of KLR are therefore faster than the classical version of KLR on top of being more robust. This advantage comes from the fact that MOM algorithms work on blocks of data instead on the entire dataset at every step. More informations about Kernel Logistic Regression can be found in Roth (2001) for example.

In this last example, the complexity comes in part from the evaluation of the kernel matrix that can be computationally expensive. Following the idea that MOM algorithms are performing ERM algorithm restricted to a wisely chosen block of data, then one can modify our generic strategy in this particular example to reduce drastically its complexity. The idea here is that we only need to construct the kernel matrix on the median block. The resulting algorithm, called Fast KLR MOM is described in Fig. 2.

In Fig. 2, we compute only the block kernel matrices, denoted by \(N^1,\dots ,N^k\) and constructed from the samples in the block \(B_k\). We also denote by \(N_i^k\) the ith row in \(N^k\).

figure b

There are several drawbacks in the approach of Algorithm 2. First, the blocks are fixed at the beginning of the algorithm; therefore the algorithm needs a bigger dataset to work well and it may converge to a local minimum. Nonetheless, from the complexity point of view, this algorithm will be much faster than both the classical KLR and MOM KLR (see below for a computation of its complexity) which is important given the growing use of kernel methods on very large databases for example in biology. The choice of K should ultimately realize a trade-off between complexity and performance (in term of accuracy for example) when dealing with big databases containing few outliers.

Example 4

(Complexity of Fast KLR-MOM algorithm) The complexity of Fast KLR-MOM is \(O(N^2/K+T(N^2/K+K\log (K)+(N/K)^{2.373}))\) against \(O(TN^{2.373})\) for the ERM version.

5 Implementation and simulations

5.1 Basic results on a toy dataset

The toy model we consider models outliers due to human or machine errors we would like to ignore in our learning process. It is also a dataset corrupted to make linear classifiers fail. The dataset is a 2D dataset constituted of three “labeled Gaussian distribution”. Two informative Gaussians \(\mathcal {N}((-1,-1),1.4I_2)\) and \(\mathcal {N}((1,1),1.4I_2)\) with label respectively 1 and \(-1\) and one outliers Gaussian \(\mathcal {N}((24,8),0.1I_2)\) with label 1. In other words, the distribution of informative data is given by \(\mathcal{L}(X|Y=1) = \mathcal{N}((-1,-1),1.4I_2)\), \(\mathcal{L}(X|Y=-1) = \mathcal{N}((1,1),1.4I_2)\) and \({\mathbb P}(Y=1)={\mathbb P}(Y=-1)=1/2\). Outliers data have distribution given by \(Y=1\) a.s. and \(X\sim \mathcal {N}((24,8),0.1I_2)\).

Fig. 3
figure 3

Scatter plot of 630 samples from the training dataset (600 informative data, 30 outliers), the color of the points correspond to their labels (Color figure online)

The algorithms we study are the MOM adaptations of Perceptron, Logistic Regression and Kernel Logistic Regression.

Based on our theoretical results, we know that the number of blocks K has to be larger than 4 times the number of outliers for our procedure to be on the safe side. The value \(K=120\) is therefore used in all subsequent applications of MOM algorithms on the toy dataset except when told otherwise. To quantify performance, we compute the miss-classification error on a clean dataset made of data distributed like the informative data.

For Kernel Logistic Regression, we study here a linear kernel because outliers in this dataset are clearly adversarial when dealing with linear classifiers. The algorithm can also use more sophisticated kernels, a comparison of the MOM algorithms with similar ERM algorithms is represented in Fig. 4, the ERM algorithms are taken from the python library scikit-learn (Pedregosa et al. 2011) with their default parameters.

Figure 4 illustrates resistance to outliers of MOM’s algorithms compared to their classical version.

Fig. 4
figure 4

Scatter plot of 500 samples from the test dataset (500 informative data), the color of the points correspond to their labels and the background color correspond to the prediction. The score in the title of each subfigure is the accuracy of the algorithm (Color figure online)

These first results are completed in Fig. 5 where we computed accuracy on several run of the algorithms. These results confirm the visual impression of our first experiment.

Fig. 5
figure 5

Comparison of the MOM algorithms and their counterpart with the boxplots of the accuracy on the test dataset from 50 runs of the algorithms on 50 sample of the training/test toy dataset (one run for each dataset sampled)

Finally, we illustrate our results regarding complexities of the algorithms on a simulated example. MOM algorithms have been computed together with state-of-the art algorithms from scikit-learn (Pedregosa et al. 2011) (we use Random forest, SVM classifier as well as SGD classifier optimizing Huber loss which entail a robustness in Y but not in X, see Huber and Ronchetti 2009, Chapter 7) on a simulated dataset composed of two Gaussian blobs \(\mathcal {N}((-1,-1),1.4I_2)\) and \(\mathcal {N}((1,1),1.4I_2)\) with label respectively 1 and \(-1\). We sample 20000 points for the training dataset and 20000 for the test dataset. The parameters used in the algorithms are those for which we obtained the optimal accuracy, (this accuracy is illustrated in the next section). Time of training plus time of evaluation on the test dataset are gathered in Table 1.

Table 1 Time of different algorithms on a simulated dataset

Not surprisingly, very efficient versions of linear algorithms from Python’s library are extremely fast (results are sometimes provided before we even charged the dataset in some experiments). The performance of our algorithm are nevertheless acceptable in general (around 5 times longer than random forest for example). The important fact here is that non linear algorithms such as SVM take much more time to provide a result. FAST KLR MOM is able to reduce substantially the execution time of SVM with comparable predictive performance.

5.2 Applications on real datasets

We used the HTRU2 dataset, also studied in Lyon et al. (2015), that is provided by the UCI Machine Learning Repository. The goal is to detect pulsars (a rare type of Neutron star) based on radio emission detectable on earth from which features are extracted to gives us this dataset. The problem is that most of the signal comes from noise and not pulsar, the goal is then to classify pulsar against noise, using the 17 898 points in the dataset.

The accuracy of different algorithms is obtained using on several runs of the algorithms each using 4/5 of the datasets for training and 1/5 for testing algorithms. Boxplots presenting performance of various algorithms are displayed in Fig. 6. To improve performance, RBF kernel was used both for KLR MOM and Fast KLR MOM.

Fig. 6
figure 6

Comparison of the MOM algorithms and common algorithms with the boxplots and the medians of the accuracy \(\frac{1}{n}\sum _{i=1}^n \mathbb {1}\{{\hat{f}}(x_i)=y_i\} \) on the test dataset from 50 runs of the algorithms on 100 sample of a 4/5 cut of the dataset HTRU2 (one run is trained on a sample of 4/5 of the dataset and tested on the remaining 1/5)

5.3 Outlier detection with MOM algorithms

When we run MOM version of a descent algorithm, we select at each step a block of data points realizing the median of a set of “local/block empirical risk” at the current iteration of the algorithm. The number of times a point is selected by the algorithm can be used as a depth function measuring reliability of the data. Note that this definition of depth of a data point has the advantage of taking into account the learning task we want to solve, that is the loss \(\ell \) and the class F. It means that outliers are considered w.r.t. the problem we want to solve and not w.r.t. some a priori notion of centrality of points in \(\mathbb {R}^d\) unrelated with the problem considered at the beginning.

We apply this idea on the toy dataset with the Logistic Regression MOM algorithm. Results are gathered in a sorted histogram given in Fig. 7. Red bars represent outliers in the original datasets.

Fig. 7
figure 7

Sorted Histogram of the score (number of times a data belongs to the selected median block) of each points in a Logistic Regression MOM algorithm on a toy dataset. Red is an outlier and blue is an informative sample. \(K=120\) and \(T=2000\) (Color figure online)

Quite remarkably, outliers are in fact those data that have been used the smallest number of times. The method targets a very specific type of outliers, those disturbing the classification task at hand. If there was a point very far away from the bulk of data but in the half-space of its label, it wouldn’t be detected.

This detection algorithm doesn’t scale well when the dataset gets bigger as a large number of iterations is necessary to choose each point a fair number of times. For bigger datasets, we suggest to adapt usual outlier detection algorithms (Aggarwal 2013). We emphasize that clustering techniques and K-Means are rather easy to adapt in a MOM algorithm and detect points far from the bulk of data. This technique might greatly improve usual K-Means as MOM K-Means is more robust.

Let us now analyze the effect of K on the outlier detection task. The histogram of the 1000 smaller counts of points of HTRU2 dataset as K gets bigger is plotted in Fig. 8.

Fig. 8
figure 8

Sorted Histogram on the score (number of times a data is selected in a median block) of each points in a Logistic Regression MOM algorithm on the pulsar dataset for various values of K and \(T=20 \times K\) (only the 1000 smaller counts among the 17898 sample of the pulsar dataset are represented)

It appears from Fig. 8 that K measures the sensitivity of the algorithm. Severe outliers (as in the toy example) are detected for small K while mild outliers are only discovered as K gets bigger.

It seems therefore that the optimal choice of K in MOM depends on the task one is interested in. For classification, K should be as small as possible to get better risk bounds (but it still should be larger than the number of outliers) whereas for detecting outliers we may want to choose K much larger to even detect an outlier, (but it should also be small enough for the underlying classification to perform correctly). As a proof of concept, for Pulsar database, we got optimal results choosing \(K=10\) for classification whereas we only detect a significant amount of outliers when K is around 1000.

6 Proofs

6.1 Proof of Theorem 1

We adapt Vapnik–Chervonenkis’s classical analysis (Vapnik 1998) of excess risk bound of ERM to a dataset corrupted by outliers. We first recall that \(f^*\in \mathop {\mathrm{argmin}}\nolimits _{f\in \mathcal{F}} R^{0-1}(f)\) and for all \(f\in \mathcal{F}\), the excess loss function of f is \(\mathcal{L}_f^{0-1}=\ell ^{0-1}_f-\ell ^{0-1}_{f^*}\). For simplicity we denote \({\hat{f}} = \widehat{f}_{\text {ERM}}^{0-1}\) and for all \(f\in \mathcal{F}\), \(\mathcal{L}_f^{0-1}=\mathcal{L}_f\) and \(R(f) = R^{0-1}(f)\).

It follows from the definition of the ERM estimator that \(P_N\mathcal{L}_{{\hat{f}}}\le 0\). Therefore, if we denote by \(P_\mathcal{I}\) (resp. \(P_\mathcal{O}\)) the empirical measure supported on \(\{(X_i, Y_i): i\in \mathcal{I}\}\) (resp. \(\{(X_i, Y_i): i\in \mathcal{O}\}\)), we have

$$\begin{aligned} R({\hat{f}})-R(f^{*})&= (P-P_N)\mathcal{L}_{{\hat{f}}}+P_N \mathcal{L}_{{\hat{f}}}\le (P-P_N)\mathcal{L}_{{\hat{f}}}\\&= \frac{|\mathcal{I}|}{N}(P-P_\mathcal{I})\mathcal{L}_{{\hat{f}}} + \frac{|\mathcal{O}|}{N}(P-P_\mathcal{O})\mathcal{L}_{{\hat{f}}}\\&\le \frac{|\mathcal{I}|}{N}\sup _{f\in \mathcal{F}}(P-P_{\mathcal{I}})\mathcal{L}_f+\frac{2|\mathcal{O}|}{N} \end{aligned}$$

because \(|\mathcal{L}_{{\hat{f}}}|\le 1\) a.s.. Then, by the bounded difference inequality (Boucheron et al. 2013, Theorem 6.2), since all \(f\in \mathcal{F}\) satisfies \(-1\le \mathcal{L}_f\le 1\), one has, for any \(x>0\),

$$\begin{aligned} {\mathbb P}\left( \sup _{f\in \mathcal{F}}(P-P_{\mathcal{I}})\mathcal{L}_f\ge \mathbb {E}[\sup _{f\in \mathcal{F}}(P-P_{\mathcal{I}})\mathcal{L}_f]+x\right) \le e^{-2|\mathcal{I}|x^2}. \end{aligned}$$

Furthermore, by the symmetrization argument (cf. Chapter 4 in Ledoux and Talagrand 2011),

$$\begin{aligned} \mathbb {E}\left[ \sup _{f\in \mathcal{F}}(P-P_{\mathcal{I}})\mathcal{L}_f\right] \le 2\frac{\mathcal{R}(\mathcal{L}_\mathcal{F})}{|\mathcal{I}|}. \end{aligned}$$

Therefore, for any \(x>0\), with probability larger than \(1-e^{-2|\mathcal{I}|x^2}\),

$$\begin{aligned} R({\hat{f}})-R(f^{*})\le \frac{2\mathcal{R}(\mathcal{L}_\mathcal{F})}{N}+x+\frac{2|\mathcal{O}|}{N}. \end{aligned}$$

The proof is completed by choosing \(x=\sqrt{K/(2|\mathcal{I}|)}\).

6.2 Proof of Theorem 2

Let \(f^{*}\in \mathop {\mathrm{argmin}}\nolimits _{f\in F}P\ell _f\). By definition, one has \(\text {MOM}_{K}\big (\ell _{\widehat{f}_{\text {MOM},K}}\big )\le \text {MOM}_{K}\big (\ell _{f^*}\big ) \), therefore,

$$\begin{aligned} R(\widehat{f}_{\text {MOM},K})-R(f^{*})\le P\ell _{\widehat{f}_{\text {MOM},K}}-\text {MOM}_{K}\big (\ell _{\widehat{f}_{\text {MOM},K}}\big ) - \left( P\ell _{f^*}-\text {MOM}_{K}\big (\ell _{f^*}\big )\right) . \end{aligned}$$
(8)

Let us now control the two expressions in the right-hand side of (8). Let \(x>0\). We have

$$\begin{aligned} {\mathbb P}\left[ P \ell _{f^*} - \text {MOM}_{K}\big (\ell _{f^*}\big )> x\right]= & {} {\mathbb P}\left[ \sum _{k=1}^K I(P\ell _{f^*} - P_{B_k}\ell _{f^*}>x)\ge \frac{K}{2} \right] \\= & {} \sum _{k=K/2}^K \left( {\begin{array}{c}K\\ k\end{array}}\right) p^k (1-p)^{K-k}\le p^{K/2} 2^K \end{aligned}$$

where \(p={\mathbb P}[P\ell _{f^*} - P_{B_k}\ell _{f^*}>x]\). Using Markov inequality together with \(\mathrm{var}(\ell _{f^*})\le 2 L^2 \mathbb {E}(f^*(X))^2\le 2L^2 \theta ^2\), we obtain

$$\begin{aligned} {\mathbb P}\left[ P \ell _{f^*} - \text {MOM}_{K}\big (\ell _{f^*}\big )> x\right] \le \left( \frac{4 \mathrm{var}(\ell _{f^*})K}{N x^2}\right) ^{K/2}\le \left( \frac{8L^2\theta ^2K}{N x^2}\right) ^{K/2}=\exp (-K/2) \end{aligned}$$

when \(x=2L\theta \sqrt{2eK/N}\).

Now, for any \(x>0\), one has \(\sup _{f\in F}\text {MOM}_{K}\big (P\ell _f-\ell _f\big )>x\) iff

$$\begin{aligned} \sup _{f\in F}\sum _{k=1}^KI\left\{ (P-P_{B_k})\ell _f>x\right\} \ge \frac{K}{2} . \end{aligned}$$
(9)

Let us now control the probability that (9) holds via an adaptation of the small ball method (Koltchinskii and Mendelson 2015; Mendelson 2015). Let \(x>0\) and let \(\phi (t)=(t-1)I\{1\le t\le 2\}+I\{t\ge 2\}\) be defined for all \(t\in \mathbb {R}\). As \(\phi (t)\ge I\{t\ge 2\}\), one has

$$\begin{aligned}&\sup _{f\in F}\sum _{k=1}^KI\left\{ (P-P_{B_k})\ell _f>x\right\} \\&\quad \le \sup _{f\in F}\sum _{k\in \mathcal{K}}\mathbb {E}\left[ \phi \left( 2(P-P_{B_k})\ell _f/x\right) \right] +|\mathcal{O}|\\&\quad +\sup _{f\in F}\sum _{k\in \mathcal{K}}\left( \phi \left( 2(P-P_{B_k})\ell _f/x\right) -\mathbb {E}\left[ \phi \left( 2(P-P_{B_k})\ell _f/x\right) \right] \right) \end{aligned}$$

where we recall that \(\mathcal{K}=\{k\in \{1, \cdots , K\}: B_k\cap \mathcal{O}= \emptyset \}\}\).

Since, \(\phi (t)\le I\{t\ge 1\}\) and for all \(f\in F\), \(\text {Var}(\ell _f)\le 2L^2\mathbb {E}f(X)^2\le 2L^2\theta _2^2\), we have for all \(f\in F\) and \(k\in \mathcal{K}\),

$$\begin{aligned} \mathbb {E}\left[ \phi \left( 2(P-P_{B_k})\ell _f/x\right) \right] \le {\mathbb P}\left( (P-P_{B_k})\ell _f\ge \frac{x}{2}\right) \le \frac{4\text {Var}(\ell _f)}{x^2|B_k|}\le \frac{8L^2\theta _{2}^2K}{x^2N}. \end{aligned}$$

One has therefore

$$\begin{aligned}&\sup _{f\in F}\sum _{k=1}^KI\left\{ (P-P_{B_k})\ell _f>x\right\} \\&\quad \le K\left( \frac{8L^2\theta _{2}^2K}{x^2N}+\frac{|\mathcal{O}|}{K}+\sup _{f\in F}\frac{1}{K}\sum _{k\in \mathcal{K}}\left( \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) \right. \right. \\&\qquad \left. \left. -\mathbb {E}\left[ \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) \right] \right) \right) . \end{aligned}$$

As \(0\le \phi (\cdot )\le 1\), by the bounded-difference inequality, for any \(y>0\), with probability larger than \(1-e^{-2y^2K}\),

$$\begin{aligned} \sup _{f\in F} \frac{1}{K}\sum _{k\in \mathcal{K}}&\left( \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) -\mathbb {E}\left[ \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) \right] \right) \\&\le \mathbb {E}\left[ \sup _{f\in F}\frac{1}{K}\sum _{k\in \mathcal{K}}\left( \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) -\mathbb {E}\left[ \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) \right] \right) \right] +y. \end{aligned}$$

Now, by the symmetrization inequality,

$$\begin{aligned}&\mathbb {E}\left[ \sup _{f\in F} \frac{1}{K}\sum _{k\in \mathcal{K}}\left( \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) -\mathbb {E}\left[ \phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) \right] \right) \right] \\&\quad \le 2\mathbb {E}\left[ \sup _{f\in F}\frac{1}{K}\sum _{k\in \mathcal{K}}\epsilon _k\phi \left( \frac{2(P-P_{B_k})\ell _f}{x}\right) \right] . \end{aligned}$$

Since \(\phi \) is 1-Lipschitz and \(\phi (0)=0\), by the contraction principle (see Ledoux and Talagrand 2011, Chapter 4 or more precisely equation (2.1) in Koltchinskii 2011),

$$\begin{aligned} \mathbb {E}\left[ \sup _{f\in F}\frac{1}{K}\sum _{k\in \mathcal{K}}\epsilon _k\phi \left( \frac{(P-P_{B_k})\ell _f}{x}\right) \right] \le \mathbb {E}\left[ \sup _{f\in F}\frac{1}{xK}\sum _{k\in \mathcal{K}}\epsilon _k(P-P_{B_k})\ell _f\right] . \end{aligned}$$

By the symmetrization principle,

$$\begin{aligned} \mathbb {E}\left[ \sup _{f\in F}\frac{2}{xK}\sum _{k\in \mathcal{K}}\epsilon _k(P-P_{B_k})\ell _f\right] \le \frac{2}{xN}\mathbb {E}\left[ \sup _{f\in F}\sum _{i\in \mathcal{J}}\epsilon _i\ell _f(X_i,Y_i)\right] . \end{aligned}$$

Finally, since \(\ell \) is L-Lipschitz, by the contraction principle (see equation (2.1) in Koltchinskii 2011),

$$\begin{aligned} \mathbb {E}\left[ \sup _{f\in F}\sum _{i\in \mathcal{J}}\epsilon _i\ell _f(X_i,Y_i)\right] \le 2L\mathcal{R}(F). \end{aligned}$$

Thus, for any \(y>0\), with probability larger than \(1-\exp (-2y^2K)\),

$$\begin{aligned} \sup _{f\in F}\sum _{k=1}^KI\left\{ (P-P_{B_k})\ell _f>x\right\} \le K\left( \frac{8L^2\theta _{2}^2K}{x^2N}+\frac{|\mathcal{O}|}{K}+y+\frac{4L\mathcal{R}(F)}{xN}\right) . \end{aligned}$$

Let \(\Delta =1/4-|\mathcal{O}|/K\) and let \(y=\Delta \) and \(x = 8L\max \left( \theta _{2}\sqrt{K/N}, 4\mathcal{R}(F)/ N\right) \) so

$$\begin{aligned} {\mathbb P}\left( \sup _{f\in F}\sum _{k=1}^KI\left\{ (P-P_{B_k})\ell _f>x\right\} < \frac{K}{2}\right) \ge 1-e^{-\Delta ^2K/8}. \end{aligned}$$

Going back to (9), this means that

$$\begin{aligned} {\mathbb P}\left( \sup _{f\in F}\text {MOM}_{K}\big (\ell _f-P\ell _f\big )\le 4L\max \left( \theta _{2}\sqrt{\frac{K}{N}}, \frac{4\mathcal{R}(F)}{N}\right) \right) \ge 1-\exp (-2\Delta ^2K). \end{aligned}$$
(10)

Plugging this result in (8) concludes the proof of the theorem.

6.3 Proof of Proposition 1

We denote by \(B(1)(u),\dots ,B(K)(u) \) the blocks such that the corresponding empirical means \(P_{B(k)(u)}(f_u(X_1^N)), k=1, \ldots , K\) are sorted: \(P_{B(1)(u)}(f_u(X_1^N))\ge \dots \ge P_{B(K)(u)}(f_u(X_1^N))\). Denote \(J\in \mathbb {N}\) such that \(K=2J+1\).

The goal is to show that \(u\mapsto \psi _{f_u}((X_i)_{i=1}^N)=MOM_K(f_u((X_i)_{i=1}^N))\) is differentiable and to compute its partical derivatives. To that end, it suffices to show that for all \(\varepsilon \) with \(\Vert \varepsilon \Vert _2\) sufficiently small, we have \(B{(J)}(u)=B{(J)}(u+t\varepsilon )\) for all \(t\in [0,1]\) and for that it is sufficient to check that the same order of the K empirical means is preserved for all \(f_{u+t\varepsilon }\):

$$\begin{aligned} \forall 1\le k\le K-1, \forall t \in [0,1],\quad P_{B(k)(u)}(f_{u+t\varepsilon })-P_{B(k+1)(u)}(f_{u+t\varepsilon })>0. \end{aligned}$$
(11)

We decompose this difference in three parts,

$$\begin{aligned} P_{B(k)(u)}(f_{u+t\varepsilon })-P_{B(k+1)(u)}(f_{u+t\varepsilon })\ge&P_{B(k)(u)}(f_u)-P_{B(k+1)(u)}(f_u)\\&-\left| P_{B(k)(u)}(f_u)-P_{B(k)(u)}(f(_{u+t\varepsilon })\right| \\&-\left| P_{B(k+1)(u)}(f_{u+t\varepsilon })-P_{B(k+1)(u)}(f_u)\right| \end{aligned}$$

The two last terms are controlled by the Lipshitz property of \(u\mapsto f_u\),

$$\begin{aligned}&\forall t\in [0,1],\quad P_{B(k)(u)}(f_{u+t\varepsilon })-P_{B(k+1)(u)}(f_{u+t\varepsilon })\\&\quad \ge P_{B(k)(u)}(f_u)-P_{B(k+1)(u)}(f_u)-2tL\Vert \varepsilon \Vert _2. \end{aligned}$$

We denote by

$$\begin{aligned} h_k(\Vert \varepsilon \Vert _2)=\mathbb {P}\left( \forall t\in [0,1],\quad P_{B(k)(u)}(f_u)-P_{B(k+1)(u)}(f_u)-2tL\Vert \varepsilon \Vert _2\ge 0\right) \end{aligned}$$

for all \(1\le k\le K-1\), \(h_k\) is an non-decreasing function. Because for all \(1\le k\le K\), \(P_{B(k)(u)}(f_u)\) has a uniformly continuous law with respect to the Lebesgue measure (because its density is a convolution of several copies of the density of \(f_u(X)\)), there is no jump in the c.d.f and then \(h_k\) verifies that

$$\begin{aligned} h_k(\Vert \varepsilon \Vert _2)\xrightarrow [\Vert \varepsilon \Vert _2\rightarrow 0]{} 1. \end{aligned}$$

And again because for all \(1\le k\le K\), \(P_{B(k)(u)}(f_u)\) has a uniformly continuous law with respect to the Lebesgue measure, we also have that

$$\begin{aligned} h_k(\Vert \varepsilon \Vert _2)=\mathbb {P}\left( \forall t\in [0,1],\quad P_{B(k)(u)}(f_u)-P_{B(k+1)(u)}(f_u)-2tL\Vert \varepsilon \Vert _2> 0\right) . \end{aligned}$$

Then, taking the union bound for \(1\le k\le K-1\),

$$\begin{aligned} h(\Vert \varepsilon \Vert _2)&:=\mathbb {P}\left( \forall 1\le k\le K-1, \forall t\in [0,1],\quad P_{B(k)(u)}(f_u)-P_{B(k+1)(u)}(f_u)-2tL\Vert \varepsilon \Vert _2> 0\right) \\&\ge 1-\sum _{k=1}^{K-1} \left( 1-h_k(\Vert \varepsilon \Vert _2)\right) . \end{aligned}$$

Moreover, h can be rewritten as a probability that the blocks don’t change using the reasoning leading to Eq. (11), hence

$$\begin{aligned} h(\Vert \varepsilon \Vert _2)&=\mathbb {P}\left( \forall 1\le k\le K-1,\quad B{(k)}(u)=B{(k)}(u+t\varepsilon )\right) \le \mathbb {P}\left( \forall t \in [0,1],\right. \\&\quad \left. B{(J)}(u)=B{(J)}(u+t\varepsilon )\right) . \end{aligned}$$

We now compute the partial derivatives of the median of means \(\psi _{f_u}\). Let \(e_1,\dots ,e_p \in \mathbb {R}^p\) be the canonical basis of \(\mathbb {R}^p\). For all \(m \in \mathbb {N}\), we define \(\varepsilon _{m}^j=\delta _m e_j\) with \((\delta _m)_m\) a decreasing sequence of \({\mathbb R}_+^*\) such that for all \(1\le k\le K-1\) we have \(h_k(\delta _m)\ge 1-2^{-m}\), \(\delta _m\) exists because \(h_k(\delta )\rightarrow 1\) when \(\delta \rightarrow 0\). Then,

$$\begin{aligned} h(\Vert \varepsilon _m^j\Vert _2)\ge 1-K2^{-m}. \end{aligned}$$
(12)

We denote by \(A_m^j\) the event \(A_m^j:=\left\{ \forall t \in [0,1], \quad B_{(J)}(u)=B_{(J)}(u+t\varepsilon _{m'}^j) \right\} \) and we study the limiting event \(\Omega ^j=\overline{\lim }_{m\rightarrow \infty }A_m^j\).

First, let us note that for all \(1\le j\le p\), the sequence of set \((A_m^j)_n\) is non-increasing, hence

$$\begin{aligned} \Omega ^j=\overline{\lim }_{m\rightarrow \infty }A_m^j=\underline{\lim }_{m\rightarrow \infty }A_m^j=(\overline{\lim }_{m\rightarrow \infty }(A_m^j)^c)^c, \end{aligned}$$

then, for all \(1\le j\le d\), we can study the \(\overline{\lim }_{m\rightarrow \infty }(A_m^j)^c\) with Borel-Cantelli Lemma. Indeed, we have from Eq. (12), \(\mathbb {P}((A_m^j)^c)\le K 2^{-m}\). Hence, the series \(\sum _m \mathbb {P}((A_m^j)^c)\) converges and by Borel Cantelli Lemma, \(\mathbb {P}(\overline{\lim }_{m\rightarrow \infty }(A_m^j)^c)=0\), then for all \(1\le i\le p\), \(\mathbb {P}(\Omega ^j)=1\). In other words, we have that for all \(\omega \in \Omega ^j\), there exists \(m \ge 1\) such that \(\omega \in A_m^j\). Hence, there exists \(m \ge 1\) such that for all \(t \in [0,1],\quad B{(J)}(u)=B{(J)}(u+t\varepsilon _m^j),\) which implies that for all \(1\le j\le p\),

$$\begin{aligned} \partial _j\psi _{f_u}(X)&=\lim _{t\rightarrow 0}\frac{\psi _{f_{u+t\epsilon _m^j}}(X)-\psi _{f_{u}}(X)}{t}=\lim _{t\rightarrow 0}\frac{P_{B(J)(u)}(f_{u+t\varepsilon _m^j})-P_{B(J)(u)}(f_{u})}{t}\\&=\frac{1}{N/K}\lim _{t\rightarrow 0}\sum _{i \in B(J)(u)} \frac{f_{u+t\varepsilon _m^j}(X_i)-f_{u}(X_i)}{t}=\frac{1}{N/K}\sum _{i \in B(J)(u)} \partial _jf_{u}(X_i). \end{aligned}$$