Upper bounds and aggregation in bipartite ranking

: One main focus of learning theory is to ﬁnd optimal rates of convergence. In classiﬁcation, it is possible to obtain optimal fast rates (faster than n − 1 / 2 ) in a minimax sense. Moreover, using an aggregation procedure, the algorithms are adaptive to the parameters of the class of distributions. Here, we investigate this issue in the bipartite ranking framework. We design a ranking rule by aggregating estimators of the regression function. We use exponential weights based on the empirical ranking risk. Under several assumptions on the class of distribution, we show that this procedure is adaptive to the margin parameter and smoothness parameter and achieves the same rates as in the classiﬁcation framework. Moreover, we state a minimax lower bound that establishes the optimality of the aggregation procedure in a speciﬁc case.


Introduction
The design of estimators that achieve optimal rates of convergence is a major topic in statistical learning. It has been investigated in many situations such as regression, density estimation and classification. The rates depend on the properties of the considered class of distributions. Classical conditions are on the distribution of the observations and the regularity of the regression function. In that case, the best rates are slower than n −1/2 and the estimators depend on the regularity of the regression function. There exist adaptive estimators to get rid of the knowledge of this parameter (see [21,17,28,3,19,4,27]). In classification, when adding an assumption on the distribution of the regression function, rates faster than n −1/2 and even faster than n −1 are achieved. The rates were obtained for plug-in classification rules in two papers. In [3], the authors estimate the regression function using the locally polynomial estimator. Moreover, the optimal rates are achieved without knowing the regularity and the margin parameters by aggregating the plug-in rules (see [18]). More recently, the local multi-resolution estimation method (see [24]) combined with the Lepski's method (see [20]), achieves the optimal and adaptive minimax rates. Both approaches firstly estimate the regression function and then threshold the estimated function at level 1/2.
In the last decade, the bipartite ranking problem, a supervised learning task, has received the attention of the statistical learning community (see [15,26,10] for instance). Its probabilistic framework is the same as the classification framework but the task is of a really different nature. Indeed, to solve this problem one has to order all the observations and understand the whole feature space. This task is important for many applications such as the anomaly detection in signal processing, information retrieval, design of diagnosis tools in medicine and credit-scoring in finance. The problem can be formulated as a pairwise classification problem (see [8]) where the goal is to minimize a loss based on a pair of observations called the ranking risk. In this paper, the authors show that, under a low noise assumption, the rates of convergence of the excess of ranking risk can be really close to n −1 . To this end, they use a procedure based on the minimization of the empirical ranking risk over a class of candidate ranking rules. The main drawback of their setup is that the target function has to belong to the class of ranking rules. In [9], minimax rates faster than n −1/2 are achieved over class of distributions controlled by a smoothing parameter and a margin parameter. They used the same estimator of the regression function as in classification but this estimator needs the knowledge of the regularity parameter.
Here, we investigate the performance of the aggregation with exponential weights in the bipartite ranking framework in order to obtain a method that can be adaptive to the parameters. The main result is that this procedure satisfies an oracle inequality. Then we study the impact of this inequality in two settings, one with the mild density assumption over the marginal of the observation and the other with the strong assumption (see [3]). When adding assumptions on the regression function, we obtain a new adaptive upper bound in the case of the mild density assumption. Moreover, when aggregating the plug-in estimators of [9], the procedure is adaptive to the parameters of the class of distributions under the strong density assumption.
The rest of the paper is organized as follows. In section 2, we explain the notations and the bipartite ranking task. We define the ranking risk and a convexification of it using the hinge loss. Several margin assumptions are presented and equivalence links are stated. In section 3, we describe the aggregation estimator using the convexified ranking risk and we show the oracle inequalities satisfied by the procedure of aggregation. In section 4, we present two adaptive minimax upper bounds for the excess ranking risk using the aggregated estimator. Finally, we extend the minimax lower bound obtained in [9] to all dimensions. The proofs are deferred in appendix.

Theoretical background
Here, we introduce the main assumptions involved in the formulation of the bipartite ranking problem and recall the important results which are used in the following analysis, giving an idea of the nature of the ranking problem.

Probabilistic setup and first notations
Here and throughout, (X, Y ) denotes a pair of random variables, taking its values in the product space X × {−1, +1} where X is typically a subset of an Euclidean space of (very) large dimension d ≥ 1, R d say. The r.v. X is a vector of features for predicting the binary label Y . Let p = P{Y = +1} be the rate of positive instances. The joint distribution of (X, Y ) is denoted by P , X's marginal distribution by µ and the posterior probability by η(x) = P{Y = +1 | X = x}, x ∈ X . For the sake of simplicity and with no loss of generality, we assume that X coincides with µ(dx)'s support. Additionally, the r.v. η(X) is supposed to be absolutely continuous w.r.t. the Lebesgue measure.
The indicator function of any event E is denoted by I{E} and the range of any mapping Φ by Im(Φ). We also denote by B(x, r) the closed Euclidean ball in R d centered in x ∈ R d and of radius r > 0. For any multi-index s = (s 1 , . . . , s d ) ∈ N d and any x = (x 1 , . . . , and ⌊β⌋ the largest integer that is strictly less than β ∈ R. For any x ∈ R d and any ⌊β⌋-times continuously differentiable real-valued function g on R d , we denote by g x its Taylor polynomial expansion of degree ⌊β⌋ at point x,

Bipartite ranking
The bipartite ranking task consists in learning how to order the observations according to the label Y . Specifically, from a sample D = {(X 1 , Y 1 ), . . . , (X n , Y n )} with distribution P , we want to learn a scoring function s : X → R such as the order induced by s is the same as the order induced by η. In this case, the observations with label "+1" should have large values whereas the observations with label "−1" should have small values. The most popular tool to evaluate the accuracy of a scoring function is the ROC curve [13]. It is the plot of the false positive rate against the true positive rate that corresponds to the performance of all the classifiers one can create by thresholding the scoring function s.
Pairwise classification. However, this is a functional tool and for this reason, it is complex to optimize from a theoretical and a computational perspective. For this reason, several authors have reformulated this problem as a pairwise classification problem (see [15,1,8]). In this setup, the goal is, given (X, Y ) and (X ′ , Y ′ ) two random couples with distribution P , to determine whether Y > Y ′ or not. In this context, the predictor takes the form of ranking rule, namely a (measurable) function r : X 2 → {−1, +1} such that r(x, x ′ ) = 1 when x ′ is ranked higher than x: the more pertinent a ranking rule r, the smaller the probability that it incorrectly ranks two instances drawn independently at random. Formally, optimal ranking rules are those that minimize the ranking risk : A ranking rule r is said to be transitive iff ∀(x, x ′ , x ′′ ) ∈ X 3 : "r(x, x ′ ) = +1 and r(x ′ , x ′′ ) = +1" ⇒ "r(x, x ′′ ) = +1". Observe that, by standard quotient set arguments, one can see that transitive ranking rules are those induced by scoring functions: r s (x, x ′ ) = 2 · I{s(x ′ ) ≥ s(x)} − 1 with s : X → measurable. With a slight abuse of notation, we set L(r s ) = L(s) for ranking rules defined through a scoring function s.
Optimality. It is easy to see that an optimal ranking rule is defined thanks to the regression function η, see Example 1 in [8] for further details. Additionally, it should be noticed that one may derive a closed analytical form for the excess of ranking risk E(r) = L(r)−L * , with L * = L(r * ). For clarity, we recall the following result.
Lemma 1 (Ranking risk excess - [8]). For any ranking rule r, we have: The accuracy of a ranking rule is here characterized by the excess of ranking risk E(r), the challenge from a statistical learning perspective being to build a ranking rule, based on a training sample (X 1 , Y 1 ), . . . , (X n , Y n ) of i.i.d. copies of the pair (X, Y ), with asymptotically small excess of ranking risk for large n.
We highlight the fact that, using a basic conditioning argument, the minimum ranking risk L * can be expressed as a function of η(X)'s Gini mean difference (where p = P{Y = +1}): In binary classification, it is well-known folklore that the learning problem is all the easier when η(X) is bounded away from 1/2. In bipartite ranking, Eq. (2.3) roughly says that the more the r.v. η(X) is spread, the easier is the optimal ranking of X 's elements. Hence, the two problems are very different from this perspective.
A continuum of classification problems. In addition, we emphasize the fact that the optimal ranking rule r * (x, x ′ ) can be seen as a (nested) collection of optimal cost-sensitive classifiers: the binary rule r * (x, see Proposition 15 in [11] for instance. Hence, while binary classification only aims at recovering the single level set G * 1/2 , which is made easier when η(X) is far from 1/2 with large probability (see [23] or [28]), the ranking task consists in finding the whole collection {G * t : t ∈ Im(η(X))}. Though of disarming simplicity, this observation describes well the main barrier for extending fastrate analysis to the ranking setup. Indeed, the random variable η(X) cannot be far with arbitrarily high probability from all elements of its range.
Convexification of the ranking risk. From a practical angle, optimizing the ranking risk is a real difficulty because the involved loss is not convex. In the classification framework where convex surrogates are widely used for practical purposes, it has also been used for theoretical issues ( [5,29] and [18] for instance). Here, we propose to convexify the pairwise loss and we use this loss in our aggregation procedure (see 3). Notice that minimization of convexified pairwise loss was studied in [8]. We call any measurable function f : X × X ′ → [−1, 1] a decision rule and we set the random variable Z = (Y − Y ′ )/2. With this notation, we now present the convexification of the ranking risk that we use in this paper.
Definition 2 (Hinge ranking risk ). For any decision function f , the hinge ranking risk is defined by where φ(x) = max(0, 1 + x).
Notice that a ranking rule is a specific kind of decision rule. The next proposition gives a justification to strategies based on the minimization of the hinge ranking risk in order to obtain accurate ranking rules.
Proposition 3. The minimizer of the ranking risk r * is a minimizer of the hinge ranking risk A. We call A * = A(r * ).
As for the ranking risk, there exists a close analytical form for the hinge ranking risk. This is the purpose of the next proposition.
Lemma 4 (Hinge ranking risk excess). For any decision rule f : X × X → [−1, 1], we have: The specific use of this surrogate is not fortunate and is due to its linearity. Using this property, we see that, for any ranking rule r : X × X → {−1, 1}, we have: By thresholding a decision function, we can obtain a ranking rule. More precisely, for any decision rule f , we set We now link the excess of hinge ranking risk of a decision function f with the excess of ranking risk of its associated ranking rule. Using this definition, one can easily show that, for any decision rules f : X × X → [−1, 1], we have: Thus, the minimization of the excess of hinge ranking risk provides a reasonable alternative to the minimization of the excess of ranking risk.
Plug-in ranking functions. Given the form of the Bayes ranking rule r * (X, X ′ ), it is natural to consider plug-in ranking rules, that is to say ranking rules obtained by "plugging-in" a nonparametric estimator η n (x) of the regression function η, based on a data sample (X 1 , Y 1 ), . . . , (X n , Y n ), instead of η(x) into Eq. (2.2): The performance of predictive rules built via the plug-in principle has been extensively studied in the classification/regression context, under mild assumptions on the behavior of η(X) in the vicinity of 1/2 (see the references in [3] for instance) and on η's smoothness in particular. Similarly in the ranking situation, since one obtains as immediate corollary of Lemma 1 that E( r n ) is bounded by E[| η n (X)−η(X)|], one should investigate under which conditions nonparametric estimators η n lead to ranking rules with fast rates of convergence of E( r n ) as the training sample size n increases to infinity.

Additional assumptions
Optimal ranking rules can be defined as those having the best possible rate of convergence of E( r n ) towards 0, as n → +∞. Therefore, the latter naturally depends on (X, Y )'s distribution. Following the footsteps of [3], we embrace the minimax point of view, which consists in considering a specific class P of joint distributions P of (X, Y ) and to declare r n optimal if it achieves the best minimax rate of convergence over this class: where the infimum is taken over all possible ranking rules r n depending on (X 1 , Y 1 ), . . . , (X n , Y n ). In order to carry out such a study, mainly three types of hypotheses shall be used. Here, smoothness conditions related to the realvalued function η : X ⊂ R d → (0, 1) together with regularity conditions on the marginal µ(dx) and assumptions that we shall interpret as "spread" conditions for η(X)'s distribution are stipulated.
Complexity assumption. In the plug-in approach, the goal is to link closeness of η n (x) to η(x) to the rate at which E( r n ) vanishes. Complexity assumptions for the regression function (CAR) stipulating a certain degree of smoothness for η are thus quite tailored for such a study. Here, focus is on regression functions η(x) that belong to the (β, L, R d )-Hölder class of functions, denoted Σ(β, L, R d ), with β > 0 and 0 < L < ∞. The latter is defined as the set of functions g : R d → R that are β times continuously differentiable and satisfy, for any Remark 1 (Alternative assumptions). We point out that more general CAR assumptions could be considered (see [14] for instance), involving metric entropies or combinatorial quantities such as the VC dimension, more adapted to the study of the performance of empirical risk minimizers. Owing to space limitations, the analysis is here restricted to the Hölder assumption.
Marginal density assumption. In this paper, we use the same terminology as in [3] to define the assumption over the density of the marginal of X. Let strictly positive constants c 0 and r 0 be fixed. Recall first that a Lebesgue measurable set A ⊂ R d is said to be (c 0 , r 0 )-regular iff ∀r ∈]0, r 0 [, ∀x ∈ A: where λ(B) denotes the Lebesgue measure of any Lebesgue measurable set B ⊂ R d . The following assumption on the marginal distribution µ will be used in the sequel. Fix constants c 0 , r 0 > 0 and 0 < µ min < µ max < ∞ and suppose that a compact set C ⊂ R d is given.
The strong density assumption is said to be satisfied if the marginal distribution µ(dx) is supported on a compact and (c 0 , r 0 )-regular set A ⊂ C and has a density f (w.r.t. the Lebesgue measure) bounded away from zero and infinity on A: The mild density assumption is said to be satisfied if the marginal distribution µ(dx) is supported on a compact and (c 0 , r 0 )-regular set A ⊂ C and has a density f (w.r.t. the Lebesgue measure) bounded away from infinity on A: f (x) ≤ µ max for all x ∈ A.
Global low noise assumption. Here, we introduce an additional assumption for the function η. In classification, to obtain rates faster than n 1/2 , one has to assume that the regression function η satisfies a low noise assumption in addition to the classical properties of the space of the distribution. In ranking, such assumption was used in [8] and [9]. We introduce two margin assumptions in the context of bipartite ranking and we make the link with the assumption previously made. Let α ∈ [0, 1]. The following conditions describe the behavior of the r.v. η(X).
Assumption MAK(α). The distribution P verifies the margin assumption These conditions are introduced to control the variance of I{r(X, In particular, we use this control to state the oracle inequality 8. This type of conditions have been studied in classification in order to obtain fast rate (see [6] for further details).
In the bipartite ranking framework, a condition was introduced in [9].
Equipped with these notations, we state the link between these assumptions.
The theoretical results of this paper are always stated using the condition NA(α). This is why, we do not need the inverse statement. Since in classification, such conditions are equivalent, it may be the same in ranking. Condition (2.9) above is void for α = 0 and more and more restrictive as α grows. It clearly echoes Tsybakov's noise condition, introduced in [28], which boils down to (2.9) with 1/2 instead of η(x). Whereas Tsybakov's noise condition is related to the behavior of η(X) near the level 1/2, condition (2.9) implies global properties for η(X)'s distribution, as shown by the following result.
It is important to note that, in ranking, Assumption NA(α) can be fulfilled for α ≤ 1 solely (see the proof in [9]), whereas, in classification, α in Tsybakov's noise condition can be very large, up to +∞, recovering in the limit Massart's margin condition [21]. Indeed, as may be shown by a careful examination of Lemma 6's proof, bound (2.9) for α > 1 implies that F ′ (η(x)) = 0, denoting by F the cdf of η(X). Therefore, it is obvious that the (probability) density of the r.v. η(X) cannot be zero on its whole range Im(η) = {η(x), x ∈ X }.
In the context of binary classification, by combining the CAR assumptions described above and Tsybakov's noise condition, optimal rates of convergence were obtained in [3] and adaptive optimal rates in [19]. In particular, it was shown that, with the additional assumption that µ(dx) satisfies the mild density assumption, the minimax rate of convergence is n −β(1+α)/(d+β(2+α)) and may be thus faster than n −1/2 or even very close to n −1 , depending on the values taken by the parameters α and β. With the additional assumption that µ(dx) satisfies the strong density assumption, the minimax rate of convergence is n −β(1+α)/(2β+d) and may be thus faster than n −1/2 or even than n −1 . We shall now attempt to determine whether similar results hold in the ranking setup.

Oracle inequalities for the aggregation procedure
In this section, we describe how to aggregate ranking rules into an accurate decision rule for the hinge ranking risk. We propose a procedure that uses exponentials weights. This kind of procedure is very popular in machine learning and was studied in many contexts such as regression (see [25,12] and [2]), aggregation of experts (see [7] for instance) and classification (see [18]). We show that the obtained decision rule satisfies an oracle inequality which can be used to achieve minimax upper bounds (see 4). The proof of the theorem is an adaptation to the ranking case of the one in [18].

Aggregation via exponential weights
The ranking rules r 1 , . . . , r M are given and the goal of the aggregation method is to mimic the performance of the best of them according to the excess risk and under the low noise assumption. We define the exponential aggregate decision rule asf where the weights w n j are Notice that we call itf n because this function takes its values in [−1, 1]. The functions r 1 , . . . , r M take their values in {1; −1}, we have, where A n (r m ) = 1 n(n−1) i =j max(0, 1 − Z ij r m (X i , X j )) is the empirical hinge ranking risk of the ranking rule r m . Using the equality (2.5), the weights can be rewritten in terms of the empirical risks of r m 's We call this procedure aggregation with exponential weights (AEW). The idea behind this procedure is to give more weight to the ranking rules that have the smaller empirical performance in order to mimic the accuracy of the empirical (hinge ranking) risk minimizer (ERM). The next result states that the AEW has similar performance as the ERM estimator up to a (log M )/n term.

with weights 3.2 satisfies
The main benefits of the AEW procedure are that it does not need a minimization algorithm and is less sensitive to overfitting because the output decision rule is a mixture of several ranking rules whereas ERM only involves one ranking rule.

An oracle inequality
We now provide the main tool of this paper, an oracle inequality for the excess of hinge ranking risk. The goal of an oracle inequality is to show that an estimator is nearly as good as the best one of a given collection (see [22] for example in model selection). Here, the goal of this oracle inequality is to state that the procedure AEW 3.1 has asymptotically the same performance as the best one among the convex hull formed by a finite set of decision functions.
where C > 0 is a constant depending only on a.
In [18], the author shows that the rate log M n α+1 α+2 is optimal in a minimax sense. For the moment, we do not have such result of optimality, however, the rate in the oracle inequality is the same. Using this tool, we can state an oracle inequality for the excess of ranking risk.
Corollary 9 (Oracle inequality for the ranking risk). Let α ∈ (0, 1], M ≥ 3 and {r 1 , . . . , r M } be a finite set of prediction rules. We assume that NA(α) holds. Letf n be the aggregate estimator introduced in 3.1. Then, for any integers M ≥ 3, n ≥ 1 and any a > 0,f n satisfies the inequality where C > 0 is a constant depending only on a.
Proof. Using inequalities 2.5 and 2.6 combine with Theorem 8, we immediately get the desired result.
This oracle is the main tool to obtain the minimax rates in Theorem 11, 12 and 15 using an estimator based on the AEW procedure.

Minimax rates
Here, we present the adaptive minimax upper bounds in bipartite ranking in two cases, specifically under the mild assumption and the strong assumption. The estimators of the regression function used are the same as in classification (see [18] and [3]).

The "mild" case
In this section, we assume that the regression function η belongs to a Hölder class of functions. An important result from [16], on the complexity of Hölder classes, says that: where the left hand side is the ǫ-entropy of the (β, L, [0, 1] d )-Hölder class w.r.t. to the L ∞ ([0, 1] d ) norm and C is a constant depending only on β and d. We now introduce the first class of distributions for the random couple (X, Y ).
Definition 10. Let α ≤ 1, β and L be strictly positive constants. The collection of probability distributions P (dx, dy) such that 1. the marginal µ(dx) = y P (dx, dy) satisfies the mild density assumption with µ max , 2. the global noise assumption NA(α) holds, 3. the regression function belongs to Hölder class Σ(β, L, R d ), is denoted by P α,β,µmax (omitting to index it by the constants involved in the noise assumption for notational simplicity).
Theorem 11 (Upper bound: mild case). There exists a constant C > 0 such that for all n ≥ 1, the maximum expected excess of ranking risk of the aggregation rule defines in 4.1 ǫ n = n −αβ/(d+β(2+α)) , is bounded as follows: where C depends on d, β and α.
To obtain an estimator adaptive to the smoothness and the margin coefficients, we aggregate classifiersr (ǫ,β) n for (ǫ, β) in a finite grid. We split the sample in two sets, the first set D The cardinality of Σ ǫn is an exponential of n so the estimatorsr ǫn,β , for a given (ǫ n , β), are not easily implementable. However the procedure is very interesting from a theoretical standpoint since it is adaptive to the parameters and it achieves fast rates when αβ > d. Finally, notice that this estimator can achieve fast rates when αβ > d i.e. when the regression function is very smooth.

The "strong" case
Now, we introduce the second case, namely the strong density assumption. The class of distributions is given in the next definition.
Definition 13. Let α ≤ 1, β and L be strictly positive constants. The collection of distributions probabilities P (dx, dy) such that 1. the marginal µ(dx) = y P (dx, dy) satisfies the strong density assumption with µ max and with µ min , 2. the global noise assumption NA(α) holds, 3. the regression function belongs to Hölder class Σ(β, L, R d ), is denoted by P α,β,µmax,µmin (omitting to index it by the constant involved in the noise assumption for notational simplicity).
We recall the non-adaptive upper bound for the excess of ranking risk.

Theorem 14 ([9]
). There exists a constant C > 0 such that for all n ≥ 1, the maximum expected excess of ranking risk of the plug-in rule r (β) n (x, x ′ ) = 2 · I{ η n,hn (x ′ ) > η n,hn (x)} − 1, with h n = n −1/(2β+d) and l = ⌊β⌋, is bounded as follows: The plug-in estimator defined in the last theorem depends only on β. To obtain an estimator adaptive to the smoothness coefficient, we aggregate classifiers r (β) n for β in a finite grid. We split the sample in two sets, the first set of size m = n − ⌊n/ ln n⌋ is used to build the plug-in classifiers and the second one of size l = ⌊n/ ln n⌋to obtain the weights. We define the set F of of plug-in classifiers using the training sample D 1 m = (X i , Y i ) 1≤i≤m : Using the validation sample D 2 l = (X i , Y i ) m+1≤i≤n , we build the weights, for all r ∈ F Finally, our ranking rule is r adp = sign( f adp ), where f adp = r∈F w (l) n (r)r.

Lower bounds
For completeness, we now state a lower bound for the minimax rate of the expected excess of ranking risk in the strong density case. It holds in a specific situation, namely when αβ ≤ 1. When d = 1, the result can be found in [9].
Theorem 16 (A minimax lower bound). Let (α, β) ∈]0, 1] × R * + such that αβ ≤ 1. There exists a constant C > 0 such that, for any ranking rule r n based on n independent copies of the pair (X, Y ), we have: ∀n ≥ 1, When d ≥ 2 the rate of convergence is always slower than n −1/2 . That means that we are not able to prove optimal fast rates for the excess ranking risk. In classification, the limitation is αβ ≤ d, so optimal fast rates can be achieved in this situation (but not hyper fast).
For the mild case and the oracle inequality, mimicking the proof of theorem 16 does not give the same rates as the upper bounds. An explanation of the difficulties as well as the rates are given in Appendix B.

Conclusion
In this paper, we investigate the aggregation with exponential weights of ranking rules. In order to aggregate, we convexify the ranking loss using the hinge loss. We state an oracle inequality for the aggregation procedure under a low noise assumption that achieves the same rate as in classification. This is the crucial point to obtain the adaptive upper bounds for the excess of ranking risk. In the mild density case, we establish a new upper bound that is adaptive to the margin and the regularity parameters, with the same rates as in classification. In the strong density case, we aggregate plug-in classifiers in order to obtain minimax adaptive rates of convergence, under a restrictive assumption over the parameters for all dimensions. Moreover, in dimension 1, the aggregation procedure attains minimax adaptive fast rates. These results are in the continuity of [9] and there are still a lot of issues, in particular to obtain the lower bounds, that require a better understanding of the nature of the bipartite ranking problem.

Proof of Proposition 3
Proof.

Proof of Lemma 4
Proof. Because f values are in [−1, 1] By definition of f * (X, X ′ ), we get the desired result.

Proof of Theorem 8
Proof. Let a > 0. Adding and subtracting (1+a)(A n (f n )−A n (f * ) to A(f n )−A * and then using proposition 7, we have for any f ∈ F : Taking the expectation, we upper bound Now the goal is to control the expectation in the RHS. For that we use the Bernstein's inequality. First, notice that, using the linearity of the hinge loss on [−1, 1] we have: using the union bound we deduce that, for all δ ∈]0, 4 + 2a[, the probabil- The quantity inside the exponential is lower for all δ ∈]0, 4 + 2a[ and f ∈ F than −cδ 2− α 1+α where c depends only on a. Using the fact that and the inequality obtained, we get Optimizing in t the RHS, we obtain the desired result.

Proof of Theorem 16
Proof. The proof is classically based on Assouad's lemma. For q ≥ 1, consider the regular grid on [0; 1] d defined as is assumed: if it does not hold, define η q (x) as the one which is moreover closest to 0). Consider the partition X ′ 1 , . . . , X ′ q d of [0, 1] d canonically defined using the grid G (q) (x and y belong to the same subset iff η q (x) = η q (y)). Obviously, X = Let u 1 : R + → R + be a non increasing infinitely differentiable function as in [3]. Let u 2 : R + → R + be an infinitely differentiable bump function such as u ′ 2 = 1 on [1/12, 1/6]. Let φ 1 , φ 2 : R d → R + be function defined as where the positive constant C φ is taken small enough to ensure that . Now we define the hypercube H. For this purpose, we merge together intervals: . . , H} where K is the number of intervals we bring together relatively to the first coordinate (and it will play a role in the proof), m = Kq d−1 is the number of cubes in a group and H = ⌊q/K⌋. Define the hypercube H = {P σ , σ ∈ S H m }, where S m is the symmetric group of order m, of probability distributions P σ of (X, Y ) as follows.
whereh is a function of q and k(x) = ⌊xK/q⌋. We now check the assumptions. Because of the design Hölder condition holds for x, x ′ ∈ X i ( [3]). In contrast of classification situation, we have to check whether Hölder condition holds for x ∈ X i , x ′ ∈ X j when i = j belong to a same group G k . One can see that Hölder condition holds as soon as mh ≤ Lq −β (i.e Kh ≤ Lq 1−d−β ). Consider now the margin assumption. For t = O(h) the margin condition implies W ≤ Ch α . A constraint on K is also induced by the margin assumption: restricted to a group, the range of η has a measure of order q −β (because of the Hölder assumption). Hence, the margin assumption is satisfied if mW = O(q −αβ ) because of the strong density assumption W ≥ C/q d . Coupling the two last inequalities leads to αβ ≤ 1, guaranteeing K ≥ 2. So we A straightforward calculus shows that H 2 (P σ , P τi,j σ ) ≤ 4W (1− 1 − q −2β ≤ 4W q −2β . Using argument of [4] we have, The number of σ ∈ S H m such that σ(i) − σ(j) > m/2 is greater than m! H /8, so finally we have inf rn sup P ∈P α,β,µ min ,µmax Now, we take q = C 1 n 1/(2β+d) combined with W = C 2 q −d and m = C 3 q d−αβ with some positive constants C 1 , C 2 , C 3 , to conclude the proof.

Appendix B: Lower bounds
Basically, the idea of the proof of 16 is the following: first fix X 1 ⊂ X a part of the space such that X ∈ X 1 then create a classification problem as in [3] around X 1 . Doing that gives the rates of convergence of classification multiplied by the measure of X 1 . So the next step is to create classification problems for a union of part of the space with a measure independent of n. For the mild case in classification (see the proof in [3]), the classification problem uses the all space X i.e. all the important parts of the space have an η close to 1/2 and a density close to zero. So with our strategy we obtain the rates of classification times the measure of the space X 1 (i.e. W in the previous proof) which is not independent of n. For information only, we give the lower bounds that are achievable with this strategy. Since we believe they are not optimal, we do not give the proofs.
Oracle inequality. Adapting the proof of Theorem 3 in [19], one can get the next proposition. Let P α be the set of all probability distributions such that NA(α) holds. Notice that the power of n is half the power of n in the upper bound. Moreover a term in log log(M ) appears and comes from the fact that, we use permutations instead of the hypercube {−1, +1} log(M) .
Mild assumption. In that case, using directly the same proof as in the strong case with the choice of the parameters as in [3], one can prove the following proposition.
Notice that the only change here is the factor 2 in front of the α.