Improved Classification Rates under Refined Margin Conditions

In this paper we present a simple partitioning based technique to refine the statistical analysis of classification algorithms. The core of our idea is to divide the input space into two parts such that the first part contains a suitable vicinity around the decision boundary, while the second part is sufficiently far away from the decision boundary. Using a set of margin conditions we are then able to control the classification error on both parts separately. By balancing out these two error terms we obtain a refined error analysis in a final step. We apply this general idea to the histogram rule and show that even for this simple method we obtain, under certain assumptions, better rates than the ones known for support vector machines, for certain plug-in classifiers, and for a recently analyzed tree based adaptive-partitioning ansatz. Moreover, we show that a margin condition which sets the critical noise in relation to the decision boundary makes it possible to improve the optimal rates proven for distributions without this margin condition.


Introduction
Given a dataset D := ((x 1 , y 1 ), . . . , (x n , y n )) of observations drawn in an i.i.d. fashion from a probability measure P on X × Y , where X ⊂ R d and Y := {−1, 1}, the learning goal of binary classification is to find a decision function f D : X → {−1, 1} such that for new data (x, y) we have f D (x) = y with high probability.
The problem of classification is, apart from regression, one of the most considered problems in learning theory and many classical learning methods have been presented in the literature such as histogram rules, nearest neighbor methods or moving window rules. A general reference for these methods is [4]. Several more recent methods use trees to build a classifier, for example the random forest algorithm, introduced in [3], makes a prediction by a majority vote over a collection of random forest trees. Another example is the tree based adaptivepartitioning algorithm, presented in [2]. Here, a classifier is picked by empirical risk minimization over a nested sequence (S m ) m≥1 of families of sets which is based on dyadic or decorated tree partitions. Examples of non-tree based algorithms are described in [1] and [7]. There, the final classifier is found by empirical risk minimization over a suitable grid of plug-in rules or is derived by plug-in kernel, partitioning or nearest neighbor classification rules. Another non-tree based algorithm is, for example, the support vector machine (SVM), which solves a regularized empirical risk minimization problem over a reproducing kernel Hilbert space H. For more details on statistical properties of SVM for classification we refer the reader to [10,Chapter 8].
In this paper we discuss a partitioning based technique to analyse the statistical properties of classification algorithms. In particular we show for the histogram rule that under certain assumptions this technique leads to rates, which are faster than the rates obtained in [1,2,7], and [10]. To be more precise, we divide the input space X into two overlapping regions that are adjustable by a parameter r in such a way that one set, which we will denote by N r , contains points near the decision boundary, whereas the other set F r contains those that are sufficiently far away from the decision boundary. We examine the excess risks over these two sets separately by applying an oracle inequality for empirical risk minimizers on both parts. It turns out that we have no approximation error on F r and that we obtain, under a suitable assumption which relates critical noise to the decision boundary, an optimal variance bound on F r , which in turn leads to an O(n −1 ) behavior of the excess risk on F r . However, this bound still depends on the parameter r, namely it increases for r → 0. In contrast, our bound on the risk on N r decreases for r → 0. By balancing out these two risks with respect to r we obtain a refined bound on X under additional assumptions describing the concentration of mass around the decision boundary.
A more detailed discussion on this technique and the statistical results, which include rate adaptivity, are presented in Section 3. Moreover, a comparison of the resulting learning rates to the ones known for the SVM, for certain plug-in classification rules and the tree based adaptive-partitioning algorithm described in [2] can be found at the end of Section 4. In particular, we show that the above mentioned assumption that relates the location of critical noise to the decision boundary has an essential influence on our learning rates such that we outperform under a common set of assumptions the optimal rates obtained for the classifier in [1]. Furthermore, we show that if we omit the latter assumption, we obtain exactly the optimal rate of [1]. We note that all proofs are deferred to Section 5.

General assumptions
To describe our learning goal we consider in the following the classification loss L := L class : Y × R → [0, ∞), defined by L(y, t) := 1 (−∞,0] (y · signt) for y ∈ Y, t ∈ R, where 1 (−∞,0] denotes the indicator function on (−∞, 0]. We define the risk of a measurable estimator f : X → R by and the empirical risk by is called the Bayes risk, and a measurable function f * L,P : X → R so that R L,P (f * L,P ) = R * L,P holds is called Bayes decision function. Recall that the Bayes decision function f * L,P for the classification loss is given by sign(2P (y = 1|x) − 1) for x ∈ X, where P ( · |x) is a regular conditional probability on Y given x. Let us now briefly describe a particular histogram rule. To this end, let A = (A j ) j≥1 be a partition of R d into cubes of side length s ∈ (0, 1] and X :

I. Blaschzyk and I. Steinwart
Thus, the empirical histogram is defined by h D,s := signf D,s . We define the set F by Then, it is easy to show that the empirical histogram rule h D,s is an empirical risk minimizer over F for the classification loss, that means Since we aim in a further step to examine the risk on subsets of X consisting of cells, we have to specify the loss on those subsets. Therefore, we define for an arbitrary index set J ⊂ { 1, . . . , m } the set and the related loss Furthermore, we define the risk over T J by and define the shortcut L T J • f := L T J (x, y, f (x)). We denote by P n the product measure of the probability measure P . As mentioned in the introduction, we have to make assumptions on P to obtain rates. Therefore, we recall some notions from [10,Chapter 8] which describe the behavior of P in the vicinity of the decision boundary. To this end, let η : X → [0, 1], defined by η(x) := P (y = 1|x) for x ∈ X, be a version of the posterior probability of P , that is, that the probability measures P ( · |x) form a regular conditional probability of P. Clearly, if we have η(x) = 0 resp. η(x) = 1 for x ∈ X we observe the label y = −1 resp. y = 1 with probability 1. Otherwise, if, e.g., η(x) ∈ [1/2, 1) we observe the label y = −1 with the probability 1−η(x) ∈ (0, 1/2] and we call the latter probability noise. Obviously, in the worst case this probability equals 1/2 and we define the set containing those x ∈ X by X 0 := { x ∈ X : η(x) = 1/2 }. Furthermore, we write where d(x, A) := inf x ∈A d(x, x ), is called distance to the decision boundary. This helps us to describe the mass of the marginal distribution P X of P around the decision boundary by the following exponents. We say that P has margin exponent (ME) α ∈ (0, ∞] if there exists a constant c ME > 0 such that for all t > 0. Descriptively, the ME α measures the amount of mass close to the decision boundary. Therefore, large values of α are better since they reflect a low concentration of mass in this region, which makes the classification easier. Furthermore, we say that P has margin-noise exponent ( for all t > 0. The MNE β measures the mass and the noise, that means the amount of points x ∈ X with η(x) ≈ 1/2, around the decision boundary. That is, we have high MNE β if we have low mass and/or high noise around the decision boundary. Next, we say that the distance to the decision boundary Δ η controls the noise from below by the exponent γ if there exist a γ ∈ [0, ∞) and a constant c LC > 0 with for P X -almost all x ∈ X. That means, if η(x) is close to 1/2 for some x ∈ X, this x is close to the decision boundary. For examples of typical values of these exponents and relations between them we refer the reader to [10,Chapter 8].
Finally, in order to describe the region of the decision boundary in a more geometrical way, we say according to [6, 3.2.14(1)] that a general set T ⊂ X is m-rectifiable for an integer m > 0 if there exists a Lipschitzian function mapping some bounded subset of R m onto T . Furthermore, we denote by ∂ X T the relative boundary of T in X. Moreover, we denote by H d−1 the (d − 1)-dimensional Hausdorff measure on R d , see [6,Introduction]. The following lemma, which is based on [9,Lemma A.10.4], describes the Lebesgue measure of the decision boundary in terms of the Hausdorff measure. Its result will be necessary for the analysis of the main theorem in Section 3.

Oracle inequality and learning rates
Our goal is to find an upper bound for the excess risk R L,P (h D,s ) − R * L,P . The idea is to split X into two overlapping sets and to find a bound on the risks over these sets by using information on P . To this end, we denote the set of indices of cubes that intersect X by Next, we split this set into cubes that lie near the decision boundary and into cubes that are bounded away from the decision boundary. To be more precisely, we define, for r > 0 and a version η for which the assumptions at the end of Section 2 hold, the set of indices of cubes near the decision boundary by and the set of indices of cubes that are sufficiently bounded away by Moreover, we write The next lemma shows that we are able to assign all x ∈ A j with j ∈ J r F either to the class X −1 or to X 1 . Furthermore, we need to set geometric requirements to ensure that X ⊂ N r ∪ F r . d . For r ≥ s/2 define the sets N r and F r by (6) and (7). Then, the following statements are true: Lemma 3.1 ii) leads to a helpful splitting of the excess risk as the following lemma shows.

Lemma 3.2. Under the assumptions of Lemma 3.1 ii) we have
That means, we can bound the excess risk R L,P (h D,s ) − R * L,P if we find bounds on the excess risks over the sets N r and F r . For that purpose, we use an oracle inequality for empirical risk minimizer separately on both error terms, see [10,Theorem 7.2]. This is possible, since the following lemma shows that, considering the loss L T J for any set T J constructed as in (2), the empirical histogram rule h D,s is still an empirical risk minimizer over F.  (3). Then, the empirical histogram rule h D,s is an empirical risk minimizer over F for the loss L T J , that means Before we state our oracle inequality we discuss in a more detailed way the improvement that we gained by our separation technique described above. First, we make no approximation error on the set F r , which consists of cells that are sufficiently bounded away from the decision boundary. This follows from the circumstance that h D,s learns correctly on those cells. We refer the reader to Part 1 of the proof of Lemma 3.5 for details. Second, the main refinement arises from the fact that we achieve, under the condition that the decision boundary controls the noise from below, a bound on F r of the form with the best possible exponent, θ = 1. Here, V is a positive constant. The latter bound is known in the literature as variance bound. This bound plays an important part in the analysis of the risk terms since we have small variance if the right-hand side of the latter inequality is small. This relation is shown in detail in the next lemma. Assume that the associated distance to the decision boundary Δ η controls the noise from below by the exponent γ ∈ [0, ∞) and consider for some fixed r > 0 the set F r , defined in (7). Furthermore, let L := L class be the classification loss and let f * L,P be a fixed Bayes decision function. Then, for all measurable f : We remark that the right-hand side of the variance bound on F r depends on the separation parameter r. This dependence is also reflected in the risk term on F r . In particular, we show in Part 1 of the proof of Theorem 3.5 by applying [10, Theorem 7.2] on the risk term on the set F r that the improvements mentioned above lead to where τ ≥ 1 andc is a positive constant. Whereas this error term increases for r → 0, the error term on the set N r behaves exactly the opposite way, that is, it decreases for r → 0. In fact, bounding the risk on N r requires additional knowledge of the behavior of P in the vicinity of the decision boundary. By applying [10, Theorem 7.2] on the risk on the set N r we show in Part 2 of the proof of Theorem 3.5 under the assumption that P has ME α and MNE β that Here,c is a positive constant, τ ≥ 1 and V is the prefactor of the variance bound on N r , shown in the second part of the proof. We refer the reader to the proof of Theorem 3.5 for exact constants. If we balance the obtained risk terms over N r and F r with respect to r, we obtain the oracle inequality presented in the following theorem. For this purpose, we define the positive constant which depends on α, γ and d and wherê 1] d and P be a probability measure on X × {−1, 1} with fixed version η : X → [0, 1] of its posterior probability. Assume that the associated distance to the decision boundary Δ η controls the noise from below by the exponent γ ∈ [0, ∞) and assume as well that P has MNE β ∈ (0, ∞] and ME α ∈ (0, ∞]. Furthermore, let Let L be the classification loss and let for fixed n ≥ 1 and τ ≥ 1 the bounds and be satisfied, where the constantc α,γ,d is defined by (8) The proof shows that the constants c α,γ,d is given by By choosing an appropriate sequence of s n in dependence of our data length n and setting a constraint on the MNE β we state learning rates in the next theorem. Prior to that, we define with κ : that depends on α, β, γ, τ and d and where c α,γ,d is the constant from (12). Theorem 3.6. Assume that X and P satisfy the assumptions of Theorem 3.5 for β ≤ γ −1 κ, where κ := (1 + γ)(α + γ). In addition, assume that the side length s n in Theorem 3.5 is given by holds with probability P n ≥ 1 − 2e −τ , where n 0 and the constant c α,β,γ,τ,d only depend on τ, α, β, γ and d.
The proof of the latter theorem shows that the constant c α,β,γ,τ,d is given by Furthermore, we remark that the constraint on the MNE β in Theorem 3.6 is set to secure that the chosen side length s n fulfils assumption (9). If we omit this constraint we have to choose another s n . For this s n we would not be able to balance the two terms in the right-hand side of the excess risk in Theorem 3.5. Since our examples in Section 4 fulfil this constraint we did not consider other choices of s n . To obtain the rates we have to know the parameters describing P . However, it is also possible to obtain the rates in Theorem 3.6 by the following data splitting ansatz, a hold-out procedure whose concept is similar to the one described in [10,Chapter 6.5]. Let (S n ) be a sequence of finite subsets S n ⊂ (0, 1]. For a dataset D := ((x 1 , y 1 ), . . . , (x n , y n )) we define the sets where k := n 2 + 1 and n ≥ 4. Then, we use D 1 as a training set and compute h D1,s for s ∈ S n and use D 2 to determine s * D2 ∈ S n such that and a learning method producing this decision function is called training validation histogram rule (TV-HR). The following lemma shows that the TV-HR learns with the same rate as in Theorem 3.6 without knowing the parameters describing P . Theorem 3.7. Assume that X and P satisfy the assumptions of Theorem 3.5 for β ≤ γ −1 κ, where κ := (1 + γ)(α + γ). Let S n be a finite subset of (0, 1] such that S n is a n −1/d -net of (0, 1]. Assume that the cardinality of S n grows at most polynomially in n. Then, the TV-HR learns with rate n − βκ β(κ+γ 2 )+dκ .

Comparison of rates
In order to compare our rate obtained in Theorem 3.6 to the ones known from [1,2,7] and [10], we set in the following reasonable sets of common assumptions. Besides our geometric assumption on X, namely we make the following two assumptions on P : (ii) P has ME α ∈ (0, ∞], (iii) there exists a γ ∈ [0, ∞) and constants c LC , c UC > 0 such that for all x ∈ X we have Here, assumption (iii) a coincides with the definition in (5). Furthermore, assumption (iii) b shows that we have an upper control by Δ η on the noise, which is up to a constant a kind of inverse to (iii) a . Then, [10,Lemma 8.17] shows under the assumptions (ii) and (iii) b that P has MNE β = α + γ. Hence, we find by Theorem 3.6 with κ := (1 + γ)(α + γ) and a suitable cell-width s n that h D,sn learns with a rate with exponent A simple transformation shows that this exponent equals First, we compare the rate with exponent (13) to the rate achieved by support vector machines (SVM) for the hinge loss by assuming that (i), (ii) and (iii) hold. For this purpose, [10,Chapter 8.3 (8.18)] shows that the best possible rate for SVMs using Gaussian kernels is obtained by where ρ > 0 is an arbitrary small number. Hence, our rate in (13) is better by − γ 1+γ in the denominator. For the typical value of γ = 1, indicating a moderate control of noise by the decision boundary, our rate is better by −1/2 in the denominator.
Second, we compare our rates to the ones for certain plug-in classifiers, see [1,7], and to the rates obtained by the classification algorithms, described in [2]. In the cases of [1] and [2] the authors assume that P has a noise exponent (NE) q ∈ [0, ∞], that is, that there exist a constant c NE > 0 such that for all ε > 0, c.f. [10,Definition 8.22]. Since (14) measures the amount of critical noise and does not locate noise we call this exponent noise exponent in contrast to [8] and the mentioned authors, who call this exponent margin exponent. The authors of [7] assume a weaker version of (14) on P , however, the latter implies this weak version, see [5,Section 2]. We compare our rates under a different assumption set as in the first comparison to SVMs. To this end, we impose in addition to (i), (ii) and (iii) a that (iv) η is Hölder-continuous for some γ ∈ (0, 1]. Then, we find under condition (iv) with Lemma A.2 that assumption (iii) b is fulfilled with exponent γ and thus we assume in the following that (iii) a holds for the same γ. Note that in (iv) we have γ ∈ (0, 1], whereas in the case of (iii) we have γ ∈ [0, ∞). Moreover, under assumptions (ii) and (iii) a we find with [10,Exercise 8.5] that the noise exponent in (14) holds with By assuming (i), (ii), (iii) a and (iv) our rate yields the same exponent as in (13), that is Furthermore, the plug-in classifiers based on kernel, partitioning or nearest neighbor regression estimates shown in [7, Theorem 1, 3 and 5] yield under these assumptions and thus in particular with (15) the rate such that our rate is better by − γ(2+γ) 1+γ in the denominator. The authors were able to improve the rate given in (17) by making in addition the assumption that P X has a density with respect to the Lebesgue measure, which is bounded away from zero, see [7,Theorem 2,4 and 6]. Under this condition and (i), (ii), (iii) a and (iv) the classifiers yield the rate Hence, our rate with exponent (13) is better if our margin exponent α fulfils α < γ 1+γ . We have small margin exponent α, for example, if we have much mass around the decision boundary, that is, the density is unbounded in this region. We remark that the authors obtained rates under the Hölder assumption (iv), a weak margin assumption, and improved them as discussed above by making the assumption that P X has a density which is bounded away from zero.
Next, we compare our rates to the ones obtained by the classifier resulting from the classification method given in [2,Section 5]. Therefore, we consider in addition to (i), (iii) a and (iv) for example that (v) P X is the uniform distribution.
Under the condition that (i) and (v) hold, we find with Lemma 2.1 that assumption (ii) is fulfilled for α = 1. Then, we obtain in (15) that q = 1 γ . Again, we find with Lemma A.2 that assumption (iii) b is fulfilled with exponent γ and assume again that (iii) a holds for the same γ. Hence, the conditions (i) and (iii) a , (iv) and (v) yield in (16) a rate with exponent Hence, our rate is worse by 1 1+γ . However, the rate given in (19) is also comparable under a more generic assumption set in which we do not fix an example of P X . Indeed, if we assume the conditions (i), (ii), (iii) a and (iv), then, our rate with exponent (16) holds and [2, Corollary 5.2(i)] shows that their classifier obtains the rate log n n α+γ α+2γ+d . (20) Thus, our rate with exponent (16) is again better by − γ 1+γ in the denominator. Finally, we compare our rates to the ones obtained for the plug-in classifier defined by [1, (4.1) with p = ∞] under the conditions (i), (iii) a , (iv) and (vi) P X has a uniformly bounded density.
Analogously as above one can show with (i) and (vi) with Lemma 2.1 that we have MNE α = 1. Under these conditions our rate with exponent (18) holds. The classifier in [1, (4.1) with p = ∞] achieves the rate in expectation and we find that our rate is better by − γ 1+γ in the denominator. We remark at this point that [1, Theorem 4.1 and 4.3] proved that the classifier achieves this rate under a different assumption set, namely under (iv), (vi) and the assumption that P has NE q ∈ [0, ∞]. The classifier then achieves the rate and for this set of assumptions the rate is optimal (in a minimax sense). Our assumptions, namely (i), (iii) a , (iv) and (vi) imply the assumptions of [1], but, this is not a contradiction since our assumptions are a subset of the assumptions of [1].
Our improvement arises from assumption (iii) a since it forces critical noise (η ≈ 1/2) to be located close to the decision boundary and as we will see down below this assumption has an essential influence on the NE q. To be more precisely, there are two sources for slow learning rates. The first one is the approximation error around the decision boundary, the second one is the existence of critical noise. Assumption (iii) a forces both to be in the same region such that both effects cannot independently occur, which in turn leads to better rates compared to [1]. In other words, with assumption (iii) a we exclude distributions that have regions of critical noise that are far away from the decision boundary. Be aware that this does not mean that we consider only distributions without noisy regions bounded away from the decision boundary. In Fig. 1 we present two examples which make this situation more clear. Areas of noise that are, for example, located in the set X 1 in Fig. 1 (a) resp. in the set X −1 in Fig. 1 (b) are still allowed under (iii) a whereas the areas of critical noise in the particular other set are permitted.
To make this heuristic argument more precise we take a look at Theorem 3.5 and its proof and show that if we omit assumption (iii) a and thus consider the assumptions taken in [1], we match the optimal rate in (22). To this end, we consider in addition to (i) the above mentioned assumptions of [1], that is (iv), (vi) and the NE q. Since we do not assume (iii) a we cannot use this assumption to obtain a variance bound on the set F r , which is bounded away from the decision boundary (e.g., Lemma 3.4). Hence, the separation technique we used in the proof would make no sense any more, but we are able to bound the excess risk on the whole set by Part 2 in the proof of Theorem 3.5, where we bounded the excess risk on the set N r that is close to the decision boundary. This situation corresponds to the fact that our set F r is empty (letting go r → ∞). There are two points that change in Part 2 of the proof. First, we have no variance bound as in (38), but we can apply [10,Theorem 8.24], a general variance bound, and obtain where θ := q q+1 and whereṼ is a positive constant. Second, we can bound the cardinality |F| in (40) as in Part 1 and yield with some calculations as in Part 3 for the overall excess risk where c > 0 is a constant and τ ≥ 1. Then, minimizing over s yields for our learning method the rate which matches the in [1, Theorem 4.1 and 4.3] proven optimal rate (22). We further remark that instead of the Hölder assumption (iv) the weaker assumption (iii) b is sufficient for Theorem 3.5 and the above modified one. If we consider now in addition to the assumptions in [1] that (i) and (iii) a hold, we find that our rate improves immediately since we can directly apply Theorem 3.5 and thus obtain exactly the rate with exponent (18) which is better by − γ 1+γ . In summary, taking in consideration assumption (iii) a influences the noise exponent in a good way since we exclude distributions that have critical noise far away from the decision boundary. This leads to better learning rates.
Finally, we remark that for our results as well as for the results from [1, 2, 7] and [10] less assumptions are sufficient and in the comparisons above we tried to formulate reasonable sets of common assumptions.

Proofs
Proof of Lemma 2.1: For a set T ⊂ X and δ > 0 we define as in [9] the sets Next, we show that

I. Blaschzyk and I. Steinwart
For this purpose, we remark that according to (4) we have Having showed (24), we find together with the fact that λ d ( ). Finally, with (23) and the fact that X 0 = ∂ X X 1 we find that

Proof of Lemma 3.1:
i) We assume for A j with j ∈ J r F that we have an x 1 ∈ A j ∩ X 1 = ∅ and an x −1 ∈ A j ∩ X −1 = ∅. Then, the connecting line x −1 x 1 from x −1 to x 1 is contained in A j since A j is convex and we have x −1 − x 1 ∞ ≤ s. Moreover, since Δ η (x) ≥ r for all x ∈ F r we have that x ∈ X 1 ∪ X −1 . Next, pick an m > 1 such that and x m ∈ X −1 , there exists an i with x i ∈ X 1 and x i+1 ∈ X −1 and we find that On the other hand, such that r ≤ 2r m , which is not true for m ≥ 3. Hence, we can not have an Classification rates under refined margin conditions 809 ii) We define the set of indices and define the set Since X ⊂ F r ∪ C r , it suffices to show that C r ⊂ N r . To show the latter we fix an x ∈ C r . If x ∈ X 0 we immediately have Δ η (x) = 0 < 3r, hence we assume w.l.o.g. that x ∈ X 1 . Then, there exists a j ∈ J r C such that x ∈ A j . Furthermore, there exists an x * ∈ A j with Δ η (x * ) < r and we find with where · ∞ is the supremum norm in R d . Since s ≤ 2r, it follows that Δ η (x) ≤ 3r and therefore x ∈ N r .

Proof of Lemma 3.2:
Under the assumptions of Lemma 3.1 ii) we find that X ⊂ N r ∪ F r . Since the excess risk is non-negative we then have

Proof of Lemma 3.3: For f ∈ F we have
Next, we take a closer look at the risk on a single cell A j for j ∈ J. That is, where c j ∈ {−1, 1} is the label of the cell A j . The risk on a cell is the smaller the less often we have y i = c j such that the best classifier on a cell is the one which decides by majority. This is true for the histogram rule by definition. Since the risk is zero on A j with j ∈ J, the histogram rule minimizes the risk with respect to L T J .

Proof of Lemma 3.4:
We we obtain For x ∈ F r we have Δ η (x) ≥ r and thus we find with our lower-control assumption that and therefore By using 1 Fr where denotes the symmetric difference defined by C D := (C \ D) ∪ (D \ C) for sets C, D ⊂ X and by using Lemma A.1 we obtain for the variance bound

Proof of Theorem 3.5:
We define the set of cubes N r and F r as in (6), (7) for the choice of where With (25) we find that s ≤ r. To see the latter, we remark that (1−θ) and conclude by replacing θ by (26) The rest of the proof is structured in three parts, where we establish error bounds on N r and F r in the first two parts and combine the results obtained in the third and last part of the proof. In the following we write N := N r and F := F r and keep in mind, that these sets depend on a parameter r. Furthermore, we write h D := h D,s . Part 1: In the first part we establish an oracle inequality for for all f ∈ F. Furthermore, with Lemma 3.4 we obtain where c 1 := max{c LC , 2 γ }. We observe that r γ ≤ c 1 , since with assumption (10), where we rewrite the exponent by (1+γ)(α+γ)+γ 2 γ = 1+γ(2−θ) and therefore r γ ≤ 2 γ ≤ c 1 . As we conclude from Lemma 3.3 that h D is an empirical risk minimizer over F for the loss L F , we are able to use [10, Theorem 7.2], an improved oracle inequality for ERM. We obtain for all fixed τ ≥ 1 and n ≥ 1 that ,P (f ). Next, we refine the right-hand side of this oracle inequality. Obviously we have |F| ≤ 2 |J| . We bound the the cardinality |J| by using a volume comparison argument. To this end, we define the setJ : Then, Thus, holds with probability P n ≥ 1 − e −τ .
Classification rates under refined margin conditions 813 Finally, we have to bound the approximation error R * We find with h P,s ∈ F and Lemma A.1 that since P X ((X 1 {h P,s ≥ 0}) ∩ A j ) = 0 for each j ∈ J r F . To see the latter, we first remark that the latter set contains those x ∈ A j for that either h P,s (x) ≥ 0 and η(x) ≤ 1/2 or h P,s (x) < 0 and η(x) > 1/2. Since we have A j ⊂ X −1 ∪ X 1 we can ignore the case η(x) = 1/2. Furthermore, we know by Lemma 3.1 i) that either A j ∩ X −1 = ∅ or A j ∩ X 1 = ∅. Let us first consider the case A j ∩ X −1 = ∅ and thus A j ⊂ X 1 . According to the definition of the histogram rule, cf. (1), we find for all x ∈ A j that h P,s (x) = 1, since Obviously we have η(x) ≥ 1/2 and h P,s (x) = 1 for all x ∈ A j . Analogously we can show for cells with A j ∩X 1 = ∅ for j ∈ J r F that η(x) ≤ 1/2 and h P,s (x) = −1 for all x ∈ A j . Hence, P X ((X 1 {h P,s ≥ 0}) ∩ A j ) = 0 for all j ∈ J r F and the approximation error vanishes on the set F .
Altogether, for the oracle inequality on F we obtain with (29) and (30) that holds with probability P n ≥ 1 − e −τ . Part 2: In the second part we establish an oracle inequality for R L N ,P (h D ) − R * L N ,P , again by using [10,Theorem 7.2]. Analogously to Part 1 we define for all t > 0. We turn our attention to the minimum and note that by the definition of N we have For x ∈ X with |2η(x) − 1| < t by the definition of the lower control we conclude from Then, we find by (33), (34) and by the definition of the margin exponent that Combining (35) with (32) we obtain Minimizing the right-hand side of (36) yields where c 2 := α+γ γ c αγ α+γ ME γcLC α α α+γ , such that with and (26) we have Note, that the definition of V yields V 1 2−θ ≥ 1. Since h D is an ERM over F for the loss L N due to Lemma 3.3, by using [10, Theorem 7.2] we obtain for fixed τ ≥ 1 and n ≥ 1 that holds with probability P n ≥ 1−e −τ . In order to refine the right-hand side in (39), we establish a bound on the cardinality |F| = 2 |J A | and on the approximation error. To bound the mentioned cardinality we use the fact that N is contained in a tube around the decision line, that is j∈J N A j ⊂ {Δ η (x) ≤ 3r}, see (6). We remark that 3r ≤ δ * holds, where δ * is the constant from Lemma 2.1, since with assumption (10) we have Then, with Lemma 2.1 we find that and we obtain This yields to where c 3 := 12H d−1 ({η = 1/2}). By r ≥ s ≥ s d we hence conclude that where c 4 := 2 max{12H d−1 ({η = 1/2}), 1}. Thus, (39) changes to Finally, we have to bound the approximation error R * L N ,P,F − R * L N ,P in (41). For f 0 = h P,s we have with Lemma A.1 that We split J r N in indices where cells do not intersect the decision line and those which do by We notice that, as in the calculation of the approximation error in Part 1, the first sum vanishes since P X ((X 1 {h P,s ≥ 0}) ∩ A j ) = 0 for all j ∈ J r N1 . Moreover, we remark that J r N2 only contains cells of width s that intersect the decision boundary. Hence, by using the margin-noise assumption we find Altogether for the oracle inequality on N with (41) we find that holds with probability P n ≥ 1 − e −τ .
Part 3: In the last part we combine the results obtained in Part 1, the oracle inequality on F and Part 2, the oracle inequality on N . That means, with the separation in (27) we obtain with (31) and (43) for the oracle inequality on X that Next, for P X -almost all x ∈ A we have 1 (−∞,0] ((2η(x) − 1)signf (x)) = 1 ⇔ (2η(x) − 1)signf (x) ≤ 0.