Penalized empirical risk minimization over Besov spaces

Kernel methods are closely related to the notion of reproducing kernel Hilbert space (RKHS). A kernel machine is based on the minimiza- tion of an empirical cost and a stabilizer (usually the norm in the RKHS). In this paper we propose to use Besov spaces as alternative hypothesis spaces. We study statistical performances of a penalized empirical risk minimiza- tion for classification where the stabilizer is a Besov norm. More precisely, we state fast rates of convergence to the Bayes rule. These rates are adaptive with respect to the regularity of the Bayes.


Classification framework
We consider the binary classification setting. Let (X, Y ) be a random variable with unknown probability distribution P over X ×{−1, +1}. X ∈ X is called the input variable. It is a feature vector, whereas Y ∈ {−1, 1} is the corresponding class or label. The goal of classification is to predict class Y when only X is observed. In other words, a classification algorithm builds a decision rule from X to {−1, 1}. A classifier is a function f : X → R where the sign of f(x) determines the class of an input x. The performance of a classifier is measured by the generalization error, given by: R(f) := P(sign(f(X)) = Y ).
If we assume that the joint distribution P is known, the best classifier is defined by: where η(x) := P(Y = 1|X = x). Classifier (1) is called the Bayes rule. It is easy to see that it minimizes the generalization error. Unfortunately, in practice η is unknown and then f * is not available. A natural way to overcome this difficulty is to provide an empirical classifier based on training data. Suppose we have at our disposal a training set D n = {(X i , Y i ), i = 1, . . . , n} made of i.i.d. realizations of the random variable (X, Y ) of law P . Now classification can be seen as a standard estimation problem where we have to estimate f * from i.i.d. observations. The efficiency of an empirical classifierf n is measured via its excess risk: where R(f n ) := P(sign(f n (X)) = Y |D n ). Here we are interested in consistent classifierf n , i.e. such that (2) tends to zero as n → ∞. Finally, a classifierf n learns with rate (ψ n ) n∈N * if there exists an absolute constant C > 0 such that for all integer n, where E is the expectation with respect to the training set. Without any assumption over the joint distribution P , [11] gives arbitrary slow rates. However several authors propose different rates restricting the class of distributions P . Pioneering works of Vapnik [30,31] investigate the statistical performances of the Empirical Risk Minimization (ERM). The idea is very simple: we are looking at the minimizer of the empirical risk: If we suppose that the class of possible Bayes rules has finite VC dimension, ERM reaches the parametric rate n − 1 2 in (3). Moreover, if P is noise-free (i.e. R(f * ) = 0), the rate becomes n −1 . This is a fast rate. More recently, [29] or [19] describes intermediate situations using margin assumptions. These assumptions add a control on the behaviour of the conditional probability function η at the level 1 2 . Under this condition, they get minimax fast rates of convergence between n − 1 2 and n −1 for ERM estimators in classification. At the present time, there exists a vast literature about the fast rates phenomenon. Fast rates have been obtained for different procedure such as Boosting ( [7]), Plug-in rules ( [1]), SVM ( [26]), or dyadic decision trees ( [15]). In this work we propose to state fast rates of convergence for a penalized empirical risk minimization using the hinge loss.

SVM regularization
Support Vector Machines was first proposed by Boser, Guyon and Vapnik ([8]) for pattern recognition. Given a training set D n , the SVM classifier (without offset)f n solves the following minimization: where H K denotes the reproducing kernel Hilbert space (RKHS) associated to the kernel K. The first term in (5) is an empirical cost using the hinge loss l(y, f(x)) := (1 − yf(x)) + . This term allows the solution to fit the data. The second term regularizes the solution with f 2 K , the square norm of f in the RKHS. The parameter α n is called the smoothing parameter. It has to be determined explicitly to make the trade-off between these two terms. To get statistical performances of the method, we need to take a closer look at the couple (H K , · K ).
For any Borel measure µ over X , consider L K : L 2 (µ) → L 2 (µ) the integral operator defined as: This operator is closely related to the kernel K. If X is compact and K is continuous (K is called a Mercer kernel), L K is compact. From spectral theorem, there exist (φ k ) k≥1 , orthonormal basis of L 2 (µ) of eigenfunctions of L K with (λ k ) k≥1 corresponding eigenvalues. It allows us to get a representation of H K in a sequence space as follows: In this case, the regularization in (5) can be written: where (a k ) k≥1 gives a representation of f in the basis (φ k ) k≥1 . For instance, consider a convolution kernel K(x, y) = Φ(x − y). Then in (7), coefficients (a k ) k≥1 are the Fourier coefficients of f whereas (λ k ) k≥1 are the Fourier coefficients of Φ. Representation (6) holds for Mercer kernels. One can generalize this fact to X = R d in the following case. Suppose K(x, y) = Φ(x−y) is a convolution kernel. If Φ has some mild properties, the RKHS associated to K can be written: where β is a function of: • q the margin parameter, Parameter r describes the regularity of f * in the Besov space B r2∞ (R d ). This assumption is strongly related to the use of Sobolev space as hypothesis space. In these functional spaces, the smoothness is related to the asymptotic behaviour of the Fourier transform. This criterion depends on the variations of the function in R d . It is a global criterion. For instance, the derivability can be expressed in terms of Fourier transform. For a given f such thatf ∈ L 1 (R), the following elementary inequalities: Unfortunately the Bayes rule is not continuous. With previous remark, its Fourier transform decreases slowly (in O(|ω| −1 )). From this point of view, Sobolev spaces are not really adapted to the shape of f * . As a result, f * ∈ B r2∞ (R d ) holds for small values of r (namely r < d 2 ). That's why fast rates are not reached in [16]. An alternative is to take into account the regularity of f * : it is piecewise constant with local discontinuities. This can be done using a multiresolution analysis and considering Besov spaces as hypothesis spaces.

Besov regularization
It seems interesting to consider minimization (5) with more general hypothesis spaces. We propose to use Besov spaces as hypothesis spaces and study the minimization procedure: where · spq denotes the norm in B spq (R d ). We replace H K by Besov spaces.
The advantage of Besov spaces as compared to Sobolev spaces is that they give a more general description of the smoothness properties of functions. An explicit description of B spq (R d ) and · spq is given in Section 2. There exist several motivations to introduce Besov spaces in (5): is a Sobolev space of order s. For p = q = 2, (10) corresponds to the standard SVM using Sobolev spaces as RKHS (see [16]). Then (10) generalizes the Sobolev case.
• The use of the hinge loss l(y, f(x)) = (1 − yf(x)) + in (10) is related to the SVM algorithm. The statistical consequences of minimizing such a loss is well-treated in [4]. The principal advantage when we are expecting rates of convergence is the control of the excess risk (2) by the excess risk using the hinge loss. It gives in this paper fast rates of convergence. Another caracteristic of the hinge loss concerns the regularity of the minimizer of the risk. We have: which is a non continuous function with values +1 or −1. For this reason, fast rates cannot be reached in [16] for SVM with Sobolev spaces. B spq (R d ) with p < 2 gives more flexibility. It contains for instance piecewise regular functions. In this case it will be easier to approximate the Bayes classifier, which leads to better rates of convergence. • There is a large theory around Besov spaces, such as a characterization using wavelet coefficients. It gives a representation of the norm in (10) in a sequence space as follows: where (α k ) and (β jkq ) are the wavelet coefficients.
This representation can be compared to the sequence space representation (7) of a RKHS norm. In the standard SVM case, the regularization can be expressed with respect to the spectrum of L K . It allows to control the complexity of the RKHS. [20] or [6] control the local Rademacher average of balls in RKHS in this sequence space. It depends on the asymptotic behaviour of the sequence (λ k ) k≥1 and affects the statistical performances of the method. In this paper we point out that similar facts can be derived for Besov spaces using a wavelet analysis. Minimization (10) is strongly related to the SVM minimization. However as a kernel method, SVM uses a RKHS norm as regularization. It allows to define SVM as a large margin hyperplane in some Hilbert feature space. Here the hypothesis space is a Besov space. Besov spaces are not Hilbertian, and then cannot be represented as RKHS. This penalized empirical risk minimization is not an SVM minimization. However an interesting open problem is to express (10) as a kernel method. This problem is connected to recent developments on the theory of reproducing kernels. In this direction, a short discussion is proposed at the end of this work.
The remainder of this paper is organized as follows: In Section 2, we introduce the wavelet theory. We characterize Besov spaces in terms of wavelet coefficients. It reduces the control of the Rademacher average to a problem in a sequence space, leading to very natural proofs. An oracle inequality is deduced in Section 3. It is a direct application of a general model selection theorem due to [6]. We finally control the approximation power of Besov balls to state fast learning rates for the procedure (10). The solution is adaptive with respect to the regularity of the Bayes. We conclude in Section 4 with a discussion. Section 5 is dedicated to the proofs of the main results.

Wavelet framework
For the mathematical aspects of wavelets, we refer for example to [21], while [17] proposes comprehensive expositions for signal processing. Wavelet applications in statistical settings are given for instance in [13]. For a complete study of minimax rates of convergence for density estimation by wavelet thresholding, we refer to [12].
Here recall some definitions and notations for wavelets and Besov spaces. Going back to statistical learning theory, one proposes a control of the local Rademacher average of Besov balls.

Besov spaces and wavelets
For the one-dimensional case, we refer for instance to [12]. To introduce the d-dimensional case, we begin with an example in dimension 2 using the tensor product. Write (V 1 j ) j∈Z a multiresolution analysis (MRA for short in the sequel) of L 2 (R) generated by φ.
Then we have for j = 0: More generally for all j ∈ Z: Then the two-dimensional mother wavelets are 2 j φ(2 j x−k)ψ(2 j y−l), 2 j ψ(2 j x− k)φ(2 j y −l) and 2 j ψ(2 j x−k)ψ(2 j y −l) for (k, l) ∈ Z 2 . This means that there are three wavelets in the two-dimensional case. This fact is illustrated in [21] with a geometrical point of view. We can generalize this result in higher dimensions with the following lemma.
As a result, the system given by: is an orthonormal basis of L 2 (R d ).
This lemma generalizes the one-dimensional case. From a scaling function r-regular and rapidly decreasing generating a MRA, we can construct 2 d − 1 mother wavelets with the same regularity. The existence of such a wavelet basis is proved in [21].
As a consequence, any f ∈ L 2 (R d ) can be decomposed as: where In the case of tensor product, we have for all k ∈ Z d and x ∈ R d : for e ∈ {0, 1} d \0 R d = E and where we write for simplicity ψ 0 = φ and ψ 1 = ψ.
Here we are interested in compactly supported wavelet bases. [10] has shown that in dimension d = 1, there exists an orthonormal basis of compactly supported wavelets satisfying conditions of Lemma 1, for any integer r ≥ 1 (for r = 0, it corresponds to the Haar basis). Using the tensor product, this result gives a compactly supported d-dimensional wavelet basis of L 2 (R d ) (see [21] for details).

Besov spaces
Besov spaces were introduced by O.V. Besov in the 60s. Here we propose to characterize Besov spaces B spq (R d ) in terms of wavelet coefficients.
Recall P j : and if there exists a positive sequence (ǫ j ) j∈N such that: To express the L p -norm of D j (f) in terms of the AMR of L 2 (R d ), we need the following lemma.
Lemma 2. Let g 1 , . . . g L compactly supported on R d satisfying assumptions 1. and 2. of . Then there exist 0 < c 1 < c 2 such that for all 1 ≤ p,

831
Lemma 2 is a direct consequence of [21,Lemma 8], using the d-dimensional change of variables formula.
Gathering with (13), we arrive at the following characterization of Besov spaces. where: First term in (14) corresponds to the L p -norm of P 0 (f) whereas the second term corresponds to the l q -norm of 2 js D j (f) p .
This characterization of Besov spaces will be useful to control the complexity of B spq (R d ) in this sequence space. For other characterizations, we refer to [22] or [28], including the most usual definition in terms of modulus of continuity.

Local complexity of Besov balls
First error bounds for empirical risk minimization go back to Vapnik (see [31]). Consider an ERM estimatorf ERM over a collection of classifiers F , [31] states that: This leads to the study of the supremum of an empirical process. With concentration inequalities, this random process can be controled by its expectation, up to some residual terms. The behaviour of the maximum of the empirical process gives rise to a specific notion of size fot the class F , called global size. This measure is related to the worst deviation of the empirical error to the true error, and the obtained bounds might be loose.
Recently, sharp bounds have been established using different localized versions of (15). It is now common to use localized averages. Considering the penalized empirical minimization (10) using the hinge loss l(y, f(x)) = (1 − yf(x)) + , we are interesting in: where in the sequel Parameter r allows us to identify locally the scale of richness of the function class. It really measures the magnitude of the error deviation of functions with small variance, which are the one that are likely to be picked by the learning algorithm. From the lipschitz property of the hinge loss, gathering with the well-known symmetrization device (originally in [30]), it turns out that there is a tight connection between such a quantity and the following expectation: where The ǫ i are called Rademacher variables. (17) is called the local Rademacher average of B(R). The use of Rademacher averages in Classification goes back to [14] (see also [5,3,2]). [20] has proved that the local Rademacher average of a kernel class is determined by the spectrum of its integral operator (see also [6]). Under assumptions on the law of X, we propose a same type of result for Besov classes. The following theorem is the meaty part to deduce statistical performances of minimization (10). It allows us to control the local average (16) and to obtain an oracle inequality (Proposition 1). Theorem 1. Suppose P X admits a density ρ such that: Then if s > d p and 1 ≤ p ≤ 2, there exists a constant c depending on a and A such that: A detailed proof is presented in Section 5. As mentioned in the introduction, we use wavelet theory presented above. More precisely, Lemma 3 allows us to control (17) , exactly as in the RKHS case. In the sequel we consider Besov spaces with such a restriction.

Statistical performances
To state learning rates to the Bayes, we act in two steps. First step is to state an oracle inequality of the form: The statistical sense of this inequality is rather transparent. It ensures classifier f n to have comparable performances with the best classifier called oracle (which minimizes the true risk), up to a residual term δ n such that δ n → 0 as n → ∞. Constant C has to be close to 1. It remains to control the right hand side of the oracle inequality. The main term of this bound is called the approximation function, defined in this case as: Following [16], we use the theory of interpolation spaces to control this function under an assumption over the Bayes f * . Finally, to get rates of convergence such as (3), it remains to note that:

Oracle inequality
To obtain good statistical properties, we need to restrict the class of considered distributions P . A standard way is to impose a margin hypothesis over the conditional probability function η. In this work we will assume that there exist η 0 , η 1 > 0 such that: This assumption is closely related to the margin assumption originally due to [29]. The first part ensures a jump of the probability η at the level 1 2 . The second part is not natural. It avoids the no noise case where η(x) ∈ {0, 1}. It appears for some technical reasons discussed in Section 5 (see also [6]).
Proposition 1 (Oracle inequality). Let P the joint distribution such that the marginal of X satisfies assumptions of Theorem 1. Suppose (18) holds for some η 0 , η 1 > 0. Consider a non-decreasing function φ on R + such that φ(0) = 0 and φ(x) ≥ x for x ≥ 1 2 . Given (X i , Y i ), i = 1, . . . , n i.i.d. from P , we define: where s > d p and 1 ≤ p ≤ 2. If we choose α n such that: then the estimatorĝ n is such that: where c, c 1 , c 2 , c 3 and c 4 are absolute constants and u = s + d 1 2 − 1 p . Remark 3. It holds whatever φ : φ(0) = 0 and φ(x) ≥ x for x ≥ 1 2 . From the model selection approach, the minimum required regularization is of order f spq . In the standard SVM, a regularization of order f 2 spq is used. Thus we only consider in Corollary 1 the two cases φ(x) = x and φ(x) = 2x 2 . These two orders of regularization will lead to different statistical performances.

Remark 4.
This inequality is independent of the approximation term. The choice of α n in (20) only depends on the hypothesis set we consider. A control of the approximation power of Besov spaces will give adaptive learning rates.

Rates of convergence
Last step is to control the approximation term in the oracle inequality of Proposition 1. The theory of interpolation spaces allows us to measure how well the models approximate the target function f * . We finally get the following rates of convergence.
Corollary 1 (Rates of convergence). Let P satisfying assumptions of Proposition 1. Then for any 1 ≤ p ≤ 2 and s > d p , define the estimatorŝ Suppose that f * ∈ B rp∞ (R d ) for some r > 0. Then there exist absolute constants C, C ′ > 0 such that: and where we choose α n such that an equality holds in (20).
Remark 5. We consider two special cases for the function φ of Proposition 1. Estimatorf n is the penalized empirical minimizer using the weakest regularization (linear with respect to the norm) whereasĝ n uses the standard SVM penalization (of order f 2 ). We can see coarsely that the rate off n outperforms the one ofĝ n since r s > r 2s−r . With this approach, a lighter regularization results in a better bound. Remark 6. The construction of these estimators does not depend on the regularity of the Bayes. The smoothing parameter α n is chosen independently of the parameter r appearing in the assumption f * ∈ B rp∞ (R d ). As a result, estimatorsf n andĝ n are called adaptive. They adapt to the regularity of the Bayes.
Remark 7. [16] gives learning rates for SVM using Sobolev spaces. In particular, under a strong margin assumption, we obtain n − 2rs 2rs+d(2s−r) . We can compare this bound with (22) for p = q = 2. In this case we have n − r 2s−r 2s 2s+d . This rate is clearly slower than n − 2rs 2rs+d(2s−r) since s > r. However it gives similar results when s → r.

Fast rates and optimality
Consider the one-dimensional case where X = R. Suppose f * is such that: It means that the Bayes rule changes only a finite number of times over the real line. Under this assumption, SVM algorithm using Sobolev spaces cannot reach fast rates (see [16]). In this paper Besov spaces allow us to consider values of p < 2. With (23), if 1 ≤ p = q < 2, we have using [17]: Consequently, f * such that (23) holds belongs to B r11 (R) ⊂ B r1∞ (R) for r = 3/2. Substituing into (22), the rate becomes n − 6s−3 2s(4s−3) which is a fast rate for s small enough. This example illustrates the importance to consider Besov spaces with p < 2 as hypothesis space. For p < 2, these spaces contain piecewise regular functions with local discontinuities. It gives fast rates of convergence.
An interesting question thought it is out of the scope of this paper is the optimality of Corollary 1. To answer to this question, it should be possible to link the assumption of regularity of f * in the d-dimensional case with more standard complexity assumption for classification. For example, a more classical model for possible Bayes rule is the Hölder Boundary Fragment assumption over the decision boundary ( [29]). Using the characterization of Besov norms with wavelet coefficients, it should be interesting to link Corollary 1 to this framework. It may give a direction to deduce minimax facts in our framework, using for instance lower bounds of [24].

Conclusion
We have studied a new procedure of penalized empirical risk minimization using Besov spaces. This method generalizes SVM algorithm using Sobolev spaces as RKHS. The introduction of Besov spaces gives more flexibility to study the approximation power of the procedure. For the estimation part of the analysis, we adopt the model selection approach of [6]. We propose a control of the local Rademacher average of Besov balls. We hence obtain fast learning rates to the Bayes. Moreover, the construction of these estimators does not depend on the regularity of the Bayes. They are adaptive with respect to the regularity of f * .
From technical point of view, this paper generalizes the control of Rademacher to a non Hilbertian functional space. It is well-known that local Rademacher of RKHS balls can be controled using RKHS formalism. Here we propose to use a wavelet analysis to get a similar result for Besov spaces. A compactly supported wavelet basis allows us to work in a sequence space.
This contribution could be compared with another introduction of wavelet theory in classification. [15] studies the statistical performances of the LASSO estimator, solving the minimization: The hypothesis space F d is made of piecewise constant classifiers on a dyadic regular grid of [0, 1] d . It allows to decompose each classifier into a fundamental system of indicators on dyadic sets of [0, 1] d . This system is closely related to the wavelet tensor product of the Haar basis. As a consequence, in all the proofs, similitudes with the technics used in the wavelet literature are granted. From this point of view, the present work can be compared to [15].
Unfortunately, from practical point of view, the presence of Besov norms in our procedure leads to some computational problems. Besov spaces are not Hilbert spaces. As a result, our method cannot be embedded into a kernel method and computed as SVM algorithm. The feature space is not a RKHS in this case.
Recently, several authors investigate learning algorithm with non Hilbertian hypothesis space. [9] underlines the main principles of an hypothesis space in a learning problem. The hypothesis set must be composed of pointwise defined functions. Moreover the evaluation functional δ x : f → f(x) must be continuous. Due to the embedding theorem, Besov spaces B spq (R d ) with s > d p have this property. In the RKHS case, it corresponds to the reproducing property. It gives a reproducing kernel lying in the RKHS. However the Hilbertian structure is not necessary. To generalize the notion of RKHS to RKS (Reproducing Kernel Space), we need a bilinear form corresponding to the scalar product for RKHS. It could be done with the duality map, considering a duality couple (H, H * ). [18] establishes an equivalence between particular dualities called evaluation subdualities and a set of weakly continuous applications called reproducing kernels. [9] also provides an explicit construction of both subdualities and the associated reproducing kernel. It is a generalization to the construction of RKHS using Carleman operator. The construction is based on the duality map.
Finally we know from [21] that (B spq (R d ), B −sp ′ q ′ (R d )) are in duality through the duality map: is the space of distributions such that (13) holds for −s < 0. As a result, it will be interesting in this direction to find a kernel generating Besov spaces as RKS. Last step would be to implement our procedure with such a kernel.

Proofs
This section is dedicated to the proofs of the main results of this paper. Throughout this section, c denotes a constant that may vary from line to line. For p, q > 0, we write p ′ , q ′ such that 1/p + 1/p ′ = 1/q + 1/q ′ = 1. Finally, with some abuse of notations, we write Ef for E PX f(X) and El(f) for E P l(Y, f(X)).

Proof of Theorem 1
Since the marginal of X admits a bounded density ρ with compact support P, we have: Moreover X 1 , . . . X n are i.i.d. from ρ. Then: We then have to bound the RHS of (24). Let begin with the one-dimensional case, i.e. when the input domain X ⊂ R. From wavelet decomposition, we can write f ∈ L 2 (R) as: The description of Besov spaces using wavelets leads to the following equivalent norm:

838
Moreover from Lemma 2, , where x ≈ y means there exist c, C > 0 such that cy ≤ x ≤ Cy.
For f ∈ B spq (R) with s > 1 p , the wavelet expansion (25) is pointwise since the evaluation functionals are continuous in these spaces (see Remark 2). Thus we obtain: where f α,β is defined in (25) and Hence we get for any integer d ′ : To prove the inequality, we will bound this two terms separately.
We begin applying the Hölder (twice) and Jensen inequalities to T 2 : The definition of Γ(R, r) leads to: where 1 p + 1 p ′ = 1 q + 1 q ′ = 1. Next step is to control, for all j > d ′ , the serie: Lemma 4. Let Y 1 , . . . Y n i.i.d. with zero mean and σ 2 variance. Then for all u ≥ 2, there exists c u > 0 such that: This concentration inequality is due to Rosenthal ([23]). Putting Y i = ǫ i ψ jk (X i ), we have with Lemma 2, gathering with conditions on the density ρ: for an absolute constant c depending on A. As a result, applying Lemma 4 for u = p ′ ≥ 2, we obtain: Now it is worth noticing that since p and the wavelets functions ψ jk are compactly supported, the quantity: is zero whatever k ∈ S C (j) := {k ∈ Z : suppψ jk ∩ P = ∅}. We know from [21] that there exists a constant c > 0 such that ♯S(j) ≤ c2 j . Then: Gathering with (26), we obtain: where the convergence of the geometric serie comes from the condition s > 1 p . Moreover we have p ≤ 2. Then s − 1 2 ≥ s − 1 p and 1 − p ′ 2 ≤ 0. We obtain: Last step is to control T 1 . We put β −1,k = α 0k and ψ −1k = φ 0k . Thus we have, applying successively Cauchy-Schwarz and Jensen inequalities, Besides, Eǫ i ǫ j = 0, ∀i = j. Then: We have to control, for j ∈ {−1, . . . d ′ } the serie: Eψ jk (X i ) 2 .
As above, since p and the wavelet mother ψ are compactly supported, where ♯S(j) ≤ c2 j . Finally we obtain: where last line comes from the definition of Γ(R, r).
Then there exists a constant c > 0 depending on a and A such that: Optimizing with respect to d ′ , we obtain the following upper bound in dimension 1: Now we turn out into the d-dimensional case. The principle of the proof follows the one dimensional case. From (12) we have, for any f ∈ B spq (R d ): where the equality is pointwise since s > d p . Then we can write: where now: We proceed as in dimension 1. For any integer d ′ : We apply the Hölder (twice) and Jensen inequalities to T 4 to get: Next step is to control, for all j > d ′ , the serie:

843
We have with Lemma 2: since ψ is compactly supported. As a result, applying the Rosenthal inequality, we obtain: Now it is worth noticing that since p and the wavelets ψ jkl are compactly supported, the quantity There exists an absolute constant c > 0 which only depends on d such that ♯S d (j) ≤ c2 dj . As a result, With previous inequality, we hence have: where the condition s > d p ensures the convergence of the geometric serie. Moreover we have p ≤ 2. Then, as above: It remains to control T 3 . For brievity, we put β −1k1 = α k , ψ −1k1 = φ k and for any l > 1, β −1kl = 0 ψ −1kl = 0. Applying successively Cauchy-Schwarz and Jensen, one has:

844
With the same argument as in dimension one, we obtain: We have to control, for j ∈ {−1, . . . d ′ } the serie Eψ jkl (X i ) 2 .
Since f and the wavelet mother ψ are compactly supported, where ♯S d (j) ≤ c2 dj . Finally we get: from the definition of Γ d (R, r). Then the control of the two terms entails: Optimizing with respect to d ′ , we lead to the conclusion.

Proof of Proposition 1
To prove the oracle inequality, we use the following model selection approach. From [6], minimization (10) can be rewritten asf n =f R where: This gives a model selection interpretation of classifierf n , where models are balls in B spq (R d ). We can then apply the following general model selection theorem (Theorem 5 in [6]). We recall it for completeness.
Theorem 2. Let l a loss function such that g * ∈ arg min f∈L 2 (PX ) El(Y, f(X)). Let (G m ) m∈M a countable collection of models with G m ⊂ L 2 (P X ), ∀m ∈ M.
(H4) ∀m ∈ M, ∀g 0 ∈ G m , ∀r ≥ r * m : Let (x m ) m∈M a real sequence such that m∈M e −xm ≤ 1 and: Let (ρ m ) m∈M a family of positive numbers. Letg such that there existsm ∈ M withg ∈ Gm and: Then if m → pen(m) satisfies, for any m ∈ M: where B m = 75KC m + 28b m , we obtain: Theorem 2 is rather general. It can be applied to a wide variety of situations related to many statistical models. In particular it can be used to propose adaptive estimators in non-parametric regression, density estimation or classification. It gives the minimum required penalty to get an oracle inequality for a penalized empirical cost minimizer.
In our setup, we have to find constant b R , C R , a subroot function φ R and a distance d on L 2 (P X ) such that: (H1) ∀R ∈ R + , ∀g ∈ B(R), l(g) ∞ ≤ b R ; (H2) ∀g, g ′ ∈ L 2 (P X ), V ar (l(g) − l(g ′ )) ≤ d 2 (g, g ′ ); (H3) ∀R ∈ R + , ∀g ∈ B(R), d 2 (g, f * ) ≤ C R E (l(g) − l(f * )); (H4) ∀R ∈ R + , ∀r > 0, we have Once assumptions (H1)-(H4) are granted, next step is to discretize the continuous family of models (B(R)) R∈R + over a certain family of values of the radii. Following [6], we consider the set of discretized radii: To apply the second part of Theorem 5 in [6], the penalty function should satisfy: where c 1 is a suitable constant. It can be checked that condition (20) on α n ensures such an inequality for: Last step is to forth between the discretized framework and the continuous framework. We follow exactly [6] to writeĝ n defined in (19) as an approximate penalized minimum empirical risk estimator of Theorem 2 over the family (B(R)) R∈R . It only remains to prove (H1)-(H4).

Proof of (H1)
We only consider in Proposition 1 a parameter range of Besov spaces B spq (R d ) such that s > d p . As a result, from the continuous embedding of B spq (R d ) into C(R d ) for s > d p , one gets for any f ∈ B spq (R d ): f ∞ ≤ c f spq .

Proof of (H2)-(H3)
To check these assumptions, we have to choose a distance d in L 2 (P X ). This choice has been done implicitely in Theorem 1. This theorem will prove (H4) with the usual distance d(g, g ′ ) = E(g − g ′ ) 2 , for any g, g ′ ∈ L 2 (P X ). It comes from Section 3.2.1 which allows us to write the L 2 -norm of a function in B spq (R d ), using wavelet decomposition. Then we consider the same distance to check (H2) and (H3).
(H2) is trivially satisfied because the hinge loss l is a Lipschitz function. Moreover with Lemma 11 of [6], hypothesis (18) ensures (H3) with constant C R = 2 M R η1 + 1 η0 . The choice of the distance above corresponds to the setting (S1) in [6]. That's why the second part of (18) is necessary in our context.

Proof of (H4)
The proof of (H4) has been done in Section 3.2.2.

Proof of Corollary 1
We only treat the particular case φ(x) = x. From Proposition 1, we have in this case: (R l (f, f * ) + α n f spq ) + 4α n 4 + c η 1 η 0 + 2 n , which gives, since l is the hinge loss, a control on the excess risk off n as follows: To get Corollary 1, it remains to control the RHS of (29) called the approximation function, defined by: By the Lipschitz property of the hinge loss, gathering with assumptions on the marginal of X, we have, for any p ≥ 1: where c depends on A.
To control the first term above, we use the following result.
Lemma 5. For any r < s, Proof. The cornerstone idea in the proof is the use of interpolation spaces. Given two Banach spaces B and B ′ , θ ∈]0, 1[ and q ∈ [0, ∞], the space (B, B ′ ) θ,q called interpolation space between B and B ′ consists of all f ∈ B such that f θ,q := is finite, where P t (f) is a norm in B called the Peetre's functional (see [27] for a definition).
Here we are interested in the case q = ∞ and the following geometric explanation of interpolation space [25, Theorem 3.1]: where B B ′ (R) := {f ∈ B ′ : f B ′ ≤ R}. It means that the distance of any function in (B, B ′ ) θ,∞ to the ball B B ′ (R) tends to zero with a given rate of convergence. This approximation problem arose from the study of approximation error in learning theory, where usually B = L 2 and B ′ = H K a reproducing kernel Hilbert space ( [25]). Here we propose a generalization to the Banach case with Besov spaces. We use in particular the following stability of Besov spaces in terms of interpolation spaces: where γ = θs. From (30) with θ = r s , we conclude the proof of Lemma 5. Using this lemma and optimizing with respect to R leads to: Finally going back to (29), we arrive at: ER(f n ), f * ) ≤ 2cα r s n + 4α n 4 + c η 1 η 0 + 2 n .
Choosing α n such that an equality holds in (20) concludes the proof.