Classification via local multi-resolution projections

We focus on the supervised binary classification problem, which consists in guessing the label $Y$ associated to a co-variate $X \in \R^d$, given a set of $n$ independent and identically distributed co-variates and associated labels $(X_i,Y_i)$. We assume that the law of the random vector $(X,Y)$ is unknown and the marginal law of $X$ admits a density supported on a set $\A$. In the particular case of plug-in classifiers, solving the classification problem boils down to the estimation of the regression function $\eta(X) = \Exp[Y|X]$. Assuming first $\A$ to be known, we show how it is possible to construct an estimator of $\eta$ by localized projections onto a multi-resolution analysis (MRA). In a second step, we show how this estimation procedure generalizes to the case where $\A$ is unknown. Interestingly, this novel estimation procedure presents similar theoretical performances as the celebrated local-polynomial estimator (LPE). In addition, it benefits from the lattice structure of the underlying MRA and thus outperforms the LPE from a computational standpoint, which turns out to be a crucial feature in many practical applications. Finally, we prove that the associated plug-in classifier can reach super-fast rates under a margin assumption.


Setting
The supervised binary classification problem is directly related to a wide range of applications such as spam detection or assisted medical diagnosis (see [25, chap. 1] for more details). It can be described as follows.
The supervised binary classification problem. Let E stand for a subset of R d and write Y = {0, 1}. Assume we observe n co-variates X i ∈ E and associated labels Y i ∈ Y such that the elements of D n = {(X i , Y i ), i = 1, . . . , n} are n independent realizations of the random vector (X, Y ) ∈ E × Y of unknown law P X,Y . Given D n and a new co-variate X n+1 , we want to predict the associated label Y n+1 so as to minimize the probability of making a mistake.
In other words, we want to build a classifier h n : E → Y upon the data D n , which minimizes P(h n (X) = Y |D n ). It is well known that the Bayes classifier h * (τ ) := 1 {η(τ )≥1/2} , where η(τ ) := E[Y |X = τ ] = P(Y = 1|X = τ ) (unknown in practice), is optimal among all classifiers since, for any other classifier h n , we have ℓ(h n , h * ) := P(h n (X) = Y |D n ) − P(h * (X) = Y ) ≥ 0 (see [12]). As a consequence, we measure the classification risk T (h n ) associated to a classifier h n as its average relative performance over all data sets D n , T (h n ) = E ⊗n ℓ(h n , h * ). As described in [12,Chap. 7], there is no classifier h n such that T (h n ) goes to zero with n at a specified rate for all distributions P X,Y . We therefore make the assumption that P X,Y belongs to a class of distributions P (as large as possible) and aim at constructing a classifier h n such that where the infinimum is taken over all measurable maps θ n from E into Y and means lesser or equal up to a multiplicative constant factor independent of n. Any classifier h n verifying eq. (1) will be said to be (nearly) minimax optimal when δ = 0 (δ > 0). P will stand for the set of all distributions such that the marginal law P X of X admits a density µ on E and η belongs to a given smoothness class. Throughout the paper, we will denote by µ the density of P X .
Many classifiers have been suggested in the literature, such as k-nearest neighbors, neural networks, support vector machine (SVM) or decision trees (see [12,25]). In this paper, we will exclusively focus on plug-in classifiers h n (τ ) := 1 {ηn(τ )≥1/2} , where η n stands for an estimator of η. With such classifiers, it is shown in [48] that, where the term on the rhs is known as the regression loss (of the estimator η n of η) in L 1 (E, µ)norm. Eq. (2) shows in particular that rates of convergence on the classification risk of a plug-in classifier h n can be readily derived from rates of convergence on the regression loss of η n . This prompts us to focus on the regression problem, which can be stated in full generality as follows.
The regression on a random design problem. Let E, Y stand for subsets of R d and R, respectively. Assume we dispose of n co-variates X i ∈ E and associated observations Y i ∈ Y such that the elements of D n = {(X i , Y i ), i = 1, . . . , n} are n independent realizations of the random vector (X, Y ) ∈ E × Y of unknown law P X,Y . We define ξ := Y − η(X), where η(τ ) := E[Y |X = τ ], so that by construction E[ξ|X] = 0. Given D n and under the assumption that P X,Y belongs to a large class of distributions P, we want to come up with an estimator η n of η, which is as accurate as possible for the wide range of losses S p (η n ) = E ⊗n E|η n (X) − η(X)| p , p ≥ 1.
As described previously, in the particular case where Y = {0, 1}, we fall back on the regression problem associated to the classification problem with plug-in classifiers. In this case, ξ is bounded such that |ξ| ≤ 1. Notice however that the regression on a random design problem stated above permits for Y to be any subset of R (including R itself). To be more precise, and by analogy with eq. (1), our aim is to build an estimator η n of η such that, for all p ≥ 1, where the infinimum is taken over all measurable maps θ n from E into Y. And η n will be said to be (nearly) minimax optimal when δ = 0 (δ > 0).

Motivations
Many estimators η n of η have been suggested in the literature to solve the regression on a random design problem. Among them, the celebrated local polynomial estimator (LPE) has been praised for its flexibility and strong theoretical performances (see [45,46]). As is well known, the LPE is minimax optimal in any dimension d ∈ N and for any S p -loss, p ∈ (0, ∞], over the set of laws P such that (i) µ is bounded from above and below on its support A := Suppµ = {τ : µ(τ ) > 0}, (ii) η belongs to a Hölder ball C s (E, M) of radius M and (iii) ξ has sub-Gaussian tails. As a drawback, the LPE is computationally expansive since it requires to perform a new regression at every single point x ∈ A where we want to estimate η. Computational efficiency is however of primary importance in many practical applications.
In this paper, we show that it is possible to construct a novel estimator η n of η by localized projections onto multi-resolution analysis (MRA) of L 2 (R d , λ) (where λ stands for the Lebesgue measure on E), which presents similar theoretical performances and is computationally more efficient than the LPE.

The hypotheses
In this section, we summarize the assumptions on µ, A, η and ξ that will be used throughout the paper.
Assumption on µ. Let us denote by µ min , µ max two real numbers such that 0 < µ min ≤ µ max < ∞. As is standard in the regression on a random design setting, we assume that the density µ is bounded above and below on its support A.
This guarantees that we have enough information at each point x ∈ A in order to estimate η with best accuracy. For a study with weaker assumptions on µ, the reader is referred to [17,19], for example, and the references therein.
Assumption on A. We first assume that, (S1) A = E = [0, 1] d . Therefore A is known under (S1). We will deal with the case where A is unknown in Section 9.
Assumption on η. Fix r ∈ N. In the sequel, we will assume that, (H r s ) The regression function η belongs to the generalized Lipschitz ball L s (E, M) of radius M, for some s ∈ (0, r). Unless otherwise sated, s is unknown but belongs to the interval (0, r), where r is known. For a detailed review of generalized Lipschitz classes, the reader is referred to the Appendix below.
Assumptions on the noise ξ. We will consider the two following assumptions, (N1) Conditionally on X, the noise ξ is uniformly bounded, meaning that there exists an absolute constant K > 0 such that |ξ| ≤ K. (N2) The noise ξ is independent of X and normally distributed with mean zero and variance σ 2 , which we will denote by ξ ∼ Φ(0, σ 2 ). Assumption (N1) is adapted to the supervised binary classification setting, where Y = {0, 1}, while (N2) is more common in the regression on a random design setting, where Y = R.
Combination of assumptions. In the sequel, we will conveniently refer by (CS1) to the set of assumptions (D1), (S1), (N1) or (N2). As detailed below in Section 3, configuration (CS1) is comparable to what is customary in the regression on a random design setting.

Our results
Assuming at first A to be known, we introduce a novel nonparametric estimator η @ of η built upon local regressions against a multi-resolution analysis (MRA) of L 2 (R d , λ) and show that, under (CS1), it is adaptive nearly minimax optimal over a wide generalized Lipschitz scale and across the wide range of losses L p (E, µ), p ∈ [1, ∞). We subsequently show that these results generalize to the case where A is unknown but belongs to a large class of (eventually disconnected) subsets of R d , provided we modify the estimator η @ accordingly. We denote by η this latter estimator and prove that η can be used to build an adaptive nearly minimax optimal plug-in classifier, which can reach super-fast rates under a margin assumption. The above results essentially hinge on an exponential upper-bound on the probability of deviation of η @ from η at a point, as detailed in Theorem 7.1. These results either improve on the current literature or are interesting in their own right for the following reasons. 1) They show that it is possible to use MRAs to construct an adaptive nearly minimax optimal estimator η @ of η under the sole set of assumptions (CS1). More precisely, our results (i) hold in any dimension d; (ii) over the wide range of L p (E, µ)-losses, p ∈ [1, ∞); (iii) and a large Lipschitz scale; (iv) and do not require any assumption on µ beyond (D1). It is noteworthy that, in contrary to most alternative MRA-based estimation methods, no smoothness assumption on µ is needed. 2) From a computational perspective, η @ outperforms other estimators of η under (D1) since it takes full advantage of the lattice structure of the underlying MRA. In particular it requires at most as many regressions as there are data points to be computed everywhere on E, while alternative kernel estimators must be recomputed at each single point of E. We illustrate this latter feature through simulation. 3) Furthermore, and in contrary to alternative MRA-based estimators, the local nature of η @ allows to relax the assumption that A is known. This latter configuration allows for µ to cancel on E as long as it remains bounded on its support A, which is particularly appropriate to the supervised binary classification problem under a margin assumption. 4) In the regression on a random design setting, η @ bridges in fact the gap between usual linear wavelet estimators and alternative kernel estimators, such as the LPE. On the one hand, η @ inherits its computational efficiency from the lattice structure of the underlying MRA. On the other hand, it features similar theoretical performances as the LPE in the random design setting. In particular, it remains a (locally) linear estimator of the data (modulo a spectral thresholding of the local regression matrix), and cannot discriminate finer smoothness than the one described by (generalized) Lipschitz spaces. Here is the paper layout. We start by a literature review in Section 3. We give a hand-waving introduction to the main ideas that underpin the local multi-resolution estimation procedure in Section 4. We define notations that will be used throughout the paper and introduce MRAs in Section 5. Our actual estimation procedure is described in Section 6 and the results are detailed in Section 7. We show how these results can be fine-tuned under additional assumptions in Section 8. Assumption (S1) is relaxed and the properties of η are detailed in Section 9. We show how these latter results spread to the classification setting in Section 10. Results of a simulation study with η @ under (CS1) are given in Section 11. Proofs of the regression results can be found in Section 12. The proofs of the classification results are simple modifications of the proofs given in [4] and can be found in [39]. In addition, the Appendix contains a detailed review of generalized Lipschitz spaces and MRAs.

Literature review
Both the regression on a random design problem and the classification problem have a longstanding history in nonparametric statistics. We will therefore limit ourselves to a brief account of the corresponding literature that is relevant to the present paper.

Classification with plug-in classifiers
Let us start with a review of some of the classification literature dedicated to plug-in classifiers. The seminal work [37] showed that plug-in rules are asymptotically optimal. It has been subsequently pointed out in [36] that the classification problem is in fact only sensitive to the behavior of P X,Y near the boundary line M := {τ ∈ E : η(τ ) = 1/2}. So that assumptions on the behavior of P X,Y away from this boundary are in fact unnecessary. Subsequent works such as [3] have shown that convex combinations of plug-in classifiers can reach fast rates (meaning faster than n −1/2 , and thus faster than nonparametric estimation rates). More recently, it has been shown in [4] that plug-in classifiers can reach super fast rates (that is faster than n −1 ) under suitable conditions. All these results are derived under some sort of smoothness assumption on the regression function η (see [50]) and a margin assumption (MA) (see Section 10 for details). This latter assumption clarifies the behavior of P X,Y in a neighborhood of M and kicks in naturally through the computation where δ is chosen such that it balances the two terms on the rhs. Finally, [4] exhibited optimal convergence rates under smoothness and margin assumptions and showed that they are attained with plug-in classifiers. Let us now turn to the regression on a random design problem.

Regression on a random design with wavelets
First results on multi-resolution analysis (MRA) and wavelet bases (see [34,38]) emerged in the nonparametric statistics literature in the early 1990's (see [27,14,13,15,16]). It has been proved that, under (CS1) and in the particular case where µ is the uniform distribution on E, thresholded wavelet estimators of η are nearly minimax optimal over a wide Besov scale and range of L p (E, µ)-losses (see [10]). In order to leverage on the power of MRAs and associated wavelet bases, several authors attempted to transpose these latter results to more general design densities µ. This, however, led to a considerable amount of difficulties. The literature relative to the study of wavelet estimators on an unknown random design breaks down into two main streams. (i) The first one aims at constructing new wavelet bases adapted to the (empirical) measure of the design (see [29,30,9,47]). (ii) The second one aims at coming up with new algorithms to estimate the coefficients of the expansion of η on traditional wavelet bases (see [2,23,31,41,44]). The present paper belongs to this second line of research. As described in [23], the success of the LPE on a random design results from the fact that it is built as a "ratio", which cancels out most of the influence of the design. In a wavelet context, a first suggestion has therefore been to use the ratio estimator of η (see [1,42], for example), well known from the statistics literature on orthogonal series decomposition (see [20,21] and [12,Chap. 17] and the references therein). Roughly speaking, the ratio estimator is the wavelet equivalent of the Nadaraya-Watson estimator (see [40,49]). It is elaborated on the simple observation that η(x) = η(x)µ(x)/µ(x) for all x ∈ A, where both g(.) = η(.)µ(.) and µ(.) are easily estimated via traditional wavelet methods. The ratio estimator relies thus unfortunately on the estimation of µ itself and must therefore assume as much smoothness on µ as on η. To address that issue, an other approach has been introduced in [6,28]. They work with d = 1 and take E to be the unit interval [0, 1]. Their approach relies on the wavelet estimation of η •G −1 , where G stands for the cumulative distribution of the design and G −1 for its generalized inverse. Results are therefore stated in term of regularity of f •G −1 . Unfortunately, this method does not readily generalize to the the multi-dimensional case, where G admits no inverse.
Finally, [5] obtains adaptive near-minimax optimal wavelet estimators over a wide Besov scale under (CS1) by means of model selection techniques. His results are hence valid for the L 2 (E, µ)-loss only. Other relevant references that proceed with hybrid estimators (LPE and kernel estimator or LPE and wavelet estimator) are [18] and [51]. They both work under (CS1), with d = 1 and assume that µ is at least continuous.

A primer on local multi-resolution estimation under
(CS1) In order to fix the ideas, let us now give a hand-waving introduction to the local multi-resolution estimation method. Throughout the paper, we will work with r-MRAs of L 2 (R d , λ), for some r ∈ N, consisting of nested approximation spaces V j ⊂ V j+1 built upon compactly supported scaling functions (see Section 5.2 and Appendix). Under the assumption that η belongs to the generalized Lipschitz ball L s (E, M) of radius M, the essential supremum of the remainder of the orthogonal projection P j η of η onto V j decreases like 2 −js (see Appendix). The regression function η can therefore be legitimately approximated by P j η. As an element of V j , P j η may be written as an infinite linear combination of scaling functions at level j. In particular, there exists a partition F j of E into hypercubes of edge-length 2 −j such that, for all H ∈ F j and all x ∈ H, we can write P j η(x) = k∈S j (H) α j,k ϕ j,k (x), where S j (H) stands for a finite subset of Z d (see Figure 1). This leaves us in turn with the estimation of coefficients (α j,k ) k∈S j (H) for all H ∈ F j , which is achieved by least-squares and provides us with the estimator η @ j of η on H. It is noteworthy that the local estimator η @ j of η is exclusively built upon scaling functions and does not require the estimation of wavelet coefficients. In particular, it does not involve any sort of wavelet coefficient thresholding. To the best of the author knowledge, this is the first time that this local estimation procedure is proposed and studied from both a theoretical and computational perspective. In addition, we show that Lepski's method (see [32], for example) can be used to adaptively choose the resolution level j. Notice that Lepski's method has already been used in a MRA setting in [43]. In what follows, we detail the local multi-resolution estimation method and establish the near minimax optimality of η @ .

Preliminary notations
In the sequel, we will denote by B p (z, ρ) the closed ℓ p -ball of R d of center z and radius ρ.
More generally, we adopt the following notations: for any subset S of a topological space E, Closure(S) will stand for its closure and S c for its complement in E. For any subset S of R d , z ∈ R d and τ ∈ R + , we will write z + S and τ S to mean the sets {z + u : u ∈ S} and {τ u : u ∈ S}, respectively. Finally, given a set (of functions) R, SpanR will denote the set of finite linear combinations of elements of R. For any p ∈ N, vectors v of R p will be seen as elements of M p,1 , that is matrix with p rows and one column. For any two u, v ∈ R p , u, v will denote their Euclidean scalar product. In addition, for any p, q ∈ N and M ∈ M p,q , M t will stand for the transpose of M. For any two matrices M, P , M · P will denote their matrix product when it makes sense.
[M] k,ℓ and [M] k,• will respectively stand for the element of M located at line k, column ℓ and the k th row of M. Finally, M S will denote the spectral norm of M (see [26, §5.6.6]). We denote by ⌊z⌋ the integer part of z ∈ R defined as max{a ∈ Z : a ≤ z}. More generally, given z ∈ R d , we write ⌊z⌋ the integer part of z, meant in a coordinate-wise sense. In the same way, we denote by ⌈z⌉ the smallest integer greater than z (in a coordinate-wise sense). We write rhs (resp. lhs) to mean right-(resp. left-) hand-side and sometimes write := to mean equal by definition. Throughout the paper, we will refer to constants independent of n as absolute constants and c, C will stand for absolute constants whose value may vary from line to line. For any two sequences a n , b n of n, we will write a n b n to mean a n ≤ Cb n for some absolute constant C and a n ≈ b n to mean that there exist two constants c, C independent of n such that cb n ≤ a n ≤ Cb n .

The polynomial reproduction property
In what follows, we will exclusively consider MRAs built upon Daubechies' scaling functions ϕ j,k (see Appendix and [8,35,7,24]). Given a natural integer r, we will refer by r-MRA to a MRA whose nested approximation spaces V j reproduce polynomials up to order r − 1.
Daubechies' scaling functions ϕ j,k are appealing in the estimation framework since they are compactly supported and have minimal volume supports among scaling functions that give rise to r-MRAs. Recall finally that a r-MRA can explain Lipschitz smoothness s for any s ∈ (0, r).

General notations
Consider the Daubechies' r-MRA of L 2 (R d , λ) built upon Daubechies' scaling function ϕ, as described in the Appendix. We will denote by Suppϕ j, To alleviate notations, we will write ϕ k in place of ϕ 0,k and ϕ j in place of ϕ j,0 . Notice that Closure(Suppϕ j,k ) is in fact a closed hyper-cube of R d whose corners lie on the lattice 2 −j Z d . For any x ∈ A, we write It defines a partition of E into 2 jd hypercubes of edge length 2 −j , modulo a λ-null set. For the sake of concision, we write R = 2r − 1 in the sequel. We have the following proposition, whose proof is straightforward and thus left to the reader.
Proposition 5.1. S j verifies the following properties, 1. S j is constant on each element H ∈ F j . We will denote by S j (H) its value on H.

Moreover, for any two
It is a direct consequence of Proposition 5.1 that in the case where r = 1, we have #S j (H) = 1 for all H ∈ F j . We denote its single element by ν(H). It is in fact easy to show that ν(H) = ⌊2 j x⌋ for any x ∈ H. For any H ∈ F j , we write 6 Construction of the local estimator η @ Assume we are under (CS1) and work with the Daubechies' r-MRA of L 2 (R d , λ). The estimation procedure is local, so that we start by selecting a point x ∈ A. By construction, there exists H ∈ F j such that x ∈ H. We want to estimate η at point x. As detailed in the Appendix, an estimator of η can be reduced to an estimator of the orthogonal projection P j η of η onto V j , modulo an error R j η, such that |R j η| ≤ M2 −js when η belongs to the generalized Lipschitz ball L s (E, M) of radius M. Now, we can write This leaves us with exactly R d coefficients α j,ν , ν ∈ S j (H) to estimate, which are valid for any x ∈ H. We evaluate these coefficients by least-squares. Denote by B H ∈ M n,R d the matrix whose rows are the vectors ϕ H (X i ) t for 1 ≤ i ≤ n. Let us denote by k 1 , . . . , k R d the elements of S j (H). Then we choose where we set α ⋄ H = 0 if the arg min above contains more than one element. Let us write As is well known, when Q H is invertible, the arg min on the rhs of eq. (4) admits one single element which writes as follows, Naturally, we will denote the corresponding estimator of P j η at point . We now introduce a thresholded version of η ⋄ H based on the spectral thresholding of Q H . We denote by λ min (Q H ) the smallest eigenvalue of Q H in the case where r ≥ 2, when Q H is actually a matrix, and Q H itself in the case where r = 1, when it is a real number. Furthermore, we define where π n is a tuning parameter. In practice, and unless otherwise stated, we choose π n = log n. Moreover, we assume throughout the paper that n is large enough so that π −1 n ≤ min( g min 2 , 1), where, for reasons that will clarified later, we have denoted, and c min stands for the strictly positive constant defined in the proof of Proposition 12.4. Ultimately, the estimator η @ j of P j η is defined as,

The results
Let r be a natural integer, denote by P the set of all distributions on E × Y and write P(CS1, H r s ) := {P ∈ P : (CS1) and (H r s ) hold true}.
Furthermore, we define j r , j s , J and t(n) such that, where κ is a positive real number to be chosen later. In addition, we write J n = {j r , j r + 1, . . . , J − 1, J}. Notice that j s strikes the balance between bias and variance in the sense that, for log n ≥ (2s + d) log 2 and s ∈ (0, r), one has got Throughout the sequel, we assume that n is large enough so that the latter inequalities hold true. Our first result gives an upper bound on the probability of deviation of η @ j form η at a point x ∈ A.
Theorem 7.1. Fix r ∈ N and assume we are under (CS1) and (H r where Λ is defined as follows, As a consequence of the above theorem, we can deduce the (near) minimax optimality of η @ js over generalized Lipschitz balls.
Corollary 7.1. Fix r ∈ N and assume we are under (CS1) and (H r s ). Then, for any p ∈ [1, ∞) and j ∈ J n , one has got where η @ j and C(p) are defined in eq. (8) and Proposition 12.1 below, respectively. A fortiori, when s is known, we can choose j = j s and apply eq. (10a) and eq. (10c) above to obtain This, together with the lower-bound of Theorem 7.3, proves that η @ js is (nearly) minimax optimal over the generalized Lipschitz ball L s (E, M) of radius M.
The next Theorem shows that the approximation level j can be determined from the data D n so that we obtain adaptation over a wide generalized Lipschitz scale.
Theorem 7.2. Fix r ∈ N and assume we are under (CS1) and (H r s ). We define Finally, we prove that η @ is indeed (nearly) minimax optimal by giving the corresponding lower-bound result.
Theorem 7.3. Assume we are under (CS1) and (H r s ). We write inf θn the infinimum over all estimators θ n of η, that is all measurable functions of the data D n . Then, for d ≥ 1, s > 0, we have, for all 1 ≤ p < ∞, The next section shows how these results can be improved in the case where we benefit from additional information on µ or η.

Refinement of the results
As can be seen from Corollary 7.1 and Theorem 7.2 above, π n appears as a multiplicative factor in the upper-bounds and thus deteriorates them by a multiplicative log n term. However, this needs not be the case, and under appropriate additional assumptions, π n can be chosen to be a constant. Consider indeed the following two assumptions.
Under (O1), we know a lower bound µ * min of µ min , and therefore a lower bound g * min of g min (see eq. (7)). Under (O1), we will thus choose π −1 n = min( g * min 2 , 1). It is straightforward to show that Theorem 7.1 is still valid with this new value of π n (see Remark 12.1 in the proof of Theorem 7.1), and thus all the subsequent results follow as well. Under (O2), we know an upper bound M of the essential supremum of η on E. In that case, we redefine where, for any z ∈ R, we have written T M (z) = z1 {|z|≤M } + Msign(z)1 {|z|>M } . Once again, it is straightforward to show that Theorem 7.1 is now valid with π −1 n = min( g min 2 , 1) and 2M in place of M in the indicator function on the rhs of eq. (11) (see Remark 12.1 in the proof of Theorem 7.1), and thus all the subsequent results follow as well. Notice that π n is an absolute constant under (O1) and (O2), while it is an increasing sequence of n to be fine-tuned by the statistician otherwise. Hence π n appears to be the price to pay for not knowing a lower bound of µ min or an upper bound of the essential supremum of η on E.

The problem
Now, we would like to relax assumption (S1) and allow for A to be an unknown subset of E, eventually disconnected. Under (CS1), the success of η @ stems from the fact that it is constructed upon an approximation grid of the form 2 −j Z d ∩ [0, 1] d , whose edges coincide exactly with the boundary of A. In the case where A is unknown, some cells of the lattice might straddle the boundary of A and thus require a new treatment. In order to handle this new configuration, we will need to make a smoothness assumption on the boundary of A and allow for the estimation cells to move with the point at which we want to estimate η. Ultimately, we devise a new estimator η of η which is built upon a moving approximation grid. In fact, this new estimation method ensures that the point x at which we want to estimate η always belongs to a cell H of F j at resolution level j, whose center belongs to A. This will ensure that local regressions performed on cells that straddle the boundary of A are still meaningful. The smoothness assumption we will make on A might be compared to the support assumption made in [4, eq. (2.1)] in the classification context. In substance, it is assumed in [4] that A is locally ball-shaped to be compatible with the ball-shaped support of the LPE kernel, which they use to estimate η. In our case, we perform estimation with multi-dimensional scaling functions whose supports are cube-shaped and will thus assume that A is locally cube-shaped.

Smoothness assumption on A
Let us now make these informal arguments more precise. To that end we introduce assumption (S2) as an alternative to (S1) above. Fix an absolute constant m 0 ∈ (0, 1) and recall that 2 js = ⌊n 1 2s+d ⌋. With these notations, (S2) goes as follows, Figure 2: (S2) allows for A to be non-convex and eventually disconnected.
In words, (S2) means that if we zoom close enough to any x ∈ A, we can find a hypercube B ∞ (z x , m) that contains x and is a subset of A. Notice readily that for all j 1 ≥ j 2 , the component of 2 j 2 (A − x) that contains 0 is a subset of the component of 2 j 1 (A − x) that contains 0, so that A j 2 ⊂ A j 1 . Therefore A js grows with n and shrinks with s. Of course, (S1) is a particular case of (S2). Setting (S2) allows A to be unknown and belong to a wide class of subsets of R d , eventually disconnected (see Figure 2). In the sequel, we will conveniently refer by (CS2) to the set of assumptions (D1), (S2), (N1) or (N2).

Moving local estimation under (CS2)
As detailed above, η is obtained by local regression on a moving approximation grid. Let us describe the construction of η more precisely. First of all, we split the sample into two pieces. For simplicity, let us assume that we dispose of 2n data points. The first half of the sample points, which we denote by . . , n}, will be used to identify the support A of µ, while the second half, which we denote by D n = {(X i , Y i ), i = 1, . . . , n}, will be used to estimate the scaling functions coefficients by local regressions. Let us denote by H 0 the cell 2 −j [0, 1] d of the lattice 2 −j Z d at resolution j. And denote by H 0 (x) the same cell centered in x, that is H 0 (x) = x − 2 −j−1 + 2 −j [0, 1] d . Then, the construction of η j (x) at a point x ∈ R d goes as follows. (i) If none of the design points (X ′ i ) of the sample D ′ n lie in H 0 (x), then take η (x) = 0. (ii) If one or more design points of the sample D ′ n lie in H 0 (x), we select one of them and denote it by X ′ ix (the selection procedure is of no importance beyond computational considerations). By construction, x belongs to the cell H 0 (X ′ ix ) centered in X ′ ix ∈ A. Since X ′ ix belongs to A, it makes sense to perform a local regression on H 0 (X ′ ix ) with the sample points D n , which gives rise to an estimator η of η valid at any point of It is noteworthy that this procedure uses the sample D ′ n to identify the support A of µ. Interestingly, the above estimation procedure requires at most as many regressions as there are data points in D ′ n to return an estimator η of η at every single point x ∈ A. It is therefore computationally more efficient than any other kernel estimator, such as the LPE. The computational performance of η can in fact be further improved in the sense that the local regression on the cell H 0 (X ′ i ) can be omitted if the cell H 0 (X ′ i ) is itself included in the union of cells centered at other design points of D ′ n . In particular, we can choose X ′ ix to be a design point X ′ i of D ′ n that belongs to H 0 (x) and for which a local regression has already been performed, if it exists, or any one of the X ′ i that belong to H 0 (x) otherwise. Intuitively, the computational efficiency of η stems from the fact that the design points (X ′ i ) provide some valuable information on the unknown support A of µ, which can be exploited under (CS2). In particular, and as we will see below, (D1) guarantees that the design points of D ′ n populate A densely enough so that, as long as j ≤ J, the cells H 0 (X ′ i ), 1 ≤ i ≤ n, form a cover of A, modulo a set whose µ-measure decreases almost exponentially fast toward zero with n.

Construction of the local estimator η
Assume we are under (S2) and work with the Daubechies' r-MRA of L 2 (R d , λ). Obviously, shifting the approximation grid is equivalent to shifting the data points (X i ) of D n and keeping the lattice fixed. For ease of notations and clarity, we adopt this second point of view. In order to compute η at a point x ∈ H 0 (X ′ ix ) ∩ A, we want to shift the design points in such a way that X ′ ix falls right in the middle of H 0 . In other words, we want X ′ ix to be shifted at point 2 −j−1 ∈ R d (whose coordinates are worth 2 −j−1 ∈ R). This corresponds to the change of variable X i = X i − (X ′ ix − 2 −j−1 ), where we have denoted by X i and X i the representations of a same data point in the canonical and shifted coordinate systems of R d , respectively. In order to compute η at point x ∈ H 0 (X ′ ix ) ∩ A, it is therefore enough to perform a local regression on H 0 against the shifted data points, For the sake of concision, we will denote by u = u − (X ′ ix − 2 −j−1 ) the coordinate representation of a point u in the shifted coordinate system of R d . Let us denote by k 1 , . . . , k R d the elements of S j (H 0 ). With these notations, eq. (4) must be corrected and written as where we set α ⋄ H 0 = 0 if the arg min above contains more than one element. The notations introduced in Section 5.3 can be updated to this new setting as follows. B H 0 stands now for the random matrix of M n,R d whose rows are the ϕ H 0 ( X i ) t , i = 1, . . . , n. In addition, we recall that we have defined Its coefficients write thus as Notice here that S j (H 0 ) = {ν ∈ Z d : 2 −1 ∈ Suppϕ ν }, which neither depends on j nor x. Therefore, and for later reference, we denote In addition, if we write Y H 0 = (Y i 1 H 0 ( X i )) 1≤i≤n , then eq. (5) still holds true when the solution to eq. (14) is unique. So that, for all x ∈ H 0 (X ix ) ∩ A, we can write η ⋄ H 0 ( x) = α ⋄ H 0 , ϕ H 0 ( x) . Finally eq. (6) remains valid with X i replaced by X i and H by H 0 , η @ H 0 redefined as η H 0 and g min redefined as where c min is the strictly positive constant defined in Lemma 12.1 below. So that ultimately, the estimator η j of P j η at a point x ∈ R d writes as Notice that by contrast with eq. (8) above, the sum over the hypercubes of F j has disappeared. This is due to the fact that the approximation grid moves with x so that we end up virtually always performing estimation on the same hypercube H 0 .

The results
Interestingly, η still verifies similar results as the ones described in Section 7. To be more precise, recall that we work with a sample of size 2n broken up into two pieces D n and D ′ n of size n. Let us redefine J n so that J n = {j s , j s + 1, . . . , J − 1, J} where 2 js = ⌊n 1 2s+d ⌋. Then, we obtain the following result in place of Theorem 7.1.
Theorem 9.1. Fix r ∈ N and assume we are under (CS2) and (H r s ). Recall that η j is defined in eq. (17). Then, for all j ∈ J n , all δ > 2M2 −js max(1, 3π n R d µ max ) and all x ∈ A, we have got where Λ has been defined in Theorem 7.1.
Left aside the fact that η is constructed upon a sample of size 2n, the sole difference with the result of Theorem 7.1 is that the leading constant in front of the exponential on the second line has changed from 2R d to 3R d . Furthermore, it is straightforward to deduce from Theorem 9.1 results similar to Corollary 7.1, Theorem 7.2 and Theorem 7.3, and a fortiori the refined results obtained in Section 8, for η under (CS2). The proofs of these results for η under the set of assumptions (CS2) follow, for the most part, exactly the same lines as the proofs given for η @ under (CS1). Details can be found in Section 12.2.

Classification via local multi-resolution projections
Recall from [4] that the margin assumption can be written as, (MA) There exist constants C * > 0 and ϑ ≥ 0 such that The binary classification setting corresponds to (CS2), under assumptions (N1) and (O2). Notice besides that we have K = 1 in (N1) and M = 1 in (H r s ). Since we are under (O2), it follows from Section 8 that π n = π 0 = min(1, g min 2 ) is independent of n and η is capped at M = 1 as in eq. (13). For the sake of coherence, we denote by j the adaptive resolution level built upon η , as described in Theorem 7.2, and define P(CS2, H r s ) by analogy with eq. (9) above. Finally, we recall that η is built upon a sample of size 2n split into two sub-samples D n and D ′ n of size n. As a consequence of Theorem 9.1, we can use the plug-in classifier built upon η to obtain similar results as the ones given in [4, Lemma 3.1] for LPE based plug-in classifiers.
where the classification risk T (.) has been defined in Section 1 and the constants C 0 , C 1 , C 2 are made explicit in [39] and only depend on µ max , µ min , r, d and ϑ.
In fact, it can be shown that the classifiers h defined in Corollary 10.1 are (nearly) minimax optimal. Proofs of Corollary 10.1 and the associated lower-bound can be found in [39].

Simulation study
In order to illustrate the performance of η @ j @ , we have carried out a simulation study in the regression setting in the one-dimensional case, that is with d = 1. As detailed earlier, the sole purpose of this simulation is to show that (1) η @ can be easily implemented and is computationally efficient, (2) η @ works well in practice in the case where the density of the design µ is discontinuous, (3) and to give an intuitive visual feel for η @ , which is built upon the juxtaposition of local regressions against a set of scaling functions. In particular, we run our simulation against benchmark signals, which allows to compare them with the ones detailed in the literature for alternative kernel estimators (see simulation study in [32], for example). We have run them under (CS1), which corresponds to the case where η @ j can be completely computed with exactly 2 j regressions. We have in particular E = [0, 1] = A. We focus on the functions η introduced in [14] and used as a benchmark in numerous subsequent simulation studies. They are made available through the Wavelab850 library freely available at http:// www-stat.stanford.edu/~wavelab/. In addition we have chosen the noise ξ to be standard normal, that is we are working under (N2) with σ = 1. In all cases, we have chosen the signal-to-noise ratio (SNR) to be equal to 7. To be more specific, we are working on a dyadic grid G of [0, 1] of resolution 2 −15 . We compute the root-mean-squared-error (RMSE) of both the signal and the noise on that grid and rescale the signal so that its RMSE be seven times bigger than the one of the noise. Let us now give details about the simulation of the sample points and the computation of the estimator. We divide the unit-interval into ten sub-segments A k := 10 −1 [k, k + 1] for k = 0, . . . , 9. We define the density of X as follows.
We choose the p k 's at random. To that end, we denote by (u k ) 0≤k≤9 ten realizations of the uniform random variable on [. 25,1], write v = u 0 + . . . + u 9 and set p k = u k v −1 . Notice that this guarantees that µ ≥ min 0≤k≤9 10p k ≥ µ min = 0.25 on [0, 1]. We then simulate 3000 sample points X i according to µ. Finally, we bring the points back on the grid G by assimilating them to their nearest grid node. Since the X i 's are supposed to be drawn from a law that is absolutely continuous with respect to the Lebesgue measure on [0, 1], we must keep only one data point per grid node. This reduces the number of data points from 3000 to the number that is reported on top of each of the histograms. In order to compute the adaptive estimator at sample points X i , we use the boundary-corrected scaling functions coded into Wavelab850 for r = 3 and for which we must have j ≥ 3. We set J = ⌈log(n/ log n)/ log 2⌉. The elimination of redundant sample points on the grid removes on average 150 points so that we obtain J = 10. We therefore have J n = {3, 4, . . . , 10}. Notice interestingly that the computation of η @ 3 requires only 8 regressions and η @ 10 requires 1, 024 of them. This is much smaller than for the LPE whose computation necessitates as many regressions as there are sample points at each resolution level. In practice, we compute the minimum eigenvalues of all regression matrices across partitions and resolution levels and choose π −1 n to be the first decile of this set of values. When proving theoretical results, we have chosen η @ j to be zero on the small probability event where the minimum eigenvalue of the regression matrix is smaller than π −1 n . In practice we can choose it to be an average value of the nearby cells in order to get an estimator that is overall more appealing to the eye. In our Adaptive resolution j simulation, we in fact do not use that modification. Instead, we modify j @ to be the highest j ∈ {3, . . . , j @ } such that η @ j @ has been computed from a valid regression matrix, meaning a regression matrix whose smallest eigenvalue is greater than the threshold π −1 n . In practice, for a given signal, we generate µ at random and compute η @ j @ for 100 samples drawn from µ. We quantify the performance η @ j @ by its relative RMSE, meaning its RMSE computed at sample points X i divided by the amplitude of the true signal, that is its maximal absolute value on the underlying dyadic grid. We display results for "Doppler", "HeaviSine", "Bumps" and "Blocks" corresponding to the median performance among the 100 trials. Each figure displays four graphs. Clockwise from the top left corner, they display in turn, an histogram of sample points X i ; the adaptive level j @ at sample points X i ; the true signal (black dots) and the estimator η @ j @ at sample points X i (solid blue line) and its corresponding relative RMSE in the title; and finally the original signal (solid blue line) with its noisy version at sample points X i (red dots). Consider the term Now, apply Proposition 12.1 and notice that A µ(x)dx = 1 to show that I is upper-bounded by the term that appears on the rhs of eq. (12) stated in Corollary 7.1. In particular, for all 1 ≤ p < ∞, we obtain I ≤ C(p)π p n t(n) −p ≤ C(p) < ∞. This in turn proves that we can apply the Fubini-Tonelli theorem to get and concludes the proof.

Proof of Theorem 7.1
Let x ∈ A and j ∈ J n . There exists H ∈ F j such that x ∈ H. Let us work on the set {λ min (Q H ) ≥ π −1 n } on which Q H is invertible. On that set, we can write

Now, notice that for all
. Then, we have, Thus, a direct application of Proposition 12.5 allows to write, for δ > 2M2 −js max(1, 3π n R d µ max ), By definition, we have η @ j (x) = η @ H (x), so that we have By construction, η @ H (x) = η ⋄ H (x) on the event {λ min (Q H ) ≥ π −1 n } and η @ H (x) = 0 on its complement. So that we obtain |η(x) − η @ H (x)| = |η(x)| ≤ M on the rhs of eq. (20). Notice in addition that M2 −js ≥ |R j η(x)| under (H r s ) (see Appendix). Finally, we obtain, for where we have writtenM = M. The term on the lhs has been dealt with above. The term on the rhs is tackled using Proposition 12.3. This concludes the proof.
Write J n (j) = {k ∈ J n : k > j}. Notice first that we have the following inclusions Therefore, we can write Now, we notice that So that So that a direct application of the Cauchy-Schwarz inequality leads to Now, a direct application of Proposition 12.1 for j s ≤ j ≤ J gets us Besides, notice that for j s ≤ j < k ≤ J, we can apply Proposition 12.2 with κ ≥ p 2 C −1 To conclude the proof, it remains to notice that #J n ≤ log n and remark that the multiplicative constant in the upper-bound of Theorem 7.2 is indeed smaller than, say, 5 for n large enough.

A few useful Propositions and Lemmas
Proposition 12.1. Fix r ∈ N and assume we are under (CS1) and (H r s ). Then, For any x ∈ A and j ∈ J n , one has got and C 5 is made explicit in the proof at eq. (21).
Proof. For any x ∈ A, take δ = 3M2 −js max(1, 3π n R d µ max ). Notice first that max(1, 3π n R d µ max ) ≤ π n max(1, 3R d µ max ) since, by construction, π −1 n ≤ 1 in any case. Now, write As δ has been fixed, we only need to tackle the rhs above, which we will denote by II. Using Theorem 7.1, we can write Denote by II 1 and II 2 the lhs and rhs terms above, respectively. Now, recall that j ≤ J, where 2 Jd ≤ nt(n) −2 and t(n) 2 = κπ 2 n log n. Therefore, as soon as Let us now turn to II 2 . Assume first that we are working under the bounded noise assumption, (N1). In that case, we have where the last inequality results from the change of variable u = √ n2 −j d 2 π −1 n t together with the fact that 2 jd ≤ n and we have written Assume now that we are working under the Gaussian noise assumption (N2). In that case, we have Denote by II 3 and II 4 the first and second term, respectively. They can both be handled in the exact same way as II 2 , which leads to where we have written and where we have written To conclude, let us write C 5 (r, d, p, µ max ; K, σ) = C 2 (r, d, p, µ max , K) under (N1) C 3 (r, d, p, µ max , σ) + C 4 (r, d, p, µ max ) under (N2) Therefore, we ultimately obtain which concludes the proof.
Proposition 12.2. Fix r in N and assume we are under (CS1) and (H r s ). This means in particular that s ∈ (0, r). Let j be such that j s ≤ j ≤ J. Let t(n) 2 = κπ 2 n log n, and define C 9 (r, d, µ max , π n ; K, σ) := C 6 (r, d, µ max , K, π n ), under (N1) where C 6 is defined in eq. (22) below. Then we have, for n large enough, Proof. The proof relies on a direct application of Theorem 7.1. Write C 0 = 2M max(1, 3π n R d µ max ) and notice indeed that the theorem applies since for j ≥ j s , we get 2 j d 2 n − 1 2 ≥ 2 −(r+ d 2 ) 2 −jss (see eq. (10b)) and, as soon as n is large enough, we have t(n) ≥ 2 r+ d 2 C 0 . This leads us to Let us denote the first term by I and the second one by II. I is easily tackled noticing that for j ≤ J, n2 −jd ≥ n2 −Jd ≥ t(n) 2 = κπ 2 n log n. So that, we obtain I ≤ 2R 2d n −κC 6 , where we have written Let us now turn to II. Assume first we work under (N1). Then we can write Notice first that 2 j d 2 t(n) ≤ √ n. Therefore, we obtain II ≤ 2R d n −κC 6 . Assume now that we work under (N2). In that case, we obtain We proceed exactly as under (N1). So that we obtain II ≤ C 7 n −κC 8 , where So that C 7 ≤ 3R d for n large enough. Notice finally that C 8 (r, d, µ max , t, π n ) ≥ C 6 (r, d, µ max , t, π n ). This concludes the proof.
Proposition 12.3. Fix an integer r ≥ 1 and assume we are under (CS1). Let x ∈ A and j ∈ J n . By construction, there exists H ∈ F j such that x ∈ H. Recall besides that #S j (H) = R d , where R = 2r−1 is obviously independent of both x and j. Write . = . ℓ 2 (R R d ) and assume there exists a strictly positive constant g min independent of x and j such that Then, for any real number t such that 0 < t ≤ g min 2 , we have Proof. Under the assumption described in eq. (23), we get , so that ET i = 0, VarT i ≤ µ max 2 jd and |T i | ≤ 2 jd+1 . A direct application of Bernstein inequality for any δ > 0 leads to To conclude, we write Proposition 12.4. Fix an integer r ≥ 1 and assume we are under (CS1). For any x ∈ A and j ∈ J n , we denote by H the unique hypercube of F j such that x ∈ H. Then, there exists a strictly positive absolute constant g min independent of both x and j such that, for all j ∈ J n and all x ∈ A, we have λ min (EQ H ) ≥ g min > 0.
Proof. For any u ∈ R R d such that u ℓ 2 (R R d ) = 1, we can write = µ min where S has been defined in eq. (15) and the last equality results from the fact that the value of the integral on the rhs of eq. (24) is invariant with H. Let us denote by S R d −1 the unit-sphere of R R d . As detailed in [39], the map is absolutely continuous with respect to u on the compact subset S R d −1 of R R d . It therefore reaches its minimum at some point u * ∈ S R d −1 . It is a direct consequence of the local linear independence property of the scaling functions (ϕ k ) (see Proposition 12.7) that where c min is a constant that is both independent from x and j. This concludes the proof with g min = µ min c min .
Proposition 12.5. Let (X i ) i=1,...,n and (ξ i ) i=1,...,n be sequences of independent random variables such that E(ξ|X) = 0. Take any j ≥ j r . Moreover, assume we are given a function R j (.) such that R j (.) L∞(E,λ) ≤ M2 −js , a subset H of E and a scaling function ϕ j,k . Write and define Then, for all δ > 3µ max M2 −j(s+ d 2 ) , we have Proof. Notice indeed that So that we can write Now it is enough to notice that So that P(III ≥ δ/3) = 0 as soon as δ > 3µ max M2 −j(s+ d 2 ) . Now, turn to II and write Obviously . So that we can apply Bernstein inequality to get And finally, turn to III. Assume first that the noise ξ is bounded by K. We have obviously Eϕ j,k (X i )ξ i 1 H (X i ) = 0, Var(ϕ j,k (X i )ξ i 1 H (X i )) ≤ K 2 µ max and |ϕ j,k (X i )ξ i 1 H (X i )| ≤ K2 j d 2 +1 , so that Now, it is enough to notice that for all s > 0 and j such that j ≥ 1 s log 2 M K (which becomes a constraint for K < M only), which concludes the proof under (N1). When j ≥ 1 s log 2 3M, the conclusion under (N2) is a direct consequence of Proposition 12.6. Proposition 12.6. Let ϕ j,k be a scaling function and H a subset of E. Define Assume now that the noise ξ is conditionally Gaussian, that is we are under (N2). Then, we notice that, conditionally on X 1 , . . . , X n , I ∼ Φ(0, σρ j,k / √ n), where ρ 2 j,k = n −1 n i=1 ϕ j,k (X i ) 2 1 H (X i ).
Then, for all δ > 0, one can write Proof. For any δ > 0, we write Notice first that So that .
The first term is handled thanks to a regular Gaussian tail inequality. Notice indeed that P(|I| ≥ δ|X 1 , . . . , X n )1 In addition, notice that Eϕ j,k (X) 4 1 H (X) ≤ µ max 2 jd and |ϕ j,k (X) 2 1 H (X i )−Eϕ j,k (X) 2 1 H (X)| ≤ 2 jd+1 , so that a direct application of Bernstein inequality leads to which concludes the proof. Proof. This result is derived from [33] and its proof can be found in [39].

Proof of the upper-bound results under (CS2)
Recall that under (CS2), we work with a sample of size 2n split into two sub-samples denoted by D n and D ′ n . As detailed previously, similar results as the ones described in Section 7, Section 8 and Section 12.1.4 are still valid with η under (CS2). They in fact all stem from Theorem 9.1. The proofs remain for the most part unchanged, with J n redefined as J n = {j s , j s + 1, . . . , J − 1, J} where 2 js = ⌊n 1 2s+d ⌋, η in place of η @ , X i in place of X i (where we have written u = u − X ′ ix + 2 −j−1 ), and H 0 in place of H. The sole differences appear in the proofs of Theorem 9.1 and Proposition 12.4. Let us start with the proof of Theorem 9.1.
Proof of Theorem 9.1. Assume we are under (CS2) and want to control the probability of deviation of η j (x) from η(x) at a point x ∈ A, for some j ∈ J n . Recall that H 0 (x) stands for the cell H 0 = 2 −j [0, 1] d centered in x at level j, that is H 0 (x) = x − 2 −j−1 + 2 −j [0, 1] d and denote by O x the event Focus first on what happens on the event O c x . The last term can be controlled easily since the probability that no single design point X ′ i of D ′ n belongs to H 0 (x) decreases exponentially fast with n. Notice indeed that, under (CS2), P(O c x ) = (P(X ′ 1 / ∈ H 0 (x))) n = (1 − P(X ′ 1 ∈ H 0 (x))) n where the before last inequality is a direct consequence of (S2) and the last one comes from the fact that for any x ∈ [0, 1), ln(1 − x) ≤ −x. Now, recall that η j (x) = 0 on O c x and |η(x)| ≤ M since η ∈ L s (R d , M). So that we obtain P(|η(x) − η j (x)| ≥ δ, O c x ) ≤ exp(−µ min min(2m 0 , 2 −1 ) d n2 −jd )1 {δ≤M } , which is smaller than the first term in the upper-bound of Theorem 9.1. Now focus on what happens on the event O x . We can write Therefore, it is enough to control the probability of deviation of η j (x) from η(x) on O x , conditionally on X ′ ix . It is controlled in exactly the same way as the probability of deviation of η @ j (x) from η(x) under (CS1), except that we now work with conditional probabilities and expectations with respect to X ′ ix . Interestingly, the random variable X ′ ix is independent of the points of D n since it is built upon the design points (X ′ i ) of D ′ n which are themselves independent of the points of D n . This is a key feature that makes theoretical computations tractable under (CS2) and allows to handle η in a similar way as η @ under (CS1). As announced above, Proposition 12.4 is the sole result that is not obviously true under (CS2). However it can be extended to setting (CS2) without much trouble (see below). Ultimately, this proves that, on the event O x and conditionally on X ′ ix , the probability of deviation of η j (x) from η(x) verifies Theorem 7.1. So that finally, it remains to put everything together to obtain the results announced in Theorem 9.1, which concludes the proof.
As detailed in [39], the proof of Proposition 12.4 can be extended to setting (CS2), thanks to the local linear independence property of the scaling functions (see Proposition 12.7) and a compactness argument. In particular, we obtain the following result, which is proved in [39].
Write s > 0 and r = ⌊s⌋ + 1. The Besov space B s ∞,∞ on R d , also known as the generalized Lipschitz space L s (R d ), is the collection of all functions f ∈ C (R d ) ∩ L ∞ (R d , λ) such that the semi-norm |f | L s (R d ) := sup t>0 t −s ω r (f, t) ∞ , is finite. The norm for L s (R d ) is subsequently defined as Fix a real number M > 0. Throughout the paper, L s (R d , M) refers to the ball of L s (R d ) of radius M. Obviously, the elements of L s (R d , M) are λ-a.e. uniformly bounded by M on R d . As described in [11,7], there exists an alternative definition of Lipschitz spaces C s (R d ), also known as Hölder spaces, which goes as follows. For any integer d, multi-index q = (q 1 , . . . , q d ) ∈ N d and x = (x 1 , . . . , x d ) ∈ R d , we define the differential operator ∂ q as usual by ∂ q := ∂ q 1 +...+q d ∂ q 1 x 1 ...∂ q d x d . For any positive integer s, C s (R d ) consists of the functions f on R d such that ∂ q f is bounded and absolutely continuous on R d , for all q ∈ N d such that |q| 1 := q 1 + . . . + q d ≤ s. This definition is extended to non-integer s as follows, : ∂ q f ∈ C s−m (R d ), |q| 1 = m}, m < s < m + 1, m ∈ N.
It can be shown that, for all non-integer s > 0, C s (R d ) = L s (R d ), while C s (R d ) is a strict subset of L s (R d ) when s ∈ N (see [11, p. 52] for examples of functions that belong to L 1 ([0, 1]) but not to C 1 ([0, 1]) in the particular case where d = 1). Furthermore, we define these function spaces on the subset E of R d as the restriction of their elements to E. As explained in [7, Remark 3.2.4], function spaces on E can be defined by restriction or, alternatively, in an intrinsic way, and both definitions coincide for fairly general domains E of R d . Looking at function spaces on E as function spaces on R d restricted to E justifies the use of MRAs of L 2 (R d , λ) in our local analysis.

MRAs and smoothness analysis
Multivariate MRAs will always be assumed to be obtained from a tensorial product of onedimensional MRAs, as described in [7, §1.4, eq. (1.4.10)]. We will denote by ϕ j,k (.) = 2 jd/2 ϕ(2 j .− k) the translated and dilated version of ϕ with k ∈ Z d . As usual, we write V j to mean Closure(Span{ϕ j,k , k ∈ Z d }), so that Closure(∪ j≥0 V j ) = L 2 (R d , λ) (where the closures are taken with respect to the L 2 (R d , λ)-metric).
The r-MRAs defined in Section 5.2 are intimately connected with generalized Lipschitz spaces.