Multiscale regression on unknown manifolds

We consider the regression problem of estimating functions on $\mathbb{R}^D$ but supported on a $d$-dimensional manifold $ \mathcal{M} \subset \mathbb{R}^D $ with $ d \ll D $. Drawing ideas from multi-resolution analysis and nonlinear approximation, we construct low-dimensional coordinates on $\mathcal{M}$ at multiple scales, and perform multiscale regression by local polynomial fitting. We propose a data-driven wavelet thresholding scheme that automatically adapts to the unknown regularity of the function, allowing for efficient estimation of functions exhibiting nonuniform regularity at different locations and scales. We analyze the generalization error of our method by proving finite sample bounds in high probability on rich classes of priors. Our estimator attains optimal learning rates (up to logarithmic factors) as if the function was defined on a known Euclidean domain of dimension $d$, instead of an unknown manifold embedded in $\mathbb{R}^D$. The implemented algorithm has quasilinear complexity in the sample size, with constants linear in $D$ and exponential in $d$. Our work therefore establishes a new framework for regression on low-dimensional sets embedded in high dimensions, with fast implementation and strong theoretical guarantees.


Introduction
High-dimensional data challenge classical statistical models and require new understanding of tradeoffs in accuracy and efficiency. The seemingly quantitative fact of the increase of dimension has qualitative consequences in both methodology and implementation, demanding new ways to break what has been called the curse of dimensionality. On the other hand, the presence of inherent nonuniform structure in the data calls into question linear dimension reduction techniques, and motivates a search for intrinsic learning models. In this paper we explore the idea of learning and exploiting the intrinsic geometry and regularity of the data in the context of regression analysis. Our goal is to build low-dimensional representations of high dimensional functions, while ensuring good generalization properties and fast implementation. In view of the complexity of the data, we allow interesting features to change from scale to scale and from location to location. Hence, we will develop multiscale methods, extending classical ideas of multi-resolution analysis beyond regular domains and to the random sample regime.
In regression, the problem is to estimate a function from a finite set of random samples. The minimax mean squared error (MSE) for estimating functions in the Hölder space C s ([0, 1] D ), s > 0, is O(n −2s/(2s+D) ), where n is the number of samples. The exponential dependence of the minimax rate on D manifests the curse of dimensionality in statistical learning, as n = O(ε −(2s+D)/s ) points are generally needed to achieve accuracy ε. This rate is optimal (in the minimax sense), unless further structural assumptions are made. For example, if the samples concentrate near a d-dimensional set with d ≪ D, and the function belongs to a nonuniform smoothness space B S , with S > s, we may hope to find estimators converging in O(n −2S/(2S+d) ). In this quantified sense, we may break the curse of dimensionality by adapting to the intrinsic dimension and regularity of the problem.
A possible approach to this problem is based on first performing dimension reduction, and then regression in the reduced space. Linear dimension reduction methods include principal component analysis (PCA) (Pearson, 1901;Hotelling, 1933Hotelling, , 1936, for data concentrating on a single subspace, or subspace clustering (Chen and Lerman, 2009;Chen and Maggioni, 2011;Vidal et al., 2005;Elhamifar and Vidal, 2009;Liu et al., 2010), for an union of subspaces. Going beyond linear models, we encounter isomap (Tenenbaum et al., 2000), locally linear embedding (Roweis and Saul, 2000), local tangent space alignment (Zhang and Zha, 2002), Laplacian eigenmaps (Belkin and Niyogi, 2003), Hessian eigenmap (Donoho and Grimes, 2003) and diffusion map (Coifman et al., 2005a). Besides the classical Principal Component Regression (Jolliffe, 1982), in Lee and Izbicki (2016) diffusion map is used for nonparametric regression expanding the unknown function over the eigenfunctions of a kernel-based operator. It is proved that, when data lie on a d-dimensional manifold, the MSE converges in O(n −1/O(d 2 ) ). This rate depends only on the intrinsic dimension, but does not match the minimax rate in the Euclidean space. If infinitely many unlabeled points are sampled, so that the eigenfunctions are exactly computed, the MSE can achieve optimal rates for Sobolev functions with smoothness parameter at least 1. Similar results hold for regression with the Laplacian eigenmaps (Zhou and Srebro, 2011). Some regression methods have been shown to automatically adapt to the intrinsic dimension and perform as well as if the intrinsic domain was known. Results in this direction have been established for local linear regression (Bickel and Li, 2007), k-nearest neighbors (Kpotufe, 2011), and kernel regression (Kpotufe and Garg, 2013), where optimal rates depending on the intrinsic dimension were proved for functions in C 2 , C 1 , and C s with s ≤ 1, respectively. Kernel methods such as kernel ridge regression are also known to adapt to the intrinsic dimension (Ye and Zhou, 2008;Steinwart et al., 2009), while suitable variants of regression trees have been proved to attain intrinsic yet suboptimal learning rates (Kpotufe and Dasgupta, 2012). On the other hand, dyadic partitioning estimates with piecewise polynomial regression can cover the whole scale of spaces C s , s > 0 (Györfi et al., 2002), and be combined with wavelet thresholding techniques to optimally adapt to broader classes of nonuniform regularity (Binev et al., 2005(Binev et al., , 2007. However, such estimators are cursed by the ambient dimension D, due to the exponential cardinality of a dyadic partition of the D-dimensional hypercube.
This paper aims at generalizing dyadic partitioning estimates (Binev et al., 2005(Binev et al., , 2007 to predict functions supported on low-dimensional sets, with optimal performance guarantees and low computational cost. We tie together ideas in classical statistical learning (Friedman et al., 2001;Györfi et al., 2002;Tsybakov, 2009), multi-resolution analysis (Daubechies, 1992;Mallat, 1999;Coifman et al., 2005b), and nonlinear approximation Johnstone, 1994, 1995;Cohen et al., 2002). Our main tool is geometric multiresolution analysis (GMRA) (Allard et al., 2012;Maggioni et al., 2016;Liao and Maggioni, 2019), which is a multiscale geometric approximation scheme for point clouds in high dimensions concentrating near low-dimensional sets. Using GMRA we learn low-dimensional local coordinates at multiple scales, on which we perform a multiscale regression estimate by fitting local polynomials. Inspired by wavelet thresholding techniques (Cohen et al., 2002;Binev et al., 2005Binev et al., , 2007, we then compute differences between estimators at adjacent scales, and retain the locations where such differences are large enough. This empirically reveals where higher resolution is required to attain a good approximation, generating a data-driven partition which adapts to the local regularity of the function. Our approach has several distinctive features: (i) it is multiscale, and is therefore wellsuited for data sets containing variable structural information at different scales; (ii) it is adaptive, allowing the function to have localized singularities or variable regularity; (iii) it is entirely data-driven, that is, it does not require a priori knowledge about the regularity of the function, and rather learns it automatically from the data; (iv) it is provable, with strong theoretical guarantees of optimal performance on large classes of priors; (v) it is efficient, having straightforward implementation, minor parameter tuning, and low computational cost. We will prove that, for functions supported on a d-dimensional manifold and belonging to a rich model class characterized by a smoothness parameter S, the MSE of our estimator converges at rate O((log n/n) 2S/(2S+d) ). This model class contains classical Hölder continuous functions, but further accounts for potential nonuniform regularity. Our result shows that, up to a logarithmic factor, we attain the same optimal learning rate as if the function was defined on a known Euclidean domain of dimension d, instead of an unknown manifold embedded in R D . In particular, the rate of convergence depends on the intrinsic dimension d and not on the ambient dimension D. In terms of computation, all the constructions above can be realized by algorithms of complexity O(n log n), with constants linear in the ambient dimension D and exponential in the intrinsic dimension d.
The remainder of this paper is organized as follows. We conclude this section by defining some general notation and formalizing the problem setup. In Section 2 we review geometric multi-resolution analysis. In Section 3 we introduce our multiscale regression methods and establish the performance guarantees. We discuss the computational complexity of our algorithms in Section 4. The proofs of our results are collected in Section 5.
Notation f g and f g mean that there exists a positive constant C, independent on any variable upon which f and g depend, such that f ≤ Cg and f ≥ Cg, respectively. f ≍ g means that both f g and f g hold. The cardinality of a set A is denoted by #A. For x ∈ R D , x denotes the Euclidean norm and B r (x) denotes the Euclidean ball of radius r centered at x. Given a subspace V ⊂ R D , we denote its dimension by dim(V ) and the orthogonal projection onto V by Proj V . Let f, g : M → R be two functions, and let ρ be a probability measure supported on M. We define the inner product of f and g with respect to ρ as f, g : The L ∞ norm of f is f ∞ := sup ess |f |. We denote probability and expectation by P and E, respectively. For a fixed M > 0, T M is the truncation operator defined by T M (x) := min(|x|, M )sign(x). We denote by 1 j,k the indicator function of an indexed set C j,k (i.e., 1 j,k (x) = 1 if x ∈ C j,k , and 0 otherwise).

Setup
We consider the problem of estimating a function f : M → R given n samples • ρ is an unknown probability measure supported on M; • {x i } n i=1 are independently drawn from ρ; We wish to construct an estimator f of f minimizing the mean squared error

Geometric multi-resolution analysis
Geometric multi-resolution analysis (GMRA) is an efficient tool to build low-dimensional representations of data concentrating on or near a low-dimensional set embedded in high dimensions. To keep the presentation self-contained, we summarize here the main ideas, and refer the reader to Allard et al. (2012); Maggioni et al. (2016); Liao and Maggioni (2019) for further details. Given a probability measure ρ supported on a d-dimensional manifold M ⊂ R D , GMRA performs the following steps: 1. Construct a multiscale tree decomposition T of M into nested cells T := {C j,k } k∈K j ,j∈Z , where j represents the scale and k the location. Here K j is a location index set.
2. Compute a local principal component analysis on each C j,k . Let c j,k be the mean of x on C j,k , and V j,k the d-dimensional principal subspace of C j,k . Define P j,k := An ideal multiscale tree decomposition should satisfy assumptions (A1)÷(A5) below for all integers j ≥ j min : (A1) For every k ∈ K j and k ′ ∈ K j+1 , either C j+1,k ′ ⊆ C j,k or ρ(C j+1,k ′ ∩ C j,k ) = 0.
The children of C j,k are the cells C j+1,k ′ such that C j+1,k ′ ⊆ C j,k . We assume that 1 ≤ a min ≤ #{C j+1,k ′ : C j+1,k ′ ⊆ C j,k } ≤ a max for all k ∈ K j and j ≥ j min . Also, for every C j,k , there exists a unique k ′ ∈ K j−1 such that C j,k ⊆ C j−1,k ′ . We call C j−1,k ′ the parent of C j,k .
(A2) ρ M \ k∈K j C j,k = 0, i.e. Λ j := {C j,k } k∈K j is a partition of M, up to negligible sets.
(A4) There exists θ 2 > 0 such that, if x is drawn from ρ conditioned on C j,k , then x−c j,k ≤ θ 2 2 −j almost surely.
These are natural properties for multiscale partitions generalizing dyadic partitions to nonEuclidean domains (see Christ, 1990). (A1) establishes that the cells constitute a tree structure. (A2) says that the cells at scale j form a partition. (A3) guarantees that there are at most 2 jd /θ 1 cells at scale j. (A4) ensures that the diameter of all cells at scale j is bounded by 2 −j , up to a uniform constant. (A5)(i) assumes that the best rank d approximation to the covariance of a cell is close to the covariance matrix of a d-dimensional Euclidean ball, while (A5)(ii) assumes that the cell has significantly larger variance in d directions than in all the remaining ones.
Since all cells at scale j have similar diameter, Λ j is called a uniform partition. A master tree T is a tree satisfying the properties above. A proper subtreeT of T is a collection of nodes of T with the properties: the root node is inT ; if a node is inT , then its parent is also inT . Any finite proper subtreeT is associated with a unique partition Λ = Λ(T ) consisting of its outer leaves, by which we mean those nodes that are not inT , but whose parent is.
In practice, the master tree T is not given. We will construct one by an application of the cover tree algorithm (Beygelzimer et al., 2006) (see (Liao and Maggioni, 2019, Algorithm 3)). In order to make the samples for tree construction and function estimation independent from each other, we split the data in half and use one subset to construct the tree and the other one for local PCA and regression. From now on we index the training data as Riemannian manifold isometrically embedded in R D , and ρ is the volume measure on M, then (A5) is satisfied as well: Proposition 1 (Proposition 14 in Liao and Maggioni (2019)) Assume ρ is a doubling probability measure on M with doubling constant C 1 . Then, the C j,k 's constructed from (Liao and Maggioni, 2019, Algorithm 3) in satisfy: (a1) (A1) with a max = C 2 1 (24) d and a min = 1; (a2) let M = jmax j=j min k∈K j C j,k ; for any ν > 0, If additionally M is a C s , s ∈ (1, ∞), d-dimensional closed Riemannian manifold isometrically embedded in R D , and ρ is the volume measure on M, then (a5) (A5) is satisfied when j is sufficiently large.
Since there are finite training points, the constructed master tree has a finite number of nodes. We first build a tree whose leaves contain a single point, and then prune it to the largest subtree whose leaves contain at least d training points. This pruned tree associated with the C j,k 's is called the data master tree, and denoted by T n . The C j,k 's cover M, which represents the part of M that has been explored by the data. Even though assumption (A2) is not exactly satisfied, we claim that (a2) is sufficient for our performance guarantees, for example in the case that f ∞ ≤ M . Indeed, simply estimating f on M \ M by 0, for any ν > 0 we have In view of these bounds, the rate of convergence on M \ M is faster than the ones we will obtain on M. We will therefore assume (A2), thanks to (a2). Also, it may happen that conditions (A3)÷(A5) are satisfied at the coarsest scales with very poor constants θ.
Nonetheless, it will be clear that in all that follows we may discard a few coarse scales, and only work at scales that are fine enough and for which (A3)÷(A5) truly capture in a quantitative way the local geometry of M. Since regression is performed on an independent subset of data, we can assume, by conditioning, that the C j,k 's are given and satisfy the required assumptions. To keep the notation simple, from now on we will use C j,k instead of C j,k , and M in place of M, with a slight abuse of notation. Besides cover tree, there are other methods that can be applied in practice to obtain multiscale partitions, such as METIS (Karypis and Kumar, 1999), used in Allard et al. (2012), iterated PCA (Szlam, 2009)), and iterated k-means. These methods can be computationally more efficient than cover tree, but lead to partitions where the properties (A1)÷(A5) are not guaranteed to hold.
After constructing the multiscale tree T , GMRA computes a collection of affine projectors {P j : R D → R D } j≥j min . The main objects of GMRA in their population and sample version are summarized in Table 1. Given a suitable partition Λ ⊂ T , M can be approximated by the piecewise linear set

Multiscale polynomial regression
Given a multiscale tree decomposition {C j,k } j,k and training samples {(x i , y i )} n i=1 , we construct a family { f j,k } j,k of local estimates of f in two stages: first we compute local coordinates on C j,k using GMRA outlined above, and then we estimate f |C j,k by fitting a polynomial of order ℓ on such coordinates. A global estimator f ℓ Λ is finally obtained by summing the local estimates over a suitable partition Λ. Our regression method is detailed in Algorithm 1. In this section we assume that f is bounded, with f ∞ ≤ M .
In order to analyze the performance of our method, we introduce the oracle estimator f ℓ Λ based on the distribution ρ, defined by and split the MSE into a bias and a variance term: The bias term is a deterministic approximation error, and will be handled by assuming suitable regularity models for ρ and f (see Definitions 2 and 5). The variance term quantifies the stochastic error arising from finite-sample estimation, and will be bounded using concentration inequalities (see Proposition 10). The role of Λ, encoded in its size #Λ, is , intrinsic dimension d, bound M , polynomial order ℓ, approximation type (uniform or adaptive).
Output: multiscale tree decomposition T n , partition Λ, piecewise ℓ-order polynomial estimator f ℓ Λ . 1: construct a multiscale tree T n by (Liao and Maggioni, 2019, Algorithm 3) on {x i } 2n i=n+1 ; 2: compute centers c j,k and subspaces V j,k by empirical GMRA on {x i } n i=1 ; 3: define coordinates π j,k on C j,k : 4: compute local estimators g ℓ j,k by solving the following least squares problems over the space P ℓ of polynomials of degree ≤ ℓ: 6: construct a uniform (see Section 3.1) or adaptive (see Section 3.2) partition Λ; 7: define the global estimator f ℓ Λ by summing the local estimators over the partition Λ: crucial to balance (2). We will discuss two possible choices: uniform partitions in Section 3.1, and adaptive at multiple scales in Section 3.2.

Uniform partitions
A first natural choice for Λ is a uniform partition Λ j : The bias f − f ℓ Λ j decays at a rate depending on the regularity of f , which can be quantified as follows: where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).
We capture the case where the bias is roughly the same on every cell with the following definition: where T ranges over the set, assumed non-empty, of multiscale tree decompositions satisfying assumptions (A1)÷(A5).
Clearly A ℓ,∞ s ⊂ A ℓ s . These classes contain uniformly regular functions on manifolds, such as Hölder functions.
Example 1 Let M be a closed smooth d-dimensional Riemannian manifold isometrically embedded in R D , and let ρ be the volume measure on M. Consider a function f : M → R and a smooth chart Hölder functions C ℓ,α on M with ℓ ∈ N and α ∈ (0, 1] are defined as follows: f ∈ C ℓ,α if the ℓ-order derivatives of f exist, and d(x, z) being the geodesic distance between x and z. We will always assume to work at sufficiently fine scales at which d(x, z) ≍ z − x R D . Note that C ℓ,1 is the space of ℓ-times continuously differentiable functions on M with Lipschitz ℓ-order derivatives. We have Example 2 Let M be a smooth closed Riemannian manifold isometrically embedded in R D , and let ρ be the volume measure on M.

Adaptive partitions
Theorem 4 is not fully satisfactory for two reasons: (i) the choice of optimal scale requires knowledge of the regularity of the unknown function; (ii) no uniform scale can be optimal if the regularity of the function varies at different locations and scales. Inspired by Binev et al. (2005Binev et al. ( , 2007, we thus propose an adaptive estimator which learns near-optimal partitions from data, without knowing the possibly nonuniform regularity of the function. oracles empirical counterparts Adaptive partitions may be selected by a criterion that determines whether or not a cell should be picked or not. The quantities involved in this selection are summarized in Table  2, along with their empirical versions. ∆ ℓ j,k measures the local difference in approximation between two consecutive scales: a large ∆ ℓ j,k suggests a significant reduction of error if we refine C j,k to its children. Intuitively, we should truncate the master tree to the subtree including the nodes where this quantity is large. However, if too few samples exist in a node, then the empirical counterpart ∆ ℓ j,k can not be trusted. We thus proceed as follows. We set a threshold τ n decreasing in n, and let T n (τ n ) be the smallest proper subtree of T n containing all C j,k 's for which ∆ ℓ j,k ≥ τ n . Crucially, τ n may be chosen independently of the regularity of f (see Theorem 9). We finally define our adaptive partition Λ n (τ n ) as the partition associated with the outer leaves of T n (τ n ). The procedure is summarized in Algorithm 2.
1: compute the approximation difference ∆ ℓ j,k on every node C j,k ∈ T n ; 2: set the threshold τ n := κ (log n)/n; 3: select the smallest proper subtree T n (τ n ) of T n containing all C j,k 's with ∆ ℓ j,k ≥ τ n ; 4: define the adaptive partition Λ n (τ n ) associated with the outer leaves of T n (τ n ).
To provide performance guarantees for our adaptive estimator, we need to define a proper model class based on oracles. Given any master tree T satisfying assumptions (A1)÷(A5) and a threshold τ > 0, we let T (τ ) be the smallest subtree of T consisting of all the cells C j,k 's with ∆ ℓ j,k ≥ τ . The partition made of the outer leaves of T (τ ) is denoted by Λ(τ ).
In general, the truncated tree T (τ ) grows as the threshold τ decreases. For elements in B ℓ s , we have control on the growth rate, namely #T (τ n ) τ −p . The class B ℓ s is indeed rich, and contains in particular A ℓ,∞ s , while additionally capturing functions of nonuniform regularity.

The proof is given in Appendix A.
Example 3 Let g be the function in Example 2. Then g ∈ B ℓ d(d−d Γ )/(2d Γ ) for every ℓ = 0, 1, 2, . . .. Notice that g ∈ A ℓ (d−d Γ )/2 , so g has a larger regularity parameter s in the B ℓ s model than in the A ℓ s model.
We will also need a quasi-orthogonality condition ensuring that the functions W ℓ j,k representing the approximation difference between two scales are almost orthogonal across scales.

Definition 7
We say that f satisfies quasi-orthogonality of order ℓ with respect to the measure ρ if there exists a constant B 0 > 0 such that, for any proper subtree S of any tree T satisfying assumptions (A1)÷(A5), The following lemma shows that f ∈ B ℓ s , along with quasi-orthogonality, implies a certain approximation rate of f by f ℓ Λ(τ ) as τ → 0. The proof is given in Appendix A.
Lemma 8 If f ∈ B ℓ s ∩ (L ∞ ∪ A ℓ t ) for some s, t > 0, and f satisfies quasi-orthogonality of order ℓ, then The main result of this paper is the following performance analysis of adaptive estimators, which is proved in Section 5.
For a given accuracy ε, in order to achieve MSE ε 2 , the number of samples we need is n ε (1/ε) 2s+d s log(1/ε). When s is unknown, we can determine s as follows: we fix a small n 0 , and run Algorithm 2 with 2n 0 , 4n 0 , . . . , 2 j n 0 , . . . samples. For each sample size, we evenly split data into a training set to build the adaptive estimator, and a test set to evaluate the MSE. According to Theorem 9, the MSE scales as (log n/n) 2s 2s+d . Therefore, the slope in the log-log plot of the MSE versus n gives an approximation of −2s/(2s + d). This could be formalized by a suitable adaptation of Lepski's method.

Computational considerations
The computational cost of Algorithms 1 and 2 may be split as follows: Tree construction. Cover tree itself is an online algorithm where a single-point insertion or removal takes cost at most O(log n). The total computational cost of the cover tree algorithm is C d Dn log n, where C > 0 is a constant (Beygelzimer et al., 2006).
Local PCA. At every scale j, we perform local PCA on the training data restricted to the C j,k for every k ∈ K j using the random PCA algorithm (Halko et al., 2011). Recall that n j,k denotes the number of training points in C j,k . The cost of local PCA at scale j is in the order of k∈K j Dd n j,k = Ddn, and there are at most c log n scales where c > 0 is a constant, which gives a total cost of cDdn log n.
Multiscale regression. Given n j,k training points on C j,k , computing the low-dimensional coordinates π j,k (x i ) for all x i ∈ C j,k costs Dd n j,k , and solving the linear least squares problem (1), where the matrix is of size n j,k × d ℓ , costs at most n j,k d 2ℓ . Hence, constructing the ℓ-order polynomials at scale j takes k∈K j Dd n j,k + d 2ℓ n j,k = (Dd + d 2ℓ )n, and there are at most c log n scales, which sums up to c(Dd + d 2ℓ )n log n.
Adaptive approximation: We need to compute the coefficients ∆ j,k for every C j,k , which costs 2(Dd + d ℓ ) n j,k on C j,k , and 2c(Dd + d ℓ )n log n for the whole tree.
In summary, the total cost of constructing GMRA adaptive estimators of order ℓ is C d Dn log n + (4Dd + d 2ℓ + d ℓ )cn log n, which scales linearly with the number of samples n up to a logarithmic factor.

Proofs
We analyze the error of our estimator by a bias-variance decomposition as in (2). We present the variance estimate in Section 5.1, the proofs for uniform approximations in Section 5.2, and for adaptive approximations in Section 5.3.

Variance estimate
The main quantities involved in the 0-order (piecewise constant) and the 1st-order (piecewise linear) estimators are summarized in Table 3.
Proposition 10 Suppose f ∞ ≤ M and let ℓ ∈ {0, 1}. For any partition Λ, let f ℓ Λ and f ℓ Λ be the optimal approximation and the empirical estimators of order ℓ on Λ, respectively. Then, for every η > 0, piecewise constant: ℓ = 0 oracles estimators empirical counterparts piecewise linear: ℓ = 1 Table 3: Local constant and linear estimators on C j,k . The truncation in [Λ j,k ] d has the effect of regularizing the least squares problem, which is ill-posed due to the small eigenvalues {λ j,k l } D l=d+1 .
and therefore for some absolute constants c 0 , C 0 and some c 1 , C 1 depending on θ 2 , θ 3 .

Proof of Theorem 4
Notice that #Λ j ≤ 2 jd /θ 1 by (A3). By choosing j ⋆ such that 2 −j ⋆ = µ log n n 1 2s+d for some The probability estimate in Theorem 4 (a) follows from

Proof of Theorem 9
We begin by defining several objects of interest: • T n : the data master tree whose leaves contain at least d points of training data. It can be viewed as the part of a multiscale tree that our training data have explored. Notice that • T : a complete multiscale tree containing T n . T can be viewed as the union T n and some empty cells, mostly at fine scales with high probability, that our data have not explored.
• Suppose T 0 and T 1 are two subtrees of T . If Λ 0 and Λ 1 are two adaptive partitions associated with T 0 and T 1 respectively, we denote by Λ 0 ∨Λ 1 and Λ 0 ∧Λ 1 the partitions associated to the trees T 0 ∪ T 1 and T 0 ∩ T 1 respectively.
• Let b = 2a max + 5 where a max is the maximal number of children that a node has in T n .
Inspired by the analysis of wavelet thresholding procedures (Binev et al., 2005(Binev et al., , 2007, we split the error into four terms, . The goal of the splitting above is to handle the bias and variance separately, as well as to deal with the fact the partition built from those C j,k such that ∆ ℓ j,k ≥ τ n does not coincide with the partition which would be chosen by an oracle based on those C j,k such that ∆ ℓ j,k ≥ τ n . This is accounted by the terms e 2 and e 4 which correspond to those C j,k such that ∆ ℓ j,k is significantly larger or smaller than ∆ ℓ j,k respectively, and which will be proved to be small in probability. The e 1 and e 3 terms correspond to the bias and variance of oracle estimators based on partitions obtained by thresholding the unknown oracle change in approximation ∆ ℓ j,k . Since Λ n (τ n ) ∨ Λ n (bτ n ) is a finer partition than Λ n (bτ n ), we have The term e 12 accounts for the error on the cells that have not been explored by our training data, which is small: According to (6), we have (∆ ℓ j,k ) 2 ≤ 4 f 2 ∞ ρ(C j,k ). Then every C j,k with ∆ ℓ j,k ≥ bτ n satisfies ρ(C j,k ) b 2 κ 2 f 2 ∞ (log n/n). Hence, provided that n satisfies b 2 κ 2 f 2 ∞ log n 2d, we have P{∆ ℓ j,k ≥ bτ n and ρ(C j,k ) < d/n} where the last inequality follows from Lemma 11(b). Therefore, by Definition 5 we obtain To estimate Ee 2 12 , we observe that, thanks to Lemma 14, Hence, by choosing ν = 1 > 2s 2s+d we get The term e 3 is the variance term which can be estimated by Proposition 10 with Λ = Λ n (τ n ) ∧ Λ n (τ n /b). We plug in η = r(log n/n) s 2s+d . Bounding #Λ by #Λ n (τ n /b) ≤ #Λ n ≤ n/d (as our data master tree has at d points in each leaf) outside the exponential, and by #Λ n (τ n /b) ≤ #Λ(τ n /b) ≤ |f | p B ℓ s (τ n /b) −p inside the exponential, we get the following estimates for e 3 : P e 3 > r log n n s 2s+d where C 0 = C 0 (θ 2 , θ 3 , a max , d, s, |f | B ℓ s , κ) and C 1 = C 1 (θ 2 , θ 3 , a max , d, s, |f | B ℓ s , κ). We obtain P{e 3 > r(log n/n) s 2s+d } ≤ Cn −ν as long as r is chosen large enough to make the exponent smaller than −ν.
Next we estimate e 2 and e 4 . Since T n (τ n ) ∩ T n (τ n /b) ⊆ T n (τ n ) ∪ T n (bτ n ) and T n (bτ n ) ⊆ T n (τ n /b), we have e 2 > 0 if and only if there is a C j,k ∈ T n such that either C j,k is in T n (τ n ) but not in T n (τ n /b), or C j,k is in T n (bτ n ) but not in T n (τ n ). This means that either ∆ ℓ j,k ≥ τ n but ∆ ℓ j,k < τ n /b, or ∆ ℓ j,k ≥ bτ n but ∆ ℓ j,k < τ n . As a consequence, P{e 2 > 0} ≤ C j,k ∈Tn P ∆ ℓ j,k ≥ τ n and ∆ ℓ j,k < τ n /b + C j,k ∈Tn P ∆ ℓ j,k ≥ bτ n and ∆ ℓ j,k < τ n ,
We can now apply Lemma 15: we use (b) with η = τ n /b, and (a) with η = τ n . We obtain that We have P{e 2 > 0} + P{e 4 > 0} ≤ Cn −ν provided that κ is chosen such that the exponents are smaller than −ν.
We are left to deal with the expectations. As for e 2 , Lemma 14 implies e 2 M , which gives rise to, for ν = 1 > 2s 2s+d , The same bound holds for e 4 , which concludes the proof of Theorem 4.

Basic concentration inequalities
This section contains the main concentration inequalities of the empirical quantities on their oracles. For piecewise linear estimators, some quantities used in Lemma 13 are decomposed in Table 4. All proofs are collected in Appendix A.
Lemma 11 For every t > 0 we have:

Lemma 12
We have: where c is an absolute constant.
Lemma 14 Suppose f ∈ L ∞ . For every C j,k ∈ T and C j ′ ,k ′ ⊂ C j,k , oracle estimators empirical counterparts Lemma 15 Suppose f is in L ∞ . For every η > 0 and any γ > 1, we have C 0 , c 0 depend on a max ; c ′ 0 depends on a max , θ 2 ; C 1 , c 1 depend on a max , θ 2 , θ 3 .
Proof [Lemma 8] For any partition Λ ⊂ T , denote by Λ l the l-th generation partition such that Λ 0 = Λ and Λ l+1 consists of the children of Λ l . We first prove that lim l→∞ f ℓ Λ l = f in L 2 (M). Suppose f ∈ L ∞ . Notice that f ℓ Λ l − f ≤ f 0 Λ l − f . As a result of the Lebesgue differentiation theorem, f 0 Λ l → f almost everywhere. Since f is bounded, f 0 Λ l is uniformly bounded, hence f 0 Λ l → f in L 2 (M) by the dominated convergence theorem. In the case where f ∈ A ℓ t , taking the uniform partition Λ j(l) at the coarsest scale of Λ l , denoted by j(l), 2 −j(l)t , and therefore f ℓ Λ l → f in L 2 (M). Now, setting Λ = Λ(τ ) and S := T (τ ) \ Λ, by Definitions 5 and 7 we get which yields the first inequality in Lemma 8. The second inequality follows by observing that 2 − p = 2s d p and |f | p