Adaptive and minimax estimation of the cumulative distribution function given a functional covariate

. We consider the nonparametric kernel estimation of the conditional cumulative distribution function given a functional covariate. Given the bias-variance trade-oﬀ of the risk, we ﬁrst propose a totally data-driven bandwidth selection device in the spirit of the recent Goldenshluger-Lepski method and of model selection tools. The resulting estimator is shown to be adaptive and minimax optimal: we establish nonasymptotic risk bounds and compute rates of convergence under various assumptions on the decay of the small ball probability of the functional variable. We also prove lower bounds. Both pointwise and integrated criteria are considered. Finally, the choice of the norm or semi-norm involved in the deﬁnition of the estimator is also discussed, as well as the projection of the data on ﬁnite dimensional subspaces. Numerical results illustrate the method.


Introduction
The aim of Functional Data Analysis (FDA) is to analyse information on curves or functions. This field has attracted a lot of attention over the past decades, thanks to its numerous applications. We refer to Ramsay and Silverman (2005);  for case studies and Ferraty and Romain (2011) for a recent overview. In many situations, it is of interest to understand the link between a scalar quantity Y and a functional random variable X. For instance, it has been pointed out (Glaser et al., 2001(Glaser et al., , 2013 that the probability of occurrence of cerebral edema in children treated with diabetic ketoascidosis depends on the evolution over time of the drugs and fluids administred (and not only on their quantity). Another example of application consists in explaining the link between the chemical composition of a sample of material and its spectrometric curve (such a problem related to food industry will be illustrated in the numerical study).
We suppose that the random variable X takes values in a separable infinite-dimensional Hilbert space (H, ·, · , · ). The latter can be L 2 (I), the set of squared-integrable functions on a subset I of R, or a Sobolev space. The link between the predictor X and the response Y is classically described by regression analysis. However, this can also be achieved by estimating the entire conditional distribution of the variable Y given X. The target function we want to recover is the conditional cumulative distribution function F (conditional c.d.f. in the sequel) of Y given X defined by (1) F x (y) := P(Y ≤ y|X = x), (x, y) ∈ H × R.
1 To estimate it, we have access to a data sample {(X i , Y i ), i = 1, ..., n} distributed like the couple (X, Y ).
In the sequel, we consider kernel estimators similar to the ones defined by Ferraty et al. ( , 2010, for which we provide a detailed nonasymptotic adaptive and minimax study. The pioneering works on conditional distribution when the covariate is functional are the one of Ferraty and Vieu (2002); , completed by Ferraty et al. (2010). Kernel estimators, which depend on a smoothing parameter, the so-called bandwidth, are built to address several estimation problems: regression function, conditional c.d.f., conditional density and its derivatives, conditional hazard rate, conditional mode and quantiles. A lot of research has then been carried out to extend or adapt the previous procedures to various statistical models. For instance, the estimation of the regression function is studied by ; ; Dabo-Niang and Rhomari (2009). The case of dependent data is the subject of the works of Masry (2005); Aspirot et al. (2009); Laib and Louani (2010); Dabo-Niang et al. (2012) under several assumptions (α-mixing, ergodic or non-stationary processes). Demongeot et al. (2010) consider local-linear estimators of the conditional density and conditional mode. Robust versions of the previous strategies are proposed by Crambes et al. (2008); Azzedine et al. (2008); Gheriballah et al. (2013). Gijbels et al. (2012) investigate the estimation of the dependence between two variables conditionally to a functional covariate through copula modelling. Most of this literature focuses on asymptotic results (almost-complete convergence, asymptotic normality,...). Bias-variance decompositions are provided. Few papers tackle the problem of bandwidth selection:  and Benhenni et al. (2007) have studied global or local cross-validation procedures which are shown to be asymptotically optimal in regression contexts. Recently, a Bayesian criterion has been investigated from a numerical point of view by Shang (2013).
To our knowledge, adaptive estimation procedures in a nonasymptotic framework can only be found in conditional distribution estimation with real or multivariate covariates. We refer to Brunel et al. (2010) and Plancade (2013) for c.d.f estimation with a real covariate and to Akakpo and Lacour (2011) and references therein for conditional density estimation with a multivariate covariate. Nevertheless, these works are based on projection estimators which cannot be extended directly to a functional framework in a nonparametric setting.
In keeping with the studies of functional conditional distribution, we investigate the properties of the nonparametric Nadaraya-Watson-type estimators of , but with a new perspective, only used so far for real and multivariate covariates. To estimate the c.d.f. defined by (1), we consider , for any (x, y) ∈ H × R, with d a general semi-metric on the Hilbert space H, K h : t → K(t/h)/h, for K a kernel function (that is R K(t)dt = 1) and h a parameter to be chosen, the so-called bandwidth. We choose the kernel K of type I : its support is in [0, 1] and there exist two constants c K , C K > 0 such that c K 1 [0,1] ≤ K ≤ C K 1 [0,1] . We focus on the metric associated to the norm of the Hilbert space The main goal is to define a fully data-driven selection rule for the bandwidth h, which satisfies nonasymptotic adaptive results. The criterion we propose draws inspiration from both the so-called Lepski method (see the recent paper of Goldenshluger and Lepski 2011) and model selection tools. We show that the bias-variance trade-off is realized and that the selected estimator automatically adapts to the unknown regularity of the target function. As usual, the variance term of the risk depends on asymptotic properties of the small ball probability ϕ(h) = P(d(X, 0) ≤ h) when h → 0. The behaviour of the small ball probability is a difficult problem which is still the subject of research studies. We compute precise rates for our estimator under several assumptions on the distribution of the process X, fulfilled e.g. by a large class of Gaussian processes. Consistently with the previous works, the rates we obtain are quite slow. However, we prove that they are minimax optimal. The results are also shown to be coherent with lower bounds computed by Mas (2012) for the estimation of the regression function.
To bypass the difficulties inherent to the infinite dimensional nature of the data, some researchers (see e.g. Masry 2005;Geenens 2011) have suggested replacing the norm · in the definition of the estimator (2) by a semi-norm. The case of projection semi-norms has received particular attention. In that case the estimator can be redefined this way x − x ′ , e j 2 and (e j ) j≥1 is a basis of H. Defining this estimator amounts to project the data into a p-dimensional space. We show that it does not improve the convergence rates of the Nadaraya-Watson estimator since the lower bounds are still valid. In order to understand what is going on, we briefly study a bias-variance decomposition of the risk of this estimator.
The paper is organized as follows: in Section 2, we provide a bias-variance decomposition of the estimator (2) in terms of two criteria, a pointwise and an integrated risk. The bandwidth h is shown to influence significantly the quality of estimation. In Section 3, we define a bandwidth selection criterion achieving the best bias-variance trade-off. Rates of convergence of the resulting estimator are computed in Section 4. To ensure that these rates are optimal, we also prove lower bounds. Properties of the estimator defined with a projection semi-metric are investigated in Section 5. The results are illustrated via simulations and examples in Section 6. Finally, the proofs are gathered in Section 7, and some details are postponed to the Appendix (Section A).
2. Integrated and pointwise risk of an estimator with fixed bandwidth 2.1. Considered risks. We consider two types of risks for the estimation of (x, y) → F x (y). Both are mean integrated squared error with respect to the response variable y.
The first criterion is a pointwise risk in x, integrated in y: for a fixed x 0 ∈ H, D a compact subset of R and keeping in mind that the Hilbert norm of H is . . We also denote by |D| := D dt the Lebesgue measure of the set D.
Next, we introduce a second criterion, which is an integrated risk with respect to the product of the Lebesgue measure on R and the probability measure P X of X, defined by where X ′ is a copy of X independent of the data sample and B is a subset of H. The motivation for studying the two risks is twofold. First, in practice, we can either be interested in the estimation of F X n+1 where X n+1 is a copy of X independent of the sample or we can be interested in estimating the c.d.f conditionally to X = x 0 where x 0 is a point chosen in advance. Such an approach is rather classical in functional linear regression (Ramsay and Silverman 2005;Cardot et al. 1999) where either prediction error on random curves (Crambes et al., 2009) or prediction error over a fixed curve (Cai and Hall 2006) are considered. Second, integrated risks have been relatively unexplored in nonparametric functional data analysis. Indeed, there is no measure universally accepted as the Lebesgue measure in finite-dimensional setting (see e.g. Delaigle and Hall 2010;Dabo-Niang and Yao 2013). The only measure at hand is the probability measure of X.

2.2.
Assumptions. Hereafter, we denote by ϕ x the shifted small ball probability: where hereafter the notation P X ′ (resp. E X ′ , Var X ′ ) stands for the conditional probability (resp. expectation, variance) given X ′ . For simplicity, we assume that the curve X is centred. We also consider the following assumptions. The first one is related to the choice of the kernel, the two following are regularity assumptions for the function to estimate and the process X.
H F There exists β ∈ (0; 1) such that F belongs to the functional space F β , the class of the maps (x, y) ∈ H × R → F x (y) such that: where X ′ is an independent copy of X. Assumption H F is an Hölder-type regularity condition on the map x → F x . This type of condition is natural in kernel estimation. It is very similar to Assumption (H2) of  or Assumption (H2') of Ferraty et al. (2010). Note, however, that no regularity condition on the map y → F x (y) is required here. A similar phenomenon appears for the estimation of the c.d.f when the covariate is real: for instance, the convergence rate given by Brunel et al. (2010, Corollary 1) only depends on the regularity of F with respect to x.
Assumption H ϕ is very similar to assumptions made by Burba et al. 2009;Ferraty et al. 2010. This condition H ϕ is reasonable, since the class of Gaussian processes fulfills it provided that B is a bounded subset of H. Indeed the upper bound is verified with C ϕ = 1 thanks to Anderson's Inequality (see Anderson 1955 and also Li and  and from Hoffmann-Jørgensen et al. (1979, Theorem 2.1, p.322) we know that the lower bound is verified with c ϕ := e −R 2 /2 where R := max{ x , x ∈ B} .
2.3. Upper bound. Under the assumptions above we are able to obtain a nonasymptotic upper bound for the risk, proved in Section 7.2.
where C > 0 only depends on c K , C K , |D| and C D . (ii) If, in addition, Assumption H ϕ is fulfilled, where C > 0 only depends on c K , C K , c ϕ , C ϕ , |D| and C D .
The first term of the right-hand-side of inequalities (6) and (7) corresponds to a bias term, and the second is a variance term, which increases when h goes to 0 (since ϕ x 0 (h) and ϕ(h) decrease to 0 when h → 0). Note that the upper bounds are very similar to the results of Ferraty et al. (2006, Theorem 3.1) and Ferraty et al. (2010, Corollary 3). However, we do not have an extra-ln n factor in the variance term.
We deduce from Theorem 1 that the usual bias-variance trade-off must be done if one wants to choose h in a family of possible bandwidths. The ideal compromise h * is called the oracle, and is defined by It cannot be used as an estimator since it both depends on the unknown regularity index β of F and on the rate of decrease of the small ball probability ϕ(h) of X to 0. The challenge is to propose a fully data-driven method to perform the trade-off.

Adaptive estimation
In this section, we focus on the integrated risk. We refer to Remark 1 below for the extension of the results for the pointwise criterion.
3.1. Bandwidth selection. We have at our disposal the estimators F h defined by (2) for any h > 0. Let H n be a finite collection of bandwidths, with cardinality depending on n and properties precised below. For any h ∈ H n , an empirical version for the small ball probability ϕ(h) = P( X ≤ h) is For any h ∈ H n , we define where κ is a constant specified in the proofs which depends neither on h, nor on n, nor on F X ′ . The quantity V (h) is an estimator of the upper bound for the variance term (see (7)) and A(h) is proved to be an approximation of the bias term (see Lemma 5). This motivates the following choice of the bandwidth: The selected estimator is F h .
This selection rule is inspired both by the recent version of the so-called Lepski method (see Goldenshluger and Lepski 2011) and by model selection tools. The main idea is to estimate the bias term by looking at several estimators. Goldenshluger and Lepski (2011) , based on a convolution product of the kernel with the estimators with fixed bandwidths. However, this can only be done when the bias of the estimator is written as the convolution product of the kernel with the target function. Since it is not the case in our problem, we perform the bandwidth selection with (10). This is analogous to the procedure proposed by Chagny (2013a) or Comte and Johannes (2012) for model selection purpose. Thus, V (h) can also be seen as a penalty term. We also refer to the phD of Chagny (2013b, p.170) for technical details leading to this choice. Finally, let us notice that a criterion based on the maximum h ∨ h ′ also appears in Kerkyacharian et al. (2001), and more recently, similar ideas are used in Goldenshluger and Lepski (2013).
3.2. Theoretical results. To prove our main results, we consider the following hypothesis, in addition to the assumptions defined in Section 2.2.
H b The collection H n of bandwidths is such that: H b1 its cardinality is bounded by n, H b2 for any h ∈ H n , ϕ(h) ≥ C 0 ln(n)/n, where C 0 > 0 is a purely numerical constant (specified in the proofs).

Remark 1.
• Assumption H b1 fixes the size of the bandwidth collection: compared to the assumptions of Goldenshluger and Lepski (2011), we consider a discrete set and not an interval, which permits to use the classical tools of model selection theory in the proofs.
• In practice, it is impossible to verify Assumption H b2 since the function ϕ is unknown. However, this difficulty can be circumvented by introducing a random collection of bandwidths H n verifying, for all h ∈ H n , ϕ(h) ≥ 2C 0 ln(n)/n where ϕ is an estimator of ϕ (see Equation (9)). However, since it does not add significant difficulty (see Comte and Johannes 2012;Brunel et al. 2013) but would complicate the understandability of proofs, we choose to keep Assumption H b2 .
We now state the following nonasymptotic bound for the maximal risk over the class F β .
Theorem 2. Assume H ϕ , H b and that n ≥ 3. There exist two constants c, C > 0 depending on c K , C K , c ϕ , C ϕ , |D|, C D and κ such that The optimal bias-variance compromise is reached by the estimator, which is thus adaptive with respect to the unknown smoothness of the target function F . The selected bandwidth h is performing as well as the unknown oracle h * defined in (8), up to the multiplicative constant c, up to a remainding term of order 1/n which is negligible, and up to the ln(n) factor. This extra-quantity also appears in the term V (h). The loss is due to adaptation. In Section 4, we prove that it does not affect the convergence rates of the estimator which is nevertheless optimal in the minimax sense in most of the cases.
The proof of Theorem 2 is mainly based on model selection tools, specifically concentration inequalities (see Section 7.3). A specific difficulty comes from the fact that the variance term in (7) depends on the unknown distribution of X, through its small ball probability. Thus, the penalty term V (h) = κ ln(n)/(nϕ(h)), which may have been classically defined cannot be used in practice. This explains why we plug (9), an estimator for ϕ(h) in V (h). However, for the sake of clarity, we begin the proof by establishing the result with V (h) replaced by its theoretical counterpart V (h) = κ ln(n)/(nϕ(h)). Notice also that we could build an adaptive estimator for the pointwise risk. To do so, replace

Minimax rates
In this section, we compute the convergence rate of the oracle F h * with h * defined by (8), the rate of the selected estimator F h , and prove lower bounds for the conditional c.d.f. estimation problem under various assumptions on the rate of decrease of the small ball probability of the covariate X.
4.1. Small ball probabilities. The computation of the oracle h * , as well as the computation of the minimum in the right-hand-side of (12) require to fix conditions on the rate of decrease of the small ball probability ϕ(h). The choice of the assumptions is crucial and determines the rates of convergence of our estimators. Small ball problems have aroused considerable interest and attention in the past decades. Lots of studies propose to compute lower and upper bounds for ϕ(h), in the case of particular types of process X. If much attention has been given to Gaussian processes (see for example the clear account provided by Li and Shao 2001), systematic studies have also been undertaken to handle the general case of (infinite) sum of independent random variables (Lifshits 1997;Dunker et al. 1998;Mas 2012). We consider in the sequel one of the three following hypothesis which are frequently used in the literature. This allows to understand how the small ball probability decay influences the rates (see Section 4.2). We describe below large classes of processes for which they are fulfilled.
Such inequalities are heavily connected with the rate of decrease of the eigenvalues of the covariance operator Γ : f ∈ H → Γf ∈ H with Γf (s) = f (·), Cov(X(·), X s ) . Recall the Karhunen-Loève decomposition of the process X, which can be written where (η j ) j≥1 are uncorrelated real-random variables, (λ j ) j≥1 is a non-increasing sequence of positive numbers (the eigenvalues of Γ) and (ψ j ) j≥1 an orthonormal basis of H. When X really lies in an infinite dimensional space, the set {j ≥ 1, λ j > 0} is infinite, and under mild assumptions on the distribution of X, it is known that ϕ(h) decreases faster than any polynomial of h (see e.g. Mas 2012, Corollary 1, p.10). This is the case in Assumptions H X,L and H X,M . Moreover, the faster the decay of the eigenvalues is, the more the data are concentrated close to a finite dimensional space, and the slower ϕ(h) decreases.
Finally, it also results of the above considerations that H X,F only covers the case of finite dimensional processes (the set {j, λ j > 0} is finite, that is the operator Γ has a finite rank). This is the extreme case of H X,M (with α = 1, γ 1 = γ 2 = 0). Nevertheless, even if our main purpose is to study functional data, the motivation to keep this case is twofold. First, we show below that our estimation method allows to recover the classical rates (upper and lower bounds) obtained for c.d.f. estimation with multivariate covariates. Then, processes which fulfill H X,F can still be considered as functional data since the finite space to whom X belongs is unknown for the statistician.

4.2.
Convergence rates of kernel estimators. Under the previous regularity assumptions, we compute the upper bounds for the pointwise and integrated risks of the estimators, the proofs are deferred to the Appendix -Section A.
(a) Under the assumptions of Theorem 1, and if H F is fulfilled, the convergence rates of the pointwise risk Table 1, line (a). (b) Under the assumptions of Theorem 2, the convergence rates of the integrated risk Table 1, line (b). For both cases, the upper bounds are given up to a multiplicative constant, and for the different cases H X,L , H X,M , and H X,F .
Let us comment the results. The faster the small ball probability decreases (that is the less concentrated the measure of X is), the slower the rate of convergence of the estimator is. In the generic case of a process X which satisfies H X,L , the rates are logarithmic, which is not surprising. It reflects the "curse of dimensionality" which affects the functional data. Similar rates are obtained by  (section 5.3) in the same framework, and by Mas (2012) for regression estimation (section 2.3.1). However, we show that the results can be improved when the process X is more regular, although still infinite dimensional.
(lower bounds) Table 1. Rates of convergence of the oracle estimator (line (a)) and the adaptive estimator (line (b)). Minimax lower bounds (line (c)).
Under Assumption H X,M , the rates we compute have the property to decrease faster than any logarithmic function. Assumption H X,F is the only one which yields to the faster rate, that is the polynomial one.
We have already noticed that our adaptive procedure leads to the loss of a logarithm factor (see the comments following Theorem 2). Nevertheless, by comparing line (a) to line (b) in Table 1, we obtain that the adaptive estimator still achieves the oracle rate if H X,L or H X,M are fulfilled. The loss is actually negligible with respect to the rates. 4.3. Lower bounds. We now establish lower bounds for the risks under mild additional assumptions, showing that the estimators suggested above attain the optimal rates of convergence in a minimax sense over the class of conditional c.d.f. F β (defined in Section 2.2). The results for the integrated risk are obtained through non-straightforward extensions of the pointwise case.
Theorem 3. Suppose that H X is fulfilled, and that n ≥ 3. ( is lower bounded by a quantity proportional to the ones in line (c) in Table 1.
constant to be specified in the proof, and that there exist two constants c 2 , C 2 > 0 such that, for all h > 0, for all x ∈ B, ] is also lower bounded by a quantity proportional to the ones in line (c) in Table 1. For both cases, the infimum is taken over all possible estimators obtained with the datasample (X i , Y i ) i=1,...,n . In (i), E F is the expectation with respect to the law of {(X i , Y i ), i = 1, . . . , n} and in (ii), E F is the expectation with respect to the law of {{(X i , Y i ), i = 1, . . . , n}, X ′ } when, for all i = 1, . . . , n, for all x ∈ H, the conditional c.d.f. of Y i given Theorem 3, which is proved in Section 7.4, shows that the upper bounds of Proposition 1 cannot be improved, not only among kernel estimators but also among all estimators, under assumptions H X,L and H X,M . The estimator F h is thus both adaptive optimal in the oracle and in the minimax senses.
The computations are new for conditional c.d.f. estimation with a functional covariate. Under H X,F , with γ = 1, the lower bounds we obtain are consistent with Theorem 2 of Brunel et al. (2010) or Proposition 4.1 of Plancade (2013) for c.d.f. estimation with a one-dimensional covariate, over Besov balls. In the functional framework, the results can only be brought close to those of Mas (2012) (Theorem 3) for regression estimation.
Remark 2. The rates (both lower and upper-bounds) depend on the smoothness of the target function F (assumption H F , F ∈ F β , to control the bias term) and on the smoothness of the process X, through the rate of decay of the small ball probabilities (larger variance for smaller small ball probabilities). The broad range of rates we establish in Table 1 (from polynomial to logarithmic rates, with intermediate cases) can be compared to the rates which are computed for the density estimation of a variable Z in a deconvolution model W = Z + ε, from a sample drawn as W , and with known noise ε distribution. Actually, the standard rates in this setting depend on smoothness assumptions both on the density to recover and on the noise density. Similarly to our framework, the rates are known to be slow (logarithmic) under classical smoothness assumptions (a "supersmooth" noise, and an "ordinary" smooth signal) but they can be improved by considering differents assumptions. See e.g. Comte et al. (2006) and Lacour (2006).

Impact of the projection of the data onto finite-dimensional spaces
We have seen in Section 4.2 that, when X lies in an infinite dimensional space (assumptions H X,M and H X,L ), the rates of convergence are slow. This "curse of dimensionality" phenomenon is well known in kernel estimation for high or infinite dimensional datasets. The introduction of the projection semi-metrics d p , leading to the estimators (4), has thus been proposed in order to circumvent this problem. Defining such estimators amounts to project the data into a p-dimensional space. Indeed, this permits to address the problem of variance reduction since ϕ p (h) := P (d p (x, 0) ≤ h) ∼ h→0 C(p)h p and then the variance is of order 1/(nh p ). Notice that, even if the variance orders of magnitude are the same, the situation here is different from Assumption H X,F with γ = p: H X,F amounts to suppose that the curve X lies a.s. in an unknown finite-dimensional space (see Section 4.1) whereas, here, the data is projected into a finite-dimensional space but may lie in an infinite-dimensional space.
A first thing we can say is that, under our regularity assumption H F , Theorem 3 remains true and the convergence rate of the risk of F h,p cannot be better than the lower bounds given in Table 1, line (c). This implies that, in our setting, the estimator F h,p cannot converge at significantly better rates than our adaptive estimator F h even if the couple of parameters (p, h) is well-chosen. Precisely, as shown in Proposition 2 below, project data also adds an additional bias term which compensates for the decrease of the variance. 5.1. Assumptions. In order to state the result, we need the following assumptions.
H ′ ϕ There exist two constants c ϕ , C ϕ > 0 such that for all h ∈ R, for all p ∈ N * , . H ξ Let ξ j := X, e j /σ j where σ j := Var( X, e j ). One of the two following assumptions is verified: H ind ξ the sequence of random variables (ξ j ) j≥1 is independent and there exists a constant C ξ such that, for all j ≥ 1 Remark that Assumption H ′ ϕ is the equivalent of Assumption H ϕ replacing d by d p . If X is a Gaussian process, the vector ( X, e 1 , . . . , X, e p ) is a Gaussian vector and Assumption H ϕ is also verified provided that B is bounded. Assumption H ind ξ is true if X is a Gaussian process and (e j ) j≥1 is the Karhunen-Loève basis of X (see (13) above, and also Ash and Gardner 1975). Assumption H b ξ is equivalent to suppose that X is bounded a.s. We are aware that both assumptions H ind ξ and H b ξ are strong since in most cases the Karhunen-Loève basis is unknown. We give here Proposition 2 below in the only aim of better understanding the bias-variance decomposition of the risk when the data are projected. A further study would be needed to obtain weaker assumptions but this is beyond the scope of this paper.

Upper bound.
Proposition 2. Suppose assumptions H F and H ξ are fulfilled. Let h > 0 and p ∈ N * be fixed. ( We have additional bias terms compared to Ferraty et al. ( , 2010. This is due to the fact that our regularity assumption H F (see Section 2.2) is different from Assumption (H2) of  or Assumption (H2') of Ferraty et al. (2010). Our assumption is expressed with the norm of H whereas their assumptions are expressed with the seminorm used in the definition of the estimator (here d p ). Remark that, with projection semi-norms, the assumptions of Ferraty et al. ( , 2010 imply that the function F x only depends on ( x, e j ) 1≤j≤p . Indeed, if we take x and x ′ such that x, e j = x ′ , e j for j = 1, . . . , p (but x, e j = x ′ , e j for some j > p), both (H2) and (H2') imply that F x (y) = F x ′ (y) for all y. Our assumption is then less restrictive.
Remark 3. Notice that the estimator (4) is not consistent when p is fixed. This is also noted by Mas (2012) in a regression setting (see Remark 2, p.4). It is coherent with the fact that we loose information when we project the data. Indeed, suppose that the signal X lies a.s. in (span{e 1 , . . . , x, e j 2 = 0 and 0 otherwise. The bias of such an estimator is then constant and non null as soon as there exists F x (y) = P(Y ≤ y) on a subset of D of positive Lebesgue measure. Hence in order to obtain a consistent estimator in the case where σ j > 0 for all j, we have to impose that p is depending on n and lim n→+∞ p = +∞. 5.3. Discussion. The rates obtained can be compared to the lower bounds given in Table 1 in the Gaussian case under assumptions H X,F and H X,M . 5.3.1. Comparison with the rates obtained under Assumption H X,F . We start from the Karhunen-Loève decomposition of X defined in (13). For a Gaussian process, the variables η j are independent standard normal, (λ j ) j≥1 is a non-increasing sequence of positive numbers and (ψ j ) j≥1 a basis of H. If λ γ+1 = 0 and λ γ > 0 and if the law of (η 1 , . . . , η γ ) is non-degenerate then Assumption H X,F is fulfilled. Two cases may then occur.

5.3.2.
Comparison with the rates obtained under Assumption H X,M . Thanks to Proposition 2, we are able to obtain the rate of convergence for the estimator.
Corollary 1. Suppose that the assumptions of Proposition 2 are fulfilled and that (ξ 1 , . . . , ξ p ) admits a density f p with respect to the Lebesgue measure on R p such that there exists a constant c f verifying , for a well-chosen bandwidth h and a good choice of p, and where C > 0 is a numerical constant.
, for a well-chosen bandwidth h and a good choice of p, and where C > 0 is a numerical constant.
If cj −2a ≤ λ j ≤ Cj −2a , for two constants c, C > 0, then Assumption H X,M is fulfilled with α = 1/(a − 1/2), the estimator converges with the minimax rate if δ = a (adding the condition δ ′ ≥ a for the pointwise risk). The conclusion is similar to Paragraph 5.3.1: if e j = ±ψ j for all j ≥ 1 (recall that this condition is unrealistic since in most cases the basis (ψ j ) j≥1 is unknown) then we can choose p and h such that the minimax rate is achieved, up to a logarithmic factor, for the integrated risk and the pointwise risk under an additional condition on x 0 . Otherwise, we do not know if the minimax rate can be achieved.

Numerical study
We briefly present an illustration of the theoretical results. We first describe the parameters we use and the way the risks are computed. Then we illustrate the performances of our method with some figures and tables, for both simulated samples and a real dataset.
6.1. Implementation of the estimators. We implement the two types of estimators (2) and (4) with the uniform kernel K = 1 (0;1) . We choose the bandwidth h in a collection Thanks to the definition of the kernel K, it is useless to consider bandwidths h which are larger than max{ To choose the other parameter k max , keep in mind that ϕ(h) should not be too small: the aim is to have an empirical counterpart for Assumption H b2 and also to avoid instabilities in the calculation of V (h). We propose to ensure that ϕ(h) ≥ ln(n)/n. This is the case when C/k max is the quantile of order ln(n)/n of the sample For the collection of estimators ( F h ) h∈Hn defined in (2), we obviously implement the selection algorithm which lead to F h . The method to compute it is entirely described in Section 3.1. The L 2 norm involving in the definition of A(h) (see (10)) is approximated by the trapezoidal rule over D = (min 1≤i≤n Y i , max 1≤i≤n Y i ). The only parameter to be adjusted is the constant κ of the penalty term V (h). The simulations permit to tune it (see Section 6.2 below).
This calibration, as well as the simulation study of the performances of the estimators requires risk computations. For an estimator F of the conditional c.
is approximated by the mean of N = 50 Monte-Carlo replications of the random variable F X ′ − F X ′ 2 D . The interval D is chosen as explained above. The set B involved in (5) is necessary only for the theoretical study and can be chosen as large as possible. It is ignored here.
6.2.1. Simulation of the data-samples. Our procedure is applied to different simulated samples (X i , Y i ) i∈{1,...,n} , with n = 500, drawn as (X, Y ). The functional covariate X is simulated in the following way: where (λ j ) j≥1 is a sequence of positive real numbers such that j≥1 λ j < +∞, (ξ j ) j≥0 is a sequence of i.i.d. standard normal random variables and ψ j (t) := √ 2 sin(π(j − 0.5)t), t ∈ R. Remark that the simulated X is a finite-dimensional Gaussian process. Hence, Assumption H X,F (see Section 4.1) is satisfied, for γ = J + 1. However, if J is sufficiently large, we will consider and X has thus a behaviour very similar to the one of an infinite-dimensional process. In the simulation study, only a few coefficients are estimated, since the sample size is finite, evidently. The three following choices are considered, to illustrate the three regularity conditions studied above. The processes are plotted in Figure 1.
(a) λ j = j −2 , J = 150, corresponding to a Brownian motion, which verifies Assumption H X,L . (b) λ j = e −2j /j, J = 150, which corresponds to a process X verifying H X,M . (c) λ j = j −2 , J = 2, which leads to a process X satisfying H X,F (with γ = 3). When X is simulated, we obtain Y as follows.
Example 1 (Regression model): Y = β, X 2 + ε with β(t) = sin(4πt) and ε ∼ N (0, σ 2 ), where ε is independent of X and σ 2 = 0.1. With this model, the cumulative distribution function of Y given X = x is F x 1 (y) := Φ((y − β, x 2 )/σ), where Φ(z) = P(Z ≤ z) for a Z ∼ N (0, 1); Example 2 (Gaussian mixture model): The conditional distribution of Y given X = x is 0.5N (8 − 4 x , 1) + 0.5N (8 + 4 x , 1). Then, the cumulative distribution function of Y given X = x is F x 2 (y) := 0.5Φ(8 − 4 x ) + 0.5Φ(8 + 4 x ). 6.2.2. Calibration of the penalty constant. The question that needs to be answered now is: how to choose the constant κ involved in the penalty term V (h) (in (10))? Its choice is crucial for the quality of adaptive estimation. Keep in mind that h is chosen as the bandwidth which realizes the best compromise of two terms, the estimator of the bias term, namely A(h) and the variance term V (h) = κ ln(n)(n ϕ(h)) −1 . From (11), heuristically, • if κ is small, then A(h) is the most influential term. Since this term tends to decrease with h, we select small bandwidths; • if κ is large, the reverse occurs: we select large bandwidths.
To illustrate this phenomenon, we plot the risk of the estimator F h in function of the parameter κ, see Figure 2. In the light of the results, as expected, κ should not be chosen too small (otherwise the risk blows up), neither too large. However a choice for κ larger than the minimizer does not make the risk blowing up (same results are obtained in Bertin et al. 2014). We thus fix κ = 4. Another value around 4 just improves the results for some c.d.f. and deteriorates them for some others. Data-driven calibration tools developed in model selection contexts, such as slope heuristic, could be very useful here. However this kind of tool does not exist in our bandwidth selection context, and this is beyond the scope of the paper.
6.2.3. Simulation results. Beams of estimators are presented in Figure 3. The estimation is quite stable for the regression model (Example 1). However the estimation is harder for the Gaussian mixture model in the cases H X,L : a more detailed study shows that our criterion selects models with a too small h. The bandwidth selection criterion behaves better in the cases H X,F and H X,M .
To study the impact of the projection of the data from the practical point of view, we also compare the estimatorF h,p (see (2)) withF h (see (4)), which can be seen asF h,p for p = ∞. Since we have not developed adaptive estimation forF h,p with finite value of p, we choose to compare the corresponding oracle estimatorsF h * ,p with h * defined by (8) (just replaceF h in the r.h.s of (8) byF h,p ). As expected from the theory (see Proposition 2), the smaller p, the more biased the estimators: as an example, the oracle estimates in Example 1 (b) (regression model with process X satisfying H X,M ) are plotted on Figure  4. The boxplots for the corresponding risks (over 50 estimators) are also plotted (outliers are not drawn). The mean integrated risks for all the studied models are presented in Table 2. Both the boxplots of Figure 4 and the risks of Table 2 confirm that the risks generally becomes quite smaller when p increases.
6.3. Application to a spectrometric dataset. We propose to study a classical dataset, widely studied in regression contexts (see Vieu (2002, 2006); ). The data are available on line 1 , and have been recorded on Tecator Infratec Food and Feed Analyzer. For each unit i (i = 1, . . . , n, with n = 215), we observe one spectrometric curve X i which corresponds to the absorbances of a piece of chopped meat measured at 100 wavelengths ranging between 850 and 1050 nm. The aim is to describe the link between a spectrometric curve of a chopped meat and its fat content. For each curve X i , we denote by Y i the corresponding fat content. The sample curves are plotted in Figure 5, first column.  Table 2. Values of the Mean integrated risks ×10 averaged over 50 samples, for the estimatorsF x 0 h * ,p , with x 0 a copy of X, and for p ∈ {1, 2, 3, 4, 5, ∞}.
In a first step, the data are centred to match our assumptions. Then we apply our global adaptive estimation procedure. We choose ten curves x  h is strongly increasing in the interval [0, 10], which indicates that our estimator detects that Y must be small with large probability. Conversely, when Y i (j) 0 is large (red curves) the estimated conditional distribution function faster increases in the interval [30,50]. These results prove that our estimators are able to capture the repartition of Y given X = x 0 when x 0 is taken in the sample.

Concluding remarks
• The estimation procedure we propose is not restricted to the case where the covariate X is functional. Indeed the adaptive estimator F h can be calculated as soon as the covariate X takes values in a general Hilbert space (H, · ). The results can be applied to a function space such as L 2 (I) (I ⊂ R), L 2 (R d ) or a Sobolev space but also R d , C d , ℓ 2 (N),... The results given in sections 2 and 3 remain valid. For instance, in the case where X ∈ R d , an immediate consequence of Theorem 1, is that both the pointwise and integrated risks of F h converge to 0 at the rate (n/ ln(n)) −2β/(2β+d) . • Is there a solution to the "curse of dimensionality"? We prove that, under our assumptions, the classical Nadaraya-Watson estimator (2) with d(x, x ′ ) = ||x−x ′ || attains the minimax rate of convergence. Then, in our setting, even if these rates are slow, they cannot be significantly improved by changing the semi-norm d in the kernel. A reflexion is under way on determining if it is possible to modify the rates considering more regular functions F than the ones of the class F β , for instance taking into account the derivatives of the covariate X in the spirit of Ferraty and Vieu (2002). Another approach may be to reduce the structural complexity of the model considering e.g. single or multiple-index models (Chen et al. 2011;Ait-Saïdi et al. 2008) .
7. Proofs 7.1. Preliminary notation and results. We will mainly focus on the proof of the results for the integrated risk (since it is the one for which adaptation results are provided), and only highlight the differences when choosing the pointwise criterion. Some technical proofs are postponed to the Appendix (Section A). We denote by E X ′ (resp. P X ′ , Var X ′ ) the conditional expectation (resp. probability, variance) given X ′ . We also introduced the classical norm . L q (R) of the space L q (R) of integrable functions (the notation will be used with q = 2 and q = ∞).
Recall that K h (x) := h −1 K(h −1 x). Assumptions on the kernel and H ϕ imply that, for all l ≥ 1, where m l := c l K c ϕ and M l := C l K C ϕ . These inequalities are useful in the sequel. One of the key arguments in the proofs of Theorems 1, 2, and Proposition 2 is the control of the deviations (in probability and expectation) of the process R x h , for x ∈ H, defined by The following lemma, proved in Section A, establishes the result which is useful to control the integrated risk of the estimators. The proof can be found below.

7.2.1.
Main part of the proof of the Inequality (7). Following Ferraty et al. ( , 2010, we define We also have R (18)). First, notice that since F X ′ h ≤ 1 and F X ′ ≤ 1 a.s., and with Lemma 1 we get . The last inequality comes from the bound xe −x ≤ e −1 , The first and third terms are variance terms, bounded by Lemmas 2 and 3 proved below. The second one is a bias term, controlled by Lemma 4.
This ends the proof of Inequality (7). The scheme can easily be adapted to prove (6).

Proof of Lemmas 2 and 3 (upper bounds for the variance terms). Proof of Lemma 2. By Fubini's Theorem
Since, for all y ∈ D, F X ′ h (y) is a mean of independent and identically distributed random variables (conditionally to X ′ ), we have, on the set {X ′ ∈ B}, where the last inequality comes from Inequality (17).
and the result comes also from Inequality (17).
7.2.3. Proof of Lemma 4 (upper bound for the bias term). First remark that, for y ∈ D, a.s.
. Then, thanks to the generalized Minkowski Inequality (see Lemma A.1 of Tsybakov 2009), which can be easily extended to the integral over Hilbert spaces h (X ′ ) = 1/n, which ends the proof:

Proof of Theorem 2.
7.3.1. Main part of the proof of Theorem 2. Let Λ be the set We split the loss function of the estimator We will argue as follows, with two steps. First, on the set Λ, ϕ(h) is close to ϕ(h), and the proof comes down to the control of the estimator of the bias approximation of the criterion (10). Second, the probability of the set Λ c is shown to be negligible. Let us prove these two claims.
Let h ∈ H n be fixed. We start with the following decomposition for the loss of the estimator F X ′ h : The definitions of A(h), A( h) and then the one of h enable to write Thus, We now split A(h) to introduce the following intermediary Compared to the data-driven criterion (10), the variance term V (h) is deterministic here. Note that We obtain the decomposition .

But on the set Λ, for any
Moreover, on Λ, we also have, for h ∈ H n , ϕ(h)− ϕ(h) < ϕ(h)/2, that is 2/ϕ(h) > 1/ ϕ(h). Thus, Gathering the two bounds in (26) leads to Besides, the quantity F X ′ h − F X ′ 2 D is the loss of an estimator with fixed bandwidth h and has already been bounded (see Theorem 1 Inequality (7)). Hence we obtain where C is the constant of Theorem 1 (Inequality (7)). The remainding part of the step is the result of the lemma hereafter, the proof of which is postponed to the following section.
Lemma 5. Let h ∈ H n be fixed. Under the assumptions of Theorem 2, there exist two constants C 1 and C 2 such that, The constant C 2 depends on C 0 , |D|, M 2 , m 1 and C K and the constant C 1 only depends on C D .

Proof of Lemma 5 (Upper bound for
Recall that we write the estimator (22) and R X ′ h by (18). We split again Thus, by subtracting V (h ′ ) and taking the maximum over h ′ ∈ H n , we obtain We have not subtracted V (h ′ ) to two of the above terms: we show below that we can bound these terms directly. We now deal with each of the terms involving in (31) on the set {X ′ ∈ B}.
• Upper bound for the term depending on We then use the definition of the set Ω h,h ′ , and split the term to obtain the bias terms: thanks to Lemma 4.
• Upper bound for the term depending on 1 Ω c h,h ′ . It is the second term which does not depend on V (h ′ ): . Thus by applying Inequality (19) of Lemma 1, with η = 1/2: Recall now that thanks to H b2 , ϕ(h) ≥ C 0 ln(n)/n for all h ∈ H n , with C 0 > 16(M 2 /m 2 1 + C K /2m 1 ). Use also H b1 to deduce Thus, we have proved that • Upper bound for the term depending on T h,h ′ . The definition of this term implies that using that E F X ′ h∨h ′ ≤ 1. We roughly bound the supremum over h ′ ∈ H n by a sum over h ′ and use the last inequality: Then, Inequality (20) of Lemma 1 (with α = 2) proves that, on the set {X ′ ∈ B}, a.s., , the other term is bounded as follows and same computations allow to deal with it. We thus deduce that • Upper bound for the terms depending on T a h ′ or T b h ′ . First, by definition of Furthermore, noticing that F X ′ h ′ belongs to L 2 (D), the following equality is classical: whereS D (0, 1) is a dense countable subset of the sphere S(0, 1) = {t ∈ L 2 (D), t D = 1}.
, for any t ∈ L 2 (D), the Cauchy-Schwarz Inequality leads to sup t∈S(0,1) v, t 2 D ≤ v D , with equality when t = v/ v . We can replace S(0, 1) byS D (0, 1) thanks to the separability of L 2 (D), which gives (35). Moreover, we write the scalar product Consequently, We use the following lemma, which permits to control the empirical process defined by (36). Its proof can be found in Section A.
Lemma 6. Under the assumptions of Theorem 2, for δ 0 > max(3528C 2 K |D|/M 2 C 0 , 12), there exists a constant C 3 > 0 (depending only on m 1 , M 2 , δ 0 , C 0 and |D|) such that Choosing 2κ/3 > 288δ 0 |D|M 2 /m 2 1 in the definition of V (h ′ ) (see (25)) leads to V (h ′ )/48 ≥ 6δ 0 |(D|M 2 /m 2 1 ) ln(n)/(nϕ(h)). This proves that , similar computations allow to also obtain the same bound for Gathering Inequalities (32) Sketch of the proof . The proof is based on the general reduction scheme described in Section 2.2 of Tsybakov (2009). We only describe in this section the main steps, whose proofs are given in Section A.2. Let x 0 ∈ H be fixed and r n = (ln(n)) −β/α the rate of convergence. We define two functions F 0 and F 1 , called hypotheses, such that (A) F l belongs to F β , for l = 0, 1, 1 2 D ≥ cr n for a constant c > 0, (C) K(P ⊗n 1 , P ⊗n 0 ) ≤ α for a real number α < ∞ (which does not depend on x 0 ), where P ⊗n 0 (resp. P ⊗n 1 ) is the probability distribution of a sample (X 0,i , Y 0,i ) i=1,...,n (resp. (X 1,i , Y 1,i ) i=1,...,n ) for which the conditional c.d.f. of Y 0,i ∈ R given X 0,i ∈ H (resp. of Y 1,i given X 1,i ) is F 0 (resp. F 1 ). K(P, Q) is the Kullback distance between two probability distributions P and Q: K(P, Q) = ln(dP/dQ)dP if P << Q, and K(P, Q) = +∞ otherwise. Then, thanks to Theorem 2.2 in Tsybakov (2009) (p.90), the results hold for any x 0 . Moreover, the multiplicative constant c ′ involved in Theorem 2.2 is clearly independent on x 0 . In the sequel, we define F 0 and F 1 . The three conditions are checked in Section A.2.
Construction of F 0 and F 1 and of the associated samples. For (x, y) ∈ H × R, let F x 0 be the c.d.f. of the uniform distribution on D, that is F x 0 (y) = y |D| 1 y∈D + 1 y>sup D . Choose a real random variable Y 0 with a uniform distribution P U D on the compact set D, and take any process X 0 on H, independent on Y 0 , with distribution P X verifying H X,L . For the second function, set for two constants c (B) > 0 and c (C) > 0. From H X,L , a positive number η n for which the properties above hold is given by (39) η n = ln n − ((2β + γ)/α) ln ln n C 1 −1/α . We also choose a variable Y 1 , such that, for any x ∈ H, the conditional distribution of Y 1 given X 0 = x is characterized by the c.d.f. F x 1 . The notation P 1 is the distribution of (X 0 , Y 1 ).

7.4.2.
Proof of (i), under Assumption H X,M or H X,F . The proofs exactly follow the same scheme as for (i) under H X,L . The only difference is the choice of the sequence (η n ) n (see (39)).
7.4.3. Proof of (ii). The risk (5) is an integral w.r.t the measure P X ⊗ P U D where P U D is the uniform distribution on the set D. The tools defined to prove (i) are useful and we refer to it. But it cannot be straightforwardly adapted, since for an integrated criterion, two hypotheses are not sufficient. We focus on the case of Assumption H X,L (the switch to Assumption H X,M and H X,F is the same as in (i)). Denote by r n = (ln(n)) −2β/α the rate of convergence again. We must build a set of functions (F ω ) ω∈Ωn where Ω n is a non-empty subset of {0, 1} mn and m n is a positive integer which will be precised later, such that, For all ω ∈ Ω n , P ω is absolutely continuous with respect to P 0 and 1 Card(Ω n ) ω∈Ωn K(P ⊗n ω , P ⊗n 0 ) ≤ ζ ln(Card(Ω n )) for a real number ζ ∈]0, 1/8[, where P ⊗n ω is the probability distribution of a sample (X ω,i , Y ω,i ) i=1,...,n for which the conditional c.d.f. of Y ω,i given X ω,i is given by F ω . Then the result comes from Theorem 2.5 of Tsybakov (2009) (p.85-86). We follow the same steps as previously: below, we define the set of hypotheses (F ω ) ω∈Ωn , and in Section A.2, we give some additional material to check conditions (A'), (B') and (C'). Construction of the set of hypotheses F ω and of the associated samples. The first function (x, y) → F x 0 (y) is defined as in the proof of (i). For all ω = (ω 1 , . . . , ω mn ) ∈ {0, 1} mn , let where ψ, H, L, and (η n ) n are introduced in the proof of (i) (a good choice of η n is (39)), and with x j = √ 2 sup n∈N * {η n }e j , for all j ≥ 1, where (e j ) j≥1 is an orthonormal basis of L 2 ([0, 1]). Moreover ψ verifies the following condition: ψ 2 L 2 (R) < ln(2)/(64C 2 |D| H L 2 (R) c (C) ) (where C 2 appears in Assumption (14), p.9 and c (C) in (39)).
We also choose a variable Y ω , such that, for any x ∈ H, the conditional distribution of Y ω given X = x is characterized by the c.d.f. F x ω . The notation P ω is the distribution of (X, Y ω ).
Remark that the definition of (x j ) j=1,...,mn implies that, x − x j η n = 0 for all x ∈ H, as soon as j = k.
Indeed, suppose that H( x − x k /η n ) = 0, since H has support [0, 1], we have x − x k ≤ η n . Now remark that, as (e j ) j≥1 is an orthonormal basis, for all j = k, we obtain the splitting (23). Lemmas 2 and Lemmas 3 remain valid (by replacing again d by d p in every terms, and by using H ′ ϕ instead of H ϕ ). This first part is also easily adapted to the proof of Inequality (i).
The difference lies in the control of the bias term. We substitute to Lemma 4 the following result, the proof of which can be found below. This ends the proof.
Lemma 7. Suppose that Assumptions H F and H ξ are fulfilled. Then where C > 0 only depends on C D , β, and C ξ . 7.5.2. Proof of Lemma 7. Let us begin with the first inequality (integrated risk). Like in the proof of Lemma 4, we also obtain (24). Then, where ξ (1) j := ( X 1 , e j − µ j )/σ j and ξ ′ j := ( X 1 , e j − µ j )/σ j are the standardized versions of X 1 , e j and X ′ , e j . The same arguments as in Lemma 4 lead to Now, firstly, for all a, b > 0, (a + b) β/2 ≤ (2 max{a, b}) β/2 ≤ 2 β/2 (a β/2 + b β/2 ) and secondly h (X ′ ) = 1/n. We thus obtain Under Assumption H b ξ , the results comes from the following bound Now, we use the following lemma, proved in Section A.
Lemma 8. Let (η j ) j≥1 a sequence of real random variables and (σ j ) j≥1 a sequence of real numbers verifying, for β > 0, and the result comes from Inequality (41). The proof of the first inequality of Lemma 7 is completed.
For the second inequality (pointwise risk), the only difference is that, from (24), we rather use The final bound then follows similarly.

References
Choosing leads to what we want to prove, that is When assuming H X,F , the optimal h can be computed: the one which minimizes R(h) = h 2β + h −γ has the order n 1/(2β+γ) and immediatly gives R(h) ≤ Cn −2β/(2β+γ) .
Proof of (b). The proof comes down to the proof of (a) since Theorem 1 gives a bound of the risks of F h with the form min h R(h) ( R(h) defined in the proof of (a)). The computation of the (a) bound for this minimum has thus been done in the previous section.
A.1.2. Proof of Corollary 1. The proof is based on the same ideas as the ones used to prove Proposition 1 in Section A.1.1. We begin with the result (ii) (integrated risk). Proof of (ii). Thanks to Proposition 2 (ii), the risk of the estimator is bounded by h 2β + ( j>p σ 2 j ) β + n −1 ϕ −1 p (h), up to a multiplicative constant. Remark that where f p is the density of (ξ 1 , . . . , ξ p ). By noticing that where c only depends on j≥1 σ 2 j and c f . With the assumption on σ j , we thus obtain the following upper bound for the risk, up to a constant R(h, p) := h 2β + p β(1−2δ) + c −p n −1 h −p .

A.2. Supplementary details for the proof of Theorem 3.
A.2.1. Additional material for the proof of (i), under Assumption H X,L . All that remains to be done is to check that the hypotheses F 0 and F 1 defined in Section 7.4.1 satisfy the three conditions (A), (B) and (C) (also defined in Section 7.4.1).
Check (A). Belonging to the space F β . For any x ∈ H, the function F x 0 is a c.d.f. by construction (it does not depend on x and is simply the c.d.f. of the uniform distribution on D), and F . Thus, F 0 belongs to F β . Let x ∈ H be fixed. The function y → F x 1 (y) is continuous, with limit 0 when y goes to −∞ (recall that D is a bounded set), and 1 when y goes to +∞ (since R ψ(t)dt = 0). If y / ∈D, (F x 1 ) ′ (y) = 0 (the support of ψ is included in D) and if y ∈D, thanks to the definition of L above. Thus F x 1 is increasing, and F x 1 is a conditional distribution function. Moreover, for any x, x ′ ∈ H, denoting by thanks to the regularity property of the function H. Therefore, F 1 also belongs to F β .

Check (B). Condition on the loss
We have, thanks to the lower bound for η n , Check (C). Upper bound for the Kullback divergence K(P ⊗n 1 , P ⊗n 0 ). In a first step, we prove that the measure P 1 is absolutely continuous with respect to P 0 , and compute the Radon-Nikodym derivative. First, notice that Therefore, keeping in mind that R ψ(t)dt = D ψ(t)dt = 0, the conditional distribution of Y 1 given X 0 = x admits a density with respect to the Lebesgue measure on D given by We can thus compute the distribution P 1 of the random couple (X 0 , Y 1 ). By definition, we see that P 1 << P 0 and dP 1 /dP 0 (x, y) = |D|f x 1 (y). This enables to compute the Kullback distance K(P 1 , P 0 ) = ln dP 1 dP 0 dP 1 = H×R ln (|D|f x 1 (y)) f x 1 (y)dydP X 0 (x), Noting that ln(1 + u) ≤ u for every u > −1, we obtain by using successively that R ψ(y)dy = 0 and that the support of H is [0; 1].
A.2.2. Additional material for the proof of (ii). Similarly, it remains to check that the set of hypotheses (F ω ) ω∈Ωn satisfy the three conditions (A'), (B') and (C') defined in Section 7.4.3.

Check (A').
We have already checked that F 0 belongs to F β . Let ω ∈ {0, 1} mn be fixed. To prove that F x ω is non increasing (x ∈ H fixed), as for F x 1 , we bound, for y ∈ D, thanks to Property (40) and the definition of L above. Thus, as F 1 in the proof of (i), F ω is a conditional distribution function, and we also similarly obtain F ω ∈ F β .
A.3. Proofs of some technical lemmas. Here we prove the technical results which are used in Section 7.
A.3.1. Proof of Lemma 1. To prove Inequality (19), the guideline is to apply Bernstein's Inequality (see Birgé and Massart 1998), for the conditional probability P X ′ .
Lemma 9. Let T 1 , T 2 , . . . , T n be independent random variables and S n (T ) = n i=1 (T i − E [T i ]). Assume that Var(T 1 ) ≤ v 2 and ∀l ≥ 2, 1 n Then, for η > 0, Here, T i = K h (d(X i , X ′ ))/E X ′ [K h (d(X i , X ′ ))], and R X ′ h − 1 = S n (T )/n (recall that we consider here conditional expectation and probability with respect to X ′ ). Let us compute the quantities v and b 0 involved in the inequality. First, on the set {X ′ ∈ B}, Inequality (17) implies that Similarly, for l ≥ 2, By splitting M l = C l K C ϕ = M 2 C l−2 K , the last upper bound can be written with b 0 = C K /(m 1 ϕ(h)). We now apply the first inequality of Lemma 9, this complete the proof of Inequality (19). The proof may be adapted easily to demonstrate Inequality (21). For Inequality (20), we follow the same strategy as Comte and Genon-Catalot (2012), pages 20-21. First thanks to Inequality (44). Now, Since, by Assumption H b2 , ϕ(h) ≥ C 0 ln(n)/n, we obtain ∞ 0 exp − n(u + V R (h)) 4v 2 du ≤ 4M 2 C 0 m 2 1 1 ln(n)n κ R m 2 1 /4M 2 , and the last upper bound is smaller than (4M 2 /C 0 m 2 1 )/n α as soon as κ R > 4M 2 α/m 2 1 . For the other integral, we begin with a lower bound for n u + V R (h)/4b 0 , by using ϕ(h) ≥ C 0 ln(n)/n another time. Thus, as soon as κ R > 32C 2 K α 2 /m 2 1 C 0 . This ends the proof of Lemma 1.
A.3.2. Proof of Lemma 6 (concentration of the empirical process). The aim is to control the deviations of the supremum of the empirical process ν n,h defined by (36). Since it is centred and bounded, the guiding idea is to apply the following concentration inequality.
Lemma 10. [Talagrand's Inequality] Let ξ 1 , . . . , ξ n be i.i.d. random variables, and define ν n (r) = 1 n n i=1 r(ξ i ) − E[r(ξ i )], for r belonging to a countable class R of real-valued measurable functions. Then, for δ > 0, there exists a universal constant C such that and, by a classical generalization of Hölder's Inequality . Now suppose that β ∈ Q∩]0, +∞[, we can write without loss of generality that β/2 = p/q with p ∈ N * and q > 1 (if q = 1, β/2 ∈ N * ). Then the function x → x 1/q is concave and by Jensen's Inequality: The case β > 0 follows immediately from the density of Q into R.