Multivariate adaptive warped kernel estimation

: We deal with the problem of nonparametric estimation of a multivariate regression function without any assumption on the compacity of the support of the random design. To tackle the problem, we propose to extend a “warping” device to the multivariate framework. An adaptive warped kernel estimator is ﬁrst deﬁned in the case of known design distribu- tion and proved to be optimal in the oracle sense. Then, a general procedure is carried out: the marginal distributions of the design are estimated by the empirical cumulative distribution functions, and the dependence structure is built using a kernel estimation of the copula density. The copula density estimator is also studied and proved to be optimal in the oracle and in the minimax sense. The plug-in of this estimator in the regression function esti- mator provides a fully data-driven procedure. A numerical study illustrates the theoretical results.


Introduction
Let (X, Y ) be a couple of random variables taking values on R d × R such that with ε a centered real random variable with finite variance independent of X = (X 1 , . . . , X d ). Assume that we have an independent identically distributed (i.i.d. in the sequel) sample (X i , Y i ) i=1...n distributed as (X, Y ). The subject of the paper is the estimation of the multivariate regression function r(x) = E[Y |X = x] on a subset A ⊂ R d , not necessarily bounded, with a warping device described below, that also requires the estimation of the dependence structure between the coordinates of X. Regression estimation is a classical problem in statistics, addressed in a significant number of research works frequently based on non-parametric methods such as kernel estimators (Nadaraya, 1964;Watson, 1964), local polynomial estimators (Fan and Gijbels, 1996), orthogonal series or spline estimators (Golubev and Nussbaum, 1992;Antoniadis et al., 1997;Efromovich, 1999;Baraud, 2002) and nearest neighbour-type estimators (Stute, 1984;Guyader and Hengartner, 2013). Among kernel methods, the most popular estimator is the well-known Nadaraya-Watson estimator, defined for model (1) by where h = t (h 1 , . . . , h d ) is the so-called bandwidth of the kernel K, K h (x) = K 1,h1 (x 1 )K 2,h2 (x 2 ) . . . K d,h d (x d ), with K l,h l (x) = K l (x/h l )/h l for h l > 0, and K l : R → R such that R K l (x)dx = 1, l = 1, . . . , d.
A commonly shared assumption for regression analysis is that the support of X is a compact subset of R d (Györfi et al., 2002;Guyader and Hengartner, 2013;Furer and Kohler, 2015). It could be very restrictive in some situations such as for example the estimation of the regression function on the level sets of the cumulative distribution function (c.d.f.) (Di Bernardino et al., 2015). Stone (1982) first conjecture that this assumption could be weakened. To our knowledge, Kohler et al. (2009) are the first who propose theoretical results with no boundedness assumption on the support of the design. The price to pay is to make a moment assumption on the design X (see Assumption (A4) in Kohler et al. 2009). So far, "warped" estimators have been developed (Yang, 1981;Kerkyacharian and Picard, 2004) and require very few assumptions on the support of X. If we assume, in a sake of clarity, that d = 1, the warped method is based on the introduction of the auxiliary function g = r • F −1 X , where F X : x ∈ R → P(X ≤ x) is the c.d.f. of the design X. First, an estimatorĝ is proposed for g, and then, the regression r is estimated usingĝ •F , whereF is the empirical c.d.f. of X. This strategy has already been applied in the regression setting using projection methods (Kerkyacharian and Picard, 2004;Pham Ngoc, 2009;Chagny, 2013) but also for other estimation problems (conditional density estimation, hazard rate estimation based on randomly right-censored data, and c.d.f. estimation from current-status data, see e.g. Chesneau and Willer 2015;Chagny 2015). If the warping device permits to weaken the assumptions on the design support, the warped estimator also depends on a unique bandwidth, for d = 1, whereas the ratio form of the kernel estimator (2) requires the selection of two smoothing parameters (one for the numerator, one for the denominator). In return, the c.d.f. F X of X has to be estimated, but this can simply be done using its empirical counterpart. This does not deteriorate the optimal convergence rate, since the empirical c.d.f. converges at a parametric rate. A data-driven selection of the unique bandwidth involved in the resulting warped kernel estimator, in the spirit of Goldenshluger and Lepski (2011) leads to nonasymptotic risk bounds when d = 1 (Chagny, 2015). To our knowledge, this adaptive estimation has never been carried out for a ratio regression estimator, the only reference on this subject being Ngoc Bien (2014) who assumes that the design X has a known uniform distribution.
Nevertheless, the extension of the warped strategy to the multivariate framework is not trivial, and we propose to deal with this problem here. The key question is to take into account the dependence between the multivariate components of each X i . We propose to tackle this problem by using copulas, that permit to describe the dependence structure between random variables (Sklar, 1959;Jaworski et al., 2010). The price to pay is the additional estimation of the copula density of the design: the complete strategy requires the plug-in of such estimate in the final warped regression estimator. The results are obtained for random design distribution with possibly unbounded support, like in Kohler et al. (2009). However, we will see that the assumptions are not exactly the same: in particular, the warping device permits to avoid the moment conditions on X. Moreover, our results takes place in the field of nonasymptotic adaptive estimation, and the bandwidth of the kernel estimator we propose does not depend on the smoothness index of the target function, contrary to the one of Kohler et al. (2009).
The paper is thus organized as follows. We explain in Section 2 how to extend the Yang (1981) estimator to the multivariate design framework. For sake of clarity, we first concentrate on the simple toy case of known design distribution (Section 3): under mild assumptions, we derive (i) a non-asymptotic oracle type inequality for an integrated criterion for a warped kernel estimator with a data-driven bandwidth selected with a Lepski-type method, and (ii) an optimal convergence rate over possibly anisotropic functional classes (Neumann, 2000;Kerkyacharian et al., 2001;Bertin, 2005). Then, a kernel copula estimate that also adapts automatically to the unknown smoothness of the design is exhibited and studied in Section 4. An oracle type inequality is also proved. Finally, warped regression estimation with unknown copula density is the subject of Section 5: as expected, the risk of the final estimate depends on the risks of both the copula estimator and the regression estimator with known design density. A simulation study is carried out in Section 6. Concluding remarks as well as perspectives for future works are given in Section 7 and all the proofs are gathered in Section 8. Throughout the paper, we pay a special attention to compare assumptions, methodology and results to the one of Kohler et al. (2009).

Multivariate warping strategy
If d = 1, the warping device is based on the transformation F X (X i ) of the data X i , i = 1, . . . , n. For d > 1, a natural extension is to use F l (X l,i ), for l = 1, . . . , d and i = 1, . . . , n, where F l is the marginal c.d.f. of X l . Let us introduce F X : in such a way that r = g • F X . If we consider that the marginal variables X l of X are independent, the estimator of Yang (1981) can immediately be adapted to the multivariate setting: we set to estimate g, and it remains to compound by the empirical counterpart of F X to estimate r. However, a dependence between the coordinates X l,i of X i generally appears. The usual model for this dependence using a copula C and the c.d.f F X of X can be written Denoting the copula density by c, we have and the density f X of X can be expressed as where (f l ) l=1,...,d are the marginal densities of X = (X 1 , . . . , X d ). It can then be proved that the previous estimator (3) estimates cg and not g (see the computation (8) below). As a consequence, we propose to set, as an estimator for g, where c b is a kernel estimator of the copula density that will be defined later (see Section 4). We denote by F X : R d → [0; 1] d the empirical multivariate marginal c.d.f.: and finally set, for x ∈ A, to rebuild our target function r from the data. In the sequel, we denote by · L p (Θ) the classical L p -norm on a set Θ.

Collection of kernel estimators
For sake of clarity, we first consider the regression estimation problem with a known design distribution. In this section, the copula density c and the marginal c.d.f. F X are consequently considered to be known. Thus, (6) becomes The following computation enlights the definitions (6) and (7) above. For any u ∈ F X (A), where is the convolution product. For small h, the convolution product K h (cg)1 [0,1] d is supposed to be closed to cg: this justifies that g h is suitable to estimate g, and r h suits well to recover the target r.

Risk of the estimator with fixed bandwidth
As in Kohler et al. (2009), we consider a global weighted integrated risk criterion, to study the properties of our estimator. Let . f X be the classical L 2 -norm on the space of squared integrable functions with respect to the Lebesgue measure weighted by f X on A: for any function t in this space, The mean integrated squared risk of the estimator r h can thus be written and, using a classical bias-variance decomposition, we have To obtain upper-bounds for these two terms, we introduce the following assumptions.
(H cg,β ) The function (cg)1 F X (A) belongs to an anisotropic Nikol'skiȋ ball N 2 (β, L), with L > 0 and β = (β 1 , . . . , β d ) ∈ (R * + ) d (Nikol'skiȋ, 1975). This is the set of functions f : R d → R such that f admits derivatives with respect to x l up to the order β l (where β l denotes the largest integer less than β l ), and Assumptions (H cg,β ) and (H K, ) are classical for nonparametric multivariate kernel estimation (Goldenshluger and Lepski, 2011;Comte and Lacour, 2013) and permit to control the bias term of the risk B(h). Assumption (H K, ) is not restrictive since a wide range of kernels could be chosen, and an assumption on the support and the bounds of the kernel is also necessary for Kohler et al. (2009) (see equation (7) of their paper). In (H cg,β ), the index β measures the smoothness of the function cg, and allows us to deal with possibly anisotropic regression functions (different smoothness according to the different direction can be considered), like assumption (A2) in Kohler et al. (2009). However, since they consider a pointwise criterion (local bandwidth choice), they rather choose Hölder spaces instead of Nikol'skiȋ spaces, which are designed for integrated risks (like our L 2 -risk, see e.g. Tsybakov 2009) and global bandwidth selection purpose. A second difference lies in the use of this assumption. Although we assume that (H cg,β ) holds in the sequel, we will not assume the smoothness index β to be known. Its value is not required to compute our selected bandwidth (see Section 3.3), while Kohler et al. (2009) use it to choose the bandwidth of their kernel estimate (see equation (8) in their paper). The difficulty of (H cg,β ) is that this smoothness assumption is made directly on cg, and not on the targeted function r. It is for example satisfied if the two functions c and g separately belong to N 2 (β, L ) (L > 0), for β such that each β l ≤ 1, l ∈ {1, . . . , d}. The fact that the assumption is carried by the auxiliary function g and not r is classical in warped methods (Pham Ngoc, 2009;Chagny, 2015). Another solution is to consider weighted spaces: lots of details can be found in Kerkyacharian and Picard (2004). Assumption (H c,low ) is specific to the warped method, which makes appear the copula density in the formula of the estimator. It is replaced by other assumptions in Kohler et al. (2009), see comments following Corollary , it is verified for example for the Farlie-Gumbel-Morgenstern copula, for the Ali-Mikhail-Hacq copula with a parameter θ ∈ ] − 1, 1[ (Balakrishnan and Lai, 2009), or for the copula density of a design with independent marginals. For other copulas, it is possible to restrict the estimation set A to exclude problematic points: for example, for d = 2, the points (0, 1), (0, 0) and (1, 0) are generally the ones which makes (H c,low ) not true. The choice A =]ε, +∞[ d , for a fixed ε > 0, (although still uncompact) sometimes permit to avoid the problem, and thus to consider other copula densities (the example A ⊂ (R + ) d is related to the application of our method to level set estimation, see Di Bernardino et al. 2015). For example, for the Gumbel Type I bivariate with a parameter θ = 1 or the Frank copula, it is possible to choose A =]ε, +∞[ 2 for the case of nonnegative variables (X i ) i . The proof of the following result can be found at Section 8.1.
This is a nonasymptotic bias-variance upper bound for the quadratic risk. The first term of the right-hand-side of the inequality of Proposition 3.1 is an upper-bound for the bias term B(h). The second one bounds the variance term. Another choice would have been to kept B(h) in the inequality (in this case, Assumptions (H cg,β ) and (H K, ) are not required). Our choice permits to immediately deduce the following convergence rate, by computing the bandwidth that minimizes the right-hand-side of the inequality of Proposition 3.1, over all possible bandwidths h ∈ (R * + ) d (see a brief proof in Section 8.2).

Corollary 3.1. Under the same assumptions as Proposition 3.1, there exists a bandwidth h(β) such that
Thus the usual convergence rate in multivariate nonparametric estimation can be achieved by our estimator, provided that its bandwidth is carefully chosen. Here, the bandwidth h(β) that minimizes the upper-bound of the inequality of Proposition 3.1 depends on the smoothness index β of the unknown function cg. This smoothness index is also unknown a priori. The challenge of adaptive estimation is to propose a data-driven choice that also leads to an estimator with the same optimal convergence rate.

Estimator selection
where κ > 0 is a tuning constant and m c an estimator for m c . We define and the final estimator r h . The criterion (12), inspired from Goldenshluger and Lepski (2011), is known to mimic the optimal "bias-variance" trade-off that has to be realized in a data-driven way. A short heuristic about the definition of the construction of the criterion could be found at the beginning of Section 8.3. It is a global criterion: we select the same bandwidth, whatever the estimation point is. This is one difference with Kohler et al. (2009), who propose local choices. Another difference is that our choice does not depend on the smoothness index of the target function.
We start with the study of the estimator r h,c . The collection H n is chosen such that and These assumptions are very common to derive such estimators (Comte and Lacour, 2013;Chagny, 2015). For example, We also introduce additional assumptions: The assumption (H c,high ) is quite restrictive for copula density estimation if However, it is also required for copula density estimation (see Section 4), and it is classical for adaptive density estimation purpose. Moreover, the same upper-bound is assumed in Autin et al. (2010) on [0, 1] d . Assumption (H c,high ) is for example satisfied by the Frank copula density, the Farlie-Gumbel-Morgenstern copula, the copula density of a design with independent marginals... Assumption (H ε ) is classical in adaptive regression estimation, see e.g. Baraud (2002) and Chagny (2015). It is then possible to set the following upper bound, proved Section 8.3.
Theorem 3.1. Assume that H n satisfies (13) and assume also (H ε ), (H c,low ) and (H c,high This result is an oracle-type inequality which assesses that the selected estimator performs as well as the best estimator of the collection ( r h ) h∈Hn , up to multiplicative constants and a remainder term: it achieves the best bias-variance trade-off (see Proposition 3.1). No smoothness assumption is required to establish the result. If we add Assumptions (H cg,β ) and (H K, ) (for an index ∈ R d + such that j ≥ β j , j = 1, . . . , d) to the assumptions of Theorem 3.1, we obtain the same convergence rate as the one of Corollary 3.1 for the estimator r h .
Corollary 3.2. Under the same assumptions as Theorem 3.1, if we also assume that (H cg,β This result can be compared to Theorem 1 of Kohler et al. (2009): our estimate achieves the same convergence rate as their kernel estimate. As already indicated, the smoothness assumptions for the two results are similar. The main difference is that we do not need to know the smoothness index of the targeted function to compute our estimator, while they have to. This is what makes our result adaptive. The other assumptions to establish the respective results are specific to the chosen methodology: our assumptions (H c,low ) and (H c,high ) on the copula density are specific to the extension of the warping device to the multivariate setting, and permit to deal with unbounded support for the design. They are replaced by a moment assumption on the design, and a boundedness assumption on the regression function in Kohler et al. (2009).
Notice that Theorem 3.1 and Corollary 3.2 cover the case of the estimator r h , whose bandwidth h is defined with a variance term that involves the unknown quantities E[Y 2 1 ] and m c . We choose not to present the final results: to switch from r h to r h it remains to replace the unknown expectation E[Y 2 1 ] by its empirical counterpart 1 n n i=1 Y 2 i and to change V (h) in V (h). This is quite classical, and can be done for example like in Theorem 3.4 p. 465 of Brunel and Comte (2005). It is more unusual to replace the lower bound for the copula m c by an estimate m c : this can nevertheless be done thanks to cumbersome computations, following for example the proof of Theorem 4.1 of Chagny et al. (2017). The oracle-type inequality that will be obtained is exactly the same as the one of Theorem 3.1, but will be valid only for a sample size n large enough. The convergence rate of Corollary 3.2 is unchanged. We do not go into details, to avoid burdening the text by adding two similar results and to avoid lengthening the proofs.

Copula density estimation
The estimator defined by (6) involves an estimator of the copula density c that was assumed to be known in the previous section, on F X (A). This section is devoted to the question of copula density estimation. Since it is an interesting question by itself, to be more general we perform the estimation on [0, 1] d and not on F X (A), like in other papers that deal with copula density estimation (Fermanian, 2005;Autin et al., 2010). However the results are the same if we restrict the risk, all L p -norms involved in the method and the validity of the assumptions to F X (A).
An adaptive estimator based on wavelets is defined in Autin et al. (2010) but, to be consistent with the previous kernel regression estimator already chosen, we propose to use the kernel estimator defined by Fermanian (2005). Con- The estimator is very close to the classical kernel density estimator, up to the warping of the data through the empirical c.d.f. Remark that if we replace the estimator F X in (14) by its target F X , like in the previous section, then c b (u) is the density estimator of the random vector (F 1 (X 1 ), . . . , F d (X d )), with uniformly distributed marginal distributions. We easily obtain the following upper-bound for the risk of the copula density estimator when the marginal distributions are known: The results of Fermanian (2005) are asymptotic. Since our goal is to prove nonasymptotic adaptive upper-bounds, the Goldenshluger-Lepski method allows us to select a bandwidth b among a finite collection B n ⊂ (R * + ) d . The collection B n should satisfy and one of the following constraints where |B n | is the cardinal of the set B n . These assumptions are similar to (13).
like above for regression estimation, B c stands for an empirical counterpart of the bias term of the risk, and V c has the same order as the variance term (compare to (15)). An oracle-type inequality could be derived for the final copula density esti- Note that the L 1 -norm of the kernel does not appear in (15), but only in the variance term of the Goldenshluger-Lepski method, namely (19), for technical reasons (more details on the proof in Section 8.4 or in Section 3.4.2 of Comte 2015).
The logarithmic term in the upper-bound of the inequality can be avoided by assuming the second part of (17), instead of |B n | ≤ ln(n). Like the tuning constant κ in V (see (11)), the constant κ c in (19) has to be calibrated. The bound that we obtain in the proof is unfortunately not accurate (this is a consequence of numerous technical upper bound, based on a concentration inequality), and cannot be used for practical purpose. The tuning of this parameter will be discussed below (see Section 6.2). Keep in mind for the following section that the same oracle inequality holds for an integrated risk on a smaller set than [0, 1] d , e.g. F X (A). In this case, it is enough to assume an upper-bound on the copula density on this set. Proposition 4.1 also permits to derive an adaptive convergence rate for our copula density estimator (even if its not the initial goal of the paper): if the copula density c belongs to a Nikol'skiȋ ball N 2 (α, L ) for L > 0 and α = t (α 1 , . . . , α d ) ∈ (R * + ) d , and if the kernel W is of order ∈ R d + such that j ≥ α j for j = 1, . . . , d, (see Assumption (H K, )), then c b automatically achieves the convergence rate n − 2ᾱ 2ᾱ+d whereᾱ is the harmonic mean of the components of α. Following Autin et al. (2010), this is also the lower bound for the minimax risk, and thus our estimator is minimax optimal (with no additional logarithm factor, comparing to Corollary 4.1 of Autin et al. 2010).

Plug-in regression estimate
Now we consider the general case of unknown copula density c to estimate the regression function r. The idea is to plug the kernel estimator c b (defined by (14)) of c in (7) for a well-chosen bandwidth b. We consider the case of fixed bandwidth, both for the regression and the copula estimators, this paves the way of future works about the fully data-driven estimator (with two selected bandwidth, see the concluding remarks below). Let us plug in r h the estimator c b : for any b, h > 0, and x ∈ A, under Assumption (H c,low ), This means that r h,b (x) = ((c× g h )/ c b )• F X (x)1 c b ( F X (x))≥mc/2 . To make the estimator fully computable, one needs to know the lower bound m c of the copula: in practice it is possible to replace it by a lower bound of an estimator. As explained previously, to avoid making the proofs more technical and cumbersome, we choose to not consider the problem from a theoretical point of view. We obtain the following upper-bound for our ratio estimator. Its risk has the order of magnitude of the worst risk between the risk of r h and c b .
The result is not surprising, and we cannot expect to obtain a sharper bound for the plug-in estimator. We thus have to add smoothness assumptions both on the regression function and on the copula density to derive the convergence rate of the plug-in estimator.
Finally, to obtain the fully computable estimator, one needs to replace the c.d.f. F X by its empirical counterpart introduced in (5). The switch is not a problem: the idea is that the empirical c.d.f. converges at a parametric rate, that does not deteriorate our slower nonparametric decrease of the risk. The multivariate setting does not change anything for the substitution compare to the univariate case. The scheme of the switching can now be considered as classical, since it has been widely detailed both by Kerkyacharian and Picard (2004) and Chagny (2015), but it significantly increases the length of the proofs. That is why, following many works about warped estimation (Chesneau and Willer 2015;Pham Ngoc 2009...), we do not give all the details.

Simulation study
In this section we illustrate the performance of our estimator with a simulation study, carried out with the free software R. The regression function that we consider is r(x 1 , x 2 ) = 1/ √ x 1 x 2 (for (x 1 , x 2 ) ∈ R + × R + , see Figure 1).

Fig 1. Regression function
To check the assumptions of the theoretical results, the design (X 1 , X 2 ) is generated using a Frank Copula with parameter 10 and exponential marginals with mean 1 (see Figure 2). The support of the design distribution is thus unbounded, which is possible with our method, according to the theory. The case of bounded support for the design distribution is briefly investigated below, Section 6.3. The response variable is given by Y = r(X 1 , X 2 ) + ε with ε a Gaussian noise with mean 0 and standard deviation 0.025.
To study the performances of our estimators, we use a Monte Carlo approximation, with 1000 iterations of independent samples (from the data used to compute an estimate r) of a relative L 2 -risk, namely the Relative Mean Square Error (RMSE): Finally, we confront our estimators with the classical Nadaraya-Watson kernel estimator with a cross-validation selected bandwidth (using the npreg function of the R package np, see Hayfield and Racine 2008) that is not designed to deal with an unbounded design and the estimator proposed by Kohler et al. (2009) .   Fig 2. Illustration of the design: Frank copula with parameter 10 and exponential marginals.

Impact of the estimation fo the marginal distributions of the design
In this section we investigate how the estimation of the marginal design distribution, through the empirical c.d.f., affects the results. We compare the estimators r h,b computed with the true marginal distributions and r h,b, F computed with the estimated c.d.f. using the following bandwidths h 1 = h 2 = b 1 = b 2 = (log(n)/n) 0.5 . Several bandwidths have been tested, and this choice "by hand" is a reasonable one among all the possibilities. We provide in Figure 3 the corresponding boxplots for sample sizes n = 100, 500 and 1000. For each sample size, 100 RMSE values (computed from inde-pendent samples) are plotted. The estimation of the marginal distributions is carried out using the classical empirical cumulative distribution function with the function ecdf of the software R. The results are quite similar in both cases. Using the empirical counterpart F X instead of the true c.d.f. F X does not seem to affect the quality of the final estimator. This kind of results is not very surprising as the estimator F X is widely known to be a very good estimate of F X . From now on, the marginal distributions are thus estimated for all the presented results.

Simulations with data-driven bandwidth for the copula estimator and fixed bandwidth for the regression estimator
In this subsection, we confront our estimator (with and without bandwidth selection) to the well-known Nadaraya-Watson estimate and to the one proposed by Kohler et al. (2009). The bandwidths of the regression estimator are chosen like in the previous subsection whereas there are two cases for the copula density estimator. For the estimator without bandwidth selection they are chosen like in the previous subsection but when we perform bandwidth selection, the applied methodology is the Goldenshuger-Lepski procedure detailed in Section 4. The L 2 -norm involved in the approximation of the bias term (18) in the selection device is approximated by a Riemann sum over a regular grid of 50 points. As explained above (end of Section 4) the procedure also requires a value for the tuning constant κ c involved in (19). Classically, we tune it once and for all, for each sample size. Following globally the scheme detailed by Bertin et al. (2016) (section 7.2), we study the evolution of the risk with respect to the constant, and choose a value that minimizes the risk. But, we take into account recent research by Lacour and Massart (2016) about the difficulty of optimal tuning of the Lepski methods. We just propose to select b ∈ arg min b∈Bn { B c (b) + 2V c (b)} instead of b, and to compute the new final estimate c b . The reason are mainly technical, and we refer to Section 5 of Lacour and Massart (2016) for details. Figure 4 thus presents the calibration results (risk of c b with respect to the value of κ c ). Remark that the shape of the curve is the same with different regression functions and different design distributions. The figure above assesses that the value of the constant is crucial: a too small or too large choice can lead to an increase of 50% of the RMSE. The selected values (κ c = 30 for n = 100, κ c = 280 for n = 500 and κ c = 680 for n = 1000) are then used to compute the estimator c b and to evaluate its performances. Once this calibration is made, we are in position to compare the risk of the 4 competing methods. This is carried out on Figure 5.
First, we can see that the estimator of Kohler et al. (2009) seems to have lower performances than others. Then, the performances of the Nadaraya-Watson estimator are better than our approach without bandwidth selection but worse when we perform copula bandwidth selection. For example for n = 100, our estimator has a decrease of the median RMSE of 23% and, above all, a variance divided by more than 100. This highlights the robustness of our estimator and the interest of the bandwidth selection step.
The simulations are implemented in R on a server with 140 cores, 400 Gbytes de Ram and a E5-2680 v2 @ 2.80Ghz processor. For a data sample of size 100, the computation of the Nadaraya-Watson estimate with a cross-validation selected bandwidth takes 32 seconds, the one of Kohler estimate is 9 seconds. Without any bandwidth selection, our estimate is computed in 7 seconds. By adding the selection step for the bandwidths of the copula density estimate, it requires 84 seconds.

Case of bounded support for the design
The above subsections illustrate the performances of our estimator for the case of an unbounded design. Let us now study the pertinence of our estimator when the design has a distribution with compact support. Here, the design is the same as previously (see Figure 2), but restricted to the square [0, 2] × [0, 2], and correctly normalized (see Figure 6). We consider here samples of size n = 100.
The Figure 7 below shows the performances of the different methods with bounded or unbounded support.
These results highlight the importance of the bandwidth selection procedure and the robustness of our warped estimator to the case of bounded support: even in this case, our estimator has nearly the same median for the RMSE than the Nadaraya-Watson with a variance divided by 5.

Concluding remarks
The aim of the paper is to extend the so-called "warping" device to a multivariate framework, through the study of regression kernel estimation. When the design distribution is known, the extension of the method can be done and similar results as the ones obtained in the univariate framework (non-asymptotic risk bound and optimal convergence rate) are proved. When the design distribution is unknown, the challenge is to cope with the possible dependence structure between the coordinates of the design, and the extension can be done only through the additional estimation of the copula density. This can be done separately in an adaptive way, also with a kernel estimator. Section 5 paves the way for a future study of the plug-in estimator, which is out of the scope of the paper: the risk of the regression estimator with fixed bandwidth and after plug-in of the copula estimate (also with fixed bandwidth), depends both on the risk of the estimator with known distribution and on the risk of the copula estimator, which is not surprising.
The next step, out of the scope of the paper, is to propose a bandwidth selection rule for the regression estimate computed with the adaptive copula density estimate (that is with selected bandwidth). It requires to replace the copula density c in the Goldenshluger-Lepski estimation of the bias term of the risk (see (10)) by c b . This probably also implies a modification of the variance term (11) to penalize the plug-in, but is not straightforward. The difficulties are numerous, owing first to the problem of dependence (the regression estimate, the copula estimate, and the selected bandwidth depend on the design X i ): it makes difficult to isolate the risk of the adaptive copula estimator from the risk of the regression estimator with known marginal distribution. A natural idea is to imagine that we have at our disposal an additional sample of the design, independent from the data. We can perhaps then conduct the study in the spirit of Bertin et al. (2016), who deal with similar questions for conditional density estimators (that involve the plug-in of marginal density estimates). Another way to tackle the problem could be to adapt very recent research that suggests to develop alternative selection algorithms (see for example Lacour et al. 2017;Nguyen 2018) to choose simultaneously the two bandwidths b and h.
Finally notice that the results we obtain above for multivariate random design regression with additive error term can be extended to handle other multivariate estimation problems, such as regression estimation in the heteroskedastic model or cumulative distribution function estimation from data subject to interval censoring case 1, as it is proposed in Chagny (2015) for d = 1.

Proofs
The main tool of the theoretical results is the following concentration inequality (see, for example, Lacour 2008).
Theorem 8.1 (Talagrand Inequality). Let F be a set of uniformly bounded functions, which have a countable dense sub-family for the infinite norm. Let (V 1 , . . . , V n ) be independent random variables and Consider M 1 , v, and H, such that Then, for every δ > 1, there exist numerical positive constants C 1 , C 2 , c 1 and c 2 such that We will use several times the following standard convolution inequality, called the Young Inequality. Let p, q ∈ [1; ∞) such 1/p + 1/q ≥ 1. If s ∈ L p (R d ) and t ∈ L q (R d ), then, s t ∈ L r (R) with 1/r = 1/p + 1/q − 1, and

Proof of Proposition 3.1
The variance term is using Assumption (H c,low ). But, For the bias term, the result is classical, see e.g. Proposition 3 p.579 of Comte and Lacour (2013) (with r j = a j = 0).

Proof of Corollary 3.1
Let f be the multivariate function defined by . By studying f , we prove that it admits a unique minimum on (R * + ) d . The function f is indeed differentiable, and admits a unique critical point h(β) (the one for which the gradient of f equals to 0) such that or equivalently By multiplying these d equalities, we get From this, we derive This is sufficient to compute the associate convergence rate: indeed, we have, for a constant C, and the second term of f , namely C 1 d l=1 h l (β) 2β l has the same order of magnitude (see (22)). Thus, we obtain the result of Corollary 3.1.

Heuristic about the bandwidth selection method
Let us briefly explain the ideas behind the Goldenshluger-Lepski method. Given the finite bandwidth collection H n ⊂ (R * + ) d , the optimal choice is the one which minimizes the bias/variance trade off (see (9)), or the upper-bound of Proposition 3.1. However, since r is unknown, the variance and the bias term of the risk are unknown, and the smoothness index of r is likely to be unknown too. Thus the optimal bandwidth is unattainable. The idea of the method is to mimic the bias/variance decomposition of the result of Proposition 3.1, with empirical estimations. The simplest term is V (h) (see (11)) that has the order of the variance term of the risk. The main specificity of the Goldenshluger-Lepski method is to provide an empirical counterpart for the bias term of the upper bound by comparing pair by pair several estimators. The proposition is to introduce auxiliary estimators that involve two kernels, K h and K h . In our framework, the bias term is (K h (cg))/c) • F X − r 2 f X . Since r and cg are unknown, they are replaced by estimators with bandwidth h : this leads to (K h (c g h ))/c) • F X − r h 2 f X , which makes appear the auxiliary estimator of the method, K h (c g h ))/c) • F X in our framework. This adds a random part to a deterministic term, namely the bias: this random part should be corrected by substracting V (h ). Finally, since any bandwidth h of the collection could be chosen, we scan all the collection. This leads to the definition of the bias estimate, B(h) (see (10)). This explanation is obviously a rough heuristic, and we carefully show that the term B has the order of the bias term (see Inequality (27) below). It remains to select the bandwidth h that minimizes the sum of these empirical B(h) and V (h), over all the possible bandwidths h. We thus get (12).

Proof of the result
First, the loss function of the selected estimator can be written Let h ∈ H n be fixed. We introduce K h (c g h 1 F X (A) ) and follow the decompositions of Theorem 4.2 in Comte (2015) to obtain By taking the expectation, the last term of the previous inequality is the risk of an estimator with fixed bandwidth, controlled by Proposition 3.1. It remains to bound B(h). Let us begin by splitting the norm involved in its definition. We have The proof has now some similarities with the proof of Theorem 1 of Chagny (2015), some easy calculations are thus omitted. We first apply the Young inequality (21) (with r = 2, p = 1, q = 2) that leads to .
We thus obtain for any h ∈ H n , .

Proof of Proposition 4.1
As the previous one, this proof is based on oracle-type inequalities using Goldenshluger-Lepski method so we omit some detailed calculations.
We first obtain, for any b ∈ B n , Taking into account the inequality (15), it remains to study B c . Thanks to the convolution inequality (21), we get We roughly upper-bound and write, for any b ∈ B n , where S c (0, 1) is a countable subset of the unit sphere of L 2 ([0, 1] d ) (the set of function t such that t 2 L 2 ([0,1] d ) = 1), ·, · L 2 ([0,1] d ) is the scalar product of L 2 ([0, 1] d ), and ν n, )t(u)du. We could now apply Theorem 8.1 to the centered empirical process ν n,c . It is not difficult to see that the following choice for the constants are possible: We only detail the computation of v c : firstly n −1 n i=1 Var(ϕ t,c,b (X i )) = Var(ϕ t,c,b (X 1 )) ≤ E[ϕ 2 t,c,b (X 1 )]. Then, denoting byW b (x) = W b (−x), we use Assumption (H c,high ), and the Young inequality (21) with p = 2, q = 1, and r = 2: since t 2 L 2 ([0,1] d ) = 1. Using Theorem 8.1, we obtain, for any δ ≥ 1 and for some constants C, c 1 , c 2 (that may change from line to line) using the first part of (17) and then the constraint (16) on the collection B n . If it is the second part of (17) which is assumed, then b∈Bn E sup t∈Sc(0,1) Thus, if V c (b)/(3( W 2 L 1 ([0,1] d ) + 1)) ≥ δ W 2 L 2 ([0,1] d ) /(nb 1 · · · b d ) which means that κ c is large enough, we have proved the following inequality, which concludes the proof of Proposition 4.1:

Proof of Proposition 5.1
We split the loss function r h,b − r 2 f X = T 1 + T 2 , with