L p adaptive estimation of an anisotropic density under independence hypothesis

: In this paper, we focus on the problem of a multivariate density estimation under an L p -loss. We provide a data-driven selection rule from a family of kernel estimators and derive for it L p -risk oracle inequalities depending on the value of p ≥ 1. The proposed estimator permits us to take into account approximation properties of the underlying density and its independence structure simultaneously. Speciﬁcally, we obtain adaptive upper bounds over a scale of anisotropic Nikolskii classes when the smooth- ness is also measured with the L p -norm. It is important to emphasize that the adaptation to unknown independence structure of the estimated density allows us to improve signiﬁcantly the accuracy of estimation (curse of di-mensionality). The main technical tools used in our derivation are uniform bounds on the L p -norms of empirical processes developed in Goldenshluger and Lepski [13].


Introduction
Let X i = (X i,1 , . . . , X i,d ), i ∈ N * , be a sequence of R d -valued i.i.d. random vectors defined on a complete probability space (Ω, A, P) and having density f with respect to the Lebesgue measure. Furthermore, P f denotes the probability law of X (n) = (X 1 , . . . , X n ), n ∈ N * , and E f is the mathematical expectation with respect to P f .
Goldenshluger and Lepski [14] developed a data-driven selection rule from a family of kernel estimators. Moreover, the selected estimator is minimax adaptive over a scale of anisotropic Nikolskii classes when the smoothness of the underlying density and the error of estimation are measured with the same L p -norm.
Lepski [27] proposed an estimator which takes into account the independence between groups of coordinates of the observed vectors, for estimation under the L ∞ -loss. Thus, it was shown that the adaptation to unknown independence structure permits us to reduce the so-called curse of dimensionality. This result was illustrated by application to adaptive minimax estimation over a scale of anisotropic Nikolskii classes.
In Rebelles [33], the same problem was studied in the pointwise setting and some comparisons between the local procedure and the global one in Lepski [27] have been made.
In the present paper, we address the same problem under an L p -loss, 1 ≤ p < ∞. As in Goldenshluger and Lepski [14], we consider the case where the smoothness of the underlying density is assumed to be also measured in the L pnorm. Our main goal is to derive optimal minimax adaptive rates in the context of global estimation of a density, by taking advantage of the fact that some coordinates of the observations may be independent from the others. Throughout our article we compare both the results and methods used with those of Goldenshluger and Lepski [14] and Lepski [27].
Minimax estimation In the framework of the minimax estimation, it is assumed that f belongs to a certain set of functions Σ, and then the accuracy of an estimator f is measured by its maximal risk over Σ: The objective here is to construct an estimator f * which achieves the asymptotic of the minimax risk (minimax rate of convergence): Here, infimum is taken over all possible estimators.
see, e.g., Ibragimov and Khasminskii ([19], [20]), and Hasminskii and Ibragimov [17]. It is important to emphasize that minimax rates depend heavily on both the dimension d and the index p of the L p -risk. The dependence on p disappears when we estimate a density belonging to the class N p,d (β, L) on a given bounded interval of R d , see, e.g., Donoho et al. [5] for the case d = 1.
To reduce the influence of the dimension on the accuracy of estimation (curse of dimensionality), many researchers have studied the possibility of taking into account, not only the smoothness properties of the target function, but also some structural hypothesis on the statistical model. For instance, see the works on the composite function structure in Horowitz and Mamen [18], Iouditski et al. [21] and Baraud and Birgé [1], the works on multi-index structure in Goldenshluger and Lepski [12] and Lepski and Serdyukova [28], and the works on the multiple index model in density estimation in Samarov and Tsybakov [36].
Let us briefly discuss one of the possibilities of facing to this problem in the density model setting. As explained above, the approach which has been recently proposed in Lepski [27] is to take into account the independence structure of the density f , namely its product structure due to the independence structure of the vector X 1 .
Structural assumption Denote by I d the set of all subsets of {1, . . . , d}, except the empty set. Let P be a given set of partitions of {1, . . . , d}. For all I ∈ I d denote also I = {1, . . . , d}\I and |I| =card(I). We will use ∅ for {1, . . . , d}. Finally, for all x ∈ R d and I ∈ I d put x I := (x i ) i∈I and Assume that f ∅ ≡ f , that f ∅ ≡ 1 and note that f I is the marginal density of X 1,I . If P ∈ P is such that the vectors X 1,I , I ∈ P, are independent then f (x) = I∈P f I (x I ), ∀x ∈ R d . In the sequel, the possible independence structure of the density f will be represented by a partition belonging to the following set: Note that P(f ) is not empty if we consider that ∅ ∈ P, or that P = {P} if the independence structure of f is known. The possibility of choosing P, instead of considering all partitions of {1, . . . , d}, is introduced for technical purposes. This is explained in more detail in Lepski [27], section 2.1, paragraph "Extra parameters". In this paper, we focus on the problem of minimax estimation with L prisk over anisotropic Nikolskii classes N p,d (β, L, P, f ) (defined by (3.1) in Section 3.1). The definition of these classes is a modification of that of classes N p,d (β, L) to take into account the possible independence structure P of the target density f . Here, we need f and some of its marginals f I to be uniformly bounded by a real number f > 0. In particular, we will prove in Section 3.2 that, for fixed β ∈ (0, +∞) d , L ∈ (0, +∞) d , P ∈ P(f ) and f > 0, ϕ n,p ( N p,d (β, L, P, f )) = n − γp r γp +r , r := inf I∈P i∈I where γ p is given in (1.1).
We remark that minimax rates (accuracy of estimation) depend heavily on the parameter (β, P). Knowledge of this parameter cannot be assumed often in particular applications. Hence, it becomes necessary to find an estimator whose construction would be parameter free.
The first question arising in this framework is the following: does there exist an estimator f * such that lim sup where ϕ n,p (α) is the minimax rate of convergence over Σ α . If such an estimator exists, it is called an optimal adaptive estimator (O.A.E.). As mentioned previously, Goldenshluger and Lepski [14] provide an O.A.E. for estimation under L p -risk, 1 < p < ∞, over the scale In this paper, we construct an O.A.E. for estimation under L p -risk over the scale {N p,d (β, L, P, f )}, 1 < p < ∞. Therefore, we improve the adaptive rates of convergence found in Goldenshluger and Lepski [14] when the target density has an independence structure P = ∅.
In Rebelles [33], it was shown that there exists no O.A.E. for pointwise estimation over any scale {N p,d (β, L, P, f )} containing at least two classes. In the pointwise setting, there is a "ln-price" to pay for adaptation both to the smoothness parameter of the target density and to its independence structure.
Organization of the paper In Section 2, we provide a measurable datadriven selection rule based on bandwidth selection of kernel estimators and we derive oracle inequalities for the selected estimator. In Section 3, we define anisotropic Nikolskii classes of densities for adaptation with respect to their independence structure and we provide adaptive upper bounds over a scale of those functional classes. It is also established that the quality of estimation we obtain is rate optimal for this problem. Proofs of all main results are given in Section 4. Proofs of technical lemmas are deferred to the Appendix.

Lp adaptive estimation of an anisotropic density under independence hypothesis 111
2. Estimator's construction and L p -risk oracle inéqualities

Kernel estimators related to independence structure
(2.1) Let h max , h min and V min be fixed numbers satisfying Then, define the set of parameters and introduce the family of estimators Note first that f (h,∅) = f h is the Parzen-Rosenblatt estimator (see, e.g., Rosenblatt [35], Parzen [32]) with kernel K ∅ ≡ K and multibandwidth h.
Next, the introduction of the estimator f (h,P) is based on the following simple observation. If there exists P ∈ P(f ), the idea is to estimate separately each marginal density corresponding to I ∈ P. Since the estimated density possesses the product structure we seek its estimator in the same form. Moreover, by scrutinizing the proof of Theorems 1 and 2 below, we see that Here and in the sequel . s,I denotes the norm . Ls(R |I| ,dxI ) , s ∈ [1, +∞], I ∈ I d .
Remark 1. As it is discussed above, if P ∈ P(f ) is known, the initial problem is reduced to the estimation of marginals f I , I ∈ P. Therefore, the natural loss that can be used in the definition of the risk for our problem seems to be In Section 2.3 we propose a data driven selection from the family F[ P ]. The possibility of choosing the sets H I is introduced to make our procedure practically feasible. Indeed, H I can be chosen as an appropriate grid in [h min , h max ] |I| . To define our selection rule, we need to introduce some notation and quantities.

Auxiliary estimators and quantities
For I ∈ I d and h, η ∈ (0, 1] d introduce auxiliary estimators where "⋆" stands for the convolution product on R |I| . Obviously, f hI ,ηI ≡ f ηI ,hI .
We endow the set P with the operation "⋄" introduced in Lepski [27]: for any P, P ′ ∈ P P ⋄ P ′ := {I ∩ I ′ = ∅, I ∈ P, I ′ ∈ P ′ } , that is, in its turn, a partition of {1, . . . , d}. This allows us to define for h, η ∈ (0, 1] d and P, P ′ ∈ P 3) The ideas that led to the introduction of the estimators f (h,P),(η,P ′ ) , based on both the operation "⋆" and "⋄", are explained in Lepski [27], Section 2.1, paragraph "Estimation construction". Note that the arguments given in the latter paper do not depend on the norm used in the definition of the risk and remain valid for estimation under L p -loss. Here, we give only the following simple explanation. Inspired by the methodology proposed by Goldenshluger and Lepski [14], Section 2.6, we seek auxiliary estimators in the form (2.3) noting that bounded and centered random variables and, therefore, is "somehow small". Thus, we can expect that .
Lp adaptive estimation of an anisotropic density under independence hypothesis 113 Finally, since f hI is an estimate of f I , we come to the introduction of f (h,P),(η,P ′ ) and we can expect that However, we emphasize that the methodology developed by Goldenshluger and Lepski [14] cannot be applied to the selection of a partition P since it is not based on the selection from a family of linear estimators. Furthermore, the estimation under L p -loss, 1 ≤ p < ∞, instead of sup-norm loss, leads us to modify the method proposed in Lepski [27] by introducing the following quantities and some specifical technical arguments to compute our risk bounds; see the proof of Theorems 1 and 2, Section 4.2.
For I ∈ I d and h ∈ (0, 1] d define For h ∈ (0, 1] d and P ∈ P put U p (h, P) := sup I∈P U p (h I ).
We will see in Section 4.1 that the quantities U p (h, P) can be viewed as uniform bounds on the L p -norm of the stochastic errors related to the estimators from the family F[P]. Such "majorants" were developed in Goldenshluger and Lepski [13] and used in Goldenshluger and Lepski [14] for multivariate density estimation under L p -loss. Let us remark that U p (h, P) is a deterministic quantity when p ∈ [1,2], and a random one when p > 2. In both cases, it follows from the results in Lemmas 1 and 2 below, that where C 3 > 0 is a constant and γ p is given in (1.1).
We remark that if P = {∅} then G p = 1.

Selection rule and oracle inequalities
For h ∈ (0, 1] d and P ∈ P introduce It is easily checked that ( h, P) exists, is in H[ P ] and is measurable, see, e.g., Lepski [27], section 2.1, paragraph "Existence and measurability", for more details.
We also emphasize that the construction of the proposed procedure does not require any condition concerning the density f . However, the following mild assumption will be used for computing its risk: where P(f ) is given in (1.2). Note that the considered class of densities is determined by P and in particular Define, for (h, P) ∈ H[ P ] such that P ∈ P(f ), If the possible independence structure P of the target density is known, the latter quantity can be viewed as an "L p -risk " of the estimator f (h,P) , defined with the loss l f (h,P) , f := sup In this case, we see that the effective dimension of estimation is not d, but d(P) := sup I∈P |I|. Therefore, the best estimator from the family F[ P ] (the oracle) should be f (h * ,P * ) such that Let us provide the following oracle inequalities for our selected estimator f .
Theorem 2. Let f > 0, q ≥ 1 and p > 2. Assume that for some constants C 3 and C 4 Then, ∀f ∈ F[f , P ], The constants C 3 , C 4 and α p,i , p > 2, i = 1, 2, are given in the proof of the theorem and depend on K, d, q, p and f .
Here, we see that the possibility of choosing the set of partitions P is interesting for other reasons than the computational one. Indeed, the latter results lead us to consider various problem in the framework of density estimation.
First, it is possible to consider that P contains the two elements ∅ and {{1}, . . . , {d}}, if we suppose that the target density has independent components. We may also consider that P = {P} if the independence structure of the underlying density is known...
Next, for P = {∅} (no independence structure) we automatically obtain oracle inequalities given in Theorems 1 and 2 in Goldenshluger and Lepski [14], up to numerical constants. The proof of Theorem 3 in the latter paper indicates that, for p ∈ [2, ∞), where C 4 > 0 is a constant. This lower bound holds under very weak assumptions on the density f and, together with the result of our Theorem 2, leads to an oracle inequality , f ] and we obtain a so-called L p -risk oracle inequality. Note, however, that for all other cases, R , f ], up to a numerical constant. This seems to be a price to pay for taking into account the possible independence structure of the underlying density and, thus, for reducing the influence of the dimension on the quality of estimation.
Furthermore, comparing our results with those in Goldenshluger and Lepski [14], we remark that another price to pay for our problem appears through the constant α p,1 ; see the computations in the proofs of Theorems 1 and 2. Indeed, the prime interest is to obtain oracle inequalities with a constant α p,1 close to 1, and this seems to be more difficult whenever we consider that the target density has an independence structure P = ∅.
However, Theorems 1 and 2 in the present paper lead us to consider various problems arising in the framework of minimax and minimax adaptive estimation. This is the subject of Section 3 below.

A short simulation study
Consider that we estimate a bivariate density (d = 2). Thus, the set of partitions P contains the two elements P 1 = {{1}, {2}} and P 2 = {{1, 2}}. Moreover, if we consider that the smoothness parameter h = (h 1 , h 2 ) is fixed, we only have to compare the accuracy of the estimator f (h,P1) with that of the classical kernel one f (h,P2) = f h . Then, the main question is: does our strategy choose the partition P 1 when the two components of X 1 are independent?
Here, we answer to this question in the following case: For simplicity, we estimate f on a grid of 100 × 100 points in the domain [−1/2, 1/2] 2 via Fast Fourier Transform, by using n = 1000 simulated random vectors. Because f is an isotropic density, the smoothness parameter h = (h 1 , h 2 ) is an isotropic vector properly chosen in the dyadic grid {h = (2 −k , 2 −k ) : k ∈ N, log 2 (ln 2 (n)) ≤ k ≤ log 2 (n)}, in order to minimize both the L 2 -risk (average over 1000 samples) of f (h,P1) and the one of f (h,P2) .    Pi) , i = 1, 2, average over 1000 samples. Figure 2 shows the boxplots of values of both selection criterions ∆ 2 (h, P 1 ) + Λ 2 U 2 (h, P 1 ) (on the left) and ∆ 2 (h, P 2 ) + Λ 2 U 2 (h, P 2 ) (on the right) over 1000 samples, with a random quantity Λ 2 multiplied by c = 0, 01. Here, our strategy chooses the partition P 1 999 times. We conclude that, for this example, the selected estimator outperforms the classical kernel estimator in almost all cases.

L p adaptive estimation
In this section, we discuss adaptive minimax estimation over a certain scale of anisotropic Nikolskii classes when the smoothness of the underlying density is assumed to be measured with the same L p -norm that used to measure the quality of estimation.

Anisotropic Nikolskii classes of densities related to independence structure
We start with the definition of the anisotropic Nikolskii class of densities we use in the sequel. Let {e 1 , . . . , e s } denote the canonical basis in R s , s ∈ N * .
Here D k i f denotes the kth order partial derivate of f with respect to the variable t i , and ⌊β i ⌋ is the largest integer strictly less than β i .
In order to take into account the smoothness of the underlying density and its possible independence structure simultaneously, a collection of anisotropic Nikolskii classes of densities was introduced in Lepski [27], Section 3, Definition 2. However, since the adaptation is not necessarily considered with respect to the set of all partitions of {1, . . . , d}, the condition imposed therein can be weakened. For instance, if P = {∅} (no independence structure), we want to find again the well known results concerning the adaptive estimation over the scale of anisotropic Nikolskii classes of densities {N p,d (β, L)}, that is not possible with the classes introduced in Lepski [27]. For these reasons, the following collection {N p,d (β, L, P)} P was introduced in Rebelles [33], Section 3.1.
Finally, recall that the condition f ∈ F[f , P ] is required in Theorems 1 and 2, and define In the next section, we illustrate the application of Theorems 1 and 2 to adaptive estimation over anisotropic Nikolskii classes of densities N p,d (β, L, P, f ).
where infimum is taken over all possible estimators.
The proof of Theorem 3 coincides with the one of Theorem 3 in Goldenshluger and Lepski [15], up to minor modifications to take into account the independence structure of the underlying density. Therefore, it is omitted.
Our goal now is to show that ϕ n,p (β, P) is the minimax rate of convergence on the anisotropic class N p,d (β, L, P, f ), and that a minimax estimator can be selected from the collection F[ P ] given in (2.2).
Assume that ∅ ∈ P, that H I is the dyadic grid in {h I ∈ [h min , h max ] |I| : V hI ≥ V min }, I ∈ I d , and consider the estimator f defined by the selection rule (2.4)-(2.5). We show below that the quality of estimation of f is optimal up to a numerical constant on each class N p,d (β, L, P, f ), whatever the nuisance parameter (β, L, P, f ). We achieve the latter goal with properly chosen kernel K and numbers h max , h min and V min .
For a given integer l ≥ 2 and a given symmetric Lipschitz function u : R → R satisfying supp(u) ⊆ [−1/(2l), 1/(2l)] and R u(y)dy = 1 set  It follows that ϕ n,p (β, P) is the minimax rate of convergence on each functional class N p,d (β, L, P, f ) and that our estimator, which is fully data-driven, is an O.A.E. over the scale of functional classes {N p,d (β, L, P, f )} (β,L,P,f ) . Let us briefly discuss other consequences of Theorem 4.
Next, in view of the latter consideration, Theorem 4 allows us to compare the influence of the independence structure on the accuracy of estimation. For example, we see that ϕ n,p β, ∅ ≫ ϕ n,p (β, P) , ∀P = ∅.
We conclude that the existence of an independence structure improves significantly the accuracy of estimation with L p -risk. The same conclusion was obtained in Lepski [27] for density estimation under the sup-norm loss and in Rebelles [33] for pointwise density estimation. It is also important to emphasize that there is no price to pay for adaptation to the independence structure in the framework of estimation with an L p -loss, whereas there is a "ln-price" in the pointwise setting, see Rebelles [33]. Note that, if P = {∅} (no independence structure), there is still a "ln-price" to pay for adaptation to the smoothness parameter when we consider the pointwise criterion. This was shown for the first time in Lepski [26] for the Gaussian white noise model, in the unidimensional case.
Finally, in view of the embedding theorem for anisotropic Nikolskii classes, see, e.g., Theorem 6.9 in Nikolskii [31]

Proofs of main results
The main technical tools used in our derivations are uniform bounds on the L p -norm of empirical processes developed in Goldenshluger and Lepski [13]. We start this section by giving corresponding results established in Goldenshluger and Lepski [14] for multivariate-density estimation under L p -loss.

Uniform bounds on the L p -norm of kernel empirical processes
and put

Lp adaptive estimation of an anisotropic density under independence hypothesis 121
For h I ∈ (0, 1] |I| and x I ∈ R |I| , define ξ hI (x I ) : Propositions 1 and 2 below follow immediately from Lemmas 1 and 2 established in Goldenshluger and Lepski [14], Section 4.1. Indeed, assumptions (K1) and (K2) required in the latter paper are satisfied for Here C i = C i (L K , k ∞ , |I|, q), i = 1, 2.
Proposition 2. Let p > 2. Assume that n ≥ C 3 , nV min > C 4 , and h |I| max ≥ 1/ √ n. Then The following result is obtained straightforwardly by application of Theorem 2 in Goldenshluger and Lepski [13]. All technical arguments are given in the latter paper and its proof is omitted.
All constants appearing in Propositions 1, 2 and 3 can be expressed explicitly, see corresponding results in Goldenshluger and Lepski [14] and in Goldenshluger and Lepski [13].
To compute our risk bounds we need the following technical lemmas. Define Lemma 1. Let V min be a fixed number such that nV min ≥ 1.
(i) Assume that p ∈ [1, 2) and n ≥ 3 ∨ 4 2p/(2−p) . Then, Then, Lemma 2. Let p > 2, and assume that for some constants C 3 and C 4 Then, ∀f ∈ F[f , P ], , , All constants involved in the latter lemmas are given in their proofs, those are postponed to the Appendix.

Lp adaptive estimation of an anisotropic density under independence hypothesis 127
Note first that, by applying Proposition 3 in Kerkyacharian, Lepski and Picard [24], it is easily established that, for any h ∈ (0, 1] d , any P ′ ∈ P and any I ∈ P ⋄ P ′ , Next, by the choice of h max , we get from Lemma 1 (i)-(iii) and Lemma 2 (i)-(ii) Consider now, for all I ∈ P, the system The solution is given by where β I is given in (3.2 C > 0, for n large enough. Indeed, the choice of the numbers h max , h min and V min implies that the conditions required in both theorems are satisfied. Finally, in view of the properties of the dyadic grids, it is easily seen that we get the statement of Theorem 4 from (4.12) and (4.13).

A.1. Proof of Lemma 1
We divide this proof into several steps.