Fast adaptive estimation of log-additive exponential models in Kullback-Leibler divergence

We study the problem of nonparametric estimation of density functions with a product form on the domain $\triangle=\{( x_1, \ldots, x_d)\in \mathbb{R}^d, 0\leq x_1\leq \dots \leq x_d \leq 1\}$. Such densities appear in the random truncation model as the joint density function of observations. They are also obtained as maximum entropy distributions of order statistics with given marginals. We propose an estimation method based on the approximation of the logarithm of the density by a carefully chosen family of basis functions. We show that the method achieves a fast convergence rate in probability with respect to the Kullback-Leibler divergence for densities whose logarithm belongs to a Sobolev function class with known regularity. In the case when the regularity is unknown, we propose an estimation procedure using convex aggregation of the log-densities to obtain adaptability. The performance of this method is illustrated in a simulation study.


Introduction
In this paper, we estimate densities with product form on the simplex = {(x 1 , . . ., by a nonparametric approach given a sample of n independent observations X n = (X 1 , . . ., X n ).We restrict our attention to densities which can be written in the form, for x = (x 1 , . . ., x d ) ∈ R d : (1) with 0 i bounded, centered, measurable functions on I = [0, 1] for all 1 ≤ i ≤ d, and normalizing constant a 0 .Densities of this form arise, in particular, as solutions for the maximum entropy problem for the distribution of order statistics with given marginals, or in the case of the random truncation model.
More generally, in [10], the authors give a necessary and sufficient condition for the existence of a maximum entropy distribution of order statistics with fixed marginal cumulative distribution functions F i , 1 ≤ i ≤ d.See [9] for motivations for this problem.Moreover, its explicit expression is given as a function of the marginal distributions.Let us suppose, for the sake of simplicity, that all F i are absolutely continuous with density function f i supported on I = [0, 1], and that F i−1 > F i on (0, 1) for 2 ≤ i ≤ d.Then the maximum entropy density f F , when it exists, is given by, for x = (x 1 , . . ., x d ) ∈ R d : This density is of the form required in (1) with 0 i defined as: with K i , 2 ≤ i ≤ d a primitive of h i chosen such that 0 i are centered, and K d+1 = c a constant.
We present an additive exponential series model specifically designed to estimate such densities.This exponential model is a multivariate version of the exponential series estimator considered in [5] in the univariate setting.Essentially, we approximate the functions 0 i by a family of polynomials (ϕ i,k , k ∈ N), which are orthonormal for each 1 ≤ i ≤ d with respect to the i-th marginal of the Lebesgue measure on the support .The model takes the form, for θ = (θ i,k ; 1 ≤ i ≤ d, 1 ≤ k ≤ m i ) and x = (x 1 , . . ., x d ) ∈ : Even though the polynomials (ϕ i,k , k ∈ N) are orthonormal for each 1 ≤ i ≤ d, if we take i = j, the families (ϕ i,k , k ∈ N) and (ϕ j,k , k ∈ N) are not completely orthogonal with respect to the Lebesgue measure on .The exact definition and further properties of these polynomials can be found in the Appendix.We estimate the parameters of the model by θ = ( θi,k ; 1 ≤ i ≤ d, 1 ≤ k ≤ m i ), obtained by solving the maximum likelihood equations: Approximation of log-densities by polynomials appears in [17] as an application of the maximum entropy principle, while [13] shows existence and consistency of the maximum likelihood estimation.We measure the quality of the estimator f θ of f 0 by the Kullback-Leibler divergence D f 0 f θ defined as: Convergence rates for nonparametric density estimators have been given by [19] for kernel density estimators, [5] and [32] for the exponential series estimators, [4] for histogram-based estimators, and [24] for wavelet-based log-density estimators.Here, we give results for the convergence rate in probability when the functions 0 i belong to a Sobolev space with regularity r i > d for all 1 ≤ i ≤ d.We show that if we take m = m(n) = (m 1 (n), . . ., m d (n)) members of the families (ϕ i,k , k ∈ N), 1 ≤ i ≤ d, and let m i grow with n such that ) and ( d i=1 m i ) 2d+1 /n tend to 0, then the maximum likelihood estimator f θm,n verifies: Notice that this is the sum of the same univariate convergence rates as in [5].By choosing m i proportional to n 1/(2r i +1) , which gives the optimal convergence rate O P (n −2r i /(2r i +1) ) in the univariate case as shown in [34], we achieve a convergence rate of O P (n −2 min(r)/(2 min(r)+1) ).
Therefore by exploiting the special structure of the underlying density, and carefully choosing the basis functions, we managed to reduce the problem of estimating a d-dimensional density to d one-dimensional density estimation problems.We highlight the fact that this constitutes a significant gain over convergence rates of general nonparametric multivariate density estimation methods.
In most cases the smoothness parameters r i , 1 ≤ i ≤ d, are not available, therefore a method which adapts to the unknown smoothness is required to estimate the density with the best possible convergence rate.Adaptive methods for function estimation based on a random sample include Lepski's method, model selection, wavelet thresholding and aggregation of estimators.
Lepski's method, originating from [27], consists of constructing a grid of regularities, and choosing among the minimax estimators associated to each regularity the best estimator by an iterative procedure based on the available sample.This method was extensively applied for Gaussian white noise model, regression, and density estimation, see [8] and references therein.Adaptation via model selection with a complexity penalization criterion was considered by [7] and [3] for a large variety of models including wavelet-based density estimation.Loss in the Kullback-Leibler distance for model selection was studied in [33] and [12] for mixing strategies, and in [35] for the information complexity minimization strategy.More recently, bandwidth selection for multivariate kernel density estimation was addressed in [16] for L s risk, 1 ≤ s < ∞, and [26] for L ∞ risk.Wavelet based adaptive density estimation with thresholding was considered in [23] and [14], where an upper bound for the rate of convergence was given for a collection of Besov-spaces.Linear and convex aggregate estimators appear in the more recent work [30] with an application to adaptive density estimation in expected L 2 risk, with sample splitting.
Here we extend the convex aggregation scheme for the estimation of the logarithm of the density proposed in [11] to achieve adaptability.We take the estimator f θm,n for different values of m ∈ M n , where M n is a sequence of sets of parameter configurations with increasing cardinality.These estimators are not uniformly bounded as required in [11], but we show that they are uniformly bounded in probability and that it does not change the general result.The different values of m correspond to different values of the regularity parameters.The convex aggregate estimator f λ takes the form: m∈Mn λ m = 1} and normalizing constant ψ λ given by: To apply the aggregation method, we split our sample X n into two parts X n 1 and X n 2 , with size proportional to n.We use the first part to create the estimators f θm,n , then we use the second part to determine the optimal choice of the aggregation parameter λ * n .We select λ * n by maximizing a penalized version of the log-likelihood function.We show that this method gives a sequence of estimators f λ * n , free of the smoothness parameters r 1 , . . ., r d , which verifies: .
The rest of the paper is organized as follows.In Section 2 we introduce the notation used in the rest of the paper.In Section 3, we describe the additive exponential series model and the estimation procedure, then we show that the estimator converges to the true underlying density with a convergence rate that is the sum of the convergence rates for the same type of univariate model, see Theorem 3.3.We consider an adaptive method with convex aggregation of the logarithms of the previous estimators to adapt to the unknown smoothness of the underlying density in Section 4, see Theorem 4.1.We assess the performance of the adaptive estimator via a simulation study in Section 5.The definition of the basis functions and their properties used during the proofs are given in Section 6.The detailed proofs of the results in Section 3 and 4 are contained in Sections 7, 8 and 9.

Notation
Let q i , 1 ≤ i ≤ d be the one-dimensional marginals of the Lebesgue measure on : If h i ∈ L 1 (q i ), then we have: For a vector x = (x 1 , . . ., x d ) ∈ R d , let min(r) (max(r)) denote the smallest (largest) component.
Let us denote the support of a probability density g by supp (g) = {x ∈ R d , g(x) > 0}.Let P( ) denote the set of probability densities on .For g, h ∈ P( ), the Kullback-Leibler distance D (g h) is defined as: g log (g/h) .
Definition 2.1.We say that a probability density f 0 ∈ P( ) has a product form if there exist ( 0 i , 1 ≤ i ≤ d) bounded measurable functions defined on I such that I 0 i q i = 0 for 1 ≤ i ≤ d and a.e. on : be the family of orthonormal polynomials on I with respect to the measure q i ; see Section 6 for a precise definition of those polynomials and some of their properties.Recall elements of R |m| , we denote the scalar product: and the norm θ = √ θ • θ.We define the function θ • ϕ m as follows, for x ∈ : For a positive sequence (a n ) n∈N , the notation O P (a n ) of stochastic boundedness for a sequence of random variables (Y n , n ∈ N) means that for every ε > 0, there exists C ε > 0 such that:

Additive exponential series model
In this Section, we study the problem of estimation of an unknown density f 0 with a product form on the set , as described in (5), given the sample X n drawn from f 0 .Our goal is to give an estimation method based on a sequence of regular exponential models, which suits the special characteristics of the target density f 0 .Estimating such a density with standard multidimensional nonparametric techniques naturally suffer from the curse of dimensionality, resulting in slow convergence rates for high-dimensional problems.We show that by taking into consideration that f 0 has a product form, we can recover the onedimensional convergence rate for the density estimation, allowing for fast convergence of the estimator even if d is large.The quality of the estimators is measured by the Kullback-Leibler distance, as it has strong connections to the maximum entropy framework of [10].
We propose to estimate f 0 using the following additive exponential series model, for m ∈ (N * ) d : ( 6) with ψ(θ) = log exp (θ • ϕ m ) .This model is similar to the one introduced in [32], but there are two major differences.First, we have only kept the univariate terms in the multivariate exponential series estimator of [32] since the target probability density is the product of univariate functions.Second, we have restricted our model to instead of the hyper-cube I d , and we have chosen the basis functions ((ϕ i,k , k ∈ N), 1 ≤ i ≤ d) which are appropriate for this support.
Remark 3.1.In the genaral case, one has to be careful when considering a density f 0 with a product form and a support different from .Let f 0 i denote the i-th marginal density function of f 0 .If supp (f 0 i ) = A ⊂ R for all 1 ≤ i ≤ d, we can apply a strictly monotone mapping of A onto I to obtain a distribution with a product form supported on .When the supports of the marginals differ, there is no transformation that yields a random vector with a density as in Definition 2.1.A possible way to treat this case consists of constructing a family of basis functions which has similar properties with respect to supp (f 0 ) as the family ((ϕ i,k , k ∈ N), 1 ≤ i ≤ d) with respect to , which we discuss in detail in Section 6.Then we could define an exponential series model with this family of basis functions and support restricted to supp (f 0 ) to estimate f 0 .Let m ∈ (N * ) d .We define the following function on R |m| taking values in R |m| by: According to Lemma 3 in [5], we have the following result on A m .

Lemma 3.2. The function
We denote by Θ m : Ω m → R |m| the inverse of A m .The empirical mean of the sample X n of size n is: In Section 8.2 we show that μm,n ∈ Ω m a.s.when n ≥ 2.
For n ≥ 2, we define a.s. the maximum likelihood estimator fm,n = f θm,n of f 0 by choosing: The loss between the estimator fm,n and the true underlying density f 0 is measured by the Kullback-Leibler divergence D f 0 fm,n .
For r ∈ N * , let W 2 r (q i ) denote the Sobolev space of functions in L 2 (q i ), such that the (r − 1)-th derivative is absolutely continuous and the L 2 norm of the r-th derivative is finite: 1) is absolutely continuous and h (r) ∈ L 2 (q i ) .
The main result is given by the following theorem whose proof is given in Section 8.3.
Theorem 3.3.Let f 0 ∈ P( ) be a probability density with a product form, see Definition 2.1.Assume the functions 0 i , defined in (5) belong to the Sobolev space W 2 r i (q i ), r i ∈ N with r i > d for all 1 ≤ i ≤ d.Let (X n , n ∈ N * ) be i.i.d.random variables with density distribution f 0 .We consider a sequence (m(n) = (m 1 (n), . . ., m d (n)), n ∈ N * ) such that lim n→∞ m i (n) = +∞ for all 1 ≤ i ≤ d, and which satisfies: (10) lim The Kullback-Leibler distance D f 0 fm,n of the maximum likelihood estimator fm,n defined by ( 9) to f 0 converges in probability to 0 with the convergence rate: ) .This choice constitutes a balance between the bias and the variance term.Then the conditions (10) and (11) are satisfied, and we obtain that : Thus the convergence rate corresponds to the least smooth 0 i .This rate can also be obtained with a choice where all m i are the same.Namely, with (m * (n) = (v * (n), . . ., v * (n)), n ∈ N * ) and v * (n) = n 1/(2 min(r)+1) .
For r = (r 1 , . . ., r d ) ∈ (N * ) d , r i > d for 1 ≤ i ≤ d, and a constant κ > 0, let : The constants A 1 and A 2 , appearing in the upper bounds during the proof of Theorem 3.3 (more precisely in Propositions 8.3 and 8.5), are uniformly bounded on K r (κ), thanks to Corollary 6.13 and log(f 0 ) ∞ ≤ 2dκ + |log(d!)|, which is due to (43).This yields the following corollary for the uniform convergence in probability on the set K r (κ) of densities: Corollary 3.5.Under the assumptions of Theorem 3.3, we get the following result: Remark 3.6.Since we let r i vary for each 1 ≤ i ≤ d, our class of densities K r (κ) has an anisotropic feature.Estimation of anisotropic multivariate functions for L s risk, 1 ≤ s ≤ ∞, was considered in multiple papers.For a Gaussian white noise model, [22] obtains minimax convergence rates on anisotropic Besov classes for L s risk, 1 ≤ s < ∞ ,while [6] gives the minimax rate of convergence on anisotropic Hölder classes for the L ∞ risk.For kernel density estimation, results on the minimax convergence rate for anisotropic Nikol'skii classes for L s risk, 1 ≤ s < ∞, can be found in [16].These papers conclude in general, that if the considered class has smoothness parameters ri for the i-th coordinate, 1 ≤ i ≤ d , then the optimal convergence rate becomes n −2 R/(2 R+1) (multiplied with a logarithmic factor for L ∞ risk), with R defined by the equation 1/ R = d i=1 1/r i .Since R < ri for all 1 ≤ i ≤ d, the convergence rate n −2 min(r)/(2 min(r)+1) is strictly better than the convergence rate for these anisotropic classes.In the isotropic case, when r i = r for all 1 ≤ i ≤ d, the minimax convergence rate specializes to n −2r/(2r+d) (which was obtained in [32] as an upper bound).This rate decreases exponentially when the dimension d increases.However, by exploiting the multiplicative structure of the model, we managed to obtain the univariate convergence rate n −2r/(2r+1) , which is minimax optimal, see [34].

Adaptive estimation
Notice that the choice of the optimal series of estimators fm * ,n with m * defined in Remark 3.4 requires the knowledge of min(r) at least.When this knowledge is not available, we propose an adaptive method based on the proposed estimators in Section 3, which can mimic asymptotically the behaviour of the optimal choice.Let us introduce some notation first.We separate the sample X n into two parts X n 1 and X n 2 of size n 1 = C e n and n 2 = n − C e n respectively, with some constant C e ∈ (0, 1).The first part of the sample will be used to create our estimators, and the second half will be used in the aggregation procedure.Let (N n , n ∈ N * ) be a sequence of non-decreasing positive integers depending on n such that lim n→∞ N n = +∞.Let us denote: For m ∈ M n let fm,n be the additive exponential series estimator based on the first half of the sample, namely: with θm,n given by ( 9) using the sample X n 1 (replacing n with n 1 in the definition (8) of μm,n ).Let : denote the set of different estimators obtained by this procedure.Notice that Card ( Recall that by Remark 3.4, we have that for r = (r 1 , . . ., r d ) with r i > d and n ≥ n, where n is given by: , achieves the optimal convergence rate O P (n −2 min(r)/(2 min(r)+1) ).By letting N n go to infinity, we ensure that for every combination of regularity parameters r = (r 1 , . . ., r d ) with r i > d, the sequence of optimal estimators fm * ,n is included in the sets F n for n large enough.
We use the second part of the sample X n 2 to create an aggregate estimator based on F n , which asymptotically mimics the performance of the optimal sequence fm * ,n .We will write ˆ m,n = θm,n • ϕ m to ease notation.We define the convex combination ˆ λ of the functions ˆ m,n , m ∈ M n : For such a convex combination, we define the probability density function f λ as: ( 16) with ψ λ = log exp( ˆ λ ) .We apply the convex aggregation method for log-densities developed in [11] to get an aggregate estimator which achieves adaptability.Notice that the reference probability measure in this paper corresponds to d!1 (x)dx.This implies that ψ λ here differs from the ψ λ of [11] by the constant log(d!), but this does not affect the calculations.The aggregation weights are chosen by maximizing the penalized maximum likelihood criterion H n defined as: with the penalizing function pen (λ) = m∈Mn λ m D f λ fm,n .The convex aggregate estimator f λ * n is obtained by setting: The main result of this section is given by the next theorem which asserts that if we choose N n = o(log(n)) such that lim n→∞ N n = +∞, the series of convex aggregate estimators f λ * n converge to f 0 with the optimal convergence rate, i.e. as if the smoothness was known.Theorem 4.1.Let f 0 ∈ P( ) be a probability density with a product form given by ( 5).Assume the functions 0 i belongs to the Sobolev space W 2 r i (q i ), r i ∈ N with r i > d for all such that lim n→∞ N n = +∞.The convex aggregate estimator f λ * n defined by (16) with λ * n given by ( 18) converges to f 0 in probability with the convergence rate: .
The proof of this theorem is provided in Section 9. Similarly to Corollary 3.5, we have uniform convergence over sets of densities with increasing regularity.Recall the definition (13) , where R n satisfies the three inequalities: Remark 4.3.For example when N n = log(n)/(2 log(log(n))), then (20), ( 21) and ( 22) are satisfied with R n = N n for n large enough.

Simulation study : random truncation model
In this section we present the results of Monte Carlo simulation studies on the performance of the additive exponential series estimator.We take the example of the random truncation model introduced in Section 1 with d = 2, which is used in many applications.This model naturally satisfies our model assumptions.
Let Z = (Z 1 , Z 2 ) be a pair of independent random variable with density functions p 1 , p 2 respectively such that ⊂ supp (p), where p(x 1 , x 2 ) = p 1 (x 1 )p 2 (x 2 ) is the joint density function of Z. Suppose that we only observe pairs (Z 1 , Z 2 ) if 0 ≤ Z 1 ≤ Z 2 ≤ 1.Then the joint density function f of the observable pairs is given by, for This corresponds to the form (2) with g 1 , g 2 given by: We will choose the densities p 1 , p 2 from the following distributions: • Gumbel(α, β) with α > 0, β ∈ R: The exact choices for densities p 1 , p 2 are given in Table 1. Figure 1 shows the resulting density functions g 1 and g 2 for each case.
To calculate the parameters θm,n , we recall that θm,n is the solution of the equation ( 9), therefore can be also characterized as: (23) θm,n = argmax with μm,n defined by (8), see Lemma 7.4 .We use a numerical optimisation method to solve (23) and obtain the parameters θm,n .We estimate our model with m 1 = m 2 = m, and m = 1, 2, 3, 4. We compute the final estimator based on the convex aggregation method proposed in Section 4. We ran 100 estimations with increasing sample sizes n ∈ {200, 500, 1000}, and we calculated the average Kullback-Leibler distance as well as the L 2 distance between f 0 and its estimator.We used C e = 80% of the sample to calculate the initial estimators, and the remaining 20% to perform the aggregation.The distances were calculated by numerical integration.We compare the results with a truncated kernel density estimator with Gaussian kernel functions and bandwidth selection based on Scott's rule.The results are summarized in Table 2 and Table 3.  q q q q q q q AESE Kernel 0.00 0.04 0.08 KL−distance, n=200 q q q q q q q q AESE Kernel 0.00 0.02 0.04 0.06

KL−distance, n=1000
q q q q q q q AESE Kernel 0.00 0.10 0.20 L2−distance, n=200 q q q q q q AESE Kernel 0.02  q q q q q q q q q q q q q q q AESE Kernel 0.04 0.08 0.12 KL−distance, n=200 q q q q q q q q q q AESE Kernel 0.02 0.06 0.10 KL−distance, n=500 q q q q q q q q AESE Kernel 0.020 0.030 0.040 0.050 KL−distance, n=1000 q q q q q q q q q q q AESE Kernel 0.1 0.2 0.3 0.4 0.5 0.6 L2−distance, n=200 q q q q q q q AESE Kernel 0.1 0.2 0.3 0.4 0.5 L2−distance, n=500 q q q q q q AESE Kernel 0.10 0.15 0.20 0.25  We can conclude that the additive exponential series estimator outperforms the kernel density estimator both with respect to the Kullback-Leibler distance and the L 2 distance.As expected, the performance of both methods increases with the sample size.The boxplot of the 100 values of the Kullback-Leibler and L 2 distance for the different sample sizes can be found in Figures 2, 4 and 6.Figures 3, 5 and 7 illustrate the different estimators compared to the true joint density function for the three cases obtained with a sample size of 1000.We can observe that the additive exponential series method leads to a smooth estimator compared to the kernel method.
Remark 5.1.The additive exponential series model encompasses a lot of popular choices for the marginals p 1 , p 2 .For example, the exponential distribution is included in the model for m i = 1, and the normal distribution is included for m i = 2. Thus we expect that if we choose exponential or normal distributions for p 1 , p 2 , we obtain even better results for the additive exponential series estimator, which was confirmed by the numerical experiments (not included here for brevity).
6. Appendix: Orthonormal series of polynomials 6.1.Jacobi polynomials.The following results can be found in [2] p. 774.The Jacobi polynomials (P They are given by Rodrigues' formula, for t ∈ [−1, 1], k ∈ N: The normalizing constants are given by: (24) In what follows, we will be interested in Jacobi polynomials with α = d − i and β = i − 1, which are orthogonal to the weight function has degree k.The derivatives of the Jacobi polynomials P (d−i,i−1) k , r ≤ k, verify, for t ∈ I (see Proposition 1.4.15 of [15]): We also have: Recall the definition (4) of the marginals q i of the Lebesgue measure on the simplex.According to the following Lemma, the polynomials (ϕ i,k , k ∈ N) form an orthonormal basis of L 2 (q i ) for all 1 ≤ i ≤ d.Notice that ϕ i,k has degree k.Lemma 6.2.For 1 ≤ i ≤ d, k, ∈ N, we have: Proof.We have, for k, ∈ N: where we used (24) for the last equality.
6.3.Mixed scalar products.Recall notation (3), so that ) is a family of orthonormal polynomials with respect to the Lebesgue measure on , for all 1 ≤ i ≤ d.
We give the mixed scalar products of (ϕ [i],k , k ∈ N) and (ϕ [j], , ∈ N), 1 ≤ i < j ≤ d with respect to the Lebesgue measure on the simplex .Lemma 6.3.For 1 ≤ i < j ≤ d and k, ∈ N, we have: ≤ 1 for all k, ∈ N.
Proof.We have: with r k a polynomial defined on I given by: Notice that r k is a polynomial of degree at most k as ϕ i,k is a polynomial with degree k.Therefore if k < , we have = 0 since ϕ j, is orthogonal (with respect to the measure q j ) to any polynomial of degree less than .Similar calculations show that if k > , the integral is also 0.
Let us consider now the case k = .We compute the coefficient ν k of t k in the polynomial r k .We deduce from (25) that the leading coefficient ω i,k of ϕ i,k is given by: Using this we obtain for ν k : and thus r k has degree k.The orthonormality of (ϕ j,k , k ∈ N) ensures that I r k ϕ j,k q j = ν k /ω j,k .Therefore, we obtain: This shows that the family of functions ϕ = (ϕ i,k , 1 ≤ i ≤ d, k ∈ N) is not orthogonal with respect to the Lebesgue measure on .For k ∈ N * , let us consider the matrix R k ∈ R d×d with elements: . Therefore it is symmetric and positive semi-definite.Let λ k,1 ≤ . . .≤ λ k,d denote the eigenvalues of R k .We aim to find a lower bound for these eigenvalues which is independent of k.
Lemma 6.4.For k ∈ N * , the smallest eigenvalue λ k,d of R k is given by: and we have λ k,d ≥ 1/d.
Proof.It is easy to check that the inverse R −1 k of R k exists and is symmetric tridiagonal with diagonal entries D i , 1 ≤ i ≤ d and lower (and upper) diagonal elements Q i , 1 ≤ i ≤ d − 1 given by: The matrix R −1 k is positive definite, since all of its principal minors have a positive determinant.In particular, this ensures that the eigenvalues of R k and R −1 k are all positive.Let c i (λ), 1 ≤ i ≤ d denote the i-th leading principal minor of the matrix R −1 k − λI d , where I d is the d-dimensional identity matrix.The eigenvalues of R −1 k are exactly the roots of the characteristic polynomial c d (λ).Since R −1 k is symmetric and tridiagonal, we have the following recurrence relation for c i (λ), 1 ≤ i ≤ d: Let M k be the symmetric tridiagonal matrix d × d with diagonal entries D i , 1 ≤ i ≤ d and lower (and upper) diagonal elements k have the same eigenvalues.
It is easy to check that λ * = (k + d − 1)/k is an eigenvalue of M k with corresponding eigenvector v = (v 1 , . . ., v d ) given by, for 1 ≤ i ≤ d: The matrix M k has non-negative elements, with positive elements in the diagonal, sub-and superdiagonal.Therefore M k is irreducible, and we can apply the Perron-Frobenius theorem for non-negative, irreducible matrices: the largest eigenvalue of M k has multiplicity one and is the only eigenvalue with corresponding eigenvector x such that x > 0. Since v > 0, we deduce that λ * is the largest eigenvalue of Since λ k,d is increasing in k, we have the uniform lower bound 1/d.Remark 6.5.We conjecture that the eigenvalues λ k,i of R k are given by, for 1 ≤ i ≤ d: Bounds between different norms.In this Section, we will give inequalities between different types of norms for functions defined on the simplex .These inequalities are used during the proof of Theorem 3.
) and: In particular, we have m .We first give lower and upper bounds on θ • ϕ m L 2 .
Lemma 6.6.For all θ ∈ R |m| we have: Proof.For the upper bound, one simply has, by the triangle inequality and the orthonormality: For the lower bound, we have: where we used the normality of ϕ [i],k with respect to the Lebesgue measure on and Lemma 6.3 for the cross products.We can rewrite this in a matrix form: where R k ∈ R d×d is given by ( 29) and Since, according to Lemma 6.4, all the eigenvalues of R k are uniformly larger than 1/d, this gives: This concludes the proof.
We give an inequality between different norms for polynomials defined on I.
Lemma 6.7.If h is a polynomial of degree less then or equal to n on I, then we have for all . By the Cauchy-Schwarz inequality, we have: .
We deduce from Definition 6.1 of ϕ i,k and ( 27) that: For all 1 ≤ i ≤ d, we have the uniform upper bound: This implies that for t ∈ I: Bessel's inequality implies that n k=0 β 2 k ≤ h 2 L 2 (q i ) .We conclude the proof using (31).
We recall the notation S m of the linear space spanned by (ϕ Proof.Let g ∈ S m .We can write g = θ • ϕ m for a unique θ ∈ R |m| .Let , where g i is a polynomial defined on I of degree at most m i for all 1 ≤ i ≤ d.
We have: where we used Lemma 6.7 for the second inequality, Cauchy-Schwarz for the third inequality, and Lemma 6.6 for the fourth inequality.
Remark 6.9.For d fixed, κ m as a function of m verifies: with i ∈ W 2 r i (q i ) and I i q i = 0 for 1 ≤ i ≤ d.Let i,m i be the orthogonal projection in L 2 (q i ) of i on the span of (ϕ i,k , 0 ≤ k ≤ m i ) given by i, ,m i is the approximation of on S m given by (47).We start by giving a bound on the L 2 (q i ) norm of the error when we approximate i by i,m i .Lemma 6.10.For each 1 ≤ i ≤ d, m i + 1 ≥ r i and i ∈ W 2 r i (q i ) , we have: Proof.Notice that (26) implies that the series (ϕ and the normalizing constants κ i,k ≥ 0 are given by: where we used the definition of ϕ i,k for the second equality, (26) for the third equality and (24) for the fourth equality.Notice that κ i,k is non-decreasing as a function of k.Since where the first inequality is due to the monotonicity of κ i,k as k increases.Thanks to (26) and the definition of κ i,k , we get that (ϕ This and ( 35) implies (33).Lemma 6.10 yields a simple bound on the L 2 norm of the approximation error − m .
Corollary 6.11.For m = (m 1 , . . ., m d ), m i + 1 ≥ r i and i ∈ W 2 r i (q i ) for all 1 ≤ i ≤ d, we get: Proof.We have: where we used (33) for the first equality.
Lastly, we bound the L ∞ norm of the approximation error.Lemma 6.12.For each r i (q i ), we have: Proof.We recall the constants where we used Cauchy-Schwarz for the second inequality, ( 32) and (36) for the third inequality, κ 2 i,k ≥ (d−i)!(i−1)!(k+r i ) 2r i e −2r i for the fourth inequality, and ∞ k=m i +1 (k+r i ) −2r i +2d ≤ (2r i − 2d − 1) −1 (m i + r i ) −2r i +2d+1 for the fifth inequality.Corollary 6.13.There exists a constant C > 0 such that for all i ∈ W 2 r i (q i ) and m i + 1 ≥ r i > d for all 1 ≤ i ≤ d, we have: Proof.Notice that for m i + 1 ≥ r i > d, we have: and that the right hand side is bounded by a constant C > 0 for all r i ∈ N * .Therefore: 7. Preliminary elements for the proof of Theorem 3.3 We adapt the results from [5] to our setting.Let us recall Lemmas 1 and 2 of [5].
Corollary 7.2.Let g, h ∈ P( ).If log(g/h) ∞ < +∞, then we have, for any constant κ ∈ R: and: Recall Definition 2.1 for densities f 0 with a product form on .We give a few bounds between the L ∞ norms of log(f 0 ), 0 and the constant a 0 .Lemma 7.3.Let f 0 ∈ P( ) given by Definition 2.1.Then we have: The first part of (43) can be obtained by bounding 0 with 0 ∞ in the definition of a 0 .The second part is a direct consequence of this.The first part of (44) can be deduced from the fact that 0 = 0.The second part is again a direct consequence of the first part.
Let m ∈ (N * ) d .Recall the application A m defined in (7) and set Ω m = A m (R |m| ).For α ∈ R |m| , we define the function F α on R |m| by: (45) Recall also the additive exponential series model f θ given by (6).
The information projection of a density f is the closest density in the exponential family (6) with respect to the Kullback-Leibler distance to f .We consider the linear space of real valued functions defined on and generated by ϕ m : (47) . The following Lemma summarizes Lemmas 6.6 and 6.8.
Lemma 7.6.Let m ∈ (N * ) d .We have for all g ∈ S m : For all θ ∈ R |m| , we have: Now we give upper and lower bounds for the Kullback-Leibler distance between two members of the exponential family f θ and f θ in terms of the Euclidean distance θ − θ .Notice that for all where we used (50) for the second inequality, the fact that the functions (ϕ Then α belongs to Ω m and θ * = Θ m (α) exists.Let τ be such that: Then θ * satisfies: Proof.Suppose that α = A m (θ) (otherwise the results are trivial).Recall F α defined in (45).We have, for all θ ∈ R |m| : Using (52) and the Cauchy-Schwarz inequality, we obtain the strict inequality: We consider the ball centered at θ: B r = {θ ∈ R |m| , θ − θ ≤ r} with radius r = 3d e τ + log(f θ ) ∞ A m (θ) − α .For all θ ∈ ∂B r , we have: The right hand side is non-negative as 6d 3 2 e 1+ log(f θ ) ∞ κ m A m (θ) − α ≤ τ ≤ 1, see the condition on τ .Thus, the value of F α at θ, an interior point of B r , is larger than the values of F α on ∂B r .Therefore F α is maximal at a point, say θ * , in the interior of B r .Since the gradient of F α at θ * equals 0, we have ∇F α (θ * ) = α − ϕ m f θ * = 0, which means that α ∈ Ω m and θ * = Θ m (α).Since θ * is inside B r , we get (54).The upper bound (55) is due to (50) of Lemma 7.7.To prove (56), we use (57) and the fact that F α (θ) − F α (θ * ) ≤ 0, which gives:
We divide the proof of Theorem 3.3 into two parts: first we bound the error due to the bias of the proposed exponential model, then we bound the error due to the variance of the sample estimation.We formulate the results in two general Propositions, which can be later specified to get Theorem 3.3.

8.1.
Bias of the estimator.The bias error comes from the information projection of the true underlying density f 0 onto the family of the exponential series model {f θ , θ ∈ R |m| }.We recall the linear space S m spanned by (ϕ where ϕ i,k is a polynomial of degree k, and the form of the probability density f 0 given in (5).For 1 ≤ i ≤ d, let 0 i,m be the orthogonal projection in L 2 (q i ) of 0 i on the vector space spanned by (ϕ i,k , 0 ≤ k ≤ m i ) or equivalently on the vector space spanned by (ϕ i,k , 1 ≤ k ≤ m i ), as we assumed that and verifies, with Proof.The existence of θ * is due to Lemma 8.1.Thanks to Lemma 7.4 and (41) with κ = ψ(θ 0 ) − a 0 , we can deduce that: We need the following lemma to control log(f 0 /f θ * ) ∞ .
Proof.Let θ * be defined in Proposition 8.3.Let X = (X 1 , . . ., X d ) denote a random variable with density f 0 .Let θ in Lemma 7.8 be equal to θ * , which gives and for α, we take the empirical mean μm,n .With this setting, we have: By Chebyshev's inequality α − α 0 2 ≤ |m| K/n except on a set whose probability verifies: k ≤ e log(f 0 ) ∞ by the normality of ϕ i,k .Therefore we obtain: We can apply Lemma 7.8 on the event Thanks to (60) we have: Since ε m ≤ 1, (62) holds if δ 2 m,n ≤ 1/K.Then except on a set of probability less than e log(f 0 ) ∞ /K, the maximum likelihood estimator θm,n satisfies, thanks to (56) with τ = δ m,n √ K: ) and the boundedness of γ m when m i > r i for all 1 ≤ i ≤ d is due to Corollary 6.13.By Remark 6.9, we have that κ m = O(|m| d ).If (10) holds, then κ m ∆ m converges to 0. Therefore for m large enough, we have that ε m defined in (59) is less than 1.By Proposition 8.3, the information projection f θ * of f 0 exists.For such m, by Lemma 7.4, we have that for all θ ∈ R |m| : ).The condition δ m,n ≤ 1 in Proposition 8.5 is verified for n large enough since γ m is bounded and (11) holds, giving lim n→∞ δ m,n = 0. Proposition 8.5 then ensures that D f θ * fm,n = O P (|m| /n).Therefore the proof is complete.

Proof of Theorem 4.1
In this section we provide the elements of the proof of Theorem 4.1.We assume the hypotheses of Theorem 4.1.Recall the notation of Section 4. We shall stress out when we use the inequalities (20), (21) and (22) to achieve uniformity in r in Corollary 4.2.
First recall that 0 from (5) admits the following representation: Using Corollary 6.13 and ψ(θ 0 m ) − a 0 ≤ 0 m − 0 ∞ , we obtain that log(f 0 m /f 0 ) ∞ is bounded for all m ∈ (N * ) d such that m i ≥ r i : Recall the notation A 0 m = ϕ m f 0 for the expected value of ϕ m (X 1 ), μm,n the corresponding empirical mean based on the sample X n 1 , and ˆ m,n = θm,n • ϕ m where θm,n is the maximum likelihood estimate given by ( 9).Let T n > 0 be defined as: We first show that with probability converging to 1, the estimators are uniformly bounded.
Lemma 9.1.Let n ∈ N * , n ≥ n * and M n as in (14).Then we have: with C Tn defined as: with a finite constant C given by (72).Moreover, on the event A n , we have the following uniform upper bound for ˆ m,n ∞ , m ∈ M n : (69) ˆ m,n ∞ ≤ 4 + 4γ + 2 log(f 0 ) ∞ .
where we used (44) for the first inequality.
We also give a sharp oracle inequality for the convex aggregate estimator f λ * n conditionally on A n with n fixed .The following lemma is a direct application of Theorem 3.6. of [11] and (69).Lemma 9.3.Let n ∈ N * be fixed.Conditionally on A n , let f λ * n be given by ( 16) with λ * n defined as in (18).Then for any x > 0 we have with probability greater than 1 − exp(−x): .
Let ε > 0. To prove (19), we need to find C ε > 0 such that for all n large enough: .

Figure 1 .
Figure 1.Density functions g 1 , g 2 of the left-truncated models used in the simulation study.

Figure 2 .
Figure 2. Boxplot of the Kullback-Leibler and L 2 distances for the additive exponential series estimator (AESE) and the truncated kernel estimators with Beta marginals.

Figure 3 .
Figure 3. Joint density functions of the true density and its estimators with Beta marginals.

Figure 4 .
Figure 4. Boxplot of the Kullback-Leibler and L 2 distances for the additive exponential series estimator (AESE) and the truncated kernel estimators with Gumbel marginals.

Figure 5 .
Figure 5. Joint density functions of the true density and its estimators with Gumbel marginals.

Figure 6 .
Figure 6.Boxplot of the Kullback-Leibler and L 2 distances for the additive exponential series estimator (AESE) and the truncated kernel estimators with Normal mix marginals.

Figure 7 .
Figure 7. Joint density functions of the true density and its estimators with Normal mix marginals.

6. 2 .Definition 6 . 1 .
Definition of the basis functions.Based on the Jacobi polynomials, we define a shifted version, normalized and adapted to the interval I = [0, 1].For 1 ≤ i ≤ d, k ∈ N, we define for t ∈ I:

Lemma 6 . 8 .
and the different norms introduced in Section 7. Let m ∈ (N * ) d and κ m = √ 2d! d i=1 (m i + d) 2d .Then we have for every g

6. 5 .
Bounds on approximations.Now we bound the L 2 and L ∞ norm of the approximation error of additive functions where each component belongs to a Sobolev space.Let m = (m 1 , . . ., m d ) ∈ (N * ) d , r = (r 1 , . . ., r d ) ∈ (N * ) d such that m i +1 ≥ r i for all 1 ≤ i ≤ d.Let = d i=1 [i]

Lemma 7 . 8 .
are orthogonal to the constant function with respect to the Lebesgue measure on for the third inequality, and (49) for the fourth inequality.Now we will show that the application Θ m is locally Lipschitz.Let m ∈ (N * ) d and θ ∈ R |m| .If α ∈ R |m| satisfies:

8. 2 .
Variance of the estimator.We control the variance error due to the parameter estimation by the size of the sample.We keep the notations used in Section 8.1.In particular ε m is defined by (59) and κ m = √ 2d! d i=1 (m i + d) 2d .The results are summarized in the following proposition.Proposition 8.5.Let f 0 ∈ P( ) have a product form given by Definition 2.1.Let m ∈ (N * ) d and suppose that ε m ≤ 1. Set:

Table 2 .
Average Kullback-Leibler distances for the additive exponential series estimator (AESE) and the truncated kernel estimator (Kernel) based on 100 samples of size n.Variances provided in parenthesis.

Table 3 .
Average L 2 distances for the additive exponential series estimator (AESE) and the truncated kernel estimator (Kernel) based on 100 samples of size n.Variances provided in parenthesis.