Optimal exponential bounds for aggregation of estimators for the Kullback-Leibler loss

We study the problem of model selection type aggregation with respect to the Kullback-Leibler divergence for various probabilistic models. Rather than considering a convex combination of the initial estimators $f_1, \ldots, f_N$, our aggregation procedures rely on the convex combination of the logarithms of these functions. The first method is designed for probability density estimation as it gives an aggregate estimator that is also a proper density function, whereas the second method concerns spectral density estimation and has no such mass-conserving feature. We select the aggregation weights based on a penalized maximum likelihood criterion. We give sharp oracle inequalities that hold with high probability, with a remainder term that is decomposed into a bias and a variance part. We also show the optimality of the remainder terms by providing the corresponding lower bound results.


Introduction
The pure aggregation framework focuses on the best way to combine estimators (seen as deterministic functions) in order to attain nearly the best risk of these estimators and in order to quantify the residual term. Given N estimators f k , 1 ≤ k ≤ N and a sample X = (X 1 , . . . , X n ) from the model f , the problem is to find an aggregated estimatef which performs nearly as well as the best f λ , λ ∈ U, where: and U is a certain subset of R N (we assume that linear combinations of the estimators are valid candidates). The performance of the estimator is measured by a loss function L. Common loss functions include L p distance (with p = 2 in most cases), Kullback-Leibler or other divergences, Hellinger distance, etc. The aggregation problem can be formulated as follows: find an aggregate estimatorf such that for some C ≥ 1 constant,f satisfies an oracle inequality in expectation, i.e.: E L(f,f ) ≤ C min 2260 C. Butucea et al. or in deviation, i.e. for ε > 0 we have with probability greater than 1 − ε: with remainder terms R n,N and R n,N,ε which do not depend on f or f k , 1 ≤ k ≤ N . If C = 1, then the oracle inequality is sharp. Historically, three types of problems were identified depending on the choice of U. In the model selection problem, the estimator mimics the best estimator amongst f 1 , . . . , f N , that is U = {e k , 1 ≤ k ≤ N }, with e k = (λ j , 1 ≤ j ≤ N ) ∈ R N the unit vector in direction k given by λ j = 1 {j=k} . In the convex aggregation problem, f λ are the convex combinations of f k , 1 ≤ k ≤ N , i.e. U is the simplex Λ + ⊂ R N with: Finally in the linear aggregation problem, U = R N is the entire linear span of the initial estimators. Recently, these problems were generalized to q -aggregation in [29], who define it as aggregation with U = q (t n ), i.e. a ball of radius t n > 0 in q -norm, 0 ≤ q ≤ 1. Early papers usually consider the L 2 loss in expectation as in (1). For the regression model with random design, optimal bounds for the L 2 loss in expectation for model selection aggregation was considered in [31] and [30], for convex aggregation in [20] with improved results for large N in [33], and for linear aggregation in [28], where knowledge of the density of the design is required. These results were extended to the case of regression with fixed design for the model selection aggregation in [15] and [16], and for affine estimators in the convex aggregation problem in [14]. A unified aggregation procedure which achieves near optimal loss for all three problems simultaneously was proposed in [8].
For density estimation, first results include [10] and [32] who independently considered the model selection aggregation under the Kullback-Leibler loss in expectation. For positive, integrable functions p, q, let D (p q) denote the generalized Kullback-Leibler divergence given by: This is a Bregman divergence, introduced in [7], therefore D (p q) is non-negative and D (p q) = 0 if and only if a.e. p = q. The Kullback-Leibler loss of an estimatorf is given by D (f ||f ). In [10] and [32], the authors introduced the progressive mixture rule to give a series of estimators which verify oracle inequalities with optimal remainder terms. They introduced the sequential risk for estimating f with weights δ: R seq (f, n, δ) = 1 n + 1 where the estimatorsf δ,i depend only on the first i observations. They proved that there exists a progressive mixture procedure associated to the weights δ * seq such that This method was later generalized as the mirror averaging algorithm in [21] and applied to various problems. Corresponding lower bounds which ensure the optimality of this procedure were shown in [22]. The convex and linear aggregation problems for densities under the L 2 loss in expectation were considered in [26].
While a lot of papers considered the expected value of the loss, relatively few papers address the question of optimality in deviation, that is with high probability as in (2). For the regression problem with random design, [1] shows that the progressive mixture method is deviation sub-optimal for the model selection aggregation problem, and proposes a new algorithm which is optimal for the L 2 loss in deviation and expectation as well. Another deviation optimal method based on sample splitting and empirical risk minimization on a restricted domain was proposed in [23]. For the fixed design regression setting, [25] considers all three aggregation problems in the context of generalized linear models and gives constrained likelihood maximization methods which are optimal in both expectation and deviation with respect to the Kullback-Leibler loss. More recently, [13] extends the results of [25] for model selection by introducing the Q-aggregation method and giving a greedy algorithm which produces a sparse aggregate achieving the optimal rate in deviation for the L 2 loss. More general properties of this method applied to other aggregation problems as well are discussed in [12].
For the density estimation, optimal bounds in deviation with respect to the L 2 loss for model selection aggregation are given in [3]. The author gives a non-asymptotic sharp oracle inequality under the assumption that f and the estimators f k , 1 ≤ k ≤ N are bounded, and shows the optimality of the remainder term by providing the corresponding lower bounds as well. The penalized empirical risk minimization procedure introduced in [3] inspired our current work. Here, we consider a more general framework which incorporates, as a special case, the density estimation problem. Moreover, we give results in deviation for the Kullback-Leibler loss instead of the L 2 loss considered in [3]. The spectral density model is equivalent to estimating the covariance function cov(X · , X ·+h ) for all integers h from a stationary Gaussian sequence. It can be seen as the estimation of a Toeplitz covariance operator and has many applications.
Linear aggregation of lag window spectral density estimators with L 2 loss was studied in [11]. The method we propose is more general as it can be applied to any set of estimators f k , 1 ≤ k ≤ N , not only kernel estimators. Moreover, we establish oracle inequalities for the model selection type of risk, which has a less aggressive target than the linear aggregation risk (it aims the minimal risk over a finite choice instead the minimal risk over all linear combinations), but leads to better performance than thelinear aggregation due to smaller aggregation price. Also, this paper concerns optimal bounds in deviation for the Kullback-Leibler loss instead of the L 2 loss in expectation.
We now present our main contributions. We propose aggregation schemes for the estimation of probability densities on R d and the estimation of spectral densities of stationary Gaussian sequences. We prove sharp oracle inequalities in deviation for the Kullback-Leibler loss. Indeed, for initial estimators f k , 1 ≤ k ≤ N , we propose an aggregate estimatorf that verifies the following: for every f belonging to a large class of functions F, with probability greater than We propose two methods of aggregation for non-negative estimators, see Propositions 2.4 and 2.6. Contrary to the usual approach of giving an aggregate estimator which is a linear or convex combination of the initial estimators, we consider an aggregation based on a convex combination of the logarithms of these estimators. The aggregate estimatorsf = f D λ for the probability density model andf = f Ŝ λ for the spectral density model withλ =λ(X 1 , . . . , X n ) ∈ Λ + maximize a penalized maximum likelihood criterion. The exact form of the convex aggregates f D λ and f Ŝ λ will be precised in later sections for each setup.
The first method concerns estimators with a given total mass and produces an aggregate f D λ which has also the same total mass. This method is particularly adapted for density estimation as it provides an aggregate which is also a proper density function. We use this method to propose an adaptive nonparametric density estimator for maximum entropy distributions of order statistics in [9]. The second method, giving the aggregate f Ŝ λ , does not have the mass conserving feature, but can be applied to a wider range of statistical estimation problems, in particular to spectral density estimation. We show that both procedures give an aggregate which verifies a sharp oracle inequality with a bias and a variance term. Indeed, the bias term is new in oracle inequalities. It appears from the estimation of linear functionals of f and not from estimation of f itself. In most cases, this bias is of order 1/n or smaller under mild smoothness assumptions on f and on the estimators (f k , 1 ≤ k ≤ N ). The smoothness parameter does not interfere at all with the rates, it only appears in the constants of the remaining term.
When applied to density estimation, our results provide sharp oracle inequalities with the optimal remainder term of order log(N )/n. Theorem 3.1 proposes an aggregate estimatorf D * such that we have, for any x > 0, with probability higher than 1 − exp(−x): where β is an explicit constant depending only on the infinity norm of the logarithms of f and f k , 1 ≤ k ≤ N . In this case, the empirical measure provides an unbiased estimator of linear functionals of f and there is no bias term.
In the case of spectral density estimation, we need to assume a small amount of smoothness for the logarithm of the true spectral density and of the estimators. We require that there exists some r > 1/2 such that the logarithms of the functions belong to the periodic Sobolev space W r . We show that this also implies that the spectral density itself belongs to W r and see that our assumption is slightly more restrictive than the usual assumption in the literature: that f belongs to W 1/2 . Theorem 3.5 proposes an aggregate estimatorf S * such that, for any x > 0, with probability higher than 1 − exp(−x): where β and α are constants which depend only on the regularity and the Sobolev norm of the logarithms of f and f k , To show the optimality in deviation of the aggregation procedures, we give the corresponding tight lower bounds as well, with the same remainder terms, see Propositions 4.2 and 4.3. This complements the results of [22] and [3] obtained for the density estimation problem. In [22] the lower bound for the expected value of the Kullback-Leibler loss was shown with the same order for the remainder term, while in [3] similar results were obtained in deviation for the L 2 loss. The proof of Proposition 4.2 is quite close to the proof of lower bounds in expectation in [22]. However, this construction based on the Haar basis does not check the additional smoothness assumptions in Proposition 4.3. A new construction of test functions f 1 , ..., f N and a new proof for the spectral density model can be found in Section 5.
In conclusion, we introduce two aggregated procedures based on penalized maximum likelihood criterion for exponential family of functions and show their sharp asymptotic optimality in deviation with respect to their Kullback-Leibler risk. The progressive mixture rule is a strong competitor of our procedure for the probability density model: it satisfies an oracle inequality in expectation but with far less restrictive assumptions and it behaves better numerically in our brief examples, for large enough sample sizes (larger than 200). No such competitor exists for the spectral density model that we also treat here.
The rest of the paper is organised as follows. In Section 2 we introduce the notation and give the basic definitions used in the rest of the paper. We present the two types of convex aggregation method for the logarithms in Sections 2.1 and 2.2. We give a general sharp oracle inequality in deviation for the Kullback-Leibler loss for each method and setup. In Section 3 we apply the methods for the probability density together with numerical implementation and the spectral density estimation problems. The results on the corresponding lower bounds can be found in Section 4 for both problems. Proofs were gathered in Section 5. We summarize the properties of Toeplitz matrices and periodic Sobolev spaces in the Appendix.

Aggregation procedures for the Kullback-Leibler divergence
In this section, we propose two convex aggregation methods, suited for models submitted to different type of constraints: non-negative functions with fixed given total mass and non-negative functions without mass restriction. First, we introduce the setups and the aggregation procedures suited for each type of constraint. Then, we state non-asymptotic oracle inequalities for the Kullback-Leibler divergence in a general form.
Let h : R d → R + be a reference probability density with support H := {x ∈ R d : h(x) > 0}. Note that H can be compact or not. We consider the set G with the convention that log(0/0) = 0. Notice that any function in the set G is integrable. Moreover, we have that log(f/h) ∞ < ∞ implies that necessarily f and h have the same support. Thus, the Kullback-Leibler divergence D (f f ), defined in (4), is finite for any functions f , f belonging to G.
We consider a probabilistic model P = {P f ; f ∈ G}, with P f a probability distribution depending on f . We assume that we have a sample X = (X 1 , . . . , X n ), n ∈ N * with distribution P f . In the sequel, the examples we consider are models P f that correspond either to a probability density model or to a spectral density model.
More precisely, the probability density model corresponds to a sample of i.i.d. random variables with common probability density functions (pdf) f . The random variables X = (X 1 , ..., X n ) has joint pdf f ⊗n ( The spectral density model corresponds to a sample issued from a stationary Gaussian sequence with spectral density function (sdf) f . Let 1/(2π)1 [−π,π] be the reference density and (X k ) k∈Z be a stationary, centered Gaussian sequence with covariance function γ j = Cov (X k , X k+j ), for j ∈ Z. Under the standard assumption that ∞ j=0 |γ j | < +∞, the spectral density f associated to the process is the even function defined on [−π, π] whose Fourier coefficients are γ j : Note that, γ 0 = π −π f (x)dx. Now, suppose we have (f k , 1 ≤ k ≤ N ), N distinct estimators belonging to G. We propose two aggregation methods based on these estimators and the available sample, that behave, with high probability, as well as the best initial estimator f k * in terms of the Kullback-Leibler divergence, where k * is defined as: The first aggregation method concerns setups where f as well as (f k , 1 ≤ k ≤ N ) have a given total mass supposed equal to 1 without loss of generality: .. = f N = 1. This is the case in both the probability density model and in the spectral density model with given variance γ 0 . We set the classical notation for nonparametric exponential family of functions: where ψ = h log(h/f ) and t = ψ + log(f/h), giving that t h = 0 and that ψ = log( e t h) is a normalization constant (when needed). Similarly for t k and ψ k . We proceed in this case by aggregating (t k , 1 ≤ k ≤ N ) into a convex combination and by constructing where ψ λ is the normalization constant. The second aggregation method concerns setups where f and (f k , 1 ≤ k ≤ N ) are non-negative with arbitrary total mass, e.g. the more realistic spectral density model with unknown variance. In this case, we proceed by aggregating log(f k /h) =: g k into a convex combination and leave the resulting function unnormalized: It is easy to see that the normalized f S λ / f S λ is actually the same as f D λ . However, the criteria we estimate and penalize in order to define the optimal coefficientsλ D * andλ S * are different and the two aggregation methods lead to two different estimators.
With previous notation, we check that, for all f in G: Remark 2.1. At this point we notice that the assumption that f belongs to the class G is quite restrictive. Indeed, assuming e.g. that h is the indicator function on [0,1] implies that there exist two constants 0 However, if f tends to 0 in the probability density model, [32] suggested data augmentation to raise the function above 0. That means mixing the data with an auxiliary sample drawn from the distribution ϕ and use (f k + ϕ)/2 instead of f k . We may choose ϕ uniformly distributed over a compact set that covers all data.
Moreover, the functions f and f k , 1 ≤ k ≤ N may be unbounded, or tend to 0, through the reference probability density h. Consider e.g. h(x) = (1 − α)/x α for x in [0,1] and some α between 0 and 1, or an exponential distribution on non-negative real numbers, or a Gaussian distribution on the real line. These cases are not covered by the L 2 aggregation procedure. We fix some additional notation. We denote by I n an integrable estimator of the function f measurable with respect to the sample X = (X 1 , . . . , X n ). The estimator I n may be a biased estimator of f as is the periodogram in the spectral density model for example. We notef n the expected value of I n : For a measurable function p on R d and a measure Q on R d (resp. a measurable function q on R d ), we write p, Q = p(x)Q(dx) (resp. p, q = pq) when the integral is well defined. We shall consider the L 2 (h) norm given by p L 2 (h) = We describe now the implementation using data of our two aggregation methods of the estimators (f k , 1 ≤ k ≤ N ) and first results. For each of the two previously defined aggregation methods, we see how the Kullback-Leibler discrepance between the estimated f and the convex combination f λ (which is either f D λ or f S λ ) can be reduced to a linear functional of f . Such linear functionals can be estimated with fast rates in a biased or, sometimes, an unbiased way. We penalize this estimator to get the criterion H n (λ), a concave function of λ to be maximized over the simplex Λ + . The resulting argument of this optimization, λ * , provides our aggregated proceduref * = fλ * that is optimal in deviation. Proofs are structured in two main steps, for each aggregation method. First, f as well as f k , 1 ≤ k ≤ N belong to G and Lemmas 2.3 and 2.5 show that the penalized criterion H n (λ) has a unique maximizer provided that {t k , 1 ≤ k ≤ N }, respectively {g k , 1 ≤ k ≤ N }, are linearly independent. Moreover H n (λ) is proved strongly concave around its maximum.
Next, we assume additionally that log(f k /h) ∞ are uniformly bounded by some constant K > 0 for all 1 ≤ k ≤ N , see Remark 2.1 on such constraint. We state in the following Propositions 2.4 and 2.6 oracle inequalities in deviation under quite general form. Indeed, the remainder term in these inequalities is decomposed for each method into a bias term for estimating the linear functional ·, f of f and the maximum of N stochastic terms which are centered at second order. No additional smoothness assumptions are required here.

Non-negative functions with given total mass
In this Section, we shall consider non-negative functions with mass 1, but what follows can readily be adapted to functions with any given total mass, such as spectral density function with given variance.
We want to estimate f based on the estimators f k ∈ G for 1 ≤ k ≤ N which we assume to be non-negative with mass 1. Recall the representation (6) of f and f k . For λ ∈ Λ + defined by (3), recall that the aggregated estimator f D λ is given by the convex combination of (t k , 1 ≤ k ≤ N ): Minimizing the Kullback-Leibler distance is thus equivalent to maximizing with Y λ having probability density function f D λ . Let I n be a non-negative estimator of f based on the sample X = (X 1 , . . . , X n ), which may be the empirical measure or a biased estimator withf n = E f [I n ]. We estimate the scalar product t λ , f by t λ , I n . To select the aggregation weights λ, we consider on Λ + the penalized empirical criterion H D n (λ) given by: with penalty term: Remark 2.2. The penalty term in the definition of H D n can be multiplied by any constant θ ∈ (0, 1) instead of 1/2. The choice of 1/2 is optimal in the sense that it ensures that the constant exp(−6K)/4 in (14) of Proposition 2.4 is maximal, giving the sharpest result.
The penalty term is always non-negative and finite. Notice that H D n simplifies to: which is obviously concave function of λ, as ψ λ is convex in λ. Lemma 2.3 below asserts that the function H D n , defined by (9), admits a unique maximizer on Λ + and that it is strictly concave around this maximizer.
Then there exists a uniqueλ D * ∈ Λ + such that: Furthermore, for all λ ∈ Λ + , we have: Usingλ D * defined in (10), we set: We show that the convex aggregate estimatorf D * verifies almost surely the following non-asymptotic inequality with a bias and a variance term.
. . , X n ) be a sample from the model P f . Then the following inequality holds: with the functional B n given by, for ∈ L ∞ (R): and the function V D n : Λ + → R given by: The bias term B n is new in the aggregation results. We stress the fact that it is a bias for estimating the linear functional , f appearing in the definition of the risk and not a bias for estimating the function f . The bias for linear functionals can be made O(1/n), and thus negligible in front of the variance, under very mild assumptions on the function f .
Obviously, when f is a pdf, we plug the empirical measure in , f and the resulting estimator n −1 n i=1 (X i ) is unbiased.

Non-negative functions
In this Section, we shall consider non-negative functions. We want to estimate a function f ∈ G based on the estimators f k ∈ G for 1 ≤ k ≤ N . Since most of the proofs in this Section are similar to those in Section 2.1, we only give them when there is a substantial new element. Recall the representation (6) of f and f k . For λ ∈ Λ + defined by (3), we consider the aggregate estimator f S λ given by the convex aggregation of log(f k /h) =: g k , for 1 ≤ k ≤ N : Notice that By our assumptions f and f S λ belong to G, thus D f f S λ < ∞ for all λ ∈ Λ + . Minimization of the Kullback-Leibler distance given in (16) is therefore equiv- which is positive-semidefinite. As I n is a non-negative estimator of f based on the sample X = (X 1 , . . . , X n ), we estimate the scalar product g λ , f by g λ , I n . Here we select the aggregation weights λ based on the penalized empirical criterion H S n (λ) given by: with the penalty term: The choice of the factor 1/2 for the penalty is justified by arguments similar to those given in Remark 2.2. The penalty term is always non-negative and finite.
Notice that H S n simplifies to: Thus, the new criterion H S n is also the sum of a linear term in λ and of a concave term −(1/2) f S λ . The proofs are very similar to those in the previous section, for this reason. Lemma 2.5 below asserts that the function H S n admits a unique maximizer on Λ + and that it is strictly concave around this maximizer. (17). Then there exists a uniqueλ S * ∈ Λ + such that:λ Furthermore, for all λ ∈ Λ + , we have: Usingλ S * defined in (18), we set: We show that the convex aggregate estimatorf S * verifies almost surely the following non-asymptotic inequality with a bias and a variance term. . . . , X n ) be a sample from the model P f . Then the following inequality holds: with the functional B n given by (13), and the function V S n : Λ + → R given by:

Applications and numerical implementation
In this section we apply the methods established in Section 2.1 and 2.2 to the problems of density estimation and spectral density estimation, respectively. By construction, the aggregate f D λ of Section 2.1 is more adapted for the density estimation problem as it produces a proper density function. In this model, we choose to use the unbiased empirical estimator of the linear functional of f . Next, Theorem 3.1 makes the remainder term explicit uniformly over pdf's f such that log(f/h) ∞ ≤ L, for some L > 0. A short numerical study might inspire the reader for future developments.
For the spectral density estimation problem, the aggregate f S λ will provide the optimal results. It is more convenient in this case to plug-in the periodogram I n in order to get a biased estimator , I n of , f . In order to control the bias and make it negligible with respect to the stochastic term, we assume that log(f/h) belongs to a Sobolev ellipsoid of smoothness r, for some r > 1/2 arbitrarily close to 1/2. Note that r is not needed anywhere in the procedure, it only appears in the constants of the remainder term and does not affect the rate log N/n. We show that our smoothness assumption implies that j≥1 j 2r Cov (X · , X ·+j ) ≤ L which is slightly more restrictive than the usual assumption in spectral density models that j≥1 j 2 Cov (X · , X ·+j ) ≤ L, when (X j , j ∈ Z) is a stationary Gaussian sequence with spectral density function f .

Probability density model
Recall that the model {P f , f ∈ F D (L)} corresponds to i.i.d. random sampling from a probability density f ∈ F D (L), that is the random variable X = We consider the following subset of probability density functions, for L > 0: Recall that for any fixed f in G, the function t f := ψ+log(f/h) has the sup-norm bounded as follows t f ∞ ≤ 2 log(f/h) ∞ which is finite by assumption. We refer to Remark 2.1 for a discussion on this assumption. We want to establish here an oracle inequality uniformly over f which is the reason why we assume that t f ∞ is uniformly bounded. This bound does not appear anywhere in the algorithm, but only in the remainder term that is a price for aggregation of estimators.
The aggregation procedure involves estimation of linear functionals < , f >, for functions in L 1 (h). We estimate the probability measure f (x)dx by the empirical measure I n , giving which is an unbiased estimator of , f . In the following Theorem, we give a sharp non-asymptotic oracle inequality in probability for the aggregation proceduref D * with a remainder term of order log(N )/n. We prove in Section 4.1 the lower bound giving that this remainder term is optimal. . . . , X n ) be an i.i.d. sample from f and I n be defined in (21). Let f D * be given by (12). Then for any x > 0 we have with probability greater than 1 − exp(−x): with β = 2 exp(6K + 2L) + 4K/3.
Remark 3.2. We can also use the aggregation method of Section 2.2 and consider the normalized estimatorf S , which is a proper density function. Notice that the optimal weightsλ D * (which definesf D * ) andλ S * (which definesf S * ) maximize different criteria. Indeed, according to (18) the vectorλ S * maximizes: and according to (10) the vectorλ D * maximizes: where we used the identity g λ = t λ − N k=1 λ k ψ k for the second equality and the equality log for the third. That shows that the resulting aggregated estimators are different.

Spectral density model
In this section we apply the convex aggregation scheme of Section 2.2 to spectral density estimation of stationary centered Gaussian sequences. Let h = 1/(2π)1 [−π,π] be the reference density and (X k ) k∈Z be a stationary, centered Gaussian sequence with covariance function γ defined as, for j ∈ Z: Notice that γ −j = γ j . Then the joint distribution of X = (X 1 , . . . , X n ) is a multivariate, centered Gaussian distribution with covariance matrix Σ n ∈ R n×n given by [Σ n ] i,j = γ i−j for 1 ≤ i, j ≤ n. Notice the sequence (γ j ) j∈Z is semidefinite positive.
We make the following standard assumption on the covariance function γ: The spectral density function f associated to the process is the even function defined on [−π, π] whose Fourier coefficients are γ j : γ j cos(jx).
Condition (22) ensures that the spectral density is well-defined, continuous and bounded. It is also even and non-negative as (γ j ) j∈Z is semi-definite positive. The function f completely characterizes the model as: For ∈ L 1 (h), we define the corresponding Toeplitz matrix T n ( ) of size n×n by: Notice that T n (2πf ) = Σ n . Some properties of the Toeplitz matrix T n ( ) are collected in Section A.1. We choose the following estimator of f , for x ∈ [−π, π]: with (γ j , 0 ≤ j ≤ n − 1) the empirical estimates of the covariances (γ j , 0 ≤ j ≤ n − 1):γ The function I n is a biased estimator, where the bias is due to two different sources: truncation of the infinite sum up to n, and renormalization in (25) by n instead of n − j (but it is asymptotically unbiased when n goes to infinity as condition (22) is satisfied). The expected valuef n of I n is given by: In order to be able to apply Proposition 2.6, we assume that f and the estimators f 1 , . . . , f N of f belongs to G (they are in particular positive and bounded) and are even functions. In particular the estimators f 1 , . . . , f N and the convex aggregate estimatorf S * defined in (20) are proper spectral densities of stationary Gaussian sequences. Remark 3.3. By choosing h = 1/(2π)1 [−π,π] , we restrict our attention to spectral densities that are bounded away from +∞ and 0, see [24] and [6] for the characterization of such spectral densities. Note that we can apply the aggregation procedure to non even functions f k , 1 ≤ k ≤ N , but the resulting estimator would not be a proper spectral density in that case.
To prove a sharp oracle inequality for the spectral density estimation, since I n is a biased estimator of f , we shall assume some regularity on the functions f and f 1 , . . . , f N in order to be able to control the bias term. More precisely those conditions will be Sobolev conditions on their logarithm, that is on the functions log(f/h) and log(f 1 /h), . . . , log(f N /h).
For ∈ L 2 (h), the corresponding Fourier coefficients are defined for k ∈ Z by a k = 1 2π π −π e −ikx (x) dx. From the Fourier series theory, we deduce that k∈Z |a k | 2 = 2 L 2 (h) and a.e. (x) = k∈Z a k e ikx . If furthermore k∈Z |a k | is finite, then is continuous, (x) = k∈Z a k e ikx for x ∈ [−π, π] and ∞ ≤ k∈Z |a k |. For r > 0, we define the Sobolev norm 2,r of as: The corresponding Sobolev space is defined by: For r > 1/2, we can bound the supremum norm of by its Sobolev norm: where we used Cauchy-Schwarz inequality for the second inequality with The proof of the following Lemma seems to be part of the folklore, but since we didn't find a proper reference, we give it in Section A.2.

Lemma 3.4.
Let r > 1/2, K > 0. There exists a finite constant C(r, K) such that for any g ∈ W r with g 2,r ≤ K, then we have exp(g) 2,r ≤ C(r, K).
For r > 1/2, we consider the following subset of functions:

we deduce from (26) that g f is continuous (and bounded by L).
This implies that f is a positive, continuous, even function and thus a proper spectral density. Notice that 2π f ∞ ≤ exp(L). We deduce from (23) that γ k = π −π e −ikx f (x) dx and thus: Thus Lemma 3.4 and (26) imply also that the covariance function associated to f ∈ F S r (L) satisfies (22). We also get that ∞ j=1 jγ 2 j < +∞, which is a standard assumption for spectral density estimation.
The following Theorem is the main result of this section.
are linearly independent. Let X = (X 1 , . . . , X n ) be a sample of a stationary centered Gaussian sequence with spectral density f and I n be defined in (24). Letf S * be given by (15). Then for any x > 0, we have with probability higher than 1 − exp(−x): with β = 4(K e L + e 2L+3K ) and α = 4KC(r, L)/C r .
Remark 3.6. When the value of γ 0 is given, we shall use the aggregation method of Section 2.1 after normalizing the estimators f k , 1 ≤ k ≤ N by dividing f k by f k . The final estimator of f would take the formf D and verifies a similar sharp oracle inequality asf S * (that is without the term α/n of Theorem 3.5). When the value of γ 0 is unknown, it could be estimated empirically bŷ to estimate f . However the empirical estimation of γ 0 introduces an error term of order 1/ √ n, which leads to a suboptimal remainder term for this aggregation method.

Numerical implementation
In this section we apply the aggregation method f D λ established in Section 2.1 to the problem of density estimation. We compare the performance of our aggregation scheme to the progressive mixture method introduced by [10] and [32], which proves to be a strong competitor.
To the best of our knowledge, there is no progressive mixture rule for spectral density estimation.
We consider two examples of probability densities to estimate: a shifted and truncated version of the Claw distribution and of the Smooth Comb distribution used in [26]. A random variable is distributed according to the Claw distribution if its density f Claw is given by: where ϕ is the density function of a standard normal random variable, and α a normalizing constant. The Smooth Comb distribution has density function f SmComb defined as, for x ∈ R: with normalizing constant a. Given a sample of the underlying distribution, we use the first part of it to create kernel density estimators with different bandwidths, then use the second part to perform the aggregation. We consider 5 different estimators with the normal kernel and bandwidths h = 0.005, 0.0075, 0.01, 0.02, 0.05. We create 100 samples of size 1000, and then use n = 50, 100, 200, 500, 1000 of the available size to estimate the true density, allocating 80% of the sample to creating the kernel estimators and 20% for the aggregation part.
The results of the estimations can be found in Figure 1. We can observe that for small sample sizes the convex aggregation method performs better in terms of risk and dispersion, but as the sample size increases, the progressive mixture method yields better results. Further numerical studies may include a hybrid procedure that would benefit from our aggregated procedure for small samples and from the progressive mixture procedure for large enough samples. The latter might also be introduced for the spectral density model, which characterizes the covariance of stationary series and is of great interest for many applications.

Lower bounds
In this section we show that the aggregation procedure given in Section 2 is optimal by giving a lower bound corresponding to the upper bound of Theorem 3.1 and 3.5 for the estimation of the probability density function as well as for the spectral density.

Probability density estimation
In this section we suppose that the reference density is the uniform distribution Remark 4.1. If the reference density is not the uniform distribution on [0, 1] d , then we can apply the Rosenblatt transformation, see [27], to reduce the problem to this latter case. More precisely, according to [27], if the random variable Z has probability density h, then there exists two maps T and T −1 such that U = T (Z) is uniform on [0, 1] d and a.s. Z = T −1 (U ). Then if the random variable X has density f = exp(g) h, we deduce that T (X) has density Furthermore, if f 1 and f 2 are two densities (with respect to the reference density h), then we have We give the main result of this Section. Let P f denote the probability measure when X 1 , . . . , X n are i.i.d. random variable with density f .

Spectral density estimation
In this section we give a lower bound for aggregation of spectral density estimators. Let P f denote the probability measure when (X n ) n∈Z is a centered Gaussian sequence with spectral density f . Recall the set of positive even function F S r (L) ⊂ G defined by (28) for r ∈ R. Proposition 4.3. Let N ≥ 2, r > 1/2, L > 0. There exist a constant C(r, L) free of N and N spectral densities (f k , 1 ≤ k ≤ N ) belonging to F S r (L) such that for all n ≥ 1, x ∈ R + satisfying: we have: with the infimum taken over all estimatorsf n based on the sample sequence X = (X 1 , . . . , X n ), and β = 8 −5/2 /3.

Proofs of results in Section 2
Proof of Lemma 2.3. Consider the form (9) of H D n (λ). Obviously, the function . This implies that for all λ, λ ∈ Λ + : Since ψ λ is convex and differentiable, we deduce from (9) that H D n is concave and differentiable. We also have by the linearity of L D n and (32) that for all λ, λ ∈ Λ + : The concave function H D n on a compact set attains its maximum at some points Λ * ⊂ Λ + . Forλ * ∈ Λ * , we have for all λ ∈ Λ + : see for example Equation 4.21 of [5]. Using (33) with λ =λ * and (34), we get (11). Letλ 1 * andλ 2 * be elements of Λ * . Then by (11), we have: which implies that a.e. f D . By the linear independence of (t k , 1 ≤ k ≤ N ), this givesλ 1 * =λ 2 * , giving the uniqueness of the maximizer.
Proof of Proposition 2.4. Using (8), we get: By the definition of k * , together with pen D (e k ) = 0 for all 1 ≤ k ≤ N and the strict concavity (11) of H D n atλ D * with λ = e k * , we get: We recall, see Lemma 1 of [2], that for any non-negative integrable functions p and q on R d satisfying log(p/q) ∞ < +∞, we have: We have: where we used (36) for the first inequality, the fact that ψ ≤ t f ∞ , as well as t f h = 0. By using this lower bound on D f D * f k to both terms on the right hand side of (35), we get: where the first equality is due to the following bias-variance decomposition equality which holds for all ∈ L 2 (h) and λ ∈ Λ + : The function V D n is affine in λ, therefore it takes its maximum on Λ + at some e k , 1 ≤ k ≤ N , giving: This concludes the proof.
Proof of Lemma 2.5. Notice that for all λ, λ ∈ Λ + : The proof is then similar to the proof of Lemma 2.3 using (38) instead of (32).
Proof of Proposition 2.6. Similarly to the proof of Proposition 2.4 we obtain that: Since log(f S * /f k ) ∞ = gλ * − g k ≤ 2K for 1 ≤ k ≤ N , we can apply (36) withf S * and f k : where in the second and third inequalities we use that ĝ S * ∞ ≤ max 1≤k≤N g k ∞ ≤ K. Applying (40) to both terms on the right hand side of (39) gives: where we used (37) for the second equality. The function V S n is affine in λ, therefore it takes its maximum on Λ + at some e k , 1 ≤ k ≤ N , giving: . This concludes the proof.
Proof of Theorem 3.1. By Proposition 2.4, we have that: Since I n (dy) is an unbiased estimator of f (y)dy, we get B n t D * − t k * = 0. Notice that which will provide a control of the second term on the right hand side of (41). Thus, the proof of the theorem will be complete as soon as (42) is proved.
To prove (42), we use the concentration inequality of Proposition 5.3 in [3] which states that for Y 1 , . . . , Y n independent random variables with finite variances such that |Y i − EY i | ≤ b for all 1 ≤ i ≤ n, we have for all u > 0 and a > 0: Then, since f k and f k * belong to F D (K), we have |Y i − EY i | ≤ 4K, and: Applying (43) with a = exp(−6K − 2L)/4, b = 4K and u = log(N ) + x, we obtain: where the second inequality is due to (44). This proves (42) and completes the proof.
Proof of Theorem 3.5. Using Proposition 2.6 and the notations defined there, we have that: First step: Concentration inequality for max 1≤k≤N V S n (e k ).
We shall prove that It is enough to prove that for each 1 ≤ k ≤ N : Indeed take u = log(N )+x and the union bound over 1 ≤ k ≤ N to deduce (46) from (47). The end of this first step is devoted to the proof of (47). Recall definition (63) of Toeplitz matrices associated to Fourier coefficients. We express the scalar product , I n for ∈ L ∞ ([−π, π]) in a matrix form: We have the following expression of the covariance matrix of X: Σ n = 2πT n (f ).
Since f is positive, we get that Σ n is positive-definite. Set ξ = Σ −1/2 n X so that ξ is a centered n-dimensional Gaussian vector whose covariance matrix is the n-dimensional identity matrix. By taking the expected value in (48), we obtain: where tr (A) denotes the trace of the matrix A, and R n ( ) = Σ 1 2 n T n ( )Σ 1 2 n . Therefore the difference , I n −f n takes the form: We shall take = g k − g k * . For this reason, we assume that is even and denote the eigenvalues of the symmetric matrix R n ( ), with η 1 having the largest absolute value. Similarly to Lemma 4.2. of [4], we have that for all a > 0: where we used for the second inequality that 2 √ vw ≤ v/a+aw for all v, w, a > 0. Let us give upper bounds for |η 1 | and η 2 . We note ρ(A) for A ∈ R n×n the spectral radius of the matrix A. Then by the well-known properties of the spectral radius, we have that: We deduce from (64) that ρ(Σ n ) = ρ(2πT n (f )) ≤ 2π f ∞ ≤ exp(L) and ρ(T n ( )) ≤ ∞ ≤ 2K. Therefore we obtain: As for η 2 , we have: where we used (65) for the last inequality. Using (50) and (51) in (49) gives: where for the second inequality we set a = 4 exp(2L + 3K). This proves (47), thus (46).

Second step: Upper bound for the bias term
We set * =ĝ S * − g k * and we have * 2,r ≤ 2K/C r . Let (a k ) k∈Z be the corresponding Fourier coefficients, which are real as * is even. We decompose the the bias term as follows: withf n,1 ,f n,2 given by, for x ∈ [−π, π]: For the first term of the right hand side of (52) notice that: We deduce that f n,1 − f, * = f n,1 − f,¯ * , with¯ * = |j|≥n a j e ijx . Then, by the Cauchy-Schwarz inequality, we get: Thanks to Lemma 3.4, we get: We deduce that: For the second term on the right hand side of (52), we have: Using the Cauchy-Schwarz inequality and then Lemma 3.4, we get as r > 1/2: Therefore combining (53) and (54), we obtain the following upper bound for the bias:

Proofs of results in Section 4
In the following proof, we shall use the Hellinger distance which is defined as follows. For two non-negative integrable functions p and q, the Hellinger distance H(p, q) is defined as: A well known property of this distance is that its square is smaller then the Kullback-Leibler divergence defined by 4, that is for all non-negative integrable functions p and q, we have: Proof of Proposition 4.2. Since the probability densities (f k , 1 ≤ k ≤ N ) belongs to F D (L), we have: For the choice of (f k , 1 ≤ k ≤ N ), we follow the choice given in the proof of Theorem 2 of [22]. Let D be the smallest positive integer such that 2 D/8 ≥ N and Δ = {0, 1} D . For 0 ≤ j ≤ D − 1, s ∈ R, we set: . Notice the support of the function α j is (j/D, (j + 1)/D]. Then for any δ = (δ 1 , . . . , δ D ) ∈ Δ, the function f δ defined by: is a probability density function with e L ≥ 1+T/D ≥ f ≥ 1 − T/D ≥ e −L . This implies that f δ ∈ F D (L). As shown in the proof of Theorem 2 in [22], there exists N probability densities (f k , 1 ≤ k ≤ N ) amongst {f δ , δ ∈ Δ} such that for any i = j, we have: and f 1 can be chosen to be the density of the uniform distribution on [0, 1] d . Recall the notation p ⊗n of the n-product probability density corresponding to the probability density p. Then we also have (see the proof of Theorem 2 of [22]) for all 1 ≤ i ≤ N : Let us take T = D (log(N ) + x)/3n, so that with condition (29) we indeed have T ≤ D(1 − e −L ). With this choice, and the defintion of β , we have for Now we apply Corollary 5.1 of [3] with m = N −1 and with the squared Hellinger distance instead of the L 2 distance to get that for any estimatorf n : This concludes the proof.
Proof of Proposition 4.3. Similarly to the proof of Proposition 4.2, the left hand side of (31) is greater than: We shall choose a set of spectral densities (f k , 1 ≤ k ≤ N ) similarly as in the proof of Proposition 4.2 such that f k ∈ F S r (L). Let us define ϕ : [0, π] → R as, for x ∈ [0, π]: We have that ϕ ∈ C ∞ (R) and: Let D be the smallest integer such that 2 D/8 ≥ N and Δ = {0, 1} D . For 1 ≤ j ≤ D, x ∈ [0, π], letᾱ j (x) be defined as: and for any δ = (δ 1 , . . . , δ D ) ∈ Δ and s ≥ 0, let the function f δ s be defined by: Since π 0 ϕ = 0, we get: We assume that s ∈ [0, 1/2], so that 2πf δ s ≥ 1/2. Let us denote g δ s = g f δ s = log(2πf δ s ). We first give upper bounds for (g δ s ) (p) L 2 (h) with p ∈ N. For p = 0, we have by (58): For p ≥ 1, we get by Faà di Bruno's formula that: The -th derivative of 2πf δ s is given by, for y ∈ [0, π]: Therefore we have the following bound for this derivative: From ϕ ∈ C ∞ (R), we deduce that ϕ ( ) ∞ is finite for all ∈ N * . Since s ∈ [0, 1/2] and 2πf δ s ≥ 1 − s ϕ ∞ ≥ 1/2, there exists a constantC p depending on p (and not depending on N ), such that: In order to have f δ s ∈ F S r (L), we need to ensure that g δ s 2,r ≤ L/C r . For r ∈ N * , we have: Therefore if s ∈ [0, s r,L ] with s r,L ∈ [0, 1/2] given by: then by (59) and (61) we get: Let r and r denote the unique integers such that r − 1 < r ≤ r and r ≤ r < r + 1. For r / ∈ N * , Hölder's inequality yields: .
Using (61) and (61) with p = r and p = r , we obtain: Hence if s ∈ [0, s r,L ] with s r,L ∈ [0, 1/2] given by: we also have g δ s 2,r ≤ L/C r , providing f δ s ∈ F S r (L). Mimicking the proof of Theorem 2 in [22] and omitting the details, we first obtain (see last inequality of p.975 in [22]) that for δ, δ ∈ Δ: with σ(δ, δ ) the Hamming distance between δ and δ , and then deduce that there exist (δ k , 1 ≤ k ≤ N ) in Δ with δ 1 = 0 such that for any 1 ≤ i = j ≤ N and s ∈ [0, s r,L ], we have (see first inequality of p.976 in [22]): Notice f δ 1 s = f 0 s = h is the density of the uniform distribution on [−π, π]. With a slight abuse of notation, let us denote by P f the joint probability density of the centered Gaussian sequence X = (X 1 , . . . , X n ) corresponding to the spectral density f . Assume X is standardized (that is Var (X 1 ) = 1), which implies f = 1. Let Σ n,f denote the corresponding covariance matrix. Since h = (1/2π)1 [−π,π] , we have Σ n,h = I n the n × n-dimensional identity matrix. We compute: The expected value in the previous equality can be written as: E f X T Σ −1 n,f − I n X = tr Σ −1 n,f − I n E f [X T X] = tr (I n − Σ n,f ) = 0, where for the last equality, we used that the Gaussian random variables are standardized. This yields D (P f P h ) = − 1 2 log (det(Σ n,f ))). We can use this last equality for f = f δ s since f δ s = 1 thanks to (56), and obtain: Notice that for s ∈ [0, s r,L ], we have 3/2 ≥ 1 + s ϕ ∞ ≥ 2πf δ s ≥ 1 − s ϕ ∞ ≥ 1/2 thanks to (58) and (56). Therefore we have: where we used Σ n,f δ s = T n (2πf δ s ) and Lemma A.2 with = 2πf δ s for the first inequality, and (57) for the second inequality. We set: Proof. Notice that by Property (1), the eigenvalues (ν i , 1 ≤ i ≤ n) of T n ( ) verify ν i ∈ [1/2, 3/2]. For t ∈ [−1/2, 1/2], we have log(1 + t) ≥ t − t 2 , giving that: where we used that T n ( − 1) = T n ( ) − I n for the second equality and Property (2) for the second inequality.
Set L = C r K. Let f = e g with g 2,r ≤ K. We have g (p) ∞ ≤ C 1 K for all integer p < r. According to Leibniz's rule, we get that f (r) = g (r) f + P r (g (1) , . . . , g (r−1) )f , where P r is a polynomial function of maximal degree r such that: max |P r (x 1 , . . . , x r−1 )| ≤ C r,1 K r .
for some finite constant C r,1 . We deduce that: f (r) L 2 (h) ≤ e L g (r) L 2 (h) + e L C r,1 K r .
Then use that f L 2 (h) ≤ e L to get the Lemma for r ∈ N * .
Let K > 0 and set L = C r K. Let f = e g with g ∈ W r such that g 2,r ≤ K. Following the proof of Lemma A.3, we first give an upper bound of J s ( , f ) in this