Adaptive density estimation based on a mixture of Gammas

We consider the problem of Bayesian density estimation on the positive semiline for possibly unbounded densities. We propose a hierarchical Bayesian estimator based on the gamma mixture prior which can be viewed as a location mixture. We study convergence rates of Bayesian density estimators based on such mixtures. We construct approximations of the local H\"older densities, and of their extension to unbounded densities, to be continuous mixtures of gamma distributions, leading to approximations of such densities by finite mixtures. These results are then used to derive posterior concentration rates, with priors based on these mixture models. The rates are minimax (up to a log n term) and since the priors are independent of the smoothness the rates are adaptive to the smoothness.

Nonparametric density estimation using Bayesian models with a mixture prior distribution has been used extensively in practice due to their flexibility and available computational techniques using MCMC. In some cases their theoretical properties have been studied, and in particular the asymptotic behaviour of the associated posterior distribution. Posterior weak consistency has been studied quite systematically in particular by [16], but posterior concentration rates have been derived only for a small number of kernels. In the case of density estimation on [0,1] (or any compact interval of R) [10] has studied mixtures of Beta densities, and [9] have considered mixtures of triangular densities, Gaussian location mixtures have been considered by [5,4,7,13,11] and, more generally, power exponential kernels by [7,12]. Location scale mixtures have been considered also by [1]. Apart from the latter paper, the posterior concentration rates have been obtained by the above authors are equal to the minimax estimation rate (up to a log n term) over some collections of functional classes, showing that nonparametric mixture models are not only flexible prior models, but they also lead to optimal procedures, in the frequentist sense. The above results do not cover however models for densities on R + and the posterior concentration rates have been obtained only under the condition that the densities are uniformly bounded.
In this paper we propose to estimate a possibly unbounded density supported on the positive semiline via a Bayesian approach using a Dirichlet Process mixture of Gamma densities as a prior distribution. The proposed prior distribution does not depend on regularity properties of the unknown density (such as the Hölder exponent) so the resulting posterior estimates are adaptive. Bayesian Gamma mixtures are widely used in practice, for instance, for pattern recognition [2] and for modelling the signal-to-noise ratio in wireless channels [14]. An algorithm for implementing a Gamma mixture with unknown number of components as well as aspects of the practical application of this model is given in [15].
The main purpose of the paper is to derive the conditions on the Gamma mixture prior and on the hyperpriors so that the posterior distribution asymptotically concentrates at the optimal rate (up to a log factor) around the true density over smooth classes of densities. We derive the concentration rate of the posterior distribution when the unknown density belongs to a local Hölder class on (0, ∞) (see the formal definition below) adapting the techniques applied in Shen et al. [13], Rousseau [10] and Kruijer et al. [7] to the proposed mixture of Gamma densities. In particular, we will show that this mixture provides a good approximation for such functions. Secondly, we investigate the concentration rate of this posterior distribution for an unknown density on (0, ∞) that can be unbounded at 0, namely for a density x α−1 h(x) for α ∈ (0, 1) in a neighbourhood of 0 and for function h belonging to a locally Hölder class on (0, ∞). A typical example of such behaviour is a Gamma density with the shape parameter between 0 and 1.
For a bounded density, we use the lower bound on the rate of convergence for estimators of densities from the local Hölder class H(β, L) which is n −β/(2β+1) [8].
The paper is organised as follows. In Section 2 we define the prior distribution and study the concentration rate of the corresponding posterior distribution over an extension of the local Hölder class to possibly unbounded densities. We also discuss the choice of the base measure of the Dirichlet process prior as well as the hyperprior measure on the shape parameter of the Gamma distribution that lead to consistent estimation with the posterior concentration rate equal to the minimax optimal rate of convergence up to a log factor. Numerical performance of the estimator is studied on simulated data for bounded and unbounded densities and on real data, with the results presented in Section 3. The proof of the main result is given in Section 4, and the proofs of the auxiliary results are deferred to the appendix.

Setup and Notation
Throughout the paper we assume that X n = (X 1 , · · · , X n ) is an n-sample from a distribution with density f on R + with respect to Lebesgue measure. We denote by F = {f ∈ L 1 (R + ); f : The aim is to estimate the unknown density f ∈ F which we do using a Bayesian approach. We construct a prior probability on F, by modelling f as a mixture of Gamma densities, see (2.1) below. The associated posterior distribution is denoted by Π(·|X n ). Let f 0 be the true density of the X i 's and we are interested in determining the posterior concentration rate ε n = o(1) defined by where · 1 is the L 1 norm.
In the following Section we present the main results of the paper.

Main results
We start with description of the mixture of Gamma distributions which underpins the construction of our prior model on F.

Prior model : mixtures of Gamma distributions
We consider the following Gamma mixture types of models: We consider (P, z) ∼ Π = Π 1 ⊗Π z , where Π 1 is a probability on the set of discrete distributions over R + an Π z is a probability distribution on R + = [0, +∞). We also denote R+ * =]0, +∞). Hence the densities are represented by location Gamma mixtures, since in the above parametrization is the mean of the Gamma distribution with parameters (z, z/ ). This particular parametrization leads to the variance equal to 2 /z, and imsart-ejs ver. 2014/10/16 file: BR2016EJS_arxiv.tex date: May 30, 2016 N.Bochkina and J.Rousseau/Density estimation based on a mixture of Gammas 5 as z goes to infinity, the Gamma distribution (z, z/ ) can be approximated by a Gaussian random variable with mean and variance 2 /z. This allows for precise approximation near 0 and more loose approximation in the tail. This parametrization has also been used in Wiper et al. [15].
The key to good approximation properties of a continuous density f by the gamma mixtures defined above is We explain in more details in Section 2.3.3, why mixtures of Gamma distributions as proposed here are flexible models for estimating smooth densities on R + . The general idea is that, as in the case of mixtures of Beta distributions in Rousseau [10] or mixtures of Gaussian distributions in Kruijer et al. [7], under regularity conditions with f verifying some Hölder -type condition with regularity β > 0, one can construct a probability density f 1 on R + such that The continuous mixture K z f 1 can then be approximated by a discrete mixture with O( √ z log z) components. We consider discrete priors on P and priors on z that satisfy the following condition: Condition (P): The prior on z, Π z satisfies : for some constants c, c , c 0 > 0 and ρ z ≥ 0, We consider either of these two types of prior on P : • Dirichlet Prior of P : P ∼ DP (m, G) where DP (m, G) denotes the Dirichlet Process with mass m > 0 and base probability measure G having positive and continuous density g on R + * satisfying: for some −1 < a 0 ≤ a 0 and 1 < a 1 ≤ a 1 . • Finite mixture : Condition (P) is quite mild. It is satisfied for instance for √ z following a Gamma distribution, in which case ρ z = 0. The prior condition on the base measure (2.4) imposes fat tails on G.
Note that, it appears from the proofs that if g(x) ≤ e −cx then posterior concentration rates would remain unchanged over the functional class described below, assuming that the true density f also has exponential type tails. Hence, when estimating such densities, G could be chosen a Gamma random variable, but the inverse Gamma does not satisfy these conditions and posterior concentration rates are not derived in this case.

Functional classes
In this paper we are interested in estimating densities that are possibly unbounded at 0. To construct a class of such functions over which posterior concentration rates are derived, first we need a class of bounded functions.
Lemma 2.1. For any f ∈ P α (β, L(·), γ, C 0 , C 1 , e, ∆) and x > 0,  1. Note that in the above functional class P(β, L(·), γ, C 0 , C 1 , e, ∆), f is bounded from above but is allowed converge to 0 as x goes to 0. Interestingly if f (x) = x τ h(x) for any τ ≥ 0 and h satisfies (2.6) then condition (2.6) is satisfied by f . 2. If f (x) ∈ P(β, L(·), γ, C 0 , C 1 , e, ∆) is also bounded from below, then condition (2.5) can be interpreted as log f being locally Hölder (possibly with a constant function L log f ) with the corresponding integral conditions. Similarly, the same holds for h(x) bounded from below if f (x) = x α−1 h(x) ∈ P α (β, L(·), γ, C 0 , C 1 , e, ∆). 3. The moment condition (2.6) is satisfied for a number of densities on the positive semiline (see Appendix C.2 for details): (a) for the Weibull distribution with density In this case, lim x→0 f (x) = 0 so we can take α = 1, and e.g. for b ∈ (0, 2) we can take γ = 1 and β < b/2. Similarly to location mixtures of Gaussian densities, mixtures of Gamma densities provide a flexible tool to approximate smooth densities on R + , and using the representation of Lemma 2.1, to approximate smooth but possibly unbounded densities. The posterior concentration rates are presented in the following Theorem.
Note also that the tail assumption (2.9) is much weaker than the tail condition considered for mixtures of Gaussians as in [7] or Shen et al. [13] where exponential decay in the form e −c|x| τ is assumed for some τ > 0. Because the posterior concentration rates obtained in [7] or Shen et al. [13] are upper bounds, it is not clear that the exponential tail condition was necessary. It seems however that because Gamma densities have fatter tails than Gaussian densities, they allow for approximations of densities with fatter tails.
The theorem is proved following the approach of Ghosal et al. [3], so that first we construct an approximation of the true density f 0 by a continuous mixture of Gammas in the form K z f 1 for some density f 1 close to f 0 and then approximate the continuous mixture by a discrete and finite mixture. This allows us to control the prior mass of Kullback-Leibler neighbourhoods. The tail condition (2.9) is used in this second step, similarly as the exponential tail condition used with Gaussian mixtures in [7] or Shen et al. [13]. It is to be noted however that local Hölder condition involved in the definition of the functional classes P α (.) implicitly also induce some tail behaviour on f 0 .

Mixture of inverse Gamma distributions
Although (2.9) is a rather mild condition, it excludes fat tail distributions such as the folded Cauchy density whose density at infinity behaves like x −2 . If one is interested in estimating fat tail densities at infinity, then a simple transformation allows to do it using mixtures of Gamma densities.
Consider the class of densities q( for small x and some a > 0 and f 0 (x) x −b−1 for some 0 < b ≤ 2 when x goes to infinity. Then q 0 (y) = y −2 f 0 (1/y) y −a−1 when y goes to infinity and q 0 (y) y b−1 when y goes to 0. Hence q 0 satisfies the tail conditions both at 0 and infinity assumed in Theorem 2.1. Since ||q − q 0 || 1 = ||f − f 0 || 1 , Theorem 2.1 implies that density q(x) can be estimated using the appropriately adapted prior, such that the corresponding prior on f satisfies the conditions stated in the theorem, with the same rate of convergence.
The prior for estimating q(x) = x −2 f (1/x) is a mixture of inverse Gammas: The hyperprior is (P , z) ∼Π =Π 1 ⊗Π z , where Π z is a probability distribution on R + satisfying condition (2.3) andΠ 1 is a probability on the set of discrete distributions over R + satisfying either of these two types of prior on P : • Dirichlet Prior ofP :P ∼ DP (m,Ḡ) where DP (m,Ḡ) denotes the Dirichlet Process with mass m > 0 and base probability measureḠ having positive and continuous densityḡ on R + * satisfying: for some −1 <ā 0 ≤ā 0 and 1 <ā 1 ≤ā 1 (note that this corresponds to the density x −2ḡ (1/x) satisfying conditions (2.4) withā 0 = a 1 − 2 and a 1 = a 0 + 2). • Finite mixture : imsart-ejs ver. 2014/10/16 file: BR2016EJS_arxiv.tex date: May 30, 2016 In particular, we have similar approximation properties for q in the following class: We summarize the result in the following corollary.
Corollary 2.1. Consider the prior defined by (2.11) that satisfies condition (P), and assume that X n = (X 1 , · · · , X n ) is a sample of independent observations distributed according to a probability P 0 on R + having density q 0 with respect to Lebesgue measure. Assume that there exists α ∈ (0, 1] such that q 0 ∈P α (β, L(·), γ, C 0 , C 1 , e, ∆) and that q 0 satisfies the following condition for some ρ 1 > 0: Then, there exists M > 0 such that Note that the folded Cauchy density satisfies the conditions for q 0 (x) required in the corollary.

Approximation of densities by Gamma mixtures
As in Rousseau [10], Kruijer et al. [7], one of the key elements in the proof of Theorem 2.1 is the approximation of a smooth density f by a continuous Gamma mixture of the form K z f 1 where f 1 is a smooth function close to f which is of independent interest. Similarly to Rousseau [10], Kruijer et al. [7], f 1 is constructed iteratively to be able to adapt to the smoothness of f 0 . The general idea is that K z f 0 (x) is a good approximation of f 0 if f 0 has smoothness β ≤ 2, as in the Gaussian case, because the Gamma density g z, behaves like a Gaussian density when z goes to infinity. To approximate f 0 with the correct order z −β/2 for β > 2, we need to correct for the error K z f 0 − f 0 and replace The above approximation scheme is described in the following proposition: where C R depends on α, β, L(·), γ, C 0 , C 1 , e, ∆ only. Moreover, the probability densityf where B depends only on α, β, L(·), γ, C 0 , C 1 , e, ∆.

Prior model
In the following sections we will fit a Bayesian model to the data with following Dirichlet Process prior: It is easy to check that Condition (P) holds for this prior. We use default choices of free parameters m = 1, a = 2 and b = c = 1, however we check sensitivity with respect to these parameters.
To sample from the posterior we use the slice sampling algorithm of Kalli et al. [6]. Introducing the auxiliary variables u = (u 1 , · · · , u n ) the uniform random truncation variables and c = (c 1 , · · · , c n ) the allocation variables so that the full likelihood is written as allows to use a Gibbs sampler algorithm based on the following full conditional distributions where the full conditional on z is sampled using a Metropolis-Hasting step with proposal

Simulations
We simulate n = 1000 observations from the true density f 0 , and fit a Bayesian model with prior (3.1). We considered the following true densities. Even though the folded Cauchy density does not satisfy the conditions of the theorem, we show that the proposed gamma mixture still provides a reasonably concentrating posterior distribution. 1250 thinned samples from the posterior distribution are plotted for each true density after at least 50000 burn in iterations, with the red line representing the mean of the posterior distribution and the green line representing the true density. Improvement of the concentration of the posterior distribution with increasing sample size is presented for the two component mixture. For all considered true densities, including the unbounded one and the folded Cauchy density, the proposed gamma mixture model performs well. However, the value of the folded Cauchy density around 0 has high uncertainty. Sensitivity with respect to the choice of the free parameters was investigated for all densities, all leading to good performance (presented for the unbounded density Gamma(0.4, 1)).

Email arrival data
In this section we consider the data of the intervals between arrival times of emails modelled in [15] which consists of the interarrival times (minutes) of 203 E-mail messages (we are grateful to Fabrizio Ruggeri, one of the authors, who kindly provided us with the data). The proposed location Gamma mixture (3.1) with the default choice of the free parameters was fitted to the data. The histogram of the data with superimposed posterior mean and the samples from the posterior distribution are shown in Figure 4). The histogram shows that the fit of the location mixture is similar to the fit of the location-scale mixture presented in [15]. The plot of the samples with the posterior indicates uncertainty about the values of the density around 0 as well as high uncertainty about the possible second mode around 3. We also present a zoom in into the neighbourhood of 0 which confirms the findings that the density is small around 0 and that the distribution of the email arrival times differs from exponential.

Proof of Theorem 2.1
Proof of Theorem 2.1. The proof consists in verifying the assumptions of Theorem 2.1 of Ghosal et al. [3]. The first assumption, on the prior mass of Kullback -Leibler neighbourhood of the true density, is verified in Lemma 4.1. In Lemma 4.2 we control the L 1 (and Hellinger) entropy of the sieves which are defined below.
Choosing ζ small enough completes the proof of (4.1), and hence the proof of Theorem 2.1.
We extend the definition of K z in the following way: for any distribution P , define If P has Lebesgue density f , then K z * P (x) = K z f (x).
Lemma 4.1. Assume that the probability density f 0 ∈ P α (β, L, γ, C 0 , C 1 , e, ∆) and that there exist C > 0 and ρ 1 > 0 such that Then, for any 0 > 0, there exist κ, C p > 0 such that for any prior satisfying condition (P) and n ≥ 1 where n = 0 n −β/(2β+1) (log n) q , with q defined in Theorem 2.1. The constants κ and C p depend on Π, 0 and on the constants defining the functional class.
Lemma 4.1 is proved in Section 4.2.
As in Shen et al. [13], we control the entropy of the following approximating sets.
The proof of Lemma 4.2 is given in Section 4.3. In the next section we prove Lemma 4.1.

Proof of Lemma 4.1
Consider P N the discrete distribution constructed in Lemma B.2, which we write as Note that for all P ∈ P z , P (U 0 ) ∈ (z −A , 2z −A ). Let z n = n 2/(2β+1) (log n) t with t = 2q − 5 if ρ z ≤ 5/2 and t = 2(q − ρ z ) if ρ z > 5/2 and set I n = (z n , 2z n ). Then for all z ∈ I n and all P ∈ P 2zn , Lemma B.3 implies that if A is large enough (depending on β, L, γ, C 0 , C 1 , e, ∆).
In particular, we have that j α j = m for the DP prior, and Also, we have that Note that for x ∈ (0, 1], Γ(x) ≤ x −1 , and for x > 1, Γ(x) ≤ exp(x log x). Adapting Lemma 6.1 of Ghosal et al. [3] to the case of hyperparameters α i of the Dirichlet distribution possibly greater than 1 , we obtain: Using condition (P) on Π z we have that Π z (I n ) e −c √ zn(log zn) ρz , replacing z n by its expression terminates the proof of Lemma 4.1 for the DP prior. For the mixture prior satisfying (P), we have exp (−C(N + N log n)) , which terminates the proof.
Fix δ 2 = ε/C for the constant C to be defined below, and δ 0 = ε/ √ 2z. LetÂ be the following set . In particular, for any z ∈ [z(1 + δ 2 ) , z(1 + δ 2 ) +1 ) for some ∈ {0, 1, . . . , L}, infẑ ∈Ẑ |ẑ/z − 1| ≤ δ 2 . LetŜ be an ε-net for S = {(p 1 , . . . , p J ) : The second term is less than ε by the definition of Q. The last term is bounded in the same way as in [13] by J j=1 |π j − π j | ≤ 2ε. To bound the third term, we first bound the L 1 distance between the two gamma densities using Lemma C.1: To bound the first term, we bound the Kullback-Leibler distance between the corresponding probability distributions: for some z c between z andẑ where ψ 1 (z) = (log Γ(z)) is the trigamma function. It is known that as z → 0, ψ 1 (z) = γ + z −2 + o(1), and as z → ∞, ψ 1 (z) = z −1 + 0.5z −2 + o(z −2 ) which implies that for both z large and small, which implies that we can bound the Kullback-Leibler distance as for an appropriate constant C. Therefore, the first term is bounded by by the definition of δ 2 . Now we study cardinality of setÂ. For each z ∈ [z,z] , due to δ 2 = Cε and by the definition of L. Therefore, combining all the inequalities together, we obtain that and henceQ is a 5ε-net of Q, with The second inequality is proved following the same route as in the proof of Proposition 2 in Shen et al. [13].
Applying case α = 1, for P(β, L(·), γ, C 0 , C 1 , e, ∆) to h(x) with k such that β ∈ (2k − 2, 2k], i.e. k = r 0 , we have that there exists d j ∈ R, j = 1, · · · , k such that the function Thus, we can definẽ and the first part of (2.13) holds with R(x) x α−1 R 0 (x). From the proof of case α = 1, it follows that R 0 (x) has terms proportional to (x 2 + 1)x β (1 + x γ )L h (x) with 1 ≤ ≤ r 0 and x j |h (j) (x)| for 1 ≤ j ≤ r. Therefore, the second part of (2.13) hold due to f ∈ P α (β, L(·), γ, C 0 , C 1 , e, ∆) and inequality g 2 (x)dµ ≤ [ g p (x)dµ] 2/p for p ≥ 2 for a probability measure µ. We now prove (2.15). 3. Proof of (2.15): general case We follow the same route as in Kruijer et al. [7]. We bound We first prove that and since for all j = 1, · · · , 2k A1(a) c which implies (A.4). We now bound the second term of the right hand side of (A.3). Now we consider the integral A1(a) c K zf (x)dx. Note that Using (C.5), we have that for any H > 0 by choosing M large enough since for the appropriate choice of M . We need only to study what happens if x ∈ A c 1 (a), ∈ A 1 (2a) and | /x−1| ≤ M log z/z. We assume that z is large enough so that M log z/z ≤ 1/2 and hence We boundh(x) from below using ∈ A 1 (2a) and and which implies that In particular, it implies that if a ≥ β/2 then x ∈ A 1 (a); so for x ∈ A c 1 (a), we must have a < β/2. Therefore, using (A.1) and taking a ∈ (β/4, β/2), we have

For large enough z and for any
Lemma B.1. Let e z = z −a , E z = z b and H be a probability distribution on [e z , E z ]. Then for all κ > 0, there exists N 0 > 0 and a probability distribution P with at mostN = N 0 √ z(log z) 3/2 supporting points such that : for all x ∈ [τ 0 e z , τ 1 E z ] with 0 < τ 0 < 1 < τ 1 < +∞ when z is large enough (B.1) Proof of Lemma B.1. The proof of the Lemma is based on the ideas of Ghosal and van der Vaart [5] combined with some ideas of Rousseau [10]. We use Gaussian approximation C.5. Set u = /x and consider |u − 1| ≤ M log z/z := δ z with M some arbitrarily large constant. Then writing h(u) = log u + 1/u − 1 as soon as N + 1 > N 0 log z with N 0 large enough, for some τ > 0. Choose r > 0, then a Taylor expansion of h(u) around 1 leads to where Q N,z ( , x) is a polynomial funtion of with degrees smaller than 2N and If | /x − 1| ∈ (δ z , δ) with δ > 0 arbitrarily small but fixed, h( /x) ≥ ( /x − 1) 2 /4 and Split [e z , E z ] into intervals in the form I j = [e z (1 + δ z /2) j , e z (1 + δ z /2) j+1 ], with j ≤ J z and Following Lemma A1 of Ghosal and van der Vaart [5], since the functions → , ≤ 2N are continuous over I j , there exists a probability P j,N with support included in I j with at most N + 1 points in the support such that for all ≤ 2N where H j = HI Ij /H[I j ]. Construct P N = j H(I j )P j,N , then P N has support [e z , E z ] and for all x, Let ∈ I j and x ∈ [e z (1 + δ z /2) j−1 , e z (1 + δ z /2) j+2 ], then x/ ≤ (1 + δ z /2) 2 1 + 3δ z when z is large enough and as soon as r/2 > a + κ as soon as M > 4(κ + a) + 2. Finally if | /x − 1| > δ, using Lemma C.2 for some c > 0. This implies that for all x ∈ R, where P N has at most N 0 √ z(log z) 3/2 supporting points in [e z , E z ], with N 0 depending on κ, a, b.
The following Lemma allows us to control the Kullback-Leibler divergence between f and K z * P .
Note that it appears from the proof of Lemma B.2, that P N can be chosen so that for all 1 ≤ ≤ J z where J z is such that e z (1 Since We also have that It implies that For an arbitrary κ > 0, which we will choose later, consider the discrete distribution P N constructed in Lemma B.1, which we write asP N = Jz j=1 Note that for e z = z −a and E z = z b , J z ≤ (b + a)M −1 √ z log z. This implies that For any distribution P with support [e z , E z ], by Lemma C.2 with u = /x > 1 + δ, δ = 1 and c 1 = c(δ): dP ( ) Similarly, applying Lemma C.2 with u = /x < 1 − δ, δ = 1/2 and c 0.5 = c(δ), Hence choosing κ ≥ 2β + b implies that Let A > 0 and construct the grid (u ) : Nj i=1 p j,i δ uj,i be the probability on R + with supporting points u j,i where u j,i is the closest point to j,i on the grid (u , ≤ L). If there are multiple u i,j then we collapse the probabilities and without loss of generality we can assume that the u i,j are all distinct. Define p j,i g z, j,i (x) exp(z −A+1 (1 + 2E z /e z )) for large enough z, which implies K z * P N (x) ≤ K z * P N (x)(2 + Cz a+b+1−A ), ∀x ∈ [e z /2, 2E z ]. (B.6) Finally where the last inequality comes from Lemma C.1. By choosing A > 1/2 + β, Lemma B.2 is proved by re-indexing p i,j as p l and u i,j as u l , l ≤ N .

B.2. Kullback-Leibler neighbourhoods
In the following Lemma we describe Kullback-Leibler neighbourhoods of f of size 2 n . Lemma B.3. Assume that f ∈ P α (β, L, γ, C 0 , C 1 , e, ∆), and that there exists ρ 1 > 0 such that Then, if A is large enough, for all z large enough and all P ∈ P z , KL(f, K z * P ) ≤ z −β (log z); V (f, K z * P ) ≤ z −β (log z) 2 .