Posterior rates of convergence for Dirichlet mixtures of exponential power densities

: A Dirichlet mixture of exponential power distributions, as a prior on densities supported on the real line in the problem of Bayesian density estimation, is a natural generalization of a Dirichlet mixture of normals, which has been shown to possess good frequentist asymptotic properties in terms of posterior consistency and rates of convergence. In this article, we establish upper bounds on the rates of convergence for the posterior distribution of a Dirichlet mixture of exponential power densities, assuming that the true density has the same form as the model. When the kernel is analytic and the mixing distribution has either compact support or sub-exponential tails, a nearly parametric rate, up to a logarithmic factor whose exponent depends on the tail behaviour of the base measure of the Dirichlet process and the exponential decay rate at zero of the prior for the scale parameter, is obtained. The result covers the important special case where the true density is a location mixture of normals and shows that a nearly parametric rate arises also when the prior on the scale contains zero in its support, provided it has a suﬃciently fast decay rate at zero. This improves on some recent results on density estimation with Dirichlet mixtures of normals by allowing the inverse-gamma distribution, which is a commonly used prior on the square of the bandwidth. When the kernel is not inﬁnitely diﬀerentiable at zero, as the case may be depending on the shape parameter, the posterior distribution is shown to concentrate around the sampling density at a slower rate.


Introduction
Mixtures of probability distributions naturally arise in some contexts as models for observations of phenomena with multiple underlying factors. In a Bayesian framework, mixture models provide convenient constructs for density estimation because a prior can be induced on a space of densities by putting a prior on the mixing distribution of a given collection of kernels. This approach, which has the virtue of combining conceptual simplicity of the scheme with flexibility of the model due to the wide range of possible kernel choices depending on the sample space, was initiated by Ferguson [5] and Lo [17], who used a Dirichlet process prior on the mixing distribution and derived the expressions for the posterior and the Bayes' density estimator (or predictive distribution). The difficulty in defining a prior on a set of densities using directly a Dirichlet process lies in the fact that this process selects discrete distributions. A typical choice for the kernel when the sample space is the real line is the normal density, but an exponential power density can be more generally used. The motivation for considering mixtures of other distributions than the normal lies in the fact that the empirical distributions of many phenomena fail to conform to a normal. Exponential power distributions may represent a reasonable alternative when the discrepancy is in the tails. For example, the normal-Laplace distribution, which results from the convolution of independent normal and Laplace components, behaves like the normal in the middle of its range and like the Laplace in the tails. The interest in this distribution is due to its role in describing the stopped rate of a Brownian motion, when the starting value is generated by a normal and the hazard stopping rate is constant. Its use in the study of high frequency price data is pointed out in Reed [19]. The normal-Laplace distribution makes an interesting case for considering mixtures of exponential power distributions because, as it will be shown, available results on normal mixtures do not cover the case where the mixing distribution has fatter tails than those of the kernel.
An EP distribution is characterized by three parameters: the location parameter θ = E[X], the scale parameter σ ≡ σ p = {E[|X − θ| p ]} 1/p and the shape parameter (or exponent ) p that determines the thickness of the tails. Note that Var(X) = σ 2 p 2/p Γ(3/p)/Γ(1/p). For p = 2, the expression in (1.1) reduces to the density of a normal, where φ(·) stands for the standard normal density. As the shape parameter varies, the EP density describes platikurtic distributions for p > 2, namely, distributions with lighter tails than those of a normal, and leptokurtic distributions for 0 < p < 2. In particular, for p = 1, the Laplace or double-exponential density is obtained. EP distributions with 0 < p < 1 are called fractional distributions. They have super-Laplace tails, i.e., tails that are beyond those of the Laplace density. For p → 0 + , the limiting distribution is degenerate at θ. For p → ∞, the limiting distribution is uniform over the interval (θ − σ, θ + σ], The derivation of EP distributions in the above form can be attributed to, among others, Lunetta [18] and Vianelli [26], but EP distributions were first obtained with a different parametrization by Subbotin [24] as a generalization of the Gaussian distribution to model the distribution of random errors. For a review of the main properties of Subbotin's EP distributions see, e.g., Johnson, Kotz and Balakrishnan [12], Chap. 24. EP distributions have been considered as an alternative to the normal distribution in statistical modelling and used to study Bayesian robustness, see, e.g., Choy and Smith [3]. West [30] showed that EP distributions with p ∈ [1,2] allow for a scale mixture of normals representation, the normal (p = 2) being a degenerate mixture, while Walker and Gutiérrez-Peña [28] proved that EP distributions with p ≥ 1 allow for a scale mixture of uniforms representation: where X|(U = u) ∼ Uniform θ − σp 1/p √ 2 u 1/p , θ + σp 1/p √ 2 u 1/p and U ∼ Gamma (1 + 1/p, 2 −p/2 ).
In this article, we consider location mixtures of EP distributions. The scale parameter is assumed to be distributed independently of the mixing measure. Thus, a prior on a set of densities is induced by the product measure of the prior for the mixing distribution and the prior for the scale. In what follows, we shall mainly focus on the case where the standard choice of a Dirichlet process prior for the mixing distribution is considered. We are interested in establishing whether and, if so, how fast the posterior distribution concentrates its mass around the true density, as the amount of data increases. Weak and strong posterior consistency, as well as rates of convergence, for Dirichlet location and location-scale mixtures of normals have been studied in the recent literature. Ghosal, Ghosh and Ramamoorthi [7] showed that, under general conditions, a Dirichlet location mixture prior with Gaussian kernel gives rise to a consistent posterior in the total variation distance. Lijoi, Prünster and Walker [16] weakened the condition on the tails of the base measure of the Dirichlet process from an exponential to a power decay law. Tokdar [25] proved strong consistency for Dirichlet location-scale mixtures of normals and weakened a moment condition on the true density (Theorem 3.3, page 96). Ghosal and van der Vaart [8] obtained (upper bounds on the) rates of convergence of the posteriors for Dirichlet location and location-scale normal mixture priors in the case where the true density is itself a location or a location-scale mixture of normals, the mixing distribution having either compact support or sub-Gaussian tails, the scale being restricted to a known bounded interval. They showed that, under regularity conditions on the prior, the posterior converges at a nearly parametric rate, up to a logarithmic term whose exponent depends on the tail behaviour of the base measure. For Dirichlet location mixtures of normals, assuming the same set-up, Walker, Lijoi and Prünster [29] slightly improved the power of the logarithmic term. This fast rate seems to be due to a combination of factors: the form of the true density, which is exactly of the type selected by the prior, together with the infinite degree of smoothness of the Gaussian kernel and the fact that a mixture of normals, with mixing distribution having exponentially decaying tails, can be approximated by a finite mixture of normals with a restricted number of components. This yields a high prior concentration rate and, when the scale is forced to stay bounded away from zero and infinity, a small entropy number, leading to a nearly parametric rate. It may be argued whether the assumption on the scale parameter plays a crucial role in determining the posterior rate or may be removed without affecting the rate, except possibly for a logarithmic factor. If the prior on the scale has full support on (0, ∞), one has to consider a richer approximating class of densities, a sieve, with scale bounded below and above by sequences respectively decreasing to zero and increasing to infinity at proper rates. Employing a fully supported prior for the scale amounts for regularity conditions on the tails, the requirement on the decay rate at zero being expectedly more restrictive than that at infinity, because the most important values are those included in a neighborhood of zero. In fact, as the bandwidth tends to zero, normal mixtures approximate any density and the complexity of the sieve increases.
The focus of the article is on rates of convergence for posterior distributions of Dirichlet EP mixtures. We begin to show that a finite EP mixture can be estimated at a nearly parametric rate, whatever the value of the kernel shape parameter, using a finite EP mixture prior. Then we consider true densities that may be infinite EP mixtures. First the case where the kernel is analytic and the mixing distribution has either compact support or sub-exponential tails is considered. No constraint is assumed on the scale. We show that a nearly parametric rate arises also when the prior on the scale has full support on (0, ∞). The main result states that, under regularity conditions on the tails of the base measure of the Dirichlet process and the prior for the scale, the posterior of a Dirichlet mixture of EP distributions converges at the target rate. The conditions we impose on the tails of the prior for the scale are satisfied by the common choice of an inverse-gamma prior on σ 2 . This result covers the important special case of Dirichlet normal mixtures, complementing the findings of Ghosal and van der Vaart [8], extending those of Walker, Lijoi and Prünster [29] and improving on those of Scricciolo [21], where a slower rate, heavily depending on the tail decay rate at zero of the prior for the bandwidth, was obtained. We also consider the case where the kernel is not infinitely differentiable at zero. A slower rate is obtained even assuming the scale to lie in a compact interval.
The article is organized as follows. Section 2 describes the set-up and introduces some notation. Section 3 analyzes the case of finite mixtures. Section 4 presents the main results on infinite mixtures. It is split into two subsections: Subsection 4.1 deals with the case where the kernel is analytic, Subsection 4.2 investigates the case of kernels that are not infinitely differentiable at zero. Section 5 is devoted to final remarks and discussion. Auxiliary results invoked in the proofs of the theorems are reported in the Appendix.

Preliminaries
Suppose we have observations X 1 , . . . , X n from an unknown density f F, σ on R which is a location mixture of EP distributions, where F denotes the mixing distribution. In what follows, F will be also used to indicate the corresponding probability measure. Note that f F, σ is just the convolution of ψ σ, p and F , The scale parameter σ is assumed to be distributed independently of F according to some distribution G on (0, ∞). Let π denote the overall prior on M(R) × (0, ∞), where M(R) is the set of all probability measures on R. The prior π induces a prior on the class of densities F := {f F, σ : (F, σ) ∈ M(R) × (0, ∞)} via the mapping (F, σ) → f F, σ . We shall use the same symbol π to denote either measure, the correct interpretation being clear from the context. We assume that F is equipped with a metric d, which may be either the Hellinger where λ is the Lebesgue measure on R, or the one induced by the L 1 -norm, f − g 1 := |f − g| dλ. We are interested in assessing the rate of convergence for the posterior distribution corresponding to a prior π, under the non-Bayesian assumption that X 1 , . . . , X n are i.i.d. observations from a density f 0 . In the following, it will be explicitly stated whether f 0 is itself a location mixture of EP distributions, i.e., f 0 ≡ f F0, σ0 , where F 0 and σ 0 denote the true values of F and σ, respectively. A sequence ε n → 0, as n → ∞, is said to be an upper bound on the posterior rate of convergence relative to a metric d on F if, for some constant M > 0, the posterior probability 0 -almost surely or in P n 0 -probability, where P 0 stands for the probability measure corresponding to f 0 . In order to derive the rates of convergence for Dirichlet mixtures of EP distributions, we shall appeal to a theorem of Ghosal and van der Vaart [8], page 1239, reported as Theorem A.1 in the Appendix. Condition (A.1) involves the packing number of a class of densities, say F ′ , equipped with a (semi-)metric d, denoted by D(ε, F ′ , d), which is defined as the maximum number of points in F ′ such that the distance between each pair is at least ε. This number is related to the ε-covering number N (ε, F ′ , d), the minimum number of balls of radius ε needed to cover F ′ , by the inequalities (2.1) The logarithm of the packing or covering number is referred to as the (metric) entropy. The symbols " " and " " will be used throughout to indicate inequalities valid up to constants that may be universal or depend on P 0 . Fixed constants within the present set-up will be inessential to our purposes.

Finite mixtures
The analysis of data generated by mixture models may reasonably begin from the case where we have observations of phenomena with multiple underlying factors whose cardinality is finite. If the responsible factors are known to be finitely many, then a discrete EP mixture prior can be conveniently used to estimate the sampling density. We begin to consider the case where all the components of the mixture have the same value of the shape parameter. The generic density takes the form with F a discrete distribution having a finite number of atoms which can be represented as k j=1 w j δ θj , where δ θj denotes a point mass at θ j . The number of mixture components is a random variable (r.v.) K with probability mass function ρ(·) on the positive integers. Given K = k, the vector w k := (w 1 , . . . , w k ) of mixing weights has conditional distribution π k on the (k − 1)-dimensional simplex ∆ k := {w k ∈ R k : 0 ≤ w j ≤ 1, j = 1, . . . , k, k j=1 w j = 1} and the locations θ 1 , . . . , θ k are independent r.v.'s with common distribution Π. We shall also write θ k for (θ 1 , . . . , θ k ) to ease the notation. The hierarchical model can be thus described: • K ∼ ρ and σ ∼ G independently; • given (k, σ), the r.v.'s θ 1 , . . . , θ k i.i.d.
We shall show that, if the true density is itself a finite mixture of EP densities with the same shape, under regularity conditions on the prior, the posterior converges at nearly parametric rate n −1/2 (log n). This rate agrees with those found for other discrete kernel mixture priors, when the true density is of the same form as the assumed model, see, for instance, Ghosal [6] for Bernstein polynomials, Kruijer and van der Vaart [15] for beta mixtures, Scricciolo [22] for histograms and polygons. The proof exploits the approximation properties of the mixtures under consideration to find a good fitting distribution of a given density in a proper subclass. In fact, any finite EP mixture can be approximated arbitrarily closely (in the distance induced by the L 1 -norm) by mixtures having exactly the same number of components, with locations and mixing weights taking values in suitable neighborhoods of the corresponding true elements. This technique is used to provide an estimate of the prior concentration rate as well as an upper bound on the metric entropy of a sieve set. In the first case, the number of mixture components is constant, this leading to a high prior concentration rate; in the second case, it can be chosen to increase quite slowly, at a logarithmic rate, as the approximation error goes to zero. In the metric entropy estimate, we shall also need the bandwidth to approach zero at a logarithmic rate, thus a suitable condition on the decay rate of the prior G for σ at zero is required. We shall use the following assumption.
(A) G has a continuous and positive Lebesgue density g on an interval containing σ 0 and, for constants d > 0, γ, ̟ ∈ (0, ∞], satisfies G(s) e −ds −γ as s → 0 and 1 − G(s) s −̟ as s → ∞. (3.1) The common choice of an inverse-gamma prior on σ 2 when the kernel is Gaussian meets requirement (A). Let σ 2 ∼ IG(α, β), with parameters α, β > 0. The condition on the tail behaviour at zero is satisfied with γ = 2. In fact, for some constant d > 0, The condition on the tail behaviour at infinity is satisfied with ̟ = 2α, The next theorem states the first result of the section.
Theorem 3.1. Let p > 0 be fixed. Suppose that f 0 = F 0 * ψ σ0, p is a finite EP mixture. Assume that (i) the prior G for σ satisfies condition (A); (ii) the prior ρ for the number of components is such that, for constants b, B > 0, 0 < ρ(k) ≤ Be −bk for all k ∈ N; (iii) for each k ∈ N, the prior π k for the mixing weights is a Dirichlet distribution Dir(α 1 , . . . , α k ), with parameters α 1 , . . . , α k such that, for constants a, A > 0, D ≥ 1 and for 0 < ε ≤ 1/(Dk), Aε a ≤ α j ≤ D, j = 1, . . . , k; (iv) the common prior Π for the locations has a continuous and positive density on an interval containing the support of F 0 and, for constants c, ϑ > 0, satisfies Then, the posterior rate of convergence relative to d H is ε n = n −1/2 (log n).
Proof. We appeal to Theorem A.1 and show that conditions (A.1)-(A.3) are satisfied with sequencesε n = n −1/2 (log n) andε n = n −1/2 (log n) 1/2 , so that (an upper bound on) the rate is given by ε n := (ε n ∨ε n ) =ε n . For sequences k n of positive integers, a n , s n and t n of positive real numbers specified below, an, sn, tn is the class of EP mixtures with at most k components defined as Along the lines of Lemma 3 of Ghosal and van der Vaart [10], pages 705-707, it can be shown that For t n such that (log t n ) = O(log n), log D(ε n , F n , d H ) k n log a n s nεn .
Remark 3.1. The condition on the parameters of the Dirichlet distribution contained in assumption (iii) is satisfied if, for example, a noninformative specification is considered, as when all the α j 's are taken to be equal to some constant D ≥ 1. Now, we consider the case where the shape parameter is allowed to mix both in the prior and in the true density. The generic density has the form where p k := (p 1 , . . . , p k ) is the vector of shape parameters that are assumed to be independent r.v.'s with common distribution H. The hierarchical model can be described as follows: ∼ H independently; • given (k, σ, w k , θ k , p k ), the r.v.'s X 1 , . . . , X n are conditionally independent and identically distributed with density f F, p k , σ .
with p 0 j > 0 for all j = 1, . . . , k 0 . Assume that conditions (i)-(iv) of Theorem 3.1 are satisfied and (v) the common prior H for the shape parameters has a continuous and positive Lebesgue density h on (0, ∞) and, for constants Then, the posterior rate of convergence relative to d H is ε n = n −1/2 (log n).
Proof. The main steps in the proof of Theorem 3.1 go through in this case. Conditions (A.1)-(A.3) are satisfied with the same sequencesε n = n −1/2 (log n) andε n = n −1/2 (log n) 1/2 . We begin to consider condition (A.1). For sequences k n , a n , s n , t n specified as in Theorem 3.1, and p n , q n of positive real numbers to be chosen below, let F n := is the class of mixtures with at most k components defined as w j δ θj , |θ j | ≤ a n , j = 1, . . . , k, Similarly to Theorem 3.1 and using Lemma A.4, it can be proved that for large enough n, Posterior rates for Dirichlet exponential power mixture priors 281 Consequently, log D(ε n , F n , d H ) k n log a n q n t n p n s nεn + 1 p n .
n and condition (A.1) is satisfied. Using assumptions (i), (ii), (iv) and (v), condition (A.2) is seen to be satisfied withε n = n −1/2 (log n) 1/2 . In fact, For any σ > 0, any p k0 := (p 1 , . . . , p k0 ) with positive p j 's, and any discrete distribution F on R with k 0 support points θ k0 := (θ 1 , . . . , θ k0 ) ∈ R k0 and weights w k0 := (w 1 , . . . , w k0 ) ∈ ∆ k0 , by inequalities (A.6), (A.7) and Lemma A.4, where min 1≤j≤k0 p 0 j > 0 by assumption and min 1≤j≤k0 c p 0 j > 0 because, for any p > 0, c p is positive with lim p→0 + c p = 0 and lim p→∞ c p = 2. For each j = 1, . . . , k 0 , z * j is a point lying in between 1/(p j ∨ p 0 j ) and 1/(p j ∧ p 0 j ). Since the value of g (1) (·−θj)/σ0 (z * j ) 1 does not depend on θ j , the symbol θ j may be suppressed. For ease of notation, we shall write g where the second inequality descends from the fact that, since g For any density f F, p k 0 , σ in the set . Thus, by Theorem 5 of Wong and Shen [31], pages 357-358, for a suitable constant c 1 > 0, the above set is contained in B KL (f 0 ; c 1 ε(log ε −1 ) 2 ). Using arguments similar to those of Theorem 3.1 and the fact that, for small enough ε > 0, by condition (v), Therefore, for a suitable constant c 2 > 0 (possibly depending on f 0 ), we have π(B KL (f 0 ;ε 2 n )) exp{−c 2 nε 2 n } for sufficiently large n. Remark 3.2. An example of distribution with density of the form (3.2) is given by the Laplace-normal (or Gauss-Laplace) mixture with equal locations, which was used by Kanji [14], Jones and McLachlan [13] to fit wind shear data. Haas, Mittnik and Paolella [11] used the Gauss-Laplace mixture for modelling and predicting financial risk. Remark 3.3. Suppose the investigator wants to assign positive prior probability to EP densities having tails equal to or lighter than those of a Laplace. A prior distribution for p on [1, ∞) can be specified by considering a beta distribution for 1/p as in Box and Tiao [2], page 167, a ≥ 1.
Thus, 1/p has a symmetric distribution around the normal theory value 1/2. Deviations from the normality can be taken into account by "adjusting" the value of the parameter a. When a = 1, the prior distribution is uniform over (0, 1]. When a > 1, the prior distribution is symmetric, with mode a 1/2, and assigns high probability to EP distributions in a neighborhood of the normal. As a → ∞, the prior density becomes more and more peaked at 1/2 and converges to a delta function, representing an assumption of exact normality. Since the prior distribution for p corresponding to a beta for 1/p is left-truncated, the lower tail condition in (3.3) is trivially satisfied with β = ∞, while the upper tail condition is satisfied with υ = a because Remark 3.4. As a consequence of Theorem 3.1 or Theorem 3.2, the Bayes' estimator,f n (·) := f (·)π(df |X 1 , . . . , X n ), converges to f 0 in the Hellinger distance, in P n 0 -probability, at a rate at least as fast as n −1/2 (log n), see, e.g., Theorem 5 of Shen and Wasserman [23], page 694.

Infinite mixtures
In this section, we analyze the case where the underlying factors may be infinitely many. We consider densities of the form f F, σ = F * ψ σ, p , where F can be any distribution on R. As a prior for F , we adopt a Dirichlet process D α with base measure α. We recall that a Dirichlet process on a measurable space (X, A), with a finite and positive base measure α on (X, A), is a random probability measure F on (X, A) such that, for every finite partition (A 1 , . . . , A k ) of X, the vector of probabilities (F (A 1 ), . . . , F (A k )) has a Dirichlet distribution with parameters (α(A 1 ), . . . , α(A k )). Observations from a Dirichlet process mixture of EP distributions can be structurally described as follows: • F ∼ D α and σ ∼ G independently; Let π = D α × G denote the prior distribution for (F, σ), with the proviso that the symbol π will also be used for the prior induced on F by the mapping (F, σ) → f F, σ , the ambiguity being resolved by the context. We recall that π is said to be strongly consistent at f 0 if, for every ε > 0, π(B(f 0 ; ε)|X 1 , . . . , where d can be either the Hellinger or the L 1 -metric. Ascertainment of posterior consistency for Dirichlet EP mixture priors can proceed as for Dirichlet normal mixture priors, for which sufficient conditions were derived by Ghosal, Ghosh and Ramamoorthi [7] in Theorem 7, page 152, using the sieve approach. Lijoi, Prünster and Walker [16], Theorem 1, page 1293, weakened their conditions adopting an alternative approach due to Walker [27]. In both approaches, the main idea behind the proof of consistency is to show that the prior satisfies Schwartz's [20] condition on the positivity of the probability of Kullback-Leibler neighborhoods of f 0 , known as the Kullback-Leibler property and indicated by the notation f 0 ∈ KL(π). The following proposition provides sufficient conditions for f 0 to be in the Kullback-Leibler support of π.
Proposition 4.1. Assume that π = D α × G, with the base measure α supported on R and the prior G supported on (0, ∞). Let f 0 be a continuous density on R satisfying the following conditions: Then, f 0 ∈ KL(π).
Proof. First, note that π = D α × G is a Type II mixture prior or a Prior 2, in the terminology of Wu and Ghosal [32], page 299. Thus, we can appeal to their Theorem 3, page 310. Once observed that the weak support of D α is M(R), the proof parallels that of Theorem 6, page 312, which deals with the case where p = 1, namely, the kernel is a Laplace density. Note that conditions (a)-(c) coincide with conditions B4-B6 of Theorem 2, pages 305-309. See also Wu and Ghosal [33]. It is, therefore, sufficient to check that conditions B3 and B7 are satisfied. We begin with condition B3. Since, for x = 0, and the condition is satisfied. As for condition B7, we need to check that, for any a ∈ R and b > 0, and, for some η 0 > 0, The linearity of log ψ 1, p (x) as a function of |x| p , together with assumption (d), imply that the above integrals are finite.

Analytic kernels
In this subsection, we consider the case where the true density f 0 = F 0 * ψ σ0, p is a location mixture of EP densities with shape parameter p that is an even integer. If p = 2m, m ∈ N, then, for any value σ 0 of the scale, the kernel ψ σ0, p is an analytic function. It is infinitely differentiable on R, and, in particular, at the origin. We establish that, if the true mixing distribution F 0 has either compact support or sub-exponential tails, then the sequence of posterior distributions (weakly) converges to a point mass at the true probability measure P 0 with a nearly parametric rate, up to a logarithmic factor. The key idea of the proof is the following. A density f 0 of the stated form can be uniformly approximated by a finite mixture of EP distributions with a number of components that increases at a logarithmic rate, as the approximation error goes to zero. Because of the analyticity of the kernel, such a finitely supported mixing distribution can be found by matching a relatively small number of moments of F 0 or of its (renormalized) restriction to a compact set, see Lemma A.5 (cf. Lemma 3.1 of Ghosal and van der Vaart [8], pages 1240-1241). This result is used to provide an exponential lower bound on the prior probability of Kullback-Leibler type neighborhoods of f 0 as well as an exponential upper bound on the covering number of a sieve set. the prior G for σ satisfies condition (A), then the posterior rate of convergence relative to d H is ε n = n −1/2 (log n) κ , where Proof. We show that conditions (A.1) through (A.3) of Theorem A.1 are satisfied byε n = n −1/2 (log n) κ , with κ as in (4.2), andε n = n −1/2 (log n). Given η ∈ (0, 1/5), for positive constants E, F and L to be suitably chosen later on, An estimate of the η-metric entropy of F a, η 2 /16, s, t is provided. Using the second inequality in (2.1) and d 2 where the third inequality descends from Lemma A.3 of Ghosal and van der Vaart [8], page 1261, and the last one from Lemma A.9. The choice η n = ε n = n −1/2 (log n) κ , s n = E(log η −1 n ) −2/γ , t n = exp{F (log η −1 n ) 2 } and a n = L(log η −1 n ) 2/δ leads to conclude that, for F n := F an, η 2 n /16, sn, tn , condition (A.1) is verified because log D(ε n , F n , d H ) (log n) 2κ = nε 2 n . Now we show that condition (A.2) is also satisfied. By the independence of F and σ, It remains to be checked that condition (A.3) is satisfied. We show that, for small enough ε > 0, there exist constants c 1 , c 2 > 0 so that The proof is in the same spirit as that of Theorem 5.1 by Ghosal and van der Vaart [8], pages 1251-1253. Let 0 < ε ≤ [(σ 0 /2) ∧ (1 − e −1 )/ √ 2] be fixed. Let [−a 0 , a 0 ] be the support of F 0 . By Lemma A.5, there exists a discrete distribution F ′ 0 (depending on ε), supported on (at most) N log(1/ε) points θ 1 , . . . , θ N in [−a 0 , a 0 ] that are at least 2ε-separated, such that where in (4.5) we have used inequality (A.6). It follows that Using the independence of F and σ, together with the assumptions on the base measure α and the prior G for σ, we obtain the bound in (4.4). Thus, condition (A.3) is satisfied withε n = n −1/2 (log n) and the proof is complete.
Remark 4.1. Given p = 2m, m ∈ N, the best rate is obtained when κ = 1. This implies having δ ≥ 2p, namely, the base measure α should have quite rapidly decaying tails. Note that the rate depends on the tail behaviour of the prior G for σ only through its exponential decay rate γ at zero.

Remark 4.2.
If p = 2, i.e., the kernel is Gaussian, and γ = ∞, i.e., the scale parameter σ is bounded below away from zero, σ ∈ [σ, ∞) for some known σ > 0, then (an upper bound on) the rate is which is the same found by Ghosal and van der Vaart [8] in Theorem 5.1, page 1250, wherein, except for assumption (A), the same conditions as in Theorem 4.1 are postulated.
The assumption of Theorem 4.1 that the true mixing distribution F 0 has compact support can be relaxed to allow for an unbounded set of locations, without affecting the nearly parametric rate, by requiring F 0 to have sub-exponential tails and the base measure α to have an EP density with suitably constrained shape parameter.  for some constant c 0 > 0. If the base measure α has a density α ′ such that, for constants b > 0 and 0 < δ ≤ p, satisfies the prior distribution G for σ satisfies assumption (A), then the posterior rate of convergence relative to d H is ε n = n −1/2 (log n) κ , where Proof. First, note that a density α ′ , as assumed in (4.8), is continuous and positive on R and that, for δ ≥ 1, the corresponding base measure α satisfies the tail condition   .7), we have f F * 0 , σ0 − f 0 1 ε. By Lemma A.5, there exists a discrete distribution F ′ 0 , which matches the (finite) moments of F * 0 up to the order N log(1/ε) and has at most N support points in [−a ε , a ε ] that are (at least) 2ε-separated, such that and any σ > 0 such that |σ − σ 0 | ≤ ε, using the same chain of inequalities as in (4.5), because, by assumption, (δ/p) ≤ 1. Then, the proof can be completed by laying out the same arguments as in Theorem 4.1.

Remark 4.3.
Given p = 2m, m ∈ N, the best rate is obtained for δ = p, namely, when the tail decay rate of the base measure α equals the shape parameter of the kernel. In such a case,
Remark 4.4. Given p > 0, if the true density f 0 = F 0 * ψ σ0, p , then If the true mixing distribution F 0 has sub-exponential tails, i.e., for constants c 0 , q > 0, F 0 ({θ : |θ| > t}) e −c0t q , for large t > 0, then it is the term with the slower decay rate that dominates in the upper bound in (4.11), Therefore, under assumption (4.7) of Theorem 4.2, this meaning that f 0 (x)/e −C0|x| p → c, as |x| → ∞, where c is a positive constant. Theorem 4.2 contemplates only the case where F 0 has tail decay rate at least as fast as that of the kernel, as when, for example, f 0 is the convolution of two normals, f 0 = φ τ0 * φ σ0 = φ (τ 2 0 +σ 2 0 ) 1/2 . If, instead, f 0 is the convolution of a Gaussian kernel and a Laplace distribution, f 0 = ψ τ0, 1 * φ σ0 , then Theorem 4.2 does not apply.
The result of Theorem 4.2 for the important special case of a Gaussian kernel with an inverse-gamma prior on σ 2 is separately stated in the following corollary.
Corollary 4.1. Suppose that f 0 is a mixture of normals, f 0 = F 0 * φ σ0 , with the true mixing distribution F 0 having sub-Gaussian tails, for large t > 0, for some constant c 0 > 0. If the base measure α is normal and the prior on σ 2 is an inverse-gamma, then the posterior rate of convergence relative to d H is ε n = n −1/2 (log n) 5/2 .
In the next corollary, we state a result on the posterior expected densityf n , for which an explicit expression can be found in Lo [17], Theorem 2, pages 353-354. The "in probability" statement is understood to be with respect to P n 0 .
In Theorem 4.1 and Theorem 4.2, verification of the remaining mass condition (A.2) for F c n has led us to ask for the base measure α to have either subexponential tails or a density α ′ of a prescribed form. In the next theorem, the use of Lemma A.11, which provides an upper bound on the posterior probability of F c n by exploiting the properties of the Dirichlet process, allows us to impose a less restrictive condition on α. In fact, condition (4.12) requires α ′ to have tails only bounded below by those of an EP density. On the other hand, however, we consider a stronger condition on the upper tail of the prior G for σ, a request that might be due to the method of proof. The requirement on G is formalized hereafter as condition (A ′ ) for easy reference.
Assumption (A ′ ) differs from assumption (A) because it requires G to have an exponentially decaying tail also at infinity. This rules out the possibility of using an inverse-gamma prior on σ 2 , unless σ is known to lie in some interval (0, σ], with 0 < σ < ∞: in such a case, in fact, a right-truncated inverse-gamma distribution trivially satisfies the tail condition at infinity. For example, a prior distribution for σ verifying condition (A ′ ) may have density of the form with parameters β, ν > 0. Note that g is a continuous and positive density proportional to an inverse-gamma IG(1; β) on (0, 1] and to a Weibull W(ν; β) on (1, ∞). Condition (A ′ ) is satisfied with γ 1 = 1 and γ 2 = ν, i.e., The conditions on G appearing in (A ′ ) were also postulated by Ghosal and van der Vaart [10], page 699, to assess posterior rates of convergence for Dirichlet normal mixtures at smooth densities. However, these authors consider a samplesize-dependent prior on σ: in their set-up, in fact, G is the distribution of σ/σ n , with σ n a sequence of positive real numbers such that n −a1 σ n n −a2 , for some 0 < a 2 < a 1 < 1. Theorem 4.3. Let p be a fixed even integer. Suppose that f 0 = F 0 * ψ σ0, p , with the true mixing distribution F 0 having either compact support or sub-exponential tails as in (4.7). If the base measure α has a continuous and positive density α ′ such that, for constants b > 0 and 0 < δ < p, satisfies for large |θ|, (4.12) the prior distribution G for σ satisfies condition (A ′ ), then the posterior rate of convergence relative to d H is ε n = n −1/2 (log n) κ , where .
If nη 2 n → ∞, then also the first term in (A.17) converges to zero. To complete the proof, it remains to choose η n so that condition (A.1) is also satisfied. From Theorem 4.1, which appeals to Lemma A.9, it is known that log D(η n , F n , d H ) a n s n p ∨ log 1 η n × log 1 η n .
Remark 4.5. Let supp(g) denote the support of the prior density g for σ. An inspection of the proof of Theorem 4.3 shows that, when supp(g) ⊆ (0, σ], for some 0 < σ < ∞, then δ is also allowed to take the value p, i.e., 0 < δ ≤ p. Furthermore, it turns out that κ = 1 + p/γ, with γ the tail decay rate of G at zero. This rate is better than the one we would get using Theorem 4.2: in fact, since (p/δ) ≥ 1, it results 1 2 + p/δ + p/γ > 1 + p/γ. Remark 4.6. If G has compact support [σ, σ] ⊂ (0, ∞), which corresponds to having γ 1 = γ 2 = ∞, then κ = 1 and, for any value of p = 2m, m ∈ N, an upper bound on the rate is n −1/2 (log n), which is the same obtained by Walker, The fact that ψ σ, p is not infinitely differentiable at the origin seems to play a key role in the search for a finitely supported approximating mixing distribution F ′ 0 of a given F 0 with a restricted number of atoms. When p = 2m, m ∈ N, the arguments of Lemma 3.1 in Ghosal and van der Vaart [8], pages 1240-1241, cannot be used to find F ′ 0 by the moment matching condition, because θ and x do not separate by factorization (see their equation (3.8), page 1241). An approximating mixing distribution can be found using the matching condition combined with a preliminary partitioning argument, but this incurs an additional factor of ε −(1∨1/p) in the number of support points (cf. Lemma A.5), leading to a slower than nearly parametric rate.
Theorem 4.4. Let p > 1/2 be such that p = 2m, m ∈ N. Suppose that f 0 = F 0 * ψ σ0, p , with the true mixing distribution F 0 having either compact support or sub-exponential tails as in (4.7) and the true scale σ 0 lying in a compact interval [σ, σ] ⊂ (0, ∞). If the base measure α has a continuous and positive density α ′ such that, for constants b > 0 and 0 < δ ≤ p, satisfies (4.12), the prior G for σ is supported on [σ, σ] and has a continuous and positive density on an interval containing σ 0 , then the posterior rate of convergence relative to d H is where κ > 0 depends on p.

297
For p > 1/2, without loss of generality, we may assume that the support points of F ′ 0 are at least 2ε 2 -separated (cf. Ghosal and van der Vaart [8], page 1252). Represented F ′ 0 as N j=1 w j δ θj , with |θ j − θ k | ≥ 2ε 2 for all j = k, for any probability measure F on R such that Thus, Reasoning as in the proof of Theorem 4.1, for a constant c > 0, Note that, for p > 1/2, the condition ε 2 ≤ (1/N ) of Lemma A.2 in Ghosal and van der Vaart [8], page 1260, is satisfied. Hence, for constants c ′ , c ′′ > 0, Then, for a suitable constant c 2 > 0, we have π(B KL (f 0 ;ε 2 n )) exp{−c 2 nε 2 n }. Following the arguments in the proof of Theorem 4.2, the same bound can be obtained for the case where F 0 has sub-exponential tails. To complete the proof, note that κ > ς, thus an upper bound on the rate is given by ε n =ε n .
Remark 4.7. The result covers the case of a normal-Laplace density f 0 , for which a rate of n −1/4 (log n) 5/4 is obtained.
Some remarks are in order. Differently from Theorem 4.1, Theorem 4.2 and Theorem 4.3, the posterior rate has been derived here assuming that the scale lies in a compact interval. The result is based on inequalities that are explicit in the lower and upper bounds on the scale so that, in principle, a rate could be obtained also when σ is unconstrained. Yet, as the rate would be slower, the aim is not pursued. As far as we are aware, the (minimax) optimal rate of convergence, relative to the Hellinger or the L 1 -metric, for this density estimation problem is unknown. Admittedly, we have found only an upper bound on the posterior rate and are not able to say whether, except possibly for a logarithmic factor, this bound is sharp or not.

Final remarks
In this article, we have studied frequentist asymptotic properties of posterior distributions of Dirichlet EP mixture priors, focussing on rates of convergence. Some theoretical properties usually discussed for specific densities nested in the family of EP distributions, like the normal, have been extended to more general kernels with non-trivial implications on posterior rates. The discrepancy observed in the rates for analytic and non-analytic kernels seems to suggest that, for infinite mixtures, the accuracy of the posterior in quantifying the uncertainty on the true density may heavily depend on the regularity of the kernel. It should be mentioned that a similar behaviour has been recently noted also by de Jonge and van Zanten [4] in a regression setting, insofar that the regularity of the kernel influences the posterior contraction rate, leading to a slower rate when the kernel is not analytic. This feature deserves further investigation to clarify whether and, if so, why the lack of regularity of the kernel causes a loss in the rate.
The results on posterior rates are of interest not only in density estimation, as previously pointed out, but also in the context of linear regression with unknown error distribution, see, e.g., Ghosal and van der Vaart [9], pages 205-207. In both settings, however, the cases where the true mixing distribution has tails decaying not exponentially fast or the true density is a scale mixture of normals, like the Cauchy or Student's t distribution, are not covered. This suggests that a potential future direction to pursue is an extension of previous results to Dirichlet scale mixtures of EP distributions.

Appendix: Auxiliary results
This Appendix reports the statement of a theorem in the recent literature (cf. Theorem A.1), which is instrumental to derive the main results of the article, establishes some facts about the uniform approximation of EP mixtures by finite mixtures (cf. Lemma A.5) and provides an upper bound on the metric entropy By the Mean Value Theorem, (·/σ) (z * ) 1 , with Ψ(z) := D 1 log Γ(z) the Digamma function. Clearly, As z * → 0, where γ is the Euler-Mascheroni constant, and T 2 σ(z * ) −1 . The first bound in (A.8) follows the fact that (z * ) −1 < (p ∨ p ′ ). As z * → ∞, T 1 = O(1) and T 2 = O(1), whence the second bound.
The next lemma establishes that, at least when the kernel is analytic, EP mixtures can be uniformly approximated by finite EP mixtures with a relatively small number of components.
Lemma A.5. Let p > 0 be fixed. Let 0 < ε < 1 and a, σ > 0 be given. Define support points, such that Proof. In the case where p is an even integer the result can be proved as in Lemma 3.1 of Ghosal and van der Vaart [8], pages 1240-1241. In the case where p = 2m, m ∈ N, for any constant M > 0 and any probability measure F ′ on [−a, a],
Remark A.2. For σ = 1 and p = 2, inequality (A.14) reduces to which, rewritten in terms of the Mill's ratio, takes the form where Φ(·) denotes the standard normal cumulative distribution function.
Next, a similar bound valid for any p > 0 is derived which, in particular, holds for p ∈ (0, 1).
Lemma A.7. Let p > 0 be fixed and let σ > 0 be given. For any B > σ, The following lemma provides an upper bound on the L 1 -distance between two EP mixtures by relating it to the corresponding L ∞ -distance. It is based on Lemma A.7 and extends an analogous result valid for normal mixtures (cf. Lemma 3.2 of Ghosal and van der Vaart [8], pages 1242-1243).
Lemma A.8. Let p > 0 be fixed. Given a > 0, let F and F ′ be probability measures on [−a, a]. For given σ > 0, define If d ∞ ≤ (e −2/p /σ), then F * ψ σ, p − F ′ * ψ σ, p 1 d ∞ a ∨ σ p log 1 σd ∞ 1/p . The next lemma gives an upper bound on the L 1 -metric entropy of a sieve set of EP mixtures. It is based on Lemma A.5 and Lemma A.8 and can be proved similarly to Lemma 3 of Ghosal and van der Vaart [10], pages 705-707, which deals with normal mixtures. Lemma A.9. Let p = 2m, m ∈ N, be fixed. Let 0 < ε < 1/5. Let 0 < s < t and a > 0 be such that, for some ν > 0, Define The following assertion is useful to estimate the prior probability of Kullback-Leibler type neighborhoods of f 0 when checking condition (A.3). Ghosal and van der Vaart [8] present this result for p = 2 (cf. Lemma 4.1, pages 1248-1249).
Lemma A.10. Let p > 0 be fixed. Let F ′ be a probability measure on R such that, for some constant b ′ > 0, for large t > 0.
The next lemma is a version of Lemma 11 of Ghosal and van der Vaart [10], pages 715-717, adapted to EP mixtures.
Lemma A.11. Let p > 0 be fixed. Let X 1 , . . . , X n be i.i.d. observations from a probability measure P 0 with density f 0 = F 0 * ψ σ0, p . Suppose that the model is a location mixture of EP densities, i.e., f F, σ = F * ψ σ, p , with the scale parameter σ distributed independently of the mixing distribution F . If the base measure α of the Dirichlet process prior for F has a continuous and positive density α ′ on [−a, a], with a ≥ 1, then, for any 0 < T ≤ (a/2) and η > 0, there exists a constant K > 0 (depending only p) such that where λ a := inf |θ|≤a α ′ (θ) > 0.