Posterior contraction rates for deconvolution of Dirichlet-Laplace mixtures

We study nonparametric Bayesian inference with location mixtures of the Laplace density and a Dirichlet process prior on the mixing distribution. We derive a contraction rate of the corresponding posterior distribution, both for the mixing distribution relative to the Wasserstein metric and for the mixed density relative to the Hellinger and $L_q$ metrics.


Introduction
Consider statistical inference using the following hierarchical Bayesian model for observations X 1 , . . . , X n : (i) A probability distribution G on R is generated from the Dirichlet process prior DP(α) with base measure α. (ii) An i.i.d. sample Z 1 , . . . , Z n is generated from G. (iii) An i.i.d. sample e 1 , . . . , e n is generated from a known density f , independent of the other samples. (iv) The observations are X i = Z i + e i , for i = 1, . . . , n.
In this setting the conditional density of the data X 1 , . . . , X n given G is a sample from the convolution p G = f * G of the density f and the measure G. The scheme defines a conditional distribution of G given the data X 1 , . . . , X n , the posterior distribution of G, and consequently also posterior distributions for quantities that derive from G, including the convolution density p G . We are interested in whether this posterior distribution can recover a "true" mixing distribution G 0 if the observations X 1 , . . . , X n are in reality a sample from the mixed distribution p G0 , for some given probability distribution G 0 . The main contribution of this paper is for the case that f is the Laplace density f (x) = e −|x| /2. For distributions on the full line Laplace mixtures seem the second most popular class next to mixtures of the normal distribution, with applications in for instance speech recognition or astronomy (Kotz et al. [2001]) and clustering problem in genetics (Bailey et al. [1994]). For the present theoretical investigation the Laplace kernel is interesting as a test case of a non-supersmooth kernel.
We consider two notions of recovery. The first notion measures the distance between the posterior of G and G 0 through the Wasserstein metric where Γ(G, G ′ ) is the collection of all couplings γ of G and G ′ into a bivariate measure with marginals G and G ′ (i.e. if (x, y) ∼ γ, then x ∼ G and y ∼ G ′ ), and k ≥ 1. The Wasserstein metric is a classical metric on probability distributions, which is well suited for use in obtaining rates of estimation of measures. It is weaker than the total variation distance (which is more natural as a distance on densities), can be interpreted through transportation of measure (see Villani [2009]), and has also been used in applications such as as comparing the color histograms of digital images. Recovery of the posterior distribution relative to the Wasserstein metric was considered by Nguyen [2013], within a general mixing framework. We refer to this paper for further motivation of the Wasserstein metric for mixtures, and to Villani [2009] for general background on the Wasserstein metric. In the present paper we improve the upper bound on posterior contraction rates given in Nguyen [2013], at least in the case of the Laplace mixtures, obtaining a rate of nearly n −1/8 for W 1 (and slower rates for k > 1). Apparently the minimax rate of contraction for Laplace mixtures relative to the Wasserstein metric is currently unknown. Recent work on recovery of a mixing distribution by non-Bayesian methods is given in Zhang [1990]. It is not clear from our result whether the upper bound n −1/8 is sharp. The second notion of recovery measures the distance of the posterior of G to G 0 indirectly through the Hellinger or L q -distances between the mixed densities p G and p G0 . This is equivalent to studying the estimation of the true density p G0 of the observations through the density p G under the posterior distribution.
As the Laplace kernel f has Fourier transform it follows that the mixed densities p G have Fourier transforms satisfying imsart-ejs ver. 2011/11/15 file: laplace.tex date: January 27, 2016 /Contraction Rates for Dirichlet-Laplace mixtures 3 Estimation of a density with a polynomially decaying Fourier transform was first considered in Watson and Leadbetter [1963]. According to their Theorem in Section 3A, a suitable kernel estimator possesses a root mean square error of n −3/8 with respect to the L 2 -norm for estimating a density with Fourier transform that decays exactly at the order 2. This rate is the "usual rate" n −α/(2α+1) of nonparametric estimation for smoothness α = 3/2. This is understandable as |p(λ)| 1/(1 + |λ| 2 ) implies that (1 + |λ| 2 ) α |p(λ)| 2 dλ < ∞, for every α < 3/2, so that a density with Fourier transform decaying at square rate belongs to any Sobolev class of regularity α < 3/2. Indeed in Golubev [1992], the rate n −α/(2α+1) is shown to be minimax for estimating a density in a Sobolev ball of functions on the line. In the present paper we show that the posterior distribution of Laplace mixtures p G contracts to p G0 at the rate n −3/8 up to a logarithm factor, relative to the L 2 -norm and Hellinger distance, and also establish rates for other L q -metrics. Thus the Dirichlet posterior (nearly) attains the minimax rate for estimating a density in a Sobolev ball of order 3/2. It may be noted that the Laplace density itself is Hölder of exactly order 1, which implies that Laplace mixtures are Hölder smooth of at least the same order. This insight would suggest a rate n −1/3 (the usual nonparametric rate for α = 1), which is slower than n −3/8 , and hence this insight is misleading.
Besides recovery relative to the Wasserstein metric and the induced metrics on p G , one might consider recovery relative to a metric on the distribution function on G. Frequentist recovery rates for this problem were obtained in Fan [1991] under some restrictions. There is no simple relation between these rates and rates for the other metrics. The same is true for the rates for deconvolution of densities, as in Fan [1991]. In fact, the Dirichlet prior and posterior considered here are well known to concentrate on discrete distributions, and hence are useless as priors for recovering a density of G.
Contraction rates for Dirichlet mixtures of the normal kernel were considered in Ghosal and Vaart [2001], Ghosal and van der Vaart [2007], Kruijer et al. [2010], Shen et al. [2011], Scricciolo [2011]. The results in these papers are driven by the smoothness of the Gaussian kernel, whence the same approach will fail for the Laplace kernel. Nevertheless we borrow the idea of approximating the true mixed density by a finite mixture, albeit that the approximation is constructed in a different manner. Because more support points than in the Gaussian case are needed to obtain a given quality of approximation, higher entropy and lower prior mass concentration result, leading to a slower rate of posterior contraction. To obtain the contraction rate for the Wasserstein metrics we further derive a relationship of these metrics with a power of the Hellinger distance, and next apply a variant of the contraction theorem in Ghosal et al. [2000], which is included in the appendix of the paper. Contraction rates of mixtures with other priors than the Dirichlet were considered in Scricciolo [2011]. Recovery of the mixing distribution is a deconvolution problem and as such can be considered an inverse problem. A general approach to posterior contraction rates in inverse problems can be found in Knapik and Salomond [2014], and results specific to deconvolution can be found in Donnet et al. [2014]. These authors are interested in deconvolving a (smooth) mixing density rather than a mixing distribution, imsart-ejs ver. 2011/11/15 file: laplace.tex date: January 27, 2016 /Contraction Rates for Dirichlet-Laplace mixtures 4 and hence their results are not directly comparable to the results in the present paper.
The paper is organized as follows. In the next section we state the main results of the paper, which are proved in the subsequent sections. In Section 3 we establish suitable finite approximations relative to the L q -and Hellinger distances. The L q -approximations also apply to other kernels than the Laplace, and are in terms of the tail decay of the kernel's characteristic function. In Sections 4 and 5 we apply these approximations to obtain bounds on the entropy of the mixtures relative to the L q , Hellinger and Wasserstein metrics, and a lower bound on the prior mass in a neighbourhood of the true density. Sections 6 and 7 contain the proofs of the main results.

Notation and preliminaries
Throughout the paper integrals given without limits are considered to be integrals over the real line R. The L q -norm is denoted with · ∞ being the uniform norm. The Hellinger distance on the space of densities is given by It is easy to see that h 2 (f, g) ≤ f − g 1 ≤ 2h(f, g), for any two probability densities f and g. Furthermore, if the densities f and g are uniformly bounded by a constant M , then f − g 2 ≤ 2 √ M h(f, g). The Kullback-Leiber discrepancy and corresponding variance are denoted by K(p 0 , p) = log(p 0 /p) dP 0 , K 2 (p 0 , p) = (log(p 0 /p)) 2 dP 0 with P 0 the measure corresponding to the density p 0 . We are primarily interested in the Laplace kernel, but a number of results are true for general kernels f . The Fourier transform of a function f and the inverse Fourier transform of a functionf are given bỹ The covering number N (ε, Θ, ρ) of a metric space (Θ, ρ) is the minimum number of ε-balls needed to cover the entire space Θ.
Throughout the paper denotes inequality up to a constant multiple, where the constant is universal or fixed within the context. Furthermore a n ≍ b n means c ≤ lim inf n→∞ a n /b n ≤ lim sup n→∞ a n /b n ≤ C, for some positive constants c and C.
We denote by M[−a, a] the set of all probability measures on a given interval [−a, a].

Main Results
Let Π n (·|X 1 , . . . , X n ) be the posterior distribution for G in the scheme (i)-(iv) introduced at the beginning of the paper. We study this random distribution under the assumption that X 1 , . . . , X n are an i.i.d. sample from the mixture density p G0 = f * G 0 , for a given probability distribution G 0 . We assume that G 0 is supported in a compact interval [−a, a], and that the base measure α of the Dirichlet prior in (i) is concentrated on this interval with a Lebesgue density bounded away from 0 and ∞.
a] with f being Laplace kernel and α has support [−a, a] with Lebesgue density bounded away from 0 and ∞, then for every k ≥ 1, there exists a constant M such that The rate for the Wasserstein metric W k given in the theorem deteriorates with increasing k, which is perhaps not unreasonable as the Wasserstein metrics increase with k. The fastest rate is n −1/8 (log n) 5/8 , and is obtained for W 1 .
Theorem 2. If G 0 is supported on [−a, a] with f being Laplace kernel and α has support [−a, a] with Lebesgue density bounded away from 0 and ∞, then there exists a constant M such that The rate for the L q -distance given in (2.3) deteriorates with increasing q. For q = 2 it is the same as the rate (log n/n) 3/8 for the Hellinger distance.
In both theorems the mixing distributions are assumed to be supported on a fixed compact set. Without a restriction on the tails of the mixing distributions, no rate is possible. The assumption of a compact support ensures that the rate is fully determined by the complexity of the mixtures, and not their tail behaviour.

Finite Approximation
In this section we show that a general mixture p G can be approximated by a mixture with finitely many components, where the number of components depends on the accuracy of the approximation, the distance used, and the kernel f . We first consider approximation with respect to the L q -norm, which applies to mixtures p G = f * G, for a general kernel f , and next approximation with respect to the Hellinger distance for the case that f is the Laplace kernel. The first result generalizes the result of Ghosal and Vaart [2001] for normal mixtures. Also see Scricciolo [2011].
The result splits in two cases, depending on the tail behaviour of the Fourier transformf of f : Lemma 1. Let ε < 1 be sufficiently small and fixed. For a probability measure Proof. The Fourier transform of p G is given byfG, forG(λ) = e ıλz dG(z). Determine G ′ so that it possesses the same moments as G up to order k − 1, i.e.
By Lemma A.1 in Ghosal and Vaart [2001] G ′ can be chosen to have at most k support points.
Then for G and G ′ supported on [−a, a], we have The inequality comes from |e iy − k−1 imsart-ejs ver. 2011/11/15 file: laplace.tex date: January 27, 2016 /Contraction Rates for Dirichlet-Laplace mixtures 7 Therefore, by Hausdorff-Young's inequality, We denote the first term in the preceding display by I 1 and the second term by I 2 . It is easy to bound I 2 as: For I 1 we separately consider the supersmooth and ordinary smooth cases.
In the ordinary smooth case with smoothness parameter β, we have the bound We choose M = (1/ε) −(β−1/p) −1 to render the right side equal to ε p . Then The number of support points in the preceding lemma is increasing in q and decreasing in β. For approximation in the L 2 -norm (q = 2), the number of support points is of order ε −1/(β−1/2) , and this reduces to ε −2/3 for the Laplace kernel (ordinary smooth with β = 2). The exponent β − 1/2 can be interpreted as (almost) the Sobolev smoothness of p G , since, for α < β − 1/2, We do not have a compelling intuition for this correspondence. The Hellinger distance is more sensitive to areas where the densities are close to zero. This causes that the approach in the preceding lemma does not give sharp results. The following lemma does, but is restricted to the Laplace kernel.
Lemma 2. For a probability measure G supported on [−a, a] there exists a discrete measure G ′ with at most N ≍ ε −2/3 support points such that for p G = f * G and f the Laplace density Proof. Since p G (x) ≥ f (|x| + a) = e −a e −|x| /2, for every x and probability measure G supported on [−a, a], the Hellinger distance between Laplace mixtures satisfies If we write q G (x) = p G (x)e |x|/2 , andq G for the corresponding Fourier transform, then the integral in the right side is equal to (1/2π) |q G ′ −q G | 2 (λ) dλ, by Plancherel's theorem. By an explicit computation we obtaiñ where r(λ, z) is given by r(λ, z) = e −z ıλ + 1/2 + e −z e (ıλ+3/2)z − 1 ıλ + 3/2 − e (ıλ+1/2)z ıλ − 1/2 = e −z (ıλ + 1/2)(ıλ + 3/2) − 2e ıλz e z/2 (ıλ + 3/2)(ıλ − 1/2) . (3.1) Now let G ′ be a discrete measure on [−a, a] such that By Lemma A.1 in Ghosal and Vaart [2001] G ′ can be chosen to have at most k + 1 support points. By the choice of G ′ the first term of r(λ, z) gives no contribution to the difference r(λ, z) d(G ′ − G)(z). As the second term of r(λ, z) is for large |λ| bounded in absolute value by a multiple of |λ| −2 , it follows that By the choice of G ′ in the second term of r(λ, z) we can replace e iλz by e ıλz − k j=0 (ıλz) j /j! again without changing the integral r(λ, z) d(G ′ − G)(z). It imsart-ejs ver. 2011/11/15 file: laplace.tex date: January 27, 2016 /Contraction Rates for Dirichlet-Laplace mixtures 9 follows that It follows, by a similar argument as in the proof of Lemma 1, that we can reduce both I 1 and I 2 to ε 2 by choosing and M ≍ ε −2/3 and k = 2aeM .

Entropy
We study the covering numbers of the class of mixtures p G = f * G, where G ranges over the collection M[−a, a] of all probability measures on [−a, a]. We present a bound for any L q -norm and general kernels f , and a bound for the Hellinger distance that is specific to the Laplace kernel.
Proposition 1. If both f q and f ′ q are finite andf has ordinary smoothness β, then, for p G = f * G, and any q ≥ 2, Proof. Consider an ε-net of P a = {p G : G ∈ M[−a, a]} by constructing I the collection of all p G 's such that the mixing measure G ∈ M[−a, a] is discrete and has at most N ≤ Dε −(β−1+q −1 ) −1 support points for some proper constant D.
In light of the approximation Lemma 1, the set of all mixtures p G with G a discrete probability measure with N ε −(β−1+q −1 ) −1 support points forms an ε-net over the set of all mixtures p G as in the lemma. It suffices to construct an ε-net of the given cardinality over this set of discrete mixtures.
By Jensen's inequality and Fubini's theorem, Furthermore, for any probability vectors p and p ′ and locations θ i , Combining these inequalities, we see that for two discrete probability measures Thus we can construct an ε-net over the discrete mixtures by relocating the support points (θ i ) N i=1 to the nearest points (θ ′ i ) N i=1 in a ε-net on [−a, a], and relocating the weights p to the nearest point p ′ in an ε-net for the l 1 -norm over the N -dimensional l 1 -unit simplex. This gives a set of at most Ghosal and van der Vaart [2007] for the entropy of the l 1 -unit simplex). This gives the bound of the lemma.
Proposition 2. For f the Laplace kernel and p G = f * G, Proof. Because the function √ f is absolutely continuous with derivative x → −2 −3/2 e −|x|/2 sgn(x), we have by Jensen's inequality and Fubini's theorem that By integrating this inequality we see that the densities p G and p G ′ with mix- i δ θi with the same support points, but different weights, we have Therefore the bound follows by arguments similar as in the proof of Proposition 1, where presently we use Lemma 2 to determine suitable finite approximations.
The map G → p G = f * G is one-to-one as soon as the characteristic function of f is never zero. Under this condition we can also view the Wasserstein distance on the mixing distribution as a distance on the mixtures. Obviously the covering numbers are then free of the kernel.
Proposition 3. For any k ≥ 1, and any sufficiently small ε > 0, The proposition is a consequence Lemma 4, below, which applies to the set of all Borel probability measures on a general metric space (Θ, ρ) (cf. Nguyen [2013]).
Lemma 3. For any probability measure G concentrated on countably many disjoint sets Θ 1 , Θ 2 , . . . and probability measure G ′ concentrated on disjoint sets In particular, . Thenθ andθ ′ have marginal distributions G and G ′ , and The first term is bounded by the k-th power of the first term of the lemma, while the probability in the second term is equal to 1 − i p i ∧ p ′ i = i |p i − p ′ i |/2. Lemma 4. For the set M(Θ) of all Borel probability measures on a metric space (Θ, ρ), any k ≥ 1, and 0 < ε < min{2/3, diam(Θ)}, .
Proof. For a minimal ε-net over Θ of N = N (ε, Θ, ρ) points, let Θ = ∪ i Θ i be the partition obtained by assigning each θ to a closest point. For any G let G ε = i G(Θ i )δ θi , for arbitrary but fixed for M ε the set of all G ε . We next form the measures G ε,p = i p i δ θi for (p 1 , . . . , p N ) ranging over an (ε/ diam(Θ)) k -net for the l 1 -distance over the N -dimensional unit simplex. By Lemma 3 every G ε is within W k -distance of some G ε,p . Thus N (ε, M ε , W k ) is bounded from above by the number of points p, which is bounded by (4 diam(Θ)/ε) kN (cf. Lemma A.4 in Ghosal et al. [2000]).

Prior mass
This main result of this section is the following proposition, which gives a lower bound on the prior mass of the prior (i)-(iv) in a neighbourhood of a mixture p G0 .
Proposition 4. If Π is the Dirichlet process DP(α) with base measure α that has a Lebesgue density bounded away from 0 and ∞ on its support [−a, a], and f is the Laplace kernel, then for every sufficiently small ε > 0 and every probability measure G 0 on [−a, a], Proof. By Lemma 2 there exists a discrete measure G 1 with N ε −2/3 support points such that h(p G0 , p G1 ) ≤ ε. We may assume that the support points of G 1 are at least 2ε 2 -separated. If not, we take a maximal 2ε 2 -separated set in the support points of G 1 , and replace G 1 by the discrete measure obtained by relocating the masses of G 1 to the nearest points in the 2ε 2 -net. Then h(p G1 , p G ′ 1 ) ε 2 , as seen in the proof of Proposition 2. Now by Lemmas 6 and 5, if G 1 = N i=1 p j δ zj , with the support points z j at least 2ε 2 -separated, By Lemma A.2 of Ghosal and Vaart [2001], since the base measure α has density bounded away from zero and infinity on [−a, a] by assumption, we have The lemma follows upon combining the preceding.
Lemma 5. If G ′ = N j=1 p i δ zj is a probability measure supported on points z 1 , . . . , z N in R with |z j − z k | > 2ε for j = k, then for any probability measure G on R and kernel f , Proofs. The first lemma is a generalization of Lemma 4 in Ghosal and van der Vaart [2007] from normal to general kernels, and is proved in the same manner.
In view of the shape of the Laplace kernel, it is easy to see that for G compactly supported on [−a, a], We bound the squared Hellinger distance as follows: By the elementary inequality t + u t ≥ 2 √ u, for u, t > 0, we obtain (5.1) upon choosing A = min(a, log p G − p G ′ −1 2 − a/2). For the proof of the second assertion we first note that, if both G and G ′ are compactly supported on [−a, a], Therefore p G /p G ′ ∞ ≤ e 2a , and (5.2) follows by Lemma 8 in Ghosal and van der Vaart [2007].

Proof of Theorem 1
The proof is based on the following comparison between the Wasserstein and Hellinger metrics. The lemma improves and generalizes Theorem 2 in Nguyen [2013]. Let C k be a constant such that the map ε → ε[log(C k /ε)] k+1/2 is monotone on (0, 2]. Lemma 7. For probability measures G and G ′ supported on [−a, a], and p G = f * G for a probability density f with inf λ (1 + |λ| β )|f (λ)| > 0, and any k ≥ 1, Proof. By Theorem 6.15 in Villani [2009] the Wasserstein distance W k (G, G ′ ) is bounded above by a multiple of the kth root of |x| k d|G − G ′ |(x), where |G − G ′ | is the total variation measure of the difference G − G ′ . We apply this to the convolutions of G and G ′ with the normal distribution Φ δ with mean 0 and variance δ 2 , to find, for every M > 0, where Z is a standard normal variable. The number K δ := e 2|a| Ee 2|δZ| is uniformly bounded by if δ ≤ δ k , for some fixed δ k . By Plancherel's theorem, where we have again applied Plancherel's theorem, used that the L 2 -metric on uniformly bounded densities is bounded by the Hellinger distance, and the assumption on the Fourier transform of f , which shows that (φ δ /|f |)(λ) (1 + |λ| β )e −δ 2 λ 2 /2 δ −β . If U ∼ G is independent of Z ∼ N (0, 1), then (U, U + δZ) gives a coupling of G and G * Φ δ . Therefore the definition of the Wasserstein metric gives that Combining the preceding inequalities with the triangle inequality we see that, for δ ∈ (0, δ k ] and any M > 0, The lemma follows by optimizing this over M and δ. Specifically, for ε = h(p G , p G ′ ), we choose M = k/(k + β) log(C k /ε) and δ = (M k+1/2 ε) 1/(k+β) . These are eligible choices for which is indeed a finite number. In fact the supremum is taken at ε = 2, by the assumption on C k .
For the Laplace kernel f we choose β = 2 in the preceding lemma, and then obtain that d(p G , p G ′ ) ≤ h(p G , p G ′ ), for the "discrepancy" d = γ −1 (W k ), and γ(ε) = D k ε 1/(k+β) [log(C k /ε)] (k+1/2)/(k+β) a multiple of the (monotone) transformation in the right side of the preceding lemma. For small values of W k (G 1 , G 2 ) we have (6.1) As k + 2 > 1 the discrepancy d may not satisfy the triangle inequality, but it does possess the properties (a)-(d) in the appendix, Section 9. The balls of the discrepancy d are convex, as the Wasserstein metrics are convex (see Villani [2009]). It follows that Theorem 3 applies to obtain a rate of posterior contraction relative to d and hence relative to W k ∼ d 1/(k+2) log(1/d) (k+1/2)/(k+2) . We apply the theorem with P = P n equal to the set of mixtures p G = f * G, as G ranges over M[−a, a]. Thus (9.3) is trivially satisfied.
For the entropy condition (9.1) we have, by Proposition 3, .
Theorem 3 yields a rate of contraction relative to d equal to the slower of the two rates, which is (log n/n) 3/8 . This translates into the rate for the Wasserstein distance as given in Theorem 1.

Proof of Theorem 2
We apply Theorem 3, with P = P n the set of all mixtures p G as G ranges over M[−a, a]. For d = h the rate follows immediately by combining Propositions 1 and 4.
Since the densities p G are uniformly bounded by 1/2, the L q distance p G − p G ′ q is bounded above by a multiple of h(p G , p G ′ ) 2/q . We can therefore apply Theorem 3 with the discrepancy d(p, p ′ ) = p − p ′ q/2 q . In view of Proposition 1 log N ε, P n , d ε −2/(q+1) log(1/ε).
Therefore the entropy condition (9.1) is satisfied with ε n ≍ (log n/n) (q+1)/(2q+4) . By Proposition 4 the prior mass condition is satisfied for ε n ≍ (log n/n) 3/8 . By Theorem 3 the rate of contraction relative to d is the slower of these two rates, which is the first. The rate relative to the L q -norm is the (2/q)th power of this rate.

Normal mixtures
We reproduce the results on normal mixtures from Ghosal and Vaart [2001], but in L 2 -norm. Note the normal kernel is supersmooth with β = 2, by the approximation lemma, for any measure G 1 compactly supported on [−a, a] we can always find a discrete measure G 2 with number of support points of order N ≍ log ε −1 such that p G1 − p G2 2 ≤ ε. It is easy to establish Following the same procedure as before, assuming G 0 is the true measure, we obtain for prior mass condition Thus we obtain ε n = log n/ √ n. By Lemma 1, we have the following estimate for entropy condition log N (ε, P a , · 2 ) log 1 ε 2 , this coincides with the estimate of prior mass condition, thus we obtain the rate of ε n = log n/ √ n with respect to L 2 -norm. This is the same with what is obtained in Ghosal and Vaart [2001], only in L 2 -norm. However we lose a √ log n-factor comparing to Watson and Leadbetter [1963], which is log n/n.

Appendix: contraction rates relative to non-metrics
The basic theorem of Ghosal et al. [2000] gives a posterior contraction rate in terms of a metric on densities that is bounded above by the Hellinger distance.
In the present situation we would like to apply this result to a power smaller than one of the Wasserstein metric, which is not a metric. In this appendix we establish a rate of contraction which is valid for more general discrepancies. We consider a general "discrepancy measure" d, which is a map d : P ×P → R on the product of the set of densities on a given measurable space and itself, which has the properties, for some constant C > 0: Thus d is a metric except that the triangle inequality is replaced with a weaker condition that incorporates a constant C, possibly bigger than 1. Call a set of the form {x : d(x, y) < c} a d-ball, and define covering numbers N (ε, P, d) relative to d as usual.
Let Π n (·|X 1 , . . . , X n ) be the posterior distribution of p given an i.i.d. sample X 1 , . . . , X n from a density p that is equipped with a prior probability distribution Π.
Proof. For a given j ∈ N, choose a maximal set Q j,1 , Q j,2 , . . . , Q j,Nj in the set Q j = {Q ∈ Q : Cjε < d(P, Q) < 2Cjε} such that d(Q j,k , Q j,l ) ≥ jε/2 for every k = l. By property (d) of the discrepancy every ball in a cover of Q j by balls of radius C −1 jε/4 contains at most one Q j,k . Thus N j ≤ N (C −1 jε/4, Q j , d) ≤ N (ε). Furthermore, the N j balls B j,l of radius jε/2 around Q j,l cover Q j , as otherwise the set of Q j,l would not be maximal. For any point Q in each B j,l , we have d(P, Q) ≥ C −1 d(P, Q j,l ) − d(Q, Q j,l ) ≥ jε/2.
The following lemma comes from the general results of Birgé [1984] and Le Cam [1986].
Lemma 9. For any probability measure P and dominated, convex set of probability measures Q with h(p, q) > ε for any q ∈ Q and any n ∈ N, there exists a test φ n such that P n φ n ≤ e −nε 2 /8 , sup Q∈Q Q n (1 − φ n ) ≤ e −nε 2 /8