Kullback Leibler property of kernel mixture priors in Bayesian density estimation

Positivity of the prior probability of Kullback-Leibler neighborhood around the true density, commonly known as the Kullback-Leibler property, plays a fundamental role in posterior consistency. A popular prior for Bayesian estimation is given by a Dirichlet mixture, where the kernels are chosen depending on the sample space and the class of densities to be estimated. The Kullback-Leibler property of the Dirichlet mixture prior has been shown for some special kernels like the normal density or Bernstein polynomial, under appropriate conditions. In this paper, we obtain easily verifiable sufficient conditions, under which a prior obtained by mixing a general kernel possesses the Kullback-Leibler property. We study a wide variety of kernel used in practice, including the normal, $t$, histogram, gamma, Weibull densities and so on, and show that the Kullback-Leibler property holds if some easily verifiable conditions are satisfied at the true density. This gives a catalog of conditions required for the Kullback-Leibler property, which can be readily used in applications.


Introduction
Density estimation, which is also relevant in various applications such as cluster analysis and robust estimation, is a fundamental nonparametric inference problem. In Bayesian approach to density estimation, a prior such as a Gaussian process, a Polya tree process, or a Dirichlet mixture is constructed on the space of probability densities. Dirichlet mixtures were introduced by Ferguson [9] and Lo [21] who also obtained expressions for resulting posterior and predictive distribution. West [30], West, Müller and Escobar [31] and Escobar and West [6; 7] developed powerful Markov chain Monte Carlo methods to calculate Bayes estimates and other posterior quantities for Dirichlet mixtures.
The priors of interest in this paper are of mixture type and can be described in terms of a kernel and a prior for the mixing distribution. Let X be the sample space and Θ is the space of the mixing parameter θ. Let K(x; θ) be the kernel on X × Θ, i.e., K(x; θ) is a jointly measurable function such that for all θ, K(·; θ) is a probability density on X. The choice of an appropriate kernel depends on the underlying sample space X, on which the true density is defined. If X is the entire real line, a location-scale kernel is appropriate. If X is the unit interval, a uniform or triangular density kernel, or Bernstein polynomial may be considered. If X is the positive half line (0, ∞), mixtures of gamma, Weibull, lognormal, exponential or inverse gamma may be used. Petrone and Veronese [25] discussed the issue of the choice of a kernel in view of a constructive approximation known as the Feller sampling scheme. Let P , the mixing distribution on Θ, be given a prior Π on M (Θ), the space of probability measure on Θ. Let supp(Π) denote the weak support of Π. The prior on P and the chosen kernel then give rise to a prior on D(X), the space of densities on X, via the map P → f P (x) := K(x; θ)dP (θ). We shall call such a prior a type I mixture prior or Prior 1 in short. To enrich the family of the kernels, let the kernel function contain another parameter φ, referred to as the hyper parameter. In this case, we shall denote the kernel by K(x; θ, φ). The hyper parameter φ might be elicited a priori or be given a prior. In the former case, such a prior essentially reduces to Prior 1. For the latter case, assume that φ is independent of P and denote the prior for φ by µ. Let Φ be the space of φ and supp(µ) denote the support of µ. With such a random hyper parameter in the chosen kernel, the prior on densities is induced by µ × Π via the map (φ, P ) → f P,φ (x) := K(x; θ, φ)dP (θ). We shall call this prior a Type II mixture prior or simply Prior 2. Clearly, Prior 2 contains Prior 1 as a special case where φ is treated as a vacuous parameter. In some situations, the prior Π may contain an additional indexing parameter ξ. For instance, when Π is the Dirichlet process with base measure α ξ (written as DP(α ξ )) depending on an indexing parameter ξ, which is also given a prior, we obtain a mixture of Dirichlet processes (MDP) [1] prior for mixing distribution P . Addition of this hierarchical structure to Prior 1 or Prior 2 gives somewhat more flexibility. In this paper, we do not make any specific assumption on Π like DP or MDP other than requiring that it has large weak support. The prior induced on the space of densities by a mixing distribution P ∼ Π (and φ ∼ µ and ξ ∼ π) will be denoted by Π * and we shall refer to it as a kernel mixture prior. Note that the variable x and the parameters θ, φ and ξ mentioned above are not necessarily one-dimensional.
Asymptotic properties, such as consistency, and rate of convergence of the posterior distribution based on kernel mixture priors were established by Ghosal, Ghosh and Ramamoorthi [11], Tokdar [29], and Ghosal and van der Vaart [13; 14], when the kernel is chosen to be a normal probability density (and the prior distribution of the mixing distribution is DP). Similar results for Dirichlet mixture of Bernstein polynomials were shown by Petrone and Wasserman [26], Ghosal [10] and Kruijer and van der Vaart [19]. However, in the literature, there is a lack of such results for mixture of other kernels, which are also widely used in practice. We are only aware of the article by Petrone and Veronese [25] who considered general kernels. However, they derived consistency only under the strong and unrealistic condition that the true density is exactly of the mixture type for some compactly supported mixing distribution, or the true density itself is compactly supported and is approximated in terms of Kullback-Leibler divergence by its convolution with the chosen kernel.
Schwartz [28] showed that the consistency at a true density f 0 holds if the prior assigns positive probabilities to specific type of neighborhoods of f 0 defined by Kullback-Leibler divergence measure and the size of the model is restricted in some appropriate sense. Thus the prior positivity condition, known as the Kullback-Leibler property (KL property), is fundamental in posterior consistency studies. More formally, let a density function f be given a prior Π * . Define a Kullabck-Leibler neighborhood of f of size ǫ by K ǫ (f ) = {g : K(f ; g) < ǫ}, where K(f ; g) = f log(f /g), the Kullback-Leibler divergence between f and g. We say that the KL property holds at f 0 ∈ D(X) or f 0 is in the Kullback-Leibler support (KL support) of Π * , and write f 0 ∈ KL(Π * ), if Π * (K ǫ (f 0 )) > 0 for every ǫ > 0. For the weak topology, the size condition in Schwartz's theorem holds automatically [16,Theorem 4.4.2]. Further, Ghosal, Ghosh and Rammamoorthi [12] argued that this property drives consistency of the parametric part in some semiparametric models.
This paper addresses issues about KL property of general kernel mixture priors, thus addressing one of the most important issues in posterior consistency. We discuss the KL property for general kernel mixture priors, which are not restricted by any particular type of kernel or by a prior distribution for mixing distribution. The distinguished feature of our results is that we allow the true density to be not of the chosen mixture type, and impose only simple moment conditions and qualitative conditions like continuity or positivity.
Ghosal, Ghosh and Rammamoorthi [11] presented results on consistency for Dirichlet location mixture of a normal kernel with an additional scale parameter in terms of both weak and L 1 -topologies. Tokdar [29, Theorem 3.2] considered a location-scale mixture of the normal kernel and established consistency in weak topology (weak consistency) under more relaxed conditions. If the prior Π is chosen to be DP(α), Tokdar [29] also weakened a moment condition on the true density in his Theorem 3.3. His Theorem 3.2 will be implied by Theorem 4 in this paper (with the choice λ = 0 there). In fact, we establish the KL property for a general location-scale kernel mixture and show that such a result applies to various kernels including the skew-normal, t, double-exponential and logistic. This is a substantial generalization of results known for only the normal kernel thus far. Moreover, we obtain results about the KL property for priors with kernels not belonging to location-scale families, e.g., the Weibull, gamma, uniform, and exponential kernels. The examples studied here provide a ready catalog of conditions required for the KL property to hold for virtually all kernel mixture priors that are of practical interest.
With the the help of our results on KL property, consistency in L 1 -(equivalently, Hellinger) distance can be obtained by constructing appropriate sieves approximating the class of mixtures and establishing entropy bounds for them. Since the techniques used for sieve construction and bounding entropy vary widely depending on the chosen kernel, we do not address L 1 -consistency in this paper.
The paper is organized as follows. In Section 2, we study the kernel mixture priors under complete generality without specifying a kernel or the nature of it. In Section 3, using the results provided in Section 2, we study the priors with kernels of the location-scale type. In Section 4, the priors with concretely specified kernels are studied as examples by using the results in the previous sections.

General Kernel Mixture Priors
First we observe that the Kullback Leibler property is preserved under taking mixtures.
where ξ is an indexing parameter following a prior π and let f 0 be the true density. Suppose that there exists a set B with properties The proof is almost a trivial application of Fubini's theorem, since Remark 1. If Π = DP(α) and supp(P ǫ ) ⊂ supp(α), then P ǫ ∈ supp(Π); see, for instance, Theorem 3.2.4 of [16]. In particular, the condition holds for any chosen P ǫ if α is fully supported on Θ. A similar assertion holds when Π is the Polya tree prior PT({T m }, A ) (see [20]). Let T m be a collection of gradually refining binary partitions and A = {α ǫ1,...,ǫm : ǫ 1 , . . . , ǫ m = 0 or 1, m ≥ 1}. If the end points of T m form a dense subset of some set S where S ⊃ supp(P ǫ ) and the elements of A , which control the beta distributions regulating the mass allocation to the sets in Π m , are positive, then also P ǫ ∈ supp(Π). This is implicit in Theorem 5 of [20] or Theorem 3.3.6 of [20]; for an explicit statement and proof, see Theorem 2.20 of [15]. Now, if W is an open neighborhood of P ǫ , then Π(W ) > 0 holds.
In most application, we can choose P ǫ to be compactly supported. Compactness of supp(P ǫ ) often helps satisfy condition A4-A9 in Lemmas 2 and 3, which are useful in verifying the conditions of Theorem 1.
Proof. By Condition A4, we have that K(x; θ, φ) → K(x; θ, φ ǫ ) as φ → φ ǫ , for any given x and θ. By Condition A6 and the dominated convergence theorem (DCT), f Pǫ,φ (x) → f Pǫ,φǫ (x, ) as φ → φ ǫ , for any given x. Equivalently, this can be written as log Note that By Condition A5 and the DCT, f 0 log if P ∈ U . Also Θ K(x; θ, φ)dP ǫ (θ) > c for any x ∈ C, since P ǫ has support in D. Hence, given φ ∈ A, for any P ∈ U and x ∈ C, Then, for 3δ + ǫ/4 < 1, By choosing δ small enough, we can ensure that the right hand side (RHS) of the last display is less than ǫ/2. Hence, for any given φ ∈ A for any P ∈ U .

Location scale kernel
In this section we discuss priors with kernel functions belonging to location scale families. We write the kernels as K(x; θ, h) Obviously, when d = 1, this reduces to ordinary derivative and · denotes absolute value. We have the following theorems, whose proofs use some ideas from the proof of Theorem 3.2 of [29].
Theorem 2. Let f 0 (x) be the true density and Π * be a type I prior on D(X) with kernel function h −d χ( x−θ h ), i.e. P ∼ Π, and given P , (θ, h) ∼ P . If χ(·) and f 0 (x) satisfy: B1. χ(·) is bounded, continuous and positive everywhere; B2. there exists l 1 > 0 such that χ(x) decreases as x moves away from 0 outside the ball {x : Remark 3. Tokdar [29] assumed that the weak support of Π includes all compactly supported probabilities in R d ×R + . Then automatically the weak support of Π is M (R d × R + ). This is because any arbitrary probability measure can be weakly approximated by a sequence of compactly supported probability measures.
Proof of Theorem 2. We prove this theorem by verifying the conditions in Theorem 1. Since there is no hyper-parameter in Prior 1, we only need to show that Conditions A1 and A3 are met.
To show that Condition A1 is met, we define, Since for any given a, χ(a)f 0 (x − ah m ) → χ(a)f 0 (x) as h m → 0 and f 0 is bounded, by the DCT, we obtain f Pm (x) → f 0 (x). Now, to satisfy Condition A1, we show that To this end, observe that Hence, as log f0(x) Also Let m < l 1 . Now, for x > m, using assumption B2, The last inequality holds when This follows because, with z = T η x + T η+1 x/ x , a positive multiple of x, The last inequality holds because when x ≤ m, Combining (7) and (9), we obtain From Condition B5, Hence, The first term on the RHS of the above inequality is finite, by Condition B6. By Conditions B5 and B7, the second term is also finite. Thus f 0 (x) log f0(x) fP m (x) dx → 0 as m → ∞, i.e., Condition A1 is satisfied.
We show that Condition A3 is met by verifying the conditions of Lemma 3. First, from the proof above, we see that for any ǫ > 0, there exists m ǫ such that f 0 (x) log f0(x) fP mǫ (x) dx < ǫ. Let P ǫ in Theorem 1 be chosen to be P mǫ , which is compactly supported. By Condition B8, P ǫ ∈ supp(Π). Second, Condition A7 is satisfied. To show log By Condition B7 and expression (10) Condition A8 is satisfied by Condition B1. We show that Condition A9 is also satisfied. Let C ⊂ X be a given compact set. First we show that Such an E contains D in its interior, and is compact. By the definition of uniform equicontinuity, it is to show that for any ǫ > 0, there exists δ > 0 such that for all x ∈ C and all (θ, h), Since E and C are compact and h is bounded away from 0 within E, , which ensures the second term on the RHS of (11) less than ǫ/2. Notice that Thus the uniform equicontinuity required in Condition A9 is satisfied.
We can enlarge E to ensure that h −d χ( x−θ h ) is less than any preassigned number for x ∈ C and (θ, h) ∈ E c . This holds for large value of h, since . This follows from Assumption B9 when d ≥ 2. For d = 1, the condition automatically holds since χ(y)dy = 1 implies χ(y) = o( y −1 ) with the help of the montonicity condition B2. For given C, choosing a and h large enough to construct the set E, we have sup{h −d χ( x−θ h ) : x ∈ C, (θ, h) ∈ E c } < cǫ/4, for any given ǫ. Now we consider Prior 2 with location scale family kernels. Let the locationparameter for the density be mixed according to P following a prior Π. Let the scale-parameter h be a hyper-parameter, which is also given a prior distribution µ. Assume that h and P are a priori independently distributed. We let Π * to denote the prior for the density functions on X, induced by Π×µ via the mapping . We then have the following theorem. Theorem 3. For such prior described above, let χ(x) and f 0 (x) be densities on X satisfying condition B1-B9. Then, f 0 ∈ KL(Π * ).
Proof. The proof uses Theorem 1 and Lemmas 2 and 3. Verification the Conditions A7-A9 is similar to (but easier than) that in Theorem 2. The second inequality in Condition B7 implies that Condition A5 is satisfied. Conditions A4 and A6 are satisfied since χ(·) is a continuous probability density function and the kernel we consider here is a location family of χ(·) with a fixed scale. Condition A1 will be proved in the same way as in the proof of Theorem 2.

Examples
In this section, we discuss the KL property for some kernel mixture priors with concretely specified kernels. More precisely, we prove that the property holds under some conditions on the true density when the kernel is chosen to be skew-normal (normal also, as it is a special case), multivariate normal, logistic, double exponential, t (Cauchy also as it is a special case), histogram, triangular, uniform, scaled uniform, exponential, log-normal, gamma, inverse gamma and Weibull densities.

Location-scale kernels
For a given density χ(·) supported on the entire real line (or R d when X is d-dimensional), we shall consider two types of kernel mixture prior -Prior 1 where both the location parameter θ and the scale parameter φ of φ −d χ((x − θ)/φ) are mixed according to a random probability measure on R d × (0, ∞), or Prior 2 where θ is mixed according to a random probability measure P on R d and φ is given a prior µ on (0, ∞). The KL property may be verified by checking Condition B1-B9 for the kernel and applying respectively Theorem 2 or Theorem 3.
In this subsection, we consider several examples of location-scale kernels. Condition B1 and B2 can be easily verified. Conditions B4-B6 are also the conditions assumed in all the following theorems for each of the location scale density kernels. By choosing prior on P as described in Remark 1, Condition B8 can be satisfied. In this subsection, only multivariate normal density has a mixing parameter θ with dimension d ≥ 2. For this kernel Condition B9 is obviously satisfied. Hence, in the rest of this subsection, for each kernel function and corresponding prior, we only show that conditions B3 and B7 are satisfied.

Skew-normal density kernel
Consider the skew-normal kernel where the skewness parameter λ is given. We have the following result.
Theorem 4. Assume that the prior Π satisfies B8. Let f 0 (x) be a continuous density on R satisfying conditions B4, B5, B6 and there exists η > 0 such that for any a and b, where c 1 (x) and c 2 (x) are bounded functions here.

Remark 4.
With λ = 0, Theorem 4 implies Theorem 3.2 of [29], since the normal density is a special case of the skew-normal.

Multivariate normal density kernel
Let We have the following result.
Theorem 5. Assume that the prior Π satisfies B8. Let f 0 (x) be a continuous density on R d satisfying Conditions B4, B5, B6 and that Proof. The proof of this theorem is very similar to the proof of Theorem 4, with λ = 0 and some other minor modifications in all the steps except in verifying Condition B7. Note that for some bounded functions c 1 (x) and c 2 (x), we have that and similarly for any a and b.

Logistic density kernel
Let the kernel be χ(x) = e −x /(1 + e −x ) 2 . We have the following result.

t ν -density kernel
Let the kernel be given by where the degrees of freedom ν is given. Let log + u = max(log u, 0). We have the following result.
Proof. Condition B3 is satisfied, since χ′(z) Condition B7 can be verified by observing the tail of | log χ ν (x)| has growth like log |x| as |x| → ∞.
Remark 5. Since the Cauchy density is the t-density with ν = 1, Theorem 8 applies to the Cauchy kernel.

Kernels with bounded support
The priors with kernels supported on [0, 1] are preferred for estimating densities supported on [0, 1]. We study the KL property of such priors using Theorem 1.
The following lemma will be used in the following proofs repeatedly.

Histogram density kernel
Let the kernel function be Consider a kernel mixture prior obtained by mixing both θ and m. We have the following result. An analogous result holds when only θ is mixed and m is given a prior with infinite support.
To see this, define . By Riemann integrability of a continuous function, for any ǫ 1 > 0, there exists M 1 > 0, such that for m > M 1 , | m 1 Since f 0 is continuous on a compact set, it is uniformly continuous. Hence, for any given , we have where M is an upper bound for f 0 on [0, 1]. Hence, by choosing ǫ 1 and ǫ 2 small enough, there exists M 3 = max(M 1 , M 2 ) such that for m > M 3 , (13) holds.
Since we consider f 0 bounded away from 0 here, Condition A1 will be satisfied by choosing m ǫ large enough and appropriate weights {w 1 , . . . , w mǫ }. Let where 0 < δ 1 < 1/4 and ǫ > 0. Since W is not empty and it is an open neighborhood of some distribution that belongs to the support of Π, P ∈ W , we have with the index i corresponding to the given x, fP mǫ fP < e ǫ , and hence f 0 log fP mǫ fP < ǫ for all P ∈ W .

Triangular density kernel
Let the kernel function be Construct a kernel mixture prior by mixing both m and n. We have the following result. Proof. Since the mixing parameters are discrete, defining w i = f0(i/n) n j=0 f0(j/n) and letting W = {P : P (i/n) > w i e −ǫ , for i = 1, 2, . . . , n}, we can complete the proof as in Theorem 9.

Bernstein polynomial kernel
In the literature, Bernstein polynomials have been used to estimate densities under both frequentist and Bayesian framework. The motivation of the prior comes from the fact that any bounded function on [0, 1] can be approximated by a Bernstein polynomial at each point of continuity of the function; see [22].
Proof. Though the prior is slightly different from Prior 2 in that Π k is allowed to depend on k, we can still use Theorem 1 by changing Π(W ) > 0 to Π k (W ) > 0 for any given k. This follows since k is discrete. By Lemma 4, we may assume that f 0 is bounded from below. Since Bernstein polynomials uniformly approximate any continuous density (see, for instance, Theorem 1 of [5]), it follows that Condition A1 is satisfied. Condition A3 holds by the discreteness of k and the assumed positivity condition of its prior. The rest of the proof proceeds as before by considering all possible weights w ′ j > w j e −ǫ .

Lognormal density kernel
Let the kernel function be K(x; θ, φ) = . Consider a type I or type II mixture prior based on this kernel. Transform x → e y in the kernel function and in f 0 . If the model using e y K(e y ; θ, φ) as kernel function possess KL property at e y f 0 (e y ), then the corresponding model using K(x; θ, φ) as kernel function possess the KL property at f 0 (x). This is because of log e y f 0 (e y ) e y K(e y ; θ, φ)dP (θ) dy < ǫ.
For the lognormal kernel, we have the following result.
Theorem 12. Assume that the prior Π satisfies B8. Let f 0 (x) be a continuous density on R + satisfying 1. f 0 is nowhere zero except at x = 0 and bounded above by M < ∞; Then f 0 ∈ KL(Π * ).

Weibull density kernel
Weibull is a widely used kernel function. Ghosh and Ghosal [17] discussed a model using this density as kernel function and showed posterior consistency useful in survival analysis. However, the assumption for the true density f 0 assumed there was quite strong. Here we establish the KL property with this kernel under very general assumptions.
The Weibull kernel is given by K(x; θ, φ) = θφ −1 x θ−1 e −x θ /φ . We can transform this kernel using the map x = e y to where W (z) = exp[z − e z ], the location parameter is θ −1 log φ and scale parameter is θ −1 . We have the following result.
Theorem 13. Let f 0 (x) be a continuous density on R + satisfying 1. f 0 is nowhere zero except at x = 0 and bounded above by M < ∞; .
Proof. We need to verify Conditions B3-B7 for kernel W (·) and true density e y f 0 (e y ). Condition B3 is satisfied, since we have W ′ (z) W (z) = 1 − e z . To verify Condition B7, observe that Condition 4 of this theorem implies R e y f 0 (e y ) log e 2|y| 1+η W (e 2|y| 1+η )dy < ∞ and R e y f 0 (e y )| log W (e y−a b )|dy < ∞.

Gamma density kernel
The gamma density is one of the most widely used kernel function for density estimation on [0, ∞). Hason [18] discussed a model using the gamma density as kernel with the hierarchical structure has as many stages as the most general one we discussed in Section 1. Chen [4] and Bouezmarni and Scaillet [3] discussed a mixture of gamma model with a different parametrization.
Proof. We use K m (x; α) to denote K(x; α, m −1 ). Let where as a density function for α, and 1l(·) is the indicator function. Obviously, P m is compactly supported and f m (x) = f Pm (x). Let F m be the probability measure corresponding to f m . By Lemma 5 in the Appendix, f 0 (x) log f0(x) fm(x) dx → 0 as m → ∞, which implies that Condition A1 is satisfied.
To complete the proof, we show that Condition A3 is satisfied by verifying conditions of Lemma 3. For any given Based on expression (19), (20) and (25) in the appendix, we have log inf K(x; α, β) By Condition B7*, we have that | log inf (α,β)∈D K(x; α, β)|f 0 (x)dx < ∞. Further, log f mǫ (x) is also f 0 -integrable by a similar argument. Condition A8 is obviously satisfied. Condition A9 is satisfied by letting E be large enough compact set containing D. This proves the theorem.

Inverse gamma density kernel
The inverse gamma density function is defined as h(x; a, b) = b a Γ(a) x −a−1 e −b/x . We consider the following reparametrerization K(x; k, z) = h(x; k, kz) as the kernel function and construct mixture priors. Let φ δ defined as in (14). We have the following result.

Proof. Observe that
where g is the gamma density. By Proposition 3.1 in [3], we have for any x ∈ [0, ∞), The derivative of the logarithm of the expression on the RHS of above relation are given by, is the digamma function, and its details is given in the proof of Lemma 5 in the Appendix. Therefore and hence where Ga is the cumulative distribution function (c.d.f.) of gamma distribution with parameter (m + 1, 1). For large m, the last expression is bounded below by {Φ(1 + δ/x) − Φ(1)}/2 in view of the central limit theorem. Similarly, for 1 ≤ x < m and large m, the lower bound for the left hand side (LHS) of (16) is fm(x) dx → 0 as m → ∞, which implies that Condition A1 is satisfied. Similarly as in the proof of Theorem 14, we can show that Condition A3 is also satisfied.

Exponential density kernel
Consider a mixture prior based on the exponential kernel. Let K(x; θ) = θe −θx . Recall that a function ϕ on R + is completely monotone if it possesses derivatives ϕ (n) of all orders and (−1) n ϕ (n) (x) ≥ 0 for x > 0. LetF 0 (x) = 1−F 0 (x), where F 0 is the distribution function corresponding to density function f 0 . We have the following result.
Theorem 16. If f 0 is a continuous density on R + , x and | log f 0 (x)| are f 0integrable,F 0 (x) is completely monotone, and the weak support of Π is M (R + ), then f 0 ∈ KL(Π * ).
Since log f 0 (x) and x are both f 0 -integrable, by the DCT, we have as a → ∞. Thus Condition A1 is satisfied.
14. Scaled uniform density kernel Let the true density f 0 be supported on X = R + , and consider a mixture prior based on the scaled uniform kernel K(x; θ) = θ −1 1l{0 ≤ x ≤ θ}.
Theorem 17. If f 0 (x) is a continuous and decreasing density function on R + such that f 0 | log f 0 | < ∞ and the weak support of Π is M (R + ), then f 0 ∈ KL(Π * ).
Since | log f 0 (x)| is f 0 -integrable, by using the DCT, we have f 0 log f0 fm → 0 as m → ∞. Condition A3 is satisfied by a similar argument as in the proof for Theorem 9.
as m → ∞. We show that for any δ > 0, as m → ∞, since for any given x and δ, there exists q < 0 such that 1 + log(x/v) − x/v < q for all the v ∈ A m , |x − v| > δ.
Thus Conditions C1-C3 in Lemma 7 are all satisfied and we have that f m (x) → f 0 (x) as m → ∞ for each x > 0. Lemma 8. Let K m (x; α) be defined as in Section 12. If Condition B7* is satisfied, then there exists a function 0 < C(x) < 1 such that and log 1 C(x) f 0 (x)dx < ∞. Proof. For m −1 < x < 1, applying Stirling's inequality and noting that v < x + δ < 1 + δ in the following integral, it follows that x mv e m(v−x) (v + m −1 ) mv dv. (29)