A novel approach to Bayesian consistency

: It is well-known that the Kullback–Leibler support condition implies posterior consistency in the weak topology, but is not suﬃcient for consistency in the total variation distance. There is a counter–example. Since then many authors have proposed suﬃcient conditions for strong consistency; and the aim of the present paper is to introduce new conditions with speciﬁc application to nonparametric mixture models with heavy– tailed components, such as the Student- t . The key is a more focused result on sets of densities where if strong consistency fails then it fails on such densities. This allows us to move away from the traditional types of sieves currently employed.


Introduction
In this paper we consider a novel approach to Bayesian consistency in nonparametric problems, specifically concentrating on mixture models, which are the usual type of nonparametric model used in practice. The first formulation is given by Doob [9]; but this approach has a drawback in infinite dimensional models, see [7,8]. Instead it is commonly assumed that observations are i.i.d. from some fixed but unknown density function, and a general sufficient condition for weak consistency is given in Schwartz [25].
To set the scene, assume that the observations X 1 , . . . , X n are i.i.d. realvalued random variables from a true density p 0 . Let the model L be the space of all Lebesgue densities on (R, R) equipped with the total variation metric, and Π be a prior on (L , L), where R and L are Borel σ-algebras. Formally, for a (pseudo-)metric d on L , the posterior distribution Π(·|X 1 , . . . , X n ) is called to be d-consistent at p 0 if Π(d(p 0 , p) > η|X 1 , . . . , X n ) converges to zero in probability for every η > 0. When d is the total variation (Lévy-Prokhorov, resp.) metric, it is often called strongly (weakly, resp.) consistent.
Let K(p, q) = p log(p/q) dμ be the Kullback-Leibler (KL) divergence, where μ is the Lebesgue measure. Schwartz [25] has shown that if p 0 lies in the KL Π p ∈ L : K(p 0 , p) < δ > 0 for every δ > 0, (1.1) then the posterior distribution is weakly consistent at p 0 . Along with the KL support condition (1.1), various sufficient conditions for strong consistency have been studied in infinite-dimensional models, see [3,29,28,6] for general conditions. Some important references concerning specific models and priors are [1,11,13,5]. Further work incorporating convergence rates can be found in [16,12,31,21], for example. Since the total variation is a stronger metric than Lévy-Prokhorov, see [20, p. 34], the KL support condition (1.1) is often insufficient for strong consistency. In this regard, Barron, Schervish and Wasserman [3] constructed a prior satisfying the KL support condition (1.1) but the corresponding posterior distribution is not strongly consistent. Walker, Lijoi and Prünster [30] explained this phenomenon with the notion of data tracking.
In this paper we present a new sufficient condition for strong consistency and apply it to nonparametric mixture models. Since the convergence in Lévy-Prokhorov metric is equivalent to weak convergence, once the prior satisfies the KL support condition (1.1), Schwartz's theorem implies that there exists some sequence n ↓ 0 such that Π(d P (p 0 , p) > n |X 1 , . . . , X n ) converges to zero in probability. For strong consistency, therefore, it suffices to show that Π(A n,η |X 1 , . . . , X n ) → 0 in probability for every η > 0, where A n,η = p ∈ L : d P (p 0 , p) ≤ n , d V (p 0 , p) > η . (1. 2) The new approach is based on the fact that A n,η is a collection of "weird" densities in the sense that it consists of highly fluctuating densities with a centering around p 0 . With a reasonable prior, therefore, prior mass imposed on A n,η is negligible, which in turn implies strong consistency. The focus on A n,η allows us to move away from the typical uses of sieves. Our approach is very different from [30], relying on a special property of densities in A n,η . The new approach entails different kinds of sieves avoiding the calculation of Hellinger entropy or prior probabilities of small Hellinger balls. Instead, we require a Lévy-Prokhorov convergence rate ( n ) for which we provide a general sufficient condition. Our new approach significantly simplifies conditions required on the hyperparameter of a Dirichlet process in a mixture model, for example. In particular, a mean parameter can have an arbitrarily heavy tail. We also consider a mixture of Student's t distributions which can be used to model heavy-tailed distributions; the consistency of which is yet to been done in the literature.

Notation
For p ∈ L , the corresponding probability measure is denoted as P , and vice versa. The expectation of a function f with respect to P is denoted P f, i.e. P f = f (x)dP (x). The expectation under the true distribution is denoted E. Let be the total variation and Hellinger metrics. The indicator function for a set A is denoted 1 A . For two positive sequence (a n ) and (b n ), a n b n represents a n /b n → 0. The maximum of two numbers a and b are denoted a ∨ b. The inequality represents "less than up to a constant multiplication," where the constant is universal (such as 2, π, e) unless specified explicitly.

Main results
For p ∈ L and γ > 0, define a non-negative function p γ on R as where B γ (x) = {y ∈ R : |y − x| < γ}. Note that p γ = p * U γ , where * denotes the convolution and U γ is the uniform distribution on the interval [−γ, γ]. Therefore, p γ is also a probability density which can be understood as a smoothed version of p, where γ controls the degree of smoothness. For example, suppose p(y) = 2 0 < y < 1/4 or 1/2 < y < 3/4 0 otherwise. Then See Figure 1. For simplicity, (p 0 ) γ is written as p 0,γ . For two probability measures P and Q, let be the Lévy-Prokhorov metric, where A = ∪ x∈A B (x). Note that the convergence in d P is equivalent to weak convergence, and one inequality in the definition of d P can be omitted; see [17]. For a given density p 0 , suppose that a density p is close to p 0 in d P but far away from p 0 in d V . This is only possible when p is a "weird" density in the sense that it highly fluctuates with a centering around p 0 ; as illustrated in Figure 2. It is an important property of such a density that d V (p, p γ ) is large even for small γ. Note that for every fixed p ∈ L , d V (p, p γ ) converges to zero as γ goes to 0 by Lebesgue differentiation theorem and Scheffé's lemma, but never converges uniformly over L due to highly fluctuating densities. Therefore, if the prior probability for large d V (p, p γ ) is sufficiently small, the posterior distribution would be strongly consistent. The key point here is that after excluding weird densities from L , d V (p, p γ ) can be shown to converge uniformly. To be more specific, note that by Scwartz's theorem [25], the KL support condition (1.1) guarantees the existence of a sequence n ↓ 0 such that If d P (p, p 0 ) ≤ n , then for any sequence (γ n ), with γ n → 0 and n /γ n → 0, we have as n → ∞, where the o(1) term depends on n , γ n and p 0 only, see the proof of which is not essential but simplifies the proof. Note that condition (ii) holds if the tail of p 0 is not heavier than that of the Cauchy distribution.  Since d V (p, p γn ) = o(1) for every fixed p ∈ L , (L n ) can typically be chosen to increase to L , constituting new sieves. In the existing Bayesian literature, such sieves are required to have bounded entropy [11,3] or satisfy a certain prior summability condition [29]. Instead of these conditions, our requirement is (2.3), which eventually gives A n,η ∩ L n = ∅, where A n,η is defined as (1.2). Note that A n,η decreases to the singleton {p 0 }, while L n grows to the whole set L . As illustrated in the next section, we can easily find (γ n ) and (L n ) satisfying (2.3) in nonparametric mixture models.
Note that in Barron's counter-example [3], the prior puts large mass on a set of weird densities such as the one in Figure 2. As a consequence, we cannot choose a sequence of sets (L n ) satisfying (2.3), resulting in posterior inconsistency.
It should be emphasized that to prove (2.3), we need to know a Lévy-Prokhorov rate ( n ), which can be interpreted as the "price" for avoiding the construction of complicated sieves. Note that the KL support condition guarantees the existence of "some" rate sequence ( n ). If we do not know what n is, we only know that there exists a sequence γ n such that γ n → 0 and n /γ n → 0. If γ n converges too slowly, however, L n satisfying the second assumption of (2.3) cannot contain sufficiently many densities. As a consequence the posterior probability of L c n might not be sufficiently small.

M. Chae and S. G. Walker
For a given sequence (δ n ), define a specialized KL ball around p 0 as n is a standard assumption to achieve the posterior convergence rate of at least (δ n ); see for example [12,31]. Let B n = {p ∈ L : d P (p, p 0 ) ≤ n }. Since d P induces the weak topology, there exist a number r > 0 and finite number of bounded continuous functions g 1 , . . . , g k such that Note that the number k of sub-bases and radius r may depend on p 0 and n .
The key idea for obtaining the Lévy-Prokhorov rate is to find these numbers. In this context, it is shown in Lemma 5.2 that for every n ↓ 0 with n 4 n → ∞, there exists a sequence of tests (ϕ n ) such that where K > 0 is a universal constant. As a consequence, Π(K n ) ≥ e −nδ 2 n implies (2.1) for every n δ n ∨ n −1/4 . Although n −1/4 might be far away from the optimal rate, it is sufficient for strong consistency in many examples.
n for a constant c 1 > 0 and a sequence δ n ↓ 0. Then, (2.1) holds for every sequence see Lemma 8.1 of [12]. Combining this with the previous two theorems, we have the following corollary.

Mixture of normal distributions
Consider a location mixture of normal distributions where φ σ (x) = φ(x/σ)/σ, φ is the standard normal density and F is a probability measure. A prior Π on L can be constructed by putting independent priors for σ and F . With a slight abuse of notation, we use the notation Π for denoting both a prior for (σ, F ) and a prior for p.
For p = p F,σ , it can be shown that see Lemma 5.3. Note that the right hand side of (3.1) depends on p through σ only. Therefore, sieves (L n ) can be constructed independent of F .
For a concrete example, we consider an inverse gamma Γ −1 (a 1 , a 2 ) prior for σ 2 , which is standard in both theory and practice, where a 1 , a 2 > 0 are hyperparameters and Γ −1 (a 1 , a 2 ) denotes the inverse gamma distribution whose density is proportional to x → x −a1−1 e −a2/x . Note that the prior on σ 2 puts little mass around zero implying that prior mass for large d V (p, p γ ) with small γ is nearly zero, c.f. (3.1).
In most examples Π(K n ) ≥ e −nδ 2 n with δ n much smaller than n −1/4 , so the condition given in Theorem 3.2 is very mild. A natural choice for the prior on F is DP(a 3 , G), where DP(a 3 , G) denotes the Dirichlet process with precision a 3 > 0 and mean G. For the Dirichlet process mixture of normal prior, the prior concentration condition has been extensively studied in literature, see for example [15,26,22,4]. In most existing papers, the true density p 0 is firstly approximated by a finite mixture p with a sufficiently small number N , and then prove that a DP mixture prior puts sufficiently large mass around p * . It should be noted that in the above mentioned papers, the tail of G must be exponentially thin to construct suitable sieves. Lijoi, Prünster and Walker [23] partly resolved this problem using the martingale approach of [29], but it is still required that G has a finite mean. With our approach, however, the only requirement is the prior concentration on K n which holds if the tail of G is not extremely thin, see Proposition 2 in [4] for the most recent result. Therefore, conditions on G can be significantly weakened. For example, the Cauchy and heavier-tailed distributions can be taken for G which are not allowed with any other methods.

Mixture of Student's t distributions
If the true density p 0 is heavy-tailed, e.g. the tail is of a polynomial order, then it is theoretically unknown that Bayesian procedures based on normal mixtures work well. Practically, there are two possible methods to utilize a Dirichlet process mixture of normal for fitting data generated from a heavytailed distribution. The first one is to use a location-scale mixture. In this regard, Tokdar [27] proved the posterior consistency with a location-scale mixture under mild conditions. His result allows a heavy-tailed distribution such as Cauchy for the true density. Secondly, one may use a heavy-tailed mean parameter G. Unfortunately for both methods, it is challenging to generalize the theoretical results beyond consistency. In particular, existing mathematical tools for getting convergence rates might be difficult to apply because it is rarely possible to find (δ n ) satisfying Π(K n ) ≥ e −nδ 2 n with a heavy-tailed p 0 . We are not aware of whether this is due to the mathematical difficulty or the intrinsic limit of normal mixtures.
As an another alternative, we consider a mixture of Student's t distributions. While a mixture of Student's t distributions has been considered in some application, see for example [24,10,18], its asymptotic behavior has not been studied in the literature. In Bayesian analysis, this is due to the technical challenge for constructing suitable sieves with heavy-tailed components. Since the approach given in the present paper avoids the construction of complicated sieves, it can also be applied to Student's t mixtures.
Let h be the density of Student's t distribution with v > 0 degrees of freedom, and h σ (x) = σ −1 h(x/σ). For a fixed v, consider a location mixture of the form Similarly as in normal mixtures, for p = p F,σ , we have by Lemma 5.4, where the constant in the inequality depends only on v. A prior Π can be constructed by putting independent priors for σ and F . As in the case of normal mixtures, we can put an inverse gamma prior on σ 2 . We abbreviate the proof of the following two theorems because after replacing (3.1) by (3.2), it is identical to the normal mixture case. We put DP(a 3 , G) prior on F . Although the required condition for the prior concentration is mild, it is technically demanding to prove Π(K n ) ≥ e −cnδ 2 n . We imitate techniques known for normal mixtures. As mentioned earlier, the key part of the proof is the approximation of p 0 , which can be approximated by p F,σ for some (F, σ), a finite mixture of normal distributions. To be a bit more specific, for any probability measure F on a compact interval [−a, a], the total variation between φ σ * F and φ σ * F is small if the first few, say N , moments of F and F are the same, see Lemma 3.1 of [14]. Also, there exists discrete measures F at most N components such that this moment condition is satisfied, see Lemma A.1 of [14]. Since Student's t distribution is a scale mixture of normal distributions [2], we have, see (5.3) for details, where H is Γ(v/2, v/2) distribution. Therefore, by applying the finite approximation technique of continuous normal mixtures, a mixture of Student's t distribution can also be approximated by a finite mixture. Combining with known concentration results for the Dirichlet distribution, we have the following theorem. Although the proof is long and quite similar to [15], we provide full details for the reader's convenience. We note that the main difference from normal mixtures is that a discrete measure F should be constructed independent of the scale parameter, see Lemma 5.9. Theorem 3.5. Put independent Γ −1 (a 1 , a 2 ) and DP(a 3 , G) priors for σ 2 and F , respectively, where v > 4 and G is the standard Cauchy. Suppose that p 0 satisfies (2.4) and twice continuously differentiable with first and second order derivatives p 0 and p 0 . Furthermore, assume that (p 0 /p 0 ) 2 p 0 dμ < ∞, (p 0 /p 0 ) 4 p 0 dμ < ∞ and P 0 ([−x, x]) ≥ 1 − x −β for some β > 4/3 and every large enough x. Then, Π(K n ) ≥ e −nδ 2 n for some δ n n −1/4 .
Note that the mean parameter G of the Dirichlet process is assumed to be the standard Cauchy, but it can be replaced by other distribution whose tail is of a polynomial order. Although Theorem 3.5 cannot allow the Cauchy distribution as p 0 due to the tail assumption required for p 0 , it is not difficult to extend the result further with more elaborate proof. For example, if p 0 is smoother than the twice differentiablity condition in Theorem 3.5, refined approximation techniques can be applied to obtain better rates as in [22,4,26].

Discussion
The key idea for the proof of Theorem 2.1 lies in the inequality (2.2). This can be extended to the consistency incorporating a rate (η n ). Assume for the moment that the support of p 0 is bounded. To find an upper bound of (2.2), we applied d V (p, p γn ) γ n /σ n , d V (p 0 , p 0,γn ) γ n and d V (p 0,γn , p γn ) n /γ n . By taking γ 2 n n σ n , a rate sequence (η n ) can be chosen as η n n /σ n . However, this rate is far from optimal rate even when n n −1/2 and σ n 1/ log n. Better rates can be obtained if we have a better bounds for d V (p, p γn ), d V (p 0 , p 0,γn ) and d V (p 0,γn , p γn ) (or similar quantities). For example, if p 0 belongs to a β-Hölder class, the bound for d V (p 0 , p 0,γn ) might be improved to d V (p 0 , p 0,γn ) γ β n , as with normal mixtures [22]. We leave this more delicate analysis of rates as future work; and since our approach does not require entropy calculations, we believe that it can eliminate additional log n terms in the existing literature.
Proof of Theorem 2.1. It suffices to prove that for every η > 0, A n,η ∩ L n is an empty set for large enough n, where A n,η is defined as (1.2). For every p ∈ A n,η , we have (1). Also, for any M > 0 and p ∈ A n,η , we have The last integral is bounded by 2Mγ −1 n ( n + ξ( n )) by Lemma 5.1. Note that if X and U γn are independent random variables following P 0 and Unif[γ n , γ n ], respectively, the law of X + U γn is equal to P 0,γn . This implies that Therefore, we have Since ξ( n ) n by (2.4), M can be arbitrarily large and n γ n = o(1), the right hand side of (5.1) is bounded below by η/2 for every p ∈ A n,η and large enough n. Since sup p∈Ln d V (p, p γn ) = o(1), we conclude that A n,η ∩ L n is an empty set for large enough n.  N , ∞) and B j = (a j−1 , a j ] for j = 1, . . . , N. For δ > 0, define bounded continuous functions ψ j , for j = 1, . . . , N, such that ψ j (x) = 1 for x ∈ [a j−1 + δ, a j − δ], ψ j (x) = 0 for x ≤ a j−1 or x ≥ a j and ψ j is linear on the intervals [a j−1 , a j−1 + δ] and [a j − δ, a j ]. We can choose δ sufficiently small so that
and ϕ n = max k ϕ n,k . Since g k is bounded by 1, Hoeffding's inequality [19] implies and for p ∈ U c k , The last display implies If N ≤ n 2 /(16 log 2), we have Since N M / −2 , the desired sequence of tests exists provided that n 4 n is bigger than a universal constant.
Proof of Theorem 2.2. Let n be a sequence such that δ n ∨ n −1/4 n 1 and A n = {p ∈ L : d P (p 0 , p) ≥ n }. By Lemma 8.1 of [12] and Lemma 5.2, there exists a constant c 2 > 0 such that P n 0 (Ω n ) → 1, where Ω n is the event that R n (p)dΠ(p) > e −c2nδ 2 n . Also, by Lemma 5.2, there exist a constant K > 0 and a sequence of tests (ϕ n ) satisfying (2.7) for every large enough n. It follows that the proof is complete.
Proof of Theorem 3.1. Take a sequence (γ n ) such that n γ n σ n . Then, by Lemma 5.3. Therefore, strong consistency holds by Theorem 2.1.

Proof of Theorem 3.2
For a sufficiently slowly diverging (will be described below) sequence (M n ) → ∞, let n = n −1/4 M n and γ n = n M n . If M n grows sufficiently slowly, we can choose a seqeunce (β n ) such that nδ 4 n M 8 n β 4 n and γ n β n 1.
for large enough n, where C is a constant depending only on a 1 and a 2 . By the construction of (β n ), we have It follows that Π(L c n ) ≤ e −5nδ 2 n . Also, for any p = p F,σ in L n , we have where the first inequality holds by Lemma 5.3. Therefore, the strong consistency holds by Corollary 2.1.

Proof of Theorem 3.5
Throughout this subsection, h is the density of Student's t distribution with v degrees of freedom, h σ (x) = σ −1 h(x/σ), and constants in may depend on v. Proof. Let γ > 0 be given, h σ (x) = ∂h σ (x)/∂x and g σ (x) = sup |y−x|<γ |h σ (y)|. Since where constants in depends only on v.
Proof. Note that where the last inequality holds by Lemma 5.6. Also, and the summand in the right hand side is bounded by Combining the last two displays, we have where T z h σ is defined as in Lemma 5.6. Since by Lemma 5.6.
. Assume that p 0 is twice continuously differentiable with first and second order derivatives p 0 and p 0 .
In both cases, constants c 1 , c 2 depend on v and given integrals only.
Proof. By the Taylor expansion with the integral form of the remainder, we have for every x and y. Since p(x) = p 0 (x + σy)h(y)dy, we have Thus, for some constant c 1 > 0, where the last integral is finite for v > 2.
The proof for the second inequality is identical to Lemma 4 of [15], for which y 4 h(y)dy < ∞ is required.
The following lemma is an extension of Lemma 2 in [15] in the sense that a discrete probability measure F can be taken independent of σ ≥ σ 0 . Lemma 5.9. Let a, σ 0 , > 0 be given numbers such that a/σ 0 ≥ 1. For any probability measure F on [−a, a], there exists a discrete probability measure F on [−a, a] with fewer than Daσ −1 0 log −1 support points, such that Proof. Throughout this proof, p F,σ denotes φ σ * F , not h σ * F . Partition the interval [−a, a] into k disjoint, consecutive subintervals I 1 , . . . , I k of length σ 0 and a final interval I k+1 of length l k+1 smaller than σ, where k is the largest integer less than or equal to 2a/σ 0 . Write F = k+1 i=1 F (I i )F i , where each F i is a probability measure concentrated on I i , then p F,σ = k+1 i=1 F (I i )p Fi,σ . Let Z i be a random variable distributed according to F i , and for a i the left endpoint of I i , let G i be the law of W i = (Z i − a i )/σ 0 . For σ ≥ σ 0 , let G i,σ be the law of W i,σ = W i σ 0 /σ. Thus, G i and G i,σ are supported on [0, 1] and [0, σ 0 /σ] ⊂ [0, 1], respectively.
As the proof of Lemma 2 in [15], it can be shown that for each i, there exists a discrete probability measure G i with fewer than N i log −1 support points such that d V (p Gi,1 , p G i ,1 ) (log −1 ) 1/2 . Note that, from the construction, the first N i moments of G i and G i are identical, see the proof of Lemma 3.1 in [14]. Let G i,σ be the law of W i,σ = W i σ 0 /σ, where W i is a random variable distributed according to G i . Then, the first N i moments of G i,σ and G i,σ are also identical, so d V (p Gi,σ,1 , p G i,σ ,1 ) (log −1 ) 1/2 by Lemmas 3.1 and 3.2 in [14]. Let F i be the law of a i + σ 0 W i and set F = and similarly for F i and G i,σ . It follows that Proof. Since d V is bounded by 2, we may assume that > 0 is sufficiently small. Note that h(x) = φ τ (x)dH(τ −2 ), where H is Γ(v/2, v/2) distribution (mean 1 and variance 2/v), see [2]. Thus, h σ (x) = φ στ (x)dH(τ −2 ). Let F be a given probability measure on [−a, a]. Then, for any probability measure F on [−a, a], we have for every large enough t. The right hand side of the last display is bounded by provided that t ≥ 3v −1 log −1 . Thus, the right hand side of (5.3) is bounded by where C > 0 is a constant depending only on v. By Lemma 5.9, there exists a discrete probability measure F , with fewer than