Criteria for posterior consistency and convergence at a rate

: Frequentist conditions for asymptotic consistency of Bayesian procedures with i.i.d. data focus on lower bounds for prior mass in Kullback-Leibler neighbourhoods of the data distribution. The goal of this paper is to investigate the ﬂexibility in these criteria. We derive a versatile new posterior consistency theorem, which is used to consider Kullback-Leibler consistency and indicate when it is suﬃcient to have a prior that charges metric balls instead of KL-neighbourhoods. We generalize our proposal to sieved models with Barron’s negligible prior mass condition and to separable models with variations on Walker’s condition. Results are also applied in semi-parametric consistency: support boundary estimation is considered explicitly and consistency is proved in a model for which Kullback-Leibler priors do not exist. As a further demonstration of applicability, we consider metric consistency at a rate : under a mild integrability condition, the second-order Ghosal-Ghosh-van der Vaart prior mass condition can be relaxed to a lower bound for ordinary KL-neighbourhoods. The posterior rate is derived in a parametric model for heavy-tailed distributions in which the Ghosal-Ghosh-van der Vaart condition cannot be satisﬁed by any prior.


Introduction and main result
Aside from computational issues, the most restrictive aspects of non-parametric Bayesian methods result from limited availability of priors.In general, distri-butions on infinite dimensional spaces are relatively hard to define and control technically, so unnecessary elimination of candidate priors is highly undesirable.Specifying to frequentist asymptotic aspects, the conditions that Bayesian limit theorems pose on priors play a crucial role and have received much attention, as reviewed in several excellent overview texts [14,17,16] over the years.It is the goal of this paper to extend the range of criteria on the prior for posterior consistency and convergence at a rate [15], showing asymptotic suitability for a wider range of priors.From the outset, we accept that this may go at the expense of additional model conditions.

Introduction
Doob [10] studied posterior limits with the help of his martingale convergence theory and gave the first general posterior consistency theorem for i.i.d.data.Notwithstanding the generality of its Bayesian interpretation, Doob's theorem is not quite satisfactory to the frequentist interested in non-parametric statistics, in that Doob's prior null set of possible inconsistency can be very large, as was stressed by Schwartz [33] and amplified by Freedman [11,12,9].To frequentists Freedman's counterexamples discredited Bayesian methods for non-parametric statistics greatly.The resulting under-appreciation was hard to justify, given Schwartz's 1965 posterior consistency theorem [34] for i.i.d.data: posteriors are consistent in the frequentist sense, if consistent uniform tests exist and the prior Π is a so-called Kullback-Leibler prior : for all δ > 0, Π P ∈ P : −P 0 log dP dP 0 < δ > 0. (1.1) Although there are alternatives [24], for example those based on Le Cam's inequality [25,28], condition (1.1) has become the standard.Schwartz's theorem does not cover all examples, however.Example 1.1.Consider i.i.d.X 1 , X 2 , . . .from a distribution P 0 with Lebesgue density p 0 : R → R that is supported on an interval of known width (say, 1) but unknown location.The model is parametrized in terms of a density η supported on [0, 1] with η(x) > 0 for all x ∈ [0, 1] and a location θ ∈ R: Note that if θ does not equal θ , for all η, η .Therefore Kullback-Leibler neighbourhoods do not have any extent in the θ-direction and no prior can be a Kullback-Leibler prior in this model.
Even in a simple model of uniform distributions {U [0, θ] : θ ∈ [0, 1]}, any prior that is KL must have a non-zero point-mass at θ = 1.(See example 7.2 for more.) Schwartz's theorem for posterior consistency in metric parameter spaces requires that the model is of finite entropy with respect to the Hellinger metric.That condition is rather restrictive and can be mitigated in several ways.The (testing) problem was noted by Le Cam (see the Le Cam-dimension of the model [26]) and his solution can be applied in Schwartz's setting too.A more Bayesian solution partitions the model sequentially into subsets of bounded Hellinger metric entropy and subsets of negligible prior mass (see, for example, [2] and section 4.4.2 of [17]).Walker has proposed a method that does not depend on finite covers but adds a summability condition to condition (1.1) [39].(For more, see subsection 4.2.) In metric parameter spaces, consistency can be strengthened to posterior convergence at a rate: extensions of Schwartz's theorem [15,35] apply Barron's sieve idea and a minimax test construction [4,5], in combination with a secondorder Kullback-Leibler condition on the prior: Π P ∈ P : −P 0 log dP dP 0 < 2 n , P 0 log for some C > 0 and large enough n, to conclude that the posterior concentrates its mass in Hellinger balls around P 0 of radii n → 0. But again, not all examples are covered: below, heavy-tailed distributions are found for which integrability of squared log-density ratios is violated.
Example 1.2.Consider an i.i.d.sample of integers X 1 , X 2 , . . .from a distribution P a , (a ≥ 1), defined by, for all k ≥ 2, with Z a = k≥2 k −a (log k) −3 < ∞.As it turns out, for a = 1, b > 1, Therefore, Schwartz's condition (1.1) for the prior of a can be satisfied but there exists no prior such that (1.2) is satisfied for all P 0 in the model.(See example 7.3 for more.)In fact, if we change the third power of the log-factor in the denominator of (1.3) to a square, Schwartz's KL-priors also do not exist.
Schwartz's theorem and its rate-specific version have become the standard frequentist tools for the asymptotic analysis of Bayesian posteriors, almost to the point of exclusivity.As a consequence, lower bounds for prior mass in Kullback-Leibler neighbourhoods c.f. (1.1) and (1.2) are virtually the only criteria frequentists apply to priors in non-parametric asymptotic analyses (notable exceptions are made in the examples of [7,8,18] as well as [16]; see, however, lemma 3.1 below).Since these lower bounds on prior weights of Kullback-Leiblerneighbourhoods are sufficient conditions applicable for i.i.d.data, it is not clear if other criteria for the prior can be formulated.The goal of this paper is to increase flexibility in criteria for prior choice, by formulating a greater variety of suitability conditions for priors.The goal is not to generalize conditions of Schwartz's theorem or to sharpen its assertion; rather we want to show that stringency with regard to the prior can be relaxed at the expense of stringency with regard to conditions on the model.

Main result
The main result is summarized in the next theorem: we have in mind a fixed model subset V (e.g. the complement of a fixed neighbourhood of P 0 ) for which we want to demonstrate asymptotically vanishing posterior mass.Following the ideas of [34,26,4,5] the set V is covered by a finite collection of subsets V 1 , . . ., V N to be tested against P 0 separately with the help of the minimax theorem: each V i is matched with a model subset B i (which can be thought of as a 'neighbourhood' of P 0 if the model is well-specified) such that Π(B i ) > 0 and inequality (1.5) below is satisfied.The B i are often chosen as Kullback-Leibler neighbourhoods (as in Schwartz's theorem), but under a moment condition on likelihood ratios larger neighbourhoods can act as alternatives.
Throughout this paper and in the formulation below, we assume that the model is dominated and we use posterior (2.1).Let co(V ) denote the convex hull of V and let P Π n (n ≥ 1), denote the n-fold prior predictive distributions: Furthermore, for given α ∈ [0, 1], model subsets B, W and a given distribution P 0 , define, (and π P0 (W, B) = inf α∈[0,1] π P0 (W, B; α); see appendix B).
Π(B i ) > 0 and sup Q∈Bi P 0 (dP/dQ) < ∞ for all P ∈ V i , then, ) Although this angle will not be pursued further in this paper, it is noted that P 0 is not required to be in the model P so that the theorem applies both to well-and to mis-specified models [21] in the form stated. Furthermore, in subsection 3.1 it is shown that condition (1.5) is equivalent in quite some generality to separation of B i and co(V i ) in Kullback-Leibler divergence with respect to P 0 , sup underlining the fundamental nature of condition (1.1).But even with this equivalence in mind, the theorem is uncommitted regarding the nature of the V i , and, more importantly, we may use any B i that (i) allow uniform control of P 0 (p/q) α , and (ii) allow convenient choice of a prior such that Π(B i ) > 0. The two requirements on B i leave room for trade-offs between being 'small enough' to satisfy (i), but 'large enough' to enable a choice for Π that leads to (ii).The freedom to choose B's and Π lends the method the desired flexibility.
In what follows it is shown that Schwartz's theorem, Barron's sieve generalization, Walker's theorem and posterior rates of convergence can all be related to theorem 1.3.In section 2, the denominator the posterior is considered in detail and theorem 1.3 is proved.In section 3 we establish that condition (1.5) is equivalent to KL-separation.Based on that, Schwartz's theorem is re-derived with several variations, e.g.posterior consistency in Kullback-Leibler divergence and Hellinger consistency with priors that charge metric balls.In section 4 separable models are considered and in section 5 we consider posterior rates.To provide an example of how our proposals enhance flexibility, corollary 5.3 shows that condition (1.2) can be replaced by a Schwartz-type KL-condition: for some K > 0, Π P ∈ P : under a very simple integrability condition on the model.Section 6 applies the results to semi-parametric estimation of support boundary points for a density on a bounded interval in R [23].The last section contains a short discussion on applications, including consistency in non-parametric density estimation with various Dirichlet mixtures, and counterexamples 1.1 and 1.2.The appendices A, B and C contain two notes on supports, properties of Hellinger transforms and proofs, respectively.

Posterior consistency
To establish the basics, the model (P, B) is a measurable space consisting of Markov kernels P on a sample space (X , A ): the map A → P (A) is a probability measure for every P ∈ P and the map P → P (A) is measurable for every A ∈ A .Assuming the model is dominated by a σ-finite measure (with density p for P ∈ P), a prior probability measure Π on (P, B) gives rise to the posterior : (2.1) We take the frequentist i.i.d.perspective, i.e. we assume that there exists a distribution P 0 on (X , A ) such that (X 1 , . . ., X n ) ∼ P n 0 .As a consequence expression (2.1) does not make sense automatically: for the denominator to be non-zero with P n 0 -probability one, we impose that, of real-valued random variables related through Y = f (X) + e for some non-negative regression function f , such that for all δ > 0, P (f (X) < δ) > 0. Errors e 1 , e 2 , . . .are independent of X and i.i.d., with a shared marginal distribution supported on [θ, ∞), for some unknown θ ∈ R to be estimated.The problem occurs when the statistician believes that his errors are positive with probability one, while their true distribution assigns (small but) non-zero probability to negative outcomes.(In finance examples abound, arising when one anticipates non-negative returns (for example a hedged return, the total return on a bond or an auction price) based on an incomplete or simplified model for downside risk.) The statistician makes a choice for the prior Π that reflects his belief, placing no mass on negative values for θ.When sequential i.i.d.draws are conducted, sooner or later a negative value of the error will occur in conjunction with a small value of f (X), resulting in a negative value for Y .But negative f (X) + e have probability zero according to all distributions in a subset of the model of prior mass one: sooner or later, the likelihood evaluates to zero Π-almost-everywhere in the model, resulting in a posterior that is ill-defined.

A sketch of the proof of theorem 1.3
Our first lemma asserts that, under the condition that certain specific testsequences for covers of complements exist, posterior concentration in neighbourhoods follows.The proof is inspired by Le Cam's dimensionality restrictions and more broadly, by [34,26,4,5,28,15,17].The argument is essentially an application of the minimax theorem (see section 16.4 of [28], section 45 of [36]), adapted as in [21].The essential difference between lemma 2.3 and other Bayesian limit theorems is that posterior numerator and denominator are dealt with simultaneously rather than separately, so that the prior Π is one of the factors that determines testing power and can be balanced against model properties directly.
In the following lemma V is a fixed set (e.g. the complement of an open neighbourhood of P 0 ) for which we want to prove asymptotically vanishing posterior mass.We cover V by a finite number of model subsets V 1 , . . ., V N such that for each V i , a special type of test sequence exists.In the next subsection, we give conditions for the existence of such sequences.Lemma 2.3.Assume that P n 0 P Π n for all n ≥ 1.For some N ≥ 1, let V 1 , . . ., V N be a finite collection of measurable model subsets.If there exist constants D i > 0 and test sequences (φ i,n ) for all 1 ≤ i ≤ N such that, P n 0 φ i,n + sup for large enough n, then any V ⊂ 1≤i≤N V i receives posterior mass zero asymptotically, The condition that covers of the model have to be of finite order is restrictive and problems arise already in parametric context.In such cases application of the theorem requires a bit more refinement [26] (see example C.1), or the alternatives of section 4.
Le Cam [26,27,28] and Birgé [4,5] propose a seminal approach to testing that combines the minimax theorem with the Hellinger geometry of the model.Here we stay close to the methods of [21] which are inspired by the above and their application in [15].Define V n = {P n : P ∈ V } and denote its convex hull by co(V n ); elements from co(V n ) are denoted P n .The following lemma says that testing power is bounded in terms of Hellinger transforms [28].
Lemma 2.4.Let n ≥ 1, V ∈ B be given; assume that P n 0 (dP n /dP Π n ) < ∞ for all P ∈ V .Then there exists a test sequence (φ n ) such that, Given Π and a measurable B such that Π(B) > 0, define the local prior predictive distributions P Π|B n by conditioning the prior predictive on B: for all n ≥ 1 and A ∈ σ(X 1 , . . ., X n ).The following lemma formulates an upper bound for the right-hand side of inequality (2.5), which prescribes the (n-independent) form of the central requirement of theorem 1.3.

Variations on Schwartz's theorem
In this section we apply theorem 1.3 to re-derive Schwartz's theorem, sharpen its assertion to consistency in Kullback-Leibler divergence and we consider model conditions that allow priors charging metric balls rather than Kullback-Leibler neighbourhoods.

Schwartz's theorem and Kullback-Leibler priors
The strategy to prove posterior consistency in a certain topology (or more generally, to prove posterior concentration outside a set V ) now runs as follows: one looks for a finite cover of V by model subsets V i , (1 ≤ i ≤ N ) satisfying the inequalities (1.5) for subsets B i that are as large as possible and neighbourhoods of P 0 in an appropriate sense.Subsequently we try to find (a σ-algebra B on P and) a prior Π : In this regard the following lemma offers guidance, because it relates testing power to Kullback-Leibler separation of the sets B and W in definition (1.4).It is stressed that in applications the sets W i are convex hulls of model subsets V i .
Lemma 3.1.Let P 0 ∈ B ⊂ P and W ⊂ P be given and assume that there exists an a ∈ (0, 1) such that for all Q ∈ B and if and only if, Quite generally, lemma 3.1 shows that model subsets are consistently testable if and only if they can be separated from neighbourhoods of P 0 in Kullback-Leibler divergence.This illustrates the fundamental nature of Schwartz's prior mass requirement and undermines hopes for useful priors that charge different neighbourhoods of P 0 in general.However, this does not exclude the possibility of gaining freedom in the choice of the prior by strengthening requirements on the model, as we hope to demonstrate with the rest of this paper.
Due to the fact that Kullback-Leibler divergence dominates Hellinger distance, Schwartz's theorem can be proved from theorem 1.3 and lemma 3.1 (at least, for models that have sup Q∈B P 0 (dP/dQ) < ∞ for all P ∈ V and B a Kullback-Leibler neighbourhood of P 0 ).Schwartz's theorem does not fully exploit the room that (3.2) offers, because it stops short of asserting posterior consistency in Kullback-Leibler divergence.However it is well-known [39,16], that Kullback-Leibler consistency does not require much more than Schwartz's conditions.The following theorem provides posterior Kullback-Leibler consistency without requiring more of the prior, by imposing an integrability condition on the model.Theorem 3.2.Let P 0 and the model be such that for some Kullback-Leibler neighbourhood B of P 0 , sup Q∈B P 0 (dP/dQ) < ∞ for all P ∈ P. Let Π be a Kullback-Leibler prior.For any > 0, assume that {P : To appreciate how a finite cover of Kullback-Leibler neighbourhoods may occur in models, consider the following example that relies on relative compactness with respect to the uniform norm for log-densities.Example 3.3.Let > 0 be given and assume that the complement V of a Kullback-Leibler ball of radius > 0 contains N points P 1 , . . ., P N such that the convex sets, cover V .Finiteness of the cover can be guaranteed, for example with the Ascoli-Arzelà theorem, if the model describes data taking values in a fixed bounded interval in R and the associated family of log-densities is bounded and equicontinuous.(Other ways to find suitable covers refer to • ∞ -entropy or bracketing numbers for log-likelihood ratios [38].)Then any P ∈ co(V i ) satisfies dP/dP i − 1 ∞ < 1 2 as well, and hence, log(dP/dP i ) ≤ log(1 + 1 2 ) ≤ 1 2 .As a result, and (3.3) holds.In such models, any prior Π satisfying (1.1) leads to a posterior that is consistent with respect to Kullback-Leibler divergence.

Priors that charge metric balls
In this subsection we discuss model conditions that allow one to relax Schwartz's condition for the prior, to the condition that the prior has full support in Hellinger (or other) metric topologies.Given some P 0 and a suitable (metric) neighbourhood B, we impose that for all Q ∈ B and any Under this condition the Cauchy-Schwarz inequality leads to, .
Combined with lemma 2.3 this gives the following theorem.
Theorem 3.4.Assume the model P has finite Hellinger metric entropy numbers.Furthermore assume that there exists a constant L > 0 and a Hellinger ball B centred on P 0 such that for all P ∈ P and Q ∈ B , Finally assume that for any Hellinger neighbourhood B of P 0 , Π(B) > 0. Then the posterior is Hellinger consistent, P 0 -almost-surely.
Next choose 1 ≤ r < ∞.Analogous to the Hellinger metric (r = 2), define, for all P, Q probability measures, Matusita's r-metric distance [30], , (based on any σ-finite μ that dominates P and Q).Applying Hölder's inequality where we applied Cauchy-Schwarz before we arrive at the following theorem concerning priors that charge d r -balls.
Theorem 3.5.Let 1 ≤ r < ∞ be given and let the model P be has finite d rmetric entropy numbers.Let X 1 , X 2 , . . .be i.i.d.-P 0 distributed for some P 0 ∈ P. Assume that the prior is such that P n 0 P Π n , for all n ≥ 1 and satisfies, Π P ∈ P : d r (P 0 , P ) < δ > 0, ( for all δ > 0. In addition, assume that there is an L > 0 and a d r -ball B such that for all P ∈ P and Q ∈ B, P 0 (p/q) s/r∨1 ≤ L s , where 1/r + 1/s = 1.Then the posterior is consistent for the d r -metric, P 0 -almost-surely.
Remark 3.6.For the models under discussion, we note the following general construction of so-called net priors [25,28,13,15,20]: denote the metric on P by d.Initially, assume that P has finite d-metric entropy numbers.Let (η m ) be any sequence such that η m > 0 for all m ≥ 1 and η m ↓ 0. For fixed m ≥ 1, let P 1 , . . ., P Mm denote an η m -net for P and define Π m to be the measure that places mass 1/M m at every P i , (1 ≤ i ≤ M m ).Choose a sequence (λ m ) such that λ m > 0 for all m ≥ 1 and m≥1 λ m = 1, to define the net prior Π = m≥1 λ m Π m .A net prior assigns non-zero mass to every open set.
In addition, lower-bounds for prior mass in metric balls are proportional to inverses of upper bounds for metric entropy numbers, provided we choose (λ m ) appropriately.In case P is not totally bounded, one may generalize the above construction by choosing an sieve (K m ) of relatively compact submodels.Net priors, or more generally, Borel priors of full support are helpful if one is interested in the construction of Kullback-Leibler priors, at least, if the corresponding topology is fine enough.Lemma 3.7.If for every P ∈ P, the Kullback-Leibler divergence P → R : Q → −P log(dQ/dP ) is continuous, then a Borel prior of full support is a Kullback-Leibler prior.
When discussing consistency, requirements on the model like (3.5) are present to guarantee continuity of the Kullback-Leibler-divergence.For example, the perceptive reader may have recognized in (3.5) sufficiency to invoke theorem 5 of [41] which provides an upper bound for the Kullback-Leibler divergence in terms of the Hellinger distance.The latter is a stronger, Lipschitz-like variation on the continuity condition of the above lemma.

Posterior consistency on separable models
Requiring finiteness of the order of the cover in theorem 1.3 and lemma 2.3 is somewhat crude.There are several ways out: firstly, in subsection 4.1 we explore the possibility of letting a sieve of totally bounded submodels approximate the full model analogous to Barron's theorem.Secondly, Hellinger consistency of the posterior on separable models formed the assertion of a remarkable theorem of Walker for a Kullback-Leibler prior that also satisfies a summability condition [39].In subsection 4.2 we show that variations on Walker's theorem can be derived with the methods of section 2.

Generalization to sieves
When the model is (a measurable subset of) a Polish space, inner regularity guarantees that the model is approximated in prior measure by (relatively) compact submodels.Since the latter are of finite metric entropy, a proof is conceivable based on a sieve of compact submodels with complements of 'negligible' prior mass.
Theorem 4.1.Let X 1 , X 2 , . . .be i .i.d .− P 0 for some P 0 ∈ P and let V be given.Assume that P n 0 P Π n for all n ≥ 1 and that there exist constants K, L > 0 and a sequence of submodels (P n ) such that for large enough n ≥ 1, Ln) with tests φ 1,n , . . ., φ Nn,n such that, for some model subset B such that Π(B) > 0.
Condition (i.) of theorem 4.1 corresponds with condition (2.3) and upper bounds for testing power of the preceding subsections remain applicable.More particularly, condition (i.) has the following alternative.
(i'.) there exist a model subset B with Π(B) > 0 and a cover V 1 , . . ., V Nn for V ∩ P n of order N n ≤ exp( 1 2 Ln), such that for every for all > 0. Note that, for all P ∈ P, whenever log q − log p 0 ∞ ≤ .Hence, a sieve (P n ) satisfying condition (i.) such that Π(P \ P n ) ≤ exp(−nK ) for some small K > 0 would suffice in this case and similar ones.
A generalization of condition (ii.) of theorem 4.1 involving n-dependent choices for B can be found in appendix C. Theorem 4.1 is applied in the support boundary problem of section 6, see remark 6.4.

Variations on Walker's theorem
In this subsection we abandon constructions based on finite covers altogether and require only that the cover is countable.Le Cam's dimensionality restrictions [26,27] are related to in example C.1.More generally, a natural setting arises when we consider models that are separable in some metric topology, in which case countable covers by balls of any radius exist.Theorem 4.3.Let P and Π be given and assume that P n 0 P Π n for all n ≥ 1.Let V be a model subset, with a countable cover V 1 , V 2 , . . .and B 1 , B 2 , . . .such that for all i ≥ 1, we have Π(B i ) > 0 and for all P ∈ V i , sup Q∈Bi P 0 (dP/dQ) < ∞.Then, Compare the upper bound in (4.2) to that of example C.1.Two corollaries show how theorem 4.3 is related to Walker's condition [39].Note that in the first, the prior is not required to be a KL-prior.
Corollary 4.4.Let P and Π be given and assume that P n 0 P Π n for all n ≥ 1.Let V be a model subset, with a countable cover V 1 , V 2 , . .., and a B ⊂ P such that Π(B) > 0 and for all i ≥ 1, P ∈ V i , sup Q∈B P 0 (dP/dQ) < ∞.Furthermore, assume that, If the prior satisfies the summability condition, ) For another perspective on Walker's condition, see exercises 8.11-8.12 in [16].The second corollary does not impose model conditions like (4.3), and, instead, requires a Kullback-Leibler prior that satisfies a slightly different summability condition.
Corollary 4.5.Let P be separable in the Hellinger topology.Assume that there is Kullback-Leibler neighbourhood B of P 0 such that for all P ∈ P, sup Q∈B P 0 (dP/dQ) < ∞.Let Π be a Kullback-Leibler prior such that for all where the V i , (i ≥ 1) are any cover of P by Hellinger balls of a fixed radius.
Then the posterior is P 0 -almost-surely Hellinger consistent.

Posterior rates of convergence
Minimax rates of convergence for (estimators based on) posterior distributions were considered more or less simultaneously in Ghosal, Ghosh and van der Vaart [15] and Shen and Wasserman [35], with conditions that display very close resemblance.Both pose (1.2) as the condition on the prior and both appear to be inspired by Wong and Shen [41], as well as Ghosal et al. [13] and/or Barron et al. [2], which concern posterior consistency based on controlled bracketing entropy for a sieve, up to subsets of negligible prior mass, following ideas that were first laid down in [1].In [40] Walker, Lijoi and Prünster extend the results of [39] to posterior rates of convergence.Note that methods proposed in the preceding sections hold at finite values of n ≥ 1: the hypothesis B, V as well as the constant α can be made n-dependent without changing the basic building blocks.As such, not much needs to be adapted to preceding results to extend also to rates of posterior convergence.
Below we follow Barron's ideas again and sharpen theorem 4.1 to accomodate rates of posterior convergence.For the theorem below, we endow the model with a metric d and assume that the prior is Borel with respect to the associated metric topology.
Theorem 5.1.Let X 1 , X 2 , . . .be i .i.d .− P 0 for some P 0 ∈ P. Assume that the prior Π is such that P n 0 P Π n for all n ≥ 1.Let ( n ) be a sequence with n ↓ 0 and n 2 n → ∞.Define V n = {P ∈ P : d(P, P 0 ) > n }, a sequence of measurable submodels P n ⊂ P and measurable model subsets B n such that sup Q∈Bn P 0 (dP/dQ) < ∞ for all P ∈ V n .Assume that, for sufficiently large n ≥ 1, n , while also, Then, Π( P ∈ P : This theorem has been formulated generally and this generality obscures the interpretation of conditions somewhat: the first condition plays the same role as the entropy condition in the Ghosal-Ghosh-van der Vaart theorem; it enables construction of a suitable minimax test.Sufficiency of prior mass around P 0 forms part of the second condition, which also assures that the sieve approximates the model closely enough, by upper-bounding prior mass outside the sieve.Under an integrability condition, condition (5.1) for the sets co(V n,i ) and B n follows from a minimal amount of separation of co(V n,i ) and B n in Kullback-Leibler divergence.Lemma 5.2.Consider two model subsets B, W such that P 0 ∈ B. Suppose that for some a ∈ (0, 1), P 0 (dP/dQ) a is finite for all then there exists an α ∈ (0, 1) such that, Conversely, if for some Δ > 0, sup then π P0 (B, W ; α) > e −αΔ for all α ∈ (0, 1).
Lemma 5.2 says that if B and W are separated in Kullback-Leibler divergence by some small difference Δ, then the logarithm of the Hellinger transform log π P0 (B, W ) is upper-bounded by a multiple of −Δ.This emphasizes the role played by the Kullback-Leibler divergence and illustrates the associated limitation: not all models have integrable likelihood ratios, and Kullback-Leibler divergences that are infinite make inequality (5.3) void.
With lemma 5.2 in hand, we can simplify and specify theorem 5.1 considerably, to bring us closer to the Ghosal-Ghosh-van der Vaart theorem.For simplicity of presentation, we do not incorporate Barron's negligible prior mass argument (although one could trivially).
Corollary 5.3.Let X 1 , X 2 , . . .be i.i.d.-P 0 for some P 0 ∈ P. Specify that the metric on P is the Hellinger metric H; define ( n ) with n ↓ 0 and n 2 n → ∞, and take V n = {P ∈ P : H(P 0 , P ) > M n }, for M > 0, and B n = {Q ∈ P : −P 0 log(dQ/dP 0 ) < 2 n }.Assume that for n large enough and all P ∈ V n , sup{P 0 (dP/dQ) : then Π( P ∈ P : Comparison with inequality (1.2) shows that the requirement on the prior is in terms of Schwartz's KL-neighbourhoods rather than the second-order KLneighbourhoods of (1.2), a convenience that comes at the expense of an integrability condition for likelihood ratios.(It is noted that Ghosal and van der Vaart [16] provide a refinement of [15] that also does not involve second-order KLneighbourhoods.The simplicity of the integrability condition of corollary 5.3 seems preferable to the technical intricacy of their theorem 8.11, however.)For an analysis of example 1.2 using corollary 5.3, see example 7.3.

Marginal consistency
In this section, we consider a problem of the following basic, semi-parametric type [3]: let Θ be an open subset of R k parametrizing the parameter of interest θ and let H be a measurable (and typically infinite-dimensional) parameter space for the nuisance parameter η.The model is P = {P θ,η : θ ∈ Θ, η ∈ H} where Θ × H → P : (θ, η) → P θ,η is a Markov kernel on the sample space (X , A ) describing the distributions of individual points from an infinite i.i.d.sample X 1 , X 2 , . . .∈ X .Given a metric g : Θ × Θ → [0, ∞) and a prior measure Π on Θ × H we say that the posterior is marginally consistent for the parameter of interest, if for all > 0, for all θ 0 ∈ Θ and η 0 ∈ H. Marginal consistency amounts to consistency with respect to the pseudo-metric d : P × P → [0, ∞), d P θ,η , P θ ,η = g(θ, θ ), for all θ, θ ∈ Θ and η, η ∈ H.The following theorem is a formulation of theorem 1.3 specific to marginal consistency.Theorem 6.1.Let P = {P θ,η : θ ∈ Θ, η ∈ H} be a model for data X 1 , X 2 , . . .assumed distributed i.i.d.-P 0 for some P 0 ∈ P in the Hellinger support of Π.Let > 0 be given, define V = {P θ,η ∈ P : g(θ, θ 0 ) > , η ∈ H} and assume that V 1 , . . ., V N form a finite cover of V .If there exist model subsets B 1 , . . ., B N such that for every Π(B i ) > 0 and sup Q∈Bi P 0 (dP/dQ) < ∞ for all P ∈ V i , then the posterior is marginally consistent, P 0 -almost-surely.

Density support boundaries
Consistent support boundary estimation (see [19], or [31] for a more recent, Bayesian reference), though easy from the perspective of point-estimation, is not a triviality when using Bayesian methods because one is required to specify a nuisance space [32].The Bernstein-Von Mises phenomenon for this type of problem is studied in Kleijn and Knapik [23] and leads to exponential rather than normal limiting form for the posterior.Below, we prove consistency using theorem 1.3 and note that the result generalizes relatively straightforwardly to rates (see below).The model is defined as follows: for some constant σ > 0 define the parameter of interest to lie in the space Θ = {θ = (θ 1 , θ 2 ) ∈ R 2 : 0 < θ 2 − θ 1 < σ} equipped with the Euclidean norm • .Let H be a collection of Lebesgue probability densities η : [0, 1] → [0, ∞) with a a modulus of continuity f (i.e. a continuous, monotone increasing f : (0, a) → (0, ∞) (for some a > 0) with f (0+) = 0), such that, 2) The model P = {P θ,η : θ ∈ Θ, η ∈ H} is defined in terms of Lebesgue densities of the following semi-parametric form, for some (θ 1 , θ 2 ) ∈ Θ and η ∈ H.A condition like (6.2) is necessarily part of the analysis, because questions concerning support boundary points make sense only if the distributions under consideration put mass in every neighbourhood of θ 1 and θ 2 .(Let • s,Q denote the L s (Q)-norm, for s ≥ 1.) Theorem 6.2.For some σ > 0, let Θ be {(θ 1 , θ 2 ) ∈ R 2 : 0 < θ 2 − θ 1 < σ} and let the space H with associated function f as in (6.2) be given.Assume that there exists an s ≥ 1 such that the sets B, satisfy Π(B) > 0 for all δ > 0. Also assume there exists a constant K > 0 such that for all P ∈ P and Q ∈ B, dP/dQ r,Q ≤ K, where 1/r + 1/s = 1.If X 1 , X 2 , . . .form an i.i.d.-P 0 sample for P 0 = P θ0,η0 ∈ P then, for all > 0.

g(x) dx ,
for all > 0 small enough.The prior mass requirement is satisfied because the distribution of the process Z has full support relative to the uniform norm in the collection of all continuous functions on [0, 1] bounded by M .Remark 6.4.If the assumed bound σ > 0 is set to infinity, testing power is lost (see the proof of theorem 6.2, or note that if one pictures distributions P of wider and wider support, the minimal mass bound (6.2) implies less and less mass remains to lower-bound P (p 0 = 0) and P 0 (p = 0)).To see that the bound is of a technical rather than essential nature, note that if a model of bounded-support distributions satisfies (6.2) and is uniformly tight, such a constant σ > 0 exists.Consequently, a sequence of models with growing σ's can be used: for given P 0 = P θ0,η0 , there is a lower bound σ > 0 such that the model of theorem 6.2 is well-specified for all σ > σ.So if σ m → ∞, the corresponding models P m are well-specified for large enough m and the posteriors on those P m are consistent, c.f. theorem 6.2.By diagonalization there exists a sequence (σ m(n) ) n≥1 that traverses (σ m ) slowly enough in order to guarantee that consistency obtains while we increase m(n) with the sample size n.
To know exactly how slowly we should let σ go to infinity, we use theorem 4.1: let σ n increase with n and define Since N n = 4 for all n ≥ 1 (namely the sets V +,1 , V −,1 , V +,2 and V −,2 in the proof of theorem 6.2) any constant L > 0 will do, as long as, (Similarly, rates of convergence can be studied with the choice = n : the modulus of continuity f then determines how n , σ n and other n-dependencies must be fine-tuned.)A glance at inequality (C.8) suggests that condition (4.1) applies, if we choose Π such that, for some K > 0. For example, if the family H consists of densities that display jumps at both θ 1 and θ 2 of some minimal size δ > 0, then f (x) ≥ 1  2 δ x for values of x > 0 that are close enough to x = 0. Consequently, for a model in which support boundaries represent discontinuous jumps, marginal posterior consistency obtains if we let σ n = o(n).If H consists of densities that are continuous (k = 0) or k ≥ 1 times continuously differentiable at the boundary points, then f (x) is lower-bounded by a multiple of x k+2 , which implies that σ n must be of order o(n 1/k+2 ).

Conclusion and examples
Schwartz's theorem is central to the frequentist perspective on Bayesian nonparametric statistics and it has been in place for more than fifty years: it is beautiful and powerful, in that it applies to a very wide class of models.However, its generality with respect to the model implies that it is rather stringent with respect to the prior.Since choices for non-parametric priors are usually not abundant, overly stringent criteria form a problem.In this paper, an attempt has been made to demonstrate that there is more flexibility in the criteria for the prior, if one is willing to accept more strict model conditions.

Some easy examples
Because Hellinger consistent density estimation using mixtures is a well-studied subject, especially with Dirichlet priors, we discuss that example below in quite some generality, to illustrate practicality of the proposed methods.
Example 7.1.Consider a model P for observation of one of two real-valued, dependent random variables X, Z, assuming that if we would observe Z, the distribution for X would be known: X|Z = z is assumed to have a Lebesgue density p(•|z) : R → R such that z → p(x|z) is bounded and continuous for every x.We observe only an i.i.d.sample X 1 , X 2 , . . .from P 0 ∈ P and the corresponding Z 1 , Z 2 , . . .remain hidden.The model P then consists of distributions P F for X with Lebesgue densities of the form, where the parameter F represents the unknown distribution of Z.For reasons explained below, assume that Z ∈ [0, 1], so that the space D of all distributions on [0, 1] is compact in Prokhorov's weak topology.Note that for any fixed x ∈ R, F → p F (x) is weakly continuous.By Scheffé's lemma this pointwise continuity implies weak-to-total-variational continuity of the map F → P F , which is equivalent to weak-to-Hellinger continuity.Since D is weakly compact, this implies that the model P is Hellinger compact (and consequently, Hellinger entropy numbers are all finite).Additionally we make the assumption that the L 2 -condition (3.5) is satisfied; for example in the well-known normal location mixture model, where X|Z = z is distributed normally with mean z [14], the family P = {p F : F ∈ D} is contained in an envelope that allows straightforward verification of (3.5) (for details, see the proof of theorem 3.2 in [21]).
With finite entropy numbers and (3.5) established, note that any prior Π on D that is Borel for the weak topology induces a prior that is Borel for the Hellinger topology on the model P. If the weak support of Π equals D then the induced Hellinger support includes P. For instance, a Dirichlet prior for F with base measure of full support on [0, 1] will suffice to conclude from theorem 3.4 that the posterior is Hellinger consistent.Other priors on D, like Gibbs-type measures of full weak support [6] would also suffice.In fact, consistency applies for any bounded, continuous (and some semi-continuous) kernel(s) x → p(x|z) such that mixture densities satisfy (3.5).
To conclude we demonstrate that the approach advocated in this paper applies in counterexamples 1.1 and 1.2.Example 7.2.Assume that the width of the support of p 0 is equal to one.The model consists of densities η supported on [0, 1] shifted over θ in R, Consider H with some prior Π H and a prior Π Θ on Θ = R with a Lebesgue density that is continuous and strictly positive on all of R. Note that if θ = θ the Kullback-Leibler divergence of P θ,η with respect to P θ ,η is infinite, for all η, η ∈ H.As noted, KL-neighbourhoods do not have any extent in the θdirection, however, the construction of example 6.3 remains applicable.In fact, in the present, fixed-width simplification, the situation is more transparent: if we write P 0 = P θ0,η0 and V = V + ∪ V − with V + = {P θ,η : θ > θ 0 + , η ∈ H} and V − = {P θ,η : θ < θ 0 − , η ∈ H} for some > 0, then we choose B + = {P θ,η : , η ∈ H}, so that Π(B ± ) > 0. Consider only α = 0 and notice that the mismatch in extent of supports implies that, for all P ∈ co(V ± ), based on (6.2).If H is chosen such that for all P ∈ V ± , sup Q∈B± P 0 (p/q) < ∞, then (6.3) follows (even regardless of the prior on H, which is remarkable).Larger spaces H can be considered if the sets B ± are restricted appropriately while maintaining Π(B ± ) > 0. Conclude that for the estimation of an unknown θ 0 ∈ R, Schwartz's theorem does not apply, while example 6.3 remains in effect.Example 7.3.Recall example 1.2: the sample X 1 , X 2 , . . .consists of i.i.d.integers from a distribution P a , (a ≥ 1), defined by, for all k ≥ 2 (with normalization Z a ).The parameter a is smooth and the Fisher information is non-singular, so a can be estimated at parametric rate, but as noted, there exists no prior for the parameter a such that condition (1.2) can be satisfied for all P 0 in the model.Corollary 5.3 remains valid, however, and demonstrates that the posterior converges at √ n-rate.Because corollary 5.3 is formulated for totally-bounded parameter spaces only, without a negligiblility condition like (5.2), we restrict the parameter a to a bounded interval I = [1, L], for some L > 1. (However the result below is expected to hold also without this restriction.) For any rate n that is slower than n −1/2 , write n = n −1/2 M n , with M n → ∞ and note that we only have to consider M n that diverge very slowly, i.e. n that are arbitrarily close to the parametric rate.Also note that there exist constants Define V n = {P : H(P, P 0 ) ≥ M n } for some M > 0. We cover V n with Hellinger balls Hence, for any a ≥ 1, any P c ∈ V n and any P b with |b − a| < n /M 2 , we have, (which may be equal to −∞).

Appendix C: Proofs
This section contains all proofs of theorems and lemmas in the main text.
Proof of lemma 2.3.For a set V covered by measurable V 1 , . . ., V N , almost-sure convergence per individual V i implies the assertion.So we fix some 1 ≤ i ≤ N and note that, By Fubini's theorem, From (2.3) we conclude that P n 0 Π(V i |X 1 , . . ., X n ) ≤ e −nDi , for large enough n.Apply Markov's inequality to find that, so that the first Borel-Cantelli lemma guarantees, Replicating this argument for all 1 ≤ i ≤ N , assertion (2.4) follows.
Example C.1.Suppose that we wish to prove consistency relative to some metric d on P but coverings of the model by d-balls are not finite.Then we may try the following construction: for > 0, we define W = {P ∈ P : d(P, P 0 ) > } and W k = {P ∈ P : 2 k−1 ≤ d(P, P 0 ) < 2 k }, (k ≥ 1).Assume that the covering numbers If we show that the right-hand side goes to zero as n → ∞, the posterior is d-consistent.
The following lemma reduces the testing criterion to an n-independent condition, c.f. (1.5).
Proof of lemma 2.5.Let 0 ≤ α ≤ 1 be given.Note that for all n ≥ 1, P Π n (A) ≥ Π(B) P Π|B n (A) for all A ∈ σ(X 1 , . . ., X n ).Combining that with the convexity of x → x −α on (0, ∞), we see that, With the use of Fubini's theorem and lemma 6.2 in Kleijn and van der Vaart (2006) [21] which says that Hellinger transforms factorize when taken over convex hulls of products, we find: Applying (C.2) with α = 1, P n = P n , and using that for all P ∈ V , P 0 (dP/dQ) is bounded uniformly over B, we see that also P n 0 (dP n /dP Π n ) < ∞.By (2.5), we obtain (2.7).

C.2. Proofs for section 3
Proof of lemma 3.1.Assume that (3.2) holds.Lemma B.1 says that α → P 0 (dP/dQ) α is convex and continuously differentiable on (0, a).So for all α ∈ (0, a), Proof of theorem 3.4.Proposition 2.1 guarantees that P n 0 P Π n , for all n ≥ 1.For given > 0, let V denote {P ∈ P : H(P, P 0 ) > 2 }.Since P is totally bounded in the Hellinger metric, there exist P 1 , . . ., P N such that the model subsets V i = {P ∈ P : H(P, P i ) < } form a cover of V .On the basis of the constant L of (3.5), define B = {Q ∈ P : H(Q, P 0 ) < 2 /(4L) ∧ }, where where we have followed the steps in the proof of theorem 1.3.Using again the local prior predictive distribution P Π|B n of (2.6), the third term satisfies, Like at the end of the proof of lemma 2.3, an application of the Borel-Cantelli proves the assertion.
Proof of theorem 4.3.By monotone convergence, We treat the terms in the sum separately with the help of test sequences (φ i,n ), for all i ≥ 1, following the proof of lemma 2.3: Like in the proof of lemma 2.5, the assumptions that sup Q∈Bi P 0 (dP/dQ) < ∞ and Π(B i ) > 0, imply that P n 0 (dP n /dP Π n ) < ∞, for all P ∈ V i .So φ i,n can be chosen in such a way that, by the minimax theorem.To minimize the r.h.s., choose φ as follows, and follow the proof of lemma 2.5 to conclude that the r.h.s. of (C. (for some constant 0 < γ < 1) which goes to zero at geometric rate, if (4.4) Proof of corollary 4.5.Given > 0, define V = {P : H(P, P 0 ) ≥ } and let {V i : i ≥ 1} denote a countable collection of Hellinger balls of radius −P 0 log dP dP 0 .
Note that (C.6) serves as a lower bound for the r.h.s. of the previous display, which enables the choice B = {P ∈ P : −P 0 log(p/p 0 ) < /4} to guarantee that there exist constants 0 < α , γ < 1 such that, which goes to zero since Π(B) > 0 and the sum is finite by assumption.

C.4. Proofs for section 5
Proof of theorem 5.1.Fix n ≥ 1 large enough to satisfy conditions (i) and (ii).
According to lemma 2.5, there exist test functions φ n,i : X n → [0, 1] for all 1 ≤ i ≤ N n , such that, for all α ∈ [0, 1], Define ψ n = max i φ n,i and decompose the n-th posterior for V n = {P ∈ P : d(P, P 0 ) ≥ n }, as follows, The first term is upper-bounded as follows,  Proof of corollary 5.3.Take P n = P for all n ≥ 1.Note that (5.4) implies that Π is a Kullback-Leibler prior, which implies that P n 0 P Π n .Let V n = {P ∈ P : H(P, P 0 ) ≥ n } and B n = { P ∈ P : −P 0 log(dP/dP 0 ) < 2 n /8 }.By condition (i) there is a cover of V n consisting of Hellinger balls of radii n /2 of order N n = N ( n , P, H) ≤ exp(Ln 2 n ).Note that for every 1 ≤ i ≤ N n and all P ∈ co(V n,i ), we have −P 0 log(dP/dP 0 ) ≥ H 2 (P, P 0 ) ≥ (H(V n , P 0 ) − n /2) 2 = 2 n /4, while −P 0 log(dQ/dP 0 ) ≤ 2 n /8 for all Q ∈ B n .According to lemma 5.2, the separation in Kullback-Leibler divergence between B n and V n implies that π P0 co(V n,i ), B n ≤ e −α 2 n for some α > 0. Possibly after rescaling of n by an n-independent constant (which leads to larger α, effectively), π P0 satisfies condition (5.1).The assertion then follows from theorem 5.1.

1 0g
g(y) h(y) dy , for some h ∈ C M and all x ∈ [0, 1].To define a prior on H, let U ∼ U [−M, M ] be uniformly distributed on [−M, M ] and let W = {W (x) : x ∈ [0, 1]} be Brownian motion on [0, 1], independent of U .Note that it is possible to condition the process Z(x) = U + W (x) on −M ≤ Z(x) ≤ M for all x ∈ [0, 1] (or reflect Z in z = −M and z = M ).Define the distribution of η under the prior Π H by taking h = e Z .On Θ let Π Θ denote a prior with a Lebesgue density that is continuous and strictly positive on Θ.One verifies easily that the model satisfies (6.2) with f defined by, f ( ) = e −2M min 0
.2) for every n ≥ 1, where P Π n is the prior predictive distribution.If (2.2) is not satisfied, expression (2.1) for the posterior is ill-defined for infinitely many n ≥ 1 with P ∞ 0 -probability one.The following proposition provides sufficient condition to prevent this.
[1]dition (ii.) of theorem 4.1 defines what 'negligibility' of prior mass outside the sieve means.If we think of B as a small neighbourhood around P 0 , it appears that the freedom to choose B small enables upper bounds for the l.h.s. of (4.1) arbitrarily close to one.In such cases, condition (ii.) reduces to the requirement that Π(P \ P n ) decreases exponentially[1].The following example illustrates this point.
and sup Q∈B P 0 (dP/dQ) < ∞ for all P ∈ V i .Example 4.2.Assume that X 1 , X 2 , . ..are i.i.d.-P 0 for some P 0 in a model P that is dominated by a σ-finite measure μ.Consider a prior Π that charges all L ∞ (μ)-balls around log p 0 (where p 0 , p denote the μ-densities for P 0 , P respectively):Π P ∈ [26] the model subsets W k (related to the so-called Le Cam dimension of the model[26]) are finite.Let V k,1 , . . ., V k,N k be d-balls of radius 2 k−2 covering W k .Assume that for every d-ball V k,i , (1 ≤ i ≤ N k ), there exists a test sequence (φ k,i,n ) n≥1 such that (2.3) is satisfied with D k,i ≥ d 2 (V k,i , P 0 ).Then, for every n ≥ 1, Every total-variational neighbourhood of P 0 contains a Kullback-Leibler neighbourhood, so combination of lemma 3.1, lemma 2.5 and theorem 1.3 proves posterior consistency in Kullback-Leibler divergence c.f. (3.4).