A Bernstein-Von Mises Theorem for discrete probability distributions

We investigate the asymptotic normality of the posterior distribution in the discrete setting, when model dimension increases with sample size. We consider a probability mass function $\theta_0$ on $\mathbbm{N}\setminus \{0\}$ and a sequence of truncation levels $(k_n)_n$ satisfying $k_n^3\leq n\inf_{i\leq k_n}\theta_0(i).$ Let $\hat{\theta}$ denote the maximum likelihood estimate of $(\theta_0(i))_{i\leq k_n}$ and let $\Delta_n(\theta_0)$ denote the $k_n$-dimensional vector which $i$-th coordinate is defined by \sqrt{n} (\hat{\theta}_n(i)-\theta_0(i)) for $1\leq i\leq k_n.$ We check that under mild conditions on $\theta_0$ and on the sequence of prior probabilities on the $k_n$-dimensional simplices, after centering and rescaling, the variation distance between the posterior distribution recentered around $\hat{\theta}_n$ and rescaled by $\sqrt{n}$ and the $k_n$-dimensional Gaussian distribution $\mathcal{N}(\Delta_n(\theta_0),I^{-1}(\theta_0))$ converges in probability to $0.$ This theorem can be used to prove the asymptotic normality of Bayesian estimators of Shannon and R\'{e}nyi entropies. The proofs are based on concentration inequalities for centered and non-centered Chi-square (Pearson) statistics. The latter allow to establish posterior concentration rates with respect to Fisher distance rather than with respect to the Hellinger distance as it is commonplace in non-parametric Bayesian statistics.


Introduction
The classical Bernstein-Von Mises Theorem asserts that for regular (Hellinger differentiable) parametric models, under mild smoothness conditions on the prior distribution, after centering around the maximum likelihood estimate and rescaling, the posterior distribution of the parameter is asymptotically Gaussian and that the limiting covariance matrix coincides with the inverse of the Fisher information matrix. This theorem provides a frequentist perspective on the Bayesian methodology and elements for reconciliation of the two approaches. In regular parametric models, Bernstein-von Mises theorems motivate the interchange of Bayesian credible sets and frequentist confidence regions. Refinements of the Bernstein-von Mises theorem have also proved helpful when analyzing the redundancy of universal coding for smoothly parametrized classes of sources over finite alphabets.
The proof of the classical Bernstein-Von Mises theorem relies on rather sophisticated arguments. Some of them seem to be tied up with the finite dimensionality of the considered models. Hence, extensions of Bernstein-von Mises theorems to non-parametric and semi-parametric settings have both received deserved attention and shown moderate progress during the last four decades. Soon after Bayesian inference was put on firm frequentist foundations by Doob (1949), Schwartz (1965) and others, Freedman (1963) (see also Freedman, 1965) pointed out that even when dealing with the simplest possible case, that of independent, identically distributed, discrete observations, there is no such thing as a general posterior consistency result let alone a general Bernstein-Von Mises Theorem. Moreover, according to the evidence presented by Freedman (1965), it is mandatory to focus moderately large classes of distributions. Despite such early negative results, non-parametric Bayesian theory has been progressing at a steady pace. The framework of empirical process theory has enabled to provide sufficient conditions for posterior consistency and to relate posterior concentration rates to model complexity (Ghosal andvan der Vaart, 2007b, 2001;Ghosal et al., 2000).
Among the different approaches to non-parametric inference, using simple models with increasing dimensions has attracted attention in the context of maximum likelihood inference (Portnoy, 1988;Fan and Truong, 1993;Fan et al., 2001;Fan, 1993) and in the context of Bayesian inference (Ghosal, 2000). The last reference is especially relevant to this paper. Therein, S. Ghosal considers nested sequences of exponential models satisfying a number of assumptions involving the growth rate of models with sample size, the growth rate of the determinant of the Fisher information matrix with respect to model dimension (and thus sample size), prior smoothness, and moment bounds for score functions in small Kullback-Leibler balls located around the sampling probability (those conditions will be explained and compared with our own conditions in Section 3.1). S. Ghosal proves a Bernstein-Von Mises Theorem (Ghosal, 2000, Theorem 2.3) for the log-odds parametrization, partially building on previous results from Portnoy (1988) concerning maximum likelihood estimates. However our objectives significantly differ from those of S. Ghosal. In (Ghosal, 2000), the main application of non-parametric Bernstein-Von Mises Theorems for multinomial models seems to be non-parametric density estimation using histograms. This framework justifies special attention to multinomial distributions which are almost uniform. Our ultimate goal is quite different. In information-theoretical language, we are interested in investigating memoryless sources over infinite alphabets as in (See Kieffer, 1978;Gyorfi et al., 1993;Boucheron et al., 2009, and references therein). In Information Theory, refinements of Bernstein-Von Mises Theorems allow to investigate the so-called maximin redundancy of universal coding over parametric classes of sources (Clarke and Barron, 1994). In Information Theory, a source over a (countable alphabet) is a probability distribution over the set of infinite sequences of symbols from the alphabet. The redundancy of a (coding) probability distribution with respect to a source on a given (finite) sequence of symbols is the logarithm of the ratio between the probability of the sequence under the source and under the coding probability. In universal coding theory, average redundancy with respect to a prior distribution over sources can be written as the difference between the (differential) Shannon entropy of the prior distribution and the average value of the (differential) entropy of the conditional posterior distribution. Thanks to non-trivial refinements of the Bernstein-Von Mises Theorem, the latter conditional entropy can be approximated by the (differential) entropy of a Gaussian distribution which covariance matrix is the inverse of the Fisher information matrix defined by the source under consideration. This elegant approach provides sharp asymptotic and non-asymptotic results when dealing with classes of sources which are soundly parameterized by subsets of finite-dimensional spaces (See Clarke and Barron, 1990, for precise definitions). When turning to larger classes of sources, for example toward memoryless sources over countable alphabets (Boucheron et al., 2009), this approach to the characterization of maximin redundancy has not (yet) been carried out. A major impediment is the current unavailability of adequate non-parametric Bernstein-Von Mises Theorems. This paper is a first step in developing the Bayesian tools that are useful to precisely quantify the minimax redundancy of universal coding of nonparametric classes of sources over infinite alphabets. Because of our ultimate goals, we cannot focus on almost uniform multinomial models. We are specifically interested in situations where the sampling probability mass functions decay at a prescribed rate (say algebraic or exponential) as in (Boucheron et al., 2009).
As pointed out by Ghosal, in models with an increasing number of parameters, justifying the asymptotic normality of the posterior distribution is more involved, and precisely characterizing under which conditions on prior and sampling distribution this asymptotic normality holds remains an openended question. For example, in the context of discrete distributions, several ways of defining the divergence between distributions look reasonable. Most of the recent work on non-parametric Bayesian statistics dealt with posterior concentration rates and has been developed using Hellinger distance Ghosal andvan der Vaart, 2007b, 2001). One may wonder whether some posterior concentration rate results obtained using Hellinger metrization can be strengthened. It is not clear how to tackle this issue in full generality. In this paper, taking advantage of the peculiarities of our models, we use another, demonstrably stronger, information divergence, the Fisher (χ 2 ) "distance" and establish posterior concentration rates with respect to Fisher balls (see 3.6). The proof relies on known concentration inequalities for centered χ 2 (Pearson) statistics and (apparently) new concentration inequalities for non-centered χ 2 statistics.
Paraphrasing van der Vaart (1998), as the notion of convergence in the Bernstein-Von Mises Theorem is a rather complicated one, the expected reward, once such a Theorem has been proved, is that "nice" functionals applied to the posterior laws should converge in distribution in the usual sense. An obvious candidate for deriving that kind of method is a Bayesian variation on the Delta method. However, we are facing here two kinds of obstacles. On the one hand, we cannot rely on the availability of a Bernstein-Von Mises Theorem when considering the infinite-dimensional model (Freedman, 1963(Freedman, , 1965. This precludes using the traditional functional Delta method as described for example in (van der Vaart and Wellner, 1996;van der Vaart, 1998). On the other hand, when considering models of increasing dimensions, a variant of the Delta method has to be derived in an ad hoc manner. This is what we do. We assess this rule of thumb by examining plug-in estimates of Shannon and Rényi entropies. Such functionals characterize the compressibility of a given probability distribution (Csiszár and Körner, 1981;Cover and Thomas, 1991;Gallager, 1968). The problem of estimating such functionals has been investigated by Antos and Kontoyiannis (2001) and Paninski (2004). It has been checked there that plug-in estimates of the Shannon and Rényi entropies are consistent and some lower and upper bounds on the rate of convergence have been proposed. Up to our knowledge, classes of distributions for which plug-in estimates satisfy a central limit theorem have not been systematically characterized. Here, the Bernstein-Von Mises Theorem allows to derive central limit theorems for Bayesian entropy estimators (see Theorem 3.12) and provides the basis for constructing Bayesian credible sets. In the present context, those credible sets are known to coincide asymptotically with Bayesian bootstrap confidence regions (Rubin, 1981).
The paper is organized as follows. In Section 2, the framework and notation of the paper are introduced. A few technical conditions warranting local asymptotic normality when handling models of increasing dimensions are also stated. The main results of the paper are presented in Section 3. The nonparametric Bernstein-Von Mises Theorem (3.7) is described in Subsection 3.1. It is complemented by a posterior concentration lemma (3.6) that might be interesting in its own right. A roadmap of the proof of the Bernstein-Von Mises Theorem is stated thereafter. In Paragraph 3.2, the asymptotic normality of Bayesian estimators of various entropies is derived using the non-parametric Bernstein-Von Mises Theorem and various tail bounds for quadratic forms that are also useful in the derivation of the Bernstein-Von Mises theorem. In Paragraph 3.3, sequences of Dirichlet priors are checked to satisfy the conditions of the Bernstein-Von Mises Theorem. The main results of the paper are illustrated on the envelope classes investigated by Boucheron et al. (2009). In Subsection 3.5, the setting of Theorem 3.7 is compared with the framework described in (Ghosal, 2000). In Subsection 3.6, the posterior concentration lemma is compared with related recent results in non-parametric Bayesian statistics. The Proof of the Bernstein-Von Mises Theorem is given in Section 4. It adapts Le Cam's proof (Le Cam and Yang, 2000;van der Vaart, 2002) to the nonparametric setting using a collection of old and new non-asymptotic tail bounds for chi-square statistics. The proof of the asymptotic normality of Bayesian entropy estimators is given in Section 5. It relies on the Bernstein-Von Mises Theorem and on the aforementioned tail bounds for chi-square statistics.

Notation and background
This section describes the statistical framework we will work with, as well as the behavior of likelihood ratios in this framework. At the end of the section, a useful contiguity result is stated.
Throughout the paper, θ = (θ(i)) i∈AE * denotes a probability mass function over AE * = AE \ {0} and Θ denotes the set of probability mass functions over AE * . If the sequence x = x 1 , . . . , x n denotes a sample of n elements from AE * , let N i denote the number of occurrences of i in x: N i (x) = n j=1 1 xj =i . The log-likelihood function maps Θ × AE n * toward Ê: When the sample x is clear from context, ℓ n (θ, x) is abbreviated into ℓ n (θ).
Throughout the paper, θ 0 denotes the (unknown) probability mass function under which samples are collected. Let Ω = AE AE * , let X 1 , . . . , X n , . . . denote the coordinate projections. Then È 0 denotes the probability distribution over Ω (equipped with the cylinder σ-algebra F ), satisfying Recall that the maximum likelihood estimator θ of θ 0 on a sample x is given by the empirical probability mass function: θ(i) = N i /n . Let k denote a positive integer that may and should depend on the sample size n. We will be interested in the estimation of the θ 0 (i) for i = 1, . . . , k. In this respect, all the useful information is conveyed by the counts N i , i = 1, . . . , k, or equivalently in what will be called the truncated version of the sample. The truncated version of sample x is denoted byx and constructed as follows The counter N 0 is defined as the number of occurrences of 0 inx: . The image of θ ∈ Θ by truncation is a p.m.f. over {0, . . . , k}, it is still denoted by θ with θ(0) = i>k θ(i). Let Θ k denote the set of p.m.f. over {0, . . . , k}. In the sequel, depending on context, θ 0 may denote either the p.m.f. on AE * from which the sample is drawn or its image by truncation at level k.
Henceforth, θ ∈ Θ kn may denote either (θ(i)) 0≤i≤kn or its projection on the k n last coordinates (θ(i)) 1≤i≤kn ; in the same way, if h denotes a vector (h(i)) 0≤i≤kn in Ê kn+1 such that kn i=0 h(i) = 0, h may also denote its projection on the k n last coordinates (h(i)) 1≤i≤kn depending on the context. For a given sample x, the score function is the gradient of the log-likelihood at θ ∈ Θ k , for i ∈ {1, . . . , k}: l n (θ) i = N i /θ(i)−N 0 /θ(0) . Assume all components of θ ∈ Θ k are positive, then the information matrix I(θ) is defined as
Let k n denote a truncation level. If h belongs to Ê kn+1 and satisfies kn i=0 h(i) = 0, let σ n (h) be defined by where we agree on the following convention: if θ 0 (i) = 0 and h(i) = 0, then In the parametric setting, that is when k n remains fixed, Le Cam's proof of the Bernstein-Von Mises Theorem (van der Vaart, 1998;van der Vaart, 2002) is made significantly more transparent by resorting to a contiguity argument. In order to adapt this argument to our setting, we need to formulate two conditions. In the sequel (k n ) n∈AE denotes a non-decreasing sequence of truncation levels. Let (h n ) n∈AE denote a sequence of elements from Ê kn+1 such that for each n, kn i=0 h n (i) = 0. The sequence (h n ) n∈AE is said to be tangent at the p.m.f. θ 0 if the following condition is satisfied.
Condition 2.2. There exists a positive real σ such that the sequence σ 2 n (h n ) tends toward σ 2 > 0.
The probability distribution È n,h over {0, . . . , k n } n is the product distribu- We are now equipped to state the building block of the contiguity argument: the proof is given in the appendix (A).
Lemma 2.3. Let θ 0 denote a probability mass function over AE * . If the sequence of truncation levels (k n ) n∈AE satisfy Condition 2.1 and if the sequence (h n ) n∈AE satisfies the tangency Condition 2.2 then the sequences (È n,h ) n and (È n,0 ) n are mutually contiguous, that is, for any sequence (B n ) of events where for each n, B n ⊆ {0, . . . , k n } n , the following holds: Note that throughout the paper, we use De Finetti's convention: if (Ω, F , P ) denotes a probability space, Z a random variables on Ω, F , then P Z = P [Z] = P (Z) denotes the expected value of Z (provided it is well-defined, that is P Z + and P Z − are not both infinite. If A denotes an event, then P {A} = P 1 A .

Main results
In a Bayesian setting, the set of parameters is endowed with a prior distribution. In this paper, we consider a sequence of prior distributions (W n ) n∈AE matching the non-decreasing sequence of truncation levels we use. Let W n be a prior probability distribution for (θ(i)) 1≤i≤kn such that θ = (θ(i)) 0≤i≤kn ∈ Θ kn . Henceforth, we assume that W n has a density w n with respect to Lebesgue measure on Ê kn . Let T = (τ (i)) 0≤i≤kn be a random variable such that (τ (i)) 1≤i≤kn is distributed according to W n and τ (0) = 1 − kn i=1 τ (i). Conditionally on T = θ, (X n ) n∈AE is a sequence of independent random variables distributed according to the p.m.f. θ.

Non parametric Bernstein-Von Mises Theorem
Let H n be the random variable H n = √ n (τ (i) − θ 0 (i)) 1≤i≤kn , and P Hn|X1:n its posterior distribution, that is its distribution conditionally to the observations X 1:n = (X 1 , . . . , X n ). If the truncation level k n = k (that is the dimension of the parameter space Θ kn ) is a constant integer, the classical parametric Bernstein-Von Mises Theorem asserts that the sequence of posterior distributions is asymptotically Gaussian with centerings ∆ n (θ 0 ) = √ n( θ − θ 0 ) and variance I −1 (θ 0 ) if the observations X 1:n are independently distributed according to θ 0 .
Theorem 3.7 below asserts that under adequate conditions on the sequence of priors W n and on the tail behavior of θ 0 , the Bernstein-Von Mises Theorem still holds provided the truncation levels k n do not increase too fast toward infinity.
For any sequence of prior distributions (W n ) n , for a sequence M n of real numbers increasing to +∞, and a sequence (k n ) n of truncation levels that satisfy Condition 2.1, we will use the following three conditions in order to establish the three propositions the Bernstein-Von Mises Theorem depends on.
Condition 3.1. The sequence of truncation levels (k n ) n and radii M n satisfies Requiring a prior smoothness condition is commonplace when establishing asymptotic normality of posterior distribution in parametric settings.
Requiring a prior concentration condition, sometimes called a small ball probability conditions is usual in non-parametric Bayesian statistics.
Note that the prior concentration condition entails the second condition in Condition 3.1.
The next lemma which is proved in Section 4.3 asserts that under mild conditions, the posterior distribution concentrates on χ 2 (Fisher) balls centered around maximum likelihood estimates.
Lemma 3.6. (Posterior concentration) If the p.m.f. θ 0 and the sequence of truncation levels (k n ) n both satisfy Conditions (2.1,3.4,3.5 This posterior concentration lemma allows to recover the parametric posterior concentration phenomenon if truncation levels remain fixed and strengthens the generic non-parametric posterior concentration theorem from Ghosal et al. (2000).
where · denotes the total variation norm.
A comparison of the Theorem with respect to previous results available in the literature (Ghosal and van der Vaart, 2007b,a;Ghosal et al., 2000;Ghosal, 2000) is given at the end of the Section.
Remark 3.8. A corollary of the Bernstein-Von Mises Theorem is that The proof of Theorem 3.7 is organized along the lines of Le Cam's proof of the parametric Bernstein-Von Mises Theorem as exposed by A. van der Vaart in (van der Vaart, 1998) (see also van der Vaart (2002)).

Roadmap of the proof of the Bernstein Von-Mises theorem.
If P is any probability distribution on Ê kn and M > 0 is any positive real, let P M be the conditional probability distribution on the ellipsoid {u ∈ Ê kn : To alleviate notations, we will use the shorthands N kn and N Mn kn to denote the (random) distributions N kn (∆ n (θ 0 ), I −1 (θ 0 )) and N Mn kn (∆ n (θ 0 ), I −1 (θ 0 )). From the triangle inequality, if follows that: The proof of Theorem 3.7 boils down to checking that each of the three terms on the right-hand side tends to 0 in È n,0 probability.
The first term avers to be the easiest to control thanks to the well-known concentration properties of the Gaussian distribution. Upper bounding the middle term is arguably the most delicate part of the proof. The posterior concentration Lemma allows to deal with the third term.
Let us call nv(M n ) the middle term The posterior density is proportional to the product of the prior density and of the likelihood function. Hence, controlling the variation distance between N Mn kn and P Mn Hn|X1:n requires a good understanding of log-likelihood ratios. A quadratic Taylor expansion of the log-likelihood ratio leads to: Performing algebra along the lines described in (van der Vaart, 2002, P. 142) (computational details are given in the Appendix, see Section C), leads to (3.9) We prove in Section 4.1 that the decay of nv(M n ) depends on prior smoothness around θ 0 and on the ratio between M n and (n inf i≤kn θ 0 (i)) 1/3 : If the sequence of truncation levels (k n ) n and radii (M n ) n satisfies Conditions (3.1) and (3.4) then The third term P Mn Hn|X1:n − P Hn|X1:n is handled thanks to the posterior concentration lemma, since by Lemma B.1 in the appendix P Mn Hn|X1:n − P Hn|X1:n = 2P Hn|X1:n H T n I(θ 0 )H n ≥ M n .
The proof of the Theorem is concluded by upper-bounding N kn − N Mn kn . The latter quantity is a matter of concern because we are facing increasing dimensions (k n ) n∈AE . It is checked in Section 4.4 that Proposition 3.11. There exists a universal constant C such that if lim inf n (n inf i≤kn θ 0 (i)) ≥ c 0 > 0 and lim inf M n /k n ≥ 64, then for large enough n

Estimating functionals
The Bernstein-von Mises Theorem provides a handy tool to check the asymptotic normality of estimators of Rényi and Shannon entropies. Antos and Kontoyiannis (2001) established that plug-in estimators of Shannon and Rényi entropies are consistent whatever the sampling probability is. They also proved that entropy estimation may be arbitrarily slow, and that on a large class of sampling distributions, the mean squared error is O log n/n . In the parametric setting, that is with fixed finite alphabets, analogues of the delta-method and the classical Bernstein-Von Mises Theorem can be used to check the asymptotic normality of both frequentist and Bayesian entropy estimators. Our purpose is to show that the non-parametric Bernstein-Von Mises Theorem can be used as well.
For any α > 0, let g α be the real function defined for non negative real numbers by g α (u) = u α for α = 1, and g 1 (u) = u log u (with the convention g 1 (0) = 0). The additive functional G α is defined by The Shannon entropy of the probability mass function θ is −G 1 (θ) and for α = 1, −1 α−1 log G α (θ) denotes the Rényi entropy of order α (Cover and Thomas, 1991).
Let T = (τ (i)) 0≤i≤kn be distributed according to the posterior distribution, a Bayesian estimator of G α (θ) may be constructed using the posterior distribution of The Bernstein-Von Mises Theorem asserts that under È n,0 , for large enough n, the posterior distribution of (τ (i)) 1≤i≤kn is approximately Gaussian, centered around the maximum likelihood estimator θ n = ( θ(i)) 1≤i≤kn , with variance 1 n I(θ 0 ) −1 . Theorem 3.12 below makes a similar assertion concerning G n,α (T ).
Let G n,α ( θ n ) be the truncated plug-in maximum likelihood estimator: The variance parameter γ n,α is defined by Notice that θ 0 (i)(log θ 0 (i) + 1)) 2 as soon as this is finite, and γ 2 n,α has limit γ 2 ) 2 ] as soon as this is finite, which requires at least that α > 1 2 .
Now, let I be the collection of all intervals in Ê, and for any I ∈ I, let Φ(I) = I φ(x)dx where φ is the density of N (0, 1). The following Theorem asserts that the Levy-Prokhorov distance between the posterior distribution of √ n G n,α (T ) − G n,α ( θ n ) and N (0, γ 2 α ) tends to 0 in È n,0 probability. The Levy-Prokhorov distance metrizes convergence in distribution.
Theorem 3.12. (Estimating functionals) If lim n γ 2 n,α = γ 2 α is finite, then under the assumptions of the Bernstein-von Mises Theorem (Theorem 3.7), The proof of this theorem is given in Section 5. Let us define the symmetric Bayesian credible set with would-be coverage probability 1 − δ as the smallest interval which has posterior probability larger than 1 − α. This credible set is an empirical interval since it is defined thanks to an empirical quantity, the posterior distribution. In order to construct such a region, it is enough to sample from the posterior distribution using mcmc sampling methods. Note that this symmetric Bayesian credible set is not the (non fully empirical) interval where u δ is the 1−δ/2 quantile of N (0, 1). Theorem 3.12 just asserts that asymptotically, the symmetric Bayesian credible set has length u δ γ n,α / √ n. and is centered around G n,α ( θ n ). Hence Theorem 3.12 asserts that, in È n,0 -probability, Bayesian credible sets for G α (θ 0 ) and frequentist confidence intervals based on truncated plug-in maximum likelihood estimators are asymptotically equivalent. The next theorem provides sufficient conditions for the plug-in truncated maximum likelihood estimators to satisfy a central limit theorem with limiting variance γ 2 α . Theorem 3.13. (MLE functional estimation) Assume that lim n γ 2 n,α = γ 2 α is finite. If the truncation parameter k n satisfies:

Dirichlet prior distributions
We may now check that when using Dirichlet distributions as prior distributions, there exist truncation levels (k n ) n and radii (M n ) n such that Conditions 3.4 (prior smoothness) and 3.5 (prior concentration) hold. Let β = (β 0 , β 1 , . . . , β kn ) be a (k n + 1)-tuple of positive real numbers. The Dirichlet distribution with parameter (β 0 , β 1 , . . . , β kn ) on the probability mass functions on {0, 1, . . . , k n } has density In the absence of prior knowledge concerning the sampling distribution θ 0 , we refrain from assigning different masses on the coordinate components: we consider Dirichlet priors W n,β with constant parameter β = (β, . . . , β) for some positive β.
Thus, using Dirichlet prior with parameter β, the Prior Smoothness and Prior Concentration Conditions hold for θ 0 with truncation levels k n as soon as Condition 3.1 and k n log n + log(det(I(θ 0 ))) = o (M n ) .
But the existence of a sequence of radii (M n ) tending to infinity such that both the last condition and Condition 3.1 hold, is a straightforward consequence of Condition 2.1 and of the condition in Proposition 3.14.
Note that if the prior distribution is Dirichlet with parameter β then the posterior distribution is Dirichlet with parameters β + (N 0 , N 1 , . . . , N kn ) . Let n i = j<i N j for i ≤ k n , agreeing on n 0 = 0. Sampling from the posterior distribution is equivalent to picking an independent sample of n exponentially distributed random variables, Y 1 , . . . , Y n , picking another independent sample Z 0 , . . . , Z kn of k n + 1 independent Γ(β, 1)-distributed random variable, and letting θ * (i) = Z i + ni<j≤ni+1 Y j /( n j=1 Y j + kn j=0 Z j ). The latter procedure is very close to the Bayesian Bootstrap (Rubin, 1981), indeed, we obtain the latter procedure if we omit to add the Z i in the weights. This procedure which has been extensively investigated (See Lo, 1988Lo, , 1987Weng, 1989, among other references) is now considered as a special case of exchangeable bootstrap (See van der Vaart and Wellner, 1996, and references therein). Theorems from the preceding section tell us that the Bayesian bootstrap of (non-linear) functionals of the sampling distribution approximate the asymptotic distribution of maximum likelihood estimates. We leave the analysis of the second-order properties of the posterior distribution to further investigations.

Examples
Previous results may now be applied to two examples of envelope classes already investigated by Boucheron et al. (2009): 1. The sampling probability θ 0 is said to have exponential(η) decay if there exists η > 0, and a positive constant C such that ∀i ∈ AE * , Using truncation level k n , 2. The sampling probability θ 0 is said to have polynomial(η) decay if there exists η > 1, and a positive constant C such that ∀i ∈ AE * , Using truncation level k n , Let us first assume that θ 0 has exponential(η) decay. Then withc = 1 C ∧ exp(−η) Invoking Proposition 3.14, the non-parametric Bernstein-Von Mises Theorem holds for θ 0 with exponential(η) decay using the Dirichlet prior with parameter β > 0 with truncation levels k n = 1 η (log n − a log log n), a > 6.
Invoking Proposition 3.14, the non-parametric Bernstein-Von Mises Theorem holds using the Dirichlet prior with parameter β > 0 with truncation levels Theorems 3.12 and 3.13 concerning estimations of functionals hold as soon as 2α > 1 + 1/η, so that the Bayesian estimates of entropy and of Rényi-entropy of order α > 1/2 + 1/(2η) satisfy a Bernstein-von-Mises theorem with rate √ n.

Comparison with Ghosal's conditions
Now, we aim at comparing the set of conditions used by Ghosal (2000) to establish a Bernstein-Von Mises Theorem for sequences of multinomial models using log-odds parametrization. An exhaustive comparison of the two approaches (that is, comparing the merits of combining Le Cam's proof and concentration inequalities for some quadratic forms with the merits of Ghosal's proof which refines Portnoy's arguments) should first be based on a general purpose result characterizing the impact of re-parametrization on asymptotic normality of posterior distributions. This would exceed the ambitions of this paper. Then a thorough comparison between conditions (P) (Prior Smoothness and Concentration) and (R) (Prior concentration and behavior of likelihood ratios in the vicinity of the target θ 0 ) and the conditions used in this paper would be in order. As a matter of fact, provided re-parametrization is taken into account, the prior smoothness conditions in the two papers are not essentially different.
On the other hand the conditions on the integrability of likelihood ratios seem somewhat different. Looking for general exponential families, Ghosal (2000) imposes upper-bounds on the fourth and the third moment of linear forms of I(θ)∆ 1 (θ) for θ close to θ 0 (this is the meaning of conditions on the growth of B 1,n (c) and B 2,n (c).) In this paper, we take advantage of the fact that ∆ n (θ) is a multinomial vector.
Choosing a as 1 √ kn 1 and carefully performing straightforward computations, it is possible to check that if θ 0 has polynomial-(η) decay (according to the framework of Section 3.4), B 2,n (0) ≥ Ck 2η n , so that the clause B 2,n (c log k n )k 2 n (log k n )/n → 0 for all c > 0 in Condition (R), implies k 2+2η n log k n /n → 0. This condition is more demanding that the conditions we obtained at the end of Section 3.4.

Classical non-parametric approach to posterior concentration
We compare the posterior concentration lemma (Lemma 3.6) and the classical results on posterior concentration obtained in non-parametric statistics (See Ghosal and van der Vaart (2007a), Ghosal et al. (2000), Ghosal and van der Vaart (2007b), Ghosal and van der Vaart (2001)).
Let Θ kn denote the set of probability distributions over {0, . . . , k n }. Let ǫ 2 n satisfy nǫ 2 = M n . Let V n (ǫ n ) be the set: Let d denote the Hellinger distance between probability mass functions: Let D(ǫ, Θ kn , d) denote the ǫ-packing number of Θ kn , that is the maximum number of points in Θ kn such that the Hellinger distance between every pair is at least ǫ.
Theorem 2.1 in Ghosal et al. (2000) asserts that, if for some C > 0, we have W n {V n (ǫ n )} ≥ exp −Cnǫ 2 n and if log D(ǫ n , Θ kn , d) ≤ nǫ 2 n , then for large enough A, In this paper, the prior W n is supported by Θ kn , and a careful reading shows that the proof in Ghosal et al. (2000) can be adapted to situations where the sampling probability changes with n. Now, Θ kn endowed with the Hellinger distance is isometric to the intersection of the positive quadrant and the unit ball of Ê kn+1 endowed with the Euclidean metric, so that there exists a universal constant C kn and log D(ǫ n , Θ kn , d) ≤ nǫ 2 n if and only if k n log n Mn ≤ CM n .

Proof of the Bernstein-Von Mises Theorem
In this section, we establish the building blocks of the proof of the Bernstein-Von Mises Theorem that is Proposition 3.10, the posterior concentration Lemma and Proposition 3.11.

Truncated distributions
In order to prove Proposition 3.10, it is enough to upper bound NV(M n ): where A n and C n are defined in Section 3.1. We take advantage of the fact that integration is performed on E θ0,kn (M n ), in order to uniformly upper-bound the integrand.
Using the duality between ℓ 1 kn and ℓ ∞ kn , for all h ∈ E θ0,kn (M n ) .
The second term can be upper-bounded assuming the prior smoothness condition. The first term is a sum of two random suprema. The expected value of the maximum of random variables with uniformly controlled logarithmic moment generating functions can be handily upper-bounded thanks to an argument due to Pisier (Massart, 2003): if (W i ) 1≤i≤k are real random variables, then (4.1) For each i, the random variable N i is binomially distributed with parameters n and θ 0 (i), log(1 + u) ≤ u, and for all u ≥ 0, e u −1 u − 1 ≥ e −u −1 u + 1, so that (4.1) leads to so that choosing λ = log(2(k n + 1))n inf i≤kn θ 0 (i), as the function u → e u −1 u − 1 is increasing on Ê + , letting δ n = log(2(kn+1)) n inf i≤kn θ0(i) , Thus, sup h∈E θ 0 ,kn (Mn) w n θ 0 + h √ n and the proposition follows using Assumptions (3.1) and (3.4) and the fact that R(u) = O(u) as u tends toward 0.

Tail bounds for quadratic forms
In this section, we gather a few results concerning tail bounds for quadratic forms or square-roots of quadratic forms in Gaussian and empirical settings. All those bounds are obtained by resorting to concentration inequalities for Gaussian distributions or for suprema of empirical processes. Let us first start by a first bound concerning chi-square distributions. Let ξ 2 n be distributed according to χ 2 kn (chi-square distribution with k n degrees of freedom), the following inequality is a direct consequence of Cirelson's inequality (Massart, 2003): The following handy inequality provides non-asymptotic tail-bounds for Pearson statistics. For any θ ∈ Θ kn let V n (θ) denote the square root of the Pearson statistic The following follows from Talagrand's inequality for suprema of empirical processes (Massart, 2003, p. 170): for all x > 0, Non-centered Pearson statistics also show up while proving the posterior concentration lemma. Let θ = θ 0 + h √ n with σ n (h) ≥ √ M n . Note that from the definition of V n (θ 0 ), it follows that Computations carried out in the Appendix allow to establish that if σ 2 n (h) ≥ M n , and if M n = o(n inf i≤kn θ 0 (i)), . (4.4)

Proof of the posterior concentration lemma
Proof. We need to check that P Hn|X1:n H T n I(θ 0 )H n ≥ M n is small in È n,0 probability. For any θ ∈ Θ kn , let V n (θ) denote the square root of the Pearson statistic A sequence of tests (φ n ) n∈AE is defined by φ n = ½ Vn(θ0)≥sn where each threshold s n is defined by s n = 2 √ k n + √ 2x n + 3x n / n inf i≤kn θ 0 (i) with x n = Mn 4 . The tests φ n aim at separating θ 0 from the complements of Fisher balls centered at θ 0 , that is from θ 0 + h √ n : σ 2 n (h) ≥ M n . Hence, we need to check that P Hn|X1:n H T n I(θ 0 )H n ≥ M n = P Hn|X1:n H T n I(θ 0 )H n ≥ M n φ n + P Hn|X1:n H T n I(θ 0 )H n ≥ M n (1 − φ n ) ≤ φ n + P Hn|X1:n H T n I(θ 0 )H n ≥ M n (1 − φ n ) .
is small in È n,0 probability. As, in order to upper-bound È n,0 φ n , it is enough to bound the tail of Pearson's statistics under È n,0 , we focus on the expected value of the second term. Note that the latter is null as soon as the maximum likelihood estimator errs too far away from θ 0 .
In order to control È n,0 P Hn|X1:n H T n I(θ 0 )H n ≥ M n (1 − φ n ) , we resort to the same contiguity trick as in (van der Vaart, 1998). Let A be a fixed positive real, define the probability distribution È n,A on AE n as the mixture of È n,h when the prior is conditioned on the ellipsoid θ 0 + 1 √ n E θ0,kn (A): Arguing as in (van der Vaart, 2002), thanks to Lemma 2.3, one can check that the sequences (È n,0 ) n and (È n,A ) n are mutually contiguous (for the sake of selfreference, a proof is given in the Appendix, see Section A). Hence, it is enough to upper-bound È n,A P Hn|X1:n H T We will handle È n,0 φ n and È n,h (1 − φ n ) using non-asymptotic upper bounds for centered and non-centered Pearson statistics while the prior mass around θ 0 (W n {θ 0 + E θ0,kn (A)/ √ n}) can be lower-bounded by assuming Conditions 3.4 and 3.5.

Non-centered Pearson statistics show up while handling
Then, using the definition of φ n , Inequality (4.4) entails Let us now lower bound W n {θ 0 + E θ0,kn (A)/ √ n}. Performing a change of variables (agreeing on the convention that But the volume of the ellipsoid in Ê kn induced by E θ0,kn (A) is the inverse of the square root of the determinant of I(θ 0 ) (that is kn i=0 θ 0 (i) 1/2 ) times the volume of the sphere with radius √ Thus, assuming conditions 3.4 and 3.5: (1 + o(1)) .

Posterior Gaussian concentration
Proving Proposition 3.11 amounts to checking that the growth rate of the sequence of radii M n is large enough so as to balance the growth rate of dimension k n . By Lemma B.1: The right-hand-side can be upper-bounded: where ξ 2 n is distributed according to χ 2 kn (chi-square distribution with k n degrees of freedom). Then, invoking (4.2), The second term in the upper bound is handled using (4.3) and choosing x = inf Mn 128 , c0M 2 n 512 .

Proof of Theorem 3.12
In frequentist statistics, once asymptotic normality has been proved for an estimator, the so-called delta-method allows to extend this result to smooth functionals of this estimator. In this section, we develop an ad hoc approach that parallels the classical derivation of the delta-method. Taylor expansions allow to write √ n(G n,α (T ) − G α (θ 0 )) as the sum of a linear function of H n − ∆ n (θ 0 ) and of two (random) quadratic forms. Checking the theorem amounts to establish that under È n,0 those two quadratic forms converge to 0 in distribution.
A Taylor expansion of the logarithm leads to log È n,hn È n,0 The proof consists in checking the three following points: 1. the remainder term converges in probability toward 0. 2. the first summand converges in distribution toward N (0, σ 2 ). 3. the middle term converges in probability toward −σ 2 /2.
Let us check the first point. As a matter of fact: But sup i≤kn |h n (i)| √ nθ 0 (i) ≤ σ n (h n ) n inf i≤kn θ 0 (i) = o(1).
In order to check the the second point, note that the random variable Z n (h n ) = 1 √ n kn i=0 N i h(i) θ0(i) can be rewritten as a sum of i.i.d. random variables: Under È n,0 , each random variable Y j is equal to hn(i) θ0(i) with probability θ 0 (i), so that as σ 2 n (h n ) is bounded and bounded away from 0. The Berry-Essen Theorem (Dudley, 2002) entails that as n tends to infinity, Z n (h n ) converges in distribution toward N (0, σ 2 ).
Hence, the sequence of distributions of likelihood ratios È n,h Èn,0 (X 1:n ) converges weakly toward a log-normal distribution with parameters −σ 2 /2 and σ 2 . The Lemma follows directly from Le Cam's first Lemma (van der Vaart, 1998).
Lemma A.1. Let θ 0 denote a probability mass function over AE * . If the sequence of truncation levels (k n ) n∈AE satisfy Condition 2.1, the sequences (È n,0 ) n and (È n,A ) n are mutually contiguous.
Proof. Let (B n ) be a sequence of events where for each n, B n ⊆ {0, . . . , k n } n . Then so that for some sequence (h n ) n such that for all n, σ 2 n (h n ) ≤ A, lim sup n→+∞ È n,A (B n ) ≤ lim sup n→+∞ È n,hn (B n ) .
The reverse implication may be proved with the same reasoning using that inf σ 2 n (h)≤A È n,h (B n ) ≤ È n,A (B n ) .

Appendix B: Distance in variation and conditioning
The obvious proof of the following folklore lemma is omitted.
Lemma B.1. Let P denote a probability distribution on some space (Ω, F ). Let A denote an event with non-null P -probability and let P A the conditional probability given A, that is P A (B) = P (A ∩ B)/P (A) then P A − P = P (A c ) .

Appendix D: Tail bounds for non-centered Pearson statistics
This section provides a proof of Inequality (4.4). Recall from Section 4.2, that is just a sum of i.i.d. random variables, we found no obvious way to use classical exponential inequalities (either Hoeffding or Bernstein inequalities) to prove the tail bounds we need. Before resorting to classical inequalities, we split the sum into two pieces according to the signs of the coefficients a * i . The two pieces are handled using negative association arguments.
Let J = {i : i ≤ k n , a * i ≥ 0} and J c = {i : i ≤ k n , a * i < 0}. Note first that Following Dubhashi and Ranjan (1998), a collection of random variables Z 1 , . . . , Z n is said to be negatively associated if for any I ⊆ {1, . . . , n}, for any functions f : Ê |I| → Ê and g : Ê I c → Ê that are either both non-decreasing or both non-increasing, By Theorem 14 from Dubhashi and Ranjan (1998), both sets of random variables a * i (N i − nθ(i))/ nθ 0 (i) , i ∈ J and a * i (N i − nθ(i))/ nθ 0 (i) , i ∈ J c are negatively associated in the sense of Dubhashi and Ranjan (1998).
The logarithmic moment generating function of i∈I a * where I = J or J c . Each N i is binomially distributed with parameter n and θ i .