Reference priors for exponential families with increasing dimension

: In this article, we establish the asymptotic normality of the posterior distribution for the natural parameter in an exponential family based on independent and identically distributed data. The mode of conver- gence is expected Kullback-Leibler distance and the number of parameters p is increasing with the sample size n . Using this, we give an asymptotic expansion of the Shannon mutual information valid when p = p n increases at a suﬃciently slow rate. The second term in the asymptotic expansion is the largest term that depends on the prior and can be optimized to give Jeﬀreys’ prior as the reference prior in the absence of nuisance parameters. In the presence of nuisance parameters, we ﬁnd an analogous result for each ﬁxed value of the nuisance parameter. In three examples, we determine the rates at which p n can be allowed to increase while still retaining asymptotic normality and the reference prior property.


Introduction
The Shannon mutual information (SMI) was derived in Shannon (1948a), see also Shannon (1948b), as a rate of information transmission across an informationtheoretic channel, that is, the electrical engineer's analog of a likelihood. Formally, the SMI for a random variable X distributed as P θ with density p θ and equipped with the prior density Π is where D(· ·) is the relative entropy or Kullback-Leibler number and P X is the marginal for X. Here, f and g are densities for distributions F and G with respect to a common dominating measure ν (suppressed in the notation). The interpretation is that someone draws a value of θ according to Π and transmits it over the channel defined by the likelihood so the receiver receives and outcome of X with conditional density of p(x|θ). The mutual information is then a transmission rate, in bits per symbol. So, the fastest information transmission will occur for the data source Π that maximizes the mutual information.
The supremal transmission rate, over Π, is the capacity of the channel. In addition, if the relative entropy is regarded as a redundancy in noiseless source coding, i.e., it is the extra bits sent beyond what optimal coding would require, the mutual information is the Bayes redundancy and maximizing it gives the maximin redundancy.
In statistics, the widespread use of the SMI began with Lindley (1956). Since then, the SMI as a statistical quantity has been regarded as a measure of dependence between a parameter and data, a measure of distance between distributions, a mode of convergence, a measure of "information" in a data set, and as a sort of average learning rate (with respect to n). In fact, these various interpretations are much at one with Shannon's original communications theory interpretation.
The seminal contribution of Bernardo (1979) was to recognize that the capacity of a channel, that is, its maximal SMI, had an important interpretation in Bayesian statistics. The capacity of a channel represents the fastest learning rate a statistician could achieve on average from a fixed likelihood and that this could be effected by finding the capacity achieving source distribution, which he termed a reference prior. In fact, prior to Bernardo (1979), Ibragimov and Hasminsky (1973) established that Jeffreys' prior is the reference prior in an asymptotic sense (in the absence of nuisance parameters) without using the term reference prior; see also Clarke and Barron (1994) for a modern formulation. The technique of their proof rested on posterior normality.
Because of its desirable properties, the concept of reference priors has received extensive development, especially in a series of papers by Berger and Bernardo and their collaborators such as Berger and Bernardo (1992b), Berger and Bernardo (1992a) and . The paper Berger and Bernardo (1989) deserves special mention because it extended the concept of reference priors to include nuisance parameter cases. The derivation of their new reference prior was not given explicitly but probably relied on a calculus of variations argument applied to a heuristic asymptotic expansion. Later, Ghosh and Mukerjee (1992) presented an argument on the basis of a formal asymptotic expansion. Moreover, in the recent paper, Berger et al. (2009) the notion of a reference prior has been formalized.
From this overall body work, it can be surmised that the usual electrical engineering treatment of information concepts which mostly, but not entirely, uses discrete random variables is less appropriate in statistics where continuous variables are common. Essentially, this meant that identifying reference priors had to be done asymptotically in the sample size n, since finite n optimizations give discrete reference priors. However, see Zhang (1994) for a convergence result that applies to the case in .
Three important contributions that largely completed the theoretical treatment of reference priors for finite dimensions and led to the present work are the following. Let X be p dimensional and have a distribution controlled by a parameter θ. Write X n = (X 1 , . . . , X n ) to mean n independent and identical (IID) outcomes of X and let Π(θ|X n ) be posterior density corresponding to the prior Π. Then the SMI is given by where m n is the mixture of the n-fold product of f (·|θ)s with respect to Π. Understanding the asymptotics of I(X n ) as n increases proceeded from Ghosh and Mukerjee (1992) to Sun and Berger (1998) who further developed the conditional mutual information given a nuisance parameter ψ and then to Clarke and Yuan (2004) who handled the general case of I(T n |S n , ψ), in which a conditioning statistic S n (over which the integration is done) as well as a nuisance parameter value ψ are present. The work Berger et al. (2009) extended the class of priors over which the asymptotic optimization had been done in earlier cases.
This tour-de-force showed that the previous restrictions on the prior and likelihood were convenient but not necessary. The successful extension of reference prior concepts to distances other than the relative entropy was done for the chi-squared distance in Clarke and Sun (1997) and completed in Ghosh et al. (2010). Once these results were available, the major outstanding conceptual issues for reference priors for finite dimensional parameters with independent data were largely resolved. Admittedly, there are gaps such as dealing with nuisance parameters outside the relative entropy distance definition, but it is not clear how extensively useful this would be. Aside from the case of dependent data, which is still being studied, the frontier for reference priors has shifted to high dimensional settings. Reference priors, or more generally objective priors, beyond the finite dimensional case, have received little attention despite the popularity of Ghosh and Ramamoorthi (2003) and the rapid development of nonparametric Bayesian methods that ensued. Apart from the generic recommendation to use a noninformative base measure in a Dirichlet process prior, the main contribution to objective prior selection in the nonparametric case seems to be Ghosal et al. (1997). There, a sequence of uniform priors on carefully selected finite subsets of a class of distributions was proposed. It was shown that when this sequence has a weak limit it can correspond to a uniform distribution and reduces to the Jeffreys prior in regular parametric settings. A variant on this construction formed by taking a convex combination of those uniform distributions leads to consistent posterior even in the nonparametric setting. The posterior usually converges at the optimal rate; see Ghosal et al. (2000).
The present paper is between the finite dimensional setting that has been well studied and the purely nonparametric approach just described. That is, we find rates of increase on the number of parameters p in terms of n so that at each stage, Jeffreys' prior is the reference prior in an asymptotic sense and the sequence of posteriors formed from these priors will be asymptotically normal.
The connection between posterior normality and reference priors has been recognized since Bernardo (1979). This was used implicitly in Berger and Bernardo (1989) and explicitly in Clarke and Barron (1990). Indeed, it is easy to see that asymptotic normality of the posterior should be equivalent to the determination of the reference prior under reasonable conditions such as the parameter having fixed finite dimension, the likelihood satisfying smoothness assumptions, and the mode of convergence being strong enough, that is, essentially equivalent to convergence in the sense of the integrated Kullback-Leibler divergence.
As in the fixed dimensional case, the root of the asymptotic expansion of the SMI lies in posterior asymptotic normality in the L 1 -sense. This is possible because working with the local parameter allows explicit bounds for the error in approximating a posterior density by its limiting normal form under suitable uniform integrability conditions. The study of posterior normality in increasing p setting was pioneered by Ghosal (2000); see also Ghosal (1997), Ghosal (1999), and Boucheron and Gassiat (2009). However, the approach in Ghosal (2000) does not give an estimate of the probability of the set W c = W c n on which the L 1 -distance between a posterior and its limiting normal may fail to be small; Ghosal (2000) only implies that the probability of W c converges to zero. In order to be able to use the approximate normality in the L 1 -sense to derive bounds for the expected Kullback-Leibler divergence, we need an explicit bound on the probability of the set W c . Essentially, we construct a set W with high probability on which the L 1 -distance between the posterior and its approximating normal density is small. Proofs of our results on asymptotic normality in the L 1 -sense are patterned after those of Ghosal (2000), but there are important technical differences. We use a different decomposition of the integrals into central and tail regions as well as higher moments to bound probabilities. To gain the necessary control on the probability of W c , we reduce W c further, but then to make the L 1 -distance small on the larger W , we must impose stronger conditions. Although most proofs in the section on asymptotic normality use ideas already in Ghosal (2000), for the sake of self-containedness and transparency, we shall give complete proofs of most of the results on asymptotic normality in the L 1 -sense. This leads to a stronger growth restriction on the dimension p as the sample size n grows.
Once asymptotic normality is obtained, we use arguments similar to those in Clarke and Barron (1990) to make the transition from L 1 -distance to the expected Kullback-Leibler divergence. The resulting analysis gives an asymptotic expansion of the SMI as the sum of a dominant term free of the prior, a term that depends on the prior but does not grow with n, and another small error term. The representation is virtually identical with that in Clarke and Barron (1990), except that p can now grow to infinity as n does. Optimizing the second term over the prior establishes Jeffreys' prior as the reference prior.
It will be seen that the growth restriction of the rate p = p n depends on the specific model under consideration. In the easiest case, all the random variables X 1 , . . . , X p are independent univariate N (θ i , 1)'s for i = 1, . . . , n, and the Jeffreys prior is uniform on any compact set. Then, it is enough to choose p = O(n 1/3−η ) for any η > 0. By contrast, when the X i 's are multinomial, it will be seen that a much slower growth rate of p with n, namely p = O(n 1/9−η ), appears to be required for the reference prior to exist and give posterior normality. In our third example, a Dirichlet distribution, we find order p = O(n 1/6−η ).
We do not know if these rates are the best possible, but it appears that some restrictions like these on the growth rate of p are essential. In the next section we define our setting and notation. Then, Section 3 states our main results for identifying reference priors when nuisance parameters are not present and when they are. Section 4 presents three examples of our results, the normal, the multinomial and the Dirichlet in which explicit rates on p can be given in terms of n. Section 4 presents our three examples and Section 5 gives some concluding remarks on prior selection. Section 5 discusses extensions of the present results and their implications for prior selection in high dimensions. Appendix A states and proves an asymptotic normality theorem essential to the identification of the reference priors in Section 3 and Appendix B provides some details of proof of the results in Section 3. For convenience, Appendix C gathers together some simple lemmas we use in the various derivations.
We use the following symbols throughout this paper: " " means inequality "up to a constant multiple"; I p is the identity matrix of order p; and x T = (x 1 , . . . , x p ) (respectively, A T ) stands for the transpose of a vector x (respectively, matrix A = ((a ij )) for i, j = 1, . . . , p). The notation · denotes the Euclidean norm for vectors as well as the operator norm for matrices, that is, A = sup{ Ax : x ≤ 1}. We use φ p (·|µ, Σ) to mean the p-dimensional normal density with mean vector µ and dispersion matrix Σ = ((σ ij )) for i, j = 1, . . . , p, and a n = O(b n ) (respectively, a n = o(b n )) means that a n /b n is bounded (respectively, a n /b n → 0). We denote a generic constant by C, not necessarily the same from occurrence to occurrence.

Setting and assumptions
Let X n = (X 1 , X 2 , . . . , X n ) IID ∼ f (x|θ), θ ∈ Θ ⊂ R p and suppose that the dimension of the X i 's is p = p n → ∞, where densities are with respect to a σ-finite measure ν on R p . Each distinct value of p is regarded as a stage in the overall structure and there is no necessary linkage from one stage to the next except that we assume that there is a true value θ 0 uniformly in the interior of the p-dimensional parameter spaces, i.e., there exists an ǫ 0 > 0 (fixed) such that at the pth stage {θ : θ − θ 0 < ǫ 0 } ⊂ Θ. This means that the dimension of the true parameter is increasing but that the extra entries thereby introduced as p increases do not move the true value outside the interior of the corresponding parameter space.
We restrict to the case of natural exponential families given by f (x|θ) = exp[x T θ − ψ(θ)]. The true mean is therefore µ = E θ0 (X) = ψ ′ (θ 0 ) ∈ R p and the p × p Fisher information matrix is given by F = ψ ′′ (θ 0 ). The maximum likelihood estimator (MLE)θ satisfies ψ ′ (θ) =X = n −1 n i=1 X i . We use P θ0 to denote the true distribution of the data, where dependence on n and p is suppressed in the notation.
To state the quantities on which we will impose conditions, let X ∼ f (·|θ) and V = J −1 (X − µ). Let, V = V 1 , . . . , V p ′ and for fixed δ > 0 we can define where m ≥ 0 is a constant related to the growth of M r (see condition MCV4). Note that the local restriction on the parameter space appears in (2.4) and (2.5) so the expectation is indexed by the θ (which depends on u), not θ 0 . Now, the main hypotheses can be stated in three classes. For any fixed M > 0, assume the following two conditions hold uniformly for all θ 0 with θ 0 ≤ M (i.e., the implicit constants do not depend on θ 0 as long as θ 0 ≤ M ).
It will be seen in Section 4 that m = 0 will suffice for the normal and Dirichlet examples whereas m = 1 seems to be needed for the multinomial example; the role of m partly explains the difference in the rates ranging from n 1/3 to n 1/9 . Second, we must impose conditions on the prior density Π for θ.

Conditions PDB: [Prior Density Bounds]
The prior density Π satisfies where K n is some constant, subject to some growth condition (see Condition (BF2) below).
Note that Conditions (PDB1) and (PDB2) ensure that Π(θ) remains bounded below by e −cp log p for some c > 0, for all θ sufficiently close to θ 0 . It is not hard to see that there are a large class of priors for which (PDB1) and (PDB2) are satisfied. Indeed, suppose Π is an independence prior given by a product h j (θ j ) where the log h j 's satisfy uniform positivity condition at θ 0 , i.e., h j (θ j,0 ) > ǫ and a uniform Lipschitz condition on a neighborhood of θ 0 , i.e., |h j (θ j, ensuring (PDB1) and Our third set of conditions control the growth of the norm of Fisher information or its inverse, and also involve the Lipschitz constant K n of log Π(θ) defined in (PDB2) and moment bounds B n and B * n . Conditions BF: [Bounds using F ] For some α ≥ 0, δ > 0, at θ 0 we have We further assume that log n = O(log p). If this fails, the setting is very similar to fixed dimension, and results will go through by slight variation of the arguments; see Ghosal (2000), pages 52-53, for more explanation.
We comment that (BF1) is essentially always satisfied. Indeed, if F is written as the product of its eigenvalues, λ j for j = 1, . . . , p, the geometric meanarithmetic mean inequality gives So, taking logarithms and rearranging terms gives provided the diagonal entries of F are uniformly bounded by a polynomial in p. The same condition clearly implies F ≤ tr(F ) = O(p α ). This occurs in the normal, multinomial, and Dirichlet examples in Section 4. In addition, for priors satisfying uniform positivity and Lipschitz conditions as above (so that K n = O( √ p)), (BF2) is always satisfied for some rate.
Given Π, the posterior density of θ in an exponential family assumes the convenient form The mixture of densities over the whole parameter space, i.e., the marginal density of X n , will be denoted m n (·). That is, m n (X n ) = p(X n |θ)Π(θ)dθ = n −p/2 (det(F )) −1/2 p(X n |θ 0 + n −1/2 Hu)Π(θ 0 + n −1/2 Hu)du, where the second expression follows from a change of variables.
By contrast, for examining local behavior, we define the local likelihood ratio process, that is, the likelihood ratio in terms of the local parameter u, by (2.10) Consequently, the posterior density whose asymptotics we want to find, is given in terms of u by When no confusion will result, we drop the subscript n, writing only Π * for the posterior and we use Π to denote both the prior probability and its density.

Statements of results
In this section, we state our three main results for the increasing p setting. First, we give an asymptotic expression for the relative entropy between the n-fold product of densities and their mixture distribution. From this we derive a reference prior in the absence of nuisance parameters. Then, equipped with these results we identify reference priors in the presence of nuisance parameters.

No nuisance parameters
In the absence of nuisance parameters, we can derive reference priors from an asymptotic expression for the relative entropy. Our result is the following.
A sketch of the proof is given below; some details are relegated to Section 7. To use Theorem 3.1, let Π be a prior density satisfying Conditions (PDB1) and (PDB2) uniformly in θ 0 and concentrated on { θ 0 ≤ M } ∩ Θ. The SMI is given by the expected Kullback-Leibler divergence between the posterior and the prior. By the uniformity in Theorem 3.1, we obtain the following result.
Theorem 3.2. Assume that the conditions of Theorem 3.1 hold uniformly for asymptotically maximizes the SMI.
The proof of Theorem 3.1 rests on the posterior normality established in Appendix A as well as bounding the density ratio mn(X n ) p(X n |θ0) formally shown in Appendix B. We begin our sketch of the proof of Theorem 3.1 by stating the bounds for the density ratio. Taken together, these permit general upper and lower bounds on the relative entropy between P n θ and m n . Note that two auxiliary bounds λ n and λ * n appear in this result. They are formally defined in Appendix A, Lemma 6.7 and in Corollary 6.2, respectively. Here, it is enough to observe pλ n → 0 and λ * n → 0 as n → ∞ in P θ0 -probability, see Lemma 6.8.
and (PDB2). Then on W , we have the bound Lower bound on mn(X n ) p(X n |θ0) : Assume Conditions (MCV1) and (MCV3). Then on W , we have the bound Now, we provide a sketch of the proof of Theorem 3.1.
Sketch of proof of Theorem 3.1: The strategy of the proof is to define an error expression and show it goes to zero in L 1 . Then, the result will follow from the fact that E( ∆ n 2 ) = p. To bound |R n | on W , note that Lemma 3.3 may be written as Now, under the Conditions (MCV1)-(MCV4) and (BF1) and (BF2), restricting to the set W gives the bound which goes to 0 as n → ∞. Thus it remains to show that and that the convergence is uniform over θ 0 ∈ Θ satisfying θ 0 ≤ M .
where ½ A is the indicator function for the set A. These four convergences are verified in Appendix B.

Nuisance parameters present
While it is often reasonable to use a reference prior for reference purposes, or even directly as an objective prior, it is also common for nuisance parameters to appear. This is particularly common in high dimensional parameters. Thus, reference priors have been extensively studied in nuisance parameter contexts. It will be seen below that our results extend readily to the setting of Berger and Bernardo (1989), Ghosh and Mukerjee (1992), and Clarke and Yuan (2004). For instance, in the case of a nuisance parameter ψ, the conditional mutual information given a nuisance parameter ξ is and integrating over ξ gives the conditional SMI, where P n ξ = P n θ,ξ Π(θ|ξ)dθ. If ψ is fixed dimensional and varies over a compact set, it is enough to verify that the expansion for the information inside the integral is uniform in ξ. If this is done, then we obtain an analog to the prior proposed in Berger and Bernardo (1989) and Ghosh and Mukerjee (1992).
The typical situation is that the limiting form in Theorem 3.2 is an improper density. When Jeffreys' prior is not proper, it is routinely truncated. In such cases conditions like "maximizing mutual information" and "permissibility", or their increasing-dimensional analogs, must be imposed, see Berger et al. (2009). A separate problem is that many inference settings have nuisance parameters. That is, prior selection for the parameter of interest must be done conditionally on the value of some other parameter, say ψ. However, the inferential goal is not to estimate ψ, only θ. We are unconcerned about the value of ψ except that it may affect our inferences on θ. The classic example of this is estimating a normal mean without being concerned about the variance. When the variance is unknown, the intervals from a t n−1 -distribution for estimating θ are wider than those from a N (θ, σ 2 0 ) with σ 0 known.
To state the nuisance parameter setting formally, augment ξ to θ, where p = p n = dim(θ) and q = q n = dim(ξ). That is, in principle, there may be countably many nuisance parameters in the limit of large n. Assume that the data remain IID and there are true values θ 0 and ξ 0 uniformly in the interior of p + q dimensional parameter spaces. That is, there is a fixed ǫ 0 so that for any stage p+q, {(θ, ξ) : (θ, ξ)−(θ 0 , ξ 0 ) < ǫ 0 } ⊂ Θ×Ξ where Θ is the p-dimensional parameter space for θ and Ξ is the q-dimensional nuisance parameter space for ξ at the n-th stage. Now, the natural exponential family can be written as where the natural parameter is η = (η 1 (θ 1 , ξ), . . . , η p (θ p , ξ)). That is, each component of η consists of one of the θ j 's and possibly all the ξ j 's.
The notation from Section 2 carries over directly. We use P θ0,ξ to denote the true distribution of the data, where dependence on n, p, and q is suppressed in the notation. Now, the true mean is p-dimensional and given by µ = E θ0,ξ (X) = ψ ′ (θ 0 , ξ) ∈ R p and the p×p Fisher information matrix for the parameter θ at θ 0 is given by To state our next result, suppose F has been partitioned and write F 1,1 to mean the upper right p × p block of F . Suppose ξ is known and the dependence of ψ on ξ is smooth, i.e., has continuous first and second derivatives neither of which are zero on a neighborhood N (ζ) = N ξ (ζ) of radius ζ centered at ξ. Also, assume that the three classes of hypotheses (MCV), (PDB), and (BF) conditional on ξ hold. Then we get a conditional version of Theorem 3.2.
Theorem 3.4. Suppose that ξ is smooth and the uniform versions of the hypotheses of Theorem 3.2 are satisfied. Then, for each fixed ξ ∈ N (ζ), we have and the error term is uniformly small for ξ ∈ N (ζ) as n increases.
Proof. This follows from verifying that the uniformized versions of the Lemmas and Theorems continue to hold under the uniformized hypotheses.
Since the error term in Theorem 3.4 is uniformly small, it is natural to extract a corollary by integrating. We have the following.
Corollary 3.5. Assume the hypotheses of Theorem 3.4 and that ξ has been assigned a prior Π(ξ) on N (ζ). Then, the conditional SMI satisfies Thus, the prior asymptotically maximizes the conditional SMI but the marginal prior for ξ is unconstrained.

Examples
In this section we examine three cases in which p is increasing. We can verify that our hypotheses are satisfied and therefore both the asymptotic normality theorem and its consequences for reference priors hold.

Independent normal model
Consider n IID samples from a p dimensional normal model with mean θ = (θ 1 , . . . , θ p ) and covariance matrix identity, that is, the components of these variables are also independent. Assume also that a nonsingular prior density Π for θ. First we verify Conditions (PDB), (MCV) and (BF) for this case. Then, we see that if p = O(n 1/3−η ) for some η > 0, the conclusion of Theorem 3.2 holds, and hence Jeffreys' prior, which is the uniform prior on every compact rectangle, is the reference prior.
To begin, observe that (1) and tr(F ) = p. Note that (PDB1) and (PDB2) are satisfied for any well-behaved product-type prior, in particular the uniform, which is Jeffreys' prior in this case.
In this example, it is possible to derive a similar expansion under much weaker growth restriction p/n → 0 by direct computation, provided that we use an independence prior for the components of θ. In this case, the posterior is again of product form, so the expected Kullback-Leibler divergence is the sum of Kullback-Leibler divergence for each component. Individually, the Kullback-Leibler divergence between the posterior and the corresponding normal approximation decays like the square of the Hellinger distance, that is as n −1 . As there are p components and the Kullback-Leibler divergence is additive in the components, the overall Kullback-Leibler divergence of the product posterior density to the appropriate product normal density decays like p/n → 0.
Indeed, to see why (3.1) in Theorem 3.2 holds, let θ = θ 0 + u/ √ n and write By asymptotic normality (see Appendix A), the first term is o(1) a.s. for p/n → 0, so it suffices to show that the expectation of the second term with respect to the marginal distribution gives the first two terms of (3.1). Note that Since E(u 2 i |X i,1 , . . . , X i,n ) → 1 in L 1 , the claim follows by integration with respect to Π * first and then by integrating with respect to the marginal of X. Thus the Jeffreys prior is asymptotically entropy maximizing among all product form priors only under the mild restriction that p/n → 0. Note that, Jeffreys' prior, being the uniform distribution, can be regarded as a product of p constant functions of the components θ j of θ for j = 1, . . . , p.
Next, to illustrate Corollary 3.5, consider n IID samples from a p-dimensional normal with mean θ = (θ 1 , . . . , θ p ) and covariance matrix σ 2 I p , the p×p diagonal matrix with nonzero entries σ 2 unknown. Treating σ as a nuisance parameter, and the θ i = µ/σ 2 , for i = 1, . . . , p, as the parameters of interest, the three classes of conditions can be verified conditionally on a value of σ in much the same way as in the absence of nuisance parameters. Indeed, it will be apparent that the most stringent condition comes from (MCV1), so our main results will hold when p = O(n 1/3−η ) for some η > 0.
To see this, start by fixing a value of σ. Observe that the (p+1)×(p+1) Fisher information matrix as a function of (θ, σ 2 ) is F = diag(σ −2 , . . . , σ −2 , 2σ −2 ). So, the Fisher information matrix for θ is the p×p matrix For the third set of conditions, it can be seen that (BF1) is satisfied for α = 1 since det(F 1,1 ) = σ −2p and (BF2) is satisfied as in the case when nuisance parameters are not present. Now, conditional versions (on σ) of Theorems 6.1, 3.1 and 3.2 hold whenever p = O(n 1/3−η ) for some η > 0. In particular, Theorem 3.4 holds and since conditions (MCV), (PDB), and (BF) hold uniformly for compact sets for which σ > 0, Corollary 3.5 holds giving that det(F 1,1 ) ∝ σ −p is the conditional reference prior. It is seen that this is improper and independent of θ. Indeed, the analysis extends to the case that each of the p components are independent k-dimensional normal random variables all have the same variance matrix Σ(ζ) regarded as a nuisance parameter provided Σ varies over a compact set of non-singular matrices smoothly parametrized (with non-zero derivative) by ζ = (ζ 1 , . . . , ζ q ) for some fixed q and k is fixed as well. If q increases, it is not clear that Theorems 6.1, 3.1, 3.2, 3.4 and Corollary 3.5 hold.

Multinomial model
Consider a multinomial distribution with (p + 1) cells. The distribution is characterized by the probability vector π = (π 1 , . . . , π p ) in which P (cell j) = π j for j = 1, . . . , p, and the probability of the zeroth cell is π 0 = 1 − p j=1 π j . The multinomial is an exponential family which has p natural parameters given by θ j = log(π j /π 0 ). This transformation corresponds to π j = e θj /(1 + p j=1 e θj ) for j = 1, . . . , p, and π 0 = 1/(1 + p j=1 e θj ). It will be seen next that it is enough to require that p = O(n 1/9−η ) for some η > 0 for Theorem 6.1 to hold and for Theorem 3.2 to show that Jeffreys' prior is the reference prior.
To proceed, we verify that the (BF) conditions are satisfied. Suppose for all j = 1, . . . , p, |θ j | ≤ M for some bound M > 0. Essentially, this means that all π j 's are O(p −1 ). It can be verified that F = D − ππ T , where D = diag(π 1 , . . . , π p ). Using standard arguments in matrix algebra and induction on p, it can be shown that det(F ) = p j=0 π j . To transform this back into the natural parameters, recall the formula where F θ is the Fisher information in the θ j parameters, F π is the Fisher information in the π j parameters, and U is the matrix with (i, j) entries So, for |θ j | ≤ M , (BF1) is satisfied for α = 1. This is slightly stronger than applying (2.9). For (BF2), we can take K n = O(p 1/2 ) because (4.2) is a product form prior (in θ) with uniform positivity and as used in (2.7) and (2.8). Now, to get the rate from condition (BF2) we use part A of Lemma 8.1 to get that F −1 = D −1 + (1 − π ′ D −1 π) −1 11 T . (Note that the denominator is 1 − π T D −1 π = 1− p j=1 π j = π 0 .) In both the π parametrization and the natural θ parametrization, . This bound is actually sharp. For instance, when θ = 0, i.e., π j = 1/(1 + p) for all j = 0, . . . , p, it can be verified that the largest eigenvalue of F −1 is O(p 2 ). It will be seen in the (MCV) conditions that, for the multinomial, we can set m = 1. Thus, (BF2) will be satisfied if p 5/2+δ / √ n → 0. Next, we examine the (MCV) conditions. We can now use part B of Lemma 8.1 to find F −1/2 and verify the (MCV) conditions directly, or we can observe by Section 3 of (Ghosal, 2000, 60- Thus all the (MCV) conditions are satisfied when p = O(n 1/9−η ) for some η > 0. Before we can conclude that this is the rate, we must verify (PDB).
Note that conditions (PDB) are written in terms of the natural parametrization, so we must transform from the π j -parametrization to the θ j -parametrization, as we did for the (BF) conditions. However, this time we are working with the priors rather than the Fisher information. Let us consider a prior of product form on π, say p j=0 h j (π j ). To find the Jacobian, recall (4.1). where π j = π j (θ). Since is seen that we get that Π(θ) = p j=0 π j h j (π j ) where the π j 's are expressed as functions of θ. That is, Π(θ) remains of product form. Note that this does not mean independence since π 0 = 1 − p j=1 π j . Now, assuming | log h j (π j )| = O(log p), which is satisfied for the conjugate Dirichlet class of priors, we have that for θ 0 Now, for θ in a compact set, all π j are of order O(p −1 ). So, using the inequality | log x − log y| ≤ max(x −1 , y −1 )|x − y|, we get that, for j = 0, . . . , p, since 1 πj and 1 π ′ j are of order O(p). Likewise, for all j = 0, . . . , p, we get when the h j 's are Lipschitz with O(p) constant. Using these in (4.3), observe we bound the j = 0 term by the sum of the other terms: Thus, (4.3) becomes It is seen that e θj is bounded on compact sets, (1 + p k=1 e θ k ) = O(p), and the sum over j in the first term gives p. Thus, thereby verifying (PDB2). Note that growth restrictions on p are more stringent for the multinomial than for the normal to identify Jeffreys' prior. This is because in the normal case, components are independent, giving diagonal Fisher information matrix and moment bounds which do not grow with p. These features let us treat the components nearly separately, leading to a weaker growth restriction on p. This is consistent with the growth restrictions required for asymptotic normality studied in Ghosal (2000).
It is worth contrasting the priors obtained here via Theorem 3.2, Theorem 3.4 and Corollary 3.5 with other priors for the multinomial. Aside from Jeffreys' prior, the earliest seems to have been developed by Sono (1983) on the basis of transforming the π j 's so that the standardized highest posterior density (HPD) regions in the transformed parameters match those of the likelihood by itself. Essentially, this is a sort of invariance and frequentist matching approach. Sono's method gives Jeffrey's prior for p = 1 but not for p ≥ 2. Sono (1983) noted that the resulting priors depend on the ordering of the parameters and that the Bayes test for a point hypothesis on π based on the standardized HPD regions is equivalent to the likelihood ratio test (independent of the ordering of the π j 's). Berger and Bernardo (1992b) studied the same problem from the standpoint of reference priors developing ordered group reference priors. This technique is helpful when there is a natural way to partition a finite dimensional parameter of interest into several subvectors that can be ranked in order of inferential importance. They observe that product form priors, such as the Jeffreys prior in the multinomial case, allow inference about the groups of parameters to be decoupled in the sense that the product form of the prior leads to a sort of product form for the posterior. Moreover, the case of the Jeffreys prior for the multinomial is quite special in that it is proper and marginalizes, i.e., integrating out the last subvector of parameters in the prior leads to the reference prior for all but the last subvector of parameters. Berger and Bernardo (1992b) verifies that there are often numerous ways to partition a parameter vector into subvectors and that the results are typically not equivalent. Most recently, Bernardo (2010) derived that a Dirichlet(p −1 , . . . , p −1 ) is obtained when the p parameters are partitioned into p groups of one parameter each, i.e., the prior assignment is done one-at-a-time treating the remaining parameters at each stage as a nuisance. In the present context of increasing p, however, this "converges" to a Dirichlet(0, . . . , 0) which is improper and does not satisfy the requirement that all α −1 j be bounded. Beyond tractable cases like the multinomial, the ordered group reference prior method may not work as conveniently because it rests on assigning an objective prior at each conditioning step. In general, the prior they used comes out of their paper Berger and Bernardo (1989) and follows from an asymptotic expansion in Ghosh and Mukerjee (1992) that requires independence assumptions that are often not satisfied. Thus, while this prior may be a sensible choice in general and may be regarded as an approximation to the reference prior identified in Corollary 3.5, it is not in general a reference prior. Nevertheless, the method of building up objective priors by ordering the parameters and choosing priors in m stages does provide a way to find non-informative priors when there are many parameters.
From a more heuristic perspective, the Dirichlet(α 1 , . . . , α p ) distribution is conjugate to the multinomial π and choosing α j = 1 for all j gives a uniform distribution on the (p − 1)-dimensional simplex. This prior is objective in the sense of the Principle of Insufficient Reason. This differs from Jeffreys' prior, which is Dirichlet( 1 2 , . . . , 1 2 ) and proper but is not uniform. By contrast, setting all the α j 's to zero results in the limiting improper prior resulting from oneparameter-at-a-time prior assignment. The Dirichlet(0, . . . , 0), however, can be regarded as uniform on the log π j 's, see Chap. 3, Sec. 5 in Gelman et al. (2004). Heo and Kim (2007) examined the behavior of the posterior from a multinomial likelihood using the Dirichlet prior for a variety of choices of the α j 's.
We need to find F −1 and J which we then use to find J −1 . Using part A of Lemma 8.1 we get in which the factor simplifies to a(1 − a p j=1 a −1 j ) −1 and the (j, k)-th entry in D −1 11 T D −1 is a −1 j a −1 k for j, k = 1, . . . p. Note that F −1 is well defined provided a −1 = p j=1 a −1 j . It is seen that, for any ℓ, a j is bounded above for θ j > ℓ > −1 and a j goes to zero as θ increases. Likewise, a is bounded above for p j=1 θ j + p > ℓ > −1 and goes to zero as p j=1 θ j increases. Since F is the variance-covariance matrix of a nonsingular distribution, F is positive definite for any set of θ's satisfying θ j > −1. Thus, a −1 = p j=1 a −1 j . Further, because of the continuous dependence of the a j 's on the θ j 's, it follows that b = 1 − a p j=1 a −1 j remains boinded away from zero if θ j > ℓ > −1 for all j, and any fixed ℓ. It follows that F −1 ≤ tr(F −1 ) = O(p). Now, we use part B of Lemma 8.1 to find J, a square root of F . Letting u = √ a1, we find where v = −a(1 + 1 − a p j=1 a −1 j ) −1 1 and w T = (a −1/2 1 , . . . , a −1/2 p ). So, using part A of Lemma 8.1 again, we find the inverse is which is bounded away from zero and c ≤ a is bounded uniformly in θ as long as θ j > ℓ for any j, for any fixed ℓ > −1. Note that the entries in M are bounded as well.
To find the rates required for the (MCV) conditions we must examine the moments of V = J −1 (X − µ). So, consider a p-dimensional unit vector u T = (u 1 , . . . , u p ) and observe that, from the form of V , we have (4.5) Recall that X j = log Y j and that Y j = W j /W in distribution, where the W j 's are independent Gamma(θ j + 1, 1) random variables and W = p j=1 W j is a Gamma( p j=1 θ j + p, 1) random variable. Consequently, It is the fourth moment of the last upper bound that we must control for (MCV3). So, using (a + b + c + d) 2 ≤ 4a 2 + 4b 2 + 4c 2 + 4d 2 gives four terms We bound (4.6) and (4.8) by the Marcinkiewicz-Zygmund inequality for centered random variables with finite 2r-th moments: Thus, for r = 2 (4.6) is bounded by and (4.8) is bounded by (4.11) Now, to work out rates for the (MCV) conditions, we start by finding the orders of (4.7), (4.9), (4.10), and (4.11).
Consider the expression in (4.10). As before, when the θ's are bounded, a −2 j < C ′ and there is a C so that E| log W j | 4 < C. Since 1 = u 4 ≥ u 4 1 + · · · + u 4 p , (4.10) is bounded by Next, for (4.7), recall that W ∼ Gamma( p j=1 θ j + p, 1). So E( log W ) 4 = Ψ ′′′ ( p j=1 θ j + p) ≤ c since all polygamma functions are bounded above as long as the argument stays away from zero. This is due to the series representation Indeed, as t → ∞. In the present case, all θ j > ℓ > −1, so p j=1 θ j + p grows like p, i.e., E( log W ) 4 = O(p −3 ). For bounded θ, there is a ℓ so that a j > ℓ > 0, i.e., a −1/2 j has a finite bound C ′ . So, (4.7) is bounded by Now consider the expression in (4.11). Bounding E| W j | 4 's and a −2 j 's by a constant, (4.11) is bounded from above by We now argue that c 2 ( p j=1 (u j /a j )) 2 is bounded. Then, since b is bounded away from zero, it will follow that the above expression is O(p 2 ). To this end, observe that c = a(1 + √ b) −1 and note that since aa −1 j ≤ 1. Alternatively, one can use the fact that c = O(p −1 ).
Finally, we bound (4.9). As E( log W ) 4 = O(p −3 ), letting C ′ be a bound on all a −1 j 's and C ′′ be a bound on a −1/2 j 's for bounded θ, (4.8) is bounded by since, as shown above, c p j=1 (u j /a j ) = O(1) and . Thus E|u T V | 4 = O(p 2 ) for any u with u = 1 for θ bounded away from zero and uniformly bounded above. That is, B * n = O(p 2 ) since the domain of θ's includes local neighborhoods of the sort used in the definition of B * n . To get rates for B n and B ′ n , note that E|X| 3 ≤ (E|X| 4 ) 3/4 . Thus, E|u T V | 3 ≤ (O(p 2 )) 3/4 = O(p 3/2 ) and B n = B ′ n = O(p 3/2 ). Note that the preceding extends to give M 2r = max j E θ0 |V j | 2r = O(1). We can now identify rates because we can choose m = 0 to satisfy (MCV4). Write Then, for any r ≥ 1, Taking expectations gives uniformly for all θ j 's bounded and θ j > ℓ > −1. This gives E|V j | 2r = O(1) for any r ≥ 1 and j = 1, . . . , p. So, we can choose m = 0 in (MCV4) as claimed. Now, (MCV1)-(MCV3) give the following.
It remains to verify (BF1), (BF2), (PDB1) and (PDB2). Verification of (BF1) is easy: apply (2.9) to see that any α > 1 will suffice. By contrast, Conditions (BF2), (PDB1) and (PDB2) involve properties of the prior and so are related to each other. We consider two cases: Π is a conjugate prior and Π is Jeffreys' prior.
Beginning with conjugate priors, recall that a regular exponential family such as the Dirichlet has natural form exp p j=1 η j x j − ψ(θ) . So, its conjugate family is is of the form Π(θ) ∝ exp p j=1 η j α j − λψ(η) , where λ > 0 and α j /λ < 0 on bounded sets of θ, see Chap. 4, p. 113, Brown (1986). In the present setting, θ ranges over {θ : for all j, −1 < ℓ < θ j < M }, and the Dirichlet is regular in the natural parameter θ. Note that the conjugate prior is not of the product form, so we cannot use (2.7) or (2.8) and must proceed with a direct verification of (PDB1) and (PDB2).
For any conjugate prior, (PDB1) is The last term on the right is bounded when the θ j 's are in a compact set so (PDB1) is satisfied for conjugate priors with rate O(p). For (PDB2), the difference of logarithms for conjugate priors is Thus, it is seen that the rate K n in (BF2) is K n (p) = O( √ p).

Now, verification of (BF2) is easy. Since
, a weaker constraint that the (MCV) conditions did. So, the overall rate is p = O(p 1/6−η ).
Next, we turn to the verification of (BF2), (PDB1) and (PDB2) for Jeffreys' prior where a j = −Ψ ′ (θ j + 1) and a = Ψ ′ ( p j=1 θ j + p). Since Jeffreys' prior is based on the Fisher information, it is enough to observe that as long as the entries in F are polynomially bounded in p, log det(F ) will be O(p log p) using (2.9) thereby satisfying (PDB1).
To begin the verification of (PDB2), write We first show that the "error term" in matrix ordering. To do this, we use the form of F (θ) and calculate directly. We have where 1 is the vector (1, . . . , 1) T of p ones, a ′ = Ψ ′ ( p j=1 θ ′ j + p) and a ′ j = Ψ ′ (θ ′ j + 1). It is sufficient to show that all entries are bounded by a constant multiple of θ − θ ′ . Note that max j |θ j − θ ′ | ≤ θ − θ ′ .
For the first term in (4.13), note |a j − a ′ j | is bounded by |θ j − θ ′ j |, so each entry is bounded by max j |θ j − θ ′ j |. The second term is similar because all the entries in the matrix are of the form (a j − a ′ j ) times a bound which is finite when all the θ j 's are bounded and that |a j − a ′ j | ≤ C|θ j − θ ′ j | by the mean value theorem. Further, a = O(p −1 ), so the entries in the second term are bounded by Cp −1 θ − θ ′ . For the third term, note that by the mean value theorem, since |Ψ ′ (t)| = O(t −1 ) as t → ∞. The other terms are of the form a −1 j and hence bounded above since the a j 's are bounded below when the θ j 's are bounded. Finally, as observed above, (a ′ − a) in the first factor in the fourth term in (8.2) is bounded by Cp −1/2 θ − θ ′ and a = O(p −1 ). The second factor is O(p) since the a −1 j 's are bounded by a constant when the θ j 's are bounded. The remaining two factors are entry-wise bounded as well, so all entries in this term are bounded by the overall bound Cp −1/2 θ −θ ′ . Collecting these bounds together, all entries of F (θ) −1 (F (θ ′ ) − F (θ)) are bounded by a constant multiple of θ − θ ′ . An application of Lemma 8.2 now gives (4.12). This leads to (4.14) So, we get (PDB2) and can take K n = O(p 2 ) in (BF2). Using this and F −1 = O(p), (BF2) becomes p 2 p 1/2 p (1+0)/2+δ / √ n → 0, i.e., p = O(n 1/6−η ), as in (MCV1).

Discussion
Here we have established an asymptotic expansion for the relative entropy D(p n θ0 m n ) between an n-fold product of an i.i.d. model and the mixture over such models with respect to the prior. The error term is o(1) and the dimension p of θ is increasing with n. We observe that our expansion is uniform under appropriate assumptions. This leads to an expansion for I(X n |Π), the SMI between a parameter and a sample of size n for a general class of priors. The term involving the prior can be maximized so that the corresponding reference prior is seen to be Jeffreys' prior, even when p is increasing with n. We have verified that in three examples, the normal, the multinomial, and the Dirichlet, that our hypotheses are satisfied when p = O(n 1/3−η ), p = O(n 1/9−η ), and p = O(p 1/6−η ), respectively, for some η > 0. An analogous treatment can be given when the model depends on a nuisance parameter. In particular, one can integrate the asymptotic expression for the SMI given a specific value of the nuisance parameter over a range of nuisance parameters to obtain the conditional SMI which can be optimized as well to give a conditional version of Jeffreys' prior, although the prior on the nuisance parameter is indeterminate. We comment that the treatment given for the more general setting of Clarke and Yuan (2004) is also expected to generalize. Moreover, other measures of distance may be amenable to the same sort of treatment, parallel to Ghosh et al. (2010).
Our main results, like other reference prior derivations, rest on an asymptotic normality result in Appendix A. This key feature of this result, in contrast to other asymptotic normality results, is that the error of approximation admits an explicit bound in the increasing parameter case. The approximation is in L 1 -distance and the set on which the bound fails has probability decaying polynomially in p.
From Bernardo (1979), Clarke and Yuan (2004), the present results, and numerous other authors, it can be seen that, typically, when SMI can be optimized in its conditional or unconditional form, the result is a prior based on the normalized square roots of asymptotic variances. The reference prior obtained in Section 5 is also based on Jeffreys' prior, but conditionally on the nuisance parameter (whose distribution is unconstrained). In Clarke and Yuan (2004), all the priors obtained are based on square roots of asymptotic variances, typically of an asymptotically normal statistic, or on ratios of asymptotic variances from asymptotically normal statistics. This general form is consistent with those derived under invariance considerations by George and McCulloch (1993).
The merit of Jeffreys' prior, and variants such as ratios of asymptotic variances, remains somewhat inconclusive for high dimensional problems. Obviously, if the information-theoretic assumptions are satisfied, then Jeffreys' prior, or its similarly derived variants are ineluctable. Even so, using the Jeffreys prior directly can be cumbersome when the Fisher information is far from diagonal, e.g., the Dirichlet example. One way around this (suggested originally by Jeffreys and since studied extensively) is to use a product of Jeffreys priors for individual parameters, or groups of parameters, see Berger and Bernardo (1992b) for one instance of this. There is evidence that this is a viable solution in some cases. There are also cases in which truncating the parameter space to get propriety leads to nontrivial dependence on the truncation. This can be examined via robustness to cut-off specification and some researchers have put a hyper prior on the point of truncation to good effect.
Nevertheless, some investigators argue that relative to ideal inference, Jeffreys prior can put too little or too much weight on tail regions of the parameter space: Chen et al. (2009) noted that in many binomial regression problems Jeffreys prior has tails that are lighter than any multivariate t-distribution. By contrast, Jeffreys prior for the mean and variance, (µ, Σ), in a multivariate normal problem is Π(µ, Σ) ∝ |Σ| −(p+2)/2 but the exact frequentist matching prior is Π F M (µ, Σ) ∝ |Σ| −p , see Geisser and Cornfield (1963), indicating the tails of Jeffreys prior may be heavier than desirable. Note that this means Jeffreys' prior can put too little or too much mass around some points such as zero. Even in the simplest setting of a Bernoulli(π) where Jeffreys' prior is Π J (π) ∝ (π(1−π)) −1/2 and puts relatively high mass around 1 2 making some values of π more reasonable a priori than others. Zhu and Lu (2004) explain this by an estimator matching argument. Roughly, they looked for priors that make the MLE equal to the posterior mean and argue that the uniform is not always least informative, deriving the Haldane prior Π H (π) ∝ (π(1−π)) −1 (when one wants a uniform distribution on log(π/(1 − π))) and a prior that concentrates near π = 0 or 1.
On the other hand, for some sparse problems involving dimension reduction via principal components, see Guan and Dy (2009), Jeffreys' prior seems to work well. Also, various modifications of the Jeffreys' prior such as Berger and Bernardo (1992b) and Yang and Berger (1994) give good performance, even in certain regression problems; see Chen et al. (2009). Overall, it seems rare that Jeffreys' prior, or some modification of it, will fail to give good results.

Appendix A: Posterior normality
The proof of Theorem 3.1 rests on the asymptotic normality of the posterior in the L 1 -sense. This posterior normality is of a particularly strong form because we obtain an explicit bound on the L 1 -distance on a "good" set W = W n,p,θ0,δ = { ∆ n ≤ 1 4 p (1+m)/2+δ } such that P (W ) decays to zero at a polynomial rate in p −1 (see Lemma 6.5). The first step of the proof of posterior normality is to use an instance of an inequality that can be stated informally as follows. Let a and b be positive integrable functions of u. Let {N, N c } be a partition of the domain such that informally N stands for the central region, where |a − b| is small, whereas N c stands for the tail region, where a and b are individually small. Then, we can estimate the L 1 -distance between the normalized functions as follows. By adding and subtracting a/ b, bounding the first term by the second, and partitioning the domain of integration we get Now, we can state our bound on the L 1 -distance between the posterior and its normal approximation.
Theorem 6.1. Assume Conditions (MCV), (PDB) and (BF). Then on W , we have Proof. The proof is very similar to that of Theorem 2.3 of Ghosal (2000). Start by using (6.1) with a(u) = Π(θ 0 + n −1/2 Hu)Z n (u), b(u) = Π(θ 0 )Z n (u), recognizing that φ p is b(u)/ b(v)dv. This gives that the left hand side of (6.2) is bounded by two times 3) The first term in (6.3) can be bounded by adding and subtracting Π(θ 0 )Z n (u) and using the triangle inequality, namely by in view of Lemmas 6.8 and 6.12 below, respectively. The bound on the second term of (6.3) follows from Lemma 6.10 and is e −p 1+m+2δ /16 . The bound on the third term of (6.3) follows directly from Lemma 6.11 below and is e −c1p 1+m+2δ .
To complement Theorem 6.1, we extract a corollary that bounds the probability of W c . It is this result that is used in Theorem 3.1.
is bounded by a multiple of Proof. By adding and subtracting the limiting normal and using the triangle inequality, The corollary now follows from Theorem 6.1 and Lemma 6.11 below.
Next, we turn to the formal proof of Theorem 6.1 which we have broken up into a series of Lemmas. We begin with a result on local expansions for ψ and ψ ′ . It is a restatement, in terms of the local parameter, of an approximation result due to Portnoy (1988).
Lemma 6.3. The normalizing function in the exponential family has the local expansion for every u, where, for someθ lies between θ 0 and θ 0 + n −1/2 Hu, where R 2n = 1 2n Eθ (u T V ) 2 JV . The following lemma bounds the moments of ∆ n and hence controls probabilities of deviation of it.
Lemma 6.4. Let r ≥ 1. Then there exist universal constants C 2r , depending only on r, so that E ∆ n 2r ≤ C 2r M 2r p r .
The next lemma bounds the probability of W c , ensuring it is unlikely that ∆ n is too large.
Proof. The proof follows from Markov's inequality, Lemma 6.4 and the use of Condition (MCV4) to control M 2r . We have from Condition (MCV4) that Note that the main role of (MCV4) appears in Lemma 6.5 above to control the probability of W c . This kind of condition is not needed for posterior normality because posterior normality is a local property, i.e., depends only on a shrinking neighborhood of θ 0 , and on the increase in number of data points. It is only when we want to aggregate over data sequences by taking a probability that we must control moments as in (MCV4).
On W , the set where ∆ n is relatively small, a bound on the maximum likelihood estimator for standardized parameter can be given. The probability that this bound can be violated can also be controlled at the O(p −2rδ ) rate. This is formalized in the following.
This completes the proof of Theorem 6.1.
We begin with the proof of Lemma 3.3.
Here, in the second step, estimate (6.19) and the lower bound of the prior ratio in Lemma 6.12 were used. Now, the statement follows from the second part of Lemma 6.11.
Proof of Theorem 3.1: Let us start with the first error term (3.4).
By restricting the domain of integration, we have m n (X n ) = p(X n |θ)Π(θ)dθ ≥ θ−θ0 < 1 n p(X n |θ)Π(θ)dθ = Π θ − θ 0 < 1 n 1 Π( θ − θ 0 < 1 n ) θ−θ0 < 1 n p(X n |θ)Π(θ)dθ ; here and afterwards, we shall slightly abuse notation to denote prior probabilities by the same symbol Π. Therefore, by Jensen's inequality (− log E(X) ≤ −E(log X)), Condition (PDB1), and the form of exponential families for n independent random variables, log(p(X n |θ 0 )/m n (X n )) is Taken together, we now can bound (3.4) using the following: E θ0 log + p(X n |θ 0 ) m n (X n ) ½ W c ≤ (p log n + cp log p)P θ0 (W c ) + n Π( θ − θ 0 < 1 n ) The first term in (7.1) tends to zero when n is polynomial in p provided we choose r > 1/(2δ). Thus, it is enough to use expression (6.5) in the second term of (7.1) to see it is bounded by To deal with the three terms under the square root in (7.2), note that u ≤ n F θ − θ 0 and that the domain of integration is θ − θ 0 ≤ n −1 . So, the first term under the square root is Similarly, the second term under the square root is bounded above by (4n 6 ) −1 F 2 .
Using these three bounds and the fact that the resulting integral cancels the prior probability, (7.2) is bounded above by Since √ a 1 + · · · + a k ≤ √ a 1 + · · · + √ a k when a j ≥ 0, (7.4) is bounded by which goes to zero as a consequence of Condition (BF0) and the (MCV) conditions. So, (3.4) holds.