Data-driven priors and their posterior concentration rates

: In high-dimensional problems, choosing a prior distribution such that the corresponding posterior has desirable practical and theoretical properties can be challenging. This begs the question: can the data be used to help choose a prior? In this paper, we develop a general strategy for constructing a data-driven or empirical prior and suﬃcient conditions for the corresponding posterior distribution to achieve a certain concentration rate. The idea is that the prior should put suﬃcient mass on parameter values for which the likelihood is large. An interesting byproduct of this data-driven centering is that the asymptotic properties of the posterior are less sensitive to the prior shape which, in turn, allows users to work with priors of computationally convenient forms while maintaining the desired rates. General results on both adaptive and non-adaptive rates based on empirical priors are presented, along with illustrations in density estima- tion, nonparametric regression, and high-dimensional normal models.


Introduction
The Bayesian framework is ideally suited for updating prior beliefs. However, applications often do not come equipped with genuine prior beliefs, so the data analyst must make a choice. For low-dimensional problems, the posterior is relatively insensitive to the choice of prior, at least asymptotically, so default non-informative priors can be used. For modern high-dimensional problems, on the other hand, the prior matters, and the present way of thinking is to choose a prior such that the corresponding posterior distribution has certain desirable properties. For example, in sparse high-dimensional normal linear models, conjugate normal priors are attractive due to their computational simplicity. However, it was shown in Castillo and van der Vaart (2012, Theorem 2.8) that, for priors with thin normal tails, the posterior has certain suboptimal asymptotic properties, so these are out and more sophisticated priors like the horseshoe (Carvalho, Polson and Scott, 2010) and its variants (e.g., Armagan, Dunson and Lee, 2013;Bhattacharya et al., 2015;Bhadra et al., 2017) are now in. The point is that, at least in high-dimensional problems, the interpretation of prior distributions has changed-their role is simply to facilitate efficient posterior inference and, therefore, only priors whose corresponding posterior has good properties are used. So if an empirical or data-dependent prior had some practical or theoretical benefit, then there would be no reason not to use it. This begs the two-part question: are there any benefits to the use of an empirical prior and, if so, how to construct one for which these benefits are realized?
The idea of letting the prior depend on data is not new. Classical empirical Bayes, as described in Berger (1985, Ch. 4.5), Carlin and Louis (1996), and more recently in Efron (2010), leaves certain prior hyperparameters unspecified and then uses the data to construct plug-in estimates of these parameters, usually via marginal maximum likelihood. That is, if θ is the parameter of interest, then a class {Q γ : γ ∈ Γ} of prior distributions for θ is considered, and rather than introducing another prior for γ, one simply gets an estimator,γ, based on data, and uses the plug-in prior Qγ. The primary motivation for such a strategy is to let the data help carry some of the data analyst's prior specification burden. This, in turn, can provide some computational benefits, since the posterior for γ does not need to be evaluated. These computational savings are usually minimal in the high-dimensional settings we have in mind here, since γ is usually of very low dimension compared to the interest parameter θ. Posterior distribution properties for these classical empirical Bayes strategies have been investigated recently in, e.g., Szabó, van der Vaart and van Zanten (2013) and van der Pas, Szabó and van der Vaart (2017a,b) for a high-dimensional Gaussian model, and more generally in Petrone, Rousseau and Scricciolo (2014), Rousseau and Szabo (2017), and Donnet et al. (2018). These results confirm a natural conjecture that the use of the data-dependent prior Qγ is asymptotically equivalent to the use of data-independent prior Q γ , where γ an appropriately defined "best" value. But they do not reveal any theoretical benefit to the use of a data-dependent prior, it only says the performance is no worse than it would be with a special data-independent prior Q γ . What is missing from the classical approach is a direct use of the information the data contains about θ itself; it only uses information indirectly through a marginal likelihood that is of little relevance to the actual problem.
Fortunately, there are other strategies for constructing empirical priors. Martin and Walker (2014) and Martin, Mess and Walker (2017) recently employed a new type of empirical Bayes procedure, in two structured high-dimensional Gaussian linear models; related approaches to these problems can be found in Belitser (2017), Belitser and Nurushev (2017), Belitser and Ghosal (2019), and Arias-Castro and Lounici (2014). Their main idea was to suitably center the prior for θ around a good estimator, and they were able to establish various optimal posterior concentration rate and structure learning results. An important practical consequence of their approach is that the computationally convenient conjugate normal priors, shown to be suboptimal in the classical Bayesian setting, do actually meet the conditions for optimality in this new empirical Bayes context. The practical and theoretical benefits in these cases have been refined and extended in Martin and Shen (2017), Martin and Ning (2019), and Martin and Tang (2019); see, also, Lee, Lee and Lin (2017). However, their empirical prior construction and the asymptotic properties rely heavily on the Gaussian linear model structure, so whether there is a general framework underlying these developments remains an open question. Our main contribution here is to give an affirmative answer to this question, by presenting a general empirical prior construction and establishing general posterior concentration rate results.
To set the scene, let X n be the data, indexed by n ≥ 1, not necessarily independent and identically distributed (iid) or even independent, with joint distribution P n θ with density p n θ indexed by a parameter θ in Θ, possibly highor infinite-dimensional. For a sequence of prior distributions, Π n , on Θ, the posterior distribution, Π n , for θ is defined, according to Bayes's formula, as where L n (θ) = p n θ (X n ) is the likelihood function. A relevant property of the posterior Π n is its concentration rate relative to the Hellinger distance on the set of joint densities {p n θ : θ ∈ Θ}. Recall that the Hellinger distance between two densities, say, f and g, with dominating measure μ, is given by H 2 (f, g) = 1 2 (f 1/2 − g 1/2 ) 2 dμ. If ε n is a sequence with ε n → 0 no faster than n −1/2 , then we say that the posterior distribution has (Hellinger) concentration rate and M > 0 is a sufficiently large constant. Here E n θ denotes expectation with respect to the joint distribution P n θ . For a deterministic or data-independent sequence of priors, Π n , this property has been investigated in Ghosal, Ghosh and van der Vaart (2000) and Walker, Lijoi and Prünster (2007) for the iid case and by Ghosal and van der Vaart (2007a) in the non-iid case. Here we investigate this property for certain data-dependent priors.
To motivate our specific empirical prior construction, recall an essential part of the posterior concentration rate proofs for standard Bayesian posteriors. If ε n is the desired rate, then it is typical to consider a "neighborhood" of the true θ of the form where K is the Kullback-Leibler divergence and V is the corresponding second moment, A crucial step in proving that the posterior attains the ε n rate is to demonstrate that the prior allocates a sufficient amount of mass to the set in (1). If the prior could be suitably centered at θ , then this prior concentration would be trivial. The difficulty, of course, is that θ is unknown, so care is needed to construct a prior satisfying this prior concentration property simultaneously for a sufficiently wide range of θ . In fact, this placement of prior mass can be problematic and is one reason why examples like monotone density estimation are challenging; see Salomond (2014). Our proposed alternative is motivated by considering an "empirical version" of the neighborhood in (1), namely, whereθ n is a suitable estimator and P n is the empirical distribution function. We do not need a term corresponding to the second moment, V , as in (1). This is equivalent to This is effectively a neighborhood ofθ n , which is known, unlike the θ in (1), so it is straightforward to construct a prior to assign a sufficient amount of mass to L n . The consequence is that a prior satisfying this mass condition would depend on the data, since it must be suitably centered atθ n . But aside from the data-dependent centering and some care in its spread (see Remark 3), the specific shape of the empirical prior distribution satisfying this property is not particularly important. Therefore, the conditions can be checked with relatively simple-often conjugate-priors, which greatly simplifies posterior computations. Moreover, the method in general is quite versatile, providing simple solutions with optimal concentration rates in challenging problems like monotone (Martin, 2018) and heavy-tailed density estimation (Section 4.3), and other shape-constrained problems (Martin and Shen, 2017), while giving improved rates in a classical nonparametric regression problem (Section 4.5). The discussion above focused on cases where the target rate ε n was known, which can be unrealistic in high-dimensional problems. For example, in a nonparametric regression problem, the optimal rate will depend on the smoothness of the true mean function. If this smoothness is known, then it is possible to tune the prior so that the attainable and targeted rates agree. However, if the smoothness is unknown, as is often the case, the prior cannot make direct use of this information, so one needs to make the prior more flexible so that it can adapt to the unknown rate. Adaptive posterior concentration rate results have received considerable attention in the recent literature, see van der Vaart and van Zanten (2009), Kruijer, Rousseau and van der Vaart (2010), Arbel, Gayraud and Rousseau (2013), Scricciolo (2015), and Shen and Ghosal (2015). The common feature in all this work is that the prior should be a mixture over an appropriate model complexity index. The empirical prior approach described above can readily handle this modification, and we provide general sufficient conditions for adaptive empirical Bayes posterior concentration.
The remainder of this paper is organized as follows. In Section 2, we introduce the notion of an empirical prior and present the conditions needed for the corresponding posterior distribution to concentrate at the true parameter value at a particular rate. This discussion is split into two parts, depending on whether the complexity is known or unknown. Section 3 presents the proofs of the two main theorems, and a take-away point is that the arguments are quite straightforward, suggesting that the particular empirical prior construction is indeed very natural. Several examples are presented in Section 4, starting from a relatively simple parametric problem and ending with a challenging adaptive nonparametric density estimation problem. We conclude, in Section 5, with a brief discussion. Details for the examples are in the Appendix.

Known complexity
For our first case, suppose the complexity of θ , e.g., the smoothness of the true density or regression function, is known. Then we know the target rate, ε n , and we can make use of this information to design an appropriate sieve on which to construct an empirical prior. For this case, below we present a set of sufficient conditions that imply the posterior corresponding to our empirical prior has Hellinger concentration rate ε n . Applications of this result will be given in Section 4.
Our prior construction here and in the next subsection relies on a sieve, Θ n , an increasing sequence of finite-dimensional subsets of the parameter space Θ. Letθ n = arg max θ∈Θn L n (θ) be a sieve maximum likelihood estimator (MLE). As is always the case, what distinguishes a sieve from some other subset of the parameter space is its approximation properties. Condition S1 below states specifically what will be required.
Remark 1. The sequence θ † = θ † n in Condition S1 can be interpreted as "pseudotrue" parameter values in the sense that n −1 K(p n θ , p n θ † ) → 0. In the case that Θ n eventually contains θ , then we can trivially take θ † = θ . However, in examples like that in Section 4.5, the model does not include the true distribution, so identifying θ † is more challenging. Fortunately, appropriate sieves are already known in many of the key examples.
Remark 2. Define the likelihood ratio, R n (θ) = L n (θ)/L n (θ ). An important consequence of Condition S1 is a bound on R n (θ n ) at the sieve MLE, which will be used in what follows. That is, there exists a constant c > 1 such that R n (θ n ) ≥ e −cnε 2 n with P n θ -probability converging to 1.
Indeed, for θ † in Condition S1, by definition ofθ n , we trivially have R n (θ n ) ≥ R n (θ † ), and for the iid case it follows from Lemma 8.1 in Ghosal, Ghosh and van der Vaart (2000)-with their "Π" a point mass at θ † -that R n (θ † ) ≥ e −cnε 2 n with P n θ -probability converging to 1. The general case is handled in Lemma 10 of Ghosal and van der Vaart (2007b).
The sieve Θ n will also serve as the support of our yet-to-be-defined empirical prior Π n . Since it is finite-dimensional, we will assume that it is equipped with a data-independent measure ν n , e.g., Lebesgue measure, and Π n will have a density π n with respect to ν n . The reason the measure must be data-independent is that it rules out the case of a degenerate prior supported atθ n , a situation we are not interested in investigating.
The next two conditions-LP1 and GP1-concern the prior supported on Θ n . The first, a local prior condition, formally describes how the empirical prior Π n should concentrate on that empirical version of the Kullback-Leibler neighborhood (1) eluded to in Section 1, namely, On one hand, requiring that a sufficient amount of mass be assigned to L n is similar to the standard local prior support conditions in Ghosal, Ghosh and van der Vaart (2000), Shen and Wasserman (2001), and Walker, Lijoi and Prünster (2007), inspired by the developments in Barron (1988). On the other hand, the neighborhood's dependence on the data is our chief novelty and the main driver of our empirical prior construction. Condition LP1. Given ε n , there exists C > 0 such that the prior Π n satisfies where L n is as in (3), depending implicitly on ε n . Remark 3. LP1 often requires the spread of Π n to be decreasing with n. For example, in a scalar normal mean problem, to satisfy LP1 with ε n = n −1/2 requires, say, a normal empirical prior, centered at the sample mean, with variance v n = vn −1 for some v > 0. Of course, LP1 is a sufficient but not necessary condition, so it is possible, at least in simple cases like this, to get the desired posterior concentration rate with other priors, e.g., with constant v n . We are conditioned to believe that a tight prior is undesirable because it might be overly informative, but this rationale is based on the prior center being fixed. In the present case, the prior gets its "non-informativeness" from the data-driven center. And from this perspective, relatively tight prior concentration is actually quite reasonable, since one cannot expect a real benefit from the prior centering without putting a substantial amount of prior mass there. And based on results presented elsewhere (see Section 5), the relatively tight empirical prior does not negatively affect the frequentist validity of the posterior uncertainty quantification. Finally, when n is fixed, the empirical prior spread involves constants, e.g., v in v n above, that can be chosen by the data analyst, so there is no flexibility lost in practice.
The second prior condition is global and effectively controls the tails of the empirical prior density π n , i.e., how heavy can the tails be and still achieve the desired rate. This is an empirical prior version of the more familiar prior tail condition (Ghosal, Ghosh and van der Vaart, 2000) or the prior summability condition (Walker, Lijoi and Prünster, 2007) in the classical Bayesian nonparametric setting.
Condition GP1. Given ε n , there exists constants K > 0 and p > 1, such that the density function π n of the empirical prior Π n satisfies where " " means less than or equal to up to a multiplicative constant.
Condition GP1 points to π n not having too heavy tails, but in a distributional sense, taking into account its dependence on the data. While it might be unfamiliar, our examples in Section 4 show that it can be verified for commonly used priors, such as a normal distribution, centered at the MLEθ, with suitable variance, and any p > 1.
With the empirical prior Π n on Θ n , having density π n with respect to ν n , the posterior distribution is defined as Then the following theorem considered the asymptotic behavior of the random variable Π n (A Mεn ), where A Mεn is the Hellinger neighborhood described in Section 1. While this Hellinger neighborhood is relatively specific, the result entails rates with respect to other metrics in the examples of Section 4. And, for example, in the usual iid case, if the posterior mass assigned to A Mεn vanishes, then so does that of in which case ε n is the usual Hellinger rate.
Theorem 1. Let ε n be such that ε n → 0 and nε 2 n → ∞. If, for this ε n , Condition S1 holds and the empirical prior satisfies LP1 and GP1, then there exists , then the same conclusion holds but with the constant M replaced by an arbitrary sequence M n → ∞.

Unknown complexity
As discussed above, the attainable concentration rates depend on certain complexity features of the unknown θ , e.g., smoothness of a regression function. If that feature is known, as in Section 2.1, then so is the desired rate, ε n , and that information can be used to construct a suitable sieve on which to define a prior, empirical or otherwise. When that feature is unknown, the standard practice (e.g., Ghosal and van der Vaart, 2017, Chap. 10) is to work with a prior that mixes over models of different complexity levels, and leads to a posterior that adapts to the "right" complexity for the unknown θ . Here we adopt that same mixture strategy, but with an empirical twist.
Start with a representation of θ as a pair (S, θ S ), where S is some model index, taking values in some finite set S n , and θ S is the corresponding model parameter, taking values in Θ n,S . This suggests a sieve The particular form of this decomposition can vary across applications. One that is common is to represent a log-density or regression function in terms of a basis expansion and let θ = (θ 1 , θ 2 , . . .) denote the coefficients. Then S could correspond to a finite set of indices that are "turned on," and S n a collection of subsets of {1, 2, . . . , } whose cardinality is bounded by some specified T n . This version of S is used in Sections 4.4 and 4.5 for a sparse normal means model and nonparametric regression, respectively. In mixture models, on the other hand, S would be an integer the represents the number of mixture components. An important feature of S or of Θ n,S is its dimension, which we will denote by |S|; in our examples, each Θ n,S will be finite-dimensional and |S| is literally its dimension, but this could also apply to infinite-dimensional Θ n,S with |S| a suitable entropy of Θ n,S . The key point is that |S| measures the complexity of model S, both intuitively and in the technical sense that a more complex S, one with larger |S|, will have a slower associated rate.
Here, compared to Section 2.1, we do not know the complexity of the true θ or, more specifically, we do not know which Θ n,S , if any, contains θ . If there happens to exist a true model S , so that θ ∈ Θ n,S , then the rate we would hope to achieve is ε n = ε n,S , in which case we say that the posterior concentration rate is adaptive. More generally, if there exists a "best" model S † -see Condition S2 below-then adaptation entails that the posterior concentrates at the associated oracle rate ε n = ε n,S † .
The driving assumption behind recent developments in high-dimensional inference is that the truth is not too complex, and we can incorporate such a belief into our prior for S. Towards this, start with a marginal prior w n for S, supported on S n , and a conditional prior Π n,S for θ S , given S, supported on Θ n,S . Since Θ n,S is finite-dimensional, there is some non-data-dependent measure, ν n,S , such as Lebesgue measure, with respect to which Π n,S has a density, π n,S . Then the prior distribution Π n on Θ n is a mixture where Π n,S (B) = B π n,S (θ) ν n,S (dθ). In practice, we often have prior information in the form of a "low-complexity assumption," i.e., small w n (S) for complex S, but we can be non-informative about θ S , as before, by letting data control its prior center. As before, various conditions are needed in order to prove that the posterior concentrates at a certain rate. Again, these come in the form of a condition on the sieve and local and global conditions on the prior. In this case, the complexity is unknown and we seek a more general adaptive concentration result so, naturally, the conditions here are more complicated than in Section 2.1.
Recall that the complexity of the model, as measured by |S|, and the quality of approximation are at odds with one another, i.e., a simple model with small |S| will tend to have large Kullback-Leibler approximation error and vice versa. The smallest ε n for which Condition S2 holds will be called the oracle rate.
In some examples, it is known that θ belongs to Θ n,S for some S , in which case we can take S † = S and θ † = θ so that Condition S2 is trivial and the corresponding oracle rate is simply the rate ε n,S associated with the true parameter space. In cases where θ does not belong to any sieve, approximationtheoretic results are needed to check Condition S2. Examples of both types are presented in Section 4. Regardless, S † acts like the "pseudo-true" model, θ † a deterministic sequence of "pseudo-true" parameters, and ε n = ε n,S † is the oracle rate; see Remark 1. Moreover, like in Remark 2, Condition S2 implies a bound on the likelihood ratio, i.e., whereθ n,S denotes the sieve MLE over Θ n,S , S ∈ S n . Next, similar to what we did in Section 2.1, let us define the sets which are just neighborhoods ofθ n,S in Θ n,S . Then we have the following versions of the local and global prior conditions, suitable for the adaptive case, which dictate how the prior Π n,S allocates mass to L n,S and Θ n,S ∩ L c n,S , respectively. Condition LP2. Given ε n and the pseudo-true model S † from Condition S2, there exist constants A > 0 and C > 0 such that, as n → ∞, Condition GP2. Given ε n , there exists constants K ≥ 0 and p > 1 such that In certain examples, such as those in Sections 4.4-4.5, it can be shown that the integral in Condition GP2 above is bounded by e κ|S| for some constant κ. Then the condition is satisfied with K = 0 if the prior w n for S is such that the marginal prior for |S| has exponential tails (e.g., Arbel, Gayraud and Rousseau, 2013;Shen and Ghosal, 2015).
For adaptive concentration rates, some extra regularization is needed in addition to the prior centering. This additional regularization amounts to a second way in which the prior depends on the data, so we refer to these as double empirical priors, and below we will consider two types of regularization.
• Type 1 Regularization. For an α ∈ (0, 1) to be specified, if Π n is the empirical prior above, then we set the double empirical prior as Dividing by a portion of the likelihood penalizes those parameters that "track the data too closely" (Walker and Hjort, 2001), hence regularization. A range of acceptable α values is identified below. In fact, α can often be arbitrarily close to 1, so this is indeed a very minor adjustment. • Type 2 Regularization. Even though the regularization step in the above construction is very mild, some readers might be uncomfortable with what can be viewed as even a minor adjustment to the likelihood. An alternative approach is to place the additional regularization on the prior w n for S. That is, if w n is as above, then for an α ∈ (0, 1) to be specified, define This has the effect of putting an even smaller weight on those models that fit the data "too well" in the sense that their maximum likelihood is large, hence regularization. But, as above, often any α < 1 is allowed, so this extra regularization is quite mild. This amounts to a double empirical prior of the form In either case, for a suitable (and implicit) α to be defined below, the posterior distribution based on the double empirical prior can be expressed as Theorem 2. Let ε n be such that ε n → 0 and nε 2 n → ∞, and assume that Conditions S2, LP2, and GP2 hold for this ε n . For the constant p > 1 in Condition GP2, take any α ∈ (0, 1 − p −1 ).
Then there exists M > 0 such that Π n in (12), whether it be based on Type I or We automatically have the Condition S2 holds for any ε n larger than the oracle rate, and since Condition LP2 depends specifically on the pseudo-true model S † , it can typically be shown that it too holds for the oracle rate. So as long as Condition GP2 also holds for the oracle rate, we get the advertised adaptation property. Otherwise, the rate is the larger of the oracle rate in Conditions S2 and LP2 and that which satisfies Condition GP2. Moreover, if the integral in (8) is exponential in the dimension |S|, then Condition GP2 can be well-controlled with weights w n (S) that are exponentially small in |S|. Finally, note that Condition GP2 can often be verified for any p > 1; see the examples in Section 4 and the results in Martin, Mess and Walker (2017), Martin and Shen (2017), etc. In such cases, any α < 1 is allowed in either Type I or Type II regularization.

Proof of Theorem 1
Start by expressing the posterior Π n in (5) as The dependence of the prior on data requires some modification of the usual arguments for establishing concentration properties of Π n . In particular, in Lemma 1, the lower bound on the denominator D n in (13) is obtained quite simply thanks to the data-dependent prior, formalizing the motivation for this empirical Bayes approach described in Section 1, while Lemma 2 applies Hölder's inequality to get an upper bound on the numerator N n (A Mεn ).
Proof. The denominator D n can be trivially lower-bounded as follows: Now use the definition of L n to complete the proof.
Lemma 2. Assume Condition GP1 holds for ε n with constants (K, p), and let q > 1 be the Hölder conjugate of p. Then Proof. Start with the following simple bound: Dividing both sides by R n (θ n ) 1− 1 2q , and taking expectations, moving this expectation inside the integral, and applying Hölder's inequality, gives A standard argument (e.g., Walker and Hjort, 2001) shows that the first expectation on the right hand side above equals 1 − H 2 (p n θ , p n θ ) and, therefore, is upper bounded by e −M 2 nε 2 n , uniformly in θ ∈ A Mεn . Under Condition GP1, the integral of the second expectation is e Knε 2 n . Combining these two bounds proves the claim.
Proof of Theorem 1. To start, set a n = e −cnε 2 where the constants (C, c, d) are as in Condition LP1, Remark 2, and Equation (3), respectively, and c 0 is another sufficiently small constant. Also, abbreviate N n = N n (A Mεn ) and R n = R n (θ n ). If 1(·) denotes the indicator function, then Π n (A Mεn ) = N n D n 1(R n ≥ a n and D n ≥ b n ) + N n D n 1(R n < a n or D n < b n ) n + 1(R n < a n ) + 1(D n < b n ).
Taking expectation and applying Lemma 2, we get +d)nε 2 n e −Gnε 2 n + P n θ (R n < a n ) + P n θ (D n < b n ). (14) The second and third terms are o(1) by Remark 2 and Lemma 1, respectively. If we take G > C + c 2q + d or, equivalently, M 2 > q(K + C + c 2q + d), then the first term is o(1) as well, completing the proof of the first claim.
For the second claim, when nε 2 n is bounded, the conclusion (14) still holds, and the latter two terms are still o(1). The first term in the upper bound is decreasing in G or, equivalently, in M , so the upper bound vanishes for any M n → ∞.

Proof of Theorem 2
The proof approach here is similar to that of Theorem 1 above, with a few differences. We will start with the posterior defined by the double empirical prior with Type 1 regularization described in Section 2.2. For that version of the prior, the the posterior probability Π n (A Mεn ) is a ratio N n (A Mεn )/D n , where After proving Theorem 2 for this case, we will describe the adjustments needed to get the same result with the Type 2 regularized double empirical prior. Throughout, we will assume Conditions S2, LP2, and GP2 hold with ε n .
Under Condition GP2, the summation on the right-hand side above is bounded by a constant times e Knε 2 n and the claim now follows with k = k q −1 .
Proof of Theorem 2. By Lemma 3 and Condition LP2, And by (7) we have R n (θ n,S † ) ≥ e −cnε 2 n for some c > 1, with P n θ -probability converging to 1. Since |S † | ≤ nε 2 n , this lower bound for the denominator can be combined with the upper bound in the numerator from Lemma 4 using an argument very similar to that in the proof of Theorem 1, to get So, for M sufficiently large, the upper bound vanishes, proving the claim.
It turns out that the proof for the Type 2 regularized double empirical prior follows along almost the same lines. The key is that we do not need to be concerned about the normalizing constant in the definition of w n in (10) because it appears in both the numerator and denominator of the posterior probability. Similarly, we can replace L n (θ n,S ) in (10) by R n (θ n,S ) so, for the proof, we are free to assume that With this, the bound on the denominator, D n , of the posterior probability from Lemma 3 is unchanged. For the numerator, N n (A Mεn ), note the following trivial inequality: Consequently, R n (θ) α π n,S (θ) ν n,S , and the right-hand side is exactly what we bounded in the proof of Lemma 4. So we can put together the bounds on the numerator and denominator exactly like we did above to obtain the ε n posterior convergence rate for the Type 2 regularized version of the double empirical prior.

Fixed finite-dimensional parameter estimation
Suppose that the parameter space Θ is a fixed subset of R d , for a fixed d < ∞.
Under the usual regularity conditions, the log-likelihood n = log L n is twice continuously differentiable, its derivative˙ n satisfies˙ n (θ n ) = 0 at the (unique) global MLEθ n , and the following expansion holds: whereΣ n = −¨ n (θ n ). Then the set L n can be expressed as For rate ε n = n −1/2 , this suggests an empirical prior of the form: for some fixed positive definite matrix Ψ in order to ensure S1. The proposition below states that this empirical prior yields a posterior that concentrates at the parametric rate ε n = n −1/2 . Note that we do not need any additional finetuning, like in Theorem 2.4 of Ghosal, Ghosh and van der Vaart (2000), to get optimal rates in the finite-dimensional case.
Proposition 1. Assume that each component θ j in the d-dimensional parameter θ are on (−∞, ∞), and that the regularity conditions necessary to establish the quadratic approximation (15) hold. Then Conditions LP1 and GP1 hold for the empirical prior (16) with ε n = n −1/2 . Therefore, the posterior, with α = 1, concentrates at the rate ε n = n −1/2 relative to any metric on Θ.
Proof. See the Appendix.

Density estimation via histograms
Consider estimation of a density function, p, supported on the compact interval [0, 1], based on iid samples X 1 , . . . , X n . A simple approach to develop a Bayesian model for this problem is a random histogram prior (e.g., Scricciolo, 2007Scricciolo, , 2015.
consisting of mixtures of uniforms, i.e., piecewise constant densities, where the parameter θ is a vector in the S-dimensional probability simplex, That is, p θ is effectively a histogram with S bins, all of the same width, S −1 , and the height of the s th bar is S −1 θ s , s = 1, . . . , S. Here, assuming the regularity of the true density is known, we construct an empirical prior for the vector parameter θ such that, under conditions on the true density, the corresponding posterior on the space of densities has Hellinger concentration rate within a logarithmic factor of the minimax rate. More sophisticated models for density estimation will be presented in Sections 4.3 and 4.6.
Let S = S n be the number of bins, specified below. This defines a sieve Θ n = Δ(S n ) and, under the proposed histogram model, the data can be treated as multinomial, so the (sieve) MLE isθ n = (θ n,1 , . . . ,θ n,S ), whereθ n,s is just the proportion of observations in the s th bin, s = 1, . . . , S. Here we propose a Dirichlet prior Π n for θ, namely, θ ∼ Π n = Dir S (α),α s = 1 + cθ n,s , s = 1, . . . , S, which is centered on the sieve MLE in the sense that the mode of the empirical prior density isθ n ; the factor c = c n will be specified below. Finally, this empirical prior for θ determines an empirical prior for the density via the mapping θ → p θ .

Mixture density estimation
Let X 1 , . . . , X n be iid samples from a density p θ of the form where k(x | μ) is a known kernel and the mixing distribution θ is unknown. Here we focus on the normal mixture case, where k(x | μ) = N(x | μ, σ 2 ), where σ is known, but see Remark 4. The full parameter space Θ, which contains the true mixing distribution θ , is the set of all probability measures on the μ-space, but we consider here a finite mixture model of the form for an integer S, a vector ω = (ω 1 , . . . , ω S ) in the simplex Δ(S), and a set of distinct support points μ = (μ 1 , . . . , μ S ). For fixed S, letθ = (ω,μ) be the MLE for the mixture weights and locations, respectively, where the optimization is restricted so that |μ s | ≤ B, where B = B n is to be determined. We propose to "center" an empirical prior on the S-specific MLE as follows: • ω and μ are independent; • the vector ω is Dir S (α) like in Section 4.2, whereα s = 1+cω s , s = 1, . . . , S; • the components (μ 1 , . . . , μ S ) of μ are independent, with where δ n is a sequence of positive constants to be determined.

Proposition 3. Suppose that the true mixing distribution θ in
Proof. See the Appendix.
Remark 4. The proof of Proposition 3 is not especially sensitive to the choice of kernel. More specifically, the local prior support condition, LP1, can be verified for kernels other than normal, the key condition being Equation (24) in the Appendix. For example, that condition can be verified for the Cauchy kernel where σ is a fixed scale parameter. Therefore, using the same empirical prior formulation as for the normal case, the same argument in the proof of Proposition 3 shows that the Cauchy mixture posterior achieves the rate ε n = (log n)n −1/2 when the true density p = p θ is a finite Cauchy mixture. That the rate achieved is nearly parametric is not surprising-the finite Cauchy mixture is effectively finite-dimensional-but, to our knowledge, the Bayesian literature does not say anything about rates for heavy-tailed density estimation while it fits quite easily into our setup. Of course, the challenge is in going from a finite to infinite Cauchy mixture and, if suitable bounds on the error in approximating the latter by the former were available, then our analysis would immediately give a rate for the more general case.

Estimation of a sparse normal mean vector
Consider inference on the mean vector θ = (θ 1 , . . . , θ n ) of a normal distribution, N n (θ, I n ), based on a single sample X = (X 1 , . . . , X n ). That is, , 1), for i = 1, . . . , n, independent. The mean vector is assumed to be sparse in the sense that most of the components, θ i , are zero, but the locations and values of the non-zero components are unknown. This problem was considered by Martin and Walker (2014) and they show that a version of the double empirical Bayes posterior contracts at the optimal minimax rate. Here we propose an arguably simpler empirical prior and demonstrate the same asymptotic optimality of the posterior based on the general results in Section 2.2.
Write the mean vector θ as a pair (S, θ S ), where S ⊆ {1, 2, . . . , n} identifies the non-zero entries of θ, and θ S is the |S|-vector of non-zero values. Assume that the true mean vector θ has |S n | = s n such that s n = o(n). The sieves Θ n,S are subsets of R n that constrain the components of the vectors corresponding to indices in S c to be zero; no constraint on the non-zero components is imposed. Note that we can trivially restrict to subsets S of cardinality no more than T n = n. Furthermore, Condition S2 is trivially satisfied because θ belongs to the sieve S n by definition, so we can take θ † = θ .
For this model, the Hellinger distance for joint densities satisfies where · is the usual 2 -norm on R n . In this sparse setting, as demonstrated by Donoho et al. (1992), the 2 -minimax rate of convergence is s n log(n/s n ); we set this rate equal to nε 2 n , so that ε 2 n = (s n /n) log(n/s n ). Therefore, if we can construct a prior such that Conditions LP2 and GP2 hold for this ε n , then it will follow from Theorem 2 that the corresponding empirical Bayes posterior concentrates at the optimal minimax rate.
Let the prior distribution w n for S be given by Proposition 4. Suppose the normal mean vector θ is s n -sparse in the sense that only s n = o(n) of the entries in θ are non-zero. For the empirical prior described above, there exists a constant M > 0 such that the corresponding posterior distribution Π n , using Type I or Type II regularization, with any α < 1, satisfies E n θ Π n ({θ : θ − θ 2 > Ms n log(n/s n )}) → 0.
Proof. See the Appendix.
Note that the prior being employed in this empirical Bayes formulation is conjugate, leading to some computational savings compared to the non-conjugate priors shown to be optimal in Castillo and van der Vaart (2012) under a classical Bayesian formulation; see Martin, Mess and Walker (2017) and Martin (2017) for more on computational benefits, and Martin and Ning (2019) for results on coverage of credible sets based on this empirical Bayes model. A similar approach to the one described above is considered in Martin and Shen (2017) to get minimax optimal posterior concentration rates and fast computation for the case where θ is known to be piecewise constant.

Regression function estimation
Consider a nonparametric regression model where z 1 , . . . , z n are iid N(0, 1), t 1 , . . . , t n are equi-spaced design points in [0, 1], i.e., t i = i/n, and f is an unknown function. Following Arbel, Gayraud and Rousseau (2013), we consider a Fourier basis expansion for f = f θ , so that . .) and (φ 1 , φ 2 , . . .) are the basis coefficients and functions, respectively. They give conditions such that their Bayesian posterior distribution for f , induced by a prior on the basis coefficients θ, concentrates at the true f at the minimax rate corresponding to the unknown smoothness of f . Here we derive a similar result, with a better rate, for the posterior derived from an empirical prior.
Following the calculations in Section 4.4, the Hellinger distance between the joint distribution of (Y 1 , . . . , Y n ) for two different regression functions, f and g, satisfies where f 2 n = n −1 n i=1 f (t i ) 2 is the squared L 2 -norm corresponding to the empirical distribution of the covariate t. So, if the conditions of Theorem 2 are satisfied, then we get a posterior concentration rate relative to the metric · n .
Suppose that the true regression function f is in a Sobolev space of index β > 1 2 . That is, there is an infinite coefficient vector θ such that f = f θ and ∞ j=1 θ 2 j j 2β 1. This implies that the coefficients θ j for large j are of relatively small magnitude and suggests a particular formulation of the model and empirical prior. As before, we rewrite the infinite vector θ as (S, θ S ), but this time S is just an integer in {1, 2, . . . , n}, and θ S = (θ 1 , . . . , θ S , 0, 0, . . .) is an infinite vector with only the first S terms non-zero. That is, we will restrict our prior to be supported on vectors whose tails vanish in this sense. For the prior w n for the integer S, we take w n (s) ∝ e −g(s)s , s = 1, . . . , n, where g(s), is a non-decreasing slowly varying function, which includes the case of g(s) ≡ B for B sufficiently large; see the proof of the proposition. Next, for the conditional prior for θ S , given S, note first that the sieve MLE is a least-squares estimatorθ where Φ S is the n×|S| matrix determined by the basis functions at the observed covariates, i.e., Φ S = (φ j (t i )) ij , i = 1, . . . , n and j = 1, . . . , |S|. As in Martin, Mess and Walker (2017), this suggests a conditional prior of the form This empirical prior for θ ≡ (S, θ S ) induces a corresponding empirical prior for f through the mapping θ → f θ .
Proposition 5. Suppose that the true regression function f is in a Sobolev space of index β > 1 2 . For the empirical prior described above, there exists a constant M > 0 such that the corresponding posterior distribution Π n , using Type I or Type II regularization, with any α < 1, satisfies Proof. See the Appendix.
Note that the rate obtained in Proposition 5 is exactly the optimal minimax rate, i.e., there are no extra logarithmic factors. This, like in Section 4.4, is a consequence of f eventually being in the specified sieve; these extra log factors are a result of having to approximate the true parameter by an element in the sieve. A similar result, without the additional logarithmic terms, is given in Gao and Zhou (2016).

Nonparametric density estimation
Consider the problem of estimating a density p supported on the real line. Like in Section 4.3, we propose a normal mixture model and demonstrate the asymptotic concentration properties of the posterior based on an empirical prior, but with the added feature that the rate is adaptive to the unknown smoothness of the true density function. Specifically, as in Kruijer, Rousseau and van der Vaart (2010), we assume that data X 1 , . . . , X n are iid from a true density p , where p satisfies the conditions C1-C4 in their paper; in particular, we assume that log p is Hölder with smoothness parameter β. They propose a fully Bayesian model-one that does not depend on the unknown β-and demonstrate that the posterior concentration rate, relative to the Hellinger distance, is ε n = (log n) t n −β/(2β+1) for suitable constant t > 0, which is within a logarithmic factor of the optimal rate.
Here we extend the approach presented in Section 4.3 to achieve adaptation by incorporating a prior for the number of mixture components, S, as well as the S-specific kernel variance σ 2 S as opposed to fixing their values. For the prior w n for S, we let w n (S) ∝ e −D(log S) r S , S = 1, . . . , n, where r > 1 and D > 0 are specified constants. Given S, we consider a mixture model with S components of the form where θ S = (ω S , μ S , λ S ), ω S = (ω 1,S , . . . , ω S,S ) is a probability vector in Δ(S), μ S = (μ 1,S , . . . , μ S,S ) is a S-vector of mixture locations, and λ S is a precision (inverse variance) that is the same in all the kernels for a given S. We can fit this model to data using, say, the EM algorithm, and produce a given S sieve MLE: ω S = (ω 1,S , . . . ,ω S,S ),μ S = (μ 1 , . . . ,μ S ), andλ S . Following our approach in Section 4.3, consider an empirical prior for ω S obtained by taking whereα s,S = 1 + cω s,S and c = c S is to be determined. The prior for μ S follows the same approach as in Section 4.3, i.e., where δ = δ S is to be determined. The prior for λ S is also uniform, where ψ = ψ S is to be determined. Also, as withμ S being restricted to the interval (−B, +B), we restrict theλ S to lie in (B l , B u ), to be determined. Then we get a prior on the density function through the mapping (S, θ S ) → p S,θ S . For this choice of empirical prior, the following proposition shows that the corresponding posterior distribution concentrates around a suitable true density p at the optimal rate, up to a logarithmic factor, exactly as in Kruijer, Rousseau and van der Vaart (2010).
Proposition 6. Suppose that the true density p satisfies Conditions C1-C4 in Kruijer, Rousseau and van der Vaart (2010), in particular, log p is Hölder continuous with smoothness parameter β. For the empirical prior described above, if B = (log n) 2 , B l = n −1 , B u = n b−2 , and, for each S, c = c s = n 2 S −1 , δ = δ S = S 1/2 n −(b+3/2) , and ψ = ψ S = Sn −1 , for a sufficiently large b > 2, then there exists constants M > 0 and t > 0 such that the corresponding posterior distribution Π n , using Type I or Type II regularization, with any α < 1,

Conclusion
This paper considers the construction of an empirical or data-dependent prior such that, when combined with the likelihood via Bayes's formula, gives a posterior distribution with desirable asymptotic concentration properties. The details vary a bit depending on whether the complexity of the true θ is known to the user or not (Sections 2.1-2.2), but the basic idea is to first choose a suitable sieve and then center the prior for the sieve parameters on the sieve MLE. This makes establishing the necessary local prior support condition and lower-bounding the posterior denominator straightforward, which is a major obstacle in the standard Bayesian nonparametric setting. Having the data involved in the prior complicates the usual argument to upper-bound the posterior numerator, but compared to the usual global prior conditions involving entropy, here we only need to suitably control the spread of the empirical prior. The end result is a data-dependent measure that achieves a certain-often optimal-concentration rate, adaptively, if necessary.
The approach presented here is quite versatile, so there are many potential applications beyond those examples studied here. A more general question to be considered in a follow-up work, one that has attracted a lot of attention in the Bayesian nonparametric community recently, concerns the coverage probability of credible regions derived from our empirical Bayes posterior distribution. Having suitable concentration rates is an important first step, but coverage properties will require new insights. The theoretical results presented in Martin and Ning (2019) for the sparse normal means problem and in Martin and Tang (2019) for regression, along the numerical results in Martin (2018) for the monotone density estimation, are promising, but more work is needed. asymptotic distribution of the MLE asθ ∼ N d (θ , n −1 Σ −1 ), where Σ is the Fisher information matrix evaluated at θ . Then we have, As long as Ψ is non-singular, the right-hand side above is not dependent on n and is finite, which implies we can take ε n = n −1/2 . It follows from Theorem 1 that the Hellinger rate is ε n = n −1/2 and, since all metrics on the finite-dimensional Θ are equivalent, the same rate obtains for any other metric.
We should highlight the result that the integral involved in checking Condition GP1 is at most exponential in the dimension of the parameter space: This result will be useful in the proof of some of the other propositions.

A.2. Proof of Proposition 2
We start by verifying Condition LP1. Note that, for those models in the support of the prior, the data are multinomial, so the likelihood function is where (n 1 , . . . , n S ) are the bin counts, i.e., n s = |{i : X i ∈ E s }|, s = 1, . . . , S. Taking expectation with respect to θ ∼ Dir S (α) gives Therefore, Next, a simple "reverse Markov inequality" says, for any random variable Y ∈ (0, 1), Recall that L n = {θ ∈ Θ n : L n (θ) > e −dnε 2 n L n (θ)} as in (3), so we can apply (21) to get Then it follows from (20) that n and, therefore, Condition LP1 is satisfied, with C > d, if Towards this, we have So, if c = nε −2 n as in the proposition statement, then the right-hand side above is upper-bounded by e nε 2 n (1+S/n) . Since S ≤ n, (22) holds for, say, d > 2, hence, Condition LP1.
Towards Condition GP1, note that the Dirichlet component for θ satisfies where the "≈" is by Stirling's formula, valid for all n s > 0 due to the value of c. This has a uniform upper bound: Then Condition GP1 holds if we can bound the product of this and Γ(S) −1 , the volume of Δ(S), by e Knε 2 n for a constant K > 0. Using Stirling's formula again, and the fact that c/S → ∞, we have We need S log(1 + c/S) ≤ nε 2 n . Since c/S n 2 , the logarithmic term is log n. But we assumed that S ≤ nε 2 n (log n) −1 , so the product is nε 2 n , proving Condition GP1.
It remains to check Condition S1. A natural candidate for the pseudo-true parameter θ † in Condition S1 is one that sets θ s equal to the probability assigned by the true density p to E s . Indeed, set It is known (e.g., Scricciolo, 2015, p. 93) that, if p is β-Hölder, with β ∈ (0, 1], then the sup-norm approximation error of p θ † is Since p is uniformly bounded away from 0, it follows from Lemma 8 in Ghosal and van der Vaart (2007a) in turn, is upper-bounded by S −2β by the above display. Therefore, we need S = S n to satisfy S −β ≤ ε n , and this is achieved by choosing S = nε 2 n (log n) −1 as in the proposition. This establishes Condition S1, completing the proof.

A.3. Proof of Proposition 3
We start by verifying Condition LP1. Towards this, we first note that, for mixtures in the support of the prior, the likelihood function is which can be rewritten as L n (θ) = (n1,...,n S ) ω n1 1 · · · ω n S S (s1,...,sn) S s=1 i:si=s where the first sum is over all S-tuples of non-negative integers (n 1 , . . . , n S ) that sum to n, the second sum is over all n-tuples of integers 1, . . . , S with (n 1 , . . . , n S ) as the corresponding frequency table, and k(x | μ) = N(x | μ, σ 2 ) for known σ 2 . We also take the convention that, if n s = 0, then the product i:si=s is identically 1. Next, since the prior has ω and μ independent, we only need to bound E(ω n1 1 · · · ω n S S ) and E S s=1 i:si=s for a generic (n 1 , . . . , n S ). The first expectation is with respect to the prior for ω and can be handled exactly like in the proof of Proposition 2. For the second expectation, which is with respect to the prior for the μ, since the prior has the components of μ independent, we have so we can work with a generic s. Writing out the product of kernels, we get By Jensen's inequality, i.e., E(e Z ) ≥ e E(Z) , the expectation on the right-hand side is lower bounded by where v n = δ 2 n /3 is the variance of μ s ∼ Unif(μ s − δ n ,μ s + δ n ). This implies Putting the two expectations back together, from (23) we have that where now the expectation is with respect to both priors. Recall that L n = {θ ∈ Θ n : L n (θ) > e −dnε 2 n L n (θ)} as in (3), and define L n = {θ ∈ L n : L n (θ) ≤ L n (θ n )}. Since, L n ⊇ L n and, for θ ∈ L n , we have L n (θ)/L n (θ n ) ≤ 1, we can apply the reverse Markov inequality (21) again to get Then it follows from (25) that Π n (L n ) ≥ Γ(c + S) c n Γ(c + S + n) e − nvn 2σ 2 − e −dnε 2 n and, therefore, Condition LP1 is satisfied if nv n 2σ 2 ≤ bnε 2 n and Γ(c + S + n) Γ(c + S)c n ≤ e anε 2 n , where a + b < d. The first condition is easy to arrange; it requires that v n ≤ 2bσ 2 ε 2 n ⇐⇒ δ n ≤ (6bσ 2 ) 1/2 ε n , which holds by assumption on δ n . The second condition holds with a = 2 by the argument in the proof of Proposition 2. Therefore, Condition LP1 holds. Towards Condition GP1, putting together the bound on the Dirichlet density function in the proof of Proposition 2 and the following bound on the uniform densities, we have that, for any p > 1, Then Condition GP1 holds if we can make both terms in this product to be like e Knε 2 n for a constant K > 0. The first term in the product, coming from the Dirichlet part, is handled just like in the proof of Proposition 2 and, for the second factor, we have Since δ n ∝ ε n and B n ∝ log 1/2 (ε −1 n ), we have B n /δ n ∝ n 1/2 , so the exponent above is S log n nε 2 n . This takes care of the second factor, proving Condition GP1.
Finally, we refer to Section 4 in Ghosal and van der Vaart (2001) where they show that there exists a finite mixture, characterized by θ † , with S components and locations in [−B n , B n ], such that max{K(p θ , p θ † ), V (p θ , p θ † )} ≤ ε 2 . This θ † satisfies our Condition S1, so the proposition follows from Theorem 1.
In the context of Remark 4, when the normal kernel is replaced by a Cauchy kernel, we need to verify (24) in order to meet LP1. To this end, let us start with where the expectation is with respect to the prior for the μ s and the σ is assumed known. This log of this expectation is easily seen to be lower-bounded by Exponentiating the right-hand term, we get si=s 1 1 + σ −2 (X i −μ s ) 2 1 i:si=s σ 2 + vn 1+(Xi−μs) 2 and the second term here is lower-bounded by exp(−n s v n /σ 2 ). Therefore, Condition LP1 holds with the same ε n as in the normal case.
Condition GP1 in this case does not depend on the form of the kernel, whether it be normal or Cauchy. And S1 is satisfied if we assume the true density p = p θ is a finite mixture of densities, for example, the Cauchy. This proves the claim in Remark 4, namely, that the empirical Bayes posterior, based on a Cauchy kernel, concentrates at the rate ε n = (log n)n −1/2 when the true density is a finite Cauchy mixture.

A.4. Proof of Proposition 4
The proportionality constant depends on n (and g) but it is bounded away from zero and infinity as n → ∞ so can be ignored in our analysis. Here we can check the second part of Condition LP2. Indeed, for the true model S n of size s n , using the inequality n s ≤ (en/s) s , we have w n (S n ) ∝ n s n −1 e −Bs n ≥ e −[B+1+log(n/s )]s n and, since nε 2 n = s n log(n/s n ), the second condition in Condition LP2 holds for all large n with A > 1. Next, for Condition GP2, note that the prior w n given above corresponds to a hierarchical prior for S that starts with a truncated geometric prior for |S| and then a uniform prior for S, given |S|. Then it follows directly that Condition GP2 on the marginal prior for |S| is satisfied.
For Condition LP2, we first write the likelihood ratio for a generic θ ∈ Θ S : L n (θ) L n (θ n,S ) = e − 1 2 θ S −θn,s 2 . Therefore, L n,S = {θ ∈ Θ S : 1 2 θ −θ n,S 2 < |S|}. This is just a ball in R |S| so we can bound the Gaussian measure assigned to it. Indeed, Π n (L n,S ) = . For moderate to large |S|, the above display is exp 1−2γ +log γ +log 2 |S| 2 and, therefore, plugging in S n for the generic S above, we see that Condition LP2 holds if 1 − 2γ + log γ + log 2 < 0. For Condition GP2, the calculation is similar to that in the finite-dimensional case handled in Proposition 1. Indeed, the last part of the proof showed that, for a d-dimensional normal mean model with covariance matrix Σ −1 and a normal empirical prior of with meanθ n and covariance matrix proportional to Σ −1 , then the integral specified in the second part of Condition GP2 is exponential in the dimension d. In the present case, we have that Θ S E θ {π n,S (θ) p } 1 p dθ = e κ|S| for some κ > 0 and then, clearly, Condition GP2 holds with K = κ. If we take B in the prior w n for S to be larger than this K, then the conditions of Theorem 2 are met with ε 2 n = (s n /n) log(n/s n ).
Write ε n = (log n) t n −β/(2β+1) for a constant t > 0 to be determined. For Condition S2, we appeal to Lemma 4 in Kruijer, Rousseau and van der Vaart (2010) which states that there exists a finite normal mixture, p † , having S n components, with S n n 1/(2β+1) (log n) k−t = nε 2 n (log n) k−3t , such that max K(p , p † ), V (p , p † ) ≤ ε 2 n , where k = 2/τ 2 and τ 2 is related to the tails of p in their Condition C3. So, if t is sufficiently large, then our Condition S2 holds.
For Condition GP2, we first note that, by a straightforward modification of the argument given in the proof of Proposition 3, we have Δ(S)×R S ×R+ E p {π n,S (θ) p } 1/p dθ ≤ e bS log n 1 + B δ for some b > 0. The logarithmic term appears in the first product because, as in the proof of Proposition 3, the exponent can be bounded by a constant times S log(1 + c/S) S log n since c/S = n 2 /S 2 < n 2 . To get the upper bound in the above display to be exponential in S, we can take

R. Martin and S. G. Walker
With these choices, it follows that the right-hand side in the previous display is upper bounded by e 3b log n , independent of S. Therefore, trivially, the summation in (8) is also upper bounded by e 3b log n . Since log n ≤ nε 2 n , we have that Condition GP2 holds.
Condition LP2 has two parts to it. For the first part, which concerns the prior concentration on L n , we can follow the argument in the proof of Proposition 3. In particular, with the additional prior on λ, the version of (25) is EL n (θ S ) ≥ Γ(c + S) c n Γ(c + S + n) e − 1 6 nδ 2λ e −nzψ L n (θ S ) for some z ∈ (0, 1). This is based on the result that if λ ∼ Unif(λ(1−ψ),λ(1+ψ)) then Eλ =λ and E log λ > logλ − zψ for some z ∈ (0, 1). With c = n 2 S −1 as proposed, the argument in the proof of Proposition 2 shows that the first term on the right-hand side of the above display is lower-bounded by e −CS for some C > 0. To make other other terms lower-bounded by something of the order e −C S , we need δ and ψ to satisfy δ 2 1 B 2 u S n and ψ S n .
Given these constraints and those coming from checking Condition GP2 above, we require From Lemma 4 in Kruijer, Rousseau and van der Vaart (2010), we can deduce that the absolute value of the locations for p † are smaller than a constant times log ε −β n . Hence, we can take B = (log n) 2 . Also, we need B l ε β n which is met by taking B l = n −1 . To meet our constraints, we can take B u = n b−2 , so we need b ≥ 2. These conditions on (B, B l , B u , δ, ψ) are met by the choices stated in the proposition. For the second part of Condition LP2, which concerns the concentration of w n around S n , we have w n (S n ) ≥ e −D(log S n ) r S n e −Dnε 2 n (log n) k+r−3t . So, just like in Kruijer, Rousseau and van der Vaart (2010), as long as 3t > k+r, we get w n (S n ) ≥ e −Dnε 2 n as required in Condition LP2.