Empirical priors for targeted posterior concentration rates

In high-dimensional problems, choosing a prior distribution such that the corresponding posterior has desirable properties can be challenging. This paper develops a general strategy for constructing empirical or data-dependent priors whose corresponding posterior distributions achieve targeted, often optimal, concentration rates. The idea is to place a prior which has sufficient mass on parameter values for which the likelihood is suitably large. This makes the asymptotic properties of the posterior less sensitive to the shape of the prior which, in turn, allows users to work with priors of convenient forms while maintaining the desired posterior concentration rates. General results on both adaptive and non-adaptive rates based on empirical priors are presented, along with illustrations in density estimation, nonparametric regression, and high-dimensional structured normal models.


Background
Current theoretical research on Bayesian methods is largely concerned with finding posterior concentration rates. To set the scene, if Π n denotes a posterior distribution for some parameter θ in a metric space (Θ, d), with true value θ ⋆ , the goal is to find the most rapidly vanishing sequence ε n such that, for a constant M > 0, E θ ⋆ [Π n ({θ : d(θ ⋆ , θ) > Mε n })] → 0, n → ∞. (1) The traditional setting involves independent and identically distributed (i.i.d.) observations and θ is a density function with d being the Hellinger or L 1 metric; see Ghosal et al. (2000) and Walker et al. (2007). Results for the non-i.i.d. case are developed in Ghosal and van der Vaar (2007).
In the classical Bayesian framework, especially in high-or infinite-dimensional models, the prior must be controlled very carefully-roughly, the prior tails can be neither too fat nor too thin-because it completely determines the attainable concentration rate. One idea of current interest is the use of generic data-dependent measures. These are probability measures driven by the data and not necessarily the result of a Bayesian prior-to-posterior construction; see, e.g., Belitser (2016). Here our focus is on datadependent measures arising from an empirical Bayes approach, where the posterior is obtained by passing an empirical or data-dependent prior through the likelihood function via Bayes's formula. The classical empirical Bayes approach starts with a family of priors indexed by a parameter γ, i.e., Π(dθ | γ), and then estimates γ based on the data. This is typically done by finding γ to maximize the marginal likelihood, L n (θ) Π(dθ | γ), where L n (θ) denotes the likelihood function. The corresponding posterior has a simple form, namely, Π n (dθ) ∝ L n (θ) Π(θ | γ), but demonstrating that it has desirable asymptotic concentration properties is a non-trivial exercise (e.g., Donnet et al. 2014;Rousseau and Szabó 2016). For more on empirical Bayes, see Efron (2014).
There is no particularly compelling justification for this classical empirical Bayes approach; so why not consider an alternative where the choice of data-dependent prior is motivated specifically by the properties one wants the posterior to have? Hence, our goal here is to redefine the idea of empirical Bayes, and we propose a more poignant choice of empirical prior designed specifically so that the corresponding posterior distribution achieves the desired concentration rate properties. Martin and Walker (2014) and Martin et al. (2015) recently employed a new type of empirical Bayes procedure in two structured high-dimensional Gaussian linear models and obtained optimal minimax posterior concentration rates. Their main idea is to suitably center the prior around a good estimator of the parameter, a relatively straightforward task for these normal linear models. An important practical consequence is that the computationally convenient normal priors, which have been shown to be suboptimal in these problems in a classical Bayesian context (e.g., Castillo and van der Vaart 2012, Theorem 2.8), do actually meet the conditions for optimality in this new empirical Bayes context. It is not clear, however, if this strategy of prior centering can be applied to cases beyond these normal linear models. In this paper, we develop a general framework for this new kind of empirical Bayes approach, with supporting theory.

Main contributions
A benefit of this general framework is its simplicity and versatility, i.e., that the conditions are relatively easy to check for standard prior types and that the same techniques can be used for a wide range of models and/or true parameters. For example, the proposed approach can handle models that involve mixtures of light-and heavy-tailed kernels (Section 4.3), something that apparently the existing Bayesian nonparametric machinery currently cannot do (e.g., Kruijer et al. 2010Kruijer et al. , p. 1229. Shape-restricted problems, such as monotone density estimation, discussed in Salomond (2014), is another situation where the standard Bayesian machinery is not fully satisfactory, but the methods presented herein can be applied directly.
To motivate the use of our empirical priors in particular, it helps to recall one of the essential parts of proofs of posterior concentration rates for standard Bayesian posteriors. Suppose we have data X n with a joint density p n θ , depending on a parameter θ; high-and infinite-dimensional parameters are our main focus in this paper but, to keep the present discussion simple, we consider θ to be finite-dimensional. If ε n is the target concentration rate, then it is typical to consider a "neighborhood" of the true θ ⋆ of the form θ : K(p n θ ⋆ , p n θ ) ≤ nε 2 n , V (p n θ ⋆ , p n θ ) ≤ nε 2 n , where K is the Kullback-Leibler divergence and V is the corresponding second moment; see Section 1.4. A crucial step in proving that the posterior attains the target concentration rate is to demonstrate that the prior allocates a sufficient amount of mass to the set in (2). If the prior could be suitably centered at θ ⋆ , then this prior concentration would be trivial. The difficulty, of course, is that θ ⋆ is unknown, so care is needed to construct a prior satisfying this prior concentration property uniformly on a sufficiently wide range of θ ⋆ . In fact, this placement of prior mass is known to be problematic in the usual Bayesian proofs and is one key reason why a number of examples, such as heavy-tailed density estimation, are particularly challenging.
As an alternative, consider the following "empirical version" of (2), where L n (θ) is the likelihood function based on X n andθ n is a maximum likelihood estimator. This is effectively a neighborhood ofθ n , which is known, so it is straightforward to construct a prior to assign a sufficient amount of mass to L n . The catch is that a prior satisfying this mass condition would be data-dependent, or empirical, since it must be appropriately centered atθ n . One can proceed to construct a corresponding empirical Bayes posterior by combining this empirical prior with the likelihood via Bayes's formula. Ifθ n behaves badly, then the empirical prior to posterior update can correct for it, provided certain conditions are satisfied. Our key observation is that an empirical prior that allocates a sufficient amount of mass to L n is easy to arrange in practice (see Section 4) and is a significant step towards proving concentration rate results for the corresponding empirical Bayes posterior. Our aim is to put sufficient mass around the maximum likelihood estimator in the prior, in fact the maximum allowed up to a constant, which ensures the target rate for the posterior. Future work will look at how to set the constant in order to match posterior credible regions with confidence regions, for example.
While the attainable posterior concentration rate is determined by the prior, the targeted rate depends on the true value θ ⋆ of the parameter in some way. For example, in a nonparametric regression problem, the optimal rate will depend on the smoothness of the true regression function. If this smoothness is known, then it is possible to tune the prior so that the attainable and targeted rates agree. However, if the smoothness is unknown, as is often the case, the prior cannot make direct use of this information, so one needs to make the prior more flexible so that it can adapt to the unknown concentration rate. Adaptive posterior concentration rate results have received considerable attention in the recent literature, see van der Vaart and van Zanten (2009), Kruijer et al. (2010), Arbel et al. (2013), Scricciolo (2015), and Shen and Ghosal (2015), with the common denominator in all this work is that the prior should be a mixture over an appropriate model complexity index. The empirical prior approach described above can readily handle this modification, and we provide general sufficient conditions for adaptive empirical Bayes posterior concentration.

Outline of the paper
In Section 2 we introduce the notion of an empirical prior and present the conditions needed for the corresponding posterior distribution to concentrate at the true parameter value at a particular rate. This discussion is split into two parts, depending on whether the target rate is known or unknown. A toy example is given that shows the conditions of the theorems are not unreasonable. Section 3 presents the proofs of the two main theorems, and a take-away point is that the arguments are quite straightforward, suggesting that the particular empirical prior construction is indeed very natural. Several examples are presented in Section 4, starting from a relatively simple parametric problem and ending with a challenging adaptive nonparametric density estimation problem. We conclude, in Section 5, with a brief discussion. Details for the examples are in the Appendix.

Notation
Suppose that data X n , indexed by n ≥ 1, not necessarily independent or i.i.d., have a joint distribution with density p n θ , indexed by a parameter θ ∈ Θ, possibly high-or infinite-dimensional. Write L n (θ) = p n θ (X n ) for the likelihood function. If Π n is a prior distribution, possibly depending on data, supported on a subset Θ n ⊆ Θ, and having a density π n with respect to some non-data-dependent dominating measure ν n on Θ n , then Bayes's formula gives the posterior distribution Π n (A) = A L n (θ)π n (θ) ν n (dθ) Θn L n (θ)π n (θ) ν n (dθ) , A ⊆ Θ n .
Typically, ν n would be Lebesgue or counting measure, depending on the structure of Θ n . For our theoretical analysis, if θ ⋆ is the true value of the parameter, then it will be convenient to rewrite the posterior distribution as where R n (θ) = L n (θ)/L n (θ ⋆ ) is the likelihood ratio, and N n (·) and D n denote the numerator and denominator of the ratio, respectively. Some minor modification to this familiar form will be considered in Section 2.2.
Our results below will establish convergence rates for the posterior Π n relative to the Hellinger distance on the set of joint densities {p n θ : θ ∈ Θ}. Recall that the Hellinger distance between two densities, say, f and g, with dominating measure µ, is given by We say that the posterior distribution has (Hellinger) concentration rate (at least) ε n at and M > 0 is a sufficiently large constant. This particular set can be related to some other more familiar types of neighborhoods in certain cases; see Section 4 for details. For example, consider the typical i.i.d. setup, so that p n θ is just a product of marginal densities p θ . In this case, Therefore, in the i.i.d. case, if Π n (A M εn ) vanishes, then so does the posterior probability of {θ : H(p θ ⋆ , p θ ) > Mε n }, so that ε n is the usual Hellinger concentration rate.
In addition to the Hellinger distance, we will also have a need for the Kullback-Leibler divergence, K, and the corresponding second moment, V , given by Sieves will play an important role in our prior construction and analysis. According to Grenander (1981) and Geman and Hwang (1982), a sieve is simply an increasing sequence of (finite-dimensional) subsets of the parameter space. We will denote these generically by Θ n . Care is needed in choosing the sieves to have the appropriate approximation properties; see Conditions S1 and S2 in Section 2 and the examples in Section 4. We will letθ n denote a sieve maximum likelihood estimator (MLE), i.e.,θ n = arg max θ∈Θn L n (θ). An important subset of Θ n is the "neighborhood" of the sieve MLE eluded to above, i.e., Finally, we write ∆(S) = {(θ 1 , . . . , θ S ) : θ s ≥ 0, S s=1 θ s = 1} for the S-dimensional probability simplex, 1(·) for the indicator function, " " for inequality up to a universal constant, |A| for the cardinality of a finite set A, and, for a number p > 1, we say that q = p/(p − 1) is the Hölder conjugate of p in the sense that p −1 + q −1 = 1.
2 Empirical priors and posterior concentration 2.1 Known target rate For our first case, suppose that the target rate, ε n , is known. That is, the feature of θ ⋆ that determines the desired rate, e.g., the smoothness of the true regression function, is known. In such cases, we can make use of the known target rate to design an appropriate sieve on which to construct an empirical prior. Condition S1 below concerns the sieve's approximation properties, and is familiar in the posterior concentration literature. Condition S1. There exists an increasing sequence of subsets Θ n of Θ and a deterministic sequence θ † = θ † n in Θ n such that max K(p n θ ⋆ , p n θ † ), V (p n θ ⋆ , p n θ † ) ≤ nε 2 n , all large n.
Remark 1. The sequence θ † = θ † n in Condition S1 can be interpreted as "pseudo-true" parameter values in the sense that n −1 K(p n θ ⋆ , p n θ † ) → 0. In the case that the sieves eventually contain θ ⋆ , then θ † is trivial. However, there will be examples where the model does not include the true distribution, in which case, identifying θ † is more challenging. Fortunately, appropriate sieves are already known in many of the key examples.
Remark 2. An important consequence of Condition S1, that will be used in the proof of our main theorems, is a bound on the likelihood ratio R n (θ n ) at the sieve MLE, in particular, there exists a constant c > 1 such that R n (θ n ) ≥ e −cnε 2 n with P θ ⋆ -probability converging to 1. See Lemma 8.1 in Ghosal et al. (2000).
The next two conditions, namely, Conditions LP1 and GP1 below, are conditions on the prior. The first, a local prior condition, formally describes how the empirical prior Π n should concentrate on the "neighborhoods" L n in (4). On one hand, this is similar to the standard local prior support conditions in Ghosal et al. (2000), Shen and Wasserman (2001), and Walker et al. (2007) but, on the other hand, the neighborhood's dependence on the data is our chief novelty and is the main driver of our empirical prior construction. The second, a global prior condition, effectively controls the tails of the empirical prior density π n , i.e., how heavy can the tails be and still achieve the desired rate.
Condition GP1. There exists constants K > 0 and p > 1, such that the density function π n of the empirical prior Π n satisfies where, again, ν n is a non-data-dependent dominating measure on Θ n .
We have claimed that it is relatively simple to construct an empirical prior to satisfy Conditions LP1 and GP1 above. In fact, in many cases, we can take Π n to be a normal prior with meanθ n and suitable variance. Details of examples like this, as well as others for which a normal prior is not appropriate, are given in Section 4. Here, to show that the conditions are quite reasonable, we provide a simple illustration.
Toy Example. Consider X 1 , . . . , X n i.i.d. N(θ, 1), so thatθ n =X, the sample mean and ν n is constant equal to Lebesgue measure on R. The target rate is ε n = n −1/2 . We can take the sieve Θ n to be fixed at Θ = R and set θ † = θ ⋆ , so that Conditions S1 and R1 hold trivially. Next, for L n in (4), with d = 1, we have If we take Π n = N(X, s 2 ), then it can be shown that Condition LP1 holds if we take the prior standard deviation s as .
We claim that Condition GP1 also holds with this choice. To see this, for the prior density π n (θ) = N(θ |X, s 2 ) and constant p > 1, we have the omitted proportionality constant here and below depends on p but not on s orX. Then the familiar property of normal convolutions gives where θ ⋆ is the true mean, so the integral of the p −1 power is Given the form for s, the right-hand side is bounded as n → ∞ and, therefore, Condition GP1 is satisfied with ε n = n −1/2 .

Unknown target rate
As discussed in Section 1, the target rate depends on some features of the unknown θ ⋆ . In this case, care is needed to construct a prior which is adaptive in the sense that it still leads to posterior concentration at the desired rate. Towards adaptivity, we will make two adjustments to the prior described in Section 2.1. The first step is to introduce a mixture element into the prior, and the second step, for regularization purposes, incorporates data in the prior again but in a different way than the prior centering step. The starting point, again, is with the selection of a suitable sieve. Let Θ be the full parameter space, and let Θ n be an increasing sequence of finite-dimensional subsets. Express the parameter θ as an infinite vector θ = (θ 1 , θ 2 , . . .), e.g., θ could be the coefficients attached to a basis expansion of a regression function or a log-density function, so that the event "θ j = 0" means that feature j is "turned off" and, therefore, the corresponding θ is less complex. This suggests that we define the sieves as where S is a finite subset of {1, 2, . . . , T n }, with T n increasing with n, and As in Section 2.1, we will need four conditions in order to establish our adaptive posterior concentration rate result: two conditions on the sieve and sieve MLE, a local prior condition, and a global prior condition. Since we are seeking a stronger adaptive rate result, naturally, the conditions here are stronger.
In some examples, it is known that the true parameter belongs to one of the sieve sets, so that θ † can be taken as θ ⋆ , and Condition S2 is trivial. In other cases, θ ⋆ may not belong to any sieve, so approximation-theoretic results on the sieve will be needed to establish this. Examples of both types are presented in Section 4. In any case, the set of indices S ⋆ n acts like the "true model" and θ † is a deterministic sequence of "pseudo-true" parameter values; see Remark 1.
Towards writing down the empirical prior, it will help if we express the infinitedimensional vector θ as (S, θ S ), i.e., as a pair consisting of the indices of its non-zero entries and the corresponding non-zero values, then it is natural to introduce a prior for θ in a hierarchical way, with a prior for w n for S and a conditional prior Π n,S for θ S , given S. Write π n,S for the density function of Π n,S with respect to a non-data-dependent dominating measure ν n,S on Θ S . Technically, the conditional prior as a distribution for the infinite-dimensional θ such that the components with indices in S have density π n,S and the remaining components have point-mass distributions at 0; in other words, the conditional prior is a product measure made up of Π n,S and a product of point masses. To summarize, so far, the proposed empirical prior for θ is a mixture of the form Π n (dθ) = S:|S|≤Tn w n (S) Π n,S (dθ).
Next, similar to what we did in Section 2, let us define the sets which are effectively neighborhoods in Θ S centered aroundθ n,S . Then we have the following versions of the local and global prior conditions, suitable for the adaptive case.
Condition LP2. There exist constants A > 0 and C > 0 such that In certain examples, such as those in Sections 4.4-4.5, it can be shown that the integral in Condition GP2 above is bounded by e κ|S| for some constant κ. Then the condition is satisfied with K = 0 if the prior w n for S is such that the marginal prior for |S| has exponential tails (e.g., Arbel et al. 2013;Shen and Ghosal 2015). However, for other examples, such as density estimation (Section 4.6), a bit more care is required.
To achieve adaptive posterior concentration rates, we propose a slight modification to the previous approach, one that incorporates data into the prior in two ways: one for prior centering, like before, and another for suitable regularization. That is, for an α ∈ (0, 1) to be specified, if Π n is the empirical prior described above, then we consider a double empirical prior defined as Dividing by a portion of the likelihood has the effect of penalizing those parameter values that "track the data too closely" (Walker and Hjort 2001), hence regularization. Then the corresponding double empirical Bayes posterior has two equivalent forms: Of the two expressions, the former is more intuitive from an "empirical Bayes" perspective, while the latter is easier to work with in our theoretical analysis. The latter also resembles some recent uses of a power likelihood in, e.g., Grünwald and van Ommen (2016), Bissiri et al. (2016), Holmes and Walker (2017), Syring and Martin (2016), and others, for the purpose of robustness.
To identify an appropriate power α ∈ (0, 1), take p > 1 as in Condition GP2, and let q > 1 be the Hölder conjugate. Then we propose to take α such that αq = 1 2 , i.e., To summarize, in its most convenient form, the posterior distribution based on a doubly empirical prior Π n in (7) is Theorem 2. Let ε n = ε n (θ ⋆ ) be a target rate corresponding to the true θ ⋆ , and assume that Conditions S2, LP2, and GP2 hold for this ε n . Then there exists a constant M > 0 and α of the form (10) such that Π n in (11) Proof. See Section 3.

Proof of Theorem 1
The dependence of the prior on data requires some modification of the usual arguments. In particular, in Lemma 1, the lower bound on the denominator D n in (3) is obtained quite simply thanks to the data-dependent prior, formalizing the motivation for this empirical Bayes approach described in Section 1, while Lemma 2 applies Hölder's inequality to get an upper bound on the numerator N n (A M εn ).
Proof. The denominator D n can be trivially lower-bounded as follows: Now use the definition of L n to complete the proof.
Lemma 2. Assume Condition GP1 holds for ε n with constants (K, p), and let q > 1 be the Hölder conjugate of p. Then Start with the following simple bound: Dividing both sides by R n (θ n ) 1− 1 2q , and taking expectations (with respect to P θ ⋆ ), moving this expectation inside the integral, and applying Hölder's inequality, gives A standard argument (e.g., Walker and Hjort 2001) shows that the first expectation on the right hand side above equals 1 − H 2 (p n θ ⋆ , p n θ ) and, therefore, is upper bounded by e −M 2 nε 2 n , uniformly in θ ∈ A M εn . Under Condition GP1, the integral of the second expectation is bounded by e Knε 2 n . Combining these two bounds proves the claim.
Proof of Theorem 1. To start, set a n = e −cnε 2 where the constants (C, c, d) are as in Condition LP1, Remark 2, and Equation (4), respectively, and c 0 is another sufficiently small constant. Also, abbreviate N n = N n (A M εn ) and R n = R n (θ n ). Then Taking expectation and applying Lemma 2, we get The second and third terms are o(1) by Remark 2 and Condition LP1, respectively. If we take G > C + c 2q + d or, equivalently, M 2 > q(K + C + c 2q + d), then the first term is o(1) as well, completing the proof of the first claim.
For the second claim, when nε 2 n is bounded, the conclusion (12) still holds, and the latter two terms are still o(1). The first term in the upper bound is decreasing in G or, equivalently, in M, so the upper bound vanishes for any M n → ∞.

Proof of Theorem 2
Write the posterior probability Π n (A M εn ) as a ratio N n (A M εn )/D n , where The strategy of the proof here is similar to that of Theorem 1. In particular, the empirical nature of the prior makes getting the lower bound on D n very simple.
. Proof. Almost identical to the proof of Lemma 1.
Lemma 4. Assume Condition GP2 holds with constants (K, p), let q > 1 be the Hölder conjugate of p, and let α be determined by (10). Then where G = M 2 q − K. Proof. Taking expectation of N n (A M εn ), moving expectation inside the integral, and applying Hölder's inequality, we get The first expectation on the right hand side above is upper bounded by e −M 2 nε 2 n , uniformly in θ ∈ A M εn ∩ Θ S and in S, so Under Condition GP2, the summation on the right-hand side above is bounded by a constant times e Knε 2 n and the claim now follows immediately.
Proof of Theorem 2. Under the stated conditions, by Lemma 3, An argument similar to that in Remark 2 shows that R n (θ n,S ⋆ n ) ≥ e −cnε 2 n for some c > 1, with P θ ⋆ -probability converging to 1. Since |S ⋆ n | ≤ nε 2 n , this lower bound for the denominator can be combined with the upper bound in the numerator from Lemma 4 using an argument very similar to that in the proof of Theorem 1, to get So, for M sufficiently large, the upper bound vanishes, proving the claim.

Fixed finite-dimensional parameter estimation
Suppose that the parameter space Θ is a fixed subset of R d , for a fixed d < ∞. Under the usual regularity conditions, the log-likelihood ℓ n = log L n is twice continuously differentiable, its derivativel n satisfiesl n (θ n ) = 0 at the (unique) global MLEθ n , and the following expansion holds: whereΣ n = −l n (θ n ). Then the set L n can be expressed as For rate ε n = n −1/2 , this suggests an empirical prior of the form: for some fixed matrix Ψ in order to ensure S1. The proposition below states that this empirical prior yields a posterior that concentrates at the parametric rate ε n = n −1/2 .
Proposition 1. Assume that each component θ j in the d-dimensional parameter θ are on (−∞, ∞), and that the quadratic approximation (13) holds. Then Conditions LP1 and GP1 hold for the empirical prior (14) with ε n = n −1/2 . Therefore, the posterior concentrates at the rate ε n = n −1/2 relative to any metric on Θ.
Proof. Similar to the toy example; see the Appendix for details.

Density estimation via histograms
Consider estimation of a density function, p, supported on the compact interval [0, 1], based on i.i.d. samples X 1 , . . . , X n . A simple approach to develop a Bayesian model for this problem is a random histogram prior (e.g., Scricciolo 2007Scricciolo , 2015. That is, we consider a partition of the interval [0, 1] into S bins of equal length, i.e., [0, 1] = S s=1 E s , where E s = [ s−1 S , s S ), s = 1, . . . , S. For a given S, write the model consisting of mixtures of uniforms, i.e., piecewise constant densities, where the parameter θ is a vector in the S-dimensional probability simplex, ∆(S). That is, p θ is effectively a histogram with S bins, all of the same width, S −1 , and the height of the s th bar is S −1 θ s , s = 1, . . . , S. Here, assuming the regularity of the true density is known, we construct an empirical prior for the vector parameter θ such that, under conditions on the true density, the corresponding posterior on the space of densities has Hellinger concentration rate within a logarithmic factor of the minimax rate. More sophisticated models for density estimation will be presented in Sections 4.3 and 4.6. Let S = S n be the number of bins, specified below. This defines a sieve Θ n = ∆(S n ) and, under the proposed histogram model, the data can be treated as multinomial, so the (sieve) MLE isθ n = (θ n,1 , . . . ,θ n,S ), whereθ n,s is just the proportion of observations in the s th bin, s = 1, . . . , S. Here we propose a Dirichlet prior Π n for θ, namely, θ ∼ Π n = Dir S (α),α s = 1 + cθ n,s , s = 1, . . . , S, which is centered on the sieve MLE in the sense that the mode of the empirical prior density isθ n ; the factor c = c n will be specified below. Finally, this empirical prior for θ determines an empirical prior for the density via the mapping θ → p θ .
Proof. See the Appendix.

Mixture density estimation
Let X 1 , . . . , X n be i.i.d. samples from a density p θ of the form where k(x | µ) is a known kernel and the mixing distribution θ is unknown. Here we focus on the normal mixture case, where k(x | µ) = N(x | µ, σ 2 ), where σ is known, but see Remark 3. The full parameter space Θ, which contains the true mixing distribution θ ⋆ , is the set of all probability measures on the µ-space, but we consider here a finite mixture model of the form for an integer S, a vector ω = (ω 1 , . . . , ω S ) in the simplex ∆(S), and a set of distinct support points µ = (µ 1 , . . . , µ S ). For fixed S, letθ = (ω,μ) be the MLE for the mixture weights and locations, respectively, where the optimization is restricted so that |μ s | ≤ B, where B = B n is to be determined. We propose to "center" an empirical prior on the S-specific MLE as follows: • ω and µ are independent; • the vector ω is Dir S (α) like in Section 4.2, whereα s = 1 + cω s , s = 1, . . . , S; • the components (µ 1 , . . . , µ S ) of µ are independent, with where δ n is a sequence of positive constants to be determined.
This determines an empirical prior for the density function through the mapping (16).
Remark 3. The proof of Proposition 3 is not especially sensitive to the choice of kernel. More specifically, the local prior support condition, LP1, can be verified for kernels other than normal, the key condition being Equation (22) in the Appendix. For example, that condition can be verified for the Cauchy kernel where σ is a fixed scale parameter. Therefore, using the same empirical prior formulation as for the normal case, the same argument in the proof of Proposition 3 shows that the Cauchy mixture posterior achieves the target rate ε n = (log n)n −1/2 when the true density p ⋆ = p θ ⋆ is a finite Cauchy mixture. To our knowledge, this mixture of heavy-tailed kernels has yet to be considered in Bayesian nonparametrics literature (cf., Kruijer et al. 2010Kruijer et al. , p. 1229), but it fits quite easily into our general setup proposed here.

Estimation of a sparse normal mean vector
Consider inference on the mean vector θ = (θ 1 , . . . , θ n ) ⊤ of a normal distribution, N n (θ, I n ), based on a single sample X = (X 1 , . . . , X n ). That is, X i ∼ N(θ i , 1), for i = 1, . . . , n, independent. The mean vector is assumed to be sparse in the sense that most of the components, θ i , are zero, but the locations and values of the non-zero components are unknown. This problem was considered by Martin and Walker (2014) and they show that a version of the double empirical Bayes posterior contracts at the optimal minimax rate. Here we propose an arguably simpler empirical prior and demonstrate the same asymptotic optimality of the posterior based on the general results in Section 2.2.
Write the mean vector θ as a pair (S, θ S ), where S ⊆ {1, 2, . . . , n} identifies the nonzero entries of θ, and θ S is the |S|-vector of non-zero values. Assume that the true mean vector θ ⋆ has |S ⋆ n | = s ⋆ n such that s ⋆ n = o(n). The sieves Θ S are subsets of R n that constrain the components of the vectors corresponding to indices in S c to be zero; no constraint on the non-zero components is imposed. Note that we can trivially restrict to subsets S of cardinality no more than T n = n. Furthermore, Condition S2 is trivially satisfied because θ ⋆ belongs to the sieve S ⋆ n by definition, so we can take θ † = θ ⋆ . For this model, the Hellinger distance for joint densities satisfies where · is the usual ℓ 2 -norm on R n . In this sparse setting, as demonstrated by Donoho et al. (1992), the ℓ 2 -minimax rate of convergence is s ⋆ n log(n/s ⋆ n ); we set this rate equal to nε 2 n , so that ε 2 n = (s ⋆ n /n) log(n/s ⋆ n ). Therefore, if we can construct a prior such that Conditions LP2 and GP2 hold for this ε n , then it will follow from Theorem 2 that the corresponding empirical Bayes posterior concentrates at the optimal minimax rate.
Let the prior distribution w n for S be given by where g(s) is a non-decreasing slowly varying function as s → ∞, which includes the case where g(s) ≡ B for a sufficiently large constant B; see the proof of the proposition. For the conditional prior for θ S , given S, based on the intuition from the toy example, we let where the sieve MLE isθ n,S = X S = (X i : i ∈ S).
Proposition 4. Suppose the normal mean vector θ ⋆ is s ⋆ n -sparse in the sense that only s ⋆ n = o(n) of the entries in θ ⋆ are non-zero. For the empirical prior described above, if γ is sufficiently small, then there exists a constant M > 0 such that the corresponding posterior distribution Π n satisfies Proof. See the Appendix.

Regression function estimation
Consider a nonparametric regression model where z 1 , . . . , z n are i.i. d. N(0, 1), t 1 , . . . , t n are equi-spaced design points in [0, 1], i.e., t i = i/n, and f is an unknown function. Following Arbel et al. (2013), we consider a Fourier basis expansion for f = f θ , so that f (t) = ∞ j=1 θ j φ j (t), where θ = (θ 1 , θ 2 , . . .) and (φ 1 , φ 2 , . . .) are the basis coefficients and functions, respectively. They give conditions such that their Bayesian posterior distribution for f , induced by a prior on the basis coefficients θ, concentrates at the true f ⋆ at the minimax rate corresponding to the unknown smoothness of f ⋆ . Here we derive a similar result, with a better rate, for the posterior derived from an empirical prior.
Following the calculations in Section 4.4, the Hellinger distance between the joint distribution of (Y 1 , . . . , Y n ) for two different regression functions, f and g, satisfies where f 2 n = n −1 n i=1 f (t i ) 2 is the squared L 2 -norm corresponding to the empirical distribution of the covariate t. So, if the conditions of Theorem 2 are satisfied, then we get a posterior concentration rate result relative to the metric · n .
Suppose that the true regression function f ⋆ is in a Sobolev space of index β > 1 2 . That is, there is an infinite coefficient vector θ ⋆ such that f ⋆ = f θ ⋆ and ∞ j=1 θ ⋆2 j j 2β 1. This implies that the coefficients θ ⋆ j for large j are of relatively small magnitude and suggests a particular formulation of the model and empirical prior. As before, we rewrite the infinite vector θ as (S, θ S ), but this time S is just an integer in {1, 2, . . . , n}, and θ S = (θ 1 , . . . , θ S , 0, 0, . . .) is an infinite vector with only the first S terms non-zero. That is, we will restrict our prior to be supported on vectors whose tails vanish in this sense. For the prior w n for the integer S, we take w n (s) ∝ e −g(s)s , s = 1, . . . , n, where g(s), is a non-decreasing slowly varying function, which includes the case of g(s) ≡ B for B sufficiently large; see the proof of the proposition. Next, for the conditional prior for θ S , given S, note first that the sieve MLE is a least-squares estimator where Φ S is the n × |S| matrix determined by the basis functions at the observed covariates, i.e., Φ S = (φ j (t i )) ij , i = 1, . . . , n and j = 1, . . . , |S|. As in Martin et al. (2015), this suggests a conditional prior of the form where γ < 1 is sufficiently small. This empirical prior for θ ≡ (S, θ S ) induces a corresponding empirical prior for f through the mapping θ → f θ .
Proposition 5. Suppose that the true regression function f ⋆ is in a Sobolev space of index β > 1 2 . For the empirical prior described above, if γ is sufficiently small, then there exists a constant M > 0 such that the corresponding posterior distribution Π n satisfies Proof. See the Appendix.
Note that the rate obtained in Proposition 5 is exactly the optimal minimax rate, i.e., there are no additional logarithmic factors. This is mainly due to the covariance structure in the prior for θ S , given S, which is very natural in the present framework. A similar result, without the additional logarithmic terms, is given in Gao and Zhou (2016).

Nonparametric density estimation
Consider the problem of estimating a density p supported on the real line. Like in Section 4.3, we propose a normal mixture model and demonstrate the asymptotic concentration properties of the posterior based on an empirical prior, but with the added feature that the rate is adaptive to the unknown smoothness of the true density function. Specifically, as in Kruijer et al. (2010), we assume that data X 1 , . . . , X n are i.i.d. from a true density p ⋆ , where p ⋆ satisfies the conditions C1-C4 in their paper; in particular, we assume that log p ⋆ is Hölder with smoothness parameter β. They propose a fully Bayesian model-one that does not depend on the unknown β-and demonstrate that the posterior concentration rate, relative to the Hellinger distance, is ε n = (log n) t n −β/(2β+1) for suitable constant t > 0, which is within a logarithmic factor of the optimal minimax rate.
Here we extend the approach presented in Section 4.3 to achieve adaptation by incorporating a prior for the number of mixture components, S, as well as the S-specific kernel variance σ 2 S as opposed to fixing their values. For the prior w n for S, we let w n (S) ∝ e −D(log S) r S , S = 1, . . . , n, where r > 1 and D > 0 are specified constants. Given S, we consider a mixture model with S components of the form where θ S = (ω S , µ S , λ S ), ω S = (ω 1,S , . . . , ω S,S ) is a probability vector in ∆(S), µ S = (µ 1,S , . . . , µ S,S ) is a S-vector of mixture locations, and λ S is a precision (inverse variance) that is the same in all the kernels for a given S. We can fit this model to data using, say, the EM algorithm, and produce a given-S sieve MLE:ω S = (ω 1,S , . . . ,ω S,S ),μ S = (μ 1 , . . . ,μ S ), andλ S . Following our approach in Section 4.3, consider an empirical prior for ω S obtained by taking ω S | S ∼ Dir S (α S ) whereα s,S = 1 + cω s,S and c = c S is to be determined. The prior for µ S follows the same approach as in Section 4.3, i.e., µ S,s ∼ Unif(μ S,s − δ,μ S,s + δ), s = 1, . . . , S, independent, where δ = δ S is to be determined. The prior for λ S is also uniform, where ψ = ψ S is to be determined. Also, as withμ S being restricted to the interval (−B, +B), we restrict theλ S to lie in (B l , B u ), to be determined. Then we get a prior on the density function through the mapping (S, θ S ) → p S,θ S . For this choice of empirical prior, the following proposition shows that the corresponding posterior distribution concentrates around a suitable true density p ⋆ at the optimal minimax rate, up to a logarithmic factor, exactly as in Kruijer et al. (2010).

Conclusion
This paper considers the construction of an empirical or data-dependent prior such that, when combined with the likelihood via Bayes's formula, gives a posterior distribution with desired asymptotic concentration properties. The details vary a bit depending on whether the targeted rate is known to the user or not (Sections 2.1-2.2), but the basic idea is to first choose a suitable sieve and then center the prior for the sieve parameters on the sieve MLE. This makes establishing the necessary local prior support condition and lower-bounding the posterior denominator straightforward, which is a major obstacle in the standard Bayesian nonparametric setting. Having the data involved in the prior complicates the usual argument to upper-bound the posterior numerator, but compared to the usual global prior conditions involving entropy, here we only need to suitably control the spread of the empirical prior. The end result is a data-dependent measure that achieves the targeted concentration rate, adaptively, if necessary. The approach presented here is quite versatile, so there are many potential applications beyond those examples studied here. For example, high-dimensional generalized linear models, sparse precision matrix estimation, shape-restricted function estimation, time series, etc. A more general question to be considered in a follow-up work, one that has attracted a lot of attention in the Bayesian nonparametric community recently, concerns the coverage probability of credible regions derived from our empirical Bayes posterior distribution. Having suitable concentration rates is an important step in the right direction, but pinning down the constants will require some new insights.

A Details for the examples A.1 Proof of Proposition 1
For Condition LP1, under the proposed normal prior, we have Making a change of variable, z = n 1/2 Ψ 1/2 (θ −θ n ), the integral above can be rewritten as and, therefore, Π n (L n ) is lower-bounded by a constant not depending on n so Π n (L n ) is bounded away from zero; hence Condition LP1 holds with ε n = n −1/2 . For Condition GP1, we can basically proceed as outlined in the toy example above. So, writing the prior as θ ∼ N d (θ n , n −1 Ψ −1 ), and the asymptotic distribution of the MLE aŝ θ ∼ N d (θ ⋆ , n −1 Σ ⋆−1 ), where Σ ⋆ is the asymptotic covariance matrix, i.e., the Fisher information matrix evaluated at θ ⋆ , we have, π n (θ) p ∝ |pnΨ| −1/2 |nΨ| p/2 N d (θ |θ n , (pnΨ) −1 ).
As long as Ψ is non-singular, the right-hand side above is not dependent on n and is finite, which implies we can take ε n = n −1/2 . It follows from Theorem 1 that the Hellinger rate is ε n = n −1/2 and, since all metrics on the finite-dimensional Θ are equivalent, the same rate obtains for any other metric. We should highlight the result that the integral involved in checking Condition GP1 is at most exponential in the dimension of the parameter space: This result will be useful in the proof of some of the other propositions.
Then it follows from (18) that Π n (L n ) ≥ Γ(c + S) c n Γ(c + S + n) − e −dnε 2 n and, therefore, Condition LP1 is satisfied if Towards this, we have So, if c = nε −2 n as in the proposition statement, then the right-hand side above is upperbounded by e nε 2 n (1+S/n) . Since S ≤ n, (20) holds for, say, d ≥ 2, hence, Condition LP1.
Towards Condition GP1, note that the Dirichlet component for θ satisfies where the "≈" is by Stirling's formula, valid for all n s > 0 due to the value of c. This has a uniform upper bound: Then Condition GP1 holds if we can bound this by e Knε 2 n for a constant K > 0. Using Stirling's formula again, and the fact that c/S → ∞, we have We need S log(1 + c/S) ≤ nε 2 n . Since c/S ≪ n 2 , the logarithmic term is log n. But we assumed that S ≤ nε 2 n (log n) −1 , so the product is nε 2 n , proving Condition GP1. It remains to check Condition S1. A natural candidate for the pseudo-true parameter θ † in Condition S1 is one that sets θ s equal to the probability assigned by the true density p ⋆ to E s . Indeed, set It is known (e.g., Scricciolo 2015, p. 93) that, if p ⋆ is β-Hölder, with β ∈ (0, 1], then the sup-norm approximation error of p θ † is Since p ⋆ is uniformly bounded away from 0, it follows from Lemma 8 in Ghosal and van der Vaart (2007) that both K(p ⋆ , p θ † ) and V (p ⋆ , p θ † ) are upper-bounded by (a constant times) H 2 (p ⋆ , p θ † ) which, in turn, is upper-bounded by S −2β by the above display. Therefore, we need S = S n to satisfy S −β ≤ ε n , and this is achieved by choosing S = nε 2 n (log n) −1 as in the proposition. This establishes Condition S1, completing the proof.

A.3 Proof of Proposition 3
We start by verifying Condition LP1. Towards this, we first note that, for mixtures in the support of the prior, the likelihood function is which can be rewritten as L n (θ) = (n 1 ,...,n S ) ω n 1 1 · · · ω n S S (s 1 ,...,sn) S s=1 i:s i =s where the first sum is over all S-tuples of non-negative integers (n 1 , . . . , n S ) that sum to n, the second sum is over all n-tuples of integers 1, . . . , S with (n 1 , . . . , n S ) as the corresponding frequency table, and k(x | µ) = N(x | µ, σ 2 ) for known σ 2 . We also take the convention that, if n s = 0, then the product i:s i =s is identically 1. Next, since the prior has ω and µ independent, we only need to bound E(ω n 1 1 · · · ω n S S ) and E S s=1 i:s i =s for a generic (n 1 , . . . , n S ). The first expectation is with respect to the prior for ω and can be handled exactly like in the proof of Proposition 2. For the second expectation, which is with respect to the prior for the µ, since the prior has the components of µ independent, we have so we can work with a generic s. Writing out the product of kernels, we get By Jensen's inequality, i.e., E(e Z ) ≥ e E(Z) , the expectation on the right-hand side is lower bounded by e − ns 2σ 2 E(µs−X) 2 = e − ns 2σ 2 {vn+(μs−X) 2 } , where v n = δ 2 n /3 is the variance of µ s ∼ Unif(μ s − δ n ,μ s + δ n ). This implies Putting the two expectations back together, from (21) we have that where now the expectation is with respect to both priors. Recall that L n = {θ ∈ Θ n : L n (θ) > e −dnε 2 n L n (θ)} as in (4), and define L ′ n = {θ ∈ L n : L n (θ) ≤ L n (θ n )}. Since, L n ⊇ L ′ n and, for θ ∈ L ′ n , we have L n (θ)/L n (θ n ) ≤ 1, we can apply the reverse Markov inequality (19) again to get Then it follows from (23) that Π n (L n ) ≥ Γ(c + S) c n Γ(c + S + n) e − Snvn 2σ 2 − e −dnε 2 n and, therefore, Condition LP1 is satisfied if nv n 2σ 2 ≤ bnε 2 n and Γ(c + S + n) Γ(c + S)c n ≤ e anε 2 n , where a + b < d. The first condition is easy to arrange; it requires that v n ≤ 2bσ 2 ε 2 n ⇐⇒ δ n ≤ (6bσ 2 ) 1/2 ε n , which holds by assumption on δ n . The second condition holds with a = 2 by the argument presented in the proof of Proposition 2. Therefore, Condition LP1 holds.
Towards Condition GP1, putting together the bound on the Dirichlet density function in the proof of Proposition 2 and the following bound on the uniform densities, we have that, for any p > 1, Then Condition GP1 holds if we can make both terms in this product to be like e Knε 2 n for a constant K > 0. The first term in the product, coming from the Dirichlet part, is handled just like in the proof of Proposition 2 and, for the second factor, we have Since δ n ∝ ε n and B n ∝ log 1/2 (ε −1 n ), we have B n /δ n ∝ n 1/2 , so the exponent above is S log n nε 2 n . This takes care of the second factor, proving Condition GP1. Finally, we refer to Section 4 in Ghosal and van der Vaart (2001) where they show that there exists a finite mixture, characterized by θ † , with S components and locations in [−B n , B n ], such that max{K(p θ ⋆ , p θ † ), V (p θ ⋆ , p θ † )} ≤ ε 2 . This θ † satisfies our Condition S1, so the proposition follows from Theorem 1.
In the context of Remark 3, when the normal kernel is replaced by a Cauchy kernel, we need to verify (22) in order to meet LP1. To this end, let us start with E exp − log where the expectation is with respect to the prior for the µ s and the σ is assumed known. This expectation is easily seen to be lower-bounded by The right-hand term term can be written as s i =s 1 1 + (X i −μ s ) 2 /σ 2 1 s i =s 1 + vn/σ 2 1+(X i −μs) 2 /σ 2 and the second term here is lower-bounded by exp(−n s v n /σ 2 ). Therefore, Condition LP1 holds with the same ε n as in the normal case.
Condition GP1 in this case does not depend on the form of the kernel, whether it be normal or Cauchy. And S1 is satisfied if we assume the true density p ⋆ = p θ ⋆ is a finite mixture of densities, for example, the Cauchy. This proves the claim in Remark 3, namely, that the empirical Bayes posterior, based on a Cauchy kernel, concentrates at the rate ε n = (log n)n −1/2 when the true density is a finite Cauchy mixture.

A.4 Proof of Proposition 4
The proportionality constant depends on n (and g) but it is bounded away from zero and infinity as n → ∞ so can be ignored in our analysis. Here we can check the second part of Condition LP2. Indeed, for the true model S ⋆ n of size s ⋆ n , using the inequality n s ≤ (en/s) s , we have and, since nε 2 n = s ⋆ n log(n/s ⋆ n ), the second condition in Condition LP2 holds for all large n with A > 1. Next, for Condition GP2, note that the prior w n given above corresponds to a hierarchical prior for S that starts with a truncated geometric prior for |S| and then a uniform prior for S, given |S|. Then it follows directly that Condition GP2 on the marginal prior for |S| is satisfied.
Therefore, L n,S = {θ ∈ Θ S : 1 2 θ −θ n,S 2 < |S|}. This is just a ball in R |S| so we can bound the Gaussian measure assigned to it. Indeed, Stirling's formula gives an approximation of the lower bound: e −γ|S| γ |S|/2 2 |S|/2 e |S|/2 |S|/2 2π 1/2 . For moderate to large |S|, the above display is exp 1 − 2γ + log γ + log 2 |S| 2 and, therefore, plugging in S ⋆ n for the generic S above, we see that Condition LP2 holds if 1 − 2γ + log γ + log 2 < 0. For Condition GP2, the calculation is similar to that in the finite-dimensional case handled in Proposition 1. Indeed, the last part of the proof showed that, for a d-dimensional normal mean model with covariance matrix Σ −1 and a normal empirical prior of with meanθ n and covariance matrix proportional to Σ −1 , then the integral specified in the second part of Condition GP2 is exponential in the dimension d. In the present case, we have that Θ S E θ ⋆ {π n,S (θ) p } 1 p dθ = e κ|S| for some κ > 0 and then, clearly, Condition GP2 holds with K = κ. If we take B in the prior w n for S to be larger than this K, then the conditions of Theorem 2 are met with ε 2 n = (s ⋆ n /n) log(n/s ⋆ n ).

A.5 Proof of Proposition 5
By the choice of marginal prior for S and the normal form of the conditional prior for θ S , given S, Conditions LP2 and GP2 follow immediately or almost exactly like in Section 4.4. Indeed, the second part of Condition GP2 holds with K the same as was derived in Section 4.4. Therefore, we have only to check Condition S2. Let p θ denote the density corresponding to regression function f = f θ . If θ ⋆ is the coefficient vector in the basis expansion of f ⋆ , then it is easy to check that If f ⋆ is smooth in the sense that it belongs to a Sobolev space indexed by β > 1 2 , i.e., the basis coefficient vector θ ⋆ satisfies ∞ j=1 θ ⋆2 j j 2β 1, then it follows that K(p n θ ⋆ , p n θ ⋆ S ) n|S| −2β . So, if we take ε n = n −β/(2β+1) and |S ⋆ n | = ⌊nε 2 n ⌋ = ⌊n 1/(2β+1) ⌋, then a candidate θ † in Condition S2 is θ † = θ ⋆ S . That the desired bound on the Kullback-Leibler second moment V also holds for this θ † follows similarly, as in Arbel et al. (2013, p. 558). This establishes Condition S2 so the conclusion of the proposition follows from Theorem 2.

A.6 Proof of Proposition 6
Write ε n = (log n) t n −β/(2β+1) for a constant t > 0 to be determined. For Condition S2, we appeal to Lemma 4 in Kruijer et al. (2010) which states that there exists a finite normal mixture, p † , having S ⋆ n components, with S ⋆ n n 1/(2β+1) (log n) k−t = nε 2 n (log n) k−3t , such that max K(p ⋆ , p † ), V (p ⋆ , p † ) ≤ ε 2 n , where k = 2/τ 2 and τ 2 is related to the tails of p ⋆ in their Condition C3. So, if t is sufficiently large, then our Condition S2 holds.
For Condition GP2, we first note that, by a straightforward modification of the argument given in the proof of Proposition 3, we have ∆(S)×R S ×R + E p ⋆ {π n,S (θ) p } 1/p dθ ≤ e bS log n 1 + B δ S B u (1 + ψ) − B l (1 − ψ) 2ψB l , for some b > 0. The logarithmic term appears in the first product because, as in the proof of Proposition 3, the exponent can be bounded by a constant times S log(1 + c/S) S log n since c/S = n 2 /S 2 < n 2 . To get the upper bound in the above display to be exponential in S, we can take δ B n b and ψ B u − B l B l 1 e bS log n − (B l + B u )/(2B l ) .
With these choices, it follows that the right-hand side in the previous display is upper bounded by e 3b log n , independent of S. Therefore, trivially, the summation in (8) is also upper bounded by e 3b log n . Since log n ≤ nε 2 n , we have that Condition GP2 holds. Condition LP2 has two parts to it. For the first part, which concerns the prior concentration on L n , we can follow the argument in the proof of Proposition 3. In particular, with the additional prior on λ, the corresponding version of (23) is EL n (θ S ) ≥ Γ(c + S) c n Γ(c + S + n) e − 1 6 nδ 2λ e −nzψ L n (θ S ) for some z ∈ (0, 1). This is based on the result that if λ ∼ Unif(λ(1 − ψ),λ(1 + ψ)) then Eλ =λ and E log λ > logλ − zψ for some z ∈ (0, 1). With c = n 2 S −1 as proposed, the argument in the proof of Proposition 2 shows that the first term on the right-hand side of the above display is lower-bounded by e −CS for some C > 0. To make other other terms lower-bounded by something of the order e −C ′ S , we need δ and ψ to satisfy δ 2 1 B 2 u S n and ψ S n .
Given these constraints and those coming from checking Condition GP2 above, we require From Lemma 4 in Kruijer et al. (2010), we can deduce that the absolute value of the locations for p † are smaller than a constant times log ε −β n . Hence, we can take B = (log n) 2 . Also, we need B l ε β n which is met by taking B l = n −1 . To meet our constraints, we can take B u = n b−2 , so we need b ≥ 2. These conditions on (B, B l , B u , δ, ψ) are met by the choices stated in the proposition. For the second part of Condition LP2, which concerns the concentration of w n around S ⋆ n , we have w n (S ⋆ n ) ≥ e −D(log S ⋆ n ) r S ⋆ n e −Dnε 2 n (log n) k+r−3t .
So, just like in Kruijer et al. (2010), as long as 3t > k + r, we get w n (S ⋆ n ) ≥ e −Dnε 2 n as required in Condition LP2.