Empirical Bayes scaling of Gaussian priors in the white noise model

: The performance of nonparametric estimators is heavily de- pendent on a bandwidth parameter. In nonparametric Bayesian methods this parameter can be speciﬁed as a hyperparameter of the nonparamet- ric prior. The value of this hyperparameter may be made dependent on the data. The empirical Bayes method is to set its value by maximizing the marginal likelihood of the data in the Bayesian framework. In this paper we analyze a particular version of this method, common in practice, that the hyperparameter scales the prior variance. We characterize the behavior of the random hyperparameter, and show that a nonparametric Bayes method using it gives optimal recovery over a scale of regularity classes. This scale is limited, however, by the regularity of the unscaled prior. While a prior can be scaled up to make it appropriate for arbitrarily rough truths, scaling cannot increase the nominal smoothness by much. Surprisingy the standard empirical Bayes method is even more limited in this respect than an oracle, deterministic scaling method. The same can be said for the hierarchical Bayes method.


Introduction
Recent years have seen increasing use of Bayesian methods in high-dimensional or nonparametric statistical problems. It is known from both theory (e.g. [13,15,5]) and practice that the (asymptotic) performance of such methods is sensitive to the fine properties of the prior that is employed. This dependence can be alleviated by adapting the prior to the data through one or more tuning parameters, so-called hyperparameters. In the case of function estimation such parameters can for instance describe the degree of regularity of a prior, a length scale, or a bandwidth.
Two tuning methods are widely used. The first is to endow the hyperparameters with a prior distribution, and leads to fully Bayesian procedures, referred to as hierarchical Bayes. The frequentist behavior of such methods has been studied in e.g. [2,14,20,23,27], where it was found that, if the priors are well chosen, they can yield adaptive, rate-optimal recovery for a range of nonparametric statistical problems. A second possible approach is to estimate the hyperparameters from the data, e.g. by using a likelihood-based method. This approach is not fully Bayesian, and commonly called empirical Bayes, but is often computationally convenient and therefore commonly used in practice.
The theoretical performance of empirical Bayes methods in nonparametric problems has been studied in only a limited number of special cases, see for instance [1,17]. Because a general understanding of such methods appears difficult at this time, in this paper we focus on the important case that the hyperparameter is a scale parameter of a Gaussian prior. This situation was first considered in work on spline smoothing (see [29]), where the posterior mean for a (multiply integrated, scaled and released) Brownian motion as a prior for an unknown function is a penalized least squares estimator, and choosing the scale parameter of the prior is equivalent to choosing the smoothing parameter (that multiplies the penalty).
We consider the scaled Gaussian priors in the particular case of the Gaussian white noise model, which allows tractable formulas. In view of the close relation between this model and many other nonparametric models, it is expected that our findings generalize. However, since we deliberately consider a particular method, this does not follow from general results on equivalence of experiments and thus will require further investigation.
The term empirical Bayes is used in various ways (see [22,10,30,16] for the original and alternative uses). In our situation it means determining a suitable value of a (scaling) parameter of a prior from the data, which could still refer to different methods. Specifically, we study the maximum likelihood estimator (MLE) for the scale parameter based on the marginal Bayesian likelihood (see (2.5) below). This is a natural method, which attempts to take the best of both worlds. The method is also of interest by its close relation to the "full" (hierarchical Bayes) method. These two methods differ only in that empirical Bayes takes the MLE for the (univariate) marginal Bayesian likelihood, whereas hierarchical Bayes equips the (univariate) parameter of this marginal likelihood with a prior. Within our framework these methods perform equivalently, as we show in Section 2.3.
We investigate the behavior of the empirical Bayes method in a frequentist set-up: the method is (empirical) Bayesian, but it is evaluated under the assumption that the data are generated under a given "true" parameter. In this situation minimax optimal rates can be used as a benchmark for performance. However, it is not our primary aim to construct minimax estimators, or even to exhibit priors that lead to minimax posterior means. Rather the particular (scaled) priors and specific likelihood-based empirical Bayes method for choosing the scaling parameter are the starting points. We aim at establishing their performance, as they are natural and widely applied choices. For the aim of minimax estimation there are various other methods (see e.g. [4,21,9]).
The results of this paper are a step towards a more general understanding of empirical Bayes methods. They concern the behaviour of the empirical Bayes scaling parameter and contraction of resulting plug-in posterior distribution. We study contraction of the full posterior distribution rather than a summary measure, such as a posterior mean. The full posterior is important for the use of the Bayesian method for uncertainty quantification, for instance through credible sets: sets of prescribed posterior probability. We hope to report on this involved issue in a future paper. Understanding the behaviour of the empirical Bayes scaling parameter will also be essential in this investigation.
In an earlier paper [19] we considered the performance of posterior distributions based on the same priors, but with deterministic scaling. It turned out that for a given base prior and a given true regularity level there is an optimal scaling rate. It is natural to compare the empirical Bayes method, which gives a data-dependent rate, to the performance with this optimal rate, which would be available to an oracle. Here we found the following somewhat surprising result. While it is known that the oracle procedure fails to be minimax if the regularity of the true parameter is higher than a level dependent on the unscaled prior (see [25] and the next section), it turns out that the empirical Bayes method fails to follow the oracle if the regularity of the true parameter exceeds an even lower bound. This finding may motivate the investigation of different empirical Bayes schemes. On the positive side our results show that empirical Bayes works adequately if the base prior does not (or only little) undersmooth the true parameter.
In the next section we give a precise description of the problem, and state our main findings. In Section 3 we illustrate the results with some simulations and pictures. Sections 4 and 5 contain the proofs.
We write a b for a ≤ Cb for a constant C that is universal or fixed in the context, and a n ≍ b n if a n /b n → 1.

Setup
To be able to derive concrete results we consider a relatively tractable nonparametric model: the Gaussian sequence model, or, equivalently the signal-inwhite-noise model, and sometimes called the normal means model. This model often serves as a platform to investigate the behavior of statistical procedures, see for instance [7,24,2,5,12,6] for studies on various aspects of non-and over-smoothing procedures in this setting.
We assume we observe a sequence X = (X 1 , X 2 , . . .) satisfying for θ 0 = (θ 0,1 , θ 0,2 , . . .) an unknown element of ℓ 2 = {θ ∈ R ∞ : θ 2 = k θ 2 k < ∞} and Z 1 , Z 2 , . . . independent, standard normal random variables. We denote the "true" distribution of X by P 0 and the corresponding expectations by E 0 . All results refer to this distribution, although in the next paragraphs we adopt a Bayesian point of view in which the parameter is random to motivate the posterior distribution and the empirical Bayes likelihood.
This model is equivalent to the signal-in-white-noise model, in which we observe the process (Y t : 0 ≤ t ≤ 1) given by with f 0 ∈ L 2 [0, 1] an unknown function and W a standard Brownian motion. Indeed, if e i is an orthonormal basis of L 2 [0, 1], then the variables X i = 1 0 e i (s) dY s satisfy (2.1), with θ 0,i = f 0 , e i the Fourier coefficients of f 0 relative to the basis e i . In Section 3 we illustrate our findings by simulated data in this setting.
The variance of the errors in (2.1) is taken equal to the known value 1/n. It is clear from the signal-in-white-noise representation that a possible parameter σ 2 , changing the variance in σ 2 /n, would be 'estimable' without error from the data, e.g. by n times the quadratic variation [Y ] 1 of the signal (Y t : 0 ≤ t ≤ 1). Thus it is no loss of generality not to introduce such an additional parameter in the model; taking it equal to unity simplifies the notation. A different situation would arise, were the signal observed only on a discrete time set. We guess that similar phenomena will occur in this different model, but to verify this will require significant additional technical work. Including an additional variance parameter would be natural in this work.
We assume that the parameter θ 0 belongs to a hyper rectangle in ℓ 2 , i.e.
for (unknown) constants and C, β > 0. In the case that the θ 0,k 's are the Fourier coefficients of some unknown function, this roughly means assuming that the function has "regularity" of the order β. It is known that the minimax rate of estimation relative to the ℓ 2 -norm over hyper rectangles of this form is of the order n −β/(1+2β) (see [8]). In the Bayesian set-up the model (2.1) is viewed as giving the conditional distribution of X given the parameter θ 0 , inference on the unknown parameter θ 0 begins by postulating a prior distribution for θ 0 . We consider the family of priors on R ∞ , where α > 0 is a fixed parameter and τ > 0 is a scaling parameter that will be set by an empirical Bayes approach. In other words, under the prior Π τ the coordinates θ 0,k of θ 0 are independent, centered Gaussian variables with variances τ 2 k −1−2α . The parameter α determines the speed at which the variances tend to zero. It can be interpreted as the baseline "regularity" of the unscaled prior. Indeed, for fixed τ > 0 and any s < α, the prior Π τ gives full mass to the Sobolev space H s = {θ ∈ ℓ 2 : k θ 2 k k 2s < ∞}. In this paper we stick to this prior. The fact that the prior does give mass zero to the Sobolev space of order α motivated [31] (also see [3]) to consider various modifications, such as block dependent priors. Another alternative would be to mix priors of the form (2.3) over the value of α. Estimating α by empirical or hierarchical Bayes (with τ = 1 fixed) is considered in [18].
Under the (conditional) model (2.1) and the prior (2.3) the coordinates (θ 0,k , X k ) of the vector (θ 0 , X) are independent, and hence the conditional distribution of θ 0 given X factorizes over the coordinates as well. Thus the computation of the posterior distribution reduces to countably many posterior computations in conjugate normal models. It is straightforward to verify that the posterior distribution Π τ (· | X) is given by In the empirical Bayes approach we subsequently replace the hyperparameter τ by a data-driven choiceτ n . In the Bayesian setting described by the conditional distributions θ | τ ∼ Π τ and X | (θ, τ ) ∼ ⊗ k N (θ k , 1/n), it holds that The corresponding log-likelihood for τ (relative to an infinite product of N (0, 1/n)distributions) is given by We shall prove that with P 0 -probability going to one, ℓ n attains a global maximum on (0, ∞), and denote the point where this is attained byτ n . (If the point of global maximum is not unique, any global maximum can be chosen.) Outside the event on which ℓ n has a global maximum,τ n can be set to an arbitrary value.
The empirical Bayes posterior is now defined as the random measure Πτ n (·|X) obtained by substitutingτ n for τ in the posterior distribution (2.4), i.e. Πτ n (B|X) = Π τ (B|X) τ =τn for measurable subsets B ⊂ ℓ 2 . The results presented in the next subsection concern the rate at which the empirical Bayes posterior contracts to the true parameter θ 0 . Furthermore, we characterize the behavior ofτ n itself.
If the true parameter satisfies next to (2.2) also the reverse inequality (with a constant c ≤ C), then it turns out thatτ n has a precise behavior, and the performance of the posterior can be established by uniformity arguments. The more difficult case is to considerτ n and Πτ n (·|X) under general θ 0 in the rectangle described by (2.2).

Main results
If the prior is not rescaled, i.e. we use the prior Π τ for some fixed value of τ , then the posterior (2.4) contracts to θ 0 at the optimal rate n −β/(1+2β) if and only if α = β (cf. [26,5,19,11]). That is, the Bayesian procedure performs optimally if and only if the "regularities" of the prior and the unknown parameter match.
This relationship changes if the parameter τ = τ n is chosen to tend to zero or infinity with n. Two situations arise: if the prior does not under-smooth the unknown parameter too much, then the optimal rate can still be attained, whereas in the other case the posterior gives suboptimal recovery no matter the scaling ( [19,25]). More precisely, and every M n → ∞, (ii) If β > 1 + 2α, then this posterior probability tends to 1 for some θ 0 satisfying (2.2).
The optimal rescaling rate τ n = n (α−β)/(1+2β) in case (i) depends on the unknown parameter β that measures the smoothness of the true parameter. We therefore call it the oracle rescaling rate. Our aim is to compare the performance of the empirical Bayes procedure to that of the oracle procedure.
Remarkably, the performance of the empirical Bayes procedure cuts the range β ≤ 1 + 2α, where optimal deterministic scaling is possible, into two subregimes. If β < 1/2 + α, then the empirical Bayes posterior matches the oracle procedure and contracts at the optimal rate n −β/(2β+1) to θ 0 . On the other hand, if 1/2 + α ≤ β < 1+2α, then the empirical Bayes procedure performs strictly worse than the oracle. The message is that smooth priors perform well from the perspective of contraction rates; if empirical Bayes scaling is used, then a good prior should under-smooth the truth by at most 1/2 level of regularity.
Besides the empirical Bayes posterior, we study the empirical Bayes rescaling rateτ n itself. In our first theorem we give upper and lower bounds for its magnitude. For given nonzero θ 0 consider the functions h n : (2.6) For fixed n the function h n is positive on (0, ∞) and tends to zero as τ → ∞, by dominated convergence, for any nonzero θ 0 ∈ ℓ 2 . Therefore, for positive constants l < L we can define In the next theorem we show thatτ n belongs with probability tending to one to the interval [τ n , τ n ], provided l is chosen sufficiently small and L sufficiently big.
The function h n and the bounds τ n ≤ τ n depend on the unknown true parameter θ 0 . For typical θ 0 the upper and lower bounds have the same order of magnitude. In particular, this is true for θ 0 satisfying the exact asymptotic behavior θ 2 0,k ≍ C 2 k −1−2β , in which case, for some constants d depending on α, β, C and l (see Section 4.4), The cut-off at β = α + 1/2 is clearly visible in this bound. The exact asymptotic behavior θ 2 0,k ≍ C 2 k −1−2β may be considered a worst case for θ 0 belonging to the hyper rectangle (2.2). For general θ 0 that are not "in the boundary" of any rectangle (for any β), the behaviour ofτ n , may be complicated, but we shall see in (the proof of) Theorem 2.2 that the lower and upper bounds τ n and τ n are sufficiently sharp to analyze the behaviour of the empirical Bayes posterior distribution of θ.
The worst case upper bound τ n in (2.9) has the same order as the optimal rescaling rate n (α−β)/(1+2β) if β < 1/2 + α, but not if β ≥ 1/2 + α. The theorem shows that the empirical Bayes procedure selects the common order whenever the lower and upper bounds have the same order, in particular when θ 2 0,k ≍ C 2 k −1−2β . Hence in the latter case the empirical Bayes procedure selects the proper oracle scaling rate if β < α + 1/2, but not in the other case, i.e. only if the baseline "regularity" α of the prior is sufficiently large compared to the regularity β of the truth θ 0 . This suggests that the empirical Bayes posterior will match the oracle only in the case β < α + 1/2, and performs sub-optimally if β ≥ α + 1/2. The following theorem, which is the main result of this paper, states that this is true under the general assumption (2.2).
The first assertion of the theorem shows that the empirical Bayes procedure attains the optimal rate if β < 1/2 + α, but a slower rate in the other cases. The rate n −(1/2+α)/(2+2α) in the case that β > 1/2 + α is the optimal rate for the value β = 1/2 + α at the cut point. If (2.2) holds for some β > 1/2 + α, then it also holds for β = 1/2 + α. Therefore an interpretation is that the empirical Bayes procedure with a prior of regularity α is incapable to exploit regularity (2.2) in the true function θ 0 beyond level 1/2 + α. The second and third assertions of the theorem show that the rates are sharp. The third, final assertion shows in a very strong sense that the deterioration of the rate in the third case is caused completely by the prior.
The good news is that the empirical Bayes procedure repairs any amount of prior over-smoothing, at least as far as contraction rates are concerned.

Hierarchical Bayes
Instead of substituting a random value forτ n into the posterior distribution for θ, the hierarchical Bayes approach models τ with a prior distribution λ, and next performs a full Bayes analysis with the mixture prior Here Π τ is the prior on θ with scale τ , as given in (2.3). Besides a posterior distribution on θ, this also yields a posterior distribution for τ , which can be written in the form for ℓ n the marginal log likelihood of X given τ , given in (2.5). By definition the empirical Bayes valueτ n is the point of maximum of the integrand in the integrals on the right. Thus the two methods are closely related. The link is made formal in the following theorem, which implies that the hierarchical Bayes method copies both the good and the bad behavour (suboptimality if β ≥ α + 1/2) of the empirical Bayes method. We restrict to the inverse Gamma distribution as a prior for τ 2 . Inspection of the proof shows that the theorem goes through for many other priors λ. Define τ n and τ n as before by (2.7) and (2.8), where the constant L in (2.8) is chosen sufficiently large.  (2.8)), As a consequence the posterior distribution of θ relative to the prior ∞ 0 Π τ dλ(τ ) has the same properties as Πτ n (· | X) given in Theorem 2.2.

Some Simulation Results
To illustrate the main results we simulated data from the signal-in-white-noise model for n = 200 and the true function f 0 given by The Fourier coefficients of this function are given by θ 0,k = k −2.25 sin(10k), corresponding to a true regularity level as in (2.2) given by β = 1.75. Figure 1 shows the function f 0 , its primitive, and the noisy observation Y . We put the Gaussian prior (2.3) on (the Fourier coefficients of) f 0 , with prior regularity level α = 1.75, and determined an appropriate scaling parameterτ n by the empirical Bayes method. The left panel of Figure 2 shows the true signal f 0 (black) and the posterior mean (red). The right panel shows the empirical log-likelihood for τ .
The empirical Bayes reconstruction is satisfying. To illustrate that the scale parameter τ of the prior really matters, we also computed the posterior means with scaling parameter 20 times larger and 20 times smaller than the empirical  Bayes value. This leads to under-smoothing (blue), and over-smoothing (green), respectively, as shown in Figure 3.
In an attempt to visualize the cut-off at β = α + 1/2 we repeated the procedure for various prior regularities α near β, every time choosing the scaling by the empirical Bayes method. The results are shown in Figure 4. The theory claims that big values α (i.e. over-smoothing) work fine, as they can and will be corrected by the choice of the scale parameterτ n , but values α below β − 1/2 cannot be corrected, and lead to suboptimal reconstruction. This is illustrated in Figure 4, in which the prior smoothness increases in steps of 1/2 from β to β + 1 in the top panels, and decreases from β to β − 1 in the top panels. The last reconstruction, for α = β − 1, is clearly not satisfactory.
The theory says that the empirical and hierarchical Bayes methods do not differ much. We illustrate this in Figure 5, which is the hierarchical Bayes version of Figure 2. Instead of the likelihood for τ , the picture shows the posterior distribution of this parameter in the right panel. We used the inverse Gamma distribution for the square scaling parameter τ 2 , which is conjugate to the Gaussian location family. The posterior distribution was computed by a Gibbs sampler. Finally Figure 6 is the hierarchical Bayes counterpart of

Proof of Theorem 2.1
Every term of the series (2.5) that defines ℓ n is a smooth function of τ . With the help of the dominated convergence theorem, it is straightforward to see that the function ℓ n is (P 0 -a.s.) continuously differentiable on (0, ∞), with derivative given by the series of term-wise derivatives. It will be convenient first to substitute ν 1+2α = τ 2 n, and then differentiate with respect to ν. The resulting derivative map M n is given by (4.1) In the new parametrization the upper and lower bounds become for the function h : (0, ∞) → (0, ∞) given by In the following subsections we prove that if the constants l, L > 0 are sufficiently small and large, respectively, then with probability tending to 1, (i) the function M n is strictly negative and bounded away from 0 on (ν n , ∞), (ii) larger than any given constant on (ν n /2, ν n ), (iii) bounded below by a fixed constant on (0, ν n /2).
Property (i) shows that the primitive function of M n (and hence the log marginal Bayesian likelihood ℓ n ) is decreasing on (ν n , ∞), whence an absolute maximum is taken to the left of ν n . The pair of properties (ii) and (iii) imply that the primitive function of M n increases more on (ν n /2, ν n ) than it possibly decreases on (0, ν n /2). Thus an absolute maximum of ℓ n is taken to the right of ν n . We conclude that the absolute maximum of ℓ n is taken in the interval [ν n , ν n ], which is the first assertion of Theorem 2.1. Because the constant in (iii) may be negative, it does not follow that M n is nonnegative throughout (0, ν n /2). Thus our proof does not exclude additional local maxima on this interval. In fact such local maxima may exist for irregular θ 0 , as we illustrate in Section 4.6.

Asymptotic behavior of M n on (ν n , ∞)
In this section we prove that if l in the definition of ν n is small enough, then lim sup This shows that M n is negative throughout [ν n , ∞) with probability tending to one, so that the empirical likelihood is strictly decreasing on this interval. For the proof of (4.3) we note that, since E 0 X 2 k = θ 2 0,k + 1/n, for h defined in (4.2). By considering Riemann sums (cf. Lemma A.1 in the appendix) we see that for ν → ∞ the second term on the right converges to the positive constant c α := ∞ 0 (x 1+2α + 1) −2 dx. By the definition of ν n we have nh(ν) ≤ l for ν ≥ ν n . It follows that (4.3) is satisfied for l < c α .
(4.5) (The first part can be handled by splitting the sum in the parts k ≤ ν and k > ν and bound k 1+2α + ν 1+2α below by ν 1+2α and k 1+2α , respectively; for the second part we use the inequality xy/(x + y) 2 ≤ 1, valid for xy > 0, and the definition of h.) For ν ≥ ν n we have that nh(ν) is bounded by l, and hence the right side is bounded by a multiple of 1/ν. It follows that Var 0 M n (ν n ) → 0 as required. Furthermore, combination with the triangle inequality shows that the d n -diameter of the set [ν n , ∞) is bounded by a multiple of 1/ √ ν n . Next we consider the covering number N (ε, [ν n , ∞), d n ). Because the d ndiameter of the set [ν, ∞) is bounded above by a multiple of 1/ √ ν, for a large enough constant A the interval [A/ε 2 , ∞) is included in a single d n -ball of radius ε. For the remaining interval we have for K 1 + (log(A/(ε 2 ν n ))) + . By Lemma 4.1 (below) on each of the relevant intervals [A/(2 k+1 ε 2 ), A/(2 k ε 2 )] appearing on the right: It follows that N (ε, [A/(2 k+1 ε 2 ), A/(2 k ε 2 )], d n ) 1/ε. Putting things together we obtain This concludes the proof of (4.4).
For ν 1+2α ≤ n 1/3 we have Since E 0 X 2 k = 1/n + θ 2 0,k and 2xy ≥ x + y for x, y ≥ 1, the expected value of the right-hand side is bounded below by which, for n large enough, is bounded below by a constant times n 1/3 if θ 0 = 0. Since Var 0 X 2 k = 2/n 2 + θ 2 0,k 1/n 2 + 1/(nk 1+2β ), the variance is bounded by a constant times which is (easily) bounded by n 1/3 for n large enough. The proof of the statement is now completed by an application of Chebychev's inequality.

Asymptotic behavior of M n on (n 1/(3+6α) , ν n )
In this section we show that if the constant L in the definition of ν n is chosen large enough, then M n is bounded uniformly below by a fixed (negative) constant on (n 1/(3+6α) , ν n ) and by an arbitrarily large constant on (ν n /2, ν n ), with probability tending to 1.
We have that We shall show that the sequence of random variables tends in probability to zero. Then for every ν ≥ n 1/(3+6α) , xg possesses minimal value −g 2 on (0, ∞) and is bounded below by x/2 for x ≥ 16g 2 . It follows that the left side is bounded below on (n 1/(3+6α) , ν n ) by a negative constant that tends to zero, and is "big" whenever nh(ν) is big.
Finally we prove that G n → 0 in probability. The process H(ν)/ h(ν)/ν is Gaussian. Lemma 4.2 (below) shows that on the interval [ν, 2ν] its intrinsic metric is bounded above by a multiple of | · |/ν. It follows that the covering number of this interval relative to the Gaussian metric is bounded above by a multiple of 1/ε. Since ν n is bounded by a power of n, the interval (1, ν n ] can be covered with O(log n) intervals of this type, and hence has covering number bounded above by a multiple of log n/ε. By Corollary 2.2.5 in [28], applied with ψ(x) = e x 2 − 1, it follows that Together with the fact that Var 0 H(ν) h(ν)/ν, this shows that G n is of the order O P (n −1/(6+12α) √ log log n).

Asymptotic behavior for special choices of θ 0
For θ 2 0,k = C 2 k −1−2β the function h given by (4.2) satisfies, as ν → ∞, (cf. Lemma A.1) for the constants c α,β defined by In this case the definition of ν n readily gives that Furthermore, by its definition ν n satisfies the same equation with L instead of l.
If θ 0 satisfies the one-sided inequality θ 2 0,k ≤ C 2 k −1−2β (or θ 2 0,k ≥ c 2 k −1−2β ), then the function h can be upper bounded as previously (or lower bounded with c instead of C, respectively). By its definition the upper bound ν n can then be upper bounded by the right side of (4.10) (or ν n can be lower bounded by this expression with c replacing C, respectively). Thus given both the upper and lower bound on θ 0 , the two quantities ν n and ν n have the same order.
Finally assume again that θ 2 0,k ≍ C 2 k −1−2β . Relation (4.4) is then valid also with ν n replacing ν n : sup ν≥ν n |M n (ν) − E 0 M n (ν)| → 0 in probability, in all three cases. Sinceν n is a zero of M n and is contained in [ν n , ν n ], it follows that E 0 M n (ν) |ν=νn → 0 in probability. Again employing (4.9), we conclude that n c α,β − c α tends to zero in probability in the three cases, respectively, for the constants c α,β defined previously and This readily gives thatτ n /τ n tends to a constant in probability.

The special case θ 0 = 0
If θ 0 = 0, then the expected value E 0 M n (ν), given at the beginning of Section 4.1, tends to a negative constant as ν → ∞, and is 0 only at ν = 0. Thus it is negative and bounded away from zero on every interval [ν, ∞) for ν > 0.
Furthermore, in this case the function h vanishes and hence the computations in Section 4.1 show that Var 0 M n (ν) 1/ν for every ν > 0, and that the upper bound on Var 0 (M n (ν 1 )−M n (ν 2 )) given by Lemma 4.1 is valid without the factor (1 + nh(ν 1 )) in its right side. Similar arguments as in Section 4.1 then show that M n tends to its expectation uniformly on every sequence of intervals [ν n , ∞) with ν n → ∞.
Combination of these findings shows that P 0 (ν n ≤ ν n ) → 1 for every ν n → ∞. This is equivalent to nτ 2 n being bounded in probability.

Example: multiple local maxima
We construct a fixed parameter θ 0 and a subsequence n j → ∞ such that, with probability tending to 1, the random map M nj is strictly negative somewhere in the interval [0, ν nj ]. We fix 0 < β < α + 1/2 and for (large) positive constants A, B and C to be determined later and j ∈ N,we set ν j = A j and n j = Bν 1+2β j and define θ 0 by We shall show that by choosing the constants A, B and C sufficiently large, we can ensure that n j h(ν j /C) becomes arbitrarily small (positive) and n j h(ν j ) > L for j large enough. The latter implies that ν j /C < ν j < ν nj , and the former that E 0 M nj (ν j /C) is smaller than a negative constant. Using (4.5) we then also get that Var 0 M nj (ν j /C) 1/ν j → 0, and the claim follows by Chebychev's inequality.
To upper bound n j h(ν j /C) we split the sum in the definition of h into three parts. The sum over the indices k < 2ν j−1 is bounded by The second sum is over the indices ν j < k < 2ν j and is bounded by Finally, since ν j < ν j+1 , we have the same bound for the sum over k > ν j + 1: We conclude that n j h(ν j /C) For the lower bound we note that To complete the construction, observe that by choosing B large enough we can ensure that n j h(ν j ) > L. By next choosing C large enough and then A large enough we can make n j h(ν j /C) arbitrarily small.

Proof of Theorem 2.2
It is convenient to continue to work with the parametrization ν 1+2α = τ 2 n. Slightly abusing notation we denote by Π ν the same prior as Π τ for ν 1+2α = τ 2 n and similarly for the posterior, so In this notation the empirical Bayes posterior is whereν n is the (or rather a) zero of the random function M n on (0, ∞) defined by (4.1).
Because θ − θ 0 2 = (θ k − θ k,0 ) 2 , we have, withθ ν,k = ν 1+2α (k 1+2α + ν 1+2α ) −1 X k the posterior mean, By Markov's inequality the left side divided by (M n ε n ) 2 is an upper bound on Π ν (θ : θ − θ 0 ≥ M n ε n |X), for any M n ε n > 0. We like to show that the latter probability evaluated at ν =ν n tends to zero for the appropriate rate ε n = ε n,α,β and any M n → ∞. By Theorem 2.1 with probability going to 1, the empirical Bayes rescaling rateν n belongs to the interval [ν n , ν n ]. Therefore, to prove Theorem 2.2 it suffices to show that the expectation of the supremum of this expression over ν ∈ [ν n , ν n ] is of the appropriate order ε 2 n . We shall first show that the supremum of the expectations has the right order, and next that the expectation of the supremum has the same order.

Posterior risk for scaling in
The second term of (5.1) is deterministic. The expectation of the first term can be split in square bias and variance terms. We find that the expectation of (5.1) is given by In this section we prove that the supremum of this expression over ν ∈ [ν n , ν n ] is bounded by a constant times n −2β/(1+2β) + ν n /n. In Section 4.4 it was seen that under (2.2) the upper bound ν n is bounded above by the right side of (4.10), which shows that ν n /n ≍ ε 2 n,α,β , the (square) order claimed in Theorem 2.2. The first term n −2β/(1+2β) is smaller than this order, in all three cases.
The series in the second and third terms are bounded by a multiple of ν (and asymptotic to ν times a constant as ν → ∞), and hence the suprema of these terms over ν ∈ [ν n , ν n ] are bounded by a multiple of ν n /n.
The first series is decreasing in ν and hence it suffices to consider it at ν = ν n . Its terms are bounded above by θ 2 0,k . Therefore, in view of (2.2), we have for By the definition of ν n (and continuity of the series) we have, for ν ≥ ν n , (The function h is as in (4.2).) As a first consequence we have It remains to consider the terms between ν n and N . For ν ≤ k ≤ 2ν and any ν > 0 we have that ν 1+2α k 1+2α /(k 1+2α + ν 1+2α ) 2 ≥ 1/(2 1+2α + 3). Therefore, as a second consequence of (5.2), for ν ≥ ν n . For L large enough that ν n 2 L ≥ N we have For ν n 2 L ∼ N this is bounded above by a multiple of LN/n n −2β/(1+2β) .

Uniform result for the posterior risk
In this section we bound the quantity Using the explicit expressions for theθ ν,k we see that the random variable in the supremum is the absolute value of V(ν)/n − 2W(ν)/ √ n, where We deal with the two processes separately. By comparison with Riemann sums (cf. Lemma A.1) we see that Var 0 V(ν) ≍ ν ∞ 0 (x 1+2α + 1) −4 dx as ν → ∞. By Lemma 5.1 below the standard deviation metric of V is bounded above by a multiple of | · |/ √ ν on the interval [ν, 2ν]. Therefore the covering number of this interval relative to the standard deviation metric is bounded above by a multiple of √ ν/ε. Covering the interval [ν n , ν n ] with the intervals (2 −m−1 ν n , 2 −m ν n ], for m = 0, 1, 2, . . ., we see that its covering number is bounded above by m √ 2 −m ν n /ε √ ν n /ε. By Corollary 2.2.5 in [28] applied with ψ(x) = x 2 , it follows that Divided by n this yields √ ν n /n ≤ ν n /n. It remains to deal with the process W. Because W(ν) = νH(ν) for H given in (4.7), we have by (4.8) that Var 0 W(ν) = νh(ν), for h given in (4.2). By (5.2) we have that Var 0 W(ν) ≤ νL/n for ν ≥ ν n . Furthermore, by Lemma 5.1 (below) the standard deviation metric of W is bounded above by a multiple of |·| h(ν)/ν |·|/ √ νn on an interval [ν, 2ν] with ν ≥ ν n . By the same reasoning as in the preceding paragraph this shows that the covering number of the interval [ν n , ν n ] relative to the standard deviation metric is bounded above by ν n /n/ε. By Corollary 2.2.5 in [28] applied with ψ(x) = e x 2 − 1, it follows that Divided by √ n this yields the order √ ν n log n/n ≤ ν n /n.
The left side of the second inequality of the lemma can be written in the form k (h k (ν 1 ) − h k (ν 2 )) 2 k 2+4α θ 2 0,k , this time for the function h k given by h k (ν) = ν 1+2α /(k 1+2α + ν 1+2α ) 2 . The derivative of this function satisfies |h ′ k (ν)| ν 2α /(k 1+2α + ν 1+2α ) 2 . By the mean value theorem the left side of the lemma is bounded by a multiple of The series in the right side is bounded by ν −1 1 h(ν 1 ).
Furthermore, the constant c 2 can be chosen arbitrarily large by choosing L in (2.8) large enough, while the constant c 3 is fixed.