Adaptive posterior contraction rates for the horseshoe

We investigate the frequentist properties of Bayesian procedures for estimation based on the horseshoe prior in the sparse multivariate normal means model. Previous theoretical results assumed that the sparsity level, that is, the number of signals, was known. We drop this assumption and characterize the behavior of the maximum marginal likelihood estimator (MMLE) of a key parameter of the horseshoe prior. We prove that the MMLE is an effective estimator of the sparsity level, in the sense that it leads to (near) minimax optimal estimation of the underlying mean vector generating the data. Besides this empirical Bayes procedure, we consider the hierarchical Bayes method of putting a prior on the unknown sparsity level as well. We show that both Bayesian techniques lead to rate-adaptive optimal posterior contraction, which implies that the horseshoe posterior is a good candidate for generating rate-adaptive credible sets.


Introduction
The rise of big datasets with few signals, such as gene expression data and astronomical images, has given an impulse to the study of sparse models.The sequence model, or sparse normal means problem, is well studied.In this model, a random vector Y n = (Y 1 , . . ., Y n ) with values in R n is observed, and each single observation Y i is the sum of a fixed mean θ 0,i and standard normal noise ε i : Y i = θ 0,i + ε i , i = 1, . . ., n. (1.1) We perform inference on the mean vector θ 0 = (θ 0,1 , . . ., θ 0,n ), and assume it to be sparse in the nearly black sense, meaning that all except an unknown number p n = n i=1 1{θ 0,i = 0} of the means are zero.We assume that p n increases with n, but not as fast as n: p n → ∞ and p n /n → 0 as n tends to infinity.
Many methods to recover θ 0 have been suggested.Those most directly related to this work are [30,20,9,8,18,16,19,14,5,3,2,25].In the present paper we study the Bayesian method based on the horseshoe prior [7,6,28,23,24].Under this prior the coordinates θ 1 , . . ., θ n are an i.i.d.sample from a scale mixture of normals with a half-Cauchy prior on the variance, as follows.Given a "global hyperparameter" τ , In the Bayesian model the observations Y i follow (1.1) with θ 0 taken equal to θ.The posterior distribution is then as usual obtained as the conditional distribution of θ given Y n .For a given value of τ , possibly determined by an empirical Bayes method, aspects of the posterior distribution of θ, such as its mean and variance, can be computed with the help of analytic formulas and numerical integration [23,24,33].It is also possible to equip τ with a hyper ‡ Research supported by the Netherlands Organization for Scientific Research.§ The research leading to these results has received funding from the European Research Council under ERC Grant Agreement 320637.
prior, and follow a hierarchical, full Bayes approach.Several MCMC samplers and software packages are available for computation of the posterior distribution [27,21,15,31,17].
The horseshoe posterior has performed well in simulations [7,6,23,22,3,1].Theoretical investigation in [33] shows that the parameter τ can, up to a logarithmic factor, be interpreted as the fraction of nonzero parameters θ i .In particular, if τ is chosen to be at most of the order (p n /n) log n/p n , then the horseshoe posterior contracts to the true parameter at the (near) minimax rate of recovery for quadratic loss over sparse models [33].While motivated by these good properties, we also believe that the results obtained give insight in the performance of Bayesian procedures for sparsity in general.
In the present paper we make three novel contributions.First and second we establish the contraction rates of the posterior distributions of θ in the hierarchical, full Bayes case and in the general empirical Bayes case.Third we study the particular empirical Bayes method of estimating τ by the method of maximum Bayesian marginal likelihood.
As the parameter τ can be viewed as measuring sparsity, the first two contributions are both focused on adaptation to the number p n of nonzero means, which is unlikely to be known in practice.The hierarchical and empirical Bayes methods studied here are shown to have similar performance, both in theory and in a small simulation study, and appear to outperform the ad-hoc estimator introduced in [33].The horseshoe posterior attains similar contraction rates as the spike-and-slab priors, as obtained in [20,9,8], and two-component mixtures, as in [25].We obtain these results under general conditions on the hyper prior on τ , and for general empirical Bayes methods.
The conditions for the empirical Bayes method are met in particular by the maximum marginal likelihood estimator (MMLE).This is the maximum likelihood estimator of τ under the assumption that the "prior" (1.2) is part of the data-generating model, leaving only τ as a parameter.The MMLE is a natural estimator and is easy to compute.It turns out that the "MMLE plug-in posterior distribution" closely mimics the hierarchical Bayes posterior distribution, as has been observed in other settings [29,26].Besides practical benefit, this correspondence provides a theoretical tool to analyze the hierarchical Bayes method, which need not rely on testing arguments (as in [12,13,34]).
In the Bayesian framework the spread of the posterior distribution over the parameter space is used as an indication of the error in estimation.For instance, a set of prescribed posterior probability around the center of the posterior distribution (a credible set) is often used in the same way as a confidence region for the parameter.In the follow-up paper [32], we investigate the coverage properties and sizes of the adaptive credible balls and marginal credible intervals.
The paper is organized as follows.We first introduce the MMLE in Section 2. Next we present contraction rates in Section 3, for general empirical and hierarchical Bayes approaches, and specifically for the MMLE.We illustrate the results in Section 4. We conclude with appendices containing all proofs not given in the main text.

Notation
We use Π(• | Y n , τ ) for the posterior distribution of θ relative to the prior (1.2) given fixed τ , and Π(• | Y n ) for the posterior distribution in the hierarchical setup where τ has received a prior.The empirical Bayes "plug-in posterior" is the first object with a data-based variable τ n substituted for τ .In order to stress that this does not entail conditioning on τ n , we also write The density of the standard normal distribution is denoted by ϕ.Furthermore, 0 [p] = {θ ∈ R n : n i=1 1{θ i = 0} ≤ p} denotes the class of nearly black vectors, and we abbreviate

Maximum marginal likelihood estimator
In this Section we define the MMLE and compare it to a naive empirical Bayes estimator previously suggested in [33].In Section 3.1, we show that the MMLE is close to the "optimal" value τ n (p n ) = (p n /n) log(n/p n ) with high probability, and leads to posterior contraction at the near-minimax rate.
The marginal prior density of a parameter θ i in the model (1.2) is given by dλ. (2.1) In the Bayesian model the observations Y i are distributed according to the convolution of this density and the standard normal density.The MMLE is the maximum likelihood estimator of τ in this latter model, given by 2) The restriction of the MMLE to the interval [1/n, 1] can be motivated by the interpretation of τ as the level of sparsity, as in [33], which makes the interval correspond to assuming that at least one and at most all parameters are nonzero.The lower bound of 1/n has the additional advantage of preventing computational issues that arise when τ is very small ( [33,10]).We found the observation in [10] that an empirical Bayes approach cannot replace a hierarchical Bayes one, because the estimate of τ tends to be too small, too general.In both our theoretical study as in our simulation results the restriction that the MMLE be at least 1/n prevents a collapse to zero.Our simulations, presented in Section 4, also give no reason to believe that the hierarchical Bayes method is inherently better than empirical Bayes.Indeed, they behave very similarly (depending on the prior on τ ).
The MMLE requires one-dimensional maximization and is thus easily computed.The behavior of the quantity to be maximized in (2.2) and the MMLE itself is illustrated in Figure 1.A function for computation is available in the R package 'horseshoe' ( [31]).
An interpretation of τ as the fraction of nonzero coordinates motivates another estimator ( [33]), which is based on a count of the number of observations that exceed the "universal threshold" √ 2 log n: where c 1 and c 2 are positive constants.If c 2 > 1 and (c 1 > 2 or c 1 = 2 and p n log n), then the plug-in posterior distribution with the simple estimator τ S (c 1 , c 2 ) contracts at the near square minimax rate p n log n (see [33], Section 4).This also follows from Theorem 3.2 in the present paper, as τ S (c 1 , c 2 ) satisfies Condition 1 below.
In [33], it was observed that the simple estimator is prone to underestimation of the sparsity level if signals are smaller than the universal threshold.This is corroborated by the numerical study presented in Figure 2. The figure shows approximations to the expected values of τ S and τ M when θ 0 is a vector of length n = 100, with p n coordinates drawn from a N (A, 1) distribution, with A ∈ {1, 4, 7}, and the remaining coordinates drawn from a N (0, 1/4) distribution.For this sample size the "universal threshold" √ 2 log n is approximately 3, and thus signals with A = 1 should be difficult to detect, whereas those with A = 7 should be easy; those with A = 4 represent a boundary case.
The figure shows that in all cases the MMLE (2.2) yields larger estimates of τ than the simple estimator (2.3), and thus leads to less shrinkage.This is expected in light of the results in the following section, which show that the MMLE is of order τ n (p n ), whereas the simple estimator is capped at p n /n.Both estimators appear to be linear in the number of nonzero coordinates of θ 0 , with different slopes.When the signals are below the universal threshold, then the simple estimator is unlikely to detect any of them, whereas the MMLE may still pick up some of the signals.We study the consequences of this for the mean square errors in Section 4.

Contraction rates
In this section we establish the rate of contraction of both the empirical Bayes and full Bayes posterior distributions.The rate of contraction refers to properties of these posterior distributions when the vector Y n follows a normal distribution on R n with mean θ 0 and covariance the identity.We give general conditions on the empirical Bayes estimator τ n and the hyper prior on τ that ensure that the square posterior rate of contraction to θ 0 of the resulting posterior distributions is the near minimax rate p n log n for estimation of θ 0 relative to the Euclidean norm.We also show that these conditions are met by the MMLE and natural hyper priors on τ .
The minimax rate, the usual criterion for point estimators, has proven to be a useful benchmark for the speed of contraction of posterior distributions as well.The posterior cannot contract faster to the truth than at the minimax rate [12].The square minimax

Empirical Bayes
The empirical Bayes posterior distribution achieves the near-minimax contraction rate provided that the estimator τ n of τ satisfies the following condition.Let τ n (p) = (p/n) log(n/p).
This condition is weaker than the condition given in [33] for 2 -adaptation of the empirical Bayes posterior mean, which requires asymptotic concentration of τ n on the same interval [1/n, Cτ n (p n )] but at a rate.In [33] a plug-in value for τ of order τ n (p n ) was found to be the largest value of τ for which the posterior distribution contracts at the minimax-rate, and has variance of the same order.Condition 1 can be interpreted as ensuring that τ n is of at most this "optimal" order.The lower bound can be interpreted as assuming that there is at least one nonzero mean, which is reasonable in light of the assumption p n → ∞.In addition, it prevents computational issues, as discussed in Section 2.
A main result of the present paper is that the MMLE satisfies Condition 1.
A second main result is that under Condition 1 the posterior contracts at the nearminimax rate.
Theorem 3.2.For any estimator τ n of τ that satisfies Condition 1, the empirical Bayes posterior distribution contracts around the true parameter at the near-minimax rate: for any In particular, this is true for τ n equal to the MMLE.

Hierarchical Bayes
The full Bayes posterior distribution contracts at the near minimax rate whenever the prior density π n on τ satisfies the following two conditions.

Condition 2. The prior density π
The restriction of the prior distribution to the interval [1/n, 1] can be motivated by the same reasons as discussed under the definition of the MMLE in Section 2. In our simulations (also see [33]) we have also noted that large values produced by for instance a sampler using a half-Cauchy prior, as in the original set-up proposed by [7], were not beneficial to recovery.
As t n is of the same order as τ n (p n ), Condition 3 is similar to Condition 1 in the empirical Bayes case.It requires that there is sufficient prior mass around the "optimal" values of τ .The condition is satisfied by many prior densities, including the usual ones, except in the very sparse case that p n log n, when it requires that π n is unbounded near zero.For this situation we also introduce the following weaker condition, which is still good enough for a contraction rate with additional logarithmic factors.The following lemma is a crucial ingredient of the derivation of the contraction rate.It shows that the posterior distribution of τ will concentrate its mass at most a constant multiple of t n away from zero.We denote the posterior distribution of τ by the same general symbol Lemma 3.6.If Conditions 2 and 3 hold, then Furthermore, if only Conditions 2 and 4 hold, then the similar assertion is true but with 5t n replaced by We are ready to state the posterior contraction result for the full Bayes posterior.Theorem 3.7.If the prior on τ satisfies Conditions 2 and 3, then the hierarchical Bayes posterior contracts to the true parameter at the near minimax rate: for any M n → ∞ and p n → ∞, If the prior on τ satisfies only Conditions 2 and 4, then this is true with √ p n log n replaced by √ p n log n.
Proof.Using the notation r n = √ p n log n, we can decompose the left side of the preceding display as The first term on the right tends to zero by Theorem 3.2, and the second by Lemma 3.6.

Simulation study
We study the relative performances of the empirical Bayes and hierarchical Bayes approaches further through simulation studies, extending the simulation study in [33].We consider the mean square error (MSE) for empirical Bayes combined with either (i) the simple estimator (with c 1 = 2, c 2 = 1) or (ii) the MMLE, and for hierarchical Bayes with either (iii) a Cauchy prior on τ , or (iv) a Cauchy prior truncated to [1/n, 1] on τ .
We created a ground truth θ 0 of length n = 400 with p n ∈ {20, 200}, where each nonzero mean was fixed to A ∈ {1, 2, . . ., 10}.We computed the posterior mean for each of the four q q q q q q q q q q simple full − Cauchy n = 400, p n = 200 − all means 0.5 1.0 1.5 2 4 6 8 10 q q q q q q q q q q MMLE full − truncated q q q n = 400, p n = 20 − nonzero means q q q q q q q q q q n = 400, p n = 200 − nonzero means q q q q q q q q q q n = 400, p n = 20 − zero means q q q q q q q q q q n = 400, q q q q q q q q q q n = 400, p n = 20 − tau q q q q q q q q q q n = 400, p n = 200 − tau q q q q q q q q q q procedures, and approximated the MSE by averaging over N = 100 iterations.The results are shown in Figure 3.In addition the figure shows the MSE separately for the nonzero and zero coordinates of θ 0 , and the average value (of the posterior mean) of τ .The shapes of the curves of the overall MSE for methods (i) and (iii) were discussed in [33].Values close to the threshold √ 2 log n ≈ 3.5 pose the most difficult problem, and hierarchical Bayes with a Cauchy prior performs better below the threshold, while empirical Bayes with the simple estimator performs better above, as the simple estimator is very close to p n /n in those settings, whereas the values of τ resulting from hierarchical Bayes are much larger.
Three new features stand out in this comparison, with the MMLE and hierarchical Bayes with a truncated Cauchy added in, and the opportunity to study the zero and nonzero means separately.The first is that empirical Bayes with the MMLE and hierarchical Bayes with the Cauchy prior truncated to [1/n, 1] behave very similarly, as was expected from our proofs, in which the comparison of the two methods is fruitfully explored.
Secondly, while in the most sparse setting (p n = 20), full Bayes with the truncated and non-truncated Cauchy priors yield very similar results, as the mean value of τ does not come close to the 'maximum' of 1 in either approach, the truncated Cauchy (and the MMLE) offer an improvement over the non-truncated Cauchy in the less sparse (p n = 200) setting.The non-truncated Cauchy does lead to lower MSE on the nonzero means close to the threshold, but overestimates the zero means due to the large values of τ .With the MMLE and the truncated Cauchy, the restriction to [1/n, 1] prevents the marginal posterior of τ from concentrating too far away from the 'optimal' values of order τ n (p n ), leading to better estimation results for the zero means, and only slightly higher MSE for the nonzero means.
Thirdly, the lower MSE of the simple estimator for large values of A in case p n = 20 is mostly due to a small improvement in estimating the zero means, compared to the truncated Cauchy and the MMLE.As so many of the parameters are zero, this leads to lower overall MSE.However, close to the threshold, the absolute differences between these methods on the nonzero means can be quite large, and the simple estimator performs worse than all three other methods for these values.
Thus, from an estimation point of view, empirical Bayes with the MMLE or hierarchical Bayes with a truncated Cauchy seem to deliver the best results, only to be outperformed by hierarchical Bayes with a non-truncated Cauchy in a non-sparse setting with all zero means very close to the universal threshold.
We split the sum in the indices I 0 := {i : θ 0,i = 0} and I 1 := {i : θ 0,i = 0}.By Lemma C.1, with m τ given by (C.3), By Proposition C.2 the expectations of the terms in the first sum are strictly negative and bounded away from zero for τ ≥ ε, and any given ε > 0. By Lemma C.6 the sum behaves likes its expectation, uniformly in τ .By Lemma C.7 (i) the function m τ is uniformly bounded by a constant C u .It follows that for every ε > 0 there exists a constant C ε > 0 such that, for all τ ≥ ε, and with p n = #(θ 0,i = 0), the preceding display is bounded above by This is negative with probability tending to one as soon as (n − p n )/p n > C u /C e , and in that case the maximum Since this is true for any ε > 0, we conclude that τ M tends to zero in probability.
We can now apply Proposition C.2 and Lemma C.3 to obtain the more precise bound on the derivative when τ → 0 given by This is negative for τ /ζ τ p n /(n − p n ), and then τ M is situated on the left side of the solution to this equation, or τ M /ζ τ M p n /(n − p n ), which implies, that τ M τ n , given the assumption that p n = o(n).
For the proof of Theorem 3.2, we use the following observations.The posterior density of θ i given (Y i = y, τ ) is (for fixed τ ) an exponential family with density θ → ϕ(y − θ)g τ (θ) ψ τ (y) = c τ (y)e θy g τ (θ)e −θ 2 /2 , where g τ is the posterior density of θ given in (2.1), and ψ τ is the Bayesian marginal density of Y i , given in (C.2), and the norming constant is given by for the function I −1/2 (y) defined in (C.1).The cumulant moment generating function z → log E(e zθi | Y i = y, τ ) of the family is given by z → log c τ (y)/c τ (y + z) , which is z → log I −1/2 (y + z) plus an additive constant independent of z.We conclude that the first, second and fourth cumulants are given by The derivatives at the right side can be computed by repeatedly using the product and sum rule together with the identity I k (y) = yI k+1 (y), for I k as in (C.1).In addition, since Hence, in view of Chebyshev's inequality, it is sufficient to show that, with var(θ Applying the upper bound (B.5) for the p n non-zero coordinates θ 0,i , and the upper bound in the last display for the zero parameters, we find that Next an application of Markov's inequality leads to (B.3).The proof of (B.4) is similar.For the nonzero θ 0,i we use the fact that var

B.2. Proof of Lemma 3.6
The number t n defined in Condition 4 is the (approximate) solution to the equation p n C u /τ = C e (n−p)/(2ζ τ ), for C e = (π/2) 3/2 .By the decomposition (A.2), with P θ0 -probability tending to one, ) by Bayes's formula, with P θ0 -probability tending to one, for c n ≥ 5 The Bayesian marginal density of Y i given τ is the convolution ψ τ := ϕ * g τ of the standard normal density and the prior density of g τ , given in (2.1).The latter is a half-Cauchy mixture of normal densities ϕ τ λ with mean zero and standard deviation τ λ.By Fubini's theorem it follows that ψ τ is a half-Cauchy mixture of the densities ϕ * ϕ τ λ .In other words where the second step follows by the substitution 1 − z = (1 + τ 2 λ 2 ) −1 and some algebra.Note that I −1/2 depends on τ , but this has been suppressed from the notation I k .Set Proof.From (C.2) we infer that, with a dot denoting the partial derivative with respect to τ , ψτ where By integration by parts, Substituting the right hand side in formula (C.3), we readily see by some algebra that τ −1 times the latter formula reduces to the right side of the preceding display.
Proof.Let κ τ be the solution to the equation e y 2 /2 /(y 2 /2) = 1/τ , that is We split the integral over (0, ∞) into the three parts (0, ζ τ ), (ζ τ , κ τ ), and (κ τ , ∞), where we shall see that the last two parts give negligible contributions.By Lemma C.7(vi) and (vii), if By the definition of κ τ , both terms are of smaller order than τ /ζ τ .Because e y 2 /2 /y 2 is increasing for large y and reaches the value where the remainder R τ is bounded in absolute value by . By Lemma C.10 the integrand in the integral is bounded above by a constant for y near 0 and by a multiple of y −2 otherwise, and hence the integral remains bounded.Thus the remainder R τ is negligible.By Fubini's theorem the integral in the preceding display can be rewritten by the fact that the inner integral vanishes when computed over the interval (0, ∞) rather than (0, ζ τ ).Since ∞ y [(va) 2 − 1]ϕ(va) dv = yϕ(ya), it follows that the right side is equal to We split the integral in the ranges (0, 1/2) and (1/2, 1).For z in the first range we have 1 − z ≥ 1/2, whence the contribution of this range is bounded in absolute value by Uniformly in z in the range (1/2, 1) we have τ 2 + (1 − τ 2 )z ∼ z, and the corresponding contribution is by the substitution ζ 2 τ (1 − z) = u.The integral tends to ∞ 0 e −u/2 du = 2, and hence the expression is asymptotic to half the expression as claimed.
The second statement follows by the same estimates, where now we use that Since E 0 m τ (Y ) ∼ −cτ /ζ τ for a positive constant c, as τ ↓ 0, the continuous function τ → E 0 m τ (Y ) is certainly negative if τ > 0 and τ is close to zero.To see that it is bounded away from zero as τ moves away from 0, we computed E 0 m τ (Y ) via numerical integration.The result is shown in Figure 4.  Lemma C.3.For any ε τ ↓ 0 and uniformly in Similarly, uniformly in In view of Corollary 2.2.5 of [35] (applied with ψ(x) = x 2 ) it is sufficient to show that var θ0 G n (τ ) → 0 for some τ , and where d n is the intrinsic metric defined by its square ) , diam n is the diameter of the interval [1/n, 1] with respect to the metric d n , and N (ε, A, d n ) is the covering number of the set A with ε radius balls with respect to the metric d n . If This tends to zero, as τ n ≥ 1 by assumption.Combining this with the triangle inequality we also see that the diameter diam n tends to 0. Next we deal with the entropy.The metric d n is up to a constant equal to the square root of the left side of (C.6).By Lemma C.4 it satisfies To compute the covering number of the interval [1/n, 1], we cover this by dyadic blocks [2 i /n, 2 i+1 /n], for i = 0, 1, 2, ..., log 2 n.On the ith block the distance d n (τ 1 , τ 2 ) is bounded above by a multiple of n|τ 1 − τ 2 |/2 3i/2 .We conclude that the ith block can be covered by a multiple of ε −1 2 −i/2 balls of radius ε.Therefore the whole interval [1/n, 1] can be covered by a multiple of ε −1 i 2 −i/2 ε −1 balls of radius ε.Hence the integral of the entropy is bounded by This tends to zero as diam n tends to zero.
The second assertion of the lemma follows similarly, where we use the second parts of Lemmas C.5 and C.4.
Furthermore, for |θ| ≤ ζ τ /4, and ε = 1/16 and 0 < τ Proof.In view of Lemma C.11 the left side of (C.6) is bounded above by, for ṁτ denoting the partial derivative of m τ with respect to τ , By Lemma C.5 the second expected value on the right hand side is bounded from above by a multiple of sup τ ∈[τ1,τ2] τ −3 τ −3 1 .To handle the first expected value, we note that the partial derivative of I k with respect to τ is given by İk = 2τ (J k+1 − J k ), for J k (y) = ) and J k ≤ I k /τ 2 , and k → I k and k → J k are decreasing and nonnegative, we have that By combining the preceding two displays we conclude Here E θ Y 4 is bounded and For the proof of the second assertion of the lemma, when |θ| ≤ ζ τ /4, we argue similarly, but now must bound, The same arguments as before apply, now using the second bound from Lemma C.5.
The first follows from the monotonicity and (v).
For the proof of the second we note that if τ ≥ δ > 0, then δ 2 ≤ τ 2 + (1 − τ 2 )z ≤ 1, for every z ∈ [0, 1], so that the denominators in the integrands of I −1/2 , I 1/2 , I 3/2 are uniformly bounded away from zero and infinity and hence After changing variables zy 2 /2 = v, the numerator and denominator take the forms of the integrals in the second and first assertions of Lemma C.8, except that the range of integration is (0, y 2 /2) rather than (1, y).In view of the lemma the quotient approaches 1 as y → ∞.For y in a bounded interval the leading factor y 2 is bounded, while the integral in the numerator is smaller than the integral in the denominator, as z    Proof.We split the integral in the definition of I k over the intervals [0, τ a ] and [τ a , 1], for a = 2/(k + 1).The contribution of the first integral is bounded above by e τ a y 2 /2 τ a 0 z k (1 − τ 2 )z dz e τ a y 2 /2 τ ka .
Combining these displays, we see that This remains valid if we enlarge range of integration to [0, 1].The change of coordinates zy 2 /2 = v completes the proof of the equality in the first assertion.
For the second assertion we expand the integral in the first assertion with the help of the second assertion of Lemma C.8.Note here that for k > −1 the integrals in the latter lemma can be taken over (0, y) instead of (1, y), since the difference is a constant.
The inequality in the first assertion is valid for y → ∞, in view of the second assertion, and from the fact that G(y) := (y 2 /2) −k y 2 /2 0 v k−1 e v dv possesses a finite limit as y ↓ 0 it follows that it is also valid for y → 0. For intermediate y the inequality follows since the continuous function y → G(y)e −y 2 /2 /(y −2 ∧ 1) is bounded on compacta in (0, ∞).
For the proofs of the assertions concerning I 1/2 − I 3/2 we write e y 2 z/2 dz.
Next we follow the same approach as previously.
Proof.Define g = f 2 /f 1 .Since ∞ 0 f 1 (x)dx = ∞ 0 f 1 (x)g(x) dx and g is monotonely increasing, there exists an x 0 > 0 such that g(x) ≤ 1 for x < x 0 and g(x) ≥ 1 for x > x 0 .Therefore By the definition of g the right side is E f2 h(X) − E f1 h(X).

Fig 1 .
Fig 1. Logarithm of the quantity to be maximized in (2.2).The red dot indicates the location of the MMLE.Each plot was made using a single simulated data set consisting of 100 observations each.From left to right, top to bottom, there are 1, 5, 15 or 40 means equal to 10; the remaining means are equal to zero.

Fig 3 .
Fig 3. Mean square error (overall, for the nonzero coordinates, and for the zero coordinates) of the posterior mean corresponding to empirical Bayes with the simple estimator with c 1 = 2, c 2 = 1 ( ) or the MMLE (•) and to hierarchical Bayes with a Cauchy prior on τ ( ) or a Cauchy prior truncated to [1/n, 1] ( ).The bottom plot shows the average estimated value of τ (or the posterior mean in the case of the hierarchical Bayes approaches).The settings are n = 400 and pn = 20 (left) and pn = 200 (right); the results are approximations based on averaging over N = 100 samples for each value of A.
Appendix A: Proof of the main result about the MMLE A.1.Proof of Theorem 3.1 By its definition the MMLE maximizes the logarithm of the marginal likelihood function, which is given by

τ
log n, by Lemma B.1 (iv) and (v), while for the zero θ 0,i we use that var(θ i | Y i , τ ) is bounded above by τ e Y 2 i /2 for |Y i | ≤ ζ τn and bounded above by 1 + Y 2 i otherwise, by Lemma B.1 (vi) and (v).For the two cases of parameter values this gives bounds for E θ0,i sup τ ∈[1/n,Cτn] var(θ i | Y i , τ ) of the same form as the bounds for the square bias, resulting in the overall bound p n log n + (n − p n )τ n ζ τn p n log n for the sum of these variances.An application of Markov's inequality gives (B.4).

Fig 4 .
Fig 4. Upper bound on E 0 mτ (Y ) as computed with the R integrate() routine (solid line).The upper bound mτ (y) ≤ y 2 was used for |y| > 500 for numerical stability.The dashed line shows the asymptotic value (C.4).
The empirical Bayes posterior is found by replacing τ in the posterior distribution Π(• | Y n , τ ) of θ relative to the prior (1.2) with a given τ by a databased estimator τ n ; we denote this by Π τn (• | Y n ).The full Bayes posterior Π(• | Y n ) is the ordinary posterior distribution of θ in the model where τ is also equipped with a prior and (1.2) is interpreted as the conditional prior of θ given τ .