On general maximum likelihood empirical Bayes estimation of heteroscedastic IID normal means

: We propose a general maximum likelihood empirical Bayes (GMLEB) method for the heteroscedastic normal means estimation with known variances. The idea is to plug the generalized maximum likelihood estimator in the oracle Bayes rule. From the point of view of restricted empirical Bayes, the general empirical Bayes aims at a benchmark risk smaller than the linear empirical Bayes methods when the unknown means are i.i.d. variables. We prove an oracle inequality which states that under mild con- ditions, the regret of the GMLEB is of smaller order than (log n ) 5 /n . The proof is based on a large deviation inequality for the generalized maximum likelihood estimator. The oracle inequality leads to the property that the GMLEB is adaptive minimax in L p balls when the order of the norm of the ball is larger than ((log n ) 5 / 2 / √ n ) 1 / ( p ∧ 2) . We demonstrate the superb risk performance of the GMLEB through simulation experiments.


1.2)
This problem has been considered by many in the literature, including recent studies by [18] and [17]. However, while the existing studies are typically based on the shrinkage approach, our focus is on the general empirical Bayes [13,15], or equivalently nonparametric empirical Bayes [12].
In general empirical Bayes, the unknowns θ i are typically treated as constants in the compound approach [13]. In a homoscedastic compound decision problem, the average risk is written as where f (x|θ, σ) is the density of N (θ, σ 2 ), and G n is the empirical distribution of θ i . Robbins [13,14] observed that the optimal solution of the above problem is the Bayes rule t * Gn,σ (x) = E Gn (θ|X = x, σ). This can be viewed as fundamental theorem of compound decisions as it connects the compound problem to the Bayes approach. The idea is to plug-in estimated G n to mimic the Bayes rule or its performance. In the presence of heteroscedasticity, the same calculation as in (1.3) will not go through as X i − θ i do not have the same distribution. In the heteroscedastic case with known σ i , we may write (1.4) where G n is the empirical distribution of (θ i , σ i ). This still connects the compound problem to Bayes. However, the fundamental theorem fails in the presence of heteroscedasticity with observable σ i in general as the meaning and implication of putting a known quantity in the prior G n is unclear. Moreover, there may not be sufficient sample size at each σ-value to allow sufficiently accurate estimation of a nonparametric unknown prior.
One plausible way is to take empirical Bayes view that θ i are i.i.d. variables with an unknown common prior G. Empirical Bayes methods can be understood from the point of view of restricted empirical Bayes. Given a class of decision functions D, with oracular knowledge of G, the oracle benchmark is R D (G) = 2274 W. Jiang The regret of an estimatort n is The aim of restricted empirical Bayes is to seekt n ∈ D satisfying the asymptotic optimality r G,D (t n ) → 0, as n → ∞. (1.6) Let G be a normal distribution with mean μ and variance τ 2 . With D being the class of all linear estimators, the optimal estimator in D is t * 2 approximates the optimal linear rule t * D (x) in the sense of (1.6). In the heteroscedastic case, Xie, Kou and Brown [18] proposed to select an estimator from the class The parameters μ and τ 2 are estimated by minimizing a Stein's unbiased risk estimate (SURE) function. Xie, Kou and Brown [18] also suggested a semiparametric shrinkage estimator of the form ( i . Both SURE estimators satisfy the asymptotic optimality (1.6). Since i , any estimator of the previous form is also of the latter form. Hence, the semiparametric SURE aims at a smaller benchmark risk than the parametric SURE.
Denote the density of the normal location mixture by distribution G with scale σ by where ϕ(x) is the standard normal density. It is well known that for any prior G, the Bayes rule is given by Tweedie's formula [14,1,4] is the Bayes risk for univariate estimation. The general empirical Bayes approach assumes no knowledge about the unknown prior G but still aims to mimic the Bayes rule t * G (·, σ i ) in (1.8) or approximately achieve the risk benchmark R * n (G). Compared with the parametric and semiparametric methods, the general empirical Bayes is greedier since it aims at the optimal estimator among all the rules. There are two main strategies to approximate the Bayes rule in (1.8): modeling on the θ space, called "g-modeling", and modeling on the x space, called "f -modeling". Efron [5] provided examples and summarized some advantages of both strategies. As demonstrated in [7] and [10], compound decision problem is a favorable case for nonparametric g-modeling. Nonparametric g-modeling refers to estimating the unknown prior by the generalized MLE [10] where f G,σ (x) is the mixture density as in (1.7) and G is the family of all distribution functions. The calculation of the generalized MLE is usually difficult.
Recently, Koenker and Mizera [10] proposed a convex optimization approach to computing the generalized MLE, which is proven to be efficient and accurate. The heteroscedastic option in the REBayes package [9] facilitates our research. Fu, James and Sun [6] also considered the general empirical Bayes method for the heteroscedastic normal mean problem (1.1)-(1.2) with i.i.d. θ i . They suggested an f -modeling procedure to mimic the Bayes rule in (1.8) and proved its optimality in the sense (1.6). Still, the heart of the question is whether the gain by aiming at the smaller benchmark risk is large enough to offset the additional cost of the nonparametric estimation. Our results affirm that when θ i are drawn from a common prior G, the proposed general maximum likelihood empirical Bayes (GMLEB) estimator realizes risk reduction over linear methods. The rest of this paper is organized as follows. In Section 2 we provide an oracle inequality that gives non-asymptotic upper bounds for the regret of the GMLEB. Some implications are given. In Section 3 we prove a large deviation inequality for the generalized MLE under the average Hellinger distance, which is a key element for the oracle inequality. Other elements leading to the oracle inequality are provided in Section 4. In Section 5 we present some simulation results. Mathematical proofs of theorems and lemmas are given either right after their statements or in Section 6.

Main results
In the remaining part of the paper, the unknown prior where θ i are drawn from is denoted by G * n . We assume that the variances are uniformly bounded, i.e., there exist constants σ l and σ u such that σ l ≤ inf n min i σ i ≤ sup n max i σ i ≤ σ u . In our analyses, we allow approximate solutions to (1.10). For definiteness and notation simplicity, the generalized MLE is any solution of where q n = (e √ 2π/n 2 ) ∧ 1. The GMLEB estimator is defined aŝ where G n is any approximate generalized MLE (2.1) for prior G * n and f G,σ (x) is as in (1.7).

An oracle inequality for the GMLEB
Let μ p (G) = |u| p dG(u) 1/p be the p-th absolute moment of a distribution function G. The convergence rate ε n , as a function of the sample size n, the mixing distribution G, and the power p of the absolute moment, is defined as ε(n, G, p) = max 2 log n, n 1/p log nμ p (G) Here is an outline of the proof of Theorem 1. First of all, one problem with analyzing the GMLEB is that the denominator f Gn,σi in definition (2.2) could be arbitrarily small. In order to rule out that possibility, we define a regularized rule t * Gn (X i , σ i ; ρ n ) which replaces this denominator with f Gn,σi ∨ (ρ n /σ i ), and in Theorem 5 we show that this rule relates to the GMLEB as x * is the constant as in Theorem 4, and d(·, ·) is the average Hellinger distance defined in (3.2). The large deviation inequality in Theorem 4 and the analytical properties of the regularized Bayes rule in Lemma 2 provides an upper bound for Because the generalized MLE is based on the same data, is not separable. We use the following strategy. Let (t * Hj (·, σ 1 ; ρ n ), . . . , t * Hj (·, σ n ; ρ n )), j ≤ N be a set of approximated regularized Bayes rules in the sense that it is a (2η * )-net of under · ∞,M , where η * will be manifested in Theorem 7. By the entropy bound in Theorem 7, there exists a collection of distributions is small. Finally, Theorem 6 provides an upper bound of the regret due to the lack of the knowledge of G * n , which implies that is small. These upper bounds for individual pieces E G * n ζ 2 jn are put together via

Consequences of the oracle inequality
Theorem 2. Suppose that under P G * n , θ 1 , . . . , θ n are i.i.d. random variables from a distribution G * n , and given θ i 's, For a class of distributions G , the minimax risk for the average squared loss holds uniformly for a range of sequences {G n , n ≥ 1} of distribution classes. For positive p and C, the L p balls of distribution functions are defined as Donoho and Johnstone [3] proved that as C n → 0, This is the adaptive minimaxity in G p,Cn .

A large deviation inequality for the generalized MLE
In [7], the analysis of risk is divided into two parts. One is outside a Hellinger neighborhood d(f Gn , f G * n ) ≤ xε n , the other is inside this neighborhood. An essential ingredient is a large deviation inequality for d(f Gn , f G * n ). In the heteroscedastic case, it seems that certain omnibus distance between f Gn,σi and f G * n ,σi should be used. We use the average Hellinger distance d( G n , G * n ) as defined in (3.2) below. We provide a large deviation inequality for d( G n , G * n ). This result plays a crucial role in the oracle inequality stated in Theorem 1.
Define the collection of n-dimensional vectors of marginal densities as where G is the family of all distribution functions. For two vectors (f G,σ1 (x), . . . , f G,σn (x)), (f H,σ1 (x), . . . , f H,σn (x)) ∈ F n , define the average Hellinger distance 2 is the square of the Hellinger distance between probability densities f and g. Define the supreme norm in bounded intervals, where h = (h 1 (x), . . . , h n (x)) is an n-dimensional vector of functions.

Theorem 4.
Suppose that under P G * n , θ 1 , . . . , θ n are i.i.d. random variables from a distribution G * n , and given θ i 's, X i ∼ N (θ i , σ 2 i ) are independent observations with known variances. Let f G,σ be as in (1.7). Let G n be certain approximate generalized MLE satisfying (2.1). Then, there exists a universal constant x * such that for all t ≥ x * and log n > 1/p,

Other elements of the oracle inequality
In this section we provide other elements of the oracle inequality in Theorem 1. We divide this section into four subsections to study: (1) the connection between the GMLEB and the regularized rule, (2) some analytical properties of the regularized Bayes estimator, (3) regret of a regularized Bayes estimator with a misspecified prior, and (4) an entropy bound for regularized Bayes rules.
For the Bayes rule t * G (x, σ) = x + σ 2 f G,σ (x)/f G,σ (x), we may want to avoid dividing by a near-zero quantity. Define regularized Bayes rule as .
) as the Bayes and regularized Bayes rules for the unit-variance normal mean problem With y = x/σ, by the condition on F , we have t * G (x, σ)/σ = t * F (y) and This is a scale invariance of the Bayes and regularized Bayes rules.

Connection between the GMLEB and the regularized rule
The connection between the GMLEB estimator (2.2) and the regularized Bayes rule in (4.1) is provided by where 0 < q n ≤ 1. This is consequence of the following theorem.

Some properties of the regularized Bayes estimator
In this subsection we give some analytical properties of the regularized Bayes estimator. Denote the inverse function of y = ϕ(x) by log n for some constant c 0 . This is established in the following lemmas.  − log(2πy 2 ). Then, . (4.8)

Regret of a regularized Bayes estimator with a misspecified prior
Let F i and F * i be scale changes of G and G * n under parameter σ i according to (4.2), respectively. Let (4.14) Then, by Theorem 3 of [7] and Lemma 6.1 of [19], for all 0 < ρ ≤ (2πe 2 ) −1/2 and x 0 > 0, where M 0 is a universal constant. Note that the Hellinger distance is invariant under scale change: . Thus we have the following risk bound for the regularized Bayes rule for misspecified prior, which will be used to bound ζ 2 4n in (2.10). Theorem 6. For any 0 < ρ ≤ (2πe 2 ) −1/2 and x 0 > 0,

An entropy bound for regularized Bayes rules
We now provide an entropy bound for collections of regularized Bayes rules. It is used to bound E G * n ζ 2 3n in (2.9) with a Gaussian isoperimetric inequality. For any family H of functions and semi-distance d, the η-covering number is For each fixed ρ > 0 define the collection of the regularized Bayes rules t * G (x; ρ) in (4.1) as where (4.18)

Numerical studies
In order to investigate the adaptivity of the GMLEB to different heteroscedastic mean vectors, we carries out a simulation study. In Table 1, θ i are drawn from We report the sum of squared loss i (θ i − θ i ) 2 for n = 1000 based on average of 100 replications. We display our simulation results for five estimators: the extended James-Stein [2], the shrinkage estimator SURE-M and the semiparametric shrinkage estimator SURE-SG [18], the group-linear method [17], the NEST [6] and the GMLEB. We also display Oracle as the risk of the oracle Bayes rule t * G * n (·, σ i ) in (1.8). In each column, boldface entry represents the best performer. The sum of squared loss of the GMLEB happens to be the smallest among the reported estimators and tracks the oracle risk very well. Indeed, here the oracle Bayes rule in (1.8) is nonlinear.
In Table 2, we report another simulation for independent θ i and σ 2 i . The means are generated by θ i ∼ (1 − p)δ 0 + pN (3, τ 2 ) where δ u is the degenerate distribution at u. We set p = 0.2 to 0.8 with an increment of 0.2, and τ 2 = 0.1, 1 or 10. The GMLEB is the best throughout all combinations.
As we have described in the outline, we divide the proof into four steps.