On improved predictive density estimation with parametric constraints

: We consider the problem of predictive density estimation for normal models under Kullback-Leibler loss (KL loss) when the parameter space is constrained to a convex set. More particularly, we assume that X ∼ N p ( µ,v x I ) is observed and that we wish to estimate the density of Y ∼ N p ( µ,v y I ) under KL loss when µ is restricted to the convex set C ⊂ R p . We show that the best unrestricted invariant predictive density estimator ˆ p U is dominated by the Bayes estimator ˆ p π C associated to the uniform prior π C on C . We also study so called plug-in estimators, giving conditions under which domination of one estimator of the mean vector µ over another under the usual quadratic loss, translates into a domination result for certain corresponding plug-in density estimators under KL loss. Risk comparisons and domination results are also made for comparisons of plug-in estimators and Bayes predictive density estimators. Additionally, minimaxity and domina- tion results are given for the cases where: (i) C is a cone, and (ii) C is a ball.


Introduction
We consider the problem of predictive density estimation for Gaussian models under Kullback-Leibler loss (KL loss) when the parameter space is constrained to a convex set. More precisely, let X|µ ∼ N p (µ, v x I) and Y |µ ∼ N p (µ, v y I) be two independent random vectors having a normal distribution, with common unknown mean µ that we assume to be restricted to a convex set C ⊂ R p . The scale parameters v x and v y are assumed to be known and we denote by p(x|µ, v x ) and p(y|µ, v y ) the conditional densities of X and Y.
Under the above restriction, we seek to determine efficient predictive density estimatorsp(y|x) of the density p(y|µ, v y ), based on observing only X = x, relative to the Kullback-Leibler loss L(µ,p(y|x)) = R p p(y|µ, v y ) log p(y|µ, v y ) p(y|x) dy (1.1) and the associated Kullback-Leibler(KL) risk R KL (µ,p) = R p p(x|µ, v x )L(µ,p(y|x))dx. (1.2) This model was considered by George, Liang and Xu [6] when the mean µ is is unrestricted, that is, µ ∈ R p . The reference density is the generalized Bayes predictive densityp U (y|x) associated to the noninformative prior on R p , π(µ) ≡ 1. Its expression may be derived from a more general result due to Aitchison (see [1]), as the conditional density of Y given X = x associated with prior measure π, and given by Komaki [10] noticed that expression (1.3) with π(µ) ≡ 1 reduces tô and Murray [17] showed thatp U is best invariant with constant risk, under translations and non singular linear transformations. For a location family, Ng [18] extended this invariance property and Liang and Barron [12] proved that p U is minimax.
In the normal case, George, Liang and Xu [6] gave a simple direct proof of the minimaxity ofp U in (1.4) and showed, among other results, thatp U can be improved by Bayes predictive densitiesp π under superharmonic priors π and for p ≥ 3, thus adding to previous findings due to Komaki [10]. We refer to a recent review of Bayesian predictive estimation by George and Xu [7] for further exposition and description of recent research in this area.
We will make use of the following key representation for Bayesian estimators given by George, Liang and Xu. Hereafter, we let W = , and we let m π (W ; v w ) and m π (X; v x ) be the marginal densities of W and X respectively under prior π.

5)
wherep U (·|X) is the Bayes estimator associated with the uniform prior on R p given by (1.4).
It is also shown that, for any prior π, the difference between the KL risks of p U (.|x) and the Bayesian predictive densityp π (.|x) is given by where E µ,v stands for the expectation with respect to the normal distribution N p (µ, vI).
George, Liang and Xu underlined the fact that there exists a parallel between this predictive density estimation problem and the estimation of the mean vector µ under quadratic loss μ − µ 2 giving rise to the quadratic risk More precisely, they show that the predictive densityp U plays a similar role as the standard estimator X of µ which is best invariant and minimax under the quadratic risk (1.7), but inadmissible for p > 2.
Our findings here involve the elaboration and the use of similar connections between the risks R v Q and R KL to draw inferences regarding domination and minimaxity for predictive density estimation problems when the mean µ is restricted to a convex set C. These findings parallel, and rely on several results applicable to estimating a bounded multivariate mean with risk R Q ( [2,5,8,14,15,20], among others). In Section 2, using a result of Brown, George and Xu [3], our first result formalizes a link between the two risks R KL and R v Q , the quantities involving R KL being expressed as integrated quantities involving R v Q . This link is used to express the risk differences between Bayes and plug-in estimators and between Bayes estimators as well.
In Section 3, various applications are given for restricted parameter spaces C. First, referring to Hartigan [8] who showed that the Bayes estimator of µ with respect to the uniform prior π C on a convex set with a non-empty interior C dominates X under the quadratic risk (1.7), we obtain, via two different paths, a similar result for the domination ofp πC overp U . We also show that our proof of dominance for KL-loss implies dominance for quadratic, thus providing an alternative proof of Hartigan's result. Secondly, we turn our attention to plug-in densities. When an estimator δ 1 for µ ∈ C is dominated by a Bayes estimator µ π,v , associated to a prior π, and to a scale factor v, we give conditions under which the Bayes predictive density estimatorp π dominates the plug-in density estimatorp 1 (X) ∼ N p (δ 1 (X), v y I p ). In the case when p = 1, we apply these results to obtain improvements on the plug-in maximum likelihood estimator In Section 4, we deal with the minimaxity of Bayes predictive density estimators when µ is restricted to a ball or to a cone. As a specific result, we show that, when µ ≤ m, the boundary uniform Bayes estimator is minimax for risk R KL whenever m ≤ c 0 (p) √ v w where c 0 (p) is the constant given by Berry [2] and Marchand and Perron [15]. When µ belongs to a convex cone C with non-empty interior, we prove that the unrestricted predictive density estimatorp U in (1.4) remains minimax (as it is when no restriction is assumed) for risk R KL when µ ∈ C. This finding parallels the result of Tsukuma and Kubokawa [20] who established that X is still minimax for estimating µ under the restriction to a polyhedral cone. In Section 5 we expand on some additional considerations concerning plug-in estimators. Section 6 contains concluding remarks, and Section A is an appendix with details on some of the proofs.

Context and preliminary results
In this section, we expand upon a link between estimation under risks R v Q and risk R KL . The following lemma and theorem concern plug-in estimatorŝ p 1 ∼ N p (δ 1 (X), v y I p ) and Bayesian estimatorsp π (·|X). Theorem 2.1 provides a useful expression for a R KL risk difference in terms of integrated R v Q risk differences.
Lemma 2.1. For a prior π, and a plug-in estimatorp 1 wherep U is, as above, the generalized Bayes estimator with respect to the flat prior on R p .
Proof. Part (a) is taken from Brown, George, and Xu ( [3], Theorem 1). For part we need to show that which indeed matches (2.1).
Theorem 2.1. For a plug-in estimatorp 1 ∼ N p (δ 1 (X), v y I p ) and for Bayes estimatorsp π andp π ′ , we have Proof. Part (b) follows immediately from part (a) (or, alternatively, from (1.6) and part (a) of Lemma 2.1). From the definition of the risk R KL , we have for (a): Upon applying (1.5), we obtain equivalently and the result now follows from Lemma 2.1.
Note that Corollary 1 of Brown et al. [3] is an integrated version of (b). We will make use of Theorem 2.1 to obtain dominance results applicable for risk R KL by working inside the above integrals and relying on the associated comparisons for the risks As an illustration, take δ 1 (X) = X, π to be the flat prior on R p . We have R vx and, from part (a) of Theorem 2.1: which is positive. The finding thatp U dominatesp 1 is not new of course (e.g., Aitchison [1]), but the objective here was rather to illustrate Theorem 2.1 above. Finally, we expand on some definitions and notations with respect to convex sets and cones in R p . A subset C of R p will be called a (positively homogeneous) cone if it is closed under positive scalar multiplication, i.e. α x ∈ C when x ∈ C and α > 0 (e.g., [19][ ]). In the above, C is a cone with vertex the origin. More generally, for such a C and for any g ∈ R p , the set C g = C + g is an affine cone with vertex g, where we adopt hereafter the notation: for α ∈ R, θ ∈ R p and A ⊂ R p , αA + θ = {αa + θ | a ∈ A}.

Improving onp U
For the problem of estimating, under risk R vx Q , with µ restricted to a convex subset C of R p with a non-empty interior, but otherwise arbitrary, Hartigan [8] showed quite generally that the Bayes estimator with respect to a uniform prior on C dominates the estimator X. We obtain an analogous result here for the predictive density estimation problem with risk R KL . A first proof follows from Hartigan's result and Theorem 2.1. A second proof with a much different flavor circumvents the explicit use of Hartigan's result and Theorem 2.1 and follows, surprisingly, a more direct route to establishing the result. The second part of this subsection parallels a result by Kubokawa [11] in providing a class of priors for which we obtain Bayesian improvements top U in the univariate case with µ bounded to an interval. Theorem 3.1. Let C ⊂ R p be convex set a with a non-empty interior, and let π C = ½ C (µ) be the noninformative prior restricted to C. Thenp πC dominateŝ First proof. We apply part (b) of Theorem 2.1 withp π ′ ≡p U ,p π ≡p πC , µ π ′ ,v (X) = X for all v, andμ π,v (X) the Bayes estimator of µ for R v Q under prior π C . We infer from Hartigan [8] with equality iff C is a cone and µ is its vertex. Making use of this and Theorem 2.1, the result follows immediately.
Second proof. Using the expression of the risk difference in (1.6), the risk difference betweenp U andp πC equals by expressing the marginals m πC . Then, applying the successive changes of We need to prove that, for any µ ∈ C, △R(µ) ≥ 0. Let µ ∈ C. Clearly it suffices to show that, for any z ∈ R p , we have This is equivalent to showing that As v w < v x , the last quantity is a convex combination of c and µ. Since c ∈ C, µ ∈ C and C is a convex then c ′ ∈ C, which implies that (3.3) is satisfied.
Finally, since {C − µ} is a convex set containing the origin, (3.3) is satisfied with equality iff C is a cone and µ is its vertex which means that △R(µ) > 0, except when C is a cone and µ is its vertex. This completes the proof.
It is worth noting that this second proof of Theorem 3.1 and part (b) of Theorem 2.1 guarantee that the domination ofp πC overp U implies Hartigan's result, that is, the domination ofμ πC over X when C is convex with a nonempty interior. Indeed, for any v x > 0 and v y > 0 fixed, the domination ofp πC overp U indicates that Hence, for any 0 < a < b, the integral on (a, b) of the integrand term in (3.4) is non negative, which implies that this integrand term is non negative almost everywhere. Finally, by continuity of this function in v, Hence, our method provides an independent proof of Hartigan's result.
We now turn to the particular case where p = 1 and C is a compact interval, say [−m, m] without loss of generality. Kubokawa [11] obtained a class of priors that lead to improvements on the minimum risk equivariant estimator X for risk R v Q . Here, in analogous manner as above, we give a parallel result for our density estimation problem with risk R KL . Proof. Since Kubokawa [11] showed thatμ π,v (X) dominates X for µ ∈ [−m, m] for all v > 0 under the given assumptions on π, the result follows directly from Theorem 2.1 withp π ′ ≡p U .

Improving on a maximum likelihood estimator
In this subsection, we give other direct implications of Theorem 2.1 and further applications when the mean µ is restricted to a ball centered at 0 of radius m. As in the above Hartigan type result, we can borrow known or easy to derive results for the analogous estimation problem with risk R v Q . However, in contrast, we seek to dominate here the plug-in density based on the maximum likelihood estimator for a univariate bounded normal mean, rather than the generalized Bayes estimatorp U . Corollary 3.1. Suppose that the Bayesian estimatorμ π,v dominates a given estimator δ 1 for µ ∈ C, and for all Proof. The proof is straightforward. First, (B) implies (A). Secondly, by virtue of Theorem 2.1 and given our assumptions, we have Proceeding in analogous manner, we also have the following potentially useful result which focusses on the risk behavior of the Bayes estimatorsμ π,vx , for v w < v < v x , instead of that of δ 1 .
Corollary 3.2. Suppose that the Bayesian estimatorμ π,vx dominates a given estimator δ 1 for µ ∈ C under risk R vx Q . Suppose further that either We now turn to an application of Corollary 3.1 for a univariate normal mean which is bounded to an interval, but, in view of making use of its condition (B), we require the following preliminary result. Proof. Let X ∼ N (µ, v) with a ≤ µ ≤ b, and let φ(t) = (1/ √ 2π) e −t 2 /2 be the standard normal density function. The maximum likelihood estimator of µ is given by By differentiating under the integral sign with respect to σ we obtain after simplification. Hence R(µ, σ, δ mle ) increases in σ and we have the desired result.
For risk R v Q , µ ∈ [a, b], Marchand and Perron [14], as well as Casella and Strawderman [4] (for δ BU below), showed that: • all Bayes estimators δ π (X) with π a symmetric measure about with c 1 ≈ 0.4837, c 2 ≈ 0.5230, and where δ BU (X) is the Bayes estimator with respect to the two-point uniform prior on {−m, m}, δ F U (X) is the Bayes estimator with respect to the uniform U (a, b) prior. For the density estimating problem, we denotep BU (X),p F U (X) as the corresponding Bayes estimators for the boundary uniform and fully uniform priors respectively. In view of the above dominance results for risk R v Q , part (B) of Corollary 3.1, and Lemma 3.1, we now derive improvements on the plug-in maximum likelihood estimator p mle (X) ∼ N p (δ mle (X), v y I p ).
(b) all Bayes estimatorsp π (X), with π a symmetric measure about The Bayesian estimatorsp π (X),p BU (X) andp F U (X) may be evaluated directly by computing the predictive densityp(·|x) as in (1.3), or via (1.5). For instance, in the case of the two-point uniform prior on {−m, m}, we obtain the predictive densityp BU (X) as being the density of a mixture of the two normal distributions N (−m, v y ) and N (m, v y ) with respective weights w(x) = (1 + e 2 m x/vx ) −1 and 1 − w(x).

Bayesian improvements on the plug-in
Here is an instructive example in relationship with Corollaries 3.1 and 3.2. Consider normal priors π τ ∼ N p (0, τ I) and the corresponding Bayes estimatorŝ µ πτ ,v (x) = τ /(τ + v) x under squared error loss. Now, for cases where µ is constrained to the ball B m = {µ ∈ R p : µ ≤ m}, one may verify that aX So sufficiently large prior variances lead here to improvements, with the dominance for τ → ∞ interpretable as the dominance ofp U on B m for all m > 0, which is of course already known. Alternatively, in view of applying Corollary 3.2, we begin with the weaker condition τ ≥ (m 2 /2p) − v x /2 forμ πτ ,vx (X) to dominate δ 1 (X) = X under risk R vx Q . However, turning to condition (B'), we observe that the quadratic risk of µ πτ ,v (X), given by We hence obtain τ ≥ max(v x , (m 2 /2p) − v x /2) as an alternative sufficient condition to (3.5), and the improved sufficient condition forp πτ (X) to dominate the plug-in N p (X, v y I p ) under risk R KL with µ ∈ B m . Finally, we point out that yet a stronger result can be arrived at by working with condition (A') of Corollary 3.2.

Minimax results
In this section, we derive minimaxity results for cases where µ is restricted. First, we fully exploit Section 2's relationships between the risks R KL and R Q , as well as known minimaxity results for cases where the mean is restricted to a ball. Secondly, we establish that the estimatorp U remains minimax when the mean µ is restricted to a cone yielding a general result analogous to Tsukuma and Kubokawa's result [20] under squared error loss.

Minimaxity when the mean is restricted to a ball
Here is a general framework which we will seek to apply in cases where the mean µ is restricted to a ball of radius m centered around the origin.
Theorem 4.1. Consider our general problems of estimating µ with risk R v Q and the density p Y (·|µ, v y ) with risk R KL , where µ ∈ C ⊆ R p . Suppose there exists a prior measure π * and a subset C 0 of C such that π * (C 0 ) = 1; . Thenp π * is minimax for risk R KL among all Bayesian estimators that have constant risk , sinceR v is the Bayes risk associated with π * . Now, making use of Theorem 2.1 withp π ′ ≡p U and the constant R KL risk ofp U , we obtain indeed for any Bayes estimatorp π ′ , As an example, consider a constraint to a ball where µ ∈ C = {µ ∈ R p : µ ≤ m}. For this problem, Bayes estimators form a complete class and a minimax estimator can be found among orthogonally invariant estimators. Since such estimators have constant risk on spheres where µ = λ, a minimax estimator can be found among orthogonally invariant Bayes estimators, or equivalently among Bayes estimators with spherically symmetric priors ( [9]). Now take C 0 = {µ ∈ C : µ = m} and π * to be the uniform prior measure on C 0 . As studied in Berry [2], Marchand and Perron [15], or Casella and Strawderman [4] (p = 1), the corresponding Bayes estimator is given byμ π * ,v (x) = √ v. Hence, to satisfy the assumptions of Theorem 4.1, we , which yields the following.
Corollary 4.1. For estimating the density p(·|µ, v y ) with the constraint µ ≤ m, the boundary uniform Bayes estimatorp π * is minimax for risk R KL whenever m ≤ c 0 (p) √ v w , where c 0 (p) is the constant defined and evaluated by Berry [2], and Marchand and Perron [15].
We conclude this subsection by pointing out that, as at the end of Section 3.2, p π * may be evaluated directly as in (1.3), or by using (1.5), and that Marchand and Perron [15] provide various properties of c 0 (p), including the lower bound: √ p ≤ c 0 (p).

Minimaxity ofp U when the mean is restricted to a cone
In this section, we deal with the minimaxity ofp U in (1.4) when the mean µ is restricted to a cone. We point out that another potential and related approach to the problem is given by the recent work of Marchand and Strawderman [16]. Proof. Let r be the constant risk ofp U . We will show that r is a limit of Bayes risks of a sequence of Bayes predictive densitiesp π k (y|x) with respect to a sequence of proper priors π k lying in C.
where B k denotes the ball of radius k centered at 0. By the positive homogeneity of C, we can express C k as Consider, as a sequence of proper priors, the uniform distributions on the convex sets C k , that is, where λ is the Lebesgue measure on R p .
Using (4.2) and the expression of the risk difference in (1.6), the difference of Bayes risks betweenp U andp π k is Now, using the second expression of π k (µ) in (4.2) and expressing the marginal m π k in the bracketed term in (4.3), we have Then with the changes of variables It remains to show that the limit of L(v, k) = L * (k/ √ v) when k goes to infinity exists and hence does not depend on v, which will imply that lim k→∞ (r− r(π k ,p π k )) = 0. Let A k,r = k √ v {C 1 − r}. As the interior of C is non-empty, for any interior point r of C 1 , there exists a ball B ǫ in R p of radius ǫ > 0, centered at 0, such that B ǫ ⊂ C 1 − r.
Then it follows that and hence lim Now the convexity of C, and hence of C 1 , allows, according to Lemma A.1, to apply the Lebesgue dominated convergence theorem to the expression in (4.7). Thus we obtain lim k→∞ L(k, v) = 0 for any fixed v. Therefore lim k→∞ (r − r(π k ,p π k )) = 0, which establishes the minimaxity ofp U . 1 As an immediate consequence of Theorem 3.1 and Theorem 4.2, we have the following corollary.
Corollary 4.2. Let C ⊂ R p be a convex cone with non-empty interior, and let π C = ½ C (µ) be the flat prior restricted to C. Thenp πC is minimax for risk R KL and µ ∈ C.
Note that Theorem 4.2 holds also for affine convex cones with non-empty interiors. Indeed, it suffices to consider the sequence of proper priors π k (µ) = 1 λ(C k ) ½ C k (µ − g). More generally, we can generalize Theorem 4.2 to cones which are not necessarily convex. The proof is analogous to the one of Theorem 4.2 and is relegated to the appendix. Theorem 4.3. Suppose C is a finite disjoint union of (affine) convex cones, in the sense that the restriction is given by where C 1 , . . . , C n are convex cones with non-empty interiors, and g 1 , . . . , g n are n fixed points in R p where, for any i = j, Thenp U is still minimax when µ is restricted to C.

Plug-in estimators: Some additional considerations
We now expand on a general phenomenon concerning plug-in estimators and how they can be improved upon within the class of normal density estimators. Below, we set G m (c) = (1 − 1 c )m − log c for c > 1, m > 1, and we let c 0 (m) denote the root of G m (c) that lies in (m, ∞). It is easy to show that G m (c) is positive for c ∈ (1, m], is maximized on (1, ∞) at c = m, and does indeed have a single root on (m, ∞). As above, we denote R vx Q (µ, δ) as the risk Theorem 5.1. Suppose δ(X) is an estimator of µ, µ ∈ C, such that inf µ∈C R vx Q (µ, δ) > 0. Consider the estimatorp c ∼ N p (δ(X), cv y I p ) and the plug-in versionp 1 ∼ N p (δ(X), v y I p ). We then have the following.
Proof. We have Proof. The result follows directly with the decomposition

Conclusion
In [6], George et al. put forth an interesting parallel between minimax conditions for estimating a multivariate normal mean vector under quadratic risks R v Q and minimax conditions for a predictive density estimator under Kullback-Leibler risk R KL . Similar connections with regards to admissibility were also developed by Brown et al. ([3]). In this paper, we obtain similar connections for dominance and minimaxity results when the mean is restricted to a convex set or to a cone. For instance, in the case where the mean is restricted to a convex set C, we prove domination of the Bayes predictive density estimatorp πC with respect to the uniform prior π C over the generalized Bayes predictive densityp U (y|x) associated with the flat prior on R p .
An essential use is made of an explicit link between a collection of risks R v Q and the Kullback-Leibler risk R KL . It allows us to create settings where a Bayesian predictive density estimatorp π dominates a plug-in density estimator p 1 (X) ∼ N p (δ 1 (X), v y I p ). Examples including improvements on the plug-in maximum likelihood estimatorp mle (X) ∼ N p (δ mle (X), v y I p ) are derived when the dimension p is 1.
Minimaxity results are also obtained when the mean is restricted to a ball as we derive conditions for which the boundary uniform Bayes estimator is minimax for risk R KL . Also, when the restricted parameter space is a convex cone C with non-empty interior, we prove that the unrestricted predictive density estimator p U remains minimax (as it is when no restriction is assumed).

Appendix A: Appendix
We expand first here on a technical lemma useful in the proof of Theorem 4.2, and we conclude with a proof of Theorem 4.3. For convenience, we denote by φ(s) the normal density p(s|0, 1).
Let C 1 ⊂ B 1 be a convex set in R p with non-empty interior. For t > 0, write Then define C k = ∪ n i=1 C i k + g i = ∪ n i=1 kC i 1 + g i . Note that λ(C k ) = k p λ(C 1 ) and, for any k 1 ≤ k 2 , we have C k1 ⊂ C k2 . Using the sequence of proper priors π k (µ) = 1 k p 1 λ(C1) ½ C k (µ), we have, as in the proof of Theorem 4.2, r − r(π k ,p π k ) = 1 λ(C 1 ) with, for i = 1, . . . n, thanks to (4.6). Then by the change of variables µ = µ ′ + g i . Note that the expression ofL i (v, k) corresponds to the expression of L(v, k) in (4.6) which has been shown to satisfy lim k→∞ L(v, k) = 0. Hence lim k→∞L i (v, k) = 0 and from (A.2) we get lim k→∞ L i (v, k) ≥ 0. Since L i (v, k) ≤ 0, it follows that lim k→∞ L i (v, k) = 0. Therefore lim k→∞ (r − r(π k ,p π k )) = 0, which completes the proof.