Optimal properties of some Bayesian inferences

Relative surprise regions are shown to minimize, among Bayesian credible regions, the prior probability of covering a false value from the prior. Such regions are also shown to be unbiased in the sense that the prior probability of covering a false value is bounded above by the prior probability of covering the true value. Relative surprise regions are shown to maximize both the Bayes factor in favor of the region containing the true value and the relative belief ratio, among all credible regions with the same posterior content. Relative surprise regions emerge naturally when we consider equivalence classes of credible regions generated via reparameterizations.


Introduction
Suppose we have the ingredients for a proper Bayesian analysis. For this we observe data x from a statistical model {f θ : θ ∈ Θ} , where f θ is a density with respect to support measure µ on the sample space X , and we have a proper prior density π on θ, with respect to support measure υ on Θ. With these ingredients we have available the joint distribution of (θ, X), as given by the density f θ (x) π (θ) with respect to support measure ν × µ, and the observed value x. We denote the prior predictive measure of X by M (B) = E Π (P θ (B)) and the posterior measure of θ by Π(A | x). For a quantity of interest τ = Υ(θ), taking values in a set T , we denote the marginal posterior and prior measures of Υ by Π Υ (· | x) and Π Υ respectively, with corresponding densities π Υ (· | x) and π Υ , taken with respect to a support measure ν T on T .
Bayes theorem, or the principle of conditional probability, says that any probability statements about the unknown θ, after observing x, should be based on the posterior Π(· | x). These ingredients alone however, do not prescribe what γ-credible region B γ (x) ⊂ T we should quote for τ = Υ(θ). Since there are typically many subsets of T containing γ of the posterior probability, we need a rule for choosing among them.
Relative surprise credible regions for τ, as discussed in Evans (1997), are based on a particular approach to assessing a hypothesis H 0 : τ = τ 0 . For this we compute the observed relative surprise (ORS) given by We see that (1) compares the relative increase in belief for τ 0 , from a priori to a posteriori, with this increase for each of the other possible values in T . Other approaches to measuring surprise are discussed in Good (1988). For estimation, we consider (1) as a function of τ 0 and select a value which minimizes this quantity as the estimate, called the least relative surprise estimate (LRSE). To obtain a γ-credible region for τ we simply invert (1) in the standard way to obtain the γ-relative surprise region One virtue of relative surprise inferences is that they are invariant under reparameterizations.
In Evans, Guttman and Swartz (2006) it was shown that relative surprise inferences possess an optimal property in the class of Bayesian inferences. In that development, (2) was taken as the basic concept. In particular, if we consider the class of all γ-credible regions for τ = Υ(θ), then the γ-relative surprise region for τ has the smallest prior content among all γ-credible regions for this quantity. Hypothesis assessments and estimates are derived from relative surprise regions in a direct way and so also possess optimal properties. The LRSE is obtained by taking the region with γ = 0 and the ORS is obtained as inf{γ : τ 0 ∈ C γ (x)}. In section 2 we show that this optimal property has a direct interpretation in terms of minimizing the prior probability of covering a false value and argue that this is an appropriate way to assess repeated sampling properties in contexts where we have a proper prior. In section 3 we prove that, for relative surprise regions, the prior probability of covering a false value is always bounded above by the prior probability of covering the true value and so such sets are, in a generalized sense, unbiased.
As discussed in Evans and Zou (2002) and Evans, Guttman and Swartz (2006), there is a close connection between relative surprise inferences and Bayes factors. In section 3 we establish some results that deepen this connection and show that relative surprise inferences lead to optimal results when interpreted in terms of Bayes factors. In particular, we prove that a γ-relative surprise region C γ (x) for τ = Υ(θ) always has a Bayes factor in favor of the region containing the true value bounded below by unity and, moreover, the Bayes factor is maximized among all γ-credible regions for τ by C γ (x). Further, we introduce the relative belief ratio as an alternative method for measuring change in belief from a priori to a posteriori, and show that this is also bounded below by unity for relative surprise regions and that such regions maximize this quantity as well. The lower bound can be seen as a natural consistency requirement on inferences in the sense that, it would be odd to report a γ-credible region for τ for which our belief in the set containing the true value declined from a priori to a posteriori. While a decline in belief from a priori to a posteriori makes sense for any particular subset of T , it doesn't make sense for our best report for a set supposedly containing the true value, as the data are suggesting otherwise. Further, the optimality result indicates that we are making the best use of the data, from this point-of-view, when we choose to use relative surprise regions. In section 4 we show that in quite general circumstances relative surprise regions arise as a member of an equivalence class of credible regions under reparameterizations. Further, we show that choosing among these regions is equivalent to choosing the measure we use to construct an hpd-like credible region, where hpd stands for highest posterior density. We argue that the most natural choice of this measure is Π Υ , which gives relative surprise regions.

Covering false values
Suppose we have a rule for determining a γ-credible region for τ = Υ(θ) based on the sampling model and prior, i.e., for each γ ∈ [0, 1] and x ∈ X , the rule determines a region B γ (x) ⊂ T satisfying Π Υ (B γ (x) | x) ≥ γ. The coverage P θ (Υ(θ) ∈ B γ (X)) of this region is then of considerable interest, particularly when Π is taken to be a diffuse prior. In such a context it seems natural to ask that a γ-credible region satisfy, or at least approximately satisfy, the confidence property P θ (Υ(θ) ∈ B γ (X)) ≥ γ for all θ ∈ Θ. In other words, in an i.i.d. sequence x i ∼ P θ for i = 1, 2, . . . , we require that the proportion of times that Υ(θ) ∈ B γ (x i ) is at least γ, and also that this property hold for all θ ∈ Θ. It is well-known that Bayesian credible regions do not generally possess this property, see Joshi (1974), and in fact can perform rather poorly in this regard. A number of papers discuss issues concerned with comparing frequency and Bayesian inferences including, Berger and Selke (1987), Casella and Berger (1987), and Samaniego and Reneau (1994) as well as the texts Gelman, Carlin, Stern and Rubin (2004), Carlin and Louis (2000), and Robert (2001).
We restrict to proper priors, as then, letting E M denote expectation with respect to the prior predictive distribution of the data, This can be interpreted as saying that the prior probability the γ-credible region B γ contains a value Υ(θ), when θ ∼ Π, is at least γ. This probability can also be given a long-run relative frequency interpretation in the i.i.d. sequence (θ i , x i ) ∼ Π × P θ for i = 1, 2, . . . , as the proportion of times Υ(θ i ) ∈ B γ (x i ). Various arguments can be offered for the restriction to proper priors, e.g., see DeGroot (1970). In particular, when we have a proper prior, this long-run relative frequency seems more appropriate than the confidence property, as the confidence property requires good coverage at values of θ that have a priori very little weight. Further, while the confidence property has its appeal, the plethora of absurd confidence regions, see Plante (1991) for some discussion, might at least lead one to doubt the wisdom of focusing too closely on confidence. Property (3) holds for any γ-credible region B γ and so does not help us choose among them. Consider, however, the accuracy of the region B γ , where this is measured by the probability of B γ covering an independent false value τ ∼ Λ where Λ is a probability measure on T .
Definition 1. The prior probability of covering a false value from probability measure Λ is given by So the "true" value of θ is generated from the prior Π, the data x is generated from P θ , and the "false" value τ of the parameter of interest is generated from Λ independent of the true value, i.e., τ has no connection with the data.
To obtain a γ-credible region B γ that minimizes (4) we make use of the following results. Suppose we have a probability measure P and a σ-finite measure Q on a set Ω. Further, suppose that P and Q are both absolutely continuous with respect to the same measure on Ω with respective densities p and q. Let . Lemma 1 and Theorem 2 are proved in Evans, Guttman and Swartz (2006). In that paper P is taken to be the posterior but otherwise the proofs are the same.
The following result establishes the optimality, with respect to (4), of hpd-like credible regions as defined in (5).
Theorem 3. Suppose that the probability distribution Λ is also absolutely continuous with respect to ν T on T with density λ. Then, in the Bayesian model specified by Π × P θ , the region B Λ,γ given by (4). This completes the proof.

M. Evans and M. Shakhatreh/Optimal properties of some Bayesian inferences 1272
Specializing this to the choice Λ = Π Υ , we have the following result.
This says that C γ minimizes the prior probability of covering a false value of the parameter of interest, when the false value follows the marginal prior distribution Π Υ and is independent of the true value of the model parameter. It seems natural to take Λ = Π Υ as this distribution identifies the values of τ that we consider a priori at least plausible. A repeated sampling interpretation of this is obtained by considering a sequence (θ i , x i , τ i ), for i = 1, 2, . . . , of independent values from the joint distribution Π × P θ × Π Υ . Then Corollary 4 says that, among all γ-credible regions B γ for Υ(θ) formed from Π × P θ , a γrelative surprise region for Υ(θ) minimizes the proportion of times the event It is worth noting that Theorem 3 and Corollary 4 also hold when θ ∼ Π * for any probability distribution Π The proof is the same. So, for example, we could take Π * to be degenerate at some value and C γ would still be optimal. From a practical, point-of-view, however, the choices Π * = Π and Λ = Π Υ seem to be the most sensible, as (4) then has an interpretation as a prior probability.
In addition (4), with Λ = Π Υ , would appear to have several uses. First we can quote this probability as a way of assessing the accuracy of C γ with a given prior. If this probability is quite high, then we have a region with low accuracy. Also (4) can be used for experimental design purposes such as setting sample size. Consider the following example.

M. Evans and M. Shakhatreh/Optimal properties of some Bayesian inferences
For example, the following table gives some values of the prior probability of covering a false value when γ = .95 and σ 2 = 1, based on a Monte Carlo integration sample size of 10 3 , with the standard errors in parentheses. It is straightforward to show that (7) converges to 0 as n → ∞. So, by choosing n large enough, we can make (7) as small as we like and so control the error in our inference. If σ 2 → ∞, so the prior is becoming more diffuse, the prior probability of covering a false value generated from the prior converges to 0. This is exactly how we would want our region to behave, namely, the data become much more important in determining the inference as the prior becomes more diffuse. For example, C γ (x) →x ± n −1/2 z (1+γ)/2 as σ 2 → ∞. So for a very diffuse prior, C γ (x) has a very small probability of covering an independently generated value from the prior.
Example 1 illustrates that we can't use (4), with Λ = Π Υ , to compare priors. A more concentrated prior will give a higher value for (4) than one more diffuse, however, we have different regions C γ (x) and different distributions for the false values under different priors. This emphasizes the importance of a careful choice of the prior so that unrealistic values of the parameter are excluded.
Similar optimality results can be obtained for the ORS given by (1). For suppose we agree to reject the hypothesis H 0 : Υ(θ) = τ 0 whenever the ORS is greater than γ. This is equivalent to rejecting H 0 whenever τ 0 ∈ C c γ (x). Now consider the class of tests specified by γ-credible regions B γ , so we reject H 0 whenever τ 0 ∈ B c γ (x). In this case, we want to find B γ maximizing This is the conditional prior probability, given that H 0 is true, that we would reject the hypothesis specified by τ, when τ is a value independently generated from the prior. The quantity (8) is clearly analogous to power in the frequentist context. Then, arguing as in Theorem 3 and Corollary 4, we have that (8) is maximized by C c γ (x) among all rejection regions with posterior content less than or equal to 1 − γ. Also, we can use (8) to determine a sample size so that the test based on C c γ has a prescribed value for this conditional prior probability.

Change in belief and unbiasedness
the Bayes factor in favor of the true value of τ being in C. If we let C shrink nicely to τ 0 (as in Rudin (1974), p. 163, a sequence of Borel sets C i shrinks nicely to a point τ 0 if there is an α > 0 such that each C i lies in an open ball B(τ 0 , r i ) centered at τ 0 and of radius r i > 0, then µ(C i ) ≥ αµ(B(τ 0 , r i )) for every i where µ is volume measure, and r i → 0 as i → ∞), then BF C (x) converges to π Υ (τ 0 | x)/π Υ (τ 0 ) whenever these densities are continuous at τ 0 . So we can think of this quantity as an approximation to the Bayes factor associated with τ 0 and the ORS is a calibration of this value to determine if it is indeed small and thus evidence against τ 0 as a plausible value. The Bayes factor in favor of C is a measure of the change in our belief that C contains the true value from a priori to a posteriori. Perhaps a simpler measure of this change in belief is given by the following.
Definition 2. The relative belief ratio of a subset C ⊂ T , is given by Again, as C shrinks nicely to {τ 0 }, RB C (x) converges to π Υ (τ 0 | x)/π Υ (τ 0 ). Note that BF C (x) = RB C (x)/RB C c (x) and so BF C is not a function of RB C or conversely. They are measuring change in belief on different scales. Clearly the two will be approximately equal when RB C c (x) ≈ 1 and this will occur whenever C is "small". Now consider a γ-relative surprise region C γ (x) for τ. From (2), and the fact that the function Π From this we have the following property for relative surprise regions. We assume throughout the remainder of this section that π Υ (τ ) > 0 for every τ ∈ T .
So Lemma 5 says that the ratio of posterior to prior probabilities of C γ (x) satisfies the same inequality that the respective densities do on this set.
We have an important lower bound on BF Cγ (x) (x) and RB Cγ (x) (x).
Accordingly the Bayes factor and the relative belief ratio always indicate an increase in belief in the set C γ (x) from a priori to a posteriori. In particular, the posterior probability content of C γ (x) is always greater than its prior content. Note that, since BF C c (x) = 1/BF C (x), we have that BF C c γ (x) (x) < 1 and RB C c γ (x) (x) < 1 for a relative surprise region C γ (x).
Note that the fact the Bayes factor and relative belief ratio are always greater than 1 for a relative surprise region, does not imply that relative surprise inferences never find evidence against a hypothesized value H 0 : τ = τ 0 . For we assess H 0 by computing (1), or equivalently from (2), computing γ * = inf{γ : τ 0 ∈ C γ (x)}. If γ * is large (near 1), then we have evidence against H 0 . Alternatively, we could select an appropriate γ and report C γ (x) as our best choice of a γ-credible region to contain the true value. If τ 0 / ∈ C γ (x), then we have evidence against H 0 .
Of course, there may be other credible regions with these properties. For example, hpd regions often have these properties, although there does not seem to be an easy general proof of this. In any case, the following shows that relative surprise regions are best from this point-of-view.
Theorem 7. The set C γ (x) has maximal Bayes factor and maximal relative belief ratio among all measurable sets C ⊂ T satisfying Proof. From Theorem 2 we know that Π Υ (C ) is minimized, among all measur- is also minimized by the same choice when we restrict to those C satisfying the result follows for the Bayes factor and is obvious for the relative belief ratio.
Theorem 7 is most relevant when there are a number of credible regions, including the relative surprise region C γ (x), with posterior content exactly equal to γ. Theorem 7 then says that C γ (x) is the best choice among these regions from the point of view of the Bayes factor and the relative belief ratio, as it provides the largest increase in belief from a priori to a posteriori. We have the following immediate consequence.
Corollary 8. Suppose that the true value of θ is selected according to Π. Then E M (BF Bγ (X) (X)) and E M (RB Bγ (X) (X)) are maximized, among credible re- This says that the prior mean Bayes factor and prior mean relative belief ratio are maximized, by C γ (x).
Consider the following example as an illustration.
Example 2 (Probability of joint success). Suppose we observe x from a Binomial(n, θ 1 ), an independent y from a Binomial(n, θ 2 ), we put independent uniform priors on θ 1 and θ 2 and we are interested in making inference about ψ = θ 1 θ 2 . This is the probability of simultaneous success from tossing two coins where the coins have probability of heads equal to θ 1 and θ 2 , respectively. Suppose we have n = 5 and observe x = 4 and y = 1. In the following table we give some γ-hpd intervals and γ-relative surprise (rs) intervals for ψ. We see that these intervals are quite different. Also the relative surprise intervals always dominate the hpd intervals in the sense that the Bayes factor and relative belief ratio of the relative surprise interval are always greater than the corresponding quantities for the hpd interval, as proven generally in Theorem 7. The estimate determined by the hpd approach is the mode and this is given by .122 while the LRSE is .186. While the hpd intervals, in this example, always have RB > 1 and BF > 1, other methods of forming the intervals do not necessarily give intervals with these properties. For example, if we took the left-tail of the posterior as a γ-credible interval for ψ, then the left-tail .4-credible interval has RB = .730 and BF = .640. In frequentist contexts, a confidence region is said to be unbiased, if the probability of the region containing a particular false value is always less than or equal to the probability of the region containing the true value. The following result shows that relative surprise regions are unbiased in a generalized sense.
Theorem 9. For a relative surprise region, the prior probability of containing an independent value generated from the prior is always less than the prior probability of containing the true value, when it is generated from the prior.

Reparameterizations
A basic principle of inference is that inferences about a parameter of interest should be invariant under reparameterizations, e.g., whatever rule we use to obtain a γ-credible region B γ for a parameter of interest τ, the rule should yield the region ΨB γ for any 1-1, sufficiently smooth, reparameterization ψ = Ψ(τ ). Relative surprise inferences satisfy this principle.
Suppose, however, that we insist on forming credible regions for parameters taking values in T by minimizing their Λ content, where Λ is also absolutely continuous with respect to ν T on T with density λ. Let T be an open subset of R k and D T ,T denote the class of reparameterizations Ψ : T → T that are 1-1, onto, continuously differentiable and such that Ψ −1 is continuously differentiable. Then, by Theorem 3, the γ-credible region for ψ = Ψ(τ ) that has minimal Λ content is given by the hpd-like region where J Ψ (τ ) is the Jacobian of the transformation Ψ evaluated at τ.
: Ψ ∈ D T ,T be the class of γ-credible regions for τ that arise via reparameterizations, when using the measure Λ to construct the credible regions. Each of the regions in [B Λ,γ (x)] is a plausible candidate as a γ-credible region for the parameter of interest and it is not clear how we should choose among them. The following result provides an approach to this choice.
Lemma 11. If Λ is a probability measure and Ψ −1 Note that, when Ψ ΠΥ , Ψ Λ are the respective probability transforms, λ is continuous and positive and π Υ is positive and continuous, then by the inverse function theorem, we must have that Ψ * = Ψ −1 Λ • Ψ ΠΥ ∈ D T ,T and J Ψ * (τ ) = π Υ (τ )/λ(Ψ * (τ )). Lemma 11 says that, in very general circumstances, when we choose to optimize with respect to a probability measure Λ on T , a relative surprise region is always available as an equivalent credible region under a reparameterization.
As previously noted, when we consider choosing among the elements of [B Λ,γ (x)] we need only consider which measure Λ • Ψ is most appropriate. Theorem 3 says that choosing Λ • Ψ leads to a region that minimizes the prior probability of covering a false value τ ∼ Λ • Ψ. Unless there are good reasons to do otherwise, the most appropriate weighting to apply to false values is given by the prior Π Υ = Λ • Ψ −1 Λ • Ψ ΠΥ . This leads to a region that focuses on the parameter values that we believe are a priori important. For example, choosing a credible region that minimized the probability of covering false values that are well out of range where the prior placed most of its mass, would seem to be clearly inappropriate, as we presumably know a priori that these are unrealistic values.
The following illustrates the need for a rule to select a credible region.
Example 3 (All γ-credible intervals can arise from reparameterizations). Suppose that the posterior of τ ∈ R 1 is absolutely continuous, has finite second moment and π Υ (τ | x) > 0 for every τ ∈ R 1 . Let τ 1 > τ 0 be such that Π Υ ((τ 0 , τ 1 ) | x) = γ. Let Λ be a probability measure with density λ(τ ) > 0 for every τ ∈ R 1 . From Lemma 10, we have that the set of all Λ hpd-like γcredible intervals for τ obtained via reparameterizations, is the same as the set of all Λ • Ψ hpd-like γ-credible intervals for τ as Ψ ranges over all reparameterizations. Then, there is a constant k γ (x, Ψ) such that i.e., the probability measure with this density is a smooth reparameterization of Λ, and so While the reparameterization in the example depends on the data, it is not clear generally how to rule out reparameterizations. With the relative suprise rule this is not an issue, because of invariance.
So far we have restricted the discussion to probability measures Λ. Suppose, however, that Λ is a bounded measure on T . It is immediate, for any positive constant b, that B bΛ,γ (x) = B Λ,γ (x). So we can take b = 1/Λ(T ) and simply treat Λ as a probability measure, as we get the same set of credible regions and C γ (x) ∈ [B Λ,γ (x)] . Suppose now that Λ is an unbounded measure with density λ with respect to υ T . Further suppose that there is a sequence of bounded measures Λ n with densities λ n with respect to υ T , such that λ n → λ pointwise as n → ∞. For example, if Λ and υ T are volume measure on R k , then λ ≡ 1 and we can take λ n to be (2πn) k/2 times a N k (0, nI) density. Then, when the posterior distribution of π Υ (τ | x)/λ(τ ) is continuous, we have that B Λn,γ (x) → B Λ,γ (x) as n → ∞, since lim inf B Λn,γ (x) = lim sup B Λn,γ (x) = B Λ,γ (x) up to a set having posterior measure 0. If λ and π Υ are positive and continuous, then we have that C γ (x) ∈ [B Λn,γ (x)] for each n. Therefore, B Λ,γ (x) is approximated by B Λn,γ (x) for large n and C γ (x) is equivalent to this set under a reparameterization. Accordingly, we can think of C γ (x) as being approximately equivalent to B Λ,γ (x) under a reparameterization.

Conclusions
Relative surprise regions have been shown to minimize the prior probability of covering a false value from the prior. This prior probability can be seen to serve as a measure of accuracy of a credible region and can be used for design purposes. Further, relative surprise regions have optimal properties with respect to the Bayes factor and relative belief ratio of the region. Finally, we have shown that relative surprise regions arise very naturally when we consider choosing among equivalent credible regions based on reparameterizations.
The relevance of our results in a particular application depends on the prior. In our view this is no different than concerns about the relevance of our choice of a sampling model in a problem, i.e., if we make a poor choice, then any inferences drawn based on this model are at least suspect. Model checking methods, can increase our confidence, when the model passes, that our choice makes sense. Similarly, methods for checking for prior-data conflict, such as those discussed in Evans andMoshonov (2006, 2007), can increase our confidence that the prior we have chosen makes sense. When the model and prior pass such checks, then optimal inferences drawn from such ingredients have greater force. In particular, the repeated sampling interpretations based upon the prior, then seem much more appropriate to us than the common frequentist practice of looking for procedures that possess good properties uniformly over all values of the model parameter, i.e., even at values of the parameter that we believe a priori are not relevant.