On resolving the Savage-Dickey paradox

The Savage-Dickey ratio is known as a specialised representation of the Bayes factor (O'Hagan and Forster, 2004) that allows for a functional plugging approximation of this quantity. We demonstrate here that the Savage-Dickey representation is in fact a generic representation of the Bayes factor that relies on specific measure-theoretic versions of the densities involved in the ratio, instead of a special identity imposing the above constraints on the prior distributions. We completely clarify the measure-theoretic foundations of the representation as well as the generalisation of Verdinelli and Wasserman (1995) and propose a comparison of this new approximation with their version, as well as with bridge sampling and Chib's approaches.

(This quantity B 01 (x) is then compared to 1 in order to decide about the strength of the support of the data in favour of H 0 or H a .) It is thus mathematically clearly and uniquely defined, provided both integrals exist and differ from both 0 and ∞. The practical computation of the Bayes factor has generated a large literature on approximative (see, e.g. Chib, 1995, Gelman and Meng, 1998, Chen et al., 2000, Chopin and Robert, 2010, seeking improvements in numerical precision. The Savage-Dickey (Dickey, 1971) representation of the Bayes factor is primarily known as a special identity that relates the Bayes factor to the posterior distribution which corresponds to the more complex hypothesis. As described in Verdinelli and Wasserman (1995) and Chen et al. (2000, pages 164-165), this representation has practical implications as a basis for simulation methods. However, as stressed in Dickey (1971) and O' Hagan and Forster (2004), the foundation of the Savage-Dickey representation is clearly theoretical.
While the difficulty with the representation (1) is usually addressed in terms of computational aspects, given that π 1 (θ|x) is rarely available in closed form, we argue in the current paper that the Savage-Dickey representation faces challenges of a deeper nature that led us to consider it a 'paradox'. First, by considering both prior and posterior marginal distributions of θ uniquely under the alternative model, (1) seems to indicate that the posterior probability of the null hypothesis H 0 : θ = θ 0 is contained within the alternative hypothesis posterior distribution, even though the set of (θ, ψ)'s such that θ = θ 0 has a zero probability under this alternative distribution. Second, as explained in Section 2, an even more fundamental difficulty with assumption (2) is that it is meaningless when examined (as it should) within the mathematical axioms of measure theory.
Having stated those mathematical difficulties with the Savage-Dickey representation, we proceed to show in Section 3 that similar identities hold under no constraint on the prior distributions. In Section 3, we derive computational algorithms that exploit these representations to approximate the Bayes factor, in an approach that differs from the earlier solution of Verdinelli and Wasserman (1995). The paper concludes with an illustration in the setting of variable selection within a probit model. 2

A measure-theoretic paradox
When considering a standard probabilistic setting where the dominating measure on the parameter space is the Lebesgue measure, rather than a counting measure, the conditional density π 1 (ψ|θ) is rigorously (Billingsley, 1986) defined as the density of the conditional probability distribution or, equivalently, by the condition that for all measurable sets A 1 × A 2 , when π 1 (θ) is the associated marginal density of θ. Therefore, this identity points out the well-known fact that the conditional density function π 1 (ψ|θ) is defined up to a set of measure zero both in ψ for every value of θ and in θ. This implies that changing arbitrarily the value of the function π 1 (·|θ) for a negligible collection of values of θ does not impact the properties of the conditional distribution.
In the setting where the Savage-Dickey representation is advocated, the value θ 0 to be tested is not determined from the observations but it is instead given in advance since this is a testing problem. Therefore the density function π 1 (ψ|θ 0 ) may be chosen in a completely arbitrary manner and there is no possible reason for a unique representation of π 1 (ψ|θ 0 ) that can be found within measure theory. This implies that there always is a version of the conditional density π 1 (ψ|θ 0 ) such that Dickey's (1971) condition (2) is satisfied-as well as, conversely, there are an infinity of versions for which it is not satisfied-. As a result, from a mathematical perspective, condition (2) cannot be seen as an assumption on the prior π 1 without further conditions, contrary to what is stated in the original Dickey (1971) and later in O'Hagan and Forster (2004), Consonni and Veronese (2008) and Wetzels et al. (2010). This difficulty is the first part of what we call the Savage-Dickey paradox, namely that, as stated, the representation (1) relies on a mathematically void constraint on the prior distribution.
In the specific case of the artificial example introduced above, the choice of the conditional density π 1 (ψ|θ 0 ) is therefore arbitrary: if we pick for this density the density of the N (0, 1) distribution, there is agreement between π 1 (ψ|θ 0 ) and π 0 (ψ), while, if we select instead the function exp(+ψ 2 /2), which is not a density, there is no agreement in the sense of condition (2). The paradox is that this disagreement has no consequence whatsoever in the Savage-Dickey representation. The second part of the Savage-Dickey paradox is that the representation (1) is solely valid for a specific and unique choice of a version of the density for both the conditional density π 1 (ψ|θ 0 ) and the joint density π 1 (θ 0 , ψ). When looking at the derivation of (1), the choices of some specific versions of those densities are indeed noteworthy: in the following development, the second equality depends on a specific choice of the version of π 1 (ψ|θ 0 ) but not on the choice of the version of π 1 (θ 0 ), while the third equality depends on a specific choice of the version of π 1 (ψ, θ 0 ) as equal to π 0 (ψ)π 1 (θ 0 ), thus related to the choice of the version of π 1 (θ 0 ). The last equality leading to the Savage-Dickey representation relies on the choice of a specific version of π 1 (θ 0 |x) as well, namely that the constraint 3 holds, where the right hand side is equal to the Bayes factor B 01 (x) and is therefore independent from the version. This rigorous analysis implies that the Savage-Dickey representation is tautological, due to the availability of a version of the posterior density that makes it hold.
As an illustration, consider once again the artificial example above. As already stressed, the value to be tested θ 0 = 1 is set prior to the experiment. Thus, without modifying either the prior distribution under model M 1 or the marginal posterior distribution of the parameter θ under model M 1 , and in a completely rigorous measure-theoretic framework, we can select π 1 (θ 0 ) = 100 = π 1 (θ 0 |x) .
For that choice, we obtain Hence, for this specific choice of the densities, the Savage-Dickey representation does not hold. Verdinelli and Wasserman (1995) have proposed a generalisation of the Savage-Dickey density ratio when the constraint (2) on the prior densities is not verified (we stress again that this is a mathematically void constraint on the respective prior distributions). Verdinelli and Wasserman (1995) state that This representation of Verdinelli and Wasserman (1995) therefore remains valid for any choice of versions for π 1 (θ 0 |x), π 1 (θ 0 ), π 1 (ψ|θ 0 ), provided the conditional density π 1 (ψ|θ 0 , x) is defined by which obviously means that the Verdinelli-Wasserman representation is dependent on the choice of a version of π 1 (θ 0 ).
We now establish that an alternative representation of the Bayes factor is available and can be exploited towards approximation purposes. When considering the Bayes factor where the right hand side obviously is independent of the choice of the version of π 1 (θ 0 ), the numerator can be seen as involving a specific version in θ = θ 0 of the marginal posterior densitỹ π 1 (θ|x) ∝ π 0 (ψ)f (x|θ, ψ) dψ π 1 (θ) , which is associated with the alternative priorπ 1 (θ, ψ) = π 1 (θ)π 0 (ψ). Indeed, this densityπ 1 (θ|x) appears as the marginal posterior density of the posterior distribution defined by the densitỹ π 1 (θ, ψ|x) = π 0 (ψ)π 1 (θ)f (x|θ, ψ) m 1 (x) , wherem 1 (x) is the proper normalising constant of the joint posterior density. In order to guarantee a Savage-Dickey-like representation of the Bayes factor, the appropriate version of the marginal posterior density in θ = θ 0 ,π 1 (θ 0 |x), is obtained by imposing where, once again, the right hand side of the equation is uniquely defined. This constraint amounts to imposing that Bayes' theorem holds in θ = θ 0 instead of almost everywhere (and thus not necessarily in θ = θ 0 ). It then leads to the alternative representation which holds for any value chosen for π 1 (θ 0 ) provided condition (4) applies. This new representation may seem to be only formal, since both m 1 (x) andm 1 (x) are usually unavailable in closed form, but we can take advantage of the fact that the bridge sampling identity of Torrie and Valleau (1977) (see also Gelman and Meng, 1998) gives an unbiased estimator ofm 1 (x)/m 1 (x) since In conclusion, we obtain the representation whose expectation part is uniquely defined (in that it does not depend on the choice of a version of the densities involved therein), while the first ratio must satisfy condition (4). We further note that this representation clearly differs from Verdinelli and Wasserman's (1995) representation: since (6) uses a specific version of the marginal posterior density on θ in θ 0 , as well as a specific version of the full conditional posterior density of ψ given θ 0

Computational solutions
In this Section, we consider the computational implications of the above representation in the specific case of latent variable models, namely under the practical possibility of a data completion by a latent variable z such that f (x|θ, ψ) = f (x|θ, ψ, z)f (z|θ, ψ) dz when π 1 (θ|x, ψ, z) ∝ π 1 (θ)f (x|θ, ψ, z) is available in closed form, including the normalising constant.
Note that is another convergent (if biased) estimator ofm 1 (x)/m 1 (x). The availability of two estimates of the ratiõ m 1 (x)/m 1 (x) is a major bonus from a computational point of view since the comparison of both estimators may allow for the detection of infinite variance estimators, as well as for coherence of the approximations. The first approach requires two simulation sequences, one fromπ 1 (θ, ψ|x) and one from π 1 (θ, ψ|x), but this is a void constraint in that, if H 0 is rejected, a sample from the alternative hypothesis posterior will be required no matter what. Although we do not pursue this possibility in the current paper, note that a comparison of the different representations (including Verdinelli and Wasserman's, 1995, as exposed below) could be conducted by expressing them in the bridge sampling formalism (Gelman and Meng, 1998).

An illustration
Although our purpose in this note is far from advancing the superiority of the Savage-Dickey type representations for Bayes factor approximation, given the wealth of available solutions for embedded models (Chen et al., 2000, Marin and, we briefly consider an example where both Verdinelli and Wasserman's (1995) and our proposal apply. The model is the Bayesian posterior distribution of the regression coefficients of a probit model, following the prior modelling adopted in Marin and Robert (2007) that extends Zellner's (1971) g-prior to generalised linear models. We take as data the Pima Indian diabetes study available in R (R Development Core Team, 2008) dataset with 332 women registered and build a probit model predicting the presence of diabetes from three predictors, the glucose concentration, the diastolic blood pressure and the diabetes pedigree function, assessing the impact of the diabetes pedigree function, i.e. testing the nullity of the coefficient θ associated to this variable. For more details on the statistical and computational issues, see Marin and Robert (2010) since this paper relies on the Pima Indian probit model as benchmark. This probit model is a natural setting for completion by a truncated normal latent variable (Albert and Chib, 1993). We can thus easily implement a Gibbs sampler to produce output from all the posterior distributions considered in the previous Section. Besides, in that case, the conditional distribution π 1 (θ|x, ψ, z) is a normal distribution with closed form parameters. It is therefore straightforward to compute the unbiased estimators (7) and (8). Figure 1 compares the variation of this approximation with other standard solutions covered in Marin and Robert (2010) for the same example, namely the regular importance sampling approximation based on the MLE asymptotic distribution, Chib's version based on the same completion, and a bridge sampling (Gelman and Meng, 1998) solution completing π 0 (·) with the full conditional being derived from the conditional MLE asymptotic distribution. The boxplots are all based on 100 replicates of T = 20, 000 simulations. While the estimators (7) and (8) are not as accurate as Chib's version and as the importance sampler in this specific case, their variabilities remain at a reasonable order and are very comparable. The R code and the reformated datasets used in this Section are available at the following address: http://www.math.univ-montp2.fr/~marin/savage/dickey.html. Comparison of the variabilities of five approximations of the Bayes factor evaluating the impact of the diabetes pedigree covariate upon the occurrence of diabetes in the Pima Indian population, based on a probit modelling. The boxplots are based on 100 replicas and the Savage-Dickey representation proposed in the current paper is denoted by MR, while Verdinelli and Wasserman's (1995) version is denoted by VW.
our exposition of the Savage-Dickey paradox. The second author also thanks Geoff Nicholls for pointing out the bridge sampling connection at the CRiSM workshop at the University of Warwick, May 31, 2010. This work had been supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2009-2012 project Big'MC.