Estimating the Marginal Likelihood Using the Arithmetic Mean Identity

In this paper we propose a conceptually straightforward method to estimate the marginal data density value (also called the marginal likelihood). We show that the marginal likelihood is equal to the prior mean of the conditional density of the data given the vector of parameters restricted to a certain subset of the parameter space, A, times the reciprocal of the posterior probability of the subset A. This identity motivates one to use Arithmetic Mean estimator based on simulation from the prior distribution restricted to any (but reasonable) subset of the space of parameters. By trimming this space, regions of relatively low likelihood are removed, and thereby the efficiency of the Arithmetic Mean estimator is improved. We show that the adjusted Arithmetic Mean estimator is unbiased and consistent.


Introduction
The marginal data densities (i.e. the normalizing constants of the posterior distributions of the model parameters, also called the marginal likelihoods or the integrated likelihoods) are key quantities needed for formal Bayesian model selection and for model averaging; see, e.g. Zellner (1971). The posterior odds ratio, used for comparing two competing models, is equal to the product of the prior odds and the Bayes' factor. In turn, the Bayes' factor is defined to be the ratio of marginal likelihoods. The marginal likelihoods and the prior model probabilities are used to form the posterior model probabilities, which are necessary for Bayesian model averaging and for testing statistical hypotheses. Therefore the marginal likelihoods are essential in the Bayesian approach.
Let us consider a model in which (i) the space of parameters is denoted by Θ, (ii) p(θ) is a prior density function of the parameters 1 collected in θ ∈ Θ, and (iii) y is a vector of observations. The marginal data density, p(y), is defined as an integral (calculated over the whole parameters' space) of the conditional data density given the vector of parameters, p(y|θ), with respect to the prior distribution: Even in simple models, correct assessment of the marginal likelihood is computationally challenging. In more complicated models, high-dimensional integration is required to estimate marginal data densities. In majority of models, it is not possible to analytically integrate out parameters from the joint distribution for y and θ, p (y, θ), and a certain Monte Carlo approximation of p(y) for the observed vector y is needed.
In this paper we propose a conceptually straightforward method to estimate the marginal data density value. Our proposition is motivated by the problems with the Harmonic Mean estimator and their solutions proposed by Lenk (2009). By analogy to the adjusted Harmonic Mean estimator, we propose corrected Arithmetic Mean estimator. Therefore, we start with reviewing ways of estimating the marginal data density by means of the Harmonic Mean (proposed by Newton and Raftery (1994)). Estimation of the marginal data density by Harmonic Mean has become one of the most popular methods due to its simplicity. Under the assumption that the prior distribution of the vector of parameters is proper, we have Consequently, 1 where E θ|y (·) denotes the expected value with respect to the posterior distribution of θ. In (3), the reciprocal of p(y) is expressed as an expected value of the inverse of the conditional density of the data y given θ with respect to the posterior distribution of the parameters. In other words, the marginal data density is equal to the posterior Harmonic Mean of the conditional density of the data. This equation suggests using the sample Harmonic Mean of the conditional density of y given θ based on draws from the posterior distribution, p(θ|y). The Harmonic Mean (HM) estimator, proposed by Newton and Raftery (1994), is given bŷ where {θ (q) } k q=1 is drawn from the posterior distribution of the parameters by means of a Markov Chain Monte Carlo (MCMC) method.
Even though the HM estimator is consistent (see Newton and Raftery (1994)), it has some serious shortcomings. Namely, it can be unstable (see Raftery et al. (2007)), and it overestimates the marginal data density. Moreover, as pointed by Lenk (2009), it is characterised by so-called "simulation pseudo-bias". Lenk (2009) has proposed several methods for correcting the "pseudo-bias" provided that we can draw from the posterior distribution restricted to a subset of the space of parameters, of which posterior probability is close to one. The adjusted HM estimator, proposed by Lenk (2009), is given byp whereP (A) is an assessment of the prior probability of subset A ⊆ Θ, of which the posterior probability is greater than 1 − ε for small ε > 0 (i.e. P (A|y) > 1 − ε), see Lenk (2009). Based on the identity 2 p(y) = P (A) which is true for any subset A ⊆ Θ such that 0 < P (A) < ∞ and 0 < P (A|y) < +∞, Pajor and Osiewalski (2013) have shown that Lenk's correction can be used regardless of the posterior probability accumulated in the chosen subset of the parameters' space, and that some estimators from the adjusted HM estimator class (depending on the choice of A) can be used in the case of an improper prior. Their analytical result makes it possible to select A, and consequently to improve numerical properties of the adjusted Harmonic Mean estimator. It is easy to show that the identity (6) is equivalent to Given the subset A ⊆ Θ, the identities (6) and (7) naturally lead to the following estimators of the marginal data density value (further called as Corrected Harmonic Mean):p CHME (y) =P (A) where are drawn from p(θ|y) and p(θ|y, A), respectively.
The Corrected Harmonic Mean estimator (given by (8)) is still biased, and it overestimates the marginal data density. Using the method presented in Xie et al. (2011), we have showed that even if P (A) is known, the expected value of the Corrected Harmonic Mean times P (A) is greater than p(y), i.e.
(see Supplementary Appendix A in Pajor (2016)). The bias results from the presence of the reciprocal function in (8), which is convex for positive arguments. In order to overcome the problem of biasedness of the estimator, we propose using the Arithmetic Mean instead of the Harmonic Mean. In turn, to reduce an unacceptably high variance of the Arithmetic Mean estimator, we trim the prior sample to eliminate problematic regions of the parameter space. Our method belongs to the class of methods which are based on importance sampling and used to estimate p(y) separately for each model. Estimation of marginal likelihoods instead of the Bayes factor allows a new model to be directly compared with other models for which the marginal likelihood has been calculated.
Obviously, various other numerical methods have been proposed in the literature to estimate the marginal data density value directly or to compute ratios of two marginal likelihoods (i.e. Bayes factors); for a review see, e.g. Ardia et al. (2012), Friel and Wyse (2012). Very popular methods for computing Bayes factors or marginal posterior probabilities are based on the reversible jump algorithm of Green (1995) and on bridge sampling methods (see Meng and Wong (1996)). Bartolucci et al. (2006) proposed a class of estimators of the Bayes factor based on an extension of the bridge sampling identity of Meng and Wong (1996) and combined with reversible jump sampler.
A widely used method of approximating the marginal likelihood is Laplace-Metropolis approximation (see Kass and Raftery (1995), Raftery (1996)). The Laplace-Metropolis method is based on a numerical approximation of the marginal likelihood, obtained by substituting the posterior density by the Normal density with mean equal to the posterior mode and covariance matrix equal to the posterior variance matrix, obtained from posterior simulation output. Unfortunately, the approximation may not be valid for some models, e.g. mixture models, where asymptotic normality does not hold (see, e.g. Frühwirth-Schnatter (2006)).
Among the numerical methods used to estimate the marginal likelihood, a very popular one is the method of Chib (1995). The approach of Chib (1995) is aimed at estimation of the marginal likelihood from Gibbs sampling output (with the use of Rao-Blackwellization). The idea was extended by Chib and Jeliazkov (2001) to deal with cases where the Metropolis-Hastings algorithm is used to generate posterior samples. The methods require an additional sample using reduced Gibbs or reduced Metropolis-Hastings sampling, which may be time-consuming when the parameter space is split into a large number of blocks. Moreover, the numerical accuracy of the estimate depends on the choice of a specific point of the parameter space. Friel and Pettitt (2008) proposed the power posterior method, inspired by ideas from the path sampling, which had been proposed by Gelman and Meng (1998), and based on samples from a distribution proportional to the likelihood raised to a power t times the prior density function. A difficulty with this method is that of choosing the optimal temperature schedule. Recently, Weinberg (2012) proposed a novel approach based on Lebesgue integration. All estimation methods present advantages and disadvantages. Unfortunately, no perfect solution exists.
In the following section, we propose a new class of estimators of the marginal likelihood. They are based on the Arithmetic Mean of likelihoods calculated only over an arbitrary subset of the space of model parameters corrected by the reciprocal of the posterior probability of the subset. We also show that, under some assumptions, new estimators (which depend on the choice of a subset of the parameter space) are unbiased and consistent (see Supplementary Appendix B in Pajor (2016)). In Sections 3 and 4, we present simulation results and data examples showing that the new estimator performs very well in comparison with the Corrected Harmonic Mean, Chib and Laplace-Metropolis estimators. Finally, we conclude that preliminary experience with estimators introduced here is very promising.

Corrected Arithmetic Mean estimators: methodology
Equation (1) suggests using the sample Arithmetic Mean of the conditional density of y given θ based on draws from the prior distribution, p(θ). The Arithmetic Mean (AM) estimator is given byp where {θ (q) } k q=1 is now drawn from the prior distribution of the parameters. This estimator was mentioned by, e.g. Hammersley and Handscomb (1964) and Raftery and Banfield (1991), and it was used by McCulloch and Rossi (1992) for logistic regression models as well as by Lewis and Raftery (1997) for comparing alternative hierarchical (i.e. random-effects) models. Although this AM estimator is unbiased, it can have a very high variance (and thus it can be quite inefficient). If the posterior distribution is much more concentrated than that of the prior, then while sampling from the prior distribution, we obtain most points in the area, in which the sampling density of the data (or the likelihood) is close to zero. Hence, the estimate of p(y) depends on only few points from the area of high value of p(y|θ). Consequently, an incredibly large simulation sample would be required to obtain adequate result (see Lewis and Raftery (1997), Raftery (1996)). Moreover, the AM estimator cannot be used for improper priors. A natural remedy for the inefficiency is to trim the prior sample to eliminate regions of the parameter space with very low likelihood, similarly as for the Corrected Harmonic Mean estimator. This motivates us to propose a modification of the Arithmetic Mean estimator.
Let us assume that A ⊆ Θ, 0 < P (A) < +∞, and 0 < P (A|y) < +∞. Starting from the identity we obtain and consequently, or equivalently, where E θ (·) denotes the expected value with respect to the prior distribution of θ, and E θ (·|A) denotes the conditional expected value of θ given A.
Equation (14) says that the marginal density of the data can be expressed as a product of the reciprocal of the posterior probability of the subset A, P (A|y), and the expected value of the conditional density of the data times the indicator function of subset A, p(y|θ)I A (θ). This expected value is calculated with respect to the prior distribution of the model parameters. Identity (14) naturally leads to the following estimator of the marginal data density value (further called Corrected Arithmetic Mean estimator):p where {θ (q) } k q=1 is drawn from the prior distribution, p(θ). The assessment of the posterior probability of the subset A,P (A|y), requires also sampling from the posterior distribution.
In turn, identity (15) suggests that the marginal likelihood can be approximated by the product of the ratio of prior to posterior probabilities of the subset A and the sample Arithmetic Mean of the conditional data density, based on {θ A (q) } k q=1 drawn from the prior distribution, restricted to the subset A, p(θ|A). Note that we have just defined a new class of estimators indexed by the subset A ⊆ Θ. Under additional assumptions (first, the subset A is compact; second, the likelihood is bounded on A; third, P (A|y) is known; fourth, it is possible to generate samples from the prior distribution), it is easy to show that the Corrected Arithmetic Mean (CAM) estimator is unbiased and consistent (see Theorem 2 in Supplementary Appendix B in Pajor (2016) for details). It stems from the fact that the CAM estimator is just a simple Monte Carlo estimator. In practice, P (A|y) needs to be estimated via posterior simulation. If it is possible to generate samples forming an ergodic Markov chain whose equilibrium distribution is the posterior distribution, then the CAM estimator remains consistent (see Theorem 3 in Supplementary Appendix B in Pajor (2016)). Because of the opportunity to arbitrarily select the subset A in (16) and (17), the problem with inefficiency of the Arithmetic Mean (11) can be overcome. Indeed, subset A should be chosen in the area of large values of p(y|θ). By doing so, the variance of the Arithmetic Mean can be reduced. Note that, from the numerical point of view, an optimal choice of A is to haveP (A|y) = 1 because thenp CAME (y) = 1 k k q=1 p(y|θ (q) )I A (θ (q) ), and the CAM estimator needs additional simulation only from the prior distribution. Moreover, to evaluate the expected value in (14), importance sampling can be used. Suppose that there exists an algorithm for generating sample according to the probability density function s(·) referred as an importance function. Then This yields the importance sampling estimator of p(y) with importance function s(·), namely,p where {θ (q) } k q=1 is drawn from the importance sampling distribution. The simplicity of the Corrected Arithmetic Mean estimator and its good properties are its main advantages over other techniques based on importance sampling, e.g. stabilized version of the Harmonic Mean (see Raftery et al. (2007)), annealed importance sampling (see Neal (2001)), important-weighted marginal density estimators (see Chen (1994), Chen (2005)), and others (see, e.g. Meng and Wong (1996), Raftery (1996), Lewis and Raftery (1997), Han and Carlin (2001), Raftery et al. (2007), Friel and Pettitt (2008), Lenk (2009), Xie et al. (2011), Weinberg (2012). Moreover, our new estimators can be used in the case of an improper prior (under the assumption that A ⊆ Θ is a set with non-zero and finite prior measure), in which p(θ) is a density function of some σ-finite measure such that p(y) is non-zero and finite, i.e. the posterior distribution exists.

Simulation study
In this section we present simulation studies for three classes of models in which, under some assumptions, the true values of the marginal likelihoods are known (i.e. can be calculated analytically). We consider two examples presented by Lenk (2009) (the conjugate normal model and linear regression models) and autoregressive (AR) models. As regards AR models, we elaborate two cases: first, autoregressive models with conjugate inverse Gamma-Normal prior distribution; second, autoregressive models with covariance stationarity conditions and independent inverse Gamma-Normal prior distributions. In the second class of AR models, unlike the conjugate AR models, close-form expression for the marginal likelihood does not exist. Thus we calculated the natural logarithm of Bayes factors in favour of the true (assumed) model compared with other ones. In addition to the new estimator, the HM, corrected HM, AM, Laplace-Metropolis (LM), and Chib's estimators are used. The Laplace-Metropolis estimator is defined as follows (see Raftery (1996)): whereθ is the posterior mode of ln[p(y|θ)p(θ)], Ψ is the negative inverse Hessian of ln[p(y|θ)p(θ)] evaluated at θ =θ, and d is the dimension of θ. In our examplesθ was estimated from the posterior simulation output. Function ln[p(y|θ (q) )p(θ (q) )] was computed for each q = 1, . . . , k, and then the value of θ (q) for which it is largest was taken. Matrix Ψ was approximated by the estimated posterior covariance matrix from the posterior simulation output (as suggested by Raftery (1996)).
As a "gold standard" for calculating the marginal likelihood, we also consider the method proposed by Chib (1995), and extended by Chib and Jeliazkov (2001). The Chib's method is based on the identity leading to the following equation: which holds for any θ * ∈ Θ. In practice (for the sake of estimation efficiency), the vector θ * is chosen from the area of high posterior density values. Usually, both the prior p(θ * ) and likelihood terms p(y|θ * ) can be easily calculated. The estimation of the posterior probability density function p(θ * |y) is difficult, but it can be approximated based on output from an MCMC sampler. This method requires performing reduced Gibbs steps (or reduced Metropolis-Hastings steps) in addition to the Gibbs sampler (or to the Metropolis-Hastings algorithm).
the marginal posterior distribution for σ 2 is an inverse Gamma: y is the Arithmetic Mean for sample y.
By integrating out parameters from the joint distribution of parameters and data, p(y|μ, σ 2 )p(μ, σ 2 ), the marginal density of the data can be presented in closed form (see Lenk (2009)) as In Figure 1, we present natural logarithms of consecutive estimates obtained for HM, CHM and CAM estimators, in the framework of the conjugate normal model for 100  (2009)). The true value of logarithm of likelihood is equal to −217.893, observations generated from a normal distribution with a mean of 30 and a variance of 4. Similar to Lenk (2009), the prior parameters are as follows: m 0 = 0, w 0 = 0.05, r 0 = 3, s 0 = 3. True value of logarithm of likelihood is equal to −217.893 (the horizontal, dashed line in Figure 1). We report results based on 10000 Monte Carlo simulations, which were repeated 1000 times. As regards subset A, in this model we assume that it is a rectangle limited by the range of the Monte Carlo sampler output ( The prior probability of subset A, P (A), was approximated, using importance sampling. As a sampling distribution, the inverse Gamma-Normal truncated to A was used. The mean and variance of the sampling distribution were set at the mean and variance of the posterior distribution of (μ, σ 2 ) . The evaluation of p(y) using (16) requires estimation of P (A|y). We putP (A|y) = 1 and applied MCIS method with the importance function being an independent inverse Gamma-Normal with the mean and variance of the posterior distribution (estimated from the MC draws). We can see from Figure 1 that the natural logarithms of the corrected HM and AM estimators cover the true value of ln p(y). The spread of estimates is clearly smaller for the CAM estimator. The prior probability of A, P (A), is approximately 1.17 · 10 −4 .
In Table 1, average errors (AE) and root mean squared errors (RMSE) in the conjugate normal model are presented. In this model ln p(y) can be analytically calculated, and thus we can compare the true value of ln p(y) with its estimates. As mentioned above, we consider a few very popular estimators of the marginal likelihood: the Harmonic Mean, Corrected Harmonic Mean, Arithmetic Mean, Corrected Arithmetic Mean (newly proposed), Chib, and Laplace-Metropolis estimators. All of the realizations of . true ln p(y) ln HME ln CHME ln AME ln CAME ln ChibE ln LME N = 50 these estimators are computed on the log scale. The Arithmetic Mean estimator and the Chib's method perform best of all considered methods, but it is natural in such a simple model. Normalizing constants of the conditional posterior distribution for μ and of the marginal posterior distribution for σ 2 are known, therefore the use of the Chib's method or the Arithmetic Mean leads to the true value of the marginal likelihood. Consequently, average errors and RMSE are equal to zero. Because of "simulation pseudo-bias", the HM estimator performs worst. The HM estimator has the highest average and root mean squared errors from all estimators under consideration. It is natural because the RMSE can be written as a root of a sum of the variance term and the bias term: Thus, more biased estimators have higher values of the RMSE. The Corrected Arithmetic Mean estimator has smaller absolute values of average error than the Laplace-Metropolis estimator, but the latter has smaller root mean squared errors. The results demonstrate that our proposed estimator can be better than the CHM estimator.

Linear regression models
Simulation properties of our new estimator can be easily checked in linear regression models. Such a model can be written in the following standard notation: where X is an N × K matrix of regressors, β is a K × 1 vector of parameters. Moreover, we assume that ε ∼ N (0, σ 2 I N ) with a conjugate family of distributions, Values of regressors are generated from a standard normal distribution. Moreover, to simulate datasets we generated samples of size N = 25, 100, 200, data points from model (24) with K = 3, 20, 40, 100, ε ∼ N (0, I N ), and β = (5, 1, −2, 1, 1, . . . , 1) . It has been shown that the joint posterior density for β and σ 2 also has inverse Gamma-Normal form. The posterior distribution for β conditional on σ 2 and the marginal posterior distribution of σ 2 are as follows: where . The natural logarithm of the marginal data density value can be expressed as where ln a = − N 2 ln(πs 0 ) − 1 2 ln det(I N + XV 0 X ) + ln Γ r0+N 2 − ln Γ r0 2 (see, e.g. Appendix to Lenk (2009)).
Results obtained using various estimators of the log-marginal likelihood are presented in Table 2. First of all, for the CHM estimator we assume that the subset A is an intersection of the parameter space where the conditional density function p(y|θ) exceeds the smallest value of p(y|θ) evaluated at pseudo-random sample {θ (q) } k q=1 from the posterior distribution and the hypercuboid limited by the range of the sampler output: ) . Moreover, the CHM estimator is computed using three methods of "simulation pseudo-bias" assessments. Namely, to approximate P (A), similar to Lenk (2009), we use Monte Carlo Importance Sampling: Importance Sampling with Uniform distribution as an importance one, MCIS with Normal distribution, and MCIS with independent inverse Gamma-Normal distribution. The first two are based on Lenk's proposition and his fragments of code (available in supplemental materials for Lenk (2009)). In the latter method, specifications of mean and variance of importance distribution are evaluated with MC draws from posterior distributions of β and σ 2 . In turn, for the CAM estimator we assume that the subset A is a hypercuboid limited by the range of the posterior sampler output, A = ⊗ i [min{θ i(q) }, max{θ i(q) }]. Consequently,P (A|y) = 1 and the prior simulation support is in regions where the likelihood is significant. To simulate datasets, samples of size N = 100 data points from the model (24) were generated. The Monte Carlo sampler was run for 10000 iterations. The true and estimated marginal data density values are presented in Table 2 and Figure 2.
In Table 2, we also present average errors and root mean squared errors. The Corrected Arithmetic Mean estimator performs best. For this estimator the average errors and RMSE are the smallest. Although three importance sampling distributions generate different estimates produced by the same CHM estimator, each of them is worse than the estimates produced by the CAM estimator with the inverse Gamma-Normal density as the importance function. Moreover, it seems that the CAM estimator has (as compared to the Harmonic Mean) a clearly less spread distribution (see Figure 2). Note   Table 2: Average error and root mean squared error in the linear regression model. Results obtained for simulated data from linear regression models. 1000 datasets were generated. The ln p(y) was estimated with Monte Carlo sampler based on 10000 iterations. The true value of ln p(y) was analytically calculated. The ln HME denotes the natural logarithm of the Harmonic Mean estimator, ln CHME -the natural logarithm of the Corrected Harmonic Mean estimator, ln AME -the natural logarithm of the Arithmetic Mean estimator, ln CAME -the natural logarithm of the Corrected Arithmetic Mean estimator, ln LME -the natural logarithm of the Laplace-Metropolis estimator.
that for K ≥ 20 the AM estimator is inefficient (an overflow or an underflow occur). It is so because the posterior distribution is much more concentrated than that of the prior. By trimming the space of parameters to the subset A, we eliminate very low likelihood regions, and consequently we reduce the variance of the estimator.
As regards the Chib's method, for any vector θ * = (β * , σ 2 * ) the posterior density can be decomposed as p(θ * |y) = p(β * |σ 2 * , y)p(σ 2 * |y), where the first term is a density function of a Normal distribution, and the second is a density function of an inverse Gamma one, see (24) and (26). Consequently, the Chib's method reduces itself to the use of (27). Monte Carlo iterations for estimation were used. The true value ln p(y) is analytically calculated. The ln HME denotes the natural logarithm of the Harmonic Mean estimator, ln CHME -the natural logarithm of the Corrected Harmonic Mean estimator, ln CAME -the natural logarithm of the Corrected Arithmetic Mean estimator. models in which analytical derivation of marginal posterior densities by integrating out parameters from a joint distribution (which determines the normalizing constant of the posterior distribution) may not be possible. We consider the autoregressive models of order p, y t = β 0 + β 1 y t−1 + · · · + β p y t−p + ε t , t = 1, . . . , N,

Autoregressive models
with ε = (ε 1 , ε 2 , . . . , ε N ) ∼ N (0, σ 2 I N ), independent inverse Gamma-Normal prior distribution (i.e. β ∼ N p+1 (b 0 , V 0 ), σ 2 ∼ IG(r 0 /2, s 0 /2)), and with stationarity conditions for {y t }. In AR(p) models with independent prior distributions, the joint posterior density for β and σ 2 does not take a standard form, but the full conditional distributions are standard: where , S is the subset of parameter space for β, where stationarity conditions of {y t } are satisfied. Thus, the Gibbs sampler can be used to simulate from the joint posterior distribution (by simulation from these Normal and inverse Gamma distributions). Similar to the linear regression model for the CAM estimator, we assumed that the subset A constitutes an intersection of the parameter space and the hypercuboid limited by the range of the posterior sampler output, A = ⊗ i [min{θ i(q) }, max{θ i(q) }] ∩ S, whereas for the CHM estimators: , q = 1, . . . , k}, i = 1, . . . , K + 1, θ (q) = (β (q) , σ 2 (q) ) . A sample of size N = 100 data points was generated from model (28) Table 3: Results obtained in AR(p) models with conjugate inverse Gamma-Normal priors, y held fixed for each p. SD -standard deviation, RMSE -root mean squared error, AE -average error (true -estimate). ln p(y) was calculated via its analytical expression, given in (23) It is worth noting that for a conjugate prior distribution, i.e. β|σ 2 ∼ N p+1 (b 0 , σ 2 V 0 ) and σ 2 ∼ IG(r 0 /2, s 0 /2), this model is a special case of the normal linear model with y = (y 1 , y 2 , . . . , y N ) and X = Therefore, in the conjugate AR models, the joint posterior density for β = (β 0 , β 1 , . . . , β p ) and σ 2 takes a well known form: a product of an inverse Gamma and a (conditional) Normal. Thus, the natural logarithm of the marginal data density value can be calculated from (27). In Table 3, we present the main characteristics of estimates of ln p(y), obtained with the use of the proposed estimator and a collection of popular methods. In normal autoregressive models with conjugate prior distributions (where the marginal data density value is available analytically) among considered estimators, the CAM estimator performs best because it is characterised by smallest root mean squared errors and average errors. Note that the CAM estimator has substantially smaller average errors than the CHM (with three different methods of estimation P (A)). Thus, the newly proposed estimator has better simulation properties than that proposed by Lenk (2009). Moreover, the CAM estimator is good for even small samples.  Table 4: Mean, standard deviation, root mean squared error and average error of the natural logarithm of estimates obtained in AR(p) models with independent inverse Gamma-Normal priors and stationarity conditions, y held fixed for each p. SD -standard deviation, RMSE -root mean squared error relative to ln ChibE(1), AE -average error relative to ln ChibE(1) (ln ChibE(1) -estimate), IG-N -inverse Gamma-Normal. ln p(y) is not available analytically. Table 4 contains means, standard deviations, root mean squared errors and average errors (relative to ln ChibE(1)) of the natural logarithm of estimates obtained in AR(p) models with independent inverse Gamma-Normal priors and under stationarity conditions. Unlike the conjugate AR model a closed-form expression for the marginal likelihood does not exist. Therefore, the root mean squared errors and average errors are determined relative to the Chib's method calculated at the posterior maximum. The Gibbs sampler was run for 10000 iterations. Results of the initial draw were adopted as equal to maximum likelihood estimates. We assume b 0 = (0. 1, 0.4, 0.03, . . . , 0.03, −0.3) in the case of the AR(p) structure used, and b 0 = (0.1, 0.4, 0.03, . . . , 0.03) for AR(p − 1) as well as for AR(p + 1) structures. The vector of the data used, y, in stationarity AR models and in the conjugate autoregressive model remains the same. For each p ∈ {2, 5, 15} we estimate three models: AR(p) (which is true) and AR(p − 1) as well as AR(p + 1) (which are false). We use 1000 replications for each model.
As regards of the Chib's method, the joint posterior density of (β * , σ 2 * ) can be decomposed as follows: p(θ * |y) = p(σ 2 * |y)p(β * |σ 2 * , y), where (β * , σ 2 * ) is chosen to be the posterior mean of (β, σ 2 ) or to be the maximum of the posterior probability density function. Because the full-conditional densities, including their normalizing constants, are known (see (29) and (30)), the conditional density p(β * |σ 2 * , y) is available directly, while the ordinate p(σ 2 * |y) is estimated from the draws of the Gibbs run (by averaging full-conditional densities).  Table 5: Frequency of indications of the correct AR(p) structure based on logarithm of Bayes factors for the true AR(p) structure against others. Datasets for which analytically calculated Bayes factors indicate true specification are considered. Results obtained for simulated data from autoregressive models. 1000 datasets were generated. In AR(p) models with independent inverse Gamma-Normal (IG-N) prior, ln p(y) was estimated with Gibbs sampler based on 10000 iterations. The Chib's method was calculated at posterior means and at the numerical approximations of the maximum of the posterior probability density function. In the case of AR(p) models with the conjugate inverse Gamma-Normal prior, the true value of ln p(y) was analytically calculated; moreover, direct MC samples from the posterior distributions were used.
We can see from Table 4 that the Chib's estimators have the smallest dispersion measured by standard deviation. The average values of the natural logarithm of the CAM estimates are closer to those of the Chib's.
Due to the fact that in models with independent prior distributions the p(y) cannot be derived analytically, we calculated Bayes factors for the true (assumed for the purpose of simulation) AR(p) model against neighbouring (false) models, AR(p ± 1). The results are summarized in Table 5 where frequencies of indications of the correct structure are presented. In the AR(p) models with conjugate inverse Gamma-Normal prior distributions, the CAM estimator favours the true specifications in all cases. The LM and AM estimators perform as the second best methods. In AR(p) models with independent inverse Gamma-Normal prior distributions, once again the LM, AM, CAM, and Chib's estimators perform best. Even though the number of iterations is fairly small and the HM estimator is biased (i.e. the marginal likelihoods are overestimated), this leads to the correct conclusion about model selection in over 95.7% of cases. The HM and CHM estimators provide worse estimates of the Bayes factors than does the CAM estimator. Once again, the CAM estimator performs very well, being as good as the Chib's method.

Non-nested linear regression models
We now present a classical example for Bayesian model selection considered among others in Carlin and Chib (1995), Han and Carlin (2001), Bartolucci et al. (2006), Friel and Pettitt (2008), where different methods for estimating the Bayes factor between two non-nested competing linear regression models have been compared. Based on n = 42 specimens of radiata pine, the maximum compressive strength parallel to the grain (y t ) is described related to specimen's density (x t ) or to its density adjusted for resin content (z t ). The data set is taken from Williams (1959). Two competing models are the following: . In order to facilitate comparisons with other methods, we use the same prior specification as in articles cited above, i.e. N (185, 10 4 ) for β and δ, N (3000, 10 6 ) for α and γ, and IG(6/2, 300 2 /2) for σ 2 ε and σ 2 η . We also assume prior independence among all the parameters. As regards subset A, used for both CAM and CHM estimators, it is assumed to be an intersection of the parameter space, where the conditional density function p(y|θ) exceeds the smallest value of p(y|θ) evaluated at pseudo-random sample {θ (q) } k q=1 from the posterior distribution and the cuboid limited by the range of the Gibbs sampler output.
We estimated the marginal likelihoods for M 1 and M 2 models, and we computed the Bayes factor in favour of model M 2 , i.e. B 21 . Each estimate was calculated 100 times, so the bias, standard and relative errors of the estimates could be computed. The Chib's method was calculated at posterior means and at the numerical approximations of the maximum of the posterior probability density function -analogously to the example shown in Section 3.3.
The aim of this example is to compare estimates obtained by using the Corrected Arithmetic Mean with those obtained by the methods proposed by Bartolucci et al. (2006), Friel and Pettitt (2008). The Gibbs sampler was run for 40000 iterations, of which the first 10000 are treated as burn-in and are discarded. The results are displayed in Table 6, where the mean, average errors, standard errors, root mean squared errors, and relative RMSE for the Bayes factor are presented. To compute RMSE and relative RMSE, following Han and Carlin (2001), Friel and Pettitt (2008) and Bartolucci et al. (2006), we used B 21 = 4862 computed by Green and O'Hagan (1998).
As can be seen from Table 6, the CAM estimator performs very well, having an efficiency very similar to (even quite better than) the Chib's marginal likelihood method. Of course, in these models the Chib's method is not difficult to use because full conditional  distributions are standard and easy to sample from. Performance of the Harmonic and Corrected Harmonic Mean estimators is the poorest, but it is possible that the CHM estimator could be improved by selecting better importance functions. It is very important to stress that our method leads to estimate of B 21 with better efficiency to that of the power posterior methods, presented by Friel and Pettitt (2008). Biases and standard errors for estimates B 21 , obtained by using power posterior methods, are several times greater than those obtained by using our Corrected Arithmetic Mean. The bias and the standard error for estimates of the Bayes factor, obtained by using the serial MCMC approach, are equal to 10 and 132, respectively. Also, we can compare our results with those presented by Bartolucci et al. (2006). According to their results, the estimators of B 21 based on an extension of the bridge sampling identity perform worse than the CAM estimator. The standard errors for these estimates range from 204.5 to 246.3 (being over 25-fold higher than those for the CAM estimator), and the relative errors range from 4.20% to 5.07%, being over 26-fold higher than for the CAM estimator.

Mixture models for galaxy data
Now we will show that using our new estimator can help one to handle difficult problems with multimodal posterior distributions. Thus, in this section our method of approximation of the marginal likelihood is applied to one-dimensional Gaussian mixture models for so-called "galaxy data". The dataset consists of the velocities (in 10 3 km/s) of 82 distant galaxies, diverging from our own galaxy, presented by Postman et al. (1986) and analysed with the use of different mixture models by a number of researches, including, for example, Roeder (1990), Carlin and Chib (1995), Chib (1995), Phillips and Smith (1996), Raftery (1996), Richardson and Green (1997), Neal (1999), Liang and Wong (2001), Steele et al. (2006), and many others. 3 A Gaussian mixture model that we apply in our examples is considered by Chib (1995). In the model with d components, the conditional probability density function of independent and identically distributed observations, y t , t = 1, 2, . . . , n, is as follows: where y = (y 1 , . . . , y n ) is a vector of observations, ω = (ω 1 , . . . , ω d ) is a probability vector, elements of which add up to one, φ(y t |μ i , σ 2 i ) is a probability density of normal distribution with mean μ i and variance σ 2 i , θ = (μ 1 , . . . , μ d , σ 2 1 , . . . , σ 2 d , ω 1 , . . . , ω d ) summarizes all unknown model parameters. Similar to Chib (1995) it is assumed that all elements of θ are mutually independent and have the following prior distributions: where μ 0 = 20, σ 2 0 = 100, r 0 = 6, s 0 = 40, α i = 1, i = 1, . . . , d. As pointed out in Chib (1995), the prior hyperparameters are chosen so that these prior distributions reflect weak prior information about the parameters. The posterior distribution is given by where p(θ) is the density function of the vector of parameters.
The posterior distribution has d! symmetric modes because the prior distribution is exchangeable and the likelihood function is invariant to the permutation of the component labels (see Frühwirth-Schnatter (2001)). The symmetry of the posterior distributions of the parameters causes several numerical problems (e.g. lack of label-switching). To simulate from the posterior distribution, Chib (1995) used the augmented Gibbs sampler. However, the Gibbs sampler can be inefficient as a numerical tool to visit all d! symmetric modes in the posterior distribution. There exist different Bayesian solutions for the problem (see, e.g. Celeux et al. (2000), Stephens (2000), Berkhof et al. (2003), Marin et al. (2005), Chung et al. (2004), Jasra et al. (2005), Frühwirth-Schnatter (2006). In this paper we used a simple and effective algorithm proposed by Frühwirth-Schnatter (2001), the permutation sampling combined with the Gibbs sampler.
Similar to Chib (1995) the mixture model can be expressed in terms of independent and identically distributed latent variables associated with observations, z t , t = 1, . . . , n, that take values in a discrete space {1, 2, . . . , d}, such that P r(z t = i|θ) = ω i and p(y which leads to the mixture model in (32).
In models with an additional assumption that the variance σ 2 i remains constant across components, the full conditional distribution of σ 2 is Because the Gibbs sampler may not explore the whole unconstrained parameter space, but tends to stick to one posterior mode with occasional switches among other posterior modes (as a consequence, some of the posterior modes are hardly ever or just never "visited"), the random permutation sampler (introduced by Frühwirth-Schnatter (2001)) is applied. Similar to Raftery (1996), models with two, three, and four components are considered. In addition, mixture models with equal and unequal variances are taken into account. Results obtained using various simple estimators of the log-marginal likelihood are presented in Tables 7 and 8. Each method was repeated 100 times, thus 100 estimates was obtained for each procedure. As in the previous example, for the CHM and the CAM estimators, we assume that the subset A is an intersection of the parameter space, where  Table 7: Results for Galaxy Data. ln p(y) was estimated with the Gibbs sampler based on 30000 iterations. The Chib's method was calculated at one of the posterior modes and at the numerical approximations of the maximum of the posterior probability density function. ln AME* was computed based on 10 8 points drawn from the prior distribution. Results reported by Neal (1999) are also included.
the conditional density function p(y|θ) exceeds the smallest value of p(y|θ) evaluated at pseudo-random sample {θ (q) } k q=1 from the posterior distribution, and the hypercuboid limited by the range of the sampler output, A = ⊗ i [min{θ i(q) }, max{θ i(q) }]∩{θ : p(y|θ) ≥ min{p(y|θ (q) ), q = 1, . . . , k}}. Consequently,P (A|y) = 1, and the prior simulation support is in regions where the likelihood is significant. To estimate p(y) using (16), we applied the MCIS method with the importance function being the independent inverse Gamma-Normal and Dirichlet with the mean and variance of the posterior distribution (estimated from the MCMC draws). Similarly, to approximate P (A) we use the Monte Carlo Importance Sampling with independent inverse Gamma-Normal and Dirichlet distributions. The Gibbs sampler with random permutations was run for 31000 iterations, the first 1000 being discarded as burn-in. 4 In case of the Chib's method, the natural logarithm of the posterior density function, ln p(θ * |y), was computed from the decomposition ln p(θ * |y) = ln p(ω * |μ * , σ 2 * , y) + ln p(σ 2 * |μ * , y) + ln p(μ * |y), where the vector θ * was calculated, based on posterior simulation output as (i) a posterior mode of ln[p(y|θ)p(θ)], and (ii) a numerical approximate maximum likelihood value. Unsatisfactory results were obtained when the Chib's estimator was applied at the posterior means. In order to calculate components of (38), reduced Gibbs runs were used. The value of the posterior density function at θ * was estimated through Rao-Blackwellization using reduced samples of size 30000. Since p(y) is unknown, we  estimated it by the Arithmetic Mean estimator using 10 8 draws from the prior distribution. For the sake of comparison, results reported by Neal (1999) are also included in Table 7.
In Table 8, average errors (AE) and root mean squared errors (RMSE) obtained in equal and unequal variance models are presented. Closed-form expression for the marginal likelihood does not exist, therefore root mean squared errors and average errors are calculated relative to ln AME*. As can be seen from the results in Table 8, our new estimator performs best. The CAM estimator appears to work very well, being better than the Chib's method. It has the smallest average errors and RMSE. Due to multimodality of the posterior distributions, the Laplace-Metropolis estimator (here used in a naive manner) is the least accurate across all models. Despite the simulation pseudo-bias, the Harmonic Mean estimator works better than the LM estimator. Finally, it is important to stress that the CAM estimator can be easily implemented because it requires only additional draws from inverse the Gamma-Normal and Dirichlet distributions, centered at the posterior means.

Discussion
In the paper a new class of estimators of the marginal data density is proposed. The idea of the estimators is based on correction of the arithmetic mean estimator by trimming the prior sample to certain subset A ⊆ Θ. We show that under following assumptions: 1. The subset A is compact; 2. A is a set with non-zero and finite prior as well as posterior measures; 3. The likelihood is bounded on A; 4. It is possible to generate samples forming an ergodic Markov chain whose equilibrium distribution is the posterior distribution; 5. Random samples can be generated from the prior distribution (or from the importance sampling distribution); the new estimators are consistent. Moreover, the CAM estimators can be used in the case of an improper prior when the posterior distribution is proper. In simulation we focus on simple models where the marginal data density, p(y), is known. Comparison of the true value of the marginal likelihood with its estimates obtained with the use of different estimators points to higher accuracy of the method proposed in this paper. The operational characteristics of our estimator were also illustrated using two very popular data examples. The first involved a choice between two non-nested regression models for specimens of radiata pine. The second example involved mixture Gaussian models for galaxy data. These results demonstrate that the AM estimators are promising. The properties of the new estimators depend on the choice of the subset A. At present there is no answer to the question: "What is the most appropriate subset A?". The challenge will be to choose the subset A so as to minimize variance of the estimator. Simulation studies and empirical examples considered in the paper demonstrate that the choice of A as an intersection of the parameter space (with the additional restriction that the likelihood function exceeds the smallest value of the likelihood evaluated at pseudo-random sample from the posterior distribution) and the hypercuboid limited by the range of the posterior sampler output is effective in the sense of a relatively small simulation sample being sufficient to obtain acceptable results.