A New Monte Carlo Method for Estimating Marginal Likelihoods

. Evaluating the marginal likelihood in Bayesian analysis is essential for model selection. Estimators based on a single Markov chain Monte Carlo sample from the posterior distribution include the harmonic mean estimator and the inﬂated density ratio estimator. We propose a new class of Monte Carlo estimators based on this single Markov chain Monte Carlo sample. This class can be thought of as a generalization of the harmonic mean and inﬂated density ratio estimators using a partition weighted kernel (likelihood times prior). We show that our estimator is consistent and has better theoretical properties than the harmonic mean and inﬂated density ratio estimators. In addition, we provide guidelines on choosing optimal weights. Simulation studies were conducted to examine the empirical performance of the proposed estimator. We further demonstrate the desirable features of the proposed estimator with two real data sets: one is from a prostate cancer study using an ordinal probit regression model with latent variables; the other is for the power prior construction from two Eastern Cooperative Oncology Group phase III clinical trials using the cure rate survival model with similar objectives. prior inference problem to compare our estimator to HM and IDR; our results show that PWK has the smallest empirical MCSE and RMSE. The computation time for our method is only slightly longer than that for the HM which indicates our spherical shell partition approach is very eﬃcient. We conduct another simulation study for a mixture of two bivariate normal distributions to illustrate the ePWK estimator, which is obtained by additionally slicing the partition rings in the partition step of the PWK method. We show that the ePWK method reduces the MCSE and RMSE by a great deal when compared to the HM and IDR methods at the cost of slightly more computation time.


Introduction
The Bayes factor quantifying evidence of one model over a competing model is commonly used for model comparison or variable selection in Bayesian inference. The Bayes factor is a ratio of two marginal likelihoods, where the marginal likelihood is essentially the average fit of the model to the data. However, the integration for the marginal likelihood is often analytically intractable due to the complex kernel (product of likelihood and prior) structure. To deal with this computational problem, several Monte Carlo methods have been developed. They include the importance sampling (IS) of Geweke (1989), the harmonic mean (HM) of Newton and Raftery (1994) and its generalization (GHM) by Gelfand and Dey (1994), the serial approaches of Chib (1995) and Chib and Jeliazkov (2001), the inflated density ratio method (IDR) of Petris and Tardella (2003) and Petris and Tardella (2007), the thermodynamic integration (TI) of Lartillot and Philippe (2006) and Friel and Pettitt (2008), the constrained GHM estimator with the highest posterior density (HPD) region of Robert and Wraith (2009) and Marin and Robert (2010), and the steppingstone sampling of Xie et al. (2011) and Fan et al. (2011). Under some mild conditions, they are all shown to be asymptotically convergent to the marginal likelihood by the ergodic theorem. They vary in using Monte Carlo samples or kernels in the Monte Carlo integration.
We assume only a single Markov chain Monte Carlo (MCMC) sample from the posterior distribution, which may be readily available from standard Bayesian software, and the known kernel function for computing the marginal likelihood. The HM and IDR estimators are the only existing methods that need only these two minimal assumptions. The main difference between the HM and the IDR estimators lies in the different weights assigned to the inverse of the kernel function. The former uses the prior function as a weight, while the latter uses the difference between a perturbed density and its kernel function. Although the HM estimator has been used in practice because of its simplicity, it can be unstable when the prior has heavier tails than the likelihood function and it is known to overestimate the marginal likelihood (Lartillot and Philippe, 2006;Xie et al., 2011).
While the IDR estimator has better control over the tails of the kernel than the HM estimator, it requires reparameterization, posterior mode calculation, and a careful selection of radius. Under the aforementioned two minimal assumptions, we extend the HM and IDR methods to develop a new Monte Carlo method, namely, the partition weighted kernel (PWK) estimator. The PWK estimator is constructed by first partitioning the working parameter space, where the kernel is bounded away from zero, and then estimating the marginal likelihood by a weighted average of the kernel values evaluated at the MCMC sample, where weights are assigned locally using a representative kernel value in each subset. We show the PWK estimator is consistent and has finite variance. When the partition is refined enough to make the kernel values in the same region similar, we can construct the best (minimum variance) PWK estimator. Our simulation studies empirically show that the proposed PWK estimator outperforms both the HM and IDR estimators with respect to root mean square error.
The rest of the article is organized as follows. Section 2 is a review of the HM, GHM and IDR methods that motivate the PWK estimator. In Section 3, we develop the PWK estimator and its theoretical properties. Additionally, in the class of the general PWK estimator, we find the best (minimum variance) PWK estimator and provide a spherical shell approach to realize it. In Section 4, an extended general PWK estimator defined on the full support of the kernel function is investigated. Besides the theoretical properties, we show that the HM and IDR estimators are special cases in this family. In Section 5, we conduct simulation studies of a bivariate normal case with the normalinverse-Wishart prior and a mixture of two bivariate normal distributions to compare the performance and computing time of the HM, IDR and PWK estimators. In Section 6, we compare the results and performance of the PWK estimator to the methods by Chib (1995) and Chen (2005) for an ordinal probit regression model. Moreover, we apply the PWK estimator to the determination of the optimal power prior using two Eastern Cooperative Oncology Group (ECOG) clinical trial data sets. Finally, we conclude with a discussion in Section 7. The proofs of all theorems are given in the Supplementary Web Materials (Wang et al., 2017a).

Preliminary
We review several Monte Carlo methods that only require a known kernel function and an MCMC sample from the posterior distribution to compute the marginal likelihood. Suppose θ is a p-dimensional vector of parameters and D denotes the data. Then, the kernel function for the joint posterior density π(θ|D) is q(θ) = L(θ|D)π(θ), where L(θ|D) is the likelihood function and π(θ) is a proper prior density. Assume Θ ⊂ R p is the support of q(θ). The unknown marginal likelihood c is defined to be Θ q(θ)dθ. The integration is often analytically intractable due to complicated kernel structure.
To estimate the normalizing constant c, Newton and Raftery (1994) suggest the following equation to motivate the HM method, Let {θ t , t = 1, . . . , T } be an MCMC sample from the posterior distribution π(θ|D) = q(θ)/c. The HM estimator is then given bŷ where the prior π(θ t ) can be viewed as the weight assigned to 1/q(θ t ). Although it has the features of simplicity and asymptotic convergence to the marginal likelihood, the finite variance is not guaranteed. Xie et al. (2011) also point out that the HM estimator tends to overestimate the marginal likelihood. Gelfand and Dey (1994) suggest the GHM estimator where π(θ) in (1) is replaced by a lighter-tailed density function f (θ) compared to q(θ): By proposing a light-tailed density, the ratio f (θ t )/q(θ t ) can be controlled. Consequently, the estimator has finite variance. However, in high dimensional problems, finding a suitable density f (θ) may be a challenge. Petris and Tardella (2003) propose the IDR estimator. They use the difference between a perturbed distribution q r (θ), which is inflated in the center of the kernel, and the posterior kernel q(θ) as the weight. The perturbed density q r (θ) is defined as where r is the chosen radius and w(θ) = θ (1 − r p /||θ|| p ) 1/p . It follows, where b r = Volume of the ball {θ : ||θ|| ≤ r} = π p/2 r p /Γ(p/2 + 1). This leads to the following equation, and the IDR estimator is given bŷ Under some mild conditions, the estimator is shown to have finite variance by Petris and Tardella (2007). However, the method requires a careful selection of radius and unbounded support of q(θ). Any bounded parameter must be reparameterized to the full real line. Also, in order to have a more efficient estimator, mode finding is essential and standardization of an MCMC sample with respect to the mode and the sample covariance matrix is required.

A New Monte Carlo Estimator
We first modify (1) and (6) We next partition the working parameter space into K subsets, where the ratio of h(θ) over q(θ) has similar values within each subset, to reduce the variance of the Monte Carlo estimator. The general form of the PWK estimator with unspecified local weights is essentially a weighted average for the harmonic mean estimator for q(θ) with the same weights assigned locally to an MCMC sample in a subset.
The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010). However, we do not require h(θ) to be a density function as in GHM or constrained GHM. Consequently, we allow a larger class of estimators to be considered.

General Monte Carlo Estimator
Suppose {A 1 , . . . , A K } forms a partition of the working parameter space Ω, where for an integer K > 0, w 1 , . . . , w K are the weights assigned to these K regions, respectively.
Let the weight function be the step function: So we can evaluate Δ: where V (A k ) is the volume of the k th subset in the partition, that is, Using the step function h(.) in (9), the PWK estimator for d ≡ 1/c is given bŷ . (10) In order to establish consistency and finite variance of the PWK estimator, we introduce two assumptions.
Assumption 1: The volume of each region V (A k ) < ∞ for k = 1, 2, . . . , K. Note that we consider the estimator for d rather than c because we can obtain an unbiased estimator with finite variance for d = 1/c. (10) is that when a certain full conditional density is available, the computation can be lessened. This is often the case in the generalized linear model with latent variables or random effects, and in any Gibbs sampler or its hybrid. To be specific, let (ϑ 1 , ϑ 2 ) be 2 blocks of parameters, ϑ 1 = (θ 1 , . . . , θ q ) and ϑ 2 = (θ q+1 , . . . , θ p ) . Assume that a full conditional density, π(ϑ 1 |D, ϑ 2 ), is available. Then, the p-dimensional estimation problem can be reduced to p − q dimensions:

The Optimal Monte Carlo Estimation
Our next step is to find the optimal weight w k in the class of PWK estimators (10), motivated by Chen and Shao (2002).
In practice, it is quite difficult to estimate the second moment α k . A very large sample size is required in order to obtain an accurate estimate of α k . However, the results shown in Theorem 2 shed light on the choices of A 1 , . . . , A K and w k . First, it is only required that w k be proportional to Remark 3. Following on Remark 1, when a full conditional density π(ϑ 1 |D, ϑ 2 ) is available, the estimatord in (11) reduces further tô .

Remark 4.
In practice, the marginal likelihood is often reported in log scale. Considering the dependence within the MCMC sample, we use the Overlapping Batch Statistics (OBS) of Schmeiser et al. (1990) to estimate the Monte Carlo (MC) standard error of − log(d). Letη b denote an estimate of the reciprocal of the marginal likelihood in log scale using the b th batch, Then, the OBS estimated MC standard error ofη = − log(d) is given by ) and a batch size B is suggested to be 10 ≤ T/B ≤ 20 in Schmeiser et al. (1990).

Construction of the Partition with Subsets A 1 , A 2 , . . . , A K
In order to make q(θ) roughly constant over A k for each k, which is a sufficient condition for the PWK estimator in (11) to be optimal, we provide the following rings approach for achieving it: Step 1: Step 2: Use the MCMC sample to compute the mean φ and the covariance matrix Σ of φ and then standardize φ by Step 3: Construct a working parameter space for ψ by choosing a reasonable radius r such that ψ < r for most of the standardized MCMC sample.
Step 4: Partition the working parameter space into a sequence of K spherical shells Step Step 7: where

Remark 5.
When K is sufficiently large,q(ψ t ) in (13) will be roughly constant over A k and the PWK estimate will be close to optimal. In addition, each kernel valueq(ψ t ) is simply the original kernel value q(θ t ) multiplied by the absolute value of the Jacobian function.

Extension of the General PWK Estimator
In this section, we generalize the PWK estimator from the working parameter space to the full support space and from the locally constant weight function to a general weight function of θ. We call this class extended PWK (ePWK) estimators.
Suppose {A 1 , . . . , A K * } is a partition of Θ, and w k (θ) is a weight function defined on A k . We need the following assumption to define this ePWK class: Under Assumption 3, the extended form of the general PWK in (10) is given bŷ Theorem 3. Under Assumption 3 and q(θ) > 0, then the ePWK estimatord * in (14) is Remark 6. It is easy to see thatd in (10) is a special case ofd * in (14). When K * = K + 1 and each fixed weight w k is assigned to an MCMC sample in each region A k except w K * = 0,d * reduces tod.
Remark 7. The HM estimator is another special case ofd * in (14). When using the prior π(θ i ) as weights, the inverse ofd * is the HM estimator. .
Remark 8. In addition,d * in (14) includes the IDR estimator as a special case. Let Thus, the inverse ofd * reduces to the IDR estimator. Note w 1 (θ t ) and w 2 (θ t ) in IDR are allowed to be negative.
Remark 9. When the posterior kernel q(.) after the transformation is roughly symmetric, the constant weight w k assigned to partition set A k constructed using the rings approach discussed in Section 3.3 often leads to an efficient PWK estimator in (10) as empirically demonstrated in Section 5.1 and Section 6. However, when the posterior kernel q(.) is very skewed or multimodal, the constant weight w k would result in an inefficient PWK estimator. For such a complex case, we can apply the ePWK estimator in (14). The functional weight w k (θ) can be constructed as follows. We first divide the is a representative point in A k , for = 1, . . . , m k . In Section 5.2, we apply this version of the ePWK estimator to an example involving a bimodal distribution to examine its empirical performance.

A Bivariate Normal Example
We apply the PWK estimator for computing the normalizing constant of the posterior of the parameters of a bivariate normal distribution with the normal-inverse-Wishart prior.
We consider both location and scale parameters to be unknown. Including the scale parameters makes computation challenging. Let y = (y 1 , y 2 , . . . , y n ) be n observations from a bivariate normal distribution, where μ ∈ R 2 and Σ are unknown parameters. The likelihood function is The prior for μ and Σ is specified as follows: with hyperparameters μ 0 , κ 0 , ν 0 , and Λ 0 . Then, the joint posterior kernel is given by . Under this setting, the analytical form of the normalizing constant is available as follows: where , κ n = κ 0 + n, and ν n = ν 0 + n. We set the hyperparameters μ 0 = (0, 0) , k 0 = 0.01, ν 0 = 3, and Λ 0 = 1 0.7 0.7 1 . We generated a random sample y with n = 200 from a bivariate normal distribution with μ = (0, 0) and Σ = 1 0.7 0.7 1 . The corresponding sample meanȳ was (−0.029, 0.040) , and the sample variance-covariance matrix S was 201.987 143.330 143.330 192.365 . Using (15), the marginal likelihood in log scale is −507.278. In this example, in order to apply the spherical shell approach in Section 3.3, a transformation of Σ was needed. Here, we used the log transformation for each variance parameter and the Fisher z-transformation for the correlation coefficient parameter to have un-bounded support for each of them. Then, we standardized each transformed MCMC sample from its transformed sample mean and standard deviation. In the new parameter space, we constructed the working parameter space and its partition by choosing r = 1.5, 2, or 2.5 and K = 10, 20, or 100. After selecting a representative point in each spherical shell, we estimated d = 1/c using (13). We compare our method to the HM and IDR methods based on 1,000 independent MCMC samples with T = 1, 000 or T = 10, 000 in Table 1 Table 1 shows the results, where the average computing time (in seconds) per MCMC sample on an Intel i7 processor machine with 12 GB of RAM memory using a Windows 8.1 operating system is given in the last column. From Table 1, we see that (i) PWK has the best performance with much smaller MCSE and RMSE than HM and IDR under both T = 1, 000 and T = 10, 000; (ii) when T increases, the MCSE and the RMSE of the PWK estimator become smaller under all choices of r and K; (iii) the performance of the HM estimator slightly improves but the IDR estimator does not when T increases; and (iv) the computing time of the PWK estimator is comparable to that of the HM estimator while the IDR estimator requires the most computing time. It is interesting to mention that the MCSE and the RMSE of the PWK estimator are very similar for all choices of r and K under each T , implying the robustness of the PWK estimator with respect to the specification of the working parameter space and the number of partition subsets.
In this example, we also examine the performance of ePWK by adding a subset A K+1 = Θ ∩ Ω c = {θ : ||θ|| > r} such that K * = K + 1 and ∪ K * k=1 A k = Θ. We further Under this specification, we have A K+1 w K+1 (θ)dθ = q(θ * K+1 ). Holding the other subsets A 1 , . . . , A K and their corresponding weights the same as for PWK, the resulting values of MCSE and RMSE by ePWK are 0.06332 and 0.06579 when T = 1, 000, K = 100, and r = 1.5; 0.05167 and 0.05420 when r = 2.0; and 0.05500 and 0.05772 when r = 2.5. Compared to the results of PWK (0.06375 and 0.06621 when r = 1.5; 0.05168 and 0.05420 when r = 2.0; and 0.05499 and 0.05772 when r = 2.5), ePWK performs very similarly to PWK, which is expected since the posterior kernel has light tails and very low values on A K+1 .
To evaluate the effect of a vague prior on the precision of the PWK estimator, we extend our simulation study by considering different values of hyperparameters κ 0 and ν 0 . Note that the value of log c in Table 1 is computed under κ 0 = 0.01 and ν 0 = 3, which corresponds to a relatively vague prior for (μ, Σ). Table 2 shows the simulation results of the PWK estimators with r = 2 and K = 100 for (κ 0 , ν 0 ) = (0.0001, 3), (1, 3), and (1, 10) in addition to (0.01, 3). From Table 2, we see that the MCSE values under these different values of (κ 0 , ν 0 ) are almost the same while the RMSE values are comparable except the last one with (κ 0 , ν 0 ) = (1, 10), in which the RMSE values are slightly larger.
the high but opposite correlations (i.e., ρ 1 = 0.99 and ρ 2 = −0.99), π(μ) cannot be homogeneous over a partition ring formed by the spherical shell approach in Section 3.3. To circumvent this difficulty, following Remark 9, we additionally slice the existing partition rings by dividing them equally along the angle from 0 to 360 degrees as shown by the dashed lines in Figure 1(b), where the center of the circle is the sample posterior mean (denoted asμ). Now, the heterogeneity of π(μ) over each partition subset is effectively eliminated by this additional slicing step. We note that this version of ePWK is the same as PWK except for additional slicing over the partition rings. Table 3 shows the results of HM, IDR, and ePWK estimators based on 1,000 independent random samples with T = 1, 000 or T = 10, 000 from (16). For ePWK, we consider different values of K (the number of rings) with the same m k = m (the number of slices) for k = 1, . . . , K and r (75%, 90%, or 95% × max 1≤t≤T ||μ t −μ||). We use the same values of r for both IDR and ePWK. From Table 3, we see that (i) the RMSE values of the ePWK are considerably smaller than those of HM and IDR; (ii) the performance of ePWK improves when the sample size (T ) or the number of rings (K) increases; and (iii) ePWK takes slightly longer computing time than HM and IDR.
Next, we consider a more challenging case, where μ 02 is replaced by (5, 5) so that the two modes are much further away from each other. Figure 2(a) is a scatter plot of a random sample with T = 10, 000 and Figure 2(b) shows the partition subsets of the chosen working parameter space. Table 4 summarizes the simulation results with the same simulation setting as before. We see that ePWK outperforms both HM and IDR under this more challenging case. As expected, the RMSE values in Table 4 are larger than those in Table 3

The Ordinal Probit Regression Model
In the first example, we apply the PWK method to computing the marginal likelihood under the ordinal probit regression model. Let y = (y 1 , y 2 , . . . , y n ) denote the vector of observed ordinal responses, each is coded as one value from 0, 1, . . . , J − 1, X denote the n × p covariate matrix with the i th row equal to the covariate of the i th subject x i , and u = (u 1 , u 2 , . . . , u n ) denote the vector of latent random variables. We consider the following hierarchical model as in Albert and Chib (1993) such that  ∼ N (0, σ 2 ). Based on the reparameterization of Nandram and Chen (1996), the cutpoints for dividing the latent variable u i can be specified as −∞ = γ 0 < γ 1 = 0 ≤ γ 2 ≤ · · · ≤ γ J−1 = 1 < γ J = ∞. Under this setting, the likelihood function is given in Chen (2005) where θ = (β , σ, γ 2 , . . . , γ J−2 ) if J ≥ 4, otherwise, θ = (β , σ) , and Φ(.) is the cumulative standard normal distribution function. Then, we specify normal, inverse gamma, and uniform priors for the parameters β, σ 2 , and γ, respectively.
To examine the performance of the PWK estimator under this model, we consider the prostate cancer data of n = 713 patients as in Chen (2005). In this data set, Pathological Extracapsular Extension (PECE, y) is a clinical ordinal response variable, and Prostate Specific Antigen (PSA, x 1 ), Clinical Gleason Score (GLEAS, x 2 ), and Clinical Stage (CSTAGE, x 3 ) are three covariates. PECE takes values of 0, 1, or 2, where 0 means that there is no cancer cell present in or near the capsule, 1 denotes that the cancer cells extend into but not through the capsule, and 2 indicates that cancer cells extend through the capsule. PSA and GLEAS are continuous variables while CSTAGE is a binary outcome, which was assigned to 1 if the 1992 American Joint Commission on cancer clinical stage T-category was 1, and assigned to 2 if the T-category was 2 or higher.
In this application, J = 3 so that all four cutpoints can be assigned to fixed values: −∞ = γ 0 < γ 1 = 0 < γ 2 = 1 < γ 3 = ∞. Then, the prior distribution is specified as π(θ) = π(β|σ 2 )π(σ 2 ), where β|σ 2 ∼ N (0, 10σ 2 I 4 ) and σ 2 ∼ IG(a 0 = 1, b 0 = 0.1). The density function of an inverse gamma distribution IG(a 0 , b 0 The marginal likelihood is not analytically available. Nevertheless, the estimates of this are obtained in Table 1 of Chen (2005) using the method proposed by Chen (called Chen's method) and the method proposed by Chib (1995) (called Chib's method). Chen's method needs only a single MCMC sample from the joint posterior distribution π(β, σ 2 |D). However, Chib's method with two blocks requires an additional MCMC sample from the conditional posterior distribution π(σ 2 |β * , D), where β * is the posterior mean of β. We compare PWK to these two methods under the same MCMC sample sizes T = 2, 500, or 5, 000 as in Chen (2005), except that Chib's method doubles them.  For the PWK, we apply a log transformation for σ 2 . Then, after the standardization of the transformed MCMC sample, we consider K = 10, 20, and 100 and r = 0.75 χ 2 5,0.95 , χ 2 5,0.95 , and 1.25 χ 2 5,0.95 to investigate robustness of the PWK estimates with respect to these choices. We note that χ 2 5,0.95 is the square-root of the 95 th percentile of the Chi-square distribution with p = dim(θ) = 5 degrees of freedom, which is derived by computing the norm of p independent standard normal distributions as in Yu et al. (2015). Table 5 shows the PWK estimates and the corresponding estimated MCSE (eMCSE) under the MCMC samples with T = 2,500 and 5,000, where eMCSE is computed using (12) with T/B = 10. We note that we use the same MCMC sample sizes as in Chen (2005). The results show the PWK estimators are relatively robust to the choice of the radius r and the number K of partition subsets.
From Table 1 of Chen (2005), the estimates of log c and eMCSE's are −758.71 and 0.038 based on Chen's method and −758.67 and 0.037 based on Chib's method for T = 2, 500; and −758.71 and 0.024 based on Chen's method and −758.70 and 0.023 based on Chib's method for T = 5, 000. We see from Table 5 that the PWK estimates of log c are similar to those under both Chen's and Chib's methods but with smaller eMCSE's under the MCMC samples with T = 2,500 and 5,000, respectively. For instance, the PWK estimates of log c and the corresponding eMCSE's are −758.70 and 0.020 for T = 2, 500 and −758.70 and 0.016 for T = 5, 000 when r = χ 2 5,0.95 and K = 100. Thus, the PWK yields a slightly more precise estimate of log c than the other two methods.

Analysis of ECOG Data
In this subsection, we apply the PWK estimator to the problem of determining the power prior based on historical data for the current analysis. Assume we have conducted two clinical trials for the same objective. A natural way to combine these two trials is to consider the power prior setting, which allows us to borrow information from the historical data to construct the prior for the current analysis. Assume we have an initial prior for the unknown parameters that is determined before observing the historical data. To quantify the heterogeneity between the current data and the historical data, the power prior weights the historical likelihood function by the power a 0 , where 0 ≤ a 0 ≤ 1, to indicate the extent to which the historical likelihood is incorporated into the initial prior. Our objective is to find the optimal a 0 which maximizes the marginal likelihood for the current data. Ibrahim et al. (2015) point out the difficulty of finding this solution except for normal linear regression models. Therefore, they resort to using the deviance information criterion (DIC) and the logarithm of pseudo-marginal likelihood (LPML) criterion for constructing the parameter a 0 of the power prior in Ibrahim et al. (2012Ibrahim et al. ( , 2015. To evaluate DIC, we need to plug the MCMC sample into the sum of the log likelihood over all data points; to evaluate LPML, we need to take the sum of the log transformation of each CPO, where the i th CPO is the harmonic mean of the i th likelihood evaluated at the MCMC sample from the posterior distribution based on the full sample. Both methods yield much less computational burden than the marginal likelihood method. We will show how the PWK estimator can circumvent the computational burden in evaluating the marginal likelihood. The effectiveness of Interferon Alpha-2b (IFN) in immunotherapy for melanoma patients has been evaluated by two observation-controlled clinical trials: Eastern Cooperative Oncology Group (ECOG) phase III, E1684, followed by E1690. The first trial E1684 was conducted with 286 patients randomly assigned to either IFN or Observation. The IFN arm demonstrated a significantly better survival curve, but with substantial side effects due to high dose regimen. To confirm the results of the E1684 and the benefit of IFN at a lower dosage, a later trial E1690 was conducted with three arms: high dose IFN, low dose IFN, and Observation. We use the data in E1684 as the historical data and a subset (high dose arm and Observation) of the E1690 trial as our current data. There are 427 patients in this subset.
For n = 427 patients in the current trial (E1690), we follow the model in Chen et al. (1999). Let y i denote the relapse-free survival time for the i th patient, ν i denote the censoring status, which is equal to 1 if y i is a failure time and to 0 if it is right censored, x i = (1, trt i ) denote the vector of covariates, where trt i = 1 if the i th patient received IFN and trt i = 0 if the i th patient was assigned to Observation. Then, the likelihood function is given by where D = (n, y, ν, X) is the observed current data, β = (β 0 , β 1 ) , and F (y|λ) is the cumulative distribution function and f (y|λ) is the corresponding density function. In (17), we use the same piecewise exponential model for F (y|λ) as Ibrahim et al. (2012), which is given by where s j−1 ≤ y < s j , s 0 = 0 < s 1 < s 2 < . . . < s 5 = ∞, and λ = (λ 1 , . . . , λ 5 ) .
For n 0 = 286 patients in the historical trial (E1684), we attempt to extract some of its information to set up the prior distribution for the current analysis. Similarly, we let y 0i denote the survival time for the i th patient, ν 0i denote the censoring status, and x 0i = (1, trt 0i ) denote the vector of covariates. So D 0 = (n 0 , y 0 , ν 0 , X 0 ) is the observed historical data. Assume π 0 (β, λ) is an initial prior. Here, we specify an initial proper prior N (0, 100I 2 ) for β and Exp(λ 0 = 1/100) (λ 0 : rate parameter) for each λ j , j = 1, . . . , 5, to come close to the flat prior in Ibrahim et al. (2012). To update the initial prior with the historical data, the power prior is intuitively set as the initial prior π 0 multiplied by the historical likelihood function with power a 0 as follows: where π(β, λ|D 0 , a 0 ) is called the power prior and 0 ≤ a 0 ≤ 1. In this setting, we can see when a 0 = 0, the power prior is exactly equal to the initial prior, which integrates to be 1, and when a 0 = 0, the power prior is equal to the right-hand side kernel function in (18) divided by c 0 = L(β, λ|D 0 ) a0 π 0 (β, λ)dβdλ. Combining the likelihood function in (17) and the power prior in (18), the posterior distribution of β and λ given (D, D 0 , a 0 ) will be π(β, λ|D, D 0 , a 0 ) ∝ L(β, λ|D)π(β, λ|D 0 , a 0 ).
In this framework, we compare the marginal likelihoods of L(β, λ|D)π(β, λ|D 0 , a 0 ) for 0 ≤ a 0 ≤ 1. The one with the highest marginal likelihood is our final model, and its corresponding a 0 determines the power prior.
For each choice of a 0 with an increment of 0.1 from 0 to 1, an MCMC sample size is fixed at 10,000. The log transformation of each λ j is needed. After the standardization of the transformed MCMC sample, we choose the maximum radius r = χ 2 7,0.95 due to p = 7, and the number of spherical shells K = 100. By (13) and (12), we can obtain the marginal likelihood estimate and its eMCSE for each chosen a 0 . We summarize the results in Table 6. Table 6 also includes the PWK estimates under r = 0.75 χ 2 5,0.95 , 1.25 χ 2 5,0.95 and K = 10, 20 to investigate the robustness of the PWK method.
Note the marginal likelihood function c can be shown to be continuous in a 0 . Therefore, from Table 6, we see that the best choice of a 0 is between 0.5 and 0.6 under the marginal likelihood criterion. This result is quite comparable to the result of a 0 = 0.4 in Ibrahim et al. (2012) obtained by DIC and LPML criteria, where a suitable marginal likelihood computation was not accessible. We also observe that the results are quite robust to the different values of r and K, and all point out that the best choice of a 0 is between 0.5 and 0.6.

Discussion
The marginal likelihood is often analytically intractable due to a complicated kernel structure. Nevertheless, an MCMC sample from the posterior distribution is readily available from Bayesian computing software. Additionally, the likelihood values evaluated at the MCMC sample are output in a file. Consequently, we can produce kernel values easily using the output and the prior function. In this paper, we propose a new algorithm, PWK, for estimating the marginal likelihood based on this single MCMC sample and its corresponding kernel values. Unlike some existing algorithms requiring knowledge of the structure of the kernel, we only need to know the kernel values evaluated at the MCMC sample. Therefore, our algorithm can be applied to Bayesian model selection, assessing the sensitivity of conclusions to the prior distribution, and Bayes hypothesis tests. We implement our methodology using the R programming language (R Core Team, 2015). The R codes along with README files are available as Online Supplementary Materials (Wang et al., 2017b).
We extend PWK to handle the parameter space with the full support (ePWK) and we show that HM and IDR are special cases of ePWK. We conduct a simulation study from a bivariate normal distribution with 5 parameters in a Bayesian conjugate  prior inference problem to compare our estimator to HM and IDR; our results show that PWK has the smallest empirical MCSE and RMSE. The computation time for our method is only slightly longer than that for the HM which indicates our spherical shell partition approach is very efficient. We conduct another simulation study for a mixture of two bivariate normal distributions to illustrate the ePWK estimator, which is obtained by additionally slicing the partition rings in the partition step of the PWK method. We show that the ePWK method reduces the MCSE and RMSE by a great deal when compared to the HM and IDR methods at the cost of slightly more computation time.
In example analyses of real data, we first consider an ordinal probit regression model, and compare our method to that in Chib (1995) and Chen (2005) with the same MCMC sample size for Chen's method (Chib's method requires twice this sample size). We find the three methods produce comparable estimates for the marginal likelihood and the PWK method produces the smallest eMCSE. In the second example, we consider a cure rate survival model with the piecewise constant baseline hazard function and a power prior construction based on two clinical trial data sets. We obtain the optimal power prior using the marginal likelihood criterion as opposed to the DIC and LPML methods considered by Ibrahim et al. (2012). We obtain similar results, except that the PWK approach indicates more borrowing of the historical data.
In unimodal problems, we suggest using the square root of the 95 th percentile in a Chi-square distribution with p degrees of freedom as a guide to choosing a value for the radius r for constructing the working parameter space of the standardized MCMC sample. This is because, after standardizing the MCMC sample, the marginal distribution of each parameter is approximately standard normal. Although the results are quite robust to the choices of r as shown in simulation and case studies, using the Chi-square distribution for guidance ensures that we make use of most of the MCMC sample and avoid the region with posterior density close to 0. For multimodal problems, we suggest using 95% × max 1≤t≤T ||μ t −μ|| as a guide value for constructing the working parameter space of the transformed MCMC sample. Since this approach may result in many partition subsets with extremely small posterior density in the working parameter space, we can use the spherical rings approach as demonstrated in Section 5.2 to obtain the homogeneity of the MCMC sample in each subset. This new partition approach can also be extended to a p-dimensional problem (p > 2) by introducing another p − 2 angular coordinates as in Lehnen and Wesenberg (2003) and slicing them as in Section 5.2.