Eﬃcient Model Comparison Techniques for Models Requiring Large Scale Data Augmentation

. Selecting between competing statistical models is a challenging problem especially when the competing models are non-nested. In this paper we oﬀer a simple solution by devising an algorithm which combines MCMC and importance sampling to obtain computationally eﬃcient estimates of the marginal likelihood which can then be used to compare the models. The algorithm is successfully applied to a longitudinal epidemic data set, where calculating the marginal likelihood is made more challenging by the presence of large amounts of missing data. In this context, our importance sampling approach is shown to outperform existing methods for computing the marginal likelihood.


Introduction
The central pillar of Bayesian statistics is Bayes' Theorem.That is, given a parameteric model M with parameters θ = (θ 1 , . . ., θ d ) and data x = (x 1 , x 2 , . . ., x n ), the joint distribution of (θ, x) satisfies π(θ|x)π(x) = π(x|θ)π(θ). (1) The four terms in (1) are the posterior distribution π(θ|x), the marginal likelihood or evidence π(x), the likelihood π(x|θ) and the prior distribution π(θ).The terms on the right hand side of (1) are usually easier to derive than those on the left hand side.The statistician has considerable control over the prior distribution and this can be chosen pragmatically to reflect prior beliefs and to be mathematically tractable.For many statistical problems the likelihood can easily be derived.However, the quantity of primary interest is usually the posterior distribution.Rearranging (1) it is straightforward to obtain an expression for π(θ|x) so long as the marginal likelihood can be computed.

Efficient Model Comparison Techniques
which is only possible analytically for a relatively small set of simple models.
A key solution to being unable to obtain an analytical expression for the posterior distribution is to obtain samples from the posterior distribution using Markov chain Monte Carlo (MCMC; Metropolis et al., 1953;Hastings, 1970).A major strength of MCMC is that it circumvents the need to compute π(x) and this has led to its widespread use in Bayesian statistics over the last 25 years or so.However, Bayesian model choice typically requires the computation of Bayes Factors (Kass and Raftery, 1995) or posterior model probabilities, which are both functions of the marginal likelihoods for the competing models.In Chib (1995) a simple rewriting of (1) was exploited to obtain estimates of the marginal likelihood using output from a Gibbs sampler.This has been extended in Chib and Jeliazkov (2001) and Chen (2005) to be used with the general Metropolis-Hastings algorithm.Importance sampling approaches to estimating the marginal likelihood have also been suggested (Gelfand and Dey, 1994), along with generalisations such as bridge sampling (Meng and Wong, 1996), which 'bridges' information from posterior and importance samples.More recently approaches have exploited the 'thermodynamic integral' such as power posterior methods Friel and Pettitt (2008).Alternative methods such as Sequential Monte Carlo (e.g.Zhou et al., 2015) and nested sampling (Skilling, 2004) do not require any MCMC: computation of the marginal likelihood and samples from the posterior distribution are produced simultaneously.A potential drawback for many of the above approaches to marginal likelihood estimation is that it may not be obvious how to apply them efficiently to models incorporating large amounts of missing data.
It should be noted that there are model comparison techniques such as reversible jump (RJ)MCMC (Green, 1995) which can used to compare models without the need to compute the marginal likelihood.RJMCMC works well for nested models where it is straightforward to define a good transition rule for models with different parameters.However, in the case where we have large amounts of missing data it is often necessary to use some form of data augmentation technique, where the missing information is inferred alongside the other parameters of the model.Using RJMCMC becomes much harder in these cases since the dimension of the parameter space (including the augmented data) becomes large.This is exacerbated further when the missing information between the competing models has a different structure.In this latter case the use of intermediary (bridging) models (Karagiannis and Andrieu, 2013) to move between the models of interest is a possibility.
The aim of the current paper is to demonstrate a straightforward mechanism for estimating the marginal likelihood of models with large amounts of missing data.The idea combines MCMC and importance sampling in a natural and semi-automatic manner to produce marginal likelihood estimates.The details of the algorithm developed are given Section 2. In Section 3 we consider a linear mixed model example.This enables us to demonstrate two important facets of the approach.Firstly, for the special case of the linear model we can compare our estimates of the marginal likelihood with exact computations which show very good agreement.Secondly, we show that making the distinction between model parameters and augmented data (random effects terms) assists tremendously in devising an efficient estimator of the marginal likelihood.In Section 4 we consider final outcome household epidemic data where in the special case of the Reed-Frost model exact, but expensive, computation of the marginal likelihood is possible for comparison purposes.In Section 5 we apply the methodology to an epidemic example, for the transmission of Streptococcus pneumoniae (Melegaro et al., 2004) comparing our algorithm to existing methods for computing the marginal likelihood demonstrating its simplicity and effectiveness in the presence of missing data.Finally in Section 6 we briefly discuss extensions and limitations of the algorithm.

Algorithm
Our starting point in the estimation of π(x) is to note that we can rewrite (2) as where q(θ) denotes a d-dimensional probability density function.We assume that if π(θ) > 0 then q(θ) > 0. Then an unbiased estimator, P q of π(x) is obtained by sampling θ 1 , θ 2 , . . ., θ N from q(θ) and setting Thus P q is an importance sampled (see, for example, Ripley, 1987) estimate of π(x), and the effectiveness of the estimator given by (4) depends upon the variability of π(x|θ)π(θ)/q(θ).
The remainder of the paper and this Section, in particular, is focussed on how we can effectively exploit (4) in the estimation of π(x).The first observation is that the optimal choice of q(θ) is π(θ|x), the posterior density but if we knew this, then π(x) would also be known.A simple solution is to use output from an MCMC algorithm to inform the proposal distribution (Clyde et al., 2007).For most statistical models the likelihood times the prior is unimodal for sufficiently large n.In these circumstances, the posterior distribution of θ is almost always approximately Gaussian with mean θ, the posterior mode, and covariance matrix Σ = −I( θ) −1 , where I(θ) denotes the Fisher information evaluated at θ.That is, we have a central limit theorem type behaviour similar to that observed for maximum likelihood estimators as n → ∞.This central limit theorem approximation is implicitly behind the Laplace approximations of integrals used in Tierney and Kadane (1986), (2.2) and Gelfand and Dey (1994), (8).This underpins the simple suggestion in Clyde et al. (2007) of using a multivariate t-distribution as an importance sampling distribution with location and scale parameters estimated from MCMC output.Alternatively, we can use a "defense mixture" (Hesterberg, 1995), where φ(•; μ, Σ) is the probability density function of a multivariate Gaussian distribution with mean μ and covariance matrix Σ, estimated from the MCMC output, and p is a mixing proportion.This proposal ensures that the ratio of the prior density to the proposal density is bounded above by 1/(1 − p) with p typically chosen to be 0.95.We found that a t-distribution proposal was preferable in Sections 3 and 4, whereas the defense mixture proposal was the preferred choice in Section 5.
We are now in position to outline the three step algorithmic procedure, which is implemented in the paper followed by highlighting the scope and limitation of the approach.The steps are as follows: 1. Obtain a sample θ 1 , θ 2 , . . ., θ K from the (approximate) posterior distribution, π(θ|x).Throughout this paper, and in practice, this will generally be achieved using MCMC with K chosen such that the sample is representative of the posterior distribution.However, any alternative method for obtaining an approximate sample from the posterior distribution could be used.
2. Use the sample θ 1 , θ 2 , . . ., θ K to derive a parametric approximation of the posterior distribution and let q(•) denote the corresponding probability density function.
For example, choosing q(•) either to be a multivariate t-distribution or a "defense mixture" will usually work well.
In situations where π(x|θ) is analytically available, the construction of an MCMC algorithm to sample from π(θ|x) will be straightforward and implementation of the algorithm will be trivial.Then the procedure becomes a simple and fast appendage to a standard MCMC algorithm.However, assuming an independent and identically distributed sample from q(•), the variance of the importance sampling estimator given in (4) is given by which again highlights the importance of the proposal q(θ) resembling the posterior π(θ|x) as closely as possible.As the dimension of θ increases the variance of the estimator will typically grow due to the curse of dimensionality (see Doucet and Johansen, 2011, page 671 for an explanation) and this is the main potential limitation.The examples in Section 5 show that the algorithm can be effectively used for moderate numbers of parameters with d = 11.In passing we remark that a dependent sample from q(•) in Step 3 of the algorithm can be exploited to reduce the variance of the estimator.A prime example is the defense mixture proposal where pN and (1 − p)N samples are drawn from the multivariate Gaussian distribution and the prior, respectively.
The motivation for the work are in circumstances where π(x|θ i ) is not readily available, see Sections 3 and 5, and further work is required to implement the algorithm.
When π(x|θ) is not available, it is often possible, with the addition of augmented data y, to obtain an analytical expression for π(x, y|θ).This can then be utilised within an MCMC algorithm to obtain samples from the joint posterior π(θ, y|x).Devising an importance sampling proposal distribution q(θ, y) approximating π(θ, y|x) will not be practical if y is high-dimensional, for example, the dimension of y is equal to or greater to n, the dimension of x.See, for example, Section 3 for limitations of this approach.The solution that we propose is to use the marginal MCMC output from π(θ|x) to inform the proposal distribution q(θ) in the importance sampling above, and then to separately consider the computation of π(x|θ), which will be largely problem specific.In the linear mixed model example in Section 3, the distribution of y i (random effect term) is readily available given θ and x i , and hence we can sample the random effects y from their full conditional distributions.This approach extends to the epidemic model in Section 5, where y represents the unobserved infectious status of individuals with respect to Streptococcus pneumoniae carriage and the Forward Filtering Backwards Sampling (FFBS) algorithm (Carter and Kohn, 1994) can be used to calculate π(y|x, θ), and hence π(x|θ).In future work we will show how particle filtering, (Gordon et al., 1993), can be applied to estimate π(x|θ) extending the scope of the algorithm with particular reference to Poisson regression models (Zeger, 1988).The estimation of π(x|θ) can be potentially computationally costly and thus the overall cost of the algorithm needs to be considered.However, the computation of the {π(x| θi )}'s can, in contrast to the MCMC runs, be undertaken in parallel, which can ease the computational burden.
In our approach each model is required to be analysed separately and the computational cost increases approximately linearly in the number of models to be compared.Therefore this approach is not competitive for comparing large numbers of nested models, for example, the inclusion or exclusion of p covariates in a generalised linear model, a situation where reversible jump MCMC (Green, 1995) can be effectively applied.Our approach is more suited to comparing a small number of competing models which potentially have rather different dynamics such as integer valued autoregressive (Neal and Subba Rao, 2007) and Poisson regression (Zeger, 1988) models for integer valued time series, an example which we will present in future work.The approach is particularly suited to situations which allow the posterior distribution of the parameters to be approximately Gaussian, assisting in the construction q(•), but this assumption can be relaxed.Furthermore, the appropriateness of a Gaussian, or t-distribution approximation of the posterior can easily be assessed from the MCMC output.

Linear mixed model
We illustrate our methodology on the linear mixed model.In particular we may wish to ask the model choice question of whether it is necessary to include a random effect in the model or not.This question would be extremely challenging to address using reversible jump MCMC because it would require an efficient proposal distribution for the complete set of random effects when jumping between models.However it is straightforward to fit both models using MCMC due to the availability of a Gibbs sampler.The full conditional distribution of the random effects then unlocks an efficient importance sampling algorithm for the calculation of the marginal likelihood.
The simplest linear mixed model takes the following form.Let the data be divided into m units or clusters, and assume that for i = 1, . . ., m and j = 1, . . ., n i , where ij ∼ N (0, σ 2 ) are independent and identically distributed errors.We assume that the random effects satisfy δ i |φ ∼ N (0, φ 2 ) and are independent conditional on the standard deviation parameter φ.The vector of unknown parameters for the model is given by θ = (β, σ, δ, φ).Let Z denote the design matrix of the fixed effects, with rows z T ij and let W be the design matrix for the random effects, so that x = Zβ + Wδ + .For a review of Bayesian approaches to generalized linear mixed models, see for example Fong et al. (2010).

Simulation study
To illustrate the application of our importance sampling technique we performed a simulation study where the true model is known.We simulated data for m = 50 clusters, each containing n i = 3 observations, giving n = 150 in total.We generated 3 predictor variables for each cluster by drawing m values from a standard normal distribution.We fixed our true parameters to be β T = (10, −20, 30), σ = 1 and φ = 2.For every cluster we assumed the same predictors and drew a random effect δ i from δ i |φ ∼ N (0, φ 2 ).Finally, the observed data x = [x ij ] were drawn from (10).

Model 1 -with random effects
For the fixed effects we chose Zellner's g-prior (Smith and Kohn, 1996), namely β|σ ∼ N (0, gσ 2 (Z T Z) −1 ).In our application we chose g = n, known as the unit information prior (Kohn et al., 2001).For the variance parameters we used inverse gamma priors: , setting these parameters equal to 1 in our implementation.These conjugate priors allow a Gibbs sampling algorithm to sample from the posterior distribution.The full conditional distributions are given by, After a burn-in of 1000 iterations, we drew 10000 samples from the MCMC.To demonstrate the increased efficiency provided by making use of the approach to handle missing data described in Section 2, we considered two importance sampling estimators.

Full posterior importance sampling
In the full posterior importance sampler, we estimate the mean and covariance matrix for the full parameter vector θ including the random effects, giving m + 5 parameters in total.We then used these as the centre and scale matrix for multivariate t-distributed proposal distribution with 5 degrees of freedom, with density q 1 (θ).We drew N = 1000 samples from this proposal, denoting them by {θ i } N i=1 .The importance sampling estimator is then P q1 given in (4).

Marginal posterior importance sampling
In the second importance sampling approach we make use of missing data technique described in Section 2. We calculate the marginal mean and covariance matrix for the (restricted) parameter vector ψ = (β, σ, φ).Again we use these as the centre and scale matrix for a multivariate t-distributed proposal with 5 degrees of freedom, but this time it has just 5 dimensions.For each ψ i that is drawn from the proposal q 2 (ψ), we sample the random effects δ i from their full conditional distribution in (15).The importance proposal is therefore given by q 2 (ψ)π(δ|x, ψ).
Note that in both of these estimators we have chosen to parameterise our proposal distribution in terms of the standard deviations (σ and φ) rather than the variances in order to lighten the tails.However since the priors are written in terms of the variance parameters we must multiply the proposal densities q 1 (θ) and q 2 (ψ) by the Jacobian σφ/4 to obtain the correct marginal likelihood estimator.

Model 0 -without random effects
The model with no random effects is just a linear model, given by x ij = z T ij β + ij .We assume the same conjugate priors for β and σ 2 described in Section 3.2 and in this case the marginal likelihood may be calculated analytically, namely, where a * and b * are given by ( 12) and ( 13) evaluated at δ = 0.For comparison purposes, we also use our importance sampling approach based on 10000 samples drawn directly from the joint posterior for β and σ.The importance proposal q was again based on a t-distribution with 5 degrees of freedom.

Results
Figure 1 shows the variation in 50 Monte Carlo replicates of each importance sampling estimator, based on 1000 samples.The importance sampling estimates for the linear model fall close to the true log marginal likelihood value, indicated by a dashed vertical line.For the linear mixed model treating the random effects as missing data and drawing them from their full conditional distribution greatly reduces the variance of the importance sampling estimator.The Monte Carlo standard errors for the linear model and linear mixed model with missing data were 0.0123 and 0.0171, compared with 0.106 for the linear mixed model with an importance proposal for the full parameter vector.
When we increase the number of clusters m from 50 to 500 (Figure 2 of the supplementary material (Touloupou et al., 2017)) the Monte Carlo standard error for the Figure 1: Variation of the log marginal likelihood estimates for the linear model and linear mixed model (lmm) over 50 replicates.For the lmm, in the full posterior approach the whole parameter vector (including random effects) was approximated in the importance proposal; in the marginal posterior approach the random effects were left out, and drawn from their full conditional distribution.A dashed vertical line indicates the true log marginal likelihood for the linear model.marginal posterior approach was 0.015, showing no increase due to the increase in missing data.For comparison, the Monte Carlo standard error for the full parameter importance sampling approach increased to 0.825 for m = 500.
The precision of the marginal likelihood estimator is important when it comes to estimating the Bayes factor.With m = 50 clusters the 50 Monte Carlo estimates of Bayes factor in favour of the linear model (from the missing data approach) fall between 1.279 and 1.359 for the marginal posterior importance sampling estimator.When the full posterior importance sampling estimator is used the 50 Bayes factors fall between 1.032 and 1.664, and it's not hard to envisage situations in which this loss of precision could lead to an incorrect conclusion.

Final outcome epidemic data
In this Section, we look at applying the methodology developed in Section 2 to final outcome household epidemic data.Specifically, we assume that the data consist of the number of individuals infected during the course of an epidemic in a number of households of various sizes.We follow Addy et al. (1991) and Neal and Kypraios (2015) in assuming that the epidemics in each of the households are independent with each member of a household having probability p G of being infected globally from the community at large and the within household epidemic spread emanating from the individuals infected globally.Within each household the disease dynamics are assumed to follow a homogeneously mixing, generalised stochastic epidemic model, where infectious individuals have independent and identically distributed infectious periods according to an arbitrary, but specified, non-negative probability distribution Q with E[Q] = 1 and during their infectious period make contact with a given susceptible in their household at the points of a homogeneous Poisson point process with rate λ L .The special case where Q ≡ 1 has the same final outcomes in terms of those infected as a household Reed-Frost model within household infection probability p L = 1−exp(−λ L ).Throughout we assign an Exp(1) prior to λ L which corresponds to a U (0, 1) prior on p L and also a U (0, 1) prior on p G .
For the household Reed-Frost epidemic it is trivial to adapt the approach of Neal and Kypraios (2015), Section 3.3 to compute the marginal likelihood exactly.Details of how this can be done are given in the supplementary material (Touloupou et al., 2017).Therefore we are able to compare our estimation of the marginal likelihood with the exact marginal likelihood.The exact computation of the marginal likelihood grows exponentially in the total number of households whilst the MCMC algorithm for estimating the parameters and the algorithm for estimating the marginal likelihood have essentially constant computational cost for a given maximum household size.Therefore the exact computation of the marginal likelihood is only practical for data containing a small number of households and we apply it to the influenza data sets from Seattle, reported in Fox and Hall (1980), which contain approximately 90 households each.
Exact computation of the marginal likelihood is not possible for general Q.Therefore we use our approach to compare three different choices of infectious period Q ≡ 1, Q ∼ Gamma(2, 2) and Q ∼ Exp(1) to study which infectious period distribution is most applicable for a given epidemic data set.This mimics analysis carried out in Addy et al. (1991) in a maximum likelihood framework where two infectious periods, a constant and a gamma with shape parameter 2, were compared for a combined data set of two influenza outbreaks in Tecumseh, Michigan, see Monto et al. (1985).
Let x denote the observed epidemic data.The recursive equations given in Ball et al. (1997), (3.12), can be used to compute P h k (h = 1, 2, . . .; k = 0, 1, . . ., h), the probability of observing k individuals out of h being infected in a household of size h.Therefore it is straightforward to compute π(x|λ L , p G ) or π(x|p L , p G ) for the Reed-Frost model.Consequently, it is trivial to construct a random walk Metropolis algorithm to sample from π(λ L , p G |x) (or π(p L , p G |x) for the Reed-Frost model) and to estimate the marginal likelihood using samples from a proposal density q(λ L , p G ), which is a multivariate t distribution with 10 degrees of freedom and mean and covariance matrix obtained from the MCMC samples.
We applied the algorithm for estimating the marginal likelihood to the two Seattle data sets and the Tecumseh data set for each of the three infectious period distributions.In all cases we ran the MCMC algorithm for 11000 iterations discarding the first 1000 iterations as burn-in and then used 1000 samples to estimate the marginal likelihood.For the Seattle influenza A data set with maximum household size of 3 it took approximately 2.4 seconds to compute each marginal likelihood in R on a desktop PC with Intel i5 processor.For the Seattle influenza B data set and the Tecumseh data set with maximum household size of 5 it took approximately 5 seconds to compute each marginal likelihood.The log marginal likelihoods are given in Table 1, they show that is little information in the data to choose between different Q, agreeing with findings in Addy et al. (1991) and Ball et al. (1997).Interestingly Q ∼ Exp(1) is preferred for the Seattle influenza A and Tecumseh data sets, whereas Q ≡ 1 is Seattle influenza B data set.For the two Seattle data sets we also computed the exact log marginal likelihoods, −15.08 and −24.98, for the household Reed-Frost model applied to the Seattle influenza A and influenza B data sets, respectively.For the Seattle data sets we repeated the estimation of the log marginal likelihood 100 times for the Reed-Frost model to obtain Monte Carlo standard errors for the estimated log marginal likelihoods of 0.0062 and 0.0073 for the Seattle influenza A and Seattle influenza B data sets, respectively, demonstrating good agreement between the estimated and exact log marginal likelihoods.The calculations of the exact marginal likelihood took approximately 0.5 and 14 seconds for the Seattle influenza A and influenza B data sets, respectively.The code for computing the exact marginal likelihood could not be applied to the combined Tecumseh data set, or even the separate Tecumseh data sets (see, for example Clancy and O'Neill, 2007) since enumerating over all possible augmented data states exceeded R's memory allocation.This problem could be circumvented to some extent using sufficient statistics as in Neal and Kypraios (2015) but the current approach offers a simple and fast alternative.1: Estimated log marginal likelihood for the three influenza data sets using the three choices of infectious period distribution; Q ∼ Exp(1), Q ∼ Gamma(2, 2) and Q ≡ 1.

Introduction
In this Section, we explore the application of the methodology developed in Section 2 to a scenario where π(x|θ) is not readily available, and data augmentation is required both with in the MCMC algorithm and estimation of the marginal likelihood.The example used is based on a longitudinal household study of preschool children under 3 years old and all household members was conducted in the United Kingdom from October 2001 to July 2002 (Hussain et al., 2005).The size of the families varied from 2 to 7, although in most there were 3 or 4 members.All family members were examined for Streptococcus pneumoniae carriage (Pnc) using nasopharyngeal swabs once every 4 weeks over a 10month period.The carriage status of each individual was recorded at each occasion as 1, if a carrier or 0, if a non-carrier.
Following Melegaro et al. (2004), we consider an Susceptible-Infected-Susceptible (SIS) epidemic model for the transmission of Pnc within a household.At any given time, an individual is assumed to be in either the susceptible non-carrier state 0, or the infectious carrier state 1.The population is divided into two age groups, children under 5 years old and everyone else greater than 5 years (whom for brevity we refer to as 'adults'), denoted by i = 1, 2, respectively.Let I 1 (t) and I 2 (t) denote the numbers of carrier children and carrier adults in the household at time t.The transition between state 0 and 1 is referred to as an infection and the reverse transition is referred to as clearance.The transition probabilities between states in a short time interval δt are defined for an individual in the age group i: where μ i and k i are the clearance and the community acquisition rates respectively for age group i and z is the household size.The rate β ij is the transmission rate from an infected individual in age group i to an uninfected individual in age group j.The term (z − 1) w in ( 16) represents a density correction factor, where w corresponds to the level of density dependence and (z − 1) is the number of other family members in a household size z.For example, w = 1 represents frequency dependent transmission, where the average number of contacts is equal for each individual in the population.
Finally, the probability of infection at the initial swab is assumed to be π i for age group i.We refer to this model as M 1 .
Given the dependence of the carriage status of all individuals in a household, the within household carriage dynamics in a household of size z can be modelled as a discrete time Markov chain with 2 z states (all possible binary vectors of infectious statuses of the z individuals).The presence of unobserved events, that may have occurred in between swabbing intervals, has been discussed previously (Auranen et al., 2000), and must be considered in setting up the model.The approach adopted in this paper to overcome this issue is to use Bayesian data augmentation methods.Model fitting is performed within a Bayesian framework using an MCMC algorithm, imputing the unobserved carriage states in each household.Let O j ⊆ {1, 2, . . ., T } denote the set of prescheduled observation times of household j = 1, 2, . . ., J, and let U j = {1, 2, . . ., T } \ O j denote the unobserved times.Let x j,t be the binary vector of carriage states for individuals in household j at observation time t.The observed longitudinal data X = [x j,t ] t∈Oj ;j=1,...,J consists of the household carriage statuses x j,t at the observation times.Similarly let y j,t be the corresponding latent carriage status of household j at time t ∈ U j , and form the corresponding missing data matrix Y = [y j,t ] t∈Uj ;j=1,...,J .Let θ denote the vector of model parameters, including the rates of acquiring and clearing carriage, the density correction w and the initial probabilities of carriage.
The remainder of this Section is structured as follows.In Sections 5.2 and 5.3, we introduce the MCMC algorithm and importance sampling algorithms, respectively, required to implement our approach.In particular, we introduce the Forward Filtering Backward Sampling algorithm (Carter and Kohn, 1994) to assist with dealing with the augmented data y.In Sections 5.4 and 5.5 simulated data (where the true model is known) are used to illustrate the implementation, performance and applicability of the proposed method and its comparative performance against a range of alternatives.We demonstrate that for a fixed computational cost our approach performs at least as well as existing methods, and with the exception of bridge sampling (Meng and Wong, 1996), performs considerably better.

Markov chain Monte Carlo algorithm
In the Bayesian approach, the missing data is represented as a nuisance parameter and inferred from the observed data like any other parameter.The joint posterior density of the latent carriage states y, and the model parameters θ can be factorized as: where z j,t equals x j,t if t ∈ O j ; y j,t if t ∈ U j and ∅ if t = 0.This factorization is based on the assumption that conditionally on the model parameters, the carriage process is assumed to be independent across households.
In order to simulate from the posterior distribution, we construct an MCMC algorithm that employs both Gibbs and Metropolis-Hastings updates.The main emphasis is on sampling the unobserved carriage process y, which we do using a Gibbs step via the Forward Filtering Backward Sampling (FFBS) algorithm (Carter and Kohn, 1994).In the first part of this algorithm, recursive filtering equations (Anderson and Moore, 1979) are used to calculate P (y j,t | z j,t+1 , x j,Oj ∩{1:t} , θ) for each t ∈ U j working forwards in time.The second part then works backwards through time, simulating y j,t from these conditionals, starting with t = max(U j ) and ending with t = min(U j ).The model parameters π 1 and π 2 are updated using Gibbs updates and the remaining parameters are updated jointly using an adaptive Metropolis-Hastings random walk proposal (Roberts and Rosenthal, 2009).

Marginal likelihood estimation via importance sampling
The availability of the full conditional distribution of the missing data P (y|x, θ) from the FFBS algorithm allows the missing data component y to be updated using a Gibb's step in the MCMC algorithm.This full conditional can be exploited further in the estimation of the marginal likelihood.We require P (x|θ) in order to form the importance sampling estimator in (4).Using Bayes' Theorem we can rewrite this as for any y such that P (y|x, θ) > 0. Therefore evaluation of P (x|θ) at the point θ can be done by evaluating the right-hand-side of (18) with any suitable y.A suitable y is guaranteed if it is sampled from the full conditional distribution y|(x, θ).
Our approach proceeds as follows.In step 1 we use MCMC to obtain samples from the joint posterior of θ and y.In step 2 we fit a multivariate normal distribution to the posterior samples for θ only, and use it to construct a normalised proposal density q(θ).In step 3, we obtain N samples from q(θ) and for each sample θ i we obtain a corresponding sample for the missing data y i using the Forward Filtering Backward Sampling algorithm.We then use these samples to calculate the importance sampling estimator of the marginal likelihood: The choice of q(θ) is important for the accuracy and computational efficiency of the importance sampling approach.As discussed earlier, we want q(θ) to be a good approximation of π(θ|x) but with heavier tails to ensure that the variance of P q is small.We therefore investigate a range of proposals distributions based on a fitted multivariate normal distribution with mean μ and covariance matrix Σ based on the MCMC output.These include drawing θ from IS Nj : N (μ, jΣ) (j = 1, 2, 3), a multivariate Normal distribution with different variances; IS t d : t d (μ, Σ) (d = 4, 6, 8), a multivariate Student's t distribution with d degrees of freedom, mean μ and covariance matrix d d−2 Σ (if d > 2) and IS mix : q(θ) = 0.95 × N (θ; μ, Σ) + 0.05 × π(θ) (mixture of a multivariate Normal density and the prior).

Marginal likelihood estimation
We consider the problem of estimating the marginal likelihood under the model introduced in Section 5.1, using the methods described above.These estimators were evaluated on synthetic data analogous to the real data in Melegaro et al. (2004).More specifically, the parameter values were based on the maximum likelihood estimates from the analysis of Pnc data; parameters were chosen to be k 1 = 0.012, k 2 = 0.004, β 11 = 0.047, β 12 = 0.005, β 21 = 0.106, β 22 = 0.048, μ 1 = 0.020, μ 2 = 0.053, w = 1.184, π 1 = 0.425 and π 2 = 0.095.We set the time-interval δt = 7.Only complete family transitions, where the infection state of all household members was known on two consecutive observations, were used previously (Melegaro et al., 2004; 51% of the full dataset).Although our approach could easily handle the missing data, for comparability we match the number of complete transitions by family size and number of adults to generate our data set; a total of 66 families comprising 260 individuals including 94 children under 5 years.The simulations were designed so that real and simulated datasets have the same sampling times.The hidden variable y consists of 1650 y j,t 's, comprising 6500 unobserved binary variables in total.
We compare the proposed importance sampling approach for estimating the marginal likelihood (based on the 7 proposal densities) with bridge sampling (Meng and Wong, 1996) (using the importance samples from IS mix ), harmonic mean (Newton and Raftery, 1994), Chib's method (Chib, 1995;Chib and Jeliazkov, 2001) and the power posteriors method (Friel and Pettitt, 2008).Details of the computation of these estimators are given in the supplementary material (Touloupou et al., 2017).To compare the different methods on a fair basis, we chose to dedicate equivalent amounts of computational effort for estimation of the log marginal likelihood, instead of fixing the total number of samples.
Implementation details are given as follows.The construction of the importance density was based on 25000 MCMC samples after a burn-in of obtained from the MCMC sampler described in Section 5.2.These posterior samples were used to estimate the reference parameters μ and Σ for a multivariate Student's t or normal proposal density.The marginal likelihood estimate was then based on 25000 importance sampling draws from the obtained proposal density q(θ), using the estimator in (19).To produce the bridge sampling estimate, the 25000 samples from IS mix were combined with 250 thinned samples from the MCMC.In order to apply Chib's methods, the same posterior samples were used for computing the high posterior density point.The log marginal was estimated by generating 22000 draws in each complete and reduced MCMC run, with the first 2000 draws removed as burn-in.Harmonic mean analysis was based on 50000 posterior samples, following a 3000 iteration burn-in.For the power posterior method, it was necessary to specify the temperature scheme and a pilot analysis (not counted in the computation cost) was used to choose 20 partitions on the unit interval.The MCMC sampler was run for 2650 iterations for each temperature in the descending series, omitting the first 650 as burn-in, finishing with 2650 samples at t = 0 (the prior).
Each procedure was repeated 50 times to provide an empirical Monte Carlo estimate of the variation in each approach.We also vary the total running time in order to investigate the effect of this on the accuracy of the marginal likelihood estimates, see Table 1 in the supplementary material.For each analysis method we used the same priors: Gamma(0.01,0.01)for the density factor w; Beta(1,1) for the initial probabilities of infection π 1 and π 2 and Gamma(1,1) for the remaining parameters.
Figure 2 shows the variability of the eleven marginal likelihood estimators.Except for the harmonic mean, all the methods appear to have produced consistent estimates of the marginal likelihood.Chib's method produced better estimates of the marginal likelihood than the power posterior method, which is more computationally expensive than the other methods and therefore uses a small number of MCMC samples at each temperature, leading to large uncertainty.However as seen in Figure 2, the bridge sampling and the importance sampling methods offer significant improvements in precision over the other methods.Moreover, increasing the number of samples N , led to a decrease in the Monte Carlo standard errors of order O( √ N ), see Table 1 in the supplementary material, indicating that the variances of the corresponding estimators are finite.
The success of the importance sampling approach is not surprising since it explores the posterior distribution of parameters more efficiently than the other methods due to the independence of the samples drawn from the proposal density.More surprisingly we were unable to use the bridge sampling technique to improve substantially on the standard errors, which dropped from 0.0196 for IS mix to 0.0179 for BS mix .The bridge sampling estimator attempts to combine information from the MCMC and importance samples, however the optimal estimator is derived assuming that independent samples from the posterior were available, which we approached by applying a thinning of 100 to the samples.With low levels of thinning (results shown in supplementary material Figure 4) we found that bridge sampling actually increased the standard error of the marginal likelihood estimate.
On the basis of this example, the lowest variance importance sampling estimator was obtained using the proposal density IS mix -a mixture of the prior and the Figure 2: Top: Boxplots of the estimated log marginal likelihood for model M 1 over 50 replicates for our importance sampling approach with the mixture proposal (IS mix ), Chib's method (Chib), power posteriors method (PP) and harmonic mean (HM) (note the different scales for the top and bottom plots).Bottom: Zoomed in boxplots of the estimated log marginal likelihood for model M 1 over 50 replicates for each of our importance sampling approach (IS N1 , IS N2 , IS N3 , IS t4 , IS t6 , IS t8 , IS mix ) and bridge sampling (BS mix ).normal fitted to the posterior samples.Therefore, in the next section we use this proposal density when estimating the log marginal likelihood via importance sampling.

Model comparison
In this Section, we apply the marginal likelihood estimation approaches to the problem of Bayesian model choice.We focus on their ability to distinguish between biologically motivated hypotheses concerning the dynamics of Pnc transmission.In particular we compare their performance against the established technique of Reversible Jump Markov Chain Monte Carlo (RJMCMC) and then demonstrate that the importance sampling approach can solve problems that are extremely challenging with RJMCMC.We show that using our approach it is possible to answer the epidemiological important question of how household size is related to transmission with extended discussion given in the supplementary material.
Suppose that we wish to evaluate the evidence in favour of the community acquisition rates being equal for adults and children, in the hope of developing a more parsimonious model.We call the model described in Section 5.1, in which children have community acquisition rate k 1 and adults have rate k 2 , model M 1 .The nested model, in which k 1 = k 2 is called M 2 .We generated realistic simulated datasets from each of these models and then used importance sampling, bridge sampling, Chib's method, power posteriors, the harmonic mean and reversible jump MCMC to estimate the Bayes factor in favour of M 1 , denoted by B 12 .As before, we used approximately the same computational effort for each of these approaches.For M 1 we assumed k 1 = 0.012 and k 2 = 0.004, whilst for M 2 we assumed k 1 = k 2 = 0.008.Details of the RJMCMC algorithm for selecting between models M 1 and M 2 are given in the supplementary material.We ran the RJMCMC chain with a 30000 burn-in followed by 76000 samples which ensured that similar computational effort was given to RJMCMC as to the other methods.When the evidence is strongly in favour of one model, the RJMCMC will not move between models very often and can provide poor estimates of the Bayes factor.A variant of the method, called RJMCMC corrected (RJcor), can tackle this issue by assigning higher prior probability to the model that is visited less often.This probability is estimated as π(M m ) = 1 − π(M m | x), where π(M m | x) is obtained from a pilot run of RJMCMC with initial π(M m ) = 0.5, for m = 1, 2. For RJcor we did 30000 pilot iterations and then another 76000 iterations, of which 30000 were discarded as a burn in.
Figure 3 provides a graphical representation of the variability in log(B 12 ) over 50 repeats of each Monte Carlo approach.The plot highlights that the estimators based on importance sampling and bridge sampling were the most accurate in both scenarios.The left panel of Figure 3 gives results for data generated from M 1 .Importance sampling, bridge sampling, Chib and RJ methods lead to similar estimates, whereas power posterior and harmonic mean overestimated the log Bayes factor.Moreover, RJcor produced slightly more accurate estimates of the log Bayes factor than vanilla RJMCMC.All methods selected the correct model, with largest variation from the harmonic mean estimator.In the right panel of Figure 3, the results use data generated from model M 2 .Due to the huge variance in log(B 12 ), the harmonic mean sometimes favoured the wrong model.Although the remaining methods correctly identified the true model, the importance and bridge sampling methods again produced the most precise estimates of the Bayes factor; the standard errors provided by the two methods are almost identical.
Figure 3: Variability of the log Bayes factor estimates based on 50 Monte Carlo repeats for the importance sampling method with mixture proposals (IS mix ), bridge sampling method with mixture proposals (BS mix ), Chib's method (Chib), reversible jump MCMC (RJ), corrected reversible jump MCMC (RJcor), power posteriors (PP) and harmonic mean (HM) methods (note the different scales for the top and bottom plots).
Figure 4 demonstrates the evolution of the log Bayes factor in favour of M 1 as a function of computation time using data generated from M 1 .The importance sampling estimator (in blue) converges much more rapidly than the other estimators, showing very tight credible intervals.Chib's method (in green) and corrected RJMCMC (in red) appear to converge to the same value, but more slowly and have wider CIs.The power posterior method gradually approaches the consensus estimate, requiring significantly more samples to stabilize.The harmonic mean estimator was heavily unstable and also provided much wider credible intervals than the other methods.
In the supplementary material (Touloupou et al., 2017), further model comparison questions are considered and the strength of the importance sampling technique for answering these questions is further demonstrated.In particular, we consider heterogeneity in household transmission rates, density-dependence in within-household transmission and the amount of missing data.

Conclusions
In this paper we have introduced a simple three stage algorithm for efficiently estimating the marginal likelihood.The key components are an MCMC algorithm for obtaining samples from the posterior distribution, π(θ|x), an approximating distribution q(θ) to Figure 4: Evolution of log Bayes factor estimates in favour of model M 1 as a function of computation time.The solid lines corresponds to the median and the shaded areas give the 95% credible intervals, estimated from 50 Monte Carlo replicates.Yellow represents the harmonic mean method, grey is for the power posterior, red and green correspond to RJMCMC corrected and Chib's methods respectively and blue represents the importance sampling approach with the mixture proposals.sample from and an effective estimate of the likelihood π(x|θ).The first observation is whilst an MCMC algorithm will often be relatively straightforward to construct, alternative methods for sampling from the posterior distribution could be equally considered.Moreover, it is not important if a sample from an approximate posterior distribution (for example, Monte Carlo within Metropolis;O'Neill et al., 2000) is used since all that is required for computation of the marginal likelihood is to be able to make a reasonable choice of q(•).The key limitation to using this approach is effective estimation of the likelihood π(x|θ) in cases where it is not analytically tractable.In Section 5 of this paper the temporal nature of the data allowed the FFBS algorithm to be utilised to compute π(x|θ) and more generally filtering methods are a promising avenue of research to explore in the estimation of π(x|θ).The importance sampling and the associated estimation of the likelihood is trivially parallelisable which can be utilised to speed up implementation.Finally, in cases where the likelihood can easily be computed the algorithm becomes a simple add-on to MCMC to compute the marginal likelihood.