Posterior-based proposals for speeding up Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) is widely used for Bayesian inference in models of complex systems. Performance, however, is often unsatisfactory in models with many latent variables due to so-called poor mixing, necessitating the development of application-specific implementations. This paper introduces ‘posterior-based proposals' (PBPs), a new type of MCMC update applicable to a huge class of statistical models (whose conditional dependence structures are represented by directed acyclic graphs). PBPs generate large joint updates in parameter and latent variable space, while retaining good acceptance rates (typically 33%). Evaluation against other approaches (from standard Gibbs/random walk updates to state-of-the-art Hamiltonian and particle MCMC methods) was carried out for widely varying model types: an individual-based model for disease diagnostic test data, a financial stochastic volatility model, a mixed model used in statistical genetics and a population model used in ecology. While different methods worked better or worse in different scenarios, PBPs were found to be either near to the fastest or significantly faster than the next best approach (by up to a factor of 10). PBPs, therefore, represent an additional general purpose technique that can be usefully applied in a wide variety of contexts.


Introduction
Markov chain Monte Carlo (MCMC) techniques allow correlated samples to be drawn from essentially any probability distribution by iteratively generating successive values of a carefully constructed Markov chain.This flexibility has led MCMC to become the method of choice for inferring model parameters under Bayesian inference [1].However, for high dimensional systems (e.g.where inference is over many tens, hundreds or even thousands of variables), MCMC often suffers from a problem known as "poor mixing".This manifests itself as a high degree of correlation between consecutive samples along the Markov chain, so requiring a very large number of iterations to adequately explore the posterior [2].This limitation is of practical importance, because it restricts the possible models to which MCMC can realistically be applied.The focus of this paper is to introduce and explore a new approach that helps alleviate these mixing problems, thus reducing the computational time necessary to generate accurate inference.This approach has practical advantages over existing methodologies that aim to address the same problem [3][4][5].
In Bayesian inference the posterior distribution π(θ,ξ|y) represents the state of knowledge concerning the parameters θ and latent variables ξ of a given stochastic model taking into account data y.Using Bayes' theorem this posterior distribution can be expressed as where π(y|ξ ,θ) is here referred to as the observation model, π(ξ|θ) is the latent process (i.e. the model that is usually the focus of interest) likelihood, π(θ) is the prior distribution (representing the state of knowledge prior to data y being considered), and π(y) is a constant factor known as the model evidence [6].
An MCMC implementation of Bayesian inference aims to produce samples from the posterior, i.e. a list of parameter values θ i and latent variables ξ i distributed in accordance with Eq.( 1).This is achieved by sequentially proposing some change to the current state θ i and/or ξ i to generate θ p and ξ p , and accepting or rejecting this change with a Metropolis-Hastings (MH) probability to create the next member on the list, i.e. θ i+1 and ξ i+1 [7].Note, in some instances it is possible to probabilistically generate samples directly, e.g.via Gibbs sampling [8] or slice sampling [9], without the need for an accept/reject step 1 .The term Monte Carlo refers to the probabilistic nature of these updates that form a Markov chain, i.e. step i+1 in the chain only depends on the state at the previous step.A typical approach is to sequentially update each parameter and latent variable separately.Figure 1(a) illustrates the problem of poor mixing resulting from implementing this "standard" MCMC when there is a strong dependency between latent variables and model parameters.The dark shaded area represents high posterior probability2 as a function of θ and ξ.Consider first fixing θ and making changes to ξ, as shown by the dashed line in Fig. 1(a).Because MCMC samples are probabilistically constrained to lie in the shaded region, the chain will make limited progress even for a large number of MH updates.Similarly, changes to θ whilst fixing ξ will be restricted to move along horizontal lines, which are again limited in scope.A typical output from an MCMC algorithm which independently updates parameters and latent variable is illustrated in Fig. 1(b), which shows the trace plot for one of the variables in θ.Thousands of MH updates are potentially needed to generate just one uncorrelated effective sample from the posterior.
The way out of this sorry state of affairs is shown in Fig. 1(c).Here the proposals are performed jointly in parameter and latent variable space (as indicated by the pink shaded region) and share a similar correlated structure to the posterior itself.In this case, proposals can jump much further without the posterior probability becoming negligibly small.Consequently, as illustrated in Fig. 1(d), successive samples are less correlated and fewer are needed to be representative of the posterior.
Various techniques to perform joint updates have been proposed in the literature.For example Particle MCMC (PMCMC) [3] samples a new set of parameters θ p relative to θ i and then sets about generating ξ p .This is achieved by directly sampling from the model π(ξ|θ p ) multiple times, with each instance referred to as a "particle".The final result is built up in a series of stages that sequentially take into account a larger fraction of the data y.At the end of each stage those particles which agree well with the sequentially introduced observations are duplicated at the expense of those which don't agree so well (so-called particle filtering).Whilst this method exhibits very good mixing, computational speed is often compromised because the number of particles needed to generate a reasonable acceptance probability can be very large3 [10].Furthermore, PMCMC is limited to scenarios in which data is sequentially ordered (such as in time series data).
Approximate Bayesian Computation (ABC) [3] also samples directly from the model, but rather than fitting the data to the samples through a full observation model, as in Eq.( 1), here the fit with the data is characterised by a much simpler distance measure χ.Often χ is specifically chosen such that a substantial proportion of simulated samples contribute to the final result.Such an approach, however, comes at a significant cost, because ABC generates only approximate, rather than exact, draws from the posterior [11].Furthermore, doubts have been cast on ABC's ability to accurately perform model selection [11,12].
In Hamiltonian MCMC (HMCMC) [4] θ i and ξ i are dynamically changed through a series of small intermediary steps (which take into account local gradients in the log of the posterior probability) to reach θ p and ξ p .This final state is then accepted or rejected with an overall MH probability.This, again, produces good mixing (due to the fact that θ i ,ξ i and θ p ,ξ p can be widely separated), but its efficiency is critically dependent on the number of intermediary steps needed for a sufficiently high acceptance rate [4].Although recent improvements have helped to optimise this technique (most notably the "No-U-turn sampler" introduced in [13]), HMCMC is applicable to only those models for which θ and ξ are continuous quantities.Consequently, it is not well suited to tackle models with discrete variables, e.g.disease status [14], or variable dimension number, e.g.event data [15].
Another approach is to develop a non-centred parameterisation [5,16,17] in which a new set of latent variables ξ' (whose distributions are typically independent of the model parameters θ) are introduced.These new variables are related to the original latent variables through some deterministic function ξ=h(ξ',θ,y).MCMC implemented using this re-parameterisation can lead to improved mixing [17].In this case, proposed changes to ξ' can be thought of as joint proposals in both ξ and θ.This paper introduces a new class of MCMC proposal valid for the vast majority of statistical models (Section 2).We refer to these as "posterior-based proposals" (PBPs, Sections 3), as they are constructed with the aid of importance distributions (Section 4) which approximate the posterior.PBPs enable joint updates to both θ and ξ (unlike standard approaches), are fast (i.e. they don't require multiple particles like PMCMC), accurate (i.e. they draw samples from the true posterior, unlike ABC) and they can be applied to continuous or discrete state space models (unlike HMCMC).A further novelty in PBPs is that they not only account for correlations between θ and ξ inherent to the model, but can also take into account the data (unlike non-centred re-parameterisations).Application to models used in disciplines ranging from statistical genetics to epidemiology to finance demonstrate that PBPs potentially offer considerable improvements in performance over standard approaches (Section 5).

Broadly applicable model framework
PBPs are applicable to any statistical model whose conditional dependence structure can be represented by a Direct Acyclic Graph (DAG) [18], as illustrated in Fig. 2.This encompasses a vast range of statistical models including mixed models [19], generalised linear mixed models [20] and discrete time Markov processes [21].
A key property of DAGs is that the indexes for the latent variables e can be ordered such that each element ξe is conditionally dependant on only those other elements with lower index ξe'<e (a property known a topological ordering) 4 .
Simulation results from sequentially sampling each latent variable ξe (starting from e=1 up to e=E) from a set of model-defining univariate probability distributions π(ξe|ξe'<e,θ).Consequently, the latent process likelihood in Eq.( 1) can be expressed as    given some importance distribution (ID) functional form (left hand column), whose characteristic parameters change from some initial set to a proposed set.

Table 1(b):
Here draws from the hypergeometric distribution ~( , , ) X HG N K n have p.m.f.

Aim
PBPs first propose a new set of parameters θ p relative to θ i and then generate ξ p by means of stochastically modifying ξ i to account for this change in parameters (note, this is in stark contrast to PMCMC which aims to sample ξ p directly from π(ξ|θ p ,y) without reference to ξ i or θ i ).The novel PBP process involves sequentially sampling each latent variable ξ   from e=1 up to E, and makes use of so-called "importance distributions" (IDs) applied to both the initial and proposed states.These importance distributions f ID (ξ e |ξ e'<e ,θ,y) are approximations to the posterior distributions π(ξ e |ξ e'<e ,θ,y) used in importance sampling (see Appendix A for further details).For clarity we leave a discussion of how these approximations are made in practice until Section 4.

Example
We first run through an illustrative example of a PBP and then provide a general description in Section 3.3.Figure 3  A unique feature of PBPs is that the sampling distribution for ξ   crucially depends on the relative size of λp and λi.When λp>λi (as it is in Fig. 3) ξ   is generated by adding a Poisson distributed variable onto ξ   with expected event number given by the difference between λp and λi: (e.g. in Fig. 3 a random Poisson sample X=8 results in ξ   = ξ   +  = 13).Such an approach makes sense, because adding an approximately Poisson distributed quantity with expected event number λi, to one with expected event number λp-λi, gives an approximately Poisson distributed variable with expected event number λp, as required by ξ   on the left hand side of Eq.( 4).Thus, Eq.( 4) modifies ξ   to generate ξ   accounting for the change in λ.
On the other hand, when λp≤λi the actual number of events in the proposed state ξ   should be less than in the initial state ξ   , because the expected number of events parameter λ has reduced.
Specifically each existing event in ξ   is retained in ξ   with probability λp/λi, which is equivalent to sampling ξ   from the following binomial distribution6 ~ ( , ).
Together, the two potential sampling schemes for ξ   in Eqs.( 4) and (5) (selected depending on whether λp is bigger or smaller than λi ) make up the PBP proposal in cases in which the ID is Poisson.This is summarised by the first line in Table 1.

General approach
In general the choice of ID will depend on the distribution π(ξ|θ p ,y), which itself is model dependent.
If, for example, π(ξe|ξe'<e,θ,y) is better represented by a normal ID, the PBP sampling procedure is taken from the second line in Table 1.In all, Table 1 summarises sampling schemes for twelve different ID functional forms, each corresponding to probability distributions commonly used in statistical models.These schemes are specifically designed to satisfy the following two conditions: but this doesn't satisfy condition 2 and turns out to usually be inefficient 7 .Deriving sampling schemes which also satisfy condition 2 is a non-trivial task guided by intuition and trial and error.The validity of Eqs.( 6) for both Poisson and normal IDs is explicitly demonstrated in Appendix B.

Algorithm
We now describe the general algorithm used to implement PBPs:

POSTERIOR-BASED PROPOSALS
Step 1: Generate θ p -A proposed set of parameter values is drawn from a multivariate normal (MVN) distribution centred on the current set of parameters in the chain where Σ θ is a numerical approximation to the covariance matrix for π(θ|y) and j is a tuneable jumping parameter (estimation of Σ θ and optimisation of j are achieved during an initial "adaptation" period, as explained in Appendix C). (Note, performing a joint update on all parameters instead of each individually helps to alleviate poor mixing due to strong parameter correlations in π(θ|y)).
Step 2: Generate ξ p -We take each latent variable ξe in turn (starting from e=1 up to e=E) and calculate the characteristic quantities defining the IDs for the initial and proposed states (e.g. in the Poisson case this would be λ i and λ p ), as described in Section 4. p e  is then sampled using specially design proposals outlined in Table 1 (note, this table contains separate lines referring to different potential ID functional forms).This sampling procedure is at the heart of PBPs and represents the key novelty of this approach.
Further insights into the PBP procedure are given in Appendix E.
PBPs, by design, result in ξ p exhibiting correlation with ξ i (indeed, if θ p =θ i then ξ p is exactly the same as ξ i , as required by Eq.( 6)).The PBP MCMC algorithm used in this paper mitigates against these correlations by performing PBPs interspersed with standard updates for the latent variables8 every U steps.Appendix H shows that mixing is not sensitive to the exact value for U, and is approximately optimised when U=4 (as subsequently used).

Generating importance distributions (ID)
For each latent variable ξ e , the distribution we wish to approximate is π(ξ e |ξ e'<e ,θ,y), i.e. the posterior probability distribution for ξ e given ξ e'<e , parameters θ, and data y.This distribution can be expressed as   (10) where π(ξ d≥e |ξ e'<e ,θ,y) is the joint posterior distribution for latent variables with index e and above (conditional on everything else).The integrals in Eq.( 10) successively marginalise over the unknown latent variables, starting with the last ξ E all the way back to ξ e+1 .Using Bayes' theorem and Eq.( 1), the integrand in Eq. (10) For now making the simplification that one observation is made per latent variable, i.e.
Analytically performing these integrals is usually not possible.However different levels of approximation can be made depending on the point at which the expression on the right hand side of Eq.( 13) is truncated.This leads to a family of importance distributions with increasing accuracy (as illustrated in Fig. 4):

ID0
By taking just the first term on the right hand side of Eq.( 13), the ID is simply set to the distribution from the model itself9 Note, this can only be done provided the model distribution π(ξe|ξ e'<e ,θ) has a functional form belonging to one of the possibilities in Table 1 (or alternatively a new PBP sampling scheme based on the model distribution is created which satisfies the conditions in Eq.( 6)).If this cannot be achieved, the functional form for ID 0 is chosen to match π(ξe|ξ e'<e ,θ) as closely as possible.
The calculation of ID 0 , as illustrated in Fig. 4(b), involves only those latent variables on which ξe is conditionally dependant.PBPs using ID 0 are equivalent to model-based proposals [10] (MBPs), and their MH acceptance probability from Eq.( 9) simplifies to

ID1
The ID accuracy is improved by using the first two terms on the right hand side of Eq.( 13) where c is a normalising factor.Calculation of ID 1 , as illustrated in Fig. 4(c), includes not only those latent variables on which ξe is dependant, but also the observation y e on ξe itself (as indicated by the green circle) 10 .

ID2
Including observations on those latent variables which depend on ξ e , e.g.ξ E-1 in Fig. 4(d), Eq.( 13) now gives the improved approximation Higher order approximations (i.e.ID n>2 which take into account successively more of the model and data 11 ) usually prove to be too computationally expensive to be efficient.

Choosing ID
Choosing which level of ID to optimise PBPs involves a trade-off between the computational cost of generating IDs with the size of posterior jumps (and hence improvement in mixing) they allow.
Unfortunately determining a priori which option is best is challenging.Indeed, in the results section below we find examples for which ID 1 and ID 2 each represent optimum solutions for different problems.
The classification scheme presented above is based on models which have one observation per latent variable.For models in which this is not the case, generation of IDs relies on approximating the following expression: This may or may not be computationally challenging, depending on the scenario considered.However, provided the model itself makes uses of the distributions in Table 1, using ID 0 (i.e.MBPs) is always possible.

Empirical evaluation
We now investigate the relative computational performance of PBPs compared to standard MCMC approaches applied to four contrasting model types.In each case we examine algorithm performance for parameter inference from data simulated from the models in question.Results   shown are based on 10 6 MCMC sample "updates" (where each update makes changes to all the latent variables and model parameters) preceded by 10 4 discarded samples from a burnin/adaptation period (see Appendix C).One way to measure MCMC efficiency is to calculate the "effective sample size" [2] (see Appendix I), and here we calculate the computational time for any given algorithm to generate 100 effective posterior samples 12 of a given parameter.Unless otherwise stated, uninformative flat priors are assumed.

Inferring disease prevalence and diagnostic test performance
Suppose we aim to estimate the disease prevalence (fraction of infected) p D in a population of P individuals using cross-sectional diagnostic test results.Such diagnostic tests are typically imperfect, and characterised by a sensitivity Se (the probability of a positive test result given an infected individual) and specificity Sp (the probability of a negative test result given uninfected).Suppose Se and Sp are unknown.In the absence of a gold standard defining which individuals are truly infected, inference is only possible when two or more independent tests are recorded per individual (due to confounding).Here we assumes that results are available from two tests, labelled t={1,2}.The model is shown in Fig. 5 and described in detail along with development of PBP proposals in Appendix J.We note that this model could be embedded in more complex models, e.g.fitted to data from capture-mark-recapture programmes [22].
Inference was then performed using MBP/PBP MCMC approaches as well as standard Gibbs sampling.Figure 6(a) shows how posterior samples for pD vary as each of the three algorithms progress.By binning these samples, the marginal posterior distributions for pD can be generated, as shown in Fig. 6(b) (which also shows results for the other model parameters).These distributions contain the known values used to generate the data (denoted by vertical black lines), indicating successful inference.
It is important to note that in the limit of infinite MCMC sample number each of the three algorithms is expected to generate exactly the same set of marginal distributions (i.e.those shown in Fig. 6(b)).However the speed with which they converge does vary.For example, Fig. 6(c) shows how the computational time (i.e. the CPU time to generate 100 effective samples) to infer pD increases with population size.Here we find that MBPs (ID 0 ) actually performs worse than the standard Gibbs approach, however when observations are incorporated (ID 1 ) the resultant PBP algorithm is around two times faster than Gibbs sampling (at least when the number of individuals is large).Although mixing is greatly improved, as is evident from Fig. 6(a), this is offset by the computational overhead associated with each PBP.This simple example demonstrates a possible modest improvement in computational speed when using PBPs compared to a standard Gibbs approach.The next example, however, shows that much larger potential gains can be made.

Fitting dynamic models: a stochastic volatility model
Stochastic volatility (SV) models are used to capture time-varying volatility on financial markets, and are essential tools in risk management, asset pricing and asset allocation [23].In economics, a "logarithmic rate of return" can be defined by y e =log(V e+1 /V e ), where V e is the price of an asset (e.g. a share) on day e.Consequently, when y e is positive it means that on day e+1 the asset goes up in price, but when it is negative it goes down.One way to capture time variation in y e is through the socalled SVt model [24]   where u e are i.i.d. with a Student's t-distribution (characterised by parameter ν) and ηe are i.i.d.
normal (with zero mean and variance σ 2 ).Note, if h e is fixed (i.e.σ=0), the logarithmic rate of return y e would simply be sampled asymptotically from a distribution with fixed variance, or "volatility".The introduction of time variation in the variable h e , whose temporal correlations are measured by 0<ϕ<1, means that y e experiences stochastic volatility, i.e. periods when there are large variations in asset price, and other periods when there isn't much variation.
The SVt model is represented in Fig. 7(a).
Simulated data was created using μ=-10, ϕ=0.99, ν=12, σ 2 =0.0121 (which are parameter values based on estimates made from the S&P 500 index [23]).Figure 7(b) shows the time variation in y e and h e (observe how changes in h e correspond to changes in the volatility of y e ) over E=3000 days.A detailed development of PBP proposals for this model is given in Appendix K.

Speed comparison
Bayesian inference from the simulated data in Fig. 7 standard algorithm is at its fastest when ϕ~1 but slows down considerably as ϕ → 0, whereas the opposite holds for the MBP/PBP algorithms 13 .In addition, ID 1 is found to be faster than ID 2 or ID 0 .For real financial markets ϕ is within the range 0.95 to 0.99 [23], reflecting a high degree of persistence in Fig. 9 The mixed model (Section 5.3).
volatility.This corresponds to PBP running between four and ten times faster than the standard approach.However, the left hand side of Fig. 8(b) clearly demonstrates the existence of regimes for which PBPs are faster by a factor exceeding 100.

Mixed models (MMs)
MMs [25] explain observations in terms of both "fixed effects" (e.g.individual attributes such as gender or disease status) and "random effects" (which account for random uncontrollable factors within a study, e.g.variation in student grades as a result of variation in the quality of schools).MMs are useful in a wide variety of applications in the physical, biological and social sciences [26][27][28][29].They assume that a vector of N measurements y can be decomposed into three contributions: where X and Z are design matrices that define model structure, β is a vector of F fixed effects, a is a vector of E random effects and ε are residuals.Random effects and residuals are assumed to be MVN with zero mean and covariance matrices G and R, respectively.For simplicity we assume that

Speed comparison
We present three example models to illustrate how algorithm performance depends on the nature of the covariance matrix A (a detailed development of PBP proposals for the MM model framework is presented in Appendix L):

Uncorrelated random effects (A=I)
Simulated data was generate using N=10 5 , E=100, F=2 (see Appendix L for further details).Figure 10(a) shows how the computational time to infer r 2 varies as the value of r 2 used to generate the data changes.The standard Gibbs approach is consistently fastest because here the data is highly informative about the latent variables.Rather than the posterior exhibiting strong correlations between latent variables and parameters (as in Fig. 1(a)), it in fact shows very little correlation.This illustrates that when mixing is good, PBPs are not expected to speed up MCMC.

Correlated random effects (A -1 sparse)
An important application of mixed models is in the field of quantitative genetics, which aims to understand the genetic basis of traits of interest [30].As an illustration take y to represent measurements of height within a population.In contrast to the previous example, observations are correlated between members in a population, e.g. if an individual has tall parents they are more likely to be tall.This correlation results from genetic inheritance.The relatedness of individuals within the population is captured by the so-called "relationship matrix" A.
Simulated data were generated assuming a population randomly mated over four generations with N=E=4×10 3 , F=2 (see Appendix L for further details).Figure 10(b) shows how the computational time to infer r 2 varies with the r 2 value used to generate the data.Disregarding fixed effects, r 2 =0 implies no genetic inheritance and r 2 =1 corresponds to a trait dominated by genetic inheritance.In these limits Gibbs sampling slows down significantly due to strong parameter-latent variable correlations in the posterior, leading to poor mixing.Using ID 0 shows an improvement over Gibbs for low r 2 , however for high r 2 its performance proves to be poor.Using ID 1 and ID 2 leads to further improvements in computational speed, resulting in PBPs becoming consistently faster than the standard Gibbs approach (e.g., ID 2 is a factor 100 times faster in the limit r 2 →0).The move to ID 3 shows little improvement and for ID 4 , ID 5 etc. PBPs become progressively slower despite mixing better.Consequently ID 2 , which uses measurements taken on individuals as well as close relatives, represents an optimum choice in this particular example.

Correlated random effects (A -1 dense)
In cases in which the relationship matrix A is unknown it can be approximated by the genomic relationship matrix [31] (which is calculated using the measured genotypes of all individuals 14 ).Such an approach results in all elements in A -1 being non-zero (whereas the vast majority of them were exactly zero before).
We mimic this change by adding small random contributions to A -1 and repeat the previous analysis 15 .The results are shown in Fig. 10(c).The general pattern is largely the same as in Fig. 10(b), although all approaches are now considerably slower due to the additional computations necessary.An important point to mention is how ID 2a is generated under this scenario.ID 2 ordinarily uses information from all those additive genetic effects that depend on a e (as well as their predecessors).
When A -1 is dense, this approach is extremely slow because of the large number of dependencies within the model.However, huge gains in speed can be achieved by selecting only those dependencies which are deemed large (see Appendix L).This illustrates that ignoring weak dependencies can be an effective general approach to reducing the computational cost associated with calculating IDs.

Generalized linear mixed models (GLMM)
GLMMs are a widely used extension of MMs that account for non-normally distributed observations [20].The vector of observations y is now drawn from ~( ),  y μ (21) where χ belongs to the exponential family of distributions (which includes the normal, binomial, Poisson and gamma distributions).The vector of means μ depends on fixed and random effects thorough a link function g Xβ + Za (22) where X and Z are design matrices that define the model structure, β is a vector of F fixed effects, and a is a vector of E random effects (which are assumed to be MVN with zero mean and covariance matrix G).In contrast to the MM expression in Eq.( 20), the residuals ε no longer appear here as any discrepancies between the data y and the model are incorporated into χ in Eq.( 21).
As an illustrative example of a GLMM, consider data y consisting of binary (0/1) measurements of disease status (representing uninfected/infected) that are explained in terms of a quantitative genetics model (like the one introduced in Section 5.3).In this case χ is chosen to be a Bernouilli distribution and g is a logit function such that Although PBPs can be applied in the general case, here we assume one additive genetic effect per individual (i.e.Z=I) and 2 a   GA (where A is the relationship matrix).Under this model Xβ + a is a vector giving each individual's propensity for disease (which is assumed to be normally distributed) and a represents the additive genetic component of this quantity.Appendix M provides further details.

Speed comparison
For GLMMs Gibbs sampling is no longer possible, so instead a standard random walk MH scheme is implemented.Furthermore, since the model no longer contains parameter 2   (because no such parameter exists for the Bernoulli distribution), 2 a  is used to quantify the relative importance of genetics to an individual's propensity for disease.
We again assume a population of individuals randomly mated over four generations with N=E=4×10 3 , F=2 (see Appendix M for further details).Figure 11 shows how the computational time to infer 2 a  varies with the value of 2 a  used to generate the data.We see that algorithm speeds are considerably slower than for the corresponding MM containing the same number of individuals (Fig. 10(b)).This is because the information contained within the observations (i.e.binary disease measurements) is significantly less than when traits are measured directly (e.g. as for quantitative measurements such as height), resulting in greater uncertainty in the underlying additive genetic effects and consequently a larger posterior parameter space.
Figure 11 shows that PBPs using ID 1 (which represent the optimum algorithm in this particular case) are consistently faster than the standard MCMC by a factor of around 30 (roughly independent of the value of 2 a  ).

Summary
This paper introduced posterior-based proposals (PBP) and demonstrated that they speed up MCMC inference in many cases where existing approaches perform poorly.PBP are applicable to the majority of statistical models (namely, those whose conditional dependence structure is expressible in terms of a directed acyclic graph) [32].Performance is enhanced by improving mixing of MCMC (i.e.increasing the rate at which they generate uncorrelated samples) by jointly proposing changes to model parameters θ and latent variables ξ (see Section 3 for details).PBPs are a family of proposal schemes built by generating importance distributions (ID 0 , ID 1 , etc.) that systematically account for dependence structure in the Bayesian posterior with increasing accuracy.The zeroth-order approximation ID 0 ignores the data and thus corresponds to "model-based proposals" (MBPs) [10].The optimal level of approximation depends on problem specific trade-offs between improved mixing resulting from increased acceptance rates and the computational cost of generating suitably accurate IDs.
The relative computational speed of PBP MCMC compared to "standard" Gibbs/random walk MH techniques was investigated for benchmark models used in applications ranging from finance to ecology to statistical genetics.Unsurprisingly in cases where mixing of the MCMC chain is not an issue for standard approaches, PBPs were not significantly faster (indeed for a mixed model with uncorrelated random effects Gibbs sampling was actually found to be faster).However, in many scientifically challenging applications such correlations are critical, e.g. in quantitative genetics where they represent relatedness between individuals [19], and in such cases PBPs were found to be substantially faster than Gibbs sampling.This was especially true when inferring the size of small random effects (corresponding to traits with only a small component of genetic inheritance [33,34]), where performance was up to two orders of magnitude faster.

Discussion
We now discuss PBPs in relation to two widely used techniques for overcoming slow mixing: PMCMC and HMCMC.
As stated previously, PMCMC only works on problems in which data can be incorporated in a sequential manner (i.e. it would not be suitable for the examples in Sections 5.1, 5.3 and 5.4 above).Even then, however, PMCMC can become slow due to requiring a very large number of particles to give a reasonable acceptance rate 16 .Indeed, a previous paper [10] using observations taken from multiple spatially linked epidemics showed that PMCMC became much slower in comparison to an MBP approach as the number of epidemics was increased.However, a key advantage of PMCMC approaches is that they are usually relatively easy to implement and typically allow for efficient parallelization.
HMCMC relies on calculating gradients in the log-likelihood, and hence is not applicable to models with discrete variables (e.g. the example in Section 5.1) or when the number of variables within the model changes.Its efficiency is very much dependent on the shape of the posterior.Often it is tested on high dimensional multivariate normal-type posterior distributions, where it is found to perform well against other approaches [4].However in many real world problems there can be sharp gradients in which some parameters change much more rapidly than others.These irregularities necessitate lots of small intermediary steps for each MCMC iteration, significantly reducing its performance.In contrast to PMCMC, HMCMC cannot easily be parallelised.
Finally we come to PBPs introduced here.Generally speaking they tend to work better (in comparison to both standard approaches and HMCMC) on problems which are "model dominant".By this we mean that the shape of the posterior is largely represented by the latent process likelihood, with the observation model providing a small perturbation on top of this.A good example of this would be a complex model applied to relatively few actual measurements.(Note, PMCMC also works well under such scenarios because fewer particles are required for a reasonable acceptance rate.)The reason that both PMCMC and PBPs work in this model dominant regime can be seen if we look at the limiting case in which there is no data.Here both methods easily map out the prior distribution for model parameters (with latent variables naturally being dealt with by simulation in the former and MBPs in the latter).In contrast HMCMC is faced with the potentially difficult task of integrating from one parameter / latent variable combination to another, often taking a large number of steps to do so.The flip side of this argument, however, is the "data dominant" regime, in which there is a lot of data and a relatively simple model.In this case we might naturally expect HMCMC to work better than either PMCMC or PBPs.However, here mixing itself is typically fast and so standard approaches would usually suffice.Like HMCMC, PBPs also cannot easily be parallelised.However a potential extension to PBPs would be to incorporate them into a particle-like framework in which multiple proposals are made along with a particle filter applied at different time points.This may lead to further improvements in speed under some scenarios, and will be the subject of future investigation.
One final point to mention is that whilst some complex models may not fit into the DAG structure required by PBPs, it may be that certain subsections of them do.Here PBPs could still profitably be used by fixing all non-DAG elements of the model under the proposal (whereby they are incorporated into the data y for the purposes of the proposal step).Whatever the complexity of the model, therefore, PBPs can be considered as an additional tool in the MCMC programmer's arsenal.

Appendices Appendix A: Importance sampling
Importance sampling (IS) is a general technique for estimating properties of a given distribution (which can't directly be sampled from) using samples generated from a different (more accessible) distribution [35,36].We emphasise that PBPs are not a type of IS, but do make use of importance distributions (IDs).Therefore a brief description is now given.The posterior in Eq.( 1) can be expressed as which, using Eq.( 2), may be written E e e ee y y y This implies that, in principle, posterior samples can be generated by first sampling θ from π(θ|y), then ξ1 from π(ξ1|θ,y), then ξ2 from π(ξ2|ξ 1 ,θ,y), and so on and so forth until a complete set of latent variables ξ is generated.Unfortunately, however, these distributions are typically intractable, so cannot be directly sampled.To overcome this difficultly importance distributions (IDs) fID(θ|y) (which, for example, could be chosen to be multivariate normal) and f ID (ξ e |ξ e'<e ,θ,y) (a set of univariate distributions, such as normal or Poisson, for each variable e) are defined which can be sampled from.IDs are chosen to resemble the true distributions as closely as possible.
IS consists of first sampling θ from fID(θ|y) and then successively drawing ξe from f ID (ξ e |ξ e'<e ,θ,y) for e=1 to E. The resulting sample θ, ξ has an associated "weight" which accounts for the fact that it isn't a true posterior sample [35].Repeating IS a sufficient number of times gives unbiased estimates for posterior quantities of interest.For example, if i indexes sample number then i i i ii ww    ( 27) gives an estimate for parameter posterior means 17 .Unfortunately IS becomes highly inefficient for complex models because the vast majority of samples have negligible weight (leading to poor statistical estimates for quantities such as those in Eq.( 27)).This necessitates the use of MCMC approaches in the first place.
In summary, this appendix has identified importance distributions f ID (ξ e |ξ e'<e ,θ,y) (which take standard functional forms such as normal, Poisson, etc.) that aim to account for both the model and observations by approximating π(ξ e |ξ e'<e ,θ,y).For example supposing that π(ξ e |ξ e'<e ,θ,y) has a normal-like distribution, the importance distribution would be chosen to be normal 18   f ID (ξ e |ξ e'<e ,θ,y)=f norm (ξ e |μ(ξ e'<e ,θ,y),σ(ξ e'<e ,θ,y)) with mean μ and standard deviation σ functionally depend on ξ e'<e , θ and y.Details on these functional dependencies are given in Section 4.

Appendix B: Derivation of PBP for Poisson or normal IDs
Here we explicitly demonstrate the validity of the two conditions in Eq.( 6) when using the sampling procedures outlined in Table 1  where i e  is the number in the initial state and X is sampled from a Poisson distribution with average number given by the difference between λ p and λ i : The probability of this proposal is given by Using this, along with Eq.(A1), finally gives which shows that condition 1 in Eq.( 6) is, indeed, satisfied.Condition 2 is satisfied because X=0 when λ p =λ i in Eq.(A3), and so by definition ξ p =ξ i in Eq.(A2).

Normal ID
Here we consider the case of a normal ID for latent variable e  : The ratio between Eqs.(A12) and (A11) is given by The definition This, together with the expressions in Eq.(A9), shows that condition 1 in Eq.( 6) is, indeed, satisfied.Furthermore, condition 2 is satisfied by noting that the variance in the proposal in Eq.(A10) goes to zero when The validity of all the sampling procedures in Table 1 can be verified by following essentially the same procedure as above to show that the two conditions in Eq.( 6) are satisfied.

Appendix C: Adaptation period
Posterior-based proposals contain two quantities in Eq.( 8) that need to be established: a numerical approximation to the covariance matrix of the posterior Σ θ , and a jumping size j.Motivated by adaptive MCMC [37,38], these are calculated during an "adaptation" period, which also acts as the required burn-in phase (i.e. this ensures that the first sample drawn after the adaptation period is representative of a random draw from the posterior).In this study I ad =10 4 iterations are used for adaptation, and subsequently Σ θ and j are fixed 19 .
Covariance matrix Σθ: During the adaptation period the number of MCMC iterations i changes from 1 to I ad .An approximation to the posterior covariance matrix Σ θ is calculated every 100 iterations based on samples from i/2 to i: This is effectively equivalent to using a dynamically changing burn-in period set at half the current iteration number.As the adaptation period progresses the estimated covariance matrix Σ θ becomes a better and better approximation, which helps to improve the efficiency of the algorithm.For the first 100 samples Σ θ is set to a diagonal matrix with elements chosen to be sufficiently small to ensure a good initial acceptance rate.

Jumping size j:
This determines the acceptance rate for PBPs in Eq.( 9).If j is too large very few proposals are accepted, and if too small mixing is slow.A robust heuristic method for optimising j is as follows.Initially j is set to a small quantity.Each time a PBP is accepted, j is updated according to These numerical factors are chosen for two reasons: Firstly, the two updates in Eqs.(B2) and (B3) balance each other out when acceptance occurs around 33% of the time (which from Appendix H was found to be approximately optimal), leading to a steady state solution for j.Secondly, they are sufficiently close to 1 to prevent large fluctuations in j, but sufficiently far to allow for the steady state solution to be found during the adaptation period.

Appendix D: Derivation of acceptance probability
We derive the expression in Eq.( 9).Based on Eq.( 1), the MH acceptance probability is given by where g i → p represents the proposal probability density for generating θ p and ξ p given the current state θ i and ξ i , and g p → i represents the corresponding quantity in the opposite direction 20 .Following steps 1 and 2 in the PBP algorithm from Section 3.4, the overall proposal probability can be expressed as 19 This ensures that detailed balance is strictly enforced when MCMC samples are actually taken. 20In other words, starting at θ p and ξ p and proposing the state θ i and ξ i .
Substituting this, along with the reverse transition, into Eq.(C1)(and noting that the MVN distributions are symmetric 21 ), gives Substituting condition 1 from Eq.( 6) into this expression leads to the final result in Eq.( 9).

1) Optimisation
The proposal distribution in parameter space introduced in Eq.( 8) has the advantage of being simple and robust against highly correlated model parameters.Generally speaking, however, it may not represent the optimum choice.For example, if two variables A and B are largely uncorrelated in the posterior it may actually be computationally faster to consider proposals to A and B separately.This is especially true in cases when proposing changes to A is much faster (e.g.fixed effects in mixed models) than B (e.g.random effects).
In the most general case, a combination of the following two types of proposal can be used in Eq.( 8): Multiple parameter changes -χ represent a subset of the parameters in θ, and Providing an automated way to determine the optimum choice for proposals in parameter space for a given model will be the subject of future research.

2) Parameters not continuous
In some cases model parameters are discrete, e.g. for an epidemiological model the initial population numbers in various compartments might be included in θ.If these discrete variables are approximately normally distributed (as they would be if the population sizes are reasonably large), then we could again draw a vector v from a MVN distribution as before but this time round those discrete model variables to the nearest integer In cases in which model parameters are not expected to be approximately normally distributed they can simply be updated individually.

3) Restrictions
For some of the functional forms in Table 1, only one of the characteristic parameters can be updated at a time.For example, for the beta distribution only α can be changed whilst fixing β, or vice versa.Consequently, proposals to both α and β cannot be performed simultaneously.

Appendix G: Normal distributions
The probability density function for drawing a value x from a normal distribution with mean μ and variance σ 2 is given by The equivalent multivariate normal distribution is where d is the number of dimensions and Σ is the covariance matrix that captures the variance of individual elements (e.g.parameters) as well as covariance between them.
Cholesky decomposition provides a standard way to draw samples from a multivariate normal distribution [39].Provided Σ is positive-definite it can be written as Σ=BB T , where B is a lower triangular matrix.Samples from the multivariate normal are then generated using where z is a vector of normally distributed independent samples with mean zero and unit variance.

Secondly, Fig. H(b) shows variation in CPU time as a function of the tuneable constant κ (this is used
in Table 1 in cases in which the ID is normally distributed).Again, performance is largely the same provided κ is smaller than around 0.1.For this study κ=0.03 was selected.
Finally, Fig. H(c) shows that the algorithm is optimised when the MH acceptance probability is around 33%.This is implemented using the methods outlined in Appendix C.

Appendix I: Effective sample number
Given X correlated MCMC samples of some quantity x i , the autocorrelation function can be approximated by where estimates for the average and variance of x are given by   The effective sample size is given by the actual sample number X, correcting for correlations between successive samples: When actually calculating X eff , clearly the sum in Eq.(H3) cannot go to infinity.In fact, Fτ often exhibits considerable fluctuations for large τ, and these can generate unwanted bias.The simplest way to deal with these is to truncate the sum in Eq.(H3) up to a maximum size τmax, which is defined to be the largest value of τ for which the following condition holds true (see [40]): This represents the point at which the autocorrelation function intersects the x-axis.
We identify the following model parameters θ={pD,Se 1 ,Sp 1 ,Se 2 ,Sp 2 } and latent variables ξ={D}.The observation model and latent process likelihood are given by where d N is the number of individuals with disease status d and | rd t N is the number of cases in which test t gives result r for individuals with disease status d.

Importance distributions
Step 2 of the PBP algorithm (introduced in the Section 3.4) makes use of IDs.Successive approximations for these IDs were then discussed in Section 4.Here we explicitly present expressions for this particular model.where f Bern is the Bernoulli probability distribution.
ID 1 takes into account both the model and the observations.From Eq.( 16) this is given by Since ID 1 actually represent the exact importance distribution, no higher order terms need to be considered in this particular example.For ID 0 , z is equal to p D and for ID 1, z=p 1 /(p 1 +p 0 ), where the definitions for p 0 and p 1 are given in Eq.(I5).

Proposals
Sequentially going through each individual e, the values for the initial i e z and proposed p e z states are calculated.Table 1 shows that for the Bernoulli distribution:

Gibbs samplers for disease prevalence model
For the disease diagnostic test model it is possible to explicitly sample directly from the posterior when model parameters and latent variables are each considered separately.
Model parameters: Assuming a uniform prior, the following samples are sequentially drawn from beta distributions where Г is the gamma function and E the observation period.
Importance distributions ID 0 is given by the model The definition of ID 2 from Eq.( 17) is given by

e e f h h y h h y h h h y h h
In other words, the posterior distribution for h e at time e depends not only on the value of h e-1 (i.e. the previous time point), but also on the observations at times e and e+1.In the general case this integral is intractable, but again following the Taylor series expansion approximation to the observation likelihood around where q y qq e q q q y Substituting the results from Eq.(J7) into (J6) gives Integrating over two normally distributed quantities is Gaussian distributed with respect to the difference in means with variance given by the sum of the variances of the two original distributions.Consequently, Eq.(J9) becomes The product of the three normal distributions in Eq.(J11) gives the final result The standard approach for the stochastic volatility model By multiplying the two expressions in Eq.(J1), the posterior distribution is given by ( | , , , , ) , which takes the form of a standard normal distribution Consequently, μ is sampled in the following way: ~, .
Similarly, the correlation parameter ϕ is also sampled from a normal distribution  | , , , , ) , which is an inverted chi-squared distribution with respect to σ 2 .All samples generated which have zero prior probability are subsequently resampled.
We adopt a simple random walk Metropolis-Hastings scheme (i.e.propose a new parameter / variable by adding a normally distributed contribution to its existing value and accepting or rejecting that change) for ν and h e .In the case of h e care is taken to only calculate those parts of the latent process likelihood and observation probability that actually change (to optimise the code as far as possible).The jumping sizes of the separate proposals are individually tuned to give acceptance approximately 33% of the time (using the same procedure as for j in Appendix C).  assumed to be unrelated (i.e.conditionally independent) and those in the 2 nd , 3 rd and 4 th generations are conditionally dependent on exactly two individuals in the previous generation (i.e.their parents).Simulated data was generated using a population size of P=1000, two fixed effects β={1,0.1},randomly allocated gender (i.e.X i2 =0 or 1 with equal probability along with X i1 =1) and one additive genetic effect per individual (i.e.Z=I).
Correlated random effects (A -1 dense): Simulated data were generated in the same way as in the sparse case except that a small noise (uniformly sampled in the range between -0.005 and 0.005) was added to each element in A -1 to mimic a typical genomic relationship matrix.

Observation model and latent process likelihood
We identify model parameters At first glance it might appear that PBPs are not applicable to this particular model because the latent variables are MVN distributed (i.e. a distribution not contained within Table 1).This, however, turns out not to be the case, because of the following transformation.
The latent process likelihood is given by Separating out those terms which depend on a E in the sum (and remembering that A is symmetric and fixed), leads to the product of a normal distribution for a E (given a e'<E ) multiplied by a new MVN distribution over the remaining E-1 latent variables where the p.d.f. for the univariate normal distribution is given by The scheme above can be iterated until the original MVN distribution is converted into a product of normal distributions for each of the random effects and matrix M and vector s are fixed and calculated from A by recursively applying Eq.(K2).Note, Equation (K4) follows the same structure as Eq.( 2), showing that it represents a DAG (specifically, the one illustrated in Fig. 9).
Under the above transformation, the observation model and latent process likelihood are given by where i goes over the observations, f goes over the fixed effects and e goes over the random effects.

Importance distributions
The different levels of ID approximation are illustrated in Fig. L2.From Eq.( 16), we see that ID 1 is generated by taking the product of the model and the observation probability distributions.For simplicity we assume that each observation contains a single random effect, but that each random effect may have many observations made on it (PBPs can also be applied the more general case, but estimation of the IDs becomes somewhat more complicated).
A rearrangement of the observation model in Eq.(K4) leads to an effective observation probability for each individual random effect of where y e combines all observations y i that include random effect a e , and   The distribution for ID 2 takes into account those random effects which depend on a e .As stated in Eq.( 17) and illustrated in Fig. 4(d), derivation of ID 2 involves integrating over those random effects a d which depend on a e .Explicitly writing down the expression for fID2(ξ e |ξ e'<e ,θ,y) is somewhat verbose.Instead, here we build up the final result by considering different contributions in turn.
To start with we consider a particular random effect a d (which depends on a e ), and find out how it affects fID2 when it is integrated out.The contribution to the model part of the full posterior from a d is given by where e' sums over all those random effects on which a d depends.Three possibility exist for e': 1) e' < e, in which case a e' is known (as represented by the shaded circles in Fig. 4), 2) e'=e and 3) e' > e, in which case the additional latent variable e' has to first be integrated out.In the case of the third option, Eq.(K10) is first recast in terms of the specific variable e'=r that needs to be integrated out: Now we remember that the posterior also has a contribution for a r coming from its measurement and those random effects upon which it depends.These are captured by ID 1 from Eq.(K9), which is given by  The distribution for ID 2a is calculated in exactly the same way as for ID 2 except that only those contributions with |M ee' |>0.1 are included in the sum in Eq.(K5).

Gibbs samplers for mixed models
Mixed models represent a case for which it is possible to explicitly sample directly from the posterior when model parameters and latent variables are each considered separately [30].Assuming a simple uniform prior 23 , multiplying the two expressions in Eq.(K6) leads to the posterior probability distribution    The following Gibbs proposals can be identified, which are sequentially applied to constitute a single "update": Model parameters: Rearranging (K24) gives and so 2   also has an inverted chi-squared distribution.See below for how to draw samples from the inverted chi-squared distributions in Eqs.(K25) and (K26).
Taking each fixed effect f  in turn, the posterior can be written as  

Sampling from the inverse chi-squared distribution
We assume an inverse chi-squared distribution of the form A simple method to calculate samples from this distribution is through From Eq.( 21), ID 1 is found by taking the product of ID 0 in Eq.(L2) and the observation probability at the current latent variable in Eq.(L3).This leads to the final result

Fig. 1
Fig. 1 Illustrative example of poor/good mixing.(a) Proposals are made individually on parameters and latent variables separately (the dashed line shows the case of a latent variable being changed).The shading represents the region of high posterior probability.(b) Trace plot exhibiting poor mixing.(c) Efficient joint proposals are made using the distribution in pink (which is correlated in the same way as the posterior).(d) Trace plot exhibiting good mixing.

Fig. 2 A
Fig. 2 A directed acyclic graph (DAG) illustrating a model with parameters θ, latent variables ξ and observations y.The arrows represent conditional dependencies.The model assumes that latent variables ξ e (where e goes from 1 to E=6 in this example) are sampled from a set of univariate probability distributions π(ξ e |ξ e'<e ,θ) and y r are sampled from π(y r |ξ,θ).Note, in general each observation y r can depend on an arbitrary combination of ξ e (this example shows the special case when y r depends only on ξ r ).

Fig. 3
Fig. 3 Shows the true (unknown) distribution π(ξ e |ξ e'<e ,θ,y) (black lines) and importance distribution f-ID(ξe|ξe'<e,θ,y) (dashed red lines) for a given latent variable ξ e using (a) the current state on the MCMC chain and (b) the proposed state.In this particular example ξ e takes non-negative integer values and the IDs are Poisson distributions.
shows hypothetical distributions for a particular value of e for which ξ e takes non-negative integer values.Here the black lines represent the true (unknown) distributions for the current .Since these curves are approximately Poisson distributed5 , the IDs are taken to be Poisson (as shown by the red dashed lines in Fig.3) with probability mass functions by "expected event number" parameters λi and λp, which themselves are functionally dependent on (  ′ <  ,   , ) and (  ′ <  ,   , ), respectively (see Section 4).

Fig. 4
Fig. 4 (a) The model (circles containing parameters and observations have been omitted for clarity).The posterior distribution for ξ e assumes that ξ e'<e are known (indicated by the blue shading).(b-e) Successive approximations for the ID (described in the text).The bold, green circles indicate that observations on these latent variables have been used in calculating the corresponding ID.

Step 3 :
Accept or reject joint proposal for θ p and ξ p -With MH probability (see Appendix D)

Fig 5 .
Fig 5.The disease diagnostic test model (Section 5.1) with probability of disease p D , true disease status D e (1/0 denoting infected/uninfected), and observed disease status as measured from two diagnostic tests  1,2 (which have sensitivities Se 1,2 and specificities Sp 1,2 ).

Fig. 6
Fig. 6 Results for the disease diagnostic test model.(a) Trace plots for disease probability p D as a function of MCMC sample for three different algorithms.(b) The posterior distributions for the model parameters (the vertical lines represent the true values).(c) Shows how the CPU time needed to generate 100 effective samples for p D changes as a function of the number of individuals in the population.

Fig. 8
Fig. 8 Results for the SVt stochastic volatility model.(a) The posterior distributions for the model parameters.(b) Shows how the CPU time needed to generate 100 effective samples of σ 2 varies as the correlation parameter ϕ used to simulate the data changes.
(b) identifies marginal posterior distributions which contain the true parameter values, as indicated by black vertical lines in Fig 8(a).For each algorithm Fig. 8(b) shows how the computational time to infer σ 2 varies with the correlation parameter ϕ used to generate the data (with all other parameters fixed as above).The A is a known symmetric matrix, and 2    RI , where I is the identity matrix (such that residuals are uncorrelated between measurements).

2
contribution of random effects to residuals.

Fig. 10
Fig. 10 Results for mixed models.Shows how the CPU time needed for generating 100 effective samples of 2 2 2 2 ( ) a a e r     

Fig. 11
Fig. 11 The CPU time to generate 100 effective sample of 2 a  as a function of 2 a  using a GLMM with binary observations and a logit link function.

Figure H :
Figure H: Optimisation of key parameters.
c is a normalisation constant.Explicitly incorporating the observation model from Eq.(I1), this becomes As an illustration of how PBPs are implemented in practice, we explicitly go through step 2 of the PBP algorithm from Section 3.4 (which stochastically modifies i e D to generate p e D ).The ID is given by a Bernoulli distribution with disease probability z:

DD  , otherwise we draw a random number from 0 to 1 and
 , otherwise we draw a random number from 0 to 1 and if it is less than Each individual e is considered in turn and, using the definitions in Eq.(I5), we set D e =1 with probability p 1 /(p 1 +p 0 ) else D e =0.

Figure L2 :
Figure L2: Various levels of approximation used for estimating fID(a e |a e'<e ,θ,y) for the quantitative genetics model.The shaded circles represent known additive genetic effect a e'<e and the bold, green circles indicate the actual trait measurement used.

Figure L1 :
Figure L1: A specific quantitative genetics example in which a represent additive genetic effects for a population of P individuals randomly mated over four generations (note, for clarity β, and y have been omitted from this diagram).Note, for the non-founding population the random effect for each individual has contributions from its two parents in the previous generation.
, with residuals ε incorporated into the observation model.
(K11) and (K12) and integrating over a r leads to the posterior probability being proportional to This can again be re-cast in terms of a d : original expression in Eq.(K10), we see that the effect of integrating out a r is to replace a r with the posterior estimate post r  in the mean and to add an additional contribution to the variance.The procedure above can be repeated for all e'>e, leading to the contribution which comes from the observation on a d itself: (K16) and (K17) and integrating over a d implies that the posterior is proportional to This can be recast in terms of a e : ) represents the overall contribution to fID2 from latent variable d.To find fID2, therefore, this distribution must be multiplied over all random effects d which depend on a e , and also the observation and model contributions from a e itself must be included:

2 a
with all other quantities fixed the posterior has an inverted chi-squared distribution with respect to


u m are uniform randomly generated numbers between 0 and 1. Eq.(K5) and the sign S e =1 or -1 correspond to whether the disease status for individual e is 1 or 0, respectively.Importance distributionsID 0 is given by the latent process model: generate ID 1 first the observation model is approximated by a normal distribution (this is achieved by Taylor series expanding the log of the observation probability about mod e

Table 1 (
a): Shows how to sample the proposed latent variable p e  from i e . , for the Poisson and normal IDs.
. Suppose, arbitrarily, that the expected number of occurrences in the proposed state λ p is greater than the expected number in the initial state λ .Table1shows that the actual number in the We now consider the reverse transition.Since p and i are now switched in Table1, the corresponding proposal probability density function is given by ) shows the expression for ID 1 .The first thing to note is that the product of the model π(ξe|ξ e'<e ,θ) and observation probability π(ye|ξ e ,θ) distributions from Eq.(J1) is not a standard distribution with which PBPs can be used (i.e. it is not listed in Table1).One way around this problem is to first approximate the observation model as a normal distribution.This is achieved by ) Gibbs sampling for random effect a e is achieved through