Joining and splitting models with Markov melding

Analysing multiple evidence sources is often feasible only via a modular approach, with separate submodels specified for smaller components of the available evidence. Here we introduce a generic framework that enables fully Bayesian analysis in this setting. We propose a generic method for forming a suitable joint model when joining submodels, and a convenient computational algorithm for fitting this joint model in stages, rather than as a single, monolithic model. The approach also enables splitting of large joint models into smaller submodels, allowing inference for the original joint model to be conducted via our multi-stage algorithm. We motivate and demonstrate our approach through two examples: joining components of an evidence synthesis of A/H1N1 influenza, and splitting a large ecology model.


Introduction
The increasing availability of large amounts of diverse types of data in all scientific fields has prompted an explosion in applications of methods that combine multiple sources of evidence using (Bayesian) graphical models (for example, Moran and Clark, 2011;Commenges and Hejblum, 2012;Shubin et al., 2016;Birrell et al., 2016).Such evidence synthesis methods have several advantages (Ades and Sutton, 2006;Welton et al., 2012;Jackson et al., 2015): resulting estimates are typically more precise, due to the increased amount of information; they are consistent with all available knowledge; and the risk of potential biases introduced if estimation relies on a 'best quality' subset is minimised.
However, dealing with joint models of several sources of evidence, including data and expert opinion, may be inferentially imprudent, computationally challenging, or even infeasible.It is often sensible to take a modular approach, where separate submodels are specified for smaller components of the available data, facilitating computation and, importantly, allowing insight into the influence of each submodel on the joint model inference (Green et al., 2003;Liu et al., 2009).These submodels can originate in two ways: either by first specifying submodels that, in a Bayesian framework, should be joined in a single model to allow all information and uncertainty to be fully propagated; or as a result of splitting an existing joint model.Formally, consider M probability submodels p m (φ, ψ m , Y m ), m = 1, . . ., M , for submodel-specific multivariate parameters ψ m and observable random variables Y m , as well as a multivariate parameter φ common to all submodels that acts as a 'link' between the submodels.The problem is then to join the submodels into a single model p comb (φ, ψ 1 , . . ., ψ M , Y 1 , . . ., Y M ) so that the posterior distributions for the link parameter φ and the submodelspecific parameters ψ m account for all observations and uncertainty.A suitable joint model for a collection of submodels naturally arises in some contexts from standard model constructs, such as a hierarchical model (Figure 1).However, it is not immediately clear how to form such a joint model when either: the submodels are not expressed in a form conditional upon the link parameter φ, particularly if the link parameter is a non-invertible deterministic function of the other parameters; or the prior marginal distributions p m (φ), m = 1, . . ., M , for the link parameter φ differ in the submodels.In applied research, convenient approximate two-stage approaches have been widely used, where one submodel is fitted and an approximation of the resulting posterior is provided to a second submodel (Jackson et al., 2009;Presanis et al., 2014).However, the joint model that is implied by such an approach is unclear (Eddy et al., 1992;Ades and Sutton, 2006).
Conversely, suppose a joint model p(φ, ψ 1 , . . ., ψ M , Y 1 , . . ., Y M ) exists that needs splitting into M submodels p m (φ, ψ m , Y m ), m = 1, . . ., M .The submodels should be faithful to the original model in the sense that joining the submodels results in the original model.In some contexts, suitable submodels arise naturally from the structure of the joint model, resulting in splitting strategies used implicitly in the context of hierarchical models (Lunn et al., 2013a;Tom et al., 2010;Liang and Weiss, 2007) and of tall data (Scott et al., 2016;Neiswanger et al., 2014).However, neither the general conditions stipulating when splitting is permissible nor a general framework for splitting a model are immediately clear.
In this paper we introduce Markov melding, a simple, generic approach for joining and splitting models that clarifies and generalises various proposals made in the literature under the umbrella of one theoretical framework.Markov melding builds on the theory of Markov combination (Dawid and Lauritzen, 1993) and super Markov combination (Massa and Lauritzen, 2010;Massa and Riccomagno, 2017) and combines it with ideas from Bayesian melding (Poole and Raftery, 2000), enabling evidence synthesis (Ades and Sutton, 2006;Welton et al., 2012) and model expansion (Draper, 1995) in realistic applied settings.Markov combination is a framework for combining submodels when the prior marginal distributions p m (φ), m = 1, . . ., M , are identical.Our approach relaxes this assumption, which is often not satisfied in applied settings, to allow joining submodels with similar but not identical prior marginal distributions.It also accounts for contexts where the link parameter φ is a non-invertible deterministic function of other parameters in a submodel.When joining submodels, Markov melding aims to preserve the original submodels as faithfully as possible, and, in particular, always preserves the submodel-specific conditional distributions p m (ψ m , Y m | φ) for all m.Note that, while Markov melding is defined for any collection of submodels, the results may be misleading if any evidence components (priors, submodels and data) strongly conflict (Presanis et al., 2013;Gåsemyr and Natvig, 2009).Such conflict should be investigated and resolved, for example through bias modelling (Turner et al., 2009), before proceeding with the synthesis.In terms of splitting, the Markov melding framework proposed here clarifies the conditions required and the general framework in which to conduct model splitting, facilitating the modular approach advocated above.Notably, we generalise existing tall data splitting approaches (Scott et al., 2016;Neiswanger et al., 2014) for independent, identically distributed data to other types of data.
Finally, we also develop an algorithm for fitting the Markov melded model in stages, for both joining and splitting models.This algorithm extends naturally that employed in Lunn et al. (2013a) and is closely related to those proposed in Liang and Weiss (2007) and Tom et al. (2010).
The paper is organised as follows: in Section 2 we introduce some examples motivating this work; Section 3 provides the conceptual framework underlying our approach; inferential and computational aspects of the approach are presented in Section 4; Section 5 gives details and results for the motivating examples; we conclude with a discussion and suggestions for further work in Section 6.

Motivating examples
We motivate and demonstrate our framework for joining and splitting models with two examples, for which we provide here a brief high-level outline.We show how Markov melding applies in each case in Section 3.3; and provide full details and results in Section 5.For both examples, as in the rest of the paper, we use directed acyclic graphs (DAGs) to represent the dependence structure between variables in a model (Figures 2  and 3).Each variable in the model is represented by a node with rectangular nodes denoting observed variables and links between the nodes indicating direct dependencies.Stochastic (distributional) dependencies are represented by solid lines and deterministic (logical) relationships by dashed lines.The joint distribution of all nodes is the product of the conditional distributions of each node given its direct parents, and conditional independence relationships can be read from the graph (Lauritzen, 1996).

Joining: A/H1N1 influenza evidence synthesis
Public health responses to influenza outbreaks rely on knowledge of severity: the probability that an infection results in a severe event such as hospitalisation or death.One method to estimate severity is by combining estimates of cumulative numbers of severe events with estimates of cumulative numbers of infections obtained from synthesizing different data sources.This approach is adopted in Presanis et al. (2014) for the A/H1N1 pandemic, where information from intensive care units (ICU) is integrated with several other sources.Figure 2 provides a schematic representation of the submodels used for each evidence component.A crucial ingredient is the cumulative number of ICU admissions for the A/H1N1 strain, χ.A lower bound φ for χ is estimable through an immigration-death model governed by transition rates θ, from time-dependent (weekly) prevalence data y on suspected 'flu cases in ICU (Figure 2  The two submodels imply two different prior models for the link quantity φ.A further complication is that the deterministic function connecting φ to the ICU submodel parameters is a sum of products, which is not invertible, preventing the ICU submodel from being expressed conditional on φ.Presanis et al. (2014) therefore transferred information between the two separate submodels via an approximate approach (see Section 4.3 for details).We show in this paper how Markov melding can be used to join the two submodels formally into a single joint model, making all the assumptions involved explicit.We also explain the relationship between our approach and the approximate approach.

Splitting: large ecology model
As an example of splitting a large DAG model we consider a joint model (Besbeas et al., 2002) for two distinct sources of data about British Lapwings (Vanellus vanellus).These data sources are primarily collected to inform different aspects of studies of the birds: census-type data provide a measure of breeding population size, while mark-recapturerecovery data provide estimates of the annual survival probability of the birds via observations of the survival of uniquely marked individuals.These data are related and a joint model allows inference to account for all information available.In the joint Bayesian model of Brooks et al. (2004) (Figure 3(a)), the mark-recapture-recovery data y are modelled in terms of the recovery rate λ, and the survival rate φ for birds; and the census data x are modelled in terms of the survival rate φ, and the productivity rate γ of adult female birds.The joint model links the data sources using the common survival rate parameter φ.Brooks et al. (2004) considered fitting the census and mark-recapture-recovery models both separately and jointly using standard Markov chain Monte Carlo (MCMC) algorithms, but considering the joint model simultaneously is cumbersome and MCMC convergence is slow.We describe in this paper how, through Markov melding, inference from such a joint model can be carried out in stages after splitting the model into two separate submodels (Figure 3(b)), circumventing the need to directly fit the joint model in a single MCMC procedure.The multi-stage fitting process is more computationally efficient, and gives insight into the contribution of each submodel to the joint model.

Joining models
To combine probabilistic models in a principled way, we propose Markov melding as an extension of a Markov combination, which has been introduced by Dawid and Lauritzen (1993) and discussed extensively with generalisations and applications in Massa and Lauritzen (2010) and Massa and Riccomagno (2017).
Let p denote either a probability distribution for discrete random variables or a probability density for continuous variables (we assume such a density exists).In both cases we talk of p interchangeably as a probability or probability distribution and we express conditional probabilities as p(ψ | φ) = p(ψ, φ)/p(φ), where p(φ) > 0. We will assume that when conditioning on a variable its distribution has support in the relevant region.For random variables X 1 , X 2 and X 3 , X 1 ⊥ ⊥ X 2 | X 3 means that X 1 and X 2 are conditionally independent given X 3 .Dawid and Lauritzen (1993) define the submodels p m (φ, ψ m , Y m ), m = 1, . . ., M , as consistent in the link parameter φ if the prior marginal distributions p m (φ) = p(φ) are the same for all m.They define the Markov combination p comb of M consistent submodels as the joint model

Markov combination
By construction, model (1) assumes that the submodels are conditionally-independent: (ψ m , Y m ) ⊥ ⊥ (ψ , Y ) | φ for m = (see Figure 1).All prior marginal distributions and submodel-specific conditional distributions, given the link parameter, are preserved: Furthermore, the model has maximal entropy among the set of distributions with this marginal preservation property, and so can be viewed as the least constrained among such distributions (Massa and Lauritzen, 2010).However, only prior marginals are preserved in Markov combinations: the posterior distributions of φ and any ψ under the Markov combination model account for all data Y m , m = 1, . . ., M , rather than just the submodel-specific data Y , and are not preserved.

Markov melding
If the submodels are not consistent in their link parameter φ, that is, if the prior marginal distributions p 1 (φ), . . ., p M (φ) of the link parameter differ, a Markov combination cannot be formed directly.However, the original submodels p m (φ, ψ m , Y m ), m = 1, . . ., M , can be altered so that the marginals p 1 (φ), . . ., p M (φ) for the link parameter become consistent.This is achieved by a procedure we term marginal replacement, where a new model p repl,m (φ, ψ m , Y m ) is formed by replacing the marginal distribution p m (φ) of φ in the original model p m (φ, ψ m , Y m ) by a new marginal distribution p pool (φ): where the pooled density p pool (φ) = g(p 1 (φ), . . ., p M (φ)) is a function g of the individual prior marginal densities.Here, and in what follows, we assume that such a pooled density exists, that g has been chosen such that p pool (φ) dφ = 1, and that p pool reflects an appropriate summary of the individual marginal distributions (we discuss options below).Since p repl,m (φ, ψ m , Y m ), m = 1, . . ., M , are consistent in the link parameter φ (that is, they all have the same prior marginal p pool (φ)), we can form their Markov combination We term this construction Markov melding of the submodels p m (φ, ψ m , Y m ) with pooled density p pool (φ) = g(p 1 (φ), . . ., p M (φ)), which amounts to applying the Markov combination (1) to submodels satisfying the consistency condition after marginal replacement as in (2)1 .The submodel-specific conditional distributions, given the link parameter, are preserved in the Markov melded model: By extending a similar argument used in Poole and Raftery (2000), it can be shown that marginal replacement has the attractive property that p repl,m minimises the Kullback-Leibler divergence D KL of a distribution q(φ, ψ, Y ) to p m (φ, ψ m , Y m ) under the constraint that the marginals on φ agree, q(φ) = p pool (φ): Marginal replacement can also be interpreted as a generalisation of Bayesian updating in the light of new information.Details are provided in Supplementary Material A.

Markov melding with deterministic variables
Care is required when some of the dependencies in a submodel are deterministic, as in the example in Section 2.1.The considerations are identical to those for Bayesian melding (Poole and Raftery, 2000), where priors on the input and output of deterministic functions are combined.Specifically, assume the k-dimensional link parameter φ is deterministically related to a -dimensional parameter θ, k ≤ , in a model p(φ, θ, ψ, Y ).The probability model is effectively given by p(θ, ψ, Y ) and φ follows an induced distribution.We assume φ is exclusively a deterministic function φ(θ) of the parameter θ.
To apply Markov melding, we need to ensure that the prior marginal distribution on φ is well defined, and that we can apply marginal replacement to φ = φ(θ).We must assume that φ(θ) is an invertible function or, in the case of k < , that φ(θ) can be expanded into an invertible function φ e (θ) = (φ(θ), t(θ)), with a − k dimensional deterministic function t(θ).We denote the inverse function by θ(φ, t).The function φ e induces a probability distribution on (φ, t, ψ, Y ) which can be represented as where J θ (φ, t) is the Jacobian determinant for the transformation θ(φ, t).The marginal distribution on φ can now be obtained as p(φ) = p(φ, t, ψ, Y ) dt dψ dY .We show in Supplementary Material B that p(φ) is independent of the chosen parametric extension t(θ) and so is well defined, and that we can apply marginal replacement, as defined by (2), to replace p(φ) with p pool (φ): Markov melding with p repl (θ, ψ, Y ) can now be applied as in (3).

Pooling marginal distributions
The pooling function g determines the prior marginal distributions p meld (φ, ψ m , Y m ), which, in general, will not match those in the original submodels.It must, therefore, be chosen subjectively, ensuring that the pooled density p pool (φ) appropriately represents prior knowledge of the link parameter φ.Various standard pooling functions have been suggested in the multiple expert elicitation literature (see, for example, Clemen and Winkler, 1999;O'Hagan et al., 2006).The difference here is that we propose to pool prior marginal distributions of submodels, rather than directly-specified priors.A simple option is linear pooling, where w = (w 1 , . . ., w M ) , with w m ≥ 0 to weight the submodel priors.An alternative is log pooling, with w m ≥ 0, a logarithmic version of the linear pooling (for reasons why logarithmic pooling might be attractive see Supplementary Material C).A special case of log pooling is product of experts (PoE) pooling (Hinton, 2002) when in which equal weight is given to each submodel prior.A further special case of linear or log pooling is dictatorial pooling p pool (φ) = p m 0 (φ) when one submodel m 0 is considered authoritative2 .We shall assume throughout this paper that the weights w are a fixed quantity, chosen subjectively, in contrast to some of the power prior literature (Neuenschwander et al., 2009), where attempts have been made to treat the weight w as an unknown parameter.
Figure 4 shows the pooled density when combining two normal distributions under three different pooling functions with three choices of weights.The PoE approach is arguably the least intuitive pooling function due to the rather concentrated combined distribution implied.However, the required computation is greatly simplified (see Section 4), and so if (and only if) PoE adequately represents prior beliefs then PoE pooling is an attractive option.The choice of pooling function is particularly important when there is some disagreement between the priors, but if there is substantial conflict between submodel priors we do not recommend the use of Markov melding, as mentioned above.

Splitting models
We may want to split up a larger model, for example, for computational efficiency or to understand the influence of each submodel on the joint model.In this case we want to split a large joint model , in such a way that joining the submodels using Markov melding recovers the original, joint model.
where p 1 (φ), . . ., p M (φ) are new prior marginal distributions.These marginal distributions and the pooling function g can chosen freely to enable efficient computation, provided that the pooled distribution p pool (φ) = g(p 1 (φ), . . ., p M (φ)) is the same as the original marginal distribution p(φ).An obvious choice is p m (φ) = p(φ) that is, conditioning on the link variable φ makes the parts that are intended for splitting conditionally independent.
Figure 5 shows a few stylised situations with M = 2 where splitting for computational purposes might be desirable.The joint distributions for all models is p(φ, ψ 1 , ψ 2 , Y 1 , Y 2 ).The model in Figure 5(a) can be split into p , with a new prior distribution p 2 (φ), which could be different and computationally simpler than p(φ) = p(φ, Y 1 | ψ 1 )p(ψ 1 ) dψ 1 dY 1 .Markov melding, with dictatorial pooling p pool (φ) = p 1 (φ), results in leading to the original model, regardless of the choice of p 2 (φ).The case in Figure 5(b) is similar to the example in Section 2.2 (see Section 3.3.2for a definition of splitting in this case).Note that in Figures 5(a  Replacing the marginal distribution of φ with pooled density p pool (φ) in the ICU submodel, using (4), and in the severity submodel, using (2), and then applying Markov melding, as in (3), results in where

Inference and computation
The joint posterior distribution, given data Y m = y m , m = 1, . . .M , under the Markov melded model in (3) is The degree of difficulty of inference for this posterior distribution depends on the specification of the submodels.Our focus is settings in which, considered separately, each of the original collection of submodels is amenable to inference by standard Monte Carlo methods (for example, Robert and Casella, 2004).
In Section 4.1 we first consider a standard Metropolis-within-Gibbs sampler, but when the constituent submodels are complex, this sampler may be cumbersome and slow.We thus propose a multi-stage Metropolis-within-Gibbs sampler, in which inference for the full Markov melded model is generated iteratively in stages, starting with standard inference on one of the submodels.This latter sampling scheme enables a convenient modular approach to inference.Both approaches, in general, require the marginal prior densities p m (φ) of the link parameter under each submodel, m = 1, . . ., M , which will not usually be analytically tractable.In Section 4.2 we discuss approaches to estimating these densities, although there is no need to estimate them if PoE pooling is chosen.In Section 4.3 we show how approximate approaches, such as those used by Presanis et al. (2014), relate to the Markov melded model.

Metropolis-Hastings samplers
A general Metropolis-Hastings sampler for the posterior distribution ( 6) can be constructed in the usual way.Candidate values (φ , ψ 1 , . . ., ψ M ) for each parameter of the Markov melded model are drawn from a proposal distribution q(φ , ψ 1 , . . ., ψ M | φ, ψ 1 , . . ., ψ M ), based on the current values (φ, ψ 1 , . . ., ψ M ) of the Markov chain.The candidate values are accepted with probability min(1, r), where r is in the form where the target-to-proposal density ratio is

Metropolis-within-Gibbs sampler
A particular form of the above general sampler is a Metropolis-within-Gibbs sampling scheme (Müller, 1991), in which samples are drawn from the full conditional distribution of each latent parameter ψ 1 , . . ., ψ M , and then the link parameter φ in turn.
Latent parameter updates Markov melding does not introduce any extra complexities in sampling the parameters ψ m in each submodel m = 1, . . ., M (conditional on the link parameter φ) beyond those inherent to the original submodels, considered separately.Typically, they can be sampled using standard algorithms.For instance, a Metropolis-Hastings algorithm, in which we draw a candidate value ψ m from a proposal distribution q(ψ m | ψ m ) based upon the current value ψ m , will be feasible whenever the corresponding algorithm is feasible for estimation of the posterior distribution of the m th submodel alone.In this case, since terms involving marginal densities for the link parameter φ in (6) cancel, the target-to-proposal density ratio in (7) simplifies to This target-to-proposal density ratio is identical to that required for a Metropolis-Hastings update for the parameter ψ m , conditional on the link parameter φ, when the m th submodel alone is the target distribution.
Link parameter updates To update the link parameters, a candidate value φ is drawn from an appropriate proposal distribution q(φ | φ), based upon the current value φ, and is accepted according to the target-to-proposal density ratio When the prior marginal distributions p m (φ) or p pool (φ) are not analytically tractable, we propose to use an approximation p m (φ) in their place, calculated using the methods described in Section 4.2.Note that, under PoE pooling, the terms involving the marginal distributions for the link parameter φ cancel in (8), leaving removing the need to estimate the marginal prior distribution for the link parameter φ.

Multi-stage Metropolis-within-Gibbs sampler
When the constituent submodels are complex, an alternative, multi-stage approach may be computationally preferable to the Metropolis-within-Gibbs sampler.The multi-stage approach generalises the two stage approach in Lunn et al. (2013a).We assume a factorisation of the pooled prior p pool (φ) = M m=1 p pool,m (φ).A default factorisation for any pooling function sets p pool,m (φ) = p pool (φ) 1/M , but there may be more computationallyefficient factorisations.For example, when PoE pooling is used, the factorisation with p pool,m (φ) = p m (φ) is more computationally efficient, as we describe below.The aim then is to sample, iteratively in stages = 1, . . ., M , from Since p meld,M (φ, ψ 1 , . . ., ψ M | y 1 , . . ., y M ) = p meld (φ, ψ 1 , . . ., ψ M | y 1 , . . ., y M ), after M stages the samples obtained reflect the posterior distribution (6) of the full Markov melded model.Note that each p m (φ) (and thus also p pool (φ)) can be estimated from the submodels in advance and independently of the following sampling scheme, as we describe in Section 4.2.

Estimating marginal distributions
The prior marginal densities p m (φ) of the link parameter under each of the M submodels are central to Markov melding, and in particular are required to evaluate the acceptance probability of proposals within the MCMC samplers we proposed above.However, these marginals are not generally analytically tractable, except when the prior distribution p m (φ) is directly-specified as a standard, tractable distribution, such as when φ appears as a founder node in a DAG representation of the submodel.When not available analytically, we can estimate the marginal density p m (φ) for each submodel m by kernel density estimation (Henderson and Parmeter, 2015) with samples drawn from p m (φ) = p m (φ, ψ m , Y m ) dψ m dY m by standard (forward) Monte Carlo.Care is required if φ has high dimension because the curse of dimensionality applies to kernel density estimation; see Section 6 for further discussion.

Normal two-stage approximation method
Approximate approaches for joining submodels are widely used in applied research.In this section, we show that approximate inference for the Markov melded model formed by joining submodels (with PoE pooling) can be produced using a standard normal approximation approach.
Consider the case when the Markov melded model p meld (φ, ψ 1 , ψ 2 , Y 1 , Y 2 ) is formed by joining M = 2 submodels p 1 (φ, ψ 1 , Y 1 ) and p 2 (φ, ψ 2 , Y 2 ).Suppose that ψ 1 is not a parameter of interest in the posterior distribution, so that it can be integrated over An approximate two stage sampler that mimics the multi-stage sampler (above) can then be constructed for this marginal distribution.
Stage 1 Fit the submodel p 1 (φ, ψ 1 , Y 1 ) to obtain posterior samples from p 1 (φ | Y 1 ), and approximate the posterior by a (multivariate) normal distribution with mean µ and covariance Σ.With p N denoting the probability density function for a (multivariate) normal distribution, , we obtain an approximation for (11) by replacing This two-stage approximate approach is commonly used in practice (see Section 6) in the form , with c a data dependent constant.
In this case the likelihood of the second submodel is modified by a factor p N ( µ | φ, Σ) (in a DAG representation a dependency of the constant µ on φ and the constant Σ is added).This approach can be viewed as approximate Markov melding with PoE pooling, in which one submodel is represented by a normal approximation.If, instead of PoE pooling, one wishes to regard the marginal p 2 (φ) on the link variable φ as authoritative and thus fully retain it, dictatorial pooling p pool (φ) = p 2 (φ) leads to the variant where µ 0 and Σ 0 are an estimate of the mean and covariance of the prior marginal p 1 (φ), which can be obtained at stage one in parallel to the posterior by sampling from the prior submodel, and Adjusting according to µ 0 and Σ 0 removes the prior p 1 (φ) from approximate joint model.For simplicity the time domain is suppressed: parameters with subscripts T , U and V are collections of parameters across the time range denoted by the subscript.For example, y a,U = {y a,t : t ∈ U }.

ICU submodel
The main data source in the ICU submodel is prevalence-type data from the Department of Health's Winter Watch scheme (Department of Health, 2011), which records the total number of patients in all ICUs in England with suspected pandemic A/H1N1 influenza infection.Weekly observations y a,t taken at days t ∈ U = {8, 15, 22, . . ., 78} for age group a ∈ {1, 2} (children and adults respectively) are available between December 2010 and February 2011.To estimate the link parameter φ = (φ a ) = (φ 1 , φ 2 ), that is, the cumulative number of ICU admissions over the period of observation t ∈ T = {1, . . ., 78}, from such prevalence data requires an immigration-death model for the system of ICU admission and exits from ICU. Assume that new ICU admissions follow an inhomogeneous Poisson process with rate λ a,t at time t, and the length of stay in ICU is exponentially distributed with rate µ a .Then the number of patients admitted up to time t who are still present in ICU at time t follows a thinned inhomogeneous Poisson process and the observed number of prevalent patients is y a,t ∼ Po(η a,t ), a ∈ {1, 2}, t ∈ U , with expectation, under a discretised formulation with daily time steps, given by η a,t = t u=1 λ a,u exp{−µ a (t − u)}, t ∈ T .We assume η a,1 = 0 to enforce the assumption that no patients with suspected 'flu were in ICU a week before observations began.
The product of the expected new admissions of suspected cases λ a,t and the proportion positive for A/H1N1 π pos a,t gives the expected number of confirmed new admissions on day t.The link parameter φ a is the uninvertible sum of these products over time: We model the proportion positive π pos a,t using weekly virological positivity data from the sentinel laboratory surveillance system Data Mart (Public Health England, 2014), which records the number z pos a,v of A/H1N1-positive swabs out of the total number n pos a,v tested during week v ∈ V = {1, . . ., 11} in age group a ∈ {1, 2}.We assume a uniform prior π pos a,t ∼ Unif(ω a,v , 1), t ∈ T , for the true positivity, where v = 1 for t = 1, . . ., 14 and v = (t − 1)/7 for t = 15, . . ., 78, and where the lower bound ω a,v is informed by a binomial model for the positivity data: z pos a,v ∼ Bin(n pos a,v , ω a,v ), v ∈ V .For the expected new admissions λ a,t , we assume a random-walk prior with log(λ a,1 ) ∼ Unif(0, 250) and log(λ a,t ) ∼ N(log(λ a,t−1 ), γ −2 a ) for t = 2, . . ., 78, with γ a ∼ Unif(0.1,2.7).For the length of ICU stays we assume constant age-group specific exit rates µ 1 = exp(−α) and µ 2 = exp(−{α+β}), with α ∼ N(2.7058, 0.0788 2 ) and β ∼ N(−0.4969, 0.2048 2 ) (Presanis et al., 2014).

Severity submodel
We consider a simplified version of the full, complex severity submodel in Presanis et al. (2014).The Winter Watch ICU data are only available for a portion of the time of the 'third wave' of the A/H1N1 pandemic, and so the cumulative number of confirmed new admissions φ a from the ICU submodel is a lower bound for the true number χ a of ICU admissions during the third wave.We thus assume φ a ∼ Bin(χ a , π det ), a ∈ {1, 2}, where π det is the age-constant detection probability, to which we assign a Beta(6, 4) prior.We incorporate the remaining evidence in the full severity submodel of Presanis et al. (2014) via informative priors χ 1 ∼ Lognormal(4.93,0.17 2 ) and χ 2 ∼ Lognormal(7.71,0.23 2 ).

Markov melded model
We joined the submodels as in (5).We considered linear and log pooling with pooling weight w 1 = 0.25, 0.5 and 0.75 (and w 2 = 1 − w 1 ), and PoE pooling.
We estimated the marginal priors for φ = (φ 1 , φ 2 ) under the ICU and severity submodels using kernel density estimation with a bivariate t-distribution kernel, using 5×10 4 independent draws, sampled from the corresponding submodel by forward Monte Carlo.The marginal priors are shown in Figure 7(a).Note that the ICU submodel prior for φ is extremely flat, whereas the severity submodel prior is concentrated on a small part of the parameter space.The combined density using each of the pooling functions (with w 1 = w 2 = 0.5) is shown in Figure 7(b).Linear and PoE pooling in this case lead to similar densities, whereas the log pooling prior is more dispersed.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q χ 1 χ 2  We then estimated, in stage one, the posterior distribution of the link parameter φ under the ICU submodel alone.We drew 5 million iterations from the ICU submodel using JAGS (Plummer, 2015b), retaining every 100 th iteration, after discarding 5 × 10 4 iterations as burn-in.In stage two, for the Markov melded models under linear, log and PoE pooling, we drew 2 × 10 6 samples using the multi-stage Metropolis-within-Gibbs sampler, with the first 10 4 samples discarded as burn-in.
Figure 8 shows the results.There is a notable reduction in uncertainty in the posteriors from Markov melding compared to the ICU submodel posterior, especially in φ 1 , demonstrating the benefit of joining the submodels.In the adult age group (a = 2), the Markov melding results are robust to the choice of pooling function and pooling weight: the likelihood from the ICU submodel dominates over the pooled prior.There is considerable agreement between the various approaches in the child age group (a = 1) as well, although the choice of pooling weight has some influence on the upper tail under log pooling.As anticipated by Section 4.3, the normal approximation (fitted using OpenBUGS)
and PoE pooling posteriors are close, due to the near normality of the ICU posterior distribution.

Splitting: large ecology model
Figure 9 is a DAG representation of the full, joint model outlined in Section 2.2.

Mark-recapture-recovery data
Mark-recapture-recovery data y t 1 ,t 2 record the number of ringed birds released before May in year t 1 = 1, . . ., 35, and recovered (dead) in the 12 months up to April in year t 2 = t 1 + 1, . . ., 36.The years correspond to observations for releases from 1963 (t = 1) to 1997 and recoveries from 1964 to 1998.The number of birds y t 1 ,37 released in year t 1 and never recovered is also available.We assume We model the probability π t 1 ,t 2 of recovery in year t 2 following release in year t 1 in terms of the recovery rate λ t , and the survival rates η C,t and η A,t for immature (1 year old) and breeding (2 years or older) birds, respectively, up to April of year t: The recovery rate is the probability that a bird that dies in year t is recovered.The probability of a bird released in year t 1 being never recovered is π t 1 ,37 = 1 − 36 u=t 1 +1 π t 1 ,u .

Census data
We assume that the observed census-type data x t , which are available for 1965 (t = 3) to 1998, account for only breeding birds and that there is no emigration.We model the census data via the true number of breeding females µ A,t and immature females µ C,t , and the productivity rate γ t , the average number of female offspring per breeding female in year t, which could be greater than 1.Specifically we assume for t = 3, . . ., 36 with the observation variance σ 2 assumed constant.

Regression models and prior distributions
We model the parameters η G,t , λ t and γ t with regression models, with z t denoting the (observed) number of frost days in year t.

Results
We split the joint model, as described in Section 3.3.2,into two components: the markrecapture-recovery submodel and the census submodel.Denote by Ω 0 = (η C , η A , φ α,C , φ α,A , φ β,C , φ β,A ) the parameters shared by both submodels and by Ω 1 = (π, λ, α λ , β λ ) the parameters specific to the recovery submodel.Under both the mark-recapture-recovery submodel (stage one) and the census submodel (stage two), we use independent normal priors, with mean 0 and standard deviation √ 200, for each component φ α,C , φ α,A , φ β,C and φ β,A of the link parameter.These priors were chosen so that PoE pooling of these priors results in the original prior for the link parameters under the joint model.
In stage one we drew samples from the posterior distribution p 1 (Ω 1 , Ω 0 | y) under the recovery submodel, and retained these samples for use as a proposal distribution in stage two, in which we drew samples under the full joint model.In stage one, we drew 2.5×10 5 MCMC iterations from the posterior distribution of the mark-recapture-recovery submodel, taking 7 hours on a single core of an Intel Xeon E5-2620 2.0GHz CPU.In stage two, we discarded all but every 100 th iteration, leaving 2.5 × 10 5 MCMC iterations for inference.This took 6 1 2 hours.Figure 10 shows the results.We compare the two-stage estimates to the estimates of the joint distribution based upon 6×10 5 MCMC iterations (retaining every 10 th iteration) drawn using a standard (one stage) MCMC sampler, which took 22 hours to run in OpenBUGS.We regard these results as the 'gold standard' that we aim to match with the two stage sampling approach.The components of the link parameters φ α,C and φ β,C corresponding to the immature birds have posterior distributions that closely agree under the joint model and mark-recapture-recovery submodel alone, but there are differences in the parameters corresponding to mature birds.In particular there is a sizeable difference for the regression parameter φ β,A , which is estimated to be notably higher under the joint model than under the mark-recapture-recovery submodel alone.The two-stage approach accurately captures this shift (Figure 10, right-hand panel).The similarity of φ α,C , φ α,A and φ β,C in the stage 1 posterior (recovery model) and the stage 2 posterior implies that the census submodel contains little information about these parameters.In contrast, the census submodel does contain information about the regression parameter φ β,A describing the relationship between the survival rate of adult birds and the number of frost days.
The census information suggests that φ β,A should be less negative than implied by the recovery information, implying that adult survival rate decreases only slightly in harsher winters.

Further work and discussion
We have presented a unifying view and a generic method for joining and splitting probabilistic submodels that share a common variable.We have extended the notion of Markov combination to the case where prior marginal distributions in each submodel need not be identical, enabling a principled approach to joining models in realistic applied settings, assuming that there is not strong conflict between evidence components and that it is reasonable to assume that the submodels are conditionally independent.We also introduced a computational algorithm that allows inference for submodels to be efficiently conducted in stages, when considering either joining or splitting models.The remainder of this section discusses related work, computational issues and alternative approaches.

Related work
The key idea for a melding approach can be attributed to Poole and Raftery (2000), but their presentation focuses on a limited set of models and is tied up with a deterministic link parameter φ.This slightly obscures the key issues that we present more generally in Section 3, where we clearly separate issues relating to marginal replacement from issues related to deterministic transformation of random variables.A further influence is the work on decomposable graphical models (Dawid and Lauritzen, 1993), where a key concept is the separator, a subset of variables that splits the model into two parts that are independent conditional on the separator.Separators correspond to link variables in Markov melding.The rich literature on decomposable graphs and corresponding algorithms, such as junction tree algorithms (Lauritzen, 1996), suggests extensions of Markov melding to a series of link variables (separators) for joining several submodels into chain or tree formations.Evidence synthesis models (Eddy et al., 1992;Jackson et al., 2009;Albert et al., 2011;Commenges and Hejblum, 2012) often employ the approximate approach of summarising the results of a first-stage submodel via a Gaussian or other distribution, for use in a second-stage submodel as a likelihood term.We demonstrated in Section 4.3 that this approach is an approximation to Markov melding under PoE pooling, therefore justifying the approximation.Similar approximations are widely used in standard and network meta-analysis (for example, Hasselblad et al., 1992;Ades and Sutton, 2006;Welton et al., 2008).Similarly, in more general hierarchical models, splitting models to make inference faster or easier has previously been considered (Liang and Weiss, 2007;Tom et al., 2010;Lunn et al., 2013a).In this setting, posterior inference is first obtained from independent unit-specific submodels, with flat, independent priors replacing all hierarchical priors in the joint model.Inference for the joint model is recovered in stage two through Markov melding of these unit-specific submodels with dictatorial pooling, so that only the hierarchical prior is reflected in the final results.This can make cross-validation more convenient (Goudie et al., 2015).Splitting models into conditionally independent components at a set of separator or link parameters is also a key aspect of cross-validatory posterior predictive methods, including "node-splitting", for assessing conflict across subsets of evidence (Presanis et al., 2013;Gåsemyr and Natvig, 2009).Markov melding may provide a natural, computationally-efficient approach for systematic conflict assessment (Presanis et al., 2016).
Our framework can also be viewed as encapsulating a range of approaches proposed in the big data literature for handling a large number of observations ('tall data').With tall data it may be infeasible even to store all of the data on a single computer, nevermind evaluate functions depending on the whole dataset thousands of times, as needed in MCMC.Instead, a divide-and-conquer approach can be taken, in which the original exchangeable data y are partitioned into B batches y 1 , . . ., y B , each of which contains few enough observations that standard statistical methods can be applied without undue trouble.The key observation is that the full posterior distribution p(φ | y) can be split into a number of submodel posteriors p b (φ | y b ) ∝ p(y b | φ)p(φ) 1/B , b = 1, . . ., B. This is a form of model splitting (Section 3.2), with PoE pooling and the original prior apportioned equally among the batches.Various approaches for integrating the batch-specific posteriors to approximate the overall posterior have been proposed (Huang and Gelman, 2005;Scott et al., 2016;Neiswanger et al., 2014;Wang and Dunson, 2013;Bardenet et al., 2017;Minsker et al., 2014).However, this literature has so far only considered independent, identically distributed data, whereas we have considered more general models and data.

Computational challenges
In our examples, the link variable φ is comparatively low dimensional and simple kernel density estimation using a multivariate t-distribution kernel proved sufficient.Moreover, the results were robust with respect to the choice of kernel and kernel bandwidth.For higher-dimensional link variables more care in the choice of kernel estimation method might be required (Henderson and Parmeter, 2015), or, alternatively, we might wish to estimate the ratio of densities directly to improve stability (Sugiyama et al., 2012).
The multi-stage sampler (Section 4.1.2)broadly falls into the category of a sequential Monte Carlo sampler (Doucet et al., 2013), as described in Supplementary Material D. While the Markov melding model is invariant to the ordering of the submodels used (assuming the pooling function is also), the efficiency of the multi-stage algorithm may not be in practice, due to the need for there to be sufficient stage one samples in the appropriate region.If two submodels contain an approximately equal amount of nonconflicting information, then the ordering is unlikely to be important.In other settings, more care may be required.For example, suppose submodel M 1 contains considerably more information than M 2 .If stage one uses M 1 , then the stage one posterior may be so precise that it is unable to be adjusted for the extra information in M 2 .In contrast, if M 2 is used first, then the estimate of the posterior distribution may be very coarse, due to a lack of samples in the central part of the posterior distribution.Further research will be needed to identify the best ordering to adopt in general.

Alternative approaches
We obtained consistency in the link parameter φ, as required by Markov combination, through marginal replacement (Section 3.1).This approach assumes the priors differ in substance across submodels.Alternatively, as we outline in Supplementary Material E, we could assume that the priors differ only due to different scalings in each submodel, and so can be made consistent through rescaling, similar to when deriving multivariate distributions from copulas (Durante and Sempi, 2010).Yet another approach is a supra-Bayesian approach (Lindley et al., 1979;Leonelli, 2015), in which the decision maker models the experts' opinions.
The prior pooling approach considered within our framework includes a judgement as to how to weight the different submodels.Various other methods have been proposed for weighting evidence, including the cut operator (Lunn et al., 2013b;Plummer, 2015a); the power prior approach in clinical trials (for example, Neuenschwander et al., 2009); and modularisation in the computer models literature (Liu et al., 2009).Further research is required to investigate the relationship of Markov melding to other weighting approaches.marginals on φ agree, q(φ) = p pool (φ), that is, p repl (φ, ψ m , Y m ) = argmin q {D KL (q p m ) | q(φ) = p pool (φ) for all φ} This is easily shown as follows (we drop index m and variable Y for simplicity).The KL divergence under the constraint is given by The second term is the KL divergence of the marginals and is constant.The first term can be minimised to 0 by choosing q(ψ | φ) = p(ψ | φ) for all φ and consequently q(φ, ψ) = p(ψ | φ)p pool (φ) is the solution to the constrained KL divergence minimisation.
A similar argument has been made in Poole and Raftery (2000) to justify their choice of a distribution for Bayesian melding.Notice that the same argument can still be made based on D KL (p m q) with the roles of p m and q exchanged.Marginal replacement can also be seen as a generalisation of Bayesian updating in the light of new information.For example, if we learn that φ can assume only the value of a constant φ 0 , standard Bayesian updating entails conditioning on the new information φ 0 to form the posterior distribution p m (ψ m , Y m | φ 0 ).This can be viewed as a special case of marginal replacement in which the new marginal p pool (φ) is the point mass δ φ 0 (φ) on φ 0 , since the marginal distribution of (ψ m , Y m ) under the marginal replacement model is this posterior distribution, as follows by standard properties of the Dirac delta function δ φ 0 : In this sense, (2) enables integration of new information on φ provided not only in the form of a specific value φ 0 but in the form of a general density function p pool (φ).
Finally, Approximate Bayesian Computation (ABC) can be interpreted as a marginal replacement similar to the replacement in equation (S1), when φ typically represents a data variable3 .Instead of p pool (φ) = δ φ 0 (φ) in standard posterior inference, ABC uses where I is the indicator function of an event, d is a distance function, S some summary statistic for the data variable φ, φ 0 is observed and a small constant.ABC can thus be seen as very similar to standard posterior inference but with a widening of the δ function (Wilkinson, 2013;Miller and Dunson, 2015).In fact, the limits → 0 and → ∞ lead to the posterior and prior distributions on φ, respectively.
Note that we only need to evaluate the likelihood ρ m (φ) for the last submodel m = due to the factorisation of p meld .Equivalently, a sample can be obtained via Metropolis-Hastings sampling with target-to-proposal density ratio where the proposal functions q(φ ) just samples uniformly from S −1 .
We opted for the latter sampling approach since for the models envisaged it is rarely possible to marginalise out ψ m and the Metropolis-Hastings sampler is able to sample from both φ and ψ m together.
A notorious problem with static parameters such as φ is depletion of the sample with fewer distinct values of φ at each stage.Various schemes have been proposed to rejuvenate the sample.Liu and West (2001) propose adding a disturbance ζ to φ at each stage.This amounts to sampling from a kernel smoothed version of the original sample.Care needs be taken to avoid undue increase in variance from stage to stage.Liu and West (2001) show how the increase can be controlled by cleverly correlating the disturbance ζ with φ.Gilks and Berzuini (2001) propose a rejuvenation through a move step after the sampling step.This move step, applied at stage , needs to leave distribution π invariant, for example, by one or more Metropolis-Hastings steps.In our case this is only possible by evaluating the full distribution p meld, involving all submodels m = 1, . . ., , somehow defeating the purpose of the scheme to avoid revisiting submodels earlier than .However, for a long sequence of submodels an occasional move step might be beneficial despite the increase in computation.

E Transformation of marginals of the link variable
An alternative approach to achieve the same distribution of the link variable φ in all submodels required by (1) is via a suitable transformation of φ so that all marginals agree similar to a copula approach (Durante and Sempi, 2010).We assume we have link variables φ m for each submodel m, measuring the same quantity (for example weight) but on different scales (say, kilograms, stones, pounds), which we indicate by a submodelspecific index m of φ m .However, the twist is that we cannot assume the transformations between scales are known.Instead we assume they can be reconstructed by matching quantiles of the prior distributions on the link variables.That is, find transformations so that all distributions are identical after rescaling.
We further assume we have a presentation of the link variable φ on a standard scale with distribution p pool (φ).We are now able to define new distributions that agree in their marginals on φ m and φ

Figure 1 :
Figure 1: DAG representation of a joint hierarchical model linking M submodels.

Figure 2 :
Figure2: High-level DAG representations of the influenza submodels.The double circle denotes the (highly) informative prior for χ, reflecting data from the full severity submodel that is omitted here.Detailed DAGs of these submodels are shown in Figure6.

Figure 3 :
Figure 3: High-level DAG representations of the ecology models.Detailed DAG representations of these models are shown in Figure 9.

Figure 5 :
Figure 5: DAG representations of stylised situations where model splitting might be desirable.Splitting the joint model is possible in (a) and (b), but not in (c).

Figure 6
Figure6shows DAG representations of the two submodels outlined in Section 2.1.

Figure 6 :
Figure6: DAG representations of the submodels of A/H1N1 influenza.Repeated variables are enclosed by a rounded rectangle, with the label denoting the range of repetition.For simplicity the time domain is suppressed: parameters with subscripts T , U and V are collections of parameters across the time range denoted by the subscript.For example, y a,U = {y a,t : t ∈ U }.

Figure 7 :
Figure 7: Prior distributions for φ a , the cumulative number of confirmed new admissions in age group a, in the A/H1N1 influenza evidence synthesis: (a) under the ICU and severity submodels; (b) pooled priors under three pooling functions with w 1 = w 2 = 0.5.

Figure 8 :
Figure8: Medians and 95% credible intervals for: the posterior distribution for the link parameters φ 1 and φ 2 under the ICU submodel; the prior distribution for each parameter under the severity submodel; the posterior distribution for each parameter according to the normal approximation and the Markov melded model under each pooling function.The x-axis shows number of individuals, except for π det , which shows probabilities.

Figure 10 :
Figure 10: Histograms of the posterior densities of the link parameters φ α,C , φ α,A , φ β,C and φ β,A under the recovery submodel (Stage 1), and under the full joint model, as estimated by Stage 2 of the two stage sampler and by a standard MCMC sampler for the joint model.