Efficient inference for stochastic differential equation mixed-effects models using correlated particle pseudo-marginal algorithms

Abstract Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that are able to account for random variability inherent in the underlying time-dynamics, as well as the variability between experimental units and, optionally, account for measurement error. Fully Bayesian inference for state-space SDEMEMs is performed, using data at discrete times that may be incomplete and subject to measurement error. However, the inference problem is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameter values of interest. The algorithm is made computationally efficient through careful use of blocking strategies and correlated pseudo-marginal Metropolis–Hastings steps within the Gibbs scheme. The resulting methodology is flexible and is able to deal with a large class of SDEMEMs. The methodology is demonstrated on three case studies, including tumor growth dynamics and neuronal data. The gains in terms of increased computational efficiency are model and data dependent, but unless bespoke sampling strategies requiring analytical derivations are possible for a given model, we generally observe an efficiency increase of one order of magnitude when using correlated particle methods together with our blocked-Gibbs strategy.


Introduction
Stochastic differential equations (SDEs) are arguably the most used and studied stochastic dynamic models. SDEs allow the representation of stochastic time-dynamics, and are ubiquitous in applied research, most notably in finance (Steele, 2012), systems biology (Wilkinson, 2018), pharmacokinetic/pharmacodynamic modeling (Lavielle, 2014) and neuronal modeling. SDEs extend the possibilities offered by ordinary differential equations (ODEs), by allowing random dynamics. As such, they can in principle replace ODEs in practical applications, to offer a richer mathematical representation for complex phenomena that are intrinsically non-deterministic. However, in practice switching from ODEs to SDEs is usually far from trivial, due to the absence of closed form solutions to SDEs (except for the simplest toy problems), implying the need for numerical approximation procedures (Kloeden and Platen, 1992). Numerical approximation schemes, while useful for simulation purposes, considerably complicate statistical inference for model parameters. For reviews of inference strategies for SDE models, see e.g. Fuchs (2013) (including Bayesian approaches) and Sørensen (2004) (classical approaches). Generally, in the non-Bayesian framework, the literature for parametric inference approaches for SDEs is vast, however there is no inference procedure that is applicable to general nonlinear SDEs and that is also easy to implement on a computer. This is due to the lack of explicit transition densities for most SDE models. The problem is particularly difficult for measurements that are observed without error, i.e. Markovian observations. On the other hand, the Bayesian literature offers powerful solutions to the inference problem, when observations arise from state-space models. In our case, this means that if we assume that observations are observed with error, and that the latent process is a Markov process, then the literature based on sequential Monte Carlo (particle filters) is readily available in the form of pseudo-marginal methods (Andrieu and Roberts, 2009), and closely related particle MCMC methods (Andrieu et al., 2010), which we introduce in Section 4.
Our goal is to produce novel Gibbs samplers embedding special types of pseudo-marginal algorithms allowing for exact Bayesian inference in a specific class of state-space SDE models. In this paper, we consider ''repeated measurement experiments'', modeled via mixed-effects, where the dynamics are Markov processes expressed via stochastic differential equations. These dynamics are assumed directly unobservable, i.e. are only observable up to measurement error. The practical goal is to fit observations pertaining to several ''units'' (i.e. independent experiments, such as measurements on different subjects) simultaneously, by formulating a state-space model having parameters randomly varying between the several individuals. The resulting model is typically referred to as a stochastic differential equation mixed-effects model (SDEMEM). SDEMEMs are interesting because, in addition to explaining intrinsic stochasticity in the time-dynamics, they also take into account random variation between experimental units. The latter variation permits the understanding of between-subjects variability within a population. When considered in a state-space model, these two types of variability (population variation and intrinsic stochasticity) are separated from the third source of variation, namely residual variation (measurement error). Thanks to their generality, and the ability to separate the three levels of variation, SDEMEMs have attracted attention, see e.g. Donnet and Samson (2013a) for a review and Whitaker (2016) for a more recent account. See also Section 2 for a discussion on previous literature.
In the present work, we mainly focus on a general, plug-and-play approach for exact Bayesian inference in SDEMEMs, meaning that analytic calculations are not necessary thanks to the flexibility of the underlying sequential Monte Carlo (SMC) algorithms. We also describe a non plug-and-play approach to handle specific situations. As in Picchini and Forman (2019), our random effects and measurement error can have arbitrary distributions, provided that the measurement error density can be evaluated point-wise. Unlike (Picchini and Forman, 2019), we use a Gibbs sampler to target the marginal parameter posterior. Subject specific, common and random effect population parameters are updated in separate blocks, with pseudo-marginal Metropolis-Hastings (PMMH) steps used to update the subject specific and common parameters, and Metropolis-Hastings (MH) steps used to update the random effect population parameters. We believe that, to date, our work results in the most general plug-and-play approach to inference for state-space SDEMEMs (a similar method has been concurrently and independently introduced (July 25 2019 on arXiv), in Botha et al. (2020); see the discussion in Section 6). However, the price to pay for such generality is that the use of pseudo-marginal methods guided by SMC algorithms is computationally consuming. In order to make pseudo-marginal methods scale better as the number of observations is increased, we exploit recent advances based on correlated PMMH (CPMMH). We combine CPMMH with a novel blocking strategy and show that it is possible to reduce considerably the number of required particles, and hence reduce the computational requirements for exact Bayesian inference. In our experiments, unless specific models admit bespoke efficient sampling strategies (e.g. Section 5.3 where it was possible to implement an advanced particle filter), CPMMH based algorithms with our novel blocking strategy are one order of magnitude more efficient than standard PMMH. Occasionally we even observed a 40-fold increase in efficiency, as in Section 5.1.
The remainder of this paper is organized as follows. Background literature is discussed in Section 2. Stochastic differential mixed-effects models and the inference task are introduced in Section 3. Our proposed approach to inference is described in Section 4. Applications are considered in Section 5, including a simulation study considering an Ornstein-Uhlenbeck SDEMEM, a tumor-growth model and finally a challenging neuronal data case-study. A discussion is in Section 6. Julia and R codes can be found at https://github.com/SamuelWiqvist/efficient_SDEMEM.

Background literature
Here we rapidly review key papers on inference for SDEMEMs, and refer the reader to https://umbertopicchini.github. io/sdemem/ for a comprehensive list of publications. Early attempts at inference for SDEMEMs use methodology borrowed from standard (deterministic) nonlinear mixed-effects literature such as FOCE (first order conditional estimation) combined with the extended Kalman filter, as in Overgaard et al. (2005). This approach can only deal with SDEMEMs having a constant diffusion coefficient, see instead (Leander et al., 2015) for an extension to state-dependent diffusion coefficients. The resulting inference in Overgaard et al. (2005) is approximate maximum likelihood estimation, and no uncertainty quantification is given. Moreover, only Gaussian random effects are allowed and measurement error is also assumed Gaussian. Other maximum likelihood approaches are in Picchini et al. (2010), Picchini and Ditlevsen (2011), where a closed-form series expansion for the unknown transition density is found using the method in Ait-Sahalia (2008), however the methodology can only be applied to reducible multivariate diffusions without measurement error. Donnet et al. (2010) discuss inference for SDEMEMs in a Bayesian framework. They implement a Gibbs sampler when the SDE (for each subject) has an explicit solution, and consider Gaussian random effects and Gaussian measurement error. When no explicit solution exists, they approximate the diffusion process using the Euler-Maruyama approximation. The approach of Donnet and Samson (2013b) is of particular interest, since it is the first attempt to employ particle filters for inference in SDEMEMs: they construct an exact maximum likelihood strategy based on stochastic approximation EM (SAEM), where latent trajectories are ''proposed'' via particle Markov chain Monte Carlo. The major problem with using SAEM is the need for sufficient summary statistics for the ''complete likelihood'', which makes the methodology essentially impractical for arbitrarily complex models. Delattre and Lavielle (2013) also use SAEM, but they avoid the need for the (usually unavailable) summary statistics for the complete likelihood, and propose trajectories using the extended Kalman filter instead of particle MCMC. Unlike in Donnet and Samson (2013b), the inference in Delattre and Lavielle (2013) is approximate and measurement error and random effects are required to be Gaussian. Ruse et al. (2019) analyze multivariate diffusions under the conditions that the random effects are Gaussian distributed and that both fixed parameters and random effects enter linearly in the SDE. Whitaker et al. (2017) work with the Euler-Maruyama approximation and adopt a data augmentation approach to integrate over the uncertainty associated with the latent diffusion process, by employing carefully designed bridge constructs inside a Gibbs sampler. A linear noise approximation (LNA) is also considered. However, the limitations are that the observation equation has to be a linear combination of the latent states and measurement error has to be Gaussian. In addition, producing the bridge construct in the data augmentation approach or the LNA-based likelihood requires some careful analytic derivations. Consequently, neither approach can be regarded as a plug-and-play method (that is, a method that only requires forward simulation and evaluation of the measurement error density). In Picchini and Forman (2019), approximate and exact Bayesian approaches for a tumor growth study were considered: the approximate approach was based on synthetic likelihoods (Wood, 2010;Price et al., 2018), where summary statistics of the data are used for the inference, while exact inference used pseudo-marginal methodology via an auxiliary particle filter, which is suited to target measurements observed with a small error. It was found that using a particle approach to integrate out the random effects was very time consuming. Even though the data set was small (comprising 5-8 subjects to fit, depending on the experimental group, and around 10 observations per subject), the number of particles required to approximate each individual likelihood was in the order of thousands. This is very time consuming when the number of ''subjects'' (denoted M in the rest of this work) increases.

Stochastic differential mixed-effects models
Consider the case where we have M experimental units randomly chosen from a theoretical population. Our goal is to perform inference based on simultaneously fitting all data from the M units. Now assume that the experiment we are analyzing consists in observing a stochastically evolving dynamic process, and that associated with each unit i is a Here, α is a d-vector of drift functions, the diffusion coefficient β is a d × d positive definite matrix with a square root t is a d-vector of (uncorrelated) standard Brownian motion processes and D i are unit-specific static or time-dependent deterministic input (e.g. covariates, forcing functions), see e.g. Leander et al. (2015). The p-vector parameter κ = (κ 1 , . . . , κ p ) T is common to all units whereas the q-vectors φ i . . , M, are unit-specific random effects. In the most general random effects scenario we let π (φ i |η) denote the joint distribution of φ i , parameterized by the r-vector η = (η 1 , . . . , η r ) T . The model defined by (1) allows for differences between experimental units through different realizations of the Brownian motion paths W i t and the random effects φ i , accounting for inherent stochasticity within a unit, and variation between experimental units respectively.
We assume that each experimental unit {X i t , t ≥ 0} cannot be observed exactly, but observations y i = (y i 1 , . . . , y i n ) T are available. Without loss of generality, we assume units are observed at the same integer time points {1, 2, . . . , n}, that is in the following we write n instead of, say, n i for all i. However this is only for convenience of notation, and we could easily accommodate the possibility of different units i having different values n i and that, in turn, units are observed at different sets of times. The observations are assumed conditionally independent (given the latent process) and we link them to the latent process via where t is the measurement noise, S i is (as D i ) a unit-specific deterministic input, and h(·) is a possibly nonlinear function of its arguments. In the applications in Section 5 we have D i = S i = ∅, the empty set, for every i, and hence for simplicity of notation we disregard D i and S i in the rest of the paper. However having non-empty sets does not introduce any additional complication to our methodology. Notice, the possibility to have d 0 < d implies that we may have some coordinate of the {X i t } system that is unobserved at some (or all) t. We denote the density linking Y i t and X i observation model is when h(X i t , ϵ i t ) = F T X i t + ϵ i t for a constant matrix F and ϵ i t |Σ indep ∼ N(0, Σ), allowing for observation of a linear combination of components of X i t , subject to additive Gaussian noise. Notice that our methodology in Sections 3. 1-4.4 can be applied to an arbitrary h(·), provided this can be evaluated pointwise for any value of its arguments. For example, in Section 5.2 we have that h(·) is the logarithm of the sum of the components of a bivariate X i t . We refer to the model constituted by the system (1)-(2) as a SDEMEM. This is a state-space model, due to the Markov property of the Itô processes {X i t , t ≥ 0}, and the assumption of conditional independence of observations on latent processes. The model is flexible: equation (1) explains the intrinsic stochasticity in the dynamics (via β) and the variation between-units (via the random effects φ i ), while (2) explains residual variation (measurement error, via ξ ).

BayesIan inference
Denote with x = (x 1 , . . . , x M ) T the set of unobserved states collected across all M diffusion processes {X i t } at the same set of integer times {1, 2, . . . , n} as for data y = (y 1 , . . . , y M ) T . Then given data y = (y 1 , . . . , y M ) T , latent values x, the joint posterior for the common parameters κ, fixed/random effects φ = (φ 1 , . . . , φ M ) T , hyperparameters η and measurement error parameters ξ is π (κ, η, ξ , φ, x|y) ∝ π(κ, η, ξ )π (φ|η)π (x|κ, φ)π(y|x, ξ ) where π(κ, η, ξ ) is the joint prior density ascribed to κ, η and ξ . These three parameters may be assumed a priori independent, and then we can write π(κ, η, ξ ) = π(κ)π (η)π (ξ ), though this need not be the case and we can easily assume a priori correlated parameters. In addition we have that and Note that π (x i j |x i j−1 , κ, φ i ) will be typically intractable. In this case, we assume that it is possible to generate draws (up to arbitrary accuracy) from π(x i j |x i j−1 , κ, φ i ) using a suitable numerical approximation. For example, the Euler-Maruyama approximation of (1) is where ∆W i t ∼ N(0, I d ∆t) and the time-step ∆t, which need not be the inter-observation time, is chosen by the practitioner to balance accuracy and efficiency.
Hence, marginalizing out u gives the marginal parameter posterior in (8). Directly targeting the high dimensional posterior π(κ, η, ξ , φ, u|y) with PMMH is likely to give very small acceptance rates. The structure of the SDMEM naturally admits a Gibbs sampling strategy. We outline our novel Gibbs samplers in the next section.

Gibbs sampling and blocking strategies
The form of (10) immediately suggests a Gibbs sampler that sequentially takes draws from the full conditionals. However, we can design two types of Gibbs samplers. Our first, novel strategy is denoted ''naive Gibbs'', where the u i are updated with both the subject specific and common parameters.
Naive Gibbs: 2. π (κ, u|η, ξ , φ, y, u) Note that step 1 consists of a set of draws of M conditionally independent random variables since Hence, step 1 gives a sample from π(φ, u|κ, η, ξ , y). Draws from the full conditionals in 1-3 can be obtained by using Metropolis-Hastings within Gibbs. Taking the [φ i , u i ] block as an example, we use a proposal density of the form Effectively, samples from the full conditionals in 1-3 are obtained via draws from pseudo-marginal MH kernels. The above strategy is somewhat naive, since the auxiliary variables u need only be updated once per Gibbs iteration, instead in steps 1 to 3 of the naive Gibbs procedure vectors u i are simulated anew in each of the three steps (notice g(u i ) appears in each of the first three steps). We therefore propose to update the blocks [φ i , u i ], i = 1, . . . , M in step 1 only, and condition on the most recent value of u in the remaining steps. We call this second, novel strategy ''blocked Gibbs''.
Blocked Gibbs: The aim of blocking in this way is to reduce the variance of the acceptance probability associated with steps 2 and 3, which involve the product of M estimates as opposed to a single estimate in each constituent part of step 1. Also, notice g(u i ) appears only in the first step. The effect of blocking in this way is explored empirically in Section 5.

Estimating the likelihood
It remains that we can generate non-negative unbiased estimatesπ u (y|κ, ξ , φ). This can be achieved by running a sequential Monte Carlo procedure, also known as particle filter. The simplest approach is to use the bootstrap particle filter (Stewart and McCarty, 1992;Gordon et al., 1993) (see also Künsch, 2013) that, for a single experimental unit, recursively draws from the filtering distribution π (x i t |y i 1:t , κ, ξ , φ i ) for each t = 1, . . . , n. Here, y i 1:t denotes the observations of experiment i for time-steps 1, . . . , t. Essentially, a sequence of importance sampling and resampling steps are used to propagate a weighted sample {(x i t,k , w(u i t,k )), k = 1, . . . , N i } from the filtering distribution, where N i is the number of particles for unit i. Note that we let the weight depend explicitly on the tth component of the auxiliary variable u i , associated with experimental unit i. At time t, the particle filter uses the approximation A simple importance sampling/resampling strategy follows, where particles are resampled (with replacement) in proportion to their weights, propagated via is a deterministic function of u i t,k (as well as the parameters and previous latent state, suppressed for simplicity) that gives an explicit connection between the particles and auxiliary variables. An example of f t (·) is to take the Euler-Maruyama approximation and ∆t is a suitably chosen time-step. In practice, unless ∆t is sufficiently small to allow an accurate Euler-Maruyama approximation, f t (u i t,k ) will describe recursive application of the numerical approximation.

Algorithm 1 Bootstrap particle filter for experimental unit i
Input: parameters κ, φ i , ξ , auxiliary variables u i , data y i and the number of particles N i .
(e) Update observed data likelihood estimate. Computê Algorithm 1 provides a complete description of the bootstrap particle filter when applied to a single experimental unit. However notice the addition of a non-standard and optional sorting step 2b', which turns useful when implementing a correlated pseudo-marginal approach, as described in Section 4.3. For the resampling step we follow (Deligiannidis et al., 2018) among others and use systematic resampling (see e.g. Murray et al., 2016), which only requires simulating a single uniform random variable at each time point. It is straightforward to augment the auxiliary variable u i to include the random variables used in the resampling step. As a by-product of the particle filter, the observed data likelihood π (y i |κ, ξ , φ i ) can be estimated via the quantitŷ Moreover, the corresponding estimator can be shown to be unbiased (Del Moral, 2004;Pitt et al., 2012). The full Gibbs sampler for generating draws from the joint posterior (10) is given by Algorithm 2. For ease of exposition, we have blocked the updates for κ and ξ , but note that the use of separate updates for these parameters is straightforward.
The precise implementation of step 4 of the Gibbs sampler is likely to be example specific, and we anticipate that a direct draw of η (j) ∼ π (·|φ (j) ) will often be possible. For example when the components of φ are assumed to be normally distributed and η consists of the corresponding means and precisions, for which a semi-conjugate prior specification is possible, see Section 5.1.
Executing Algorithm 2 requires n ∑ M i=1 N i draws from the transition density governing the SDE in (1) per iteration. In scenarios where the transition density is intractable, draws of a suitable numerical approximation are required. For example, we may use the Euler-Maruyama discretization with time step ∆t = 1/m, where m ≥ 1 is chosen to limit the associated discretization bias (and typically m ≫ 1). In this case, order mn ∑ M i=1 N i draws of (7) are required. As discussed by Andrieu et al. (2010), the number of particles per experimental unit, N i , should be scaled in proportion to the number of data points n. Consequently, the use of PMMH kernels is likely to be computationally prohibitive in practice. We therefore consider the adaptation of a recently proposed correlated PMMH method for our problem.

A correlated pseudo-marginal approach
Consider again the task of sampling the full conditional π (φ i , u i |κ, η, ξ , y i ) associated with the ith experimental unit.
In steps 2(a-c) of Algorithm 2, a (pseudo-marginal) Metropolis-Hastings step is used whereby the auxiliary variables u i are proposed from the associated pdf g(·) (notice we could introduce a subject-specific g i (·), but we refrain from doing so in the interest of a lighter notation). As discussed by Deligiannidis et al. (2018) (see also Dahlin et al., 2015), the proposal kernel need not be restricted to the use of g(u i ). The correlated PMMH (CPMMH) scheme generalizes the PMMH scheme by generating a new u i * from K (u i * |u i ) where K (·|·) satisfies the detailed balance equation It is then straightforward to show that a MH scheme with proposal kernel q(φ i * |φ i )K (u i * |u i ) and acceptance probability Eq. (13) satisfies detailed balance with respect to the target π (φ i , u i |κ, η, ξ , y i ).
We take g(u i ) as a standard Gaussian density and K (u i * |u i ) as the kernel associated with a Crank-Nicolson proposal (Deligiannidis et al., 2018). Hence and where I d is the identity matrix whose dimension d is determined by the number of elements in u i . The parameter ρ is chosen to be close to 1, to induce strong positive correlation betweenπ Care must be taken here when executing Algorithm 1 in Step 2(b). Upon changing φ i and u i , the effect of the resampling step is likely to prune out different particles, thus breaking the correlation between successive estimates of observed data likelihood. Sorting the particles before resampling can alleviate this problem (Deligiannidis et al., 2018). We follow Choppala et al. (2016) (see also Golightly et al., 2019) by using a simple Euclidean sorting procedure which, for the case of a 1-dimensional latent state (e.g. when dim(X i t ) = 1 for every t) implies, prior to resampling the particles, to sort the particles from the smallest to the largest. This is step 2b' in algorithm 1, denoted ''optional'' as it only applies to CPMMH, not PMMH.

Tuning the number of particles for likelihood approximation
It remains that we can choose the number of particles N i to be used to obtain estimates of the observed data likelihood contributionsπ u i (y i |κ, ξ , φ i ). Note that we allow a different number of particles per experimental unit to accommodate differing lengths of the y i and potential model misspecification at the level of an individual unit. In the case of PMMH, a simple strategy is to fix φ i , κ and ξ at some central posterior value (obtained from a pilot run), and choose N i so that the variance of the log-likelihood (denoted σ 2 N i ) is around 2 (Doucet et al., 2015;Sherlock et al., 2015). When using a CPMMH kernel, we follow (Tran et al., 2016a;Choppala et al., 2016) Hence, an initial pilot run (with the number of particles set at some conservative value) is required to determine plausible values of the parameters. This pilot run can also be used to give estimates of var(φ i |y i ), i = 1, . . . , M, each of which can subsequently be used as the innovation variance in a Gaussian random walk proposal for φ i .
In Sections 5.1 and 5.3 we employ the generalized Adaptive Metropolis (AM) algorithm (Andrieu and Thoms, 2008) to tune the two proposal distributions. Regarding the generation of proposals φ i * , in the first step of the blocked Gibbs scheme we tune subject-specific proposal distributions, separately for each φ i * . In addition to these M proposal distributions we also tune a proposal distribution for (κ * , ξ * ). Thus, we automatically tune overall M + 1 proposal distributions via the generalized AM algorithm. Additionally, in Sections 5.1 and 5.3 we found that the use of different proposal distributions for each φ i * was beneficial since random effects for the different subjects varied around very different values. 8

Ornstein-uhlenbeck SDEMEM
We consider the following Ornstein-Uhlenbeck (OU) SDEMEM Here θ i 2 ∈ R is the stationary mean for the {X i t } process, θ i 1 > 0 is a growth rate (expressing how rapidly the system reacts to perturbations) and θ i 3 is the diffusion coefficient. The OU process is a standard toy-model in that it is completely tractable, that is the associated SDE has a known (Gaussian) transition density, e.g. Fuchs (2013). This fact, coupled with the assumption that the Y i t |X i t are conditionally Gaussian and linear in the latent states, implies that we can apply the Kalman filter to evaluate the likelihood function exactly. Therefore, exact inference is possible for the OU SDEMEM (both maximum and η = (µ 1 , µ 2 , µ 3 , τ 1 , τ 2 , τ 3 ), with τ j the precision of φ i j . The SDEMEM (16) has no parameters κ that are shared among subjects, and the full set of parameters that we want to infer is (µ 1 , µ 2 , µ 3 , τ 1 , τ 2 , τ 3 , σ ϵ ).
As already mentioned, we can compute the likelihood π (y|φ, σ ϵ ) = ∏ M i=1 π (y i |φ i , σ ϵ ) exactly, using a Kalman filter (see Tornøe et al., 2005 and Donnet and Samson, 2013a for a description pertaining SDEMEMs). The filter can then be used in Algorithm 2, that is we avoid using the particle filter (Algorithm 1) and replace it with the Kalman filter in Algorithm 2. Results from Algorithm 2 when using this approach are denoted with ''Kalman''. The transition density for the latent state is known and therefore we do not need to use an Euler-Maruyama discretization when propagating the states forward in the particle filter. Instead we propagate the particles using the simulation scheme induced by the exact transition density: where u i t ∼ N(0, 1) independently for all t and all i. Clearly, the u i t appearing in (17) are among the variates that we will correlate, when implementing CPMMH, in addition to the variates produced in the resampling steps.
The number of particle used for each method was selected using the methods described in Section 4.4. All five methods return exact Bayesian inference, and while this is obvious for ''Kalman'', we remind the reader that this holds also for the other four approaches as these are instances of the pseudo-marginal approach. Therefore, special interest is in efficiency comparisons between the last four algorithms, ''Kalman'' being the obvious gold-standard.
The priors in are semi-conjugate and we can therefore use a tractable Gibbs step to sample η in step 4 of Algorithm 2.
An extended introduction to the semi-conjugate prior, including the tractable posterior can be found in Murphy (2007). We ran all four methods for 60k iterations, considering the first 10k iterations to be the burn-in period. We set the starting value for σ ϵ at σ ϵ 0 = 0.2, which is far from its ground truth value. The starting values for the random effects φ i j were set to their prior means. The proposal distributions were adaptively tuned using the generalized AM algorithm and the particle filters were implemented on a single-core computer, thus no parallelization was utilized. We used the   Table 1 we conclude that CPMMH is about 20 to 40 times more efficient than PMMH in terms of mESS/m, depending on which correlation level we use. Furthermore, ''Kalman'' is about 5140 times more efficient than PMMH. However, the latter comparison is not very interesting since the Kalman filter can be applied only to a very restricted class of models. The marginal posteriors in Figs. 2-3 show that the several methods generate very similar posterior inference, which is reassuring. We left out the inference results from CPMMH-0999 for reasons of clarity. However we observed that with N = 50 CPMMH-0999 produces a slightly biased inference for σ ϵ , due to failing to adequately mix over the auxiliary variable u, while inference for the remaining parameters is similar to the other considered methods. We verified (results not shown) that using N = 100 is enough to repair this problem. From Figs. 2-3 we can conclude that all parameters, with the possible exclusion of τ 2 , are well inferred. Regarding τ 2 , this is the precision for θ i 2 , the latter representing the stationary mean for a OU model. Clearly, by looking at Fig. 1, the occasional outlier in the upper part of the Figure may contribute to underestimating the true precision of the stationary mean. To check if CPMMH indeed is necessary, we tried to run PMMH with 100 particles (i.e., the same number of particles as for CPMMH-099). The inference results produced with PMMH with 100 particles gave considerable mismatch (in terms of posterior output) for both the η parameters and σ ϵ relative to that obtained from CPMMH-099, resulting from the extremely poor mixing of the chain.
In summary, CPMMH is able to return reliable inference with a much smaller number of particles than PMMH, while resulting in a procedure that is about 20 to 40 times more efficient than PMMH (the 40-times figure is valid if we are ready to accept a small bias in σ ϵ ). Again, for most models exact inference based on a closed-form expression for the likelihood function is unavailable, therefore being able to obtain accurate inference using a computationally cheaper version of PMMH is very appealing.
Notice that while for this simple case study PMMH-naive has the same mESS than PMMH, this is not the case for the case study in Section 5.2, where using the blocked-Gibbs sampler produces a much larger mESS value compared to naive-Gibbs.

Investigating the choice of number of particles
A crucial problem when running methods based on particle filters is the selection of the number of particles N. In this section we investigate this problem by running CPMMH-099 and CPMMH-0999 with N = [5, 10, 20, 50, 100] particles  using 25 different (simulated) data sets. We also ran the Kalman algorithm using the 25 different data sets for comparison purposes. In this analysis, we are only interested in investigating the quality and computational efficiency of the inference. Hence, we initialized all algorithms at the ground truth parameter values and ran each algorithm for 60k iterations, and discarding the first 10k iterations as burnin period. We first estimated the Wasserstein distance, between the marginal posteriors for σ ϵ and η from the CPMMH algorithms and the corresponding Kalman-based marginal posteriors. This distance was computed via the POT package (Flamary and Courty, 2017) (we do not compute the Wasserstein distance for the marginal posterior of the random effects φ i , since this is not of central interest for us). All Wasserstein distances are based on the last 5k samples of the corresponding chains. To obtain a performance measure that takes into account both the quality of the inference and the computational effort, we multiply the Wasserstein distances by the runtimes (in minutes) of the CPMMH algorithms, and obtain the performance measure Wasserstein distance × runtime (m); see Figs. 4 and 5. Smaller values of this measure are to be preferred as they indicate high computational efficiency and/or accurate inference. The reason for considering this performance measure is to take the quality of the inference into account, since for N < 20 we noticed that it is possible to obtain chains that do not indicate adequate convergence within a reasonable time-frame.
We can conclude that, on average, results for different correlation levels are similar. However, for σ ϵ we obtain a better performance when using more particles (lower Wasserstein distance × runtime (m) value), this resulting from inaccurate inference for σ ϵ when using too few (N < 50) particles, leading to a large Wasserstein distance. However, this is not the case for η since Fig. 5 shows that the performance is better with fewer particles, a result that we obtain since the inference for η is good even when using few particles (though not reported, in our analyses we observed that the Wasserstein distances for η are similar across all attempted values of N). Thus, if we want to infer the measurement noise parameter σ ϵ accurately, in this case we will have to use N ≥ 50 particles, while the inference for η is satisfactory, even with fewer particles.
Another issue that we analyze is the variability of mESS for the different data sets, based on 50k iterations of CPMMH. To investigate this we computed the 25th and 75th percentiles of mESS for CPMMH-099 with N = 100 and CPMMH-0999 with N = 50 based on the inference results on all unknown parameters from 25 simulated data sets. We obtain that the 25th and 75th percentiles of mESS for CPMMH-099 (N = 100) are [227,240], and for CPMMH-0999 (N = 50) are [227,252]. Given that the several mESS are computed on different datasets, some degree of variation in the measure is expected and we conclude that the observed mESS variability is fairly small.

Tumor growth SDEMEM
We consider a stochastic differential mixed effects model that has been used to describe the tumor volume dynamics in mice receiving a treatment. Here we study a simplified version of the model in Picchini and Forman (2019), and is given by for experimental units i = 1, . . . , M. Here, W 1,t and W 2,t are uncorrelated Brownian motion processes, X i 1,t and X i 2,t are respectively the volume of surviving tumor cells and volume of cells killed by a treatment for mouse i. Let V i t = X i 1,t + X i 2,t denote the total tumor volume at time t in mouse i. The observation model is given by We complete the SDEMEM specification via the assumption that so that η = (µ 1 , . . . , µ 4 , τ 1 , . . . , τ 4 ).
We recognize that X i 1,t and X i 2,t are geometric Brownian motion processes and (19) can be solved analytically to give where logN(·, ·) denotes the log-Normal distribution. Despite the availability of a closed form solution to the underlying SDE model, the observed data likelihood is intractable, due to the nonlinear form of (20) as a function of log(X i 1,t + X i 2,t ). Nevertheless, a tractable approximation can be found, by linearizing log V i t . The resulting linear noise approximation (LNA) is derived in B, and in what follows, we compare inference under the gold standard PMMH to that obtained under the LNA.
We mimicked the real data application in Picchini and Forman (2019) by generating 21 observations at integer times for M = 10 replicates. We took η = (log 0.29, log 0.25, log 0.09, log 0.34, 10, 10, 10, 10) and sampled φ i j |η using (21). The latent SDE process was then generated using (22) with an initial condition of x i 0 = (75, 75) T (assumed known for all units), and each observation was corrupted according to (20) with σ 2 e = 0.2. The resulting data traces are consistent with the observations on total tumor volume of those subjects receiving chemo therapy in Picchini and Forman (2019) and can be seen in Fig. 6. We adopted semi conjugate, independent N(−2, 1) and Ga (2, 0.2) priors for the µ j and τ j respectively. We took log σ e ∼ N(0, 1) to complete the prior specification. Given the use of synthetic data of equal length for each experimental unit, we pragmatically took the number of particles as N i = N, i = 1, . . . , 10. Our choice of N was guided by the tuning advice of Section 4.4. For example, with CPMMH we obtain typical ρ L values of around 0.75, when parameter values are fixed at an estimate of the posterior mean. This gives σ 2 N = 10.6 which is achieved with N = 7 particles. To avoid potentially sticky behavior of the chain in the posterior tails, we choose the conservative value N = 10. We compare four approaches: naive PMMH (where the u i are updated with both the subject specific and common parameters), PMMH (where the u i are only updated with the subject specific parameters -Algorithm 2), CPMMH (Algorithm 2 with a Crank-Nicolson proposal on the u i ) and the LNA based approach. We ran each scheme for 500k iterations. The results are summarized in Table 2 and Fig. 7.   Fig. 7 shows marginal posterior densities of the components of η. We see that inferences for these parameters are consistent with the true values that generated the data (with similar results obtained for the other parameters) and that inference via CPMMH is consistent with that from the gold-standard PMMH. Similar results are obtained for σ ϵ (not shown  for brevity). At the same time, from Table 2 we note that CPMMH with ρ = 0.999 is about 11 times more efficient than the naive PMMH and almost 3 times more efficient than PMMH with additional blocking. Finally, the LNA-based approach provides an accurate alternative to PMMH, except for τ 4 . However, everything considered, CPMMH is to be preferred here as its computational efficiency is comparable to LNA, but unlike the latter, CPMMH provides accurate inference for all parameters, and unlike LNA the CPMMH approach is plug-and-play.

Use of the Euler-maruyama approximation
We anticipate that for many applications of interest, an analytic solution of the underlying SDE will not be available. It is common place to use a numerical approximation in place of an intractable analytic solution. The simplest such approximation is the Euler-Maruyama (E-M) approximation. In this section, we investigate the effect of the E-M on the performance of PMMH and CPMMH for the tumor growth model.
The Euler-Maruyama approximation of (19) is where, for example, ∆X i 1,t = X i 1,t+∆t − X i 1,t and ∆W i 1,t ∼ N(0, ∆t), with other terms defined similarly. To allow arbitrary accuracy of E-M, the inter-observation time length ∆t is replaced by a stepsize ∆t = 1/L for the numerical integration, for integer L ≥ 1. We find that using L = 5 (giving 4 intermediate times between observation instants) allows sufficient accuracy (compared to the analytic solution) to permit use of the same tuning choices when re-running PMMH (including the naive scheme) and CPMMH. Our findings are summarized by Table 3.
Unsurprisingly, inspection of Table 3 reveals that relative performance between the three computing pseudo-marginal schemes is similar to that obtained when using the analytic solution; CPMMH provides almost an order of magnitude increase in terms of mESS/m over a naive PMMH approach. We note that use of the Euler-Maruyama approximation requires computation and storage of an additional 1/∆t innovations per SDE component, inter-observation interval, particle and subject, thus accounting for the increase in CPU time compared to when using the analytic solution. Nevertheless, we find that our proposed approach is able to accommodate an intractable SDE scenario and provides a worthwhile increase in performance over competing approaches.

Comparison with ODEMEM
To highlight the potential issues that arise by ignoring inherent stochasticity, we consider inference for an ordinary differential equation mixed effects model (ODEMEM) of tumor growth. We take the SDEMEM in (19) and set γ i for i = 1, . . . , M. The observation model and random effects distributions remain unchanged from (20) and (21) upon omitting log γ i and log ψ i from φ i . The ODE system in (23) can be solved to give The likelihood associated with each experimental unit is then obtained simply as Fitting the ODEMEM to the synthetic data set from Section 5.2 is straightforward, via a Metropolis-within-Gibbs scheme. Figs. 8 and 9 summarize our findings. Unsurprisingly, since the ODEMEM is unable to account for intrinsic stochasticity, the observation standard deviation is massively over-estimated. Fig. 8 shows little agreement between the marginal posteriors under the ODEMEM and SDEMEM for this parameter. In terms of model fit, both the observation (Y 1 t ) and latent process (X 1 t = log V 1 t ) predictive distributions for unit 1 are over concentrated for the ODEMEM. Similar results (not shown) are obtained for the other experimental units. Notably, from Fig. 9, around half of the actual simulated X t values lie outside of the 95% credible interval under the ODEMEM.

Neuronal data
Here we consider a much more challenging problem: modeling a large number of observations pertaining neuronal data. In particular, we are interested in modeling the neuronal membrane potential across inter-spike intervals (ISIs). The problem of modeling the membrane potential from ISIs measurements using SDEs has already been considered numerous times, also using SDEMEMs, see Picchini et al. (2008). In fact here we analyze the same data considered in Lansky et al. (2006) and Picchini et al. (2008), or actually a subset thereof, due to computational constraints. The ''leaky integrate-and-fire'' appears to be one of the most common models, in both artificial neural network applications and descriptions of biological systems. Deterministic and stochastic implementations of the model are possible. In the stochastic version, under specific assumptions (Lanski, 1984), it coincides with the Ornstein-Uhlenbeck stochastic process and has been extensively investigated in the neuronal context, for instance in Ditlevsen and Lansky (2005). Consider Fig. 10 as an illustrative example, reporting values of neuronal membrane depolarization studied in Höpfner (2007). Inter-spike-intervals are the observations considered between ''firing'' times of the neuron, the latter being represented by the spikes appearing in Fig. 10 (notice these are not the data we analyzed. This figure is only used for illustration). Data corresponding to the near-deterministic spikes are removed, and what is left constitutes data from several ISIs. As in Picchini et al. (2008), we consider data from different ISIs as independent. Hence, M is the number of considered ISIs. These are 312 in total, however, because of computational limitations, we will only analyze a subset of 100 ISIs, hence our results are based on M = 100 and a total of 162,610 observations. A challenge is posed by the fact that some ISIs are much longer than others (in our case they vary between 600 and 2,500 observations), meaning that longer ISIs could typically require a larger N to avoid particle depletion, but using the same large N to approximate all M likelihood terms would be a waste of computational resources. This is why CPMMH comes particularly useful, as it allows to keep a small N across all units while still avoiding sticky behavior in the MCMC chains. Data from the 100 ISIs are plotted on a common time-scale in Fig. 11 (after some translation to let each ISI start approximately at zero value at time zero). These consist  {Y t ; t ≥ 0}. Differences with the SDEMEM in Picchini et al. (2008) are that: (i) their observations were assumed unaffected by measurement noise, i.e. observations were directly available from {X i t ; t ≥ 0}, i = 1, . . . , M, which is a convenient assumption easing calculations towards obtaining exact maximum likelihood estimation, but that it is generally possible to argue against; (ii) in Picchini et al. (2008) the only random effect was ν i , and remaining parameters were fixed-effects, while in the present case we have random effects λ i and σ i in addition to ν i . Of course here we also need to estimate σ ϵ , which was not done in Picchini et al. (2008) since no measurement error was assumed. As in Section 5.1 the random effects are constrained to be positive and we therefore define φ i and η = (µ 1 , µ 2 , µ 3 , τ 1 , τ 2 , τ 3 ), with τ j the precision of φ i j . Since we here have a similar setting as in Section 5.1, we employ the same semi-conjugate priors with hyperparameters (µ 0 1 , M 0 1 , α 1 , β 1 ) = (log(0.1), 1, 2, 1), (µ 0 2 , M 0 2 , α 2 , β 2 ) = (log(1.5), 1, 2, 1), (µ 0 3 , M 0 3 , α 3 , β 3 ) = (log(0.5), 1, 2, 1).
The considered data are measured with techniques ensuring high precision, and we assume the following prior log σ ϵ ∼ N(−1, 1). Because of the small measurement noise, we expect that a bootstrap filter will perform poorly, leading to a very noisy approximation of the likelihood π(y|φ, σ ϵ ) = ∏ M i=1 π(y i |φ i , σ ϵ ). To be able to obtain a good approximation of the likelihood, we instead use the bridge particle filter found in Golightly and Wilkinson (2011), since, as explained below, the bootstrap filter is statistically inadequate for this experiment (moreover, it is also computationally inadequate, since it would require a too large number of particles, which was impossible to handle with the limited memory of our computer). In A, we derive the bridge filter for the model in (24), and we also compare the forward propagation of the particles that we obtain using the bootstrap filter and the bridge filter. In A.2 we see that the likelihood approximation obtained from the bootstrap filter is very inaccurate, which is due to its inability to handle measurements with small observational noise. Consequently, the number of particles required to give likelihood estimates with low variance is computationally prohibitive. Therefore, for this example, we only report results based on the bridge filter (which is not a plug-and-play method). We use the following four algorithms already defined in Section 5.1: Kalman, which obviously here is the gold-standard method; PMMH, using the bridge filter with N = 1 particle; CPMMH-0999 using the bridge filter also with 1 particle, and CPMMH-09 using the bridge filter with 1 particle. We find that, due to propagating particles conditional on the next observation, using a single particle was enough to give likelihood estimates with low variance. We ran all algorithms for 100k iterations, considering the first 20k iterations as burn-in. The starting value for σ ϵ was set far away from the posterior mean that we obtained from a pilot run of the Kalman algorithm, and the starting values for the random effects φ i j were set to their prior means. For all algorithms, the proposal distributions were tuned adaptively using the generalized AM algorithm as described in Section 4.5. We ran the algorithms on a single-core computer so no parallelization was utilized. Posterior marginals in Figs. 12-13 show that inference results for all algorithms are very similar, except for CPMMH-0999, for which posterior samples of σ ϵ are inconsistent with the output from the other competing schemes. We note that the case of N = 1 can be seen to correspond to a joint update of the parameters and latent process x. Inducing strong positive correlation between successive values of u therefore results in extremely slow mixing over the latent process and in turn, the parameters. This is particularly evident for σ ϵ , whose update requires calculation of likelihood estimates over all experimental units. Reducing ρ to 0.9 appears to alleviate this problem. Runtimes and ESS values are in Table 4. As expected, Kalman is the most efficient algorithm, being 19 times more efficient than PMMH is terms of ESS/min. However, here PMMH and CPMMH have the same efficiency in terms of ESS/min. Thus, CPMMH does not seem to produce any efficiency improvement for this case study. This is due to the efficiency of the bridge filter in guiding state proposals towards the next observation, and therefore allowing us to run PMMH with very few particles, thus making the potential improvement brought by CPMMH essentially null.  We compare our results with those in Picchini et al. (2008). Since we have assumed that the random effects φ i = are Gaussian, then the (λ i , ν i , σ i ) are log-Normal distributed with means (λ, ν, σ ) and standard deviations (σ λ , σ ν , σ σ ) respectively. By plugging the posterior means for (log λ i , log ν i , log σ i ) as returned by ''Kalman'' into the formulas for the mean and standard deviation of a lognormal distribution, we obtain that λ = 0.036(σ λ = 0.009) [1/msec], ν = 0.406(σ ν = 0.105) [mV/msec], and σ = 0.433, (σ σ = 0.072). In Picchini et al. (2008) we used a maximum likelihood approach, which is a fast enough procedure for Markovian data (there we did not assume a state-space model) that allowed us to obtain point estimates using all 312 ISIs (instead of 100 ISIs as in this case), but still slow enough to not permit bootstrapped confidence intervals to be obtained. Therefore, there we reported intervals based on asymptotic normality. There we had point estimatesν = 0.494 andσ ν = 0.072, which are similar to our Bayesian estimation. It makes sense that the inferences are not very different, as in the end our estimation of σ ϵ is very small, √ msec]). We appreciate how close our posterior means based on 100 ISIs are to the maximum likelihood estimates using 312 ISIs.

Discussion
We have constructed an efficient and general inference methodology for the parameters of stochastic differential equation mixed-effects models (SDEMEMs). While SDEMEMs are a flexible class of models for ''population estimation'', their use has been limited by technical difficulties that make the execution of inference algorithms (both classic and Bayesian) computationally intensive. Our work proposed strategies to both (i) produce Bayesian inference for very general SDEMEMs, without the limitations of previous methods; (ii) alleviate the computational requirements induced by the generality of our methods. The SDEMEMs we considered are general in the sense that the underlying SDEs can be nonlinear in the states and in the parameters; the random parameters can have any distribution (not restricted to the Gaussian family); the observations equation does not have to be a linear combination of the latent states. We produced a Metropolis-within-Gibbs algorithm (hereafter Gibbs sampler, Algorithm 2) with carefully constructed blocking strategies, where the technically difficult approximation to the unavailable likelihood function is efficiently handled via correlated particle filters. The use of correlated particle filters brings in the well-known benefit of requiring fewer particles compared to the particle marginal Metropolis-Hastings (PMMH) algorithm. In our experiments, the novel blocked-Gibbs sampler embedding a correlated PMMH (CPMMH) shows that it is possible to considerably reduce the number of required particles while still obtaining a value of the effective sample size (ESS) that is comparable to using standard PMMH in the Gibbs sampler. This means that the Gibbs sampler with embedded CPMMH is computationally efficient and on two out of three examples of increasing complexity we found that our algorithm is much more efficient than a similar algorithm using the standard PMMH, sometimes even 40 times more efficient. Some care must be taken when choosing ρ, which governs the level of correlation between successive likelihood estimates. Taking ρ ≈ 1 can result in the sampler failing to adequately mix over the auxiliary variables. We found that this problem was exacerbated when using relatively few particles (such as N = 1), but can be overcome by reducing ρ. The fact that our approach is an instance of the pseudo-marginal methodology of Andrieu and Roberts (2009) implies that we produce exact (simulation-based) Bayesian inference for the parameters of our SDEMEMs, regardless the number of particles used. We mostly focus on producing ''plug-and-play'' methodology (but see below for exceptions), meaning that no preliminary analytic calculations should be required to run our methods, and forward simulation from the SDEs simulator should be enough. Instead, what is necessary to set is the number of particles N and, when correlated particles filters are used (CPMMH), the correlation parameter ρ (however this one is easily set within the interval [0.90, 0.999]). Finally, the usual settings for the MCMC proposal distribution should be decided (covariance matrix of the proposal function q(·)). However, for the neuronal data example we had to employ a bridge filter, since the observational noise is very low for this case study, causing the bootstrap filter to perform poorly. The bridge filter is not plug-and-play (as discussed below), however in this paper we have decided to include a non-plug-and-play method to show how to analyze complex case studies with existing state-of-art sequential Monte Carlo filters. When considering a plug-and-play approach, our proposed methodology relies on the use of the bootstrap particle filter, within which particles are propagated according to the SDE solution or an approximation thereof. We note that in scenarios where the observations are particularly informative (e.g. the neuronal data case study in Section 5.3), it may be beneficial to propagate particles conditional on the observations, by using a carefully chosen bridge construct. We refer the reader to Golightly et al. (2019) for details on the use of such constructs within a CPMMH scheme for SDEs. However, notice that in order to use the constructs in Golightly et al. (2019) the conditional distribution of observations (i.e. (2) in our context) must be Gaussian. This is the underlying assumption that is exploited in Botha et al. (2020) to enable the use of bridge constructs in inference for SDEMEMs. In Botha et al. (2020) they also use methods based on correlated particle filters, in a work which has been proposed independently and concurrently to ours (July 25 2019 on arXiv). See for example their ''component-wise pseudo-marginal'' (CWPM) method, which is similar to the naive Gibbs strategy we also propose, and they found that CWPM was the best strategy among a battery of explored methods. In order to correlate the particles, Botha et al. (2020) advocate the use of the blockwise pseudo-marginal strategy of Tran et al. (2016b): this way, at each iteration of a CPMMH algorithm they randomly pick a unit in the set {1, . . . , M}, and only for that unit they update the corresponding auxiliary variates, whereas for the remaining M − 1 units they reuse the same auxiliary variates u i as employed in the last accepted likelihood approximation. This approach implies an estimated correlation between log-likelihoods of around 1 − 1/M, which also implies that the correlation level is completely guided by the number of units. This means that for a small M (e.g. M = 5 or 10, implying a correlation of 0.80 and 0.90 respectively) a blockwise pseudo-marginal strategy might not be as effective as it could be. On the other hand, assuming a very efficient and scalable implementation allowing measurements from M = 10,000 units, the blockwise pseudo-marginal approach would produce highly correlated particles, which can sometimes be detrimental by not allowing enough variety in the auxiliary variates, and ultimately producing long-term correlations in the parameter chains, as we have documented in Section 5.3 when using a low number of particles N. We therefore think it is advantageous to use a method that allows the statistician to decide on the amount of injected correlation: even though this means having one more parameter to set (ρ in our treatment), we find this decision to be rather straightforward, as mentioned above.
We hope this work can push forward the use of SDEMEMs in applied research, as even though inference methods for SDEMEMs have been available from around 2005, the limitation of theoretical or computational possibilities has implied that only specific SDEMEMs could be efficiently handled, while other SDEMEMs needed ad-hoc solutions or computationally very intensive algorithms. We believe our work is promising as a showcase of the possibility to employ very general SDEMEMs for practical applications.
Here the bridge filter is derived for the example in Section 5.3. The analytical transition density for the X t process in 5.3 is ) .
The joint density for X t+∆t and Y t+∆t , conditional on X t , is , and β 0 = σ 2 2λ (1 − e −2λ∆t ). The conditional distribution used as proposal distribution in the bridge filter iŝ π(x t+∆t |x t , y t+∆t ) = N(x t+∆t ; µ, Σ), Eq. (A.1) can be used to propagate particles forward, which is a much more efficient approach than in the bootstrap filter case, where the sampler is myopic to the next observation, while (A.1) is able to look-ahead towards the next observation y t+∆t . Thus, the bridge filter is similar in structure to Algorithm 1 with the difference that here the particles propagation step consists in sampling from (A.1), and the weights are given bỹ w t+∆t,k = π (y t+∆t |x t+∆t,k , σ 2 ϵ )π (x t+∆t,k |x t,k ) π(x t+∆t,k |x t,k , y t+∆t ) , w t+∆t,k =w t+∆t,k ∑ N j=1w t+∆t,j , k = 1, . . . , N.

A.2. Comparing the bootstrap filter and the bridge particle filter
To compare the performance of the bootstrap and the bridge filter, we run both filters with the same number of particles (500 particles for each subject) using the 100 ISIs neuronal data from Section 5.3. Parameters are set at the posterior means obtained from the Kalman algorithm. The comparison is interesting since it illustrates the well known issue of running particle filters when the observational error is small (here we have that σ ϵ ≈ 0.001), and hence it is expected that the bootstrap filter will produce sub-optimal results. This is due to its inability to ''target'' the next observation, thus producing very small weights due to the small σ ϵ . In Fig. A.14, we compare the forward propagation of the particles for one ISI chosen at random. It is evident that the bridge filter follows the data more closely. Furthermore, we run each filter independently for 100 times and compare the averages of the log-likelihood values, the standard deviation of the 100 log-likelihood estimations, and the runtimes, see Table A.5. We can easily notice the superiority of the bridge filter returning an averaged log-likelihood value very close to the one provided by the Kalman filter. In particular, notice how the log-likelihood estimation is very unreliable (due to the small observation error).
We now compare the inference results for CPMMH when using the bridge filter and the bootstrap filter. We ran four algorithms: Kalman, PMMH with N = 1 particles using the bridge filter, CPMMH-09 with N = 1 particles using the bridge filter, CPMMH-099 with N = 100 particles using the bootstrap filter. We ran, Kalman, PMMH, and CPMMH-09 for 100k iterations, and ran CPMMH-099 for only 35k iterations, as this case is computationally more intensive. In Fig. A.15 we see that when using the bootstrap filter driven inference scheme, the σ ϵ chain fails to adequately explore regions of high posterior density. We emphasize that this is due to using too few particles (N = 100). It is clear from Table A.5 that the number of particles required to match the efficiency of the bridge filter is computationally infeasible. Marginal posteriors for the remaining parameters (not shown) are however similar for all algorithms. The reason why the population parameters η appear to be unaffected by these issues, unlike σ ϵ , is that step 4 of the Gibbs algorithms in Section 4.1 (both versions, naive and blocked one) does not depend on the approximated likelihood, whereas step 2 (which samples σ ϵ ) does depend on it.

Appendix B. Tumor growth -linear noise approximation
The linear noise approximation (LNA) can be derived in a number of more or less formal ways. We present a brief informal derivation here and refer the reader to Fearnhead et al. (2014) and the references therein for further details. We remark that the LNA is not a necessary feature of our general plug-and-play methodology outlined in Section 4 and Algorithm 2.

B.1. Setup
Consider the tumor growth model in (19), (20) and (21) and a single experimental unit so that the superscript i can be dropped from the notation. To obtain a tractable observed data likelihood, we construct the linear noise approximation of log V t = log(X 1,t + X 2,t ).
We apply the linear noise approximation (LNA) by partitioning Z t as Z t = m t + R t where m t is a deterministic process satisfying dm t dt = α(m t , φ) (B.1) and {R t , t ≥ 0} is a residual stochastic process satisfying dR t = {α(Z t , φ) − α(m t , φ)} dt + √ β(Z t , φ)dW t .
By Taylor expanding α and β about the deterministic process m t and retaining the first two terms in the expansion of α, and the first term in the expansion of β, we obtain an approximate residual stochastic process {R t , t ≥ 0} satisfying where J t is the Jacobian matrix with (i, j)th element (J t ) i,j = ∂α i (m t , φ)/∂m j,t . Assuming initial values m 0 = z 0 andR 0 = 0, the approximating distribution of Z t is given by where m t satisfies (B.1) and, after several calculations which we omit for brevity, H t is the solution to dH t dt = H t J T t + β(m t , φ) + J t H t .
(b) One step forecast. Using the observation equation, we have that Y i+1 |y 1:i ∼ N ( P T m i+1 , P T H i+1 P + σ 2 e ) .
(c) Posterior at i + 1. Combining the distributions in (a) and (b) gives the joint distribution of Z i+1 and Y i+1 (conditional on y 1:i and φ) as , Inference for the SDEMEM defined by (19), (20) and (21) may be performed via a Gibbs sampler that draws from the following full conditionals