Gaussian process approximations for fast inference from infectious disease data

We present a ﬂexible framework for deriving and quantifying the accuracy of Gaussian process approximations to non-linear stochastic individual-based models of epidemics. We develop this for the SIR and SEIR models, and show how it can be used to perform quick maximum likelihood inference for the underlying parameters given population estimates of the number of infecteds or cases at given time points. We also show how the unobserved processes can be inferred at the same time as the underlying parameters.


Introduction
Analysing data in real time allows us to learn about diseases, to estimate key parameters to understand disease dynamics, and to evaluate interventions. Often we have imperfect, incomplete observations that we would like to analyse quickly so our results can be useful to public health authorities. Many methods exist to fit to data the non-linear epidemic models that are used in policy (O'Neill, 2010), depending on the model and data involved. Here we are interested in the case of daily or weekly data on the number of infecteds or cumulative incidence, where there is no tractable closed-form likelihood and where computationally intensive methods such as multiple imputation are too slow.
There has been an explosion in methods to deal with the intractability of likelihoods in such cases in recent years. Examples include iterative filtering (Ionides et al., 2006), Approximate Bayesian Computation based on Sequential Monte Carlo (ABC-SMC) (Toni et al., 2009), particle-Markov chain Monte Carlo (PMCMC) (Andrieu et al., 2010), SMC 2 (Chopin et al., 2013), simulation-based inference (McKinley et al., 2014), forward simulation MCMC (Neal and Terry Huang, 2015), and many others. These approaches are, however, typically computationally intensive and can be difficult to implement.
On the other hand, fitting deterministic ordinary differential equation (ODE) models for epidemics directly to data (for example by least-squares fitting) is often an ill-posed inverse problem; it is widely accepted that stochastic effects need to be included if inference and prediction are to be reliable (Leander et al., 2014;House, 2015;King et al., 2015). One low-dimensional approach to stochasticity involves working in the diffusion limit (Dargatz, 2010;Guy et al., 2015;Leite and Williams, 2017), however this can involve additional technical and computational problems due to the analytical intractability of the diffusion limit itself.
We therefore consider here a Gaussian process approximation approach to speed-up real time analysis of disease data when other methods are too complex. This simultaneously accounts for stochasticity and avoids the problems of direct fitting of ODE models. This approximation is applied initially to the stochastic SIR (susceptible-infectious-removed) model. We also consider the SEIR (susceptible-exposed-infectious-removed) model, although it is easy to use this approximation scheme with a more complex compartmental model or even models used outside epidemiology. We stress that our approach involves an additional approximation step beyond the derivation of a stochastic differential equation (SDE) limit, that must be controlled and which is the main technical component of our work.
Other authors have made use of Gaussian approximations in epidemic inference. Ross et al. (2006) considered parameter estimation for the SIS model, Fearnhead et al. (2014) and Zimmer et al. (2017) considered Gaussian approximations based on the linear noise approximation, and Ball and House (2017) considered inference for the SIR epidemic on a network using a Gaussian process approximation based on the covariance function for an approximating branching process.
In this work, we present three main developments. First, we consider a very broad class of Gaussian process approximations, including those based on stochastic moment closure (which have branching process results as a special case) and those based on a linear time-inhomogeneous SDE (which have linear noise approximation approaches as a special case). Secondly, we give a method for deriving bounds on the errors of any Gaussian process approximation through Taylor expansion in the time interval between observations, using the scheme shown in Figure 1. Finally, we go into more depth than previously on the numerical comparison of different Gaussian process approximations for parameter inference on simulated data, and on how to apply these approaches to the data on cumulative incidence typically available from epidemiological outbreak reports.

Models and notation
While our approach is general, we will focus here on epidemic dynamics, in particular the Markovian SIR model, also called the general stochastic epidemic (Bailey, 1975;Andersson and Britton, 2000). We will define the different approaches as shown in Figure 1.

Pure jump Markov chain
We will write N S (t), N I (t) and N R (t) for the random integer numbers in the population who are susceptible, infectious and removed respectively. The vector N(t) = (N S (t), N I (t), N R (t)) is then a continuous-time Markov chain with events and rates where the population size N = N S + N I + N R is a constant.

Diffusion approximation
Using the convergence results of Kurtz (1970Kurtz ( , 1971) the Markov chain defined by (1) is well approximated by the solution X(t) of the SDE 2 where (ignoring the removed individuals who do not need to be counted due to the constant population size) we let and W = W 1 W 2 where W 1 and W 2 are two independent Wiener processes. Explicitly, this approximation takes the form Note that the distribution of X(∆t)|X(0) given by (2) will not in general be Gaussian.

Deterministic approximation
We can then further apply the results of Kurtz (1970Kurtz ( , 1971 to derive the deterministic approximation where s(t) and i(t) are the numbers of susceptible and infectious individuals respectively at time t that satisfy this deterministic model.

Multivariate normal moment closure
For comparison, the multivariate normal (MVN) moment closure method is also used to obtain an approximation of the stochastic SIR model (Isham, 1991). In this method, it is assumed that the joint distribution of the susceptible and infectious populations can be approximated by a bivariate normal distribution. This gives a set of 5 ODEs for the mean and variances of the susceptible and infectious populations.
We write X(t) and Y (t) for the number of susceptible and infectious people respectively in the MVN moment closure approximation of the stochastic SIR model at time t. These follow a Gaussian process GP(µ(t), σ(t)) with mean and variance-covariance matrix that obey equations which are derived from applying the assumption of multivariate normality to moment equations derived from the Markov chain (1).

Linear stochastic process approximation
The linear stochastic process (LSP) approximation approach introduced in the SIR context by Isham (1991) (referring to a more complex model of AIDS due to Tan and Hsu (1989)) is to let the susceptible population evolve deterministically, while the infectious individuals are normally distributed. This gives the following set of three ODEs for the evolution of the deterministic susceptible population, s(t), and the mean, µ Y = E[Y (t)], and variance, σ Y Y = var(Y (t)), of the infectious population: While we mention this approach to show that various Gaussian moment closures are possible, for inferential purposes we will typically require the susceptible population to be random rather than deterministic making it unsuitable.

Linear SDE approximation
Following Archambeau et al. (2007), we note that the SDE will have a Gaussian process solution GP(m(t), C(t)) with mean and variance-covariance matrix sat- Our aim is therefore to choose the time-varying matrices in equation (9) so that it approximates (2) and hence the full stochastic system. In this context, there is one 'obvious' choice for the matrix U (t): For the matrix A(t) and the vector b(t), however, there are many choices that have the correct mean behaviour. We will consider how an existing approach in the literature (the Linear Noise model) is a special case of the linear SDE approach, and also introduce other special cases that we do not believe are named.
We write X (t), Y(t) for the number of susceptible and infectious people respectively in this approximation of the stochastic SIR model at time t. These follow a Gaussian process GP(m(t), C(t)) with mean and variance-covariance matrix var(Y(t)) . (12)

Linear noise approximation
To derive the linear noise (LN) approximation (Uhlenbeck and Ornstein, 1930;Black and McKane, 2010) we start with the SDE (2) and write 4 where the quantitiesS,Ĩ are assumed to be small in the approximation. Ignoring quadratic terms O(S 2 ,SĨ,Ĩ 2 ) and keeping only linear terms gives This is a special case of the linear SDE approach where

Other special cases
We consider two other special cases, which we name in an obvious way, For each of these we obtain a set of five ODEs from (10), which we can solve numerically to give an approximation of the mean and variances of the stochastic SIR model. Note that there are more special cases, similar to these, for which we can write down A(t) and b(t).

Comparison of multivariate normal moment closure and linear SDE approximations
The main difference between the MVN moment closure and linear SDE approaches is as follows. For the MVN moment closure, we have For the linear SDE, in contrast, we have So the MVN moment closure allows for the negative correlation of the susceptibles and infecteds at the beginning (and end) of the outbreak. This means that E[Y ] initially increases more slowly which allows for a random delay time before the outbreak takes off. Figure 2 shows an example of these trajectories for a specific set of parameter values. Simulating 10 4 trajectories gives a distribution at each time point to which we compare each approximating Gaussian distribution.
For each approximation we computed the mean and variance of the size of the susceptible and infectious populations forward from the current data point until the next. The mean of the approximation was then reset to the data point, and the variances to zero. Figure 2 shows an example of this for one specific set of parameter values and one specific approximation.
We compared the approximations numerically to the stochastic simulations using the Kullback-Leibler (KL) divergence, a measure of difference between two probability distributions. The Gaussian distributions of the approximations were discretised in order to compare with the discrete distribution given by the stochastic simulations. For discrete probability distributions P and Q the KL divergence is defined as A better approximation results in a smaller KL divergence (MacKay, 2003). The KL divergence was computed each time data were obtained and before the simulations and approximations were reset to the data value. Note that we could not compute the KL divergence for a comparison of the LSP with the stochastic simulations for the size of the susceptible population as in the LSP the susceptible population evolves deterministically. Additionally, on occasion the KL divergence could not be computed for other approximations because one distribution took a value very close to zero. However, as displayed in Figure 3, this does not occur very often. Figure 3 demonstrates these comparisons for three examples of epidemiological model rate parameters that we have chosen to cover a range of R 0 values. The MVN moment closure and LN approximations consistently have the smallest KL divergence in both the size of the susceptible population and the size of the infectious population ( Figure 3). Additionally, and in particular for larger population sizes, the A noise and LSP approximations also approximate well the size of the infectious population. The b noise approximation does not approximate the size of the infectious class as well, in particular at the start and end of the epidemic.
For approximating the size of the susceptible population, the A noise and b noise approximations perform adequately but not as well as the other approximations particularly at the start and end of the epidemic.
We performed the same analysis over longer time steps, ∆t, and saw similar results; the MVN moment closure and LN approximations are best, with the A noise approximation also a good approximation for the infectious population. However, the A noise approximation becomes a much less good approximation of the susceptible population over longer time steps.
The ODEs that define the approximations were solved numerically. The b noise approximation has the simplest set of ODEs and so is fastest to solve. The A noise and LSP approximations are slower and, finally, the MVN moment closure and LN approximations take longest (Figure 4).
In addition, we note the ease with which the A noise and b noise approximations can be derived (in essence, just written down) in comparison to the MVN moment closure and LN approaches. This will become more pronounced for more complex compartmental models (for example, as demonstrated in section 5.2 when we apply these approximations to the SEIR model).
In conclusion, these numerical comparisons show that the A noise Gaussian process approximation can perform comparably to the MVN moment closure and LN approaches, in particular for large population sizes and for the infectious population size, while being computationally faster and more simple to derive. We expect these advantages to become much more pronounced for more complex models. 6 We now consider what progress is possible analytically.

Analytical comparisons
Each of the Gaussian models previously described is chosen on the basis of different a priori assumptions, however we would like to find a way to control the errors introduced in a systematic manner. Since we are interested in regularly-spaced observations, the relevant control parameter is the timestep ∆t. Our starting point is the diffusion approximation to the stochastic SIR model in the regime where the susceptible population is approximately constant (e.g. at N when this is close to its starting value). This is chosen to simplify calculations, although we note the work of Cauchemez and Ferguson (2008) suggests that the approximation of constant susceptible population can be made throughout the epidemic if the time period ∆t over which it is made is relatively small. In this limit, the system is described by the SDE first analysed by Feller (1951), where r = β − γ and ρ = β + γ. We will now consider how to expand this equation in ∆t.

Taylor scheme for SDEs
The origin of the methods used to derive Taylor schemes for SDEs is often attributed to Milstein (1975), who derived a scheme up to O(∆t). In fact, there are many such schemes that solve the SIR SDE locally in time; we use the weak order-3 scheme given by Kloeden and Platen (1992). This scheme has a rather complex general form, however for the specific equation (20) subject to initial condition I(0) = I 0 1 we obtain the following result following some analytical work: I(∆t) = I 0 1 + r∆t + 1 2 r 2 ∆t 2 + ρI 0 ∆t 1 + 3 2 r∆t + 7 6 r 2 ∆t 2 where U ∼ N (0, 1) is a standard normal random variable and O(∆t 3 , I 0 0 ) represents terms containing the variables ∆t n for n ≥ 3 and I m 0 for m ≥ 0. Such a stochastic Taylor expansion can be carried out for any SDE, including multi-dimensional ones, and as such we believe this method has significant promise for evaluation of Gaussian process approximations in general. While the complexity of analysis can grow, it is possible to work with such stochastic systems using computer algebra (Kendall, 2005), and we think that this would be an interesting direction for future research.

Local solution of ODEs giving the approximations
For the Gaussian process approximations that arise from linear SDEs, we consider (10) in the limit where the size of the susceptible population is approximately constant to give the following ODEs for the mean, m 2 , and variance, C 22 , of the number of infecteds, Solving these, using a standard Taylor series expansion subject to initial conditions m 2 (0) = i(0) = I 0 and C 22 (0) = 0, gives for each model m 2 (∆t) = I 0 1 + r∆t + 1 2 i.e. as we would expect the Taylor series for e r∆t . For b noise, we have C 22 (∆t) = ρI 0 1 + 1 2 r∆t + 1 6 r 2 ∆t 2 + · · · ∆t , where for all other models (A noise and LN) we have C 22 (∆t) = ρI 0 1 + 3 2 r∆t + 7 6 r 2 ∆t 2 + · · · ∆t .

Bounding the errors of the Gaussian process
Putting the results above together, we see that the MVN moment closure, linear noise and A noise are all consistent with the SDE (20) expanded as in equation (21), so we will continue to work with these approaches, while the b noise is significantly less accurate and so we do not consider it further. To see why errors at this order represent a significant improvement on other possible approaches, consider the Euler-Maruyama approximation to (20), J(∆t) = (1 + r∆t)I 0 + ρI 0 ∆tU , where U is a standard Gaussian random variable. Comparing to (21) we see that errors to this appear at O(∆t). This tells us that the ODE-based Gaussian process approximations we continue to analyse (MVN, LN and A) are more accurate than Euler-Maruyama steps by several orders of magnitude in ∆t.

Inference
We initially apply the Gaussian process approximations to synthetic data from the SIR model to demonstrate that the size of the susceptible population can be recovered from weekly measurements of the number of infecteds. Secondly, we consider real data from a norovirus outbreak with the SEIR (susceptible-exposed-infectious-removed) model to demonstrate that it is straightforward to use these approximations with more complex models. 8

Simulated data from the SIR model
Consider a disease that is well approximated by the SIR model with transmission rate constant β and recovery rate constant γ. Suppose we have data of the form of a set of times {t i } n i=0 together with associated measurements of the number of infecteds {y i } n i=0 . Consider a Gaussian process approximation to the SIR model such that given susceptible and infectious populations of size x 0 and y 0 respectively at the start of a time interval of length ∆t, at the end of that time interval the mean and variance-covariance matrix are µ(∆t; x 0 , y 0 , β, γ) and Σ(∆t; x 0 , y 0 , β, γ) respectively. If we also had measurements of the susceptible population, {x i } n i=0 , then we could write the likelihood function for the parameters of the approximating model given the data as In practice, the data on the susceptible population are not readily available and instead this information can be imputed using the marginal and marginal conditional distributions of a multivariate normal distribution. These can be explicitly computed as follows (Eaton, 1983). For random vector (x, y) with multivariate normal distribution N (µ, Σ) an observation y i has marginal probability density function and conditional on this observation x i has marginal conditional probability density function We can use these rules at each observation point to build up a likelihood from the product of terms such as (30) and also to update the mean vector and variance-covariance matrix for the Gaussian process approximation.
Synthetic data were obtained from one simulation of the stochastic SIR model using parameter values β = 2, γ = 1, and N = 1 × 10 4 with one initial infected. Using each of the MVN moment closure, LN, and A noise approximations, we were reliably able to recover the epidemiological model parameters as shown in Figure 5. The three approximation methods all gave similar results for parameter estimates (estimatesβ = 2.04 andγ = 1.01) and for the size of the susceptible population from regular data on the number of infecteds.

Cumulative incidence from a real norovirus outbreak with the SEIR model
In the previous section we discussed the case when the available data are the number of infecteds at regular time points. An alternative, and common, situation is when only illness onset times, and not recovery times are available. For an SIR model this corresponds to having measurements of the cumulative incidence N − S(t).
In this section, we consider real data from an outbreak of norovirus on a cruise ship visiting the British Isles as reported by Vivancos et al. (2010). This report gives us data on the number of new reported norovirus cases per day during this outbreak in a small, closed population, N = 1714. A single norovirus outbreak is commonly assumed to follow the SEIR framework where, after infection, individuals enter a latent state, E, that they leave at rate ω on becoming infectious (Simmons et al., 2013). The SDE for the SEIR framework, equivalent to equation (2) from the SIR framework, is given by dX = F(X) dt + V (X) dW with The deterministic approximation of the stochastic SEIR model is given by the ODEs where, as before, s(t), e(t), and i(t) are the numbers of susceptible, exposed, and infected individuals respectively at time t given by the deterministic model. To use the linear SDE Gaussian process approximation with the SEIR model we simply need again to choose the time-varying matrices in (9) so that it approximates (32). For the A noise approximation introduced in this work this will be As before, we can infer the unobserved time series E(t) and I(t) using the general conditional laws of Gaussian processes (Rasmussen and Williams, 2005), and use the marginal laws to perform maximum likelihood estimation on the parameter values β, γ, and S(0). Note that we fit S(0) instead of taking it to be N − 1 because we do not know the infection history of the population. For example, some of the population may have been previously recently exposed to norovirus and therefore not currently susceptible. We make the somewhat conservative assumption that newly diagnosed individuals are no longer susceptible but could potentially be in any of the E, I, or R states so that our data are effectively values of N − S(t) -a different approach could easily be taken within our inferential framework.
Additionally, some care must be taken because ω is poorly identifiable from this cumulative incidence data, and our attempts to fit it alongside the other three parameters produced unrealistically large estimates motivating us to fix this parameter from other data. We found that the literature gives the latent, or incubation, period of norovirus to be between 0.5 and 2 days. For example, an SEIR model fitted to an outbreak in a long-term care facility estimated the latent period of norovirus as 1.3 days (Vanderpas et al., 2009). A systematic review of the incubation period of norovirus genogroups I and II gives it as 1.2 days (95% confidence interval 1.1-1.2) (Lee et al., 2013). The CDC report that the incubation period of norovirus is between 0.5 and 2 days (Centers for Disease Control and Prevention, 2011). Finally, a large dataset of norovirus outbreaks showed the incubation period to have a mean and median of 1.4 (95% confidence interval 1.3-1.4) days. Since ω is the reciprocal of the latent period, we therefore chose ω = 2 days −1 as the largest value consistent with existing knowledge about norovirus.
Working at two significant figures or zero decimal places as appropriate, parameter estimates and 95% confidence intervals from fitting the data with the MVN moment closure approximation are: β ≈ 21 [8.4, 33], γ ≈ 1.7 [0, 3.9] days −1 , and S 0 ≈ 258 [159,357]. Parameter estimates and 95% confidence intervals from fitting the data with the LN approximation are: β ≈ 18 [8.6, 27], γ ≈ 1.1 [0, 3.1] days −1 , and S 0 ≈ 237 [137,336]. Parameter estimates and 95% confidence intervals from fitting the data with the A noise approximation are: β ≈ 23 [0.81, 44], γ ≈ 1.5 [0, 4.7] days −1 , and S 0 ≈ 241 [125,357]. These confidence intervals are truncated at zero for rate parameters. The standard error estimates for the intervals were taken from the leading diagonal of an approximate covariance matrix of the parameter estimates. The approximate covariance matrices were computed as the negative inverse of an approximation to the Hessian of the log-likelihood at the maximum likelihood estimates obtained using finite differences (from the MATLAB function mlecov()). The correlation matrices between the parameters for each approach are, respectively, Our overall picture, as would be expected when fitting a complex non-linear stochastic model to limited data, is of highly correlated parameters with relatively large marginal confidence intervals. The average infectious periods (estimated from 1/γ as 0.59, 0.91, and 0.67 days for the MVN moment closure, LN, and A noise approximations respectively) are shorter than the natural history of norovirus would indicate, which is likely to be due to control measures in place upon the ship (Vivancos et al., 2010) limiting the time period during which cases are able to infect others. Additionally, S(0) is estimated as much smaller than N , which could be due to pre-existing immunity, control measures in place on board the ship, and non-homogeneous mixing (through excursion choice and cabin location) (Vivancos et al., 2010).
Results for learning the time series of S(t), E(t) and I(t) are shown in Figure 6, which shows general agreement on mean behaviour, but differences in the uncertainty.
In the results presented so far, the full dataset has been used to estimate the parameters of the epidemic model, before the time series were estimated as the epidemic progressed. We show here, in Figure 7, the results of also estimating β, γ, and S 0 as the epidemic progressed, beginning from day nine. Note that in this instance, these methods did not work with fewer than nine days of data. These datapoint-by-datapoint estimates remain consistent over the epidemic.
We conclude that even in this common case where there is a small dataset of symptom onset times, our Gaussian process approach can be applied and gives epidemiologically reasonable answers in little computational time.

Conclusions
In this paper, we have sought to provide results that will allow Gaussian process approximation to become a more routinely used technique in infectious disease epidemiology. We have considered a wide class of Gaussian process approximations, as well as methods to bound the errors on the use of these. We have compared these using simulated data and applied them to real data. For future work, we would like to address the methodological challenges that face infectious disease epidemiologists and the public health system such as inaccuracy in case ascertainment, the requirement for analysis methods to be online in real-time and robust, and the ability to make predictions going forward under uncertainty. Errors O(∆t k )

Stochastic Taylor expansion
Errors O(∆t m/2 ) Figure 1: The overall scheme proposed here for assessment of the accuracy of a given stochastic moment closure approximation. Here we consider primarily SIR epidemics but the approach could apply to other population processes. Errors are controlled by the inverse of N , the population size, and by the time-step ∆t, where we use k and m to stand for the integer order of errors in the time-step to be determined by model analysis.   Figure 4: Running time for the sets of ordinary differential equations for each of the approximation methods with β = 2, γ = 1, and N = 1 × 10 6 . (The box denotes the median, lower quartile, and upper quartile. Whiskers extend to the maximum and minimum. When the interquartile range is narrow, as here, the box displays as simply a thick line.)   Figure 6: Inference of the susceptible (left), exposed (centre), and infectious (right) population using the multivariate normal moment closure approximation (top), the linear noise approximation (middle), and the A noise approximation (bottom) from data of the number of new cases of norovirus per day on a cruise ship. The black diamonds (left) are our known values which we obtain from the data reported by Vivancos et al. (2010), as described in the text. The dark lines are the mean and light lines are the mean plus/minus one standard deviation.  Figure 7: Estimates of the model parameters β, γ, and S 0 as the epidemic progresses for the multivariate normal moment closure approximation (blue), the linear noise approximation (red), and the A noise approximation (yellow). The dots are maximum likelihood estimates and the bars are 95% confidence intervals.