Correlations between stochastic epidemics in two interacting populations

It is increasingly apparent that heterogeneity in the interaction between individuals plays an important role in the dynamics, persistence, evolution and control of infectious diseases. In epidemic modelling two main forms of heterogeneity are commonly considered: spatial heterogeneity due to the segregation of populations and het- erogeneity in risk at the same location. The transition from random-mixing to heterogeneous-mixing models is made by incorporating the interaction, or coupling, within and between subpopulations. However, such cou- plings are di ﬃ cult to measure explicitly; instead, their action through the correlations between subpopulations is often all that can be observed. Here, using moment-closure methodology supported by stochastic simulation, we investigate how the coupling and resulting correlation are related. We focus on the simplest case of interactions, two identical coupled populations, and show that for a wide range of parameters the correlation between the prevalence of infection takes a relatively simple form. In particular, the correlation can be approximated by a logistic function of the between population coupling, with the free parameter determined analytically from the epidemiological parameters. These results suggest that detailed case-reporting data alone may be su ﬃ cient to infer the strength of between population interaction and hence lead to more accurate mathematical descriptions of infectious disease behaviour.


Introduction
The incorporation of heterogeneity is an increasingly important feature of epidemiological models, with spatial-structure (Grenfell and Bolker, 1998;Xia et al., 2004;Viboud et al., 2006) and risk-structure (Schenzle, 1984;Keeling and Grenfell, 1997;Keeling and White, 2010) most prominently considered. This more realistic heterogeneous structure has marked influences on many properties including: invasion dynamics, leading to travelling waves in spatial systems (Viboud et al., 2006;Diekmann, 1978;Grenfell et al., 2002) and aggregation in certain groups for risk-structured populations (Schenzle, 1984); endemic behaviour, breaking the simple relationships between proportion susceptible and the basic reproductive ratio that hold for simpler models (Keeling and Rohani, 2008); persistence, generally acting to increase the persistence within stochastic populations (Keeling, 2000c;Hagenaars et al., 2004); and control, leading to targeted interventions (Keeling and White, 2010;Christley et al., 2005;Wallinga et al., 2010). Therefore, structuring the population has profound and wide-reaching implications.
A common modelling paradigm that captures multiple forms of heterogeneity in epidemic models is the metapopulation-type model (Gilpin and Hanski, 1991;Hanski, 1998;Hanski and Gaggiotti, 2004;. In such models, the population is divided into multiple interacting, or 'coupled', subpopulations, where within-population interactions, and hence transmission, typically occur at a much higher rate than between-population interactions. Although metapopulation models are often considered as a form of spatial model, their compartmental structure can equally well apply to age or risk structured mixing, or to multiple host species. In the case of two coupled subpopulations, which we focus on in this work, these could refer to two distinct communities, different risk groups (e.g. high and low risk) or different age groups (e.g. adults and children).
In general, the difference between the metapopulation framework and the standard homogeneous mixing models is the way that transmission, or interaction, between the subpopulations is incorporated. The interaction between subpopulations is often represented as a matrix of transmission rates within and between populations, which has clear links to the number of cases generated by each group and hence to the basic reproductive ratio, R 0 , through the dominant eigenvalue (Diekmann et al., 1990;Heesterbeek, 2002). When dealing with M populations, this transmission matrix has M 2 terms, which creates unidentifiability problems when attempting to estimate parameters from endemic equilibria, as we only have M pieces of information (Grenfell and Anderson, 1985).
For spatial models, there have been some efforts to construct mechanistic models of between-population interaction (Belik et al., 2011;Sattenspiel and Dietz, 1995;Keeling and Rohani, 2002). Other approaches estimate the coupling using a generalised gravity transmission model in combination with data, for example: commuter mobility data (Viboud et al., 2006;Balcan et al., 2009), mobile phone data (used as a proxy for human mobility) (Tizzoni et al., 2014) or spatiotemporal time series of disease incidence (Xia et al., 2004), where coupling parameters are estimated so that simulated epidemic sizes, fade-out lengths and spatial synchrony patterns replicate observed ones. However, the rules governing between-population interactions are complex and good data on relevant movements between populations is rare, especially in developing countries where epidemiological models are likely to be applied. Moreover, even with access to good data it is far from clear how a given pattern of movements should translate into a single phenomenological transmission parameter for the subpopulations concerned. Radiation models have been proposed more recently as an alternative to gravity models and only require spatial distribution of population to estimate coupling (Simini et al., 2012;Masucci et al., 2013), although this model sometimes fail to describe human mobility at smaller scales (Masucci et al., 2013;Yang et al., 2014). For age-structured models, the transmission matrix is often based upon diary-based records of interactions (Mossong et al., 2008;Danon et al., 2013;Read et al., 2014), but the same issues regarding translation of social interactions into a transmission rate between age-groups applies.
However, for a stochastic system, the − M M ( 1) 1 2 correlations between the levels of infection in the subpopulations may help to mitigate this unidentifiability, especially if the transmission matrix can be assumed to have some form of symmetry. Such long-term data on disease incidence is more widely available (Olsen and Schaffer, 1990;Grenfell and Harwood, 1997). From this data we can estimate the correlation between epidemics in distinct subpopulations and, given a relationship between the coupling strength and the correlation, we can then infer possible movement parameters. This extra information that is accrued from the correlations between subpopulations is a substantive motivating element for this study.
Due to the nonlinearity of the processes invoked by transmission, exact analysis of stochastic epidemics is often mathematically intractable. Computer simulation of the epidemic process, whilst commonly used and clearly useful, provides only a limited insight into the dynamics and can be computationally expensive. Approximation methods circumvent these issues and allow us to derive analytic results and develop intuition about the expected behaviour and variability of the epidemic process. One widely used method is the moment closure approximation, whereby differential equations for the means, variances and covariances can be derived either from first principles or from the Kolmogorov forward equations (Keeling, 2000a,b;Lloyd, 2004;Keeling and Ross, 2008). The most commonly used moment closure approximation, and the one used throughout this paper, assumes that thirdorder cumulants and higher are equal to zero; this is equivalent to assuming the distribution of states follows a multivariate normal distribution (Whittle, 1957). Alternative approximations have been proposed based on different distributional assumptions: the multiplicative moment closure approximation of Keeling (2000a) assumes a multivariate log-normal distribution; (Nåsell, 2003) assumes a binomial distribution; while (Krishnarajah et al., 2005) assumes a beta-binomial distribution.
In this paper we derive an approximation for the correlation between the level of infection in two distinct subpopulations as a function of the relative transmission rates, or the coupling, between them; this improves upon the results of Keeling and Rohani (2002) and corrects an error in the parametrisation of the original approximation. Using a multivariate normal moment closure approximation we derive this approximation for the simple case of two identical subpopulations and provide conditions under which we expect this result to hold. We also numerically evaluate our model and compare our analytic approximation to stochastic simulations of the epidemic process.

A simple stochastic epidemic model
We begin by introducing the notation for a simple stochastic SIR model, with births, deaths, transmission and recovery. At any time t ∈ [0, ∞), individuals are in one of three states: susceptible, infected or recovered. A given susceptible individual meets other individuals at rate k > 0. We assume that these encounters are sufficiently close that if the other individual is infected, then transmission of infection occurs with probability τ and the susceptible individual immediately becomes infected and infectious to others. Reverting to standard epidemiological modelling notation, we let the transmission rate be β = kτ. Susceptible individuals can also succumb to infection independent of contact with infected individuals in the populations; this occurs at rate ϵ > 0, the external import rate. Infected individuals recover from infection at rate γ > 0, after which they become immune to further infection. Susceptible, infected and recovered individuals all die at rate μ > 0, independent of infection status; we assume that a death is immediately followed by the birth of a susceptible individual, and hence the total population size remains constant. The basic reproductive ratio, R 0 , for this process is R 0 = β/(γ + μ). Let S(t), I(t), R(t) ∈ {0, 1, 2, … } denote the number of susceptible, infected and recovered individuals, respectively, at time t ≥ 0. If we take the (constant) population size to be N then we can reduce the dimensionality of the system by setting Equivalently, we can write down the Kolmogorov forward equation, also known as the ensemble or master equation, for this process. Let p t (s, i) denote the probability that there are s susceptible individuals and i infectious individuals in the population at time t; then the Kolmogorov forward equations are given by This formulation is extremely useful for describing the complete nature of a given stochastic system, but for large population sizes the method is impractical as the number of ordinary differential equations (ODEs) grows quadratically with the population size. However, such equations have the advantage that they can be mechanistically used, together with the closure assumption, to derive equations for means, variances and other moments of interest.

A stochastic epidemic model for coupled populations
Now consider a pair of identical populations of size N. We assume the populations are the same size for analytic tractability; we discuss the relaxation of this assumption in Sections 3 and 4. Furthermore, we assume that both populations exhibit the same population dynamics as the simple stochastic epidemic model described in Section 2.1; however, we now assume that a proportion σ ∈ [0, 1] of an individual's contacts are with individuals in the other population. In this way, σ describes the interaction, or 'coupling', between the two populations, and the force of infection in each population depends on the number of infected individuals in both populations. Changing σ does not change the basic reproductive ratio in this model, but simply determines the distribution of secondary cases between the two subpopulations.
We now let S j (t), I j (t), R j (t) ∈ {0, 1, 2, … } denote the number of susceptible, infected and recovered individuals, respectively, in population j = 1, 2 at time t ≥ 0; and again insist that population sizes The transition rates for the resulting four-dimensional Markov chain from state (s 1 , i 1 , s 2 , i 2 ) at time t are summarised in Table 1. Again the Kolmogorov forward equations for this process can be formulated; these are shown in the Supplementary Information.

Moment closure approximations in the coupled stochastic epidemic model
An exact analysis of the coupled stochastic epidemic model is mathematically intractable. Instead we consider the approximate behaviour of the firstand second-order central moments of the process. For the coupled stochastic epidemic model there are eight distinct firstand second-order central moments, five of which are 'within-population' and three of which are 'between-population'. Since the two populations are identical, there are symmetries within the system that can be exploited: for example, and Var(S 1 ) = Var(S 2 ), and similarly for other central moments. We denote the within-population central moments by and denote the between-population central moments by Using the Kolmogorov forward equation, we can write down an ODE for each of the eight firstand second-order central moments, details of this method can be found in existing literature on moment closure approximations in epidemiological modelling Keeling, 2000b;Lloyd, 2004;Keeling et al., 2000); or the differential equation for can be calculated from first principles using: rate of event change in due to event. events  Due to the non-linearity of the infection term in the model, the ODE for an nth-order moment will depend on one or more (n + 1)th-order moments. To fully define the system of ODEs we would therefore have to write down an infinite set of equations. To circumvent this problem we use a moment closure approximation, which truncates this set of equations at some order. Here, we make a second-order moment closure approximation, which assumes that third-and higher-order cumulants are equal to zero. In this way, third-order moments can be written in terms of the mean and covariance. This is equivalent to assuming that the random variable has a multivariate normal (MVN) distribution (Whittle, 1957) and so we refer to this approximation as a second-order MVN moment closure approximation. The resulting set of eight ODEs and their derivation can be found in the Supplementary Information. Note that a first-order moment closure approximation assumes that second-and higher-order cumulants are equal to zero; this approximation returns the standard set of ODEs for the SIR-model, which describe the stochastic process in the large-population limit.

Results
We derive a theoretical approximation for the correlation at endemic equilibrium between the number of infected individuals in population 1 and the number of infected individuals in population 2 as a function of the coupling, σ. We define the correlation between the number of infected individuals in each population at endemic equilibrium as: which, in the case of two identical populations where the variances are equal and using our earlier notation, simplifies to: where X * denotes the quantity X at endemic equilibrium.

Theoretical result
For two identical populations we find that we can write the correlation as a sigmoidal function plus a correction term that is often relatively small: To derive this result, we use the moment equation for Ĉ II derived in the Supplementary Information: At equilibrium = dĈ /dt 0 II and if we divide by C N 2 */ II , then and hence we have the following approximation for the correlation that we will henceforth refer to as the MVN correlation: Moreover, if we can show that Δ ≪ 1 then we have the following simplified approximation for the correlation Table 1 A summary of the transition rates of the four-dimensional Markov chain epidemic model {(S 1 (t), I 1 (t), S 2 (t), I 2 (t)) : t ≥ 0} from state (s 1 , i 1 , s 2 , i 2 ) with birth/ death rate μ > 0, contact rate β > 0, external import rate ϵ > 0, recovery rate γ > 0 and coupling σ ∈ [0, 1].
If we relax the assumption that the two populations are of equal size, then we do not obtain this simple relationship between the coupling and correlation. The model is less amenable to analytic methods and the resulting approximation now depends on r = N 1 /N 2 and on the equilibrium values S * j and Var(I j ), j = 1, 2. To fully define the approximation we would need to estimate these values from data or through simulation. This result and full derivation are given in the Supplementary Information.
We can also derive an alternative approximate expression for ξ that is independent of S * , hence eliminating the need to find the equilibrium of the 8-dimensional ODE model. By ignoring the effects of both imports and correlations and taking the large population limit, we can find an approximation to S * , which leads to the following expression: Full details of this derivation can be found in the Supplementary Information. This parametrisation of ξ is preferable to the original (Eq. (4)) as it removes the need to estimate the number of susceptible individuals in the population at endemic equilibrium, either from data or through simulation. In addition, this alternative parametrisation provides intuition into how the epidemic parameters directly impact the correlation. We can see that as R 0 increases then the correlation also increases. Conversely, as the external import rate ϵ increases, then the correlation decreases: as ϵ increases then external infections mask the effect of the between-population infections. Given the appeal of the simpler from of Eq. (10), in the work that follows we evaluate the approximation of the correlation ρ by the sigmoidal function σ/(ξ′ + σ).
The MVN moment closure approximation holds in the large-population limit (i.e N → ∞) and assumes that the distribution of states is a multivariate normal distribution; this follows from the results of (Kurtz, 1970(Kurtz, , 1971, which show that a stochastic process can be approximated by a deterministic processes in the large population limit. Further error in approximation comes from assuming that Δ ≪ 1 and that ξ is constant and equal to ξ′. In the following section, our aim is to understand whether our approximation (Eq. (9)) and expression for ξ′ (Eq. (10)) are generic to a wider range of assumptions and parameters.
Given parameter values for our coupled stochastic epidemic model, we are able solve the underlying ODEs and hence check numerically that Δ is small and calculate ξ. The absolute error introduced into our approximation by assuming that Δ ≪ 1 is given by Δ; the error relative to the correlation ρ is given by Δ/ρ. The absolute error introduced into our approximation by assuming that ξ is constant (Eq. (10)) is [ρ − σ/ (ξ′ + σ) + Δ], or equivalently [σ/(ξ + σ) − σ/(ξ′ + σ)]; the error relative to the correlation ρ is given by [σ/(ξ + σ) − σ/(ξ′ + σ)]/ρ. In this paper we take 0.1 as a threshold for the absolute error and 0.25 as a threshold for the error relative to the correlation ρ. If the absolute or relative error exceed 0.1 or 0.25 respectively, then we say that the approximation fails.

Numerical results
To compare our analytic predictions to simulation results, we need to define a set of base parameters. In the majority of the numerical analysis we will utilise parameters for a highly-transmissible measleslike endemic disease in the UK (Anderson and May, 1992), although we note that a full model of measles requires both seasonality (Earn et al., 2000;Rohani et al., 2002;Grenfell and Bolker, 1995) and age-structure (Schenzle, 1984;Keeling and Grenfell, 1997;Bolker, 1993). We will also use parameters representing mumps (Anderson and May, 1992), rubella (Anderson and May, 1992), chickenpox (Anderson and May, 1992), whooping cough (Anderson and May, 1992), smallpox (Keeling and Rohani, 2008) and influenza (Cauchemez et al., 2004;Biggerstaff et al., 2014). For all diseases, we consider two identical populations of size N = 10 5 where μ = 5.5 × 10 −5 days −1 and ϵ = 5.5 × 10 −5 days −1 . Disease-specific parameters are given in Table 2. The numerical integration of ODEs is performed using the MATLAB ode45 solver with a relative error tolerance of 10 −5 . Fig. 1 shows the equilibrium values of the first-order central moments S * and I * and second-order central moments C * II and Ĉ * II for a measles-like endemic disease in the UK as the coupling parameter σ is varied between 0 and 1. These results are obtained by numerical integration of the 8-dimensional ODEs given in the Supplementary Information, and therefore only depend on the MVN moment closure approximation. We note that all curves broadly show a sigmoidal pattern (although S * has a minimum and I * a maximum at σ = 0.5), with S * and C * II decreasing with the coupling and I * and Ĉ * II increasing with the coupling.
For these measles-like epidemic parameters, we compare the MVN correlation ρ (Eq. (8)) and our approximation σ/(ξ′ + σ), ξ′ = 0.0625, to the results of full stochastic simulations (Fig. 2a). We simulate the stochastic process over a 200 year period using the Gillespie algorithm, with a burn-in period of 50 years, and generate 1000 realisations of the process for each value of σ. The correlation is calculated as a timeweighted Pearson correlation coefficient for 50 < t ≤ 200 years. From this comparison we draw three conclusions. Firstly, all three correlations follow a sigmoidal relationship increasing from zero for low coupling to a value close to one when the coupling is largest-although we note that values of σ > 0.5 do not match with our idealised view of a metapopulation in which within-population transmission is larger than between-population transmission. Secondly, the remarkably close agreement between ρ and the simulation results, suggest our use of the MVN moment closure approximation is justified. Finally, σ/(ξ′ + σ) is a reasonable approximation for the MVN correlation ρ as the difference between the two curves is small.
We also compare the MVN correlation ρ and our approximation σ/ (ξ′ + σ) for six other infectious diseases in the UK: mumps, rubella, chickenpox, whooping cough, smallpox and influenza (Fig. 2b). Interestingly we observe that our approximation underestimates the correlation for diseases with a high R 0 (e.g. whooping cough) and overestimates the correlation for diseases with a low R 0 (e.g. smallpox and influenza). We attribute this to the differential action of Δ and the approximation to ξ across epidemiological parameters. However, across all diseases the difference between ρ and our approximation is small, hence we can relate the phenomenological coupling parameter, σ, to the correlation between the number of infected individuals in two populations by ρ = σ/(ξ′ + σ).
In Fig. 3, we evaluate the two main sources of error in our approximation (Eq. (9)), introduced by assuming that Δ ≪ 1 and that ξ is constant and equal to ξ′. For measles-like parameters, Δ is small in both absolute and relative terms (green lines in Fig. 3); importantly, Δ never exceeds our chosen thresholds of 0.1 and 0.25 for the absolute and relative error respectively, and has a diminishing impact on the Table 2 Epidemiological parameters for seven infectious diseases in the UK; across all diseases we take N = 10 5 , μ = 5.5 × 10 −5 days −1 and ϵ = 5.5 × 10 −5 days −1 . We also give the value of the parameter ξ′ = ϵ/(μ(R 0 − 1)) taken in our approximation for the correlation. correlation when the coupling σ exceeds 0.065. The error introduced into the approximation by assuming ξ is constant is given by Fig. 3). Again, we observe that this error is well within our chosen thresholds of 0.1 and 0.25 for the absolute and relative error respectively. Moreover, this error is approximately one order of magnitude smaller than Δ, which tells us that for measles-like parameters the main source of error is our approximation is due to assuming that Δ ≪ 1.
Overall, these findings suggest that our simple approximation (Eq. (9)) should hold for these parameters across the entire range of coupling values.

Smaller population sizes
Whilst the MVN moment closure approximation holds in the large population limit (Kurtz, 1970(Kurtz, , 1971, we may often be interested in much smaller populations where the impact of stochasticity is more pronounced. For N = 10 2 , 10 3 , 10 4 , 10 5 we compare our approximation for the correlation to stochastic simulations. We generate 1000 realisations of each (N, σ) pair using the same method as described in Section 3.2 and calculate the mean of all firstand second-order central moments and the mean and variance of the correlation. Since ξ′ = ϵ/ (μ(R 0 − 1)) is independent of N then we take ξ′ = 0.0625 for all N.
In Fig. 4 we compare the stochastic simulations for each of N = 10 2 , 10 3 , 10 4 , 10 5 to our approximation σ/(0.0625 + σ). We find that, for a given σ, decreasing the population size leads to weaker correlations; equivalently, this means that in smaller populations stronger coupling is required to achieve the same level of correlation. This is because in smaller populations the correlation between the two populations is reduced by independent stochastic effects acting on the two populations. Despite this, we find even for N = 10 3 the correlation between the two populations is well approximated by σ/(ξ′ + σ); only at very small population sizes, N = 100, is our approximation a poor estimate of the correlation.
Although it is simpler to take ξ = ξ′, this value can also be calculated as = + − ξ N γ μ βS βS ( ( ) *)/ *, for some specific value of S * . This method requires the numerical integration of the ODEs given in the Supplementary Information; however, we find that for N ≲ 10 4.2 ≈ 16, 000 the numerical solution to the system of ODEs "blows up". Therefore, we cannot use this method for calculating ξ to parametrise our approximation in smaller populations. This phenomenon occurs since we assume that the distribution of states follows a multivariate normal distribution and at low levels of infection this leads to a significant proportion of the distribution being negative. For example, for N = 10 5 then ≈ I * 67, but for N ≲ 10 4.2 then ≲ I 10.6. Zero infectious cases should act as a boundary for the distribution, and hence as I reduces the multivariate normal assumption breaks down.

Parameter sensitivity analysis
We perform a brief parameter sensitivity analysis to understand how the correlation between the number of infected individuals in the two populations is affected by the value of the epidemiological parameters. In the first half of the analysis, we use the parameters for a measles-like disease in the UK (N = 10 5 , μ = 5.5 × 10 −5 , R 0 = 17, γ −1 = 13 and ϵ = 5.5 × 10 −5 ) and independently change the value of each of the four parameters μ, β, ϵ and γ. We show the impact of these epidemiological parameters on ξ′ = ϵ/(μ(R 0 − 1)). For each set of epidemiological parameter values we also calculate Δ and compare ξ ∼ ξ′ across all values of coupling σ to determine the range of parameter values for which our approximation holds, that is, for which the absolute and relative errors are within our chosen thresholds of 0.1 and 0.25 respectively for all coupling values.
We find that the values of each of the four key parameters have a profound impact on the correlation between the number of infected individuals in the two populations, but that our approximation holds for a wide range of realistic values (Fig. 5a). The correlation increases with the birth rate, μ, the basic reproductive ratio, R 0 = β/(γ + μ), varied by changing β, and the mean infectious period, γ −1 ; increases in the external import rate, ϵ, lead to a decrease in the correlation. The exact region in which our approximation fails is a complex trade-off between all parameters; however, for all four parameters the approximation fails (that is, either the absolute or relative error exceeds our chosen thresholds) as ′ + ξ becomes smaller; failures occur for Fig. 1. The effect of the coupling, σ, on key mean variables S I C *, *, * II and Ĉ * II for a measles-like endemic disease in the UK (N = 10 5 , μ = 5.5 × 10 −5 , R 0 = 17, γ −1 = 13 and ϵ = 5.5 × 10 −5 ), calculated from the ODEs given in the Supplementary Information. μ ≳ 5.62 × 10 −5 , β ≳ 1.35 (R 0 ≳ 17.5), ϵ ≲ 4.61 × 10 −5 and γ −1 ≳ 13. This failure mode is due to the growing importance of the correction term Δ relative to our approximation σ/(ξ′ + σ).
In the second half of the analysis, we focus on the external import rate ϵ, as this is generally the most difficult parameter to estimate. For each of the epidemiological parameter sets representing mumps, Fig. 2. Comparing the MVN correlation ρ and our approximation σ/(ξ′ + σ) to stochastic simulations, for (a) a measles-like endemic disease in the UK (N = 10 5 , μ = 5.5 × 10 −5 , R 0 = 17, γ −1 = 13 and ϵ = 5.5 × 10 −5 ; ξ′ = 0.0625) and (b) for parameters representing mumps, rubella, chickenpox, whooping cough, smallpox and influenza (parameter values are given in Table 2). We generate 1000 realisations of the process for each value of σ and calculate the correlation as a timeweighted Pearson correlation coefficient for 50 < t ≤ 200 years; error bars represent ± 2 standard deviations. rubella, chickenpox, whooping cough, smallpox and influenza, and for each value of ϵ, we show ξ′ = ϵ/(μ(R 0 − 1)). For each diseases parameter set we also determine a range of values for ϵ for which the approximation holds, that is, for which the absolute and relative error are within our chosen thresholds of 0.1 and 0.25 respectively.
We find that the external import rate ϵ has a significant impact on the correlation (Fig. 5b). For all diseases we consider, increasing the external import rate leads to a higher value of ξ′ and thus predicts a lower correlation for a given coupling strength: as the external import rate is increased, external infections mask the effect of the betweenpopulation infections. For measles, mumps, rubella, chickenpox, whooping cough and smallpox, the approximation fails (that is, either the absolute or relative error exceeds our chosen thresholds) as ϵ become smaller. We also observe that our approximation fails at lower values of ϵ in diseases with a lower R 0 : for example, our approximation for whooping cough (R 0 = 17) fails for ϵ ≲ 8.71 × 10 −5 , whereas our approximation for rubella (R 0 = 7) fails for ϵ ≲ 9.78 × 10 −6 . However, for influenza, our approximation fails as ϵ becomes larger. Further analysis (given in the Supplementary Information) shows that for influenza, ξ′ = ϵ/μ(R 0 − 1)) significantly overestimates ξ for large values of ϵ, so assuming that ξ = ξ′ leads to a large error in our approximation. However, for smallpox, rubella, chickenpox, mumps and whooping cough the difference between ξ′ and ξ is small and so we do not observe failure for large values of ϵ in the range of values that we consider; the value of ϵ would have to be unrealistically high to observe such an effect.
modelling is how to infer the coupling between subpopulations. Sufficiently rich data on relevant interactions is often lacking, especially in developing countries, and it is unclear how such data should translate into a single phenomenological coupling parameter. In light of data on disease incidence being more widely available, we derive an approximation for the correlation, ρ, between the number of infected individuals in two identical populations as a function of the coupling parameter σ, providing a one-to-one mapping between the correlation Fig. 5. Sensitivity analysis for our approximation to the epidemic parameters. The parameter ξ in our approximation is calculated as ξ′ = ϵ/(μ(R 0 − 1)); our approximation holds if both the absolute and relative error is within our chosen thresholds of 0.1 and 0.25 respectively (represented by a solid line). The approximation fails due to one or both of the absolute and relative errors exceeding our chosen thresholds (dotted and dashed lines respectively). This analysis is performed for (a) each of the four epidemiological parameters μ, β, ϵ and γ with baseline parameters μ = 5.5 × 10 −5 , R 0 = 17, γ −1 = 13 and ϵ = 5.5 × 10 −5 (shown by a circle) and (b) for the external import rate ϵ for the given diseases with baseline parameters given in Table 2 (shown by an arrow on the x-axis). N = 10 5 throughout. and the coupling.
The results presented here refine the analysis of  and correct an error in the original derivation of ξ. Our numerical results for a measles-like infection show substantial correlation for all but the weakest coupling. These findings are consistent with similar studies focussing on persistence and spatial synchronisation of measles outbreaks (Lloyd, 2004;Bolker and Grenfell, 1996), despite differences in the characterisation of the basic model. An analytic relationship between the coupling and correlation has been previously derived (Rozhnova et al., 2012) in a more general setting and yields similar numerical results: their relationship is derived through the van Kampen system-size expansion and analysis of the power spectrum. However, we believe that our results provide a significantly simpler relationship between correlations and epidemiological parameters, providing greater intuition and analytical traction. In addition, throughout we compare our analytically tractable results to solution of the moment-based ODEs (given in the Supplementary Information) and to numerical simulation, providing a deeper understanding of the parameter ranges over which the simple results hold and hence the range of applications where the methods are of use. We also differentiate between different modes of failure in our approximation between diseases with low and high basic reproductive ratios.
Our work also offers an alternative parametrisation of ξ (Eq. (10)) that depends only on the epidemiological parameters and holds in the large-population limit; however, our numerical analysis shows that this parametrisation also leads to a good qualitative approximation in populations of size N = 10 3 . This parametrisation is preferable to the original as it provides intuition and insight into how the epidemic parameters affect the correlation. In addition, it removes the need to estimate the number of susceptible individuals in the population at endemic equilibrium, either from data or through simulation. This is particularly useful in smaller populations where we find that the MVN moment closure approximation fails numerically and the solution to the ODEs 'blows up'. This type of failure is well-documented in the literature and typically attributed to large negative covariances, frequent global extinctions or when the distribution of states is bimodal (Keeling, 2000a,b;Lloyd, 2004;Krishnarajah et al., 2005;Nåsell, 1999). In the limit as N → ∞, (Nåsell, 1999) shows that the distribution of states conditioned on non-extinction is approximately normal when R 0 is greater than 1; however, this does not explain why the MVN moment closure approximation sometimes appears to hold for smaller populations, such as in our own analysis and the wider literature (Isham, 1995).
Our model is sufficiently general that it can describe multiple forms of heterogeneity in the population including spatial, age and risk heterogeneity; however, a limitation of the model it that the underlying SIR model is too simple describe the full dynamics of many diseases. For example, as noted previously a full model of measles dynamics should include both seasonality and age structure. These limitations should be addressed before using our results to infer the between-population coupling parameters. We have also shown that adding complexity reduces the analytical tractability of the model, such as with populations of unequal size; in the most general case, analysis of the model may require a computational, rather than analytical, approach. Finally, whilst data on disease incidence in each of the subpopulations is more widely available than mobility data, our results are still limited by the availability and quality of such data. In particular, our results will be affected by under-reporting of infections.
Our results provide a method by which the coupling can be estimated from the correlation between the number of infected individuals in two populations using data on disease incidence. Crucially, this allows us to estimate the coupling between subpopulations even in the absence of data on human mobility, thus circumventing one of the main limitations of metapopulation models. At present our model considers the mathematically tractable case of two identical populations at endemic equilibrium. Future research should aim to address the limitations outlined above by improving the underlying epidemic model, for example by incorporating seasonality or age structure. The current model can easily be extended to multiple identical interacting populations when the underlying graph is a symmetric graph, such as the complete graph or k-regular infinite tree graph. This holds since any adjacent populations will have identical neighbourhoods. These extensions will significantly improve the realism of the model and validate the use of the results in the inference of between-population coupling parameters.

Conclusion
A limitation of metapopulation models is how to infer the coupling between subpopulations. This paper relates the correlation between the number of infected individuals in two identical populations as a function of the coupling, providing a one-to-one mapping between the correlation and the coupling. Combined with data on disease incidence in each of the subpopulations, this result provides a method by which the between-population coupling can be estimated, even in the absence of information on the population mobility.