Approximating steady state distributions for household structured epidemic models Biology

infectious the increased transmission potential between individuals of the household when compared with two in different households. for these heterogeneities in transmission enables control measures to be more effectively planned. Ideally, pre-control data may be used to ﬁt such a household-structured model at an endemic steady state, before making dynamic forward-predictions under different proposed strategies. However, this requires the accurate calculation of the steady states for the full dynamic model. We observe that steady state SIS dynamics with household structure cannot necessarily be described by the master equation for a single household, instead requiring consideration of the full system. However, solving the full system of equations becomes increasingly computationally intensive, particu- larly for higher-dimensional models. We compare two approximations to the full system: the single household master equation; and a proposed alternative method, using the Fokker–Planck equation. Moment closure is another commonly used method, but for more complicated systems, the equations quickly become unwieldy and very difﬁcult to derive. In comparison, using the master equation for a sin- gle household is easily implementable, however it can be quite inaccurate. In this paper we compare these methods in terms of accuracy and ease of implementation. We ﬁnd that there are regions of param- eter space in which each method outperforms the other, and that these regions of parameter space can be characterised by the infection prevalence, or by the correlation between household states.


Introduction
Transmission of an infectious disease is often seen to be greater within a household than between those from different households (Kinyanjui et al., 2016). Household structured models take these population heterogeneities into account, and can be used to inform different potential control policies. In particular, they explicitly include (at least) two different forms of infection (Black et al., 2013): infection from within the same household as the infected individual and infection from outside the same household as an infected individual. The first of these typically corresponds to a higher rate of infection, but between a smaller number of contacts, while the latter often corresponds to a lower rate of infection, but affecting a much larger number of individuals (Hilton and Keeling, 2019;Ball et al., 1997).
Household-structured infectious disease models have been used extensively in the literature, particularly in the case of pandemic influenza (Fraser et al., 2011;Wu et al., 2006). Benefits of these models include better representation of the population under consideration (Frank and Neal, 2002) and the ability to incorporate different intervention strategies. This is useful as control policies are often targeted at the level of households, for both practical and structural reasons (Pellis et al., 2009). For example, the current eradication strategy for yaws (a neglected tropical disease) includes treating infected individuals and their contacts, which can be easily modelled using a household structure (Holmes et al., 2020). Similarly, there are a large number of studies investigating household-based vaccination strategies for pandemic influenza (House and Keeling, 2009;Black et al., 2013).
In spite of the widespread use of household models, there are some subtleties in their use. A natural way to fit a household model to data is to assume the system is at steady state, which enables https://doi.org/10.1016/j.jtbi.2021.110974 0022-5193/Ó 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). the fitting of a constant force of infection experienced outside the household. However, when subsequently predicting the response to potential changes in strategy, it is necessary to then link this external force of infection to infections in other households. The different approximations involved can lead to discrepancies, however, between the dynamic forward projection and the steady state. We describe in more detail below how these discrepancies arise.
Compartmental stochastic models in epidemiology can typically be expressed as a set of master equations. That is, a set of ordinary differential equations (ODEs) that describe how the probability of being in each state varies with time (Keeling and Ross, 2008). This gives a set of linear ODEs which can be written in the form _ p ¼ Ap, where A is the state transition matrix. This then has solution p t ð Þ ¼ p 0 ð Þe At , or we can calculate the steady state distribution by calculating the eigenvector corresponding to the 0 eigenvalue of A (Keeling and Ross, 2008) or by solving the system of linear equations Ap ¼ 0.
For example, consider the stochastic steady state SIA model in Dyson et al. (2017). This is a household-structured model in which each individual in a household can either be susceptible to infection, S, infected and infectious, I, or asymptomatic and not infectious, A. An individual can be infected from someone within the same household at rate b, or they can be infected from someone outside their household with external force of infection, e. Data were used to parameterise the model at steady state, allowing the effects of other households on the external force of infection to be considered implicitly, by taking e to be constant. An infectious individual can recover at rate d, or they can become asymptomatic at rate k. Finally, asymptomatic individuals can recover at rate d, or their symptoms can recur causing them to become infectious again (without further exposure to an infectious individual), which occurs at rate q. These dynamics are summarised in figure 1. Letting P N S;I;A denote the probability of a household of size N containing S; I and A susceptible, infectious and asymptomatic individuals respectively, the master equation describing the dynamics of a single household of size N for this system is given in Eq. 1.
subject to the following constraints on S; I; A and N: S P 0; I P 0; A P 0: How can this system be simulated forwards in time? It is necessary to make an assumption that links the rate of external infection to the infection within households. One approach is to assume the external force of infection, e, takes the form where a is the between-household rate of infection that would achieve the same force of infection at steady state, M is the number of households, I i ð Þ denotes the number of infectious individuals in household i, and N i ð Þ denotes the size of household i. However, if we simulate the system forward to steady state using this new between household rate of infection, a, calculated by rearranging Eq. 2 using the steady state prevalence (calculated by solving the master equation with constant e, Eq. 1), we find that the simulated steady state is lower than expected. This is shown in Fig. 2 However, the red line is calculated using a constant e, while the blue line used e given by Eq. 2. As such, this means the external force of infection is time varying (and the corresponding value of a was selected to ensure the steady states match). This suggests that the approximation taken by the master equations for the number of households in state S; I; A ð Þ breaks down when there is between-household interaction with a finite number of households. In the limit of a large number of households, the approximation is valid (Ross et al., 2010), but not in the case of a finite population size. Instead, we need a set of master equations that captures the full range of interactions between households. Unfortunately, this significantly increases the number of possible system states, and this quickly becomes impractical unless the number of households is very small (Ball and Lyne, 2001). Previous work has also looked at approximating household models. In Black et al. (2014), the authors investigate infectious disease processes on clumped population structures (e.g. households). They use a branching process approximation to investigate the start of the exponential growth phase, and then apply a diffusion approximation to investigate the variance during the early asymptotic phase of the infectious process. However, the authors do not address the question of how accurate these approximations are at a finite number of clumps (or households), but rather just consider the situation in which the number of clumps tends to infinity. In this paper we consider different approximations to a similar system (SIS dynamics, starting with households of size 2), to determine the steady state distributions of households in such a population. We investigate which approximations are more accurate in different regions of parameter space. We then extend this to larger household sizes, determine how the accuracy of approximations is affected by the population structure and in particular what happens when we have the same population size partitioned into fewer, but larger, households.
Depending on the parameter values used, the correlation between household states can vary substantially. We investigate the relationship between this correlation and the accuracy of different approximations. Finally, we return to our original question -how accurate is each approximation when used to convert a steady state force of infection into a between household rate of infection?.

Master equation
We begin by considering the simplest such system -a population of households of size 2 undergoing SIS dynamics. This can be expressed as a set of master equations describing the household states s; i ð Þ, where s denotes the number of susceptible individuals and i denotes the number of infectious individuals. For a household of size h, the set of permissible household states under this model can be defined as Similarly, the set of system states can be expressed in terms of these household states, assuming a constant population of N households each of size 2 (so the total population size is 2N). Let m denote the number of households in household state 2; 0 ð Þ (the number of households with 2 susceptible individuals) and let n denote the number of households in household state 1; 1 ð Þ (the number of households with 1 susceptible individual and 1 infectious individual). Then N À m À n denotes the number of households in household state 0; 2 ð Þ (the number of households consisting of 2 infectious individuals and no susceptible individuals), denoted by k. A system state is then given by the tuple m; n; k ð Þand the full set of permissible system states is expressed as We consider the stochastic compartmental SIS model incorporating the household structure as described above, with events and corresponding rates summarised in Table 1 Let p m;n;k t ð Þ denote the probability that the system is in a state containing m; n and k households in their respective states. We define the bivariate step operators, E a;b (Van Kampen, 2007) such that E a;b f m; n ð Þ¼f m þ a; n þ b ð Þ : Substituting in k ¼ N À m À n to eliminate k, we can use the bivariate step operator to write the master equation as @p m;n @t ¼ E 1;À1 À 1 À Á a m;n p m;n þ E À1;1 À 1 À Á b m;n p m;n þ E 0;1 À 1 À Á c m;n p m;n þ E 0;À1 À 1 À Á d m;n p m;n ; ð3Þ with the following rates: Numerically solving this system of equations at steady state gives us the probability of occupying each system state at steady state. From this, quantities of interest such as the expected number of infectious individuals can be calculated. However, due to the large state space, it is not feasible to solve this system for large population sizes.

Quasi-steady state
While the above system of equations are difficult to solve for larger populations, it can be solved numerically when dealing with small population sizes. However, due to the absorbing state at N; 0; 0 ð Þ (N households in household state 2; 0 ð Þ, so no infection in the population), every system will eventually reach a state of  no infection (Dickman and Vidigal, Jan 2002). Thus, the 0 eigenvalue corresponds to the disease-free eigenvector and we obtain a singular matrix. During this time, a system may spend a long period of time at a quasi-steady state. This quasi-steady state is the steady state of interest. To find the quasi-steady state, the system of master equations are solved conditioned on non-extinction (Mubayi et al., 2019). That is, we consider the sub-matrix formed by removing the rows and columns corresponding to the diseasefree states. The eigenvector corresponding to the smallest eigenvalue of this sub-matrix then represents the quasi-steady state distribution for this system. In this way, the system can be studied at low population sizes. As mentioned before, this system of equations is too large for more than a small number of households to find an exact steady state solution. As such, methods to approximate this system which would allow the steady state distribution to be approximated. One commonly used approximation method that works well for simple systems is moment closure (Keeling, 2000). However, for more complicated systems, the system of moment closure equations can become very difficult to derive (Kuehn, 2016), and so here we restrict ourselves to methods that can be generalised more easily.
Two different methods for approximating the steady state distributions are compared here, which we will refer to as the single household (SHH) approximation and the Fokker-Planck peak (FPP) approximation. To assess which method approximates the true steady state distribution more accurately, we calculate the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) of each approximation from the true steady state distribution. Letting Q denote the distribution obtained from the approximation and P the distribution obtained from solving the master equation, we denote the KL divergence of Q from P by D KL Q jjP ð Þ. For the purposes of calculating the KL divergence, we take the steady state distribution to be the proportion of households in each household state.

Single household (SHH) approximation
The first approximation is derived by looking at stochastic SIS dynamics in a household of size 2. This differs from the previous system as these equations only look at transitions between household states, rather than system states. On the other hand, the corresponding fully system of master equations give us a probability of finding each steady state household distribution.
To write down this system, let p n denote the probability of finding n infectious individuals, and N À n susceptible individuals in a household of size N. There are two events that can then happen: an infectious individual becomes susceptible, or a susceptible individual becomes infectious. However, unlike a standard SIS model, there are two components to the rate of infection. One component comes from inside the household, and one from outside the household. We assume the rate of infection from outside the household is proportional to the fraction of the population that is infectious. A summary of these rates can be seen in Table 2, and the relationship to the master equation of the system can be seen in Fig. 4.
Using these rates, the master equation (a system of N þ 1 differential equations describing the time-evolution of p n ) in the case n ¼ 2 can be written down as dp 0 which can be solved analytically to give the SHH approximation to the steady state.

Fokker-Planck peak approximation
We now consider another approximation to this system. Our aim is to find an alternative method that provides a good level of accuracy, without quickly becoming infeasible to derive analytically. To that end, we start by deriving the Fokker-Planck equation for this system (Gardiner, 2004), which can be thought of as a second-order Taylor expansion of the master equation. Doing so, we obtain @u @t From the Fokker-Planck equation, we can obtain a deterministic drift vector f and a diffusion matrix D. Using these, we define where e i is a unit vector in the direction x i . The FPP approximation is then obtained by solving the following system of ODES (Mendler et al., 2018): More details of this derivation can be found in Section A.1.
We note that f x ð Þ represents the deterministic drift term for this system, and corresponds to the system of equations obtained for the single household approximation. Similarly, as the number of households N ! 1, the diffusion term tends to 0, and we again obtain the single household approximation. A summary detailing how to obtain each approximation and how they each relate to each other and the master equations can be seen in Fig. 4.

Phase portraits and correlation
The behaviour around the steady state can vary depending on the parameter values used in the model. More specifically, the correlation between household states 2; 0 ð Þand 1; 1 ð Þis greater under certain parameter values. Quantifying the differences in this behaviour could identify another method for distinguishing between parameter regimes in which each approximation is more accurate.
To that end, we look at phase portraits of the Fokker-Planck approximated system under different parameter values, as this allows us to examine the steady state behaviour more closely. We plot the nullclines for this system, and the steady state (where the nullclines intersect). We also plot trajectories around the steady state, simulated using the Gillepsie Algorithm and the steady state as determined by this simulated trajectory. Finally, we calculate the steady state distribution from the master equation.
To investigate this further, we generate a set of parameter values using latin hypercube sampling (LHS) (Iman, 2014) to ensure we thoroughly search the parameter space. We then calculate the KL divergence of both the SHH and FPP methods from the true steady state under these parameter values.
The correlation between household states 2; 0 ð Þ and 1; 1 ð Þ is then calculated in two ways. The first is using the Gillespie algorithm to simulate the steady state distribution, from which we can calculate the correlation between household states 2; 0 ð Þ and 1; 1 ð Þ. We then plot this correlation against KL divergence.
Secondly we assume the number of households in each household state follows a multinomial distribution, with parameters equal to the steady state solution obtained from the FPP approximation. If our steady state solutions are p m ; p n ; 1 À p m À p n ð Þ , then Corr m; n ð Þ¼À ffiffiffiffiffiffiffiffiffiffi ffi p m p n p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1 À p m ð Þ1 À p n ð Þ p : Similarly, we plot this correlation against KL divergence to ensure we get the same qualitative results when calculating the correlation from the simulated trajectories.

n-Person households
We extend the work above to now consider households of some arbitrary size n, rather than just size 2 as considered before. In this case, we let m i denote the number of households in household state n À i; i ð Þ . That is, the number of households with n À i susceptible individuals, and i infectious individuals. So m 0 denotes the number of households with n susceptible individuals, while m n denotes the number of households with n infectious individuals. We consider a fixed population of size N households such that P n k¼0 m k ¼ N. This gives us a population consisting of nN individuals. For a household of size n, we have n þ 1 different possible household states, and 2n state transitions. Of these 2n state transitions, n are associated with infection events, while the other n are associated with recovery events. This is demonstrated in Fig. 5.
Each of these transition rates is summarised in Table 3. The master equation and Fokker-Planck equation can be derived in the same way as before (see Section A.3) with rates i k and r k as given in Table 3. From these, the FPP and SHH approximations can be derived as before.

Simulating forward steady state master equations
In this section, we look at determining between-household rates of infection from a model that has been fitted at steady state. This can be useful as we may only have steady state data available to us (as in Dyson et al. (2017)), but will still want to simulate the dynamics forward in time after perturbing the system. We can do this by finding the between household rate of infection which, at steady state, corresponds to the distribution of household states we obtain from the steady state model. In particular, we are interested in looking at how well each of the approximations perform.
Suppose we have a household structured model (a population of size 2M consisting of M households of size 2) undergoing SIS dynamics that has been parameterised at steady state. That is, we have a constant external force of infection (in a similar manner to the SIA model described previously), e, rather than explicitly considering the interactions between different households using a between-household rate of infection, a. As the houses are now independent (e does not depend on the infection level of other households), we can solve this system exactly at steady state using the master equation for a single household. Alternatively, we can use the master equation described initially to obtain a set of ODEs describing the time evolution of the moments of the system. As the equations are linear, this will correspond precisely to the deterministic system we obtain by considering a single household (Hahl and Kremling, 2016).
The master equation can be written as follows dp 0 dt The steady state distribution p Ã is given by the eigenvector of this system (when written in matrix form) corresponding to the zero eigenvalue. We then consider the external force of infection to be a function of the infection status of the other households, tak-ing the form given in Eq. 2. Using the SHH approximation, a can be calculated and fed back into the model to obtain a new steady state. Alternatively, the FPP approximation can be used to obtain this value of a by finding the value of a that solves a 1 x ð Þ ¼ a 2 x ð Þ, where a x ð Þ is the system of equations obtained from the FPP approximation.
As before, we are interested in which regions of parameter space one method outperforms the other. To investigate this, we take values of e; b, and c, varying each parameter one at a time.
We then find the corresponding steady state distribution, and we then find the value of a corresponding to that steady state distribution (as described above). As there is not a monotonic relationship between the KL divergence and a, the existence of an a that corresponds to a value of e is not guaranteed. Instead, the value of a that minimises the KL divergence of the master equations from the steady state system corresponding to e is used.

KL divergence from ME
Here we look at the KL divergence of the three methods from the steady state distribution obtained by solving the full system of master equations (Eq. 3) as the parameters vary for a fixed population size of 250 households. A higher KL Divergence indicates the distribution is further from the 'true' master equationderived distribution.
We plot a heat map of the ratio of the two KL divergences, with the colour map centred around 1 (both methods having equal KL divergence). A KL divergence ratio greater than 1 corresponds to the SHH approximation being more accurate (the red areas of the heatmap), while a KL divergence ratio less than one corresponds to the FPP approximation being more accurate (the blue areas of the heatmap). The results of this can be seen in Fig. 6. Fig. 6 shows that the areas with lower infection and higher recovery parameter values correspond to a greater accuracy of the FPP approximation, while lower recovery and greater infection values correspond to a greater accuracy for the SHH approximation. It can also be useful to consider multiple metrics, to see if any conclusions made are consistent across these different metrics. We provide similar plots using different metrics for assessing accuracy in appendix A.4. Further analysis shows that using a moment closure approximation provides more accurate results than either the SHH and FPP approximations (results not shown). Due to the relative difficulty in deriving systems of moment closure equa-   tions, particularly for more complex systems, it is not considered here.

Phase portraits and correlation
It was observed that the FPP approximation works well in low infection parameter regimes. Now we want to determine whether there are any qualitative differences that explain these differences. To do this, we look at the phase portraits of the system under lowinfection and high-infection parameter regimes. Each subplot in Fig. 7 shows three different points plotted, as well as each phase portrait. These are the steady state from the FPP approximation (light blue dot), the true mean steady state obtained by solving the full master equations (red dot), and the steady state obtained by averaging the points on the simulated trajectory from the Gillespie algorithm (black dot). As shown in Fig. 7, the approximation is much closer to the true solution than that obtained from the Gillespie algorithm. Fig. 7 shows a stark contrast between the behaviour around the steady state under the two different parameter regimes. The two household state (2,0) and (1,1) do not appear to have a strong correlation under the high infection regime (top row), while there is a much stronger correlation between them under a low infection parameter regime (bottom row). Specifically, as the proportion of households in state (2,0) decreases, the proportion of households in state (1,1) increases. Conversely, there is no clear trend when looking at the trajectory around the steady state under the high infection parameter regime.
From this, we hypothesised that households states (2,0) and (1,1) were more correlated under low infection parameter regimes, and would result in the FPP approximation being more accurate. Fig. 8 shows the results of this, with a number of different parameter sets sampled and the KL divergence of each approximation calculated.
As the correlation between the two states increases (from more negative towards 0), the FPP approximation becomes less accurate while the SHH approximation becomes more effective, with equality occurring around a correlation of À0:4 (Fig. 8). The same trend is observed whether we calculate the correlation from simulated trajectories (red and blue dots), or whether we approximate the correlation by assuming a multinomial distribution with parameters given by the steady state obtained using the FPP approximation (black and green dots).
The KL divergences for a population partitioned into households of different sizes can be seen in Table 4, with the steady state distributions for households of sizes 2,3 and 4 shown in Fig. 9. We denote the KL divergence of the distribution obtained under approximation Q from the true steady state distribution P by D KL PjjQ ð Þ. We let Q 1 and Q 2 denote the steady state distributions obtained using the SHH and FPP approximations respectively.
Based on visual inspection of Fig. 9 and looking at the KL divergences, the accuracy of the Fokker-Planck approximation relative to the single household approximation appears to decrease as household size increases.
We now return to the example described in the introductionhow can we find appropriate parameters to simulate a system The steady state distribution of this system is where p i denotes the probability of a household having i infectious individuals at steady state. Rearranging Eq. 2, we can calculate the corresponding value of a as being a ¼ 0:15. However, we now put this value of a back into the master equation for the full system (Eq. 3), which gives us a steady state distribution of The KL divergence of this steady state distribution from the true steady state distribution can be calculated as D KL PjjQ ð Þ¼5:16Â 10 À4 , which compares to a KL divergence of 1:12 Â 10 À5 for the optimal value of a (the value which minimises the KL divergence).   8. Plot of the correlation between household states 2; 0 ð Þ and 1; 1 ð Þ (as determined by simulations from the Gillespie algorithm (red and blue) and assuming a multinomial distribution (black and green)), and the KL divergence of the FPP and SHH approximations from the true steady state distribution for N ¼ 400.  If instead the Fokker-Planck peak approximation is used, taking a such that a 1 x Ã ; y Ã ð Þ¼a 2 x Ã ; y Ã ð Þ, we get a KL divergence from the true steady state distribution of 3:19 Â 10 À4 , which is lower than that obtained by using the single household approximation. The KL divergence for each approximation as we vary parameters can be seen in Fig. 10. We observe the same trends we did previously, with the FPP approximation outperforming the SHH approximation in regions of parameter space corresponding to lower levels of infection.

Discussion
In this study we investigated the accuracy of different approximations to the steady state distribution of a system consisting of a population of households undergoing stochastic SIS dynamics. It was shown that at steady state, it is insufficient to consider the stochastic dynamics of a single household (or equivalently, the deterministic dynamics for the population of households). Instead, a more refined model is required that considers all reactions between households.
The difficulty with this new model lies in the complexity of the system -it is no longer feasible to solve numerically in most situations. To that end, we looked at the accuracy and ease of implementation of two different approximations, the single household approximation and the Fokker-Planck peak approximation.
In a low infection parameter regime, the FPP approximation outperformed the SHH approximation, meaning the KL divergence between the true solution (obtained numerically) and the solution obtained from the FPP approximation was lower than that from the SHH approximation. However, as we move into a higher infection parameter regime (one with larger infection rates or smaller recovery rates), the SHH outperforms the FPP approximation. This is shown by varying two parameters at a time in Fig. 6. Thus, we found that the accuracy of each approximation was closely related to the prevalence associated with the region of parameter space we are in. While it is an expected result that the SHH approximation will perform better in a high infection parameter regime than in a low infection parameter regime (as the different households would synchronise more quickly), it is less clear as to whether we would expect the SHH approximation to outperform the FPP approximation in this scenario.
The FPP approximation works by taking the peak of the steady state distribution produced using the Fokker-Planck equation. As such, this approximation to the mean value is only valid when the distribution is not skewed. However, we could expect the distribution to be more skewed when pushed into the corner of the domain (corresponding to system state 0; 0; N ð Þ ), as here the bounds of the region are having a larger impact on the steady state distribution. This typically corresponds to a high infection parameter regime (as most households have two infectious individuals, and very few have none). The SHH approximation does not depend on the number of households in the system. As such, we may expect this to perform less well at smaller population sizes. Looking in closer detail at an example from each parameter regime, it was observed that household states 2; 0 ð Þ and 1; 1 ð Þ in systems in the low infection parameter regime were more strongly correlated than those in a high infection parameter regime. This is due to strong negative correlations (which are present between susceptible and infectious individuals in stochastic models Keeling and Rohani, 2008) primarily occurring along the diagonal boundary of the domain. This suggested that the correlation between household states may be a good statistic to identify which approximation should be used, rather than having to investigate the full 3-dimensional parameter space. While simulating trajectories with which to calculate the correlation can be costly, it can be approximated by assuming the number of households in each state follows a multinomial distribution. Fig. 8 shows that both methods obtain similar KL divergences. It also shows that the FPP approximation is more accurate in high correlation parameter regimes, while the SHH approximation is more accurate in low correlation parameter regimes, with both methods performing similarly at a correlation between À0.3 and À0.4. While there are some differences in the correlations obtained under each method, it is a broad region of correlation values we are interested in, rather than specific values the correlation takes.
This behaviour can typically be explained by the way households move between states. A household cannot move between states 2; 0 ð Þ and 0; 2 ð Þ without first passing through 1; 1 ð Þ. In a low infection parameter regime, most of the dynamics will be happening between household states 2; 0 ð Þ and 1; 1 ð Þ causing the correlation between these household states to be stronger. Conversely, in a high infection parameter regime, it is mostly interaction between household states 1; 1 ð Þand 0; 2 ð Þthat will be occurring, causing the correlation between these two states to be stronger (and the correlation between 2; 0 ð Þ and 1; 1 ð Þ to be weaker). However, we note that multiple household distributions can produce the same prevalence while having different correlations between household states 2; 0 ð Þand 1; 1 ð Þ, and so prevalence and correlation are not completely equivalent. This is particularly noticeable in extreme scenarios whereby between-household infection is largely replaced by a much higher level of withinhousehold infection. We note that correlation between household states does not explain these edge cases any more successfully than prevalence.
After analysing a system of 2-person households, a natural extension was to consider larger household sizes. Due to computational constraints, only households up to size 4 were considered, with the total population size remaining fixed. Fig. 9 shows that for the parameter set considered, the FPP approximation outperformed the SHH approximation for populations of 2-person and 3-person households, but that the reverse was true for 4-person households. It should be noted that the FPP KL divergence was sev- eral orders of magnitude lower for the 2-person household system, but the KL divergence for both the FPP and SHH approximations had the same order of magnitude when considering 3-person households. This suggests the relative effectiveness of the FPP approximation to the SHH approximation decreased as the household size increased. Therefore we conjecture that the accuracy of one approximation relative to the other may depend on the complexity of the system (as defined by the number of system states), with the SHH approximation outperforming the FPP approximation in more complex systems.
Our motivation for investigating this problem was for simulating forward systems that have been parameterised at steady state. In particular, we found that just using the SHH approximation to obtain an between-household rate of infection, and then using this between-household rate of infection to simulate the system forward again to steady state didn't produce the same steady state we started at. Using the methodology we present here, there are multiple ways of achieving this. In particular, we wanted to determine whether we could obtain a more accurate solution to this problem using the FPP approximation. The results (displayed in Fig. 10) are broadly consistent with the results obtained previously: the FPP approximation is more accurate in a low infection parameter regime, while the SHH approximation is more accurate than the FPP approximation in a higher infection parameter regime.
There are a number of avenues not considered here that could be considered in future work. Firstly, we only consider stochastic SIS household structured models. Applying this methodology to a wider range of models, such as SIR and SEIR models, would increase its utility. However, difficulties will occur in finding exact solutions to the master equations. One approach is to use a proxy (e.g. simulating realisations using the Gillespie algorithm) for the true solution. However, there will then be some error around any results, and it is possible that this error could be greater than the error in the approximations themselves making it difficult to assess the validity of each approximation. Similarly, we considered households of sizes 2,3 and 4. Further work should look at larger household sizes to be sure that the trend (FPP becomes less accurate, SHH becomes more accurate relative to FPP) continues. Distributions of household sizes should also be investigated, to better match real-world populations.
We considered two approximations in this work, the SHH and FPP approximations. Whilst moment closure was not considered here, it provides a more accurate approximation than the SHH and FPP approximations at the expense of simplicity to derive. Future work should look at other approximations to this system. Finally, we have shown that correlation between households acts as a good metric in determining which approximation is likely to be more accurate. Further work should look into understanding which correlations are most indicative of this when there are more household states to consider.
For values of R0 close to 1, the FPP approximation may not provide valid solutions at all. As such, the FPP approximation should only be used when R0 is sufficiently high. In this paper, we only considered parameters that resulted in R0 > 1.05 to avoid any risk of the FPP approximation not finding a valid approximation. The results provided in Table 4 suggest the FPP approximation may not be useful in more complicated systems (e.g. an SEIR model, or a model with a larger household size) due to the decrease in accuracy as the size of the state space increases.
In conclusion, we have shown that in order to accurately model a population of households of size 2, it is necessary to fully consider all interactions between states, even if just for analysing the steady states of the model. We have shown that under certain parameter regimes, the FPP approximation provides a more accu-rate approximation than the SHH approximation and that in the 2-person household case, these parameter regimes are well classified by the correlation between two household states. There are many future directions in which this work could be taken, including investigating systems with a higher complexity (e.g. SEIR model) and generalising the use of correlation as a metric to assess the accuracy of each approximation.
CRediT authorship contribution statement

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding declaration
This work has been funded by the Engineering and Physical Sciences Research Council through the MathSys CDT [Grant No. EP/S022244/1]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

A.1. Fokker-Planck derivation
The master equation for a population of N households of size two, all undergoing SIS dynamics, can be written as follows @p m;n @t ¼ E 1;À1 À 1 À Á a m;n p m;n þ E À1;1 À 1 À Á b m;n p m;n þ E 0;1 À 1 À Á c m;n p m;n þ E 0;À1 À 1 À Á d m;n p m;n ; with the following rates: a m;n ¼ a N n þ 2 N À m À n ð Þ ð Þ m b m;n ¼ cn We now want to derive the Fokker-Planck equation, a continuous approximation to the master equation. To this end, we define the following: particular, we are interested in whether any of the approximations systematically under-or over-estimate the true mean and variance. For each approximation, we calculate D var :¼ Var approx ÀVar True , and plot this difference as we vary a and b. We also do the same for the error in the prevalence, with results for both the prevalence and variance under each approximation shown in Fig. 11. Unlike the results displayed in Fig. 6, here we instead calculate an absolute error for each approximation, rather than the error in one approximation relative to the other. In Fig. 11, we see that the SHH approximation consistently overestimates the true prevalence of the system, while the FPP approximation switches between over-and under-estimating the prevalence. However, we find that the FPP approximation only underestimates prevalence in regions of parameter space where R 0 is close to one (and so prevalence is low). Conversely, the FPP approximation consistently under-estimates the variance, while the SHH approximation switches between over-and under-estimating the variance. However, we find that the SHH approximation only over-estimates variance in regions of parameter space where R 0 is close to one (and so prevalence is low).

A.5. Model code
All code used in producing these figures can be found athttps:// github.com/aholmes95/PhD/tree/master/Approximations%20Paper Fig. 11. Heatmaps showing the error in the prevalence (left column) and variance (right column) under the SHH (top row) and FPP (bottom row) approximations. Red areas represent the approximation over-estimating the true value, while blue areas represent the approximation under-estimating the true value. Steady states were obtained for a population of 500 households of size 2. In all four plots we take b ¼ 0:2.