A Bayesian Outbreak Detection Method for Influenza-Like Illness

Epidemic outbreak detection is an important problem in public health and the development of reliable methods for outbreak detection remains an active research area. In this paper we introduce a Bayesian method to detect outbreaks of influenza-like illness from surveillance data. The rationale is that, during the early phase of the outbreak, surveillance data changes from autoregressive dynamics to a regime of exponential growth. Our method uses Bayesian model selection and Bayesian regression to identify the breakpoint. No free parameters need to be tuned. However, historical information regarding influenza-like illnesses needs to be incorporated into the model. In order to show and discuss the performance of our method we analyze synthetic, seasonal, and pandemic outbreak data.


Introduction
An important issue in public health is timely epidemic outbreak detection. Outbreak surveillance and monitoring are usually done by gathering official data reported by hospitals and clinics through medical consultation. One of the most frequent causes of medical consultation in all countries is influenza-like illness (ILI) or acute respiratory infection (ARI) [1][2][3]. ILI are responsible for substantial morbidity and mortality each year [3]. Seasonal influenza occurs throughout the world, and it is ranked as a leading cause of death for people below 4 and above 65 years of age and it is among the 10 top causes of death in almost all age groups [4,5].
Early outbreak detection is necessary in order to take suitable control measures. Outbreaks correspond to breakpoints in surveillance data sets. Substantial research efforts have been devoted to this topic, inspired by a variety of statistical techniques including regression methods, time-series models, and statistical process control approaches and extensions to those fields that involve space-temporal studies and multivariate methods or techniques that include Bayesian inference [6,7]. Comprehensive reviews of the field are presented by Unkel et al. [8], Sonesson and Bock [9], Brookmeyer and Stroup [10], and Watkins et al. [11]. Each of these papers presents a classification of methods used for outbreak detection. In general, outbreak methods use threshold values or threshold intervals to signal an alert, all based on historical data.
There are methods based on linear regression with model selection using criteria like AIC or BIC. However, outbreak detection is made under uncertainty, as noise is present in early signals of influenza surveillance [12]. Statistical methods that ignore this uncertainty may result in overconfident predictions. Bayesian methods provide a way to account for uncertainty in both data and model selection [13]. In this paper we introduce a Bayesian outbreak detection using regression models with model selection based on Bayes factors; see Hoeting et al. [13] for a review. Examples of Bayesian model comparison in linear models are [14,15]. Smith and Spiegelhalter [16] present a review of selection criteria for linear models in terms of Bayes factors. Guo and Speckman [17] examine consistency of Bayes factors in the comparison problem for linear models. One key difference from most other methods is that the method introduced in this paper is not based on historical data alone, but rather on the exponential nature of an epidemic outbreak. For the purposes of this paper, prior information regarding influenza-like illnesses was used to build prior distributions which in turn are useful to estimate the Bayes factors for model selection.  Figure 1: epidemic model. Parameter is the contact rate, is the recovery rate, and accounts for infections due to imported cases. No births or deaths are taken into account given the time frame of an epidemic outbreak (few months for ILI).
The paper is organized as follows. Section 2 describes the results that lead to the outbreak detection method proposed in this paper. Section 3 applies the proposed method to synthetic and real data sets. Section 3.3 discusses the feasibility of our approach. Finally, Section 4 summarizes our findings and offers some perspectives.

Materials and Methods
Let us consider the epidemic process outlined in Figure 1. Let ( ), ( ), and ( ) denote the number of susceptible, infected, and recovered individuals at time and the population size ( ) = ( )+ ( )+ ( ). The deterministic model, without imported infections, that is, = 0, is defined through the following ODE system [18]: (1) is the per capita contact rate between susceptible and infected individuals and is the infection recovery rate. At the onset of an epidemic outbreak the number of infected individuals is small (relative to ); that is, ( 0 ) = 0 ≈ 0 and ( 0 ) = 0 at initial time 0 . Therefore ( 0 ) ≈ and for ≈ 0 ; consequently Here 0 = / is the basic reproductive number, which is defined as the expected number of secondary infections caused by an infectious individual in a totally susceptible population during the time the individual spends in the infectious compartment. An epidemic may occur if 0 is greater than one, while a basic reproductive number smaller than one will not sustain an epidemic; see [18]. Of note, the basic reproductive number does not change if ̸ = 0. In the remainder of the paper we write Δ 0 = 0 −1. Thus ( ) ≈ 0 exp( Δ 0 ) and therefore That is, the logarithm of the number of infected individuals is linear in during an epidemic outbreak. On the other hand, outside epidemic outbreaks we expect that the number of infected reported cases varies around a background level, either around zero or an average number of reports as it is the case in influenza-like illnesses (ILI reports, examples to be analyzed in Section 3). By chance, the number of infected persons reports may vary around the average, with temporary runs going up (or down). In such a case we may fit a linear model in the original scale; namely, The basis for our approach is to compare models (4) and (5), with a short run of reports, using the machinery of Bayesian model selection (see Section 2.1). If the exponential (i.e., linear in log scale) model is selected, it will signal the possible start of an epidemic outbreak. It will be crucial to properly code in the prior distribution for Δ 0 and a clear separation between the two models, since for small values of Δ 0 both models may be quite similar (since ≈ 1+ , for small ). We explain the model selection and prior selection in the following sections.

Bayesian Model Comparison.
Given a data set of reported cases ( ), = 1, 2, . . . , at times , we consider a sliding window of consecutive reports ( ) to compare the statistical models defined by expressions (4) and (5). Before the outbreak, a linear model explains better the reported cases.
On the other hand, during the early phase of the epidemic outbreak the number of infected individuals grows exponentially; thus the exponential model should be selected by the Bayes factors and the onset of the outbreak detected. Next we present an outline of Bayes factors and Bayesian model comparison and the basis for our approach. Given two hypotheses 1 and 2 corresponding to the alternative models 1 and 2 for data and parameters 1 and 2 , the posterior distribution in each case is ( | ) = ( | ) ( | , )/ ( | ), = 1, 2. Here ( | ) and ( | , ) are the prior and likelihood for model and is the normalization constant in each case. The basis of Bayesian model selection is that we can calculate the posterior distribution that each model, or each hypothesis, , is true. Namely, from Bayes's theorem we have where ( ) is the prior probability assigned for model . The Bayes factor ( 1,2 ) comparing these two models is given by the odds ratio of model 1 versus model 2 ; that is, Intuitively, the Bayes factor provides a measure of whether data have increased or decreased the odds on 1 versus 2 . Thus 1,2 > 1 signifies that 1 (or 1 ) is relatively more probable than 2 (or 2 ) given [19]. The optimal decision is therefore to choose the model with the highest posterior probability, that is, model 1 if 1,2 > 1 and model 2 otherwise.
Note that Bayes factors do not make sense when using improper priors (due to unspecified constants) and are sensitive to vague or default a priori distributions; see [20]. However, in this paper we use strong and informative (and indeed proper) priors aimed at distinguishing both models. Therefore the mentioned issues, thoroughly discussed in the Bayesian literature, should be of no concern in the current setting.
Let us denote by 1 the exponential model in (4) and 2 the linear model given in (5). Let be the data at hand, either ( ) for model 1 or log ( ) for model 2, = 1, 2, . . . , . Then we assume Thats is, ∈ R follows a normal distribution with mean and covariance matrix 2 , where is the identity matrix; ∈ R ×2 and ∈ R 2 are the design matrix and the parameter vector, respectively. We will require a different design matrix and prior distributions, for each model . To perform a standard conjugate Bayesian analysis on this linear model [19,21,22] we proceed as follows; please see Appendix A for more details. We use the Normal-Inverse Gamma (NIG) prior distribution: 0 corresponds to the location parameter, Σ 0 is the covariance matrix (for | 2 ∼ 2 ( 0 , 2 Σ 0 )), and 0 and 0 denote the parameters of the Inverse-Gamma distribution (for 2 ∼ InvGa( 0 , 0 )), in the usual way. The posterior distribution results in a NIG( , Σ , , ), where The normalization constant in (6), required by the Bayes factor, is (see Appendix A for more details).
From (4) and (5) it is clear that the design matrices are for the log-linear (exponential) and linear models, with equal to (log( 0 ), Δ 0 ) and ( , ), respectively. Other relevant parameters are explained and set in Table 1. In the following section we discuss and establish prior distributions for each model, setting the hyperparameters of the prior NIG distribution.

Prior Distributions.
As mentioned in Section 2.1, it is crucial to separate both models through a prior distribution that distinguishes clearly the exponential growth from a linear fluctuation. The basic reproduction number 0 plays a central role in the prior information. Here, prior information of our approach is set for influenza-like illnesses; other prior specifications could be attempted for another type of epidemic outbreaks. It is known that for seasonal influenza 0 is approximately 1.5 [23]; therefore prior expectation for Δ 0 will be centered at 0.5. Moreover, in calibrating our models we have found that the bigger the population size the sharper the prior needed, where the prior variance should decrease as 1/ . This rule is in agreement with standard hypothesis in physics; in a well mixed system the amplitude of fluctuations scales like the square root of the system size [24].
For each data window, we first subtract its corresponding mean, for either the logged or the original data, and center the prior linear model around 0. Consequently, the hyperparameters 0 and Σ 0 for the NIG prior are set to for the log-linear (exponential) and linear models, respectively. The outbreak detection method introduced here is robust to other reasonable settings for these hyperparameters. The only critical value is the variance for Δ 0 , which, as mentioned above, needs to be adjusted with the population size as 1/ . The remaining hyperparameters are set to 0 = (1/2)( − ) and 0 = (1/2)( − )̂2, wherê2 is the observed variance in the data window, for either the logged or the original data. Thus, the prior variance is centered near the observed variance for each model.
Indeed, in a pure inference scenario it is questionable to use data driven prior distributions. However, in the current setting it is desired to distinguish between the linear and exponential models and not in fact the estimation of the regression parameters themselves, which are regarded as nuisance. By subtracting the mean and centering the prior of 1 (either to Δ 0 or to ) to 0 and by setting a priori ( 2 ) ≈̂2 we are helping the inference of the regression parameters in each case (and equally for both models). This is a key feature of the proposed approach, since we will use a small window of three consecutive reports, and uncentered priors would blur the relative weight of each model, rendering the model comparison useless. Overall, the prior distribution selection at this stage should be regarded as a pragmatic approach to making the outbreak detection procedure work.
Once the outbreak is detected we may then try to estimate 0 using the data window at hand. Again, since the data set is very small, we will use a noninformative prior (see [19]) and use the marginal posterior for the regression parameters of the log-linear (exponential) model to estimate wherê= ( ) −1 and̂= 0.5( −̂) (indeed, is the logged data). The marginal distribution of any one of the entries of is a univariate Student distribution. We are interested in 2 (corresponding to Δ 0 ); thus 2 ∼ St((̂) 2 , 2 ( ) 22 , − ). We will use the posterior expected value, 2 = (̂) 2 , of this posterior marginal to estimate 0 ; namely, 0 =̂2 + 1. Also, since is fixed an estimator for can be produced witĥ= (̂2 + 1) .
In Section 3 we compute 12 over a moving window of four consecutive data points, that is, = 4, to decide whether changes are due to data oscillations (linear model is selected and 12 < 1), or the onset of exponential growth occurs (the exponential model is selected and 12 > 1) and an epidemic outbreak is expected.

Results
We have tested the predictive capacity of the outbreak detection method proposed in this paper with real and synthetic data sets. The real data sets used are from the Spanish influenza outbreak in San Francisco, USA, in 1918 (see [25]) and data of the acute respiratory illnesses (ARI) from San Luis Potosí, México (see Noyola and Arteaga-Domínguez [26]). Outbreak information and model relevant parameters like the infection rate ( ), the basic reproductive number ( 0 ), and the week of outbreak were estimated. In each figure, red dots indicate three consecutive points in which the exponential model is selected over the linear model; that is, 12 > 1. Grey points indicate one single four-point window in which 12 > 1. As explained in the previous section, once the outbreak is detected we use the log-linear model, with a noninformative prior, to produce estimators for both 0 and .

Synthetic Data Analysis.
To create synthetic data we have avoided committing an "Inverse Crime" [27]. Synthetic data was produced with a closely related but different model to the one assumed in (4) or (5) to be producing the infectious reports. Namely, we use the Gillespie algorithm to make a realization of the epidemic model with demographic stochasticity [28]. Initially all individuals are susceptible and the epidemic outbreak is due to imported cases. The frequency of imported cases is controlled with parameter ; see Figure 1. Of note, the deterministic model (1) is the mean field equation of this stochastic model. Moreover, in a real scenario data is accumulated over the reporting time frame (daily, weekly, etc., reports for infected persons). We then accumulate the simulated data over the reported time frame to produce the synthetic infectious reports ( ). Also, a linear autoregressive process is added to the synthetic data to simulate a background of diseases caused by other agents, as it is the case of influenza-like illness. Simulations have 0 = 1.5, = 1/7 (days); the rate of imported cases is ∈ [10 −7 , 10 −4 ] depending on the population size . Reports are accumulated weekly. Some examples are presented in Figure 2 and the estimates for 0 and are presented in Table 2.

Real Data Analysis.
Real surveillance data sets account for medical consultation cases. These numbers represent infected persons seeking medical attention at health centers. For influenza, it is estimated that as low as 17% of the infected population seek medical consultation and approximately 75% of people with seasonal or pandemic influenza do not exhibit symptoms [29]. However, under normal circumstances reports are proportional to the actual number of infected people and exponential growth in the number of infected people will be shown as such in the reported cases. In the following examples we do not explicitly model subreporting, obtaining good results in all cases.
The Spanish influenza of 1917-18 was a pandemic considered among the most devastating ones in history [30,31]. Figure 3 shows a data set corresponding to San Francisco, USA, spanning from September 24th to November 24th.
Our detection method identifies an outbreak on October 10th. The estimated parameters associated with this epidemic arê= 0.53 and̂0 = 3.7. Both the estimated 0 and BioMed Research International  Figure 4 along with outbreak detection results. In this series of data sets the seasonal outbreak is consistently detected between epidemiological weeks 13 and 15 with 0 between 1.3 and 2.5; see Table 3.
Of note, other questions from ARI surveillance may be addressed; for instance, when do the weekly reports of ARI exceed the historical mean? However, in this paper we limit ourselves to the introduction of the detection method and leave other questions of disease surveillance for future research.  be modeled correctly using previous reports of the expected value of the basic reproductive number. We have learned that the prior variance for Δ 0 needs to reduce as 1/ , where is the population size. This choice may be justified recalling that in a well mixed physical system fluctuations scale like the square root of the system size.
In the examples presented above the outbreak is detected in the presence of underreporting. The good performance of the method is explained considering the fact that the method is based on detecting a qualitative feature of the surveillance data instead of a quantitative threshold. Methods based on historical thresholds may have difficulties in detecting an outbreak happening within or below average historical report levels. Of note, our method uses historical data to calibrate prior distributions; for example, historical data is used to model how much we allow surveillance data to oscillate while in the autoregressive regime. Moreover, the method introduced in this paper allows us to estimate important parameters like infection rate ( ) and the basic reproductive number ( 0 ) which provide valuable information regarding outbreak behavior. The estimation of these quantities was made using a sliding window of three consecutive reports.
Bayesian outbreak detection was applied to two types of real data sets. It consistently succeeded in making an early detection and the estimated 0 and values were in agreement with values reported in the literature.
A Python-Scipy implementation of our approach may be downloaded from http://www.cimat.mx/jac/software; a user friendly interphase is available at request from the authors.

Conclusions
Outbreak detection is an important problem in surveillance of infectious diseases. The development of robust methods of early outbreak detection remains an active research area.
In this paper we use Bayes factors to detect a breakpoint that characterizes the onset of an epidemic outbreak in influenza-like illness surveillance data. The breakpoint characterizes the change from an autoregressive regime to exponential behavior of reported cases at the beginning of an epidemic outbreak. The detection method was successfully used on synthetic and real data sets. The resulting algorithm is straightforwardly implemented. The mathematical methods behind the algorithm are simple but contrast with other proposed methods which are based on calculating thresholds and control charts. Of note, our approach has no free parameters to tune.
The prior distributions used arise from coding information available for influenza-like illness. It is apparent that the method may be applied to surveillance data of other infectious diseases, for example, acute diarrheal diseases, provided enough prior information about the disease of interest is available.
Certainly, it is important to detect outbreaks before they have fully developed, that is, when the number of cases is still low. Our outbreak detection method seems to be able to achieve an early detection of influenza-like illness outbreaks, when synthetic and real data are analyzed. Furthermore, it allows us to make quantitative estimations for important parameters regarding the epidemic. The estimated parameters in the data sets analyzed are in agreement with previously published values.
Some features like the optimal number of reports required to identify an outbreak, optimal number of consecutive Bayes factors required to call an outbreak, and so forth are left as subject of further research. be the data, either ( ) for model 1 or log( ( )) for model 2. Then, we assume that is, ∈ R , follows a normal distribution with mean and covariance matrix 2 , where is the identity matrix, and ∈ R and ∈ R 2 are the design matrix and the parameter vector, respectively. The following details may also be found in [22].
A.1. The NIG Prior. To perform a standard conjugate Bayesian analysis on this linear model, we use the Normal-Inverse Gamma (NIG) prior distribution as follows: This two-dimensional NIG distribution signifies that where 0 correspond to the a priori location parameter and Σ 0 the a priori covariance matrix for and 0 and 0 denote the hyperparameters for the a priori Inverse-Gamma distribution for 2 ; consider The functional form of this prior distribution is given by where Γ(⋅) represents the Gamma function and the IG( 0 , 0 ) prior density for 2 is given by A.2. The Likelihood. The likelihood function for each model is defined as the joint probability of observing the data viewed as a function of the parameters; consequently viewed as a function of and 2 and fixing .  We have that A. 4. The Normalization Constant. This is the constant required by the Bayes factor. We need to compute the distribution ( | 2 ) by integrating out and subsequently integrate out 2 to obtain ( ). Accordingly, Here the matrix identity | + | = | || || −1 + −1 | was applied to obtain Now, the marginal distribution of ( ) is obtained as follows: ( ) = ∫ ( | , 2 ) ( , 2 ) 2 = ∫ ( , 2 ) × NIG ( 0 , Σ 0 , 0 , 0 ) 2 = MVSt 2 0 ( , 0 0 ( + Σ 0 )) .
(A. 15) In more detail, we have which indeed reduces (after some algebraic manipulation) to the NIG( , Σ , , ) density. The marginal distribution of any one of the entries of is a univariate Student distribution. This is used and the correct parameters are described in Section 2.2 to estimate 0 and infection rate ( ).