Examination of models for cholera: insights into model comparison methods

This article provides an overview of the Akaike and Bayesian Information Criteria as applied to the setting of deterministic modelling, with the perspective that this may be a useful tool for biomathematics researchers whose primary interests lie in the analysis of compartmental models. We additionally examine a wide range mechanistic and parameter assumptions in the cholera literature through the unifying lens of model selection criteria. Five models for cholera are considered using multiple model selection formulations, and implications for cholera modelling and for model selection criteria are discussed.


Introduction
We have two primary goals in this work: to share with the biomathematics community a statistically valid framework for the comparison of deterministic models and, as a case study, to express some particular mechanistic and parameter-related concerns that we have gleaned from many years of cholera study. To this end, we put forward a method for model comparison and contrast this technique for model selection with others by measuring them against data from the first year of the cholera outbreak that followed the 2010 Haitian earthquake. First, in Section 2 we introduce and discuss the Akaike Information Criterion (AIC) as a basis for model comparison. This section is written for a diverse group of researchers who are consumers of model selection theory, as opposed to those whose sole research focus may be model assessment. Additionally, we provide a brief overview of the Bayesian Information Criteria (BIC). Section 3 begins with a thorough review of how the spread of cholera might be modelled from a mechanistic viewpoint, and highlights disagreements in the literature regarding key parameter assumptions. Here we also describe five models to be compared using the AIC and BIC criteria against data from the recent outbreak in Haiti. In Section 4, we share some methods and results from prior model selection investigations for cholera, as well as an investigation not intended for cholera that provides a different point of view for model selection methodology.
Finally, we discuss the methods and results from our model selection efforts for cholera in Section 5, and provide a brief conclusion in Section 6.

An introduction to maximum likelihood estimators and AIC
Model formulation involves judgement, experience, trial and error. The validity of a model should (i) be consistent with knowledge of the system under study, (ii) extrapolate to related sets of data and (iii) have reasonable mathematical and statistical properties. Often subject-matter considerations suggest a stochastic argument for a range of suitable models. At other times, however, an analysis can only be realistically completed for one of several competing models, and hence a basis for the selection of a single model is needed.
A precept that is usually implied but is often unstated is the principle of parsimony: 'it is vain to do with more what can be done with fewer'. That is, we should prefer simpler models over more complex ones that fit our data about equally well. But what exactly does this mean? If we have models with one, two and three parameters and respective errors of, say, 100, 2 and 1 in fitting the data, then the second clearly improves on the first, but do the second and the third perform 'about equally well'? Model selection criteria are intended to balance the increased model complexity with any improvement in fit to make this determination.
Before we scrutinize various criteria for model selection from a statistical point of view, we must first define the word 'model'. Do two different parameter values applied to the same mechanistic structure define two different models, or are two models viewed as different because they have different mathematical structures, regardless of the particular values of the parameters within each model? Usually, when the term 'model' is used in a statistical article about model selection, it refers to a probability distribution (population) from which data can be randomly sampled, and the question is, 'From which true probability distribution were the data that I observed sampled?' However, those of us who work with mechanistic models in systems biology typically employ a different meaning for the term 'model' -mostly, we consider a system of differential equations paired with an appropriate parameter set to be a 'model'. In this article we will use 'model" to refer to a mathematical structure paired with a particular set of parameters, and we will refer to a 'family' of models as a mathematical structure whose family members are equipped with differing feasible parameter choices.
A second point to be considered when assessing the relative values of two families of models is that the underlying mechanistic modelling assumptions may be wildly different between models, making the two options not equally suited for a particular application. A third important point is that we may not wish to treat the models as the 'most true' from an application viewpoint after all. The best example of this would be neural network models, where the fitted model is a 'black box', whose contents have no intrinsic interest or meaning but are merely used for predictions.
In this introduction, we seek to reconcile our systems biology point-of-view with the statistical model comparison point-of-view, and to define a statistically accurate process under which several mechanistic models for systems may be compared. Given a particular data sample and asked to find the best approximating model, we encounter two types of errors: (i) an error caused by the choice of the family of models, and (ii) an error due to the estimated parametrization of the model family. This concept can be formulated using the concept of Overall Risk, where overall risk = risk of modelling + risk of estimation.
Often, during the course of the analysis of data, we may discover that the particular form of the specified model is not appropriate for the data at hand. In this case, the risk of modelling will be a larger portion of the overall risk due to the incorrect specification of the model. In fact, the correct specification of the model is a necessary but by no means a sufficient condition for it to be selected. For a given model family, the risk of estimation is encountered while we estimate the true parameter(s). When the true parameter values are not included in the search space, then a bias is caused. Our goal in model selection is to minimize the overall risk, and this is where model assessment measures are used.

Formulation of the AIC for deterministic model comparison
In the following paragraphs, we will introduce the AIC for model selection. Before we explore the details, let us view the numerical value assigned to the AIC intuitively: AIC = lack of fit + penalty for model complexity. (1) The lack of fit term, as the name implies, decreases when the model is more likely to describe the data. In comparison, the penalty for complexity, determined by the number of free parameters within the model, increases for more complicated models. The number of free parameters is considered to be a measure of complexity, and is a compensation for the bias in the lack of fit when the maximum likelihood estimates are used. Note that a poor model choice (risk of modelling) will increase the lack of fit if it describes the data poorly, or may instead/additionally have a high penalty for complexity. If a model's parameters are poorly estimated (risk of estimation), this would cause the lack of fit to increase as well. Thus, AIC demands a compromise between the maximized likelihood (the lack of fit component) and the number of free parameters estimated within the model (the penalty component). Hence, when AIC is applied to models from the same family, the penalty for complexity has no variability. On the other hand, the number of parameters become an important factor when comparing models with varying numbers of parameters. When we are comparing models by the simple expedient of fitting them to data, we must first accept that there are no true models. Indeed, models only approximate reality. The problem is then to determine which model would best approximate reality, given the data we have recorded. In other words, the best we can do is to minimize the loss of information. In itself, the value of the AIC for a given data-set has no meaning. It becomes useful when it is compared to the AIC of a collection of predetermined models, the model with the lowest AIC being the best model among all models specified for the data at hand. If only poor models are considered, then AIC will select the best of the poor models. For family of nondeterministic models, as the sample size gets large, the ability to choose 'good' parameters improves, and first term in the AIC formulation -the likelihood -increases but the penalty term does not, since the penalty only depends on the (fixed) number of parameters. This means that the penalty term has little effect if n, the number of points in the data-set, is large. On the one hand, it is intuitively clear that a complex model cannot be justified to describe a small amount of data; however, as the size of the data increases, objections have been raised that minimizing AIC may not point to the correct model (Bhansali & Downham, 1977;Schwarz, 1978;Woodroofe, 1982). Therefore, critics say, the importance of AIC as a model selection criterion should not be exaggerated (Forster, 1999;Hannan, 1986).
We now wish to investigate the AIC methodology from a statistician's point of view so that we may derive how AIC may properly be applied in a deterministic setting. Suppose that we have a random sample Y 1 , . . . , Y n from the true unknown model g (y). Given this random sample of size n from the true distribution defined by g and a specific family of candidate models f (y; θ), we may calculate the AIC. The work begins by making statistical inferences about the parameters for the model family f . Hence, we proceed by fitting the candidate model f (y; θ) by maximizing the conditional probability that f describes the data, given that the parameter is θ; i.e. we maximize the likelihood L(θ) = f (y i ; θ). Now, letθ denote the corresponding value of θ that maximizes the likelihood L(θ). With the optimal parameterθ for the family f selected, an assessment of the quality of the constructed model needs to be made. The form that the AIC described first in Equation (1) takes in this context is where L(θ) is the maximized likelihood function and p is the number of free parameters in the model. We note that in this formulation, the coefficients in front of the terms in the summation simply weight the terms equally; however, in the following formulation (3), equal weighting would result in the selection of an overly-simple model. At this point of the modelling process, we have simply considered a single family of models, and, as we described previously, AIC is not yet a valuable measure. When the AIC value is calculated for multiple model families, the chosen model is the model-parameter combination with the minimum AIC. The negative sign in Equation (2) ensures that higher likelihood contributes to a smaller AIC, while an increased penalty contributes to a higher AIC and the factor 2 puts the differences for AIC on the same scale as 'likelihood ratio statistics', another method for model comparison.
In biomathematics settings, where compartmental models such as Susceptible-Infected-Removed (SIR) are prevalent, the implementation of AIC as a model 'quality' measure fundamentally differs from its implementation in describing statistical models, where the variation is partially due to the randomness inherent in data being drawn from a population (distribution). In the latter setting, the probability density function (pdf), and hence the likelihood function, specifies the probability of observing the true data vector given a particular parameter vector. Given a set of parameter values, the corresponding pdf will show that some data are more probable than other data. In reality, however, we have already observed the data, and hence we are faced with the inverse problem: given the observed data and a model of interest, find one pdf, among all the probability densities that the model describes, that is most likely to have produced the data. To solve this inverse problem, we define the likelihood function by reversing the roles of the data vector and the parameter vector. The principle of maximum likelihood estimation states that the desired probability distribution is the one that makes the observed data 'most likely'.
However, when fitting a deterministic compartmental model, the very idea of drawing a random sample from a prespecified distribution is no longer applicable. The basic foundation of a compartmental model selection process is the determination of the parameters of a model based on how well they estimate the response variable of interest, and not on the likelihood of observing the parameters. In the light of our earlier discussion on how AIC works, consider a given family of deterministic models, such as the epidemiological SIR family, and a given data-set which the family is to describe when paired with appropriate parameter choices. Given the data-set, and a particular set of parameter estimates, model fitness then corresponds to the prediction error of the response variable (the output variable of interest). Clearly, different sets of parameter estimates will result in different prediction errors.
Thus, referring to our intuitive AIC definition in Equation (1), the lack of fit for a deterministic model can be measured by the residual sum of squares (RSS) between the model prediction and the observed data. Our goal in parameter estimation is to minimize the error of fit, and this is certainly different from the traditional likelihood approach, where we try to maximize the joint probability of a sample being drawn from a particular statistical population. Now, let us assume that we are comparing several deterministic model families, such as SIR versus SIRB (containing an environmental reservoir) from Section 3 below, so that each model may potentially have a different number of parameters. We then add the penalty for model complexity to the AIC formulation similar to the original formulation given in Equation (2). When measuring the error of fit from a deterministic model, the AIC can be computed by Burnham and Anderson (2002) Again, RSS is the sum of the model prediction squared errors for a particular candidate model that was the winning member of its respective family, and K is the total number of estimated parameters p plus 1, because the variance of the observed error is considered to be an additional parameter for the penalty term. The division by n is by convention and clearly the deletion of this term would simply translate all results equally. As we mentioned following Equation (2), a different coefficient is needed for the lack of fit term in Equation (3) versus in the likelihood formulation given in Equation (2). First, since the minimum AIC should select the preferred model and a decreased RSS already indicates an improved model, a negative sign would not be appropriate. Second, the 'balancing term' must now be the number of observations, and not 2 as in the previous AIC equation. As we will discuss in Sections 4 and 5, while there is a mathematically equivalent 'likelihood' derivation to the measure of lack of fit in Equation (3) above, the interpretation of likelihood in a traditional manner is not supported.
In Section 5, we will compare several implementations of the formulation given in Equation (3) with varying approaches to measuring the lack of fit. In the same manner, we will consider a likelihood formulation similar to that in Equation (2). Finally, we will examine an altogether different comparison measure that is briefly described in the following section.

AIC versus BIC for biological models
In the previous section we explained the framework for the AIC. Another tool too may prove useful in model comparison is the BIC. BIC was derived to provide a consistent estimator of the order/dimension of the true model. While the AIC seeks to measure the amount of information lost when a function f is used to approximate the true data generating function g, BIC assumes that one, true, fixed model exists within the set of candidate models, and that as the sample size n increases, the probability of selecting the true model approaches 1. However, oftentimes the sample size required for the consistent criteria to point to the true model with a reasonably high probability must be very large (in the thousands) (Burnham & Anderson, 2002). Thus, Burnham and Anderson explain that the primary assumptions needed for BIC comparisons do not apply in biological sciences or any other 'noisy' science (Burnham & Anderson, 2002). Nonetheless, some in mathematical ecology do consider the BIC in evaluating models. We will discuss a contrasting viewpoint to the utility of BIC in Section 4.4 by presenting equations for BIC in several contexts, and investigate the value of BIC results in Section 5.

A mechanistic consideration of the spread of cholera
The inspiration for the current investigation has its origins in a desire to improve and validate several mechanistically-driven models for cholera. With recent significant outbreaks in Zimbabwe and later Haiti, there has been much progress in the development of useful cholera models. The ultimate goal of these modelling efforts is, of course, to understand the dynamics of a cholera outbreak better so that we can greatly reduce the morbidity to be incurred in future outbreaks.
Thus, due to our limitations in obtaining accurate data, there is some disconnect between our mechanistic understanding of how cholera spreads and the models we use to describe the spread. There is tremendous disagreement throughout both kinds of research, biological and mathematical, about what parameter values are realistic and what modelling assumptions are important. We will explore some of these disagreements, and then introduce several candidate models for cholera.
Generally, cholera enters a human population through the ingestion of food or water that has been contaminated by the bacterium V. cholerae. The vibrios colonize in the small intestine for 12-72 h, and those humans who experience symptomatic infections then shed an enormous number of vibrios through diarrhoea which can last 1-2 weeks. The infection leads to rapid death without hydration therapy, which is inexpensive and easily applied in all but the most under-developed settings (Nelson et al., 2009). Initial cases of cholera in a population can be caused by the chance ingestion of vibrios that lurk indefinitely in the environment of many affected regions, or that are imported from asymptomatic visitors, as may well be the case in Haiti with the Nepalese peacekeepers (Piarroux et al., 2011). After the vibrios enter the human population, there is a human-environmental amplification of the disease in regions without adequate sanitation, as vibrios that have been shed from humans in turn contaminate the water that is used for drinking and cleaning (Nelson et al., 2009). This amplification is additionally powered by a 700-fold increase in the virulence of the bacteria in its first five or so hours after leaving the human intestine (Merrell et al., 2002). It is thus compelling to consider a model for cholera that attempts to describe the human-environmental disease amplification, as well as one that addresses the importance of freshly shed bacteria in the spread of the disease.
As we mentioned earlier in this section, only some humans will have symptomatic infections, and the proportion of humans with severe infections depends on the bacterial biotype (Kaper et al., 1995) and perhaps other possible factors. It is known that the degree of susceptibility depends on genetic factors (Nelson et al., 2009) and probably on nutrition as well (Glass et al., 1989;Nelson et al., 2009). Additionally, it is believed that prior exposure to the disease causes short-term complete or partial immunity (Longini et al., 2002;Nelson et al., 2009), and Glass et al. (1982) in particular suggest that recovery from a symptomatic cholera infection should protect an individual from a subsequent severe infection. It is noted that in areas with endemic disease, there are a smaller number of symptomatic cases than in areas that are immunologically naïve (Nelson et al., 2009), and from this we might infer that in countries with frequent cholera outbreaks, residents may enjoy some partial immunity if they re-encounter the bacteria. Even if one had accurate data representing the fraction of infections that are asymptomatic, then the extent to which asymptomatic infections result from a physiological predisposition versus prior exposure to cholera would not be clear, except, of course, in an outbreak for which the existing populations have no prior exposure. This latter point makes data from Haiti, whose residents had not experienced cholera prior to 2010, especially important for a model study.
In addition to the lack of quantification available as to the cause of asymptomatic infections, additionally there is wide disagreement regarding the proportion and importance of asymptomatic infections in any given cholera outbreak. There are excellent studies that suggest the per cent of symptomatic individuals might be quite small (Centers for Disease Control, xxxx; Glass et al., 1982;King et al., 2008), about 20% (Rinaldo et al., 2012;World Health Organization, 2010a), or possibly 50% (Nelson et al., 2009). Thus, the proportion of asymptomatic infections in an outbreak is not well understood, and additionally the contribution of asymptomatic individuals to disease amplification is a subject of dispute. Some researchers believe that the 'silent shedders' who are typically undocumented in data collected during an outbreak are driving the disease spread (King et al., 2008; Table 1. Summary of all model variables used in one or more of the five models.

S(t)
Susceptible humanŝ S(t) Susceptible with partial immunity Humans recovered from symptomatic infection Humans recovered from asymptomatic infection Total population in all human compartments Concentration of low-infectious vibrios in environment Rinaldo et al., 2012), while others think that these shedders are of less consequence (Nelson et al., 2009). For those with asymptomatic infections, does the lack of disease awareness cause an increased contribution to the environmental and human-to-human spread of the bacteria during the first one to two days of asymptomatic illness? Are those with symptomatic infections effectively quarantined due to the evident need to observe strict sanitation? Models may thus need to consider a range of bacterial contact rates for each type of individual (symptomatic and asymptomatic) in order to account for the full range of possible outcomes. Another parameter which requires attention is the rate of bacterial non-viability. We have observed that a majority of cholera models assume that the bacteria will not be viable after 30 days in the environment; however, a recent commentary on cholera modelling suggests that in fact persistence in the water may last anywhere from 3 to 40 days, and that this change in assumption can cause an enormous difference in model predictions (Grad, Miller, & Lipsitch, 2012).
With the discussion of disagreements within the literature in hand, we now explore several model options.

Model 1: SIR
We begin by suggesting a simple SIR model, depicted in Figure 1. The model assumes that disease spread is density-dependent, and hence the transmission term includes division by the total population size N. The SIR model is given by the system where the meanings of the variables and parameters are given in Tables 1 and 2. We would claim that this model has little mechanistic merit in its description of cholera transmission, and we include it as a check of our model comparison technique. In Section 5, we will show that in some cases, the fit of this quite possibly inappropriate model to the data we have  is good, particularly from the model selection viewpoint that favours models with few parameters.

Model 2: SIRB
The SIRB model is pictured in Figure 2 and consists of the differential equations with variable and parameter meanings given in Tables 1 and 2. This model includes an environmental class for bacteria, which can be interpreted as the concentration of infectious cholera bacteria in the drinking water. Varying formulations of this simple model, with or without waning immunity or person-to-person infectivity as included in Equations (5), have been used for decades (Capasso & Serio, 1978;Codeço, 2001). Recently, the SIRB model was applied to cholera quite effectively, especially when used in combination with spatial considerations, to model the outbreaks in Zimbabwe (Mukandavire et al., 2011) and Haiti (Bertuzzo, Casagrandi, Gatto, Rodriguez-Iturbe, & Rinaldo, 2010;Rinaldo et al., 2012;Tuite et al., 2011). In other model comparison efforts for cholera, a variation of the SIRB formulation is often selected as the most appropriate to model the available data.

Model 3: SIIRBB
The SIRB model considered in Section 3.2.2 ignores the hyperinfectious state for bacteria as well as the role of asymptomatic infected individuals that we discussed in Section 3.1. First published in 2010 (Miller Neilan et al., 2010), the SIIRBB model adds a hyperinfectious bacterial class as well as an asyptomatic infected human class to the SIRB model. The SIIRBB structure is pictured in Figure 3. This model expanded a 2006 model that first introduced the hyperinfectious class to cholera modelling (Hartley et al., 2005), and similarly did not include the person-to-person infectivity term that we see in the SIRB model in Section 3.2.2. A model similar to the one in Equations (6) was has also been used Andrews and Basu (2011) to consider control measures for the outbreak in Haiti.

Model 4: SIIRB
Shortly after the publication of the first cholera model to include hyperinfectivity (Hartley et al., 2005), a short note was published (Pascual, Koelle, & Dobson, 2006), suggesting that the inclusion of a hyperinfectious class of bacteria is only a different way to show a faster mode of transmission, and that epidemic data can therefore be modelled equally well by inclusion of person-to-person transmission rather than through a hyperinfectious class of bacteria. In the context of the SIIRBB model of the previous section, we consider replacing the hyperinfectious bacterial class with person-to-person transmission. The resulting model is where again variable and parameter values are provided in Tables 1 and 2. We add the above SIIRB model to our comparisons to check just this idea: can the data be described equally well by a model that includes person-to-person transmission in lieu of a hyperinfectious class of bacteria? Our model comparison results, if we jump ahead a little to Section 5, suggest 'not quite'. The model is diagrammed in Figure 4.

Model 5: SSIIRRB
Finally, we consider a so-called SSIIRRB model that is expanded to allow the possibility that the capacity to obtain partial immunity may drive cholera dynamics. A similar model appears in an age-structured partial differential equations model for cholera with several collaborators (Fister et al., 2016). The model equations are given by with variable and parameter values given in Tables 1 and 2. The inclusion of a partially immune class may lead to interesting results for long-term dynamics in Haiti, where initially the class with partial immunity would have been empty since the Haitian population had no prior exposure to cholera. Although this more complex model may be justifiable from a mechanistic point of view, and is certainly relevant in age-based considerations, the number of parameters needed for this model is extensive. Are there enough data to justify the use of this model? See Figure 5 for graphics describing the model.

Further parameter discussion
One reason that model choice is tricky for the study of cholera is the disagreement surrounding parameter choice. We will highlight some of the discordance within the literature for the parameter values, and use the parameter discussions from the literature to choose biologically-reasonable ranges (see Table 2) for our parameter search. As we will discuss in more detail in Sections 3.4 and 5 following, we intend for the models to describe surveillance data from Haiti which provides the number of hospitalizations from the beginning of the cholera epidemic there. With data that only suggests the number of reported infections, it is quite difficult to discover the transmission route (Eisenberg et al., 2013). Thus, it is not only difficult to recover specific parameter values related to person-toperson or bacterial infectivity (the β * values in Table 2); as we parametrize the five models considered using the hospitalization data, it is also unlikely that we could determine the relative importance of β I A versus β I S . Similarly, as discussed earlier, there is disagreement between the CDC and WHO websites (Centers for Disease Control, xxxx; World Health Organization, 2010a), a Nature Reviews Microbiology overview (Nelson et al., 2009), and several modelling articles (King et al., 2008;Rinaldo et al., 2012) regarding what proportion of infections may be asymptomatic, which suggests that we should allow a wide range of values for our parameter p. Recall also that for the SSIIRRB model (Section 3.2.5), which allows for partial immunity, the parameter p assumes a slightly different meaning than in the SIIRBB and SIIRB models (Sections 3.2.3 and 3.2.4), in which all susceptible individuals are considered equivalent. Finally, we observe that bacterial parameter values may be quite difficult to determine. In models that include the human-environment amplification, the contributions by humans to the bacterial load in the environment (η * ) cannot be estimated outside of numerical model simulations, because of the impossible-to-classify volume and location of water in the environment, as well as the reality that much of the spread of the disease is due to bacterial contamination of household objects (Nelson et al., 2009).

Initial conditions for all models
The model parameters and assumed ranges for the five models described here are given in Table 2. Each model is parametrized using surveillance data from the Haitian Ministries of Public Health for approximately the first fifteen months following the 2010 earthquake (Republique d'Haiti, xxxx). We have considered this data separately for each of the ten Haitian departments as well as its capital city, as we will elaborate in Section 5. For each of the five models, the initial conditions depend on the demographics of the specific region considered.
For all models and all Haitian data-sets, we considered those living in poverty to be the only individuals who would be susceptible to cholera. We used poverty (Verner, 2007) and population (Wikipedia, xxxx) data to obtain initial values for the number of susceptible humans in each department. Where relevant, we assumed that initially all susceptible humans would carry no partial immunity (since Haiti had not been exposed to cholera prior to the epidemic in this study). For each model/data-set combination, we assumed that Day Zero was the first day with hospitalized infected humans as recorded by the Haitian Ministries of Public Health (Republique d'Haiti, xxxx). In all models, we assumed that only a proportion h of infected humans are hospitalized (see Table 2). Thus, h · I(0) = I * ≡ number of hospitalizations on Day Zero.
We assumed for the models with symptomatic and asymptomatic populations that only symptomatic individuals would be hospitalized, so that If I(0) were to represent the true number of infected inclusive of asymptomatic infections, then I S (0) = pI(0), and I A (0) = (1 − p)I(0). It follows that The precise numerical values chosen for the initial conditions above are summarized in Table 3. We assumed that there were no recovered individuals at the start of the epidemic, and (where relevant) no hyperinfectious bacteria on Day Zero. We did assume an environmental concentration of low-infectious bacteria that is some multiple of the Michaelis constant κ (or κ L for the models that consider hyperinfectious bacteria). Depending on the chosen model, this amounts to one of the following (see Table 2):

Prior results in model selection applied to cholera
As we will highlight in the following discussion, there have been a number of interesting studies considering model selection for cholera in recent years. No two research groups seem to be considering model comparison through the same lens. See Section 3.4 for references. Note that initial infections are estimated from data given in graphical form.

Model comparison applied to cholera models using data from Haiti
A thoughtful look at multiple models for cholera is shared by Rinaldo et al. (2012). The authors consider four structural models to describe the spread of cholera through Haiti, which include similar terms to those in our study but also include detailed geographic information and rainfall intensity as well as human mobility patterns. The models consider the data throughout Haiti as a whole, connected by multiple systems of differential equations that are linked through migration and that are parametrized with local information. This is truly an impressive undertaking in its realistic parametrization of several key details. On the other hand, the model comparisons do not include all structures that interest us, and additionally, some assumptions of constant parameter values given in their Table S2 are based on assumptions used by or derived within previous cholera modelling articles. As an example, the length of immunity after acquiring cholera is based on a simulation outcome in one researcher's work (Koelle, Rodó, Pascual, Yunus, & Mostafa, 2005). The length of immunity was based on data from Matlab, Bangladesh. We might hypothesize that this population with partial immunity might be quite different from that in Haiti, since Haitians had not received prior exposure to cholera. Another example of different parameter assumptions is in the consistent use of the contact rate with infected waters to be β B = 1. Our previous discussion in Section 3.3 explains our concern with this assumption.
Thus, our work differs from Rinaldo et al. (2012) in that we assume that fewer parameters are known, and in that we desire to explore a greater array of mechanistic possibilities in various models. We are limited by our resources to consider only individual departments in contrast to a model with spatially-connected departments (Rinaldo et al., 2012). At the same time, we find the question of model selection when the departments are separated interesting. If the data from different departments show support for differing models, then how might one choose an overall model for the region? This question is of interest to other researchers who lack the resources to consider fine GIS-based detail, and hence, a comparison between the viewpoints is valuable.
The form of model comparison in the work by Rinaldo et al. (2012) is in agreement with the form that we discussed in Section 1 and provided in Equations (3). For this article, the authors used one data point for each of 49 weeks spread across 10 Haitian departments and thus considered n = 490 data points. Referring to Equation (3), this Rinaldo et al. also use K = p + 1 parameter values; that is, the number of model parameters p plus a residual variance parameter. The references cited for this approach to model comparison are a 1998 version of Burnham and Anderson (2002) that is discussed in the Introduction, and an article by Corani and Gatto (2007), who cite the work of Burnham and Anderson (2002) and extend the idea to demographic ecological (regression) models. King et al. (2008) engage in a highly stochastic study comparing compartmental models for cholera against data from Bengal that is rich in time and space. They draw likelihoodbased inferences and refer the reader to certain textbooks (Casella & Berger, 2002;Cox & Barndorff-Nielsen, 1994;Rice, 2006) for a background for inference. They provide a detailed algorithm explaining how they find the maximum likelihood via iterated filtering. For model comparison, the authors refer to Akaike's article (Akaike, 1974) as well as the Burnham and Anderson text (Burnham & Anderson, 2002) to explain their application of AIC, as well as AICc which is intended for small sample sizes and is introduced in the following section (Burnham & Anderson, 2002;Richards, 2005). The Akaike article considers the comparison of autoregressive and moving average models, and since King et al. do not specifically write the form of AIC used it is not clear how the AIC references are applied. Eisenberg et al. (2013) perform an interesting experiment of testing whether parameters can be identified using perfect data in an SIRB model for cholera (this discussion replaces the article's notation with our own for the reader's convenience). The authors then ask, using a small data-set from a cholera outbreak in Angola, whether it is possible to determine the mode of transmission for the disease (person-to-person, water-borne, or both). For this task, they apply a corrected AIC (AIC C ) method for small sample sizes, and they cite Hurvich and Tsai (1991). Hurvich and Tsai explain that the AIC C method is for a 'normal linear regression, or autoregressive, model with p regression or autoregressive, parameters', and is given by

Model selection applied to cholera data from Angola
where σ 2 is given in terms of the derivation for variance in a least squares estimate for linear regression. We note that as described in Section 1, the AIC C would be defined by where K is the number of model parameters plus one for the variance of the error in the fit and n is the number of observations from the data in Angola. The authors do not share the details of the model comparison, and we do not know the precise form of AIC C that they use.

A new approach for model comparison involving seasonal epidemics
A different approach for model comparison can be derived if one assumes a certain probability distribution for the number of hospitalized cases in an epidemic. Ponciano and Capistrán (2011) show in detail how the problem of model selection and parameter estimation can be approached if one assumes that the number of new hospitalizations follows a Poisson distribution. With this viewpoint, the AIC can be determined based on a likelihood function that parameter choices allow the new infections to follow a scaled Poisson distribution. We appreciated this derivation and viewpoint, but also felt that in our case, where the epidemic data shows epidemic waves, a Poisson distribution approach would not be a good basis for parameter fitting or model comparison.
The authors also explain a derivation for AIC in the case of parameterizing the model using weather data. In either case (the Poisson-derived likelihood for a compartmentbased model for the number of infections or a more statistically-motivated model for the weather), the authors use the formula for AIC given by whereL is the maximum likelihood estimate for the given model and p is the number of parameters, and as we read the article p is not increased by the parameter for the variance of the fit errors. The likelihood function derived for the error in the weather data in the article is where w j is the jth weather observation and ω t j is the weather model prediction, which they explain is equivalent to minimizing the expression Although the model for weather in this article is not a compartmental model, the equivalence of maximizing likelihood in Equation (10) to our goal of minimizing the RSS values caused us to consider the output of the form of AIC given in Equation (9) in our analysis in the following section. The authors do not stop with an AIC exploration of models considered; rather, they state that one should also check the BIC values. They also state, in contrast to the discussion in Section 2.2, that if there is disagreement between AIC and BIC recommendations, then this is a signal that there is not sufficient data to choose the best model (Ponciano & Capistrán, 2011). We examine this claim for our case study in Section 5.

Methods and results
We applied several formulations of model comparison criteria to the five models for cholera that were described in Section 3. Thus, we sought the model that best described the data giving hospitalizations in each Haitian department and its capital approximately during the first fifteen months of the recent outbreak following the 2010 earthquake. This data is available in chart form from the Haitian Ministries for Public Health (Haiti Institute of Statistics, 2011), and we used the freely available software Plot Digitizer to convert that data to numeric form (SourceForge, xxxx). Hence, in addition to the expected surveillance errors given the setting, there were surely additional errors in representing the data numerically. We believe that these errors are typical, and that the data will have noise in any large-scale cholera study. As a result, if there were a 'true' model that could describe the surveillance data in such an epidemiological setting, then that 'true' model would not be a 'true' representation of the actual disease dynamics. This is simply unavoidable, and we assume that the model which best describes the noisy data would be the best model to describe the actual data if it were possible to discover that data.
The first step in performing model comparison is to choose the best parameter set for each of the five models listed in Section 3, so that the data is optimally described using each model. We chose the parameter sets independently for each of the ten Haitian departments and for the capital (eleven data-sets) for each of the five models. There are many approaches to parameter selection. In this work, we first found twenty best-fit parameter sets for each of the 55 data-set?-model combination using genetic algorithms (GA) (Akman & Schaefer, 2015). As explained in the reference, each single run is the result of the evolution over many generations of 5000 randomly chosen parameter sets from the biologically feasible parameter space. Due to the stochasticity of the parameter fitting method, we observed variation in the goodness of fit and in the selected parameter values through multiple GA runs, and, as we explain further below, this variation is the reason for considering twenty parameter sets for each combination considered. We also used Matlab's lsqcurvefit (MATLAB and optimization toolbox release, 2012a) as a tool for parameter estimation; we note that other researchers are using a Markov Chain Monte Carlo (MCMC) approach which uses a random walk through the parameter space and a method for evaluating proposed moves to find optimal parametrization (Rinaldo et al., 2012). In a future article, we will explore the use of particle swarm optimization as an efficient and effective tool for parameter estimation.
Since the model's fit to the data is a cornerstone of an assigned AIC or BIC value, we worried about the impact that stochasticity might play in model selection. For each of the 5 × 11 = 55 model-data-set combinations, we found the lowest RSS/n value (where, again, n is the number of data observations for the given data-set). However, our work in exploring methods for parameter selection have convinced us that it is unlikely that differing parameter estimation approaches, or even repeated approaches, are likely to retrieve the same parameter values or goodness of fit (this was explored further in a previous article of ours (Akman & Schaefer, 2015)). Thus, we observe that if we might get 'lucky' in finding a strongly performing data-set for model A, but 'unlucky' for model B.
Should we be cautious about the limitations of any parameter fitting method for its ability in a single run to truly find the best parameter set? As a result of this concern, we considered two values in each model comparison effort -one for the minimum value observed over the 20 parameter sets chosen for each of the 55 model-data-set combinations and another for the average value observed over the 20 parameter sets chosen. The former makes a comparison based on the best result obtained for each parameter set, and the latter makes a comparison based on the average over all parameter fitting efforts.
In the light of the prior results explored in Section 4, we studied several formulations for model comparison. We first considered the AIC formulation given in Equation (3) in Section 2, which is also the formulation suggested by Burnham and Anderson (2002) and by Rinaldo et al. (2012) (Section 4.1). However, to explore our concerns about the consistency and quality of parameter fitting efforts, we calculated this AIC value using both the minimum and average RSS values found for each model-data-set combination. Thus, for each of the 55 model-data-set combinations, we calculated AIC 1,min = n ln RSS min n + 2(p + 1), where n is the number of data points for a given department and p is the number of parameters in a given model. We also considered the AIC development used for weather that was described in Section 4.4 above, but with the use of epidemic data. We believe a likelihood formulation should provide consistent results, and thus as an illustration we also considered the likelihood function L given in Equation (10) for each of the 55 model-data-set combinations, and calculated, following Equation (9), the values AIC 2,min = −2 ln L min + 2p, where n is the number of data points for a given department and p is the number of parameters in a given model. To be precise, for each model-data-set combination we calculated the RSS between our scaled data (h · I(j) or h · I S (j) for j = 0, 1, . . . , n − 1) and the surveillance data (I * (j) for j = 0, 1, . . . , n − 1), and we defined Then we define the likelihood function as We now note that Finally, despite our comments in Section 4 above, we follow the recommendation of Ponciano and Capistrán (2011), Section 4.4 and we consider the measure for BIC in both the least squares and likelihood cases. This allows us to consider the perhaps controversial point (Ponciano & Capistrán, 2011) that if AIC and BIC do not agree, then further study is warranted. For the least squares case, we suggest the form given by Hansen (2007) modified for the framework that we described in Section 2 to be BIC = n ln RSS min n + p · ln (n), where n and p are the number of data points and parameters, respectively, considered for the given model. The likelihood formulation given for BIC is where L is the likelihood function, p is the number of parameters, and n is the number observations (Burnham & Anderson, 2002). Thus, we also consider the formulas BIC 1,min = n ln RSS min n + p ln n, BIC 1,avg = n ln RSS avg n + p ln n, as well as BIC 2,min = −2 ln L + p ln n, BIC 2,avg = −2 ln L + p ln n, as we make our comparisons. Thus, we have defined eight model comparison methods. For each of these, we are seeking the minimum AIC and BIC values, and we can only declare a clear victor if there are no close runners-up. Following Burnham and Anderson (2002), Rinaldo et al. (2012), we measure the difference between each model's AIC/BIC value and the minimum value; these values are recorded by columns in Figure 6 and labelled . The standard approach to interpreting the differences assumes criteria that we cannot meet (independent observations, large sample sizes and nested models) (Burnham & Anderson, 2002;Richards, 2005); however, we presume, as did Rinaldo et al., (2012), that the guidelines of the levels of support for each model vary similarly to the levels suggested by Burnham and Anderson (2002). Thus, referring to our results in Figure 6, the model selected has = 0 (dark shading), and we presume that there is substantial to moderate support for a runner-up model when 0 < ≤ 5 (moderate shading). Finally, there is weak support for a runner-up model when 5 < ≤ 10 (light shading). (The criteria in Burnham and Anderson (2002) are slightly different, but do not include all possible differences.) Consider first our 'inadequate' SIR model from Section 3.2.1, for which results are shown in the top-left block in Figure 6. We note that the AIC 1 and BIC 1 criteria never select this model, but in Nippes, which experienced far fewer infections than in other departments (Date et al., 1994), the likelihood formulations AIC 2 and BIC 2 select the SIR model. We do Figure 6. The results of our model comparison formulation investigations are summarized in this chart and discussed in Section 5. Notes: Each model is considered separately, with individual data-sets given as rows and model comparison technique given as columns. Dark shading represents the model is selected by the criteria in the corresponding column. Light to moderate shading indicates support for the runner-up model as an alternative to the selected model. note that while the RSS and likelihood formulations are in disagreement, the AIC versus BIC values are consistent between the formulations.
We now consider the SIRB model described in Section 3.2.2 and view the results in the top-right block within Figure 6. We observe that the SIRB model is almost always selected, and this result is in agreement with most conclusions drawn in Section 4. There are four exceptions, and a small 'blip' (Nord-Est). The first exception is Artibonite, which was hit very early with cholera and whose cases in the period were considered dwarfed by those in other departments. In Artibonite, all selection criteria chose the SIIRBB model (see the next paragraph). Second, Nippes, which was discussed in the previous paragraph and had very few infections, selected the SIR model. The third region which did not select the SIRB model is Nord-Ouest, and here only BIC formulations choose the SIRB model. While the number of hospitalized in Nord-Ouest does not stand out as did the number for Nippes, the particular shape of the surveillance data (Republique d'Haiti, xxxx) is rather unusual for Nord-Ouest in that there is a rather sharp, brief peak in hospitalizations several months into the epidemic. Finally, while the individual data-sets generally justify the choice of the SIRB model, when the data are considered in total, most formulations do not select this model. The 'Total' rows in Figure 6 shows the result when the RSS values or the terms in Equation (11) are summed over all points in all data-sets, and n is taken to be the total points in all data-sets (similar to Rinaldo et al. (2012) but without their complex geographical setup). Thus, there is a disconnect between our intuitive idea that 'clearly the SIRB model is the appropriate model to select given the criteria' and the model that is actually selected when the data are considered as a whole. Where Rinaldo et al., considering a different subset of models, connected the Haitian departments rather than consider them separately, they did select an SIRB formulation (Rinaldo et al., 2012). The question that remains in this study is, 'When the data are considered as a whole, is a complex model truly able to describe the data better, or is our method of considering the data as a whole allowing too much weight for a 'deviant' data set?' The SIIRBB model, described in Section 3.2.3, has results pictured in the middle-left block of Figure 6. Several Haitian departments' data were best described by the SIIRBB model. In particular, the hard-hit Artibonite selects this model across all criteria. Three of the four AIC criteria selected the 'spiky' data from Nord-Ouest, and the values suggested that the SIIRBB model has support from all criteria. With the exception of one BIC criteria, the SIIRBB model is selected as the best model to describe the aggregate data. However, outside of these departments, there is weak or, more likely, no support for this model. We note that where there is a difference between the AIC and BIC choices, the values generally suggest that there is evidence for choice between the two differing models.
We included the SIIRB model to check the hypothesis that the variation of the SIIRBB model could describe the data equally well (see Section 3.2.4). Results are summarized in the right-centre block of Figure 6. It was surprising to us that the SIIRB showed consistently weaker performance than SIIRBB in all but isolated cases. On the other hand, for some departments, and particularly those with fewer hospitalizations (Date et al., 1994), we see that the choice of person-to-person infectivity over the use of a hyperinfectious bacterial class has some merit. Our multiple comparison approaches suggest that the two models cannot be interchanged so easily, thus suggesting that models which include a hyperinfectious class of bacteria may be more appropriate than those substituting personto-person infectivity in regions with high morbidity from cholera.
Finally, the SSIIRRB model -see Section 3.2.5 and the lower-left block of Figure 6 only enjoyed weak support in a single department. We wonder if the SSIIRRB model would perform more strongly in a longer timeframe. The complexity in the model is intended to uncover long-time dynamics. We did not check this hypothesis. Certainly for the data-sets considered, the addition of multiple unknown parameter values would cause this model to perform poorly in for the comparison metrics, but in reality, the complicated SSIIRRB rarely outperformed other models in describing the data-sets even without penalty from additional parameters.

Conclusions and remarks
The initial motivation for this study was to explore the settings in which mechanisticallycomplex models might be needed or justified to explain the dynamics of a cholera outbreak. We find that within the majority of the individual Haitian departments, for the timeframe of about 15 months, the simple SIRB model enjoys the most support. However, in describing areas that experience larger numbers of cholera infections, or in describing the region as a whole, there is evidence for a model that includes the complexities of asymptomatic infections and a hyperinfectious bacterial state. Finally, we observe that the modelling choices of a hyperinfectious bacterial state versus person-to-person transmission are not interchangeable.
Modelling is an art. Comparison methods select between models by balancing the complexity and performance of each model; however, the more subjective step of choosing appropriate models and parameter spaces cannot be under-valued in this process. The decisions of how many data-sets and which time steps should be used to inform the model selection process are critical to the outcome of a model selection process, while choice across correct formulations of the comparison approaches is less important.
In our study, we focused on the commonly used model assessment metrics RSS and likelihood AIC, both of which yielded similar results, and for the most part these comparison metrics were consistent whether we used the minimum or average RSS values for the data-set in question. Further, we observe that while some authors discount the value of BIC in this setting (Burnham & Anderson, 2002), the BIC results do seem similar to the AIC results (Ponciano & Capistrán, 2011), and where there were differences there were other disagreements that also suggested that more information would be helpful in selecting an appropriate model. Finally, we believe that there are hurdles in interpreting composite model recommendations obtained by combining multiple data-sets. In particular, we observe that more complex models seem to enjoy support as the number of observed infections (not the number of data points) increases.
In conclusion, we would urge that increased transparency about the parameter selection process and the particular format of AIC/BIC chosen by modellers would be a service to our community, with the understanding that often these longer discussions may need to be provided in supplementary materials so that they do not distract from the researchers' immediate goals.