Practical unidentifiability of a simple vector-borne disease model: Implications for parameter estimation and intervention assessment

Mathematical modeling has an extensive history in vector-borne disease epidemiology, and is increasingly used for prediction, intervention design, and understanding mechanisms. Many studies rely on parameter estimation to link models and data, and to tailor predictions and counterfactuals to specific settings. However, few studies have formally evaluated whether vector-borne disease models can properly estimate the parameters of interest given the constraints of a particular dataset. Identifiability analysis allows us to examine whether model parameters can be estimated uniquely—a lack of consideration of such issues can result in misleading or incorrect parameter estimates and model predictions. Here, we evaluate both structural (theoretical) and practical identifiability of a commonly used compartmental model of mosquito-borne disease, using the 2010 dengue epidemic in Taiwan as a case study. We show that while the model is structurally identifiable, it is practically unidentifiable under a range of human and mosquito time series measurement scenarios. In particular, the transmission parameters form a practically identifiable combination and thus cannot be estimated separately, potentially leading to incorrect predictions of the effects of interventions. However, in spite of the unidentifiability of the individual parameters, the basic reproduction number was successfully estimated across the unidentifiable parameter ranges. These identifiability issues can be resolved by directly measuring several additional human and mosquito life-cycle parameters both experimentally and in the field. While we only consider the simplest case for the model, we show that a commonly used model of vector-borne disease is unidentifiable from human and mosquito incidence data, making it diﬃcult or impossible to estimate parameters or assess intervention strategies. This work illustrates the importance of examining identifiability when linking models with data to make predictions and inferences, and particularly highlights the importance of combining laboratory, field, and case data if we are to successfully estimate epidemiological and ecological parameters using models.

S1 Appendix: Background on structural identifiabiltiy using differential algebra and the Fisher information matrix Differential Algebra Approach The differential algebra approach [1][2][3][4] is usually used to examine global structural identifiability; it can also be used to construct identifiable parameter combinations and reparameterize the model with these combinations, when the model is unidentifiable [5]. In general, the method aims at producing a set of monic polynomial equations from a model with only outputs (e.g. weekly dengue incidence and/or mosquito population data in our case), inputs (e.g. climate drivers; not included in our model), and their derivatives, plus parameters of interest (e.g. transmission rate and death rate) as coefficients in the equations. The set of equations is called the input-output equations, and we can obtain it simply through substitution and differentiation to eliminate unwanted variables in the model. The implication is that each output/measurement will have an input-output-equation, the coefficients of which are identifiable combinations for the model. From these combinations, we can compute whether the model parameters have a unique solution with the given output(s). Here, we give an example using a simple SIR model to illustrate this process (originally proved in [6]). Consider the model equations below: where µ is birth and death rate, is transmission rate, and is recovery rate. Prevalence (y) is the output of the model. We first rewrite the model with y ignoring the R compartment since it does not affect the model dynamics or the identifiability analysis results, which yields: We then solve (S3) for S and its derivative (Ṡ): S =ẏ + y + µy ẏ S =ÿ + ẏ + µẏ y ẏ(ẏ + y + µy) y 2 (S4) Replacing (S2) with (S4) and clearing the denominators, we have: In order to obtain a monic polynomial equation, we divide (S5) by the coefficient of the highest ranking monomial of y, k, yielding: Eq. (S6) is the input-output equation, and there are four identifiable combinations (the coefficients of the input-output equation) for the SIR model. Assuming the four combinations equal to (a 1 , a 2 , a 3 , a 4 ) respectively, solving for ( , k, , µ) in terms of (a 1 , a 2 , a 3 , a 4 ) gives the unique solution for each parameter of interest, indicating that the SIR model is identifiable. More complete details about the differential algebraic method can be found in [2,5]

S1
Fisher Information Matrix The Fisher information matrix F is a p ⇥ p matrix (where p is the number of parameters), which summarizes the amount of information contained in data about the parameters being estimated. If F is singular, the estimated parameters are unidentifiable. Furthermore, the rank of F represents the number of identifiable parameters or combinations. We calculate F using numerical approximation of the parameter sensitivities based on [7][8][9] as follows: whereẋ is a system of first order ODEs, with t representing time, and u the input functions (if any).
p = {✓ 1 , · · · , ✓ p } is the set of parameters to be estimated, and the model outputs are assigned as y.
Then a simplified form of F for examining identifiability can be constructed as follows: 1. Construct the sensitivity matrix Compute the Fisher information matrix: If F is full rank, the model is said to be locally structurally identifiable, while a rank deficient matrix indicates unidentifiability. In practice, we consider the parameters unidentifiable if the determinant of F is non-zero but small. Inverting F gives the Cramér-Rao bound covariance matrix corresponding to the lower bound on the variance of each parameter [8,10]. This method can only give local and asymptotic estimates, which can be problematic with small amounts of data or models with an unusual likelihood surface.

S2
S2 Appendix: Structural identifiability -Scenario 4 proof using differential algebra Here we present the proof of structural identifiability for the model in Scenario 4. The calculations below are performed using Wolfram Mathematica 10.4.
Proposition S2.1. The re-scaled model given in Eq. 2 is structurally identifiable for output equations describing dengue incidence (y h ), larvae/pupae (y a ), infected mosquitoes (y mei ), and susceptible mosquitoes (y ms ).
Proof. Recall the measurement equations for Scenario 4 are: We note that y h would normally be integrated on a weekly basis to generate weekly incidence, but the differential algebra approach assumes perfect, noise-free, continuous measurements, we can simply differentiate the integrated data to yield the instantaneous incidence shown above, which (from a structural identifiability standpoint) contains equivalent information. From this, we can rewrite the model equations in terms of the measured variables, leaving only S h , I h , and I m as unmeasured variables: The y a equation is already in the form of an input-output equation (i.e. only in terms of the parameters, observed variables, and their derivatives). From the y h , y mei , and y ms equations, we can solve for the S h , I h , and I m variables, which we can then plug in to their respective equations to result in four equations only in terms of the measured variables, their derivatives, and the parameters, i.e. a set of input-output equations. One of these equations is the y a equation above, however the other equations are quite large (the largest has over 4000 terms), so are omitted here for simplicity. From the coefficients of these equations, we can solve for each parameter individually, implying that the model is globally structurally identifiable, i.e. that it is possible to estimate all parameters of interest from all available human and mosquito surveillance data assuming no measurement errors.
From this result, we can then conclude that the original, non-rescaled model (Eq. 1) was structurally unidentifiable-as the original parameters which were combined in rescaling can be changed (by compensating one parameter versus another), without altering the apparent outputs y h , y a , y mei , and y ms . The identifiable combinations in this case will simply be the parameters which were combined in the rescaling process, namely mh C⇡/N , ⇠⇡, ⇡ + µ a ,  h N ,  a C, and  m C⇡. This S3 result can also be proven directly using the same method as in Proposition S2.1, and so we omit the proof here.
Corollary S2.1. The non-rescaled model given in Eq. 1 is structurally unidentifiable for output equations describing dengue incidence (y h ), larvae/pupae (y a ), infected mosquitoes (y mei ), and susceptible mosquitoes (y ms ). In particular, the parameters mh , C, ⇡, N, ⇠ , and µ a are unidentifiable, forming combinations mh C⇡/N , ⇠⇡, ⇡ + µ a ,  h N ,  a C, and  m C⇡. The remaining parameters are identifiable. Figure S3. Profile likelihoods (grey circles) with 300 simulated data sets, assuming normally distributed measurement error with a standard deviation of 9.77 (derived from the residuals of the fit to the Kaohsiung data). Profile likelihoods from estimation using human case surveillance data from 2010 in Kaohsiung, Taiwan are shown as black circles. Dashed lines indicate the threshold for the 95% confidence intervals. The y-axes show the change in the sum of squared error (SSE) from the minimum SSE, and the x-axes show the fold change in each parameter. We note that two of the simulated data sets sets resulted in apparent deviation from the minimum SSE for the right-sides of the  h and µ m profiles (still below the confidence interval threshold), likely due to the optimizer converging to a local minimum, or not fully converging. This can be seen as the optimizer sometimes reverted to the true (lower) best-fit minimum further along in the profile.  Figure S6. Zoomed-in profile likelihoods. Zoomed-in views of the profile likelihoods in Figure 4 and Supporting Information Fig. S4, illustrating the minima surrounding the best fit values (best-fit parameter values given as stars): (a) assuming simulated human case surveillance data, (b) adding mosquito (larvae/pupae) counts, (c) adding immature and adult mosquito counts, and (d) adding immature and adult mosquito counts with adult mosquito infection status. S8 Figure S7. Parameter relationships for human surveillance data profile likelihoods. Compensatory relationships between parameters as each parameter is profiled, assuming human surveillance data. Some parameters show pronounced and consistent relationships with one another, indicating the possibility of a practically identifiable combination, while other parameters appear to have no relationship or noisy/inconsistent relationships depending on which parameter is profiled (possibly indicating more complex combination structures between multiple parameters rather than a single pair).

S4
S9 Figure S8. Profile likelihood when fitting subsets of parameters. Profile likelihood (black circles) when fixing 1) µ m and hm ; 2) hm , µ a , and ⇠; 3) hm ,  h µ a , and ⇠. Stars indicate the best-fit parameter value, and dashed lines indicate the threshold for the 95% confidence intervals. S10