Predicting lymphatic filariasis transmission and elimination dynamics using a multi-model ensemble framework

Highlights • No single mathematical model captures all features of parasite transmission dynamics.• Multi-model ensemble modelling can overcome biases of single models.• A multi-model ensemble of three lymphatic filariasis models is proposed and evaluated.• The multi-model ensemble outperformed the single models in predicting infection.• The ensemble approach may improve use of models to inform disease control policy.


Introduction
There is increasing appreciation that large-scale parasite control or elimination problems belong to a decision and policy domain marked by significant uncertainty, complexity, and spatial heterogeneity (Vespignani, 2012;Klepac et al., 2013;Marathe and Vullikanti, 2013;Bhatt et al., 2015;Heesterbeek et al., 2015). Solving these problems is particularly germane for the current global strategies aiming to eliminate complex vector-borne macroparasitic diseases, such as lymphatic filariasis (LF), which exhibit a high degree of geographic heterogeneity in transmission dynamics and infection patterns, and consequently in extinction dynamics (Gambhir et al., 2010;Irvine et al., 2015;Michael and Singh, 2016;Duerr et al., 2005;Singh and Michael, 2015;Jambulingam et al., 2016;Stolk et al., 2006;Swaminathan et al., 2008). Although mathematical models of transmission can capture many features of these complexities, it is recognized that any single model may be inadequate to fully explore and predict the whole spectrum of system behavior (Oreskes et al., 1994;Neuman, 2003). This is partly a consequence of the inherent complexity of natural systems that give rise to multiple conceptualizations and mathematical descriptions (Oreskes et al., 1994). It is also a reflection of the fact that many different model structures and parameter sets can acceptably reproduce the observed behavior of a complex dynamical system, such that model acceptance in one or more settings may not constitute evidence for general model applicability (Beven and Freer, 2001;Ramin et al., 2012;Hoeting et al., 1999;Christakos,  2003). Indeed, it is increasingly realized in this context that even if approaches based on single models are able to explain the observed behavior of a dynamical system for a given set of data, such models may not generalize well enough to predict future system behavior, particularly under changed conditions -constituting the so-called "out of sample" problem (Simidjievski et al., 2015a(Simidjievski et al., , 2015b. Taken together, these uncertainties mean that relying upon forecasts or future predictions generated by a single model for parasite management can lead to significant bias and uncertainty in policy decisions (Lindström et al., 2015).
Recognizing that there may not be a true model of a natural dynamical system, but rather several adequate descriptions reflecting different conceptual bases and structures (Reichert and Omlin, 1997), recent work has focused on using ensemble-based approaches to explicitly account for the uncertainty inherent in the model selection process (Hoeting et al., 1999;Raftery et al., 2005;Gal et al., 2014). Thus, a single-model ensemble involves the use of a number of realizations of an individual model, with distinct predictions obtained for each realization by either introducing stochastic elements, perturbing the input data or initial conditions, or selecting different sets of model parameters (Gal et al., 2014;Viney et al., 2009). By contrast, in a multi-model ensemble, several different models are used, wherein rather than picking the single "best-fitting" model to predict responses, the aim typically is to provide some averaged prediction from different models using various combinatory methods (Hoeting et al., 1999;Raftery et al., 2005). Multi-model ensemble studies, in applications ranging from weather forecasting to cell and population dynamics modelling (Hoeting et al., 1999;Simidjievski et al., 2015aSimidjievski et al., , 2015bRaftery et al., 2005;Kuepfer et al., 2007), have highlighted the utility of this approach to significantly overcome the problems of over-fitting and model uncertainties, resulting in significant predictive performance gain by these models as compared to that of a single model. Further, studies have shown that even if a multi-model ensemble may not always be the most skillful, its performance is better than the worst single model case, and, as it is often also not possible to predict which of the constituent single-model ensembles will be worst at a given time and location, the use of multi-model ensembles is highly advantageous (Matsueda et al., 2007).
Despite the increasing success of the use of multi-model ensemble methods in other research fields, their application to epidemiological modelling has thus far been limited. However, recent developments in comparing outputs of different influenza models by the MIDAS network (Halloran et al., 2008), assessment of different vaccination strategies (Smith et al., 2012) and impacts of long-term changes in climatic conditions (Ruiz et al., 2014) for malaria, and ensemble-based predictions of Foot and Mouth Disease (FMD) epidemics (Lindström et al., 2015), point to the growing application and value of the method to infectious disease modelling. This body of work demonstrates how combining multiple models can be used to answer critical questions in epidemiology, ranging from the provision of greater confidence in health outcome predictions to improving the ways disease models inform disease control policy, suggesting that the epidemiological use of ensemble-based models are only going to increase in the future.
In this paper, we describe the construction and evaluation of an ensemble of three well-known simulation models of LF epidemiology that incorporate different modelling approaches (deterministic versus stochastic), structures (population versus individual-based) and parameterization methods (Gambhir et al., 2010;Irvine et al., 2015;Jambulingam et al., 2016;Chan et al., 1998;Norman et al., 2000;Plaisier et al., 1998;Subramanian et al., 2004a;Stolk et al., 2008), in order to better describe the population dynamics of LF and generate more accurate predictions of the impacts of drug and vector-based interventions in communities. The following sections describe the ensemble modelling procedure, analyze prediction accuracy of the single models as well as the multi-model ensemble, and assess the constructed ensemble model's performance in predicting the population dynamics of LF and the outcomes of various intervention strategies on infection. We end by discussing future work to enhance the ensemble model system for supporting policy-relevant predictions, including potential technical improvements in ensemble construction, and the international coordination mechanisms which will be required to link the system effectively to LF data and to policy making.

The models
The three single LF models that make up this study are: EPIFIL, LYMFASIM and TRANSFIL, which are a Monte-Carlo populationbased deterministic (EPIFIL), and stochastic individual-based (LYMFASIM, TRANSFIL) models. These models thus differ in complexity from being individual to population-level based, but also in the overall number of parameters used, and in parameterization methods followed. There are also other more subtle differences among the models, including how effects of infection aggregation are handled, and how drug and vector control are incorporated (Gambhir et al., 2010;Irvine et al., 2015;Michael and Singh, 2016;Singh and Michael, 2015;Jambulingam et al., 2016;Stolk et al., 2006;Swaminathan et al., 2008;Norman et al., 2000;Plaisier et al., 1998;Subramanian et al., 2004a;Michael et al., 2004;Singh et al., 2013;Plaisier et al., 2000). These primary inter-model differences are summarized in Table 1, while Table S4 in part B of the Supplementary Information (SI) captures the key similarities and differences in terms of the model parameters used and optimized during model induction and data fitting, and in running simulations of interventions using annual mass drug administrations (MDAs) and vector control. The full details of the three models and their implementation and fitting procedures for LF infection data have been described extensively previously (Gambhir et al., 2010;Irvine et al., 2015;Jambulingam et al., 2016;Swaminathan et al., 2008;Chan et al., 1998;Norman et al., 2000;Plaisier et al., 1998;Subramanian et al., 2004b), and are summarized in part A of the SI.

Experimental setup
We employed an experimental design in which each LF model was prepared, calibrated and operated by the respective modelling group, following which the relevant simulation outputs from each single model were provided for use in constructing the multi-model LF ensemble. This experimental setup comprised the following steps. First, the three groups were provided with LF baseline infection and post-intervention data from three community sites chosen to represent the vector-mediated transmission dynamics specific to each of the three major LF endemic regions of Africa (primarily Anopheles-mediated transmission), Papua New Guinea -PNG -(Anopheles) and India (Culex) (Singh et al., 2013;Njenga et al., 2008;Rajagopalan et al., 1989;Subramanian et al., 1989;Das et al., 1992;Rajagopalan et al., 1988) (Table 2A and B). The groups were asked to calibrate their models to the baseline microfilariae (mf) age-prevalence data (="training" data) from these sites, and to provide an ensemble of simulations for the construction and analysis of the multi-model ensemble. Each model aimed to generate 500 fits, or model members, but the number of initial simulations drawn by each group varied from 10,000 (LYMFASIM) to 200,000 (EPIFIL) as a result of differences in the fitting procedures followed and computational intricacies of the three models (see part A of the SI). We deem these as single-model ensembles, which are calibrated and parameterized either using Monte Carlo or stochastic perturbation methods (see parts A and B of the SI). The groups were then asked to use their respective single-model members that fell within the bounds of the weighted multi-model ensemble to carry out simulations of the observed MDA in Malindi, Africa, and Nanaha, PNG, and the effects of the integrated vector management (IVM), which was carried out in Pondicherry, India (Table 2B) (Singh et al., 2013;Njenga et al., 2008;Rajagopalan et al., 1989;Subramanian et al., 1989;Das et al., 1992;Rajagopalan et al., 1988). These simulations were provided for validation against the community mf prevalence data obtained over the durations of the interventions performed in each site (="validation" data). An overview of the methodology used in this study for calibrating and validating the LF multi-model ensembles constructed for each study site is shown in Fig. 1.
In a second phase of the study, each group was invited to recalibrate their models to a set of overall mf prevalence values thought to represent the currently expected mean infection levels in the regions of Africa, PNG and India, and to provide both their fitted single-model ensembles at baseline and resulting predictions of the effects of MDA in the form of timelines to cross below the  Overview of the methodology used for developing the lymphatic filariasis (LF) multi-model ensemble. All K (=3) single LF models were 1) compared prior to inclusion and 2) trained with baseline data on LF infection (Table 2A) to produce a collection of N k simulations for each of the three LF model types. Each constituent model was 3) assigned a weight reflecting its relative performance in reproducing the baseline age-mf prevalence data in each site. The weights were used to construct the LF multi-model ensemble. To validate the multi-model ensemble and forecast the effects of LF interventions, 4) simulations were generated by the multi-model ensemble and compared with mf age-prevalence data obtained during the intervention period in each site (Table 2B). The four processes outlined are represented by orange boxes. The different types of models (SM and MM) are represented as blue and green diamonds, respectively, and the corresponding simulation outcomes are represented as rectangles of the same color.
WHO-set LF elimination target of 1% mf for each of these scenarios. These single-model ensembles and predictions were then combined using the ensemble construction methodology developed in this study to present an analysis of a multi-model-based generation of these timelines in each region compared to those predicted by each single model.

Study sites and data
We used pre-intervention baseline and post-intervention follow-up mf prevalence data from three study sites, one from each of the three major LF endemic regions of the world (Singh et al., 2013;Njenga et al., 2008;Rajagopalan et al., 1989;Subramanian et al., 1989;Das et al., 1992;Rajagopalan et al., 1988). Note that the infection data used here were age-stratified at baseline and aggregated into overall prevalence for the follow-up years. In addition to infection prevalence information, the baseline annual biting rate (ABR) of vector mosquitoes, along with the vector genus, is known for two of these sites. The baseline survey data (Table 2A) and postintervention survey data (Table 2B) available for each site served as training and validation data, respectively, for the analysis of the single versus multi-model ensemble predictive performances.

Multi-model ensemble construction
Fig. 1 provides a summary of the overall methodological framework followed in the construction of the multi-model LF ensemble in this study. Ensemble construction was essentially performed in two steps: 1. The three LF models were individually calibrated and validated for each given data set, each relying on its own model structure and implementation; 2. The three sets of single-model simulations were then combined using a mean-squared error performance-based weighting scheme as outlined in part C of the SI (Johnson and Swinbank, 2009).

Evaluation metrics
We used three statistical indices to measure the central tendencies of the distributions of errors between data and predictions made by the single-model and multi-model ensembles, and to compare their predictive performance: viz. the average error (AE), relative error (RE), and the relative root mean squared error (ReRMSE) (Ramin et al., 2012;Simidjievski et al., 2015aSimidjievski et al., , 2015bWillmott and Matsuura, 2005). Let y i be the observed baseline prevalence in age group i, N k be the number of members comprising model k, and x kni be the model-predicted prevalence in age group i for member n of model k. The AE is calculated as: whereas the RE is given by: and the ReRMSE is calculated as: whereȳ represents the overall mean mf prevalence. The AE and RE measure the average difference (as an absolute and relative measure, respectively) between model predictions and observed values, whereas the ReRMSE normalizes these differences by the standard deviation of the system variable (mf prevalence), which thus allows comparisons of model performances either for system variables measured on different scales or for sets of data exhibiting considerable between-study variance in the measured system variable. Smaller values of ReRMSE indicate better predictive performance (Simidjievski et al., 2015a(Simidjievski et al., , 2015b.
To test if the power of the multi-model ensemble is based on the diversity of the constituent single-model ensembles, we measured the diversity of the single-model ensembles and correlated it to the performance improvement of the multi-model ensemble over the single-model ensembles (Simidjievski et al., 2015a(Simidjievski et al., , 2015b. To quantify the diversity of the single models, we measured the average pairwise difference of the single-model members: where |N k | denotes the number of members in the single-model ensemble k, I the number of measurements in the mf age-profile data set, m 1 and m 2 two members in the single-model ensemble k, and x km 1 i and x km 2 i the simulated values of these outcomes at age-group i.
To assess the performance improvement of the multi-model ensemble over a single-model ensemble, we calculate: where MM and SM, respectively, represent multi-model and singlemodel ensembles.

Intervention modelling
In two of the present study sites, Nanaha and Malindi, the LF intervention applied was annual MDA using the drug regimens advocated by WHO for PNG and East Africa respectively, while in Pondicherry, only vector control was implemented through an integrated vector management (IVM) program, which greatly reduced the monthly biting rate of Culex mosquitoes during the period (1981)(1982)(1983)(1984)(1985) when IVM was operational. Table 1 and part A of the SI text detail the manner by which the effects of these interventions are implemented in the three individual models. For this model comparison exercise, the drug efficacy rates (instantaneous worm and mf kill rates, respectively, denoted by ω and ε, and the rate of reduction in the fecundity of adult female worms, ı reduc , for a period of months, p) were supplied as fixed values to be used across all modelling groups to ensure that treatment efficacy was similarly applied in all simulations. In Malindi, Kenya, and Nanaha, PNG, the drug-regimens modelled were DEC + ALB and DEC + IVR, respectively. The efficacy values for the two regimens were supplied as follows: DEC + ALB (ω = 55%, ε = 95%, ı reduc = 95%, p = 6 months) and DEC + IVR (ω = 45%, ε = 99%, ı reduc = 75%, p = 9 months) (Michael et al., 2004;Singh et al., 2013;Michael, 2002).
In the case of modelling the impact of the integrated vector management (IVM) interventions carried out during the period 1981-1985 in Pondicherry Das et al., 1992) on the monthly biting rate (MBR), EPIFIL used a simple seg-mented exponential function of the form, MBR = MBR 0 exp[a 1 t] for each of two periods, to capture the observed decline (for the period, 1981-1985, when the IVM was in effect) and the gradual rise in this variable during the period 1986-1992 when IVM was discontinued (details in part A of the SI). LYMFASIM and TRANSFIL modelled the decline in MBR using the measured reductions in the average MBRs during the period 1981-1985.

Scenario modelling
Predictions of the impacts of annual MDA on timelines (in years) to cross the 1% mf prevalence threshold set by WHO to meet LF elimination were compared between the single-and multi-model ensembles for three hypothetical baseline infection scenarios thought to define current LF conditions in the regions of Africa, Asia and Papua New Guinea. The three scenarios are: (A) an African setting with 10% baseline mf prevalence, where Anopheles is the dominant species of LF vector mosquitoes; (B) an Asian setting with 5% baseline mf prevalence, where Culex is the dominant species; and (C) a PNG setting with 40% baseline mf prevalence, where Anopheles is the dominant species. For all scenarios, the modelled intervention was the application of annual DEC+ ALB with 65% coverage. Note the efficacy rates used for this regimen were the same as those used for the study site of Malindi, Kenya.

Single-model fits to baseline training data
The fits of the three single LF transmission models to the baseline mf prevalence data stratified by age in each of the three sites of Malindi, Kenya, Nanaha, PNG, and Pondicherry, India, are shown in Fig. 2. The results show that in the majority of cases the age prevalence data measured in each site fell within the bounds of the single-model ensembles, indicating that each model individually is able to reproduce the infection age patterns observed in these Table 3 Performance statistics of single and multi-model ensemble predictions with respect to the baseline (training) data and post-intervention (validation) data. study sites. This is unsurprising given that each group was invited to provide their fitted models for this study, although these varied between 421 and 500 simulations, or members, per model, with the exception of only 82 members from LYMFASIM for the Malindi, Kenya dataset. The performance statistics calculated for each model displayed in Table 3, however, show that the single models differed in their abilities (based on the AE, RE, ReRMSE metrics) to accurately reproduce the observed data measured in each site. For example, comparing ReRMSE between the models shows that EPIFIL surpasses LYMFASIM and TRANSFIL for two of the sites exhibiting medium (Malindi) to high (Nanaha) infection prevalences, whereas LYMFASIM outperforms the other models for the low prevalence data from Pondicherry. Examination of the diversity index calculated for each model suggests that the predictive performance of the single models for the baseline data may be related inversely to the diversity of the members constituting each single-model ensemble (Table 3). This covariation in model predictive performance and diversity also influenced the weights calculated for each model, with a relatively equal weighting given to each of the three models in the case of Nanaha, whereas the predictions for the Malindi and Pondicherry data were heavily influenced by one model (Table 3). Taken together, these results support the conjecture that it will be difficult to choose a single modelling system for predicting LF infection dynamics that is efficient and robust across all endemic conditions.
To analyze the relationship between the number of members contained in a single-model ensemble and its predictive performance, we used a bootstrap approach to create a range of ensembles for each model with different sizes, M. For each M, the ReRMSE of the resultant member fits to data from the respective sites were calculated and averaged. Table 4 summarizes the results of these experiments, and show that the optimal number of members required for inducing maximal performance varied across models and sites with the smallest set being 50 members for LYMFASIM in Nanaha and the largest set being 350 for LYMFASIM in Malindi and TRANSFIL in Pondicherry. The results indicate, however, that across all three datasets, EPIFIL may achieve optimal performance with 100-200 members, LYMFASIM with 50-350, and TRANSFIL with 150-350.

Multi-model ensemble fits to baseline data
The size of the multi-model ensembles constructed in this study varied between the datasets, comprising a total of 570, 1128, and 888 model members for the Malindi, Nanaha, and Pondicherry sites, respectively. The fits of these multi-model ensembles in comparison to those of the single-model ensembles for the baseline mf age-prevalence data measured in each of the three sites are shown in Fig. 2. The comparative fits visualized in the figure indicate that the multi-model ensembles constructed for each site had, on average, higher performances for predicting these data compared to the single models, with variance also lower than the two most variable single models, LYMFASIM and TRANSFIL. This is formally confirmed by the ReRMSE and diversity values obtained by the multi-model ensemble across all three sites compared to the individual models (Table 3), which show that although single models individually surpassed the performance of the multi-model ensemble in a given site, none of these single models achieved this systematically in all instances compared to the multi-model ensemble.
The ReRMSE values also indicate that for all three datasets, the multi-model ensembles, despite not being the most skillful in each case, always performed better than the corresponding worst single model (Table 3).

Single and multi-model ensemble predictions of site-specific intervention outcomes
In this exercise, we compared the performances of the single and multi-model ensembles in forecasting the outcomes of the LF interventions applied in each of the three study sites. The effects of MDA based on observed drug coverages were investigated in the case of the Malindi and Nanaha sites (Singh et al., 2013;Njenga et al., 2008), while the impact of IVM was modelled in the case of the Pondicherry site, again using actual implementation coverages and efficacy rates Subramanian et al., 1989;Das et al., 1992;Rajagopalan et al., 1988). Fig. 3 shows the model simulations for these interventions in each site, and confirms our implicit conjecture in this study that using individual models selected on the basis of their performance in fitting training data may lead to inferior prediction performance for data outside the training sample (i.e., exhibiting a significant overfitting problem (Simidjievski et al., 2015a(Simidjievski et al., , 2015b. Comparison of the ReRMSE values obtained for the single models for the training (baseline) compared to the validation (intervention) data indicate that this might be dependent on the specific modelling system under consideration, with this effect most apparent for EPIFIL.
By contrast, the results show that the multi-model ensembles constructed by combining all three single models are able to compensate for inter-model errors and the effects of any overfitting to training data, and are thus able to predict the effects of interventions, on average, better than the corresponding single models in each site. This is highlighted not only by the results depicted in Fig. 3, which show that the multi-model ensemble is able to envelope all the validation data points from each study site within its bounds (the gray-colored subplot in Fig. 3), but also by the fact that the ReRMSE values of the multi-model are never surpassed by the worst performing single model (Table 3). The performance improvement of the multi-model ensemble in comparison to each single model also varied, with LYMFASIM and TRANSFIL outperforming the multi-model ensemble in two of the three sites. EPIFIL, which generally produced the best fits to the baseline training data (Table 3) was surpassed by the multi-model ensemble for all three sites, highlighting that superior model performance on training data can lead to poor predictive performance outside such data.
We next examined if the relative performance of the multimodel ensemble was a function of the varying degree of diversity observed for the single-models in the different study data sets. A negative correlation (correlation coefficient = −0.66) was found between the diversity of the single-model members and the multimodel performance improvement over each of the single-model ensembles across the present data sets in this exercise This negative relationship between the two variables indicates that the multi-model ensemble is able to improve performance over insufficiently diverse single models by exploiting the diversity of the other constituent models (Simidjievski et al., 2015a(Simidjievski et al., , 2015b.

Ensemble models and scenario-based predictions of timelines to extinction
The performance statistics for fits of the single and multi-model ensembles to the three hypothetical LF infection scenarios investigated in this study are given in Table 5. The results show that, for the scenario modelling exercise, there was considerable variation between the single models in terms of predictive performances for the regional prevalence data, diversity of good fitting members, and the weights estimated for constructing the LF multi-model ensemble. EPIFIL displayed the best fit, least diversity and highest weighting in direct contrast to TRANSFIL, while LYMFASIM occupied an intermediate position with respect to these metrics ( Table 5). As with the site-specific data, the multi-model ensemble again performed better than the averaged skill of the single models for reproducing the hypothetical data, and there was no instance  where it was outperformed by the worst single model (in fact, it was the second best performing of the ensembles in the exercise). Fig. 4 depicts the timelines generated by the single and multimodel ensembles for mf prevalence to cross below the WHO 1% mf threshold, as well as the years required to reach this threshold, for each scenario. Two points are immediately apparent: first, the predictions of single models can vary considerably for each modelled scenario; and second, the corresponding multi-model predictions of the median and variance of the years of MDA required to reach the WHO threshold are dictated by the combined performance and diversity of the single models. Thus, while the best-fitting and least diverse model, EPIFIL, predicts a narrow range of 5-6 years of MDA to reach the 1% mf threshold (from an initial prevalence of 10% mf) in the African scenario, the multi-model ensemble is also informed by the more variable LYMFASIM and TRANSFIL (predictions between 1-10 years of MDA), and concludes a slightly broader range of 4-7 years for this scenario (Figs. 4A and 5 A). In the case of the Asian scenario, the single models predicted anywhere between 1-8 years, whereas the multi-model ensemble combines these predictions to indicate that only 2-5 years of annual MDA will be required (Figs. 4B and 5B). The most dramatic variation in the predictions of the single models were observed for the Papua New Guinea scenario, for which LYMFASIM predicted the need for 1-14 years of MDA to reach the 1% mf threshold, while EPIFIL (and also TRANSFIL to a large degree) predicted a much narrower range of 8-9 years (Figs. 4C and 5C). The multi-model ensemble, by contrast, incorporates these single model uncertainties to predict a range of 4-10 years of MDAs, a result stressing its utility as a device for overcoming single model prediction heterogeneity to produce a combined prediction that takes full account of such errors.

Discussion
Multi-model ensemble modelling is drawing increasing attention as a mathematical framework for addressing uncertainty when predicting the dynamics of complex systems (IPCC, 2013;Gneiting and Raftery, 2005;Cloke and Pappenberger, 2009;Velazquez et al., 2010). This growing interest also stems from the recognition that when many competing models for a specific complex phenomenon exist, then sampling from the available individual models will help reduce uncertainty or increase the reliability of the modelling results for management decision making (Ramin et al., 2012;Ramin et al., 2011;Velazquez et al., 2011). This is not only because single models when addressed together are likely to represent a wider range of structural configurations and processes leading to a better coverage of the whole behavior of a system, but also because different models will have different strengths and weaknesses and, in an ensemble, the deficiencies in any one model may be balanced by the strengths in others or by compensating errors in another model (Martre et al., 2015). Thus, by reducing the simulation errors of single models, multi-model ensemble modelling is expected to increase the consistency of predictions for a problem and therefore aid in the making of better management decisions. Here, our major contribution is to develop and implement an initial multi-model ensemble framework for combining simulations of the three currently existing single models of LF transmission in order to produce better predictions of both parasite transmission dynamics and the impacts of interventions aiming to eliminate transmission in different endemic communities and regions. In particular, we describe a combinatory method that allowed calculations of the mean and variance of a LF multi-model ensemble that relied on a weighted combination of simulations from the single models based on their performances for fitting baseline infection prevalence data. The results are then used to evaluate the performance of the constructed multi-model ensemble for providing collective simulations of the effects of interventions compared to each single model. Although we focused on only one combination method, viz. construction of a weighted mean ensemble with simulations from constituent single models weighted according to their calibration performance, we show that the performance of the multi-model ensemble for simulating infection prevalence data at both the baseline calibration period and for the intervention validation period was, on average, superior to those of any of the LF single models when evaluations were carried out against field data obtained from endemic communities representative of the three major LF endemic regions of Sub-Saharan Africa, Asia, and Papua New Guinea. Thus, although single-model ensembles outperformed the constructed multi-model ensemble for predicting either the baseline or intervention infection data for a given site-specific dataset, no single model consistently surpassed the collective performance of the multi-model ensemble for reproducing the data across all study sites for each period (Table 3). This result is consistent with findings from previous work on ensemble modelling of complex systems, which has demonstrated how weighted ensemble models often tend, on average, to display significantly higher skill and reliability compared to their constituent single model ensembles, particularly when the single models exhibit considerable disparity in their individual predictions for a dataset or situation (Viney et al., 2009).
Nonetheless, it is apparent that the reliability of the present weighted multi-model ensemble is dependent on both the predictive accuracy as well as variance of the three single LF models. Thus, even though EPIFIL and LYMFASIM contributed most to the ensemble in the case of the Malindi and Pondicherry datasets respectively, the moderate diversity of their fits to the baseline calibration data meant that the variance of the multi-model ensemble predictions for the intervention data in these sites were also relatively lower than the case for Nanaha, where each model not only contributed similarly to the ensemble, but each also exhibited high diversity in their calibration fits leading to a corresponding highly variable ensemble prediction of the validation data (Table 3 and Fig. 3). Although various corrective methods have been proposed to overcome this bias in the contributions of single models to ensemble predictions, our use of the commonly suggested linear regression procedure Vrugt et al., 2008) while decreasing variance also reduced the predictive performance of the constructed LF multi-model ensemble (see part E of the SI), indicating that a more systematic exploration of how best to trim ensemble variability, perhaps via exclusion of extreme predictions (Viney et al., 2009) or applying probabilistic approaches (Ramin et al., 2012;Raftery et al., 2005), such as Bayesian model averaging (BMA), will be required to more fully address this problem for improving the forecasting performance of the multi-model ensemble developed in this study. Note, however, that even though we used the intervention data to validate both the single and multimodel ensembles, such an approach may be confounded by the fact that we also asked each modelling group to use fixed drug parameters when simulating the effects of interventions on infection prevalence. If these parameters are incorrect, then the model predictions of trends in infection during the intervention period may reflect the outcomes of this uncertainty rather than the ability to accurately predict the dynamics of control. Future work will need to address this possibility by allowing the single models to estimate these drug parameters directly from intervention data (Singh et al., 2013). Nevertheless, it is clear that the observed improvement of ensemble performance for predicting the validation or intervention data over a single model was related to the diversity of the latter model fits to the calibration, or training, dataset (Table 3). This is in line with the consensus in the ensemble literature that multi-model ensembles perform better by exploiting the diversity of their constituent models (Simidjievski et al., 2015a(Simidjievski et al., , 2015b. Taken together, the above observed interrelationships between diversity and ensemble variance on the one hand, and performance gain on the other, indicates that there may be tradeoffs between these variables. One simple first approach to maximize this tradeoff is to control the number of members contained in the single-model ensemble. Our results gauging the optimal number of members required for producing maximal predictive performance showed that this varied between both the three LF models and the calibration dataset, with EPIFIL needing between 100 and 200 members, LYMFASIM requiring 50-350, and TRANSFIL acquiring between 150 and 350 members to produce best performances across the three sites. Given that each group provided between 421 and 500 goodfitting model members for each site, with the exception of only 82 members in the case of LYMFASIM for the Malindi dataset, this result suggests that the larger variance exhibited by LYMFASIM and TRANSFIL, for Nanaha and Pondicherry in particular, could reflect the combined effects of the different total simulations carried out (which ranged from 10,000 in the case of LYMFASIM up to 200,000 in the case of EPIFIL) in addition to inherent stochastic variation, fewer open parameters, and a larger than required member size used by these models. This indicates that future work will need to pay closer attention to calculating the optimal total simulation and selected member sizes needed by a single model in order to improve the predictive performances of both the single and the resultant multi-model ensembles.
The multi-model ensemble analysis carried out in this study also produced new insights regarding the potential predictive performances of the three constituent LF models. In particular, the results showed that EPIFIL may have the tendency to overfit a given training dataset leading to lower diversity in the ensemble members and therefore a lower capacity for reproducing 'out-of-sample' intervention predictions (Figs. 2 and 3, and Table 3). By contrast, LYMFASIM and TRANSFIL, while fitting the training data generally less well, had sufficiently diverse members to generate better fitting predictions of the intervention/validation data (Figs. 2 and 3, and Table 3). Since these LF models differed only slightly in structure, with only the ways in which the aggregation parameter and drug coverage are implemented differing significantly between the models (part A of the SI), the observed differential abilities of the models for simulating the training and intervention data must therefore reflect the different means by which the models are fitted to data. Specifically, this may be related to the fact that the Bayesian Melding method by which EPIFIL is fitted to data facilitates generation of a larger number of optimizable parameters (Gambhir et al., 2010;Michael and Singh, 2016;Singh and Michael, 2015;Singh et al., 2013) as compared to distributed models, such as LYMFASIM and TRANSFIL, which have many parameters that must be prescribed a priori and only few optimizable parameters to compensate for the computationally intensive nature of individual-based models (see part A of the SI). These implementation differences mean that EPIFIL is better able to fit a given calibration dataset, and, given the Importance Sampling algorithm employed to select the best-fitting members, is also able to obtain fits with low between-member diversity. In contrast, with most parameters fixed, LYMFASIM and TRANSFIL were able to fit the training dataset only by selection of a more diverse set of compensatory members. These results have the implication that there will be a need to increase the diversity of members of EPIFIL while commensurately the fitted members of LYMFASIM and TRANSFIL will require to be trimmed to improve their predictive performances if these single models are to be used individually or separately for making predictions of the effects of LF interventions.
The ability of multi-model ensembles as an effective means to overcome these biases of single models, however, is amply demonstrated by the regional scenario modelling results obtained in this study (Fig. 4). The key finding here is that although the predictions of the multi-model ensemble is influenced greatly by the accuracy of the best-fitting single model for reproducing calibration or training data, the variance in the ensemble predictions is also informed by the more variable model members (Table 5 and Fig. 4). Thus, while the best calibrated model for each scenario prevalence, EPIFIL, predicted a much lower range in the number of annual MDA cycles required to reach below the 1% mf threshold set by WHO as a global target for determining LF transmission elimination in all three of the regional scenarios investigated here, the multi-model ensemble borrowing from the contributions of the more variable LYMFASIM and TRANSFIL concluded slightly broader ranges in each case, intermediate between the predictions of EPI-FIL versus LYMFASIM and TRANSFIL. These results show that a weighted multi-model ensemble can overcome the low diversity of a model like EPIFIL, while at the same time is also able to trim the variance of distributed models like LYMFASIM and TRANSFIL, to allow a synthesis of predictions that take account of both current uncertainties and biases pertaining to the conceptualization and implementation of available single models as well as a likely better coverage of the whole system space (Martre et al., 2015).
Given that the scenario modelling work carried out in this study is effectively an inverse solution exercise, it is important to assess the respective predictions for consistency and coherence in order to determine the reliability of the results for supporting parasite management. Here, we highlight a major finding in this regard consistent with previous work in modelling LF interventions, viz. that following the single model predictions the multi-model ensemble forecasts of the number of annual MDAs required to meet the WHO 1% mf target across the three regional scenarios also tended to be positively related to baseline prevalence (Irvine et al., 2015;Michael and Singh, 2016;Singh and Michael, 2015;Jambulingam et al., 2016;Michael et al., 2004;Singh et al., 2013). Thus, while the multi-model ensemble predicts between 2-5 years for the Asian scenario (for an initial mf prevalence of 5%), it predicted between 4-7 years and 4-10 years of MDAs required for meeting the WHO target in the case of the African (initial prevalence of 10%) and PNG (initial prevalence of 40%). Although these are first findings, and as the respective single and multi-model predictions take explicit account of differences in vector genus composition between the three scenarios investigated, these results indicate, as previously demonstrated by the single models, that timelines to set targets will be greatly influenced by baseline endemicity. This implies strongly that any variations in baseline prevalences in a setting will lead to site-specific differences in the durations of annual MDA required to break LF transmission (Michael and Singh, 2016;Singh and Michael, 2015;Michael et al., 2004). The policy implication of this result is clear, viz. the WHO recommended six years of MDA is highly unlikely to break transmission of LF similarly everywhere, and that a more flexible approach that takes full account of spatial heterogeneities in baseline initial conditions will be required to affect the global elimination of LF.
In conclusion, this study has shown how, by reducing simulation error and improving the accuracy and consistency of simulation results, multi-model ensembles based on a deterministic weighted combination of single model predictions may offer a powerful tool for reproducing LF infection patterns in different settings and for forecasting the effects of interventions efficiently. This work represents our first attempt to develop a multi-model LF ensemble, and it is clear that future research will be required to extend the weighting scheme and the performance standard used to improve the initial framework presented here. In particular, we indicate that this work will benefit by adopting the emerging statistical post-processing methods developed in the field of ensemble weather forecasting, including incorporating various bias correction methods and methods from Bayesian calibration and ensemble modelling, to yield a more robust approach for the synthesis of multiple parasitic transmission models (Ramin et al., 2012;Lindström et al., 2015;Raftery et al., 2005;Viney et al., 2009;Martre et al., 2015;Vrugt et al., 2008). Since the accuracy of multi-model ensembles depends on the accuracy of the constituent single models, future research should also focus on better calibration of LF models for generating the ensemble of model outputs (Makowski, 2015). A continued effort to develop reliable single models will increase their individual ability to support reliable decision making in parasite control and, in turn, improve the skill of multi-model ensembles to do so as well. While these technical improvements will enhance reliable LF ensemble modelling, it is also important to consider the organizational and collaborative mechanisms required to develop such ensemble frameworks so that predictions may be delivered to policy makers effectively. A first need in this regard is perhaps the provision of an open forum for exchange of observations, including LF monitoring data from endemic countries, models, and ensemble configuration and results to the entire LF community. Currently, sharing of such observations and model results is done primarily on a bi-lateral basis through individual contact, although modelling collaboratives, such as the Gates Foundation funded Neglected Tropical Disease Modelling Consortium, is clearly beginning to overcome this issue. Recent developments based on Science Gateways (Manset et al., 2012;Foster, 2005;McLennan and Kennell, 2010), by which collaborative tools are made available via the internet, however, may provide a more effective way to achieve this, and should be explored in earnest if ensemble-based modelling is to become an effective tool for guiding decisions in the global fight against parasitic diseases.