Utilizing Nontraditional Data Sources for Near Real-Time Estimation of Transmission Dynamics During the 2015-2016 Colombian Zika Virus Disease Outbreak

Background: Approximately 40 countries in Central and South America have experienced local vector-born transmission of Zika virus, resulting in nearly 300,000 total reported cases of Zika virus disease to date. Of the cases that have sought care thus far in the region, more than 70,000 have been reported out of Colombia. Objective: In this paper, we use nontraditional digital disease surveillance data via HealthMap and Google Trends to develop near real-time estimates for the basic ( R 0 ) and observed ( R obs ) reproductive numbers associated with Zika virus disease in Colombia. We then validate our results against traditional health care-based disease surveillance data. Methods: Cumulative reported case counts of Zika virus disease in Colombia were acquired via the HealthMap digital disease surveillance system. Linear smoothing was conducted to adjust the shape of the HealthMap cumulative case curve using Google search data.


Introduction
Recent infectious disease outbreaks-including severe acute respiratory syndrome (SARS), Middle East respiratory syndrome (MERS), Ebola, and influenza A (H1N1)-have presented great challenges to the global public health community, including lack of basic epidemiologic knowledge to support important preparedness and control decisions. To address this gap, innovative surveillance methods have been developed over the last several years to leverage the increasing availability of digital data related to outbreaks. To date, many studies have retrospectively examined nontraditional digital data sources and have demonstrated their utility in estimating epidemic curves or changes in important epidemiologic parameters over time [1][2][3]. Such studies have provided a foundation for building near real-time prospective analytic techniques that can assess transmission dynamics in the absence of traditional data. These methodological developments fill a knowledge vacuum that may prove useful for public health decision making in the early stages of an outbreak.
The ongoing outbreak of Zika virus disease in Central and South America has attracted global attention due to its rapid geospatial growth as well as concerns over associated central nervous system complications [4,5]. Although Zika virus is primarily transmitted via Aedes mosquitoes, evidence of vertical and sexual transmission exists [6][7][8]. Likely introduced to the Americas in mid-to late 2013, the virus has since been propagated by the density of competent vectors throughout the region [8]. At present, approximately 40 countries in Central and South America have experienced local vector-borne transmission, resulting in nearly 300,000 total reported cases to date [9]. Of the cases that have sought care thus far in the region, about 70,000 have been reported out of Colombia, of which 17% were pregnant at time of clinical or laboratory diagnosis [9,10]. However, given the generally mild nature of Zika virus disease and subsequent lack of care seeking, reported cases undoubtedly comprise a small fraction of total cases [11,12].
Current prevention efforts focus on vector suppression [13], while interest in and efforts toward vaccine development are mounting rapidly due to increasing rates of Guillain-Barré syndrome following Zika virus infection and microcephaly in newborn babies born to women infected with Zika virus while pregnant [4,5]. Quantitative analyses designed to inform vaccine policies-in addition to other preparedness and control activities-are dependent on the transmission dynamics associated with the disease and, therefore, estimates for critical epidemiologic parameters are urgently needed for such decision making within the context of Zika virus disease.
In this paper, we use nontraditional digital disease surveillance data via HealthMap and Google Trends to develop near real-time estimates for the basic and observed reproductive numbers associated with Zika virus disease in Colombia as well as expected final outbreak size and duration. We then validate our results against traditional health care-based disease surveillance data and discuss the implications of our work on outbreak mitigation strategies in Colombia and assessment of transmission dynamics elsewhere in the region.

Methods
Cumulative reported case counts of Zika virus disease in Colombia were acquired via the HealthMap digital disease surveillance system, consisting of 28 unique nongovernmental media alerts between October 16, 2015 and April 16, 2016 [14]. The cumulative reported case curve obtained from these reports shows an unrealistic L-shape, presumably due to increased interest in reporting during recent weeks and lack of awareness during early weeks ( Figure 1). By assuming that the total number of cases obtained from HealthMap was a reasonable approximation of reality for the given time period, we used Google search data to distribute cumulative reported case counts more realistically over time.
Although many cases of dengue and influenza go undetected, previous studies have shown that relevant Google search trends demonstrate high linear correlation with reported disease incidence over time [15,16]. Thus, we obtained weekly Google search fractions of the term "Zika" from Colombia via the Google Trends website (accessed on April 29, 2016). These search fractions are displayed weekly as normalized values that range from zero to 100, which reflect the level of nationwide search interest in the word "Zika" from January 4, 2004 (first available datum) to April 16, 2016.
We created a smoothed cumulative incidence curve (referred to as "smoothed HealthMap") by scaling the Google search curve against the HealthMap-reported Zika cases [17]. The scaling constant was obtained by dividing the most recent total number of HealthMap-reported Zika cases by the total number of Google search fractions from May 31, 2015 to April 16, 2016. Perhaps due to initial delays in reporting, the first relevant uptick of the term "Zika" in the Google Trends data occurred during the week of May 31, 2015, approximately 20 weeks before the first HealthMap alert of Zika virus disease in Colombia. Because of this, May 31, 2015 was selected as the start date for modeling efforts using smoothed HealthMap data; April 16, 2016 (last available datum at time of manuscript preparation) was selected as the cut-off date.
Due to successful applications in other data-scarce (ie, cumulative incidence only) settings, the Incidence Decay and Exponential Adjustment (IDEA) model was used to estimate the basic reproductive number (R 0 ) and the discount factor (d) associated with the ongoing outbreak [2,18,19]. Both R 0 and d were solved for using nonlinear optimization to minimize the sum of squared differences (SSD) between reported (user-inputted) and modeled cumulative incidence (I) curves across multiple serial intervals (ie, outbreak generations). Figure  2 presents a formulation for I expressed in terms of R 0 and d, where t is the number of outbreak generations (ie, serial intervals) that have passed thus far and is inversely proportional to the serial interval length (ie, number of days per serial interval [SI]). Given that distribution for the SI associated with Zika virus disease had not yet been established, R 0 and d were solved for iteratively over a range of 14 deterministic lengths (10-23 days) [20].
These values of R 0 and d were then used to define maximum, minimum, and mean values for the observed reproductive number (R obs ), final reported outbreak size (I max ), and final reported duration (t max ).The observed number of secondary infections per infected individual for a given value of t (R obs ) was calculated using the following equation: When d is greater than zero, R 0 does not equal R obs . In such circumstances, disease incidence is nonexponential due to either planned or unplanned reductions in disease duration, contact rate, or infectiousness of cases [18]. Likewise, final reported outbreak duration (t max ) was calculated as follows [18]: Final reported outbreak duration can also be expressed in days by multiplying t max by SI; however, in calculating I max , original units (ie, outbreak generations) are used ( Figure 3).
In the event that a viable vaccine is developed before the ongoing outbreak in Colombia ends (t max ), the following equation was used to assess the percentage of the susceptible population that would need to be immunized against Zika virus (%Vax) to eliminate transmission, assuming 100% vaccine efficacy: %Vax=1-(1/ R obs ).
After completion of the analyses on the digital surveillance data, we performed a validation study using traditional surveillance data obtained from weekly Instituto Nacional de Salud (INS) (National Institute of Health, Colombia) epidemiological bulletin publications [21]. The INS first reported incidence of Zika virus disease in Colombia on October 16, 2015. However, subsequent publications indicated that the outbreak likely began during epidemiologic week 32 of 2015 or earlier [22]. As result, August 22, 2015 was selected as a start date for modeling efforts using INS data. April 16, 2016 (date of the most recent publication at time of manuscript preparation) was selected as the cut-off date [22]. The analyses described previously for the smoothed HealthMap dataset were conducted on the INS dataset as well, resulting in R 0 , d, R obs , I max , t max , and %Vax estimates for both digital (smoothed HealthMap) and traditional (INS) cumulative reported case data.
Using the digital (smoothed HealthMap) cumulative case counts, we estimated a mean R 0 of 3.26 (range 1.91-5.05) and a mean d of 0.04 (range 0.01-0.07) across 14 deterministic serial interval lengths (range 10-23 days) ( Figure 6). We then calculated a mean R obs of 1.63 (range 1.31-2.05), a mean I max of 85,546 cases (range 80,028-93,885 cases), and a mean t max of 530 days (range 522-538 days; November 2016). Cumulative reported case projections using these modeled parameters are shown in Figure  7.
The traditional (INS) data yielded a mean R 0 of 5.36 (range 2.52-9.63) and a mean d of 0.07 (range 0.02-0.14) across 14 deterministic serial interval lengths (range 10-23 days) ( Figure  8). Using these, we calculated a mean R obs of 1.96 (range 1.45-2.58), a mean I max of 77,386 cases (range 76,587-78,619 cases), and a mean t max of 387 days (range 382-392 days; September 2016). Cumulative reported case projections using these modeled parameters are shown in Figure 9.
Although R 0 values calculated using the traditional (INS) data were general higher than those calculated using digital (smoothed HealthMap) cumulative case counts (SSD=82.14), R obs values were quite similar across data sources (SSD=1.84).

Discussion
When depletion of susceptible individuals due to infection (ie, via death or immunity-conferred recovery) is small relative to the total population, basic reproductive numbers obtained using the IDEA model are comparable to simple SIR-type models [19]. Although they are especially suitable for use in data-scarce settings, SIR-type models-and, by extension, the IDEA model-cannot easily incorporate global dynamics, such as the importation and exportation of infectious agents (ie, vectors and humans) or significant climate events (ie, El Niño and La Niña). Nevertheless, others have demonstrated that simple SIR-type models perform similarly to complex mechanistic models when describing the transmission dynamics of vector-borne and water-borne diseases in localized contexts [23,24]. As a result, the IDEA model is a reasonable method for analyzing nationwide transmission dynamics of Zika virus disease in Colombia.
As defined by the IDEA modeling method, R 0 represents potential transmissibility of a given pathogen in a fully susceptible, naïve population; meanwhile, R obs represents observed transmission in the face of existing interventions, as captured by d [2,18,19]. In this sense, the R obs is similar to the effective reproductive number (R t ), which represents transmissibility in a population that is not fully susceptible. Mean modeled estimates for R 0 across both data sources were consistent with R 0 estimates for Zika virus disease in French Polynesia and with R 0 estimates for chikungunya and dengue [12,25,26]. Mean modeled estimates for R obs were also comparable to R t estimates for chikungunya and dengue [27,28]. To take into account the effects of ongoing transmission control efforts, R obs was used instead of R 0 to calculate %Vax.
In this study, we found that using the traditional (INS) data yielded higher R 0 estimates than the digital (smoothed HealthMap) cumulative reported case counts. Nevertheless, because estimates for d were also higher, modeled ranges for R obs and %Vax were comparable across both data sources. Similarly, the narrow range of possible case projections generated by the traditional (INS) data was largely encompassed by the wider range produced by the digital (smoothed HealthMap) cumulative reported case counts. Therefore, in the absence of traditional health care-based surveillance data, important epidemiologic parameters may be estimated using smoothed digital surveillance data as described here.
The methods used in this study are not without limitations. For both data sources, estimates for country-level case projections and I max apply only to those that seek care; true caseloads are likely to be as much as five times higher than those that are reported [11,12]. Furthermore, because country-level data are utilized, in-country transmission heterogeneities are not captured. As geographic granularity of digital surveillance data improves, similar analyses should be conducted at smaller scales. Nevertheless, given that projection models are designed to serve as decision-support tools, estimating the number of cases that will report to hospitals and clinics over the next several months-even at the country level-is still valuable for the purposes of resource allocation. This may be especially pertinent with respect to diagnostic support for pregnant women presenting with clinical symptoms for Zika virus disease. To date, nearly 20% of all reported Zika virus disease cases in Colombia have been pregnant women; if the current rate holds, thousands more may be infected and seek care before the outbreak ends. However, the projections presented in this paper only apply in the event that circumstances remain unchanged (eg, no new interventions are put in place).
With improved compliance, vector suppression interventions (eg, elimination of standing water, exhaustive use of insect repellant) have the potential to bring this outbreak to a swift close, even in the absence of a vaccine. In the event that a viable vaccine can be developed before the outbreak ends, our estimates suggest that approximately half of the susceptible population would need to be immunized to confer herd immunity. Considering the growing body of evidence linking Zika virus infection during pregnancy to microcephaly in newborn babies, women of childbearing age should be given priority if the option becomes available [4,5].
Regardless of whether a vaccine reaches the market before the outbreak in Colombia ends, the data acquisition and modeling approach presented in this paper may still benefit other Zika-affected countries with limited capacity for government-implemented health care-based data collection. Although traditional surveillance data should be used preferentially, in its absence digital surveillance data can yield comparable estimates for key transmission parameters. It has been shown that digital surveillance data can be used retrospectively to assess transmission dynamics of well-understood pathogens (eg, Vibrio cholerae); however, our findings suggest that similar analyses can also be conducted in near real time for emerging infectious diseases [3]. Moreover, the epidemiologic parameters estimates from these analyses may be readily updated as new information emerges, enabling prospective tracking of transmission dynamics at the country level despite data scarcity.
Recent history has shown the need for rapid epidemiologic assessments to better inform intervention strategies in the face of a public health emergency. For effective evaluation of such interventions, baseline estimates for transmissibility-like those described in this study-must be established. Furthermore, changes in outbreak dynamics must be closely monitored in order to assess the impact of active interventions on disease transmission. Our approach offers an important alternative to guesswork based loosely on related diseases and previous outbreaks. Given the absence of traditional surveillance data and transmission heterogeneities across Central and South America, digital surveillance data can and should be used to conduct similar analyses for other Zika-affected countries in the months ahead.