Forecasting the West Nile Virus in the United States: An Extensive Novel Data Streams–Based Time Series Analysis and Structural Equation Modeling of Related Digital Searching Behavior

Background West Nile virus is an arbovirus responsible for an infection that tends to peak during the late summer and early fall. Tools monitoring Web searches are emerging as powerful sources of data, especially concerning infectious diseases such as West Nile virus. Objective This study aimed at exploring the potential predictive power of West Nile virus–related Web searches. Methods Different novel data streams, including Google Trends, WikiTrends, YouTube, and Google News, were used to extract search trends. Data regarding West Nile virus cases were obtained from the Centers for Disease Control and Prevention. Data were analyzed using regression, times series analysis, structural equation modeling, and clustering analysis. Results In the regression analysis, an association between Web searches and “real-world” epidemiological figures was found. The best seasonal autoregressive integrated moving average model with explicative variable (SARIMAX) was found to be (0,1,1)x(0,1,1)4. Using data from 2004 to 2015, we were able to predict data for 2016. From the structural equation modeling, the consumption of West Nile virus–related news fully mediated the relation between Google Trends and the consumption of YouTube videos, as well as the relation between the latter variable and the number of West Nile virus cases. Web searches fully mediated the relation between epidemiological figures and the consumption of YouTube videos, as well as the relation between epidemiological data and the number of accesses to the West Nile virus–related Wikipedia page. In the clustering analysis, the consumption of news was most similar to the Web searches pattern, which was less close to the consumption of YouTube videos and least similar to the behavior of accessing West Nile virus–related Wikipedia pages. Conclusions Our study demonstrated an association between epidemiological data and search patterns related to the West Nile virus. Based on this correlation, further studies are needed to examine the practicality of these findings.


Introduction
West Nile virus, first isolated in Uganda in 1937, is a widely distributed arbovirus belonging to the Flavivirus genus and to the Flaviviridae family that can cause West Nile fever. This mosquito-borne infection has a seasonal trend with peaks during summer and autumn. In 70% to 80% of West Nile fever cases, no or few symptoms are reported [1]. Symptomatic infections generally consist of self-limited influenza-like illness with high-degree fever, chills, myalgia, and arthralgia, which last for approximately 5 days. Although rare at a rate of approximately 1%, neurological diseases, such as meningitis, encephalitis, meningoencephalitis, or poliomyelitis, can occur and are of great concern because they are characterized by many sequelae, especially among elderly hospitalized patients [2].
West Nile virus was detected in North America for the first time in August 1999 during an outbreak that occurred in College Point, Queens, in New York City. A cluster of human encephalitis cases, all residing in the same 16-square-mile area, was identified by Drs Deborah Asnis (a local physician based in Queens), Marcelle Layton, and Annie Fine (of the New York City Department of Health) [2]. This cluster was preceded by anecdotal reports of dead animals and birds at the Bronx Zoo, including American crows (Corvus brachyrhynchos), Chilean flamingos (Phoenicopterus chilensis), and a snowy owl (Bubo scandiacus). New human cases were subsequently diagnosed in Brooklyn, the Bronx, and Manhattan. Since then, West Nile virus has spread to the contiguous states, from the Mississippi River to the Pacific Coast, causing further outbreaks such as those that occurred during the summer of 2002. Cases decreased from 2008 to 2011, but in 2012 a major outbreak took place with its epicenter in the Dallas-Fort Worth, Texas metroplex [3,4]. The Texas Public Health authority issued a public health emergency, which attracted a lot of media coverage and public opinion reaction.
The widespread resurgence of human West Nile virus disease in 2012 following several years of relatively low incidence has highlighted the continued public health hazard posed by West Nile virus, and has emphasized the need for more accurate predictive models of when and where new West Nile virus outbreaks will occur.
Web-based tools are emerging as remarkable sources of data, especially for infectious diseases, by enabling Web search monitoring in real time and potentially capturing epidemiologically relevant information [5][6][7]. Infodemiology (a portmanteau of information and epidemiology) and infoveillance (a portmanteau of information and surveillance) indicate the emerging "science of distribution and determinants of information in an electronic medium, specifically the internet, or in a population, with the ultimate aim to inform and improve public health and public policy" [8]. Systematically tracking and monitoring, collecting, and analyzing health-related demand data generated by novel data streams could have the potential to predict events relevant for public health purposes, such as epidemic outbreaks, as well as to investigate the effect of media coverage in terms of potential distortions, misinformation, and biases-the so-called "epidemics of fear" [9].
Little is known about West Nile virus-related digital behavior. To the best of our knowledge, only a few authors have investigated this topic. Carneiro and Mylonakis [10] reported a preliminary qualitative observation about a positive correlation between Google searches and West Nile virus epidemiological cases from 2004 to 2008. They found that the search volume exhibited a cyclical pattern, with regular peaks in August each year, reproducing the epidemiological figures. Also, Web searches related to West Nile fever symptoms (fever, headache, fatigue, rash, and eye pain) were characterized by seasonal patterns. The authors noticed increases in search volume for rash starting in May, just a month before the increases in cases. Furthermore, the top-ranked US cities in terms of West Nile virus-related search volumes were in states characterized by the highest epidemiological burden.
Bragazzi and collaborators [11] assessed the association between Web searches and cases in Italy from a quantitative standpoint from 2004 to 2015. They found a correlation of r=.76 (P<.001) and r=.80 (P<.001) between Google searches and "real-world" epidemiological cases in the same study period on a monthly basis and a yearly basis, respectively. The presence of a regular pattern in West Nile virus-related Web queries was confirmed by the partial autocorrelation function analysis and by spectral analysis. From a geospatial point of view, correlation between digital behavior and epidemiological figures yielded r=.54 (P<.05).
However, the potential predictive power of West Nile virus-related Web searches has not yet been explored. To fill this gap in knowledge, we conducted this study.

Methods
West Nile virus-related data were retrieved, downloaded, and analyzed from several novel data streams, including Google Trends, WikiTrends, YouTube, and Google News, as well as from epidemiological repositories.

Novel Data Streams
Google Trends (an open source tool) was mined from inception (2004) to 2015, by searching for West Nile virus in the United States and using the "search topic" option. This strategy enables one to systematically collect all the searches related to a given keyword or list of keywords (in this case, West Nile virus), including synonyms and related terms, not just the precise string of characters typed by users [12]. WikiTrends is a freely available tool that could be used to investigate information seeking behavior concerning West Nile virus. It was mined from inception (2008) to 2015. The viewing of YouTube videos was investigated from 2008 to 2015, using Google Trends and selecting "YouTube" option. Finally, Google News is an open source news aggregator that can be used to explore the media coverage of a given topic. The consumption of West Nile virus-related news was explored from 2008 to 2015 using Google Trends and selecting "Google News" option. For further details concerning novel data streams, the reader is referred to Bragazzi et al [12].

Epidemiological Repositories
Epidemiological data related to West Nile virus cases in the United States were obtained from the Centers for Disease Control and Prevention (CDC) website and the bulletins of the Morbidity and Mortality Weekly Report, a publication of the CDC (data available on a trimester basis).

Statistical Analysis
Novel data streams-generated data were retrieved and downloaded from 2004 for Google Trends and 2008 for the other open source tools to 2015. All data were analyzed on a trimester basis. To detect a potential association with "real-world" epidemiological figures, regression analyses (with time as the confounding variable) were carried out. Furthermore, novel data streams-generated data were modeled as a time series and analyzed using time series analyses. In particular, a seasonal autoregressive integrated moving average model with explicative variable (SARIMAX) was used. By visually inspecting the autocorrelogram and the partial autocorrelogram based on the autocorrelation and partial autocorrelation function, respectively, p (the order of the model or, in other words, the number of time lags), d (the degree of differencing of the model), q (the order of the moving average model), P (the order of the seasonal part of the model), D (the degree of differencing of the seasonal part of the model), and Q (the order of the moving average model for the seasonal part) coefficients and s (lag parameter) were determined. The explicative variable was the number of West Nile virus cases. Different models were run, and the best one was chosen based on the Akaike information criterion (AIC), corrected AIC, and Schwartz Bayesian information criterion values. The best model was used to forecast Google Trends-based relative search volumes for 2016. Furthermore, structural equation modeling and clustering analysis were used to capture the complex interplay between the different novel data streams.
Regression and clustering analyses were performed using SPSS version 24.0 (IBM Corp, Armonk, NY, USA), whereas the SARIMAX models and the structural equation modeling were carried out with XLSTAT (Addinsoft, Paris, France). A P<.05 was considered statistically significant.

Results
Visual inspection of novel data streams-based data showed that each tool captured a specific digital behavior, generating specific curves which were not perfectly superimposable ( Figure 1). Concerning temporal trends, only West Nile virus-related Web searches pattern well-reproduced the epidemiological trend, with most Google queries concentrated in August. For instance, the number of accesses to the West Nile virus-related Wikipedia page (as captured by WikiTrends) and the consumption of YouTube videos exhibited high search volumes also during winter months compared to Google Trends ( Figure 2).
Regression analyses showed a significant correlation between real-world epidemiological data and novel data streams-generated figures only for Google Trends data (Table  1 and Figure 3), with the effect of year (P=.001) and of West Nile virus cases (P<.001) reaching statistical significance.   A Google Trends-based autocorrelogram and partial autocorrelogram are reported in Figure 4. These show statistically significant positive spikes for lags 0, 4, and 8 and lags 0 and 4, respectively. Descriptive statistics for Google Trends-generated data modeled as a time series is shown in Table 2. The best SARIMAX model was found to be (0,1,1)x(0,1,1) 4 (Multimedia Appendix 1 and Figure 5), or a "seasonal exponential smoothing" model, being MA(1)xSMA (1). This kind of model represents a variation of the seasonal random trend, with a fine tuning obtained adding the MA(1) and the SMA(1) components. Its parameters are reported in Table 3.
Concerning structural equation modeling, the consumption of West Nile virus-related news fully mediated the relationship between Google Trends and the consumption of YouTube videos, as well as the relation between this latter variable and the number of West Nile virus cases. Web searches as captured by Google Trends fully mediated the relation between West Nile virus cases and the consumption of YouTube videos, as well as the relation between epidemiological data and the number of accesses to the West Nile virus-related Wikipedia page as captured by WikiTrends (Figure 6a). When adjusting for time as a potential confounding factor (Figure 6b), the consumption of YouTube videos mediated by the consumption of news was found to increase throughout time in a statistically significant way, although when mediated by the number of accesses to the West Nile virus-related Wikipedia page as captured by WikiTrends tended to decrease. Interestingly, the West Nile virus-related Web search behavior decreased over time (as captured by Google Trends and mediated by the number of epidemiological cases).
Clustering analysis showed that the consumption of news was most similar to the Web searches pattern (as captured by Google Trends), which was less close to the consumption of YouTube videos and least similar to accessing the West Nile virus-related Wikipedia page as captured by WikiTrends (as can be seen by the dendrogram in Figure 7).

Principal Findings
Currently, arboviruses are re-emerging infectious agents. This is not a new phenomenon-it has been happening for centuries-but today arboviral re-emergence and dispersion are more rapid and geographically extensive mainly due to globalization and to arthropod adaptation to its effects [13].
In the existing scholarly literature, different predictive models of West Nile virus have been reported. Kala and colleagues [14] described a geographically weighted regression modeling of West Nile virus risk based on environmental parameters (ie, stream density, road density, and land surface temperature) being statistically superior to classical approaches relying on ordinary least squares regression analyses in terms of predictive power. Shand and coworkers [15] improved the performance of the mosquito-based surveillance approach by incorporating rainfall, temperature, and interaction terms between precipitation and temperature as predictors. Rochlin and coauthors [16] also exploited socioeconomic factors and found a higher proportion of the population with college education, increased habitat fragmentation, and proximity to West Nile virus-positive mosquito pools correlated with higher West Nile virus human risk, whereas Trawinski and Mackay [17] relied on meteorological parameters (including average, minimum, and maximum temperature values; precipitation; relative humidity; and evapotranspiration). Day and Shaman [18] used water table depth as a measure of drought and as a proxy of arboviral transmission in two peninsular Florida regions. In another investigation, Shaman and coworkers [19] exploited hydrological variables and found that wetter spring conditions and drier summer conditions predicted increased West Nile virus human risk in Colorado's eastern plains. Ghosh and Guha [20] performed a computational neural network analysis for capturing the nonlinear complex relationship between West Nile virus epidemiology and several different variables, including temperature, precipitation, wetlands, housing age, presence of catch basins and ditches, and mosquito abatement policies. Establishing an accurate predictive model is of crucial importance in arboviral infection control.
In our work, we exploited a variable (West Nile virus-related digital behavior), which has so far not been used in predicting West Nile virus epidemiology. We explored different novel data streams (Google Trends, WikiTrends, Google News, and YouTube) concerning seeking behavior and we were able to find a statistically significant association between epidemiological figures and digital behavior only in the case of Web searches as captured by Google Trends. Furthermore, we computed the best SARIMAX model for the period of 2004 to 2015, and we were able to forecast data related to 2016. Moreover, structural equation modeling and clustering analysis have enabled us to capture the complex interplay between the different novel data streams and the West Nile virus-related digital seeking behavior.
Even if our experience suggests the usefulness of using Google Trends for predicting West Nile virus, this should be considered as a pilot study, calling for the need for making our model more accurate and reliable, and maybe incorporating other variables (eg, environmental, socioeconomic, and ecological ones). This is of fundamental importance when designing and implementing a digital system for West Nile virus surveillance, which could complement the classical one or those actually under experimentation [21,22]. The combination of Google Trends and other predictors could reach an adequate temporal concordance with the real-world epidemiological figures and, therefore, could enable nowcasting or forecasting of new West Nile virus cases.
Our study has some limitations that should be recognized. Some of the novel data streams used provide users with relative, normalized figures, and not with raw, absolute data, thus hindering further mathematical processing and statistical analysis. Another drawback is given by the fact that Google Trends captures only a portion of the entire population, namely the percentage of people using Google as their preferred search engine (although Google is the most commonly used search engine worldwide). Furthermore, we did not perform a content analysis of the West Nile virus-related material; from the existing literature, it is known to be of rather poor quality and to exhibit some degrees of inconsistencies [23,24]. For instance, Birnbrauer and colleagues [23] explored how West Nile virus risk information was portrayed from its 1999 arrival in the United States through the year 2012, analyzing 428 articles obtained through Google News. Authors identified the following themes and topics: action, conflict, consequence, new evidence, reassurance, and uncertainty, with the action frame recurring most frequently. Moreover, West Nile virus risk was found to be improperly communicated, with statistical figures generally inaccurately reported. Dubey and coworkers [24] analyzed a total of 106 West Nile virus-related YouTube videos, 79.24% of which were found to contain useful information about the disease (60.71% related to disease prevention, and 34.52% concerning news and research updates). Videos were typically uploaded by individuals (54.6%) or news agencies (41.8%), but rarely by health care agencies (3.4%). Despite the usefulness of most West Nile virus-related videos, nonuseful videos received more views, both overall and on a daily basis. Moreover, West Nile virus-related digital behavior could have been influenced and, eventually, also distorted by extrinsic variables, such as the media coverage in terms of dissemination of imbalanced and biased information [25]. Some articles have shown that Google Trends does not always match with epidemiological data [25,26], such as in the case of Google Flu Trends [27], even though it is feasible to exploit some statistical techniques to externally revise novel data streams-generated figures, recalibrate them, and improve their accuracy and predictive power [28,29]. As such, the field of "behavioral medicine" remains largely unexplored [30], and because traditional surveillance is plagued by intrinsic limitations, enhanced methods for identification of real-time new cases and assessment of disease patterns and trends are urgently needed [31].

Conclusions
Statistically significant temporal correlations between West Nile virus epidemiological data and Google Trends suggest the feasibility of exploiting Google Trends as an internet-based monitoring tool. This is timely and of crucial importance given the recent re-emergence of arboviral infections. Workers in the field of public health and health authorities should be aware of the public interest and reaction to West Nile virus outbreaks in terms of Web searches. They could exploit the new information and communication technologies both for performing real-time monitoring of new population-based epidemic events and for carrying out a content analysis of the available online material, promptly replying to public concerns and correcting prejudices and inaccurate and misleading reports by disseminating high-quality information. However, based on the previously mentioned limitations of this paper, further studies are warranted to make our model more useful and practical.

Conflicts of Interest
None declared.