Surveilling Influenza Incidence With Centers for Disease Control and Prevention Web Traffic Data: Demonstration Using a Novel Dataset

Background Influenza epidemics result in a public health and economic burden worldwide. Traditional surveillance techniques, which rely on doctor visits, provide data with a delay of 1 to 2 weeks. A means of obtaining real-time data and forecasting future outbreaks is desirable to provide more timely responses to influenza epidemics. Objective This study aimed to present the first implementation of a novel dataset by demonstrating its ability to supplement traditional disease surveillance at multiple spatial resolutions. Methods We used internet traffic data from the Centers for Disease Control and Prevention (CDC) website to determine the potential usability of this data source. We tested the traffic generated by 10 influenza-related pages in 8 states and 9 census divisions within the United States and compared it against clinical surveillance data. Results Our results yielded an r2 value of 0.955 in the most successful case, promising results for some cases, and unsuccessful results for other cases. In the interest of scientific transparency to further the understanding of when internet data streams are an appropriate supplemental data source, we also included negative results (ie, unsuccessful models). Models that focused on a single influenza season were more successful than those that attempted to model multiple influenza seasons. Geographic resolution appeared to play a key role, with national and regional models being more successful, overall, than models at the state level. Conclusions These results demonstrate that internet data may be able to complement traditional influenza surveillance in some cases but not in others. Specifically, our results show that the CDC website traffic may inform national- and division-level models but not models for each individual state. In addition, our results show better agreement when the data were broken up by seasons instead of aggregated over several years. We anticipate that this work will lead to more complex nowcasting and forecasting models using this data stream.


Original Manuscript
Introduction Every year, an estimated 5% to 20% of people in the United States become infected with influenza [1]. The typical influenza season begins in October and ends in May, with the peak occurring in the winter months. Annually, 3,000-50,000 people die from the flu, with another 200,000 requiring hospitalization [2]. The yearly flu burden is estimated to cost around $87 billion in lost productivity [2]. Timely surveillance of influenza can help reduce this burden, allowing health care facilities to more adequately prepare for the influx of patients when flu levels are high [3].
One common surveillance measure is the fraction of patients presenting with influenza-like illness (ILI), consisting of a fever of at least 100 º F (37.8 º C) and a cough or sore throat with no other known cause [4]. ILI data are collected from about 2,900 volunteer health care providers throughout the United States, although each week only about 1,800 report their data. These data are then aggregated and made public, after a time lag of about 1-2 weeks [1,[5][6][7][8][9].Because the ILI data are collected from volunteer providers, the data set is incomplete. If policies were enacted to provide incentive for reporting health care providers, or to make reporting compulsory, the result would be a more complete data set. Other surveillance systems include virological data from the World Health Organization, emergency department visits, electronic health records, crowd-sourced ILI reports, Widely Internet Sourced Distributed Monitoring, Influenzanet, and Flu Near You [10][11].

Internet Data Streams
In the United States, 87% [12] of adults use the Internet. Of those Internet users, 72% [12] have used the Internet to search for health information within the last year. The most common health-related searches are for information regarding a specific disease or condition (66%) and information about a specific treatment or procedure (56%) [12][13].
There are two main types of health-related Internet activity. The first is health sharing, in which Internet users post about health-related topics (e.g., a tweet about being sick). The second is health seeking, in which users utilize the Internet to obtain information about health-related topics [5]. In this paper, we focus on health-seeking behavior. Previous studies have shown that analyzing online health-seeking behavior can improve early detection of disease incidence by detecting changes in disease activity [8,[14][15][16][17][18]. Similarly, other studies have shown that Internet data emerging from search queries can aid detection of outbreaks in areas with large populations of Internet users [19], because online health-related search queries and epidemics are often strongly correlated [19][20].
Internet data have been used to forecast disease incidence in other models. Polgreen et al. developed linear influenza forecasting models with lags of 1 to 10 weeks for each of the 9 U.S. census regions using search queries from Yahoo [8]. The best performing models had lags of 1-3 weeks and an average r 2 of 0.38 (with a high of 0.57 in the East-South-Central region) [8]. These low r 2 values demonstrate potential problems in relying on search information alone. Ginsberg et al. were able to predict influenza epidemics two weeks in advance using Google search queries to fit linear models using log-odds of ILI visits and related searches [14].
Using a Poisson distribution and LASSO regression, McIver and Brownstein obtained an r 2 value of 0.946 using Wikipedia data [7], although some data were excluded from analyses due to increased media attention and higher than normal influenza activity. Generous [21]. For OLS nowcasting, the r 2 value was 0.98 in the best case. For the best fit, the weekly data were offset by one week [21].
As part of the CDC's 2013-2014 Predict the Influenza Season Challenge, 9 teams used digital data sources to create forecasting models. The digital sources these teams utilized were Wikipedia, Twitter, Google Flu Trends, and HealthMap. The teams used either mechanistic or statistical models to create their forecasts, with the most successful team using multiple data sources, which may have reduced biases usually associated with Internet data streams [22]. Broniatowski et al. used Twitter data to detect increasing and decreasing influenza prevalence with 85% accuracy [23]. Zhang et al.
used Twitter data to inform stochastically, spatially structured mechanistic models of influenza in the United States, Italy, and Spain [24].
Internet data streams have also been used to supplement traditional surveillance techniques with nowcasting models. Paul et al. used Twitter along with ILI data from the CDC to produce nowcasting influenza models as well as nowcasting models using solely ILI data. They conclude that the addition of Twitter data led to more accurate nowcasting models [25].

Santillana et al. combined Google
Trends data and CDC-reported ILI data to create models for nowcasting and forecasting influenza [26]. Lampos et al. used search query data to explore both linear and nonlinear nowcasting models [27]. Yang et al. used Google search data to create an influenza tracking model with autoregression [28].
In contrast, we consider data on page views of the CDC website rather than search data from sites not solely devoted to public health. We use this data set because we expect it to be inherently less noisy because of its focus on public health issues. We use ordinary least squares to nowcast influenza nationally, across the 9 U.S. census divisions, and across 8 states using access data from 10 influenza-related CDC pages. Our nowcasting models cover influenza seasons from 2013 to 2016, with the 2012-2013 season being partially included because our data set begins Jan. 1, 2013. The inclusion of an incomplete influenza season serves to inform whether this data set can be used given a more restrictive time frame. We include both positive and negative results to advance our knowledge regarding when Internet data may or may not work. The negative results are crucial to advancing the field of disease surveillance using Internet data, as they demonstrate when these data sources contribute to unreliable surveillance. We focus on answering the following two research questions:

Q1:
Can CDC page visits be used as an additional data source for monitoring disease incidence?
Q2: What is the appropriate shift needed to obtain the best data fit?

Data Sources
We used page view data provided by the Centers for Disease Control and Prevention (CDC). Each data point contains the page name, date and time of access, and the geographic location from where the page was viewed. These data are available at geographic resolutions of national and state levels and include some metropolitan areas (e.g., New York City). The data are available at a number of temporal resolutions beginning on January 1, 2013. For these models, we use weekly page view data to coincide with the ILI data temporal resolution. The data are available as raw page view counts and normalized page view counts, and we consider the latter for this work. We selected pages associated with general influenza information, treatment, and diagnosis. Pages were sometimes renamed, but we were able to follow the evolution of each selected page by utilizing key words in the page titles as well as the date ranges for available data.
Because the majority of health-related Internet searches concern specific conditions, treatments, and procedures [13], we selected pages related to those topics. These pages also align with Johnson et al., who used pages in the categories of Diagnosis/Treatment and Prevention/Vaccination for influenza surveillance [29]. Specifically, we used the following pages: antivirals, flu basics, FluView, high risk complications, key facts, prevention, symptoms, treating influenza, treatment, and vaccine. We then aggregate the page views of interest for each of our models. A complete list of pages can be found in The states we selected were based on severity of flu (determined from FluView) during the available seasons and the availability of ILI data, which is not standardized and is dependent on each state's reporting mechanism. ILI data for each state include the week ending or starting date as well as the percentage of influenza-like illness for the specified week. While some states also report additional data, such as school closures and hospitalizations, these data are not made available by every state.
Note that the ILI reporting and accessibility vary across all the states. The states we selected were 1) California, 2) Maine, 3) Missouri, 4) New Jersey, 5) New Mexico, 6) North Carolina, 7) Texas, and 8) Wisconsin. With the exception of Texas, these states did not release ILI data outside of the typical flu season. A complete list of the data sources for the state ILI can be found in Appendix B, and the clinical data are available in Appendix E. Figure 1 shows the percentage of ILI visits for each state considered in this study as well as the national percentage of ILI visits. We see distinct spikes that indicate the peaks of the flu seasons.
With the exception of Maine, which behaves as an outlier at times, the figure shows spikes indicating there are "peak" weeks for influenza-related page views. Texas also exhibits outlier behavior with ILI percentages consistently higher than the typical national baseline of 2%, which is used to determine when the flu has reached epidemic status. These two outliers are shown in teal (Texas) and dark blue (Maine). The national ILI is shown in black. The remaining states exhibit behavior consistent with the national ILI trend. Figure 2 shows the CDC page view data as a heat map: weeks with more page views are shown darker than weeks with fewer page views.

Linear Regression
We used statsmodels version 0.9.0, a statistical analysis module for Python, to perform linear regression on our data sets using OLS. This creates a linear model M of the form Where α i are the regression coefficients, and X=(1, X 1 , X 2 , … , X n ) is the vector of CDC page view data, with n representing the number of CDC pages used for the model, ranging from 1 to 10. We correlate ILI and CDC page views for the same week or with a one-week shift. In the shifted cases, we shift the ILI data forward by one week, so that the model associates the current week's page views with the following week's ILI data. This shifting is performed to account for the incubation period of influenza and the time between the onset of symptoms and the first doctor visit.
Statsmodels uses the CDC page view and ILI data to determine the appropriate regression coefficients, fits parameters with OLS, and computes the goodness of fit, r 2 , also referred to as the coefficient of determination. The r 2 value measures how well two time series correlate. An r 2 = 1 indicates a perfect fit, while an r 2 value of 0 indicates no correlation. Although r 2 is not necessarily the best metric to use for judging goodness of fit [5], it is nonetheless the most common metric used and still provides one with a decent overall sense of fit quality. Additionally, we examined the root mean squared error (RMSE) and the normalized root mean squared error (NRMSE) using Python's sklearn libraries.

Results
We analyzed the data at the national, division, and state levels and computed the r 2 for each geographic resolution. In this section, we discuss the results of our experiments, both successes and failures. We include figures of models at the national, census division, and state levels. Because of the varying scales between page views and ILI percent, we choose to normalize the data and our models in order to plot them on the same axes. We use raw data to create the models, and then we normalize the each model with respect to its maximum. We also normalize the ILI data and CDC.gov web traffic data with respect to their maximums for the given time period so that all three curves may appear in the same plot.

National Results
We selected pages that corresponded to the topics most often searched during online health-seeking activities. When we combined all ten pages, we were able to achieve an r 2 value of 0.889 for the  Table 1 shows the most successful model for each influenza season included in this study. Figure 3 shows these models.

Census Division Results
Using the data for each of the nine census divisions, we were able to achieve r 2 >0.7 in at least one case for each division. We considered all seasons together and separately, with the better results coming from modeling each individual season. We considered all pages together and pages most closely associated with topics most commonly searched by health-seeking individuals. In the most successful case, the model was able to closely match the 2015-2016 influenza season for the West North Central division with an r 2 of 0.955 using the FluView, Symptoms, and Treatment pages.
Although we had successes using all 10 pages, the most successful model for each division involved only these three pages. Figure 4 shows some of these models, and Table 2 highlights these successes.

State Results
We found r 2 for each of the states considered in this study, using a variety of pages and page combinations. Table 3 lists the most successful model for each state, the season, the data shift, and the r 2 value.  These models aggregated all 10 pages, and the success varied by state.
We were not surprised that Texas had the best fit. Texas was the only state we included that provided ILI data not only for the typical influenza season but also for the off-season. This additional data likely contributed to the success of the Texas models. The lack of success we encountered in modeling Maine was also expected because of Maine's outlier behavior in ILI, having values considerably lower and out of pattern with other states. The models in Figure 5 included all 10 pages aggregated together. However, as indicated by the individual state results, this does not always lead to the best fit. Successful models often included a combination of select pages (such as FluView, Symptoms, and Treatment) but not an aggregation of all 10. Furthermore, aside from Texas, we did not have ILI data for the states outside of the typical flu season. Without this additional data, we are unable to determine how strongly the lower page views in the off-season correlate with off-season ILI.
We then shifted the ILI data forward by one week. The regression analysis yielded 7 state/season combinations with r 2 values greater than 0.7 (see Table 4). The table also includes both the regular and normalized root mean squared errors. Again, although the correlation appears to be weak, it is a stronger correlation than taking all 10 pages together. Using these same three pages and implementing a one-week shift, we obtained r 2 ≥ 0.

Model Failures
We generally found the models to be successful when considering pages most closely related to

Conclusions
Internet surveillance data has proven beneficial in predicting ILI incidence during flu seasons.
However, our results show that the benefit of Internet data streams on informing disease is inconclusive. That is, our work shows that the CDC website traffic can be informative in some cases (e.g., national level) but not in others (e.g., state level). To determine the extent, we must return to our original research questions.

Q1:
Given the successes of some of our models, we can conclude that CDC page view data can be used as an additional data source for monitoring disease incidence in some cases (for example, at the national level). The degree to which this data can be used appears to rely on the page selection and time frame. We obtained successful nowcasts when selecting pages related to topics most commonly used for online health queries (specific diseases and treatments) during the time span of a typical influenza season. Longer time spans and pages less associated with specific diseases and treatments led to less successful models. These results can assist others in selecting appropriate supplemental data sets for disease surveillance.

Q2:
We obtained our most successful results using a one-week shift. Two-week shifts were successful in some cases but were overall less correlated than one-week shifts. Using no shift at all proved successful in some cases but not in others. We surmise that the shift required for the best fit depends upon the incubation period for the disease in question as well as the time period of reporting. The CDC Internet data are available daily; however, ILI data are available weekly, so we are limited in the types of shifts we can apply to the data sets.
We conclude that more studies on Internet data streams are needed to understand when and why Internet data works. Our methods are consistent with other feasibility studies and provide insight into conditions under which Internet data streams may inform influenza models. Future work should include rigorously testing the predictive power of the models by separating data into training and testing sets [5].

Conflicts of Interest
None declared.