Monitoring seasonal influenza epidemics by using internet search data with an ensemble penalized regression model

Guo, Pi; Zhang, Jianjun; Wang, Li; Yang, Shaoyi; Luo, Ganfeng; Deng, Changyu; Wen, Ye; Zhang, Qingying

doi:10.1038/srep46469

Download PDF

Article
Open access
Published: 19 April 2017

Monitoring seasonal influenza epidemics by using internet search data with an ensemble penalized regression model

Pi Guo¹,
Jianjun Zhang¹,
Li Wang¹,
Shaoyi Yang¹,
Ganfeng Luo¹,
Changyu Deng¹,
Ye Wen¹ &
…
Qingying Zhang¹

Scientific Reports volume 7, Article number: 46469 (2017) Cite this article

2628 Accesses
28 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Seasonal influenza epidemics cause serious public health problems in China. Search queries-based surveillance was recently proposed to complement traditional monitoring approaches of influenza epidemics. However, developing robust techniques of search query selection and enhancing predictability for influenza epidemics remains a challenge. This study aimed to develop a novel ensemble framework to improve penalized regression models for detecting influenza epidemics by using Baidu search engine query data from China. The ensemble framework applied a combination of bootstrap aggregating (bagging) and rank aggregation method to optimize penalized regression models. Different algorithms including lasso, ridge, elastic net and the algorithms in the proposed ensemble framework were compared by using Baidu search engine queries. Most of the selected search terms captured the peaks and troughs of the time series curves of influenza cases. The predictability of the conventional penalized regression models were improved by the proposed ensemble framework. The elastic net regression model outperformed the compared models, with the minimum prediction errors. We established a Baidu search engine queries-based surveillance model for monitoring influenza epidemics, and the proposed model provides a useful tool to support the public health response to influenza and other infectious diseases.

A novel data-driven methodology for influenza outbreak detection and prediction

Article Open access 24 June 2021

COVID-19 forecasts using Internet search information in the United States

Article Open access 07 July 2022

Joint COVID-19 and influenza-like illness forecasts in the United States using internet search information

Article Open access 24 March 2023

Introduction

Seasonal influenza is a serious public health problem that causes severe illness and death in the world. According to the World Health Organization (WHO), seasonal influenza occurs with an annual attack rate estimated at 5% to 10% in adults and 20% to 30% in children. The epidemics are estimated to result in about 3 to 5 million cases of severe illness and 250,000 to 500,000 deaths worldwide each year¹. During 2008–2011, an annual average of 92,677 seasonal influenza cases was reported in China². Overall, the influenza pandemics posed a significant burden of excess influenza-associated mortality in the country³. To achieve near real-time surveillance of the spread of infectious diseases, several novel approaches based on online surveillance systems and using informal sources such as news reports⁴, social media data^5,6, and search query data^7,8 have been proposed.

In 2009, Ginsberg, J. et al.⁸ first presented a novel method of analyzing large numbers of Google search queries to track influenza-like illness in the United States. The proposed method provided near real-time estimates of seasonal influenza activity each day and overcame the limitation of traditional systems requiring 1–2 weeks to gather and process surveillance data⁸. To estimate the seasonal influenza activity and quickly detect outbreaks in China, several programs were used to predict trends of influenza epidemics^9,10. However, these techniques used only influenza-like illness or influenza case data. The robust prediction of influenza epidemics could be improved. In 2013, Yuan, Q. et al.¹¹ first explored the use of the combination of influenza case data and internet search query data from the search engine Baidu within a linear regression framework to monitor influenza epidemics in China. This provided a new idea to monitor the spread of influenza in the country. To inform the search behavior of users, Baidu released the search volume daily on the Baidu Index website (http://index.baidu.com). The search volume of different search keywords used can be abstracted to assess changes in the search behavior of users.

According to Yuan, Q. et al.¹¹, the construction of the prediction model involved compositing many search keywords into a single index according to different weights. However, in practice, many search keywords are used to construct the prediction model. The direct compositing of all keywords into a single index is not convenient for assessing the contribution of each keyword to the prediction. Developing robust techniques of search keyword selection and enhancing the ability to predict influenza epidemics remains challenging. Beyond the use of a linear regression model for prediction, we explored an ensemble framework that incorporated different penalized regression algorithms including lasso, ridge and elastic net¹² to avoid the over-fitting problem with various keywords, identify informative predictors from a pool of candidate keywords, and estimate the parameters of the model with low variability.

In our previous study¹³, use of a penalized regression model based on random bootstrap samples¹⁴ was able to detect significant variables with better predictive performance. How well a model predicts is practically quantified by performance measures. For example, performance measures such as accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC)¹⁵ and kappa index of agreement (KIA)¹⁶ are often used to evaluate performance for classification problems. However, in many settings, the assessment of performance by a single measure has inherent problems¹⁷. For example, in disease surveillance applications, to predict periods of high incidence of infectious disease requires large sensitivity and/or specificity rates in addition to prediction accuracy¹⁸. Different performance measures reflect different characteristics of the constructed prediction model. Therefore, under many circumstances, several performance measures must be considered simultaneously.

To improve prediction robustness, we sought to develop a Baidu search engine query data-based prediction model whose performance was optimized with respect to a set of measures. A novel ensemble framework was established by combining bootstrap aggregating (bagging) and a multi-objective optimization method in this study. New ensemble penalized regression models using the lasso, ridge and elastic net algorithms were constructed, and applied to predict seasonal influenza activity. Results of this study indicated that the ensemble elastic net regression model outperformed the compared models in monitoring seasonal influenza activity by using Baidu search engine query data.

Material and Methods

Ensemble penalized regression model

Penalized regression model

We first considered the lasso (L1-penalized regression method) linear regression model¹². We have an n × 1 response vector y = (y₁, y₂, …, y_n)^T and linearly independent predictors x = (x_1j, x_2j, …, x_nj)^T (j = 1, …, p). Let X = [x₁, …, x_p] be the predictor matrix. We assume that . The estimates in the lasso linear regression model are defined as (1):

where denotes for vector , and λ is the nonnegative tuning parameter. This estimation method continuously shrinks the coefficients toward 0 as λ increases, and some coefficients are shrunk to exactly 0 if λ is sufficiently large¹⁹.

Next, we considered the lasso logistic regression setup by using the tuning parameter λ. The estimates in the model are given by (2):

where λ is also the tuning parameter used for shrinking coefficients in the model. Generally, the cross-validation method was proposed to select the optimal λ²⁰. The ridge and elastic net penalized regression models were established using different penalties¹², and the optimal values of tuning parameters were chose by a similar way.

Ensemble penalized regression model built with a bagging strategy

To improve the performance of the conventional penalized regression model, we used a combination of bagging and a rank aggregation²¹ method to develop an ensemble penalized regression model. The architecture of the model consists of a sequence of processing procedures primarily including model training, validation, evaluation and averaging, which are implemented in many random bootstrap samplings (Fig. 1). The details for the methodology are presented below.

**Figure 1: Architecture of the ensemble penalized regression model.**

According to Breiman, L.²², bagging is a method of generating multiple versions of a prediction model, and these models are used to obtain an aggregated prediction, which gives substantial gains in prediction accuracy. Suppose that a training set L consisting of data X_n×p with known outcomes y = (y₁, …, y_n) that are independently drawn from the probability distribution P, then we establish a prediction model φ(X, L). Here, n is the number of samples and p is the number of predictors. By taking repeated bootstrap samples {L^(B)} from L, we formed a set of new prediction models φ(X, L^(B)). The final prediction of the bagging model denoted by φ_A(x) = Eφ(X, L^(B)) was obtained by averaging all results for a number of sub-models. The proof of the validity of bagging on improving prediction accuracy is given in the Methods section of the Supplemental Material.

To build the ensemble model, we randomly drew several (B) bootstrap samples from the original data {X_n×p, y_n×1}, trained B penalized regression models, M¹, M², …, M^B, by using the bootstrap samples and combined them to obtain an aggregated prediction. To determine an optimal sub-model in the ensemble penalized regression model according to several performance measures during each random sampling, we used a multi-objective optimization method via the weighted rank aggregation²¹. First, each measure ranked the sub-models according to their performance under that particular measure and generated the ordered lists of sub-models, R₁, …, R_K, where K is the number of measures used. Second, the weighted rank aggregation approach was used to produce an aggregated list that ranked the sub-models according to their performance under all K measures simultaneously. To obtain the optimal ordered list of models, we defined the following objective function:

where δ is an ordered list of models of size Q, d is a distance function that estimates the similarity between any two ordered lists, and w_i is a weight factor associated with each measure. The Spearman footrule distance function²³ was used to estimate the similarity between any two lists of models.

To determine an optimal model according to all K measures simultaneously, it is equivalent to seek out an optimal list δ^* to minimize the value of the objective function Φ(δ). To determine the optimal parameter δ^*, the cross-entropy method was used for rank aggregation²⁴. The algorithm of the ensemble penalized regression model is given as follows:

Algorithm. Ensemble penalized regression model.

Input:

(X, y): training set that contains n samples and a p-dimensional vector of predictors, and .
B: number of random bootstrap samplings.
n_bootstrap: size of random bootstrap samples with replacement.
Q: size of an ordered list of sub-models in the ensemble model.
K: number of performance measures.
RP: size of random subspace predictor.
δ: an initial ordered list of sub-models of size L.
d(.): the Spearman footrule distance function.

Output: prediction ψ_average of the ensemble model.

for b = 1 to B do

generate bootstrap samples

generate out-of-bag (OOB) samples

for q = 1 to Q do

randomly select RP predictors as a subset from the original P predictors

generate a new subset of predictors

generate new bootstrap samples

generate new OOB samples

establish a penalized regression model

for k = 1 to K do

compute performance measures w_q,k based on OOB samples

end

generate a matrix of performance measures , where the measures in each row (w_i1, w_i2, …, w_iQ) were ranked in order of descending values

generate K ordered list of sub-models {R_i = (M₁, M₂, …, M_Q)ⁱ, i = 1, …, K} according to W_K×Q

establish the objective function (w_i = (w_i1, w_i2, …, w_iQ))

perform the cross-entropy method for rank aggregation and to determine the optimal parameter δ^* minimizing the value of Φ(δ)

obtain an optimal ordered list of sub-models

end

establish the ensemble penalized regression model according to B optimal sub-models

produce the prediction via model averaging.

From our experience, the model performed similarly when parameter B was large, for example, B = 100. The value of n_bootstrap was set to the size of the original data. The size of the ordered list of sub-models Q was set to 10 to ensure efficiency and fast convergence²¹. Previous studies^25,26 suggested that the random subspace method usually produced an improved ensemble model. Thus, we constructed the ensemble model by using a random subset of predictors, , as proposed by Breiman, L.²⁷. To assess the contribution of each predictor in the ensemble model, we used a permutation method to estimate the importance of each predictor as follows:

where I_j is the importance score of predictor j, represents the OOB samples with the j^th predictor randomly permuted, X^OOB is the non-permuted samples, and e_i is the error rate of prediction. The architecture of the ensemble penalized regression model is depicted in Fig. 1.

Model evaluation

To widen the application of the ensemble model, we considered two set-ups of the model including the logistic and linear regression models for monitoring influenza epidemics. For the logistic regression model, we used five performance measures, including accuracy, sensitivity, specificity, AUC¹⁵ and KIA¹⁶. For the linear regression model, we used relative error (RE), root mean square error (RMSE), mean absolute error (MAE) and symmetric mean absolute percentage error (SMAPE)²⁸ to assess performance.

Application to monitor seasonal influenza activity

Data sources

This study used monthly case counts of influenza occurring from January 2011 to May 2015 in China for testing the model. These laboratory-confirmed cases of influenza were reported by physicians to the notifiable disease-monitoring system managed by China’s Center for Disease Control and Prevention, and the data are publicly available on the official website (http://www.moh.gov.cn/). The influenza surveillance data for the studied period corresponded to a total of 53 months of influenza cases. Table 1 shows the details of monthly influenza case counts used in this study.

Table 1 Data of influenza cases confirmed by laboratory test for the period January 2010 to May 2015 in China were publicly available from China’s Center for Disease Control and Prevention.

Full size table

Search query data were obtained from the Baidu Index website, which contains logs of online search query volume for numerous keywords searched by Baidu users. Since the search query data were available on a daily basis, we converted the data to monthly counts over the study period for analysis.

Keyword selection, crawling and filtering

Previous studies generally chose the names or clinical symptoms of the studied diseases as the primary terms to find more related keywords^11,29,30. From this idea, we used the term “influenza” (“” in Chinese) as a primary keyword to search for more keywords associated with the studied disease on a Chinese website (http://tool.chinaz.com/baidu/words.aspx). The recommended keywords were comprehensively extracted from different sources, including Baidu, portal websites, and blogs¹¹. On typing in the primary keyword, a total of 100 related keywords were obtained for further analysis (Table 2). After determining the related keywords, we established an auto-crawler by using Python and used it to collect search volume data for the keywords. The framework of an auto-crawler is depicted in Fig. 2. The Python scripts could be available from the authors for academic usage.

Table 2 Search keywords from Baidu search engine used in this study.

Full size table

**Figure 2: Framework of an auto-crawler using Python to collect search query data from the Baidu Index website.**

Because some recommended keywords were not necessarily related to influenza epidemics, we further filtered the keywords in three steps: first, the selected search keywords should represent factors that might affect the influenza epidemic; second, the search volume data for each keyword could be presented as a sequential time series with a specific resolution of time (e.g., daily, weekly or monthly); third, the time series of selected keywords should have a maximum cross-correlation coefficient of at least 0.4 with the influenza case data. These filtering approaches were also proposed in previous studies^11,30.

We considered two scenarios of model validation. First, the influenza case surveillance data were divided into a fitting and validation dataset. Models were fitted by using data from January 2011 to June 2014, and the remaining part of the data was used for model validation. Second, to compare the models for monitoring a high level of influenza epidemics, we investigated three cases of high incidence thresholds defined as the median, 75th and 90th percentiles of number of influenza cases over the study period, and evaluated their performance. The receiver operating characteristic (ROC) curve was used to assess the predictive ability of the models.

Results

On the basis of our filtering steps, 19 of the 100 keywords were not related to influenza epidemics, 8 keywords did not have sequential time series due to low search volume, and a set of only 58 keywords was retained for building the compared models (Table 2). Taking into account the delayed effects of predictors, we considered time lags of 0 to 1 month and the autoregressive term of influenza case number in the previous month. In total, 117 predictors were used for building the prediction models. In this case, the number of predictors was more than the length of time series of influenza cases (117 > 53). Thus, the penalized estimation of parameters in the model was necessary in this study.

In general, influenza causes annual epidemics that peak during the spring and winter in China. Most of our selected search keywords captured the peaks and troughs of the time series curves of influenza cases, so they were good indicators for monitoring influenza epidemics in the country (Figures S1–S5).

Comparison of prediction performance of different penalized regression models and the algorithms in the proposed ensemble framework is shown in Table 3. For the prediction of seasonal influenza case counts in the period between July 2014 and May 2015, the ensemble framework improved the performance of the conventional lasso, ridge and elastic net regression models. Among the models, the ensemble elastic net regression model outperformed the others since it had the smallest prediction errors (Table 3). Regardless of the periods for model fitting and prediction, the ensemble elastic net regression model was able to capture the peaks and troughs of the time series curves of influenza cases (Fig. 3). The forecast intervals given by the ensemble model well covered the actual epidemic curve of influenza cases.

Table 3 Prediction performance of different penalized regression algorithms (lasso, ridge and elastic net) and the algorithms in the proposed ensemble framework was compared using the number of influenza cases during the period of July 2014 to May 2015.

Full size table

**Figure 3: Predictions of influenza cases according to the ensemble elastic net regression model for the period of July 2014 to May 2015.**

For monitoring a high level of influenza epidemics, this study integrated the set-up of logistic regression models in the ensemble prediction framework. We studied three situations of high incidence thresholds defined as the median, 75th and 90th percentiles of number of influenza cases over the study period. The performance of the models to detect a large number of influenza cases was assessed using the measures including accuracy, sensitivity, specificity, AUC and KIA (Table 4). Overall, the ensemble elastic net regression model had the largest average AUC of 0.97, and thus outperformed the others, irrespectively of thresholds of influenza incidence used. In addition, it suggested that the predictability of the conventional lasso, ridge and elastic net models was consistently improved by the ensemble framework (Fig. 4).

Table 4 Comparison of different penalized regression algorithms (ridge, lasso and elastic net) and the algorithms in the proposed ensemble framework in predicting influenza epidemics, by using three cases of high incidence thresholds defined as the median, 75th and 90th percentiles of number of influenza cases over the study period.

Full size table

**Figure 4: Performance of different penalized regression algorithms (ridge, lasso and elastic net) and the algorithms in the proposed ensemble framework in predicting influenza epidemics.**

Figure 5 shows the estimated importance score for the top 25 keywords contributing to the prediction of the ensemble model. The keyword, “type a flu” (variable X39), was the most significant factor predicting influenza epidemics over the study period. In addition, the keywords “saying type a h1n1 flu” (variable X99), “the toll of swine flu-related death” (variable X52) and “flu symptom” (variable X47) played important roles in the internet search queries-based surveillance model we established. The ensemble elastic net regression model performed similarly with a large number of random bootstrap samplings, for example, with B = 100 (Figure S6). It also guaranteed that the prediction of the ensemble model converged to a stable result.

**Figure 5: Contribution of each predictor to the prediction in the ensemble elastic net regression model.**

Discussion

We used bagging and a multi-objective optimization technology to establish a novel ensemble elastic net penalized regression model to detect seasonal influenza epidemics in China. The results revealed high performance and small fluctuation of extrapolating ability for the proposed model as a Baidu search engine queries-based surveillance framework. The empirical analysis demonstrated that monitoring seasonal influenza epidemics was better with our ensemble models than the conventional penalized regression models.

Recently, Salathé M. et al.³¹ discussed the importance of digital disease surveillance for rapid disease outbreak detection and proposed it as a powerful tool to complement traditional approaches. In fact, internet search query data is being explored as a low-cost approach to providing near real-time estimates of disease activity and is becoming widely used for disease surveillance^11,18,29,30. In China, influenza activity based on routine surveillance data from the ministry of health of China was usually reported with a 1 to 2-week lag. Hence, as a convenient source for timely estimating of influenza activity and detecting an epidemic, search query data can contribute to improve the results of traditional disease surveillance.

In a newly released report³², about 87% of Chinese internet users preferred Baidu to search for any information, so it is the most popular search engine in China. With the wide use of the Baidu search engine, the search volume of Baidu naturally reflects Chinese online behavior³⁰. Therefore, data from Baidu are more representative of search queries in China for this analysis. Many search keywords are more likely to be captured with this search engine to build a Baidu search engine queries-based surveillance model.

The data for the surveillance model must be automatically fetched over the internet. To achieve this goal, we established an auto-crawler by using Python to collect search volume data for the keywords obtained. The auto-crawler was mainly completed by using the Selenium package within Python. The framework of the auto-crawler included calling the tool of the Selenium webdriver³³ to start with a browser and open the Baidu Index website, construct a new uniform resource locator (URL) using a keyword, call the Selenium webdriver to open the URL and take screenshots that containing the figures of search volume, and call Tesseract-OCR to extract the data (Fig. 2).

For our empirical analysis, the number of search terms used for predicting influenza epidemics was greater than the sample size (117 > 53) (Table 2). Beyond the use of a linear regression model using a stepwise fashion for significant variable selection and model prediction¹¹, this study utilized penalized regression approaches¹² to establish prediction models with various search keywords. With a large number of predictors in the model, we would prefer to search for a smaller subset that has the strongest effects. A feature of the penalized regression models is a tuning parameter, λ, that controls the amount of shrinkage applied to the coefficients. By shrinking variables with very unstable estimates towards zero, the approach can effectively exclude some irrelevant variables and produce a subset of variables with strong effects. Regarding the tuning parameter, the traditional way of choosing the optimal λ is to use the cross-validation method. However, the robustness of variable selection is affected by the fold assignment used for cross-validation to some extent³⁴. This situation results in estimating the model parameters with a degree of variability. To enhance the predictability of penalized regression models, we combined the methods of bagging and multi-objective optimization to construct the ensemble penalized regression models. Bagging can substantially improve the accuracy of an instable prediction model²². Our study suggested that the proposed ensemble framework significantly improved the performance of the conventional lasso, ridge and elastic net regression models, and the ensemble elastic net regression model was optimal in estimating influenza activity.

We found high correlations between specific search terms of Baidu and seasonal influenza incidence. We developed an index of importance score to estimate the contribution of each search term to the prediction of influenza epidemics. Breiman, L.²⁷ introduced a practical approach to measure variable importance based on computationally intensive permutations. We adopted this idea and assessed the contribution of each predictor in the ensemble model. For the performance, our predictions of time periods with high influenza incidence based on the ensemble elastic net regression model were very accurate, for different thresholds of high incidence (Table 4). Together, these results demonstrate the viability of the presented ensemble model in supporting influenza surveillance. The ensemble model performed similarly when the number of bootstrap replicates was large. The results of the empirical study indicated that the ensemble model was robust.

Although China has established a notifiable infectious disease monitoring system nationwide, reported influenza cases are available to the public with a delay of about 1 to 2 weeks. The rapid expansion of the geographical distribution and genetic diversity of novel influenza viruses poses a direct challenge to current disease control systems in China³⁵. Potentially, influenza may become a long-term threat to public health in this country. Predictive search term-based models were found to perform better than a model using only reported cases to predict future cases^7,8,11. Specifically, an internet search-term model returns results more quickly and with better performance¹⁸. Our study also suggested that most of the selected search keywords captured the peaks and troughs of the time series curves of influenza cases. Our ensemble elastic net regression model predicted seasonal influenza epidemics with high performance. Thus, in China, this internet search term-based system might be used as a supplement to existing surveillance systems. However, we should note that surveillance models based on internet search query data like Google Flu Trends have substantial flaws including missing the first wave of the 2009 influenza H1N1 pandemic and overestimating the intensity of the H3N2 epidemic during the 2012/2013 season in United States³⁶. It means that there is room to improve the performance of surveillance models based on internet search query data and provide reliable surveillance for seasonal or pandemic influenza³⁶. In addition, because Google has pulled out of mainland China since 2010, search query data from Google during the study time period of 2011–2015 are not publicly available in mainland China. Therefore, an overall comparison between the algorithm proposed in this study and that of Google Flu Trends cannot be made. All of these drive us to further validate the performance of the proposed algorithm by ongoing studies in the future.

Several limitations of this study should be mentioned. In fact, different people may use different words to search for the same information, especially when searching in Chinese, which has various ways of expression. Thus, search keywords should be carefully selected to reflect terms most likely associated with influenza epidemics. As well, internet searching behavior was susceptible to the impact of media reports, which might affect the performance of the internet search term-based system³⁷. Third, in the empirical study, 100 bootstrap replicates were used for building the ensemble model. With this setting, the ensemble prediction was converged to a stable result but required much time to generate an aggregated prediction. This issue was also discussed by Breiman, L.²⁷. A procedure for parallel computing integrated into the ensemble model to speed up the analysis would be practical. Hence, the computing efficiency needs to be improved.

In conclusion, this present study developed a novel ensemble elastic net penalized regression model by combining bagging and a multi-objective optimization method to monitor seasonal influenza activity. The approach provided a useful tool in support of the public health response to influenza and other infectious diseases in China.

Additional Information

How to cite this article: Guo, P. et al. Monitoring seasonal influenza epidemics by using internet search data with an ensemble penalized regression model. Sci. Rep. 7, 46469; doi: 10.1038/srep46469 (2017).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

World Health Organization. Influenza (Seasonal) http://www.who.int/mediacentre/factsheets/fs211/en/ (Date of access: 26/01/2017) (2014).
He, Q. et al. Effectiveness of seasonal influenza vaccine against clinically diagnosed influenza over 2 consecutive seasons in children in Guangzhou, China: a matched case-control study. Human Vaccines & Immunotherapeutics 9, 1720–1724 (2013).
Article Google Scholar
H, Y. et al. Regional variation in mortality impact of the 2009 A(H1N1) influenza pandemic in China. Influenza & Other Respiratory Viruses 7, 1350–1360 (2013).
Article MathSciNet Google Scholar
Freifeld, C., Mandl, K., Reis, B. & Brownstein, J. HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports. Journal of the American Medical Informatics Association 15, 150–157 (2008).
Article Google Scholar
Chew, C. & Eysenbach, G. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. Plos One 5, e14118 (2010).
Article CAS ADS Google Scholar
Brownstein, J. S., Freifeld, C. C. & Madoff, L. C. Digital disease detection–harnessing the Web for public health surveillance. New England Journal of Medicine 360, 1656–1658 (2009).
Article Google Scholar
Eysenbach, G. Infodemiology: tracking flu-related searches on the web for syndromic surveillance. AMIA Annual Symposium Proceedings. 244, 244–248 (2006).
Google Scholar
Ginsberg, J. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014 (2009).
Article CAS ADS Google Scholar
Ou, C., Deng, Z. & Yang, L. Prediction of Influenza-like Illness Using Auto-regression Model. Chinese Journal of Health Statistics 24, 569–571 (2007).
Google Scholar
Zhao, Y. U., Fang, Q. S., Zhou, M., Lian-Hong, L. I. & Wang, W. Surveillance of influenza in Zhejiang, 2008–2012. Disease Surveillance 27, 1003–9961 (2012).
CAS Google Scholar
Yuan, Q. et al. Monitoring Influenza Epidemics in China with Search Query from Baidu. Plos One 8, e64323–e64323 (2013).
Article CAS ADS Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society 73, 273–282 (2011).
Article MathSciNet Google Scholar
Guo, P. et al. Improved Variable Selection Algorithm Using a LASSO-Type Penalty, with an Application to Assessing Hepatitis B Infection Relevant Factors in Community Residents. Plos One 10, e0134151 (2015).
Article Google Scholar
Efron, B. & Gong, G. A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation. American Statistician 37, 36–48 (2012).
MathSciNet Google Scholar
Guo, P. et al. Gene expression profile based classification models of psoriasis. Genomics 103, 48–55 (2014).
Article CAS Google Scholar
Cohen, J. A coefficient of agreement of nominal scales. Educational and Psychological Measurement 20, 37–46 (1960).
Article Google Scholar
Datta, S., Pihur, V. & Datta, S. An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data. BMC Bioinformatics 11, 427 (2010).
Article Google Scholar
Althouse, B. M., Ng, Y. Y. & Cummings, D. A. T. Prediction of Dengue Incidence Using Search Query Surveillance. Plos Neglected Tropical Diseases 5, e1258–e1258 (2011).
Article Google Scholar
Zou, H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101, 1418–1429 (2012).
Article MathSciNet Google Scholar
Guo, P. et al. Blood lead levels and associated factors among children in Guiyu of China: a population-based study. Plos One 9, e105470–e105470 (2014).
Article ADS Google Scholar
Pihur, V., Datta, S. & Datta, S. Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics 23, 1607–1615 (2007).
Article CAS Google Scholar
Breiman, L. Bagging predictors. Machine Learning 24, 123–140 (1996).
MATH Google Scholar
Fagin, R., Kumar, R. & Sivakumar, D. Comparing top k lists. SIAM Journal on Discrete Mathematics 17, 28–36 (2003).
Article MathSciNet Google Scholar
Pihur, V., Datta, S. & Datta, S. RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics 10, 62 (2009).
Article Google Scholar
Hoens, T. R. & Chawla, N. V. Generating Diverse Ensembles to Counter the Problem of Class Imbalance. Advances in Knowledge Discovery and Data Mining 6119, 488–499 (2010).
Google Scholar
Panov, P. & Džeroski, S. Combining Bagging and Random Subspaces to Create Better Ensembles. Advances in Intelligent Data Analysis VII 4723, 118–129 (2007).
Article Google Scholar
Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
Article Google Scholar
Makridakis, S. Accuracy measures: theoretical and practical concerns. International Journal of Forecasting 9, 527–529 (1993).
Article Google Scholar
Kang, M., Zhong, H., He, J., Rutherford, S. & Yang, F. Using Google Trends for influenza surveillance in South China. Plos One 8, e55205–e55205 (2012).
Article ADS Google Scholar
Gu, Y. et al. Early detection of an epidemic erythromelalgia outbreak using Baidu search data. Scientific Reports 5, 12649 (2015).
Article CAS ADS Google Scholar
Salathé, M., Freifeld, C. C., Mekaru, S. R., Tomasulo, A. F. & Brownstein, J. S. Influenza A (H7N9) and the importance of digital epidemiology. New England Journal of Medicine 369, 401–404 (2013).
Article Google Scholar
China Internet Network Information Center. The Chinese search engine market research report in 2013 http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/ (Date of access: 26/01/2017) (2013).
npm Enterprise. selenium-webdriver https://www.npmjs.com/package/selenium-webdriver (Date of access: 26/01/2017) (2016).
Roberts, S. & Nowak, G. Stabilizing the lasso against cross-validation variability. Computational Statistics & Data Analysis 70, 198–211 (2014).
Article MathSciNet Google Scholar
Lam, T. T. et al. Dissemination, divergence and establishment of H7N9 influenza viruses in China. Nature 522, 102–105 (2015).
Article CAS ADS Google Scholar
Olson, D. R., Konty, K. J., Paladini, M., Viboud, C. & Simonsen, L. Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales. PLOS Computational Biology 9, e1003256 (2013).
Article ADS Google Scholar
Valdivia, A. et al. Rapid communications Monitoring influenza activity in Europe with Google Flu Trends: comparison with the findings of sentinel physician networks -results for 2009–10. Euro Surveill 15, 2–7 (2010).
Article Google Scholar

Download references

Acknowledgements

We thank the China’s Center for Disease Control and Prevention for providing publicly available data of reported influenza cases. This study was supported by the Department of Education, Guangdong Government under the Top-tier University Development Scheme for Research and Control of Infectious Diseases (2015022 and 2015023). We thank Mrs Laura Smales (BioMedEditing, Toronto, Canada) for English language editing. We really thank the editor and two anonymous reviewers for their professional suggestions which greatly improve the manuscript.

Author information

Authors and Affiliations

Department of Preventive Medicine, Shantou University Medical College, No. 22 Xinling Road, Shantou, 515041, Guangdong, People’s Republic of China
Pi Guo, Jianjun Zhang, Li Wang, Shaoyi Yang, Ganfeng Luo, Changyu Deng, Ye Wen & Qingying Zhang

Authors

Pi Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jianjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shaoyi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ganfeng Luo
View author publications
You can also search for this author in PubMed Google Scholar
Changyu Deng
View author publications
You can also search for this author in PubMed Google Scholar
Ye Wen
View author publications
You can also search for this author in PubMed Google Scholar
Qingying Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.G. and Q.Y.Z. conceived and designed the study. P.G., J.J.Z., L.W., S.Y.Y., G.F.L., C.Y.D., Y.W. and Q.Y.Z. collected and cleaned the data. P.G. and Q.Y.Z. analyzed, interpreted the data and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Qingying Zhang.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Material (DOC 5085 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Guo, P., Zhang, J., Wang, L. et al. Monitoring seasonal influenza epidemics by using internet search data with an ensemble penalized regression model. Sci Rep 7, 46469 (2017). https://doi.org/10.1038/srep46469

Download citation

Received: 04 October 2016
Accepted: 20 March 2017
Published: 19 April 2017
DOI: https://doi.org/10.1038/srep46469

This article is cited by

The prediction of influenza-like illness using national influenza surveillance data and Baidu query data
- Su wei
- Sun Lin
- Liu Ti
BMC Public Health (2024)
A novel data-driven methodology for influenza outbreak detection and prediction
- Lin Du
- Yan Pang
Scientific Reports (2021)
Public Interest in Knee Replacement Fell During the Onset of the COVID-19 Pandemic: A Google Trends Analysis
- David C. Landy
- Brian P. Chalmers
- Michael P. Ast
HSS Journal ® (2020)
Predicting the spread of influenza epidemics by analyzing twitter messages
- Soheila Molaei
- Mohammad Khansari
- Mostafa Salehi
Health and Technology (2019)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.