Forecasting influenza epidemics by integrating internet search queries and traditional surveillance data with the support vector machine regression model in Liaoning, from 2011 to 2015

Background Influenza epidemics pose significant social and economic challenges in China. Internet search query data have been identified as a valuable source for the detection of emerging influenza epidemics. However, the selection of the search queries and the adoption of prediction methods are crucial challenges when it comes to improving predictions. The purpose of this study was to explore the application of the Support Vector Machine (SVM) regression model in merging search engine query data and traditional influenza data. Methods The official monthly reported number of influenza cases in Liaoning province in China was acquired from the China National Scientific Data Center for Public Health from January 2011 to December 2015. Based on Baidu Index, a publicly available search engine database, search queries potentially related to influenza over the corresponding period were identified. An SVM regression model was built to be used for predictions, and the choice of three parameters (C, γ, ε) in the SVM regression model was determined by leave-one-out cross-validation (LOOCV) during the model construction process. The model’s performance was evaluated by the evaluation metrics including Root Mean Square Error, Root Mean Square Percentage Error and Mean Absolute Percentage Error. Results In total, 17 search queries related to influenza were generated through the initial query selection approach and were adopted to construct the SVM regression model, including nine queries in the same month, three queries at a lag of one month, one query at a lag of two months and four queries at a lag of three months. The SVM model performed well when with the parameters (C = 2, γ = 0.005, ɛ = 0.0001), based on the ensemble data integrating the influenza surveillance data and Baidu search query data. Conclusions The results demonstrated the feasibility of using internet search engine query data as the complementary data source for influenza surveillance and the efficiency of SVM regression model in tracking the influenza epidemics in Liaoning.


INTRODUCTION
Seasonal influenza is a serious public health problem and remains rampant across the world. According to the latest estimates from the United States Centers for Disease Control and Prevention (US-CDC), there are about three to five million cases of severe illnesses, and about 2.9 to 6.5 million deaths each year caused by influenza epidemics (World Health Organization, 2017. National Health and Family Planning Commission of the People's Republic of China reported that China has 456,718 influenza cases with the incidence rate of 33.0994 per 100,000 in 2017 (National Health and Family Planning Commission of the People's Republic of China, 2018). Influenza epidemics pose significant social and economic challenges in China Wang et al., 2015). It is necessary to establish a real-time flu surveillance system for rapid and effective responses in China.
A national noticeable infectious disease reporting system has been established to continuously report the influenza cases in China, while the system reports the flu activity one month before, putting the flu data lagged for a month. Traditional flu surveillance methods for prediction were mainly based on hospital or laboratory data (Wang et al., 2017). The idea of applying internet search query data for the infectious diseases prediction was from Ginsberg et al. (2009), who presented a brand-new method providing nearly real-time surveillance of influenza-like illness and overcoming the limitations of lag-time in the traditional flu surveillance systems of the United States. Online search query data have a stronger tendency and immediacy and can maintain full synchronization with the flu epidemic. In addition, internet search query data can be measured in real time. In order to monitor the infectious diseases activity in time, numerous studies have been emerging recently based on online search query data or social media data, including Google (Seo & Shin, 2017;Yang et al., 2017;Xu et al., 2017;Pollett et al., 2017), Yahoo (Polgreen et al., 2008), Naver (Shin et al., 2016), Daum (Woo et al., 2016;Seo et al., 2014), Baidu search engine (Guo et al., 2017b), Twitter (Wagner et al., 2017;Kagashe, Yan & Suheryani, 2017;Allen et al., 2016;Yun et al., 2016) and Weibo (Fung et al., 2013;Zhang et al., 2015) social media, Wikipedia (Hickmann et al., 2015;McIver & Brownstein, 2014), hospital or clinicians' database (Bouzille et al., 2018;Santillana et al., 2014), and so on. As Google has been pulled out of mainland China in 2010, Google search query data and Google Flu Trends cannot be accessible in mainland China. This article will construct a forecasting model for influenza based on the ensemble data integrating traditional influenza cases data and Baidu search data, which is the most popular search engine in China. Support vector machines (SVMs) are supervised learning models with associated learning algorithms, the application of SVM to classification and regression has been a hot topic recently. For solving the regression problem, SVMs have been applied to many fields: air quality forecasting , water demanding and water quality prediction (Ghalehkhondabi et al., 2017;Zhang, Zou & Shan, 2017), biomedicine (Nickerson et al., 2016), etc. SVMs can efficiently perform a non-linear classification which is based on the kernel trick, the inputs can be implicitly mapped into high-dimensional feature spaces. Lampos et al. (2015) indicated that a nonlinear query modeling approach presented the lowest cumulative nowcasting error. Woo et al. (2016) found that the SVM regression model based on weekly influenza incidence data and query data from the Korean website Daum performed well. Guo et al. (2017b) comprehensively assessed six machine learning algorithms based on Baidu search engine data and Dengue case data in Guangdong proposed that SVM regression model had a better performance than other forecasting techniques. Thus, this article attempted to build a SVM regression model to predict the flu activity.
The influenza epidemic situation varies greatly among different regions. China has a vast territory that spans tropical, subtropical and temperate regions, and it is a large challenge to establish an influenza prediction mechanism in the whole country. Liaoning province is located in both a coastal and bordered region in the northeastern part of China, where the feasibility of influenza prediction models based on internet search query data is still unknown. Thus, the purpose of this study was to investigate whether an early warning model utilizing both online influenza query data and traditional surveillance data could improve influenza prediction.

Study setting and data collection
Liaoning is a coastal province in the northeast of China with a population of approximately 43.77 million in 2016 and a temperate continental monsoon climate. Official monthly reported number of influenza cases in Liaoning province in China was acquired from China National Scientific Data Center for Public Health (http://www.phsciencedata.cn) from January 2011 to December 2015. China National Scientific Data Center for Public Health is open for those registered users in mainland China and the latest influenza incidence data was the data in December 2015. Based on Baidu Index (http://index.baidu.com), an online keyword research tool which is publicly open for the public across the globe, search queries potentially related to influenza over the corresponding period were identified. "Influenza" was first adopted as a primary indicator term to find more related queries about influenza on the Chinese website (http://tool.chinaz.com/baidu/words.aspx). The website is a free online platform that providing internet keyword mining of the Baidu search engine in mainland China. Monthly average volume of those related search queries from Liaoning was extracted from Baidu Index website, from January 2011 to December 2015.

Statistical analysis
The related influenza search terms were ranked first and those terms that having no data within one calendar year during the study period were excluded. Pearson correlation analysis was performed to explore the correlation between influenza-related search queries and the reported number of influenza cases in Liaoning. Those search terms that with the statistically significant correlation coefficient above 0.4 were sent to the construction of SVM regression model. The selection of maximum cross-correlation coefficient has been proposed in previous studies (Guo et al., 2017a;Yuan et al., 2013).
The influenza case surveillance data was divided into two parts, the fitting dataset and the validation dataset in SVM regression model. Forty-five months' data from January 2011 to September 2014 was used for model training, and the rest 15 months' data from October 2014 to December 2015 was used as the test set for model prediction. The choice of three parameters (C, , ε) in the SVM regression model was determined by leave-oneout cross-validation (LOOCV) during the model construction process. Three metrics were adopted to measure the performance of the SVM regression model, including Root Mean Square Error (RMSE), Root Mean Square Percentage Error (RMSPE), and Mean Absolute Percentage Error (MAPE). These three metrics are measures of prediction accuracy of a forecasting method in statistics. RMSE is very sensitive to the extreme errors or very small errors in a set of measurements, therefore RMSE can well reflect the precision of the forecasting. RMSPE is a percent difference between predicted and true values. MAPE is the most common measure of forecast error and it functions best when there are no extremes to the data (including zeros). The definitions of these three metrics are provided below. The notation in the study is as follows: y i denotes the observed value of the influenza cases at time t i ,ŷ i denotes the predicted value by SVM regression model at time t i .
Root Mean Squared Error, a measure of the difference between predicted and true values, is defined as: Root Mean Square Percentage Error, a measure of the percent difference between predicted and true values, is defined as: (2) Mean Absolute Percentage Error, is the mean or average of the absolute percentage errors of forecasts and is defined as: The statistical analysis and the construction of SVM regression model were performed using R statistical software version 3.4.2 with package e1071.

Baidu search terms filtering
According to the filtering criteria, there were 46 search terms left due to the available sequential data within one calendar year during the study period (Table S1). The Pearson correlation analysis was made between search terms and influenza cases across different lag periods (in the same month, at a lag of one month, at a lag of two months and three months). The correlation value of search terms with the influenza cases in the non-flu season is provided as a basal level of their relationship (Table 1). Twenty-nine of the remaining 46 terms were excluded, because their Pearson correlation coefficients between the search terms and influenza cases were less than 0.4 across every lag period. A total of 17 search terms which were strongly correlated with influenza cases across different lag periods were retained for the construction of SVM regression model, including nine queries in the same month, three queries at a lag of one month, one query at a lag of two months and four queries at a lag of three months (Table 2). Meanwhile, the amount of influenza cases might have an impact on the amount of incident cases of the following months. The Pearson correlation analysis was performed to compare the relationship between the reported number of influenza cases of the month and historically reported number of influenza cases. The correlation coefficients were 0.672, 0.498 and 0.151 at the lag time of one month, two months and three months, respectively. The reported number of influenza cases at the lag time of one month has shown the strongest correlation, thus it was submitted to SVM regression model.

Parameter selection of SVM regression model
The mathematical formula of SVM regression model is provided below: a, a Ã are Lagrangian operator. C is the upper bound of all variables, Q is a k by k positive semidefinite matrix, Q ij = y i y j K(x i , x j ), and K(x i , x j ) is the kernel.  The expression of radial basis function is provided below: During the process of leave-one-out cross-validation, we started from the default value ( = 0.0556, ε = 0.1), then we adjusted the C value to observe the model fitting results (Table 3). The same method was applied to the selection of the other two parameters, and ε (Tables 4 and 5). The values of these three parameters were evaluated according to the lowest test error, then the optimal parameters of the model were determined (C = 2, = 0.005, ε = 0.0001).

Comparison and prediction of SVM regression models from different data sources
Compared with the model based on influenza case data at the lag time of one month and the source of Baidu search data, the SVM regression model based on ensemble data integrating historical influenza surveillance data and Baidu search data showed the best accuracy with lowest RMSE (42.654) and best robustness with lowest MAPE (26.197%), as seen in Table 6.
The predicted values of the above three models and the actual number of influenza cases from October 2014 to December 2015 have been presented in Fig. 1. It was easily to find that the SVM model prediction's curve was almost identical when comparing the model based on internet search query data with the model based on ensemble data, and trend of the curve were consistent with the overall development trend of the actual influenza cases curve. The SVM regression model based on ensemble data was capable of predicting the timing and magnitude of most periods, whereas it failed to predict the influenza outbreak peak in March 2015. The predictions of the model based on flu data at the lag time of one month were significantly lower than the actual value in the previous six-month forecasting, but the overall trend was consistent in the following nine months. The residual of each predictor is displayed in Fig. 2.

DISCUSSION
This article presented an efficient SVM regression model to predict flu activity and track the epidemic orbit in Liaoning province of China. The entire analysis demonstrated that the SVM regression model based on ensemble data was better than the model based on With the rapid development and popularity of the Internet, this new method of infectious diseases surveillance system based on online search query data is more convenient and accurate. According to the 41st Statistical Report on Internet Development released by China Internet Network Information Center (CINIC), the internet users in China have steadily increased and up to 772 million, and the internet popularity rate reached 55.8%, exceeding the global average level until December 2017 (China Internet Network Information Center, 2018). Baidu search engine is the most widely accepted search engine in China, making it the most representative and available data source for the studies targeting tracking the online seeking behavior of Chinese people. Based on Baidu search query data, Chinese scholars have made great efforts in the field of disease monitoring, such as Norovirus (Liu et al., 2017b), Dengue , Hand, foot, and mouth disease (Du et al., 2017), and epidemic erythromelalgia (EM) (Gu et al., 2015). These forecasting models got great performances in the field of early warning. However, most of the researches focused on southeastern coastal regions of China, such as Guangdong and Zhejiang, and few disease prediction models was constructed and applied in the coastal areas in the northeast China. It is a significant attempt to predict the influenza activity in Liaoning province located in the northeast of China. This article could provide some hints and lessons for the flu forecasting and alerting in the Northeast of China. Strong correlation between influenza cases and search terms of Baidu was found in the present study. Influenza is characterized by a short incubation period and a sudden onset of symptoms such as fever, cough (usually dry), etc., and it is reasonable that most of search terms about flu virus, symptoms and therapy closely correlated with influenza cases at the same month. The most effectively vaccine injection timing is about one to two months prior to the flu season. Winter and spring are the peak flu seasons in Liaoning, China, thus September and October are the best months for flu vaccination in the study area. The search behavior about flu vaccine is often earlier than the vaccination timing, so we found that the search terms of Baidu at a lag of three months had a strong correlation with the occurrence of influenza cases.
The present study showed that the forecasting model based on internet search query was better than the model based on traditional data in terms of accuracy and stability. The results were consistent with the results of other studies (Guo et al., 2017b;Yuan et al., 2013). However, Olson et al. (2013) investigated the reliability of Google Flu Trends (GFT) of 2003 to 2013 and compared the flu timing and intensity between forecasting data and actual influenza incidence at the national, regional and local levels. They concluded that GFT data could not serve as the reliable surveillance for seasonal or pandemic influenza and traditional surveillance are still irreplaceable. The main reason was that GFT was based on internet data without considering the epidemiological factors such as the age distribution of patients, geographical location, illness complaints or clinical manifestations. Our study proved the advantages of ensemble source data integrating traditional influenza incidence data and search engine data in the field of forecasting. Meanwhile, there may be some space to improve the SVM model presented in the present study. Although most of the forecasting values were fitted well with actual influenza cases in the SVM regression model, they failed to identify the influenza's peak in March 2015. The climatic factors have great impact on Influenza incidence (Gomez-Barroso et al., 2017), thus the possible reason of their missing might be that March are the cold month and the flu peak seasons in northeastern China while internet search query data could not distinguish the situation.
Several limitations in influenza forecasting model based on ensemble data integrating traditional influenza cases data and Baidu search engine data need to be mentioned. Firstly, media report may influence the internet searching behavior, which will have an impact on the performance of forecasting model directly. In addition, without considering the impact factors of influenza, such as seasonal and meteorological factors, the forecasting results may have bias to some degree. Furthermore, correlation analysis of the search keywords mainly was based on previous vocabularies data. However, in pace with the rapid changes of the internet environment, many fresh online search vocabularies produced at every moment. The fresh vocabularies were hard to be tracked and usually have been overlooked.

CONCLUSIONS
The present study built a forecasting model based on ensemble data integrating Baidu search query data and traditional flu data in Liaoning province. The model based on ensemble data showed the best accuracy and best robustness in SVM regression model, rather than the models based on other single data sources. It could be a complement of the traditional surveillance for influenza dynamics in Liaoning.