Improving Google Flu Trends for COVID-19 estimates using Weibo posts

While incomplete non-medical data has been integrated into prediction models for epidemics, the accuracy and the generalizability of the data are difficult to guarantee. To comprehensively evaluate the ability and applicability of using social media data to predict the development of COVID-19, a new confirmed case prediction algorithm improving the Google Flu Trends algorithm is established, called Weibo COVID-19 Trends (WCT), based on the post dataset generated by all users in Wuhan on Sina Weibo. A genetic algorithm is designed to select the keyword set for filtering COVID-19 related posts. WCT can constantly outperform the highest average test score in the training set between daily new confirmed case counts and the prediction results. It remains to produce the best prediction results among other algorithms when the number of forecast days increases from one to eight days with the highest correlation score from 0.98 (P < 0.01) to 0.86 (P < 0.01) during all analysis period. Additionally, WCT effectively improves the Google Flu Trends algorithm's shortcoming of overestimating the epidemic peak value. This study offers a highly adaptive approach for feature engineering of third-party data in epidemic prediction, providing useful insights for the prediction of newly emerging infectious diseases at an early stage.


Introduction
Since the outbreak of COVID-19 (formally known as 2019-nCoV) in 2019 (Shen et al., 2020), the pandemic has become a major threat to the whole world. By May 30, 2021, the virus had affected more than 169 million people and caused the deaths of 3.5 million in more than 190 countries and regions worldwide (JHU, 2021). Although many measures have been taken to cope with the health emergency of national concern, such as social distancing measures, locking down measures, imposing quarantines, universities, and business closures (Tison et al., 2020), monitoring the dynamics of the epidemic and preventing its spread pose a huge challenge in practice due to the limited capacity of conventional disease surveillance systems. Studies have shown that publicly available data can play a crucial role in tracking the spread of epidemic disease as complements for conventional public health surveillance (Gundecha and Liu, 2012;Samaras et al., 2020). Non-medical data generated from various sources (Aiello et al., 2020;Kirian and Weintraub, 2010;Ram et al., 2015) has been widely used to estimate disease incidences and to detect disease outbreaks before clinically confirmed data is available (Charles-Smith et al., 2015;Dai et al., 2021;Lu et al., 2021). Social media data collected from Facebook (Gittelman et al., 2015;Strekalova, 2016), YouTube (Basch et al., 2015;Nerghes et al., 2018), and Instagram (Guidry et al., 2017;Seltzer et al., 2017), and Internet search queries (Ginsberg et al., 2009;Zhao et al., 2018) are also used to predict diseases for public health concerns. For example, Twitter data is widely used for early warning and outbreak detection, such as to predict syphilis (Young et al., 2018), swine flu (Kostkova et al., 2014), flu (Chen et al., 2014), and Ebola (Yom-Tov, 2015).
The representative work was made by the Google research and development team, who developed the Google Flu Trends (GFT) algorithm based on the high correlation between the number of certain queries in the Google search platform and influenza-like activity level (Ginsberg et al., 2009). They accurately estimated the level of influenza activity in near-real time without knowing the development stage and transmission mechanism of the disease. Since then, many researchers are inspired to track epidemics with social media data (Araujo et al., 2017;Huang et al., 2013;Signorini et al., 2011). As for the unprecedented pandemic COVID-19, some researchers also applied social media and Internet data to monitor and estimate the development of the epidemic (Ayyoubzadeh et al., 2020;Li et al., 2020;Qin et al., 2020). However, many of these studies used only sampled, incomplete data, so the integrity of the dataset and the accuracy of the prediction models are both difficult to guarantee, and there is still a lack of a general prediction framework that can accurately predict the course of COVID-19 using social media data.
To detect and predict the development of COVID-19 using publicly available social media data, this study applied the daily new confirmed COVID-19 case counts in Wuhan reported by its Health Commission, and a complete dataset of user posts from Sina Weibo (Weibo, 2020), a Twitter-like microblog platform in China, to propose a new confirmed case prediction algorithm named Weibo COVID-19 Trends (WCT) based on the GFT algorithm. WCT can effectively predict the daily new confirmed case counts before the official report is released. This study also provided a general prediction framework that can be easily extended to predict other diseases or public emergencies using accessible third-party data. This study provides a promising approach for forecasting newly emerging infectious diseases at an early stage when most epidemiological characteristics are unknown. Table 1 shows the nomenclatures used in each processing of this study.
The main contributions of this study are summarized as follows: 1. A new confirmed case prediction algorithm is developed based on GFT to predict the development of COVID-19. 2. A genetic algorithm is designed to select a keyword set to filter Weibo posts related to COVID-19. 3. A highly adaptive framework for feature engineering which allows third parties to utilize the data for epidemic predictions is proposed.
The rest of the paper is organized as follows. Section 2 reviews the GFT algorithm and its updated versions. Section 3 mainly describes the framework for the proposed COVID-19 prediction algorithm (i.e., WCT), in which a genetic algorithm is implemented to improve related keyword set selection. Section 4 presents the estimated results of WCT with a comparison with other algorithms including GFT. Finally, Section 5 summarizes the findings and limitations of this study.

The initial version of GFT
Google Flu Trends (GFT) is a short-term forecasting tool for weekly influenza activity as an auxiliary method of influenza surveillance (CDC, 2020). It was launched in 2008 with satisfying forecast precision at that time and was further applied to influenza surveillance and early warning systems in many countries (Butler, 2013). Although Google had improved the details of the algorithm many times in the process of GFT application, due to the impact of a sudden increase in influenza-like illness (ILI) related queries and other factors (Kandula and Shaman, 2019;Lazer et al., 2014b), the problem of inaccurate prediction of the algorithm has never been solved completely. Finally, Google shut down the GFT flu prediction function in 2015 (GFT, 2015).
The most well-known GFT algorithm is its initial version. With input on the fraction of certain ILI-related search queries from Google and the percentages of ILI physician visits from the US Centers for Disease Control and Prevention (CDC), the GFT algorithm trains a log-odds linear regression model (LR) to estimate ILI incidence. LR uses the log-odds of an ILI physician visit and the log-odds of an ILI-related search query to realize regression prediction: where logitðpÞ ¼ lnðp =ð1 À pÞÞ, IðtÞ is the percentage of ILI physician visits, QðtÞ is the ILI-related query fraction at time t (i.e., the sum of each query fraction in the selected ILI-related search queries set), α is the multiplicative coefficient, and ε is the error term. Firstly, the model is trained by each of the 50 million candidate common queries separately. It outputs the prediction result of ILI physician visits and the Pearson correlation score between the estimates and the CDC ILI data. Then the aggregated top-scoring queries are used to train the model and the best fit (when the number of keywords n ¼ 45) is selected automatically. The selection of queries from the best fit is called "the greedy combination algorithm" (GCA). Finally, the selected queries are used to train the model and predict the ILI physician visits. This approach has successfully estimated the level of weekly influenza activity in the United States from 2007 to 2008 with a mean correlation score of 0.97 and 1-2 weeks ahead of the reports published by CDC. It offers the opportunity to use search queries to detect influenza epidemics and inspires researchers to explore the application of social media data in public health surveillance (Cui et al., 2015;Schmidt, 2012).

Updated versions and developments
Google officially launched GFT (GFT 1.0) in November 2008, and subsequently gained a wide range of popularity. However, in the first wave of influenza A (H1N1) epidemic, that is, from April to August 2009, the predicted incidence of H1N1 was badly lower than the ILI activity reported by CDC (Butler, 2013). Therefore, Google upgraded GFT for the first time and developed the second version GFT 2.0 (Cook et al., 2011).
GFT 2.0 adjusted the number and category of selected search queries, referring to the ILI monitoring data during the first wave of H1N1 epidemic (March 29, 2009to September 13, 2009). It increased the search query terms and deleted search queries that were not directly related to influenza, which significantly improved the performance of GFT 2.0. Since its launch in September 2009, its prediction result had been very similar to the ILI activity in the United States until 2012. In the influenza epidemic season of 2012-2013, GFT 2.0 greatly overestimated the influenza epidemic with almost twice the result of CDC monitoring (Butler, 2013). This overestimation led to the second upgrade of GFT (Copeland et al., 2013). GFT 3.0 was officially launched in October 2013, and it made two changes based on GFT 2.0, that is, weakening the impact of abnormal media hot spots and using elastic net to predict ILI (previously based on linear regression). Compared with GFT 2.0, GFT 3.0 significantly reduced the peak amount of its predicted ILI in the 2012-2013 flu season. However, its predicted result was still slightly higher than that of CDC in the United States, and in the 31 weeks after the implementation of GFT 3.0, the prediction result was higher in 23 weeks (Lazer et al., 2014a).
The last upgrade of GFT took place in August 2014 (Lampos et al., 2015). GFT 4.0 expanded the GFT 3.0 model by incorporating the queries selected by the Elastic Net into a non-linear regression framework, based on a composite Gaussian Process. It also injected the ILI activity data as prior knowledge about the disease into the model. The bias of GFT prediction was significantly reduced. GFT 4.0 was used until August 2015, when Google shut down the GFT prediction service.
Because of the important role of ILI surveillance in public health, many researchers are still committed to improving the predictive performance of ILI, such as correcting the limitations of the GFT algorithm process, updating or adding the training data source of the prediction model, and proposing new prediction algorithms based on GFT. Kandula and Shaman (2019) proposed a corrected GFT algorithm, which uses the estimated value of the original GFT algorithm as new data for training the ILI prediction model, reducing the total prediction error by 44%. This algorithm considers the problem that the ILI data provided by CDC is not timely and incomplete when the GFT algorithm is proposed. It uses complete ILI data and GFT estimates to train the prediction model and replaces LR with an autoregressive integrated moving average (ARIMA) model. The algorithm greatly improves the prediction accuracy and proves the validity and practicability of the GFT prediction results. Similarly, other studies (Dugas et al., 2013;Preis and Moat, 2014;Santillana et al., 2015;Wagner et al., 2018) also found that replacing LR with other non-linear regression models and combining new data sources, including search queries, social media, and traditional data sources, into the prediction model can significantly improve the accuracy of ILI prediction.

Data description
Sina Weibo is a popular Chinese microblog platform with millions of users voluntarily sharing their lives and thoughts (Weibo, 2020). The considerable amount of post-data generated by so many users offers the possibility of monitoring and predicting the development of emerging infectious diseases. In this study, all posts made by Weibo users in Wuhan from December 1, 2019, to March 20, 2020, were collected. The dataset spans 111 days and contains the period before the COVID-19 outbreak and its evolution. The dataset contains 38,182,972 posts published publicly by 2,239,450 unique users. Each record of post data contains the post's content, type (whether the post was original or forwarded), time, user nickname, and corresponding encryption ID. If the post was forwarded, the post data contained the original post content (otherwise, it was blank), original time posted, original user nickname, and ID. During the data collection period, the mean number of daily unique users was over 117,000, and they generated more than 343,000 posts every day. On average, each user generated 2.9 posts per day.  (Fig. 1c), and the Pearson correlation score is 0.89 (P < 0.01).

The framework of weibo COVID-19 trends (WCT)
Inspired by the high correlation score between the relative frequency of the certain keyword in Weibo posts and daily new confirmed case counts of COVID-19, a new confirmed case prediction algorithm named WCT based on GFT is proposed. The basic algorithm process of WCT and its comparison with GFT are shown in Fig. 2. Both of the two algorithms are trying to train a regression model to predict the case counts in which the evaluation indicator is the Pearson correlation score (R) between the prediction results and the real case counts. In WCT, GCA is replaced by the genetic algorithm (GA) (Mitchell, 1998) when selecting the keyword set for the best fit of the prediction model. After comparing the performance of different prediction models, the LR model in GFT is selected as the prediction model in WCT.

GA for keyword set selection
A prior list of 41 keywords (see Appendix Table A) is compiled firstly to select all posts that contain COVID-19 information, including the pneumonialike epidemic's medical terminology, symptom, and epidemic control measures and organizations. There are 4,761,010 related posts from a total of 38,182,972 posts from all users (12.47%). Next, the keywords from each post related to the pneumonia-like epidemic were  extracted, and a list of 118,572 most commonly used keywords (see Appendix Table B) was produced. The most frequent 2,000 keywords were chosen based on the absolute frequency for the next analysis. The "absolute frequency" of a keyword is the total number of posts containing that keyword since the beginning of the statistical period. Next, the time series of the relative frequency of each commonly used keyword was obtained. The "relative frequency" of a keyword on a certain day refers to the number of all posts containing the keyword on that day divided by the number of unique users on that day. The relative frequency of a keyword set (KS), i.e., the sum of the relative frequency of each keyword in the selected KS, was used to train the case counts prediction model and then predict the development of the epidemic. The purpose of KS combination and selection is to find the most epidemic relevant keyword set (MKS) from the list of most commonly used keywords in Weibo posts. This paper aims to design a selection algorithm to seek the MKS which could obtain the highest R between the prediction results and the real case counts. Viewing the composition of a KS as analogous to an arrangement of chromosomes, GA is used to select the MKS. The fitness function of GA is to maximize R between the prediction results, yielded from the prediction model, and the real case counts. The process of GA is presented as follows: Step 1 KS initialization. The initial KS group is formed by M KSs, with each KS containing N keywords. Each KS is scored according to the fitness function to maximize R.
Step 2 KS update. The new KS is formed through crossover, mutation, and combination of keywords in KS. Each iteration of the algorithm will choose M better KSs based on R for the next generation and the iteration repeats.
Step 3 Stop criteria. When the maximum iteration time MG is reached or R is high enough, the algorithm will stop and the program will output the MKS.
The flow chart of GA is shown in Fig. 3a. In the implementation process, parameters were set as M ¼ 25 and MG ¼ 100. Then the respective MKS was obtained with N varying from 1 to 50 while fixing the length of MKS (N ¼ 1 to 50), separately. To avoid over-fitting, the training period was set as from December 1, 2019, to January 29, 2020, and the test period was set from January 29, 2020, to February 22, 2020. To evaluate the advantages of GA, the MKS obtained by GCA in GFT was also analyzed. The detailed MKS selection results are presented in Section 4.2.

LR for predicting the number of new confirmed cases
In this section, LR model was applied to predict the number of new confirmed cases using the relative frequency of MKS obtained by GA and a historical case count sequence. The analysis period covers the complete development stage of COVID-19 in Wuhan except February 12 and 13, 2020, due to a change in the criteria for counting diagnoses of the virus. During that period, the number of new confirmed cases increased abnormally. The starting and ending times of the training set and the predicting set are from December 1, 2019, to February 21, 2020, and from February 22, 2020, to March 20, 2020, respectively. The case counts series were manually smoothed with a 3-day window length and then used as input data for prediction.
There are also two parameters in the fitting process, the duration (D) of the training data and the lag (g) for prediction. For example, a prediction model trained with D ¼ 6, g ¼ 1 is shown in Fig. 3b. In this study, D ¼ 3 was set to ensure adequate training data in the training process, and g ¼ 1 was set to predict the next day's case counts using all information up to date. All training processes apply three-fold cross validation to reduce overfitting. The training and predicting processes are introduced as follows.
Training process Model trained ¼ FIT m ðC t ; C tÀg ; C tÀgÀ1 ; :::; C tÀgÀDþ1 ; P tÀg ; P tÀgÀ1 ; :::; P tÀgÀDþ1 Þ where Model trained is the trained model, C t and P t are the case count and number of relative frequency of MKS at time t during the training period, respectively, FIT m is the fitting process by inputting training data {C t ; C tÀg ;C tÀgÀ1 ;:::;C tÀgÀDþ1 ;P tÀg ;P tÀgÀ1 ;:::;P tÀgÀDþ1 } to train Model trained . The length of the training window is D and the dimensions of training data is 2D þ 1. The whole training set is {C t ;C tÀg ;C tÀgÀ1 ;:::;C tÀgÀDþ1 ;P tÀg ;P tÀgÀ1 ; :::; P tÀgÀDþ1 } (t increases from 1). Predicting process C t ¼ Model trained ðC tÀgÀ1 ; C tÀgÀ2 ; :::; C tÀgÀDþ1 ; P tÀgÀ1 ; P tÀgÀ2 ; :::; P tÀgÀDþ1 Þ (3) where C t is the case count at time t during the predicting period. Historical data is input as {C tÀgÀ1 ; C tÀgÀ2 ; :::; C tÀgÀDþ1 ; P tÀgÀ1 ; P tÀgÀ2 ; :::; P tÀgÀDþ1 } into the trained model Model trained . Then the prediction result of the case count at time t is output. The length of the predicting window is D and the dimensions of predicting data is 2D. The whole predicting set is {C tÀgÀ1 ; C tÀgÀ2 ; :::; C tÀgÀDþ1 ; P tÀgÀ1 ; P tÀgÀ2 ; :::; P tÀgÀDþ1 } (t increases from 1). Previous research has demonstrated that non-linear regression models, such as the Gaussian Processes, Long Short-Term Memory (LSTM), can achieve great performance in COVID-19 tracking and prediction (Alakus and Turkoglu, 2020;Lampos et al., 2021). The performance of LSTM model was also calculated to be compared with LR model. A 4-layer LSTM model was designed with a dropout rate of 0.15. The loss function was mean square error (MSE) and the optimizer was Adam. The number of training epoch ¼ 100 and batch size ¼ 10. The detailed estimated results are provided in Section 4.3.

Overview of COVID-19 related keywords and case counts
To investigate the relationship between the frequency of COVID-19 related keywords and the number of new confirmed cases per day, the temporal evolution of the keywords with the number of new confirmed COVID-19 cases in Wuhan was analyzed in this section. The direct correlation Pearson score R between the relative frequency of the top 2,000 commonly used keywords in Weibo posts and the number of new confirmed cases each day during the whole statistical period was calculated. Most of the correlated keywords are related to the treatment of COVID-19 ('hospitalization', 'physical examination', 'patient', and so on), and a few are used to describe symptoms or conditions (such as 'breathing difficulties', 'cough'). The most correlated keywords are 'hospital beds' (R ¼ 0.84, P < 0.01) and 'Shu Hongbing' (R ¼ 0.78, P < 0.01). The R value, as well as the absolute frequency of the top ten most correlated and uncorrelated keywords, is listed in Appendix Table C.
The evolution of the number of confirmed cases of COVID-19 and the relative frequency of the five most relevant keywords are shown in Fig. 4. It can be seen that the relative frequency of each keyword is very similar to the trend of the number of new confirmed cases, supporting the motivation of tracking COVID-19 with Weibo data. In contrast, the 10 keywords with the weakest correlation ('article', 'new product', '##', 'grandpa Li', 'concert', 'Trump', '19', 'Hubei Economy TV', '2019') were also analyzed. These keywords with low correlation scores have little to do with the symptoms or treatment of COVID-19.

The R value of the selected MKS
GA and GCA algorithm were both used to select MKS. By setting the length of MKS (N) to vary from 1 to 50 and applying LR and LSTM prediction model (D ¼ 3, g ¼ 1) into GA and GCA algorithm, the changes in the indicator R between the prediction results and the real case counts were compared to evaluate the performance of the MKS selection algorithm. Each prediction model adopted three-fold cross validation and then output the average test scores of the training set as R.
The MKSs (1 N 50) with the highest R selected by each algorithm are presented in Table 2. The original Chinese text for keywords in each MKS are provided in Appendix Table D. Most keywords in MKS obtained by GA or GCA algorithm are medical terms directly related to 'isolation','CT',and 'coronavirus'). It also contains keywords which are not directly related to COVID-19, such as numbers ('14', '17') and personal pronouns ('you'). GA has the feature of retaining the most relevant keywords and automatically outputting MKS with the best performance. The keywords in MKS can be repeated if duplication can make the MKS perform better. It can be found that there are some duplicated keywords in the MKS of GA-related algorithms (see Table 2). This is because the KS with duplicated keywords performs best in the iteration process of GA and becomes MKS. Judging from the correlation between the relative frequency of MKS and the daily case count of COVID-19, the performance of GA and GCA is close, but from the R value of the MKS obtained by the two algorithms, GA is better than GCA. The highest test score is obtained by the GA&LR algorithm (WCT) with R ¼ 0.66 (p < 0.01), which is higher than the test score of GFT (i.e., GCA&LR) of R ¼ 0.62 (p < 0.01).
In the four combination algorithms, GA&LR (WCT) has the best performance with the average test score R ¼ 0.65 (p < 0.01), while the average test score of GCA&LSTM is the smallest at R ¼ 0.43 (p < 0.01). The variation of R for MKS with different N is shown in Fig. 5a. Notably, GA-based predictions are much more stable than GCA. For GA&LR and GA&LSTM, the correlation scores vary in a very limited range, from 0.60 to 0.66 and from 0.55 to 0.62, respectively. However, for GCA-based predictions, the correlation scores experienced unexpected large variations. With GCA&LSTM generating the poorest prediction results, the correlation score of GCA&LR can drop to 0.21 when N ¼ 50. In a word, the MKS filtered by GA in terms of predicting daily new confirmed cases is with high agreements to the real data.
In addition, the performances of MKSs filtered by GA and GCA (N from 1 to 50) were compared when the fitness function was to maximize the direct R between the relative frequency of the MKS and daily new confirmed case counts. The experimental results further evidenced the superiority of GA in selecting more relevant keyword sets, and it is not sensitive to the length of keywords N (see Figure D8 in Appendix).

The prediction performance of WCT
In this section, the relative frequency of the selected MKS and daily new confirmed case counts were applied to train prediction models and predict the case counts in the whole analysis period with D ¼ 3, g ¼ 1. For each prediction result, R values between the prediction results and the real case counts in the whole analysis period, the training set, and the predicting set, were calculated as the indicators of performance. Note that different from the three-fold cross validation technique used in the previous analysis, the whole data in the training set were used to construct all models in this section.
The MKSs with the highest R selected by GA and GCA were used to train the LR and LSTM model, where the lengths of MKS in GCA&LR, GCA&LSTM, GA&LR, and GA&LSTM are N ¼ 35, 37, 44 and 25, respectively (see Table 2). The prediction results show that WCT (referred to GA&LR in Fig. 5b) has higher prediction accuracy than GFT (referred to GCA&LR in Fig. 5b). The performance of WCT is R ¼ 0.97 (p < 0.01) during the whole analysis period, all of which are the best among contrast models. While the performance of GFT is R ¼ 0.96 (p < 0.01). The performance in training set (R ¼ 0.98 (p < 0.01)) and predicting set (R ¼ 0.87 (p < 0.01)) of WCT are also the best among the four algorithms.
Compared with GFT, which excessively estimated the daily new confirmed cases during the outbreak period (from February 4, 2020 to February 5, 2020) over 6%-8%, WCT breaks through this limitation and the prediction error is constrained with less than 100 cases (0-3%) (Fig. 6a). The combination of GA and LR effectively overcomes the GFT's Table 2 The keyword combination and performance of MKS selected by four algorithms. shortcoming of over-estimating the epidemic peak value. Besides, in either the training or testing process, WCT constantly outperforms other algorithms. In contrast, the LSTM model does not perform well in this special task. In both GA&LSTM or GCA&LSTM, the peak number of cases was underestimated by 80% maximumly, and in the late stage of the epidemic, LSTM models overestimated the number of new cases by 10%-60% from March 1, 2020 to March 22, 2020.

Sensitivity analysis of WCT
In this section, the performances of the WCT algorithm under different parameter combinations were tested to evaluate the effect of duration of the training data (D) as well as the lag for prediction (g). The parameter D is set to change from 1 to 7, implying that the length of the training window increased from one day to a week before the days to be predicted. The parameter g is set to change from 1 to 15, implying that the algorithm attempts to predict the number of daily new confirmed cases on the gth day in the future. The length of MKS when it produces the best performance in the three-fold cross validation for each algorithm is used in this analysis (see Table 2). Fig. 7 shows the performance of the four algorithms.
The four algorithms all show robustness to the parameter D, especially when g is set in the range of 1-3. When the number of days of historical data used for prediction (D) increases from 1 to 7, the performances of the four algorithms are rather robust, in comparison to the large variation of R in terms of the lag parameter g. Overall, there is a weak tendency of increased performance with larger D, i.e., the prediction model works better when more historical data are included in the training process.
When g is small for more recent predictions, the WCT model continues to produce the best result given D is in the range of 2-5. For example, when the algorithm extends the prediction from the next day (g ¼ 1) to the second day (g ¼ 2) with D ¼ 3, the performance of WCT reaches R ¼ 0.97 to R ¼ 0.96, while the R values of GFT are only 0.96 and 0.93, respectively. When g increases from 10 to 12 with a week's historical data being trained (D ¼ 7), the R value of WCT varies in the range of 0.71 to 0.59. On the other hand, GFT only has the R value of 0.59 to 0.51. The four algorithms all show sensitivity to the parameter g. As the number of days to predict cases in advance increases, it becomes more difficult for the model to predict the future based on existing data. Compared to the GCA-based algorithms (GFT and GCA&LSTM), GAbased algorithms (WCT and GA&LSTM) show less sensitivity to changes in the g parameter. For example, WCT can still have a great performance as R ¼ 0.88 (D ¼ 6) when g ¼ 7, while the maximum R of GFT is only 0.78 (D ¼ 7).  From the comparison of the prediction effect based on the LR model and the LSTM model, the LSTM model is less sensitive to the g parameter and can still maintain a good performance when g increases. WCT remains to produce the best prediction results among other algorithms when the number of forecast days increases from one to eight days with the highest correlation score from 0.98 (p < 0.01) to 0.86 (p < 0.01). However, when g increases to 15, GA&LSTM model can maintain high R as 0.67 (D ¼ 7), while WCT is R ¼ 0.49, D ¼ 7.
Some studies have applied social media dataset to predict new confirmed cases of COVID-19. Qin et al. (2020) used the Baidu search index to predict new confirmed case counts with the performance of R ¼ 0.99 for g ¼ 1. However, this model is of limited practical value as it was not tested for longer term predictions; on the other hand, the WCT can predict case counts in 1-8 days' future with a high R ¼ 0.86-0.98. Lampos et al. (2021) designed an unsupervised prediction model using Google Trends data, which can predict newly confirmed case counts with R ¼ 0.83-0.85, ahead of official reports in more than 16 days. However, this model relies on manual construction of keyword set of Google Trends, which is highly subjective. While WCT utilizes GA to select MKS automatically and heuristically, with little human intervention in the MKS selection process. Ayyoubzadeh et al. (2020) also used Google Trends data to predict newly confirmed case counts in Iran. Comparing linear model and LSTM model, they found that the performance of linear model is better than the LSTM model, which is consistent with the conclusion of this study.
From the above comparison results of sensitivity analysis, it is clear that the WCT method exhibits relatively stronger robustness to the parameters D and g. It produces the highest correlation scores with short future predictions and can maintain relatively more stable performance for longer future estimates.

Conclusion and discussion
In this study, an algorithm called WCT is proposed to predict new confirmed cases of COVID-19. Inputting the number of historical case counts and a comprehensive dataset of Sina Weibo posts by Wuhan users, the number of daily new confirmed cases can be accurately predicted by WCT.
This paper applied a genetic algorithm to automatically construct the keyword set and it can consistently outperform the maximum average test score in the training set, higher than that obtained by GCA (0.66 vs. 0.62). The genetic algorithm is more relevant and more stable than GCA in terms of the Pearson correlation score between the prediction results and the real case counts.
The relative frequency of related posts filtered by the selected keyword set is then applied to the LR algorithm and obtained the estimates with a high correlation score of 0.97 (p < 0.01) in the whole analysis period one day ahead of the official reports. WCT can accurately predict the development of COVID-19 using only the historical number of cases combined with Weibo post data. Compared with GFT, WCT overcomes the GFT's shortcoming of over-estimating the epidemic peak value.
However, since the development of public emergencies on social media is dynamic, one limitation of the WCT model is that it may need to continuously update the keyword set for future situations with the development of public emergencies, to ensure accurate prediction in the later stage of epidemic or other public emergencies, which makes the application of the method challenging. Compared with the prediction results of the classical GFT model, considering the influence of noise and other factors, the prediction accuracy of the WCT model in short-term estimates needs to be further improved.
This study offers a promising approach of using Sina Weibo data or other social media data to realize syndromic surveillance-based disease prediction and to increase global awareness of events. It provides a process for mining epidemic development trends from large-scale social media data without too many manual parameters. In the future, the use of WCT can be extended to monitor and track other diseases or public emergencies by inputting social media data.

Declaration of competing interest
The authors declare that there are no conflicts of interest.