A data-driven model for social media fake news detection

The overall framework of our fake news detection model. Abstract: The rapid development of social media leads to the spread of a large amount of false news, which not only affects people’s daily life but also harms the credibility of social media platforms. Therefore, detecting Chinese fake news is a challenging and meaningful task. However, existing fake news datasets from Chinese social media platforms have a relatively small amount of data and data collection in this field is relatively old, thus being unable to meet the requirements of further research. In consideration of this background, we release a new Chinese Weibo Fake News dataset, which contains 26320 fake news data collected from Weibo. In addition, we propose a fake news detection model based on data augmentation that can effectively solve the problem of a lack of fake news, and we improve the generalization ability and robustness of the model. We conduct numerous experiments on our Chinese Weibo Fake News dataset and successfully deploy the model on the web page. The experimental performance proves the effectiveness of the proposed end-to-end model for detecting fake news on social media platforms.


Introduction
In the era of the Internet, the channels through which people send and receive information are more abundant, which makes communication and exchange convenient. It is straightforward to share information and interact with one another. Online social media platforms (e.g., Weibo, Douyin, and Facebook) play a more important role in producing content and information propagation. However, fake news spreads spontaneously on these social media platforms, and some people with malicious intent use social media platforms to spread fake news virally [1−3] . Fig. 1 shows a piece of fake news about taxes on Weibo. The proliferation of fake news causes public panic, disrupts social order, affects public opinion, and manipulates the focus of the public; thus, fake news has become a significant destabilizing factor on social media [4] . For example, widespread fake news on social media has caused public panic during the ongoing Covid-19 crisis [5] . Therefore, the effective detection of fake news on social media has great significance for maintaining the stability of social life and cyberspace security.
In general, there is no clear definition of fake news. The Merriam-Webster Online Dictionary states fake news as ' News reports that are intentionally false or misleading' [6] . Shu defined fake news as a verifiably false piece of information shared intentionally to mislead readers [7] . Ajao et al. defined it on online social media as ' any story circulated, shared, or propagated which cannot be authenticated' [8] . However, in academic research, fake news is typically defined as an unverified or unconfirmed message. In this study, we define fake news as misleading information that has proven to be false.
In recent years, academics and industrialists worldwide have paid more attention to the spread of fake news on social media platforms. To solve this problem, a large number of fake news analyses and fake news detection works on social media have appeared and achieved remarkable results. The identification method most easily proposed by researchers is the use of detection technology based on manual features [9−10] . On the one hand, with the development of artificial intelligence technology, recognition technology based on machine learning is proposed. It is a popular method to use classical text information representation methods such as n-gram and bag-of-words models [7] , and then construct a supervised classifier [11] . Since the introduction of deep learning, there have been many applications of deep learning to mine higher-level feature representations for fake new detection [12−13] . They learn better text representations or sequence features from fake news. In addition, user characteristics and information dissemination structures are important features. There are a large number of malicious accounts on social networks that deliberately spread content containing misinformation to manipulate people's ideas and change people's decisions [14] , and fake news detection methods focus on detecting social network users [15] , such as using user attributes (e.g., language used by the account, geographic locations of the account, and authenticated identity) [16−17] . On the other hand, many researchers have focused on international social media platforms (Facebook, Flicker, etc.) while fewer researchers have focused on Chinese social media platforms. Although there are a few jobs geared towards Chinese social media platforms such as Weibo [18] , they either analyze limited cases or a limited amount of data. It is difficult for models to sufficiently learn the characteristics of fake news to achieve reliable fake news detection results [12] . This greatly limits the progress and development of fake news analysis and detection and limits the comprehensive understanding of fake news on social media. Therefore, more data are required to enhance the discriminative power of the automatic fake news detection model.
Fake news detection has social significance, but there are some problems in practical applications. Manual fake news detection is undoubtedly the most reliable method for this purpose. For example, foreign social media anti-rumor websites require experts to analyze information and provide evidence to clarify whether the information is fake news. The domestic social media platform Weibo provides false information reporting functions. The annual report shows that the number of monthly reports on Weibo was as high as 127200, with a minimum of 74100. Since correctness of the news is completely determined by humans, it is highly dependent on the ability of the appraiser. The shortcomings of knowledge and the long detection cycle of fake news are very obvious; thus, this method has an explosive growth rate with the spread of information, and the scale of fake news increases exponentially and gradually fails to meet detection needs. It is challenging to quickly distinguish between users who maliciously report microblogs and who reasonably report questionable microblogs. Therefore, a large amount of data has also been used to test manpower requirements and manual review efficiency.
In this study, considering the current problems of fake news detection, we first propose a Chinese Weibo Fake News dataset that includes more than 20000 fake news data and more than 30000 real news data. Compared with previous Chinese fake news datasets, this significantly expands the data richness of the existing Chinese social network fake news dataset. The dataset also includes a wealth of user information and reporting information on fake news on social networking sites, including user reporting reasons, to facilitate further research. In addition, we evaluated the effectiveness of some current fake news detection methods on our dataset. Second, based on our dataset, we propose a novel method that combines contextualized representations from large pre-trained language models , such as BERT [19] or XL-Net [20] , with the machine learning model LightGBM [21] for fake news detection in Chinese social media. The contributions of this study are summarized as follows.
(i) We created a new Chinese social media platform fake news dataset containing a high quantity of content and abundant information to facilitate research on fake news detection.
(ii) We propose a framework based on post content and user attributes for fake news detection and evaluate the effect-iveness of our dataset.
(iii) Our model can realize end-to-end fake news detection and achieve stable and accurate results, which can reduce the burden of manual reviews. The rest of the paper is organized as follows: Section 2 provides a review of related works, Section 3 presents our created dataset, and Section 4 introduces our proposed method. Experiments are presented in Section 5. Finally, conclusions are presented in Section 6.

Related work
Some researchers make great contributions to the analysis and detection of fake news; thus, we mainly provide a brief review of related work from the following directions: data collection and fake news detection methods.
Dataset-related: Song et al. [22] collected large-scale social media fake news data, which uses Weibo as a research platform, and conducts a more comprehensive quantitative statistical semantic analysis of Chinese social media fake news. Ma et al. [12] proposed two datasets in 2016, and many subsequent fake news detection tasks were performed on these two datasets. Shu et al. proposed two datasets from PolitiFact and GossipCop from English websites in 2017 [7,23,24] . Wang et al. [25] collected a dataset named the WeChat dataset from WeChat's Official Accounts, which is the largest instant messaging software in China.

Content-based:
In recent years, many researchers have focused their attention on text content to determine the authenticity of information. Early research on fake news detection used linguistic features such as text length, word classes, and the percentage of pronouns [26,27] . According to the analysis of the text characteristics of the post, they then used these characteristics to classify it as credible or untrustworthy; for example, the TFIDF and topic characteristics [28] . Malicious online users spread fake opinions by confusing language, writing styles [29] , or sensational emotions [30] ; therefore, some approaches have considered this in their work [31−33] . Chen et al. [34] utilized attention mechanisms based on a recurrent neural network (RNN) to learn text features for fake news recognition.
User-based: Another group of researchers believes that user analysis is a key part of fake news detection. Most users on social networks are ordinary people performing normal social activities, but there are a small number of malicious users deliberately creating fake news to affect the emotions of other users and achieve their own goals , such as enjoying attention from the public, triggering users' negative thoughts toward the country, and interfering with politics and other purposes. With such clear characteristics, there is a clear difference between malicious and normal users. Yang et al. [35] considers account-based features which include 'is verified,' 'has a description,' gender, avatar type, and name type. Shu et al. [36] analyzed explicit and implicit user profile features from social media platforms, where explicit characteristics refer to characteristics that can be directly obtained, such as the user's name and gender. Implicit features refer to information not directly obtained from user meta-information but can be inferred from other data, such as user historical tweets. The method proposed by Ma et al. [12] models a user response that captures rich information in order to learn hidden user representations. Liu et al. [37] employed an incorporated RNN and convolutional neural network (CNN) to capture the representation characteristics of users based on time series, which can address early detection limitations. Qian et al. [38] proposed a two-level convolutional neural network (TCNN-URG) with a user response generator.
Propagation-based: Recently, some researchers have begun to exploit graph structures for fake news detection. The spread of fake news on social networks is similar to the spread of epidemics among the population. There are graph structures between users that are grouped by interests, similar opinions, and interactions with the news creator [24] . The diffusion patterns of online news are similar to the graph structure. Therefore, leveraging graph structures in fake news detection is a significant method for researchers. Refs. [13, 39−41] utilized the structural characteristics of message propagation by constructing propagation networks. Kai constructed publisher-news relation and user-news interaction networks to capture the potential relations among publishers, news pieces, and users [42] . Yuan et al. [43] captured the local and global relationships among all source tweets, retweets, and users by modeling a global heterogeneous graph. In contrast to previous ideas, Sampson et al. [44] combined the implicit links in conversation structures with the inherent network structure to increase the accuracy of fake news classification.

Dataset
In this section, we introduce the Chinese Weibo Fake News dataset in detail.

Data collection
Sina Weibo is the social media platform with the highest number of users in China. According to the latest Sina Weibo user survey report, the current daily active users of Sina Weibo have exceeded 200 million, which leads to massive amounts of information being posted on the platform every day. Fake news that misleads people mixed with life information is likely to have serious consequences. The public continues to pay attention to the Covid-19 crisis. In the early stage of the epidemic, much fake news was spread on the Weibo platform by users with ulterior motives. People's panic has caused extremely negative social effects. Therefore, Weibo officials have taken many measures to deal with similar situations. As of 2012, Weibo officials issued a series of management conventions and launched the Sina Weibo's official fake news busting service on the Weibo Community Management Center. The report processing hall is specifically used to handle users' reports of various false information and show the processing results publicly, such as deducting credit points, deleting Weibo posts, or banning users' accounts. In our dataset, we collected fake news microblogs from Weibo's official fake news busting service. Through Sina Weibo's official fake news busting service, we can obtain more official and authoritative fake news data, greatly avoiding errors in manual data annotation.
As shown in Fig. 2, we provide an example of the handling of fake news using Sina Weibo's official fake news busting service. Below, we introduce in detail the judgment of Sina Weibo's official fake news busting service on fake news. Sina Weibo's official fake news busting service report processing page displays the reported person and corresponding information. The reporting interface also provides the reporting time and the reasons provided by the reporting person. Furthermore, the interface displays the official judgment results, judgment basis, and processing method. Clicking the related hyperlink enters the user's homepage and microblog, which allows us to obtain important user information and details of the microblog.
We collected microblogs published by reported users on fake news busting services, from December 20, 2010, to December 01, 2020. Our dataset contains 26320 fake news items and 35,426 real news items.

Data analysis
The annual distribution of the number of fake news microblogs for our dataset is shown in Fig. 3. According to the official Weibo report, although fake news has increased yearly, most of the original microblogs are deleted due to being confirmed fake news on the platform.Therefore, such data do not appear in our data set. Otherwise, in our dataset, although some microblogs have similar themes and topics, their semantics may differ slightly. Considering the impact of different texts on the analysis, we did not exclude these microblogs. As far as possible, we consider various situations encountered in real-life fake news detection.
Each microblog in the Chinese Weibo Fake News dataset corresponds to a user. A total of 24181 users in our dataset posted fake news microblogs, of which only 4902 were officially certified. In contrast, the vast majority of fake news is from ordinary users. According to statistics on the number of followers, the majority accounted for less than 500 people. On the one hand, most ordinary users are incapable of discriminating against fake news; once the content of the fake news meets their daily interests, they can easily participate in its spread. On the other hand, there are a large number of virtual accounts on Weibo, and people with ulterior motives use these accounts to spread fake news. These accounts have never been used since posting fake news; therefore, there is no way to punish the users. The distribution of the number of followers of users posting fake news is shown in Fig. 4.
To show the influence of fake news, we use the sum of the number of reposted comments and likes of each microblog as an indicator to measure the influence of Weibo, and through statistical analysis of this indicator, we can understand the distribution of the influence of the rumor microblog. Owing to the differences in the topic, release time, and users, the influence of each microblog is different. The statistical information of this influence is shown in Fig. 5. We find that the influence of most fake news is limited; microblogs with less than 500 influence accounts for 92% of the total. A few fake news microblogs have a large influence, which may cause serious consequences. Therefore, detecting fake news that may have a large influence on important information is a challenge that should be considered.

Method
In this section, we introduce a data-driven combination model to improve the fake news detection performance. Fig. 6 shows the proposed framework, which can be summarized into the following components: (i) The input consists of a variety of content of microblog and corresponding users' information. (ii) We use the augmenter to augment the data of limited fake news microblogs. (iii) The original data and augmented data contextualized representations and users' descriptions are obtained by Bert. (iv) These representations are concatenated and input to LightGBM for fake news detection. The details of the proposed framework are described below.

Data augmentation
In the fake news detection task, the imbalance of data and the small amount of fake news data are important factors that hinder development. In this study, we propose a method to generate new "fake news" samples to improve the performance of the model. For the original sample, in the training data, where i indicates the number of training data. We randomly generate a certain number of enhanced samples , where indicates the augmented data of and a and b represent different augmented methods, which are introduced as follows: Word-level data augmentation: Based on the original data, we create new data that are similar to the original data by replacing synonyms, deleting unimportant words, and randomly inserting synonyms. We use the Jieba word segmentation tool to segment Chinese fake news text and then process the stop words in the text. On this basis, we randomly modify the words in the text with a ratio of 0.2, according to the  A data-driven model for social media fake news detection Chen et al. method described above, and obtain an enhanced text that is different from the original text. We select more similar sentences by calculating their BLEU scores for the augmented sentences that are different from the original data obtained through different operations. Sentence-level data augmentation: We use the free translation API provided by Google to translate the original data into other languages and then translate it back to Chinese. Owing to the different logical orders of various languages, this method can obtain new data with consistent semantics that compare with the original data.
Through the above method, we obtain different enhanced texts and select the last two enhanced texts by calculating the BLEU score.

Feature representation learning
Each microblog contains rich text information, and we obtain useful object representations using Latent Dirichlet Allocation (LDA) for fake news classification. LDA is an unsupervised probability generation technology that can be used to identify hidden subject information in text. The model assumes that "the article selects a topic with a probability and selects a word from this topic with a certain probability," where the text-to-topics and topics-to-words follow a polynomial distribution, which is described as follows: P (x1 ,x2,...,xk ;n,p 1 ,p2 ,...,pk ) = n! (1) f LDA By observing our fake news data, we find that most of the topics in fake news are related to social and political topics. Through LDA, we obtain hidden topic information in the text, which is represented as . The meaning of the text cannot be fully understood using only the LDA method, and we use the Bert model as the backbone to obtain all data features that contain the original f Bert text and augmented text. Bert use a transformer as its main framework. The transformer abandons the traditional CNN and uses the self-attention mechanism to better solve the problem of long-term dependence on text and training speed. The attention mechanism enables the model to focus on all the input information that is important for the target word, which can greatly capture global information and use parallel training to increase the speed such that the model effect is greatly improved. In addition, the change in the attention weight matrix allows us to understand the role of different words in the task more intuitively, which has strong interpretability. Bert carry out sufficient self-supervised learning based on a massive corpus to learn a good feature representation for words and has a strong feature representation ability. Bert can also use expectations related to specific tasks for finetuning to improve the performance of the final model. We use to denote the text representations learned by Bert. Finally, we concatenate these into textual representations, denoted as follows: f (2) || where represents the splicing operation.
Research has shown that on social media platforms, user behavior characteristics have a certain degree of group aggregation. For example, people with the same hobbies and educational backgrounds are more likely to gather together. Because a social structure is composed of many users and their activities, the importance of users in social networks is obvious. In social networks, a certain number of naval accounts are dedicated to spreading fake news. However, there are many official authoritative accounts on social media platforms and their published content is strictly reviewed. The content of such accounts has a relatively high credibility. Therefore, the user attributes of social accounts occupy a f description f numerical higher weight in fake news detection. User attributes are composed of the text information described by the user, number of users' fans, number of followers, and whether they are authenticated. In this study, we utilize the pre-trained Bert model to process user descriptions, which are denoted as . We use other numerical features directly, which is denoted as . Because we use a classification model based on a tree structure, this model is trained through feature splitting. Numerical scaling does not affect the location of the split point or the structure of the tree model. We concatenate our learned user text features and numerical features to obtain the final representation of the user, which is expressed as follows: After obtaining the augmented text features and user features, we use the LightGBM [21] model to obtain the final fake news prediction results. LightGBM is a framework that implements the gradient boosting decision tree (GBDT) [45] algorithm. The main idea of the GBDT is to use weak classifiers (decision trees) to iteratively train and obtain the optimal model, which is challenging to overfit and fast in training. At the same time, when deploying the model to our online fake news detection system, we found that some microblogs did not provide user information. Therefore, we train the two models, and the final output was the weighted effect of the two models.

Experiments
In this section, we discuss the evaluation indicators of the fake news detection algorithms. In addition, we introduce the effect of the baseline experiment on fake news detection and validate the proposed model on this dataset.

Dataset
In this study, we utilize the Chinese Weibo Fake News dataset to assess the proposed model. The details of our dataset are presented in Section 3. Table 1 lists the dataset statistics.

Evaluation metrics
In the experiment, the trained model is evaluated using standard classification performance metrics: accuracy, precision, recall (sensitivity), and F1-score. First, we introduce the concept of a confusion matrix, as shown in Table 2.
Based on the confusion matrix, we define the evaluation metrics mentioned above as follows: In general, a higher result means better performance.

Baseline approaches
The state-of-the-art fake news detection methods compared with our framework are as follows: Naive Bayes: The Naive Bayes method is a classification method based on Bayes' theorem and the assumption of the independence of characteristic conditions. We employ the count vectorizer and TFIDF vectorizer in Naive Bayes.
SVM [46] : The support vector machine (SVM) is a two-classification model. The basic model is a linear classifier with the largest interval defined in the feature space. The kernel technique used by the SVM essentially makes it a nonlinear classifier to model microblog content.
LSTM: We employ the pre-trained Chinese word vector [47] to represent the microblog and then build a Bi-LSTM network structure. Finally, these representations are input into a fully connected layer to make predictions. The parameters of Bi-LSTM are set to 64 and 32. Built on top of the Bi-LSTM, a fully connected layer outputs the probability that the news is fake.
GRU [12] : The gate recurrent unit (GRU) uses an RNN to model the text and extract the features; it performs similarly to LSTM, but it is easy to train and can greatly improve training efficiency.
CNN [48] : The CNN model uses a convolutional neural network to learn fake news representations for each microblog. The extracted microblog representations are finally passed through a fully connected layer with a softmax function to predict the fake news result.

Performance comparison
The performances of our proposed framework and other excellent algorithms on our dataset are presented in Table 3. For the parameter selection of the classification model, we set the maximum depth of the decision tree to five, learning rate to 0.12, bagging fraction to 0.8, and feature fraction to 0.8. The proposed framework achieves the best performance.
Our model achieves an F1-score of 0.903, with an accuracy of 0.890 for the test data. It achieves the best results compared with the other methods in the fake news mask. The prediction performance of the Naive Bayes method is much lower than that of other methods, regardless of whether it is using the CountVectorizer word-embedding vector method or  the TFIDF word-embedding vector method. The main reason for this is that Bayes classification is more sensitive to feature forms. Compared with Naive Bayes classifiers, the performance of SVM is improved, its generalization ability is better, and it is not as sensitive to data forms. However, compared with the deep learning model, the performance of early machine learning is still much worse. Deep learning methods can directly learn advanced features from data, automatically extract features, and integrate feature learning into model construction, thereby overcoming the shortcomings of traditional feature engineering tasks. However, deep learning methods require a large amount of data training to achieve a better understanding. Moreover, none of the above methods consider the important role of user information in fake news detection. We consider user information, and through a weighted combination of the fake news detection model for plain text and the model containing user information, our method has higher accuracy and flexibility in practical applications.

Conclusions
In this study, we propose an end-to-end data augmentation combination framework that uses data augmentation technology and the joint modeling of microblog text information and user attribute characteristics for fake news detection. Considering the problem of fake news data with little time and strong time characteristics, the model is enhanced to improve the generalization ability and robustness of the fake news detection task utilizing data augmentation and makes full use of the microblog text characteristics and user attribute characteristics to improve the performance of the model. Finally, in practical applications, a weighted combination of plain text and increased user characteristics is used to obtain a more stable prediction performance. Our proposed framework preforms rich experiments on the Sina Weibo data and achieves the best performance, which indicates the effectiveness of the method for fake news detection. In the future, our model could be extended to other tasks. In addition, we will continue to explore additional detection methods for fake news.

Acknowledgments
This work was supported by the Fundamental Research Funds for the Central Universities (WK3480000008, WK348000 0010, WK2100000023).

Biographies
Xin Chen is currently pursuing a master's degree at the University of Science and Technology of China, Hefei, China. Her research focuses mainly on rumor detection in social media and cross-modal understanding.
Zhendong Mao received his PhD degree in computer application technology from the Institute of Computing Technology, Chinese Academy of Sciences, in 2014. He was an assistant professor with the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, from 2014 to 2018. He is currently a professor at the School of Cyberspace Science and Technology, University of Science and Technology of China, Hefei, China. His research interests include computer vision, natural language processing and cross-modal understanding.