Rumor detection based on topic classification and multi-scale feature fusion

In recent years, with the rapid development of Internet technology, the spread of network rumors has become one of the important obstacles to maintain the stable development of social networks and ensure the public security. Most of the existing researches focus on the detection of rumors in general fields, ignoring the differences among different fields. According to the characteristics of rumor in the health field, this paper proposes a rumor detection method based on topic classification and multi-scale fusion. Different methods are used to extract features from different sub datasets of different scales, taking into account the overall, inter topic, and intra subject correlation and differences, and then judge after feature fusion. The experimental results show that this method is better than the general detection method in the data set of health field, and has some improvement compared with the algorithm in the same field.


Introduction
With the rapid development of social network, a large number of health information such as diet and health care burst out, seriously affecting people's daily life. According to the report on the governance of Internet rumors in 2019, in 2019, wechat platform released 17881 anti rumor articles, of which the health category is the disaster area of rumor propagation. For example, novel coronavirus outbreak in 2020, various rumors also broke out. Conniving at the spread of rumors in the field of health will lead to a series of social events, public panic, economic losses and so on.
At present, the existing social platform rumor detection is still mainly manual. With the increasing number of rumors, it is of great significance to automatically detect rumors in specific fields. To solve these problems, this paper constructs a health data set based on daily express. Through the multi-level analysis of data sets, it is found that there are great differences among the topics. Therefore, a rumor detection method based on topic classification and multi-scale feature fusion is proposed. Firstly, MLM and Bi LSTM + attention are used to extract features from different sub datasets, and then multi-scale algorithm is used to extract word level features. After normalization, feature fusion is carried out. Finally, neural network is used to output the prediction results.

related work
Yang et al. [1] proposed to use SVM model to carry out automatic detection experiments based on traditional event content and user information, combined with event propagation characteristics, client type and event geographical location. Liang et al. [2] The rumor is detected by introducing user behavior features, and the traditional machine learning algorithm is applied to the comparative experiment. Shu et al. [3] According to the potential characteristics of users, this paper proposes a joint detection of false accounts and false news. Rosa et al [4] proposed a new rumor detection system based on the influence .With the development of deep neural network has made new breakthroughs in natural language processing and other fields. Ma et al. [8] For the first time, recurrent neural network is introduced to detect rumor events. According to the data structure characteristics of event level rumor.Shu et al [9] proposed an interpretable detection model, which uses news content and user comments to capture interpretable Top-k to check valuable sentences and comments.
However, the above methods focus on rumor analysis in general fields. In the research texts, short texts account for a large proportion, while news in different fields are quite different.

data set source
This paper collects all the news of the daily express from July 29, 2016 to November 30, 2018, with 79209 pieces of data. After manual screening and data annotation, 13754 pieces of effective data identified by authoritative organizations such as "doctor Dingxiang", "the whole people are serious" and "microblog refutation" were tested. 2511 rumor data and 5898 non rumor data are obtained by annotation, with an average length of 840, which belong to long text data. In the labeling process, the rumor paragraphs of rumor data are extracted respectively. 20% of them are test sets.

data preprocessing
Remove stop words and participles from the text. Because health data is mainly expressed in terms of terms, it has the characteristics of long text and wide content, and it is difficult to distinguish health rumors from the overall scale of the text. Therefore, we cluster the segmented text with LDA topic model. The selection range of topic number is [1,15], and the formula for calculating topic similarity is: (1) Taking the top 10 words with high weight, through comparison, we can find that the classification is the most obvious when choosing three topics, which are: (1) mother and infant gender, (2) food safety, (3) health care. And type mark the dataset.  TF-IDF is used to calculate the weight of feature words. The expected cross entropy reflects the distance between the probability distribution of the text topic class and the text topic class in the presence of a specific word. The calculation formula is as follows:

feature extraction and fusion
Where t is the characteristic word, Ci is the ith category, P (T) is the document frequency of the word, and P (Ci|t) is the document frequency of category Ci under the condition of the word t.
The dominance rate is only suitable for binary classification, and only cares about the score of text features for the target class.
The first 2500 feature words were selected by ranking each word according to the dominance rate from large to small. The above three groups of characteristic words are normalized to [0,1] interval, and then expressed by tanh activation function.

model pre training based on MLM
MLM (masked language model) uses context prediction to train cover words by covering a certain proportion of content in sentences. This method has been used in the Bert [10]  Google, and has shown excellent results. In this paper, we simply adjust the rumor data and train the rumor paragraphs together with the complete rumor data. Finally output the transformed eigenvector 2 v .

feature training based on BiLSTM + attention
In the bilstm model, attention mechanism is introduced to simulate the attention characteristics of human brain and pay attention to more important information. The model is applied to all experimental data to obtain complete context information. Refer to Huang et al. [11] The model architecture here shows the similarity between the x-th target output and the y-th input.
xy S is defined as the difference between cosine distance and Euclidean distance. The closer the two are, the smaller the Euclidean distance is and the larger the cosine distance is.

feature fusion
The above three feature extraction vectors are fused by the method of vector splicing to get the vector V , which is expressed as: In the part of convolutional neural network architecture, input V to the full connection layer to get vector V  , and use softmax classification to get score scr, among which the larger is the classification result.This model uses cross entropy loss function to compare the distance between the real value and the predicted value. The cost function formula is:

Result analysis
The method in this paper is compared with the existing general method and domain method on the same data set, and the test results of each model are shown in Table 1 [12] .865 .803 .833 .839 SVM [2] .881 .800 .838 .846 BiLSTM [13] .861 .828 .845 .851 TextCNN [14] .   [12] .816 .875 .845 .839 SVM [2] .817 .892 .853 .846 BiLSTM [13] .835 .867 .850 .851 TextCNN [14] . It can be seen from the table that the accuracy of the topic based classification and multi-scale fusion rumor detection model proposed in this paper has reached 89.1% in the health data set, which is superior to the general model and 1.4% higher than the model in the field. Bilstm-dec and textcnn-dec fuse domain entity features, which shows that domain entities can improve the feature extraction accuracy of health data. In this paper, bilstm is also used in the model. The experimental accuracy is higher than that of domain entity features, which proves that multi-scale features play a role in rumor detection in the health field. The performance of the model in rumor data is better than that in non rumor data, and in precision, recall, F1 is better than that in non rumor data. Compared with the results of the three topics, the effect is the best in the field of food safety, which can reach 91.8%. The effect of health care can also reach 87.2%.

Conclusion
In this paper, a novel rumor detection method based on topic classification and multi-scale fusion is proposed, which considers the long text characteristics of rumor information in the health field. After topic classification, the association and difference between and within topics are considered, and the features of specific sub data sets are extracted for multi-scale fusion and classification. The accuracy of rumor detection in the field of health is improved obviously by using this method, which proves that this method adapts to the characteristics of this field.