Research Article Early Rumor Detection Based on Deep Recurrent Q-Learning

. Online social networks provide convenient conditions for the spread of rumors, and false rumors bring great harm to social life. Rumor dissemination is a process, and eﬀective identiﬁcation of rumors in the early stage of their appearance will reduce the negative impact of false rumors. This paper proposes a novel early rumor detection (ERD) model based on reinforcement learning. In the rumor detection part, a dual-engine rumor detection model based on deep learning is proposed to realize the diﬀerential feature extraction of original tweets and their replies. A double self-attention (DSA) mechanism is proposed, which can eliminate data redundancy in sentences and words at the same time. In the reinforcement learning part, an ERD model based on Deep Recurrent Q-Learning Network (DRQN) is proposed, which uses LSTM to learn the state sequence features, and the optimization strategy of the reward function is to take into account the timeliness and accuracy of rumor detection. Experiments show that, compared with existing methods, the ERD model proposed in this paper has a greater improvement in the timeliness and detection rate of rumor detection.


Introduction
With the rapid development of the Internet, social networks and peopleʼs lives have become increasingly close, and the participation and utilization rate of netizens has risen rapidly [1]. e global digital statistics report [2] released by "We Are Social" in 2019 shows that, by the end of 2018, there were 3.48 billion social network users in the world, accounting for 45% of the worldʼs total population. Social network platforms represented by Twitter and Weibo provide netizens with functions to post information and express opinions. News media have gradually established official accounts on social networks for news reporting, so social networks have gradually become people's main sources of information.
5G and edge computing bring certain security issues to social networks [3,4]. From the content point of view, the popularity of social networks improves life efficiency, but it has also become an environment for online rumors. An early study of social psychology defined rumors as "propositions spread without verification by relevant departments" [5]. Today, with the explosion of information, a huge amount of information is spread on social networks every day, including a lot of rumors. e harm of rumors to society cannot be ignored. For example, the 2011 Japanese earthquake triggered a tsunami that caused the Fukushima nuclear power plant to explode. e incident had little impact on China, but some illegal traders took the opportunity to drive up salt prices on the grounds that sea salt was contaminated by nuclear power. Salt prices in many places across the country have soared, and many people have been incited to rob salt in salt farms, which has seriously disrupted the order of social life. e purpose of the rumors is generally destructive, such as pranks and revenge on society. Because social networks have the characteristics of virtuality and anonymity, the cost of creating and spreading rumors is extremely low, and online rumors have become a trend of flooding. In the face of rumors rampant in the network environment, social network platforms have established rumor-defying accounts to manually refute rumors, but only relying on manual review to stop rumors is not only high in labor costs but also very inefficient, so artificial intelligencebased rumor detection technology has gradually become research hot spot. e spreading process of rumors has obvious timeliness [6]. Specifically, rumors spread rapidly in the form of outbreaks when they appeared in the early days, but over time, their spread speed will be greatly reduced until they eventually die out. Figure 1 shows the spreading sequence diagram of a Twitter rumor.
e red text represents the original tweet, the yellow text represents the reply message that questioned the original tweet, and the green text represents the reply message that opposes the original tweet. e rumor was released after a shooting incident. e general content of the rumor was "According to the police, there were a large number of shooters in the shooting." Later, after investigation by relevant departments, there was only one shooter, and the police did not disclose the information. Within two hours after the rumor was released, there were a thousand reposts, and the rumor spread to tens of thousands of people. Analyzing the spreading process of this rumor, when there is obvious opposition and questioning information in the comment area, the information can be preliminarily judged as a rumor. If rumors can be accurately identified and their spreading behavior can be controlled when they appear early, the adverse effects of false rumors can be greatly reduced. erefore, early rumors detection research on social networks is very important. e main contributions of this paper are as follows. Aiming at the difference in content characteristics between original tweets and reply messages in Twitter, a dual-engine rumor detection model based on the self-attention mechanism is proposed, which improves the accuracy of rumor detection; in addition, we propose an early rumor detection (ERD) based on recurrent Q-learning, which can detect rumors earlier with higher accuracy. e remainder of this paper is organized as follows: Section 2 briefly introduces the related works and research issues; Section 3 proposes an ERD model based on deep recurrent Q-learning and a dual-engine rumor detection model based on self-attention mechanism; Section 4 discusses and analyzes the experiment results. Section 5 draws the conclusion and proposes future research directions.

Rumor Detection.
e essence of the rumor detection problem is text classification. e current research on rumor detection is divided into two categories: methods based on traditional machine learning and methods based on deep learning. e former generally uses methods such as naive Bayes classification, decision trees, and support vector machines. Castillo et al. [8] used machine learning algorithms based on feature engineering to classify rumors and extracted a large number of text features based on the characteristics of rumors, including text length and the number of likes. On the basis of this method, many scholars [9][10][11][12][13][14] began to try to use different machine learning algorithms and richer features to study rumor detection.
Although the method based on machine learning can solve the problem of rumor detection to a certain extent, it is time-consuming, laborious, and inefficient in the feature engineering stage. e quality of the feature heavily relies on manual experience, which affects the quality of the rumor detection model. With the widespread application of deep learning in the field of natural language processing, researchers have also begun to use deep learning methods to solve the problem of rumor detection. e reply information of tweets has a great influence on the effect of rumor detection. Related researchers have proposed a multitask learning model for rumor detection and user stance detection. e most typical method is the multitask joint learning model proposed by Ma et al. [15], which is to define a shared layer as a bridge between the rumor detection deep learning model and the user stance detection deep learning model to exchange information. Li et al. [16] added user characteristics and attention mechanism on this basis to improve the performance of the model. In addition, with the emergence of the BERT (Bidirectional Encoder Representation from Transformers) language model, the rumor detection methods based on BERT have been proposed, such as the model of Yu et al. [14]. In addition, the rumor detection model based on the propagation tree has gradually attracted the attention of researchers. Its starting point is to abstract the tweets into a propagation tree according to the timeline and convert the rumors classification into tweet tree classification, such as Ma et al. [17] and Kumar's [18] research work.

ERD.
In the current research related to rumor detection, the research focus is mainly on improving the accuracy of rumor detection, and the early detection of rumors is less involved. e current ERD methods can be divided into three types: (1) Real-time rumor detection, such as the model proposed by Castillo et al. [19]. is model uses a support vector machine to classify the original information without considering the reply information, so there is no delay in the detection time, so as Police service: there were "numerous gunmen" at the Canada War Memorial shooting. One person was shot.
2nd suspect parrantly shot behind parliament. Lockdown almost everywhere... third suspect on the loose. That s heavy.
Ottawa was always such a safe city. This is madness.  Figure 1: A rumor from PHEME dataset [7].
to achieve real-time detection. Although the realtime rumor detection method can ensure the detection of rumors in the early stage, it has a high rate of misjudgment and has little practical value. (2) ERD based on static checkpoints. For example, Dungs et al. [20] proposed a detection method based on a hidden Markov model. is method uses a fixed number of replies as an interval when the model reads the reply information (the interval length used in the literature is 5). Set a static checkpoint; each checkpoint will consider whether to output the detection result. If the detection result outputs, the rumor detection process ends; otherwise, the reply information will continue to be read until a detection point appears and the result is output. Although this method can theoretically achieve early detection, it is not flexible enough to give play to the potential performance of the model. (3) ERD based on reinforcement learning. For example, the model proposed by Zhou et al. [21] consists of two parts: a rumor detection module (RDM) and a checkpoint module (CKM). e CKM is implemented by a reinforcement learning model, which dynamically controls the number of input replies from the RDM enables ERD. e model can use the reinforcement learning method to constantly weigh the detection time and detection accuracy to achieve the best balance between the "early nature" and accuracy of rumor detection.

Problems.
e original tweet and the reply message are two completely different messages. e original tweet is generally a complete description of an event, whose expression is more rigorous. Reply messages are towards the original tweet, sometimes even an emoticon or a punctuation mark. Existing models generally ignore the difference between the original tweets and their reply information, even though individual multitask joint learning models (such as Ma et al. [13]) use two independent networks to process the original tweet and the reply information, but these two independent networks are often two networks with the same structure. For these two kinds of information with large differences, it is not reasonable to use the same network for modeling, especially for the expression of chaotic word order such as comment information. If the recurrent neural network is only used for modeling, many potential features will be ignored. erefore, it is necessary to separately model the characteristics of tweets and reply messages.
In terms of ERD, this paper is based on a reinforcement learning strategy to solve the problem of ERD, because the ERD model proposed by Zhou et al. [21] applies reinforcement learning to the ERD problem. e ERD model is composed of CKM and RDM. e CKM is implemented by DQN to control the number of reply messages input to RDM. e RDM is implemented by GRU. rough in-depth analysis of the model, it is found that the model has the following problems: (1) Potential meaning of the state sequences are ignored Reinforcement learning is generally based on the "Markov decision process," but in the case of ERD, it is more reasonable to regard the ERD process as a "partially observable Markov decision process" because the state sequence generated by RDM is potentially helpful for ERD. Specifically, the state sequence refers to the coding sequence of known information formed by continuously inputting reply information to the RDM. For each state in the state sequence, the RDM can output the rumor classification result corresponding to the state. But for different states, even if the classification result output by the RDM is the same, the corresponding probabilities of the results are different. For this different probability sequence, the probability value may show a steady upward trend. For example, when RDM reads 2 reply messages, the probability of RDM outputting rumors and nonrumors is 0.55 and 0.45 (Softmax does the final classification; the sum of the two probabilities is 1), and the classification result is a rumor, but when 4 messages are read, the probability that RDM will output rumors and nonrumors is 0.85 and 0.15, and the classification result is still rumors; the difference is that the model will become more "certain." For the case where the classification probability changes steadily (the probability of one label increases; the probability of the other label will inevitably decrease because the probability sum is 1), the model can be allowed to output the detection results in advance, which can more effectively avoid the rumor detection process. By observing the partial change trend of the state to represent the actual state of the environment, this process is actually a partial Markov decision process. e CKM in ERD is implemented by DQN based on the Markov decision process, and the sequence features cannot be obtained, resulting in the model's poor performance in the timeliness of rumor detection.
(2) Incomplete reward function e number of rumors in social networking platforms is much less than the number of nonrumors, so rumor detection is actually anomaly detection. In the field of anomaly detection, a model with a higher accuracy rate may be not the best one, and a model with a higher recall rate is often more practical [22]. e reward function used by ERD has the problem of uneven sample distribution, which leads to low model recall. In addition, there is Security and Communication Networks the problem of insufficient flexibility in the "early" detection strategy. e reward function of ERD is as follows: If the decision action is to terminate the reading, there will be two situations. If the prediction is correct, a reward of log M is obtained, where M is the number of samples whose predictions are correct; if the prediction fails, the penalty is − P, where P is a constant 100. If the action is to continue reading, it will be punished by − ε, where ε is 0.01. is function specifically has the following problems: (1) Predicting the correct reward of log M is intended to keep the model in a good state of performance in stages and to make the model converge as soon as possible, but the problem is that the model converges to the local optimal value, which reduces the generalization ability of the model. (2) e prediction error is punished by − P; the purpose is to make the model make a "more cautious" judgment because once the prediction error is punished, the penalty is great. However, the model also has the problem of poor generalization ability, because when the number of rumors is small, if the misrecognition of rumors and misrecognition of nonrumors are treated equally, the recall rate of the model will decrease. (3) Continuing to read the reply information is punished by − ε. e original intention is to make the model output the result as soon as possible, but the problem is that the model does not perceive the "urgency" of time. In other words, for a rumor message, if the model has read 2 replies and 50 replies, the penalty for continuing to read is the same. But in fact, the penalty should be greater after reading 50 replies, because the rumors may have spread over time and the results need to be reached as soon as possible.

Problem Description.
e goal of ERD is to achieve higher accuracy of rumor detection by collecting less information. is problem can be described as follows: set the input tweet as X � x 0 , x 1 , . . . , x n , where x 0 represents the original tweet; others represent the reply information related to the original tweet and are sorted in chronological order. x i is composed of text information and metadata information.
e classification result y � {rumor, nonrumor}, and t ∈ [0, n] represents the number of reply messages used in the rumor detection process, so t indirectly expresses the time-consuming detection process. erefore, the purpose of ERD can be described as, for input X, it is necessary to accurately output tweet type y when t is as small as possible.

Early Rumor Detection Model Based on DRQN.
In order to effectively solve the problems mentioned in Section 2, we propose an ERD model based on DRQN (Deep Recurrent Q-Learning Network). e basic model architecture is shown in Figure 2, which consists of a RDM and a control model.

Control Model.
e control model is implemented by DRQN, which is a typical partially observable Markov decision algorithm. e recurrent neural network enables the model to have the memory function of the state sequence and then can learn the potential features in the state sequence. In the control module, this paper uses LSTM to realize the memory function of the state sequence. LSTM obtains the actions it considers reasonable by observing the state information and the last judgment. e specific calculation process is shown in the following formula: Among them, in addition to receiving the current state information state t , the LSTM network also receives the LSTM neuron information h t− 1 at the previous moment. After outputting h t , it passes through the fully connected layer to obtain a vector F of length two, and finally, the action probability distribution is output through the softmax function.
It is worth noting that the input state of LSTM is the last vector used for classification in the RDM, and there are two output actions: (1) Continue: It means that the current information is not enough to determine whether it is a rumor, and let the RDM read another reply message. (2) Terminate: It indicates the end of the detection process and outputs the detection result. In other words, the RDM has sufficient information to judge whether the original tweet is a rumor and outputs the result in advance to achieve the purpose of early detection.
e reward function is the core of reinforcement learning. e quality of its design can directly determine the performance of the model. We design the following reward function: Among them, when the model makes a stop-reading action, if the prediction is correct, it will directly get a reward of R to avoid falling into the local optimum; if the prediction is wrong, there are two situations. When the actual label is a rumor, it will receive a − 2P punishment; when the actual label is nonrumor, it is punished by − P. e reasons for adopting this strategy include two aspects: (1) considering the fact that there are few rumor samples; (2) considering that the losses caused by the rumor detection system are different in the two cases of misjudgment, specifically comparing the effect of misjudging the information that was originally a rumor as not a rumor and the effect of identifying nonrumors as a rumor. Obviously, the former will have a greater impact, because the omission of the rumors by the model will spread the rumors to a greater extent, but for the latter, although the cost of misjudgment has been increased, no rumors have been missed. erefore, if the information that was originally a rumor is judged to be not a rumor, the model will be punished twice.
When the model continues to read data, it will be punished by − (log n + ε), n represents the number of reply messages read by the model, and ε is a small value to avoid the situation where the penalty is 0 when reading the first reply message; the more the response information read, the greater the penalty for continuing to read.

RDM.
In view of the difference between the original tweet and the reply information, this paper proposes a dualengine RDM based on the self-attention mechanism. e specific model architecture is shown in Figure 3, which is mainly composed of the original tweet network and the reply information network.
In the network for source tweet, the text data passes through the word embedding layer, the GRU network, and the wordlevel self-attention mechanism in turn. e metadata feature extractor is used to extract the credibility characteristics of the tweet publisher and the basic information of the original tweet.
e specific calculation process is as follows: Output � output 1 , output 2 , . . . , output n ,   Table 1. Since the reply information is usually expressed with strong emotional color, the expression is more casual, and there is an unstable word order; this paper considers using two-way GRU, Text-CNN and text feature extractor to extract the reply information features in parallel, and the final feature vector is constructed through vector splicing.
(1) Bidirectional GRU e word order of reply messages is unstable. In order to extract more information, we use a bidirectional GRU network. e calculation details are shown in the following equation: (2) Text-CNN Aiming at the random features of the way of replying information, this paper uses Text-CNN to extract the semantic features of abnormal word order. Text-CNN consists of convolutional layer and pooling layer. e convolutional layer is used to extract text features. e process of extracting text features can be expressed by the following formula: A � a 1 , a 2 , . . . , a n− k+1 .
M i: i+h− 1 is the word vector from row i to i + h in the word vector matrix. After performing a linear transformation on M i: i+h− 1 , the activation function f is used to obtain a i , which represents the i-th text feature extracted by the convolution kernel of length h. Finally, all the features extracted by the convolution kernel are spliced to obtain the vector A. e above process is the processing result of one convolution kernel, and the same steps are repeated for multiple convolution kernels. In addition, the pooling function of this model adopts the maximum pooling function; that is, after the features extracted by the convolutional layer are obtained, one of the largest features is selected to represent all the features, which can be expressed by the formula a � max a 1 , a 2 , . . . , a i .
(3) Text feature extractor e text feature extractor can extract relatively intuitive text features, such as statistical negative words, whether there is an exclamation mark, and the similarity with the original tweet, as shown in  e calculation formula of cosine similarity is as follows: (4) Postlevel self-attention mechanism e self-attention mechanism at the postlevel refers to the weighting of the self-attention mechanism after feature extraction of all response information so that the model can pay attention to the useful response information. e specific calculation process is shown in the following formula: R � reply 1 , reply 2 , . . . , reply n , In formula (10), reply n represents the encoding of a reply message by the reply message network. R is the set of reply message codes. e self-attention mechanism requires three vectors Q, K, and V to represent query, key, and value, respectively, which are obtained from the vector R through three different linear transformations. Attention can be calculated based on these vectors. e specific calculation process is shown in the following formula: In formula (14), f(Q, K i ) is a general linear transformation to calculate the similarity between Q and K. e weight a i that needs attention for each post is obtained through the sotfmax function, and the product of the weight and the vector V is expressed as the response information processed by attention.  (1) Pretraining RDM ere must be a reliable RDM as a basis before training CTM.
is pretrained RDM reads all the response information during the training process. is also means that the performance of the ERD finally obtained after joining the CTM will not be higher than the performance of the pretrained RDM. e significance of adding CTM is how to make RDM get the best performance with the least information, and its best performance is that of the pretrained RDM. In the RDM training process, the batch size is 80, the dropout is 0.4, and the loss function is crossentropy loss function. In addition, Adam optimizer is used in the model and the initial learning rate is 0.0005. In the training process, when the modelʼs loss on the validation set does not decrease twice in a row, the learning rate is reduced by 10 times. When the modelʼs loss on the validation set stabilizes, the training process stops.
(2) Training CTM If CM is trained directly, its parameters will lack stability, which will make it difficult for the model to converge. erefore, in order to speed up the convergence, we use a dual network structure and experience playback mechanism to train CTM. Dual networks are two CTM networks with the same structure, which are called: current network (parameter θ) and target network (parameter θ′), respectively. e experience pool stores n four-tuple records (s t , a i , r t , s t+1 ) about the environment, and a batch of samples are randomly taken from the experience pool for training each time. e training process is to train the current network first and update the target network with the parameters of the current network when a certain batch is reached. Because the rumor detection problem is relatively special, traditional reinforcement learning is aimed at an environment, such as a game scene, but in the rumor detection problem, each rumor data is actually an environment, so it is necessary to consider multiple environments in the process of constructing the experience database. For environmental factors, the specific training process is shown in Algorithm 1.

Experimental Environment.
e experimental environment is shown in Table 3.

Dataset.
e experimental datasets include the public rumor dataset PHEME Dataset of Rumors and Non-Rumors (PHEME Rumor) [7], which is rumors and nonrumors data about five breaking news events collected on Twitter by ArkaitzZubiaga in 2016. e data distribution of each breaking news event is shown in Table 4. e dataset is divided into training dataset, verification dataset, and test dataset with the ratio of 7 : 1 : 2. e validation set is used to observe the real-time training results. Based on the best training results, the best model is selected as the final model. Finally, the test set is used to test the performance of the model.

Baselines.
is paper selects the following RDM as the baseline algorithm: (1) CRF: the RDM proposed by Zubiaga et al. [23].
(2) GAN-GRU: the RDM proposed by Ma et al. [24]. In the current ERD research, only the ERD [21] uses reinforcement learning to solve the ERD problem, so we consider using ERD as a baseline method. e control variable method is used in the comparison link, and the submodels of the two models are cross-combined into multiple models for comparison experiments.
First, the reinforcement learning module in ERD is named RL1, and the RDM is named RDM1; the reinforcement learning module in the model proposed in this paper is named RL2, and the RDM is named RDM2. Finally, the experimental models to be compared are divided into the following three models: (1) RDM1_RL1: ERD model (2) RDM2_RL1: the model proposed in this paper only contains the RDM and the control module in the ERD (3) RDM2_RL2: the ERD model proposed in this paper

5.1.
Rumor Detection Performance Evaluation. Comparative experimental results of eight RDM are shown in Table 5.
It can be seen from Table 5 that the GAN-GRU performs well in terms of accuracy, but it has poor performance in terms of precision and recall; the model of Zhou et al. is relatively stable; the LSTM-Attention achieved the highest score in precision; and the models proposed in this paper are SA-SE, SA-DE, and DSA-DE that have achieved good results in all four indicators. When the model adds the double selfattention (DSA) mechanism and the dual-engine network, the DSA-DE is compared to the sentence-level dual-engine network model SA-DE, and the single-level attention and single-engine network model SA is compared with SE; the accuracy rate is increased by 2.3% and 5%, respectively. It shows that the dual-engine network and DSA mechanism have improved the performance of the rumor detection task.

ERD Efficiency Evaluation.
Since we utilize reinforcement learning theory to achieve the purpose of ERD by controlling the number of replies input. In order to evaluate the early nature of the model, firstly, we use the standard indicators such as accuracy, precision, recall, and F 1 score.
It can be seen from Table 6 that when the control module RL1 is added, the accuracy of RDM2_RL1 is 0.05 higher than that of RDM1_RL1, indicating that the performance of the dual-engine RDM based on the DSA mechanism proposed in this paper is better than the rumor in ERD Detection module. Comparing the control modules, when RDM2 is added to the control modules RL1 and RL2, the accuracy rates drop to 0.80 and 0.81, respectively, indicating that the DRQN-based control module proposed in this paper can maintain a high accuracy rate.
Secondly, we use the average number of responses for each sample as one of the evaluation indicators, which is recorded as mean posts used. In order to more intuitively Input: Network Q(s, a), Environment set E, Experience pool P Output: Q(s, a, θ′) (1) Initialize current network Q(s, a, θ), and target network Q(s, a, θ ′ ), θ ′ � θ (2) for each epoch do (3) Select an environment e from E (4) Initialize environment e, and get state s t (5) while true do (6) According to s t , use ϵ-greedy strategy to select action a t from Q(s, a, θ) (7) Perform action a t in the environment to get the new state s t+1 and reward r t (8) if P is full do (9) Delete the oldest experience record (10) end if (11) Insert (s t , a t , r t , s t+1 ) into P (12) s t ←s t+1 (13) if s t is the last state do (14) break (15) end if (16) end while (17) if P is full do (18) Select a batch of records from P randomly (19) for each record do (20) Use target network to get y t � r t + cmax a t+1 Q(s t+1 , a t+1 , θ′) (21) Use loss function (y t − Q(s t , a t , θ)) 2 to update current network Q(s, a, θ) (22) Update current network with target network every n epochs (23) end for (24) end if (25) Figure 5 shows the experimental results of the models RDM1_RL1, RDM2_RL1, and RDM2_RL2 with the control module added in the early detection. e average number of reply messages used and the early detection rate are shown in Figures 5(a) and 5(b), respectively. It can be seen that the model RDM2_RL2 proposed uses the least amount of information. On average, each piece of data uses only 1.004 reply messages and has an early detection rate of 8.058, indicating that the model proposed in this paper can find a better balance between accuracy and timeliness so that the model can identify the rumors in the early stage while ensuring the accuracy. Figure 6 shows the change of the average reward value of the model RDM2_RL2 during the training process. It can be seen that the average reward value of the model is stable between 23 and 24 after 40 rounds of training, indicating that the DRQN-based control module proposed is effective. Figure 7 shows the changes in the early detection rate of models RDM1_RL1, RDM2_RL1, and RDM2_RL2. It can be seen that, compared to the model RDM1_RL1, the model RDM2_RL1 has the same learning ability, but the final result    of RDM2_RL1 training is better than RDM1_RL1. It shows that the RDM proposed is more helpful to detect rumors. e model RDM2_RL2 proposed has a stronger learning ability. It reaches the peak of early detection rate at about 50 rounds of training, and the early detection rate far exceeds RDM1_RL1 and RDM2_RL1. erefore, the model proposed has good performance in the accuracy and timeliness of rumor detection.

Conclusions
In terms of rumor detection, this paper first analyzes the existing research on three problems: the inability to obtain the optimal representation of the reply information, the ignorance of the difference between the original tweet and the reply information in the tweet, and the inability to handle redundant data well. In response to the above problems, this paper uses the difference between the original tweet and the reply information in the Twitter data as an entry point and proposes a dual-engine RDM, which separately deals with the original tweet and the reply information; on the remaining problem, the DSA mechanism is proposed to solve the problem of data redundancy in the two dimensions of sentences and words. For the existing multitask model, there is a problem that the optimal representation of the reply information cannot be obtained. is paper uses a singletask learning model., Let the model itself learn to encode the reply message. e final experimental results show that the method proposed in this paper has a better detection effect.
In terms of ERD, this paper considers solving the problem of ERD from the perspective of reinforcement learning. First, the following problems are found through analysis of existing research: the potential meaning of the state sequence is ignored, the reward function is imperfect, and the performance of the RDM is poor. In response to the above problems, this paper proposes an ERD model based on DRQN and describes the model in detail. In order to analyze the experimental results more effectively, this paper proposes the evaluation index of ERD rate to evaluate the performance of the ERD model. Finally, this paper is verified on the rumor dataset. e experimental results show that this paper can detect rumors earlier under the premise of ensuring the accuracy of rumor detection.
Although the model proposed in this paper has achieved good results by comparing the baseline method, it can still be optimized from the following perspectives.
In natural language processing tasks, the data cleaning stage cannot be ignored. In future research, more finegrained methods can be considered to clean information such as words, sentences, special symbols, and URLs.
In terms of rumor detection, the language model can be modeled using the relatively new BERT. BERT can learn specific expressions of words in specific language scenarios, and the model may achieve better performance. In addition, in terms of features, more meaningful features can be explored for experimentation.
In the ERD problem, you can use more powerful reinforcement learning algorithms such as A3C to model. e reward function and training method can also be further optimized. In addition, some potential features in time can be explored for modeling. Of course, the problem of ERD is not necessarily limited to reinforcement learning, and there may be more suitable methods for ERD.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.