A Comprehensive Review on Fake News Detection With Deep Learning

A protuberant issue of the present time is that, organizations from different domains are struggling to obtain effective solutions for detecting online-based fake news. It is quite thought-provoking to distinguish fake information on the internet as it is often written to deceive users. Compared with many machine learning techniques, deep learning-based techniques are capable of detecting fake news more accurately. Previous review papers were based on data mining and machine learning techniques, scarcely exploring the deep learning techniques for fake news detection. However, emerging deep learning-based approaches such as Attention, Generative Adversarial Networks, and Bidirectional Encoder Representations for Transformers are absent from previous surveys. This study attempts to investigate advanced and state-of-the-art fake news detection mechanisms pensively. We begin with highlighting the fake news consequences. Then, we proceed with the discussion on the dataset used in previous research and their NLP techniques. A comprehensive overview of deep learning-based techniques has been bestowed to organize representative methods into various categories. The prominent evaluation metrics in fake news detection are also discussed. Nevertheless, we suggest further recommendations to improve fake news detection mechanisms in future research directions.


I. INTRODUCTION
The Internet has changed interaction and communication ways through low cost, simple access, and fast information dissemination. Therefore, social media and online portals have become more popular for news searches and reading for many people rather than traditional newspapers. Social media harms society by influencing major events even though it has become a powerful means of information. Especially after the presidential election of the U.S. in 2016, the issue of online false news has gained more popularity [1], [2]. According to Zhang and Ghorbani [3], voters might be easily controlled by deceptive political statements and claims. Inspection shows that false news or lies propagate more quickly through humans than original information and cause tremendous effects [4].
The terms rumor and fake news are closely interrelated. Fake news or disinformation is intentionally created. On the The associate editor coordinating the review of this manuscript and approving it for publication was Sergio Consoli . other hand, rumors are unconfirmed and questionable information that is spread without the aim to deceive [15]. On social media sites, spreaders' intentions might be difficult to determine. As a result, any false or incorrect information is typically branded as misinformation on the Internet. Distinguishing real and fake information is challenging. However, many approaches have been adopted to address this issue. Various machine learning (ML) methods have been used to detect false information spread online in the case of knowledge verification [16], natural language processing (NLP) [16]- [18] and sentiment analysis [19]. Early research concentrated on leveraging textual information derived from the article's content, such as statistical text features [20] and emotional information [21]- [23].
Deep learning (DL) has recently become an emerging technology among the research community and has proven to be more effective in recognizing fake news than traditional ML methods. DL has some particular advantages over ML, such as a) automated feature extraction, b) lightly dependent on data pre-processing, c) ability to VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ extract high-dimensional features, and c) better accuracy. Further, the current wide availability of data and programming frameworks has boosted the usage and robustness of DL-based approaches. Hence, in the last five years, numerous articles have been published on fake news detection, mostly based on DL strategies [24]. An enthusiastic effort has been made to review the current literature to compare the extensive amount of DL-based fake news detection research efforts.
A number of research works has been published on the survey of fake news detection [5], [25], [26]. Our investigation reveals that existing studies do not provide a thorough overview of deep learning-based architectures for detecting fake news. The existing survey papers mostly cover the ML strategies in detecting fake news, scarcely exploring the DL strategies [3], [9], [10]. We provide a complete list of NLP techniques as well as describe their benefits and drawbacks. In what follows, in this survey, we performed an in depth analysis of current DL-based studies. Table 1 provides a brief overview of the existing survey papers and our research contributions. The present study aims to address the previous research's weaknesses and strengths by conducting a systematic survey on fake news detection. First, we divide existing fake news detection research into two main categories: (1) Natural Language Processing (NLP) and (2) Deep Learning (DL). We discuss the NLP techniques such as data pre-processing, data vectorizing, and feature extraction. Second, we analyze the fake news detection architectures based on different DL architectures. Finally, we discuss used evaluation metrics in fake news detection. Figure 1 depicts an overall taxonomy of fake news detection approaches. We also include a table 2, including acronyms used throughout the survey to assist researchers when encountering issues due to acronyms.
The rest of the paper is organized as follows. Section II highlights the consequences of fake news. Section III describes the used datasets. Section IV explains the Natural Language Processing techniques in fake news detection. Section V contains an in-depth analysis of deep learning strategies. Section VI presents the evaluation metrics used in previous studies. Section VII narrates the challenges and future research direction. Finally, Section VIII concludes the paper.

II. FAKE NEWS CONSEQUENCES
There has always been fake news since the beginning of human civilization. However, the spread of fake news is increased by modern technologies and the conversion of the global media landscape. The major consequences on social, political, and economic environments may be caused by fake news. Fake information and fake news have various faces. As information molds our view toward the world, fake news has a huge impact. We make critical decisions based on the information. By obtaining information, we develop an impression about a situation or people. We cannot obtain good decisions if we find fake, false, distorted, or fabricated information on the Internet. The primary impacts of fake news are as follows: Impact on Innocent People: Rumors can have a major impact on specific people. These people may be harassed by social media. They may also face insults and threats that may have real-life consequences. People must not believe in invalid information on social media or judge a person.
Impact on Health: The number of people searching for health-related news on the Internet is continuously increasing. Fake news in health has a potential impact on people's lives [36]. Therefore, this is one of the major challenges today. Misinformation about health has had a tremendous impact in the last year [37]. Social media platforms have made some policy changes to ban or limit the spread of health misinformation as they face pressure from doctors, lawmakers, and health advocates.
Financial Impact: Fake news is currently a crucial problem in industries and the business world. Dishonest businessmen spread fake news or reviews to raise their profits. Fake information can cause stock prices to fall. It can ruin the fame of a business. Fake news also has an impact on customer expectations. Fake news can create an unethical business mentality.
Democratic impact: The media has discussed the fake news phenomenon significantly because fake news played a  vital role in the last American presidential election. This is a major democratic problem. We must stop spreading fake news as it has a real impact.

III. BENCHMARK DATASET
In this section, we discuss the datasets used in various studies. For both training and testing, benchmark datasets were utilized. One of the difficulties in identifying fake news is the shortage of a labeled benchmark dataset with trustworthy ground truth labels and a massive dataset. Based on that, researchers can obtain practical features and construct models [38]. For several usages in DL and ML, such datasets have been collected over the last few years. The datasets are vastly diverse from one another because of different study agendas. For instance, a few datasets are made up entirely of political statements (such as PolitiFact), while others are made up entirely of news articles (FNC-1) or social media posts (Twitter). Datasets can differ based on their modality, labels, and size. Therefore we categorize these datasets in table 3 based on these characteristics. Fake articles are frequently collected from fraudulent websites designed intentionally to disseminate disinformation. These false news stories are eventually shared on social media platforms by their creators. Malicious individuals or bots and inattentive users who do not care to check the source of the story before sharing it assist in spreading fake news through social media. However, most datasets contain only news content. But current language features and writing style are not sufficient enough in developing an efficient detection model. Fake news, Twitter15, and Liar are the most popular datasets that are publicly available. But some studies trained their model with their created dataset [39]. We defined these datasets as self-collected. Since sufficient information is not provided about their self-collected datasets, we find it difficult to compare with other studies properly. Using the benchmark dataset, a comparative study can be established with current state-of-the-art methods for detecting fake news. Kaliyar et al. [40] conducted a comparative study of their suggested model with existing methods using the Kaggle dataset and they reported an accuracy of 93.50% which is the highest, utilizing the same dataset for fake news detection. A pie chart of used benchmark datasets is given in 2.

IV. NATURAL LANGUAGE PROCESSING
Natural Language Processing (NLP) is an area in machine learning with the capability of a computer to understand, analyze, manipulate, and potentially generate human language. The NLP technique consists of data pre-processing and word embedding. By utilizing deep learning techniques, NLP has seen some colossal advancements in recent years [41]. The natural language must be transformed into a mathematical structure to give machines a sense of natural language. In section IV-A, IV-B, and IV-C, NLP techniques are discussed.

A. DATA PRE-PROCESSING
Data pre-processing is utilized to represent complex structures with attributes, binarize attributes, change discrete attributes, persist, and manage lost and obscure attributes.
During data pre-processing, different visualization procedures are helpful. A cautious pre-processing strategy is required to ingest the data in a neural network for fake news detection because social media data sources are fragmented, unstructured, and noisy. It is a popular fact that amid the learning stage, data pre-processing saves computational time and space. In addition, limiting the impact of artifacts during the learning process, text pre-processing avoids every ingests of noisy data. The data becomes a logical representation after proper text pre-processing. It also included the most representative descriptive words. Umer et al. [42] experimented on a fake news detection model in which the accuracy was only 78% when they used the features excluding data cleaning or pre-processing, which is surprisingly poor. After performing the pre-processing steps and removing unnecessary data, the accuracy increases dramatically to 93.0%. Data quality assessment, dimensionality reduction, and splitting of the dataset are the data pre-processing steps used in various studies [39], [41], [43]. The pre-processing steps are elaborated in Sections IV-A1, IV-A2, and IV-A3.

1) DATA QUALITY ASSESSMENT
Data are frequently taken from numerous sources that are ordinarily reliable and are in completely different formats. When working on a machine learning problem, more time is invested in managing data quality issues. It is unreasonable to anticipate that the data would be perfect. There may be some issues due to a human blunder, defects within the data collection process, or restrictions on measuring gadgets. The quality of a dataset is often responsible for the poor performance of fake news detection models. For this reason, the quality of the data used in any machine learning project will have a huge effect on the chances of success. However, only a few studies ensure the quality of their used datasets. S and Chitturi [41] collected the George McIntire dataset from GitHub and dropped the rows that did not have labels in the clarifying process, and the process surely has a huge impact on their success in fake news detection.
To ensure the quality of the entire dataset, Wang et al. [44] removed duplicate and low-quality images. Alsaeedi and Al-Sarem [45] extended the data cleaning process by URL removal, lowercase and hashtag character (#) removal, mention character (@), and number removal. They also considered words with recurring characters such as ''Likkke'' and handled emoticons by supplanting positive emoticons with a ''positive'' word and with a ''negative'' word for negative emoticons.

2) TRAIN/VALIDATION/TEST SPLIT BASED
The dataset may be divided into train, test, and validation sets. The sample of data that is utilized to adjust the parameters is called the training set. The validation set is a series of examples used to fine-tune the parameters of a model. A set of examples applied only for assessing a fully-specified model's performance is regarded as the test set. Although many studies on fake news detection have divided their dataset into training, validation, and test sets, few studies have used only the training, and test sets [46], [47]. The ratios of data split 60:20:20, 70:30, and 80:20 are very common in fake news detection. The Pareto principle (for many outcomes, roughly 80% of consequences come from 20% of the causes) is used to describe the 80:20 ratio. It is typically a safe bet to use the ratio that all studies applied. Mandical et al. [48] applied the ratio of 90:5:5 and 80:10:10 when the number of articles in the dataset was less than 10,000 and greater than 10,000, respectively. However, they did not specify the purpose behind it. Jadhav and Thepade [49] compared their model performance based on the data splitting ratio and showed that 75%-25% data split has more prominent performance than other models possessing diverse splits. The model parameter estimates exhibit more prominent variation with smaller training data. Performance statistics exhibit more prominent variation with smaller testing data. Studies should be careful with splitting data so that neither variation is too large or too small, and it has more to do with the total number of instances in each category rather than the percentage. The optimal split of the test, validation and train sets is determined by hyperparameters, model architecture, data dimension, etc. Table 4 provides an overview of the advantages and disadvantages of the splitting ratios used in most studies:

3) TOKENIZATION, STEMMING AND LEMMATIZATION
Tokenization is a method of breaking down a text into words. This can be applied to any character. Performing tokenization on a space character is the most common way of tokenization.
Chopping off an end to achieve the base word is called stemming. The removal of derivational affixes is usually included in the stemming. A derivational affix is an affix in which one word is obtained from another. The derived word is usually a distinct class of words from the original.
Lemmatization is a text normalization procedure that morphologically analyzes words, generates the root form of inflated words, and is normally intended to remove inflectional endings [64]. A group of letters applied to the end of a word to modify its meaning is known as an inflectional VOLUME 9, 2021 ending. Some examples of inflectional endings are s, bat, and bats.
Rusli et al. [52] performed two experiments to detect fake news with and without stemming and stop-word removal. They used stemming and stop-word removal for removing all affixes and stop-words. They achieved a 0.82 macro-averaged F1-score by performing the stemming and stop-word removal processes. They also achieved a 0.8 macro-averaged F1-score without performing stemming and stop-word removal. Performing the stemming and stop-word removal processes in the text preprocessing phase was time-consuming, but there was a small difference in the results. Although tokenization, stemming, and lemmatization improve the performance of the classifier, many researchers have not used these techniques [4], [65]. Jain and Kasbe [66] presented simple technique with web scrapping for detecting fake news. They showed that updating the dataset regularly with web scrapping a model's truthfulness can be checked. The authors achieved an accuracy of 91% based on text. The result can be improved greatly with some extra preprocessing, such as stemming and omitting stop words.

B. WORD VECTORIZING
Word vectorizing involves mapping the word/text to a list of vectors. TF-IDF and Bag of Words (BoW) vectorization techniques are commonly used in machine learning strategies to identify fake news [4], [53], [63]. In term frequency inverse document frequency (TF-IDF), the value rises proportionally to the number of times a word emerges in the document but is balanced by the frequency of the word in the body. Although this vectorization is successful, the semantic sense of the words is lost in its attempt to translate to numbers [48]. The BoW technique considers every news article to be a document and computes the frequency count of each word within this document, which is then used to produce a numeric representation of the data. In addition to data loss, this approach also has limitations. The relative location of the words is overlooked, and contextual information is lost. This loss can be costly at times when measured against the benefit in computing convenience with the ease of use [46]. Rusli et al. [52] used TF-IDF and Bag of Words feature extraction methods to detect fake news. However, this approach may suffer due to loss of information.
Neural network-based models have accomplished victory on diverse language-related roles as opposed to traditional machine learning-based models such as logistic regression of support vector machine (SVM) by utilizing word embeddings in fake news detection. It maps words or text to a list of vectors. They are low-dimensional, and disseminated feature representations are appropriate for natural languages. The term ''word embedding'' refers to a combination of language modeling and feature learning. Words or expressions from the lexicon are allocated to real-number vectors. Neural network models essentially utilize this method for fake news detection [42], [96]. Word representation was performed using dense vectors in word embedding. These vectors represent the word mapping onto a continuous, high-dimensional vector space. This is considered an improvement over the BoW model, wherein large sparse vectors of vocabulary size were used as word vectors. These large vectors also provided no information about how the two words were interrelated or any other useful information [50]. Recently, fake news detection researchers have used pre-trained word-embedding models such as global vectors for word representation (GloVe) and Word2vec. The primary benefit of using these models is their ability to train with large datasets [40]. Unlike Word2Vec, GloVe supports parallel implementation, making it easier to train the model on huge datasets. Table 5 gives a summary of the NLP techniques and word vector models used in deep learning-based fake news detection papers.

C. FEATURE EXTRACTION
A huge amount of computational power and memory is required to analyze a large number of variables. Classification algorithms may overfit the training samples and induce poorly to new samples. Feature extraction is a process of building combinations of variables to overcome these difficulties while still representing the data with adequate precision. Feature extraction and feature selection are frequently used in text mining [69], [97].
Fake news detection strategies concentrate on applying news content and social context features [98]. News content features highlights depict the meta-information relevant to a chunk of news [5]. Commonly, in news validation, news content (linguistics and visual information) is used as a feature [99], [100]. Textual features comprise the writing style and emotion [101], [102]. Furthermore, hidden textual representations are generated using tensor factorization [103]- [105] and deep neural networks [106]- [108], achieving high performance in detecting false news with news contents. Visual features are retrieved from visual components such as image and video, but only a few studies utilized visual features in fake news detection [109], [110]. In contrast, social context information can also be aggregated for detecting fake news in social media. There are three main perspectives of social content: a) users, b) produced posts and c) networks (connection amidst the users who distributed relevant posts) [5]. User-based features are typically from the user profile in social media [98], [111]. Users' social responses in terms of stances [42], [64], topics [112], or credibility [113]- [115] are represented via post-based features. Recently, several studies have focused on stance features to detect fake news [64]. It can be effective for human fact-checkers to distinguish false claims [113], [114]. To check the authenticity of a claim/report/headline, it is essential to understand what different news agencies are declaring about that particular claim/report/headline. Reference [116]. Features that are network-based are retrieved by creating specialized networks, such as diffusion networks, interaction networks, and propagation networks [117]- [119]. The propagation network contains rich information about user interactions (likes, comments, responses, or shares) that  show the direction of information flow, timestamp details about interactions, textual information about user interactions, and user profile information about the users who are interacting [120]. We provide Figure 3 depicting important features that were utilized to detect fake news precisely.
It is pivotal to choose the correct determination algorithm for decreasing features because feature reduction contains an incredible effect on the text classification results. Some common feature reduction algorithms include Gini Coefficient (GI), Term Frequency-Inverse Document Frequency (TF-IDF), Information Gain (IG), Mutual Information (1v1I), Principal Component Analysis (PCA), and Chi-Square Statistics (CHI ). In the process of content classification, the linear classification model works well with the TF-IDF model [121]. PCA and Chi-square were utilized to improve the adaptability of the text classifier combined with deep learning models. A number of studies compared their model accuracy with and without feature extraction and found that with feature extraction, the success rate is higher. Umer et al. [42] compared the applications of feature reduction methods (PCA and Chi-square) applied with two deep learning models. When the proposed model is utilized with the reduced feature set, it increases the F1-score and accuracy by 20% and 4%, respectively, compared to the other techniques. However, many studies did not perform feature extraction, although it has a significant impact on the result [16], [122]. Neural networks are considered very powerful machine learning tools due to their ability of complex feature extraction. Instead of relying on manual feature selection and other existing techniques, researchers are currently focusing on neural networks for feature extraction [123]. Yang et al. [124] employed a model TI-CNN (Text and Image information based convolutional neural network) to extract latent features from both visual and textual information and achieved promising results [124]. Another study [107] used the deep recurrent neural network model for extraction of a collection of latent features for news producers, posts, and topics.

V. DEEP LEARNING APPROACH FOR FAKE NEWS DETECTION
Deep learning models have seen exceptional growth in recent times owing to their promising success in several fields, including communication and networking [125], [126], computer vision [127], [128], intelligent transportation [129], speech recognition [130], as well as NLP. Deep learning systems have advantages over traditional machine learning methods. Deep learning is a subfield of machine learning strategies, which displays high precision and exactness in fake news detection. Generally, ML methods are based on hand-crafted features. Biased features may appear because feature extraction assignments are challenging and slow. ML approaches failed to achieve prominent results in fake news detection. Because ML approaches produce high-dimensional representations of linguistic information, resulting in the curse of dimensionality. The existing neural network-based models have outperformed the traditional models in terms of their performance owing to their exceptional feature extraction ability [62]. In contrast, DL systems can acquire hidden representations from less complex inputs. The hidden features can be extracted from both the news content and context varieties. A study by Hiramath and Deshpande [78] showed that deep neural networks (DNNs) require less time than other ML-based classification algorithms such as logistic regression, random forest (RF), and SVM, etc. However, DNNs use more memory. Convolutional neural network (CNN) and recurrent neural network (RNN) are two broadly utilized ideal models for deep learning in VOLUME 9, 2021  cutting-edge artificial neural networks. Therefore, we provide Figure 4, which shows the percentage of DL-based fake news detection papers with used classifiers in recent years.
After inspecting previous studies, we found a general framework for deep learning-based fake news detection. The first step was to collect a dataset or create one. Most studies have used news articles collected from publicly available datasets. The pre-processing technique was applied after collecting the dataset to feed the data in a neural network [42], [96], [131]. Word2vec and GloVe word embedding methods have mostly been used in previous studies to map words into vectors [41], [78], [80]. We represent an overall process for fake news identification with deep learning in Figure 5 based on various studies [40], [42], [61].
148 DL-based studies were examined to provide a detailed description of these architectures: CNN in section V-A and RNN in Section V-B, Graph Neural Network in Section V-C, Generative Adversarial Network in Section V-D, Attention Mechanism in Section V-E, Bidirectional Encoder Representations for Transformers in Section V-F, and Ensemble Approach in Section V-G.

A. CONVOLUTIONAL NEURAL NETWORK (CNN)
A few deep learning models have been introduced to handle ambiguous detection issues. CNNs and RNNs are the most interesting models [77]. Researchers are trying to boost the performance of the fake news detector with CNN by taking its power of extracting features well and better classification process [132]. However, CNNs are also gaining popularity in the NLP technique too. It is utilized for mapping the features of n-gram patterns. The CNN is similar to a multilayer perceptron (MLP) as it is an unsupervised multilayer feed-forward neural network [45]. The CNN consists of an input layer, an output layer, and a sequence of hidden layers. CNNs are mostly used for picture recognition and classification. Neural networks with 100 or more hidden layers have been reported in recent studies. Backward-propagation and forward-propagation algorithms are utilized in neural networks. These algorithms are used to train neural networks by updating the weights of each layer. The gradient (derivative) of the cost function is utilized to update the weights. When the sigmoid activation function is applied, the value of the gradient decreases per layer. This lengthens the training time. This problem is called the vanishing-gradient problem. A deeper CNN or a direct connection in dense solves this problem. Compared to a normal CNN, a deeper CNN is also less vulnerable to overfitting [67]. Kaliyar et al. [40] proposed a model FNDNet (deep CNN), which is designed to learn the discriminatory features for fake news detection using multiple hidden layers. The model is less prone to overfitting but takes a longer time to train. The convolutional layer, pooling layer, and regularization layer are the most utilized layers in CNNs for fake news detection. The input data can be manipulated through pooling and convolution operations. Sections V-A1, V-A2, and V-A3 describe the popular layers used in CNN.

1) CONVOLUTION LAYER
CNNs work very well with image classification and computer vision because of the convolution operation, and their ability to extract features from inputs for better representation makes them very efficient. These properties make CNNs powerful in sequence processing [131]. Fernández-Reyes and Shinde [77] proposed a CNN architecture called, StackedCNN (2-dimensional convolution layers, rather than 1-dimensional convolutions). It is proven that finding patterns in text data a fusion of pre-trained word embeddings with 2-dimensional convolutional layers helps, but the performance of the StackedCNN is poor compared to state-of-the-art CNN. Another study by Li et al. [132] adopted a novel approach with multilevel CNN (MCNN) and Sensitive word's weight calculating method (TFW). MCNN-TFW successfully captured semantic information from the article text content. For this reason, it outperforms the compared methods, including CNN. Their work did not consider latent-based features. Alsaeedi and Al-Sarem [45] added more convolution layers, and it has an impact on the proposed model performance. According to the results, the model's performance is lowered by about 0.014.

2) POOLING LAYER
A pooling operation that chooses the greatest component from each patch of each feature map covered by the filter is called max pooling. A pooling layer is a new layer attached to the convolutional layer. Its purpose is to continuously diminish the spatial size of the representation in order to decrease the number of parameters and the calculation inside the network. The pooling layer operates autonomously on each feature map. Max pooling or average pooling is the most commonly used function in fake news detection. Alsaeedi and Al-Sarem [45] adjusted the hyperparameter settings in a CNN. They found the best parameter settings that gave an improvement in the model's performance. The recommended CNN model performs best when the number of units in the dense layer is set to 100, the number of filters is set to 100, and the window size is set to 5. The GlobalMaxPooling1D method achieved the highest scores, showing that it works well for fake news detection when compared to other pooling methods [45].

3) REGULARIZATION LAYER
The most crucial problem of classification is to reduce the training and test errors of the classifier. Another common issue is the over-fitting problem (the space between training and testing errors is huge). Overfitting makes it difficult to generalize the model as it becomes more applicable (overfit) to the training set. Regularization is a solution to the overfitting problem. Regularization is applied to the model to lessen the problem of overfitting and decrease the error of generalization, but not the error of training [45]. The dropout regularization method is mostly used for fake news detection [133]. Other methods such as early stopping and weight penalties were not used in previous studies on fake news detection. Dropout avoids overfitting by gradually filtering out neurons. Eventually, all weights are calculated as an average so that the weight is not too high for a single neuron.

B. RECURRENT NEURAL NETWORK (RNN)
The RNN is a type of neural network. In RNN, nodes are sequentially connected to construct a directed graph. The output from the earlier step serves as the input to the current step. RNNs are effective in time and sequence-based predictions. RNN is less compatible with features compared to CNN. RNNs are suitable for studying sequential texts and expressions. However, it cannot process very long sequences when tanh or ReLU is used as an activation function.
The backward-propagation algorithm is utilized in the RNN for training. While training the neural networks, it is required to take tiny steps frequently in the way of the negative error derivative concerning network weights to establish a minimum error function. The size of the gradients becomes tiny for each consequent layer. Thus, the RNN suffers from a vanishing gradient issue in the bottom layers of the network. We can deal with the vanishing gradient problem by using three solutions: (1) using rectified linear unit (ReLU) activation function, (2) using RMSProp optimization algorithm, and (3) using diverse network architecture such as long short-term memory networks (LSTM) or gated recurrent unit (GRU). So previous studies focused on LSTM and GRU rather than the state-of-the-art RNN [80], [96], [134]. Bugueño et al. [80] proposed a model based on RNN for propagation tree classification. The authors used RNN for sequence analysis. The number of epochs was set as 200, which is relatively high in comparison to their training examples. To predict fake news articles, authors have proposed distinctive RNN models, specifically LSTM, GRU, tanh-RNN, unidirectional LSTM-RNN, and vanilla RNN. RNNs, and in specific LSTM, are especially successful in processing sequential data (human language) and catching significant features out of diverse data sources. Further, in Sections V-B1 and V-B2, we discuss LSTM and GRU.

1) LONG SHORT-TERM MEMORY (LSTM)
LSTM models are front runners in NLP problems. LSTM is an artificial recurrent neural network framework used in deep learning. LSTM is a progressed variation of RNN [41]. RNNs are not capable of learning long-term dependencies because back-propagation in recurrent networks takes a while, particularly for the evolving backflow of blunder. However, LSTM can keep ''Short Term Memories'' for ''Long periods.'' The LSTM is made up of three gates: an input gate, an output gate, a forget gate, and a cell. Through a combination of the three, it calculates the hidden state. The cell can recall values over a large time interval. The word's connection within the beginning of the content can impact the output of the word afterward within the sentence for this reason [67]. LSTM is an exceptionally viable solution for tending the vanishing gradient issue. Bahad et al. [61] proposed an RNN model that suffers from the vanishing gradient issue. To tackle this issue, they implemented an LSTM-RNN. But still, LSTM could not solve the vanishing gradient issue completely. The LSTM-RNN model had a higher precision compared to the initial state-of-the-art CNN. Asghar et al. [135] proposed bidirectional LSTM (Bi-LSTM) with CNN for rumor detection. The model preserves the sequence information in both directions. The Bi-LSTM layer is effective in remembering long-term dependency. Even though the BiLSTM-CNN beat the other models, the suggested approach is computationally expensive.
A study by Ruchansky et al. [123] suggested a model called CSI, which comprises three modules, Capture, Score, and Integrate. The capture module extracts features from the article, and the score module extracts features from the user. Then by integrating article and user-based features, the CSI model performs the prediction for fake news detection. The CSI model has fewer parameters than other RNN-based models. Another study by Sahoo and Gupta [136] proposed an approach with both user profile and news content features for detecting false news on Facebook. The authors used LSTM to identify fake news, and a set of new features are extracted by Facebook crawling and Facebook API. It requires more time to train and test the suggested model. Liao et al. [137] proposed a novel model called fake news detection multi-task learning (FDML). The model explores the influence of topic labels for fake news while also using contextual news information to improve detection performance on short false news. The FDML model, in particular, is made up of representation learning and multi-task learning components that train both the false news detection task and the news topic categorization task at the same time. However, the performance of the model decreases without the author's information.

2) GATED RECURRENT UNIT (GRU)
In terms of structure and capabilities, GRU is comparatively easier and more proficient than LSTM. This is because there are only two gates, to be specific, reset and update. The GRU manages the information flow in the same manner as the LSTM unit does, but without the use of a memory unit. It literally exposes the entire hidden content with no control whatsoever. When it comes to learning long-term dependencies, the quality of GRU is way better than LSTM. Hence, it is a promising candidate for NLP applications [41]. GRUs are more straightforward as well as much more proficient compared to LSTM. GRU is still in its early stages, thus, we are seeing it being used lately to identify false news. GRU is a newer algorithm with a performance comparable to that of LSTM but greater computational efficiency. Li et al. [134] used a deep bidirectional GRU neural network (two-layer bidirectional GRU) as rumor detection model. The model suffers from slow convergence. S and Chitturi [41] showed that it is difficult to determine whether one of the gated RNNs (LSTM, GRU) is more successful, and they are usually chosen based on the basis of the available computing resources. Girgis et al. [96] experimented with CNN, LSTM, Vanilla, and GRU. Vanilla suffers from a gradient vanishing problem, but GRU solves this issue. Though GRU is said to be the best outcome of their studies, it takes more training time. A bidirectional GRU was utilized by Singhania et al. [87] for word-by-word annotation. With preceding and subsequent words, it captures the word's meaning within the sentence. A study by Shu et al. [100] proposed a sentence-comment co-attention subnetwork model named dEFEND (Explainable fake news detection) utilizing news content and user comments for fake news detection. The authors considered textual information with bidirectional GRU (Bi-GRU) to achieve better performance. Moreover, the model has a low learning efficiency.

C. GRAPH NEURAL NETWORK (GNN)
A Graph Neural Network is a form of neural network that operates on the graph structure directly. Node classification is a common application of GNN. Essentially, every node in the network has a label, and the network predicts the labels of the nodes without using the ground truth. The network extends recursive neural networks by processing a broader class of graphs, including directed, undirected graphs, and cyclic, and it can handle node-focused applications except any pre-processing steps [138]. The network extends recursive neural networks by processing a broader class of graphs, including cyclic, directed, and undirected graphs, and it can handle node-focused applications without requiring any pre-processing procedures cite190. GNN captures global structural features from graphs or trees better than the deep-learning models discussed above [139]. GNNs are prone to noise in the datasets. Adding a little amount of noise to the graph via node perturbation or edge deletion and addition has an antagonistic effect on the GNN output. Graph convolutional network (GCN) is considered as one of the basic graph neural networks variants.
A study by Huang et al. [140] claimed to be the first that experimented using a rich structure of user behavior for rumor detection. The user encoder uses graph convolutional networks (GCN) to learn a representation of the user from a graph created by user behavioral information. The authors used two recursive neural networks based on tree structure: bottom-up RvNN encoder and top-down RvNN encoder. The tree structure is shown in Figure 8. The proposed model performed worse for the non-rumor class cause user behavior information brings some interference in non-rumor detection.
Another study by Bian et al. [139] proposed top-down GCN and bottom-up GCN using a novel method DropEdge [141] for reducing over-fitting of GCNs. In addition, a root feature enhancement operation is utilized to improve the performance of rumor detection. Although it performed well on three datasets (Weibo, Twitter15, Twitter16), the outliers in the dataset affected the models' performance.
On the other hand, GCNs incur a significant memory footprint in storing the complete adjacency matrix. Furthermore, GCNs are transductive, which implies that inferred nodes must be present at the training time. And do not guarantee generalizable representations [142]. Wu et al. [143] proposed an algorithm of representation learning with a gated graph neural network named PGNN (propagation graph neural network). The suggested technique can incorporate structural and textual features into high-level representations by propagating information among neighbor nodes throughout the propagation network. In order to obtain considerable performance improvements, they also added an attention mechanism. The propagation graph is built using the whoreplies-to-whom structure, but the follower-followee and forward relationships are omitted. Zhang et al. [144] presented a simplified aggregation graph neural network (SAGNN) based on efficient aggregation layers. Experiments on publicly accessible Twitter datasets show that the proposed network outperforms state-of-the-art graph convolutional networks while considerably lowering computational costs.

D. GENERATIVE ADVERSARIAL NETWORK (GAN)
Generative Adversarial Networks (GANs) are deep learningbased generative models. The GAN model architecture consists of two sub-models: a generator model for creating new instances and a discriminator model for determining whether the produced examples are genuine or fake, generated by the generator model. Existing adversarial networks are often employed to create images that may be matched to observed samples using a minimax game framework [44]. The generator model produces new images from the features learned from the training data that resemble the original image. The discriminator model predicts whether the generated image is fake or real. GANs are extremely successful in generative modeling and are used to train discriminators in a semisupervised context to assist in eliminating human participation in data labeling. Furthermore, GANs are useful when the data have imbalanced classes or underrepresented samples. GANs produce synthetic data only if they are based on continuous numbers. But GANs are inapplicable to NLP data because all NLPs are based on discrete values such as words, letters, or bytes [145]. To train GANs for text data, novel techniques are required.
A study by Long [145] proposed sequence GAN (SeqGAN), which is a GAN architecture that overcomes the problem of gradient descent in GANs for discrete outputs by employing reinforcement learning (RL) based approach and Monte Carlo search. The authors provide actual news content to the GAN. Then a classifier based on Google's BERT model was trained to identify the real samples from the samples generated by the GAN. The architecture of SeqGAN is provided in Figure 9.
In generative adversarial networks, the principle of adversarial learning was invented. The adversarial learning concept has produced outstanding results in a wide range of topics, including information retrieval [146], text classification [147], and network embedding [148]. The unique problem for detecting fake news is the recognition of false news on recently emergent events on social media. To solve this problem, Wang et al. [44] suggested an endto-end architecture called event adversarial neural network (EANN). This architecture is used to extract event-invariant characteristics and, therefore, aids in the identification of false news on newly incoming events. It is made up of three major components: a multimodal feature extractor, a fake news detector, and an event discriminator. Another study by Le et al. [149] introduced Malcom that generates malicious comments which have fooled five popular fake news detectors (CSI, dEFEND, etc.) to detect fake news as real news with 94% and 90% attack success rates. The authors showed that existing methods are not resilient against potential attacks. Though the model performed well, it is not evaluated using defense mechanisms, namely adversarial learning.

E. ATTENTION MECHANISM BASED
The attention-related approach is another notable advancement. In deep neural networks, the attention mechanism is an effort to implement the same behavior of selectively focusing on a few important items while ignoring others. Attention is a bridge that connects the encoder and decoder, which provides information to the decoder from each encoder's secret state. Using this framework, the model selectively concentrates on the valuable components from the input. Thus the model will be able to discover the associations among them. This allows the model to deal with lengthy input sentences more effectively. Unlike RNNs or CNNs, attention mechanisms maintain word dependencies in a sentence despite the distance between them. The primary downside of the attention mechanism is that it adds additional weight parameters to the model, which might lengthen the training time, especially if the model's input data are long sequences.
A study by Long [150] proposed attention-based LSTM with speaker profile features, and their experimental findings suggest that employing speaker profiles can help enhance fake news identification. Recently, attention techniques have been used to efficiently extract information related to a mini query (article headline) from a long text (news content) [47], [87]. A study by Singhania et al. [87] used an automated detector through a three-level hierarchical attention network (3HAN). Three levels exist in 3HAN, one for words, one for sentences, and one for the headline. Because of its three levels of attention, 3HAN assigns different weights to different sections of an article. In contrast to other deep learning models, 3HAN yields understandable results. While 3HAN only uses textual information, a study by Jin et al. [47] used image features, including social context and text features, as well as attention on RNN (att-RNN). Another study used RNNs with a soft-attention mechanism to filter out unique linguistic features [151]. However, this method is based on distinct domain and community features without any external evidence. Thus, it provides a restricted context for credibility analysis.   To overcome the shortcomings of previous works, Aloshban [152] proposed an automatic fake news classification through self-attention (ACT). Their principle is inspired by the fact that claim texts are fairly short and hence cannot be used for classification efficiently. Their suggested framework makes use of mutual interactions between a claim and many supporting responses. The LSTM neural network was applied to the article input. The outcome of the final step of LSTM may not completely reflect the semantics of the article. Connecting all vector representations of words in the text will lead to a massive vector dimension. Therefore, the internal connection between the articles' words can be ignored. As a result, employing the self-attention function on the LSTM model extracts key parts of the article through several feature vectors. Their strategy is heavily reliant on selfattention and an article representation matrix. Graph-aware co-attention networks (GCAN) is an innovative approach for detecting fake news [153]. The authors predict if a source tweet article is false based just on its brief text content and user retweet sequence, as well as user profiles. Given the chronology of its retweeters, GCAN can determine whether a short-text tweet is fraudulent. However, this model is not suitable for long text as it is difficult to find the relationship between a long tweet and retweet propagation.

F. BIDIRECTIONAL ENCODER REPRESENTATIONS FOR TRANSFORMERS (BERT)
BERT is a deep learning model that has shown cutting-edge results across a wide variety of natural language processing applications. BERT incorporates pre-training language representations developed by Google. BERT is a sophisticated pre-trained word-embedding model built on a transformerencoded architecture [89]. The BERT method is distinctive in its capacity to identify and capture contextual meaning in a sentence or text [90]. The main restriction of conventional language models is that they are unidirectional, which restricts the architectures that could be utilized during pre-training. The BERT model eliminates unidirectional limitations by using a mask language model (MLM). BERT employs the next sentence prediction (NSP) task in addition to the masked language model to jointly pre-train text-pair representations. BERT consists of two stages: pre-training and fine-tuning. During pre-training, the model was trained on unlabeled data using a variety of pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and then all of the parameters are fine-tuned using labeled data from the downstream jobs. The architecture of the BERT model is shown in figure 10.
The data utilized in the BERT model are generic data gathered from Wikipedia and the Book Corpus. While these data contain a wide range of information, specific information on individual domains is still lacking. To overcome this problem, a study by Jwa et al. [75] incorporated news data in the pre-training phase to boost fake news identification skills. When compared to the state-of-the-art model stackLSTM,  Ding et al. [154] discovered that including mental features such as a speaker's credit history at the language level might considerably improve BERT model performance. The history feature helps further the relationship's construction between the event and the person in reality. But these studies did not consider any pre-processing methods.
Zhang et al. [91] presented a BERT-based domainadaption neural network for multimodal false news detection (BDANN). BDANN is made up of three major components: a multimodal feature extractor, a domain classifier, and a false news detector. The pre-trained BERT model was used to extract text features, whereas the pre-trained VGG-19 model was used to extract image features in the multimodal feature extractor. The extracted features are then concatenated and sent to the detector to differentiate between fake and real news. Moreover, the existence of noisy images in the Weibo dataset have affected the BDANN results. Kaliyar et al. [92] proposed a BERT-based deep convolutional approach (fakeBERT) for fake news detection. The fakeBERT is a combination of different parallel blocks of a one-dimensional deep convolutional neural network (1d-CNN) with different kernel sizes and filters and the BERT. Different filters can extract convenient information from the training dataset. The combination of BERT with 1d-CNN can deal with both large-scale structure and unstructured text. Therefore, the combination is beneficial in dealing with ambiguity.

G. ENSEMBLE APPROACH
Ensemble approaches are strategies that generate several models and combine them to achieve better results. Ensemble models typically yield more precise solutions than a single model does. An ensemble reduces the distribution or dispersion of predictions and model efficiency. Ensembling can be applied to supervised and unsupervised learning activities [86]. Many researchers have used an ensemble approach to boost their performance [42], [133]. Agarwal and Dixit [63] combined two datasets, namely, Liar and Kaggle, to evaluate the performance of LSTM and achieved an accuracy of 97%. They also used various models like CNN, LSTM, SVM, naive bayes (NB), and k-nearest neighbour (KNN) for building an ensemble model. The authors showed an average accuracy score of their used algorithms but did not show the accuracy of their ensemble model, which is a limitation of their work.
Often the CNN-LSTM ensemble approach has been used in previous DL-based studies. Kaliyar [67] used an ensemble of CNN and LSTM, and the accuracy was slightly lower than that of the state-of-the-art CNN model. However, the precision and recall were effectively improved. Asghar et al. [135] obtained an increase in the efficiency of their model by using Bi-LSTM. The Bi-LSTM retains knowledge from both former and upcoming contexts before rendering its input to the CNN model. Even though CNN and RNN typically require huge datasets to function successfully, Ajao et al. [133] trained LSTM-CNN with a smaller dataset. The abovementioned works considered just text-based features for fake news classification, whereas the addition of new features may generate a more significant result. While most studies used CNN with LSTM, a study by Amine et al. [131] merged two convolutional neural networks to integrate metadata with text. They illustrate that integrating metadata with text will result in substantial improvements in fine-grained fake news detection. Furthermore, when tested on real-world datasets, this approach shows improvements compared to the textonly deep learning model. Moving further Kumar et al. [86] employed the use of an attention layer. It assists the CNN + LSTM model in learning to pay attention to particular regions of input sequences rather than the full series of input sequences. Utilizing the attention mechanism with CNN+LSTM was reported to be efficient by a small margin. Result analysis of DL-based studies is presented in Table 7.

VI. EVALUATION METRICS
A key step in a predictive modeling pipeline is to evaluate the output of a machine-learning model. Although a model may have a higher classification result once constructed, it must be determined whether it can address the specific problem in different circumstances. Classification accuracy alone is usually insufficient to make this judgment. Other assessment metrics are necessary for proper evaluation. Since a promising method is required to pass the assessment metric's evaluation, it is easy to create a model, but it is more challenging to create a promising strategy. Diverse evaluation metrics are used to evaluate the model's efficiency. The evaluation matrix is an essential device for arranging and organizing an evaluation. The confusion matrix shows an overview of model performance on the testing dataset from the known true values. It provides a review of the model's success and useful results of true positive, true negative, false positive, and false negative. To test their models, researchers considered distinctive sorts of metrics such as accuracy (A), precision (P), and recall (R) [40], [54], [58]. The selection of metrics relies entirely on the model form and its implementation strategy. We provide some evaluation metrics that were widely used in previous studies:

A. ACCURACY
The accuracy score, also known as the classification accuracy rating, is determined as the percentage of accurate predictions in proportion to the total predictions made by the model. The accuracy (A) can be depicted by the given formula in Equation (1).
Precision (P) is defined as the number of actual positive findings divided by the total number of positive results, including incorrectly recognized ones. The precision can be computed using Equation (2).
When the total number of samples that should have been identified as positive is used to divide, the number of true positive results is referred to as recall (R). The recall can be computed using Equation (3).
The model's accuracy for each class is defined by the F1-score (F1). If the dataset is not balanced, the F1-score metric is typically used. The F1-score is often used as an assessment matrix in fake news detection [41], [157], [158]. F1-score computation can be performed using Equation (4).  whole two-dimensional field under the entire ROC curve. The FPR can be defined as in Equation (5).

VII. CHALLENGES AND RESEARCH DIRECTION
Despite the fact that numerous studies have been conducted on the identification of fake news, there is always space for future advancement and investigation. In the sense of recognizing fake news, we highlight challenges and several unique exploration areas for future studies. Although DL-based methods provide higher accuracy compared to the other methods, there is scope to make it more acceptable.
• The feature and classifier selection greatly influences the efficiency of the model. Previous studies did not place a high priority on the selection of features and classifiers. Researchers should focus on determining which classifier is most suitable for particular features. The long textual features require the use of sequence models (RNNs), but limited research works have taken this into account. We believe that studies that concentrate on the selection of features and classifiers might potentially improve performance.
• The feature engineering concept is not common in deep learning-based studies. News content and headline features are the widely used features in fake news detection, but several other features such as user behavior [154], user profile, and social network behavior need to be explored. Political or religious bias in profile features and lexical, syntactic, and statistical-based features can increase the detection rate. A fusion of deeply hidden text features with other statistical features may result in a better outcome.
• Propagation-based studies are scarce in this domain [117]. Network-based patterns of news propagation are a piece of information that has not been comprehensively utilized for fake news detection [159]. Thus, we suggest considering news propagation for fake news identification. Meta-data and additional information can increase the robustness and reduce the noise of a single textual claim, but they must be handled with caution.
• Studies focused only on text data for fake news detection, whereas fake news is generated in sophisticated ways, with text or images that have been purposefully altered [95]. Only a few studies have used image features [109], [110]. Thus, we recommend the use of visual data (videos and images). An examination with video and image features will be an investigation region to build a stronger and more robust system.
• Studies that use a fusion of features are scarce in this domain [160]. Combining information from multiple sources may be extremely beneficial in detecting whether Internet articles are fake [95]. We suggest utilizing multi-model-based approaches with later pretrained word embeddings. Many other hidden features may have a great impact on fake news detection. Hence we encourage researchers to investigate hidden features.
• Fake news detection models that learn from newly emerging web articles in real-time could enhance detection results. Another promising future work is the use of a transfer-learning approach for training a neural network with online data streams.
• More data for a more significant number of fake news should be released since the lack of data is the major problem in fake news classification. We assume that more training data will improve model performance. VOLUME 9, 2021 Datasets focused on news content are publicly available. On the other hand, datasets based on different textual features are limited. Thus research utilizing additional textual features is scarce.
• Instead of a simple classifier, using an ensemble method produces better results [49]. By constructing an ensemble model with DL and ML algorithms, in which an LSTM can identify the original article while passing auxiliary features through a second model can yield better results [41]. A simpler GRU model performs better than an LSTM [80]. Therefore, we recommend combining GRU and CNNs to urge the leading result.
• Many researchers have achieved high accuracy by using CNN, LSTM, and ensemble models [42], [64]. SeqGAN and Deep Belief Network (DBN) were not explored in this domain. We encourage researchers to experiment with these models.
• Transformers have replaced RNN models such as LSTM as the model of choice for NLP tasks. BERT has been used in the identification of fake news, but Generative Pre-trained Transformer (GPT) has not been used in this domain. We suggest using GPT by fine-tuning fake news detection tasks.
• Existing algorithms make critical decisions without providing precise information about the reasoning that results in specific decisions, predictions, recommendations, or actions [161]. Explainable Artificial Intelligence (XAI) is a study field that tries to make the outcomes of AI systems more understandable to humans [162]. XAI can be a valuable approach to start making progress in this area.

VIII. CONCLUSION
Fake news is escalating as social media is growing.
Researchers are also trying their best to find solutions to keep society safe from fake news. This survey covers the overall analysis of fake news classification by discussing major studies. A thorough understanding of recent approaches in fake news detection is essential because advanced frameworks are the front-runners in this domain. Thus, we analyzed fake news identification methods based on NLP and advanced DL strategies. We presented a taxonomy of fake news detection approaches. We explored different NLP techniques and DL architectures and provided their strength and shortcomings. We have explored diverse assessment measurements. We have given a short description of the experimental findings of previous studies. In this field, we briefly outlined possible directions for future research. Fake news identification will remain an active research field for some time with the emergence of novel deep learning network architectures. There are fewer chances of inaccurate results using deep learning-based models. We strongly believe that this review will assist researchers in fake news detection to gain a better, concise perspective of existing problems, solutions, and future directions.