Fraudulent News Headline Detection with Attention Mechanism

E-mail systems and online social media platforms are ideal places for news dissemination, but a serious problem is the spread of fraudulent news headlines. The previous method of detecting fraudulent news headlines was mainly laborious manual review. While the total number of news headlines goes as high as 1.48 million, manual review becomes practically infeasible. For news headline text data, attention mechanism has powerful processing capability. In this paper, we propose the models based on LSTM and attention layer, which fit the context of news headlines efficiently and can detect fraudulent news headlines quickly and accurately. Based on multi-head attention mechanism eschewing recurrent unit and reducing sequential computation, we build Mini-Transformer Deep Learning model to further improve the classification performance.


Introduction
With the rapid development of Internet, Internet security is suffering from various potential threats. e rise of Advanced Persistent reat (APT) has caused traditional network defense systems to face increasingly severe challenges. According to statistics, social engineering is the main technique that attackers use to launch APT attacks, so it is of practical significance to research on defending against social engineering attacks. Cutting off the chains of attacks, detecting attacks and isolating attackers is the fastest and most effective method of defending against social engineering attacks.
Currently, in major cases of social engineering attacks, the essential operation of attackers to launch attacks is to distribute fraudulent news headlines on e-mail systems and online social media platforms, such as Instant Messaging services (e.g., QQ, WeChat, WhatsApp, Facebook Messenger, and Line) or microblogs (e.g., Twitter and Weibo). Some fraudulent news headlines often carry malicious links preset by attackers. Many curious users who see those news headlines would want to learn more about the detailed contents of those news by clicking directly on the malicious links, which leads to serious consequences, including personal privacy theft, account and password stealing, and even huge asset loss.
According to Symantec Internet Security reat [1] (ISTR Volume 23), for the social engineering attacks on companies, 71.4% of targeted attacks in 2017 involved the use of spear-phishing e-mails. erefore, the main vector of social engineering attacks to reach companies through their employees remains e-mail system. Above all, it is of great importance to analyze and detect fraudulent news headlines, which has a profound impact on Internet security and the defense system against social engineering attacks.
In recent years, Deep Learning models, such as Long Short-Term Memory (LSTM) [2], attention layer [3] and Transformer [4], have demonstrated outstanding advantages in solving the problems of Natural Language Processing (NLP). In this paper, for the classification of news headline text data, we add one extra attention layer to the LSTM model and achieve a slight increase in accuracy. In addition, based on multi-head attention, we build Mini-Transformer without complex recurrent or convolutional neural networks to improve the classification performance (i.e., accuracy, precision, recall, and F1 score) dramatically.

Related Work
Although a considerable amount of literature has been published on Internet social engineering, the emerging security issues with e-mail systems and online social media platforms are still not addressed adequately. Moreover, since the operational principle of social engineering attacks has not been clearly revealed, it is difficult to construct an effective defense system.
Castillo et al. [5] raised the issue of fake information detection on Twitter. To examine newsworthy topics on Twitter, they evaluated various classification algorithms and analyzed four features (message, user, topic, and propagation). Automatic method was used to classify the credibility of Twitter messages and achieved high precision and recall.
Ma et al. [6] utilised Recurrent Neural Networks (RNN), including LSTM and Gated Recurrent Unit (GRU), to process massive text data. ey proposed a novel method that learns continuous representations of microblog events for identifying rumors on Twitter and Weibo more quickly and accurately.
Guo et al. [7] investigated the relevant characteristics of social media and utilised attention mechanism to analyze the massive news and messages on the microblog. ey designed an efficient classification scheme, which can detect rumors more accurately.
Song et al. [8] combined LSTM with attention mechanism and proposed a novel method of sentiment lexicon embedding for aspect-level sentiment analysis, which is better at representing sentiment word's semantic relationships to improve the sentiment classification performance.
Vaswani et al. [4] proposed a new network architecture (Transformer) based solely on attention mechanism, which is not only superior in machine translation quality but also more parallelizable so as to require significantly less time to train.
Our work focuses on the news headlines spread on e-mail systems and online social media platforms. We develop a set of models to detect massive fraudulent news headlines using LSTM and attention mechanism. To further improve the classification performance, we build Mini-Transformer, which consists of multi-head attention layers and fully connected dense layers rather than recurrent unit layers (i.e., LSTM layer and GRU layer).

Methodology
In this section, firstly, we briefly revisit LSTM [2]. en, we present the formulations of attention layer proposed by Bahdanau et al. [3]. Finally, we show how we use multi-head attention mechanism to build Mini-Transformer.

Long Short-Term Memory (LSTM)
Networks. LSTM is able to process variable-length input sequences by recursive operation [2]. With the ability to maintain the hidden states and fit the variations of contextual information in relevant time steps, LSTM is well-suited for classifying news headline text data.
Unlike the traditional Vanilla RNN unit whose hidden state is overwritten in each time step, LSTM unit maintains long memory cell state c t in time step t. Given an input sequence X � x 1 , x 2 , x 3 , . . . , x len with length len, x t |1 ≤ t ≤ len are real number vectors with dimension d x , hidden state sequence h 1 , h 2 , h 3 , . . . , h len with length len, h t |1 ≤ t ≤ len are real number vectors with dimension d h , and long memory cell state sequence c 1 , c 2 , c 3 , . . . , c len with length len, c t |1 ≤ t ≤ len are also real number vectors with dimension d h . From t � 1 to len, the algorithm iterates as follows: where W f , W i , W c , and W o are weight matrices for forget gate, input gate, long memory cell, and output gate, respectively. e operator "·" denotes the dot-product between the matrix and vector. e operator " * " denotes the element-wise multiplication (Hadamard product) between two vectors. σ(·) is the logistic sigmoid function, and tanh(·) is the hyperbolic tangent function.
In LSTM unit, forget gate f t controls the range of existing memory c t−1 removed from c t , input gate i t controls the range of new memory c t added to c t , and output gate o t determines the amount of output memory. By removing part of the existing memory c t−1 and adding part of the new memory c t , long-term memory cell c t is updated. LSTM unit is illustrated in Figure 1.
From t � 1 to len, after all iterative steps of the algorithm, last hidden state vector h len is computed to generate real number output y via a fully connected dense layer whose activation is logistic sigmoid function.

Attention Layer.
In 2014, Bahdanau et al. [3] introduced the attention mechanism to the NLP field for the first time and completed modeling, transduction, and alignment procedure on the machine translation task at the same time.
LSTM layer needs to return all hidden states h t |1 ≤ t ≤ len as the input of attention layer. In attention layer, attention weight scores α 1 , α 2 , α 3 , . . . , α len are computed with v α , W α , and input sequence h 1 , h 2 , h 3 , . . . , h len . α i |1 ≤ i ≤ len are real numbers reflecting the importance of each state h i . As trainable parameters, v α is a real number vector with dimension d attn , and W α is a real number matrix with shape (d attn , d h ). From i � 1 to len, the algorithm iterates as follows: where v T α is the transpose of v α .

Computational Intelligence and Neuroscience
For normalization that Sum( . From i � 1 to len, the algorithm iterates as follows: where exp(. ) is the exponential function.
We take a weighted sum of all states h i |1 ≤ i ≤ len as computing an expected state h sum with dimension d h , which is similar to h len . e formula reads as follows: Weighted sum state h sum is computed to generate real number output y via fully connected dense layer whose activation is logistic sigmoid function.

Multi-Head Attention.
In 2017, Vaswani et al. [4] introduced the multi-head attention mechanism, which consists of several attention heads running in parallel. en, they built the Transformer without any recurrence or convolution to improve the machine translation quality. In addition, the Transformer is more parallelizable so as to require significantly less time to train.
In this paper, we propose a simplified Transformer, called Mini-Transformer, for the classification of news headline text data. Mini-Transformer is composed of multi-head attention layers and eschews recurrence or convolution.
For single-head dot-product attention, given an input sequence X � x 1 , x 2 , x 3 , . . . , x len with length len, where x t |1 ≤ t ≤ len are real number vectors with dimension d x , we generate Q (Query), K (Key), and V (Value) with trainable parameter matrices, W q , W k , and W v . ey are all real number matrices with shape (d head , d x ), where d head denotes dimension of attention head. e formulas are as follows: After generating Q, K, and V with shape (len, d head ), we compute single dot-product attention head as follows: e above dot-product single-head attention outputs a real number matrix with shape (len, d head ). For multi-head attention, we employ n head parallel attention heads. Due to the reduced dimension of each head (d head ), the total computational cost is about the same as that of single-head attention with full dimensionality, but multi-head attention is more parallelizable for GPU to train. e formulas are as follows: Finally, multi-head attention outputs a real number matrix with shape (len, n head × d head ), as depicted in Figure 2

Fraudulent News Headline Detection
Our work focuses on classifying massive news headlines data into fraudulent class (label 1) and true class (label 0). ere are three Deep Learning models based on LSTM, LSTM with attention layer, and Mini-Transformer, respectively. e proposed scheme consists of labeled data source, text data preprocessing and training, and test and evaluation of Deep Learning models.

Scheme Flow Chart.
e flow chart of our proposed scheme for fraudulent news headline detection is shown in Figure 3.

Labeled Dataset.
ree data sources are used in this paper. All of them are publicly available at Kaggle, the world's largest data science community [9].
If the length of a news headline is greater than or equal to 7, the news headline would be considered as valid news headline data. For balanced sampling, there are a total of 1,481,814 news headlines, including 736,009 items with label 1 and 745,805 items with label 0.
Fraudulent news headline dataset is e Examiner -Spam Clickbait Catalog [10]. Original source is the pseudo news site, e Examiner. At a certain point, the site was the 10th largest site on mobile and was attracting twenty million unique visitors per month. However, e Examiner no longer exists at present, Kaggle keeps the last record. Our work focuses on the fraudulent news headlines from January 1, 2013, to December 31, 2015, a total of 736,009 fraudulent news headlines (with a class label of 1).

Computational Intelligence and Neuroscience
True news headline datasets are A Million News Headlines [11] and News Category Dataset [12], a total of 745,805 true news headlines (with a class label of 0).
For A Million News Headlines, the original source is Australian Broadcasting Corporation. It includes the entire corpus of articles published by the ABC news website. With a volume of two hundred articles per day and a good focus on international news, every event of significance has been captured. It contains a total of 577,264 true news headlines from February 19, 2003, to December 31, 2019.
For News Category Dataset, the original source is HuffPost. Each news headline has a corresponding category (e.g., parenting, style and beauty, entertainment, wellness, and politics). It contains a total of 168,541 true news headlines from January 28, 2012, to May 26, 2018.

Text Data Preprocessing.
We preprocess the original labeled news headline text data, including deleting repeated news headlines, removing unnecessary English symbols (i.e., ( ) ' " , . ?: -! #), removing redundant space characters, NLTK Lemmatization [13], truncating the news headlines that are too long, padding the news headlines that are too short, and converting uppercase letters to lowercase, etc. e data with time order is generally called sequence. In this paper, news headline text data are typical sequences. e representation of news headlines is a two-dimensional string array with shape (n, len), where n � 1, 481, 814 is the total number of news headlines, and len is the maximum length of news headlines. For example, the two-dimensional string array can be as follows: We calculate the frequency of each English word in the two-dimensional string array, so as to identify high-frequency words and generate a high-frequency word dictionary. Significantly, in the procedure of generating the highfrequency word dictionary, our work mainly focuses on ignoring the extremely short words, marking stopwords [14] uniformly with tag 1 and marking low-frequency words uniformly with tag 2. For example, the word dictionary can be as follows: where 'a' and 'the' are stopwords marked uniformly with tag 1, 'sanguine' and 'hypothetical' are low-frequency words marked uniformly with tag 2. e original news headline is composed of several words; to facilitate the training and test of Deep Learning model, we map each word string to the corresponding integer based on the generated word dictionary, thus the news headline two-dimensional string array can be converted to a two-dimensional integer array with shape (n, len), e.g., the two-dimensional integer array can be as follows:  where 'o' and 'e' are extremely short and unnecessary words (only one letter) that have been ignored, tag 1 denotes all stopwords, tag 2 denotes all low-frequency words, and tags which are greater than or equal to 3 denote corresponding high-frequency words.

Deep Learning Model Structure.
In this section, we propose three Deep Learning models, all of them contain word embedding layers and output layers; their structures are shown in Figure 4.
Word2Vec [15] provides a simple and effective method for vectorized representation of words, which can be employed in word embedding task. In word embedding layer, each integer in news headline two-dimensional integer array will be converted to real number vector with dimension d x . Eventually, the two-dimensional integer array will be converted to a real number array with shape (n, len, d x ), i.e., each piece of news headline text data will be converted to word vector input sequence X � x 1 , x 2 , x 3 , . . . , x len .
In output layer, vector h len , vector h sum , or the vector with dimension d dense returned from final dense layer in Mini-Transformer will be converted to real number y via dense layer whose activation is logistic sigmoid function. If y is greater than threshold 0.5, it will be set to 1, else it will be set to 0.
In Mini-Transformer, we employ two layers of multihead attention sublayer and fully connected dense sublayer without bias. It is worth noting that the activation function of dense sublayer is Rectified Linear Unit [16] (ReLU), but final dense layer has no activation function.

Experimental Settings and Results
If the frequency of a word is greater than or equal to 140 times, the word will be considered as high-frequency word; from word frequency statistics, the total number of high-frequency words is 7,996, so the length of word dictionary is 7,998, including low-frequency words and stopwords.
For word embedding layer, dimension of word vectors x t |1 ≤ t ≤ len (d x ) is 25 and the maximum length of news headlines (len) is 15. For LSTM and attention layer, dimension of hidden states h t |1 ≤ t ≤ len (d h ) is 16 and d attn is 32.
In Mini-Transformer, for multi-head attention sublayer, number of dot-product attention heads (n head ) is 8 and dimension of that (d head ) is 64, and for final dense layer, d dense is 16, which is the same as d h .
After shuffled, 80% of original labeled dataset is split into the training set and 20% is split into the test set for cross-validation; for batch training [17], we combine  Computational Intelligence and Neuroscience consecutive news headlines of this text dataset into batches, batch size is set to 768. For LSTM, LSTM with attention layer and Mini-Transformer, test set accuracy and loss curves in 10 epochs are shown in Figure 5; accuracy, precision, recall, and F1 score are shown in Table 1.
For further comparison with the models from [18], we conducted several contrast experiments by employing three regular Machine Learning models: logistic regression [19], linear support vector machine (Linear SVM) [20], and random forest [21].
For logistic regression, primal formulation is implemented with liblinear solver (dual = false). For Linear SVM, the algorithm is selected to solve primal optimization problem (dual = false). For random forest, the minimum number of samples required to split an internal node is 50 (min_samples_split = 50). Other hyper-parameters are default from scikit-learn.
From Table 1, three regular Machine Learning models do not achieve good classification results, this may be because they are too simplistic to process massive news headline data.
Compared with LSTM, Mini-Transformer achieves an obvious accuracy improvement in classification performance (0.9%-1.0%). However, LSTM with attention layer achieves a slight accuracy improvement in classification performance (<0.1%); this may be because the maximum length of news headlines (len) is so short that general attention layer cannot play a sufficient role in reflecting the importance of all hidden states returned from LSTM layer.

Conclusion
Existing work has not focused on fraudulent news headline detection. In this paper, we have compared the classification performance of mainstream LSTM network and general attention mechanism for fraudulent news headline detection using massive news headline data, which is helpful for the research on the defense system against social engineering attacks.
In addition, according to relevant experience, we have built a more advanced Deep Learning model, Mini-  Data Availability e data used to support the findings in this study are available from e Examiner-Spam Clickbait Catalog|Kaggle: https:// www.kaggle.com/therohk/examine-the-examiner, A Million News Headlines|Kaggle: https://www.kaggle.com/therohk/ million-headlines, and News Category Dataset|Kaggle: https://www.kaggle.com/rmisra/news-category-dataset.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.