Detection of Online Fake News Using Blending Ensemble Learning

(e exponential growth in fake news and its inherent threat to democracy, public trust, and justice has escalated the necessity for fake news detection and mitigation. Detecting fake news is a complex challenge as it is intentionally written to mislead and hoodwink. Humans are not good at identifying fake news. (e detection of fake news by humans is reported to be at a rate of 54% and an additional 4% is reported in the literature as being speculative. (e significance of fighting fake news is exemplified during the present pandemic. Consequently, social networks are ramping up the usage of detection tools and educating the public in recognising fake news. In the literature, it was observed that several machine learning algorithms have been applied to the detection of fake news with limited andmixed success. However, several advancedmachine learningmodels are not being applied, although recent studies are demonstrating the efficacy of the ensemble machine learning approach; hence, the purpose of this study is to assist in the automated detection of fake news. An ensemble approach is adopted to help resolve the identified gap.(is study proposed a blended machine learning ensemble model developed from logistic regression, support vector machine, linear discriminant analysis, stochastic gradient descent, and ridge regression, which is then used on a publicly available dataset to predict if a news report is true or not. (e proposed model will be appraised with the popular classical machine learning models, while performancemetrics such as AUC, ROC, recall, accuracy, precision, and f1-score will be used tomeasure the performance of the proposed model. Results presented showed that the proposed model outperformed other popular classical machine learning models.


Introduction
e increasing use of the Internet coupled with social media platforms has enabled even more people to obtain news from a wide variety of sources instead of old-style news outlets. People who spend a lot of time online are more likely to acquire news and updates through social media with an increased risk of exposure to wide-scale misinformation [1].
is provides fertile ground for fake news as news articles, hoaxes, reviews, rumours, satires, advertisements, reviews, and exaggerated claims proliferate. e widespread distribution of bogus news is capable of producing extremely adverse effects on individuals and humanity [2]. It has now become a part of daily life to hear of the worsening weather crises, political violence, intolerance amongst people of different ethnicity and cultural backgrounds, and even influencing issues of public health.
is is often done to advance or foist certain ideas into circulation and is often realised with political agendas. erefore, all the governments around the world are trying to track and tackle this problem [3]. Bogus news is not a new concept [4] or a product of the digital communication age [5]. It has most recently come to light during the 2016 US presidential election.
ere have been numerous hoax stories where citizens and governments as well all other social elements are all impacted and influenced by these stories.
Facebook has been at the epicentre of the controversy by the media houses for targeting the population and showing them posts to their support [6]. It has been alleged that bogus news could have been decisive in the 2016 US presidential election [7]. Nevertheless, we can contend that false news and generally misinformation have become a big problem, which may have a significant social cost in the future. e ubiquitous nature of the Internet enables anybody to spread false and biased information easily. It is virtually impossible to prevent or control fake news from being created or disseminated. Consequently, both online platforms and researchers are very proactive in detecting potential false news. It is a complex problem since false news can present itself in multiple ways making it both physically and automatically challenging to efficiently identify [8].
Headlines in the form of clickbait are used to entice users to view probably subjective articles to make a profit. According to Wang [9], " e problem of fake news detection is more challenging than detecting deceptive reviews, since the political language on TV interviews and posts on Facebook and Twitter are mostly short statements." erefore, it is very evident that the development of automated solutions for false news detection is imperative and exigent [3]. Prior works have used many classical models. However, several unconventional learning models are not applied although they have proven best in numerous text classification problems [10]. An ensemble approach is proposed to help resolve the identified gap.
Recent studies are demonstrating the effectiveness of ensemble learning approaches with promising results [11].
is study will investigate how natural language processing techniques and machine learning can be combined in a blending ensemble approach to create a model that will use the data of previous news reports and predict a news report as being true or not. e proposed model will be compared with the classical machine learning models using performance metrics, for example, AUC, ROC, recall, accuracy, precision, and f1-score. ese measurements will be used to gauge the performance of the model. e remainder of this paper is outlined as follows. e related literature is discussed in Section 2. Section 3 presents the study materials and methods, while result analysis is presented in Section 4 and the paper is succinctly concluded in Section 5.

Related Works
Humans are fairly unimpressive at recognising deception. Most people believe that the information they obtain is factual and trustworthy. ey tend to be unaccountably perceptive to knowledge that is not fully understood [12]. Confirmation bias influences people to grasp only what they want to perceive [13]. erefore, the proliferation and propagation of fake news is a major concern because of its capacity to generate devastating consequences. Diverse machine learning approaches are utilised to combat it. However, the majority focused on a specific category of news without utilising several advanced methods [13,14].
Numerous neural networks and models based on machine learning have been applied to detect fake news. Models were developed with features designed for specific datasets. Yun and Ahn [15] detected fake news in Korea with machine learning and text mining using a two-step approach.
Initially, the news contents are converted to values by applying text mining, and then classifiers are trained on these values. Aphiwongsophon and Chongstitvatana [16] based their models on identifying fake news using selected data sourced from Twitter. It is likely that these approaches will fall victim to dataset bias and possibly perform poorly on a different category of news [10]. Gilda [17] explored some traditional machine learning approaches. Ahmed et al. [18] investigated and compared six different classification techniques using n-gram analysis on a single dataset using feature extraction. e models were evaluated independently and the linear support vector machine classifier achieved the best score. However, several advanced learning models are not applied although they have excelled in text classification [10].
Research using deep learning to identify fake news works has accomplished encouraging results [10]. Rashkin et al. [19] used linguistic feature analysis and achieved the remarkable outcome of Long Short Term Memory. Wang [9] constructed a hybrid model using a convolutional neural network that outclassed other traditional learning models. Singhania et al. [20] applied a three-level attention network incorporating sentences, words, and headlines. ota et al. [21] presented a neural network to forecast the stance using the headline and the body of the article. Wang [9] presented a benchmark dataset named Liar and investigated using current models. e evaluation hints at how different types of models perform on data that is structured. Also, some models were prone to being overfit.
Ruchansky et al. [22] built a CSI (capture, score, and integrate) model that used text, article response, and characteristics of the users' behaviour. Ajao et al. [23] developed a framework for classifying and identifying fake news in Twitter posts using a hybrid of neural networks. e tactic intuitively identified pertinent features without considering prior knowledge. Lu and Li [24] developed (GCAN) Graph-aware Co-Attention Networks to determine if a tweet is fake by using the associated sequence of retweet users. Khan et al. [10] analysed the performance of dissimilar approaches on three datasets and showed that Naive Bayes can achieve a similar result as neural network models when working with a dataset containing under 100 thousand articles. Vijayaraghavan et al. [25] applied different models to detect fake news and state that neural networks generally perform consistently and serve as a powerful universal approximator. However, the loss and accuracy come after using too many epochs and thus the issue of overfitting comes into play. In addition, a simpler model using logistic regression also delivered good performance results. Consequently, it does not necessarily follow that the more complicated the model, the better the performance. Furthermore, deep learning is "time-consuming and resourceconsuming" [26].
Researchers have studied numerous algorithms for text classification that give good performance. However, some algorithms perform better on some datasets but may give even an average performance on other datasets. erefore, instead of using a single classifier, it is better to use a group of classifiers and take a collective or team decision, rather than basing a decision on an individual classifier [27,28]. is approach called an ensemble approach overcomes the weakness of one classifier by the strength of other classifiers and gives better performance than an individual classifier. e diverse nature of the approach and keeping the variance under control contribute greatly to its success. Furthermore, ensemble learning can result in more robust schemes of classification.
Roy et al. [3] developed models built on a Bidirectional Long Short Term Memory and Convolutional Neural Network. e output from both of these models was input into a Multilayer Perceptron Model to obtain the final result. Al-Ash et al. [29] used a random forest classifier which consists of a decision tree classifier as an ensemble classifier to detect Indonesian fake news. Reddy et al. [26] presented a hybrid approach for fake data detection using an ensemble model. Ahmad et al. [30] explored different textual properties in an ensemble approach to detect fake news. Gutierrez-Espinoza et al. [11] evaluated the performance of ensemble learning using different machine learning techniques for classification in order to identify bogus online information.
Mahabub [31] used a distinct method for detecting fake news in developing an ensemble voting classifier that incorporates many familiar machine learning algorithms. Kaur et al. [32] designed a voting model with multiple levels in automating the detection of fake news by experimenting with several models. Saeed et al. [1] incorporated an ensemble approach to detect spam from Arabic texts. Li et al. [33] applied a pipeline to identify fake news by taking into consideration the headline and article text in a stacked ensemble. In all these studies, the ensemble approach yielded better performance when compared to the individual model in the detection of deceptive information.
erefore, an advanced ensemble approach is adopted to detect fake news. e strategy will integrate blending and machine learning with natural language processing to extend and improve the current approaches.

Materials and Methods
In this section, we present the datasets, proposed framework, explanation of the algorithms, and the metrics that are used for performance evaluation. Two datasets have been selected for our experiments which include news from a range of different categories and a combination of fake and truthful articles. Both datasets are publicly available and easily accessible on the web. Categorization of news as "fake news" can be "a very challenging and time-consuming task" [34]. Hence, existing datasets are used in this study. e major challenge to identify false news is the accessibility and calibre of the datasets [35]. Also, finding a corpus of articles related to news is particularly problematic owing to copyright concerns [17]. e Liar [9] and ISOT [36] datasets are used.
e Liar dataset is publicly accessible and has been successfully used [37]. It comprises 12836 short labelled statements from politifact.com. ere are six labels for rating the truthfulness of a statement: "pants-fire," "false," "barely-true," "half-true," "mostly-true," and "true." We focus on classifying news as true or fake. For binary classification of the news, we transform these labels into two labels. "Pantsfire," "false," and "barely-true" are considered as fake and "half-true," "mostly-true," and "true" are considered as true.
is dataset largely focuses on politics that contain statements of republicans and democrats, in addition to a substantial quantity of posts from social media [10]. e ISOT Fake News Dataset comprises both truthful and fake news articles sourced from several domains [30]. e true articles were sourced mainly from reuters.com, a well-known news site on the web. e fake news articles were obtained from numerous sources, primarily from websites that have been flagged by politifact.com. e dataset consists of 44,898 articles, 23,481 being fake articles and 21,417 true articles. Each data point consists of a title, text, subject, and date. e text is the actual news article, and the subject or category is any one of Middle East, government news, US news, world news, politics news, left-news, politics, and news.
In the proposed framework, as shown in Figure 1, we are extending the current literature by introducing ensemble learning techniques incorporating blending. News articles from several domains are classified as true or fake by working with different feature sets. Blending ensemble techniques with Term Frequency, Term Frequency Inverted Document Frequency, and n-grams are used in our approach.
Raw texts of the news need to be preprocessed before being fed into the models. Natural language processing techniques will be applied to help improve accuracy. e following operations will be carried out during the preprocessing of the dataset: (i) Data cleansing: remove irrelevant data that is not required for the analysis. (ii) Check for missing values that can have an adverse effect on the final result. (iii) Convert the text to lowercase so that there is consistency. (iv) Remove all punctuation marks.
(v) Remove stopwords from the textual dataset. ese are words that provide no added semantical meaning and are of no significance during natural language processing. (vi) Stemming (or lemmatization) involves converting words back to their original structure and thus reducing the classes or word types present in the dataset. For example, "Dancing", "Dance," and "Dancer" will be shortened to "dance." Stemming makes classification more efficient and quicker [18]. e Porter Stemmer algorithm will be used due to its accuracy.
Features' design plays a key role in the machine learning models' performance. e extraction of the most relevant or important words and using them as features can be extremely useful. Term Frequency, Term Frequency Inverted Scientific Programming Document Frequency, and n-grams will be used in the extraction of features from the dataset. is approach has been chosen over word embedding based on the experimental results realised in previous studies. ota et al. [21] achieved better results using n-grams over word embedding. Vijayaraghavan et al. [25] used Word2Vect embedding and showed that it performed the worst when compared to TF-IDF models. Similar results were also confirmed by Smitha and Bharath [38]. Term Frequency utilises the tallies of words present within the documents to determine the resemblance between documents. A vector with an equal dimension that holds the counts of words is associated with each document. Term Frequency Inverted Document Frequency is a metric frequently used in the processing of natural language and information retrieval. It measures the significance of a term in a document included in the dataset. e n-grams based on words are used in representing the document's context and for features' generation that can be useful in the classification of a document as real or fake. is approach has been used successfully with unigrams and bigrams in fake news detection [10].

Blending Ensemble Model.
Blending is very closely allied to stacking. Stacking (stacked generalization) involves a learning algorithm being trained to pool the predictions of several other learning algorithms. All the algorithms are trained on the available data. A combiner algorithm is eventually used for the final prediction by taking into account the predictions of the other algorithms [39]. e blending ensemble is a variation of stacking. e prediction blending ensemble variation is based on a holdout dataset validation which was used in this study to fit the meta-model rather than out-of-fold predictions.
e model learns to combine the predictions of several contributing ensemble base models. Models implementing logistic regression, support vector machine, linear discriminant analysis, stochastic gradient descent, and ridge regression are used in the formation of the ensemble. e blending ensemble algorithm is given as follows (Algorithm 1).

Logistic Regression. A logistic regression model is used
since the text is being classified resulting in binary output (0/ 1 or true/false or true/fake). e hypothesis function can be defined mathematically as follows: A sigmoid function transforms the output into a probability. e goal is to achieve optimal probability by minimizing the cost function as shown as follows [30]: Hence, logistic regression produces a logistic curve that is restricted to values that are between 0 and 1 by utilising a sigmoid function.

Support Vector
Machine. Support vector machine creates a hyperplane to isolate and group features. Support vectors are created on either side of the hyperplane in order to calculate the optimal hyperplane with each vector maximising the distance between them. e greater the vector distance around the hyperplane results in a more accurate decision boundary between the category features [25]. e data points are classified into distinct classes dependent on their position on the hyperplane.
e key motive is to maximise the gaps that exist between the hyperplane and data points. e margin is maximised by the loss function. e hyperplane is defined by where w is the weight vector and b is the bias.
e errors are calculated by the first term in the loss function. e regularization function is represented by the second term and is used to circumvent overfitting [27]. e process can be summarised using three significant steps [37]: (i) Compute the between-class variance (separability) between the different classes. is is expressed by the following formula: (ii) Compute the within-class variance using the following formula: (iii) Create the lower-dimensional space to maximise the between-class variance and to minimize the withinclass variance. e lower-dimensional space projection (Fisher's criterion) is given by the following equation: LDA assumes that we have numeric input variables distributed normally and have the same spread (variance). Otherwise, it may be necessary to transform or normalize the data before modelling. e model is multiclass. It supports double-class classification problems and multiclass classification with no modification.

Stochastic Gradient Descent.
Stochastic gradient descent uses an iterative method to optimize an objective function through appropriate smoothness properties such as differentiable or subdifferentiable. e method consumes randomly shuffled or selected samples to gauge the gradients. erefore, stochastic gradient descent "can be regarded as a stochastic approximation of gradient descent optimization" [6]. e gradient is principally the slope or slant of a function. It is the amount "of change of a parameter with the quantity of change in another parameter" [40]. e greater the gradient, the sharper the slope. Gradient descent is applied iteratively to find the parameter values of a function that will minimize the function value with the maximum quantity. erefore, the objective is to determine optimal parameter values in order to obtain the minimum value of the cost function.
Mathematically, the details can be expressed (for classification) as follows: given a set of training examples (x 1 , y 1 ) . . . (x n , y n ), where x i ∈ R m and (y i ∈ − 1, 1), the objective is to learn a linear scoring function f(x) � w t + b with model parameters w ∈ R m and intercept b ∈ R. Predictions for binary classification are made by looking at the sign of f(x). To determine the parameters of the model, the regularized training error is minimized and is shown as follows: where L is a loss function and R is a regularization (penalty) term used to penalize the model complexity; α > 0 is a hyperparameter that controls the strength of the regularization [37].

Ridge
Regression. e regression method serves as a basis for the ridge classifier. For binary classification, the target variable is converted into +1 or −1 dependent on the class to which it belongs, and for multiclass data that uses multioutput regression, the largest value for prediction is acknowledged as the target class. Ridge regression is virtually the same as linear regression except that a small bias is introduced. Consequently, the variance is reduced significantly. So, by beginning with a somewhat worse fit, better predictions in the long term are possible. e added (1) Split the dataset. e dataset is split into test and train sets. (2) Construct the base models.

Repeat
Fit on the training set. Make prediction on holdout set. Store the predictions as input for blending until the end of base models. Build a 2D array using the stored predictions. Create the blending model. Fit the blending model on the predictions from base models. (4) Make predictions with the blending ensemble.

Repeat
Make prediction with base model on test set. Store the prediction until the end of base models. Build a 2D array using the stored predictions. bias is called the ridge regression penalty. It is computed by finding the product of lambda and the squared weight associated with each feature. e imposition of a penalty based on the size of the coefficients deals with some of the issues of Ordinary Least Squares. Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients. A penalized residual sum of squares is minimized by the ridge coefficients: α ≥ 0 is the complexity parameter that regulates the shrinkage. e greater the value of α, the larger the shrinkage, and therefore, the coefficients turn out to be increasingly robust to collinearity.

Results and Discussion
In this study, we present the performance analysis of the traditional machine learning models and the blending ensemble. is is done for both the Liar and ISOT datasets. Six performance measurements have been used in the comparison of the six models. ese include ROC AUC, f1-score, AUC, precision, recall, and accuracy. e metrics are calculated for both the real and fake classes. Table 1 summarises the experimental results for the Liar dataset gauged in the detection of fake news.
e best performing base model on the Liar dataset is the logistic regression classifier which achieved the best scores for four out of six comparison metrics. ese include ROC AUC, AUC, precision, and accuracy. However, overall, the blending ensemble has delivered the top performance. e four best scores obtained out of the six include ROC AUC, AUC, recall, and accuracy. Table 2 summarises the experimental results for the ISOT dataset measured in the detection of fake news. e linear support vector machine classifier is the best performing base model on the ISOT dataset with the best scores in five out of six comparison metrics. ese include ROC AUC, f1-score, AUC, recall, and accuracy. However, overall, the blending ensemble is the top-performing model. e four best scores attained include ROC AUC, f1-score, recall, and accuracy out of the possible six performance metrics.

ROC Curve.
e performance of a classification problem can be visualized or verified by using the ROC curve. e true positive rate (on the y-axis) is plotted against the false positive rate (on the x-axis). It is considered a probability curve. e area under the curve is regarded as a key metric for evaluating the model's classification performance. It measures the performance of the classification problem at different threshold settings and indicates the "degree or measure of separability" [6]. us, it represents the capacity of the model in distinguishing between different classes. e higher the measurement of the area under the curve is, the better the model will be able to distinguish true news article from fake news article. e ROC curves for the Liar and ISOT datasets are shown in Figures 2 and 3, respectively. In both instances, we observe that the blending ensemble is the superior model since the area enclosed underneath the curve is the largest. On the other hand, the area under ROC curves for the LDA is the smallest. erefore, we can conclude that the LDA model is the worst performer on both datasets. e deductions can easily be validated by verifying the respective AUC ROC scores in performance metrics tables.

Precision-Recall Curve.
e precision-recall curve for the Liar and ISOT datasets are shown in Figures 4 and 5, respectively. It is constructed by computing and then plotting the precision (on the y-axis) against the recall (on the x-axis) for each classifier at various thresholds. e curve summarises the trade-off concerning the true positive rate and the positive predictive label (value) for a classification (predictive) model by consuming varied probability thresholds.
A good classifier maintains both a high recall and high precision throughout the graph and will "hug" the right upper corner in the plots below [41]. is is evident for the ISOT dataset which indicates substantially better performance by all the classifiers when compared to the Liar dataset. is observation is bolstered by the AUC scores in performance metrics tables. Once again, the blending ensemble features very strongly in the comparison plots. It has the top AUC score on the Liar dataset and the second-best score, just 0.001 behind, on the ISOT dataset.

Confusion Matrix.
A confusion matrix is used for the analysis of a machine learning model. It reflects the data in connection with the true positives, false negatives, false positives, and true negatives [42]. Figures 6 and 7 represent the confusion matrix of blending ensemble on predictions made on the test sets for the Liar and ISOT datasets, respectively.
We can make the following deductions based on the confusion matrix of the Liar dataset: (i) 1201 fake news articles have been correctly predicted as fake (ii) 2702 news articles that are true (real) have been correctly predicted as true (iii) 1687 fake news articles have been incorrectly predicted as true (iv) 828 true news articles have been incorrectly predicted as fake Similarly, we can make the following inferences based on the confusion matrix of the ISOT dataset:      Scientific Programming (ii) 10462 news articles that are true (real) have been correctly predicted as true (iii) 170 fake news articles have been incorrectly predicted as true (iv) 17 true news articles have been incorrectly predicted as fake

Conclusions
e paper has presented the application of six machine learning models according to TF-IDF vectors as features (ngram level TF-IDF) for the goal of discovering fake news. By building the classifiers and conducting the experiments, we can conclude that the blending ensemble is the best performing model on the Liar and ISOT datasets. is has been validated by employing a variety of metrics to measure performance.
is model is saved and will be used for prediction. e results of the blending ensemble are compared favourably with other studies. It is a performance improvement compared to many of the ensemble models used by Ahmad et al. [30] on the ISOT dataset including the two benchmark models Wang-CNN and Wang-Bi-LSTM. We presented a blending ensemble model for detecting fake news as a solution to improve the current approaches. Our plans include experimenting with other and larger datasets and varying the type, combination, and number of base models for the ensemble. We will also consider examining current trends on social media connected to fake news to incorporate them in our model for detection. However, the associated limitations are that the data are often inconsistent, thus adding to the mistakes or anomalies of the prediction model.

Data Availability
All the data are available from the authors.

Conflicts of Interest
e authors declare no conflicts of interest.