A Predictive Model for Benchmarking the Performance of Algorithms for Fake and Counterfeit News Classification in Global Networks

The pervasive spread of fake news in online social media has emerged as a critical threat to societal integrity and democratic processes. To address this pressing issue, this research harnesses the power of supervised AI algorithms aimed at classifying fake news with selected algorithms. Algorithms such as Passive Aggressive Classifier, perceptron, and decision stump undergo meticulous refinement for text classification tasks, leveraging 29 models trained on diverse social media datasets. Sensors can be utilized for data collection. Data preprocessing involves rigorous cleansing and feature vector generation using TF-IDF and Count Vectorizers. The models’ efficacy in classifying genuine news from falsified or exaggerated content is evaluated using metrics like accuracy, precision, recall, and more. In order to obtain the best-performing algorithm from each of the datasets, a predictive model was developed, through which SG with 0.681190 performs best in Dataset 1, BernoulliRBM has 0.933789 in Dataset 2, LinearSVC has 0.689180 in Dataset 3, and BernoulliRBM has 0.026346 in Dataset 4. This research illuminates strategies for classifying fake news, offering potential solutions to ensure information integrity and democratic discourse, thus carrying profound implications for academia and real-world applications. This work also suggests the strength of sensors for data collection in IoT environments, big data analytics for smart cities, and sensor applications which contribute to maintaining the integrity of information within urban environments.


Introduction
Online social media platforms have become increasingly influential in a time when digital communication and information exchange are advancing quickly.These platforms have emerged as the main avenues for the broadcast of news, presenting both previously unheard-of potential and difficulties.The frequency and effects of fake news, which is purposefully false or misleading material presented as news, are among the most urgent problems [1].
The emergence of false news has wide-ranging effects.The public's confidence in media organizations and democratic institutions is also weakened, in addition to the credibility of information sources.Misinformation can spread like wildfire over social media networks, confusing the audience and possibly influencing their opinions and decisions [2].Misinformation is frequently fueled by clickbait, political goals, or sensationalism.Researchers and professionals have turned to cutting-edge technology, such as supervised artificial intelligence algorithms, to assist in identifying false information within the dynamic and linked landscape of online social media in order to address this issue.
This study proposes a two-phased detection methodology designed to identify instances of fake news within the realm of social media.The recommended framework Sensors 2024, 24, 5817 2 of 34 integrates supervised artificial intelligence algorithms with text analysis techniques, forming an innovative approach.In the initial stage of the project, text mining methodologies are employed to analyze a dataset comprising internet news.The primary objective of these text analysis techniques is to extract structured information from unstructured news stories.
To ensure robustness, these supervised algorithms are subjected to rigorous training and testing using four distinct datasets.Through this comprehensive approach, this study aims to effectively distinguish between bogus news and authentic news within the dynamic and complex landscape of social media.
This research has achieved uniqueness by creating a predictive model that can easily choose the best algorithms and maximize their performance.Furthermore, the model effectively improves evaluation by offering a consistent method for assessing algorithmic performance, which in turn identifies regions in which algorithms perform well or poorly.It assists in identifying algorithms that stop the spread of bogus news before it causes havoc by detecting it early.By detecting fake news before it goes viral, real-time prediction research work not only enhances model accuracy by collecting the newest trends and patterns but also allows for quick action against fake news, decreasing its spread.Rapid incident reaction and mitigation are further facilitated by the real-time forecast.In order to classify fake and counterfeit news, this work allows developers and researchers to construct more dependable and effective solutions by benchmarking numerous algorithms.This study clarifies how decisions are made and explains why certain news stories are classified as either positive or negative.Furthermore, it is quite sure that the models can support a variety of datasets, techniques, and use cases.Among the innovations of this research endeavor are these succinct explanations and others.
The following are the major contributions of our research: i. Development of predictive models for benchmarking the performance of the algorithms used for fake and counterfeit news classification in global networks.ii.Benchmarking of multiple algorithms: In order to determine which techniques are the most efficient, this work has been able to benchmark the performance of multiple algorithms.iii.Real-time prediction: The method can identify bogus and fraudulent news in real time, enabling prompt alerting, awareness raising, and intervention.iv.Interpretability and explainability: This study sheds light on the decision-making process and explains why some news items are categorized as either positive or negative (fake or counterfeit).v.
Constant learning and adaptation: By incorporating mechanisms for ongoing learning and adaptation, this work can remain effective even in the face of changing techniques used by fake and counterfeit news sources.vi.Provision of a platform for internet users to identify fake news and subsequently prevent them from being victims of fake and fraudulent news.vii.A searchlight for internet users to be selective on where to source reliable realtime information.
The following are some real-world uses for algorithms designed to classify false and counterfeit news on international networks: 1.
Social media companies: By incorporating algorithms to identify and eliminate false material, they can stop the spread of false information.

2.
News aggregators and fact-checking websites: These websites automate the process of fact-checking news stories by using algorithms.

3.
Search engines: Users are guaranteed to receive reliable information by applying algorithms that demote or delete phony news from search results.4.
Online advertising platforms employ algorithms to thwart the display of advertisements on fake news websites, diminishing the sources of income for disinformation.5.
Using algorithms to track and examine the dissemination of false information, government and law enforcement organizations can make informed decisions about enforcement and policy.6.
Media outlets and journalists: Upholding journalistic integrity by using algorithms to validate sources and identify false information.

Structure of the Paper
The entire paper is as follows: Section 1 presents the Introduction, while the summary of related works is presented in Section 2. The materials and methods used are detailed in Section 3. The results obtained in the course of experimentation are provided in Section 4, while the predictions of the best-performing algorithms are comprehensively presented in Section 5.The discussion around the results as well as the concluding parts is provided in Section 6, respectively.

Related Works
In 2022, Shaina and Chen suggested a novel architecture for detecting fake news that was intended to overcome issues like the early detection of bogus news and the lack of labeled data needed to train the detection model [3].This system uses data from the report articles and the social setting to identify fake news.It is based on a transformer design and was influenced by the BART architecture.The model's encoder blocks carry out the representation learning task.The issues of early fake news detection are further helped by the decoder blocks, which forecast future behavior based on historical observations.Compared to autoencoding models (exBake and BERT), autoregressive models (Grover and GPT-2) performed better for early detection.ExBAKE, FANG, 2-Stage Transf., Declare, TriFN, and VGCN-BERT all showed improved performance in later time steps, according to the authors.This behavior was attributed to longer learning time.The performance of the LG, TextCNN, and XGBoost was noted to have been inferior to that of the other baselines.
An effective deep diffusive neural network model for fake news identification named FakeDetector is proposed in the publication "FakeDetector: Effective Fake News Detection with Deep Diffusive Neural Network" [4].By extracting explicit and latent elements from textual data, the model simultaneously learns subjects, writers, and news article representations.This strategy is based on the finding that looking into correlations between news pieces, their subjects or topics, and their authors or distributors can help with false news identification [5].
In 2019, Benamira et al. [6] examined the use of semi-supervised learning and graph neural networks for the identification of false news.The requirement for efficient detection techniques that can use both labeled and unlabeled data to increase accuracy is what spurred this strategy.In order to handle data represented in graph domains, neural network models known as "graph neural networks" (GNNs) were developed [7].They have been used for many different purposes, such as graph node categorization.Both labeled and unlabeled data are used in semi-supervised learning, a learning methodology, to train models.Graph neural networks and semi-supervised learning have the potential to enhance fake news identification.GNNs may capture interactions and dependencies between nodes by utilizing the graph structure of social networks or other data representations, which might be helpful for spotting patterns of fake news propagation.Additionally, semisupervised learning with unlabeled data enables the model to learn from a bigger dataset, potentially improving its capacity to reliably identify bogus news.The experimental results, in the authors' opinion, showed that the suggested methodology performed better than conventional classification algorithms, particularly when trained on a small sample of tagged articles.
A benchmark framework for examining and discussing machine/deep learning methods used in fake news detection was given by Galli et al. [8] in 2022.Their framework intends to overcome the difficulties associated with fake news identification, such as the variety of subjects and language elaborations employed in its creation.It offered a consistent evaluation framework for contrasting the effectiveness of various detection models.The authors investigated how different elements, including textual, social, and networkbased features, may be used to identify bogus news.They evaluated the effectiveness of several methods and offered details on the advantages and disadvantages of every strategy.Three datasets-FakeNews, a big, unbalanced dataset; and PHEME and LIAR, two smaller datasets-were used for these analyses.First, different machine learning models (logistic regression, decision tree, SVC, and Random Forest) were compared; the results showed that logistic regression was the most successful model in terms of efficacy and efficiency measures.In contrast to other models, logistic regression has several advantages, such as simplicity of interpretation, speed of implementation, and few tuning parameters.B.E.R.T achieved the greatest overall results because it conducted context-based word-level embedding while being challenging to train.By combining multimedia and content analysis, a multimodal technique has been further developed to perform a false image classification.This approach yielded the greatest results in terms of recall, accuracy, F1, and precision using multimedia data.
Using the swarming traits of fake news, the FakeSwarm fake news recognition system was introduced in 2023 [9].The authors used three distinct swarm feature types-metric representation, principal component analysis, and location encoding-to demonstrate the importance of considering swarming characteristics in the identification of false news.The ISOT fake news dataset was used by the authors to carry out their analysis.There are 23,481 phony news pieces and 21,417 true news articles in the dataset, which contains news stories from 2015 to 2018.The fake news pieces were gathered from several sites that fact-checking agencies like Politifact and Wikipedia had identified as false.On the other hand, reliable content came through crawling from Reuters.com.The authors' examination of the available data revealed that combining all three swarm feature categories produced an excellent accuracy of more than 97% and an F1 score.
A two-step method was proposed by [10] for identifying fake news on social media.Preprocessing the data was the initial step in the method's procedure to convert unstructured datasets into structured datasets.The news texts and other texts in the dataset are vectorized using the Document-Term Matrix and the acquired TF weighting algorithm.In the second stage, 23 supervised AI algorithms were applied to the dataset, which had previously been text-mined and organized into a structured format.This study used publicly available datasets to empirically test twenty-three intelligent classification techniques.The four evaluation metrics (i.e., recall, accuracy, F-measure, and precision) were then used to compare these classification models.The authors claimed that the decision tree method has produced the best mean values in terms of accuracy, precision, and F-measure.In terms of recall metrics, the 1000 value algorithms ZeroR, CVPS, and WIHW appeared to be the best.
Wang et al. [11] introduced a novel fine-grained multimodal fusion network (FMFN) to fully fuse textual and visual information for the purpose of identifying fake news.When a tweet has both text and an image, word embeddings from the text are extracted using a pretrained language model, and each word embedding can be considered a textual feature.Deep CNNs are utilized to extract various visual aspects from the image.Scaled dot-product attention was used to combine word embeddings from the text with multiple feature vectors representing different aspects of the image.This approach captures the dependencies between textual and visual features more accurately and accounts for correlations between different visual features.The FMFN, which fuses multiple visual features and multiple textual features, learns a joint representation that is superior to that learned by other methods for fusing visual representation and text representation obtained by combining a text representation with a visual representation when identifying fake news, according to the findings of an extensive experiment the authors carried out on a publicly available Weibo dataset.
The Mc-DNN multi-channel deep learning model was introduced by Tembhurne et al. [12] in 2022.It employs and processes news headlines and articles from several channels to distinguish between real and fraudulent news.The performance of Mc-DNN was investigated using the combinations of RNN and CNN, GRU and CNN, LSTM and CNN, BiGRU and CNN, and BiLSTM and CNN.BiLSTM and CNN Mc-DNN were reported to achieve the maximum accuracy of 99.23% and 94.68%, respectively, in the task of false news identification on the ISOT fake news dataset and FND [13].
The authors of a previous publication presented an entity debiasing framework (EN-DEF), which generalizes fake news detection algorithms to future data by reducing entity bias from a cause-and-effect approach [14].Based on the causal connection connecting news entities, news contents, and news truthfulness, the authors separately modeled the contribution of each cause (entities and contents) during training.They reduced the direct influence of the entities during the inference stage in order to reduce entity bias.Extensive offline studies on the English and Chinese datasets demonstrate that the proposed method may greatly improve the performance of base false news detectors, and online tests validate its superiority in practice.
According to Murayama et al. [15], the bulk of false news datasets are dependent on a specific time period.As a result, detection models developed using such a dataset struggle to identify unexpected fake news brought about by political and societal changes; they may also produce biased output from the input, such as names of particular people and organizations.Because it is a result of the origination date of news in each dataset, the authors called this diachronic bias.The authors developed masking techniques based on Wiki data to reduce the impact of human names and test whether they strengthen fake news detection algorithms through trials using in-domain and out-of-domain data.Based on their tests, the authors were able to confirm that these masking techniques increased model robustness and accuracy in out-of-domain datasets.
In a different approach, Min et al. [16] formulated the problem of social context-based fake news identification as a diverse graph classification problem and introduced the Post-User Interaction Network (PSIN).This model effectively models the post-post, user-user, and post-user connections in social contexts while maintaining their inherent features through the use of a divide-and-conquer strategy.The authors employed an adversarial topic discriminator for topic-agnostic feature learning in order to broaden the method's applicability to recently emerging subjects.Extensive experiments demonstrate that the proposed method outperforms SOTA baselines in both on-topic and off-topic scenarios.
Sahoo and Gupta [17] proposed an autonomous detection of false news solution for the Chrome environment to detect fake news on Facebook.The authors used deep learning to analyze the behavior of the account by utilizing a range of Facebook account-related data in addition to other features connected to news articles.These authors claimed that multiple experimental analyses of actual material showed that their intended strategy for detecting false news had more accuracy than the current state-of-the-art methods.
A thorough analysis of the most advanced techniques for identifying malicious users and bots based on the various features suggested in the authors' unique taxonomy was published in a paper by Shahid et al. [18] in 2022.In order to aid researchers who are new to this subject, the authors discussed numerous important problems and prospective future research areas in an effort to avoid the critical issue of false news detection.
Shu et al. [19] looked at the problem of understanding and using social media user profiles for the identification of fake news in another paper.By tracking users' sharing habits, the authors were able to identify group representative users who are more likely to spread both false and accurate news.Subsequently, they carried out a comparative study of explicit and implicit profile attributes among various user groups to ascertain their capacity to facilitate the differentiation of false from authentic news.The authors demonstrated the usefulness of user profile features for exploitation with a bogus news categorization problem.The authors further confirmed the effectiveness of these characteristics using feature significance analysis.
In a study by Kaliyar et al. [20], the authors analyzed the substance of the news piece and the echo chambers on social media-groups of people who have the same beliefsto determine which news was fake.A tensor representing social context-that is, the association between user profiles on social media and news articles-was produced by combining news, user, and community data.The tensor and the news material were integrated to create a representation that included the social context as well as the news information.The suggested method has been tested on the real-world dataset BuzzFeed.The decomposition-derived parameters were used as features in the news classification process.An ensemble machine learning classifier (XGBoost) and a deep neural network model (DeepFakE) were employed for the classification task.According to the scientists, the proposed model (DeepFakE) beats the current techniques for identifying fake news since it uses deep learning on a combination of social context-based qualities and news content as an echo chamber.
Gedara et al. (2024) adopted the fuzzy transform to achieve the detection of fake news through data reduction.In an attempt to reduce the training period, the architecture of Long Short-Term Memory was adopted.The technique, though with several challenges-the application of a limited number of algorithms-shows a reasonable degree of efficiency when evaluated with the common evaluation metrics [21].
The application of the Similarity-Aware Multimodal Prompt Learning (SAMPLE) technique was adopted by Jiang et al. (2023).This technique, along with the three prompt templates, was used to detect fake news.With the introduction of a similarity-aware fusing method, the accuracy level seems encouraging.The experimentation that was conducted on a few datasets makes the result incomprehensible and unreliable [22].
In an effort to shift focus from the popular approach of textual fake news detection using machine learning approaches, Meel and Vishwakarma in 2021 proposed a multimodal fake news detection technique.The method focuses on four different techniques of multimodal data analysis.The results and comparative evaluation with other similar work show a high degree of accuracy, to the tune of 95.90%.The technique, however, is limited in application as it is now a generalized technique [23,24].
Deepfake is a common technique being used on social media to manipulate the faces of internet users to maliciously commit a crime on behalf of the original owner of the face.Efforts to solve this social media challenge were the focus of [25].To solve this problem, the authors used a stacking-based ensemble approach, which features a combination of Xception and EfficientNet-B7.A reasonable degree of accuracy was observed.This is an appreciable improvement over the existing solutions, even though there are a limited number of algorithms adopted for ensemble techniques.

Materials and Methods
This section delves into the heart of the study, focusing on the implementation and methodology of utilizing artificial intelligence (AI) to combat the pervasive issue of fake and counterfeit news within the realm of social media.This section navigates through the intricacies of data collection, preprocessing, and the selection of supervised AI algorithms tailored for text classification tasks.By elucidating the process of model training, validation, and performance evaluation using standard metrics, this chapter offers a comprehensive insight into the strategies employed to discern between authentic and deceptive information circulating across various social media platforms.This dataset of political news was taken out of Kaggle; it was a dataset compiled and provided for a community prediction competition [26].It consists of 25,116 rows of training data.Each contains the ID, title, author, text, and label columns.And for 5864 rows of test data, each also contains the ID, title, author, and text columns.This dataset was one of four datasets used in training our different machine learning models.

Dataset 2
The second dataset came from the University of Victoria's Online Academic Community [27].Thousands of false news and real articles are combined to create this fake news dataset.The dataset was assembled from a number of stories published on both reputable and shady news websites.

Dataset 3
A publicly accessible dataset for identifying false news is called LIAR [28].POLITI-FACT.COM offers a thorough analytical report and links to the original documents for each case, where 12.8K manually labeled brief utterances were gathered over a ten-year period in a variety of circumstances.Research that involves fact-checking can also make use of this dataset.It is interesting to note that this news dataset is orders of magnitude larger than prior, comparable public fake news databases.The LIAR dataset contains 12.8K human-labeled brief statements from POLITIFACT.COM's API; a POLITIFACT.COM editor verifies each statement.

Dataset 4
The fourth dataset was obtained via the https://zenodo.org/record/4561253(accessed on 23 January 2024) page on the Zenodo website.For the purpose of detecting fake news in text data, ref. [29] created the WELFake dataset, which has 72,134 news stories with 35,028 true and 37,106 fraudulent news.These are arranged in columns for text, label, serial number, and title.The news heading is represented by the title, the news content is represented by the text, and the label indicates if the news is true or fake (0 being fake and 1 being real).The serial number begins at 0. Just 72,134 of the roughly 78,098 records are accessible according to the dataset.The summary of the datasets is provided in Table 1.

Text Categorization
Text categorization also known as text classification is a task in natural language processing (NLP) and machine learning that entails classifying text documents into specified groups or labels according to their content [30].Text classification aims to automatically classify text documents into one or more specified categories so that organizing, searching, and deriving insights from massive amounts of text data is made simpler.The text classification pipeline is provided in Figure 1 [31].
Text categorization also known as text classification is a task in natural language processing (NLP) and machine learning that entails classifying text documents into specified groups or labels according to their content [30].Text classification aims to automatically classify text documents into one or more specified categories so that organizing, searching, and deriving insights from massive amounts of text data is made simpler.The text classification pipeline is provided in Figure 1 [31].

Text Preprocessing
A series of procedures known as text preprocessing are used to clean and prepare text data for analysis [32].The following actions are usually involved: Lowercasing: To guarantee uniformity in text data, convert every text to lowercase.This lessens the likelihood of case sensitivity problems, which facilitates word matching and processing.
Tokenization: Divide the text into discrete words, or tokens, using tokenization.Tokenization plays a crucial role in segmenting sentences or paragraphs into manageable chunks for analysis.

i.
Removal of Stop Words: Terminating words such as "the," "and," "in," and "of" should be eliminated from the text.For many NLP tasks, these terms are typically not informative.ii.Removal of Punctuation: Symbols, special characters, and punctuation are frequently unnecessary for analysis.iii.Lemmatization: Reducing words to their most basic or root form is known as lemmatization.Consolidating related words is aided by stemming and lemmatization (e.g., "running" and "ran" both become "run").iv.HTML Tag Removal: You might need to remove HTML tags if the text in your data originates from web pages.v.Text Normalization: Handling email addresses, URLs, and other text-specific patterns may require additional normalization procedures.

Feature Extraction
Textual data must be transformed into vector or numerical representations before being fed into machine learning algorithms.This procedure is known as feature extraction.It involves converting unprocessed data into a more comprehensible and useful representation, and it is an essential stage in machine learning and data analysis [33].It seeks to locate and extract from raw data the most pertinent and non-redundant characteristics, therefore improving the efficiency of machine learning algorithms.There are many techniques for feature extraction in text data.However, a combination of bag-of-words and TF-IDF (Term Frequency-Inverse Document Frequency) were the chosen methods.

Text Preprocessing
A series of procedures known as text preprocessing are used to clean and prepare text data for analysis [32].The following actions are usually involved: Lowercasing: To guarantee uniformity in text data, convert every text to lowercase.This lessens the likelihood of case sensitivity problems, which facilitates word matching and processing.
Tokenization: Divide the text into discrete words, or tokens, using tokenization.Tokenization plays a crucial role in segmenting sentences or paragraphs into manageable chunks for analysis.

i.
Removal of Stop Words: Terminating words such as "the", "and", "in", and "of" should be eliminated from the text.For many NLP tasks, these terms are typically not informative.ii.Removal of Punctuation: Symbols, special characters, and punctuation are frequently unnecessary for analysis.iii.Lemmatization: Reducing words to their most basic or root form is known as lemmatization.Consolidating related words is aided by stemming and lemmatization (e.g., "running" and "ran" both become "run").iv.HTML Tag Removal: You might need to remove HTML tags if the text in your data originates from web pages.v.
Text Normalization: Handling email addresses, URLs, and other text-specific patterns may require additional normalization procedures.

Feature Extraction
Textual data must be transformed into vector or numerical representations before being fed into machine learning algorithms.This procedure is known as feature extraction.It involves converting unprocessed data into a more comprehensible and useful representation, and it is an essential stage in machine learning and data analysis [33].It seeks to locate and extract from raw data the most pertinent and non-redundant characteristics, therefore improving the efficiency of machine learning algorithms.There are many techniques for feature extraction in text data.However, a combination of bag-of-words and TF-IDF (Term Frequency-Inverse Document Frequency) were the chosen methods.

Bag-of-Words (BoW)
The bag-of-words model is a method of representing text as an unordered collection of words.It is commonly used in natural language processing (NLP) and information retrieval (IR).The bag-of-words model disregards grammar and word order, but it keeps multiplicity.This means that if a word occurs multiple times in a document, it is counted multiple times in the bag-of-words representation of that document [34].

Term Frequency (TF)
A statistical metric called Term Frequency-Inverse Document Frequency (TF-IDF) assesses a word's significance in a document within a set of documents [35].It is a commonly used method in text mining and information retrieval.
The number of times a term (word) appears in a document is its term frequency (TF).It is a straightforward indicator of the frequency with which a term appears in a given document.

Inverse Document Frequency (IDF)
The measure of a term's frequency of occurrence in a set of documents is called Inverse Document Frequency (IDF).It is calculated by taking the logarithm of the total number of documents and dividing it by the number of documents that contain the term.

TF-IDF
The result of TF and IDF is TF-IDF.It evaluates a term's significance in a document by considering both its occurrence in the document and its rarity within the collection.A term's importance to the document and not merely its commonness is indicated by a term's high TF-IDF score [35].

"lbfgs" Logistic Regression
"lbfgs" logistic regression refers to a specific type of logistic regression algorithm that uses the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) optimization method.Logistic regression is a statistical method used for binary classification problems.Based on one or more predictor factors, it estimates the probability of a binary result (1/0, Yes/No, or True/False).L-BFGS is a popular optimization algorithm used for finding the minimum of a function, typically in the context of machine learning and numerical optimization.It is an iterative method that belongs to the family of quasi-Newton methods.L-BFGS is known for being memory-efficient and suitable for problems with a large number of parameters.
When combined, "lbfgs Logistic Regression" suggests that logistic regression is being used with the L-BFGS optimization method to train a binary classification model.This combination is often employed when dealing with machine learning tasks that require optimizing the logistic regression model's parameters to fit the data.
The sigmoid function, a logistic function, is used in logistic regression to map predictions and their probability.An S-shaped curve known as the sigmoid function transforms any real number into a range between 0 and 1.
For logistic regression, the sigmoid function is known as an activation function and is described as follows: where P(Y = 1|X) is the probability of the target variable Y being 1 given the predictors X, "β 0 , β 1 , β 2 , . .., β n " are the coefficients of the model, "X 1 , X 2 , . .., X n " are the predictor variables, and e is the base of the natural logarithm (Euler's number).
The logistic function 1 1+e −x transforms the output of a linear equation a range between 0 and 1, representing probabilities [36].
The coefficients β 0 , β 1 , β 2 , . .., β n are estimated using optimization algorithms (often maximum likelihood estimation) to minimize the error between predicted probabilities and the actual outcomes in the training data.

"liblinear" Logistic Regression
liblinear is a library for large linear classification, which includes logistic regression.It is a popular choice for training logistic regression models on large datasets because it is fast and efficient.Coordinate descent is the method that liblinear utilizes to solve the logistic regression optimization issue.
Coordinate descent is an iterative algorithm that works by optimizing one coordinate at a time.It is a simple and efficient algorithm, and it is well suited for large datasets.The "liblinear" solver is an optimization algorithm used to fit the logistic regression model.It is based on a linear support vector machine (SVM) algorithm and works well for small-to medium-sized datasets.It optimizes the logistic regression cost function using techniques like coordinate descent.Refer to Equations ( 1) and (2) for reference.

"newton-cg" Logistic Regression
The "newton-cg" solver stands for "Newton Conjugate-Gradient".It is a numerical optimization technique that combines the Newton-Raphson method with the conjugate gradient method to determine the best logistic regression model parameters.This solver is known for its efficiency and suitability for a wide range of logistic regression problems.
When combined, "newton-cg Logistic Regression" indicates that logistic regression is being used, and the "newton-cg" solver is chosen as the optimization method to train the logistic regression model.This combination is often applied when you have a logistic regression problem that benefits from the characteristics of the "newton-cg" optimization algorithm.Refer to Equations ( 1) and (2) for reference.

"sag" Logistic Regression
Stochastic average gradient descent (SAG) logistic regression is a solver that uses stochastic average gradient descent to train a logistic regression model.Stochastic average gradient descent (SAG) is an iterative algorithm that updates the model parameters by averaging the gradients of a small batch of samples at each iteration.SAG is a fast and efficient algorithm for training logistic regression models on large datasets.It is particularly well-suited for datasets with a large number of features.SAG is also less sensitive to the initial parameter values than other logistic regression solvers.Refer to Equations ( 1) and (2) for reference.

5.
Random Forest Random Forest is a supervised machine learning approach that can be used for regression and classification applications.In order to generate a final forecast, it builds a large number of decision trees during the training phase and averages them.
The foundation of Random Forests is the idea of ensemble learning, which combines the predictions of several different models to generate a more accurate forecast.Because of this, Random Forests are less likely than individual decision trees to overfit.
In contrast to decision tree classifiers, which receive the root node by using the Gini index or information gain, Random Forests obtain the root node randomly, and the splitting of the attribute nodes also occurs randomly [37].
Here, P + represents the likelihood of a positive class, whereas P − denotes the likelihood of a negative class [38].

Perceptron
Its predictions are based on a linear predictor function, which combines the feature vector with a set of weights because it is a linear classifier.
A collection of input values is fed into the perceptron, and it multiplies the results by a set of weights.After that, an activation function is applied to the total of these products to provide an output.Usually a step function, the activation function determines whether the input is higher or less than a threshold and outputs a binary value (0 or 1) accordingly.Many algorithms can be used to train a perceptron; however, the perceptron learning algorithm is the most widely used one.In order for this technique to correctly identify all of the training data, iteratively modifying the perceptron's weights is how it operates.
The following is the fundamental equation for a perceptron's operation: where w is a vector of weights with actual values, b is bias, which is an independent component that modifies the boundary away from the origin, and x is the input x values' vector.
Here, m denotes the number of perceptron inputs.Either "1" or "0" can be used to represent the output.Depending on the activation function being utilized, it can also be represented as "1" or "−1" [39].

7.
Ridge Classifier A Ridge Classifier is a linear algorithm for classification that has the Ridge regression algorithm as its foundation.Based on features, it is used to categorize data into two or more classes.Ridge regression is a regularized linear regression approach that prevents overfitting by including a penalty term in the loss function.Large model coefficients are penalized by this regularization term, which makes the model learn fewer complex correlations between the target variable and the data.
The Ridge Classifier works by first converting the target variable into {−1, 1} and then treating the problem as a regression task.It then uses the Ridge regression algorithm to learn a linear model that can predict the target variable.The predicted class is then determined by the sign of the predicted target value.
The Ridge Classifier's objective function aims to minimize the following cost function: Here, loss (y i , y i ) represents the loss function used for classification (e.g., typically logistic loss or squared hinge loss), α is the regularization parameter that controls the strength of the penalty term, and w denotes the coefficients (weights) of the linear function.
The Ridge Classifier seeks to find the optimal weights (w) that minimize this cost function, balancing between fitting the training data well and keeping the coefficients small to prevent overfitting.

8.
CatBoost Classifier (CatBoost) A CatBoost classifier is a gradient boosting classifier made especially to deal with features that fall into one of the categories.It employs a range of methods to enhance gradient boosting's effectiveness on categorical data, such as the following: i.
Ordered boosting: CatBoost makes use of ordered boosting to discover connections among categorical variables with an inherent ordering.ii.Feature hashing: To effectively handle a high number of categorical features, CatBoost takes advantage of feature hashing.iii.Oblivious trees: CatBoost makes use of oblivious trees, which are decision trees that do not care which way the category features are arranged.
Here, F(x) is the final prediction for input x, and B 0 is the initial prediction (often a global constant or the average of the target variable).M is the number of trees in the ensemble, and f m (x) represents the prediction of the m-th tree for input x.
To arrive at the end prediction, the starting prediction is added to the increments predicted by each individual tree f m (x).These trees are built using the CatBoost algorithm step by step, with each tree built trying to rectify the mistakes the ensemble has made thus far.

Nearest Centroid Classifier
A Nearest Centroid Classifier is a simple and effective classification algorithm that assigns a new data point to the class whose centroid is closest to the new data point.The centroid of a class is the mean of all the data points in that class.
To classify a new data point using the Nearest Centroid Classifier, the following steps are typically taken: i.
Determine how far the new data point is from each class's centroid.ii.Assign the newly discovered data point to the class whose centroid is nearest to it.
Being a non-parametric technique, the Nearest Centroid Classifier does not make any assumptions about the data's underlying distribution.Because of this, the Nearest Centroid Classifier is an adaptable method that works well for a range of applications.
Let us define the key components: i.
x ij : Feature vector of the i-th sample in class j. ii.C j : Centroid for class j. iii.N: Number of features.iv.n j : Number of samples in class j.
The steps involved are detailed below.

Centroid Calculation
For each class j, calculate the centroid c j as the mean of the feature vectors belonging to that class: This equation computes the centroid c j for class j by averaging the feature vectors x ij within that class.

Distance Calculation
Given a new test instance x test , compute the distance from x test to each centroid c j using a distance metric like Euclidean distance: Equation ( 9) calculates the Euclidean distance between the test instance x test and each centroid c j across all features.

Stochastic Gradient Descent (SGD Classifier)
Stochastic Gradient Descent is a training algorithm for machine learning models that uses optimization.It updates the model parameters iteratively in the direction of the loss function's negative gradient.The direction that the parameters should be adjusted in order to minimize the loss is indicated by the gradient of the loss function.
Because SGD is a stochastic algorithm, it only uses one training example at a time to update the model's parameters.For training huge datasets, where it would be impossible to change the parameters using the full dataset at once, this makes SGD particularly effective [40].
Equation ( 10) is the straight-line equation where m is the slope and b is its intercept while m = m − @m and b = b − @b.These are parameters with a small change Cost = 1 N ∑ N i=1 (Yi' − Yi) 2 ; this is the cost function for N samples.11.Support Vector Classifier-SVC (kernel = "linear", C = 0.025) SVC is a linear kernel support vector machine (SVM) classifier with a 0.025 regularization parameter.SVMs are a class of machine learning algorithms that are applicable to applications involving both regression and classification.In order to divide the data points into two classes, they locate a hyperplane in the feature space.
The linear kernel is a simple kernel that calculates the dot product between two data points.This makes it a good choice for datasets with a small number of features.The regularization parameter controls how much the model is penalized for misclassifying data points.A model with a larger regularization parameter will be less likely to overfit the training set and be more complex [40].The Support Vector Classifier (SVC) is a specific implementation of the support vector machine (SVM) algorithm used for classification tasks.The mathematical representation of SVC closely aligns with the general formulation of SVMs for binary classification.For simplicity, let us consider a linear SVC for linearly separable classes.Given the training data (x i , y i ) where x i represents feature vectors and y i represents class labels (+1 or −1 for binary classification), the following is determined: The decision function for SVC is similar to the SVM and is represented as follows: where x represents the input feature vector, w represents the weight vector, b is the bias term, and • denotes the dot product between vectors.
The optimization problem for SVC aims to find the optimal hyperplane that separates the classes while maximizing the margin and minimizing classification errors.It can be represented as follows: where ||w|| represents the Euclidean norm or w, and the objective function minimizes ||w|| 2 to maximize the margin.The constraints ensure that data points are correctly classified and are sufficiently far from the decision boundary (at a distance of at least 1/||w||).
12. Support Vector Classifier-SVC (kernel = "rbf", gamma = 2, C = 1) The Support Vector Classifier is a support vector machine (SVM) classifier with a regularization parameter of 1 and a radial basis function (RBF) kernel.One kind of machine learning method that may be applied to both classification and regression problems is RBF SVMs.In order to divide the data points into two classes, they locate a hyperplane in the feature space.
A non-linear kernel called the RBF kernel uses the distance between two data points to determine how similar they are.For datasets where the data are not linearly separable, this makes it a good option.The amount that the model is penalized for incorrectly identifying data points is determined by the regularization parameter.A model with a larger regularization parameter will be more sophisticated and less prone to overfit the training set.Refer to Equations ( 11) and (12) for reference.

LinearSVC
LinearSVC is a support vector machine (SVM) classifier with a linear kernel.SVMs are a class of machine learning algorithms that are applicable to applications involving both regression and classification.In order to divide the data points into two classes, they locate a hyperplane in the feature space.LinearSVC is similar to the linear SVM and SVC but is implemented with a different optimization algorithm, typically based on the LIBLINEAR library, making it more suitable for large-scale datasets.The optimization problem is expressed below.
LinearSVC is a faster and more efficient implementation of SVM classification for the case of a linear kernel.It also has fewer parameters to tune, making it easier to use.
Here, w represents the weight vector, b is the bias term, C is the regularization parameter, controlling the trade-off between maximizing the margin and minimizing the classification error, and • denotes the dot product between vectors.The objective function consists of two terms: the first term minimizes the norm of the weight vector to maximize the margin, while the second term represents the hinge loss, penalizing misclassifications.The parameter C balances the importance of these two terms, controlling the regularization strength.
LinearSVC aims to find the optimal w and b that define the hyperplane separating the classes by solving this optimization problem using efficient algorithms.It constructs a linear decision boundary in the input space to classify data points into different classes.LinearSVC is particularly efficient for large-scale datasets and linearly separable problems, offering a computationally tractable solution for linear classification tasks.

ZeroR Classifier
The ZeroR classifier is among the most straightforward and fundamental machine learning algorithms.It is essentially a baseline or reference model that functions as a straightforward benchmark for assessing the effectiveness of more intricate machine learning models rather than a learning algorithm in the conventional sense.
The training dataset's most frequent class or value is the only factor used by the ZeroR algorithm to generate predictions.It allocates each instance in a classification task to the class that occurs most frequently in the training set.In regression tasks, each occurrence is given a constant value, usually the target variable's mean or median.
The principle behind ZeroR is straightforward.For classification tasks, it predicts the most frequent class label from the training data for all instances in the test data, disregarding any features or input variables.Essentially, it always predicts the same class, the mode of the target variable in the training set.
For regression tasks, it predicts the mean or median of the target variable in the training set for all instances in the test data, irrespective of the input features.

Decision Tree Classifier
A decision tree classifier is an algorithm for supervised machine learning that can be applied to applications involving regression and classification.It functions by building a model of the data that resembles a tree, with each node in the tree representing a decision.After that, the model makes predictions on fresh data points using this tree [41].
Here, p is the i-th order probability.

Passive Aggressive Classifier
This classifier is particularly useful when dealing with large-scale, streaming, or online learning scenarios where data arrive sequentially, and models need to adapt to changes.The "passive-aggressive" name is derived from the algorithm's behavior when making updates to its model.It adjusts its model parameters based on a trade-off between making a "passive" update (minimizing the loss function) and an "aggressive" update (correcting misclassifications).
The update rule for the Passive Aggressive Classifier is typically based on the loss function and the gradient of the loss.
Given w as the weight vector, x as the input vector, y as the true label, ŷ as the predicted label, η as the learning rate, and C as the regularization parameter, the following is determined: The update rule for the weight vector w of the Passive Aggressive Classifier is often represented as follows: Sensors 2024, 24, 5817 15 of 34 where y − ŷ represents the loss or error term, ∂L ∂w represents the gradient of the loss with respect to the weight vector, and the regularization term C•w penalizes large weights to control overfitting.
The Passive Aggressive algorithm adjusts the weights in such a way that it tries to minimize the loss while staying close to the previous weight vector.The aggressiveness of the update is controlled by the learning rate (η) and the magnitude of the error term.

Extra Tree Classifier
The Extra Tree Classifier is an ensemble learning algorithm that predicts using a set of decision trees.While it is comparable to the Random Forest Classifier, there are a few significant differences.The Extra Tree Classifier splits the features and samples randomly at each node of the tree, while the Random Forest Classifier uses a more informed approach to selecting features and samples.The Extra Tree Classifier does not bootstrap the training data, while the Random Forest Classifier does.

Random Patches
Random Patches is an ensemble machine learning method that blends the ideas of random subspaces and bagging.Its main application is to enhance the functionality of machine learning models-particularly Random Forests and decision trees.Because Random Patches focuses on randomizing both the data and the features used for training, it is also known as "Feature Bagging".

Sampling Data
For each model i in the ensemble M, a subset of the training data D i is randomly sampled with replacement from the original training set D. This sampling might include N samples from D, where N is less than the total number of samples in D.

Selecting Features
For each subset D i , a random subset of features F i is chosen.This might involve selecting a certain number K of features randomly from the total feature set F.
These randomly generated subsets are then used to train other models (such as decision trees or other classifiers) using the Random Patches technique.The predictions from each model may be combined by voting (for classification) or averaging (for regression) during inference (when producing predictions) to produce the final forecast.

Voting Classifier
A voting classifier is an ensemble learning algorithm that creates predictions by combining several classifiers.In order to arrive at a final prediction, a set of base classifiers is trained using the training data.The base classifiers' predictions are then averaged.
The voting classifier combines the predictions from multiple base models M1, M2, . .., Mn as follows: For Hard Voting, where 1 is the indicator function, M i (x) represents the prediction of the ith model for input x, and c is the class label.For Soft Voting (for classifiers that output probabilities), where P(M i (x) = c) represents the probability of class c predicted by the ith model for input x.

Stacked Generalization
Super learning, another name for stacked generalization, is an ensemble learning method that combines the predictions of several machine learning models to generate a final prediction that is more accurate.The way it operates is by using the underlying models' outputs to train a meta-model, also called a stacking model.In order to minimize the total error, the meta-model learns how to combine the predictions of the basic models.
Stacked generalization (stacking) involves combining the predictions from multiple base models to train a meta-model that learns to make the final predictions.Let us break down the mathematical representation step by step.
Given the training dataset (X, y) and M diverse base models denoted as f 1 , f 2 , . .., f M , the following is determined: Base Model Training: Train the base models f i on the training dataset X to obtain predictions: ŷi = f i (X), for i = 1, 2, . .., M Formation of the Meta-Features: Use the predictions of these base models as new features (meta-features) for the meta-model.Form a new dataset consisting of the predictions: Here, X meta is a matrix where each row represents an instance in the original dataset, and each column represents the predictions made by one of the base models.
Training the Meta-Model: Train a meta-model (meta-learner) g using the meta-features X meta and the true target values y: g(X meta ) = y multilayer perceptron classifier is a kind of artificial neural network (ANN) that is suitable for jobs requiring both regression and classification.Multiple layers of networked nodes, or neurons, comprise MLPs.Every neuron in one layer is linked to every other neuron in the layer above it.Backpropagation is one type of supervised learning technique used to train MLP classifiers.The technique of backpropagation involves minimizing the loss function by modifying the weights of the connections between neurons.The MLP's prediction accuracy of the target values is indicated by the loss function.
Let us consider a simple MLP classifier with multiple hidden layers.X represents the input features, W (i) represents the weights for the connections between the ith layer and the (i + 1)th layer, b (i) represents the bias terms for the ith layer, and A (i) represents the activation of the ith layer.
The forward pass through an MLP with multiple hidden layers can be represented mathematically as follows: Input Layer to Hidden Layer (Layer 1 to Layer 2): where Z (1) is the weighted sum of the inputs, A (1) is the activation of the first hidden layer, and σ is the activation function (e.g., ReLU, sigmoid, or tanh) applied element-wise to Z (1) .Hidden Layers (Layer 2 to Layer N − 1): For each subsequent hidden layer i (from 2 to N − 1), where Z (i) is the weighted sum of activations from the previous layer, and A (i) is the activation of the ith hidden layer.Last Hidden Layer to Output Layer (Layer N − 1 to Output Layer): where Z (N) is the weighted sum of activations from the last hidden layer, and ŷ is the final output or prediction.

Bernoulli Restricted Boltzmann Machine (Bernoulli RBM)
A Bernoulli Restricted Boltzmann Machine is a kind of neural network model, more precisely an artificial neural network with generative stochastic properties.RBMs are employed in collaborative filtering, dimensionality reduction, and feature learning, among other unsupervised learning applications.The probability distribution type utilized for the binary units in the model is indicated by the term "Bernoulli" in the name.
Two levels of units make up a Bernoulli RBM: a visible layer and a concealed layer.The input data are represented by the visible layer, and the latent representation of the data is represented by the hidden layer.Because they are binary, the units in the visible layer can either be on or off.The concealed layer's units are binary as well.
It has a mathematical representation in terms of probability related to the visible and hidden units and the energy function.
Supposing that we have a Bernoulli RBM with M hidden and N visible units, the following is determined: Energy Function: The energy of a configuration of visible and hidden units in an RBM is given by the following: where V = (v 1 , v 2 , . .., v N ) represents the states of visible units, h = (h 1 , h 2 , . .., h M ) represents the states of hidden units, W ij are the weights between visible unit i and hidden unit j, and a i and b j are the biases for visible unit i and hidden unit j, respectively.Joint Probability: The joint probability of a configuration of visible and hidden units in an RBM is defined using the energy function: where Z is the normalization constant (partition function) calculated by summing over all possible configurations of visible and hidden units: Conditional Probabilities: The conditional probabilities for the states of the hidden units given the visible units and vice versa in a Bernoulli RBM are as follows: where σ(x) is the sigmoid function : In order to train a Bernoulli RBM, one usually maximizes the log-likelihood of the training data in order to determine the weights and biases.To update the parameters based on observed data samples, methods like Stochastic Gradient Descent and Contrastive Divergence (CD) are frequently employed.

AdaBoost Classifier
An AdaBoost Classifier is an ensemble learning approach that builds a strong classifier by combining several weak classifiers.It trains a set of weak classifiers iteratively using the training data and then modifies the weights of the weak classifiers according to how well they perform.A weighted average of the predictions made by the weak classifiers makes up the AdaBoost Classifier's final prediction.
The ensemble prediction is computed by combining the weighted predictions of all the weak learners: where H(x) is the final prediction for sample x, α t is the weight of weak learner h t , and T is the total number of weak learners.

Gradient Boosting Classifier
A gradient boosting classifier is an ensemble machine learning algorithm used for classification tasks.It builds predictive models by combining multiple decision trees through gradient descent optimization.This method is known for its high predictive accuracy and robustness in handling complex relationships in data.It is widely used in applications like spam detection, fraud detection, and image classification.Key parameters to tune include the learning rate, the number of trees, and tree depth.
Gradient boosting adds weak learners one after the other that outperform the prior models in order to optimize the ensemble model.An effective prediction model is produced when each new learner fixes the mistakes committed by the previous ensemble.

Ordinal Learning Model
One kind of machine learning model that is used to predict ordered categorical variables is called an ordinal learning model.This indicates that rather than just predicting the class label, the model also predicts the rank or degree of a variable.When data are organically arranged, such as in customer satisfaction surveys or medical diagnoses, ordinal learning models are frequently employed.
Because conventional classification and regression models do not account for the ordered character of the target variable, they are not immediately appropriate for ordinal data.To solve this problem and produce predictions that take into account the data's ordinal structure, ordinal learning models were created.
Ordinal Logistic Regression is an illustration of an Ordinal Regression model.The cumulative probabilities connected to the ordered categories provide the formula for Ordinal Logistic Regression.Take variable Y, for example, which has K-ordered categories (low, medium, and high).
The cumulative probabilities for a given category k are represented as follows: where P(Y ≤ k|X) is the probability that the outcome Y is less than or equal to category k given the input features X, αk is the intercept specific to category k, and β 1 , β 2 , . .., β p are the coefficients associated with each feature X 1 , X 2 , . .., X p , respectively.This formula extends logistic regression to accommodate ordered categorical outcomes.It is based on the logistic function, also known as the sigmoid function.Depending on the input features, the parameters αk and β are determined during training to maximize the probability of detecting the provided ordinal outcomes.

Extreme Gradient Boosting (XGBoost)
Extreme gradient boosting is an ensemble learning algorithm, which means that it creates a stronger prediction by combining the predictions of several weak learners.Decision trees are the weak learners in XGBoost.The decision trees are constructed successively by XGBoost, and each tree is trained to fix the mistakes of the one before it.An objective function is made up of a regularization term, and a loss function is optimized by XGBoost.The loss function L and the regularization term Ω add up to the objective function that has to be minimized: Objective = L(predictions, labels) + Ω(model) The particular equations utilized in the execution of XGBoost include the optimization of the objective function, which combines regularization and the loss function, as well as the gradient and Hessian computations for building and refining the ensemble of trees.

Decision Stump
A decision stump is a straightforward machine learning model made up of a decision tree with only one level.The algorithm for binary classification relies on the value of a solitary input feature to generate predictions.Decision stumps are frequently employed as building blocks in more intricate machine learning methods, including gradient boosting machines and AdaBoost.
Finding the optimal split for the data based on a single feature is the formula for a decision stump.Assume that we have a dataset that contains the target variable y and one feature, x.
The decision stump's split condition can be represented as follows: where x is the feature value, θ is the threshold value used to split the data, and C 1 and C 2 are the predicted classes on either side of the split.The decision stump algorithm looks over the feature space to identify the optimal threshold value that reduces a given criterion, which is frequently information gain or purity.For example, it could maximize Gini impurity or minimize misclassification errors in classification tasks.

Complement Naïve Bayes (CNB)
Complement Naïve Bayes is a Naive Bayes classifier variation created to overcome some of the drawbacks of the original Naive Bayes algorithm.
The Naive Bayes classifier's primary drawback is its assumption that the features are unrelated to one another.In real-world data, this is frequently not the case because features might be connected.By identifying the relationships between characteristics and applying this knowledge to increase classification accuracy, Complement NB overcomes this drawback.
The Naive Bayes classifier's tendency to favor the majority class in unbalanced datasets is another drawback.Utilizing a weighted log-likelihood function that assigns greater weight to the minority class, Complement NB overcomes this constraint.
Bayes' theorem provides the following formula for the conditional probability in Complement Naive Bayes: where P(c|x) is the probability of class c given the features x, P(x|c) is the likelihood of observing the features x given class c, P(c) is the prior probability of class c, and P(x) is the evidence probability.
The class-conditional probability is computed differently in Complement Naive Bayes.Rather than modeling the chance of observing features given the class directly, CNB determines the chance of the features given the class's absence (i.e., the complement of the class): P(x|¬ c)

Multinomial Naïve Bayes (MNB)
Multinomial feature classification challenges are the specialty of MNB.Features that can have a set number of values, such as the frequency with which a word appears in a document, are known as multinomial features.
The way MNB operates is by figuring out how likely each class is given the characteristics of the data item.Next, it is predicted that the class with the highest probability is the right class.
P(x|c) represents the likelihood of observing the features x given class c.
Assuming feature independence, Multinomial Naive Bayes estimates this likelihood as the product of the probabilities of each feature (word or token) given the class.
Here, i represents the i-th feature (word or token), and n is the total number of features.

Justification for the Choice of Ensemble Models/Techniques Chosen
Twelve ensemble models-which include Random Forest, CatBoost, Random Patches, voting, stacked generalization, multilayer perceptron (MLP), Bernoulli RBM, AdaBoost, gradient boosting, ordinal learning model (OLM), XGBoost, and Extra Tree-were selected from a total of 29 methods.The fact that all 12 algorithms made their decisions by using one or more ensemble techniques (such as stacking, voting, bagging, and boosting) should be noted.Since AdaBoost, bagging, and Random Forest are existing meta estimators, the algorithms in the ensemble were selected to offer a fresh combination of techniques.A combination of Random Forest, SVC, and decision tree using the voting ensemble technique or Random Forest, SVC, and gradient boosting using the stacking ensemble technique was seen to have produced superior outcomes, especially when compared to traditional algorithms.

Performance Matric for Evaluation
The effectiveness of the supervised artificial intelligence algorithms on the test data was evaluated using a confusion matrix.It is particularly useful for evaluating a model's ability to predict different classes or categories within a dataset.A confusion matrix is used to show the outcomes of a model's attempt to classify a batch of data points into one of several predefined classes or labels.In a typical confusion matrix, the rows represent the actual classes, or ground truth, and the columns represent the classes that the model predicted.The cells of the matrix include counts of the number of data points that belong to each combination of actual and predicted classes.A confusion matrix can yield the following four crucial values: • True Positives (TPs): In these instances, the model successfully predicted the positive class and identified the positive class (the class it truly belongs to).This example shows bogus news that was labeled as such.• True Negatives (TNs): In these cases, the model properly identified the negative class as not falling under the positive category.This instance indicates real news that was classified as real news.

•
False Positives (FPs): This happens when the model predicts the positive class when it should have predicted the negative class, but it does so wrongly.This example shows that genuine news was mislabeled as false news.
• False Negatives (FNs): This happens when a model predicts a negative class mistakenly when it should have predicted a positive class.In this case, bogus news was mistakenly identified as authentic news.All these metrics are summarized in Table 2.These values can be used to produce a number of performance measures, including the following: • Accuracy: It assesses the overall accuracy of the model's predictions.
• Precision: It assesses how well the model predicted outcomes that materialized.
• Recall (Sensitivity or True Positive Rate): It evaluates how well the model detects every instance of positivity.
• Specificity (True Negative Rate): It evaluates the model's ability to identify every instance of negativity.
Speci f icity = TN (TN + FP) • F1 Score: It provides an equitable evaluation of a model's efficacy and is the harmonic mean of recall and precision.
• Matthews Correlation Coefficient (MCC): This is employed to assess the performance of a binary classification model.It offers a fair measure by taking into consideration false positives, false negatives, true positives, and true negatives-even in situations when the classes have different sizes.The MCC is ascertained using the following formula: • KAPPA: Also referred to as Cohen's Kappa, this is a statistic used to assess how well a classification model is doing, particularly when the distribution of the classes is unbalanced.It assesses the degree of agreement between the actual and anticipated categories, accounting for the likelihood that agreement could also occur.
Here, P o is the observed agreement, the ratio of instances that were correctly predicted by the model, and P e is the expected agreement, the probability that the model's predictions and the true labels would agree by chance.
The observed agreement (P o ) is computed by dividing the total number of instances by the sum of the diagonal members of the confusion matrix.
The product of the marginal probabilities of the true and predicted labels, added together for each class, is the expected agreement (P e ).
• Area Under the Receiver Operating Characteristic curve (AUC-ROC or simply AUC): This is a performance metric that is commonly used to address binary classification problems.The ROC curve visually illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate (specificity) for different categorization model thresholds.
The area under this ROC curve, or AUC, is a single scalar statistic that summarizes the model's performance over a variety of classification criteria.A higher AUC usually indicates better discrimination between positive and negative cases.

•
False Discovery Rate (FDR): This is a statistical metric applied to hypothesis testing and binary classification.It shows the percentage of false positives, or inaccurate positive predictions, among all of a model's positive predictions.The expected ratio of false positives to all positive test results is known as the false positive ratio, or FDR, in the context of hypothesis testing.
The False Discovery Rate is calculated using the following formula: • False Negative Rate (FNR): Sometimes referred to as the Miss Rate, this is a binary classification metric that quantifies the percentage of actual positive cases that a model mistakenly predicts as negative.It has the following definition: • False Positive Rate (FPR): Often called the False Alarm Rate or Fall-Out, this binary classification indicator measures the proportion of real negative events that a model incorrectly interprets as positive.It is defined as follows: • Negative Predictive Value (NPV): This is a binary classification statistic that quantifies the percentage of actual negative instances among those that a model predicts to be negative.It has the following definition: This pipeline outlines the systematic approach to handling data for machine learning tasks.It starts with inputting raw data, followed by preprocessing steps such as lowering text, removing stop words, and tokenizing and lemmatizing the content.Feature extraction techniques like TF-IDF and bag-of-words transform the processed data into numerical representations.A Stratified Shuffle Split ensures balanced train-test splits, crucial for maintaining class proportions in classification tasks.The models are then trained on the training set.Evaluation metrics like accuracy score, confusion matrix, and classification reports assess model performance, allowing for iterative improvements.Finally, successful models are deployed for real-world applications, completing the cycle of transforming raw data into actionable insights.

Results
The architectural model of the proposed system is depictured in Figure 2. The Sci-kit learn module and the open-source Python programming language were used to create the models used in the experiment.Jupyter Notebook was used as the implementation and testing environment of choice.More information covering this is provided in a later section.Throughout this project, a variety of machine learning algorithms which can be categorized into traditional and ensemble classifiers were employed, including "lbfgs" logistic regression, "liblinear" logistic regression, "newton-cg" logistic regression, "sag" logistic regression, Random Forest, perceptron, Ridge Classifier, CatBoost, Nearest Centroid, Stochastic Gradient Descent (SGD), SVC (kernel = "linear", C = 0.025), SVC (gamma = 2, C = 1), LinearSVC, ZeroR, decision tree, Passive Aggressive, Extra Tree, Random Patches, voting, stacked generalization, multilayer perceptron (MLP), Bernoulli RBM, AdaBoost, gradient boosting, ordinal learning model (OLM), XGBoost, decision stump, Complement Naïve Bayes, and Multinomial Naïve Bayes, which were thoroughly trained using our four datasets, and each algorithm produced outputs.Every algorithm that considered our performance metrics produced an output result that comprised the following, as was stated in chapter three: F1 Score, MCC, KAPPA, accuracy, specificity, precision, recall, FDR, FPR, FNR, and NPV.We can determine which classifier performs best at recognizing fake news based on the accuracy levels.

Data Analysis of Dataset 1
Table 3 displays the outcome of all supervised machine learning algorithms used for Dataset 1, stacked generalization, which is an ensemble algorithm that had the maximum accuracy score of 0.9805 overall.The highest accuracy score achieved by a traditional learning algorithm was attained by the Ridge Classifier with a value of 0.9776.ZeroR

Data Analysis of Dataset 1
Table 3 displays the outcome of all supervised machine learning algorithms used for Dataset 1, stacked generalization, which is an ensemble algorithm that had the maximum accuracy score of 0.9805 overall.The highest accuracy score achieved by a traditional learning algorithm was attained by the Ridge Classifier with a value of 0.9776.ZeroR performed the poorest overall, accuracy-wise.ComplementNB achieved the highest overall precision score at 0.9959.With a precision value of 0.9863, the voting classifier was the model with the highest ensemble value.

Data Analysis of Dataset 2
With an accuracy score of 0.9991, Bernoulli RBM ha among all the algorithms, and according to the data in Tab accuracy rating of 0.9962, was the model with the highest v recall value of 1.0000, but its accuracy score was a pitiful 0.5 a value of 0.9960, comes next.At 0.9432, Complement NB ha LR-logistic regression.SGD-Stochastic Gradient Descent.SVC-support vector machine.SG-stacked generalization.MLP-multilayer perceptron.OLM-ordinal learning model.

Data Analysis of Dataset 2
With an accuracy score of 0.9991, Bernoulli RBM had the best level of accuracy among all the algorithms, and according to the data in Table 4, decision stump, with an accuracy rating of 0.9962, was the model with the highest value.ZeroR had the greatest recall value of 1.0000, but its accuracy score was a pitiful 0.5229.The Bernoulli RBM, with a value of 0.9960, comes next.At 0.9432, Complement NB has the lowest recall value.

Data Analysis of Dataset 2
With an accuracy score of 0.9991, Bernoulli RBM ha among all the algorithms, and according to the data in Tab accuracy rating of 0.9962, was the model with the highest v recall value of 1.0000, but its accuracy score was a pitiful 0.5 a value of 0.9960, comes next.At 0.9432, Complement NB ha LR-logistic regression.SGD-Stochastic Gradient Descent.SVC-support vector machine.SG-stacked generalization.MLP-multilayer perceptron.OLM-ordinal learning model.

Data Analysis of Dataset 3
Out of all the algorithms utilized, the stacked generalization algorithm had the best accuracy and the fourth-highest precision scores.In terms of precision, Random Patches have the greatest value.
In terms of recall, the stacked generalization classifier algorithm appears to be the best choice, with a value of 0.9707.With ZeroR, the lowest accuracy of 0.5143 is attained.With a precision value of 0.0000, ZeroR produced the lowest result.Additionally, ZeroR was also thought to be the lowest, having the lowest recall score of 0.0000.The results on datasets 3 are provided in Table 5.

Data Analysis of Dataset 2
With an accuracy score of 0.9991, Bernoulli RBM ha among all the algorithms, and according to the data in Tab accuracy rating of 0.9962, was the model with the highest v recall value of 1.0000, but its accuracy score was a pitiful 0.5 a value of 0.9960, comes next.At 0.9432, Complement NB ha LR-logistic regression.SGD-Stochastic Gradient Descent.SVC-support vector machine.SG-stacked generalization.MLP-multilayer perceptron.OLM-ordinal learning model.

Data Analysis of Dataset 4
The stacked generalization algorithm had the highest accuracy score of all the algorithms used.The next four machine learning algorithms, "lbfgs", "liblinear", "newton-cg", and "Sag" logistic regression, all had the same accuracy rating of 0.6088.According to data in Table 6, the voting classifier's accuracy was the lowest.SVC (gamma = 2, C = 1) had the maximum precision score of 0.6053.Gradient boosting comes in second with a score of 0.5943, and Multinomial NB comes in third with a number of 0.5994.With a score of 0.0000, SVC (kernel = "linear", C = 0.025), ZeroR, and decision stump obtained the lowest results.

Data Analysis of Dataset 2
With an accuracy score of 0.9991, Bernoulli RBM ha among all the algorithms, and according to the data in Tab accuracy rating of 0.9962, was the model with the highest v recall value of 1.0000, but its accuracy score was a pitiful 0.5 a value of 0.9960, comes next.At 0.9432, Complement NB ha LR-logistic regression.SGD-Stochastic Gradient Descent.SVC-support vector machine.SG-stacked generalization.MLP-multilayer perceptron.OLM -ordinal learning model.
The three models with the lowest recall scores were MLP, Extra Tree, and Random Forest, at 0.3797, 0.3619, and 0.3530, respectively.With a score of 0.68, the Bernoulli RBM performed the best, followed by the Nearest Centroid algorithm (0.6269) and the decision tree (0.5178).

Prediction of the Best-Performing Algorithms
Different models with high accuracy have been developed for the determination of the characteristics of the composites.The complexities and difficulties involved in the applications and computations of these models make them unsuitable to be used in practice.Therefore, in this work, a relatively simple approach using general empirical modeler (GEM 16.0) software was adopted.It is simple mathematical modeling software with great accuracy and ease of computation, mainly developed for the prediction of multiple quality characteristics.The procedures (the flowchart) and the graphical user interface of the software are given in Figures 3 and 4.  In an attempt to obtain the best-performing algorithms across the given datasets, a predictive model was developed (see Equation ( 43)).
PREDICTIVE MODEL = 0.9890×ACC        The accuracy results of different classifiers in classification tasks are shown in Table 8.Out of all the examples in the dataset, the accuracy column shows the percentage of correctly identified instances.From the foregoing, the new approach as shown on Table 8 has the highest level of accuracy.In most cases, some of the compared similar works used only accuracy as a metric for benchmarking the efficiency of the works.Suffice it to say that this approach as adopted by these researchers confirms that the results obtained in our case are more reliable and efficient than those of other similar researchers.The results obtained in [41][42][43][44][45][46][47] as well as the results by Khan et al. 2021 [48]  The accuracy results of different classifiers in classification tasks are shown in Table 8.Out of all the examples in the dataset, the accuracy column shows the percentage of correctly identified instances.From the foregoing, the new approach as shown on Table 8 has the highest level of accuracy.In most cases, some of the compared similar works used only accuracy as a metric for benchmarking the efficiency of the works.Suffice it to say that this approach as adopted by these researchers confirms that the results obtained in our case are more reliable and efficient than those of other similar researchers.The results obtained in [41][42][43][44][45][46][47] as well as the results by Khan et al., 2021 [48] confirm the underperformance of their techniques as presented in this study.

Discussion and Conclusions
The issue of "fake news" is one of the primary issues with the development of technology (the internet), social media, and other online activities.As was covered in chapter two, it can be used by evildoers for evil intentions.In order to identify fake news on social media, this study suggests a method that combines text mining with supervised AI algorithms [49].Text mining analysis and supervised artificial intelligence algorithms have been the subject of independent research.This combination model was tested using four different real-world datasets, and the outcomes were examined using the following metrics: MCC, KAPPA, F1 score, accuracy, specificity, recall, precision, FPR, FNR, FDR, and NVP [50].
A model based on optimized regression was developed to identify the best-performing algorithm across each of the four datasets.The results are provided in Table 7.The results obtained show SG in Dataset 1, with 0.681190 as the highest value.In Dataset 2, Bernoulli RBM has the highest value of 0.933789, while LinearSVC has the highest value of 0.689180 in Dataset 3. Finally, BernoulliRBM has the highest value of 0.02634 in Dataset 4.
In summary, the aim and objectives the authors set out to achieve in this research have been successfully achieved.This is visible from the results that have been reported.Future research could enhance the current work by incorporating the detection of bogus news using both image and video sources.Collection and processing of data through sensors have a crucial role as they can be utilized to capture real-time data from various social media platforms, facilitating continuous monitoring and supporting the implementation of AI algorithms and big data analytics in smart city initiatives [51].
Advanced machine learning and AI techniques such as explainable AI (XAI) and deep learning models, multimodal analysis such as cross-platform analysis, and natural language processing (NLP) improvements that have multilingual capabilities as well as a technique that will give room for real-time detection of fake news across the globe are the issues that are of future consideration in this line of research [52].
The major shortcoming of this technique is its failure, at the moment, to be classified as misleading or wrong.This is the most effective method for ensuring the accuracy of the stories under consideration.Achieving this will undoubtedly add to the quality and reliability of the results being generated, apart from flagging them as good or bad based on what is seen or observable.
Sufficient computational resources are available to facilitate algorithmic processing, but large amounts of data are required for continual training and testing of algorithms.
While skilled producers of fake news may devise evasion strategies, algorithms are bound to inherit biases found in training data.Furthermore, algorithms could have trouble comprehending complex settings.For algorithms to be effective, real-time information processing is required.It is crucial to encounter a scenario in which the algorithms might not function properly in different linguistic and cultural contexts.Algorithms have a natural tendency to classify news as true or phony in error.
Finally, this work adopted a relatively simple approach for the prediction of the bestperforming algorithms using general empirical modeler (GEM 16.0) software developed by [53].

37 Figure 2 .
Figure 2. Architecture of fake news detection model.

Figure 2 .
Figure 2. Architecture of fake news detection model.

Figure 5 .
Figure 5. Predicted values for the algorithms with Dataset 1.

Figure 6 .
Figure 6.Predicted values for the algorithms with Dataset 2.

Figure 7 .
Figure 7. Predicted values for the algorithms with Dataset 3.

Figure 7 .
Figure 7. Predicted values for the algorithms with Dataset 3.

Figure 7 .
Figure 7. Predicted values for the algorithms with Dataset 3.

Table 2 .
The confusion matrix.
Predicted values for the algorithms with Dataset 2. Predicted values for the algorithms with Dataset 3.
confirm the underperformance of their techniques as presented in this study.Predicted values for the algorithms with Dataset 4.

Table 8 .
Fake and counterfeit news detection based on machine learning (ML) used for benchmarking the performance and computational overhead of the new approach.