Arabic Spam Tweets Classification: A Comprehensive Machine Learning Approach

: Nowadays, one of the most common problems faced by Twitter (also known as X) users, including individuals as well as organizations, is dealing with spam tweets. The problem continues to proliferate due to the increasing popularity and number of users of social media platforms. Due to this overwhelming interest, spammers can post texts, images, and videos containing suspicious links that can be used to spread viruses, rumors, negative marketing, and sarcasm, and potentially hack the user’s information. Spam detection is among the hottest research areas in natural language processing (NLP) and cybersecurity. Several studies have been conducted in this regard, but they mainly focus on the English language. However, Arabic tweet spam detection still has a long way to go, especially emphasizing the diverse dialects other than modern standard Arabic (MSA), since, in the tweets, the standard dialect is seldom used. The situation demands an automated, robust, and efficient Arabic spam tweet detection approach. To address the issue, in this research, various machine learning and deep learning models have been investigated to detect spam tweets in Arabic, including Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB) and Long-Short Term Memory (LSTM). In this regard, we have focused on the words as well as the meaning of the tweet text. Upon several experiments, the proposed models have produced promising results in contrast to the previous approaches for the same and diverse datasets. The results showed that the RF classifier achieved 96.78% and the LSTM classifier achieved 94.56%, followed by the SVM classifier that achieved 82% accuracy. Further, in terms of F1-score, there is an improvement of 21.38%, 19.16% and 5.2% using RF, LSTM and SVM classifiers compared to the schemes with same dataset.


Introduction
In the present digital era, reviews that are posted on websites, applications, and social media platforms hold great significance.These reviews act as evaluations of various services, products, and places.People rely on these assessments to decide whether to use a particular service, purchase a product, or book a place.Additionally, these reviews have a profound impact on companies, as they shape their product features, services, and marketing campaigns based on customer feedback.Opinion-mining tools have been developed to assist businesses and decision-makers in improving product quality and enhancing sales and revenue.These tools include sentiment classification, feature-based opinion-mining, comparative sentences, and opinion searches [1].
However, reviews can also present a double-edged sword.Companies can limit who can provide reviews on their applications by linking them to serial or reservation numbers.Nevertheless, individuals can write reviews on social media platforms like Twitter and Facebook.Recently, competitors have taken advantage of this situation by employing paid attacks that can negatively impact business development and influence people's decisions [2].These paid attacks are often carried out by bots, making it difficult to determine whether the reviews are genuine or spam.Detecting spam in reviews involves two aspects: spam detection and spammer detection.Spam detection mainly focuses on classifying the submitted text as human-generated or bot-generated [3].
Spammer detection is a process that focuses on identifying the source of spam and determining whether it comes from an individual or a group of spammers.There are three techniques that can be utilized to identify spam or spammers.The first two techniques involve natural language processing (NLP) and product feature detection, and they apply to the text.The third technique involves analyzing the behavior of the reviewer, which includes examining their internet protocol (IP) address, the repeatability of their reviews, and the timing of their submissions [4].Spamming in social media can lead to several issues, such as cluttering the feeds of consumers, making it difficult to find relevant and valuable content.Spam links and comments may also contain harmful information that can be exploited to distribute malware or engage in phishing scams.Moreover, spam content may include hate speech, which can worsen racial tensions and societal problems [5].Ensemble learning is a powerful technique that involves combining multiple machine learning algorithms to achieve better performance than using these algorithms individually [5].The study by [6] is considered one of the fundamental studies in ensemble learning.They introduced a technique of dividing the feature space by employing a multiple classifiers element.The authors in [7] showed that an ensemble of identical ANN classifiers performed way better than a single classifier in terms of prediction performance.Schapire [8] proposed the boosting technique, which transforms a weak classifier into a strong one.Boosting has paved the way for robust algorithms such as AdaBoost, gradient boosting, and extreme gradient boosting (XGBoost) [9].Ensemble learning is a technique that combines the predictions of multiple individual learners to obtain a more accurate prediction than what a single model can produce.There are two types of ensemble methods: parallel and sequential.Parallel methods involve training different base classifiers independently and combining their predictions using a combiner.Two popular parallel ensemble methods are the Bagging and Random Forest algorithms.These methods encourage diversity among ensemble members by generating base learners in parallel [10].Sequential ensembles, like boosting algorithms, train models iteratively to correct the errors made by previous models.They do not fit the base models independently.Parallel ensembles are further classified into homogeneous and heterogeneous.Homogeneous ensembles comprise models built using the same machine learning algorithm, while heterogeneous ensembles are made up of models from different algorithms.The success of ensemble learning approaches depends on the accuracy and diversity of the base learners.Accuracy denotes the capability of a model to generalize efficiently based on unseen instances of the input data, while diversity refers to the differences in errors among the base learners [10].
Most of the research studies on spam detection are conducted in the English language.Very few studies have been carried out to detect spam in the Arabic language.Arabic is a comprehensive natural language, and there are several unique linguistic challenges posed by Arabic dialects and script variations across the Arabic peninsula.For example, a range of diacritics that not only change the representation but the entire meaning of the word, contextual and semantic diversity, a blend of native and modern language inflections, and special symbols with their diverse use are a few among the many characteristics that differentiate Arabic from the western languages.Therefore, there is a need to develop robust approaches to identify spam messages in Arabic, including different dialects and Modern Standard Arabic (MSA).The existing research is mainly confined to email spam detection and other social media, such as Facebook and YouTube, while tweet spam detection is limited.Moreover, the studies on tweet spam detection either use a limited dataset, or in some cases a conversion to English language is performed prior to spam tweet detection.This research focuses on identifying spam tweets in Arabic by considering the impact of text preprocessing on the identification process.The same entities are used for all the models during the training and testing phases.Various classifiers from machine learning (ML) and deep learning (DL) are employed to identify spam tweets, along with different Arabic NLP preprocessing and feature extraction techniques.
The rest of the paper is organized as follows: Section 2 provides an overview of the previous research conducted in the field of spam text detection.Section 3 provides the background of the ML and DL algorithms used in the study.Section 4 presents the proposed scheme.Section 5 presents the results and discussion, while Section 6 concludes the paper.

Related Studies in English Tweet Spam Detection
A study, detailed in [11], discusses spam sent to web pages, discussing types of spam and proposing a methodology based on feature extraction and classification algorithms for spam detection, achieving 81.8% recall and 83.1% precision with ADTree using best features.Bahnsen et al. [12] propose a flexible and intelligent model using classification algorithms and data mining methods to detect phishing websites, achieving an accuracy rate of 98.7% and highlighting the relationship between website features for future detection frameworks.Preethi and Velmayil [13] present a method for analyzing phishing URLs using lexical analysis, employing a pre-phish algorithm and machine learning methods to classify phishing and non-phishing URLs, achieving a 97.83% accuracy and a 1.82% false predictive rate.Nagaraj et al. [14] propose an ensemble machine learning model, combining Random Forest and neural networks, for classifying phishing websites with a prediction accuracy of 93.41%.Ubing et al. [15] present an ensemble learning approach based on feature elicitation and plural voting, achieving an accuracy of 95% for phishing website detection, surpassing existing methods.In recent studies, different models have been proposed for detecting spam and inappropriate content.The authors of [16] proposed a model that uses content-based features, such as word frequency count, sentiment polarity, and review length, to achieve an accuracy of 86.32% with the Naive Bayes classifier.
Jain et al. [17] proposed a multi-instance learning model and a convolutional neural network (CNN) model with Gated Recurrent Units (GRU) for text classification, achieving the highest accuracy of 91.9% with the CNN-GRU model.Mani et al. [18] introduced an ensemble technique that combines the Naive Bayes, Random Forest (RF), and Support Vector Machine (SVM) classifiers using N-gram features and achieves the highest average accuracy of 87.68% for detecting spam reviews.Siddique et al. [19] developed a model for email content discovery and classification, employing CNN, Naïve Bayes (NB), LSTM, and SVM algorithms, with the LSTM model achieving a maximum accuracy of 98.4% for perceiving and categorizing inappropriate and unsolicited spam emails written in Urdu.Lastly, Dewis and Viana [20] proposed "Phish Responder", a Python-based solution that combines deep learning and NLP techniques to detect spam and phishing emails, achieving the highest average accuracy of 99% using the LSTM model for textual datasets and 94% with the MLP model for numerical datasets.Alzaqebah et al. [21] propose an improved version of the Multi-Verse Optimizer (MVO) algorithm for feature selection in cybercrime classification problems, demonstrating the superiority of the improved algorithm (IMVO) in maintaining solution diversity and improving searchability.AbdulNabi and Yaseen [22] examine machine and deep learning algorithms, including Bidirectional Encryption Representations from Transformer (BERT), for spam and phishing email detection, showing that the BERT model achieves maximum accuracy and an F1-score of 98.67% and 98.66%, respectively, compared to other classifiers.
After a brief review, it is apparent that spam detection, especially tweet spam detection in the English language, has achieved significant improvements in terms of accuracy and other evaluation metrics.This is largely due to the development of well-established models and the consistent use of accents and dialects in the language, which makes NLP more efficient and effective in detecting spam and analyzing other types of social media content.
Table 1 provides a summary of studies conducted on spam detection in the English language.It also includes the methods and classifiers used, type of dataset, and the evaluation results in terms of accuracy, precision, recall and F1-score.

Related Studies in Arabic Tweet Spam Detection
Al-Kabi et al.
[23] propose a system for ranking Arabic web pages and detecting spam based on content and link features.The system utilizes user feedback to improve its performance and demonstrates an improvement compared to other methods in terms of performance and accuracy.Abdallah Ghourabi et al. [24] employ machine learning techniques to detect spam SMS messages in Arabic and English.They propose a hybrid deep learning model combining LSTM and CNN and evaluate it against various classification algorithms.The CNN-LSTM model achieves superior performance with an accuracy, precision, recall, F1-score and an AUC of 98.37%, 95.39%, 87.87%, 91.48%, 93.7%, respectively.Mohammed et al. [25] present an intelligent and adaptive learning approach for detecting spam emails.They propose a visual anti-spam model using a trainable Naive Bayes classifier trained in Arabic, English, and Chinese.The proposed model efficiently detects and filters spam emails, achieving an overall accuracy of 98.4%, a false positive rate of 0.08%, and a negative rate error of 2.90%.Alkadri [26] proposes an integrated Twitter spam detection framework focusing on Arabic content.The framework combines NLP, data augmentation, and supervised ML algorithms.The model achieves a total accuracy of 92% and improves the F1-score from 58% to 89% by increasing the data.It is worth mentioning that accuracy was obtained for a selected and small subset of the actual dataset.The authors in [27] propose four techniques for identifying spam in the Arabic reviews, combining ML techniques with rule-based classifiers and employing content-based features such as N-gram and negation processing.The group approach achieves 95.25% and 99.98% classification accuracies on the DOSC and HARD datasets, respectively, outperforming existing work by 25%.Alzanin and Azmi [28] propose two learning models, semi-supervised learning using the Expectation-Maximization (E-M) algorithm and unsupervised learning using the NB algorithm, for detecting fake Arabic tweets.The semi-supervised learning model performs better, with an accuracy of 78.6%, using features based on tweets and topics.A study [29] conducted a systematic literature review on the use of AI strategies for crime prediction.The review analyzed 120 research papers and identified various crime analysis types, types of crimes studied, prediction techniques, performance metrics, and the strengths and weaknesses of proposed methods, and limitations.The review describes that supervised machine learning is the most commonly used method and provides guidance for researchers in the field of smart crime prediction.
Alotaibi et al. [30] address improving customer service for the Saudi Telecom Company (STC) in Saudi Arabia.The researchers analyze tweets from the Twitter platform to measure user satisfaction and identify their sentiments and criticisms.They propose a BERT-based model for spam detection and sentiment analysis in imbalanced data from Arabic tweets.The model is trained using a dataset of 24,513 Arabic tweets, and its performance is evaluated using F1-score, accuracy, and recall metrics.The results demonstrate that the MARBERT model performs well in Arabic multi-label sentiment analysis, outperforming existing techniques in the literature with an F1-score of 75%.Alorini and Rawat introduced a dataset in their study [31], which consisted of Gulf Dialectical Arabic (Gulf DA) translated into English.The purpose of this dataset was to build a Gulf Knowledge Base (GulfKB).The researchers then utilized Bayesian inference in the GulfKB model-based reasoning to identify malicious content and suspicious users.Through numerical evaluation, they demonstrated that their approach achieved an accuracy of 91% and surpassed other existing methods described in the current literature.
Alghamdi and Khan introduced an intelligent system in their research [32], which aimed to analyze Arabic tweets for the purpose of identifying suspicious messages.Researchers have developed a system that uses supervised machine learning algorithms to detect suspicious activities in Arabic tweets.The system involves collecting a dataset of Arabic tweets and manually labeling them as suspicious or not suspicious.Six supervised machine learning algorithms were evaluated, and the support vector machine algorithm outperformed the others, achieving a mean accuracy of 86.72%.The study has contributed to the field by developing a labeled dataset of Arabic tweets and establishing a statistical benchmark for future research.This system can be an effective tool for law enforcement agencies to identify suspicious messages and prevent crime.
Alhassun and Rassam conducted a study [33] with the aim of assessing the effectiveness of a combined framework of text and metadata in detecting spam from Arabic Twitter accounts.The researchers examined whether account suspensions could serve as an indicator of Arabic spam accounts.The long short-term memory (LSTM)-combined model achieved high precision and recall rates of 94% and 93.8%, respectively, outperforming the logistic regression (LR) and SVM approaches.The proposed framework demonstrated its superiority by achieving the highest accuracy of 94.27% in the combined model.Despite the challenges posed by Arabic tweets and their high sensitivity, the text-based model utilizing convolutional neural networks (CNN) performed well, with an accuracy of 80%.Kaddoura et al. [34] presented a deep learning and classical machine learning approach to Arabic tweet spam classification.In this regard, they have collected a dataset and labelled it manually [35].N-gram methods were applied for feature extraction and joined with SVM, NN, NB and LR, while Global vector (GloVe) and fastText models were used for the deep learning approaches that outperformed the aforementioned models.
Table 2 summarizes the techniques involving Arabic spam detection in various social media datasets.Techniques in [26,30] used Arabic Twitter datasets similar or close to the current study that also involve an additionally collected dataset.
Based on the comprehensive review of the literature (Table 2), it is apparent that the existing research in the Arabic language is mainly confined to email spam detection and other social media, such as Facebook and YouTube, while tweet spam detection (based on tweet text, not account) is somewhat limited and there is much room for improvement in terms of accuracy and other figures of merit.Moreover, the studies on tweet spam detection are either using a limited, self-generated dataset, or in some cases a conversion to English language is performed prior to spam tweet detection [31].Therefore, the situation demands a comprehensive Arabic tweet spam detection approach with a diverse dataset and improved accuracy.The proposed study aims to fill this potential research gap.The proposed approach indicated a 58% to 89% improvement in F1-score.A total accuracy of 92% with a small and selected dataset. [28] Semi-supervised expectation-maximization (E-M) and supervised Gaussian NB Self-collected 271,000 Arabic tweets, consisting of 89 rumors and 88 non-rumor events.

Ensemble Machine Learning Techniques and Algorithms
Machine learning originates from pattern recognition and artificial intelligence, specifically within the subfield of computer science.It is closely intertwined with computational statistics and primarily revolves around prediction.Over the past few years, significant research efforts in machine learning have been dedicated to various domains, including NLP, computer vision, pattern recognition, cognitive computing, and knowledge representation.These areas represent critical application areas for machine learning techniques, enabling advancements in language understanding, image analysis, pattern detection, cognitive modeling, and the representation of knowledge in computational systems [37].Ensemble learning is a technique that combines multiple machine learning (ML) algorithms to achieve a better performance compared to using individual algorithms alone.Based on the literature review, we have shortlisted classical ML techniques as SVM and NB, while RF is shortlisted as an ensemble technique [36][37][38][39].
RF models are machine learning techniques that forecast output by combining the results of a series of regression decision trees.Each tree is built separately and is based on a random vector sampled from the input data, with same distribution of the trees in the forest.The NB technique is a simple text categorization algorithm.It is a probabilistic approach for each attribute in each class set.It has been effectively used for various issues and applications, but it excels in NLP.Likewise, SVM is a perfect and powerful supervised machine learning model for text and other data classification, regardless of the size of the dataset [36][37][38][39].

Deep Learning Techniques
Deep learning is a type of machine learning that focuses on training artificial neural networks with several layers, also known as deep neural networks.These networks are designed to imitate the structure and function of the human brain, with interconnected layers of artificial neurons.One of the main advantages of deep learning is its ability to automatically learn hierarchical representations from raw data.Traditional machine learning approaches often require feature engineering, which involves manually designing and selecting relevant features from the input data.In contrast, deep learning learns these features automatically as part of the model training process, eliminating the need for manual feature engineering and allowing the model to extract complex and abstract representations directly from the data [40].Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture that addresses the challenge of capturing longterm dependencies in sequential data.It introduces a memory unit and gate mechanism, which enables the network to selectively remember or forget information over a sequence of inputs.In the current study, we have employed LSTM as a deep learning model because of its effectiveness in similar problems, as observed in the literature [41].Though the techniques investigated in the proposed study are classical and exist in the literature, the nature of the dataset, the preprocessing techniques and the handling of native Arabic NLP is a task in itself, and it makes the study novel and distinguished.

Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE is a technique used for data augmentation that balances the class distribution by creating synthetic examples of the minority class.Unlike duplicating existing minority class instances, SMOTE generates synthetic samples by interpolating between neighboring instances in the feature space.This process helps in adding more data points to the dataset and better understanding the distribution of the classes in the dataset, and is known as oversampling.Similarly, if it is performed in reverse where instances of one class are reduced to equate with the other, this is known as undersampling [42].In Figure 1, we can see undersampling on the left and how we reduce samples to balance the classes; on the right, we can see oversampling and how we multiply one class to achieve a balanced dataset.

Deep Learning Techniques
Deep learning is a type of machine learning that focuses on training artificial neural networks with several layers, also known as deep neural networks.These networks are designed to imitate the structure and function of the human brain, with interconnected layers of artificial neurons.One of the main advantages of deep learning is its ability to automatically learn hierarchical representations from raw data.Traditional machine learning approaches often require feature engineering, which involves manually designing and selecting relevant features from the input data.In contrast, deep learning learns these features automatically as part of the model training process, eliminating the need for manual feature engineering and allowing the model to extract complex and abstract representations directly from the data [40].Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture that addresses the challenge of capturing long-term dependencies in sequential data.It introduces a memory unit and gate mechanism, which enables the network to selectively remember or forget information over a sequence of inputs.In the current study, we have employed LSTM as a deep learning model because of its effectiveness in similar problems, as observed in the literature [41].Though the techniques investigated in the proposed study are classical and exist in the literature, the nature of the dataset, the preprocessing techniques and the handling of native Arabic NLP is a task in itself, and it makes the study novel and distinguished.

Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE is a technique used for data augmentation that balances the class distribution by creating synthetic examples of the minority class.Unlike duplicating existing minority class instances, SMOTE generates synthetic samples by interpolating between neighboring instances in the feature space.This process helps in adding more data points to the dataset and better understanding the distribution of the classes in the dataset, and is known as oversampling.Similarly, if it is performed in reverse where instances of one class are reduced to equate with the other, this is known as undersampling [42].In Figure 1, we can see undersampling on the left and how we reduce samples to balance the classes; on the right, we can see oversampling and how we multiply one class to achieve a balanced dataset.

Natural Language Processing (NLP)
NLP is a field of study in artificial intelligence that aims to enable computers to understand, interpret and process human language for various purposes, such as opinion mining, sentiment analysis, and affecting detection in text.In today's world, NLP is essential due to the vast amount of unstructured data that is available.Basic tasks, such as content rating, subject discovery and modeling, contextual extraction, sentiment analysis, speech-to-text, text-to-speech, automatic document summarization, and machine translation are often used in high-level NLP capabilities [43].

Dataset
For the present study, we have investigated a diverse dataset obtained from three different sources.Firstly, the dataset was collected from Twitter using the API.Secondly, it was obtained from a study conducted by Alotaibi et al. [30].Thirdly, it was sourced from a recent study [26] for additional comparisons.The objective behind aggregating data from various sources was to develop a comprehensive model for detecting tweet spam, with diverse dialects from the Arabic region, including MSA and others.

Proposed Approach
This section presents the proposed approach, followed in conducting the research.Figure 2 shows the research methodology flowchart.
different sources.Firstly, the dataset was collected from Twitter using the API.Secondly, it was obtained from a study conducted by Alotaibi et al. [30].Thirdly, it was sourced from a recent study [26] for additional comparisons.The objective behind aggregating data from various sources was to develop a comprehensive model for detecting tweet spam, with diverse dialects from the Arabic region, including MSA and others.

Proposed Approach
This section presents the proposed approach, followed in conducting the research.Figure 2 shows the research methodology flowchart.Research Steps: • Read data from Twitter using Panda's library of Python language and extract the data frame.

•
Preprocessing: This step is crucial when applying AI algorithms because algorithms are only sometimes compatible with it.

•
NLP: This step is essential to converting data to a form to which can AI be applied.
It contains normalizing letters, to convert letters with multiple forms to a single form.Tokenize text or convert each word to a token for initializing data to the next.
Lemmatizing means converting each word to the root.

•
Feature extraction means converting each word to a number and replacing each word with its number.This step is essential to convert non-numerical data to numerical data suitable for AI.

•
Balancing: when the first class has more samples than the second class, the performance will not be good, so balancing means generating samples for the class that has fewer samples to be balanced with another class.

•
ML and deep learning: This step builds and trains ensemble ML and DL models on ready data.

•
Evaluation: comparing accuracy and other metrics to evaluate two or more models to select the best.

Data Preprocessing
In this step, all unwanted words, characters, and URLs will be deleted, and clean data will be generated.This step is crucial for applying machine learning (ML) and deep learning (DL) algorithms because with them, algorithms work in many cases.In addition, this step is essential for making an accurate and suitable model to decide the data entered.This step contains normalizing text to make suitable text without URLs inside it.It further includes removing punctuation and diacritics to remove all the unwanted and useless characters in Arabic text, because these characters negatively affect the model's performance model.The detailed data preprocessing pipeline is illustrated in Figure 3 and described subsequently.The input text is normalized, then several eliminations take place, such as diacritics, hashtags, punctuation symbols and stop words.After that, tokenization and lemmatization is performed to make the text ready for the next phase.learning (DL) algorithms because with them, algorithms work in many cases.In addition, this step is essential for making an accurate and suitable model to decide the data entered .
This step contains normalizing text to make suitable text without URLs inside it.It further includes removing punctuation and diacritics to remove all the unwanted and useless characters in Arabic text, because these characters negatively affect the model's performance model .The detailed data preprocessing pipeline is illustrated in Figure 3 and described subsequently.The input text is normalized, then several eliminations take place, such as diacritics, hashtags, punctuation symbols and stop words.After that, tokenization and lemmatization is performed to make the text ready for the next phase.(a) Text Normalization: Normalization is the process of reducing letters to their basic form.As the Arabic language is rich morphologically, it requires normalization.For instance, Tatweel (like: " ‫ﻛﺘــــــــــــــــــــــــــــــــــــــــــــــﺎب‬ " to " ‫.)"ﻛﺘﺎب‬ Table 3 presents the normalization form for certain Arabic letters.(a) Text Normalization: Normalization is the process of reducing letters to their basic form.As the Arabic language is rich morphologically, it requires normalization.For instance, Tatweel (like: " " to " ").Table 3 presents the normalization form for certain Arabic letters.
In this step, all unwanted words, characters, and URLs will be deleted, and clean data will be generated.This step is crucial for applying machine learning (ML) and deep learning (DL) algorithms because with them, algorithms work in many cases.In addition, this step is essential for making an accurate and suitable model to decide the data entered.This step contains normalizing text to make suitable text without URLs inside it.It further includes removing punctuation and diacritics to remove all the unwanted and useless characters in Arabic text, because these characters negatively affect the model's performance model.The detailed data preprocessing pipeline is illustrated in Figure 3 and described subsequently.The input text is normalized, then several eliminations take place, such as diacritics, hashtags, punctuation symbols and stop words.After that, tokenization and lemmatization is performed to make the text ready for the next phase.(a) Text Normalization: Normalization is the process of reducing letters to their basic form.As the Arabic language is rich morphologically, it requires normalization.For instance, Tatweel (like: " ‫ﻛﺘــــــــــــــــــــــــــــــــــــــــــــــﺎﺏ‬ " to " ‫.)"ﻛﺘﺎﺏ‬ Table 3 presents the normalization form for certain Arabic letters.), punctuation ('+*/. ...), and repeating chars: removing diacritics, punctuation, and repeating characters to clean and standardize the text data for further analysis.For instance, " " to " "; this is shown in Table 4, taken from our previous work [44].

Diacritic Marks Characters
Fatha AI 2024, 5, FOR PEER REVIEW 10 for further analysis.For instance, " "; this is shown in Table 4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5.  4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5.  4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5.  4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5.  4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5.  4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5. for further analysis.For instance, " "; this is shown in Table 4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5.

# Word
Sukun AI 2024, 5, FOR PEER REVIEW 10 for further analysis.For instance, " "; this is shown in Table 4, taken from our previous work [44].(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, ‫ﻣﻦ‬ ‫ﻫﺬﺍ,‬ ‫ﺍﻟﺬﻱ,‬ are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5.

# Word
(c) Eliminating hashtags, user references or indications, and URLs.(d) Eliminating punctuation symbols, for instance, full stops and commas, because they do not play any significant role in spam detection.(e) Eliminating stop words: Stop words are applied to formation of the language but usually do not contribute to its subjects.For instance, , , are a few Arabic stop words.The Arabic stop words are collected from various Arabic sources [44].A few examples of stop words are given in Table 5. (f) Tokenization: Convert the text into tokens, individual words, or meaningful units to facilitate further analysis.After the tokenization step, the data becomes separable and more adequate for the analysis.(g) Lemmatization: Convert each word to its base or root form to reduce inflectional variations and ensure consistency.In the existing studies, stemming was used in this regard, though that is vulnerable to over-stemming and under-stemming phenomenon.Though a bit computationally expensive, lemmatization is way better in terms of accuracy, since it keeps the context intact while returning the word base form, aka lemma, from the dictionary.It efficiently handles grammar and delivers the accurate language representation.

Split, Training and Testing Dataset
Dataset splitting involves dividing the available data into two or more subsets, which are used to create separate training and testing datasets for machine learning models.The purpose of this process is to evaluate the performance of a model on an independent dataset that it has not previously seen during training.Typically, the dataset is split into a training set and a testing set.The training set is used to train the model, while the testing set is used to evaluate the model's performance.Alternatively, cross-validation can be used, in which the dataset is divided into multiple subsets, each of which is used for training and testing.To split the dataset into train and test sets, we used 80% of the dataset for training and 20% for testing.The training set was used to train the model to recognize patterns and make predictions.The testing set was used to evaluate the trained model's generalization ability to new, unseen data.It is important to note that the testing dataset is independent/exclusive of the training dataset.

Feature Extraction
In the current study, since we are working with textual data, and there is a need to extract important features to improve the accuracy of spam prediction.Feature extraction is the process of converting textual data into numerical data that is suitable for prediction.This is achieved by converting words in comments to numerical symbols, also known as sequences, where each word or letter is assigned a code.For instance, the word "negatives" may be encoded as "1".Whenever this word appears in any text, email, or tweet, it is replaced with its corresponding symbol "1".To ensure that all sentences have the same length, the padding sequence method is used.This involves adding zeros to shorter lines to match the length of longer lines, resulting in uniform lengths for the texts.Various techniques, such as term frequency-inverse document frequency (TF-IDF), word embeddings, and bag-of-words are used for feature extraction, as explained subsequently.

Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is most used for text classification and feature extraction.Term frequency is the number of times a word appears within a document, as given in Equation (1).The Inverse Document Frequency returns how common or rare a word is in the entire record set, as given in Equation (2).So, if the word is ubiquitous and appears in many records, this number will be 0. Otherwise, it will be 1, as given in Equation (3).

Dataset Balancing
The dataset obtained from various sources contained an imbalanced number of instances.In general, a ratio of about 20-80 was observed between spam and non-spam tweets, respectively.It is apparent that the dataset is imbalanced, which may lead to unfair analysis.To balance the dataset, the SMOTE technique is applied.This technique is helpful in not only balancing the dataset classes, but also guarantees fairness in the analysis, where each class takes an equal part in the model training and evaluation.

Model Evaluation
Machine learning and deep learning models are often evaluated using accuracy and error to determine the relationship between predicted and actual values.To evaluate the performance of a proposed model on a given dataset, four measures are typically used: accuracy, F1-score, recall, and precision, as cited in references [45][46][47].These formular are expressed by means of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
• Accuracy is the ratio of true classified (TP and TN) outcomes to the total number of classified instances (TP, TN, FP and FN).It can be calculated as the following equation: • The recall is calculated as the percentage of positive tweets (TP) correctly identified by the model in the dataset.It can be calculated using the following equation: • The precision measure represents the proportion of true positive (TP) tweets among all forecasted positive tweets (TP and FP), and is calculated using the following equation: • The F1-score is a measure that combines precision and recall in a harmonic mean.The equation to calculate the F1-score is as follows: It is worth mentioning that the distinction between the proposed and the existing models is mainly based on the diverse preprocessing methods applied to the Arabic text prior to the model building, by incorporating a refined list of diacritics, stop words and others.That has eventually contributed to better feature extraction and improved models' training and effectiveness, as evident in the next section.That is the main reason why the use of widely known machine learning and deep learning techniques yields such a clear improvement in the results.

Results and Discussion
This section presents the results of all classifiers in detail and demonstrates various effects of Unigram and TF-IDF on text classification for each model.After preprocessing the dataset and extracting the features, the dataset is fed to the classifiers to determine whether a tweet is spam.The following sections summarize the results of the two experiments.

Results
The proposed models, including RF as ensemble learning model, LSTM as deep learning model, and SVM and NB as classical machine learning models, have been implemented in the Python programming language using the aforementioned dataset.This is mainly because these thrived in similar problems per the literature.
The Random Forest classifier was the first model we trained.This algorithm created a set of decision trees, each of which was trained on a different subset of the data using a random selection of features.By combining the predictions of multiple trees, the Random Forest classifier aimed to increase the overall accuracy and robustness of the model.Additionally, we performed hyperparameter tuning to optimize the model's performance.After evaluation, the model achieved an accuracy of 96.57%, a precision of 95%, a recall of 97.80%, and an F1-score of 96.38%.These results are consistent, significant, and promising when compared to similar models in the literature.
Similarly, after hyperparameter tuning, the LSTM (second experiment) model was configured with 64 neurons in the hidden layer and trained for 30 epochs.The algorithm achieved an accuracy of 94.58%, a precision of 91.25%, a recall of 97.28%, and an F1-score of 94.16%.While these results are slightly lower than the RF algorithm's results, they are still consistent, considerable, and promising when compared to a similar dataset in the literature.This is mainly due to the power of ensemble classification.The RF algorithm outperforms the LSTM model by approximately 2% in accuracy, 3.75% in precision, 0.52% in recall, and 2.22% in F1-score.
Table 6 presents the obtained performance results for the proposed algorithms for all four metrics including accuracy, precision, recall and F1-score, respectively.Figure 4 presents the results obtained from all four algorithms, providing a comparison in terms of accuracy, precision, recall, and F1-score, respectively, against all the proposed algorithms.Figure 4 presents the results obtained from all four algorithms, providing a comparison in terms of accuracy, precision, recall, and F1-score, respectively, against all the proposed algorithms.

Comparison with State-of-the-Art Approaches
A comparison with state-of-the-art approaches is conducted with schemes that have the same and identical datasets.In the literature, the F1-score is a commonly used metric for comparison, since it encompasses other metrics like precision, accuracy, and recall [

Comparison with State-of-the-Art Approaches
A comparison with state-of-the-art approaches is conducted with schemes that have the same and identical datasets.In the literature, the F1-score is a commonly used metric for comparison, since it encompasses other metrics like precision, accuracy, and recall [48][49][50].A comparison between the proposed approach and Alotaibi et al.'s [30] work is conducted as they have a common dataset.The technique in [30] achieved an F1-score of 75%, whereas the proposed approach obtained F1-scores of 96.38% and 94.16% for the ensemble learning and deep learning models, respectively.For the SVM model, the proposed approach outperformed by 4.8%, while LSTM and RF outperformed by 19.16% and 21.38%, respectively.Similarly, another scheme by Alkadri et al. [26] with identical datasets collected from Saudi Arabia yielded the highest F1-score of 89% using SVC, while the proposed RF and LSTM schemes outperformed by 5.16% and 7.38%, respectively.However, SVM underperformed by 6.8% in this regard, which is shown in Figure 5.
AI 2024, 5, FOR PEER REVIEW 14 50].A comparison between the proposed approach and Alotaibi et al.'s [30] work is conducted as they have a common dataset.The technique in [30] achieved an F1-score of 75%, whereas the proposed approach obtained F1-scores of 96.38% and 94.16% for the ensemble learning and deep learning models, respectively.For the SVM model, the proposed approach outperformed by 4.8%, while LSTM and RF outperformed by 19.16% and 21.38%, respectively.Similarly, another scheme by Alkadri et al. [26] with identical datasets collected from Saudi Arabia yielded the highest F1-score of 89% using SVC, while the proposed RF and LSTM schemes outperformed by 5.16% and 7.38%, respectively.However, SVM underperformed by 6.8% in this regard, which is shown in Figure 5.

Discussion
This study proposed classical machine learning and deep learning models for tweet spam detection in Arabic.In this regard, four algorithms were investigated, including SVM, NB, RF and LSTM.A comparison was made among the four algorithms in terms of accuracy, precision, recall and F1-score.After training the model on the used dataset, the

Discussion
This study proposed classical machine learning and deep learning models for tweet spam detection in Arabic.In this regard, four algorithms were investigated, including SVM, NB, RF and LSTM.A comparison was made among the four algorithms in terms of accuracy, precision, recall and F1-score.After training the model on the used dataset, the results were presented and reviewed in the previous section.By examining and analyzing the practical results of the proposed model, several criteria were adopted for comparing the two algorithms.The dataset relies not only on the presence of potentially suspicious URLs, but primarily on the text and its meaning or semantics, as it is the best indicator for determining whether it is spam.Here, we also focus on the number of followers, likes, and retweets.Moreover, in the NLP part, various dialects have been catered for, including Modern Standard Arabic (MSA).Finally, it was observed that Random Forest and LSTM were good choices for classifying Arabic texts, in contrast to SVM and NB.The experimental results demonstrate that Random Forest has many accurate labels in prediction due to its ensemble nature, and the LSTM has a good result for accuracy, loss, and overfitting.
In contrast to the English language, Arabic language tweet spam detection involves more preprocessing with diverse operations.This makes the Arabic tweet spam detection more complicated and vulnerable to classification errors.For instance, the diversity of dialects, diacritics marks, punctuation symbols, as well as the type and number of grammatical rules make it different and complicated compared to the English language.So, the Arabic tweet spam detection involves additional efforts starting from dataset collection, including preprocessing and the training and evaluation of the models.
Regarding the limitations of the study, it should be noted that the dataset used for analysis is somewhat restricted.Nonetheless, the tweets were collected from a diverse range of users with different Arabic dialects.To enhance the dataset, it is recommended to employ data augmentation techniques.It is also recommended to use advanced feature extraction techniques and encoders to further fine tune the results.For example, word to vector (word2vec) and global vectors for word representation (GloVe) along with Modern Arabic Bidirectional Encoder Representations from Transformers (MARBERT), are largescale pre-trained masked language models focused on both Dialectal Arabic (DA) and MSA [30,50].

Conclusions
The purpose of this study was to identify spam tweets in Arabic by utilizing machine learning and deep learning techniques.Four different models, namely Support Vector Machine, Naïve Bayes, Random Forest, and LSTM, were tested and evaluated using a combined dataset that was collected and combined with existing datasets.The experimental results revealed that the Random Forest classifier achieved the highest accuracy, precision, recall, and F1-score, followed by the LSTM model.There were no signs of overfitting observed.However, the SVM and NB models performed relatively poorly in terms of all metrics, with SVM performing better than NB overall.The proposed models exhibited a promising and improved performance in contrast to closely related state-of-the-art approaches.These findings suggest that ensemble and deep learning models are suitable for classifying Arabic tweets and are superior to other methods.In the future, the authors intend to investigate stacking ensemble models and transfer learning using more enriched and augmented datasets.Moreover, the researchers in the field may investigate other feature extraction methods and preprocessing techniques within the existing problem, such as word to vector (word2vec) and global vectors for word representation (GloVe).

Figure 2 .
Figure 2. Methodology of the proposed study.Research Steps:•Read data from Twitter using Panda's library of Python language and extract the data frame.•Preprocessing:This step is crucial when applying AI algorithms because algorithms are only sometimes compatible with it.•NLP:This step is essential to converting data to a form to which can AI be applied.It contains normalizing letters, to convert letters with multiple forms to a single form.Tokenize text or convert each word to a token for initializing data to the next.Lemmatizing means converting each word to the root.

Figure 2 .
Figure 2. Methodology of the proposed study.

Figure 4 .
Figure 4. Comparison of RF and LSTM models.

Figure 4 .
Figure 4. Comparison of RF and LSTM models.

Figure 5 .
Figure 5.Comparison with state-of-the-art approaches.

Figure 5 .
Figure 5.Comparison with state-of-the-art approaches.

Table 1 .
Summary of the techniques in English language.

Table 2 .
Summary of the techniques in Arabic language.

Table 3 .
Examples of some letters in normalized form.

Table 3 .
Examples of some letters in normalized form.

Table 3 .
Examples of some letters in normalized form.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.
(c) Eliminating hashtags, user references or indications, and URLs.

Table 5 .
Example of Arabic stop words.

Table 5 .
Example of Arabic stop words.

Table 6 .
Performance evaluation of the proposed models.

Table 6 .
Performance evaluation of the proposed models. 48-