Comparative Study of Machine Learning Algorithms for Performing Ham or Spam Classification in SMS

. Purpose: Fraud is rampant in the current era, especially in the era of technology where there is now easy access to a lot of information. Therefore, everyone needs to be able to sort out whether the information received is the right information or information that is fraudulent. In this research, the process of classifying messages including ham or spam has been carried out. The purpose of this research is to be able to build a model that can help classify messages. The purpose of this research is also to determine which machine learning method can accurately and efficiently perform the ham or spam classification process on messages. Methods: In this research, the ham or spam classification process has been using machine learning methods. The machine learning methods used are the classification process with Random Forest, Logistic Regression, Support Vector Classification, Gradient Boosting, and XGBoost Classifier algorithms. Results: The results obtained after testing in this study are the classification process using the Random Forest algorithm getting an accuracy of 97.28%, Logistic Regression getting an accuracy of 94.67%, with Support Vector Classification getting an accuracy of 97.93%, and using XGBoost Classifier getting an accuracy of 96.47%. The best precision value obtained in this study is 98% when using the random forest algorithm. The best recall value is 94% when using the SVC algorithm. While the best f1-score value is 95% when using the SVC algorithm. Novelty: This research has been compared with several algorithms. In previous research, it is still very rarely done using XGBoost to classify the ham or spam in messages. We focus on giving brief information based con comparison algorithm and show the best algorithm to classify classify the ham or spam in messages. And for the novelty that exists from this research, the machine learning model built gets better accuracy when compared to previous research.


INTRODUCTION
Now, technological advances are growing rapidly, especially in the telecommunications sector [1], [2].However, because of this, can also lead to digital crime [3].Many frauds are carried out digitally, one of the media is using SMS [4].This happens because SMS is easy to use and does not require expensive costs to send messages [5].As well as the fact that SMS does not require the internet [6], so sending messages can be done at any time and more easily.Salman et.al [7] argued, in 2021 in the United States, at least an estimated 86 million USD was lost due to fraud via SMS.The fraud can also take the form of spam which is very disturbing to SMS users [8] because it usually has less important information [9].Spam emails are unsolicited, often irrelevant or malicious messages sent in bulk to a large number of recipients without their consent [10].SMS spam may be exploited to spread hate speech and criminal activity, or to disseminate commercial information, commonly known as spam [11].Machine learning is part of the Artificial Intelligence (AI) process [12].Machine Learning is also included in the computer science family where in ML, computers can learn independently without having to be programmed first [13].It can also be interpreted that the machine learning model can carry out the learning process automatically when it has been given an algorithm that is used to handle the given task [14].So that it can perform intelligent data analysis automatically [15].This can happen because the algorithm in machine learning uses a mathematical approach to learn patterns from data [16] so that it can handle data more efficiently [17].Therefore, the model built can properly update and also adjust to the actions that have been taken [18].That means the results of the analysis and learning of the model can be used for prediction, estimation, or classification [19].
Random forest is a classification technique that is processed by combining several models [20] decision tree to make a forest [21] that can be used for data classification.So that the use of many trees can produce better accuracy [22].Logistic Regression is a type of supervised classification [23], so it requires a label as a target.In its implementation, logistic regression has excellent performance in predicting discrete probabilities (only having 2 classes) [24].This can be done because, in logistic regression, the probability value of an event is used as a logistic function [25].Support Vector Classification is a classification algorithm that has the concept of Vapnik's statistical theory [26].In the process, SVC is the same as SVM, that is minimizing the distance between the Support Vector and the data by using the maximum margin cost [27].XGBoost is an algorithm that is the realization of Gradient Boosting Decision Tree [28].This algorithm is also included in the ensemble algorithm that combines several decision tree models [29].
In the process, XGBoost is used to improve the decision tree so that the tree model built does not experience overfitting [30].This research discussed about classification process of ham or spam in SMS.The purpose of this research is to build a classification model that can perform the ham or spam classification process from SMS text.Later is expected that the model built can help users determine whether the message received is a ham or spam message.This research used 4 machine learning algorithms for the classification process, namely Random Forest, Logistic Regression, Support Vector Classification, and also XGBoost.The purpose of using these 4 algorithms is to later compare and find which algorithm is the best in classifying ham or spam.The purpose of using Random forest is because this algorithm is an ensemble learning model that combines several decision trees, so it is expected to have good performance for classification.The purpose of using logistic regression is because it is an excellent algorithm for classification which only has 2 main classes.The purpose of using Support Vector Classification is because this algorithm runs like SVM which uses support vectors and is good at classifying data.The purpose of using the XGBoost algorithm is because this algorithm is an ensemble learning algorithm but has an improvement process from each tree model built, so it is expected to provide good performance for data classification.
Research that had conducted by Jakins et.Al [31], discusses building AI models that can predict diseases using random forest classifiers and naïve bayes.The purpose of this research is to build a model that can accurately predict diabetes, coronary heart disease, and breast cancer.The results obtained after this research are for diabetes prediction, the naïve Bayes algorithm gets better test accuracy of 74.46%.For heart disease prediction, the random forest algorithm gets better testing accuracy of 83.85 and for breast cancer prediction, the random forest algorithm gets better testing accuracy of 92.40.Research conducted by Shah et.al [23] in 2020 discussed about comparative study of the Logistic regression algorithm, random forest, and K-Nearest Neighbors to be able to classify text.The purpose of this research is to find which algorithm is more effective for classifying text.The results that would obtained from this study are that after testing the logistic regression algorithm gets the best accuracy for classification which is 97%.
Research conducted by Barman et al [32] discusses the classification of soil texture using multi-class SVM.The purpose of this research is to process soil images to build a soil classification system that can be used by rural farmers at a low cost.The results after doing this research are using 3 soil classes, getting a test accuracy of 95.72%, and using 12 soil classes getting a test accuracy of 91.37%.Research conducted by Qi [33] in 2020 discussed the classification of theft based on text.The purpose of the study is to determine the performance of the XGBoost classifier when compared to other machine learning algorithms to be able to perform classification.The results obtained from this study are that after several adjustments, the XGBoost algorithm gets the best results, namely with a precision value of 96.8%, a recall value of 96.4%, and an f1score value of 96.6%.Previous research conducted by Kudupudi in 2021 [34] also discussed the process of classifying spam or spam messages using the same dataset.The study, using the logistic regression algorithm got a research accuracy of 96%.

METHODS
In this research, we used the SMS spam dataset obtained from the kaggle.comwebsite.The data to be used consists of 2 main classes, namely ham and spam, which have a total of 5574 data.Which where the data consists of 4825 ham data and 747 spam data.This research is focused on SMS for the classification of ham or spam messages using machine learning, namely because SMS data is available in large quantities, provides long-term validity for algorithms to practice classification, makes it easy to adapt to other message types, and is still relevant to most users in various regions.Figure 1 shows the visualization from the dataset used in this study.From the total of 5574 data, divided into training data and testing data.With a percentage of 70% train data and 30% test data.This dataset can be used as an initial foundation for developing and validating ham or spam classification methods with machine learning.In addition, although the data is from 2012, the basic concepts in spam message classification are still relevant and can be adapted to current trends.Therefore, while this dataset may not reflect today's real-world SMS completely, the results of this study can provide useful preliminary insights for the development of more advanced methods that can be applied to larger datasets and more actual contexts in the future.To conduct training and also testing processes, the data must first be processed.The processing that is done in this research is converting text data into vectors, it can be more easily processed by the machine learning algorithms.In this research, a statistical approach that can be converted text into a vector would be used, namely TF-IDF or Term Frecuency/Inverse Document Frecuency [35].Because in TF-IDF the statistical method used converts from text into a number based on the importance of the word in documents and Scopus [36].The weight of the word is also obtained when the analysis process is carried out [37].The formula to find TF (how often the word appears) is given in equation 1, while to find IDF (how unique the word that appears) is given in equation 2 and for weight search (TF-IDF value) is given in equation 3.Although TF-IDF has limitations in capturing semantics, it is effective in identifying keywords that distinguish ham messages from spam, especially with limited data.Whereas semantic methods such as Word2Vec or GLOVE require larger data and strong computational resources.The choice of method depends on resources and research objectives, so the use of TF-IDF provides adequate results with limited resources, with consideration of alternative methods according to the research context.
This paper may not have directly considered the relevance of SMS messages or presented the most advanced method.However, the goal was to develop a solid foundation in the classification of spam or spam messages based on text characteristics.The use of traditional algorithms in this research is a reasonable step to understand the basics of SMS message classification before considering more complex methods.SMS messages may vary in urgency and importance, but the results of this study can be a useful basis for further development, including the assessment of message urgency levels.Thus, although this study may have limitations, it provides a solid foundation for further research in the domain of SMS message classification.After the text transformation process into a vector, the data can be used for learning and testing machine learning models.This research has used machine learning models with Random Forest, Logistic Regression, Support Vector Classification, and XGBoost classifier algorithms.Random forest is an algorithm that in the process of combining several models [20] decision tree to form a forest [21] where in the end the classification process can be carried out from the forest that has been built.So it can also be interpreted that random forest is an algorithm that includes ensemble learning or an algorithm that combines several algorithms to be able to perform a better classification process.The stages of the random forest process can be described below.Define N Tree (1) Select randomly X feature from the data (2) For each I in X (1)  () = − ∑    =1 2 (  ) (2)  (, ) = ∑ ()()  ∈ (3)  (, ) =  () − (, ) (4) Select node X which has the highest information gain (5) Split node into sub-node Repeat steps 1 to 5 until construct a tree and reach the minimum number of samples that are required (3) Repeat steps 1 to 2 N times until building a forest of N trees Logistic Regression is a type of supervised classification [23] which in its implementation, has excellent performance in predicting discrete probabilities [24].This can be done because, in logistic regression, the probability value of an event is used as a logistic function [25].So by using the logistic function, the resulting output is 0 or 1.Therefore, the logistic regression process is very good for classifying binary classes.For logistic regression, the pseudocode is given below.Support Vector Classification is a classification algorithm that has the concept of Vapnik's statistical theory [26].In the process, SVC is the same as SVM, namely by minimizing the distance between the Support Vector and the sample by using the maximum margin cost [27].Because it has the same process as SVM, SVC also requires a hyperplane to make predictions.For the pseudocode of SVC is given below.XGBoost is an algorithm that is a realization of Gradient Boosting Decision Tree [28].This algorithm is also included in the ensemble algorithm that combines several decision tree models [29].In the process, XGBoost is used to improve the decision tree so that the tree model built does not experience overfitting [30].Therefore, in this process, it is expected that the model built can make good and optimal predictions.For XGBoost pseudocode is given below.In the model that had been carried out for the training process, the model has been tested.After the testing process, the performance of the model can be calculated.In this study, the process of calculating the performance of the model would use a confusion matrix where in the matrix, which model guesses are correct or wrong during the testing process.The values from the confusion matrix are positive true, positive false, negative true, and negative false.From these values, model performance can be calculated by namely accuracy, precision, recall, and also f1-score.The calculation formulas for precision, recall, and f1-score are given in equations 4, 5, and 6.
Data has been split into 2 processes, namely the training process and the testing process.The training process aims to make the model that had been built with machine learning algorithms learn patterns from the data given.Meanwhile, the model testing process aims to make the model that has been trained to recognize the pattern can be tested so that the performance of the model is known to be able to carry out the ham or spam classification process based on SMS text.The explanation of each stage is given below.The flow of the classification research method is given in Figure 2, as follows: 1.The first thing is to read the data for the classification process.2. After the data reading process is carried out, then the transformation process from text to vector has been carried out.The purpose of this process is so that the data that has been read can later be used for the training process and also model testing.3.After transforming the data to vectors, then divided into train data and test data.With a percentage of 70% train data and 30% test data.4.Then, build a model that would be used for classification.The models that were built are models with Random Forest, Logistic Regression, Support Vector Classification, and XGBoost Classifier algorithms.The parameters used in each model are given in Table 1. 5.After building the model along with the parameters used, the training and testing process of the model has been carried out with the training and testing data that was previously divided.6.Then, after completing testing on the model, the performance calculation of the model test results can be carried out.The performance calculation has been implemented using the value of the confusion matrix, namely accuracy, precision, recall, and f1-score.Table 2 shows the accuracy obtained after the testing process for ham or spam classification.As seen in Table 2, the best accuracy is obtained when the classification process using the Support Vector Classification algorithm is 97.93%.This shows that the Support Vector Classification model can accurately perform the ham or spam classification process.Meanwhile, the lowest accuracy value obtained is 94.67%, which is obtained when performing the classification process using the Logistic Regression algorithm.However, as seen in Table 2, all accuracies range more than 94%, which means that the classification process in each model can run well.At the time of testing, the value of the confusion matrix obtained can also be calculated.The confusion matrix value obtained is given in Figure 3. Figure 3 shows the value of the confusion matrix of the model test results.From this value, had been seen that the value of the correct positive value, the wrong positive, the correct negative, and also the wrong negative.These values can be used to calculate the precision, recall, and f1-score values.That value can used to see and calculate the performance of the model when performing the classification process.The results of the precision, recall, and f1-score values are given in Table 3.  Table 3 shows the results of precision, recall, and also the f1-score obtained after testing the model.Had is seen in Table 3 that the best average precision value is obtained when the classification process uses the Random Forest algorithm.This means that the random forest algorithm had good accuracy that able to carry out the classification process.As for the average of the best recall, namely when performing the classification process using Support Vector Classification.This shows that when using the Support Vector Classification model having good performance for predicting all classes.And for the average value of the best f1-score obtained when performing the classification process with Support Vector Classification, is 95%.That value shows that when the classification process uses Support Vector Classification, it gets a very good harmonic value between precision and recall.The support value for all algorithms is the same, which is 1839 because the support value is the value of the amount of data used for the testing process.Some research has been done on the ham or spam classification process.Table 4 gives some related research that discusses the ham or spam classification process.Table 4 shows research that discusses about ham or spam classification process.Table 4 also shows the comparison between the methods used in this research and previous research.Had been seen that the method used in this study was successful in improving the accuracy for the ham or spam classification task, this is evidenced by the accuracy obtained in this study has increased when compared to research that has been done before.[38] CNN and RNN CNN is 96.4% and RNN is 97,8% Kudupudi [34] Logistic Regression 96% Alzahrani [39] Logistic Regression, Naïve Bayes, SVC, Neural Network LR is 94.26%, NB is 88.16%, SVC is 94.26% and NN is 97.67%Shobana [40] Multinominal

CONCLUSION
After testing and analyzing the ham or spam classification process using machine learning algorithms, the best accuracy result obtained is 97.93%,where the accuracy is obtained when performing the classification process using the Support Vector Classification algorithm.So from this research, it can be concluded that the Support Vector Classification algorithm is an algorithm that can properly and accurately perform the ham or spam classification process based on SMS text.Based on the results that have been obtained, it can be seen that the model built in this study gets better accuracy and there is an increase in accuracy from previous studies.
In future research, it is hoped that the classification process can use the neural network method so that later the performance of the neural network can be seen in the classification.For further research, it is also hoped that it can be added to be able to carry out the classification process using Indonesian news or text and can also use other processing methods such as word2vec or glove and be able to add more parameters that are used in random forest, logistic regression, support vector classification or XGBoost classifier algorithms so that later a more in-depth performance analysis can be carried out in the data classification process.

Figure 2 .
Figure 2. Workflow process for classification

Table 1 .
Parameter for Machine Learning Algorithm

Table 2 .
Accuracy comparison from a model that builds

Table 3 .
Precision, Recall, F1-Score, and Support from the testing process

Table 4 .
Comparison from previous research