ETLE Sentiment Analysis Performance Increasement with TF-IDF, MDI Feature Selection, and SVM

mean


Introduction
In Indonesia, the government, through the Indonesian National Police (POLRI), has just released a new regulation, Electronic Traffic Law Enforcement (ETLE).A traffic ticket policy is carried out electronically through camera monitoring connected directly to the vehicle registration certificates (STNK) database [1].The National Police expects several things from implementing ETLE, including the ticketing process's efficiency and effectiveness and reduced ticketing bribery [2].The government can measure people's likes or dislikes of these public policies through sentiment analysis [3].
There have been studies that have applied sentiment analysis to find out people's responses to ETLE.Khalida et al. [4] used Naïve Bayes for predictive sentiment analysis on Twitter comments towards ETLE.Nevertheless, in terms of performance, this model only has an accuracy of 0.42.At the same time, Rahat et al. [5] proved that for sentiment analysis in the field of COVID-19, a support vector machine (SVM) can outperform the performance of naïve Bayes.There is a research opportunity to improve the performance of the sentiment analysis model on ETLE by trying SVM.
Term frequency-inversed document frequency (TF-IDF) is a feature extraction stage in sentiment analysis and is an important stage [6].The TF-IDF process usually produces many new features in the dataset and causes the curse of dimensionality problems [7].The feature selection process can reduce the number of these features.For example, Nafis et al. [8] used SVM with recursive feature elimination (RFE) to reduce the number of features in their sentiment analysis.Mean decrease of impurity (MDI) is also a feature selection method, where several studies also use MDI for text analysis.For example, Rabby et al. [9] used MDI to see which words are the most important in classifying documents related to COVID-19.Using MDI for feature selection on TF-IDF in the sentiment analysis process is a research opportunity.
This study proposes using SVM, TF-IDF, and MDI to evaluate polarization sentiment analysis on ETLE policies.First, we retrieve tweets about ETLE from Twitter.Then we do text analysis preprocessing and remove process stop words.The next step is to carry out the TF-IDF process.We apply two feature selection methods for our comparison: MDI and RFE.Next, we compare two classification models, naïve Bayes and SVM.Some of the metrics that we use to evaluate the preprocessing stage are the probability density function (PDF) and the t-test.Meanwhile, we use a bag of words (BoW) to evaluate the remove stop words stage.Finally, sensitivity, specificity, and the receiver operating curve (ROC) are for evaluating feature selection methods and classification methods.
To the best of our knowledge, there has never been an evaluation of sentiment analysis on ETLE policies using TF-IDF, MDI, and SVM.Here are some of our research contributions: 1.A novel sentiment analysis model for ETLE with extracted features from tweets using TF-IDF 2. Novel features for sentiment analysis model on ETLE that are selected using MDI 3.An enhanced classification model of sentiment analysis on tweet commentaries related to ETLE policies using SVM The remainder of this paper has the arrangement as follows: Chapter 2 explains the papers related to our research and how our research addresses gaps in existing research.Chapter 3 presents the research methodology roadmap and describes each process.Chapter 4 shows the results of testing and a discussion of the contributions of this research.Finally, Chapter 5 concludes this research.Several studies have existed regarding the application of computer science in ETLE.Pratama et al. [10] discussed the installation of a smart city in Jambi City, Indonesia.The installation of a smart city in the city consists of various features, including complaint applications called SIKESAL and ETLE.At the end of the study, there is a report on sentiment analysis on the feedback in SIKESAL.Khalida et al. [4] used Naïve Bayes for predictive sentiment analysis on Twitter comments towards ETLE.However, in terms of performance, this model only has an accuracy of 0.42.
Furthermore, several studies have used TF-IDF and SVM for sentiment analysis.Prabowo et al. [11] used TF-IDF and SVM for sentiment analysis on cyberbullying detection.They took their data from Instagram comments.Then the prediction results have an accuracy of 0.93.While Alkaff et al. [12] also used TF-IDF and SVM for sentiment analysis but for YouTube user comments on movie trailers.The best accuracy from this study is SVM, with a value of 0.86.The research opportunity is to apply TF-IDF and SVM on sentiment analysis for ETLE policies.
Several studies have implemented feature selection to increase the quality of TF-IDF.Nafis et al. [13] used SVM-RFE for feature selection in text analysis.This method is better than other methods and gives an accuracy of 0.98.Rabbi et al. [9] used MDI to see which words are the most important in classifying documents related to COVID-19.Using MDI for feature selection on TF-IDF in the sentiment analysis process is a research opportunity.Table 1 is a comparison of related works and highlights our research contribution.Text pre-processing consists of several important stages [14].The translation process translates tweets from Indonesian into English.Then, case folding makes all letters lowercase.Next, the remove stop words stage removes unnecessary words such as "and" "or," and "will."In the next step, the stemming process removes the affixes of a word.In addition, the lemmatizing process returns a word to its base word.

TF-IDF, MDI, and SVM
TF-IDF is a feature extraction method in text analysis that can indicate how important a word is in a corpus [15].TF-IDF consists of two phases.TF describes the frequency of a word  in document  [16].The formula for TF ((, )) is as equation ( 1) follows: where  , describes the number of occurrences of the word  in document  and ∑  ′, ′∈ describes the total number of words in document .
The next process, namely IDF, describes whether a word frequently occurs in every document ({| ∈ }) [17].The IDF is as equation (2) where  is the number of all documents, then |{ ∈ ;  ∈ }| describes the number of documents in which the word  appears.Random forest uses MDI as its feature selection process [18].The way to calculate MDI is to add up all the impurity reductions at each random forest node, then the results of each tree are averaged [19].Here is the MDI formula ((  , )) is as equation (3) follows: where  is a tree in a random forest,   is a feature,  is a node in .
where   is the left child of the node of a  node,   is a right child of a  node.Then () is the impurity function, and   is the function for splitting at . SVM is a supervised type of machine learning method that adds data examples to datasets with binary classes, where example data is added to the dataset to widen the distance between the two classes in the dataset [20].If the dataset can be separated, it is linearly separable, but if it cannot, it requires adding a kernel function [21].The kernel function transforms the dataset to a higher dimension so that the dataset appears linearly separable [22].One of the kernel functions is the radial basis function (RBF).The RBF kernel formula ((, )), where  is input and  is output, is as equation ( 5) follows: where ‖ − ‖ 2 is the squared Euclidean function, then  is the independent variable.

Benchmarking and Testing Metrics
We compare the significance of the difference in tweet length before and after pre-processing with their PDF and t-test [23].The two datasets significantly differ in tweet length if the t-test results show a  −  < 0.05 at a confidence level of 0.95 [24].
We use BoW to count the words with the highest frequency before removing stop words and after removing stop words.The BoW is a feature extraction method in which the word frequency in a sentence is calculated by assuming that a sentence is a bag of words, so no attention is given to grammar and structure [25].The BoW formula is as equation ( 6) follows: where n is the number of documents in the corpus, then ⊎ is the disjoint union operation.We benchmark MDI with SVM-RFE [13].RFE is a wrapper type feature selection, where this type requires a classifier for its feature selection process, so here the classifier is SVM [26]. Figure 2 shows the SVM-RFE algorithm, where FS is a data structure that stores feature sets [27].The feature set contains a combination of possible features.FR is a data structure that stores feature rankings.In the SVM model, feature ranking uses the weight of each feature.N is the size of the FS.The recursive process occurs in the iterative feature selection process.

Figure 2. The SVM-RFE algorithm
We benchmarked the SVM classification with naïve Bayes.Naïve Bayes classifies with the Bayes theorem, where there is an assumption of strong independence between features [28].The problem that often arises in sentiment analysis cases is data imbalance.In imbalanced data, it is important to measure sensitivity and specificity [29].In binary classification, sensitivity shows the ability of a model to predict label 1, whereas specificity, on the other hand, shows the ability of a model to predict label 0. In we also calculate the accuracy of the best model.The following is the formula for sensitivity, specificity, and accuracy: where TP is the true positive value, FN is the false negative value, TN is the true negative value, and FP is the false positive value [30].ROC is a curve showing the relationship between the true positive rate (TPR) and the false positive rate (FPR) [31].TPR is equivalent to sensitivity, while FPR has the same value as 1specificity [32].AUC describes the area under the ROC curve, where a larger AUC signifies good performance, i.e. the model is able to discriminate label 0 and 1 [33].The AUC value range is 0.5 to 1 [34].The 0.5 value indicates that a model's predictive ability is equal to random guessing [35].

Results and Analysis
In this section, we present and analyze the results of our sentiment analysis on tweets related to the ETLE policies, demonstrating the effectiveness of our preprocessing steps and the comparative performance of different feature selection and classification methods.

Results
The first stage of testing is pre-processing.We analyze tweet length statistics before preprocessing and after pre-processing.The average tweet length before pre-processing is 28.3 ± 13.2 words.Meanwhile, the average tweet length after pre-processing is shorter, namely 26.2 ± 12.5 words.Through the t-test, we analyze the significance between the distributions of the two datasets.The t-test result shows that, at a confidence level = 0.95, the difference in the distribution of the two datasets is significant, with a  −  = 0.002.Figure 3 shows the PDF between the two datasets.

Figure 3. PDF comparison of tweet length before and after pre-processing stage
In the next step, we implement the remove stop words process.We use BoW to analyze changes in the dataset before and after removing stop words.Figure 4 shows the bar chart of BoW results on the dataset before removing stop words and after removing stop words.Before removing stop words, the two highest number words are "the," with a total of 1,232, and "of," with 622.After the remove stop words stage, these two words no longer exist.Instead, the top 5 words are "etle" with 543, traffic with 404, "police" with 401, "electronic" with 169, and "national" with 159.With MDI feature selection So, we have six models.We tested each model based on its sensitivity and specificity values.TF-IDF produces 1022 new features, while SVM-RFE selects these features to 402.MDI also selected these features to be 402.Figure 5 shows a performance comparison of the six models.Five models, except the Naïve Bayes+TF-IDF model, have sensitivity = 1.0.Among the five models, SVM+TF-IDF+MDI has the highest specificity, which is 0.94.The naïve Bayes+TF-IDF model has a higher specificity, 1.0, but has a sensitivity value = 0.00.We also calculate the accuracy of the SVM+TF-IDF+MDI model.Its value is 0.99.To get a value that summarizes sensitivity and specificity, we test the ROC of each model.Figure 6 shows the ROC of the six models.The ROC with the highest AUC is SVM+TF-IDF+MDI, with a value of 0.97.The AUC of the other two SVM models is lower than SVM+TF-IDF+MDI but higher than the three models using naïve Bayes.All methods using naïve Bayes have AUC = 0.50, the same as a random guess.

Analysis
Paper [5] succeeded in proving that SVM works better than Naïve Bayes in the case of sentiment analysis on COVID-19 tweets.Here our contribution is the application of proof that SVM is better than Naïve Bayes in the case of sentiment analysis in tweets related to ETLE policies.
Previous research, namely paper [4], has conducted sentiment analysis on tweets related to ETLE.Nevertheless, the accuracy is only 0.42.Our contribution is a sentiment analysis model related to ETLE, which adds to the TF-IDF process.This novel model has a performance of 0.99, better than the state-of-the-art model.
Our feature extraction using TF-IDF resulted in 1,022 new features.That many features trigger the curse of dimensionality.Other studies apply SVM-RFE [8], then some apply MDI [9] to reduce dimensions with feature selection.Our contribution is to prove that MDI is better than SVM-RFE in reducing dimensions while improving sentiment analysis performance on ETLE-related tweets.

Conclusion
This study implements a sentiment analysis on tweets about ETLE.We use a dataset crawled from Twitter where 337 tweets have positive sentiments, while 65 tweets have negative sentiments.We apply TF-DF as a feature extraction process, compare MDI and SVM-RFE as its feature selection method, and compare SVM and naïve Bayes as its sentiment classification method.The test results show that TF-IDF produces 1,022 new features.The combination of the methods we use results in six models.The SVM+TF-IDF+MDI is the model with the best performance compared to the other five models.The Accuracy and AUC, respectively, are 0.99 and 0.97.

Figure 1 .
Figure 1.Our proposed research methodology

Figure 4 .
Figure 4. BoW comparison before and after removing stop words

Figure 5 .
Figure 5. Sensitivity and specificity comparisons of model performance from the combination of TF-IDF, two classification methods, and two feature selection methods