Sentiment Analysis of 2024 Presidential Candidates Election Using SVM Algorithm

− Elections for presidential candidates are held every 5 years with various candidates, especially on Twitter, arguments about political matters often occur that many Twitter users participate in discussions about the election for presidential candidate. Therefore, this study focuses on sentiment analysis to infer user responses to the presidential election and validate it by looking for a correlation between electability survey results and Twitter sentiment data using Pearson Correlation. In sentiment analysis model, the 10-Fold Cross Validation method is used to find the best model from a dataset with a division of training data and test data with 90:10 split. Then the alphabetic data will be converted into numeric data using the TF-IDF weighting method. To validate the best model, Confusion Matrix is used to get the best f1-score. The model is using Support vector machine algorithm with the Gaussian RBF (Radial Basis Function) kernel. The results of the analysis are compared with the results of the news portal electability survey which contains the 3 candidates using Pearson Correlation. This study produces the best fold for each data on each presidential candidate with the f1-score to find the best model for each fold. In the Peason Correlation result, the higher positive sentiment of each presidential candidate, the higher electability survey data. For further research, research can be discuss about hyper tuning parameters and using other kernels on Support vector machine algorithm.


INTRODUCTION
Elections occur every 5 years with the number of voters increasing each period, with an increase of 28.90% from 2004 to 2019 [1].Social media such as Facebook, Twitter, LinkedIn, YouTube have become part of human life, it is proven that most interactions can be done with social media, until it becomes self-identity to the general public [2].On social media, especially Twitter, there is a lot of dissemination of information, entertainment, and political opinion.This is not spared from the election which is a political activity that will be held in 2024, referred to as a democratic party, this is certainly a lot of discussion on social media.One way to convey opinions and through social media is to use certain hashtags in tweets posted on Twitter.Twitter itself is used because of the ease in replying to tweets with likes, comments and even replies, so that topics that are currently being discussed can be easily spread [3].Therefore, further analysis is needed to see how much Twitter users agree and disagree with the 2024 presidential election, then data will be obtained from Twitter user tweets related to this topic, namely the 2024 general election.Sentiment analysis or Opinion Mining is a study that learn about opinions, sentiments, evaluations, behavior and emotions towards an entity [2], [3].Sentiment analysis is used to find out which tweets are posted containing positive, negative or neutral words [4].In this study, the Support vector machine (SVM) method will be used, which is one of the classification methods in machine learning (supervised learning) that predicts classes based on patterns from training results.Classification is done by creating a hyperplane between positive and negative classes [2].The results of this study are the sentiments of Twitter users towards the presidential election in 2024.

II.
RESEARCH METHODOLOGY This research consists of 6 processes in outline, namely data collection, preprocessing, Term Frequency-Inverse Document Frequency weighting, Support Vector Machine training using 10-Fold Cross Validation, Choosing the best fold to be used as a model, and testing as shown below.

A. Collection Data
The first stage of this research is to scrape data sourced from Twitter with the keyword "(Name of Candidate) President" to narrow down the context of a Twitter user's Tweet.The data collection process is carried out with the help of the snscrape library by python.Data was collected from 1 January 2023 to 31 March 2023.This scraping process resulted in a total of 18,267 data with 7,810 data that were duplicate and irrelevant to the theme "2024 Presidential Candidate Election" resulting in a total of 10,457 clean data from 3 presidential candidates.

B. Preprocessing
The pre-processing stage functions to clean and transform data so that it can be processed by the system easily and accurately.Several stages of pre-processing in this study include: a. Lowercasing Lowercasing is a step to change all letters to lowercase so that the word being processed becomes the same form.Table 1 shows an example of the lowercasing process from some of Anies Baswedan's tweets:

[[pak] [anies] [baswedan] [calon] [presiden]] [[pak] [anies] [baswedan] [calon] [presiden]]
C. TF-IDF TF-IDF weighting is a method of assessing the weight of each word in a document, TF-IDF is a strong feature weighting method that measures the frequency and distribution of words to determine their weight.TF-IDF consists of TF (Term Frequency), which is the frequency of occurrence of words in documents and IDF (Inverse Document Frequency), which is the opposite of the frequency of documents, the TF-IDF value can be obtained by using the following equation: a.The value of Term Frequency (TF) is obtained by the value of the frequency of appearance of the t feature in document d [5]  = (, ) b.The Inverse Document Frequency (IDF) value is obtained from the logarithm of the number of documents n divided by df documents containing t features [6]  =  ( ) + 1 c.The value of the Term Frequency Inverse Document Frequency (TF-IDF) is obtained by multiplying the TF and IDF values [7]  =  .
TF-IDF functions to convert data into numerical form so that the features that have been processed can be loaded into the equation.So this method was chosen to extract the Twitter data.

D. Support Vector Machine
The algorithm chosen is the Support vector machine because the Support vector machine is a text mining algorithm that has high accuracy [8].As seen in this journal, the Support vector machine algorithm has higher accuracy, precision, and recall compared to the Naïve Bayes algorithm, then for journal [9] it is known that the accuracy, precision, and recall of the Support vector machine algorithm has higher accuracy than the K -Nearest Neighbor (K-NN) algorithm and Naïve Bayes.Support Vector Mahine was created by Boser, Guyon, and Vapnik [10] which is a classification method for finding the best hyperplane to separate 2 classes [11].This algorithm will produce a hyperplane that separates a plane into 2 parts in each class which can be seen in figure 2. Several kernels that can be applied to the SVM algorithm: a. Linear Kernel is a kernel function that separates 2 classes linearly with the following equation: Description of the following equation is xi as the train data, followed by i as ammount of data (ex: 1,2, …, n), then yi as the label of train data [10].
b. Polynomial Kernel is a kernel function that is not linearly separated, the equation is as follows: ,  = ( ,  ) + ) Description of the following equation is xi, xj as the train data, followed by c and d as constant [10].c.Radial Basis Function (RBF) The kernel is a function for analyzing data that is not linearly separated and requires 2 types of parameters (Gamma and Cost).

𝐾 𝑥 , 𝑥 = exp (6)
Description of the following equation is  −  as euclidean distance and 2 as independent parameter which define substraction level of RBF toward 0 [10].
To use the Support vector machine algorithm, it is necessary to find the best  ⃗ to separate each class.Normally with data that is not spread out, the hyperplane is either linear or polynomial.But if it comes to the author's case, the data distribution is very wide and the feature dimensions are unlimited, then it is necessary to use a kernel trick, the implementation that the writer has done to overcome this problem is to use a Gaussian RBF (Radial Basis Function) which is able to overcome the problem above.

E. 10-Fold Cross Validation
K-Fold Cross Validation is a form of testing that serves to assess the performance of the algorithm by dividing the sample data randomly and then dividing the data as much as the K value of the total fold.Furthermore, one of the K groups will be used as data testing and the rest of the group will be used as training data [12].
After the SVM machine is trained, the best model is obtained by finding the best f1-score value from the SVM model, 10-Fold Cross Validation method divides a dataset into 10 parts with a data ratio 10:90.
Next, it will be seen from the 10 data divisions, which division has the best f1-score to take the model from the fold

F. Testing
The Confusion Matrix is a table for evaluating classification models, the f1-score value is the value used to determine the best model [13].So I apply evaluation to research using the Confusion Matrix method.The application of the Confusion Matrix can be seen in the following table: After that, look for precision, recall, and f1-score values obtained by using the equation: Description of the following equation is TPA as the True Positive data, FPA as the False Positive data, and FNA as False Negative.

G. Electability Correlation
Pearson Correlation is a method that aims to find a linear relationship between 2 or more variables.The results of the Pearson Correlation are correlation coefficients ranging from numbers 0 to 1 which apply to negative numbers [10].The Pearson Product Moment Correlation is shown in equation as follows: Description of the following equation is n as the amount of data, ∑  as sum of pair X and Y, ∑  as sum of X variable, ∑  as sum of Y variable, ∑  as sum of square X, and ∑  as sum of square Y.

III.
RESULTS AND DISCUSSION The implementation of the SVM algorithm is carried out exactly as previously attached, the development of the SVM system begins with the data cleaning stage, then transforms the data into numeric form, then conducts model training by dividing the data 10:90 using the 10-Fold Cross Validation method, and testing the best model by taking f1score value with the Confusion Matrix method.The best fold is determined based on the f1-score when the False Negative (FN) and False Positive (FP) data are not very close [14].Also, to find the F1-score value in models with more than 2 labels, I use the macro f1, where the value taken is the average of the f1-score values for all classes, the authors use the macro value because the values will be more even for datasets that are not balanced [15].
Next the results are obtained of the training process by finding the model with the best fold, followed by an analysis of the linkages of the predicted model results using the Pearson Moment Product Correlation.The test results of the 3 presidential candidates are shown in the following table: Then there are the results of electability survey data every month for each candidate in the following table: It can be concluded that there is a relationship between sentiment and news portal electability surveys, from the 3 presidential candidates.The result obtained by the correlation has a strong unidirectional relationship for candidate Anies Baswedan and Prabowo Subianto it means that the higher the positive sentiment, the higher the electability of the candidate.While in cadidate Ganjar Pranowo has a strong non-unidiractional relationship it means that that the lower the positive sentiment, the higher the electability of the candidate.

IV.
CONCLUSION Based on the results of the research that has been done, it can be concluded that the implementation of the Sentiment Analysis Engine with the Support Vector Machine algorithm using the Gaussian RBF kernel found the best fold for each presidential candidate, namely the 8th fold with an f1-score of 0.66 for Anies Baswedan's candidate with a total 2,554 training data and 283 testing data, the 5th fold with an f1-score of 0.72 for the Ganjar Pranowo candidate with a total of 3,330 training data and 370 testing data, and the 4th fold with an f1-score of 0.78 for the Prabowo Subianto candidate with a total of 3487 training data and 387 testing data.
Also, the correlation between the electability survey and the results of the positive sentiment analysis of presidential candidate Anies Baswedan is 0.991 and the negative sentiment is -0.991.So that the conclusion obtained by the correlation has a strong relationship in the same direction.Then for the correlation between the electability survey and the results of the analysis of positive sentiment for presidential candidate Ganjar Pranowo is -0.789 and negative sentiment is 0.789.So that the conclusion obtained by the correlation has a strong relationship in the different direction.Furthermore, the correlation between the electability survey and the results of positive sentiment analysis for presidential candidate Prabowo Subianto is 0.801 and negative sentiment is -0.801.So that the conclusion obtained by the correlation has a strong relationship in the same direction.It can be concluded that there is a link between sentiment and news portal electability surveys.

Table 2 .
Remove Special Character and Number Result

Table 3 .
Stemming ResultNo.TokenizeThis stage break sentences into tokens per word.This process is carried out with the aim that the next process can be applied at the word level.The process is shown in table 4 below:

Table 8 .
Confusion Matrix Table

Table 9 .
Best Fold Test Results for Every Presidential Candidate

Table 10 .
Electability Survey and Sentiment