Enhancing Machine Learning Model Performance in Addressing Class Imbalance

This research investigates methods for addressing class imbalance in machine learning models, focusing on the Support Vector Machine (SVM) algorithm. We apply Over-sampling (SMOTE) and Under-sampling techniques to a dataset with class imbalance and evaluate the performance of SVM using these methods. The data consists of Twitter posts related to the 2024 electoral discourse. The findings indicate that incorporating SMOTE effectively enhances the performance of SVM models, particularly within the SVM Polynomial variant. However, the use of Under-sampling shows limited impact on improving SVM model performance. This study provides valuable insights for researchers and practitioners in choosing appropriate strategies for handling class imbalance in machine learning models.


INTRODUCTION
Class imbalance is one of the challenges researchers face when conducting modeling using Machine Learning [1].The consequence of class imbalance is that minority classes are often misclassified as major classes [2].Skewed data distributions in datasets manifest when a single class is disproportionately represented compared to the others [3].Numerous approaches exist for tackling this quandary, including ADASYN (Adaptive Synthetic Sampling Approach) [4] and SMOTE (Synthetic Minority Over-sampling Technique) [5].
The Synthetic Minority Over-sampling Technique (SMOTE) is a valuable tool for addressing class imbalance in machine learning datasets.By generating synthetic samples for the minority class, SMOTE aids in training more accurate and robust predictive models.While SMOTE offers several benefits, it is essential to consider its limitations and the appropriate parameters to achieve optimal results.[6].Overall, SMOTE remains a popular and effective method for tackling class imbalance in various machine-learning applications.[7].
SMOTE has been widely adopted by other researchers, as seen in [8] where class resampling using Under-sampling was conducted.The study predicted company bankruptcies using Complement Naïve Bayes.To address this, the researchers utilized the SMOTE method with Under-sampling techniques, resulting in an accuracy improvement of over 2%.Additionally, another study [9] employed SMOTE with Over-sampling techniques.The KNN algorithm was used to test the data, and the use of SMOTE increased K by 9.97%.Furthermore, [10] also utilized ◼479 ◼ISSN: 1978-1520 Over-sampling techniques with SMOTE, leading to improved accuracy across all algorithms used.
From several previous articles, two types of SMOTE techniques are identified: Oversampling and Under-sampling.Under-sampling involves diminishing the volume of data in the predominant class to conform to the quantity present in the underrepresented class [11].In contrast, Over-sampling is a technique that augments the representation of the minority class to equalize its size with that of the majority class [6].This research employs both techniques for comparison purposes.
In this investigation, the data comprised Twitter posts related to the 2024 electoral discourse, and the computational framework employed was the Support Vector Machine (SVM), a conventional supervised learning approach frequently applied in classification tasks [12].SVM determines the optimal hyperplane by maximizing the margin between different classes [13].A hyperplane serves as a discriminant function used to delineate categories or groups [14].In this study, three different SVM kernels are used: RBF, Polynomial, and Linear.
Before conducting the modeling process with SVM, the advantages of the employed techniques, SMOTE with Over-sampling and Under-sampling, are determined.This study conducts data preprocessing through a range of techniques, including data purification, case normalization, lexical normalization, tokenization, stopword elimination, and morphological stemming.Labeling the data, this study utilizes a lexicon-based approach.The lexicon-based method operates by first creating an opinion word dictionary (lexicon).Words found in this dictionary are used to identify whether a sentence contains an opinion or not [15].The created lexicon is then used for automated labeling in this research.

Dataset
The dataset for this research is collected from the social media platform Twitter (or X) spanning a period of 3 months, from January to March 2023.The initial dataset comprises 10,001 tweets.The data collection involved using specific keywords related to the research topic.Preprocessing included steps such as removing duplicates, filtering out irrelevant content, and normalizing text, which resulted in data reduction.◼480 ◼IN: 1978-1520

PreProcessing
Data pre-processing denotes the methodological transformation of unrefined data into a more digestible format [16].This procedure holds paramount significance given that raw data frequently exhibits irregular formatting.Moreover, raw data cannot be processed directly in data mining, making preprocessing essential to facilitate subsequent data analysis [17].The preprocessing steps are as follows: a. Data Cleaning Data preprocessing starts with either data cleaning or filtering [18].This involves reselecting raw data and removing incomplete, irrelevant, and inaccurate data.This step helps avoid misunderstandings during data analysis.By applying this process, the quality of the dataset is significantly improved, leading to more accurate and reliable results in subsequent analysis.It ensures that the model is trained on data that is representative and free from noise, thereby enhancing the overall performance and robustness of the machine learning algorithms used in the study.

b. Case Folding
This procedure aims to transform all letters within the document into lowercase [18].
Moreover, symbols such as punctuation marks, numerical digits, and any other non-alphabetic characters are systematically excluded.This is due to their designation as word separators or delimiters, rendering them devoid of any impact on text analysis and interpretation [19].Furthermore, leading and trailing spaces are removed, a technique known as whitespace removal [20].By applying these preprocessing steps, the text data becomes standardized and consistent, which significantly enhances the performance and accuracy of the machine learning models.It ensures that the models are not misled by variations in text format and can focus on meaningful content, leading to more reliable and valid analysis outcomes.

c. Normalization
Normalization is a preprocessing method that transforms raw data into a different form to obtain more suitable data for analysis and modeling [21].This process focuses on scaling the data.Normalizing data helps prevent large initial data ranges from overshadowing smaller data ranges by assigning equal weight to all data points [22].This is crucial because it ensures that each feature contributes equally to the model's learning process, thereby enhancing the overall performance and accuracy of the model.Without normalization, features with larger scales could dominate the learning process, leading to biased and suboptimal results.

d. Tokenize
The process of tokenization encompasses the segmentation of an assemblage of characters within a given textual composition into individual lexical units [23].According to [24] Tokenization involves the parsing of an input string by isolating each word.The essence of this procedure lies in the division of each word comprising a document.

e. Stopword Removal
Stopwords are common words that often appear and do not provide significant information and are usually ignored or discarded when creating indices or lists of words [25].
Stopwords are also often considered as noise in text.Stopwords refer to the most common words such as prepositions like "in," "to," "that," and so on.

f. Stemming
Stemming, as a method in natural language processing (NLP), is employed to normalize words by simplifying them to their root or base.This technique seeks to eliminate superfluous ◼481 ◼ISSN: 1978-1520 linguistic variations like affixes and verb conjugations, facilitating the identification of common roots across various word forms [26].

Labeling
After cleaning the data to be processed, the next step is labeling.In this study, labeling is performed using the lexicon-based method.Several studies have stated that using this method can improve accuracy [27], [28].This study employed a dataset comprising 4800 instances, with a subset of tweets being excluded from analysis during preprocessing due to redundancy.

Word Weighting With TF-IDF
TF-IDF is an established algorithm employed to assess the significance of a word within a text.This computational process is rooted in both term frequency, which gauges a term's presence in a specific document, and inverse document frequency, which evaluates its prevalence across multiple documents.Through this calculation, TF-IDF is capable of discerning the degree of a word's importance and relevance within a given context [29].The Term Frequency-Inverse Document Frequency (TF-IDF) method is widely employed for the efficient analysis of voluminous datasets [30].The TF-IDF algorithm assigns weights to each keyword in each category to find keyword similarities with available categories.Before weighting, five preprocessing steps are performed: sentence segmentation, case folding, tokenizing, filtering, and stemming.Subsequently, TF-IDF weight calculation, query relevance weight, and similarity weight are computed [31].
Based on previous studies discussing the implementation of the TF-IDF method, various formula variations were found in implementing the TF-IDF method for word weighting.The term frequency-inverse document frequency (TF-IDF) metric augments by the occurrences of a term within a document, yet it's counterbalanced by the term's frequency within the overall corpus.The distinct methodologies of TF-IDF weighting schemes frequently serve as the principal mechanism for evaluating and ranking document pertinence as perceived by users.Fundamentally, TF-IDF is an outcome of the multiplication of the term frequency (TF) and the inverse document frequency (IDF).A multitude of approaches exist to ascertain the optimal values for both these extant statistics.In the case of term frequency tf(t, d), the simplest method is to use raw frequency in the document, i.e., how many times term t appears in document d.If we denote raw frequency t as f (t, d), the simple tf schema is tf(t, d) = f(t, d).The IDF value of a term (word) can be calculated using the following equation: Where D is the total number of documents containing the term (t), and dfi is the frequency of occurrence of the word in D. The algorithm used to calculate the weight (W) of each document for the keyword (query) is: Information: d = document number d t = word number t from the keyword W = weight of document d for word t tf = term frequency/word frequency Once the weights (W) of each document are known, a sorting process is performed, where the larger the W value, the greater the level of similarity of the document to the searched word, and vice versa.

SMOTE
This study then employs SMOTE to balance classes.Although the data used in this study is nearly balanced, to assess the benefits of SMOTE Under-sampling and Over-sampling are employed.SMOTE works by generating synthetic samples for the minority class, thus ensuring that the classifier is not biased towards the majority class.By balancing the classes, SMOTE helps improve the performance of the model, especially in scenarios where the minority class is of critical interest.The application of both Under-sampling and Over-sampling techniques allows for a comprehensive evaluation of how SMOTE can enhance model accuracy and robustness, providing insights into its effectiveness in different data distributions.This approach not only helps in achieving better model performance but also ensures that the model can generalize well to unseen data, ultimately leading to more reliable predictions.

SVM
Support Vector Machine (SVM) is a form of supervised learning that is deployed in machine learning to conduct tasks associated with both classification and regression [32].VM operates by creating a predictive model that delineates data points through a hyperplane, which in turn maximizes the distance, or margin, between the hyperplane and the closest data points [33].A hyperplane, a mathematical construct typically in multidimensional space, effectively divides a data set into two distinct classes.This division is quantified by the margin, which represents the shortest distance from the hyperplane to the nearest data point in each class.A core objective of support vector machines (SVMs) lies in the task of optimally dividing data into two classes by identifying a hyperplane that maximizes this margin.
Researchers often use SVM as a classifier due to its robustness and effectiveness in highdimensional spaces.Unlike other machine learning algorithms, SVM is particularly effective when the number of dimensions exceeds the number of samples, and it remains efficient in handling large feature spaces.Furthermore, SVM is versatile due to its use of different kernel functions, allowing it to model complex non-linear relationships.The ability to handle various kernels, such as linear, polynomial, and radial basis function (RBF) kernels, makes SVM a powerful tool for a wide range of classification problems.In this study, SVM was chosen because of its high performance in terms of accuracy and its ability to generalize well to unseen data, making it suitable for the dataset and the specific classification tasks at hand.The kernels used in this study include:

a. SVM Kernel Linear
The basic linear kernel is employed when the data under scrutiny is inherently linearly separable [34].The linear kernel is suitable when there are many features because mapping to a higher-dimensional space does not significantly improve performance, as in text classification [35].In text classification, both the number of instances (documents) and the number of features (words) are large [36].The equation for the linear kernel is: Information: (, ): Kernel function that measures the similarity between two input vectors x and z.   : dot product between vectors x and z.
This formula effectively measures the linear similarity between two vectors.The higher the value of the dot product, the greater the similarity between the two vectors.Linear kernels are often used because they are simple and efficient to compute and work well when data can be separated linearly in feature space.

◼483 ◼ISSN: 1978-1520 b. SVM Kernel RBF
The radial basis function (RBF) kernel is frequently employed in analytical settings where the data does not demonstrate linear separability.Within this context, the RBF kernel features two parameters: gamma and cost [37].The parameter C, which is part of the support vector machine (SVM) model, is crucial for minimizing misclassification errors in the training data.It defines the trade-off between achieving a low training error and a sufficiently simple decision boundary.On the other hand, the parameter Gamma regulates the influence of individual samples on the decision boundary.Low gamma values indicate that distant points are taken into account when determining the decision boundary, while high gamma values prioritize closer points [38].The equation for the RBF kernel is: Here is an explanation of the elements in this formula: (, ): The kernel function that measures the similarity between two input vectors  and .: The exponential function.: A scalar parameter (gamma) that determines the width of the Gaussian function.Gamma controls how far the influence of a single sample reaches.A large gamma value means a small radius of influence, resulting in a tighter model.Conversely, a small gamma value means a large radius of influence, resulting in a looser model.
The RBF kernel function transforms the distance between two data points into a similarity measure ranging from 0 to 1.When two data points are very close to each other, the kernel value approaches 1. Conversely, if two data points are far apart, the kernel value approaches 0. The RBF kernel is highly effective for handling non-linearly separable data by mapping it to a higher-dimensional feature space.

c. SVM Kernel Polynomial
The polynomial kernel constitutes a type of kernel function applied in instances where the data cannot be delineated linearly [39].The polynomial kernel is especially well-suited for problems in which all training sets have been subjected to the process of normalization [40].The equation for the polynomial kernel is: Here is an explanation of the elements in this formula: (, ): The kernel function that measures the similarity between two input vectors  and .  : The dot product of the vectors  and .
d: The degree of the polynomial.
For (, ) = (  )  : This version of the polynomial kernel calculates the dot product of  and  and then raises it to the power of .It captures interactions up to the -th degree, making it suitable for problems where the relationship between the features is polynomial.
For (, ) = ( +   )  : This version includes an additional constant term (1 +   ) raised to the power of .The constant term allows the kernel to account for all polynomial terms up to degree , including the interaction terms and bias.This version often performs better because it captures a broader range of interactions between the features.◼484 ◼IN: 1978-1520

RESULT AND DISCUSSION
The following are the modeling results after the pre-processing and word weighting processes.The initial experiment involved SVM modeling without using SMOTE.Table 1 presents the evaluation of the SVM model without SMOTE.In Table 1, the evaluation results of various performance metrics including Accuracy, Precision, Recall, and F1-Score for SVM without SMOTE are presented for comparison.SVM Linear demonstrates consistent performance across all metrics, with accuracy, precision, recall, and F1-Score all at 87%.This indicates that SVM Linear effectively classifies the data by achieving high accuracy in predicting both positive and negative classes.The superior performance of SVM Linear may be due to the simplicity of the linear hyperplane, which is easier to optimize in less complex feature space.
SVM Polynomial shows significantly lower performance compared to SVM Linear.With an accuracy of only 50%, precision, recall, and F1-Score of 25%, 50%, and 33% respectively, SVM Polynomial struggles to classify the data, resulting in overall poor performance.SVM Polynomial may experience overfitting to the training data, leading to poor performance on the test data.
SVM RBF shows relatively good performance, with an accuracy of 83%.SVM RBF maintains high precision, recall, and F1-Score values, indicating its effectiveness in predicting both positive and negative classes.However, its performance is slightly lower than SVM Linear.The lower performance of SVM RBF compared to SVM Linear could be due to the complexity of the RBF kernel, which may not fully match the current data distribution.
Next, SVM was tested using SVM with Over-sampling.Table 2 presents the evaluation of the SVM model with SMOTE (Over-sampling).The evaluation results for the Support Vector Machine (SVM) model with the application of the SMOTE (Synthetic Minority Over-sampling Technique) Over-sampling technique are presented in Table 2. From the table, it can be seen that the use of SMOTE has significantly improved the performance of the SVM model, particularly in SVM polynomials.In SVM Linear, although there is a performance increase, the difference is not as large as in SVM Polynomial and SVM RBF.SVM Linear with SMOTE shows a substantial increase in accuracy from 87% to 91%.It appears that the application of the Over-sampling method has contributed to enhancing the model's capability in classifying data.Additionally, there is a noticeable improvement in precision, recall, and F1-Score metrics, indicating an increased effectiveness of the model in distinguishing between positive and negative classes.
SVM Polynomial also shows significant improvement after using SMOTE, although its accuracy is still below SVM Linear and SVM RBF.The precision, recall, and F1-Score metrics show substantial increases, indicating an improved ability of the model in classifying data.Meanwhile, SVM RBF shows a smaller improvement compared to SVM Linear and Polynomial after the application of SMOTE.However, the improvement is still present in accuracy, precision, ◼485 ◼ISSN: 1978-1520 recall, and F1-Score, indicating that the Over-sampling technique still has a positive impact on the model's performance.Overall, the evaluation results show that the use of SMOTE effectively improves the performance of the SVM model in classifying data, particularly in addressing class imbalance.
The third experiment involved SVM using SVM Under-sampling.Table 3 presents the evaluation of the SVM model with SMOTE (Under-sampling).In Table 3, the performance results of the Support Vector Machine (SVM) model with the application of SMOTE (Synthetic Minority Over-sampling Technique) combined with Undersampling are presented.From the table, it can be seen that the use of SMOTE with Undersampling has a limited impact on improving the performance of the SVM model.SVM Linear shows consistent results with the use of SMOTE Under-sampling, with no significant changes in accuracy, precision, recall, and F1-Score.Despite the application of the Under-sampling technique to address class imbalance, the results remain similar to SVM without SMOTE.
SVM Polynomial also shows a similar pattern with no significant changes in model performance after the application of SMOTE Under-sampling.The accuracy, precision, recall, and F1-Score remain low, indicating that this technique does not provide significant improvement in the model's ability to classify data.SVM RBF shows a slight improvement in some evaluation metrics after the application of SMOTE Under-sampling.However, the improvement is not significant and is still far from the expected performance.This indicates that the combination of SMOTE and Under-sampling techniques does not provide significant improvement in the SVM model's performance when applied to the current dataset.Overall, the evaluation results show that the use of SMOTE with the Under-sampling technique has a limited impact on improving the SVM model's performance.Although intended to address class imbalance, this technique fails to provide significant enhancement in model performance on the tested dataset.
The results obtained in this study can be considered superior compared to previous research.Table 4 provides a comparison with previous studies using different algorithms.The analysis of the research results indicates that the use of Support Vector Machine (SVM) on the Twitter dataset achieved the highest accuracy of 87%.Compared to previous studies, SVM demonstrated a significant performance improvement.Research using algorithms such as Logistic Regression on the Blibli dataset achieved an accuracy of 73%, and Naïve Bayes Classifier on the Play Store dataset achieved 80%.K-Nearest Neighbour on the Movie Review dataset and Decision Tree on the Twitter dataset each achieved an accuracy of 86%, while CNN on the Twitter dataset only reached 78%.Research using a Simple Neural Network on the Twitter dataset achieved an accuracy of 85%.Therefore, it can be concluded that SVM outperformed ◼486 ◼IN: 1978-1520 other algorithms used in previous studies on the Twitter dataset, demonstrating its effectiveness in text classification on this platform.

CONCLUSION
This study emphasizes the importance of addressing class imbalance in machine learning models, particularly with Support Vector Machine (SVM) algorithms.Evaluations using oversampling (SMOTE) and under-sampling methods were conducted on datasets with class imbalance.Results indicate that the use of SMOTE significantly enhances SVM model performance, especially for SVM Polynomial, by correcting class imbalance and improving accuracy, precision, recall, and F1-Score.However, integrating SMOTE with under-sampling showed minimal impact on SVM efficiency, failing to significantly improve model performance.These findings highlight the necessity of carefully selecting appropriate class imbalance handling methods tailored to specific dataset characteristics.This study provides valuable insights for researchers and practitioners in choosing effective methodologies for managing class imbalance to enhance machine learning model effectiveness.
For future research, it is recommended to explore combinations of other machine learning algorithms that can enhance the accuracy and stability of text classification models.Additionally, the application of more complex ensemble learning methods can be investigated to address challenges in more intricate sentiment analysis.The implementation of new techniques in data pre-processing can also be explored to ensure that the data used is clean and relevant, thereby improving the overall performance of the models.

ACKNOWLEDGMENTS
The author would like to express gratitude to the University of Lancang Kuning for providing financial support for this research.

Figure 1
Figure 1 illustrates the methodology flowchart of this study.

Table 1 .
SVM Comparison Without SMOTE

Table 4 .
Comparison with Previous Research