SMOTE-Tomek Re-sampling Based on Random Forest Method to Overcome Unbalanced Data for Multi-class Classification

— Mobile app review data needs to be utilized to understand the app characteristics desired by users. App providers can improve app performance based on user preferences by using sentiment and emotion classification on app review data. However, problems that often arise in text-based analysis are data variation and data imbalance. This can lead to biased and inaccurate classification models. It is necessary to perform pre-processing to comprehend the data requirements and implement feature extraction for word weighting to overcome the variation in the data. In addition, re-sampling techniques are also needed to overcome the imbalance in sample distribution. Re-sampling techniques such as Tomek Links and SMOTE only focus on majority or minority data. This research applies the SMOTE-Tomek merging technique, aiming at not only the minority data but also the majority data. The model performance becomes better because the technique combines oversampling and under-sampling of the majority of data to eliminate the noise of data. The data was modeled using an Ensemble Learning Random Forest for classification. The model performance resulted in a Precision value of 84%, Recall of 84%, F1-Score of 84%, and Accuracy of 84%. Furthermore, the model was optimized using GridSearchCV and obtained an increase in Precision 85%, Recall 85%, F1-Score 85%, and Accuracy 85%.


I. INTRODUCTION
As part of technological advances, mobile applications have integrated into the lives of the global community.Various groups use mobile applications to meet various needs, ranging from education, health, entertainment, and e-commerce to financial transactions [1].Based on data from Statista, the use of mobile applications in Indonesia continues to grow.The total number of downloads reached 7.31 billion in 2021, 7.7 billion in 2022, and 7.56 billion in 2023, a reduction from the previous year's figure of 7.31 billion [2].Despite the decline, there are still many mobile app downloads in Indonesia.
The use of mobile applications has generated many user reviews in text form.The reviews provide deep insights into the user experience of the mobile app used, as well as user feedback regarding the appearance and performance of the app [3].To determine the conclusions of the arguments presented by the mobile app users, rather than relying on the ratings given by the users, sentiment analysis is needed, one of the topics in Natural Language Processing (NLP).Sentiment analysis aims to study, extract, and identify subjective information from people's expressions, opinions, or emotions toward a topic.This analysis usually categorizes the sentiments into three categories: positive, negative, and neutral [4].Sentiment analysis and emotional understanding are intertwined in the context of mobile app reviews.Emotions can influence people's opinions, thoughts, or feelings based on biological and psychological states [5].Mobile app review data can support the classification of the emotions of mobile app users.Companies or stakeholders use the results to understand the feature characteristics and service needs that users want about their products and services.
However, significant challenges need to be overcome in the sentiment analysis process.The two most common challenges are data variation and unbalanced data distribution.Data variation arises from the diversity of languages users use, giving rise to ambiguity in text analysis.In addition, the problem of unbalanced data distribution becomes more complex in the case of multi-class, where each class has a different amount of data.If some classes have more data than others, the model will tend to learn the pattern more easily [6].As a result, the analysis model becomes biased and inaccurate due to the decrease in classification performance in the minority classes.The model also tends to predict the majority class in the predicted data and ignore the minority class, so the accuracy of the prediction result is biased towards the majority class [7].
To overcome data imbalance, re-sampling techniques such as oversampling using SMOTE (Synthetic Minority Oversampling Technique) and undersampling using Tomek Links are often applied.The combination of these two techniques, known as SMOTE-Tomek, is often used to handle significant data imbalance between classes [8].SMOTE-Tomek helps to create a more balanced distribution of data and removes data that creates ambiguity between classes.This combination helps to make the classification model more accurate and reliable by minimizing the bias towards the majority class and increasing the accuracy of the minority class [9].
The research [10] on personality classification on Twitter using the SVM, Decision Tree, Random Forest, Ada Boost, and Gradient Boosti algorithms with the SMOTE-Tomek, Random Undersampling, and SMOTE re-sampling technique, it was found that SMOTE-Tomek performed best among other re-sampling techniques.In addition, the SVM performed better than the boosting algorithm.The SVM model, with an initial accuracy of 39%, became 55% after SMOTE-Tomek, then optimized with tuning so that the accuracy was 56% [10].
Furthermore, research [7] analyzed the sentiment analysis of application reviews using Random Forest, Neural Network, KNN, and SVM models.Accuracy increases when applying Tomek Links.The SVM model achieved 81% and Random Forest 80%, while KNN and NN accuracy was still below 77%.Tomek Links can clean data that has the potential to become noise from the majority class that has similar characteristics to the minority class, resolve class imbalance, and improve model performance [7].Additionally, in Research [11] regarding the classification of MBKM program comments using Random Forest, Logistic Regression, MLP, and SVM algorithms, the Tomek Links undersampling technique works better than the Near Miss technique [11].In the research [12] on text classification using the IndoBERT model, the use of the SMOTE technique produced a better accuracy value of 82% compared to using augmentation techniques which only produced a value of 78%, because SMOTE can improve classification capabilities by adding synthetic data to the minority class based on nearest neighbours [12].
The literature review shows that SMOTE-Tomek, Tomek Links, and SMOTE re-sampling techniques effectively overcome class imbalance and improve model performance in various classification applications.Classification algorithms such as Random Forest (RF) and Support Vector Machine (SVM) are widely used because they can learn complex patterns in textual data, making them suitable for various tasks such as sentiment analysis [7][10] [11].However, the use of this algorithm for multi-class classification in the context of sentiment analysis of mobile application reviews is still rare.Previous studies also did not use feature representations such as TF-IDF [10], and some also limit the number of feature names used, which may affect the classification results [11].
Therefore, this research aims to evaluate the effectiveness of the SMOTE-Tomek technique for multi-class classification using TF-IDF for feature representation.The effectiveness of SMOTE-Tomek will be compared to SMOTE and Tomek Links to determine if it provides additional benefits in enhancing the performance of complex sentiment analysis models.Additionally, Hyperparameter tuning using GridSearchCV will be applied to optimize the best-performing classification model to achieve higher accuracy.This research is expected to contribute meaningfully to developing multiclass text classification models by addressing data variations and using re-sampling techniques to overcome dataset imbalances.

II. RESEARCH METHODOLOGY
This research uses several stages.It starts with the retrieval of the labelled dataset.Then, through data pre-processing, the data is cleaned from noise.Next, feature extraction using TF-IDF.After that, data imbalance is addressed using Tomek Links, SMOTE, and SMOTE-Tomek.Subsequently, the model with the best performance is optimized using GridSearchCv hyperparameter tuning.The following is Figure 1 of the research methodology.

A. Load Dataset
The dataset is taken from research conducted by Ricco San and Karen, who produced the Multilabel Sentiment and Emotion Dataset from the Indonesian Mobile Application Review [13].This dataset comes from reviews of 10 mobile applications in Indonesia.User review data on mobile applications is textual.The limited amount of textual data in Indonesian makes this dataset need further development for text analysis-based research.Data pre-processing is the process of cleaning, transforming, and tidying up data to get better quality data suitable for further analysis.The Multilabel Sentiment and Emotion Dataset from Indonesian Mobile Application Review has gone through several stages in data pre-processing, including removing base duplicates, removing URLs, removing mention/hashtag/special characters, removing emoji, removing dupes, removing enter or new line format, and remove mobile apps rating data column.However, this dataset needs additional pre-processing because the data is still fairly unstructured, which can affect classification performance.Figure 2 explains the additional pre-processing workflow for this dataset.

B. Data Pre-processing
1) Lowercase: This process of case folding converts letters in the text to all lowercase.It ensures consistent data and avoids differences in understanding caused by capitalization, which can lead to unnecessary duplication of words [14].For example, "Bintang" becomes "bintang" to standardize their representation.
3) Remove Special Character: Removing nonalphanumeric characters from the text [16].These characters do not add meaning to text understanding and can cause noise in the algorithm.For example, "©®®©".

4)
Remove Whitespace: This step removes unnecessary whitespace, such as at the beginning or end of a sentence, double spaces, or those resulting cleaning processes [17].For Example "tolong perbaiki" becomes "tolong perbaiki".

8) Remove Missing Value:
Empty or NaN data entries are removed from the dataset [18].For example, reviews that are empty from the beginning or empty due to cleaning in the preprocessing stage.9) Remove Duplicated Data: Data entries that are identical and appear more than once in the dataset [18].For example, if multiple identical reviews exist, only one is retained while the duplicates are removed.

C. Feature Extraction
Term Frequency-Inverse Document Frequency (TF-IDF) is a feature extraction method frequently used to convert text data into a numerical form in NLP using Equations ( 1) to (3).
We can use TfidfVectorizer() from the scikit-learn library to implement TF-IDF.TfidfVectorizer() enables efficient tokenization, weighting, and encoding of new text.Once the text data is converted into numerical representation, these features can be used for various NLP tasks [19].

D. Data Balancing
Data balancing works by balancing the data in each class in the dataset.This can achieved through several methods, namely oversampling, undersampling, and combined re-sampling [8].

E. Split Dataset
After handling the data imbalance, the next step is to divide the data into training data and test data, with a ratio of 80:20.

F. Model Analysis
Classification models can be used to identify and understand the emotions contained in each review.One of them is by utilizing the Random Forest and Support Vector Machine algorithms.
1) Random Forest: Random Forest is a decision tree-based ensemble learning method and a powerful approach to classification.This model creates a set of decision trees from a randomly selected subset.Each tree provides a prediction, and the predictions from all trees are combined through a voting process.The final prediction is determined based on the majority of votes [25].Each tree is built using training data samples with a bootstrap sampling technique.The training data is then classified based on the trees built.Each tree classifies the test data, which is classified into the category with the most votes, known as the majority vote.This method's main advantage lies in overcoming overfitting problems and providing more stable and accurate predictions [23].Figure 7 illustrates the stages of the Random Forest algorithm [23].2) Support Vector Machine (SVM): SVM can be used for regression and classification because it finds the optimal hyperplane that separates classes in the feature space by maximizing the margin, the distance between the hyperplane and the support vector.The advantage of this method is its ability to handle complex data with good performance [23].Figure 8 illustrates a visual representation of the SVM algorithm.

G. Evaluation
The Confusion Matrix evaluates model performance, which provides an overview of how well the model classifies the data.The variables measured in the Confusion Matrix consist of true positive (TP), which is the amount of data that is correct and correctly detected by the model; true negative (TN), which is the amount of incorrect data detected incorrectly by the model, false negative (FN) which is the amount of data detected correctly but detected incorrectly by the model, and false positive (FP) which is the amount of data detected correctly but detected incorrectly by the model.From the Confusion Matrix, various evaluation metrics such as accuracy, precision, recall, and F1-Score can be calculated using Equation ( 4) to (7), respectively [26].

H. Tunning
Hyperparameter tuning is used to improve model performance.It allows the model to fit the training data better and produce better results when used to predict new data.GridSearchCV is a hyperparameter tuning method that tries all parameter combinations in the search space to find the parameter combination with the smallest error [27].

III. RESULT AND DISCUSSION
In this research, a series of stages have been carried out to obtain optimal results in text data classification.This section describes the results of each stage of the research that has been carried out.

A. Load Dataset
The dataset' Multilabel Sentiment and Emotion Dataset from Indonesian Mobile Application Review' by Ricco San and Karen has gone through data pre-processing to remove noise with a total collection of 21,697 reviews and data annotating process to identify the type of sentiment and emotion in the review text.The data is annotated into three sentiments: positive, negative, and neutral.There are six emotions: anger, fear, sadness, happiness, love, and neutrality.Table I

B. Data Preprocessing
At this stage, the data is cleaned to create more structured data to avoid irrelevant data that contains errors and can reduce the accuracy and efficiency of the model.Table II shows the flow of data pre-processing.After the steaming process, the data list is returned to string form.Then, clean the dataset from missing values and duplicate data due to empty data or the cleaning process results.Table III compares class distribution results before and after preprocessing, with total data from 21697 to 19724.

C. Label Encoding
Label encoding is done to simplify the classification process.We are changing the class name on the Sentiment label, namely, 0 for Negative, 1 for Neutral, and 2 for Positive.For Emotion labels, namely, 0 for anger, 1 for fear, 2 for sad, 3 for happy, 4 for love, and 5 for neutral.

D. Feature Extraction
The cleaned data needs feature extraction to convert the data into numeric.Table IV shows an example of TF-IDF calculation results for feature extraction.The higher the value, the more words are important or influential in the sentence in the document.The TF-IDF weighting result is stored in a sparse matrix.Table V is a sparse matrix of TF-IDF weighting results in the first row (0, 15339), with the first number (0) indicating this value comes from the first document in the corpus.The second number (15339) shows the index of the word in the list of all words (feature name).Then, the third number (0.12312022870173846) shows the TF-IDF weight.Feature extraction using TF-IDF resulted in 20141 feature names.Figure 9 shows the top 10 feature names.

E. Data Balancing
Data balancing is required for this dataset.In Figure 10, it can be seen that the data is not balanced in each class for both Sentiment and Emotion labels.Data balancing in this research is focused on emotion as the target label because Sentiment and Emotion labels have a close relationship.The Negative class is connected to the Angry, Scared, or Sad class.In contrast, the Positive class is only connected to the Happy or Love class.In comparison, the neutral class is connected only to the neutral class.Emotion knowledge can be used to determine sentiment, but not vice versa.2) Oversampling (SMOTE): New data samples in each class.The results of the SMOTE process are that the Anger class produces 3864 samples, fear produces 5240 samples, sad produces 2919 samples, happy produces 862 samples, love produces 6319 samples, and neutral is the majority class, so there are no additional samples.Figure 12 compares the class distribution before and after applying SMOTE oversampling.The results of using T-SNE to visualize the distribution of data before and after applying various re-sampling techniques are shown in Figure 14.In the original data, there is a very significant overlap between classes.After implementing Tomek Links, although there was still overlap between classes, the neutral class (5) experienced a slight decrease in data.Implementing SMOTE produces a more balanced data distribution.The data after the SMOTE-Tomek process is similar to SMOTE, but the effective data overlap is reduced if you look closely.

G. Model Analysis
Research data from several sampling techniques are trained using Random Forest and SVM algorithms to learn patterns in the training data.Then, the models were tested using the test data to predict the class labels.Based on the evaluation of Random Forest and SVM models on original data and re-sampled data using Tomek Links, SMOTE, and SMOTE-Tomek, It was found that the Random Forest model performed best on data re-sampled using SMOTE-Tomek.This model achieved 84% precision, 84% recall, 84% F1 score, and 84% accuracy.These results show that SMOTE-Tomek is effective in overcoming data imbalances and provides additional benefits by improving the performance of complex sentiment analysis models compared to other re-sampling techniques.

H. GridSearchCV
The best parameter combinations with the smallest error in GridSearchCV Hyperparameter tuning for SMOTE-Tomek data using the Random Forest model include.
a) 'bootstrap': False (Not using bootstrap) b) 'max_depth': None (no limit on tree depth) c) 'min_samples_leaf': 1 (the minimum number of samples at each leaf node is 1) d) 'min_samples_split': 2 (the minimum number of samples required to split a node is 2) e) 'n_estimators': 200 (the number of estimators in the ensemble is 200) With these parameters, the Random Forest model can provide the best performance with a small error rate when applied to data processed with the SMOTE-Tomek technique.The confusion matrix evaluation of the Random Forest model with the SMOTE-Tomek re-sampling technique before and after using Hyperparameter tuning GridSearchCV.Precision, Recall, F1-Score, and Accuracy values are compared in Table VIII.It can be visualized in Figure 19 the increase in accuracy after applying GridSearchCV to the Random Forest model with SMOTE-Tomek re-sampling.IV.CONCLUSION Classification based on user emotions is very important.To solve the problem of multi-class imbalance in application review data, we can use the SMOTE-Tomek re-sampling technique with the Random Forest method.The results of research using this method show that the implementation of a set of pre-processing techniques, especially the SMOTE-Tomek technique, to overcome data imbalance by oversampling the minority class and then eliminating samples that are at risk of noise in the majority class can improve the performance of the Random Forest model from 58% become 84%.This model was optimized using GridSearchCV hyperparameter tuning, increasing the accuracy to 85%.This improvement shows that combining SMOTE-Tomek with GridSearchCV can improve model performance.However, for future research, it is recommended that other re-sampling techniques using Pipeline be explored.

Figure 3
flowchart of the Tomek Links, SMOTE, and SMOTE-Tomek re-sampling techniques.1)Undersampling Tomek Links:This method works by removing samples from the majority class closer to the minority class, called Tomek Links.These pairs consist of the nearest neighbours of different classes.Therefore, although the class distribution does not change, the dataset becomes cleaner, and the boundaries between classes become clearer[20].Figure4is an illustration of the Tomek Links process.

Figure 3 .
Figure 3. Illustration of the definition of Tomek links[8]

Figure 5 .
Figure 5. Illustration of The Definition of a SMOTE 3) Combined Re-sampling (SMOTE-Tomek): A combination of SMOTE oversampling and Tomek Links under-sampling techniques [21].The process is done by combining the capabilities of SMOTE, which generates synthesized data for

Figure 7 .
Figure 7. Stages of Random Forest

Figure 10 .
Figure 10.Distribution Emotion Before Re-sampling Data balancing in this study uses three techniques: oversampling, undersampling, and combined re-sampling.1) Undersampling (Tomek Links): Tomek Links are removed, leaving more representative data.The amount of data in each class that was removed was that the anger class lost 234 samples, fear lost 194 samples, sad lost 394 samples, happy lost 461 samples, love did not lose data samples, and neutral lost 691 samples.Figure 11 compares the class distribution before and after applying Tomek Links undersampling.

Figure 11 .
Figure 11.Comparison of Tomek Links Re-sampling Results

Figure 12 .
Figure 12.Comparison of SMOTE Re-sampling Results 3) Combined Resamping (SMOTE-Tomek): SMOTE-Tomek works by performing SMOTE to oversample the minority data and then cleaning the majority data that is indicated to have Tomek Links.The amount of SMOTE result data for each class becomes 6488.Then the Tomek links are removed, the anger class loses 25 samples, the fear class loses 3 samples, the sad class loses 74 samples, the happy class loses 256 samples, and the love class has no Tomek links.Hence, it uses the SMOTE result data, and the Neutral class loses 300 samples.Figure 13 compares the class distribution before and after applying combined re-sampling with SMOTE.

Figure 14 .
Figure 14.T-SNE Visualisation Comparing Data Distributions Using Tomek Links, SMOTE, and SMOTE-Tomek Re-sampling Techniques F. Split Dataset Datasets with various sampling techniques are divided with a ratio of 80% as training data and 20% as test data.Table VI compares data distribution for training and test sets on each sampling data.

Figure 15
shows the Confusion matrix for the Random Forest model with the SMOTE-Tomek technique.Out of six classes, the model correctly classified a number of data in each class.The values 1118, 1233, 1003, 943, 1294, and 828 represent the number of correctly classified data for each class according to the experiments' results.

Figure 16 .
Figure 16.Classification Report: RF Model on SMOTE-Tomek Data

Figure 18 .
Figure 18.Classification Report: RF Model with GridSearchCV on SMOTE-Tomek Data

Figure 19 .
Figure 19.Comparison of Accuracy before and after tunning in SMOTE-Tomek re-sampling technique

TABLE III CLASS
DISTRIBUTION BEFORE AND AFTER PRE-PROCESSING

TABLE V MATRIX
Table VI compares data distribution for training and test sets on each sampling data.
Table VII shows the performance comparison of Random Forest and SVM models with various sampling techniques.

TABLE VII COMPARISON
OF TESTING RESULTS

TABLE VIII COMPARISON
BEFORE AND AFTER TUNING