Aspect-Based Analysis of Telkomsel User Sentiment on Twitter Using the Random Forest Classification Method and Glove Feature Expansion

.


I. INTRODUCTION
Telkomsel is one of the largest providers in Indonesia which has a lot of features that can be used by the Indonesian people, for example, internet services.Telkomsel has a very good reputation as a provider operator, which has been operating very well by serving more than 23.5 million customers throughout Indonesia [1].
Various features provided by Telkomsel make people interested in buying, ranging from internet packages, SMS, telephone, multimedia, to digital payments that can be made using Telkomsel.Behind this very complete feature, Telkomsel certainly has several responses from Telkomsel users themselves regarding the positive, negative, and neutral sides related to the features provided by Telkomsel.These responses or arguments are called sentiments [2].Many arguments on Twitter regarding the features that Telkomsel offers to customers.There are thousands and even millions of tweets every day, this makes it easier for author to get data on Telkomsel user sentiment.
In this study, the author will conduct a sentiment analysis using the random Forest classification method with the expansion feature using Glove.This algorithm is a classifier with an ensemble method consisting of several decision trees to form a random "forest" [3].Feature expansion is used in this study with the aim of expanding vocabulary which will later assist in the data learning process [4].
Various studies have been carried out to solve the case of sentiment analysis classification on Twitter using various algorithms.Sentiment analysis classification methods that are quite popular and widely used include Naïve Bayes, Decision Tree, Random Forest [5], Alita et al. [6] used the Random Forest classification method for sentiment analysis to detect sarcasm.Sari et al [7] used binary logistic regression, Naïve Bayes classifier (NBC), and Support Vector Machine (SVM) methods.Nasution et al [8] used K-NN and SVM in twitter sentiment analysis.
However [5]- [8] only focus on the standard algorithm (baseline) which has not been optimized to improve its accuracy.There are many ways that can be used to improve the accuracy of a machine learning algorithm for classification, one of which is using the expansion feature.Therefore, this study proposes a Random Forest algorithm for the classification of sentiment analysis of Telkomsel users on Indonesian Twitter with the Glove feature expansion.
In this study, it is divided into 2 aspects, namely the signal aspect and the service aspect.This study will find out how the performance of random forest classification with feature extraction and feature expansion from each aspect, it is hoped that this research can produce better and more accurate performance scores than previous research.

A. System Overview
This research begins with crawling data, labeling data, preprocessing data, feature extraction, split data, feature expansion, and modeling sentiment analysis.Here's an explanation of each stage.*) Corresponding author (Erwin Budi Setiawan) Email: erwinbudisetiawan@telkomuniversity.ac.id

B. Data Crawling and Labelling
In this study, datasets were collected from Twitter social media using several keywords related to Telkomsel such as "Telkomsel signal, Telkomsel service, Telkomsel slow, Telkomsel quality, Telkomsel network, Telkomsel down, Telkomsel roaming, etc".The data crawling method using the Twitter API and using the Python Twint library.After crawled data is successfully retrieved, all data is saved in csv file.This data will be used as a training model for sentiment analysis.After crawling, the dataset is labeled with 3 labels, namely positive, negative, and neutral.Can be seen in Table 1 which displays some of the data obtained and has been labeled.

C. Data Preprocessing
The dataset that has been collected using the crawling method is still in an unstructured state and contains a lot of noise [9].Therefore, we need a process that can change the form of unstructured data into structured data forms.The preprocessing stage has several processes, namely Data Cleaning, Case Folding, Tokenizing, Stopwords Removing, and Stemming.The following is an explanation regarding each stage of Data Preprocessing: 1. Data Cleaning, remove noise in text such as usernames, hashtags, URL links, numbers, and punctuation.2. Case Folding, converting capital letters to lowercase.3. Tokenizing, break the text into tokens or per word contained in the text.4. Stopword Removing, deleting words in tweets that contain words that are considered not to have an important effect in determining classifications such as conjunctions.5. Stemming, make it just a root word, removing the suffix and prefix.

D. Data Split
After Preprocessing, the data is continued to split data.Data Split is the process of separating training data and test data [10].In this study, the proportion of split data used is 80:20 with details of 80% for training data and 20% for test data.The number of training data after being separated is 13,590 and test data is 3,398 data.

E. Extraction Feature TF-IDF
After splitting the data will be taken to the feature extraction stage.Feature extraction is an important factor that can affect the level of accuracy at the classification stage.The selection feature used in this research is TF-IDF.This method works by calculating the weight of each commonly used word [11].

F. Glove Expansion Feature
In this study, the expansion feature used is Glove, the expansion feature used in this study aims to expand the features of a word.The results of the Glove Algorithm will produce an output in the form of a list of similarity words [12].For example, what can be seen in Table 2 is a list of words that have similarity to the word "internet" which has been sorted according to its rank.

G. Glove Expansion Feature
Random Forest is a combination of tree predictors.Each tree depends on the value of a random vector whose sample is obtained with a uniform distribution independently for all trees in the forest [13].Random Forest was introduced by Ho (1995) by combining many trees in the training data to produce a high level of accuracy [14].

Figure 2. Random Forest visualization
The starting point of the tree is the root node, while the end where the chain ends is called the leaf node.A node represents a particular characteristic, whereas a branch represents a range of values [15].In the Random Forest partition, the dataset is divided into test and training sets.Each tree will form in-bag data with a subset of the training data and out-of-bag from the remaining parts [16].

H. Hyperparameter Tuning Using Grid Search
Hyperparameter tuning is a critical function necessary for the effective deployment of most machine learning (ML) algorithms.This algorithm has also been used to perform hyperparameter tuning in a case study of tweet emotion classification in Indonesian [17].
In this study.The author conduct experiment with the hyperparameter tuning method to improve performance results.The selection of the right parameters in the model is very important because it aims to improve the results of the performance of the classification itself.The best parameters will produce the best performance values as well.In this study, the hyperparameter tuning method was used to determine the optimizer parameters.

I. Performance Measurement
The results of the classification need to be evaluated to determine the performance of the model that has been made.In this study, the evaluation method used is the confusion matrix.The confusion matrix consists of 4 elements, namely True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN).TP is the proportion of positive labels that are predicted to be correct, FN is the proportion of positive labels that are predicted to be wrong, FP is the proportion of negative labels that are predicted to be wrong, and TN is the proportion of negative labels that are predicted to be correct.There are several performance evaluations that can be calculated using the confusion matrix elements including accuracy, precision, recall, and f1-score [19].The results of the performance evaluation can be obtained using the following formula: (3)

A. Data Distribution
Tweet data that has been obtained from the Crawling process amounted to 16,988 Indonesian language tweets containing the opinions of Telkomsel Users in Indonesia.There are 3 labels, namely positive, neutral, and negative, data distribution can be seen in Figure 3.

Figure 3. Data distribution amount of each aspect
Then, there is complementary data for making word dictionaries using data taken from several news media such as CNN Indonesia, indonews, Kompas, Tempo, Detik.com,Liputan6, and Republika as much as 142,544 data.The Indonews corpus was used for a combination of experiments in this study to find the best results.There are 3 Corpus Glove created, including Corpus Glove with Tweet dataset, Corpus Glove with News dataset, and Corpus Glove with Tweet+News dataset.

B. Test Results and Analysis
In this study, there are several steps taken before reaching at the evaluation step.The first step is data crawling and data labeling from the tweet data that has been obtained.Second, the previous data will be preprocessed then the data will be splited into training data and test data.After that, the data is extracted using the tf-idf.Then the data is classified using the Random Forest model.The following are some of the scenarios used in this study: 1. Scenario 1: Testing baseline data using different ratios 2. Scenario 2: Testing data oversampling performance using SMOTE 3. Scenario 3: Testing feature expansion performance using Glove 4. Scenario 4: Testing hyperparameter tuning performance.

Scenario 1 (Baseline + TF-IDF)
The first scenario is the determination of the baseline or initial data that will be taken to the next test.This test is carried out on tweet data that is already available from the crawling stage in Table 1, the results of the first scenario can be seen in Table 4, this test was chosen to see the best performance value from each proportion of data.
The first step is to divide the data in table 1 into 3 proportions using split data, including a ratio of 80% training data and 20% test data, then a ratio of 90% training data and 10% test data, and a ratio of 70% training data and 30% test data, then the data is trained using the random forest classification method, then compare the results of each proportion to find the best one.
Which can be seen from Table 5, data with a proportion of 80% train data and 20% test data have the best performance values compared to other proportions with an accuracy value of 56.91% and an F1-score value of 24.18% for the Signal Aspect, and an accuracy value of 79.5% and an F1-score value of 29.7% for the Service Aspect.

Scenario 2 (Oversampling with SMOTE)
This test is carried out on the best data that has been obtained from the first scenario which can be seen in Table 5.Data with a proportion of 80% for training data and 20% for test data will then be tested using the oversampling method using the SMOTE library.This oversampling aims to make data that is imbalanced in tweet data becomes balanced data [20].
In this scenario, the first step is data with a proportion of 80% train data and 20% test data, oversampling is carried out so that the data becomes balanced as was done in [9], then after the data is balanced, train using the random forest classification method.Can be seen in Table 5, the level of performance increased rapidly, this happened because in the first scenario stage, the processed data was still not comparable (imbalance).

Scenario 3 (Feature Expansion using Glove)
This test is carried out on the best data that has been obtained from the second test which can be seen in Table 6.The data that has been oversampled with SMOTE is continued for feature expansion which aims to expand the vocabulary of a word in the dataset.
In scenario 3, the first step is to train data from every aspect in Table 6. the data trained are tweet data, news data, and tweet+news data which can be seen in Table 4. the data is trained using the glove expansion feature algorithm available in the python library.The data is trained to get the accuracy value of each top-n feature, in this study 4 categories were used, namely top-1 features, top-5 features, top-10 features, and top-20 features.
It can be seen from Table 6 and Table 7, the best performance value in the third scenario is on feature expansion with Corpus Tweet+News with Feature Top-1 with an accuracy value of 83.58% and an F1-score of 67.87% on Signal Aspect and an accuracy value of 77.52% and F1-score 43.81% on Service Aspect.

Scenario 4 (Hyperparameter Tuning)
This test is carried out on the best data that has been obtained from the previous test scenario, namely the feature expansion data which can be seen in Table 6 and Table 7, the Top-1 feature expansion data in the Tweet+News corpus is the best performance data in the third scenario, then the data will be processed.
At this stage with the Hyperparameter Tuning method.The best parameters are obtained from the best parameter search process using a library from python, namely GridSearchCV [18] which functions to find the best parameter values from a classification.The best parameters can be seen in Table 8.Can be seen in Table 9 where the accuracy performance value increased to 93.52% and the F1score increased to 73.58% in the signal aspect, and the accuracy value in the service aspect increased 85.49% and the F1-score increased to 66.32% in the service aspects.After doing the four test scenarios, it can be concluded that each test can affect the performance of the random forest model created.For the first scenario, from the test results on training data with proportions of 80:20, 90:10, and 70:30, data with a proportion of 80% train data and 20% test data have higher accuracy than the others.The second scenario is testing data by oversampling using SMOTE, in this test the data that is being trained has a high increase in performance value, this is because the data being tested is balanced , of course, it has animpact because the data in the first test is imbalanced same as done on [20].The third scenario is data testing with Glove feature expansion.the data was tested using 3 corpus, namely corpus tweet, corpus news, and corpus tweet+news, in this test the best data was obtained from corpus tweet+news in the top-1 feature.The performance value at this stage increases same as done on [9] because the data is trained by expanding the features of the corpus that has been created.

IV. CONCLUSION
Balancing data using a smooth can affect the performance value of each aspect.The implementation of the expansion feature using Glove can improve the performance value for the better. in the signal aspect, the accuracy value increases to 93.52% and the f1-score value increases to 73.58%, in the service aspect the accuracy value increases to 85.49% and the f1-score value increases to 66.32%.
Suggestions for further research can try to use a combination of other feature extraction methods such as Bag of Words (Bow) with other expansion features such as word2vec or FastText, and with other classification methods such as SVM, Naïve Bayes, and others.

Figure 1 .
Figure 1.System sentiment analysis using Random Forest Classification and Glove Feature Expansion

Table 2 .
Example of Similarity Word From the word 'Internet'

Table 4 .
First scenario result

Table 5 .
Second scenario result

Table 7 .
Third scenario result of service aspect

Table 8 .
Best parameter

Table 8 .
the parameter data in Table8is obtained through the Grid Search library available in python.

Table 9 .
Fourth scenario result