Semi-supervised learning for sentiment classification with ensemble multi-classifier approach

n-gram were applied. The ensemble stacking mechanism was implemented. There were six models set up


Introduction
Sentiment analysis is the process of recognizing the writer's positive or negative feelings in documents. Sentiment analysis can be divided into document level, sentence level, and aspect level [1]. Sentiment analysis at the document/sentence level classifies either the positive or the negative sentiments in a document/sentence. Sentiment analysis has been applied to several domains using various techniques. Most supervised sentiment analysis uses machine learning that requires a labeled dataset to train the model. Building a fully labeled dataset takes a lot of effort and cost in obtaining labels for instances [2]. Semi-supervised learning (SSL) has emerged as a promising method to annotate unlabelled data [3]. The semi-supervised approach builds the model from labeled data and incrementally improves the performance of the model by labeling the sentiment polarity of unlabeled instances. This approach avoids time-consuming and expensive data labeling without reducing model performance.
This study aims to create a semi-supervised learning model (SSL-Model) for sentiment analysis using ensemble approach. For vectorization, Term Frequency-Inverse Document Frequency (TF-IDF) and ngram were applied. The ensemble stacking mechanism was implemented. There were six models set up

Data Preprocessing
The US Airlines dataset and the IMDB dataset were used for data processing. These datasets are often used in sentiment analysis model comparison. US Airlines have been investigated in [14], [15], and [16] and IMDB in [5], [17], and [18]. US Airlines consists of 14640 airline reviews downloaded from Kaggle and released by CrowdFlower in CSV format. The US Airlines dataset consists of three classes (positive, neutral, and negative). The IMDB dataset consists of 50,000 documents downloaded from Kaggle at https://www.kaggle.com/code/rafetcan/sentiment-analysis/data. The IMDB dataset consists of two classes which are positive and negative. US Airlines and IMDB needed to be processed first because they were unstructured, and contained non-alphabetical or special characters. Pre-processing through several stages is described in Table 1. This is an example of pre-processing the sentence: "@VirginAmerica you know what would be amazingly awesome? BOSS-FLL PLEASE!!!!!! I want to fly with only you". After going through the pre-processing stage, the process on documents consisting of at least 2 syllables was continued. There were 14096 US Airline documents that could be continued to the vectorization stage. For the IMDB dataset, all documents could be proceeded to the vectorization stage.

Vectorization
Term Frequency-Inverse Document Frequency (TF-IDF) is known as an algorithm to calculate the weight of each word in a set of documents. Term frequency is the frequency of occurrence of term Y in document X divided by the total term in document X [19]. IDF reduces the weight of a term if its occurrence is spread throughout the document. TF-IDF vector data is a sparse matrix with dimensions (n_samples, n_feature). N_feature is the number of features which is usually the top terms with the largest TF-IDF score. The number of documents is devided by the number of row of dataset becomes n_samples. In very large documents, the features form a very large dimensional matrix because each word that appears in the document is represented by its score [20]. TF-IDF vectorizer has good performance for sentiment analysis in research [21] and [22].

Modeling
Random Forest (RF) is used to build the ensemble multi-classifier model. Random Forest is an ensemble of decision trees, where the formation of a tree arrangement in a decision tree uses the entropy approach or the Gini index [23]. RF reduces the occurence of overfitting by creating many trees, bootstrapping technique, and splitting nodes. RF split the node using the best split strategy at every node (Fig. 1). The final classification is the majority class of these trees. Random Forest has a good performance for sentiment analysis as revealed in research [24] and [25]. In this research, the parameter of Random Forest was set using number of estimators=100, criterion using gini index, and minimum samples split=2.
SVM is also a popular technique for classification. This technique is to find the most optimum hyperplane to split documents from different classes (Fig. 2). The SVM strategy to get the optimum hyperplane is to detect the outermost data in the two classes, then find the optimum hyperplane considering the outer data [26]. SVM has a good performance in research [27] and [28]. This study, SVM with kernel parameter Radial Basis Function (RBF) was implemented.

Architecture
SSL-model architecture was proposed in this study (Fig. 3). The process began with reading the annotated input data as data training, data testing, and unlabelled data (the gray boxes in Fig. 3). The training data was processed using TF-IDF. The results of TF-IDF vectorization are three vectors: unigram, bigram, and trigram tokenization vector. The three vectors were used to create three models using RF (and SVM as a comparison in the next experiment). The performance of the three models was measured using the F1 score in test 1. This F1 score was used as a weight in the voting process at the threshold calculation stage. In Fig. 3, the result is three models working separately to annotate unlabeled data. Every model produced pseudo labels. Threshold numbers were used to select whether the annotated data (pseudolabels) was worthy of being training data. Several threshold numbers ranging from 0.6 to 0.9 in the preliminary study had been tried, and 0.6 was used as the threshold number because it produced a more accurate and larger set of labeled documents. The high confidence document would be added to the Data Training. The document with the low confident label would be re-labeled in the next iteration. Iterations in the SSL model ran ten times or until the Unlabeled Data ran out. The output of the model was data training (DT) which had been labeled by humans and machines. The resulting training data was formed into a new classifier model and tested using the F1 score and accuracy in Test 2. Test 2 was a performance measurement of the SSL Model. Fig. 4 describes the pseudocode of the proposed model. The pseudocode began with the declaration of a threshold number. The next step was to input training data (DT), testing data (DTest), and unlabeled data (UN) on lines 2-4. The DataTraining, Data testing, and Unlabeled dataset would be converted to unigram, bigram, and trigram using TF-IDF methods (lines 6-9). Then, three classifier models would be formed using three training sets and machine learning (RF or SVM) on line 10. In the next step, every classifier validated the data using data testing. Accuracy and F1-score were used as metrics to measure the performance of each model (lines 12-14).  The labeling process was on lines 15-17. The selection process for each new annotated data, whether it was suitable for training data, was on lines 18-24. The process began by checking whether the new annotated data tended to be positive, negative, or neutral based on pseudo-label weights (lines . If more than the threshold, then, it deserved to be a training data. Otherwise, it would be checked in the next iteration with the new model (formed with the new training data).

Validation
Confusion Matrix is a performance measurement for machine learning classification. Confusion Matrix output can be two or more classes as in the research [29] and [30]. The confusion matrix compares the actual conditions and predicted results ( Table 2).  (1) F1 Score as the formula in (2) is the weighted average of Precision and Recall. F1 Score is usually more useful than accuracy, especially if the result has an uneven class distribution. Precision (3) is the ratio of correctly predicted positive observations (TP) to the total predicted positive observations (TP + FP). High precision relates to the low false positive rate. Recall (Sensitivity) is the ratio of correctly predicted true observations (TP) to all observations in actual class true (TP +FN). Recall is presented in (4).

Experiment on US Airline Dataset
The US Airlines dataset was randomly divided into test data and training data. The number of labeled test data for each E1, E2, E3, and E4 was 1464. For data training, four datasets coded as E1, E2, E3, and E4 were prepared. The number of labeled training data (annotated dataset) in E1, E2, E3, and E4 were 2928, 1464, 732, and 366 respectively. The leftover training data was used as the unlabeled data set (unannotated dataset). The baseline model in every experiment E1, E2, E3, and E4 was trained with labeled training data and tested using the labeled test data. Table 3 shows the first experiment, the results of the SSL were processed step by step from the E1 with a 0.6 (60%) threshold using SVM, and Table 4 shows the first experiment using Random Forest. Table 3 explains that the first step (baseline row) is to measure baseline performance. The model built used 2928 training data and classified 1410 test data. The test results showed that the baseline accuracy was 0.67 and the F1-Score was 0.69. In this step, 9758 unlabeled data had not been processed. In the next step, the first iteration, 10569 new training data, which were the sum of the previous training data and annotation results (from the unlabeled dataset), were generated. The new training data was used to create a new classification model. The new model was tested for its performance using data testing, and the results showed an increase in accuracy to 0.73 and the F1 score increased to 0.73. At this step, there were only 2117 unlabeled data sets remaining. In iterations 2 to 7, the explanation is the same as in the first iteration. In the second iteration and so on, the accuracy decreased to 0.69 and the F1-Score to 0.70. The seventh iteration is the last step, the number of unlabeled datasets was 0 document. Accuracy and F1-Score were 0.69 and 0.70, also known as final performance of SSL. The final condition was convergent, i.e. the amount of unlabelled data was the same as the unlabelled data in the previous iteration. The remaining 157 unlabelled data required manual annotation. So far, it could be concluded that SVM was able to increase the accuracy from the baseline (from 0.67 to 0.69) and increase the F1-score from 0.69 to 0.70. The experiment was continued in the Random Forest in Table  3.   Table 3. The graph shows the performance increases in the first step of SSL, then decreases and stabilizes in the second step and so on.  Table 3 Fig. 5 explains that at the baseline stage to iteration 1, there is an increase in performance Accuracy and F1-Score. This is because the increase in accuracy and F1-score is due to the classifier model formed using labeling from experts. In the second to seventh iteration, the accuracy decreased to 0.69 and the F1-Score to 0.7 because the classifier model was formed using a combined labeling of expert and machine (pseudo-label).   Fig 6 shows the performance comparison of each iteration from Table 4. The graph shows that performance decreased in the first step of SSL, then increased in the second step. Fig. 6. Comparison Graph of Accuracy and F1 Score for Each Iteration Table 4 As in Table 3, Table 4 explains that the first step is to measure baseline performance. The model built used 2928 training data and classify 1410 test data so that the baseline accuracy was 0.73 and the F1-Score was 0.73, higher than the SVM trial. In this step, 9758 unlabeled data had not been processed. In the next step, the first iteration, 9464 new training data were generated and used to create a new classification model. The new model was tested for its performance using data testing, and the results showed a decrease in accuracy to 0.70 and the F1 score decreased to 0.71. Fig. 6 explains that from the baseline to iteration 1 there is a decrease in Accuracy and F1-Score performance because the way the random forest model classifier worked was not as good as SVM (on the US Airlines dataset). In the second iteration, the accuracy increased to 0.71 and F1-Score to 0.72 because the RF classifier model was smarter after being formed using a combination of expert and machine labeling (pseudo-label). In this step 3222 unlabeled data sets remained. The second iteration was the last step, the number of unlabeled datasets was 0 document. Accuracy and F1-Score were 0.71 and 0.72, also known as final performance of SSL. The final condition was obtained after all unlabeled data had been successfully annotated. SVM iteration was more selective in the classification process, so that it had more iterations than RF and on US Airlines, and SVM performance was higher than RF.
At baseline, RF was higher than SVM, but there was a decrease in baseline accuracy (from 0.73 to 0.71) and a decrease in F1-score (from 0.73 to 0.72). The advantage was that RF had fewer iterations and all unlabeled data were successfully annotated. The experiment was continued in scenarios E2, E3, and E4. Accuracy results and F1-scores from all experiments are presented in Table 5. The experiments were still on two machine learning models at the 60% threshold. Table 5 explains that the accuracy and F1-score at the baseline of the RF models are higher than in SVM models. The results of semi-supervised learning classification show that the accuracy and F1-score also tend to be linear with the number of training data instances. The difference between the average accuracy of the baseline and the average accuracy of the SSL model in SVM is 0.03 which is better than the RF SSL model (-0.04). The difference between the average F1 score of the baseline and the F1 score of the SSL model in SVM is 0.005 which is better than the RF SSL model (-0.01). SSL models created using SVM tended to provide better accuracy over the baseline. This means that SVM was better at maintaining the performance of the SSL process than RF, but in some experiments RF was higher in performance than SVM.

Experiment on IMDB Dataset
Similar to the previous experiment, four experimental datasets coded as E1, E2, E3, and E4 were prepared. IMDB dataset was randomly divided into training data and test data in a 9:1 ratio. The number of labeled test data for each E1, E2, E3, and E4 was 5000 (10% of all IMDB data). The number of labeled training data (annotated dataset) in E1, E2, E3, and E4 were 5000, 2500, 1250, and 625 respectively. The leftover training data was used as the unlabeled dataset (as an unannotated dataset). The same as the previous experiment, the baseline model in E1, E2, E3, and E4 was trained with labeled training data without pseudo-label. The baseline model was tested using the labeled test data. Table 6 shows the first experiment which the results of the SSL were processed step by step from the E1 dataset experiment with a 60% threshold using SVM and Table 7 shows the one used Random Forest. The first step in Table 6 (in baseline line), the model built used 5000 training data and classified 5000 test data so that the baseline accuracy was 0.85 and the F1-Score was 0.85. In this step, 40000 unlabeled data had not been processed at any case. In the next step, in the first iteration, 45000 new training data were generated and used to create new classifier. The new classifier was tested using data testing, and showed an decrease in accuracy to 0.83 and the F1 score decrease to 0.83. The second iteration is the last step, the number of unlabeled datasets was 0 document, known as final performance of SSL.  Table 7 explains that in Random Forest the baseline accuracy is 0.84 and the F1-Score is 0.84, lower than SVM model. In the next step, first iteration, 45000 new training data were generated and used to create a new classification model. The new model was tested and the accuracy decrease to 0.80 and the F1 score decreased to 0.80. The second iteration was the last step, the number of unlabeled datasets was 0 document, also known as final performance of SSL. Both methods processed the same number of iterations. The experiment was continued in scenarios E2, E3, and E4 on two machine learning models at the 60% threshold. Accuracy and F1-scores from all experiments are presented in Table 8.  Table 8 describes SSL-model operations using the IMDB dataset, and presents different results from US Airlines. The accuracy and F1-score at baseline of the SVM models were higher than Random Forest models. The accuracy and F1-score also tended to be linear with the number of training data instances. The difference between the average accuracy of the baseline and the average accuracy of the SSL model in SVM is -0.02 which was better than the RF SSL model (-0.03). The difference between the average F1 score of the baseline and the F1 score of the SSL model in SVM was -0.01 which was better than the RF SSL model (-0.03). In both types of machine learning, there was a decrease in the accuracy of the SSL model to the baseline. However, the main conclusion is that SVM is better at maintaining the accuracy of the SSL process than RF.
This study outperformed Balakrishnan et al's F1 score (on RF gave F1 score = 73.8% and SVM 72.2%) [12]. It also outperformed the F1-Score from research [13] which F1-Score results were 79.039 on B-SVM (SVM model without SSL) and 79.95 on SSSVM (SVM model with bootstrap) when using 1000 labeled data. In this study, on the IMDB dataset, the F1-score results reached 83% for SVM and 80% for RF.

Conclusion
This study presents semi-supervised learning for sentiment classification with an ensemble multiclassifier approach to construct an annotated sentiment corpus from US Airlines and IMDB dataset. TF-IDF techniques were implemented to build a vector for modeling the classifier. The results of this study provide several conclusions. The first conclusion is that in SSL the accuracy of the classification is highly dependent on the suitability of the dataset with the machine learning algorithm used. In the IMDB dataset and US Airlines dataset, SVM is better at improving model performance against the baseline. In Airlines dataset, RF is better at achieving baseline performance but fails to maintain model performance. The next research is a sentiment analysis test using several machine learning, datasets, and vectorizers, such as FastText or Word2Vec.