Classification Performance Comparison of BERT and IndoBERT on Self-Report of COVID-19 Status on Social Media Porównanie wyników klasyfikacji BERT i IndoBERT w zakresie samodzielnego zgłaszania statusu COVID -19 w mediach społecznościowych

Messages shared on social media platforms like X are automatically categorized into two groups: those who self-report COVID-19 status and those who do not. However, it is essential to note that these messages cannot be a reliable monitoring tool for tracking the spread of the COVID-19 pandemic. The classification of social media messages can be achieved through the application of classification algorithms. Many deep learning-based algorithms, such as Convolu-tional Neural Networks (CNN) or Long Short-Term Memory (LSTM), have been used for text classification. However, CNN has limitations in understanding global context, while LSTM focuses more on understanding word-by-word sequences. Apart from that, both require a lot of data to learn. Currently, an algorithm is being developed for text classification that can cover the shortcomings of the previous algorithm, namely Bidirectional Encoder Representations from Transformers (BERT). Currently, there are many variants of BERT development. The primary objective of this study was to compare the effectiveness of two classification models, namely BERT and IndoBERT, in identifying self-report messages of COVID-19 status. Both BERT and IndoBERT models were evaluated using raw and preprocessed text data from X. The study's findings revealed that the IndoBERT model exhibited superior performance, achieving an accuracy rate of 94%, whereas the BERT model achieved a performance rate of 82%.


Introduction
Covid-19 is an ailment that arises from the Coronavirus, making its debut in Wuhan at the culmination of 2019.Subsequently, the first accounts of COVID-19 cases in Indonesia were disclosed in March 2020.Some prevalent indications of COVID-19 in individuals include heightened body temperature, coughing, difficulty in respiration, and diminished olfactory function.The COVID-19 pandemic, as of the conclusion of 2019, has registered an excess of 3 million documented cases across the globe, along with roughly 208,516 fatalities as of April 2020 [1].This elevated level of mortality is attributable to a delay in promptly recognizing those afflicted with Covid-19 [2].Consequently, those affected continue to engage in their customary activities, thereby facilitating the transmission of the virus to their close associates.
The worldwide dissemination of COVID-19 has evolved into a widespread pandemic, significantly impacting individuals across the globe.Throughout this global health crisis, social media, particularly the popular platform X, has emerged as a predominant means of disseminating information pertaining to the Covid-19 virus.
X has emerged as a widely embraced platform, boasting a staggering number of over 3.7 million active users who diligently disseminate approximately 10 million posts per diem [3].Amongst this vast array of posts, one intriguing facet worth mentioning is the propensity for individuals to utilize this platform as a conduit for sharing insightful information about the COVID-19 crisis, including personal anecdotes relating to symptoms experienced or even instances of infection.Furthermore, the online community of social media enthusiasts has aptly utilized X as a platform for recounting the adverse impact that the pandemic has had on their respective families.Given its real-time nature, X lends itself seamlessly to monitoring the progression of the Covid-19 pandemic.
Identification of status of COVID-19 can be automated through the classification of self-report messages employing classification algorithms extensively developed by scholars in the field.Scholars have employed various techniques, such as Long Short-Term Memory (LSTM), Bi-directional LSTM (BiLSTM), word2vec, Support Vector Machines (SVM), and Convolutional Neural Networks (CNN), for text classification.
The classification of emotions in the text was undertaken in the study [4] utilizing the LSTM method, which yielded an accuracy score of 73.15%.In [3], the investigation focused on analysing emotions and sentiments in posts using BERT, resulting in an accuracy rate of 92%.In a subsequent study conducted by [5], the examination of emotions and sentiments deployed the BERT model, achieving a performance level of 92%.Furthermore, the researchers devised the IndoBERT model to classify Indonesian texts [6].The findings of this study demonstrate that the IndoBERT model exhibits improved efficacy when applied to Indonesian texts with limited training data.
The BERT model has gained popularity among researchers in recent times.In contrast to other language models, BERT was developed as a pre-trained deep bidirectionally trained model using unlabeled text data.This was achieved by integrating left-and right-sided context layers [7].Consequently, BERT models can be tailored to various machine learning tasks, such as classification and question answering, by simply incorporating a single layer [8].
However, BERT continues to possess limitations, specifically about the restricted availability of words in the Indonesian language.Consequently, numerous scholars have endeavoured to design BERT architectures by their respective languages.Presently, the BERT model has been made accessible in a multitude of languages, including Indonesian, exemplified by the In-doBERT [6].Another study on text classification em-ployed the IndoBERT model as a basis and conducted an experiment to investigate the impact of text preprocessing on classification performance [9].
Current literature on COVID-19 on social media predominantly focuses on sentiment analysis.Conversely, there is a dearth of research on text classification that aims to identify self-reported messages regarding COVID-19 status, and the available datasets primarily consist of texts in the English language [1], [10], [11].
A study was undertaken to compare the performance of BERT and IndoBERT models in identifying selfreported COVID-19 status from social media messages in Indonesian.The study involved training and testing each model using both raw and preprocessed text.The objective was to determine which model combination offers the highest classification performance.

Dataset
The datasets utilized in this investigation are derived from preceding investigations [12].The dataset comprises 1000 messages sourced from X encompassing the keyword covid, as demonstrated in.The dataset is segregated into two classifications, namely, positive and negative.A positive message is defined as a message that, alongside the keyword COVID-19, exhibits one or more symptoms of COVID-19, and its connotation indicates that the individual expressing the complaint is presumed positive for COVID-19.The quantity of positive messages totals 500 messages.Examples of messages within the positive category can be observed in Table 1.

Research Implementation
The various phases executed throughout this investigation can be visually observed and analyzed in a detailed manner by referencing the comprehensive illustration labelled as Figure 1.
The initial phase entails gathering datasets, as expounded upon in the preceding section.Subsequently, the text is subjected to preprocessing in order to achieve normalization through the utilization of tokenization, stemming, and stopword elimination techniques [13], [14], [15].Tokenization is a procedure executed to fragment sentences into segments of lexemes, punctuation, and other significant articulations in agreement with the stipulations of the utilized linguistic system.Stemming is the procedure of transforming a term that possesses an inflection into its foundational term (radical structure) by eliminating affixes such as prefixes, suffixes, and confixes.Stopwords Elimination is the procedure of excluding lexemes that lack any import.
Additional preprocessing measures were undertaken by incorporating specific tokens, namely [CLS] at the commencement of the sentence, [SEP] at its conclusion, and [PAD] in instances of shorter sentences [16].The subsequent phase entails the division of the data utilizing the hold-out technique.This technique effectively separates the data into two distinct categories: training data and test data.The training data accounts for 80% of the composition, while the test data accounts for the remaining 20% [17], [18].Following this, the learning phase commences, with the objective of constructing a classification model through the utilization of two methods, namely BERT and IndoBERT.As a result of this stage, four models are created, incorporating a combination of methods and techniques, as illustrated in Table 3.It is important to note that the hyperparameter values employed in both the BERT and IndoBERT methods remain consistent.Furthermore, the models that are constructed utilizing the column values of the preprocessing are of no significance.These models do not undergo any text data processing techniques such as stemming or stopword removal.The testing and validation phase carries out the anticipation of the class label for the test data through four model classifications that were subsequently established in the preceding stage.Additionally, the performance outcomes for the four models are contrasted and scrutinized at the model evaluation stage.The assessment of the performance of the classification model at this stage employs the employment of the confusion matrix to determine accuracy, specificity, and sensitivity [14].The depiction of the confusion matrix can be observed in Table 4. x 100% (1) x 100% (2) x 100% (3)

Results and Discussion
The results obtained from processing textual data in the dedicated token supplementary phase are presented in Table 5.

Special token addition
The subsequent stage involves executing an encoding procedure that transforms each token into an index.The outcomes are observable in Table 7.The first step involves generating an attention mask, which serves to differentiate between the word token's significance and the padding's insignificance.Examination of the outcomes is presented in Table 8.
Subsequently, the customized input is subsequently introduced into the BERT and IndoBERT networks, both of which consist of a series of 12-layer encoder transformers.Each encoder layer is composed of two sub-layers: the first one being a multi-head selfattention mechanism and the second one being a fully connected feed-forward network.Following the passage through all encoders, a vector output is generated for each token.However, only the vector output of the token [CLS] will be utilized as the vector input classifier.

., 0
The datasets transformed into the resultant input vector configuration are subsequently separated into training and test data.Following that, a phase of learning, testing, and validation ensues.The outcomes about the performance of the classification model can be observed in Table 9.

Conclusion
The findings of this investigation demonstrated that the execution of the IndoBERT-founded approach exhibited the capability to generate the most outstanding model for discerning self-reported COVID-19 status messages when compared to the classification model founded on BERT.Additional evidence presented in this analysis indicates that the preprocessing impact of stemming and stopword elimination can enhance the performance of identifying self-reported messages about COVID-19 status.
The self-report COVID-19 status message identification model, based on the IndoBERT model, demonstrated the highest level of performance.Its accuracy was recorded at 94%, with a specificity of 92% and a sensitivity of 96%.Conversely, the BERT-based model achieved an accuracy of 82%, a specificity of 89.11%, and a sensitivity of 74.75%.Employing the unpaired ttest on the test results, a two-tailed P value of 0.0488 was obtained, indicating a statistically significant difference.Thus, it can be concluded that the IndoBERTbased model exhibits significantly superior performance.On the other hand, the impact of text preprocessing on the classification model's performance was found to be insignificant.Using the paired t test technique, a two-tailed P value of 0.0528 was observed.
In the subsequent investigation, an evaluation will be conducted regarding additional BERT-derived methodologies, including ALBERT, Roberta, and DistilBERT.The primary objective entails constructing a model effectively discerning self-reported COVID-19 status within messages.

Acknowledgements
The computation time for the computer system in this study was furnished by the Data Science Lab, which ope-rates within the Computer Science Department of the Faculty of Mathematics and Natural Sciences at Lambung Mangkurat University.This investigation received financial backing from the Program Dosen Wajib Meneliti (PDWM) grant provided by PNBP Lambung Mangkurat University.
TP) is an outcome where the model correctly predicts the positive class.True Negative (TN) is an outcome where the model correctly predicts the negative class.False Positive (FP) is an outcome where the model incorrectly predicts the positive class.False Negative (FN) is an outcome where the model incorrectly predicts the negative class.Those values can be used to calculate accuracy, sensitivity, and specificity classification performance values.The formula of the three classification performances can be seen below.
Figure 2. Upon examining these results, it becomes apparent that model No. 4 outperforms the other models.Model No. 4 is a text-based model that has undergone preprocessing by applying stemming and stopword removal techniques.

Figure 3
Figure3shows a juxtaposition of the classification performance achieved by different classification methods.The outcomes manifest that models constructed utilizing the IndoBERT technique exhibit superior efficacy, as inferred from the accuracy and sensitivity metrics.Remarkably, the elevated sensitivity metric signifies the model's commendable aptitude in accurately discerning self-reported messages pertaining to COVID-19 status.

Figure 3 .
Figure 3.Comparison performance between BERT and IndoBERT models.

Figure 4 .
Figure 4. Effect of preprocessing on classification models' performance.

Figure 4
Figure 4 displays a comparison of the impact of preprocessing on the efficacy of classification models.This comparison reveals that preprocessing steps, such as stemming and stopword elimination, have the potential

Table 1 :
Positive messagesNegative category messages are messages that do not pertain to positive COVID-19 cases.The total count of negative messages amounts to 500.Instances of messages belonging to negative categories are shown in

Table 2 :
Negative message

Table 3 :
List of Classification Models

Table 6 :
Result of special token addition

Table 7 :
Result of encoding step

Table 8 :
Result of attention mask

Table 9 :
Performance of classification models