Naïve Bayes and TF-IDF for Sentiment Analysis of the Covid-19 Booster Vaccine

The booster vaccine polemic became a trending topic on Twitter and reaped many pros and cons. This booster vaccine began to be distributed on January 12, 2022. This booster vaccine program was implemented free of charge for the people of Indonesia to prevent the new variant of Covid-19, Omicron. The contribution of this study is to analyze the sentiment of booster vaccines to prevent covid-19 using the Naïve Bayes and TF-IDF methods. We conducted sentiment analysis to determine whether the tweet was positive, negative, or neutral. The solution used is the Naïve Bayes method and TF-IDF. The role of TF-IDF is to determine how relevant the data in the document is by utilizing word weighting. The stages of this research using CRISP-DM include Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, and Deployment. The net data results show 1,557 data with a positive sentiment of 1,335, a neutral sentiment of 171 data, and a negative sentiment of 51 data. The test results with 60:40 data sharing obtained accuracy, precision, and recall values of 85.26%, 85%, and 100%. The results of this test have increased by 7.26%, 12%, and 20% from other previous studies with the same data distribution.


Introduction
Twitter is one of the largest data-producing social media on the Internet that we can use to analyze data. The flexibility of Twitter allows users to freely interact, such as writing, reading, or discussing with fellow Twitter users. The advantage, Twitter can make it easier for people to express opinions without having to meet in person [1]. We must manage the increasing amount of digital data via Twitter into helpful information. One way to collect data about the valuable tip is to analyze it. The science that discusses data analysis via Twitter is known as text mining. This Twitter analysis is unique because we can dig up people's opinions or opinions about something. This opinion is known as sentiment. Sentiment analysis is the process of understanding, extracting, and processing text data automatically contained in sentiment sentences. Sentiment analysis aims to produce a certain percentage of conditions, products, companies, institutions, or things. The analysis results include positive, negative, or neutral sentiment [2]. The results of this sentiment analysis will help more accurate and effective decision-making [3]. The booster vaccine polemic has become a trending topic on Twitter. This polemic reaps many pros and cons. Booster vaccine polemic This was widely discussed and became an issue of debate in Indonesian society. Booster vaccines will be distributed starting January 12, 2022. This booster vaccination program is carried out free of charge to help the Indonesian people prevent Omicron [4], a new variant of Covid-19. The target of this vaccination is 18 years and over. The main priority for booster vaccinations is elderly parents.
Hoax news on social media regarding the destructive consequences of administering booster vaccines has hampered efforts to distribute booster vaccines throughout Indonesia [5]. The tendency of the Indonesian people to immediately react negatively and share data when they are sick with the first and second vaccines results in high negative sentiment [6]. On the other hand, a positive response from most Indonesian people regarding this vaccination is still possible because this vaccine prevents the transmission of Covid-19 [7]. Olhang et al. [8] [12], which only discuss vaccines in general. Meanwhile, this study examines the reaction of the Indonesian people on Twitter social media to booster vaccines.

Research Methods
The stages of this research using CRISP-DM include Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Business Understanding is essential for problem analysis. Data Understanding is necessary for data collection. Data Preparation is vital for manual data elimination, preprocessing, and labeling. Modeling is crucial to modeling data, weighting words with TF-IDF, and classification using Naïve Bayes. Evaluation is vital to test the validity of the data using the confusion matrix. Deployment is crucial to ensure the application can run properly using the PHP programming language.  Figure 1 shows the stages of CRISP-DM [13] - [14]. This study aims to classify positive, negative, and neutral sentiments from tweets using the Naïve Bayes method.
The first stage is Business Understanding (understanding of the business). This first stage is to gain an in-depth knowledge of the needs and definition of problems related to booster vaccines. The second stage is Data Understanding. This second stage is essential for data collection. The data used in this study comes from social media Twitter. The tweet data keyword used in this research is " booster vaccine. " Data collection uses Indonesian. The tweet data collection starts from June 1, 2022, to June 15, 2022. The data collection results obtained were 4,073 tweets. The third stage is Data Preparation. This third stage prepares the data for processing. Therefore, this stage is often called preprocessing [15]. Several preprocessing steps include: Cleansing is the process of cleaning data whose contents have nothing to do with words, such as symbol characters, emoticons, and URL links [16], Case Folding is the process of converting all text characters into lowercase [17], Tokenizing is a process of dividing sentences into parts or words. The words that this process generates are called tokenized [12], Stop-word Removal is the process of removing common words that have no meaning. Such as eliminating some verbs and adding adjectives and adverbs to the list of stop words. [18]. The stop word in this study comes from the Kaggle.com dictionary [14]. Stemming is searching for the roots of words resulting from the previous filtering process. This process is done for each word. The way stemming works is to return Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. The fourth stage is Modeling. This fourth stage is the core of data classification. The classification used in this study is Naïve Bayes. Naïve Bayes is a machine learning method using statistical and probabilistic calculations advocated by British scientist Thomas Bayes. How Naive Bayes works predicts the future based on past experiences. Class is divided into 3, namely: positive, negative, and neutral. This Naive Bayes classifier aims to measure the computational accuracy of data: positive, negative, and neutral sentiments. This classification is also assisted by using the TF-IDF method. The role of the TF-IDF is to determine how relevant the data is in the document by weighting words [20]. The goal is to count the number of comments from each sentence the researcher will use. Simply put, TF-IDF can be shown in Algorithm 1.
The fifth stage is Evaluation. This fifth stage aims to evaluate whether the results are appropriate. Evaluation is usually done by testing. Tests run on specific applications. Testing using data sampling with split data 60:40. This means that 60% of the data is used as training data, and 40% of the data is used as test data. The reason for dividing the data is 60:40 because it refers to another study [9] with quite good results, and there is a belief that using this division will increase the results. The training data becomes a data analysis model that represents the entire data. Test data is used as new data for comparison. The Confusion Matrix method takes part in this evaluation stage. The Confusion Matrix aims to get accuracy, precision, and recall values. The values obtained using the Confussion Matrix method are an accuracy of 85.26%, a precision of 0.85, and a recall of 1. Calculation and comparison of terms in the test data with each class [21] can be shown in equation (1): With wK is the word ( word) k in all documents labeled as j (sentiment positive / negative / neutral), vj is all words (vocabulary) in class j (sentiment positive / negative / neutral), nk is the number of times a word appears in class j (sentiment positive / negative / neutral), n is the number of words in class j (sentiment positive / negative / neutral), and vocabulary is the total number of unique words in a document. The sixth stage is Deployment. This sixth stage is necessary for implementation. Implementation of this research using the PHP programming language. The dataset consists of 1,335 positive sentiment data, 51 negative sentiment data, and 171 neutral sentiment data.

Conclusion
Based on the tests and analysis carried out for the sentiment analysis of the booster vaccine prevention of Covid-19 using Naïve Bayes and TF-IDF through the social network Twitter with a total of 1,557 data and using a ratio of 60:40, the data is manually divided into two parts, namely training data and test data. The dataset consists of positive sentiment consisting of 1,335 data, negative sentiment composed of 51 data, and neutral sentiment consisting of 171 data. Each of these data has a different value, 933 data for training data and 634 data for test data. The test was carried out using the Naïve Bayes method with a ratio of 60:40 to obtain an accuracy score of 85.26%, 85% precision, and 100% recall. The test results show that this research is better than before, using the same 60:40 division with an increase in accuracy of 7.26%, precision of 12%, and recall of 20%. This analysis is still limited to quantitative analysis using Naïve Bayes and TF-IDF. In further research, it can also be elaborated with critical analysis in the qualitative form to make it more comprehensive.