Transfer Learning for Sentiment Classification Using Bidirectional Encoder Representations from Transformers (BERT) Model

Sentiment is currently one of the most emerging areas of research due to the large amount of web content coming from social networking websites. Sentiment analysis is a crucial process for recommending systems for most people. Generally, the purpose of sentiment analysis is to determine an author’s attitude toward a subject or the overall tone of a document. There is a huge collection of studies that make an effort to predict how useful online reviews will be and have produced conflicting results on the efficacy of different methodologies. Furthermore, many of the current solutions employ manual feature generation and conventional shallow learning methods, which restrict generalization. As a result, the goal of this research is to develop a general approach using transfer learning by applying the “BERT (Bidirectional Encoder Representations from Transformers)”-based model. The efficiency of BERT classification is then evaluated by comparing it with similar machine learning techniques. In the experimental evaluation, the proposed model demonstrated superior performance in terms of outstanding prediction and high accuracy compared to earlier research. Comparative tests conducted on positive and negative Yelp reviews reveal that fine-tuned BERT classification performs better than other approaches. In addition, it is observed that BERT classifiers using batch size and sequence length significantly affect classification performance.


Introduction
Sentiment analysis, or opinion mining, is the computational study of people's opinions, feelings, emotions, and attitudes toward entities such as products, services, questions, events, subjects and their attributes [1]. For example, sentiment analysis can track the mood of the public toward a particular entity in order to create actionable knowledge. In addition, this kind of knowledge may be useful in understanding, explaining, and predicting social phenomena. In the business domain, sentiment analysis plays a critical role in helping businesses improve their strategy and better understand customers' feedback on their products. In today's client-centric business culture, understanding the client is becoming more important. Because of the importance of sentiment analysis for business and society, it has spread from computer science to the sciences of management and social. Over the past few years, industrial activities related to sentiment analysis have also flourished: Many large businesses or organizers have built their own internal capabilities [2]. Businesses need to understand human emotions because consumers can now express themselves more freely than ever before. Products may actively listen to their customers by dynamically evaluating feedback from surveys and social media posts to create products and services that are tailored to their needs [3].The traditional sentiment analysis techniques mostly concentrate on characteristic engineering and machine learning approaches. The characteristics, comprising n-grams, bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF), and part-of-speech (POS) tags, are first extracted from the text. Subsequently, classification methods, such as naïve Bayes (NB), K-nearest neighbor (KNN) and support vector machine (SVM), are applied to these characteristics to classify the sentiment's polarity [3]. Recently, studies have focused on transfer learning. Transfer learning is a state-of-the-art deep learning method for natural language processing (NLP) challenges [4,5]. In 2018, BERT was introduced, "which stands for Bidirectional Encoder Representations from Transformers" model [4]. It is intended to jointly condition both left and right contexts in all layers in order to pre-train deep bidirectional representation from unlabeled text [4]. Based on factors such as the quantity of transformer layers, self-attention layers, word embedding, forms of fine-tuning, masking, parameters, and others, different BERT transfer learning model types exist [5]. To address this knowledge gap and deliver timely insight, this research attempts to exploit the applicability of BERT in sentiment analysis and to what extent it will improve performance as compared to traditional machine learning when employing proper preprocessing and suitable fine-tuning. We do this for text reviews that are positive as well as negative. We take advantage of the fact that the BERT tokenizer generates contextualized token embeddings instead of manually creating them. Using reviews taken from the Yelp dataset for training and testing, a BERT that has been fine-tuned for review classification is evaluated. To compare the effectiveness of BERT-based classifiers for the categorization of reviews, their performance is compared with that of machine learning techniques, such as KNN, NB, and SVM [6]. In addition, the proposed models are compared with previous studies of the same. Our work is motivated by trying to achieve better performance while keeping a simple model that permits automatic preprocessing as opposed to manual preprocessing. BERT provides an opportunity to attain that. We further aim for higher prediction accuracy via fine-tuning as compared to previous studies. The remaining parts of this paper are broken into the following sections. A short overview of existing literature is presented in Section 2. Section 3 of this article provides a description of the study's research methodology. In Section 4, the outcomes are shown and discussed. Section 5 of this article discusses the implications of this study. Finally, the study is concluded in Section 6.

Literature Review
The growth of social media has added to the importance of the web as a source of knowledge. It has been observed that the number of people who frequently use social networking is increasing. Web reviews are one sort of user-generated content that focuses on the individual perception of a product, service, event, or subject [7]. Many approaches based on various characteristics and machine learning techniques have been presented and evaluated using various datasets, such as Amazon, Yelp, and other resources. Hemalatha et al. [8] investigated sentiment analysis using Yelp review datasets. They compared machine learning techniques, such as NB, MNB, and SVC. They illustrated that the NB using TF-IDF had the highest accuracy of 79% among all the others. Govind et al. [9] evaluated various techniques on well-known datasets, such as Yelp in an attempt to identify the most efficient techniques for sentiment mining, including Unigrams, Bigrams, SVM, NB, and random forests (RF). When the author compares the RF method to traditional methods, they discover that sentiment mining is significantly improved. Liu [10] compared how well various deep learning and machine learning models performed at predicting user sentiment. The author discovered that less complicated models, such as logistic regression (LR) and SVM, are better at predicting sentiment than more sophisticated models, such as gradient boosting (GB), LSTM, and BERT. Liu [11] studied Yelp review datasets on machine learning and transfer learning models. The machine learning models include NB, LR, RF, and SVM. In addition, the transformer models BERT, DistiBERT, RoBERTa, and XLNet are applied [12,13]. The highest accuracy of 70% was achieved by XLNet. Durairaj and Chinnalagu [14] suggested a fine-tuned BERT model to predict customer sentiment by using customer reviews from various datasets. The proposed model's performance was compared with SVM, FastText, BiLSTM and hybrid FastText-BiLSTM models. The outcome of the experiment demonstrates that the BERT model outperforms other models. Alamoudi and Alghamdi [7] proposed machine learning, deep learning, and transfer learning-based models for the sentiment classification on the Yelp review dataset, and the ALBERT model achieved 98.30% accuracy [15]. Xu et al. [16] used BERT to evaluate online product reviews'. The BERT model is accurate in predicting reviews. Prottasha et al. [17] compared the performance of many modeling methods (Word2Vec, GloVe, FastText, and BERT) and found that a properly calibrated BERT is superior to the competition in many natural language processing (NLP) applications, especially in the area of sentiment analysis. Bilal et al. [18] suggested features for the reviewer's network strength and adapted them to estimate helpfulness. In order to predict the efficiency of reviews, a variety of algorithms based on shallow learning were applied to around forty different parameters connected to the evaluation, and the person who wrote it. Ge et al.'s [19] proposal for recommending reviews was to have the model predict how useful a review would be based on the ratings given to those reviews that had already received positive ratings. Mutinda et al. [20] suggested the LEBERT model, which employs an n-gram to segment the input text and a sentiment lexicon to determine whether segments of the text contain sentimental words. BERT then uses these detected sections' words to create a vector. The resulting word vector is fed into a fully connected layer in a convolutional neural network (CNN) to extract features. Overall, the proposed LeBERT model was 88.2% accurate when used for binary classification. Zhang et al. [21] presented a Sliced BI-GRU (bidirectional-gated recurrent unit) architecture that employs BERT embedding in conjunction with the multi-head self-attention mechanism. The BERT models' word vector representation first, which plays a role in the neural network's embedding layer, and then they divide the input sequence into equal-length chunks. Additional features are extracted with the help of Bi-sequence GRU. According to experimental results, this model achieves an impressive 74.37% accuracy in classification on the Yelp dataset. Başarslan and Kayaalp [22] employed SVM and NB techniques from machine learning as well as CNN, RNN, and LSTM techniques from deep learning to categorize the responses. BERT, Glove, Word2Vec, TF-IDF, and BOW are some of the word embedding techniques employed. The model built using BERT and LSTM proved to be the most effective of all the models tested. Out of all the text representations and word-embedding approaches, BERT proved to be the most effective when used in a model. Using the Yelp review dataset, machine learning achieved accuracy rates of 80-86% and deep learning models achieved accuracy rates of 81-89%. Benarab and Gui [23] suggested a CNN-enhanced transformer encoder to produce a more generalizable representation using convolutional layers, identify similarities between representations from all BERT layers, and to compute the average employing the multi-head attention method. On Yelp datasets, the suggested method obtains an accuracy of 82.23%. Better results are achieved when BERT is combined with Bi-LSTM, CNN, and RNN rather than Word2vec, as suggested by Bello et al. [24]. While Word2vec limits our capacity to understand the context in which a word is used, BERT takes the previous and following inputs into account. We discovered that sentiment analysis has not been adequately studied on the Yelp review dataset by reviewing the literature. There is no previous research comparing the performance of different types of prediction models, such as machine learning and transfer learning. As a result, this proposed investigation will address a research gap. The purpose of this paper is to compare the efficacy of different classification models.

Research Methodology
This section explains how to gather data, label them, split the dataset into training and testing data, fine-tune BERT, train machine learning techniques, and measure classifier performance to distinguish between positive and negative reviews.

Data Collection and Labeling
Yelp review datasets are employed in this study to identify positive and negative reviews. Yelp is an established platform for crowd-sourced reviews and ratings that was launched in 2004 [25]. Review texts for thousands of samples are included in it [26].
There are also plenty of data about businesses, reviews, and users in the Yelp Open Dataset [27,28]. The Yelp dataset has been shown to be useful for scientific, educational, and personal uses [11]. Reviews that receive four or more positive votes are considered positive, while reviews that receive none at all are considered negative. According to the existing literature, Ge et al. [19] and Bilal et al. [18] discarded all of the reviews, with the exception of those with a rating of 0 or 4, which respectively reflect negative and positive reviews. In total, 10,000 reviews are chosen from the used dataset, employing stratified sampling methods [18]. The resulting dataset consists of 5000 records each for positive and negative reviews. The total number of sentences is 10,000, with a total of 740,877 tokens. The average number of tokens per sentence is 75, and the positive-to-negative sentiment ratio is 1.0.

Dataset
In this research, a dataset of 10,000 Yelp reviews with an equal balance of positive and negative reviews is used. The stratified sample method is used to split the dataset into training and testing datasets. Training comprises 80% (8000 samples) of the Yelp dataset, and testing comprises the remaining 20% (2000 samples). Ten percent of BERT's training dataset samples was used for validation through each training cycle. The Yelp dataset used in this research is explained in detail in Table 1.
Training comprises 80% of the Yelp dataset, and testing comprises the remaining 20%. Testing involves 2000 samples, while training involves 8000 samples. Following that, the training dataset is divided into training (90% of it) with 7200 samples and testing (10% of it) with 800 samples. The BERT model consists of 12 transformer blocks, 12 self-attention heads, and 768 hidden sizes. In this research, we fine-tune BERT for the classification tasks using Google Colab. BERT classification models are built using all lengths of sequence, 64, 128, 256, and 320, with all batch sizes set to 32 for training as well as testing. In addition, the BERT classification model is constructed with a sequence of length 384 and a batch size of either 16 or 32, depending on whether it is being used for training as well as testing. Additionally, the BERT classification is built with a sequence of length 512 and a batch size of either 8 or 32 for training as well as testing. The researchers suggest 16 and 32 batch sizes for fine-tuning BERT [4]. Table 2 provides the BERT base model's hyper-parameters. Each batch inside an epoch requires a new set of parameters, which the algorithms must update. The proposed approach for the BERT model is depicted in Figure 1. In order to evaluate the efficacy of each training iteration, researchers employ a validation split to compute validation loss and accuracy. In this study, various classifiers are fine-tuned based on various sequence lengths.

Machine Learning Approaches
In this research, the efficiency of the fine-tuned BERT-based model cannot be determined without comparison to non-BERT models. In order to classify positive and negative reviews, text classifiers KNN, NB, and SVM are trained. Textual features are generated from text using the term frequency-inverse document frequency (TF-IDF), and then machine learning models (KNN, NB, and SVM) are trained. Table 3 provides the machine learning model hyper-parameters. Clean data, tokenization, stop word elimination, and lemmatizing techniques are used for preprocessing methods. On the basis of TF-IDF, word vectors are produced. Figure 2 shows the steps of the proposed methodology for machine learning methods.To ensure that the machine learning model is constructed effectively and efficiently, preprocessing focuses on removing irrelevant characteristics and extracting relevant ones from the text. The following procedures were used to preprocess Yelp review datasets: 1.
The term "clean data" refers to the practice of deleting extraneous characters from text, such as HTML tags and trailing letters after apostrophes (such as the s in it's). Additionally, take off the web address and any punctuation. Moreover, replace all nonalphanumeric characters with a single space. For instance, "123&%, Me, 2 2023" would become "123 Me 2 2023". These symbols may cause noise in the data because they do not have much meaning. Thus, symbols are eliminated from the Yelp reviews [29]. Such a process does not impact the semantics of the dataset.

2.
Tokenization is a method of breaking down a string of characters into individual words. A token's value is not dependent on any other tokens [29]. 3.
The stop words are eliminated because they do not considerably improve the clarity of the data. Eliminating stop words shifts the focus to meaningful words, which reduces the text's dimensionality and makes it easier to discern patterns and meanings.

Performance Evaluation
In order to predict positive reviews, a binary classification task is used, with positive reviews being labeled as true (1) and negative reviews being labeled as false (0). Both the BERT-and machine learning-based classifiers are tested on a testing dataset containing 2000 instances to determine how well they perform. After passing through the initial formatting procedure, the test data are transformed into formats suitable for use with BERT. A few preprocessing methods are applied to generate word vectors before testing KNN, NB, and SVM. The sequence lengths employed for BERT classifier evaluation are identical to those employed for fine-tuning the corresponding BERT-based model. As an example, the length of sequence 64 is used in both the fine-tuning and evaluation of a BERT model. The literature uses a variety of different classification models and evaluation metrics. As a result, the problem domain and the dataset's attributes, such as balance or imbalance, should be taken into account when choosing the right metrics. Accuracy, precision, recall, and F1 score are used in this research as measurement methods to measure the classifier performance. Eventually, by comparing the fine-tuned BERT model to the KNN, NB, and SVM, its performance in classifying positive and negative reviews is evaluated.

Result and Discussion
The evaluation outcomes of various fine-tuned BERT models classified using various sequence lengths are presented and addressed in this part. To further examine and evaluate the efficacy of the various methodologies employed in this work, the evaluation outcomes for machine learning-based classifiers (KNN, NB, and SVM) are presented and addressed. The overall outcome of fine-tuning BERT classification models for the various sequence lengths used in this experiment is presented in Table 4 for training and validation. The results comprise training time, batch size, training loss, validation loss, training accuracy, and validation accuracy for four epochs for each length sequence. Losses during training and validation for the BERT classification model are depicted in Figure 3 for the following lengths of sequence: 64, 128, 256, 320, 384, and 512. The BERT-based model is initially trained and validated using the length of sequence 64 to ensure optimal performance.    Classifiers with different sequence lengths and batch sizes are compared. The training loss is shown to diminish steadily over the period of 4 iterations. In contrast, there is no pattern to the validation loss, which is totally random. Additionally, the validation accuracy tends to vary randomly. The highest validation accuracy (0.981) is achieved with a length of sequence 512 and a batch size 32. Conversely, the worst validation accuracy (0.935) is achieved with a length of sequence 64 and batch size 32.
The research shows that training and validation times grow exponentially with sequence length. Classifier training and validation times vary with sequence and batch sizes, as well as the automatically assigned GPU performance in Google Colab. The experiments were carried out in Python on a Google Colab Jupiter notebook with NVIDIA A100-SXM, 83.5 GB of memory, and a 40 GB GPU, as shown in Figure 4. The BERT classifier required 11.5 GB of memory and 38.5 GB of GPU to run with a batch size 32 and a maximum length of sequence 512. In addition to training several BERT models, the dataset of 8000 samples is used to train KNN, NB, and SVM classification models based on TF-IDF. Then, a test dataset consisting of 2000 reviews is used to evaluate the efficacy of KNN, NB, SVM, and different BERT classifiers. Table 5 Table 6 shows the results of all of the various classification models' ratings, including accuracy, F1 score, precision, and recall. These metrics were determined using the prediction results shown in Table 5. Classifiers were compared based on their ability to sort machine learning into classification; KNN was found to have the worst accuracy (0.855) and F1 score (0.853). The F1 score of 0.910 and accuracy of 0.911 shows that NB is of higher quality than KNN. With an accuracy of (0.946) and F1 score of (0.945), SVM performed better than KNN and NB. Based on the BERT classification model results, the classification algorithm with a length of sequence 64 and a batch size 32 had the worst accuracy (0.927) and F1 score (0.926). The classifier, on the other hand, achieved the highest accuracy (0.973) and F1 score (0.973) with a length of sequence 384 and using batch size 32. Another thing that is apparent is that a length of sequence 384 using batch size 16 and a length of sequence 512 using batch size 8 produce results that are competitively better than a sequence of length 256 using batch size 32, a sequence of length 320 using batch size 32, and a sequence of length 512 using batch size 32 with regard to performance. The findings demonstrate that the BERT-based models classification performance is significantly influenced by the sequence length and batch size used to optimize and evaluate the model. Figure 5 compares the accuracy of machine learning classifiers and BERT classifiers. According to the comparison, BERT classifiers perform better than machine learning classifiers. SVM outperforms other machine learning (KNN and NB) classifiers in terms of accuracy and F1 score. The accuracy and F1 score of SVM is higher than those of BERT with a sequence length of 64 and batch size 32. The best accuracy of 0.973% was obtained by the BERT classification model with batch size 32 and length of sequence 384 as shown in Figure 6. This is an improvement of 0.027 (2.7%) over the accuracy obtained by the SVM classifier (0.946).
After comparing the SVM and BERT prediction results in Table 5 for a length of sequence 384 and a batch size of 32, it is evident that the variation in accuracy is related to the TP, FP, FN, and TN predictions. SVM predicted 959 as TP, 68 as FP, 41 as FN, and 932 as TN, whereas BERT-384-32 predicted 965 as TP, 29 as FP, 24 as FN, and 966 as TN. Both the BERT features and sequence length with batch size used to optimize and evaluate the classifier are responsible for the effective performance of BERT-384-32. BERT features the ability to record word context in both directions without eliminating stop words, in contrast to machine learning, which omits the majority of words and does not account for word placement in context.   Table 7 displays, from the published literature of the Bilal and Almazroi models [6], various measures of categorization algorithm performance. It is worth noting that the BERT model achieved the best results compared to the others. BERT-3320-32 (0.707% accuracy) is followed by BERT-512-8 (0.697%), BERT-128-32 (0.694%), BERT-384-16 (0.683%), SVM (0.679%), BERT-64-32 (0.668%), and NB ( 0.596%) accuracy.  [6]. Table 8     The accuracy of the proposed model compared to other models in the literature is displayed in Figure 9. There is an overall enhancement to all classifier models. BERT-384-16 and NB are improved by 29%, followed by SVM, BERT-256-32, and BERT-512-8, which are improved by 27%. KNN, BERT-64-32, BERT-128-32 and BERT-320-32 have improved accuracy of 26%.  Table 9 shows the F1 score for classifier models for the proposed model and Bilal and Almazroi models [6]. It is notable that the BERT model outperformed the other models with the highest F1 score. BERT-384-32 (0.973% F1 score) is followed by BERT-384-16 (0.971%), BERT-512-8 (0.97%), BERT-320-32 (0.969%), BERT-512-32 (0.967%), BERT-256-32 (0.965%), BERT-128-32 (0.952%), SVM (0.945%), BERT-64-32 (0.926%), NB (0.91%), and KNN (0.853%), and then other models in [6]. Figure 10 shows the proposed model's F1 score in comparison to other models found in the literature. There is an overall enhancement in F1 scores for all classifier models. BERT-512-8, BERT-384-16, and NB are improved by 28%, followed by SVM, which is improved by 27%. KNN, BERT-128-32 and BERT-257-32 are improved by 26%. BERT-320-32 is improved by 25%, and BERT-64-32 is improved by 24%. Table 10 shows the accuracy of classifier models for the proposed model and previous studies on the Yelp dataset (binary label). It is notable that the proposed BERT-384-32 model outperformed the other models with the highest accuracy. The proposed BERT-384-32 model (97.3% accuracy) is followed by the proposed BERT-320-32 model (96.9%), and then LoBERT (BERT+ CNN) proposed by Mutinda et al. [20]. After that follows the BERT model proposed by Mutinda et al. [20] (84.00% accuracy) and the BERT-320-32 proposed by Bilal and Almazroi [6] (71.7% accuracy).   It is worth nothing that despite the better accuracy produced by SVM as compared to KNN and NB, SVM performs efficiently, which conforms to the finding in [31]. Unlike the finding in [31], our study shows that BERT outperforms SVM in accuracy. It is also important to note that in order to produce word vectors with the best possible quality for traditional machine learning classifiers, text data preprocessing is essential. Machine learning approaches also have the disadvantage of producing a large number of features, from which the most important ones must be selected automatically through methods for choosing features. In contrast, BERT employs the BERT tokenizer to transform content directly into a designated input structure without needing any pre-processing. On top of that, unlike machine learning methods, BERT uses a bidirectional transformer that considers both context directions.

Implications
Several theoretical and practical consequences derive from this study's results. This research helps researchers resolve conflicting results from prior research regarding the efficiency of BERT classification models to predict accuracy by comparing the BERT model's performance to that of machine learning approaches. Experimental data and the comparison of outcomes of the BERT classification model with machine learning techniques will also be useful for investigators in selecting the best strategy. According to the structure of the dataset used for this research, the negative reviews tend to be far longer on average than the positive ones. That will instruct you on how to create insightful evaluations using roughly 150 words. Just optimizing BERT using batch size and a specific sequence length was the focus of previous studies. In order to improve the BERT model, different batch sizes with different sequence lengths were employed in this research. These insights will allow researchers to better comprehend and evaluate the importance and influence of using various batch sizes with various sequence lengths on classifier predictive accuracy. In addition, the BERT classifier scored best when tuned using batch size 32 with a length of sequence 384, demonstrating to academics and practitioners that shortening reviews to a length of sequence 384 and using them as a prediction tool yields excellent results. However, it could change based on how the dataset is organized. Researchers will be better able to comprehend and enhance the research being conducted to predict the impact of reviews with the help of the generic technique provided in this investigation; no preprocessing was performed.

Conclusions
Online reviews are becoming more prevalent, exceeding people's capacity to organize them in a way that is useful for making decisions about purchases. The purpose of this research is to avoid the restrictions placed on the generalizability of the solution by earlier studies on manual features, as a result, estimating the effectiveness of reviews on the internet without depending on any manually generated features. This work uses BERT, a cutting-edge method for a variety of NLP challenges, together with machine learning classifiers KNN, NB, and SVM. Using a dataset of Yelp reviews, the effectiveness of different BERT classifiers that were given varying sequence lengths to train with was evaluated and compared with the efficiency of machine learning classifiers. The evaluation's findings demonstrated that, when it came to categorizing positive and negative reviews, tuned BERT classifiers outperformed machine learning methods. The proposed model showed improved accuracy on both accuracy and F1 scores compared with previous studies. Moreover, the maximum accuracy as well as the highest F1 score are obtained by the BERT classifier using batch size 32 with a sequence length of 384. In future work, the BERT transformer model will be fine-tuned and compared to other transformer models, such as the ALBERT, RoBERTa, and XLNet for binary classification. This study makes a contribution by investigating how accurately the BERT base model can predict the efficacy of reviews by varying the sequence lengths that it uses for its measurement. This will shed light on how reducing the length of the text (review) to an optimal length can enhance its predictive performance for researchers.  Data Availability Statement: The Yelp review dataset, which is publicly available, is utilized in this manuscript. https://huggingface.co/datasets/yelp_review_full and https://www.yelp.com/ dataset/download.