Bayesian estimation ‐ based sentiment word embedding model for sentiment analysis

Sentiment word embedding has been extensively studied and used in sentiment analysis tasks. However, most existing models have failed to differentiate high ‐ frequency and low ‐ frequency words. Accordingly, the sentiment information of low ‐ frequency words is insufficiently captured, thus resulting in inaccurate sentiment word embedding and degradation of overall performance of sentiment analysis. A Bayesian estimation ‐ based sentiment word embedding (BESWE) model, which aims to precisely extract the sentiment information of low ‐ frequency words, has been proposed. In the model, a Bayesian estimator is constructed based on the co ‐ occurrence probabilities and sentiment probabilities of words, and a novel loss function is defined for sentiment word embedding learning. The experimental results based on the sentiment lexicons and Movie Review dataset show that BESWE outperforms many state ‐ of ‐ the ‐ art methods, for example, C&W, CBOW, GloVe, SE ‐ HyRank and DLJT1, in sentiment analysis tasks, which demonstrate that Bayesian estimation can effectively capture the sentiment information of low ‐ frequency words and integrate the sentiment information into the word embedding through the loss function. In addition, replacing the embedding of low ‐ frequency words in the state ‐ of ‐ the ‐ art methods with BESWE can significantly improve the performance of those methods in sentiment analysis tasks.

whereas the latter uses the global word co-occurrence counts to derive word embeddings. As such, the word embedding models are able to capture the context information and map the semantically similar words into neighbouring points within the word embedding space [17]. Nevertheless, considering that semantically similar words may have opposite sentiment polarities [18], the widely used word embedding models have challenges in dealing with sentiment information. According to Yu et al. [19], among the top 10 semantically similar words to each target word, about 30% of them appears to have an opposite sentiment polarity to that of the target word. For this reason, research is still ongoing to develop sentiment word embedding algorithms that can appropriately tackle both semantic and sentiment information.
Recently, the sentiment word embedding models have been put forward to tackle sentiment information. Similar to the traditional word embedding models, the recently developed sentiment word embedding models can also be classified into the prediction-based models and count-based models. On the one hand, in the prediction-based models [18,[35][36][37], the word sentiment is taken as the local information for model learning. That is, by using the neural networks or predicting functions, the sentiment word embedding can be learnt and the sentiment polarities can be classified. On the other hand, for the count-based models [20,21], the sentiment word embedding is derived from the word-sentiment matrices of the labelled corpora, in which case, the sentiment information is used as the global statistics.
Despite of these new models, the application of sentiment word embedding models is still limited. Current state-of-the-arts fail to dedicatedly identify and process the low-frequency words in the texts, which result in the loss of low frequency but significant information [22]. Based on the Zipf's law [23], a large number of words in any corpus are often low-frequency words [24]. According to statistics, in the Stanford Sentiment Tree (SST) corpus [25], the words with the occurrence rates lower than 5 account for 72.35% of all the words, and those lower than 10 account for 84.06%. In addition, because of the small number of low-frequency word samples, the learning process is prone to fall in over-fitting of the model. Notably, the low-frequency words carry not only little sentiment information but also the noise that might affect model learning. Instead of analyzing the low-frequency words, many current sentiment word embedding models directly utilize the word information for sentiment word embedding learning. For example, in the labeled singular value decomposition (LSVD) model, the word-context matrix and the word-sentiment matrix are stitched together to obtain the sentiment word embedding through the singular value decomposition. Likewise, the LGloVe model and the DLJT1 model use both matrices as the loss function of the least squares algorithm for model learning. In this way, the sentiment word embedding models are absent of extracting the precise sentiment of lowfrequency words among the word representations. Thus, this suspending issue will result in the low accuracy of sentiment polarity classification.
On the task of sentiment analysis, we tend to propose a novel method for sentiment word embedding learning. We design and deploy a Bayesian estimation-based sentiment word embedding (BESWE) model. Inspired by the GloVe, we introduce the sentiment probability as well as derive the loss function. Accordingly, the learnt word embedding is deemed best able to convey both the context information and the sentiment information. Seeing that the co-occurrence probability in GloVe is computed via maximum likelihood estimation, which has deficiencies in processing small samples, we thus use the Bayesian estimating method to pick up the sentiment information from the word-sentiment matrices for model learning. By introducing the prior knowledge, the Bayesian estimation can overcome the defect of insufficient sentiment information [26] and increase the robustness of the model [27]. Our model is especially more accurate in capturing the sentiment information of lowfrequency words, due to its distinctive structure. On the one hand, we use a Bayesian estimator to compute the co-occurrence probabilities and the sentiment probabilities based on wordcontext matrices and word-sentiment matrices from the corpus. On the other hand, we exploit the learning of sentiment word embedding in a novel way by dedicatedly constructing a loss function. We compare our method against other state-ofthe-art models using identical experimental setups. In these configurations, the proposed model obtains a decent working performance in various natural language processing (NLP) tasks results and is considerably better than classical approaches in low-frequency word sentiment identification.
The major contributions of this work are threefold and can be summarized as follows: (1) Based on the Bayesian estimation principle, our sentiment probability computation method aims to extract the sentiment information of low-frequency words from the word-sentiment matrix. (2) Our BESWE model is innovatively developed for the sentiment analysis of low-frequency words, which integrates sentiment information with word embeddings and achieves a higher accuracy in tasks involving low-frequency words and low-frequency sentences sentiment analysis. (3) The conceptual framework of BESWE can be applied to other word embedding models. By capturing the sentiment of low-frequency words, the sentiment polarity classification accuracy of the current models can be largely improved by integrating BESWE.
The rest of this study is organized as follows. Section 2 presents the prerequisites needed for understanding the proposed model. Section 3 describes the proposed sentiment word embedding model. Section 4 presents the experimental results. A related work to our model is given in Section 5. Conclusions are drawn in Section 6.

| GloVe
The GloVe model is a word-embedding method that combines evidence from the local context and the global counts [28]. Typically, the method involves three distinguished words i, j and k. Both i and j are target words, while k stands for the context words. Let x be the matrix representing the word-word co-occurrence counts. We can define the element x ik as the times for word k appearing in the context i. Correspondingly, x i = ∑ k x ik indicates the total occurrence counts of any word within the context of word i. Therefore, the co-occurrence probability of k in the context word of i is: Then, P ik /P jk denotes the relation of i to k and j to k. As long as k has the similar relations to i and j, that is, both relevant or irrelevant, the ratio P ik /P jk would be close to 1. The information within the ratio of co-occurrence probabilities can be formulated into: where w ∈ R n refers to the target word embedding and e w ∈ R n to the context embedding.

| Parameter estimation principle
On this occasion, we tend to prove that the co-occurrence probability P ik can be derived from the maximum likelihood estimation. For every single target word i, x i times Bernoulli experiments are conducted to extract the context independently and randomly [29]. In each experiment, there exist V different outcomes. The occurrence number of the V th outcome, together with its probability, is represented by x ik and P ik , respectively.
If the random variable X i ¼ X i1 ; X i2 ; …; X iV ð Þ stands for the occurrence times of all the possibilities, in which X ik is the number of occurrence for the k th one, the parameter X i must obey the multinomial distribution, which can be written as: Hence, a log-likelihood function can be constructed: ln LðP i1 ; P i2 ; …; P ik ; …; P iV Þ; To maximize the log-likelihood function in Equation (4), setting up the target function can be viewed as an equality constrained optimization problem: Accordingly, the corresponding Lagrangian function can be formulated as: together with determining the partial derivatives of P ik and Seeing that the parameter λ is with respect to It is clear however that the estimation of P ik is written as: Obviously, the co-occurrence probability in GloVe is equivalent to the maximum likelihood estimating outcome of the co-occurrence probability. Theoretically, the maximum likelihood estimation is a statistical estimating method on TANG ET AL. abundant samples. In contrast, the Bayesian estimation is more effective in small sample processing [26].

| METHODOLOGY
This section introduces the architecture of our BESWE model and its working principle. Notation: � For i ∈ 1; 2; …; V f g and V representing the number of words, we define x as the word-context matrix, together with x ik and P ik, respectively as the occurrence counts and the occurrence probability of word k in context of word i, which are from the sets x ik stands for the total occurrence counts of any word within the context of word i.
Þ, we define t as the word-sentiment matrix, together with t i1 as the number of positive texts including word i and t i0 as the number of negative texts including we define B i as the probability of word i being positive and 1 − B i as the probability of word i being negative. � w i stands for word embedding of word i as target. � e w i stands for word embedding of word i as context. � s i stands for bias embedding of word i.

| Model architecture
The architecture of BESWE model is shown in Figure 1. Our sentiment word embedding model is developed based on the foundation of GloVe. Since the feasibility of maximum likelihood estimation for parameter calculation in GloVe is demonstrated, we start with using the co-occurrence probability for word context learning and the sentiment probability for word sentiment learning. By traversing the corpus with sentiment labels, the co-occurrence counts x i ! and sentiment counts t i ! can be obtained. Aiming at dealing with lowfrequency words, the co-occurrence number is sent to Bayesian estimator for co-occurrence probability computation. Likewise, the sentiment probability can be obtained in the same manner by using sentiment counts of the word. At this stage, the estimation of ln P ik and ln B i can be obtained, which are c ik and e i , respectively. Based on these two outcomes, the loss function is constructed for sentiment word embedding learning, which is delivered as: The loss function is minimized via AdaGrad optimizer [30]. Along these processes, the learning of w i , e w i and s i is conducted where w i stands for the sentiment word embedding of the BESWE model. More details of the Bayesian estimator and the loss function of sentiment word embedding learning are described as follows.

| Bayesian estimator
Based on the working principle of the proposed model, the cooccurrence probability and the sentiment probability are calculated by the Bayesian estimator. Distinctively, the estimation of co-occurrence probability comes from the current D-GloVe model [16]. According to the parameter distribution assumption in Section 2.2, P ! i is assumed to obey the Dirichlet prior distribution with parameter α Figure 2. Thus, the Bayesian estimation of ln P ik is: and where n k is the occurrence time of word k in the learning samples, and λ 1 is the regulatory factor.

F I G U R E 1 Model architecture
In addition, we can also compute the Bayesian estimating outcome of ln B i . The Bernoulli experiments are applied to pick up the sentiment polarity of the target word i for t i times, which is either positive or negative. As presented above, B i and t i1 represent the probability and the occurrence time of word i being positive while 1 − B i and t i0 indicate those of negative. If ð Þ denotes the occurrence times of both outcomes, the random variable T i is within the binomial distribution, that is, The distribution of aforementioned parameters is illustrated in Figure 3.
Therefore, the solution of ln B i can be: and where m k stands for the number of texts of the sentiment label k, and λ 2 is the regulatory factor for word sentiment learning.

| Loss function
Considering the semantic information characterization in GloVe, we now take the ratio of sentiment probabilities to address the relationship of word sentiment. More details about the words' sentiment relation and the ratio are presented in the Appendix 7.1. For words i and j, the sentiment relation is expressed as B i /B j . The objective function is established as: where w j and s j stand for the word embedding and bias embedding of word j, respectively. Assuming that F is confirmed as the homomorphism between groups R; þ ð Þ and R >0 ; � ð Þ, we incorporate the sentiment information into the word embeddings. Based on Equations (2) and (22), we get: Considering the properties of group homomorphism, we transform the above equation according to the addition commutative law into: In line with the basic theory of GloVe, the loss function of BESWE can be constructed via the same manner: where c ik and e i are derived from Equations (14) and (18), respectively.

| EXPERIMENT
In this section, the working performance of the proposed BESWE model is evaluated. Some state-of-the-art word embedding models, along with the learning of specific word representations, are taken for comparison. To this end, the task of word similarity is carried out. So as to deliver the sentiment embeddings, the word-and the sentence-level sentiment analysis using different models is taken into account. The sentiment analysis tasks are further divided into three subtasks, which verify the efficiency on basic sentiment classification for all words, low-frequency word sentiment identification and integration with other baseline models. The outline of the experiments is exhibited in Figure 4. Baseline models: We compare the effectiveness of the proposed model to other widely used models. Specifically, the models of word embeddings, such as C&W [10], CBOW [13] and GloVe [15], together with models of sentiment embeddings, including SE-HyRank [18] and DLJT1 [21], are implemented. For all the models used in this work, the word representation dimension is 50, the learning rate is set as 0.05. All the parameters associated are finetuned to get better results. Task 1, Word similarity measure: The capacity of word embedding models is verified via the processing in standard word similarity. Comprehensively, the average working performance on word similarity tasks is obtained on the dataset EN-WS-353-ALL, EN-WS-353-SIM and SCWS. We utilize the word embeddings to calculate the similarity score of the word pairs first. Together with the standard similarity score provided by the dataset, we shall thus compute the correlation coefficient. The evaluation metric in this task is the computed correlation coefficient, which is detailed elsewhere [31]. Task 2, Word-level sentiment analysis: On this occasion, the support vector machine classifier is used for training the words vectors, with each word vector representing specific sentiment. The popular sentiment polarity lexicons, NRC and MPQA [32], are taken as the ground truth of the word sentiments. The number of positive and negative words for MPQA is 2301 and 4151 while those for NRC is 2231 and 3324. The N-fold cross validation with the value N = 5 and N = 10 is performed on all datasets. The working performance refers to the classification accuracy of the models. The word-level sentiment analysis is carried out via the following three subtasks.

| Experiment settings
Task 2.1, Basic word-level sentiment analysis: The usual task for evaluating the performance is to capture the sentiment information of the words. Thus, we generate the word embeddings and apply them to the sentiment classification of the sentiment lexicon. Task 2.2, Low-frequency word sentiment analysis: Since the Bayesian estimation principle is able to deal with lowfrequency words, we pick the words with the frequencies lower than 5 and 10 from the SST corpus for investigation. In this way, the low-frequency word embeddings are collected for sentiment analysis.

Task 2.3, BESWE integration with other models: In order
to improve the classification accuracy, we integrate BESWE with the baseline models. A specific word embedding set is developed, which contains low-frequency word embeddings from BESWE and nonlow-frequency word embeddings from any other baseline model. By using these word embeddings for model learning, the sentiment analysis is carried out. Task 3, Sentence-level sentiment analysis: Considering the sentiment analysis for sentence, the Movie Review dataset [33] is used, which contains 10,622 samples with the proportion of each polarity 1:1. We use a convolutional neural network (CNN), namely text-CNN, with its online implementation [34]. The inputs of text-CNN are word embeddings. The samples are divided into training, validating and testing, with the ratio of 6:2:2. The training episode is set as 200 epochs using the default settings. Based on the validation outcomes, we apply the optimal model to testing. The evaluation metrics are the classification accuracy and F1 score. Similarly, there are also three subtasks like in the word-level sentiment analysis.
Task 3.1, Basic sentence-level sentiment analysis: As stated in Task 2.1, we take the words from Movie Review for sentiment analysis. Task 3.2, Low-frequency sentence sentiment analysis: In this task, we select the low-frequency words whose occupation is over 1 0% in the sentence to be the low-frequency sentence samples. Thereby the low-frequency sentences are taken for sentiment analysis.  principle is effective in tackling the large number of lowfrequency words in the corpus, based on which more accurate sentiment word embeddings are accessible. Accordingly, one can easily see a considerable gap between our model and the baseline models. Task 2.1, Basic word-level sentiment analysis: The wordlevel analysis is carried out on the dataset of single-word entries. For basic word-level sentiment analysis, our model is proved to be a competitive alternative to classical word embedding models (Table 2). Furthermore, the BESWE model obtains the best outcome with 10-fold cross validation on the dataset NRC.

Task 2.2, Low-frequency word sentiment analysis:
The BESWE model shows a better accuracy in tackling lowfrequency words, as shown in Table 3. For the frequencies lower than 5 (LF-5) and 10 (LF-10), our model shows its stability in sentiment polarity classification, compared to the baseline models. As mentioned in the result of Task 1, the lowfrequency word processing ability contributes to establishing precise sentiment word embeddings. By contrast, there is also a performance gap between BESWE and other algorithms for the dataset NRC.

Task 2.3, BESWE integration with other models: In line
with the results of Task 2.2, we combine the proposed model to baseline models, not only to show its superiority, but also to optimize the word-level sentiment analysis outcomes of baseline models. As long as the BESWE algorithm is capable of processing low-frequency words, the low-frequency word embeddings (i.e. LF-5 and LF-10 from BESWE) are incorporated into the nonlow-frequency word embeddings from the baseline. The outcomes of the integration method are reported in Table 4. The classification accuracy is significantly improved for most models. The maximum performance gap of 4.44% is observed in the integration of DLJT1 and BESWE against the basic DLJT1 for the dataset MPQA.  Table 5, we see that the BESWE model has a better accuracy than any other baseline methods.

Task 3.2, Low-frequency sentence sentiment analysis:
In line with the outcome on low-frequency words, the sentiment analysis on low-frequency sentence exceeds the baselines in both evaluation settings. The highest classification accuracy of the BESWE, based on the word frequencies lower than 5 (LF5), is 81.25%, as shown in Table 6. Meanwhile, the maximum performance gap of 9.75% is observed against the DLJT1. Comparing Tables 5 and 6, the improvement in accuracy against the baseline in Table 6 outperforms that in Table 5. One possible explanation is that the low-frequency sentences in Task 3.2 contain more low-frequency words.
Since the sentiment word embeddings learnt via the Bayesian principle is considerably more informative, it is reasonable to expect better performance in low-frequency words analysis, as it is the case. Besides, the accuracy on 'LF5' in Table 6 overperforms that of 'LF10', which further verifies the significance of our model.

Task 3.3, BESWE integration with other models: Similar
to Task 2.3, since the working performance on low-frequency sentence is highlighted, we integrate the corresponding word embeddings to improve the classification accuracy of the baseline. The outcomes of the integration methods, which combine the baseline and the proposed BESWE, are reported in Table 7. In contrast, these outcomes fail to exceed that from the direct use of the proposed model. That is, for the sentiment analysis of low-frequency words, BESWE always obtains the best and most consistent results in the identification of sentiment polarity. Whereas, the application of BESWE can narrow the performance gap. For the C&W model, a 10% improvement is presented. Effects of λ 1 and λ 2 : The hyperparameters in the BESWE model include the regulatory factors λ 1 and λ 2 used to represent the semantic and sentiment information. In this experiment, to obtain the optimal settings, we vary the values of λ 1 and λ 2 within 1; 0:75; 0:5; 0:25; 0:1; 0:05; 0:02; 0:01 f g to learn the BESWE model. In this way, we get 64 different BESWE models.
The results on the low-frequency sentence sentiment analysis against different hyperparameter settings are shown in Figures 5 and 6. The former indicates the accuracy on the lower-than-5-frequency words while the latter indicates that of the lower-than-10-frequency words. The variation of λ 1 does not cause a significant difference, while that of λ 2 has a negative correlation with the accuracy. According to Figure 5, the highest accuracy of the sentence-level sentiment analysis is 81.25% at the point λ 1 ∈ 0:05; 0:75 f g and λ 2 = 0.01. Likewise, in Figure 6, the optimal values of λ 1 and λ 2 are both 0.01, which lead to an accuracy of 80.50%. To sum up, these experimental results clarify the effectiveness of the proposed sentiment word embedding. The BESWE model outperforms other state-of-the-art models in word similarity measure. In the sentiment analysis at both word and sentence levels, our method still shows comparable outcomes. Specifically, our model produces considerably better results than the baseline methods on sentiment information capturing of both low-frequency words and low-frequency sentences. Moreover, by integrating the lowfrequency word embeddings from BESWE to other models, the classification accuracies of the baseline models improve to a large extent.

| Word embeddings
As pointed out in the Introduction section, both predictionbased word embedding models and count-based word embedding models are applied to learn word embeddings [8]. Fundamentally, Bengio et al. [9], establish neural network language model to predict target words using preceding contexts, and thus to learn the word embeddings. Following this theory, Collobert and Weston [10,11] put forward a CNN to predict the target word, based on not only preceding but also succeeding contexts. As currently edge-cutting predictionbased methods, CBOW and skip-gram models [12,13] are of simple single-layer architecture. Both of these models can efficiently compute word embeddings from large-scale datasets. Besides, researchers also focus on using the global word context co-occurrence counts in the corpus for learning word embeddings, which are taken as the basis of count-based methods. Deerwester et al. [14] novelty propose the latent semantic analysis model for exploiting word-document cooccurrence counting to learn word embedding for the first time. As a most widespread model, GloVe [15] computes the word-word co-occurrence counts via a specific weighted least squares model. A consistent and competitive result can be obtained on the tasks of sentiment analysis.

| Sentiment word embeddings
Likewise, prediction-based sentiment word embedding models and count-based sentiment word embedding models are dedicatedly studied. In 2011, Maas et al. [35] apply a logistic regression as the predictor to learn sentiment word embeddings. On the other hand, an approach based on recursive autoencoders [36] is designed that year to learn the vector representations of phrases and full sentences, which exploits the vector representations at each node of the hierarchy and uses softmax classifier for sentiment label prediction. Tang et al. [18] propose the hybrid ranking method for learning sentiment embeddings by regulating the traditional C&W model, which encodes sentiment information in the continuous representation of words. With the development of deep learning networks, Lan et al. [37] construct a CNN to detect the semantic and sentiment information. Hereafter, two kinds of information are integrated to generate sentiment word vectors. In terms of count-based sentiment word embedding models, Li et al. [21] incorporate the sentiment count into model learning by proposing a variety of count-based models (e.g. DLJT1) on the foundation of GloVe. Furthermore, LSVD and LGloVe [20] are developed as the improvement of SVD and GloVe, respectively, In these models, the application of word-label counts facilitates the learning of sentiment word embedding with label information. Different from these works, our work mainly focuses on resolving the issues of low-frequency word sentiment analysis. Little attention is paid to this topic currently within the NLP domain. Our model targets at extracting the sentiment information by obtaining a more accurate sentiment word embedding. Notably, our model is orthogonal to the aforementioned models. Our model overperforms the state-of-the-arts while the adaptation of our approach into these methods also results in a better working performance.

| CONCLUSIONS
In this work, a novel BESWE model is designed and deployed on the tasks of sentiment analysis. Aiming to obtain both the semantic and the sentiment information, a Bayesian estimator is developed to compute the co-occurrence probability and the sentiment probability. Furthermore, a loss function for the purpose of sentiment word embedding learning is constructed. We test our model on a variety of tasks to evaluate its working performance. Experimental results indicate that the BESWE model is a comparable alternative to the state-of-the-art methods in word similarity identification, word-and sentence-level sentiment analysis. Specifically, our model outperforms other methods on low-frequency word and lowfrequency sentence sentiment polarity classification to demonstrate its efficacy. By integrating the BESWE into the baseline, the classification accuracy can be improved considerably compared to the basic models.
This study offers a creative and practical method for both the semantic and sentiment information capturing. F I G U R E 5 The sensitivity of λ 1 and λ 2 on the BESWE (LF5) in lowfrequency sentence sentiment analysis F I G U R E 6 The sensitivity of λ 1 and λ 2 on the BESWE (LF10) in lowfrequency sentence sentiment analysis Distinctively, our model shows its superiority in dealing with low-frequency words and thus results in a higher accuracy in sentiment analysis.