Improving sentiment analysis accuracy with emoji embedding

Due to the diversity and variability of Chinese syntax and semantics, accurately identifying and distinguishing individual emotions from online texts is challenging. To overcome this limitation, we incorporate a new source of individual sentiment, emojis, which contain thousands of graphic symbols and are increasingly being used for expressing emotion in online conversations. We examined popular sentiment analysis algorithms, including rule-based and classiﬁcation algorithms, to evaluate the impact of supplementing emojis as additional features to improve the algorithm performance. Emojis were also translated into corresponding sentiment words when constructing features for comparison with those directly generated from emoji label words. In addition, considering diﬀerent functions of emojis in texts, we classiﬁed all posts in the dataset by their emoji usage and examined the changes in algorithm performance. We found that emojis are eﬀective as expanding features for improving the accuracy of sentiment analysis algorithms, and the algorithm performance can be further increased by taking diﬀerent emoji usages into consideration. In this study, we developed an improved emoji-embedding model based on Bi-LSTM (namely, CEmo-LSTM), which achieves the highest accuracy (around 0.95) when analyzing online Chinese texts. We applied the CEmo-LSTM algorithm to a large dataset collected from Weibo from December 1, 2019 to March 20, 2020 to understand the sentiment evolution of online users during the COVID-19 pandemic. We found that the pandemic remarkably impacted individual sentiments and caused more passive emotions (e.g., horror and sadness). Our novel emoji-embedding algorithm creatively combined emojis as well as emoji usage with the sentiment analysis model and can handle emotion mining tasks more eﬀectively and eﬃciently.


Introduction
Sentiment analysis (SA) [ 1 , 2 ] aims to extract and identify the affective states or subjective opinions from texts and is generally considered to be a natural language processing (NLP) technique. A basic task in SA is classifying the polarity of a given text and determining whether the expressed emotion in the sentence (or document) is positive, negative, or neutral. With the advancement of machine learning and deep learning, different kinds of classification algorithms are widely used in SA tasks. Since users' posts on social networking platforms are generally short and concise, lexicon-based approaches to SA are also used frequently. However, although various advanced SA methods have been proposed, accurately identifying and classifying personal emotions from online texts is still challenging due to the diversity and variability of Chinese syntax and semantics. In addition, the rapid changes in internet slang further intensify the difficulty of understanding Chinese texts.
In recent years, users on social networks have become accustomed to utilizing a set of graphic symbols in online conversations to express their Mohammad et al. [20] constructed SVM classifiers with sparse indicator features, including n-grams, POS tags, punctuation, and emojis. Calefato et al. [21] and Ding et al. [22] took emoticons into account in their proposed SA techniques. All of them demonstrated the feasibility of leveraging these emotional cues to benefit SA. However, these studies mainly considered emojis as one feature and did not research the sentiment effects of emojis on the whole texts. Little attention has been given to the SA model combined with different emoji usages in texts.
In this study, we proposed an emoji-embedding architecture named CEmo-LSTM to improve the accuracy of sentiment identification and classification in SA tasks. We further evaluated the benefits of introducing emojis to the accuracy of SA in both the traditional rule-based and supervised learning algorithms. Additionally, the most effective approach for embedding emojis in SA algorithms was examined. We compared the performance of the CEmo-LSTM model with that of other mainstream SA models in different experimental settings. Finally, by collecting all posts and embedded emojis published by users on Weibo during the COVID-19 outbreak, we utilized CEmo-LSTM to analyze the sentiment evolution of online users and measured the impact of the COVID-19 pandemic on individual moods. To the best of our knowledge, this is the first study that comprehensively evaluates the effectiveness of introducing emoji usage into SA algorithms.

Data collection
Weibo is a popular Twitter-like social media platform in China, which provides a rich publicly available data source for opinion mining and SA. We collected all data from Weibo that were posted publicly by users located in Wuhan (the capital of the Hubei province in China), including microblog text, posting time, author ID, and gender, from December 1, 2019 to March 20, 2020. By comparing the sentiments in posts published by Wuhan users before and after the COVID-19 outbreak, we can analyze the sentiment evolution of online users and further explore the impact of COVID-19 on individual moods. Overall, 38,183,194 microblog posts from 2,239,472 unique users were collected. We found that emotion tokens (i.e., emoji characters) were commonly used in Weibo posts. There were 15,609,843 posts containing emoji symbols, accounting for 40.88% of the total posts. In addition, 1,279,828 users used emojis at least once, accounting for 57.15% of all unique users.

Annotation
Although there have been some annotated corpora on Chinese and English for SA [ 23 , 24 ], they do not explicitly model the interaction between emojis and text. To fill in this gap, we manually annotated a Chinese microblog corpus. A total of 10 annotators (graduate students majoring in data analytics) were engaged to label the corpus, which consists of 10,000 randomly selected microblog posts. The sentiment polarities of the posts were manually classified as positive, negative , and neutral , denoted by 1, -1, and 0, respectively ( Table 1 ). The annotators were asked to label each post by considering both the plain text and embedded emojis.
As there are several principal functions for which emojis are used (e.g., sentiment expression, sentiment enhancement, and sentiment modification) [25] , the emoji usage of each post containing emojis was also annotated. Specifically, the emoji usage of each post was classified into three categories, strengthening, reversing (or revising ), and uncertain , labelled by 1, -1, and 0, respectively, indicating whether the sentiment of the embedded emojis was consistent (1) or inconsistent (-1) with the sentiment of the text-only post ( Table 2 ). The label 0 was used to denote when the effect of emojis in the post could not be confidently determined. We found that most emojis embedded in the posts were used to strengthen and clarify the sentiment of the original texts, accounting for approximately 73.6% of all posts with emojis included in the corpus. Finally, all 10,000 microblog posts were labelled with their sentiment polarities, of which 5499 posts containing emojis were also annotated with their emoji usages.

CEmo-LSTM model
In this study, we proposed a deep learning architecture, named the Chinese emoji-embedding LSTM model (CEmo-LSTM), to exploit the impact of emojis on sentiment analysis. Specifically, CEmo-LSTM introduces the emoji usage in online posts based on bidirectional long shortterm memory (Bi-LSTM) [26] . Both plain texts and embedded emojis are used as input features, but before feature construction, the training corpus needs to be marked and filtered by different emoji usages. Since annotators can denote the emoji usage of each post when labeling the sentiment polarity of the training dataset, the workload for data annotation is not greatly increased.
As illustrated in Figure 1 , our model includes the input sentence, word (emoji) representation, word embedding layer, Bi-LSTM layer, dropout layer, and a softmax layer. Given an input post , the model first classifies the post according to whether there are any emojis embedded and evaluates the emoji usage of each post containing emojis. For posts containing emojis, both texts and emojis are input as features. Then, a microblog post can be described as denotes the word token and denotes the emoji. Through the embedding layer, both and are converted to the vector representation, , as the input of the deep learning model to predict the sentiment polarity of a post. A Bi-LSTM layer is built to capture the representation of a microblog post, and a dropout layer is added to prevent over-fitting and improve the generalizability of the model. Finally, a softmax activation function is used to calculate a probability distribution over a set of sentiment polarities { 1 , −1 , 0 } . Consequently, a list of labels of input posts is predicted according to the corresponding output of the softmax layer.

Experimental setting
To evaluate the performance of CEmo-LSTM, we have to prove the impact of emojis on the sentiment identification of texts and discover the best approach for exploiting the novel emotional clue (i.e., emojis) in online posts for SA tasks. Specifically, our goal is to answer three research questions: RQ1: Does the supplementation of emojis promote the emotion recognition of texts? To answer this question, a rigorous con- ; } , where E indicates the set of emoji tag words. RQ2: Can the tag words of emojis be directly used when constructing features? We examined whether the vagueness and ambiguity of emoji tag words would affect the sentiment identification of SA algorithms. Before constructing features, all emojis were converted into corresponding sentiment words (e.g., Sad, Happy) instead of emoji tag words based on their meanings and sentiments, and we evaluated the changes in algorithm performance. Accordingly, an emoji-embedded post was denoted as where ES is the set of sentiment words translated from emojis. RQ3: Does the classification of the training dataset on emoji usage improve the performance of SA algorithms? Corresponding to this question, an experiment was also conducted. We classified the emoji usage of all posts containing emojis to examine the impact of the introduction of emoji usage on SA algorithms. We found that, in most posts on Weibo, the emotions expressed by emojis were consistent with emotions of plain texts, and the main function of emojis was to clarify and enhance the sentiment of the sentence. Hence, strengthening posts in the corpora (labelled with 1 in the field emoji usage ) were filtered out and used to train SA models. A post classified by emoji usage was described as where denotes the word token and EU stands for the set of emojis embedded.
In all experiments, text segmentation work was carried out with Jieba [27] , a popular Chinese word segmentation package. We filtered out stop words, punctuation, and spaces in the posts to clean the data. For the CEmo-LSTM model, we set the dimensions of word and emoji embeddings as 200, and the dimensions of hidden and cell states in Bi-LSTM cells as 64. The dropout probability (the loss of some units at random during training) in the dropout layer was set to 0.4. The "categorical_crossentropy " loss function was chosen, and the RMSprop [28] method was used for optimizing the objective functions. Since CEmo-LSTM is an emoji-embedding SA model based on emoji usage, in the first two experiments for RQ1 and RQ2, CEmo-LSTM (text) represents CEmo-LSTM's implementation in plain texts, CEmo-LSTM (text + E) represents the implementation with emoji tags embedded, and CEmo-LSTM (text + ES) denotes when emoji tags were replaced with sentiment words.

Baselines
To evaluate the effectiveness of the improved emoji-embedding SA model (CEmo-LSTM), we introduced several baseline models, including the state-of-the-art method, for comparison. Using the same experimental settings, we evaluated the performance of different algorithms. Both supervised and unsupervised learning methods were carried out.

Rule-based approach
In general, the implementation of rule-based SA relies on a specific sentiment lexicon. In this study, we constructed two lexicons: the traditional lexicon for sentiment words (sentiment lexicon, for short) and an emoji lexicon based on the sentiment of different emojis. Based on these two lexicons, we extracted all sentiment words and emojis contained in each post. By measuring the frequency and emotional intensity of sentiment words (or emojis), each post was assigned a sentiment score. If the score was greater than 0, the post was considered positive. Finally, the accuracy of the algorithm was validated by comparing the results with the manual annotations of posts.
(1) Sentiment lexicon. To construct the sentiment lexicon, we first integrated four popular Chinese sentiment dictionaries, including DU-TIR, C-LIWC, HowNet, and NTUSD [ 29 , 30 ]. Then, by supplementing popular sentiment words used on the internet [31] , we built a comprehensive sentiment lexicon, which is more suitable for SA on Weibo.
(2) Emoji lexicon. As there is significant heterogeneity [ 32 , 33 ] in the popularity of different emojis (i.e., in the Sina Weibo data used), the top 100 most popular emojis account for approximately 96% of all emojis used daily. We constructed an emoji lexicon ( Table 3 ) based on the top 100 most frequently used emojis and classified them into three categories, positive, negative , and neutral , according to their official annotations and emotions expressed. Each emoji was also Table 3 Example of emoji lexicon.
Positive Negative Neutral assigned a sentiment value, with positive emojis denoted from 1 to 5 and negative emojis denoted from -1 to -5, respectively. The absolute value represents the emotional intensity.

Classification algorithms
A total of six mainstream classification algorithms, which are widely used in SA tasks due to their promising effectiveness, were evaluated. It is worth noting that the rule-based approach was only used to discover the impact of emojis on sentiment recognition (the first experiment). All six classification algorithms were implemented in all three experiments, and their performance was compared with that of the CEmo-LSTM model based on the same experimental settings. The detailed setting information for each algorithm is summarized below.
• Logistic Regression (LR) [34] : LR was carried out for SA tasks in all three experiments, with LR (text) representing the implementation of LR in plain texts, LR (text + E) representing the implementation in posts with emoji tags embedded, LR (text + ES) representing LR's operation when emoji tags were replaced by corresponding sentiment words, and LR (EU) representing LR's operation in posts classified by emoji usage. To train each model, the features of the posts were used as inputs, such as emojis, bag-of-words, and TF-IDF values. • Support Vector Machine (SVM) [35] : SVM was also compared in each experiment, with SVM (text) denoting SVM's operation in plain texts, SVM (text + E) denoting that posts with emojis embedded, SVM (text + ES) denoting the introduction of emojis' sentiment words, and SVM (EU) relating to emoji usage. Similarly, emojis, bag-of-words, and TF-IDF values in the posts were used to train each classifier. • Naive Bayes classifier (NB) [36] : For the NB algorithm, the Naive Bayes classifier for multinomial models (i.e., the multinomial Naive Bayes classifier) was used, which is suitable for classification with discrete features (e.g., word counts for text classification). Likewise, NB was carried out in each experiment, corresponding to NB (text), NB (text + E), NB(text + ES), and NB (EU). All parameters were kept the same in each NB operation. • Gradient Boosting Decision Tree (GBDT) [37] : GBDT, the gradient boosting classifier, was also operated as GBDT (text), GBDT (text + E), GBDT (text + ES), and GBDT (EU). In each experiment, text features (or with emojis) were used to train the GBDT classifier. The learning rate was set to 0.05, and the number of estimators was 540. The dropout probability was set as 0.15, and Adam was used as the optimization method during training.

Evaluation metric
We used tenfold cross validation in our experiments. The original dataset was randomly split into ten equal sections. In each fold, nine sections were selected for training, and the tenth section was used for testing. The classification results were measured by accuracy, , which is the ratio of correctly identified sentiments of posts among all corpora, and defined as, where T indicates the number of predicted sentiment ratings that are identical with manual sentiment ratings, and N indicates the number of posts. In each experiment, we compared the accuracy of the CEmo-LSTM model with that of all baseline algorithms.

Effect of emojis on the accuracy of sentiment recognition
(1) Rule-based approach. To discover the effect of emojis on sentiment recognition, the classical rule-based approach for unsupervised learning was examined. We conducted the ruled-based algorithm both in posts with embedded emojis (emoji posts) and posts consisting plain texts (emoji-free posts). We found that the performance of the algorithm with emoji posts ( = 0 . 561 ) was significantly better than with emoji-free posts ( = 0 . 360 ). Emojis are beneficial clues for the rule-based algorithm in SA tasks. This further indicates that emojis play an important role in clarifying and enhancing the sentiment of sentences. However, the accuracy of the rule-based algorithm in both scenarios was not satisfactory, possibly due to the short length of internet micro-texts and inadequate emotional clues. (2) Classification algorithms. In order to further evaluate the impact of emojis on sentiment recognition, the performance of the classification algorithms in supervised learning was examined in the two scenarios. As shown in Table 4 , in most cases the accuracy of the classification algorithms with emoji posts was significantly higher than with emoji-free posts, indicating that in supervised learning algorithms the supplementation of emojis helps to clarify sentence emotions. In addition, we found that in sentiment classification of online microtexts, algorithms using deep learning outperformed other classification algorithms. CEmo-LSTM followed by LSTM showed the highest accuracy with our dataset, which is also popularly applied for other text-based SA tasks [40] . Generally, classification algorithms were superior to the rule-based (unsupervised) algorithm in accuracy. Because the operation of unsupervised learning does not rely on manual annotations and the sentiment lexicon can be updated based on  Table 6 The performance of different SA algorithms based on emoji usage.

Model Model
specific datasets, the rule-based SA algorithm is also frequently used in practical scenarios.

Feature comparison between emoji tag words and sentiment words
It was assumed that the ambiguity of emoji tag words would affect the understanding of sentence emotions for SA algorithms. Consequently, all emoji tags in posts were replaced with corresponding sentiment words when constructing features. However, the empirical results fail to prove this hypothesis, and the accuracy of all algorithms unexpectedly decreased ( Table 5 ). Replacing tag words of emojis with sentiment words slightly reduced the performance of the original algorithms. This indicates that the ambiguity of emoji tags has no negative impact on sentiment classification in practice, and they can be used as effective features in SA tasks.

Improving algorithm accuracy with sentiment strengthening
To evaluate the impact of the introduction of emoji usage in SA algorithms, strengthening posts in the corpora were filtered out and used to train SA models. We found that the accuracy of each classification algorithm significantly improved after examining the consistency between the sentiments of emojis and those of plain texts. This indicates that posts in which the emoji sentiment is inconsistent with the text sentiment tend to reduce the performance of SA algorithms. Before training SA models, it is useful to classify the training dataset with emoji usage. This also proves the rationality of the architecture design of the CEmo-LSTM model. As shown in Table 6 , although the introduction of emoji usage dramatically improved the accuracy of all algorithms, our improved emoji-embedding model (CEmo-LSTM) always provided the best performance in SA tasks.

Case study
As the COVID-19 pandemic sweeps across the world, it is causing widespread concern, fear, and stress. Some studies indicated that the pandemic not only threatened physical health but also affected individual mentality and emotions [41] . To understand the potential emotional changes of Wuhan residents caused by COVID-19, we collected all posts published by Weibo users who were located in Wuhan during the COVID-19 outbreak and conducted SA on the dataset utilizing the CEmo-LSTM algorithm.
The percentage of positive posts and the percentage of negative posts published daily were calculated, respectively. We found that after the COVID-19 outbreak the number of positive posts on Weibo dropped drastically ( Figure 2 A). This result verifies the above conclusion that the pandemic has had psychologically negative effects on individuals. Furthermore, in order to examine the evolution of specific sentiments of Wuhan residents, we divided user sentiments into seven categories: Happy, Appreciated, Angry, Sad , S cared, Disgusted , and Surprised [42] . The evolution pattern of each sentiment was analyzed ( Figure 2 B). It can be seen that after the outbreak of COVID-19, with the spread of the novel coronavirus, the proportion of posts related to Sad and S cared made by Wuhan users clearly increased. By further mining the textual content of these posts, we found that most topics were relevant to the spread, treatment, and impact of COVID-19. In general, the outbreak of the pandemic has indeed caused more negative emotions for Wuhan residents.

Conclusion & discussion
Due to the diversity of Chinese expressions and the variability of Chinese syntax and semantics, SA algorithms are unable to achieve satisfactory results when processing Chinese texts, especially short microtexts. Emojis, which are graphic symbols carrying specific meanings, have been frequently embedded within micro-texts to more directly express emotional meanings, and they provide novel information on user sentiments. As emojis are widely adopted in online conversations across  apps and platforms, they can be introduced into SA algorithms as crucial features to improve performance.
In this study, we examined and compared popular SA algorithms, including the rule-based algorithm in unsupervised learning, classification algorithms (e.g., SVM), and neural network algorithms (e.g., LSTM) in supervised learning. The effect of emoji introduction and the ambiguity of emoji tags were also evaluated. We found that the accuracy of supervised learning algorithms is generally higher than that of unsupervised learning algorithms. Further, deep learning algorithms (e.g., LSTM and Bi-LSTM) always achieve the best performance. In addition, we found that introducing emojis is beneficial to improve the performance of SA algorithms, and emoji tag words can be used directly when constructing features for classifier training. It is worth noting that after classifying the emoji usage of posts in the training set, the performance of each algorithm improved significantly. An overview of all algorithms and their improvements is shown in Figure 3 .
Accordingly, combined with emoji usage, we developed an improved emoji-embedding model based on Bi-LSTM (namely, CEmo-LSTM), in which emojis are used as one of the features, and the training data are classified by their emoji usage before training the classifier. Compared with existing SA algorithms [43] , our model achieved the highest accuracy when analyzing online Chinese texts. Finally, the proposed algorithm, CEmo-LSTM, was applied to the SA of Wuhan residents during the COVID-19 outbreak. It was found that the pandemic has had a negative impact on individual sentiments, and the outbreak has resulted in more passive emotions (e.g., scared and sad) on the part of Wuhan residents.
This study proposed a novel emoji-embedding algorithm based on emoji usage for SA, highlighting the sentiment evolution of social platform users due to the COVID-19 outbreak. However, our focused emotional symbols in this study were mainly from the common emojis of Sina, and there are many different emoji packs from other sources, it is necessary to explore the usage patterns of more massive emojis and to implement comparative studies across platforms and contexts. In addition, different usage habits of emojis may lead to different sentiment semantics. We tend to analyze various emoji use habits of online users in detail to promote the comprehension of emotional meanings conveyed by emojis, and explore more complex contexts to further improve the performance of the CEmo-LSTM algorithm in future work.

Declaration of Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.