SocialNLP 2018 EmotionX Challenge Overview: Recognizing Emotions in Dialogues

This paper describes an overview of the Dialogue Emotion Recognition Challenge, EmotionX, at the Sixth SocialNLP Workshop, which recognizes the emotion of each utterance in dialogues. This challenge offers the EmotionLines dataset as the experimental materials. The EmotionLines dataset contains conversations from Friends TV show transcripts (Friends) and real chatting logs (EmotionPush), where every dialogue utterance is labeled with emotions. Organizers provide baseline results. 18 teams registered in this challenge and 5 of them submitted their results successfully. The best team achieves the unweighted accuracy 62.48 and 62.5 on EmotionPush and Friends, respectively. In this paper we present the task definition, test collection, the evaluation results of the groups that participated in this challenge, and their approach.


Introduction
Human emotion underlays in our daily interactions with other people, and study from Ekman(1987) shows that emotion is a universal phenomena across different cultures.An emotion detection system can improve mutual understanding between individuals by providing undetected emotion signal.For a common sense of human perception that emotion is inherently multi-modality including vision and speech, multi-modal emotion recognition plays an important role in emotion detection area (Sebe et al.;Kessous et al., 2010;Haq and Jackson, 2011).At the same time, studies in uni-modal emotion recognition also contribute in variety of modalities like vision (Ekman and Friesen, 2003), speech (Nwe et al., 2003) and text (Alm et al., 2005).However, with the progress of social media and dialogue systems, especially the online customer services, textual emotion recognition has attracted more attention.In the social media, the hashtag and emoji are widely used and could provide substantial emotion clues (Qadir and Riloff, 2014;Kralj Novak et al., 2015).For the dialogue systems, instant emotion detection could help costumer service notice dissatisfaction of clients.Still, textual emotion recognition needs further exploration in dialogue systems for many reasons.For instance, a text segment can express various emotions given different context.Take the dialogue from Hsu et al.(2018) in Table 1 as an example, Okay! could be joy or anger in different scenarios.One more reason is that informal language and short sentence are everywhere in daily conversation.For instance, lol actually means laugh out loud.Therefore, emotion flow modeling and informal language understanding are essential for improving dialogue emotion recognition system.
For EmotionX shared task in SocialNLP 2018, we select an emotional dialogue dataset, Emo-

EmotionLines Dataset
EmotionLines is collected from two sources: Friends TV show transcripts (Friends) and Facebook messenger logs (EmotionPush).Dialogues are randomly selected from the raw data in four buckets of dialogue length [4-9], [10-14], [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29], and [20-24], with 250 dialogues for each bucket.However, EmotionPush is a private chat log and releasing it may encounter privacy issues.To cope with this problem, Stanford Named Entity Recognizer (Manning et al., 2014) was adopted to replace the named entities in the corpus.In (Hsu et al., 2018), Amazon Mechanical Turk is utilized to label the emotion of every utterance.Following Ekman's(1987) six basic emotions and with neutral added, seven emotions are available for annotators in the labeling interface.To eliminate diverse emotion-labeled utterances, the utterance annotated with more than two emotions is considered as the non-neutral utterance.Finally, a total of eight emotion labels in both Friends and Emo-tionPush datasets are joy, anger, sadness, surprise, fear, disgust, neutral, and non-neutral.Figure 1 shows the emotion label distribution for these two datasets.
As we can see, more than 45% utterances are of neural emotion labels in both datasets, and the more severe emotion label imbalance in Emotion-Push reflects the real situation that most of the utterances are neutral in daily conversations.

Challenge Setup
In shared task, each dataset is split into the training, the validation, and the testing set with 720, 80, 200 dialogues respectively.Due to the very few utterances of some emotions, we only evaluate the performance of recognizing four emotions: Joy, Anger, Sadness, Neutral, which was announced in the early announcement during the challenge.Generally speaking, recognizing strong emotions may provide more value than detecting the neutral emotion.To making a meaningful comparison in this challenge, we chose the unweighted accuracy(UWA) as our metric instead of the weighted accuracy(WA) as the latter is heavily compromised by the large proportion of the neutral emotion.
where a l denotes the accuracy of emotion class l and s l denotes the percentage of utterances in emotion class l.

Submission
We receive 18 registrations and 5 teams submit their results successfully in the end.In the following, we summarize the approaches proposed by these 5 teams.More details could be found in their challenge papers.

AR (Adobe Research)
A CNN-DCNN autoencoder based emotion classifier is proposed.The latent feature of CNN-DCNN is augmented with linguistic features, such as lexical , syntactic, derived, and psycho-linguistic features as well as the formality list.The joint training of the classifer and the autoencoder improves generalizability, and linguistic features boost the performance on the minority class.AR is the only team that considers imbalance of emotions and also the only team that does not use the context information.
SmartDubai NLP (Smart Dubai Government Establishment) Multiple approaches are implemented by this team including logistic regression, Naive Bayes, CNN-LSTM, Xgboost, where they select TF-IDF, word vector, and some NLP fea-tures to train their models.In addition, the Internet slang is converted to its meaning e.g.lol is replaced by lots of laughs.Finally, logistic regression with TF-IDF of words and characters reached highest performance.
Area66 (TCS Research) A hierarchical attention network with a conditional random fields (CRF) layer on top of it is proposed.The word embeddings of the utterance are fed in to LSTM, then the attention mechanism captures the words with important emotion representations to form the sentence embedding.To model the context dependency, utterance embeddings of the dialogue are passed through another LSTM and CRF layer to predict emotion of utterances.
JTML (ESPOL University) A classifier using 1-dimensional CNN to extract utterance features with attention mechanism across utterances which obtains context information is provided.The proposed GRU-Attention model uses sequential GRU to learn relationship between previous utterances and current utterance.It achieves an improvement on UWA.

Evaluation Results
A brief summary of approaches proposed by teams participated in the EmotionX challenge and their corresponding final results are shown in Table 2.The performance varies across teams.Especially, in Table 3, we observed that SmartDubai and JTML obtained lower UWA scores because of the low accuracy on the minority emotion classes such as anger and sadness.In contrast, the winning team AR successfully reached a similar performance on four emotions on both datasets.

Word Embedding
All teams used pre-trained word embedding: GloVe (Pennington et al., 2014) for four teams and fastText (Joulin et al., 2016) for one team.Area66 used GloVe-Tweet which is more related to informal language and the other teams did not mention the pre-trained data in their papers.Using pretrained word embedding can reduce the unseen word issue in the testing phase especially for the relatively small dataset (Friends and EmotionPush only contain ∼ 14,000 utterances, which is small compared to the commonly used datasets for pretraining the embedding.)

Neural Network
Neural network architectures are adopted in all challenge papers.Acting as a universal feature extractor, neural network could minimize the feature engineering process.AR and JTML apply CNN to generate utterance embedding , and Area66 and DLC choose LSTM instead.By modeling context information in dialogue, DLC shows that self-attention improves UWA performance on both datasets.In addition, the AR team finds that adding a reconstruction loss of DCNN could improve generalizability.

Linguistic Features
Team AR combines latent feature of CNN-DCNN and linguistic features to prediction utterance emotion.Also, AR is the only team leveraging external resources, e.g.lexicons and the formal list.By adding linguistic features into neural model, the accuracy of anger is significantly boosted by 8.2% and 33.3% on Friends and EmotionPush, respectively.For the SmartDubai team, they use word and character TF-IDF independently with logistic regression.Results show it suppresses the Xgboost using TF-IDF and some linguistic features, e.g.sentence length and percentage of unique words, and outperforms CNN-BiLSTM using fastText word embedding, too.

Data Imbalance
Data imbalance directly harm the UWA performance.In Table 3, accuracy of minority emotions like anger and sadness are relatively low for SmartDubai and JTML, leading to low UWA performance.In contrast, AR is the only team considering data imbalance in the training process.They achieve balance accuracy on each emotion by applying weighed loss in the loss function , and ultimately obtain the best performance in the Emo-tionX challenge.

Conclusion
We have a succesfull dialogue emotion recognition challenge, EmotionX, in SocialNLP 2018.Many researchers have noticed this challenge and requested the datasets.Moreover, 5 teams successfully submitted their results this year.Various interesting approaches are proposed for this challenge, and the best performance achieves the unweighted accuracy 62.5% and 62.48% on Friends and EmotionPush dataset in the Emotionlines.We will continue organizing this challenge in So-cialNLP 2019 and have planned to add the subtask of emotion dialogue generation, in the hope of encouraging and facilitating the research community to work on the emotion analysis on dialogues.

Table 2 :
Overview of methods proposed by the participants and UWA of both datasets.JTML team is not in the ranking list because of late submission.* SmartDubai only used word and character TF-IDF as features for logistics regression.fastText is used by their other framework.

Table 3 :
Accuracy of four emotions on Friends and EmotionPush datasets.