Learning More from Mixed Emotions: A Label Refinement Method for Emotion Recognition in Conversations

Abstract One-hot labels are commonly employed as ground truth in Emotion Recognition in Conversations (ERC). However, this approach may not fully encompass all the emotions conveyed in a single utterance, leading to suboptimal performance. Regrettably, current ERC datasets lack comprehensive emotionally distributed labels. To address this issue, we propose the Emotion Label Refinement (EmoLR) method, which utilizes context- and speaker-sensitive information to infer mixed emotional labels. EmoLR comprises an Emotion Predictor (EP) module and a Label Refinement (LR) module. The EP module recognizes emotions and provides context/speaker states for the LR module. Subsequently, the LR module calculates the similarity between these states and ground-truth labels, generating a refined label distribution (RLD). The RLD captures a more comprehensive range of emotions than the original one-hot labels. These refined labels are then used for model training in place of the one-hot labels. Experimental results on three public conversational datasets demonstrate that our EmoLR achieves state-of-the-art performance.


Introduction
Emotion recognition in conversations (ERC) is an important research topic with broad applications, including human-computer interaction (Poria et al., 2017), opinion mining (Cambria et al., 2013), and intent recognition (Ma et al., 2018).
Unlike vanilla text emotion detection, ERC models need to model context-and speakersensitive dependencies (Tu et al., 2022a) to simulate the interactive nature of conversations.Recurrent neural networks (RNNs) (Zaremba et al., 2014) and their variants have been successfully applied for ERC.Recently, ERC research has primarily focused on understanding the influence of internal/external factors on emotions in conversations, such as topics (Zhu et al., 2021), commonsense (Zhong et al., 2019;Jiang et al., 2022), causal relations (Ghosal et al., 2020), and intent (Poria et al., 2019b).These efforts have improved the model's understanding of the semantic structure and meaning of conversations.
However, despite advancements in modeling context information (Zhong et al., 2021;Saxena et al., 2022), the final classification paradigm in ERC remains the same: calculating a certain loss between predicted probability distribution and one-hot labels.This black-and-white learning paradigm has the following problem: According to Plutchik's wheel of emotions (Plutchik, 1980) and the hourglass model (Cambria et al., 2012), emotions expressed in dialogue are often mixed expressions that include various basic emotions (such as anger, sadness, fear, and happiness) (Chaturvedi et al., 2019;Jiang et al., 2023).Each basic emotion contributes to the overall emotional expression to some extent.However, the one-hot label representation assumes emotions are independent.In real scenarios, as shown in Figure 1, the dialogue expression is normally not a single emotion, but multiple emotions are presented with a specific distribution.Particularly when dealing with mixed and ambiguous emotions, one-hot vectors as labels are insufficient.They will overlook the emotional information within the utterance, which may lead to suboptimal performance of the ERC models.
To address the limitations of one-hot labels, several efforts have been proposed in the fields of text classification and object detection, such as label smoothing (LS) (Müller et al., 2019) and label distribution learning (LDL) (Geng, 2016).LS partially alleviates the problem by randomly injecting noise, but it still cannot fundamentally recover the inherent emotional distribution within an B utterance.The emotion is influenced by the context (green lines) and the current speaker's state (blue lines).The red lines indicate that the emotion is recognized by both context and speaker states.For example, U t B not only expresses sadness but also includes frustration and happiness.Although sad emotion has the dominant effect in this utterance, we cannot ignore the semantic information contained in other emotions.
utterance.LDL is capable of handling instances with multiple labels quantitatively, making it advantageous for tasks involving fuzzy labels (Xu et al., 2019).However, obtaining the true label distribution for structured conversation data is challenging.
Based on this, we propose the Emotion Label Refinement (EmoLR) method based on contextand speaker-sensitive information.EmoLR consists of two components: the emotion predictor (EP) and the label refinement (LR).The EP module preserves context-and speaker-related representations during the emotion detection process.In the LR module, the context-and speakersensitive representations are compared with each label to estimate their correlation individually.The refined label is generated based on the correlation scores.The original one-hot label is combined with the new label distribution to reduce interference from noise in the model.The final output, after softmax activation, is used for training.The refined label, capturing the relationship between the speaker and contextual information, reflects all possible emotions to varying degrees.This distribution provides a more comprehensive representation of the utterance's emotional state compared to the ground-truth label.It enables models to learn more label-related information, leading to improved performance in dialogue emotion analysis.Our main contributions are summarized as follows: In summary, the output (emotion states) of the above works were all computed using one-hot labels for training.In contrast, our approach monitors the model using refined labels, enabling more comprehensive extraction of semantic information regarding emotions in conversation.

Label Refinement
The process of transforming original logical labels into a label distribution is defined as label refinement (LR) or label enhancement (LE).Müller et al. (2019) proved that LR can yield more compact clustering.Vaswani et al. (2017) used LR in language translation tasks.Song et al. (2020) used LR to regularize RNN language model training by replacing hard output targets.Lukasik et al. (2020) introduced a technique to smooth wellformed correlation sequences for the seq2seq problem.LDL is an effective method for LR.Xu et al. (2021) introduced a Laplace label enhancement algorithm based on graph principles.Zhang et al. (2020) designed a tensor-based multiview label refinement method to obtain more effective label distributions.Additionally, label embedding has shown promise in classification tasks.Zhang et al. (2018) proposed a multi-task label embedding for transforming tasks into a vector-matching problems.Wang et al. (2018) proposed to regard text classification as a label-word joint embedding problem.Bagherinezhad et al. (2018) studied the impact of label attributes and introduced label refinement in image recognition.However, most LE methods require multi-label information, which is not available in ERC datasets.Obtaining true label distributions manually is also challenging.Our proposed method can generate distributed labels in conversations, and LDL can be more widely applied to ERC tasks.

Problem Definition
The task is to recognize the emotion labels of each u i using the original emotion label y i and the refined label RLD i .To this end, we aim to maximize the following function: where RLD i represents the RLD of the i-th utterance in the conversation, and θ denotes the set of parameter matrices of the model.We propose the EmoLR method to generate the RLD, which provides more label-related information across different emotion classes and improves the performance of the EP.

Emotion Label Refinement
One-hot labels often ignore important information when utterances convey mixed emotions since they can only represent a single emotion.To learn more label-related information, we aim to obtain a new label distribution that reflects the dependencies among different emotion dimensions within a sample.Considering that context and speaker states are crucial factors for emotions (Hazarika et al., 2018a,b), we propose EmoLR to recover relevant emotional information for training.EmoLR calculates the context-sensitive and speaker-sensitive dependencies between instances and labels respectively, and generates RLD by dynamically fusing two types of dependencies.However, RLD may introduce some noise to the model.Therefore, during training, we integrate both the refined labels and the corresponding one-hot labels to reduce the impact of noise on the model.EmoLR consists of two components: the emotion predictor (EP) and the label refinement (LR).The overall architecture is shown in Figure 2. We will introduce the EmoLR in detail.
The EP is the basic emotion predictor that takes into account both context and speaker states.We aim to simulate the conversation process: The global context information [c 1 , . . . ,c t−1 ] influences the speaker state s p,t , and the speaker Figure 2: Illustration of the proposed model, which is composed of an emotion predictor and a label refinement component.We extract the information from speakers and context within the conversation model to refine the original one-hot label to RLD.Therefore, the label refinement module can be regarded as the main process of EmoLR.
state s p,t−1 influences the context state c t in the next time-step.Finally, the emotion state e t is updated using speaker state s p,t−1 and last timestep emotion state e t−1 by GRU network.We utilize three groups of GRU networks, used to extract context state c t , the speaker states s p,t and the emotion representation e t , respectively.The process of capturing the representations for the three state is as follows: where u t represents the t-th utterance in the conversation, The LR component consists of a label encoder, a similarity layer, and a label attention module.The multi-layer perceptron (MLP) is used as the encoder to generate the transformation matrix W .The similarity layer takes the label embedding vector, speaker states and context states as inputs and computes their similarity correlation using the dot product.The label attention module is then used to dynamically fuse context-sensitive and speaker-sensitive dependencies to generate the RLD.Thus, the RLD becomes a contextdependent and speaker-dependent distribution that adapts to the choice of context-sensitiveness and speaker-sensitiveness for each emotion.The process can be represented as follows: where f e is an encoder function that transforms [label 1 , label 2 , . . ) where With the LR component, we assume that the learned RLD reflects the similarity relationship between emotions, context, and speaker states.This helps the model more comprehensively represent the emotions in utterances, especially when in the case of mixed emotions.

Training
During training, the RLD replaces the one-hot label and is considered as the new training target.The model is trained under the supervision of RLD.
To measure the difference between the RLD and the predicted label vector e t distribution, we use KL (Kullback and Leibler, 1951) divergence as the loss function: where z is obtained by the softmax of RLD.Based on the above process, the complete loss function can be written as: where η indicates the L2 regularization term and θ represents the set of parameter matrices of the model.
It is worth noting that this process occurs only during the training stage and is ignored during prediction.By utilizing this training method, the influence of noise on the model will be reduced as much as possible.

Datasets
We conducted comprehensive experiments on three benchmark datasets: (i) IEMOCAP (Busso et al., 2008), (ii) MELD (Poria et al., 2019a), and (iii) EmoryNLP (Zahiri and Choi, 2018).The statistics are shown in Table 1.All these datasets are multi-modal datasets, including textual, visual, and acoustic information for each utterance.However, in this paper, we focus solely on textual information across all datasets.
IEMOCAP is a two-part conversation completed by ten people in five sessions.Each utterance is labeled with one of six emotions: happy, sad, neutral, angry, excited, and frustrated.
MELD is a multi-party dataset collected from the Friends TV series, an extension of the Emo-tionLines dataset.It includes over 1400 multiparty conversations and 13000 utterances, with each utterance labeled with one of seven emotion labels: anger, disgust, sadness, joy, surprise, fear, or neutral.
EmoryNLP is another dataset based on the Friend TV series.The label of each utterance belongs to one of seven emotion classes: neutral, sad, mad, scared, powerful, peaceful, and joyful.
For all experimental results from these three datasets, we use the accuracy (Acc.) and weightedaverage F1 scores (W-Avg F1) as the evaluation metric to compare the performance of different models.

Baselines
We compare the performance of EmoLR with the following baselines: CNN (Kim, 2014) is trained on the utterance level to predict final emotion labels without context information.
ICON (Hazarika et al., 2018a) is a multimodality emotion detection framework.It discriminates the role of the participants.
KET (Zhong et al., 2019) combines external knowledge through emotional intensity and contextual relevance.
DialogueRNN (Majumder et al., 2019) uses two GRU networks to track the state of each participant throughout the conversation.
DialogueGCN (Ghosal et al., 2019) is used to enhance the dependency among speakers of each utterance.
COSMIC (Ghosal et al., 2020) introduces causal knowledge to enrich the speaker states.(Sun et al., 2021) proposes to control the contextual cues and capture speakerlevel features.

ERMC-DisGCN
DialogueCRN (Hu et al., 2021) models the retrieval and reasoning process of cognition by mimicking the thinking process of humans, in order to fully understand the dialogue context.SKAIG (Li et al., 2021) proposes a method called psychological knowledge-aware interaction graph to consider the influence of the speaker's psychological state on their actions and intentions.(Shen et al., 2021) utilizes a directed acyclic graph (DAG) to encode utterances and better model the intrinsic structure within a conversation.

Hyperparameter Settings
For all the baselines, we conducted experiments according to their original experimental settings on the three datasets, using randomly assigned seeds.For our proposed model, EmoLR, we used Adam (Kingma and Ba, 2014) optimization with a batch size of 16, L2 regularization weight of 3e-4, and a learning rate of 1e-4 throughout the training process.The dropout rate was set to 0.5.We used RoBERTa (Liu et al., 2019) to represent word embeddings and took the average value of the last four layers as input.To optimize EmoLR's performance on multiple datasets, we utilized holdout validation with a validation set to conduct a thorough hyperparameter search.

Comparison with the Baselines
IEMOCAP: On the IEMOCAP dataset, our proposed EmoLR method achieves the best performance among the baselines shown in Table 2, with an average W-Avg F1 score of 68.12%.It outperforms SKAIG by about 1.2% and most other baseline models by at least about 2%.To explain the performance difference, we need to understand the structural features of these models and the nature of the conversation.The top three models in terms of performance are Dia-logueCRN, SKAIG, and DAG-ERC, all of which attempt to model speaker-level context.However, EmoLR's use of label refinement to encode class information and provide richer context and speaker states than ground-truth labels in the dialogue is a significant reason for its improved performance.
MELD and EmoryNLP: On the MELD dataset, our proposed model achieves a W-Avg F1 score of 65.16%, slightly lower than COSMIC's 65.21%.For EmoryNLP, our model outperforms all baselines except COSMIC and SKAIG by around 2%-4%.The MELD and EmoryNLP datasets, both from the Friend TV series, pose challenges due to short utterances, rare emotion words and many conversations involve more than 5 participants.Emotion-related words rarely appear in these datasets, making it more challenging to design an ideal model.Our proposed model shows better results by efficiently addressing the issues of short utterance and rare emotion words.To capture the complete context and speaker information, the label refinement method utilizes context information and the current local emotion state to compute the global emotion label.However, COSMIC and SKAIG perform

Ablation Study
In this section, we report on ablation studies to study the impact of different sensitive dependencies in EmoLR, and the results are presented in Table 2.When using only the EP (training without label refinement, EmoLR−RLD), the classification performance is the worst on all three datasets.The W-Avg F1 score is only 63.51%, 60.70%, and 36.52.%, even worse than some baselines.These results highlight the importance of the RLD for the EP.Adding context-sensitive dependence to obtain EmoLR−Sim speak improves the W-Avg F1 score to 66.13%, 64.17%, and 37.98%, demonstrating the significance of context for label distribution.Similarly, EmoLR− Sim ctx , which represents speaker-sensitive labels without context-sensitive dependence, achieves an F1 score of 65.20%, 63.72%, and 37.23%, proving that speaker-sensitive dependence is essential for label distribution.Notably, EmoLR− Sim speak is better than EmoLR−Sim ctx , indicating Model IEMOCAP MELD EmoryNLP EmoLR 3.793e-2 1.835e-3 4.007e-1 DAG-ERC EmoLR 1.220e-4 3.856e-3 5.635e-3 SKAIG EmoLR 2.117e-6 7.421e-1 8.438e-3 COSMIC Table 3: The results of significance test among models with similar performance.that labels are more sensitive to context.We also observe that some methods yield results closer to our model.To compare them, t-test at a significance level of 0.05 is employed on three comparative methods that showed similar experimental results, as shown in Table 3.On the IEMOCAP dataset, EmoLR significantly outperformed the other comparative methods.On the MELD dataset, EmoLR yielded similar results to COSMIC but surpasses the other comparative methods.On the EmoryNLP dataset, EmoLR performed similarly to DAG-ERC but outperformed the other comparative methods.Overall, EmoLR's performance is superior to most of the comparative methods.For clearer presentation and comparison, a visual experiment is conducted to exhibit the predicted distribution with and without label refinement.The t-SNE (Van der Maaten and Hinton, 2008) method is used to demonstrate the visual result by transforming the high-dimension data space into a low-dimension data space.From Figure 3, it is obvious that the distribution of the EP under one-hot label supervision is discrete and chaotic in subfigure (a), making it difficult to find relationships among emotions.In contrast, the prediction distribution of the EP under refined label supervision is significantly closer for the same emotion class.In subfigure (d), the same class exhibits a more compact cluster and a more regular distribution.
Besides, a t-test at a significance level of 0.05 is employed to measure the ablation study.The experimental results are shown in Table 4, and it is obvious that there is a difference between the RLD and the ground-truth label.

Generalization Analysis
By applying the LR component to other models, such as DialogueRNN, DialogueGCN, COSMIC, and DAG-ERC, we observed significant improvements compared to the baselines presented in After performing a paired t-test (p < 0.05), a statistically significant difference was found between RLD and original model.
Table 5.This demonstrates that LR is not only effective in the EP but also in other models.Based on these experiments, we can conclude that our proposed method is effective and applicable in ERC.Furthermore, Table 6 displays the experimental results from multimodal ERC models.Specifically, we extracted visual and audio information from multimodal data using previous works, such as ICON (Hazarika et al., 2018a) and DialogueRNN (Majumder et al., 2019).Both of these works utilized multimodal information from the same dataset in their experiments.We combined the representations of the three modalities to create a new representation of the corpus, which was then fed into our model.The results show that RLD achieves remarkable performance in both unimodal and multimodal settings, demonstrating its applicability to both text modality and multimodality scenes.(Guo et al., 2021).

Comparison with Label Smoothing and Label Confusion Learning
We compared our RLD with other label enhancement methods implemented on the same predictor, and the experimental results are presented in Table 7.It can be observed that RLD outperforms Label Smoothing (LS) and Label Confusing Model (LCM) on both datasets.LS, which randomly adds noise, does not fundamentally solve the weakness of one-hot labels in quantitatively representing corresponding emotions.LCM, on the other hand, is not sensitive to the context and speaker state in the conversation.This constitutes the primary reason for the superior performance of EmoLR.

Correlation Analysis and Case Study
We conducted manual evaluations of each utterance with the assistance of three annotators.
During the tagging process, annotators labeled 1 if there was a possible emotion present and 0 otherwise.For example, if an utterance conveyed both happiness and excitement and its label is  {1,0,0,0,1,0} in IEMOCAP.In cases where there were discrepancies among the annotators, a majority vote was taken, labeling 1 for emotions that received two or more votes, and 0 otherwise.Correlation analyses were performed between the manual evaluations, one-hot labels, and RLD, and the experimental results are shown in Table 8.It can be observed that manual evaluations exhibit a stronger correlation with RLD compared to onehot labels.
To provide a better analysis of how RLD enhances the emotion detection ability of the EP, we present case studies with selected examples from the IEMOCAP and MELD datasets in Figure 4. Correct results and emotional keywords are highlighted in red.The ground-truth label (Result o ) and refined labels (Result f ) for each utterance are presented separately.In Figure 4(a), the conversation occurs in a negative atmosphere, but U B 1 is a common greeting about the other speaker's father, and the words have no obvious emotional tendency.This contrast with the context leads the model to misjudge and classify the speaker as neutral, assuming that the speaker is not frustrated enough at this point.However, the refined labels consider multiple emotions based on speaker-sensitive and context-sensitive perceptions, enabling the model to detect frustration.Figure 3 also demonstrates that different emotion states trained by RLD overlap less compared to one-hot labels.Similarly, for utterance U A 3 , the high probability of marriage being associated with happiness, combined with a few negative emotional words, leads to an incorrect identification as excited.However, considering the context of sadness, the predominant emotion is correctly associated with sadness, leading to the correct result.In the MELD (b) dataset, the word 'well' is typically associated with a joyful emotion, but in this case, it expresses a more excited state of mind.The RLD results align more closely with human evaluations in both examples, highlighting the effectiveness of label refinement, especially in confusing and partial conditions.

Error Analysis
Despite the excellent performance of the EmoLR method, there are instances where it fails to recognize certain emotions in dialogues.As shown in Figure 4, our model misclassifies sad utterances as excited.This is because some utterances convey a single but extremely strong emotion, and the refined labels may introduce noise, resulting in incorrect emotion recognition.For example, the frustration emotion in utterance U A 1 is clearly obvious due to the keyword ''dead'', but after label refinement, the result is influenced by the word ''marry'', resulting in an incorrect excited emotion.Figure 3  We also explore the impact of ground-truth labels on RLD in Table 9.The experiments show that training the model directly with RLD is not ideal, indicating that RLD is not completely correct.EP still requires ground-truth labels to help mitigate the noise caused by RLD.Notably, the experiment with the ground-truth set to 4 yields the best results, emphasizing the importance of striking the right balance.How to alleviate the negative effect caused by refined labels is a challenge for future work.

Conclusion
In this paper, we propose the EmoLR method to adaptively generate a refined label distribution that quantitatively describes emotional intensity.RLD guides the model to learn more label-related knowledge and capture more comprehensive semantic information.Our proposed method influences RLD through context-and speaker-sensitive states without requiring external knowledge or changing the original model structure.EmoLR has been proven effective on both unimodality and multimodality data through extensive experimental analysis and outperforms state-of-the-art results on three datasets.
Future work on EmoLR should address RLD's noise generation with unrelated emotions and reduce interference.Additionally, exploring efficient methods to encode labels in multimodal settings can shed light on mixed emotions' relevance.

Figure 1 :
Figure1: The emotions expressed in an utterance are abundant and interrelated.The left-bottom Radar chart shows the emotion of U t B utterance.The emotion is influenced by the context (green lines) and the current speaker's state (blue lines).The red lines indicate that the emotion is recognized by both context and speaker states.For example, U t B not only expresses sadness but also includes frustration and happiness.Although sad emotion has the dominant effect in this utterance, we cannot ignore the semantic information contained in other emotions.

Figure 3
Figure 3: t-SNE visualization of the emotion states of utterances from the test sets of IEMOCAP.Colors indicate the ground-truth emotion labels.r 1 represents the one-hot label, r 2 represents the EP without speakersensitive refinement, r 3 represents the EP without context-sensitive refinement, r 4 represents the RLD.

Figure 4 :
Figure 4: Case studies of basic predictor with label refinement.
the sizes of u t , c t , s p,t , e t respectively, D c , D s , D e are set to the same value.c t ∈ R D c represents the contextual representation at time t, s p,t ∈ R D s represents the state of speaker p at time t, e t ∈ R D e represents the emotion state at time t, and W α ∈ R D u ×D c represents the attention weight matrix.

Table 1 :
The statistics of splits, classes, and evaluation metrics adopted in three different datasets.Sim speak is speaker-sensitive score, and Sim ctx is context-sensitive score.
. , label C ] into a label embedding matrix R (l) , C is the number of classes, W is the transformation matrix, b is the bias, the attention scores, and RLD represents the refined label distribution based on Sim speak and Sim ctx .

Table 2 :
The experimental results on the three datasets.The W-Avg F1 scores are the weightedaverage F1 among five runs under different random seeds.Test scores are chosen at best validation scores.'-' represents the reduction of the following part.

Table 4 :
The results of significance test in the ablation study.

Table 5 :
The results of generalization analysis.

Table 6 :
Comparison of the performance on both IEMOCAP and MELD considering different modality combinations.Different modalities was found to be statistically significant under the paired t-test (p < 0.05).

Table 7 :
Comparison with label smoothing (LS) and label confusing model (LCM)

Table 8 :
Correlation analysis.PCC represents the Pearson correlation coefficient.

Table 9 :
also shows that many different emotion classes are close in location on the t-SNE result.For example, the clusters of happiness and excitement are located close together, indicating their similarity in original label value.An effect of the ground-truth label to RLD on IEMOCAP.