Attention-based Multi-modal Sentiment Analysis and Emotion Detection in Conversation using RNN

The availability of an enormous quantity of multimodal data and its widespread applications, automatic sentiment analysis and emotion classification in the conversation has become an interesting research topic among the research community. The interlocutor state, context state between the neighboring utterances and multimodal fusion play an important role in multimodal sentiment analysis and emotion detection in conversation. In this article, the recurrent neural network (RNN) based method is developed to capture the interlocutor state and contextual state between the utterances. The pair-wise attention mechanism is used to understand the relationship between the modalities and their importance before fusion. First, two-two combinations of modalities are fused at a time and finally, all the modalities are fused to form the trimodal representation feature vector. The experiments are conducted on three standard datasets such as IEMOCAP, CMU-MOSEI, and CMU-MOSI. The proposed model is evaluated using two metrics such as accuracy and F1-Score and the results demonstrate that the proposed model performs better than the standard baselines.


I. Introduction
T HE main aim of automatic sentiment and emotion analysis in conversational videos is to analyze and detect sentiment and emotional state of a participant in conversational videos. Due to the recent advancements in Internet technologies and social media networks, the users post their reviews, about a service or a product in the form of conversational videos on social media platforms, such as Twitter, Flicker, YouTube, and Facebook, etc. Recently, multimodal sentiment and emotion analysis from the conversation has become an interesting research topic due to its widespread applications in areas such as healthcare assistant devices, education, dialogue understanding, humancomputer interaction, and human resource management. In prior work, the unimodal features from the available modalities were extracted, and then the unimodal features are fused to form the multimodal feature vector. For multimodal fusion, there are three options, early fusion (feature concatenation), model-based fusion, or late (decision) fusion. In feature concatenation, the features from individual modalities are concatenated to get the multimodal feature vector.
Recently, many approaches were proposed for utterance level sentiment and emotion analysis [1], [2]. In late fusion, the feature vectors from individual modalities are modeled using the classical classifiers. The output of the classifiers on the unimodal feature vector is fused using an ensemble approach [3]. These fusion strategies perform fairly well but cannot accommodate the contextual information among the utterances and interlocutor state of the participant. More recently attention based contextual fusion and contextual cross modality fusion strategies show promising results. In the contextual fusion technique, the bidirectional recurrent neural network (RNN) was used to extract the context between the utterances of a video [4]. In contextual crossmodality fusion along with contextual information, the importance of modality is considered in multimodal fusion [5]. In [6] dynamic fusion is performed by paying attention at each time step. Evolutionary computing-based multi-layer feature optimization is used to improve the overall accuracy of classification in [7].
The sentiment or emotional state of the particular participant in the conversation is not considered for analysis in these models. Hence the existing models fail to capture the contextual information among the utterances and flow of conversation. But in reality, the contextual state and sentiment or emotion of a particular party does add a lot of value to the overall result. The proposed model believes that the sentiment or emotional state of an utterance mainly depends on the interlocutor state of the participant, the previous emotional state of the participant, and context between the utterances [8]. By incorporating the interlocutor state of the particular participant and context between the utterances, the results of the proposed method outperform the baselines by over 2%.
The main contributions of the proposed model are, • An effective multimodal sentiment and emotion analysis technique is proposed to extract the contextual information among the utterances and accommodate the interlocutor state of a particular participant in the conversation.
• The pair-wise attention-based mechanism is used to understand the relationship and importance of modalities before fusion.
• The proposed model effectively captures the sentiment or emotional state of the participant in the conversation.
• The model is tested and validated on three standard datasets and the results are compared against the standard baselines for multimodal sentiment and emotion analysis in conversational videos.
The structure of the remaining sections of the article is as follows: the important work carried-out in multimodal sentiment and emotion analysis, context extraction between the utterances and traditional techniques in multimodal fusion are described in Section II. The proposed attention-based multimodal sentiment and emotion analysis in the conversation using the RNN model is presented in Section III. The experimental setup, results on three standard datasets, and comparison of results against a standard baseline of the proposed model are presented in Section IV. Finally, future work in multimodal affective computing in conversational videos is presented and concludes the paper in Section V.

II. Related Work
Sentiment analysis and Emotion detection in conversation are popular research topics in multimodal affective computing [9] because of their applications in various areas such as sentiment analysis, health-care assistance devices, recommendation systems, education, human-computer interaction, etc. [10]. The multimodal data has information in three modes such as text (transcribed audio), audio, and video. The traditional multimodal sentiment analysis and emotion detection technique extracts the unimodal features from the three modalities, use either feature level (early) fusion [11] [12] or decision level (late) fusion [13] [14] [15] or hybrid fusion [16] to merge effective information from different modalities.
An utterance is a segment or a part of the video (may not be a complete sentence) and video reviews contain a sequence of such multiple utterances. In utterance or segment level sentiment and emotion classification, each segment of a video is analyzed and assigned a label [17]. Recently, many approaches were proposed for analyzing sentiment and detecting emotion at the utterance level [1], [2]. In [18] authors extracted acoustic, lexicon, and visual features and used an ensemble approach to ensemble classification of SVM classifier. Their proposed ensemble approach achieves better results than conventional methods. Authors in [19] fused acoustic and linguistic cues at feature level using 3-D activation valance for emotion recognition. In [20] authors extracted textual, speech, and visual features using convolutional neural networks. They analyzed sentiment and emotion using multiple kernel learning.
In [21] acoustic information and visual cues are fused to model multimodal emotion recognition system and contextual information is used for sentiment and emotion analysis. In recent works on multimodal sentiment and emotion analysis in conversational videos, each utterance of a video is processed sequentially using RNN. The model proposed in [8] propagates the context among the utterances and sequential information to the next utterance. They use bidirectional recurrent neural networks [22] to extract the context between the utterances and feed the information sequentially. DialogueRNN [23] uses an attention-based pooling approach to capture the context of a particular utterance in the conversation. However, this pooling based attention mechanism fails to consider participant information of particular utterance and its effect on other utterances. They use a global state and participant state for modeling multimodal emotion detection in conversation.
Other notable works include [24] [25] [26] where multimodal sentiment and emotion detection is addressed using deep learningbased models. Ghosal et al. [27] proposed a pair-wise attentionbased method to understand the importance of individual modalities and the relationship between the modalities before fusion. The twodimensional graph-based feature extraction methods using fuzzy logic are discussed in [28] [29] and [30]. The PRAAT 1 software was used to extract the emotional state from voice [31]. The proposed model considers context between the utterances, the interlocutor state of a participant, and previous emotion state to effectively model the multimodal sentiment and emotion analysis system in conversational videos.

III. Proposed Methodology
The proposed attention-based multimodal sentiment and emotion analysis in the conversation using RNN is discussed in detail in this section. The overview of the proposed model is: • First, the utterance level features of individual modalities such as acoustic, textual, and visual features are extracted.
• The pair-wise attention-based mechanism is used to understand the relationship and importance of modalities before fusion.
• The gated recurrent unit (GRU), a variant of RNN, is used to model the interlocutor state of the participant, context extraction, and emotion decoding.
• Bimodal and trimodal fusion are performed by considering the previous emotional state, the importance of individual modality, and interlocutor state. A trimodal representation of feature vector acts as an input for final sentiment or emotion prediction.

IEMOCAP
IEMOCAP dataset is a collection of 12-hours of two-way acted dyadic conversations among multiple speakers. The conversational video is divided into multiple opinion segments called utterances. Each of the utterances is annotated with emotion labels such as anger, sadness, excitement, happiness, fear, neutral, and surprise. Videos with angry, happy, sad, excited, frustrated, and neutral are considered to compare against the state of the art models.

CMU-MOSEI
The CMU-MOSEI dataset contains 3228 videos with 23453 small segments called utterances from 1000 speakers collected from YouTube. CMU-MOSEI is a transcribed, gender-balanced, properly punctuated dataset. The average number of segments per video is 7.3 and the average length of each segment is 7.28 seconds. The total number of words and unique words in utterances are 447143 and 23026 respectively. The dataset is manually labeled with 6 emotions such as anger, disgust, fear, happiness, sadness, and surprise.

CMU-MOSI
There are 93 videos with 2199 utterances in CMU-MOSI dataset where 89 speakers review various products and topics in English. The average length of a segment is 4.2 seconds and about 12 words per utterance. Each utterance is manually labeled by 5 assessors with a score ranging from -3 and +3. The average of these 5 assessors is taken as sentiment polarity. The Video/Utterance level Train-Test distributions of CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets are shown in Table I. The Label distribution statistics of CMU-MOSI and IEMOCAP datasets are given in Table II and Table III respectively.

B. Feature Extraction
This section discusses the steps followed in extracting features from acoustic, text, and visual modalities.

Audio Feature Extraction
OpenSMILE [34] open-source tool is used for acoustic feature extraction from CMU-MOSI, CMU-MOSEI, and IEMOCAP datasets. The acoustic features are extracted at a frame rate of 30Hz and 100ms sliding window. The dimension of utterance level features for acoustic modality is 73, 73, and 384 for CMU-MOSI, IEMOCAP, and CMU-MOSEI datasets respectively.
Let f ai be the feature vector of i th segment then, the acoustic feature vector f a is represented by, Where n is the number of segments or utterances.

Textual Feature Extraction
Features from text (transcribed text) modality are extracted from each utterance using Convolutional Neural Networks (CNN) [35] from CMU-MOSI and IEMOCAP datasets. First, each utterance is represented as word2vec vectors [36] to understand the context in the text. These Word2Vec vectors are processed using 3 convolutional layers. The three layers have feature maps of size 50, 75, and 100 with filters of sizes 2, 3, and 2 respectively. Max-pooling of a 2x2 window size is used after every convolutional layer. The fully connected layer receives input from the convolution layer and output is fed to a softmax classifier. The fully connected layer has 600 neurons with ReLU [37] activation function. The softmax output of the convolutional neural network (CNN) is used as the textual features. GloVe embedding's used for extracting textual features from the CMU-MOSEI dataset. The dimension of utterance level features for textual modality is 100 for CMU-MOSI, IEMOCAP datasets, and 300 for CMU-MOSEI dataset.
Let f ti be the feature vector of i th segment then, the textual feature vector f t is represented by, Where n is the number of segments or utterances.

Visual Feature Extraction
In the past 3D convolutional neural networks have been successfully used for object detection and classification [38]. The results presented in [38], outperform the traditional object tracking and detection, and motivate us to adopt 3D-CNN in our work. Visual features are extracted using 3D-CNN from CMU-MOSI and IEMOCAP datasets and Facet 2 tool from the CMU-MOSEI dataset. The dimension of utterance level features for visual modality is 100 for CMU-MOSI, IEMOCAP datasets, and 35 for CMU-MOSEI dataset.
Let f vi be the feature vector of i th segment then, the visual feature vector f v is represented by, Where n is the number of segments or utterances.

C. Problem Statement
Let P 1 and P 2 be the two participants in the conversation. The u 1 , u 2 … u n are the utterances uttered by either of the participants P 1 and P 2 with sentiment score and one of the emotion labels such as happy, sad, anger, surprise, disgust, and fear is assigned to the utterances. As each of the utterances is uttered by either of the participants in the conversation, this allows capturing the average sentiment of the participant in sentiment score or emotion label calculation. Also, it avoids misclassification due to long pauses by the participant in the conversation. Let u t be the t th utterance uttered by the party P 1 or P 2 at timestamp t, which is represented by three modalities such as text, visual and acoustic, where t t , v t , and a t are textual, visual and acoustic feature vectors of the t th utterance at timestamp t and p Є P1, P2.
The objective function of the problem is to accept the feature vector from three modalities of an utterance, cumulative context representation of the conversation and emotional state of the previous participant, and output the sentiment score and associated emotion label.

D. Proposed Model Description
The sentiment or emotion of an utterance depends on the cumulative contextual state of the conversation, the interlocutor state, and the sentiment or emotional state of the previous participant. Hence the proposed model considers the cumulative context and emotion of participants to predict the sentiment or emotional state of an utterance. The proposed model has three branches of recurrent neural networks (RNN) to capture the participant interlocutor state, cumulative context, and sentiment or emotional state of the participant. Each modality uses one RNN to capture participant dyadic information and another set of RNN's are used to capture the sentiment or emotional state of the participant. One RNN is used to capture the cumulative contextual information. A weighted-pooling based pairwise attentionbased mechanism is performed to understand the relative importance of individual modalities before fusion. Finally, two-two modalities and then all modalities are fused to form a trimodal representation of feature vector for predicting the sentiment score or emotion label of an utterance.

Interlocutor State
The interlocutor state of the network captures and keeps track of the state of the participant involved in the multimodal conversation. The network has nxm number of RNN's, where n is the number of participants and m is the number of modalities. The output of the interlocutor state is the input for updating cumulative contextual vector and emotion or sentiment prediction of the utterance. Initially, the interlocutor state is initialized to the null vector. For the utterance at timestamp t the interlocutor state i t of a particular modality is updated i t+1 using feature representation of particular modality at timestamp t (that is f(t) t or f(a) t or f(v) t ) and attentive cumulative contextual vector representation until timestamp t (that is C(t) t or C(a) t or C(v) t ). The purpose of using the cumulative contextual vector along with utterance representation is to understand the contextual information of conversation until that timestamp. The steps in the interlocutor state update are described using the following formula and shown in Fig. 1.
where ⊕ represents concatenation operator and m is the modality with values either t or a or v. Interlocutor

Cumulative Context
In conversational sentiment analysis and emotion detection, to determine the sentiment or emotional state of an utterance at timestamp t, the preceding utterances at time < t can be considered as its cumulative context. The interlocutor state of the previous utterance (that is i(t) t-1 or i(a) t-1 or i(v) t-1 ) and utterance level modality representation at timestamp t (that is f(t) t or f(a) t or f(v) t ) are used to change the cumulative context vector representation from c t-1 to c t . This helps to understand the dependencies between the utterances and participants. The steps in the cumulative context state update are described using the following formula and shown in Fig. 2. (6) where ⊕ represents concatenation operator and m is the modality with values either t or a or v. Weighted pooling based attention is performed over cumulative context vector representation until timestamp t.

(7)
Where, C(m) t is the attentive cumulative contextual vector.

Emotion State
The emotional state network is used to decode the sentiment or emotional information encoded by interlocutor state RRN. The previous emotion state output (that is e(t) t-1 or e(a) t-1 or e(v) t-1 ) and interlocutor state sentiment or emotional information (that is i(t) t or i(a) t or i(v) t ) are the input to emotion state RNN at timestamp t. Weighted pooling based pair-wise attention is performed on the output produced by emotion state RNN to produce the relevant sentiment or emotion label. The steps in the emotion state update are described using the following formula and shown in Fig. 3.
where ⊕ represents concatenation operator and m is the modality with values either t or a or v. Emotion

Weighted Pooling Based Pair-Wise Attention and Bimodal Fusion
For each timestamp t, the emotion state network produces emotion vectors for each modality such as e(t) t , e(a) t and e(v) t . Weighted pooling based pair-wise attention [4] is performed between twotwo emotion vectors at a time to get bimodal representation emotion vectors. Let X and Y be the two emotion state outputs produced by the emotion state network at timestamp t, then the weighted pooling based pair-wise attention mechanism is performed as follows: Where B_Fusion is the bimodal fusion at timestamp t.
The pair-wise matching matrices at timestamp t are calculated in equation (9), then the probability distribution scores (weights) of each modality are calculated in equation (10) and (11). Modality specific attentive representations are calculated in equation (12). An important component among the multiple modalities and utterances is calculated by performing element-wise matrix multiplication as shown in equation (13). Attentive matrix representations are then concatenated to produce bimodal representation at timestamp t as shown in equation (14). The steps in Weighted Pooling based pairwise attention and bimodal fusion at timestamp t are shown in Fig. 4.

Trimodal Fusion
The bimodal attentive representation and emotional state of the utterance are used to get the trimodal representation. The bimodal attentive representation and output of emotion state RNN at timestamp t is concatenated to form the final trimodal attentive representation at timestamp t. The trimodal fusion at timestamp t is shown in Fig. 5.

E. Classification and Training
The trimodal sentiment or emotional representation is fed to the softmax classifier to predict the testing label for an utterance in the conversation. The softmax classifier takes the concatenated sentiment or emotion vector e(tav) t at timestamp t as an input. The softmax output is represented as, (16) Where w (s) is the weight matrix, b (s) is the bias matrix, p is a predicted sentiment or emotion class. (17) Where , is the predicted label of testing utterance.
The cross-entropy loss function L(θ) is used to train the model and is represented as, (18) where N is the number of utterances in training data. y s and are the true and predicted label of the s th utterance. M is the number of categories (classes) and λ is the L2-regularization term. Adam [39] is used to optimize the cross-entropy loss function parameters due to its ability to adapt to the learning rate for each learning parameter. The proposed algorithm for attention-based multimodal sentiment and emotion analysis in the conversation using RNN is summarized in Table IV.

IV. Results Analysis and Discussion
The proposed attention-based multimodal sentiment and emotion analysis framework in the conversation using RNN is implemented in python using the PyTorch and tensor flow is used as backend. The model is evaluated on the Tesla K80 GPU with a 12GB RAM hardware configuration. The experiments are conducted on three standard datasets such as CMU-MOSI, CMU-MOSEI, and IEMOCAP. The experimental results of the proposed method are compared against the standard baselines such as [25] , [27], [40], [41], and [42]. The proposed model is evaluated using two metrics, classification accuracy, and F1-score. First, the results are obtained for the combination of two-two modalities such as text-audio, text-video, and audio-video, and then all three modalities with and without attention mechanism. The comparison of results of the proposed technique for sentiment analysis with and without attention is given in Tables V and VI. The results show that the attention-based model performs better than the standard baselines in all possible combinations of constituent modalities except for the audio-video combination on the CMU-MOSI dataset. The trimodal model performs better than the bimodal model. The emotion detection results on CMU-MOSEI and IEMOCAP datasets with and without attention are shown in Tables VII and VIII. The results show that the attention-based models are performing better than the standard baselines and model without attention except for the label happy in the CMU-MOSEI dataset. Fig. 6, Fig. 7, Fig. 8 and Fig. 9 show a comparison of the experimental results of the proposed method on CMU-MOSEI, IEMOCAP, CMU-MOSI datasets against standard baselines. On CMU-MOSI and CMU-MOSEI datasets the trimodal models are performing better than the bimodal and unimodal models, whereas A-V combination is performing the worst among all possible combination of models in sentiment classification. For emotion classification, the proposed model obtains the best results on the CMU-MOSEI dataset as it effectively uses all the available modalities and captures the contextual information since the availability of large dataset for training.      -120 -

V. Conclusion and Future Work
The multimodal fusion, capturing interlocutor state of the participant, and understanding context between the utterances are the most important issues in multimodal sentiment analysis and emotion detection in conversation. In this paper first, features from individual modalities such as textual, acoustic, and visual features are extracted. Textual features are extracted using CNN and GloVe embedding's, audio features using open smile toolkit and visual features using 3D-CNN and facet toolkit. An attention-based pair-wise technique is used to extract the context between the utterances and understand the importance of constituent modalities before fusion. The recurrent neural network, more specifically gated recurrent Unit (GRU) based model is used to capture the interlocutor state and context extraction. By incorporating contextual information, the interlocutor state, and previous emotion state, the proposed model performs better than the standard baselines in terms of classification accuracy. In the future, we will explore techniques to address more than two participants in conversational videos. Also, we will study the feature selection methods to understand whether the emotion-specific features can improve the overall classification accuracy.