Targeted aspect based multimodal sentiment analysis:an attention capsule extraction and multi-head fusion network

Multimodal sentiment analysis has currently identified its significance in a variety of domains. For the purpose of sentiment analysis, different aspects of distinguishing modalities, which correspond to one target, are processed and analyzed. In this work, we propose the targeted aspect-based multimodal sentiment analysis (TABMSA) for the first time. Furthermore, an attention capsule extraction and multi-head fusion network (EF-Net) on the task of TABMSA is devised. The multi-head attention (MHA) based network and the ResNet-152 are employed to deal with texts and images, respectively. The integration of MHA and capsule network aims to capture the interaction among the multimodal inputs. In addition to the targeted aspect, the information from the context and the image is also incorporated for sentiment delivered. We evaluate the proposed model on two manually annotated datasets. the experimental results demonstrate the effectiveness of our proposed model for this new task.


INTRODUCTION
Sentiment analysis, also referred to as sentiment classification, aims to extract opinions from a large number of unstructured texts and classifying them into sentiment polarities, positive, neutral or negative [1]. To date, much of the work on sentiment analysis focuses on textual data [2]. Notably, with the advances of social media, it is significant to precisely capture the sentiment via the presence of different modalities (i.e. textual, acoustic and visual) [3][4]. Recent initiatives reveal that nearly 40% of reviews on cellphone in ZOL.com contain both text and image, which attract over 3 times the attention than the text-only reviews [2]. As such, the ability to analyze sentiment on multimodal data is most pronounced.
On current shopping and social platforms, seeing that the text and image information is taken to mutually reinforce and complement each other, models are dedicatedly devised to classify the sentiment polarity by using both kinds of data and their latent relation [5]. Recent publications report their achievements on the task of multimodal sentiment analysis. Xu et al. propose a Multi-Interactive Memory Network, together with a aspect-level multimodal sentiment analysis (ABMSA) dataset for model evaluation [2]. Yu et al. develop methods for target-oriented multimodal sentiment classification (TMSC) [5][6] by integrating the attention mechanisms and the pre-trained ResNet [7]. Experimental results show that an even higher accuracy can be obtained by incorporating the image into classical sentiment analysis.
On the other hand, sentiments towards different aspects of more than one entity are discussed in the same unit of text in many scenarios. targeted aspect-based sentiment analysis (TABSA) combines the challenges and the superiorities of aspect-based sentiment analysis and target-oriented sentiment analysis, and paves a way for greater depth of analysis. Namely, this task requires the detection of the aspect category and the sentiment polarity for a given targeted entity. According to Saeidi et al., TABSA caters for more generic text by making fewer assumptions with a more delicate understanding, which is both creative and practical for sentiment analysis [8].
In this work, we introduce a new task, namely Targeted aspect-based multimodal sentiment analysis (TABMSA), which indicates the integrating of multimodal information into TABSA to facilitate the sentiment analysis. That is, by exploiting information from texts and images, the sentiment classification result with a higher accuracy can be obtained. As illustrated in Table 1, there are three targets in the text: 'Dr Lucille Corti', 'Dr Lukwiya' and 'Uganda'. For targets 'Dr Lucille Corti' and 'Dr Lukwiya', the aspects contain 'event' and 'appearance'. Notably, the sentiment polarity for 'appearance' is positive according to the image while that for 'event' is negative according to the text. On this occasion, an approach to precisely capture the information of both texts and images is highlighted.
In this paper, we propose an Attention Capsule Extraction and Multi-head Fusion network (EF-Net) on the task of TABMSA. In our model, a bidirectional-GRU and which can maintain more related information. For multimodal interaction and fusion, the multi-head attention network is applied to maximize the contribution of each modality to sentiment delivering. Lastly, the multimodal representation, concatenating with the original semantic representation, is fed into the sentiment classifier. Experiments are conducted on two manually annotated multimodal datasets [5], which aims to verify the effectiveness of EF-Net comparing to the baseline method.

TABSA
Within TABSA, the sentiment associated to specific aspect of the entity is discussed. As mentioned above, current work is established on the foundation of Saeidi et al.'s studying of baseline method and the dataset. As an example, Ma et al. develop a LSTM-based model that utilizes the commonsense knowledge proposed in SenticNet for external knowledge incorporating [9]. Language processing models, such as BERT, is also taken as an alternative [10]. Besides, a recurrent entity network is designed and deployed to track entity state via word-level information and sentence-level hidden memory [11]. In some studies, researchers generally exploit the context-independence and randomly-initialized vectors to represent the aspects, which lack analyzing the interaction between aspect and its contexts.

ABMSA
The text-image pair is the most common form of multimodal data [2]. In most cases, the joint use of these modalities can not only enhance the sentiment expressing, but also improves the classification accuracy in sentiment analysis. As presented in [3], a co-memory attentional mechanism to interactively model the interaction between text and image is established, and thus to analyze the effects on one modality to the other. Motivated by the fine-grained sentiment analysis, Multi-Interactive Memory Network is proposed to learn the interactive influences between cross-modality data and the self-influences in single-modality data [2]. Likewise, models like TomBERT [5] and ESAFN [6] also receive great attention due to their superiority in multimodal sentiment classification tasks.

METHODOLOGY
The task of TABMSA can be formulated as follows: given an image I and a text sequence of n words , containing m targets , to characterize the aspect . Our purpose is to figure out the sentiment polarity towards ( ) in ( ). The architecture of EF-Net model is shown in Fig.1. Our model mainly contains four layers: feature extracting layer, multimodal interaction layer, multimodality fusion layer and final classification layer. The model firstly extracts the features from texts and images and encodes them into corresponding representations in the feature extracting layer. Then multimodal information interaction is carried out to preserve the more-related information. In the multimodality fusion layer, the multi-head attention-based fusion network is performed to filter and fuse the inter-modal information. The multimodality fusion outcome, together with the original semantic sequences, is concatenated for final sentiment prediction. The details of each part are described as follows. We start with a brief introduction of the Multi-Head Attention (MHA) network, which is applied to our model.

Multi-Head Attention (MHA) network
The Multi-Head Attention (MHA) aims to perform multiple attention function in parallel, which can be considered as an improvement of the traditional attention mechanism [12]. Basically, a traditional attention is defined as: (1) Where stands for Query, for Key and for Value. The regulator is taken to constrain the dot product value. In MHA, the inputs , and are mapped through the parameter matrices. Then the attention function is computed in parallel, whose outcomes are concatenated to obtain the multi-head attention value. Thus, we have (2) and where , and represent the projections parameter matrices for the corresponding inputs and is the attention of i-th head.
In this work, we also employed Multi-head Self-Attention (MHSA), which can be regard as a special kind of MHA. In MHSA, the identical inputs are sent to the model, i.e.
. In this way, the attention is delivered as: (3) where indicates a general input of the MHA network.

Feature extracting layer
In this layer, both the texts and the images are sent to the model as the inputs. For the textual data, we shall map the words into a low-dimensional vector by looking up in the pretrained Glove [12]. Thus, the word embeddings for the given text are obtained. Let be the word embedding of the context, be that of the target and be that of the aspect. Context representation. In addition to the semantic information, the position information is also considered. The relative distances between each word and the target is computed and the outcomes are presented as the position embeddings . By employing the MHSA mechanism, the concatenation of word embeddings and the position embeddings is transferred to the context , which is delivered as: The context Representation retains the original semantic information and syntactic structure of the context to the greatest extent, so we get after average pooling of , which is used for feature fusion in the final sentiment classification: Targeted aspect representation. Seeing that both target and aspect are short-sequence text, Bi-GRU is employed to capture the semantic information. That is, the word embeddings of target and aspect are concatenated and sent to Bi-GRU network. The targeted aspect representation is given by: Visual representation. In order to make full use of image information, a most effective image recognition method, ResNet-152, is used for image feature extraction. For a specific input image , we re-size it to a 224×224-pixel image . With the pre-trained ResNet-152, the image feature vector can be: (9) where is a 7*7*2048 dimensional tensor. Nevertheless, since ResNet is absent of tackling the position information of target in the image, we feed R into a one-layer capsule network. Thereby, the image representation , which contains the position information of the target, is written as (10) Targeted aspect specific image attention. Aiming to remove the unrelated context to the target (e.g. image background) and preserve the most related part, the attention mechanism is applied, based on which the more essential image representation ℎ "## $ is defined as: (11) where and are trainable parameter matrices for mapping and into sub-space of the same dimension. , , ,

Multimodal interaction layer
The multimodal interaction layer is responsible for analyzing the relation between the targeted aspect, context and the image, respectively, and thus distilling the key information from the multimodal inputs. The main purpose of this layer is to obtain the targeted aspect specific textual attention and the targeted aspect specific visual attention. Therefore, MHA network is utilized to understand the interaction with respect to the targeted aspect. For the targeted aspect and the context , we set as and as . With , the interaction of between the targeted aspect and its context is characterized by: (12) where is the targeted aspect specific context representation.
Likewise, the representation of targeted aspect specific image , indicating the interaction of between the targeted aspect and the image is (13)

Multimodality fusion layer
In practical use, not only the targeted aspect, but also the context and the image contain the sentiment information for determining sentiment polarities. Accordingly, following the multimodal interaction layer, the targeted aspect-specific representations from different modalities are incorporated. Instead of using a gated mechanism to control the contribution of each component, we tend to take the three representations as the inputs of MHA model for information fusion. By exploiting the MHA mechanism, the multimodal representation is then given as: where stands for the multimodal representation. In eqn. (14), we set as , as and as . Based on the multimodal representation , we can calculate via the average pooling (eqn. (15)). This obtained representation is further enriched by concatenating the average representation of the context and the image ( and ). (15) where is the final representation with multimodal information.

Final Classification
The aforementioned final representation is fed into softmax classifier for sentiment polarity distribution identification, which is where and is the trainable weight matrix and offset vector, and is the number of sentiment polarities.

Model training
The training process is conducted on by using the categorical cross-entropy, which is expressed as: where is the number of aspect terms in the sentence, C is the number of sentiment polarities. The parameter stands for the real sentiment distribution of i-th aspect term and is the predicted one on j-th sentiment polarity. Besides, is the weight of regularization.

Dataset
We manually annotate a large-scale TABMSA dataset based on two publicly available TMSC datasets, Twitter15 and Twitter17 [5]. Three experienced researchers, who work on natural language processing (NLP), are invited to extract targets and aspects in the sentences and label their sentiment polarities. To start with, 500 samples from dataset are randomly picked in advance to reveal the most appearing target and aspect types, which are 'people', 'place', 'time', 'organization' and 'other'. The targets, as well as the corresponding aspects, are presented in   Twitter15  Twitter17  #Sentence  3502  2910  #Label  3  3  #Target aspect pair  5466  6427 Avg. of #Aspect/Sentence 1.6 2.2 Avg. text length/ Sentence 13.2 13.9 Max text length/ Sentence 36 31 Min text length/ Sentence 1 3 Considering the TABMSA task, each sample from our dataset is composed of images and texts, together with targets and aspects of specific sentiment polarities. The expressed sentiment polarities are predefined as positive, neutral or negative. Details of our dataset is exhibited in Table 3.

Experimental setting
As mentioned above, experiments are conducted on dedicatedly-annotated datasets for working performance evaluation. We set the maximum padding length of textual content as 36 for Twitter15 and 31 for Twitter17. The images are sent to pre-trained ResNet-152 to obtain the 7*7*2048dimension visual feature vector. For our model, we set the learning rate as 0.001, the dropout rate as 0.3 and the batch size as 128. The attention head number is 4.

Model comparison
In order to verify the superiority of our model, we separately compare our model with classical textual sentiment analysis methods (LSTM, GRU, ATAE-LSTM, MemNet and IAN) and the representative multimodal sentiment analysis methods (Res-MemNet and Res-IAN).
LSTM is taken to detect the hidden states of the context. As a lighter version of LSTM, GRU has simple model structure and strong capability of modeling long-term sequences of texts. ATAE-LSTM [1] applies LSTM and concatenating process to get the aspect embeddings while the attention network aims to select the word of sentiment significance. MemNet [13] applies a multi-layer attention mechanism on top of the common word embedding layer. The representations in IAN [14] are modeled on the foundation of the LSTM based interactive attention networks. And hidden states are taken to compute the attention scores by the pooling process. Res-MemNet and Res-IAN take the max-pooling layer of ResNet and the hidden representation of MemNet or IAN to concatenate for multi-modality sentiment classification. Notably, for all the aforementioned model, the sentiment polarity distribution of the target is finally determined by using the Softmax classifier.

Main Results
In this experiment, we adopt accuracy and Macro-F1 as evaluation metrics to denote the working performance. Table  4 shows the main results. In the classical TABSA tasks, the proposed model, which removes the image processing part, labeled as 'EF-Net (Text)', has the best and most consistent outcomes on two datasets. Among all the models, LSTM obtains the worst performance due to its lack of distinguishing targets and contexts in the sentence. In comparison, with the analysis of target and aspect, the working performance is considerably optimized. Besides, the employment of attention mechanism also contributes to the classification accuracy improvement. The EF-Net (Text) make use of both the position information and the semantic information. In this way, the representations in our model are considerably more informative for delivering sentiment. Furthermore, the MHA network captures the interaction between targeted aspect and context, based on which more essential information is preserved for sentiment classification.
On the other hand, the multimodal sentiment analysis models are generally more competitive than the basic textual sentiment analysis ones. With the integration of visual context, an even higher classification accuracy is thus accessible. On the task of TABMSA, EF-Net still significantly outperforms the baseline models. The minimum performance gap of 1.89% and 0.9% for Twitter15 and Twitter17 can be observed in Table 4 against the Res-EF-Net (Text) method. Clearly, our model is a better alternative for the task of multimodality sentiment analysis. In spite the effectiveness of EF-Net (Text), another explanation is that we fuse the image data into the texts, together with investigating the multimodal interaction, which exploits the sentiment information and the relation of multimodalities. Since EF-Net is more capable of dealing with the TABMSA tasks, it is reasonable to expect even higher accuracy in more evaluation settings, as it is the case.   Fig. 3. An example of visual and textual attention information, especially for long texts. Due to the parameter increasing and the model overfitting problem, the classification accuracy drops with the head number continues increasing (i.e. 5,6). Fig.3 shows an example of visual and textual attention visualization. For the text '@ABQJournal Bad accident at San Mateo and H751. Motorcycle hits car and flip', the corresponding image is presented as Fig.3.(a). The target and aspect in the sentence are 'San Mateo' and 'event', respectively. According to Fig.3.(b), we can see that our model pays more attention to the motorcycle within the image. In addition, the MHA model (with Head=4), assigns more attention weights to words like 'Motorcycle', 'bad' and 'accident', as shown in Fig.3.(c). At this stage, our model classify the sentiment polarity as negative, which demonstrates that our model can properly capture the information and interaction of multimodalities.

CONCLUSION
In this work, we present a novel multimodal sentiment analysis task, namely TABMSA. As such, in line with the TABMSA tasks, the EF-Net model is designed and deployed. We first construct the representations of multimodal inputs. By employing the MHA network, the interaction between different representations is precisely captured to deliver more-related information. Moreover, the targeted aspect representation is enriched with the fusion of context and image information, which improves the multimodal sentiment classification accuracy to a large extent. Experiments results validate that the proposed model stably outperforms the baseline models.