Self-adaptive attention fusion for multimodal aspect-based sentiment analysis

: Multimodal aspect term extraction (MATE) and multimodal aspect-oriented sentiment classification (MASC) are two crucial subtasks in multimodal sentiment analysis. The use of pretrained generative models has attracted increasing attention in aspect-based sentiment analysis (ABSA). However, the inherent semantic gap between textual and visual modalities poses a challenge in transferring text-based generative pretraining models to image-text multimodal sentiment analysis tasks. To tackle this issue, this paper proposes a self-adaptive cross-modal attention fusion architecture for joint multi-modal aspect-based sentiment analysis (JMABSA), which is a generative model based on an image-text selective fusion mechanism that aims to bridge the semantic gap between text and image representations and adaptively transfer a textual-based pretraining model to the multimodal JMASA task. We conducted extensive experiments on two benchmark datasets, and the experimental results show that our model significantly outperforms other state of the art approaches by a significant margin.


Introduction
Early sentiment analysis mainly focused on text, considering only the interrelationships between words and phrases to analyze sentiment [1].In recent years, with the content on internet social platforms gradually shifting from purely text-based content to multimodal content, the task of multimodal sentiment analysis has received increasing attention.Common multimodal sentiment analysis tasks include video sentiment analysis [2] and image-text sentiment analysis.
As a fundamental sentiment analysis task, joint multimodal aspect-based sentiment analysis (JMABSA) aims to extract the potential aspect terms and identify aspects' sentiment polarities simultaneously from text with the aid of images, which has received increasing attention over the past few years.For example, in Figure 1, the objective of JMABSA is to detect all aspect-sentiment pairs, i.e., (Roger Federer, Positive), (Gerry Berry, Positive) and (Gerry Weber Open, Neutral).
Multimodal aspect term extraction (MATE) and multimodal aspect sentiment classification (MASC) are two types of subtasks contained in JMABSA.Most previous works prefer to cast JMABSA as the aforementioned two pipeline sub-tasks.However, this kind of step-by-step operation requires propagate artifacts generated in the first step to the next step, thereby reducing the sentiment analysis performance of the final results.
There have been many attempts to explore MATE and MASC based on pretrained models.Yu et al. [3] proposed a multimodal bidirectional encoder representation from transformers (BERT) for target-oriented sentiment classification by capturing the multimodal interactions with a target attention mechanism.Khan et al. [4] introduced a two-stream multimodal target sentiment classification model with BERT by combining text and image captions.Yu et al. [5] designed a hierarchical interactive multimodal transformer to identify the aspect-oriented sentiment polarities by capturing text-image interactions.Ling et al. [6] presented a task-specific vision-language pretrained model based on bidirectional and auto-regressive transformers (BART) [36] for MASC.Unfortunately, there are considerable feature representation gaps, as visual and textual features are initialized with their corresponding modality-specific models.Therefore, it will inevitably suffer from modality alignment ambiguity by directly incorporating visual features into the pretrained textual models.Image is another kind of modality data that contains many helpful details, such as the salient objects, the scenario information, the facial expression, etc.These visual details are valuable for aspect extraction and sentiment polarity identification, as depicted in Figure 1, it is challenging to determine the sentiment polarity of aspects based on only text information.However, the visual modality offers valuable clues, such as facial expressions, that assist in predicting the sentiment of Roger Federer and Gerry Berry.More concretely, in the MATE task, we prefer to capture the salient objects and the scenario information to enhance the aspect term extraction performance.In the MASC task, the facial expression is much helpful in identifying semantic polarities.Therefore, visual information could be employed as the pivotal information to bridge the task gap between MATE and MASC, eliminating the error propagation problem of JMABSA.
Although there is a significant modality gap between image and text, they can complement each other.Inspired by this phenomenon, this paper develops a visual-textual interactive sequence to se-quence (Seq2Seq) framework based on BART to address the joint aspect term extraction and aspectoriented sentiment classification problems.The inclusion of image information is indispensable for the JMABSA task.However, the semantic gap between text and image poses a challenge, impeding the effective integration of the two modalities in the multimodal BART model.Addressing the modality gap between text and image and establishing connections between the two modalities is of paramount importance for the JMABSA task.To address the semantic gap problem, this paper proposes an adaptive visual-to-textual fusion module to bridge the modality gap.The contributions of this work are summarized as follows.
• To eliminate the inherent semantic gap between textual and visual modalities, we employ image as pivotal information to bridge the semantic gap between textual and visual modalities, and visual details are dynamically extracted to enhance the performance of JMABSA.• An adaptive visual-to-textual fusion module is built to adaptively incorporate task-specific visual information into a pretrained BART encoder to promote the network to learn a multimodal representation.• Experiment results on TWITTER-15 and TWITTER-17 datasets show that the proposed approach significantly enhances the performance of MATE and MASC and improves F1 scores on two test sets.Moreover, our model almost achieves the performance of task-specific pretrained methods.

Text-based aspect based sentiment analysis (ABSA)
Aspect based sentiment analysis (ABSA) aims to identify sentiment polarities at the aspect level.In order to handle ABSA in different scenarios, there exists several subtasks in ABSA.The main research line of ABSA focuses on two primary subtasks: Aspect term extraction, and aspect sentiment classification.For aspect term extraction, some early works mainly focus on extracting sequence features via sequence tagging methods based on convolutional neural networks (CNN) [7] and recurrent neural networks (RNN) [8].Recent works have discussed Seq2Seq methods on aspect term extraction [9,10].Similarly, for aspect sentiment classification, early studies were mainly based on manually designed features [11,12].In recent years, various deep learning approaches have been proposed, including attention-based methods [13][14][15][16][17], CNN-based networks [18,19] and graph neural networks (GNN) based methods [20][21][22][23][24]. Concurrently, the pretrained language model BERT [25] has demonstrated exceptional performance across numerous natural language processing (NLP) tasks.Li et al. [26] achieved favorable results by employing the BERT model for aspect-based sentiment classification.Since these two subtasks are highly dependent on each other, more recent studies attempt to solve these two subtasks jointly.
Joint aspect sentiment analysis (JASA) aims to extract aspect and predict their sentiments jointly.Some studies leveraged the pipeline method to solve this problem [27,28], which formulates the target extraction task as a sequence tagging problem.Hu et al. [29] proposed a span-based extract-thenclassify framework.Recently, Yan et al. [30] proposed a unified generative framework based on BART, and achieved the state of the art performance on JASA.Despite achieving remarkable improvement, all the above studies only focus on the textual modality but fail to model the visual guidance for both subtasks.In our work, we aim to propose a multimodal architecture to handle both subtasks jointly.

Multimodal aspect-based sentiment analysis (MABSA)
In the past few years, MABSA has drawn much attention.Existing studies on MABSA mostly focus on the two subtasks of MABSA: MATE and MASC.As a pioneer, Xu et al. [31] first proposed the task of MASC.Several studies have focused on modeling the interactions among the aspect, text and image based on attention mechanisms [31][32][33][34].With the successful application to tasks in NLP, Yu et al. [3] proposed a multimodal BERT architecture [25], which adapts BERT to obtain textual features and interactions among textual and visual modalities.Moreover, Khan et al. [4] adapted a transformer architecture for image caption, which translates the image input to an auxiliary sentence, then feeds the auxiliary sentence into a BERT language model.Despite these advances of methods in MABSA, almost all of them focus on handling each subtask independently, which ignores the innate connection between these two subtasks.Therefore, we aim to extend this line of research by proposing a more effective method that jointly performs MATE and MASC.
In recent years, inspired by the success of the JASA tasks, Ju et al. [35] introduced the task of joint multimodal aspect-sentiment analysis, which aims to jointly extract aspect and predict their sentiments from a text-image pair.More recently, Ling et al. [6] proposed a task-specific vision-language pretraining (VLP) framework for MABSA, which is a unified multimodal encoder-decoder architecture based on BART.Nevertheless, VLP failed to capture the alignment of between text and image modalities while transferring textual based generative pretraining models to image-text multimodal sentiment analysis task.In contrast to VLP, our proposed model aims to bridge the semantic gap between text and image representations and transfer textual-based pretraining models to the JMABSA task self-adaptively.

Methodology
Our proposed self-adaptive attention fusion (SAAF) model mainly focuses on bridging an effective modal to bridge the semantic gap between text and image.As shown in Figure 2, The SAAF comprises of several parts: Feature extraction, adaptive visual-to-textual fusion layer, and visual-enhanced BART module.
Task definition.We conceptualize the JMABSA task as a sequence labeling problem.Consider D as a set comprising multimodal samples.Formally, we are given a multimodal tweet comprising an image denoted as V and a sentence with n words denoted as T = (t 1 , t 2 , ..., t n ).Our goal is to obtain the sequence y that represents all potential aspect terms along with their respective sentiment polarities.We formulate the output as y = (a b 1 , a e 1 , s 1 , ..., a b i , a e i , s i , ..., a b k , a e k , s k ), where a b i and a e i denote the beginning index and the end index of the i-th aspect, s k denotes the sentiment polarity toward the aspect and k represents the number of aspect terms contained in T .

Feature extraction
Text embedding.Given the competitive performance exhibited by the Seq2Seq pretrained model BART [36] in the context of JASA [30], its utilization is adopted for acquiring word embeddings.In adherence to the procedure delineated in [6], the markers < s > and < /s > are employed to denote the initiation and termination of a sentence.Formally, the textual representation of a sample is denoted as: where d denotes the dimension of BART, which is equal to 768.Image embedding.The regional representations are obtained by Faster R-CNN [37].Precisely, Faster R-CNN is adopted to extract all object proposals from an image denoted as V. Subsequently, 36 object proposals with the utmost confidence are retained.The identified object and its corresponding semantic significance are denoted as follows: where FasterR − CNN denotes the Faster R-CNN [37] and R denotes the visual features: Then, the visual features are projected to match the textual embedding size of BART.Finally, the visual sequence is denoted as follows: R ∈ R 36×d . (3.4)

Adaptive visual-to-textual fusion layer
Cross-modal interaction.The multi-head self-attention layer [38] is utilized to capture intra-modal interactions within the text.This is achieved by aggregating information from nearby words through text self-attention: where AT T self denotes the multi-head self-attention, the textual feature is set as the query/key/value matrix and Norm denotes the layer normalization [39].Simultaneously, a cross-modal transformer layer [40] is utilized to achieve inter-modal interaction across text and visual modalities.In this context, the textual features E function as the query matrix, while the visual features R serve as the key/value matrix, resulting in the following relationship: where AT T cross denotes the cross-modal transformer.
Subsequently, E X→V is input into a feed-forward network (FFN) followed by a normalization layer.To enhance the textual representation further, an additional residual connection is established from E: where FFN denotes a feed-forward network [38].
Visual information and textual information are merged through cross-modal interaction.Compared with previous work, our proposed approach can better extract text closely related visual features better in JMABSA.
Selective fusion.With the strengthened textual representation obtained through cross-modal interaction, the selective fusion further aims to filter out unrelated region features for the text.Essentially, the selective fusion receives two inputs: One is the strengthened textual representation E, and the other is purely visual feature R. Initially, a concatenation of R and E is performed to generate a bimodal factor denoted as [R; E].This factor is then employed to compute the gate vector g: where sigmoid denotes a Sigmoid nonlinear activation function.
The selective fusion gate highlights the relevant information within the visual modality conditioned on the textual representation that encompasses image information.Subsequently, the gate vector is utilized to acquire the textually related regional feature R through the application of the selective filter: R = g * R. (3.9) Cross-modal mixup.To enhance the resilience of multimodal representation, the cross-modal mixup model is devised.The core philosophy behind cross-modal mixup is to create new samples by linearly interpolating a pair of training samples to exhibit linearity within the training data.A particularly appealing implementation of such multimodal data augmentation approach is studied in TMix [41].The synthetic sample is generated as follows: where λ is a scalar of balancing textual features and visual features, sampled from a Beta (α, β) distribution: where Be denotes the Beta distribution and α and β denote the hyperparameter to control the distribution of λ.R is produced as the ultimate visual representation.

Visual-enhanced BART module
The backbone of our model is BART [36], which is a Transformer-based autoencoder for Seq2Seq model.Following [6,42], the original BART model is transformed into a multimodal variant capable of encoding the multimodal input.
Encoder.The encoder of our model is based on a multilayer bidirectional Transformer.Following [42], two distinct tokens, <img> and </img>, are introduced to signify the initiation and culmination of visual features generated by the multimodal interpolation layer.Subsequently, we postulate that the original text representation E and the visual representation enriched with multimodal information R constitute the multimodal output denoted as D: where ⊕ denotes the concatenation operation.Following this, D is input into the position embedding layer to derive the ultimate multimodal representation: where D ∈ R (T +36)×d and PE denotes the position embedding layer.Finally, D serves as the input for the multimodal BART encoder.
Decoder.The decoder of our model is also a multilayer Transformer.Different from the bidirectional encoder, the decoder is unidirectional.The output of the multimodal BART encoder is denoted as H m .
The predict distribution as follows: where MLP denotes the multilayer perceptron.The loss function is determined by calculating the cross-entropy between the predicted label distribution and the true label distribution during the training process: where θ denotes the true sentiment provided in the dataset and X denotes the multimodal input.

Dataset
Two benchmark datasets, TWITTER-15 and TWITTER-17, are employed for evaluation as per the reference [3].The detailed statistics of both datasets are shown in Table 1.

Evaluation metrics
The evaluation metrics employed to assess the performance include micro F1 score (F1), precision (P), and recall (R).The micro F1 score combines the precision and recall of the model, providing a comprehensive assessment of the overall performance.Precision measures the model's ability to correctly predict positive class samples, while recall gauges the model's success in capturing positive class samples.The integrated use of these three metrics contributes to a thorough evaluation of the model's performance in multi-class classification tasks, offering insights into different aspects of its effectiveness.

Implementation details
Our approach is implemented using PyTorch (version torch-1.11.0) on hardware comprising an RTX 3070Ti.The hidden size of our model is 768, which is the same as the dimension in BART [36].
The training of the model is conducted with the implementation of the early stopping mechanism to prevent overfitting.In particular, the training process spans 100 epochs, during which the model's performance on the validation set is assessed at each epoch.The training ceases if the model fails to exhibit improved F1 scores on the validation set for p consecutive epochs, where p is a predefined hyperparameter.Subsequently, the final model is derived from the last checkpoint, and its performance is evaluated using the test set.

Baselines
Our primary focus revolves around comparing our SAAF model against two distinct categories of existing baseline systems using our proposed methodology.
Our analysis first commences with the evaluation of text-only methodologies: 1) SPAN [29] is a span-based method that formulates the JASA task as a span prediction problem, 2) directional graph convolutional networks (D-GCN) [43] proposes a BERT-based graph convolution network that formulates the JASA task as a sequence labeling problem to leverage synaptic information between words and 3) BART [30] is a unified generative framework based on BART that formulates the JASA task as an index generation problem.
Additionally, the following multimodal strategies are taken into account for JMASA since there are few studies for JMASA. 1) Initially, two pipeline approaches are executed using representative methods of MATE and MASC: unified multimodal transformer (UMT)+TomBERT and OSCGA+TomBERT, 2) UMT-collapsed [44], OSCGA-collapsed [45] and relation propagation-based BERT (RpBERT)collapsed [46] are three collapsed tagging approaches, 3) JML [35] is the first multimodal joint learning approach, which proposed an auxiliary relation detection module to control the exploitation of visual information, 4) VLP-MABSA [6], which is a unified multimodal encoder-decoder architecture for multimodal joint learning method and 5) cross-modal multitask transformer (CMMT) [47], which proposed a text-guided cross-modal interaction module to dynamically control the contributions of visual information.

Main results
In Table 2, the consistently superior performance of our underlying model, BART, in comparison to the other two methods, underscores its proficiency in tasks involving joint learning.For multimodal methods, previous pipeline approaches and collapsed tagging approaches perform much worse than recent joint learning approaches, probably because of the error propagation problem when these two subtasks are carried out separately.As the first joint learning method, JML performs better than previous studies since the joint framework improves the error propagation problem.Moreover, our model outperforms VLP-MABSA by 2.5% and 2.0%, with respect to the F1 score on TWITTER-15 and TWITTER-17, respectively.This is mainly benefits from its generative paradigm framework, which is superior in joint learning tasks.In conclusion, our proposed SAAF model distinctly attains the highest performance, as evaluated by the F1 score on the TWITTER-17 dataset.Furthermore, the F1 score of SAAF is only 0.2% lower on the TWITTER-15 dataset than VLP-MABSA which is highly pretrained.This demonstrates that SAAF is competitive among all the state of the art methods.These observations demonstrate the effectiveness of our SAAF model.

Ablation study of adaptive visual-to-textual fusion layer
Cross-modal interaction.To verify the effect of cross-modal interaction, the unprocessed raw textual representation is directly input into both the selective fusion layer and the cross-modal mixup layer.The results are shown in Table 3.It can be seen that without cross-modal interaction, the F1 score of the TWITTER-15 and TWITTER-17 datasets drop by about 0.5% and 1.2%, respectively, compared to the full model.The above results further prove that extracting text closely related visual features can better achieve multimodal fusion.
Selective fusion.Table 4 reports ablation study of the selective fusion layer.The unprocessed visual feature is directed into the cross-modal mixup layer instead of the fused representations.It can be seen that the performance drops sharply after the removal of selective fusion, illustrating the effectiveness of selective fusion layer, which aims to filter out unrelated region features for the text.Cross-modal mixup.The effectiveness of the cross-modal mixup layer is evaluated by omitting it from the adaptive visual-to-textual fusion layer.As can be seen in Table 5, the performance decreases by 1.4% and 1.8% on the TWITTER-15 and TWITTER-17 datasets, respectively, after removing the cross-modal mixup layer, which illustrates the necessity of performing cross-modal mixup.As indicated, the removal of either one or both modules (w/o selective fusion & cross-modal mixup) produce varying degrees of performance decline.This underscores the efficacy of the individual components, thereby augmenting the dependability and interpretability of our model.

Case study
To further demonstrate the effectiveness of our approach, we randomly select three samples from the TWITTER-17 dataset for a case study.Table 7 presents three test examples with predictions from two different baseline methods.The compared methods are Multimodal-BART (denoted by M-BART) and VLP.In the example (a), it is evident that both M-BART and VLP erroneously extracted the aspect term "Mott Basketball Camp."In the example (b), M-BART failed to recognize the aspect term RutgersU, while VLP predicted the right aspect but wrongly predicted the sentiment toward the aspect term RutgersU as positive.Meanwhile, M-BART also failed to correctly predict the sentiment toward the aspect term Obama.For example (c), M-BART failed to extract the aspect term Pillers1957, while VLP extracted the wrong aspect term (i.e., KSC U10).However, among all cases, it is evident that our approach, SAAF, effectively extracts all aspect terms and accurately classifies sentiment by adaptively fusing visual and textual modalities for both subtasks within a generative framework.

Conclusions
In this paper, we propose self-adaptive cross-modal attention fusion architecture.This architecture leverages a selective fusion mechanism between image and text to bridge the semantic gap and enables the adaptive transfer of textual-based pre-training models to the multi-modal JMASA task.Experiment results show that our proposed approach generally outperforms many competitive unimodal and multimodal methods.

#Figure 1 .
Figure 1.An example of the MABSA task.

Table 1 .
Statistics of two benchmark datasets for JMABSA.

Table 2 .
Comparison between previous methods and our SAAF model on two benchmark datasets.a denotes the results from Ju et al. b denotes the results are from Liang et al. c denotes the results from Yang et al.

Table 3 .
Ablation study of cross-modal interaction.

Table 4 .
Ablation study of selective fusion.

Table 5 .
Ablation study of cross-modal mixup.Selective fusion & cross-modal mixup.As can be seen in Table6, w/o selective fusion & crossmodal mixup is the BART model only with our cross-modal interaction module.It performs worse after removing both the selective fusion layer and cross-modal mixup layer.It proves the effectiveness of the selective fusion layer and cross-modal mixup layer.