research-article

A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis

Authors:
Ashima Yadav

Department of Computer Science and Engineering, Bennett University, Greater Noida, Uttar Pradesh, India

Department of Computer Science and Engineering, Bennett University, Greater Noida, Uttar Pradesh, India
View Profile

,
Dinesh Kumar Vishwakarma

Department of Information Technology, Delhi Technological University, Rohini, New Delhi, India

Department of Information Technology, Delhi Technological University, Rohini, New Delhi, India
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19 Issue 1Article No.: 15pp 1–19https://doi.org/10.1145/3517139

Published:05 January 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Multimodal sentiment analysis has attracted increasing attention with broad application prospects. Most of the existing methods have focused on a single modality, which fails to handle social media data due to its multiple modalities. Moreover, in multimodal learning, most of the works have focused on simply combining the two modalities without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-level Attentive network (DMLANet), which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify Convolutional neural network representation power. Then, we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to fetch the sentiment-rich multimodal features for the classification automatically. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verify our method's superiority.

REFERENCES

[1] Lu J., Batra D., Parikh D., and Lee S.. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In 33rd Conference on Neural Information Processing Systems 32 (2019), 1–11.Google Scholar
[2] Akbari H., Yuan L., Qian R., Chuang W.-H., Chang S.-F., Cui Y., and Gong B.. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In 35th Conference on Neural Information Processing Systems.Google Scholar
[3] Radford A., Kim J., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., and Clark J.. 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning.Google Scholar
[4] Yadav A. and Vishwakarma D. K.. 2020. A deep learning architecture of RADLNet for visual sentiment analysis. Multim. Syst. 26 (2020), 431–451.Google ScholarCross Ref
[5] Denecke K. and Deng Y.. 2015. Sentiment analysis in medical settings: New opportunities and challenges. Artif. Intell. Med. 64 (2015), 17–27.Google ScholarDigital Library
[6] Sharma A. and Ghose U.. 2020. Sentimental analysis of Twitter data with respect to general elections in India. Procedia Comput. Sci. 173 (2020), 325–334.Google ScholarCross Ref
[7] Yadav A. and Vishwakarma D. K.. 2020. A unified framework of deep networks for genre classification using movie trailer. Appl. Soft Comput. 96 (2020).Google ScholarDigital Library
[8] Chan S. W. and Chong M. W.. 2017. Sentiment analysis in financial texts. Decis. Supp. Syst. 94 (2017), 53–64.Google ScholarDigital Library
[9] Poria S., Chaturvedi I., Cambria E., and Hussain A.. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In IEEE 16th International Conference on Data Mining.Google ScholarCross Ref
[10] Xu J., Huang F., Zhang X., Wang S., Li C., Li Z., and He Y.. 2019. Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowl.-based Syst. 178 (2019), 61–73.Google ScholarDigital Library
[11] Huang F., Zhang X., Zhao Z., Xu J., and Li Z.. 2019. Image–text sentiment analysis via deep multimodal attentive fusion. Knowl.-based Syst. 167 (2019), 26–37.Google ScholarDigital Library
[12] Ji R., Chen F., Cao L., and Gao Y.. 2018. Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Trans. Multim. 21, 4 (2018), 1062–1075.Google ScholarCross Ref
[13] Baecchi C., Uricchio T., Bertini M., and Bimbo A. D.. 2016. A multimodal feature learning approach for sentiment analysis of social network multimedia. Multim. Tools Applic. 75, 5 (2016), 2507–2525.Google ScholarDigital Library
[14] Fang Q., Xu C., Sang J., Hossain M. S., and Muhammad G.. 2015. Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media. IEEE Trans. Multim. 17, 12 (2015), 2281–2296.Google ScholarDigital Library
[15] Dai S. and Man H.. 2018. Integrating visual and textual affective descriptors for sentiment analysis of social media posts. In IEEE Conference on Multimedia Information Processing and Retrieval.Google ScholarCross Ref
[16] Yadav A. and Vishwakarma D. K.. 2019. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 53, 6 (2019), 4335–4385.Google ScholarDigital Library
[17] Zhang W., Yao T., Zhu S., and Saddik A. E.. 2019. Deep learning–based multimedia analytics: A review. ACM Trans. Multim. Comput., Commun., Applic. 15, 1 (2019), 1–26.Google ScholarDigital Library
[18] Xu N.. 2017. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In IEEE International Conference on Intelligence and Security Informatics (ISI).Google ScholarDigital Library
[19] Chen F., Ji R., Su J., Cao D., and Gao Y.. 2017. Predicting microblog sentiments via weakly supervised multi-modal deep learning. IEEE Trans. Multim. 20, 4 (2017), 997–1007.Google ScholarDigital Library
[20] Zhao Z., Zhu H., Xue Z., Liu Z., Tian J., Chua M., and Liu M.. 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 56, 6 (2019).Google ScholarDigital Library
[21] Yu J., Jiang J., and Xia R.. 2019. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28 (2019), 429–439.Google ScholarDigital Library
[22] Zhao S., Wang S., Soleymani M., Joshi D., and Ji Q.. 2020. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Trans. Multim. Comput., Commun. Applic. 15, 35 (2020), 1–32.Google ScholarDigital Library
[23] Xu J., Huang F., Zhang X., Wang S., Li C., Li Z., and He Y.. 2019. Sentiment analysis of social images via hierarchical deep fusion of content and links. Appl. Soft Comput. 80 (2019), 387–399.Google ScholarDigital Library
[24] Wang J., Wang W., Wang L., Wang Z., Feng D. D., and Tan T.. 2020. Learning visual relationship and context-aware attention for image captioning. Pattern Recog. 98 (2020).Google ScholarDigital Library
[25] Woo S., Park J., Lee J.-Y., and. Kweon I. S.. 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
[26] Szegedy C., Vanhoucke V., Ioffe S., Shlens J., and Wojna Z.. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
[27] Ma H., Li W., Zhang X., Gao S., and Lu S.. 2019. AttnSense: Multi-level attention mechanism for multimodal human activity recognition. In 28th International Joint Conference on Artificial Intelligence.Google ScholarCross Ref
[28] Jaderberg M., Simonyan K., Zisserman A., and Kavukcuoglu K.. 2015. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015), 2017–2025.Google Scholar
[29] Zhou B., Khosla A., Lapedriza A., Oliva A., and Torralba A.. 2016. Learning deep features for discriminative localization. In IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
[29] Barrett M., Bingel J., Hollenstein N., Rei M., and Søgaard A.. 2018. Sequence classification with human attention. In 22nd Conference on Computational Natural Language Learning.Google ScholarCross Ref
[31] Battaglia P., Hamrick J., and Bapst V.. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.Google Scholar
[32] Niu T.. 2016. Sentiment analysis on multi-view social data. In International Conference on Multimedia Modeling.Google ScholarCross Ref
[33] Borth D., Ji R., Chen T., Breuel T., and Chang S.-F.. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In 21st ACM International Conference on Multimedia.Google ScholarDigital Library
[34] You Q., Luo J., Jin H., and Yang J..2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In 9th ACM International Conference on Web Search and Data Mining.Google ScholarDigital Library
[35] Zadeh A., Zellers R., Pincus E., and Morency L.-P.. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31, 6 (2016), 82–88.Google ScholarDigital Library
[36] Vadicamo L., Carrara F., Cimino A., Cresci S., Dell'Orletta F., Falchi F., and Tesconi M.. 2017. Cross-media learning for image sentiment analysis in the wild. In IEEE International Conference on Computer Vision Workshops.Google ScholarCross Ref
[37] Xi C., Lu G., and Yan J.. 2020. Multimodal sentiment analysis based on multi-head attention mechanism. In 4th International Conference on Machine Learning and Soft Computing.Google ScholarDigital Library
[38] Xu N. and Mao W.. 2017. A residual merged neutral network for multimodal sentiment analysis. In IEEE 2nd International Conference on Big Data Analysis (ICBDA).Google ScholarCross Ref
[39] Xu N., Mao W., and Chen G.. 2018. A co-memory network for multimodal sentiment analysis. In 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.Google ScholarDigital Library
[40] Xu N. and Mao W.. 2017. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In ACM Conference on Information and Knowledge Management.Google ScholarDigital Library
[41] Jiang T., Wang J., Liu Z., and Ling Y.. 2020. Fusion-extraction network for multimodal sentiment analysis. In Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google ScholarDigital Library
[42] Xu J., Li Z., Huang F., Li C., and Yu P. S.. 2020. Social image sentiment analysis by exploiting multimodal content and heterogeneous relations. IEEE Trans. Industr. Inform. 17, 4 (2020), 1–8.Google Scholar
[43] Huang F., Wei K., Weng J., and Li Z.. 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multim. Comput., Commun. Applic. 16, 3 (2020), 1–9.Google ScholarDigital Library
[44] Zhang K., Zhu Y., Zhang W., Zhang W., and Zhu Y.. 2020. Transfer correlation between textual content to images for sentiment analysis. IEEE Access 8 (2020), 35276–35289.Google ScholarCross Ref
[45] Yang K., Xu H., and Gao K.. 2020. CM-BERT: Cross-Modal BERT for text-audio sentiment analysis. In 28th ACM International Conference on Multimedia.Google ScholarDigital Library
[46] Rahman W., Hasan M., Lee S., Zadeh A., Mao C., Morency L.-P., and Hoque E.. 2020. Integrating multimodal information in large pretrained transformers. In Conference Association for Computational Linguistics.Google ScholarCross Ref
[47] Selvaraju R. R., Das A., Vedantam R., Cogswell M., Parikh D., and Batra D.. 2016. Grad-CAM: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.Google Scholar

Index Terms

A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis

Sentiment analysis of social multimedia data has attracted extensive research interest and has been applied to many tasks, such as election prediction and products evaluation. Sentiment analysis of one modality (e.g., text or image) has been broadly ...
Read More
Attention and Engagement Aware Multimodal Conversational Systems
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Despite their ability to complete certain tasks, dialog systems still suffer from poor adaptation to users' engagement and attention. We observe human behaviors in different conversational settings to understand human communication dynamics and then ...
Read More
Attentive Intra-modality Fusion for Multimodal Sentiment Analysis
Chinese Lexical Semantics
Abstract
The growing trend of sharing opinion videos on social media platforms leads to more and more attention to multimodal sentiment analysis research. A number of approaches in multimodal sentiment analysis have been proposed and continual improved ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 1
January 2023
505 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3572858
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 January 2023
- Online AM: 16 March 2022
- Accepted: 5 February 2022
- Received: 7 April 2021
Published in tomm Volume 19, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Attention
deep learning
multimodal analysis
sentiment analysis
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 1,651
  Total Downloads
- Downloads (Last 12 months)1,016
- Downloads (Last 6 weeks)121
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis

Attention and Engagement Aware Multimodal Conversational Systems

Attentive Intra-modality Fusion for Multimodal Sentiment Analysis