skip to main content
research-article

A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis

Authors Info & Claims
Published:05 January 2023Publication History
Skip Abstract Section

Abstract

Multimodal sentiment analysis has attracted increasing attention with broad application prospects. Most of the existing methods have focused on a single modality, which fails to handle social media data due to its multiple modalities. Moreover, in multimodal learning, most of the works have focused on simply combining the two modalities without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-level Attentive network (DMLANet), which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify Convolutional neural network representation power. Then, we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to fetch the sentiment-rich multimodal features for the classification automatically. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verify our method's superiority.

REFERENCES

  1. [1] Lu J., Batra D., Parikh D., and Lee S.. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In 33rd Conference on Neural Information Processing Systems 32 (2019), 111.Google ScholarGoogle Scholar
  2. [2] Akbari H., Yuan L., Qian R., Chuang W.-H., Chang S.-F., Cui Y., and Gong B.. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In 35th Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  3. [3] Radford A., Kim J., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., and Clark J.. 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning.Google ScholarGoogle Scholar
  4. [4] Yadav A. and Vishwakarma D. K.. 2020. A deep learning architecture of RADLNet for visual sentiment analysis. Multim. Syst. 26 (2020), 431451.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Denecke K. and Deng Y.. 2015. Sentiment analysis in medical settings: New opportunities and challenges. Artif. Intell. Med. 64 (2015), 1727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Sharma A. and Ghose U.. 2020. Sentimental analysis of Twitter data with respect to general elections in India. Procedia Comput. Sci. 173 (2020), 325334.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Yadav A. and Vishwakarma D. K.. 2020. A unified framework of deep networks for genre classification using movie trailer. Appl. Soft Comput. 96 (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chan S. W. and Chong M. W.. 2017. Sentiment analysis in financial texts. Decis. Supp. Syst. 94 (2017), 5364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Poria S., Chaturvedi I., Cambria E., and Hussain A.. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In IEEE 16th International Conference on Data Mining.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Xu J., Huang F., Zhang X., Wang S., Li C., Li Z., and He Y.. 2019. Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowl.-based Syst. 178 (2019), 6173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Huang F., Zhang X., Zhao Z., Xu J., and Li Z.. 2019. Image–text sentiment analysis via deep multimodal attentive fusion. Knowl.-based Syst. 167 (2019), 2637.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Ji R., Chen F., Cao L., and Gao Y.. 2018. Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Trans. Multim. 21, 4 (2018), 10621075.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Baecchi C., Uricchio T., Bertini M., and Bimbo A. D.. 2016. A multimodal feature learning approach for sentiment analysis of social network multimedia. Multim. Tools Applic. 75, 5 (2016), 25072525.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Fang Q., Xu C., Sang J., Hossain M. S., and Muhammad G.. 2015. Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media. IEEE Trans. Multim. 17, 12 (2015), 2281–2296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Dai S. and Man H.. 2018. Integrating visual and textual affective descriptors for sentiment analysis of social media posts. In IEEE Conference on Multimedia Information Processing and Retrieval.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Yadav A. and Vishwakarma D. K.. 2019. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 53, 6 (2019), 43354385.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Zhang W., Yao T., Zhu S., and Saddik A. E.. 2019. Deep learning–based multimedia analytics: A review. ACM Trans. Multim. Comput., Commun., Applic. 15, 1 (2019), 126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Xu N.. 2017. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In IEEE International Conference on Intelligence and Security Informatics (ISI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Chen F., Ji R., Su J., Cao D., and Gao Y.. 2017. Predicting microblog sentiments via weakly supervised multi-modal deep learning. IEEE Trans. Multim. 20, 4 (2017), 9971007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Zhao Z., Zhu H., Xue Z., Liu Z., Tian J., Chua M., and Liu M.. 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 56, 6 (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Yu J., Jiang J., and Xia R.. 2019. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28 (2019), 429439.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Zhao S., Wang S., Soleymani M., Joshi D., and Ji Q.. 2020. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Trans. Multim. Comput., Commun. Applic. 15, 35 (2020), 132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Xu J., Huang F., Zhang X., Wang S., Li C., Li Z., and He Y.. 2019. Sentiment analysis of social images via hierarchical deep fusion of content and links. Appl. Soft Comput. 80 (2019), 387399.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Wang J., Wang W., Wang L., Wang Z., Feng D. D., and Tan T.. 2020. Learning visual relationship and context-aware attention for image captioning. Pattern Recog. 98 (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Woo S., Park J., Lee J.-Y., and. Kweon I. S.. 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Szegedy C., Vanhoucke V., Ioffe S., Shlens J., and Wojna Z.. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Ma H., Li W., Zhang X., Gao S., and Lu S.. 2019. AttnSense: Multi-level attention mechanism for multimodal human activity recognition. In 28th International Joint Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Jaderberg M., Simonyan K., Zisserman A., and Kavukcuoglu K.. 2015. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015), 20172025.Google ScholarGoogle Scholar
  29. [29] Zhou B., Khosla A., Lapedriza A., Oliva A., and Torralba A.. 2016. Learning deep features for discriminative localization. In IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  30. [29] Barrett M., Bingel J., Hollenstein N., Rei M., and Søgaard A.. 2018. Sequence classification with human attention. In 22nd Conference on Computational Natural Language Learning.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Battaglia P., Hamrick J., and Bapst V.. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.Google ScholarGoogle Scholar
  32. [32] Niu T.. 2016. Sentiment analysis on multi-view social data. In International Conference on Multimedia Modeling.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Borth D., Ji R., Chen T., Breuel T., and Chang S.-F.. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In 21st ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] You Q., Luo J., Jin H., and Yang J..2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In 9th ACM International Conference on Web Search and Data Mining.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Zadeh A., Zellers R., Pincus E., and Morency L.-P.. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31, 6 (2016), 8288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Vadicamo L., Carrara F., Cimino A., Cresci S., Dell'Orletta F., Falchi F., and Tesconi M.. 2017. Cross-media learning for image sentiment analysis in the wild. In IEEE International Conference on Computer Vision Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Xi C., Lu G., and Yan J.. 2020. Multimodal sentiment analysis based on multi-head attention mechanism. In 4th International Conference on Machine Learning and Soft Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Xu N. and Mao W.. 2017. A residual merged neutral network for multimodal sentiment analysis. In IEEE 2nd International Conference on Big Data Analysis (ICBDA).Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Xu N., Mao W., and Chen G.. 2018. A co-memory network for multimodal sentiment analysis. In 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Xu N. and Mao W.. 2017. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In ACM Conference on Information and Knowledge Management.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Jiang T., Wang J., Liu Z., and Ling Y.. 2020. Fusion-extraction network for multimodal sentiment analysis. In Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Xu J., Li Z., Huang F., Li C., and Yu P. S.. 2020. Social image sentiment analysis by exploiting multimodal content and heterogeneous relations. IEEE Trans. Industr. Inform. 17, 4 (2020), 18.Google ScholarGoogle Scholar
  43. [43] Huang F., Wei K., Weng J., and Li Z.. 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multim. Comput., Commun. Applic. 16, 3 (2020), 19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Zhang K., Zhu Y., Zhang W., Zhang W., and Zhu Y.. 2020. Transfer correlation between textual content to images for sentiment analysis. IEEE Access 8 (2020), 3527635289.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Yang K., Xu H., and Gao K.. 2020. CM-BERT: Cross-Modal BERT for text-audio sentiment analysis. In 28th ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Rahman W., Hasan M., Lee S., Zadeh A., Mao C., Morency L.-P., and Hoque E.. 2020. Integrating multimodal information in large pretrained transformers. In Conference Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Selvaraju R. R., Das A., Vedantam R., Cogswell M., Parikh D., and Batra D.. 2016. Grad-CAM: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.Google ScholarGoogle Scholar

Index Terms

  1. A Deep Multi-level Attentive Network for Multimodal Sentiment Analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
        January 2023
        505 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572858
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 January 2023
        • Online AM: 16 March 2022
        • Accepted: 5 February 2022
        • Received: 7 April 2021
        Published in tomm Volume 19, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format