skip to main content
10.1145/3581783.3612859acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Authors Info & Claims
Published:27 October 2023Publication History

ABSTRACT

In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.

References

  1. Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arxiv: 2006.11477 [cs.CL]Google ScholarGoogle Scholar
  2. Tadas Baltru?aitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision. 1--10. https://doi.org/10.1109/WACV.2016.7477553Google ScholarGoogle ScholarCross RefCross Ref
  3. Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal Multi-Task Learning for Dimensional and Continuous Emotion Recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (Mountain View, California, USA) (AVEC '17). Association for Computing Machinery, New York, NY, USA, 19--26. https://doi.org/10.1145/3133944.3133949Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1505--1518. https://doi.org/10.1109/JSTSP.2022.3188113Google ScholarGoogle ScholarCross RefCross Ref
  5. Huang-Cheng Chou, Chi-Chun Lee, and Carlos Busso. 2022. Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier. In Proc. Interspeech 2022. 161--165. https://doi.org/10.21437/Interspeech.2022--11041Google ScholarGoogle ScholarCross RefCross Ref
  6. Roberto Cipolla, Yarin Gal, and Alex Kendall. 2018. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7482--7491. https://doi.org/10.1109/CVPR.2018.00781Google ScholarGoogle ScholarCross RefCross Ref
  7. Jose Maria Garcia-Garcia, Victor M. R. Penichet, and Maria D. Lozano. 2017. Emotion Detection: A Technology Review. In Proceedings of the XVIII International Conference on Human Computer Interaction. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3123818.3123852Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hatice Gunes and Bjoern Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image & Vision Computing, Vol. 31, 2 (2013), 120--136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021b. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 6--15. https://doi.org/10.1145/3462244.3479919Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wei Han, Hui Chen, and Soujanya Poria. 2021a. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arxiv: 2109.00412 [cs.CL]Google ScholarGoogle Scholar
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jonathan Herzig, Michal Shmueli-Scheuer, and David Konopnicki. 2017. Emotion Detection from Text via Ensemble Classification Using Word Embeddings. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (Amsterdam, The Netherlands) (ICTIR '17). Association for Computing Machinery, New York, NY, USA, 269--272. https://doi.org/10.1145/3121050.3121093Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460. https://doi.org/10.1109/TASLP.2021.3122291Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Anthony Hu and Seth Flaxman. 2018. Multimodal Sentiment Analysis To Explore the Structure of Emotions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, New York, NY, USA, 350--358. https://doi.org/10.1145/3219819.3219853Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7837--7851. https://aclanthology.org/2022.emnlp-main.534Google ScholarGoogle ScholarCross RefCross Ref
  16. Anoop K, Deepak P, and Lajish V L. 2020. Emotion Cognizance Improves Health Fake News Identification. In Proceedings of the 24th Symposium on International Database Engineering & Applications (Seoul, Republic of Korea) (IDEAS '20). Association for Computing Machinery, New York, NY, USA, Article 12, 10 pages. https://doi.org/10.1145/3410566.3410595Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access, Vol. 7 (2019), 117327--117345. https://doi.org/10.1109/ACCESS.2019.2936124Google ScholarGoogle ScholarCross RefCross Ref
  18. Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jan 2021), 985--1000. https://doi.org/10.1109/TASLP.2021.3049898Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arxiv: 2304.08981 [cs.CL]Google ScholarGoogle Scholar
  20. Zheng Lian, Jianhua Tao, Bin Liu, and Jian Huang. 2019. Conversational Emotion Analysis via Attention Mechanisms. In Proc. Interspeech 2019. 1936--1940. https://doi.org/10.21437/Interspeech.2019-1577Google ScholarGoogle ScholarCross RefCross Ref
  21. Jingyun Liang, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 4076--4085. https://doi.org/10.1109/ICCV48922.2021.00406Google ScholarGoogle ScholarCross RefCross Ref
  22. Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, and Tianrui Li. 2021. ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors. arxiv: 2112.01368 [cs.CL]Google ScholarGoogle Scholar
  23. Mona Hafez Mahmoud. 2019. A Survey of Some Interdisciplinary Methods and Tools to Measure Learners' Emotions in Intelligent Tutoring Systems. In 2019 6th International Conference on Advanced Control Circuits and Systems (ACCS) and 2019 5th International Conference on New Paradigms in Electronics & information Technology (PEIT). 1--6. https://doi.org/10.1109/ACCS-PEIT48329.2019.9062885Google ScholarGoogle ScholarCross RefCross Ref
  24. Mehdi Malekzadeh, Mumtaz Begum Mustafa, and Adel Lahsasna. 2015. A review of emotion regulation in intelligent tutoring systems. Journal of Educational Technology & Society, Vol. 18, 4 (2015), 435--445.Google ScholarGoogle Scholar
  25. Fatemeh Noroozi, Marina Marjanovic, Angelina Njegus, Sergio Escalera, and Gholamreza Anbarjafari. 2019. Audio-Visual Emotion Recognition in Video Clips. IEEE Transactions on Affective Computing, Vol. 10, 1 (2019), 60--75. https://doi.org/10.1109/TAFFC.2017.2713783Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Keyur Patel, Dev Mehta, Chinmay Mistry, Rajesh Gupta, Sudeep Tanwar, Neeraj Kumar, and Mamoun Alazab. 2020. Facial Sentiment Analysis Using AI Techniques: State-of-the-Art, Taxonomies, and Challenges. IEEE Access, Vol. 8 (2020), 90495--90519. https://doi.org/10.1109/ACCESS.2020.2993803Google ScholarGoogle ScholarCross RefCross Ref
  27. Jouni Pohjalainen, Fabien Fabien Ringeval, Zixing Zhang, and Björn Schuller. 2016. Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition. In Proceedings of the 24th ACM International Conference on Multimedia (Amsterdam, The Netherlands) (MM '16). Association for Computing Machinery, New York, NY, USA, 670--674. https://doi.org/10.1145/2964284.2967306Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arxiv: 1510.08484Google ScholarGoogle Scholar
  29. Bogdan Vlasenko and Andreas Wendemuth. 2009. Processing affected speech within human machine interaction. In 10th Annual Conference of the International Speech Communication Association. ISCA, Brighton, United Kingdom.Google ScholarGoogle ScholarCross RefCross Ref
  30. Kexin Wang, Zheng Lian, Licai Sun, Bin Liu, Jianhua Tao, and Yin Fan. 2022c. Emotional Reaction Analysis Based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge (Lisboa, Portugal) (MuSe' 22). Association for Computing Machinery, New York, NY, USA, 75--80. https://doi.org/10.1145/3551876.3554810Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shu-Lin Wang, I-En Chiang Honours, Alex Kuo, and Jing-Ya Lin. 2022b. Mobile Emotion Healthcare System Applying Sentiment analysis. In 2022 IEEE International Conference on Big Data (Big Data). 2814--2820. https://doi.org/10.1109/BigData55660.2022.10021053Google ScholarGoogle ScholarCross RefCross Ref
  32. Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. 2022a. A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. arxiv: 2111.02735 [cs.CL]Google ScholarGoogle Scholar
  33. Ali Yadollahi, Ameneh Gholipour Shahraki, and Osmar R. Zaiane. 2017. Current State of Text Sentiment Analysis from Opinion to Emotion Mining. ACM Comput. Surv., Vol. 50, 2, Article 25 (may 2017), 33 pages. https://doi.org/10.1145/3057270Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jing Zhao and Wei-Qiang Zhang. 2022. Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1227--1241. https://doi.org/10.1109/JSTSP.2022.3184480Google ScholarGoogle ScholarCross RefCross Ref
  35. Zengqun Zhao and Qingshan Liu. 2021. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia (MM '21). Association for Computing Machinery, New York, NY, USA, 1553--1561. https://doi.org/10.1145/3474085.3475292Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Hengshun Zhou, Jun Du, Yuanyuan Zhang, Qing Wang, Qing-Feng Liu, and Chin-Hui Lee. 2021. Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jul 2021), 2617--2629. https://doi.org/10.1109/TASLP.2021.3096037Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Hengshun Zhou, Debin Meng, Yuanyuan Zhang, Xiaojiang Peng, Jun Du, Kai Wang, and Yu Qiao. 2019. Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 562--566. https://doi.org/10.1145/3340555.3355713Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)72
          • Downloads (Last 6 weeks)6

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader