research-article

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Authors:
Haotian Wang

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0009-0005-7621-9724
View Profile

,
Yuxuan Xi

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0009-0004-6234-130X
View Profile

,
Hang Chen

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0002-0904-8946
View Profile

,
Jun Du

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0002-2387-0389
View Profile

,
Yan Song

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0002-5668-9068
View Profile

,
Qing Wang

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0003-3843-3920
View Profile

,
Hengshun Zhou

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0001-7878-6531
View Profile

,
Chenxi Wang

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0009-0003-1457-5120
View Profile

,
Jiefeng Ma

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0003-2416-3720
View Profile

,
Pengfei Hu

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0009-0005-3345-605X
View Profile

,
Ya Jiang

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0009-0009-8513-3171
View Profile

,
Shi Cheng

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0003-1499-0941
View Profile

,
Jie Zhang

University of Science and Technology of China, Hefei, Anhui, China

University of Science and Technology of China, Hefei, Anhui, China

0000-0003-1124-0854
View Profile

,
Yuzhe Weng

Northwestern Polytechnical University, Xi'an, Shaanxi, China

Northwestern Polytechnical University, Xi'an, Shaanxi, China

0000-0001-7135-5675
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 9531–9535https://doi.org/10.1145/3581783.3612859

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9531–9535

ABSTRACT

In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.

References

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arxiv: 2006.11477 [cs.CL]Google Scholar
Tadas Baltru?aitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision. 1--10. https://doi.org/10.1109/WACV.2016.7477553Google ScholarCross Ref
Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal Multi-Task Learning for Dimensional and Continuous Emotion Recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (Mountain View, California, USA) (AVEC '17). Association for Computing Machinery, New York, NY, USA, 19--26. https://doi.org/10.1145/3133944.3133949Google ScholarDigital Library
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1505--1518. https://doi.org/10.1109/JSTSP.2022.3188113Google ScholarCross Ref
Huang-Cheng Chou, Chi-Chun Lee, and Carlos Busso. 2022. Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier. In Proc. Interspeech 2022. 161--165. https://doi.org/10.21437/Interspeech.2022--11041Google ScholarCross Ref
Roberto Cipolla, Yarin Gal, and Alex Kendall. 2018. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7482--7491. https://doi.org/10.1109/CVPR.2018.00781Google ScholarCross Ref
Jose Maria Garcia-Garcia, Victor M. R. Penichet, and Maria D. Lozano. 2017. Emotion Detection: A Technology Review. In Proceedings of the XVIII International Conference on Human Computer Interaction. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3123818.3123852Google ScholarDigital Library
Hatice Gunes and Bjoern Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image & Vision Computing, Vol. 31, 2 (2013), 120--136.Google ScholarDigital Library
Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021b. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 6--15. https://doi.org/10.1145/3462244.3479919Google ScholarDigital Library
Wei Han, Hui Chen, and Soujanya Poria. 2021a. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arxiv: 2109.00412 [cs.CL]Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Jonathan Herzig, Michal Shmueli-Scheuer, and David Konopnicki. 2017. Emotion Detection from Text via Ensemble Classification Using Word Embeddings. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (Amsterdam, The Netherlands) (ICTIR '17). Association for Computing Machinery, New York, NY, USA, 269--272. https://doi.org/10.1145/3121050.3121093Google ScholarDigital Library
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460. https://doi.org/10.1109/TASLP.2021.3122291Google ScholarDigital Library
Anthony Hu and Seth Flaxman. 2018. Multimodal Sentiment Analysis To Explore the Structure of Emotions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, New York, NY, USA, 350--358. https://doi.org/10.1145/3219819.3219853Google ScholarDigital Library
Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7837--7851. https://aclanthology.org/2022.emnlp-main.534Google ScholarCross Ref
Anoop K, Deepak P, and Lajish V L. 2020. Emotion Cognizance Improves Health Fake News Identification. In Proceedings of the 24th Symposium on International Database Engineering & Applications (Seoul, Republic of Korea) (IDEAS '20). Association for Computing Machinery, New York, NY, USA, Article 12, 10 pages. https://doi.org/10.1145/3410566.3410595Google ScholarDigital Library
Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access, Vol. 7 (2019), 117327--117345. https://doi.org/10.1109/ACCESS.2019.2936124Google ScholarCross Ref
Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jan 2021), 985--1000. https://doi.org/10.1109/TASLP.2021.3049898Google ScholarDigital Library
Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arxiv: 2304.08981 [cs.CL]Google Scholar
Zheng Lian, Jianhua Tao, Bin Liu, and Jian Huang. 2019. Conversational Emotion Analysis via Attention Mechanisms. In Proc. Interspeech 2019. 1936--1940. https://doi.org/10.21437/Interspeech.2019-1577Google ScholarCross Ref
Jingyun Liang, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 4076--4085. https://doi.org/10.1109/ICCV48922.2021.00406Google ScholarCross Ref
Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, and Tianrui Li. 2021. ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors. arxiv: 2112.01368 [cs.CL]Google Scholar
Mona Hafez Mahmoud. 2019. A Survey of Some Interdisciplinary Methods and Tools to Measure Learners' Emotions in Intelligent Tutoring Systems. In 2019 6th International Conference on Advanced Control Circuits and Systems (ACCS) and 2019 5th International Conference on New Paradigms in Electronics & information Technology (PEIT). 1--6. https://doi.org/10.1109/ACCS-PEIT48329.2019.9062885Google ScholarCross Ref
Mehdi Malekzadeh, Mumtaz Begum Mustafa, and Adel Lahsasna. 2015. A review of emotion regulation in intelligent tutoring systems. Journal of Educational Technology & Society, Vol. 18, 4 (2015), 435--445.Google Scholar
Fatemeh Noroozi, Marina Marjanovic, Angelina Njegus, Sergio Escalera, and Gholamreza Anbarjafari. 2019. Audio-Visual Emotion Recognition in Video Clips. IEEE Transactions on Affective Computing, Vol. 10, 1 (2019), 60--75. https://doi.org/10.1109/TAFFC.2017.2713783Google ScholarDigital Library
Keyur Patel, Dev Mehta, Chinmay Mistry, Rajesh Gupta, Sudeep Tanwar, Neeraj Kumar, and Mamoun Alazab. 2020. Facial Sentiment Analysis Using AI Techniques: State-of-the-Art, Taxonomies, and Challenges. IEEE Access, Vol. 8 (2020), 90495--90519. https://doi.org/10.1109/ACCESS.2020.2993803Google ScholarCross Ref
Jouni Pohjalainen, Fabien Fabien Ringeval, Zixing Zhang, and Björn Schuller. 2016. Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition. In Proceedings of the 24th ACM International Conference on Multimedia (Amsterdam, The Netherlands) (MM '16). Association for Computing Machinery, New York, NY, USA, 670--674. https://doi.org/10.1145/2964284.2967306Google ScholarDigital Library
David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arxiv: 1510.08484Google Scholar
Bogdan Vlasenko and Andreas Wendemuth. 2009. Processing affected speech within human machine interaction. In 10th Annual Conference of the International Speech Communication Association. ISCA, Brighton, United Kingdom.Google ScholarCross Ref
Kexin Wang, Zheng Lian, Licai Sun, Bin Liu, Jianhua Tao, and Yin Fan. 2022c. Emotional Reaction Analysis Based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge (Lisboa, Portugal) (MuSe' 22). Association for Computing Machinery, New York, NY, USA, 75--80. https://doi.org/10.1145/3551876.3554810Google ScholarDigital Library
Shu-Lin Wang, I-En Chiang Honours, Alex Kuo, and Jing-Ya Lin. 2022b. Mobile Emotion Healthcare System Applying Sentiment analysis. In 2022 IEEE International Conference on Big Data (Big Data). 2814--2820. https://doi.org/10.1109/BigData55660.2022.10021053Google ScholarCross Ref
Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. 2022a. A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. arxiv: 2111.02735 [cs.CL]Google Scholar
Ali Yadollahi, Ameneh Gholipour Shahraki, and Osmar R. Zaiane. 2017. Current State of Text Sentiment Analysis from Opinion to Emotion Mining. ACM Comput. Surv., Vol. 50, 2, Article 25 (may 2017), 33 pages. https://doi.org/10.1145/3057270Google ScholarDigital Library
Jing Zhao and Wei-Qiang Zhang. 2022. Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1227--1241. https://doi.org/10.1109/JSTSP.2022.3184480Google ScholarCross Ref
Zengqun Zhao and Qingshan Liu. 2021. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia (MM '21). Association for Computing Machinery, New York, NY, USA, 1553--1561. https://doi.org/10.1145/3474085.3475292Google ScholarDigital Library
Hengshun Zhou, Jun Du, Yuanyuan Zhang, Qing Wang, Qing-Feng Liu, and Chin-Hui Lee. 2021. Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jul 2021), 2617--2629. https://doi.org/10.1109/TASLP.2021.3096037Google ScholarDigital Library
Hengshun Zhou, Debin Meng, Yuanyuan Zhang, Xiaojiang Peng, Jun Du, Kai Wang, and Yu Qiao. 2019. Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 562--566. https://doi.org/10.1145/3340555.3355713Google ScholarDigital Library

Index Terms

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Learning paradigms
      1. Multi-task learning
    2. Machine learning approaches
      1. Neural networks

Recommendations

An enhancement deep feature fusion method for rotating machinery fault diagnosis

A new deep learning method is proposed to automatically learn the useful fault features from the raw vibration signals.A new deep auto-encoder model is constructed for the enhancement of feature learning ability.Locality preserving projection is adopted ...
Read More
Audio-visual emotion fusion (AVEF): A deep efficient weighted approach
Highlights
- Propose a deep weight fusion method for emotion recognition.
- Conduct the cross-...
Abstract
The multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In ...
Read More
Incomplete multi-view subspace clustering with adaptive instance-sample mapping and deep feature fusion
Abstract
Multi-view subspace clustering has been widely applied in practical applications. It fuses complementary information across multiple views and treats all samples of a view as a set of bases of a generalized subspace. Meanwhile, it assumes that an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep feature fusion
joint decoding
mer2023
multi-task learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 72
  Total Downloads
- Downloads (Last 12 months)72
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

An enhancement deep feature fusion method for rotating machinery fault diagnosis

Audio-visual emotion fusion (AVEF): A deep efficient weighted approach

Incomplete multi-view subspace clustering with adaptive instance-sample mapping and deep feature fusion