ABSTRACT
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
- Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arxiv: 2006.11477 [cs.CL]Google Scholar
- Tadas Baltru?aitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision. 1--10. https://doi.org/10.1109/WACV.2016.7477553Google ScholarCross Ref
- Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal Multi-Task Learning for Dimensional and Continuous Emotion Recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (Mountain View, California, USA) (AVEC '17). Association for Computing Machinery, New York, NY, USA, 19--26. https://doi.org/10.1145/3133944.3133949Google ScholarDigital Library
- Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1505--1518. https://doi.org/10.1109/JSTSP.2022.3188113Google ScholarCross Ref
- Huang-Cheng Chou, Chi-Chun Lee, and Carlos Busso. 2022. Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier. In Proc. Interspeech 2022. 161--165. https://doi.org/10.21437/Interspeech.2022--11041Google ScholarCross Ref
- Roberto Cipolla, Yarin Gal, and Alex Kendall. 2018. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7482--7491. https://doi.org/10.1109/CVPR.2018.00781Google ScholarCross Ref
- Jose Maria Garcia-Garcia, Victor M. R. Penichet, and Maria D. Lozano. 2017. Emotion Detection: A Technology Review. In Proceedings of the XVIII International Conference on Human Computer Interaction. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3123818.3123852Google ScholarDigital Library
- Hatice Gunes and Bjoern Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image & Vision Computing, Vol. 31, 2 (2013), 120--136.Google ScholarDigital Library
- Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021b. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 6--15. https://doi.org/10.1145/3462244.3479919Google ScholarDigital Library
- Wei Han, Hui Chen, and Soujanya Poria. 2021a. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arxiv: 2109.00412 [cs.CL]Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Jonathan Herzig, Michal Shmueli-Scheuer, and David Konopnicki. 2017. Emotion Detection from Text via Ensemble Classification Using Word Embeddings. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (Amsterdam, The Netherlands) (ICTIR '17). Association for Computing Machinery, New York, NY, USA, 269--272. https://doi.org/10.1145/3121050.3121093Google ScholarDigital Library
- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460. https://doi.org/10.1109/TASLP.2021.3122291Google ScholarDigital Library
- Anthony Hu and Seth Flaxman. 2018. Multimodal Sentiment Analysis To Explore the Structure of Emotions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, New York, NY, USA, 350--358. https://doi.org/10.1145/3219819.3219853Google ScholarDigital Library
- Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7837--7851. https://aclanthology.org/2022.emnlp-main.534Google ScholarCross Ref
- Anoop K, Deepak P, and Lajish V L. 2020. Emotion Cognizance Improves Health Fake News Identification. In Proceedings of the 24th Symposium on International Database Engineering & Applications (Seoul, Republic of Korea) (IDEAS '20). Association for Computing Machinery, New York, NY, USA, Article 12, 10 pages. https://doi.org/10.1145/3410566.3410595Google ScholarDigital Library
- Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access, Vol. 7 (2019), 117327--117345. https://doi.org/10.1109/ACCESS.2019.2936124Google ScholarCross Ref
- Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jan 2021), 985--1000. https://doi.org/10.1109/TASLP.2021.3049898Google ScholarDigital Library
- Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arxiv: 2304.08981 [cs.CL]Google Scholar
- Zheng Lian, Jianhua Tao, Bin Liu, and Jian Huang. 2019. Conversational Emotion Analysis via Attention Mechanisms. In Proc. Interspeech 2019. 1936--1940. https://doi.org/10.21437/Interspeech.2019-1577Google ScholarCross Ref
- Jingyun Liang, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 4076--4085. https://doi.org/10.1109/ICCV48922.2021.00406Google ScholarCross Ref
- Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, and Tianrui Li. 2021. ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors. arxiv: 2112.01368 [cs.CL]Google Scholar
- Mona Hafez Mahmoud. 2019. A Survey of Some Interdisciplinary Methods and Tools to Measure Learners' Emotions in Intelligent Tutoring Systems. In 2019 6th International Conference on Advanced Control Circuits and Systems (ACCS) and 2019 5th International Conference on New Paradigms in Electronics & information Technology (PEIT). 1--6. https://doi.org/10.1109/ACCS-PEIT48329.2019.9062885Google ScholarCross Ref
- Mehdi Malekzadeh, Mumtaz Begum Mustafa, and Adel Lahsasna. 2015. A review of emotion regulation in intelligent tutoring systems. Journal of Educational Technology & Society, Vol. 18, 4 (2015), 435--445.Google Scholar
- Fatemeh Noroozi, Marina Marjanovic, Angelina Njegus, Sergio Escalera, and Gholamreza Anbarjafari. 2019. Audio-Visual Emotion Recognition in Video Clips. IEEE Transactions on Affective Computing, Vol. 10, 1 (2019), 60--75. https://doi.org/10.1109/TAFFC.2017.2713783Google ScholarDigital Library
- Keyur Patel, Dev Mehta, Chinmay Mistry, Rajesh Gupta, Sudeep Tanwar, Neeraj Kumar, and Mamoun Alazab. 2020. Facial Sentiment Analysis Using AI Techniques: State-of-the-Art, Taxonomies, and Challenges. IEEE Access, Vol. 8 (2020), 90495--90519. https://doi.org/10.1109/ACCESS.2020.2993803Google ScholarCross Ref
- Jouni Pohjalainen, Fabien Fabien Ringeval, Zixing Zhang, and Björn Schuller. 2016. Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition. In Proceedings of the 24th ACM International Conference on Multimedia (Amsterdam, The Netherlands) (MM '16). Association for Computing Machinery, New York, NY, USA, 670--674. https://doi.org/10.1145/2964284.2967306Google ScholarDigital Library
- David Snyder, Guoguo Chen, and Daniel Povey. 2015. Musan: A music, speech, and noise corpus. arxiv: 1510.08484Google Scholar
- Bogdan Vlasenko and Andreas Wendemuth. 2009. Processing affected speech within human machine interaction. In 10th Annual Conference of the International Speech Communication Association. ISCA, Brighton, United Kingdom.Google ScholarCross Ref
- Kexin Wang, Zheng Lian, Licai Sun, Bin Liu, Jianhua Tao, and Yin Fan. 2022c. Emotional Reaction Analysis Based on Multi-Label Graph Convolutional Networks and Dynamic Facial Expression Recognition Transformer. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge (Lisboa, Portugal) (MuSe' 22). Association for Computing Machinery, New York, NY, USA, 75--80. https://doi.org/10.1145/3551876.3554810Google ScholarDigital Library
- Shu-Lin Wang, I-En Chiang Honours, Alex Kuo, and Jing-Ya Lin. 2022b. Mobile Emotion Healthcare System Applying Sentiment analysis. In 2022 IEEE International Conference on Big Data (Big Data). 2814--2820. https://doi.org/10.1109/BigData55660.2022.10021053Google ScholarCross Ref
- Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. 2022a. A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. arxiv: 2111.02735 [cs.CL]Google Scholar
- Ali Yadollahi, Ameneh Gholipour Shahraki, and Osmar R. Zaiane. 2017. Current State of Text Sentiment Analysis from Opinion to Emotion Mining. ACM Comput. Surv., Vol. 50, 2, Article 25 (may 2017), 33 pages. https://doi.org/10.1145/3057270Google ScholarDigital Library
- Jing Zhao and Wei-Qiang Zhang. 2022. Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models. IEEE Journal of Selected Topics in Signal Processing, Vol. 16, 6 (2022), 1227--1241. https://doi.org/10.1109/JSTSP.2022.3184480Google ScholarCross Ref
- Zengqun Zhao and Qingshan Liu. 2021. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia (MM '21). Association for Computing Machinery, New York, NY, USA, 1553--1561. https://doi.org/10.1145/3474085.3475292Google ScholarDigital Library
- Hengshun Zhou, Jun Du, Yuanyuan Zhang, Qing Wang, Qing-Feng Liu, and Chin-Hui Lee. 2021. Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 29 (jul 2021), 2617--2629. https://doi.org/10.1109/TASLP.2021.3096037Google ScholarDigital Library
- Hengshun Zhou, Debin Meng, Yuanyuan Zhang, Xiaojiang Peng, Jun Du, Kai Wang, and Yu Qiao. 2019. Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 562--566. https://doi.org/10.1145/3340555.3355713Google ScholarDigital Library
Index Terms
- Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
Recommendations
An enhancement deep feature fusion method for rotating machinery fault diagnosis
A new deep learning method is proposed to automatically learn the useful fault features from the raw vibration signals.A new deep auto-encoder model is constructed for the enhancement of feature learning ability.Locality preserving projection is adopted ...
Audio-visual emotion fusion (AVEF): A deep efficient weighted approach
Highlights- Propose a deep weight fusion method for emotion recognition.
- Conduct the cross-...
AbstractThe multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In ...
Incomplete multi-view subspace clustering with adaptive instance-sample mapping and deep feature fusion
AbstractMulti-view subspace clustering has been widely applied in practical applications. It fuses complementary information across multiple views and treats all samples of a view as a set of bases of a generalized subspace. Meanwhile, it assumes that an ...
Comments