research-article

Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction

Authors:
Jingguang Tian

Hithink RoyalFlush AI Research Institute, HangZhou, China

Hithink RoyalFlush AI Research Institute, HangZhou, China

0009-0000-0865-9422
View Profile

,
Desheng Hu

Hithink RoyalFlush AI Research Institute, HangZhou, China

Hithink RoyalFlush AI Research Institute, HangZhou, China

0009-0000-6348-0409
View Profile

,
Xiaohan Shi

Nagoya University, Nagoya, Japan

Nagoya University, Nagoya, Japan

0000-0002-1917-4479
View Profile

,
Jiajun He

Nagoya University, Nagoya, Japan

Nagoya University, Nagoya, Japan

0009-0006-6489-5220
View Profile

,
Xingfeng Li

Hainan University, HaiKou, China

Hainan University, HaiKou, China

0000-0002-8958-0341
View Profile

,
Yuan Gao

Kyoto University, Kyoto, Japan

Kyoto University, Kyoto, Japan

0000-0002-2147-1835
View Profile

,
Tomoki Toda

Nagoya University, Nagoya, Japan

Nagoya University, Nagoya, Japan

0000-0001-8146-1279
View Profile

,
Xinkang Xu

Hithink RoyalFlush AI Research Institute, HangZhou, China

Hithink RoyalFlush AI Research Institute, HangZhou, China

0009-0003-2771-1398
View Profile

,
Xinhui Hu

Hithink RoyalFlush AI Research Institute, HangZhou, China

Hithink RoyalFlush AI Research Institute, HangZhou, China

0009-0009-1433-9324
View Profile

MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective ComputingOctober 2023Pages 67–73https://doi.org/10.1145/3607865.3613182

Published:29 October 2023Publication History

MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing

Pages 67–73

ABSTRACT

Multimodal emotion recognition is the task of identifying and understanding emotions by integrating information from multiple modalities, such as audio, visual, and textual data. However, the scarcity of labeled data poses a significant challenge for this task. To this end, this paper proposes a novel approach via a semi-supervised learning framework by incorporating consensus decision-making and label correction methods. Firstly, we employ supervised learning on the trimodal input data to establish robust initial models. Secondly, we generate reliable pseudo-labels for unlabelled data by leveraging consensus decision-making and label correction methods. Thirdly, we train the model in a supervised manner using both labeled and pseudo-labeled data. Moreover, the process of generating pseudo-labels and semi-supervised learning can be iterated to refine the model further. Experimental results on the MER 2023 dataset show the effectiveness of our proposed framework, achieving significant improvement on the MER-MULTI, MER-NOISE, and MER-SEMI subsets, respectively.

References

Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 2227--2231. IEEE, 2017.Google ScholarDigital Library
Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7:117327--117345, 2019.Google ScholarCross Ref
Taiba Majid Wani, Teddy Surya Gunawan, Syed Asif Ahmad Qadri, Mira Kartiwi, and Eliathamby Ambikairajah. A comprehensive review of speech emotion recognition systems. IEEE access, 9:47795--47814, 2021.Google ScholarCross Ref
Abdelaziz A Abdelhamid, El-Sayed M El-Kenawy, Bandar Alotaibi, Ghada M Amer, Mahmoud Y Abdelkader, Abdelhameed Ibrahim, and Marwa Metwally Eid. Robust speech emotion recognition using cnn lstm based on stochastic fractal search optimization algorithm. IEEE Access, 10:49265--49284, 2022.Google ScholarCross Ref
Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, and Eng Siong Chng. Speech emotion recognition with co-attention based multi-level acoustic information. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7367--7371. IEEE, 2022.Google ScholarCross Ref
Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 112--118. IEEE, 2018.Google ScholarCross Ref
Sharmeen M Saleem Abdullah Abdullah, Siddeeq Y Ameen Ameen, Mohammed AM Sadeeq, and Subhi Zeebaree. Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, 2(02):52-- 58, 2021.Google ScholarCross Ref
Asif Iqbal Middya, Baibhav Nag, and Sarbani Roy. Deep learning based multimodal emotion recognition using model-level fusion of audio--visual modalities. Knowledge-Based Systems, 244:108580, 2022.Google ScholarDigital Library
Sundararajan Srinivasan, Zhaocheng Huang, and Katrin Kirchhoff. Representation learning through cross-modal conditional teacher-student training for speech emotion recognition. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6442--6446. IEEE, 2022.Google ScholarCross Ref
Naveed Ahmed, Zaher Al Aghbari, and Shini Girija. A systematic survey on multimodal emotion recognition using learning algorithms. Intelligent Systems with Applications, 17:200171, 2023.Google ScholarCross Ref
Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, and Helen Meng. Towards discriminative representation learning for speech emotion recognition. In IJCAI, pages 5060--5066, 2019.Google ScholarCross Ref
Soumya Dutta and Sriram Ganapathy. Multimodal transformer with learnable frontend and self attention for emotion recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6917--6921, 2022.Google ScholarCross Ref
Pengfei Liu, Kun Li, and Helen Meng. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition. Proc. Interspeech 2020, pages 379--383, 2020.Google ScholarCross Ref
Wei Zhang, Bowen Ma, Feng Qiu, and Yu Ding. Multi-modal facial affective analysis based on masked autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5792--5801, 2023.Google ScholarCross Ref
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009, 2022.Google ScholarCross Ref
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460, 2021.Google Scholar
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449--12460, 2020.Google Scholar
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131--135. IEEE, 2017.Google Scholar
Michael Neumann and Ngoc Thang Vu. Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7390--7394. IEEE, 2019.Google ScholarCross Ref
Srinivas Parthasarathy and Carlos Busso. Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM transactions on audio, speech, and language processing, 28:2697--2709, 2020.Google ScholarDigital Library
Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Julien Epps, and Björn W Schuller. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective computing, 13(2):992--1004, 2020.Google ScholarCross Ref
Zheng Lian, Bin Liu, and Jianhua Tao. Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Transactions on Affective Computing, 2022.Google Scholar
Jinming Zhao, Ruichen Li, Qin Jin, XinchaoWang, and Haizhou Li. Memobert: Pretraining model with prompt-based learning for multimodal emotion recognition. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4703--4707. IEEE, 2022.Google ScholarCross Ref
Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine learning, 109(2):373--440, 2020.Google ScholarCross Ref
Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586--5609, 2021.Google ScholarCross Ref
Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. arXiv preprint arXiv:2304.08981, 2023.Google Scholar
Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2594--2604, 2018.Google ScholarCross Ref
Zhuoyuan Yao, DiWu, XiongWang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547, 2021.Google Scholar
Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492--28518. PMLR, 2023.Google Scholar
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.Google Scholar
Tadas Baltru"aitis, Peter Robinson, and Louis-Philippe Morency. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE winter conference on applications of computer vision (WACV), pages 1--10. IEEE, 2016.Google Scholar
Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2852--2861, 2017.Google ScholarCross Ref
Zengqun Zhao, Qingshan Liu, and Shanmin Wang. Learning deep global multiscale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30:6544--6556, 2021.Google ScholarDigital Library
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015.Google Scholar
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929--1958, 2014.Google Scholar

Index Terms

Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

This paper presents our solution for the Semi-Supervised Multimodal Emotion Recognition Challenge (MER2023-SEMI), addressing the issue of limited annotated data in emotion recognition. Recently, the self-training-based Semi-Supervised Learning~(SSL) ...
Read More
Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning
MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing

In the task of multimodal emotion recognition with multi-label learning (MER-MULTI), leveraging the correlation between discrete and dimensional emotions is crucial for improving the model's performance. However, there may be a mismatch between the ...
Read More
A Survey of Semi-Supervised Learning Methods
CIS '08: Proceedings of the 2008 International Conference on Computational Intelligence and Security - Volume 02

In traditional machine learning approaches to classification, one uses only a labelled set to train the classifier. Labelled instances however are often difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing
October 2023
88 pages
ISBN:9798400702884
DOI:10.1145/3607865
Program Chairs:
Shreya Ghosh
Curtin University, Australia
,
Abhinav Dhall
IIT Ropar, India
,
Dimitrios Kollias
Queen Mary University of London, UK
,
Roland Goecke
University of Canberra, Australia
,
Tom Gedeon
Curtin University, Australia
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
consensus decision-making
label correction
mer 2023
multimodal emotion recognition
semi-supervised
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 87
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction

MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling

Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning

A Survey of Semi-Supervised Learning Methods