ABSTRACT
Multimodal emotion recognition is the task of identifying and understanding emotions by integrating information from multiple modalities, such as audio, visual, and textual data. However, the scarcity of labeled data poses a significant challenge for this task. To this end, this paper proposes a novel approach via a semi-supervised learning framework by incorporating consensus decision-making and label correction methods. Firstly, we employ supervised learning on the trimodal input data to establish robust initial models. Secondly, we generate reliable pseudo-labels for unlabelled data by leveraging consensus decision-making and label correction methods. Thirdly, we train the model in a supervised manner using both labeled and pseudo-labeled data. Moreover, the process of generating pseudo-labels and semi-supervised learning can be iterated to refine the model further. Experimental results on the MER 2023 dataset show the effectiveness of our proposed framework, achieving significant improvement on the MER-MULTI, MER-NOISE, and MER-SEMI subsets, respectively.
- Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 2227--2231. IEEE, 2017.Google ScholarDigital Library
- Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7:117327--117345, 2019.Google ScholarCross Ref
- Taiba Majid Wani, Teddy Surya Gunawan, Syed Asif Ahmad Qadri, Mira Kartiwi, and Eliathamby Ambikairajah. A comprehensive review of speech emotion recognition systems. IEEE access, 9:47795--47814, 2021.Google ScholarCross Ref
- Abdelaziz A Abdelhamid, El-Sayed M El-Kenawy, Bandar Alotaibi, Ghada M Amer, Mahmoud Y Abdelkader, Abdelhameed Ibrahim, and Marwa Metwally Eid. Robust speech emotion recognition using cnn lstm based on stochastic fractal search optimization algorithm. IEEE Access, 10:49265--49284, 2022.Google ScholarCross Ref
- Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, and Eng Siong Chng. Speech emotion recognition with co-attention based multi-level acoustic information. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7367--7371. IEEE, 2022.Google ScholarCross Ref
- Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 112--118. IEEE, 2018.Google ScholarCross Ref
- Sharmeen M Saleem Abdullah Abdullah, Siddeeq Y Ameen Ameen, Mohammed AM Sadeeq, and Subhi Zeebaree. Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, 2(02):52-- 58, 2021.Google ScholarCross Ref
- Asif Iqbal Middya, Baibhav Nag, and Sarbani Roy. Deep learning based multimodal emotion recognition using model-level fusion of audio--visual modalities. Knowledge-Based Systems, 244:108580, 2022.Google ScholarDigital Library
- Sundararajan Srinivasan, Zhaocheng Huang, and Katrin Kirchhoff. Representation learning through cross-modal conditional teacher-student training for speech emotion recognition. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6442--6446. IEEE, 2022.Google ScholarCross Ref
- Naveed Ahmed, Zaher Al Aghbari, and Shini Girija. A systematic survey on multimodal emotion recognition using learning algorithms. Intelligent Systems with Applications, 17:200171, 2023.Google ScholarCross Ref
- Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, and Helen Meng. Towards discriminative representation learning for speech emotion recognition. In IJCAI, pages 5060--5066, 2019.Google ScholarCross Ref
- Soumya Dutta and Sriram Ganapathy. Multimodal transformer with learnable frontend and self attention for emotion recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6917--6921, 2022.Google ScholarCross Ref
- Pengfei Liu, Kun Li, and Helen Meng. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition. Proc. Interspeech 2020, pages 379--383, 2020.Google ScholarCross Ref
- Wei Zhang, Bowen Ma, Feng Qiu, and Yu Ding. Multi-modal facial affective analysis based on masked autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5792--5801, 2023.Google ScholarCross Ref
- Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009, 2022.Google ScholarCross Ref
- Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460, 2021.Google Scholar
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449--12460, 2020.Google Scholar
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131--135. IEEE, 2017.Google Scholar
- Michael Neumann and Ngoc Thang Vu. Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7390--7394. IEEE, 2019.Google ScholarCross Ref
- Srinivas Parthasarathy and Carlos Busso. Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM transactions on audio, speech, and language processing, 28:2697--2709, 2020.Google ScholarDigital Library
- Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Julien Epps, and Björn W Schuller. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective computing, 13(2):992--1004, 2020.Google ScholarCross Ref
- Zheng Lian, Bin Liu, and Jianhua Tao. Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Transactions on Affective Computing, 2022.Google Scholar
- Jinming Zhao, Ruichen Li, Qin Jin, XinchaoWang, and Haizhou Li. Memobert: Pretraining model with prompt-based learning for multimodal emotion recognition. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4703--4707. IEEE, 2022.Google ScholarCross Ref
- Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine learning, 109(2):373--440, 2020.Google ScholarCross Ref
- Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586--5609, 2021.Google ScholarCross Ref
- Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. arXiv preprint arXiv:2304.08981, 2023.Google Scholar
- Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2594--2604, 2018.Google ScholarCross Ref
- Zhuoyuan Yao, DiWu, XiongWang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547, 2021.Google Scholar
- Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492--28518. PMLR, 2023.Google Scholar
- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.Google Scholar
- Tadas Baltru"aitis, Peter Robinson, and Louis-Philippe Morency. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE winter conference on applications of computer vision (WACV), pages 1--10. IEEE, 2016.Google Scholar
- Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2852--2861, 2017.Google ScholarCross Ref
- Zengqun Zhao, Qingshan Liu, and Shanmin Wang. Learning deep global multiscale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30:6544--6556, 2021.Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015.Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929--1958, 2014.Google Scholar
Index Terms
- Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction
Recommendations
Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling
MM '23: Proceedings of the 31st ACM International Conference on MultimediaThis paper presents our solution for the Semi-Supervised Multimodal Emotion Recognition Challenge (MER2023-SEMI), addressing the issue of limited annotated data in emotion recognition. Recently, the self-training-based Semi-Supervised Learning~(SSL) ...
Label Distribution Adaptation for Multimodal Emotion Recognition with Multi-label Learning
MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective ComputingIn the task of multimodal emotion recognition with multi-label learning (MER-MULTI), leveraging the correlation between discrete and dimensional emotions is crucial for improving the model's performance. However, there may be a mismatch between the ...
A Survey of Semi-Supervised Learning Methods
CIS '08: Proceedings of the 2008 International Conference on Computational Intelligence and Security - Volume 02In traditional machine learning approaches to classification, one uses only a labelled set to train the classifier. Labelled instances however are often difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human ...
Comments