skip to main content
10.1145/3607865.3613182acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction

Published:29 October 2023Publication History

ABSTRACT

Multimodal emotion recognition is the task of identifying and understanding emotions by integrating information from multiple modalities, such as audio, visual, and textual data. However, the scarcity of labeled data poses a significant challenge for this task. To this end, this paper proposes a novel approach via a semi-supervised learning framework by incorporating consensus decision-making and label correction methods. Firstly, we employ supervised learning on the trimodal input data to establish robust initial models. Secondly, we generate reliable pseudo-labels for unlabelled data by leveraging consensus decision-making and label correction methods. Thirdly, we train the model in a supervised manner using both labeled and pseudo-labeled data. Moreover, the process of generating pseudo-labels and semi-supervised learning can be iterated to refine the model further. Experimental results on the MER 2023 dataset show the effectiveness of our proposed framework, achieving significant improvement on the MER-MULTI, MER-NOISE, and MER-SEMI subsets, respectively.

References

  1. Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 2227--2231. IEEE, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7:117327--117345, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  3. Taiba Majid Wani, Teddy Surya Gunawan, Syed Asif Ahmad Qadri, Mira Kartiwi, and Eliathamby Ambikairajah. A comprehensive review of speech emotion recognition systems. IEEE access, 9:47795--47814, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  4. Abdelaziz A Abdelhamid, El-Sayed M El-Kenawy, Bandar Alotaibi, Ghada M Amer, Mahmoud Y Abdelkader, Abdelhameed Ibrahim, and Marwa Metwally Eid. Robust speech emotion recognition using cnn lstm based on stochastic fractal search optimization algorithm. IEEE Access, 10:49265--49284, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  5. Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, and Eng Siong Chng. Speech emotion recognition with co-attention based multi-level acoustic information. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7367--7371. IEEE, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  6. Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 112--118. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  7. Sharmeen M Saleem Abdullah Abdullah, Siddeeq Y Ameen Ameen, Mohammed AM Sadeeq, and Subhi Zeebaree. Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, 2(02):52-- 58, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  8. Asif Iqbal Middya, Baibhav Nag, and Sarbani Roy. Deep learning based multimodal emotion recognition using model-level fusion of audio--visual modalities. Knowledge-Based Systems, 244:108580, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sundararajan Srinivasan, Zhaocheng Huang, and Katrin Kirchhoff. Representation learning through cross-modal conditional teacher-student training for speech emotion recognition. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6442--6446. IEEE, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  10. Naveed Ahmed, Zaher Al Aghbari, and Shini Girija. A systematic survey on multimodal emotion recognition using learning algorithms. Intelligent Systems with Applications, 17:200171, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  11. Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, and Helen Meng. Towards discriminative representation learning for speech emotion recognition. In IJCAI, pages 5060--5066, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  12. Soumya Dutta and Sriram Ganapathy. Multimodal transformer with learnable frontend and self attention for emotion recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6917--6921, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  13. Pengfei Liu, Kun Li, and Helen Meng. Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition. Proc. Interspeech 2020, pages 379--383, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  14. Wei Zhang, Bowen Ma, Feng Qiu, and Yu Ding. Multi-modal facial affective analysis based on masked autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5792--5801, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  16. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460, 2021.Google ScholarGoogle Scholar
  17. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449--12460, 2020.Google ScholarGoogle Scholar
  18. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131--135. IEEE, 2017.Google ScholarGoogle Scholar
  19. Michael Neumann and Ngoc Thang Vu. Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7390--7394. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  20. Srinivas Parthasarathy and Carlos Busso. Semi-supervised speech emotion recognition with ladder networks. IEEE/ACM transactions on audio, speech, and language processing, 28:2697--2709, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Julien Epps, and Björn W Schuller. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective computing, 13(2):992--1004, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  22. Zheng Lian, Bin Liu, and Jianhua Tao. Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Transactions on Affective Computing, 2022.Google ScholarGoogle Scholar
  23. Jinming Zhao, Ruichen Li, Qin Jin, XinchaoWang, and Haizhou Li. Memobert: Pretraining model with prompt-based learning for multimodal emotion recognition. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4703--4707. IEEE, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine learning, 109(2):373--440, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586--5609, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  26. Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. arXiv preprint arXiv:2304.08981, 2023.Google ScholarGoogle Scholar
  27. Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2594--2604, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  28. Zhuoyuan Yao, DiWu, XiongWang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547, 2021.Google ScholarGoogle Scholar
  29. Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492--28518. PMLR, 2023.Google ScholarGoogle Scholar
  30. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.Google ScholarGoogle Scholar
  31. Tadas Baltru"aitis, Peter Robinson, and Louis-Philippe Morency. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE winter conference on applications of computer vision (WACV), pages 1--10. IEEE, 2016.Google ScholarGoogle Scholar
  32. Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2852--2861, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  33. Zengqun Zhao, Qingshan Liu, and Shanmin Wang. Learning deep global multiscale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30:6544--6556, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015.Google ScholarGoogle Scholar
  35. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929--1958, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MRAC '23: Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing
      October 2023
      88 pages
      ISBN:9798400702884
      DOI:10.1145/3607865

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)87
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader