Skip to main content
Log in

DBT: multimodal emotion recognition based on dual-branch transformer

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

There are very few labeled datasets in speech emotion recognition. The reason is that emotion is subjective and requires much time for labeling experts to identify emotion categories, while the wav2vec2.0 model is a general model for obtaining speech representations through self-supervised training. Therefore, we try to apply it to speech-emotion recognition tasks. We propose a multimodal dual-branch transformer network. For the speech processing branch, first, we use wav2vec2.0 to extract speech features. Then, a fine-tuning strategy and a self-attention-based interlayer feature fusion strategy are used. Second, a fully convolutional classification network is used for emotion classification. Then, we use RoBERTa for text emotion recognition and bimodal fusion by an improved weighted Dempster–Shafer (DS) strategy. In addition, we propose an accuracy-weighted label smoothing method, which can improve recognition accuracy. We perform comprehensive experiments on two benchmarks: IEMOCAP and CASIA, covering both Chinese and English datasets. The experimental results show that the proposed method has higher accuracy than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability Statement

Data cannot be made available.

Notes

  1. http://www.chineseldc.org/resource_info.php?rid=76.

  2. https://huggingface.co/facebook/wav2vec2-large-960h-lv60.

  3. https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn.

  4. https://huggingface.co/roberta-large?text=The+goal+of+life+is+%3Cmask%3E.

References

  1. Baevski A, Schneider S, Auli M (2019) vq-wav2vec: Self-supervised learning of discrete speech representations. http://arxiv.org/abs/1910.05453

  2. Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460

    Google Scholar 

  3. Balakrishnan V, Lok PY, Abdul Rahim H (2021) A semi-supervised approach in detecting sentiment and emotion based on digital payment reviews. J Supercomput 77(4):3795–3810. https://doi.org/10.1007/s11227-020-03412-w

    Article  Google Scholar 

  4. Busso C, Bulut M, Lee CC et al (2008) Iemocap: interactive emotional dyadic motion capture database. Language Resour Eval 42(4):335–359

    Article  Google Scholar 

  5. Chen LW, Rudnicky A (2021) Exploring wav2vec 2.0 fine-tuning for improved speech emotion recognition. http://arxiv.org/abs/2110.06309

  6. Chen M, Zhao X (2020) A multi-scale fusion framework for bimodal speech emotion recognition. In: Interspeech, 374–378

  7. Clark K, Luong MT, Le QV, et al (2020) Electra: Pre-training text encoders as discriminators rather than generators. http://arxiv.org/abs/2003.10555

  8. Garofolo J, Graff D, Paul D et al (1993) Csr-i (wsj0) complete ldc93s6a. Web Download Philadelphia: Linguistic Data Consortium 83:87

    Google Scholar 

  9. Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993

  10. Gupta V, Juyal S, Hu YC (2022) Understanding human emotions through speech spectrograms using deep neural network. J Supercomput 78(5):6944–6973. https://doi.org/10.1007/s11227-021-04124-5

    Article  Google Scholar 

  11. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. http://arxiv.org/abs/1801.06146

  12. Jiang C, Liu J, Mao R et al (2020) Speech emotion recognition based on dcnn bigru self-attention model. 2020 International Conference on Information Science. Parallel and Distributed Systems (ISPDS), IEEE, pp 46–51

  13. Jousselme AL, Grenier D, Bossé É (2001) A new distance between two bodies of evidence. Inf Fusion 2(2):91–101

    Article  Google Scholar 

  14. Kenton JDMWC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, pp 4171–4186

  15. Kommineni J, Mandala S, Sunar MS et al (2021) Accurate computing of facial expression recognition using a hybrid feature extraction technique. J Supercomput 77(5):5019–5044. https://doi.org/10.1007/s11227-020-03468-8

    Article  Google Scholar 

  16. Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, 4243–4247

  17. Lample G, Conneau A (2019) Cross-lingual language model pretraining. http://arxiv.org/abs/1901.07291

  18. Lan Z, Chen M, Goodman S, et al (2019) Albert: A lite bert for self-supervised learning of language representations. http://arxiv.org/abs/1909.11942

  19. Liu Y, Ott M, Goyal N, et al (2019) Roberta: A robustly optimized bert pretraining approach. http://arxiv.org/abs/1907.11692

  20. Macary M, Tahon M, Estève Y, et al (2021) On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 373–380

  21. Makiuchi MR, Uto K, Shinoda K (2021) Multimodal emotion recognition with high-level speech and text features. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp 350–357

  22. Mao S, Tao D, Zhang G et al (2019) Revisiting hidden markov models for speech emotion recognition. ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 6715–6719

  23. Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press, Cambridge

    MATH  Google Scholar 

  24. Nimmagadda R, Arora K, Martin MV (2022) Emotion recognition models for companion robots. J Supercomput. https://doi.org/10.1007/s11227-022-04416-4

    Article  Google Scholar 

  25. Park DS, Chan W, Zhang Y, et al (2019) Specaugment: a simple data augmentation method for automatic speech recognition. http://arxiv.org/abs/1904.08779

  26. Pepino L, Riera P, Ferrer L (2021) Emotion recognition from speech using wav2vec 2.0 embeddings. http://arxiv.org/abs/2104.03502

  27. Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. CoRR http://arxiv.org/1802.05365

  28. Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI

  29. Rajamani ST, Rajamani KT, Mallol-Ragolta A et al (2021) A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. ICASSP 2021–2021 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 6294–6298

  30. Santoso J, Yamada T, Makino S, et al (2021) Speech emotion recognition based on attention weight correction using word-level confidence measure. In: Interspeech, pp 1947–1951

  31. Sarma M, Ghahremani P, Povey D, et al (2018) Emotion identification from raw speech signals using dnns. In: Interspeech, pp 3097–3101

  32. Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093

  33. Shafer G (1992) Dempster-shafer theory. Encycl Artif Intell 1:330–331

    Google Scholar 

  34. Siriwardhana S, Reis A, Weerasekera R, et al (2020) Jointly fine-tuning" bert-like" self supervised models to improve multimodal speech emotion recognition. http://arxiv.org/abs/2008.06682

  35. Sun C, Qiu X, Xu Y, et al (2019) How to fine-tune bert for text classification? In: China National Conference on Chinese Computational Linguistics, Springer, pp 194–206

  36. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:4

    Google Scholar 

  37. Wan CX, Li B (2022) Financial causal sentence recognition based on bert-cnn text classification. J Supercomput 78(5):6503–6527. https://doi.org/10.1007/s11227-021-04097-5

    Article  Google Scholar 

  38. Wang H, Wei S, Fang B (2020) Facial expression recognition using iterative fusion of mo-hog and deep features. J Supercomput 76(5):3211–3221. https://doi.org/10.1007/s11227-018-2554-8

    Article  Google Scholar 

  39. Wang Y, Boumadane A, Heba A (2021) A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. http://arxiv.org/abs/2111.02735

  40. Yang Z, Dai Z, Yang Y et al (2019) Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst 32:5

    Google Scholar 

  41. Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 112–118

  42. Yosinski J, Clune J, Bengio Y et al (2014) How transferable are features in deep neural networks? Adv Neural Inf Process Syst 27:8

    Google Scholar 

  43. Zadeh A, Zellers R, Pincus E, et al (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. http://arxiv.org/abs/1606.06259

  44. Zadeh AB, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2236–2246

  45. Zhao D, Qian Y, Liu J et al (2022) The facial expression recognition technology under image processing and neural network. J Supercomput 78(4):4681–4708. https://doi.org/10.1007/s11227-021-04058-y

    Article  Google Scholar 

  46. Zheng L, Li Q, Ban H, et al (2018) Speech emotion recognition based on convolution neural network combined with random forest. In: 2018 Chinese Control and Decision Conference (CCDC), IEEE, pp 4143–4147

Download references

Funding

This study was supported by the National Key R &D Program of China under Grant 2020YFC0833102.

Author information

Authors and Affiliations

Authors

Contributions

YY, CH, YT, and YX contributed to the conception of the study. YY, CH, and YF performed the experiment. YY, YT and CH, YX contributed significantly to the analysis and manuscript preparation. YY and CH performed the data analyses and wrote the manuscript. YT, YX, and XH helped perform the analysis with constructive discussions. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yan Tian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, Y., Tian, Y., He, C. et al. DBT: multimodal emotion recognition based on dual-branch transformer. J Supercomput 79, 8611–8633 (2023). https://doi.org/10.1007/s11227-022-05001-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-05001-5

Keywords

Navigation