Skip to main content
Log in

A Three-stage multimodal emotion recognition network based on text low-rank fusion

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Multimodal emotion recognition has achieved good results in emotion recognition tasks by fusing multimodal information such as audio, text, and visual. How to use multimodal interaction and fusion to transform sparse unimodal into compact multimodal has become a vital research hotspot in multimodal emotion recognition. However, in multimodality, the extracted unimodal information needs to be representative. The multimodal fusion will cause the loss of feature information, which creates a particular challenge for multimodal emotion recognition. To address these problems, this paper proposes a three-stage multimodal emotion recognition network based on text low-rank fusion by extracting unimodal features, combining bimodal features, and fusing multimodal features. Specifically, we introduce a Residual-based Attention Mechanism for the first feature extraction stage, which can filter out redundant information and extract valuable unimodal information. Then, we use the Cross-modal Transformer to complete the inter-modal interaction. Finally, we introduce a Text-based Low-rank Fusion Module that enhances multimodal fusion by leveraging the complementarity between different modalities, ensuring comprehensive fused features. The accuracy of the proposed model on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets is 82.1%, 80.8%, and 83.0%, respectively. Meanwhile, many ablation experiments are conducted in this paper to verify the effectiveness and generalization of the model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Zou, S., Huang, X., Shen, X., Liu, H.: Improving multimodal fusion with main modal transformer for emotion recognition in conversation. Knowl. Based Syst. 258, 109978 (2022)

    Article  Google Scholar 

  2. Zhao, S., Jia, G., Yang, J., Ding, G., Keutzer, K.: Emotion recognition from multiple modalities: fundamentals and methodologies. IEEE Signal Process. Mag. 38(6), 59–73 (2021). https://doi.org/10.1109/MSP.2021.3106895

    Article  Google Scholar 

  3. Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., Alhussain, T.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124

    Article  Google Scholar 

  4. Zhao, Z., Liu, Q., Zhou, F.: Robust lightweightrobust facial expression recognition network with label distribution training. In: AAAI Conference on Artificial Intelligence (2021). https://api.semanticscholar.org/CorpusID:235306283

  5. Zhang, J., Xing, L., Tan, Z., Wang, H., Wang, K.: Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 168, 108078 (2022)

    Article  Google Scholar 

  6. Ji, Q., Zhu, Z., Lan, P.: Real-time nonintrusive monitoring and prediction of driver fatigue. IEEE Trans. Veh. Technol. 53(4), 1052–1068 (2004). https://doi.org/10.1109/TVT.2004.830974

    Article  Google Scholar 

  7. Huang, C., Zaiane, O.R., Trabelsi, A., Dziri, N.: Automatic dialogue generation with expressed emotions. In: North American Chapter of the Association for Computational Linguistics (2018). https://api.semanticscholar.org/CorpusID:13788863

  8. Busso, C., Bulut, M., Narayanan, S.S.: Toward effective automatic recognition systems of emotion in speech. (2014). https://api.semanticscholar.org/CorpusID:31805666

  9. Liu, S., Gao, P., Li, Y., Fu, W., Ding, W.: Multi-modal fusion network with complementarity and importance for emotion recognition. Inf. Sci. 619, 679–694 (2022)

    Article  Google Scholar 

  10. Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl. Based Syst. 161, 124–133 (2018)

    Article  Google Scholar 

  11. Gan, C., Wang, K., Zhu, Q., Xiang, Y., Jain, D.K., García, S.: Speech emotion recognition via multiple fusion under spatial-temporal parallel network. Neurocomputing 555, 126623 (2023)

    Article  Google Scholar 

  12. Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. ArXiv arXiv:abs/1601.06733 (2016)

  13. Dai, W., Cahyawijaya, S., Liu, Z., Fung, P.: Multimodal end-to-end sparse model for emotion recognition. ArXiv arXiv:abs/2103.09666 (2021)

  14. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

    Article  Google Scholar 

  15. Chumachenko, K., Iosifidis, A., Gabbouj, M.: Self-attention fusion for audiovisual emotion recognition with incomplete data. 2022 26th International Conference on Pattern Recognition (ICPR), 2822–2828 (2022)

  16. Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.-P.: Efficient low-rank multimodal fusion with modality-specific factors. In: Annual Meeting of the Association for Computational Linguistics (2018). https://api.semanticscholar.org/CorpusID:44131945

  17. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor fusion network for multimodal sentiment analysis. In: Conference on Empirical Methods in Natural Language Processing (2017). https://api.semanticscholar.org/CorpusID:950292

  18. Zhalehpour, S., Onder, O., Akhtar, Z., Erdem, C.E.: Baum-1: a spontaneous audio-visual face database of affective and mental states. IEEE Trans. Affect. Comput. 8(3), 300–313 (2017). https://doi.org/10.1109/TAFFC.2016.2553038

    Article  Google Scholar 

  19. Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2106–2112 (2011). https://doi.org/10.1109/ICCVW.2011.6130508

  20. McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012). https://doi.org/10.1109/T-AFFC.2011.20

    Article  Google Scholar 

  21. Zadeh, A., Zellers, R., Pincus, E., Morency, L.-P.: Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. ArXiv arXiv:abs/1606.06259 (2016)

  22. Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.-P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Annual Meeting of the Association for Computational Linguistics (2018). https://api.semanticscholar.org/CorpusID:51868869

  23. Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, E.A., Provost, E.M., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)

    Article  Google Scholar 

  24. Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: A multimodal multi-party dataset for emotion recognition in conversations. ArXiv arXiv:abs/1810.02508 (2018)

  25. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8 (2013). https://doi.org/10.1109/FG.2013.6553805

  26. Schmidt, P., Reiss, A., Dürichen, R., Marberger, C., Laerhoven, K.V.: Introducing wesad, a multimodal dataset for wearable stress and affect detection. Proceedings of the 20th ACM International Conference on Multimodal Interaction (2018)

  27. Eyben, F., Wllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Acm International Conference on Multimedia (2010)

  28. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269 (2016)

  29. Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ArXiv arXiv:abs/1905.11946 (2019)

  30. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019). https://api.semanticscholar.org/CorpusID:52967399

  31. Hao, M., Cao, W., Liu, Z., Wu, M., Xiao, P.: Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features. Neurocomputing 391, 42–51 (2020)

    Article  Google Scholar 

  32. Yan, J., Zheng, W., Cui, Z., Tang, C., Zhang, T., Zong, Y.: Multi-cue fusion for emotion recognition in the wild. Neurocomputing 309, 27–35 (2018)

    Article  Google Scholar 

  33. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. ArXiv arXiv:abs/1409.2329 (2014)

  34. Cambria, E., Hazarika, D., Poria, S., Hussain, A., Subramanyam, R.B.V.: Benchmarking multimodal sentiment analysis. ArXiv arXiv:abs/1707.09538 (2017)

  35. Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. Proceedings of the conference. Association for Computational Linguistics. Meeting 2019, 6558–6569 (2019)

  36. Zhang, F., Li, X.-C., Lim, C.P., Hua, Q., Dong, C.-R., Zhai, J.-H.: Deep emotional arousal network for multimodal sentiment analysis and emotion recognition. Inf. Fusion 88, 296–304 (2022)

    Article  Google Scholar 

  37. Huan, R., Zhong, G., Chen, P., Liang, R.: Unimf: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences. IEEE Trans. Multimed. (2023). https://doi.org/10.1109/TMM.2023.3338769

    Article  Google Scholar 

  38. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  39. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems (2017). https://api.semanticscholar.org/CorpusID:13756489

  40. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141 (2017)

  41. Sun, Z., Sarma, P.K., Sethares, W.A., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: AAAI Conference on Artificial Intelligence (2019). https://api.semanticscholar.org/CorpusID:207930647

  42. Wang, X., Girshick, R.B., Gupta, A.K., He, K.: Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803 (2017)

  43. Williams, J., Kleinegesse, S., Comanescu, R., Radu, O.: Recognizing emotions in video using multimodal dnn feature fusion. (2018). https://api.semanticscholar.org/CorpusID:52000158

  44. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. ArXiv arXiv:abs/1802.00927 (2018)

  45. Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2554–2562 (2021). https://doi.org/10.1109/CVPR46437.2021.00258

Download references

Acknowledgements

This work is financially supported by the National Natural Science Foundation of China (No.61573266), the Natural Science Basic Research Program of Shaanxi(No.2021JM-133).

Author information

Authors and Affiliations

Authors

Contributions

Linlin Zhao: Conceptualization, Methodology, Software, Writing-review & editing. Youlong Yang: Conceptualization, Formal analysis, Supervision, Funding acquisition. Tong Ning: Supervision, Writing-review & editing.

Corresponding author

Correspondence to Linlin Zhao.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a Conflict of interest in connection with the work submitted.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, L., Yang, Y. & Ning, T. A Three-stage multimodal emotion recognition network based on text low-rank fusion. Multimedia Systems 30, 142 (2024). https://doi.org/10.1007/s00530-024-01345-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01345-5

Keywords

Navigation