A Three-stage multimodal emotion recognition network based on text low-rank fusion

Zhao, Linlin; Yang, Youlong; Ning, Tong

doi:10.1007/s00530-024-01345-5

A Three-stage multimodal emotion recognition network based on text low-rank fusion

Regular Paper
Published: 07 May 2024

Volume 30, article number 142, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Linlin Zhao¹,
Youlong Yang¹^na1 &
Tong Ning¹^na1

141 Accesses
Explore all metrics

Abstract

Multimodal emotion recognition has achieved good results in emotion recognition tasks by fusing multimodal information such as audio, text, and visual. How to use multimodal interaction and fusion to transform sparse unimodal into compact multimodal has become a vital research hotspot in multimodal emotion recognition. However, in multimodality, the extracted unimodal information needs to be representative. The multimodal fusion will cause the loss of feature information, which creates a particular challenge for multimodal emotion recognition. To address these problems, this paper proposes a three-stage multimodal emotion recognition network based on text low-rank fusion by extracting unimodal features, combining bimodal features, and fusing multimodal features. Specifically, we introduce a Residual-based Attention Mechanism for the first feature extraction stage, which can filter out redundant information and extract valuable unimodal information. Then, we use the Cross-modal Transformer to complete the inter-modal interaction. Finally, we introduce a Text-based Low-rank Fusion Module that enhances multimodal fusion by leveraging the complementarity between different modalities, ensuring comprehensive fused features. The accuracy of the proposed model on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets is 82.1%, 80.8%, and 83.0%, respectively. Meanwhile, many ablation experiments are conducted in this paper to verify the effectiveness and generalization of the model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Article Open access 13 February 2024

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

Zou, S., Huang, X., Shen, X., Liu, H.: Improving multimodal fusion with main modal transformer for emotion recognition in conversation. Knowl. Based Syst. 258, 109978 (2022)
Article Google Scholar
Zhao, S., Jia, G., Yang, J., Ding, G., Keutzer, K.: Emotion recognition from multiple modalities: fundamentals and methodologies. IEEE Signal Process. Mag. 38(6), 59–73 (2021). https://doi.org/10.1109/MSP.2021.3106895
Article Google Scholar
Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., Alhussain, T.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124
Article Google Scholar
Zhao, Z., Liu, Q., Zhou, F.: Robust lightweightrobust facial expression recognition network with label distribution training. In: AAAI Conference on Artificial Intelligence (2021). https://api.semanticscholar.org/CorpusID:235306283
Zhang, J., Xing, L., Tan, Z., Wang, H., Wang, K.: Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 168, 108078 (2022)
Article Google Scholar
Ji, Q., Zhu, Z., Lan, P.: Real-time nonintrusive monitoring and prediction of driver fatigue. IEEE Trans. Veh. Technol. 53(4), 1052–1068 (2004). https://doi.org/10.1109/TVT.2004.830974
Article Google Scholar
Huang, C., Zaiane, O.R., Trabelsi, A., Dziri, N.: Automatic dialogue generation with expressed emotions. In: North American Chapter of the Association for Computational Linguistics (2018). https://api.semanticscholar.org/CorpusID:13788863
Busso, C., Bulut, M., Narayanan, S.S.: Toward effective automatic recognition systems of emotion in speech. (2014). https://api.semanticscholar.org/CorpusID:31805666
Liu, S., Gao, P., Li, Y., Fu, W., Ding, W.: Multi-modal fusion network with complementarity and importance for emotion recognition. Inf. Sci. 619, 679–694 (2022)
Article Google Scholar
Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl. Based Syst. 161, 124–133 (2018)
Article Google Scholar
Gan, C., Wang, K., Zhu, Q., Xiang, Y., Jain, D.K., García, S.: Speech emotion recognition via multiple fusion under spatial-temporal parallel network. Neurocomputing 555, 126623 (2023)
Article Google Scholar
Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. ArXiv arXiv:abs/1601.06733 (2016)
Dai, W., Cahyawijaya, S., Liu, Z., Fung, P.: Multimodal end-to-end sparse model for emotion recognition. ArXiv arXiv:abs/2103.09666 (2021)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Article Google Scholar
Chumachenko, K., Iosifidis, A., Gabbouj, M.: Self-attention fusion for audiovisual emotion recognition with incomplete data. 2022 26th International Conference on Pattern Recognition (ICPR), 2822–2828 (2022)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.-P.: Efficient low-rank multimodal fusion with modality-specific factors. In: Annual Meeting of the Association for Computational Linguistics (2018). https://api.semanticscholar.org/CorpusID:44131945
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor fusion network for multimodal sentiment analysis. In: Conference on Empirical Methods in Natural Language Processing (2017). https://api.semanticscholar.org/CorpusID:950292
Zhalehpour, S., Onder, O., Akhtar, Z., Erdem, C.E.: Baum-1: a spontaneous audio-visual face database of affective and mental states. IEEE Trans. Affect. Comput. 8(3), 300–313 (2017). https://doi.org/10.1109/TAFFC.2016.2553038
Article Google Scholar
Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2106–2112 (2011). https://doi.org/10.1109/ICCVW.2011.6130508
McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012). https://doi.org/10.1109/T-AFFC.2011.20
Article Google Scholar
Zadeh, A., Zellers, R., Pincus, E., Morency, L.-P.: Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. ArXiv arXiv:abs/1606.06259 (2016)
Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.-P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Annual Meeting of the Association for Computational Linguistics (2018). https://api.semanticscholar.org/CorpusID:51868869
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, E.A., Provost, E.M., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)
Article Google Scholar
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: A multimodal multi-party dataset for emotion recognition in conversations. ArXiv arXiv:abs/1810.02508 (2018)
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8 (2013). https://doi.org/10.1109/FG.2013.6553805
Schmidt, P., Reiss, A., Dürichen, R., Marberger, C., Laerhoven, K.V.: Introducing wesad, a multimodal dataset for wearable stress and affect detection. Proceedings of the 20th ACM International Conference on Multimodal Interaction (2018)
Eyben, F., Wllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Acm International Conference on Multimedia (2010)
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261–2269 (2016)
Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ArXiv arXiv:abs/1905.11946 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019). https://api.semanticscholar.org/CorpusID:52967399
Hao, M., Cao, W., Liu, Z., Wu, M., Xiao, P.: Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features. Neurocomputing 391, 42–51 (2020)
Article Google Scholar
Yan, J., Zheng, W., Cui, Z., Tang, C., Zhang, T., Zong, Y.: Multi-cue fusion for emotion recognition in the wild. Neurocomputing 309, 27–35 (2018)
Article Google Scholar
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. ArXiv arXiv:abs/1409.2329 (2014)
Cambria, E., Hazarika, D., Poria, S., Hussain, A., Subramanyam, R.B.V.: Benchmarking multimodal sentiment analysis. ArXiv arXiv:abs/1707.09538 (2017)
Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. Proceedings of the conference. Association for Computational Linguistics. Meeting 2019, 6558–6569 (2019)
Zhang, F., Li, X.-C., Lim, C.P., Hua, Q., Dong, C.-R., Zhai, J.-H.: Deep emotional arousal network for multimodal sentiment analysis and emotion recognition. Inf. Fusion 88, 296–304 (2022)
Article Google Scholar
Huan, R., Zhong, G., Chen, P., Liang, R.: Unimf: a unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences. IEEE Trans. Multimed. (2023). https://doi.org/10.1109/TMM.2023.3338769
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems (2017). https://api.semanticscholar.org/CorpusID:13756489
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141 (2017)
Sun, Z., Sarma, P.K., Sethares, W.A., Liang, Y.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: AAAI Conference on Artificial Intelligence (2019). https://api.semanticscholar.org/CorpusID:207930647
Wang, X., Girshick, R.B., Gupta, A.K., He, K.: Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803 (2017)
Williams, J., Kleinegesse, S., Comanescu, R., Radu, O.: Recognizing emotions in video using multimodal dnn feature fusion. (2018). https://api.semanticscholar.org/CorpusID:52000158
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. ArXiv arXiv:abs/1802.00927 (2018)
Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2554–2562 (2021). https://doi.org/10.1109/CVPR46437.2021.00258

Download references

Acknowledgements

This work is financially supported by the National Natural Science Foundation of China (No.61573266), the Natural Science Basic Research Program of Shaanxi(No.2021JM-133).

Author information

Youlong Yang and Tong Ning have contributed equally to this work.

Authors and Affiliations

School of Mathematics and Statistics, Xidian University, Xi’an, 710071, China
Linlin Zhao, Youlong Yang & Tong Ning

Authors

Linlin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Youlong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Tong Ning
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Linlin Zhao: Conceptualization, Methodology, Software, Writing-review & editing. Youlong Yang: Conceptualization, Formal analysis, Supervision, Funding acquisition. Tong Ning: Supervision, Writing-review & editing.

Corresponding author

Correspondence to Linlin Zhao.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a Conflict of interest in connection with the work submitted.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, L., Yang, Y. & Ning, T. A Three-stage multimodal emotion recognition network based on text low-rank fusion. Multimedia Systems 30, 142 (2024). https://doi.org/10.1007/s00530-024-01345-5

Download citation

Received: 27 February 2024
Accepted: 21 April 2024
Published: 07 May 2024
DOI: https://doi.org/10.1007/s00530-024-01345-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Three-stage multimodal emotion recognition network based on text low-rank fusion

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Facial emotion recognition using convolutional neural networks (FERC)

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Three-stage multimodal emotion recognition network based on text low-rank fusion

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Facial emotion recognition using convolutional neural networks (FERC)

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation