Abstract
Multimodal emotion identification is becoming increasingly important in human–computer interaction due to the amount of emotional information in human communication. Multimodal emotion recognition is the technique of simultaneously considering several modalities to boost accuracy and robustness. As emotion identification studies become more vital to human–computer interactions, automatic emotion detection systems become increasingly necessary. However, a lack of data presents a problem for multimodal emotion identification. To address this issue, we suggest employing transfer learning, which uses pretrained models such as RoBERTa and attention-based mechanisms such as self-attention to extract relevant features from multiple modalities and multi-head attention to fuse data across modalities. The aim of this paper is to provide a strategy for reliably forecasting emotions in audio, visual, and text by merging and complementing aspects traditionally handled by humans with those typically handled by deep learning. During the study, three popular multimodal emotion recognition datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI, are analyzed and ranked based on their quality. This study will help in constructing the network with the right amount of focus placed on each feature modality by creating an architecture that efficiently combines textual characteristics retrieved from RoBERTa with other modality-based features. A model better than BERT is introduced as part of this work that helps to improve the performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Siriwardhana S, Kaluarachchi T, Billinghurst M, Nanayakkara S (2020) Multimodal emotion recognition with transformer-based self-supervised feature fusion. IEEE Access 8:176274–176285
Lee S, Han DK, Ko H (2021) Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification. IEEE Access 9:94557–94572
Ogura Y, Parsons WH, Kamat SS, Cravatt BF (2017) 乳鼠心肌提取 HHS public access. Phys Behav 17610:139–148. file:///C:/Users/Carla Carolina/Desktop/Artigos para acrescentar na qualificação/The impact of birth weight on cardiovascular disease risk in the.pdf.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V, Allen PG (2022) RoBERTa: a robustly optimized BERT pretraining approach. https://github.com/pytorch/fairseq. Last accessed 02 Aug 2022
Anon bert (2022) https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ. Last accessed 01 Aug 2022
Bucur B, Somfelean I, Ghiurutan A, Lemnaru C, Dinsoreanu M (2018) An early fusion approach for multimodal emotion recognition using deep recurrent networks. In: Proceedings—2018 IEEE 14th International conference on intelligent computer communication and processing, ICCP 2018, pp 71–78 (2018)
Su H, Liu B, Tao J, Dong Y, Huang J, Lian Z, Song L (2020) An improved multimodal dimension emotion recognition based on different fusion methods. In: International conference on signal processing proceedings, ICSP, 2020-Decem, pp 257–261
Song KS, Nho YH, Seo JH, Kwon DS (2018) Decision-level fusion method for emotion recognition using multimodal emotion recognition information. In: 15th International conference on ubiquitous robots, UR 2018, pp 472–476
Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9:74539–74549
Zhao Z, Wang Y, Wang Y (2022) Multi-level fusion of Wav2vec 2.0 and BERT for multimodal emotion recognition
Wiles O, Sophia Koepke A, Zisserman A (2018) Self-supervised learning of a facial attribute embedding from video. In: British Machine Vision Conference 2018, BMVC
Wang T, Hou Y, Zhou D, Zhang Q (2021) A contextual attention network for multimodal emotion recognition in conversation. In: Proceedings of the international joint conference on neural networks, 2021-July
Pham H, Liang PP, Manzini T, Morency LP, Póczos B (2019) Found in translation: Learning robust joint representations by cyclic translations between modalities. In: 33rd AAAI conference on artificial intelligence, AAAI 2019, 31st innovative applications of artificial intelligence conference, IAAI 2019 and the 9th AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Shaffer 2018, pp 6892–6899
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sharma, D., Jayabalan, M., Sultanova, N., Mustafina, J., Yao, D.N.L. (2024). Multimodal Emotion Recognition Using Attention-Based Model with Language, Audio, and Video Modalities. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_15
Download citation
DOI: https://doi.org/10.1007/978-981-97-0293-0_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0292-3
Online ISBN: 978-981-97-0293-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)