Multimodal Emotion Recognition Using Attention-Based Model with Language, Audio, and Video Modalities

Sharma, Disha; Jayabalan, Manoj; Sultanova, Nailya; Mustafina, Jamila; Yao, Danny Ngo Lung

doi:10.1007/978-981-97-0293-0_15

Disha Sharma⁵,
Manoj Jayabalan⁵,
Nailya Sultanova⁶,
Jamila Mustafina⁶ &
…
Danny Ngo Lung Yao⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 191))

Included in the following conference series:

The International Conference on Data Science and Emerging Technologies

39 Accesses

Abstract

Multimodal emotion identification is becoming increasingly important in human–computer interaction due to the amount of emotional information in human communication. Multimodal emotion recognition is the technique of simultaneously considering several modalities to boost accuracy and robustness. As emotion identification studies become more vital to human–computer interactions, automatic emotion detection systems become increasingly necessary. However, a lack of data presents a problem for multimodal emotion identification. To address this issue, we suggest employing transfer learning, which uses pretrained models such as RoBERTa and attention-based mechanisms such as self-attention to extract relevant features from multiple modalities and multi-head attention to fuse data across modalities. The aim of this paper is to provide a strategy for reliably forecasting emotions in audio, visual, and text by merging and complementing aspects traditionally handled by humans with those typically handled by deep learning. During the study, three popular multimodal emotion recognition datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI, are analyzed and ranked based on their quality. This study will help in constructing the network with the right amount of focus placed on each feature modality by creating an architecture that efficiently combines textual characteristics retrieved from RoBERTa with other modality-based features. A model better than BERT is introduced as part of this work that helps to improve the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Siriwardhana S, Kaluarachchi T, Billinghurst M, Nanayakkara S (2020) Multimodal emotion recognition with transformer-based self-supervised feature fusion. IEEE Access 8:176274–176285
Article Google Scholar
Lee S, Han DK, Ko H (2021) Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification. IEEE Access 9:94557–94572
Article Google Scholar
Ogura Y, Parsons WH, Kamat SS, Cravatt BF (2017) 乳鼠心肌提取 HHS public access. Phys Behav 17610:139–148. file:///C:/Users/Carla Carolina/Desktop/Artigos para acrescentar na qualificação/The impact of birth weight on cardiovascular disease risk in the.pdf.
Google Scholar
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V, Allen PG (2022) RoBERTa: a robustly optimized BERT pretraining approach. https://github.com/pytorch/fairseq. Last accessed 02 Aug 2022
Anon bert (2022) https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ. Last accessed 01 Aug 2022
Bucur B, Somfelean I, Ghiurutan A, Lemnaru C, Dinsoreanu M (2018) An early fusion approach for multimodal emotion recognition using deep recurrent networks. In: Proceedings—2018 IEEE 14th International conference on intelligent computer communication and processing, ICCP 2018, pp 71–78 (2018)
Google Scholar
Su H, Liu B, Tao J, Dong Y, Huang J, Lian Z, Song L (2020) An improved multimodal dimension emotion recognition based on different fusion methods. In: International conference on signal processing proceedings, ICSP, 2020-Decem, pp 257–261
Google Scholar
Song KS, Nho YH, Seo JH, Kwon DS (2018) Decision-level fusion method for emotion recognition using multimodal emotion recognition information. In: 15th International conference on ubiquitous robots, UR 2018, pp 472–476
Google Scholar
Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9:74539–74549
Article Google Scholar
Zhao Z, Wang Y, Wang Y (2022) Multi-level fusion of Wav2vec 2.0 and BERT for multimodal emotion recognition
Google Scholar
Wiles O, Sophia Koepke A, Zisserman A (2018) Self-supervised learning of a facial attribute embedding from video. In: British Machine Vision Conference 2018, BMVC
Google Scholar
Wang T, Hou Y, Zhou D, Zhang Q (2021) A contextual attention network for multimodal emotion recognition in conversation. In: Proceedings of the international joint conference on neural networks, 2021-July
Google Scholar
Pham H, Liang PP, Manzini T, Morency LP, Póczos B (2019) Found in translation: Learning robust joint representations by cyclic translations between modalities. In: 33rd AAAI conference on artificial intelligence, AAAI 2019, 31st innovative applications of artificial intelligence conference, IAAI 2019 and the 9th AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Shaffer 2018, pp 6892–6899
Google Scholar

Download references

Author information

Authors and Affiliations

Liverpool John Moores University (LJMU), Liverpool, UK
Disha Sharma & Manoj Jayabalan
Kazan Federal University, Kazan, Russia
Nailya Sultanova & Jamila Mustafina
UNITAR International University, Petaling Jaya, Malaysia
Danny Ngo Lung Yao

Authors

Disha Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Jayabalan
View author publications
You can also search for this author in PubMed Google Scholar
Nailya Sultanova
View author publications
You can also search for this author in PubMed Google Scholar
Jamila Mustafina
View author publications
You can also search for this author in PubMed Google Scholar
Danny Ngo Lung Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Disha Sharma .

Editor information

Editors and Affiliations

UNITAR Graduate School, UNITAR International University, Petaling Jaya, Malaysia
Yap Bee Wah
Faculty of Engineering and Technology, Liverpool John Moores University, Liverpool, UK
Dhiya Al-Jumeily OBE
University of Tennessee, Knoxville, TN, USA
Michael W. Berry

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharma, D., Jayabalan, M., Sultanova, N., Mustafina, J., Yao, D.N.L. (2024). Multimodal Emotion Recognition Using Attention-Based Model with Language, Audio, and Video Modalities. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_15

Download citation

DOI: https://doi.org/10.1007/978-981-97-0293-0_15
Published: 27 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0292-3
Online ISBN: 978-981-97-0293-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics