Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning

Cano Montes, Antonio; Hernández Gómez, Luis A.

doi:10.1007/978-3-030-53337-3_32

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 389))

Included in the following conference series:

International Conference on Business Information Systems

1210 Accesses
2 Citations

Abstract

Automatic Emotion recognition is renowned for being a difficult task, even for human intelligence. Due to the importance of having enough data in classification problems, we introduce a framework developed with the purpose of generating labeled audio to create our own database. In this paper we present a new model for audio-video emotion recognition using Transfer Learning (TL). The idea is to combine a pre-trained high level feature extractor Convolutional Neural Network (CNN) and a Bidirectional Recurrent Neural Network (BRNN) model to address the issue of variable sequence length inputs. Throughout the design process we discuss the main problems related to the high complexity of the task due to its inherent subjective nature and, on the other hand, the important results obtained by testing the model on different databases, outperforming the state-of-the-art algorithms in the SAVEE [3] database. Furthermore, we use the mentioned application to perform precision classification (per user) into low resources real scenarios with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Brave, S., Nass, C.: Emotion in human-computer interaction. In: The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications (2002)
Google Scholar
Sebe, N., Cohen, I., Huang, T.S.: Multimodal emotion recognition, 4 (2004)
Google Scholar
Haq, S., Jackson, P.J.B., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of Internatioanl Conference on Auditory-Visual Speech Processing (AVSP 2008), Tangalooma, Australia, September 2008
Google Scholar
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6
Article Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech, vol. 5, pp. 1517–1520 (2005)
Google Scholar
Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44, 572–587 (2011)
Article Google Scholar
Pini, S., Ahmed, O.B., Cornia, M., Baraldi, L., Cucchiara, R., Huet, B.: Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild. In: ICMI 2017, 19th ACM International Conference on Multimodal Interaction, ROYAUME-UNI, Glasgow, United Kingdom, Glasgow, 13–17th November 2017 (2017)
Google Scholar
Harár, P., Burget, R., Dutta, M.K.: Speech emotion recognition with deep learning. In: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 137–140, February 2017
Google Scholar
Lalitha, S., Madhavan, A., Bhushan, B., Saketh, S.: Speech emotion recognition. In: 2014 International Conference on Advances in Electronics Computers and Communications, pp. 1–4, October 2014
Google Scholar
Ding, W., et al.: Audio and face video emotion recognition in the wild using deep neural networks and small datasets. pp. 506–513 (2016)
Google Scholar
Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018)
Article Google Scholar
Gu, Y., Chen, S., Marsic, I.: Deep multimodal learning for emotion recognition in spoken language. CoRR, abs/1802.08332 (2018)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc. (2012)
Google Scholar
Kahou, S.E., Michalski, V., Konda, K., Memisevic, R., Pal, C.: Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI 2015, pp. 467–474. ACM, New York (2015)
Google Scholar
Tarnowski, P., Kołodziej, M., Majkowski, A., Rak, R.: Emotion recognition using facial expressions. Procedia Comput. Sci. 108, 1175–1184 (2017)
Article Google Scholar
Busso, C., et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI 2004, pp. 205–211. ACM, New York (2004)
Google Scholar
Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3687–3691, May 2013
Google Scholar
Xi, O., et al.: Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pp. 577–582. ACM, New York (2017)
Google Scholar
Latif, S., Rana, R., Younis, S., Qadir, J., Epps, J.: Transfer learning for improving speech emotion classification accuracy, pp. 257–261 (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015)
Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR, abs/1412.0767 (2014)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multi-task cascaded convolutional networks. CoRR, abs/1604.02878 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram and phoneme embedding, pp. 3688–3692 (2018)
Google Scholar
Huang, Z., Xue, W., Mao, Q.: Speech emotion recognition with unsupervised feature learning. Front. Inf. Technol. Electron. Eng. 16(5), 358–366 (2015). https://doi.org/10.1631/FITEE.1400323
Article Google Scholar

Download references

Acknowledgments

This work has been partially funded by the Spanish Ministry of Economy and Competitiveness and the European Union (FEDER) within the framework of the project DSSL: “Deep and Subspace Speech Learning (TEC2015-68172-C2-2-P)”. We also thank ESAI S.L. Estudios y Soluciones Avanzadas de Ingeniería for their support and partially funding of this research towards an automatic emotion recognition system.

Author information

Authors and Affiliations

Signal, System and Radiocommunications Department, Grupo de Aplicaciones del Procesado de Señal (GAPS), Universidad Politécnica de Madrid, 28040, Madrid, Spain
Antonio Cano Montes & Luis A. Hernández Gómez

Authors

Antonio Cano Montes
View author publications
You can also search for this author in PubMed Google Scholar
Luis A. Hernández Gómez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Cano Montes .

Editor information

Editors and Affiliations

Poznań University of Economics and Business, Poznan, Poland
Witold Abramowicz
University of Colorado, Colorado Springs, CO, USA
Gary Klein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cano Montes, A., Hernández Gómez, L.A. (2020). Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning. In: Abramowicz, W., Klein, G. (eds) Business Information Systems. BIS 2020. Lecture Notes in Business Information Processing, vol 389. Springer, Cham. https://doi.org/10.1007/978-3-030-53337-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-53337-3_32
Published: 22 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-53336-6
Online ISBN: 978-3-030-53337-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics