Abstract
Automatic Emotion recognition is renowned for being a difficult task, even for human intelligence. Due to the importance of having enough data in classification problems, we introduce a framework developed with the purpose of generating labeled audio to create our own database. In this paper we present a new model for audio-video emotion recognition using Transfer Learning (TL). The idea is to combine a pre-trained high level feature extractor Convolutional Neural Network (CNN) and a Bidirectional Recurrent Neural Network (BRNN) model to address the issue of variable sequence length inputs. Throughout the design process we discuss the main problems related to the high complexity of the task due to its inherent subjective nature and, on the other hand, the important results obtained by testing the model on different databases, outperforming the state-of-the-art algorithms in the SAVEE [3] database. Furthermore, we use the mentioned application to perform precision classification (per user) into low resources real scenarios with promising results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brave, S., Nass, C.: Emotion in human-computer interaction. In: The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications (2002)
Sebe, N., Cohen, I., Huang, T.S.: Multimodal emotion recognition, 4 (2004)
Haq, S., Jackson, P.J.B., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of Internatioanl Conference on Auditory-Visual Speech Processing (AVSP 2008), Tangalooma, Australia, September 2008
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech, vol. 5, pp. 1517–1520 (2005)
Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44, 572–587 (2011)
Pini, S., Ahmed, O.B., Cornia, M., Baraldi, L., Cucchiara, R., Huet, B.: Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild. In: ICMI 2017, 19th ACM International Conference on Multimodal Interaction, ROYAUME-UNI, Glasgow, United Kingdom, Glasgow, 13–17th November 2017 (2017)
Harár, P., Burget, R., Dutta, M.K.: Speech emotion recognition with deep learning. In: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 137–140, February 2017
Lalitha, S., Madhavan, A., Bhushan, B., Saketh, S.: Speech emotion recognition. In: 2014 International Conference on Advances in Electronics Computers and Communications, pp. 1–4, October 2014
Ding, W., et al.: Audio and face video emotion recognition in the wild using deep neural networks and small datasets. pp. 506–513 (2016)
Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018)
Gu, Y., Chen, S., Marsic, I.: Deep multimodal learning for emotion recognition in spoken language. CoRR, abs/1802.08332 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc. (2012)
Kahou, S.E., Michalski, V., Konda, K., Memisevic, R., Pal, C.: Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI 2015, pp. 467–474. ACM, New York (2015)
Tarnowski, P., Kołodziej, M., Majkowski, A., Rak, R.: Emotion recognition using facial expressions. Procedia Comput. Sci. 108, 1175–1184 (2017)
Busso, C., et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI 2004, pp. 205–211. ACM, New York (2004)
Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3687–3691, May 2013
Xi, O., et al.: Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pp. 577–582. ACM, New York (2017)
Latif, S., Rana, R., Younis, S., Qadir, J., Epps, J.: Transfer learning for improving speech emotion classification accuracy, pp. 257–261 (2018)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015)
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR, abs/1412.0767 (2014)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multi-task cascaded convolutional networks. CoRR, abs/1604.02878 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram and phoneme embedding, pp. 3688–3692 (2018)
Huang, Z., Xue, W., Mao, Q.: Speech emotion recognition with unsupervised feature learning. Front. Inf. Technol. Electron. Eng. 16(5), 358–366 (2015). https://doi.org/10.1631/FITEE.1400323
Acknowledgments
This work has been partially funded by the Spanish Ministry of Economy and Competitiveness and the European Union (FEDER) within the framework of the project DSSL: “Deep and Subspace Speech Learning (TEC2015-68172-C2-2-P)”. We also thank ESAI S.L. Estudios y Soluciones Avanzadas de Ingeniería for their support and partially funding of this research towards an automatic emotion recognition system.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cano Montes, A., Hernández Gómez, L.A. (2020). Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning. In: Abramowicz, W., Klein, G. (eds) Business Information Systems. BIS 2020. Lecture Notes in Business Information Processing, vol 389. Springer, Cham. https://doi.org/10.1007/978-3-030-53337-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-53337-3_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-53336-6
Online ISBN: 978-3-030-53337-3
eBook Packages: Computer ScienceComputer Science (R0)