Skip to main content

Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning

  • Conference paper
  • First Online:
Business Information Systems (BIS 2020)

Abstract

Automatic Emotion recognition is renowned for being a difficult task, even for human intelligence. Due to the importance of having enough data in classification problems, we introduce a framework developed with the purpose of generating labeled audio to create our own database. In this paper we present a new model for audio-video emotion recognition using Transfer Learning (TL). The idea is to combine a pre-trained high level feature extractor Convolutional Neural Network (CNN) and a Bidirectional Recurrent Neural Network (BRNN) model to address the issue of variable sequence length inputs. Throughout the design process we discuss the main problems related to the high complexity of the task due to its inherent subjective nature and, on the other hand, the important results obtained by testing the model on different databases, outperforming the state-of-the-art algorithms in the SAVEE [3] database. Furthermore, we use the mentioned application to perform precision classification (per user) into low resources real scenarios with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brave, S., Nass, C.: Emotion in human-computer interaction. In: The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications (2002)

    Google Scholar 

  2. Sebe, N., Cohen, I., Huang, T.S.: Multimodal emotion recognition, 4 (2004)

    Google Scholar 

  3. Haq, S., Jackson, P.J.B., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of Internatioanl Conference on Auditory-Visual Speech Processing (AVSP 2008), Tangalooma, Australia, September 2008

    Google Scholar 

  4. Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6

    Article  Google Scholar 

  5. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech, vol. 5, pp. 1517–1520 (2005)

    Google Scholar 

  6. Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44, 572–587 (2011)

    Article  Google Scholar 

  7. Pini, S., Ahmed, O.B., Cornia, M., Baraldi, L., Cucchiara, R., Huet, B.: Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild. In: ICMI 2017, 19th ACM International Conference on Multimodal Interaction, ROYAUME-UNI, Glasgow, United Kingdom, Glasgow, 13–17th November 2017 (2017)

    Google Scholar 

  8. Harár, P., Burget, R., Dutta, M.K.: Speech emotion recognition with deep learning. In: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 137–140, February 2017

    Google Scholar 

  9. Lalitha, S., Madhavan, A., Bhushan, B., Saketh, S.: Speech emotion recognition. In: 2014 International Conference on Advances in Electronics Computers and Communications, pp. 1–4, October 2014

    Google Scholar 

  10. Ding, W., et al.: Audio and face video emotion recognition in the wild using deep neural networks and small datasets. pp. 506–513 (2016)

    Google Scholar 

  11. Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018)

    Article  Google Scholar 

  12. Gu, Y., Chen, S., Marsic, I.: Deep multimodal learning for emotion recognition in spoken language. CoRR, abs/1802.08332 (2018)

    Google Scholar 

  13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc. (2012)

    Google Scholar 

  14. Kahou, S.E., Michalski, V., Konda, K., Memisevic, R., Pal, C.: Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI 2015, pp. 467–474. ACM, New York (2015)

    Google Scholar 

  15. Tarnowski, P., Kołodziej, M., Majkowski, A., Rak, R.: Emotion recognition using facial expressions. Procedia Comput. Sci. 108, 1175–1184 (2017)

    Article  Google Scholar 

  16. Busso, C., et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI 2004, pp. 205–211. ACM, New York (2004)

    Google Scholar 

  17. Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3687–3691, May 2013

    Google Scholar 

  18. Xi, O., et al.: Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pp. 577–582. ACM, New York (2017)

    Google Scholar 

  19. Latif, S., Rana, R., Younis, S., Qadir, J., Epps, J.: Transfer learning for improving speech emotion classification accuracy, pp. 257–261 (2018)

    Google Scholar 

  20. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009 (2009)

    Google Scholar 

  21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014)

    Google Scholar 

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015)

    Google Scholar 

  23. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR, abs/1412.0767 (2014)

    Google Scholar 

  24. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  25. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multi-task cascaded convolutional networks. CoRR, abs/1604.02878 (2016)

    Google Scholar 

  26. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  27. Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram and phoneme embedding, pp. 3688–3692 (2018)

    Google Scholar 

  28. Huang, Z., Xue, W., Mao, Q.: Speech emotion recognition with unsupervised feature learning. Front. Inf. Technol. Electron. Eng. 16(5), 358–366 (2015). https://doi.org/10.1631/FITEE.1400323

    Article  Google Scholar 

Download references

Acknowledgments

This work has been partially funded by the Spanish Ministry of Economy and Competitiveness and the European Union (FEDER) within the framework of the project DSSL: “Deep and Subspace Speech Learning (TEC2015-68172-C2-2-P)”. We also thank ESAI S.L. Estudios y Soluciones Avanzadas de Ingeniería for their support and partially funding of this research towards an automatic emotion recognition system.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Cano Montes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cano Montes, A., Hernández Gómez, L.A. (2020). Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning. In: Abramowicz, W., Klein, G. (eds) Business Information Systems. BIS 2020. Lecture Notes in Business Information Processing, vol 389. Springer, Cham. https://doi.org/10.1007/978-3-030-53337-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-53337-3_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-53336-6

  • Online ISBN: 978-3-030-53337-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics