Skip to main content

Multilingual Speech Emotion Recognition Using Deep Learning Approach

  • Conference paper
  • First Online:
Advances in Visual Informatics (IVIC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14322))

Included in the following conference series:

  • 313 Accesses

Abstract

Human emotion is an inherent part of human beings, and it is used to express their feelings to the listeners. While emotions are mostly conveyed via facial expressions, spoken words also contain emotions to reflect a speaker’s emotional state. This project focused on researching and evaluating the deep neural network performance on multi-lingual speech emotion recognition on RAVDESS, EMO-DB and combination of both emotional speech databases. Methodology used in the project was divided into five steps: data collection and speech signal extraction, signal conversion, image recognition using transfer learning, result validation and implementation of trained network in graphical user interface (GUI). The research on AlexNet and SqueezeNet in transfer learning was carried out by training the networks using different number of maximum epochs, learning rate and image augmentations. The research showed that AlexNet provided the higher validation accuracy than SqueezeNet at 66.20% during training the combined RAVDESS and EMO-DB databases. As for the testing data, the trained model obtained an F1-score of 0.6253 on testing 264 sample data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech 2005, vol. 5, pp. 1517–1520 (2005)

    Google Scholar 

  2. Rodriguez, V.: Basic rules for indoor anechoic chamber design [measurements corner]. IEEE Antennas Propag. Mag. 58(6), 82–93 (2016)

    Article  Google Scholar 

  3. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)

    Article  Google Scholar 

  4. Ghule, K.R., Deshmukh, R.: Feature extraction techniques for speech recognition: a review. Int. J. Sci. Eng. Res. 6(5), 2229–5518 (2015)

    Google Scholar 

  5. Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)

    Google Scholar 

  6. Alim, S.A., Rashid, N.K.A.: Some Commonly Used Speech Feature Extraction Algorithms. From Natural to Artificial Intelligence - Algorithms and Applications (2018). undefined/state.item.id

    Google Scholar 

  7. Dörfler, M., Bammer, R., Grill, T.: Inside the spectrogram: Convolutional Neural Networks in audio processing. In: 2017 12th International Conference on Sampling Theory and Applications, SampTA 2017, vol. 1, pp. 152–155 (2017)

    Google Scholar 

  8. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 4779–4783 (2018)

    Google Scholar 

  9. O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)

  10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  11. Stolar, M.N., Lech, M., Bolia, R.S., Skinner, M.: Real time speech emotion recognition using RGB image classification and transfer learning. In: 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–8. IEEE (2017)

    Google Scholar 

  12. Lech, M., Stolar, M., Best, C., Bolia, R.: Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Front. Comput. Sci. 2, 14 (2020)

    Article  Google Scholar 

  13. Zeng, Y., Mao, H., Peng, D., Yi, Z.: Spectrogram based multi-task audio classification. Multimedia Tools Appl. 78(3), 3705–3722 (2019)

    Article  Google Scholar 

  14. Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804 (2014)

    Google Scholar 

  15. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014 (2014)

    Google Scholar 

  16. Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.A.: Speech emotion recognition: methods and cases study. In: ICAART (2), vol. 20 (2018)

    Google Scholar 

  17. MATLAB: Transfer learning with deep network designer (2023). https://www.mathworks.com/help/deeplearning/ug/transfer-learning-with-deep-network-designer.html

  18. Zulkifli, H.: Understanding Learning Rates and How It Improves Performance in Deep Learning (2018). https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

  19. Hossain, M.S., Muhammad, G.: Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49, 69–78 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thanks Tunku Abdul Rahman University of Management and Technology for providing computing resources to run these experiments. Special thanks are given to the permission to use open access database (RAVDESS and EMO-DB) in our experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Sze Hong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liau, C.S., Hong, K.S. (2024). Multilingual Speech Emotion Recognition Using Deep Learning Approach. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2023. Lecture Notes in Computer Science, vol 14322. Springer, Singapore. https://doi.org/10.1007/978-981-99-7339-2_43

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7339-2_43

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7338-5

  • Online ISBN: 978-981-99-7339-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics