Multilingual Speech Emotion Recognition Using Deep Learning Approach

Liau, Chu Sheng; Hong, Kai Sze

doi:10.1007/978-981-99-7339-2_43

Chu Sheng Liau¹⁶ &
Kai Sze Hong¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14322))

Included in the following conference series:

International Visual Informatics Conference

313 Accesses

Abstract

Human emotion is an inherent part of human beings, and it is used to express their feelings to the listeners. While emotions are mostly conveyed via facial expressions, spoken words also contain emotions to reflect a speaker’s emotional state. This project focused on researching and evaluating the deep neural network performance on multi-lingual speech emotion recognition on RAVDESS, EMO-DB and combination of both emotional speech databases. Methodology used in the project was divided into five steps: data collection and speech signal extraction, signal conversion, image recognition using transfer learning, result validation and implementation of trained network in graphical user interface (GUI). The research on AlexNet and SqueezeNet in transfer learning was carried out by training the networks using different number of maximum epochs, learning rate and image augmentations. The research showed that AlexNet provided the higher validation accuracy than SqueezeNet at 66.20% during training the combined RAVDESS and EMO-DB databases. As for the testing data, the trained model obtained an F1-score of 0.6253 on testing 264 sample data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech 2005, vol. 5, pp. 1517–1520 (2005)
Google Scholar
Rodriguez, V.: Basic rules for indoor anechoic chamber design [measurements corner]. IEEE Antennas Propag. Mag. 58(6), 82–93 (2016)
Article Google Scholar
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Article Google Scholar
Ghule, K.R., Deshmukh, R.: Feature extraction techniques for speech recognition: a review. Int. J. Sci. Eng. Res. 6(5), 2229–5518 (2015)
Google Scholar
Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)
Google Scholar
Alim, S.A., Rashid, N.K.A.: Some Commonly Used Speech Feature Extraction Algorithms. From Natural to Artificial Intelligence - Algorithms and Applications (2018). undefined/state.item.id
Google Scholar
Dörfler, M., Bammer, R., Grill, T.: Inside the spectrogram: Convolutional Neural Networks in audio processing. In: 2017 12th International Conference on Sampling Theory and Applications, SampTA 2017, vol. 1, pp. 152–155 (2017)
Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 4779–4783 (2018)
Google Scholar
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Stolar, M.N., Lech, M., Bolia, R.S., Skinner, M.: Real time speech emotion recognition using RGB image classification and transfer learning. In: 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–8. IEEE (2017)
Google Scholar
Lech, M., Stolar, M., Best, C., Bolia, R.: Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Front. Comput. Sci. 2, 14 (2020)
Article Google Scholar
Zeng, Y., Mao, H., Peng, D., Yi, Z.: Spectrogram based multi-task audio classification. Multimedia Tools Appl. 78(3), 3705–3722 (2019)
Article Google Scholar
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804 (2014)
Google Scholar
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014 (2014)
Google Scholar
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.A.: Speech emotion recognition: methods and cases study. In: ICAART (2), vol. 20 (2018)
Google Scholar
MATLAB: Transfer learning with deep network designer (2023). https://www.mathworks.com/help/deeplearning/ug/transfer-learning-with-deep-network-designer.html
Zulkifli, H.: Understanding Learning Rates and How It Improves Performance in Deep Learning (2018). https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10
Hossain, M.S., Muhammad, G.: Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49, 69–78 (2019)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thanks Tunku Abdul Rahman University of Management and Technology for providing computing resources to run these experiments. Special thanks are given to the permission to use open access database (RAVDESS and EMO-DB) in our experiments.

Author information

Authors and Affiliations

Department of Electrical and Electronics Engineering, Faculty of Engineering and Technology, Tunku Abdul Rahman University of Management and Technology, 53300, Kuala Lumpur, Malaysia
Chu Sheng Liau & Kai Sze Hong

Authors

Chu Sheng Liau
View author publications
You can also search for this author in PubMed Google Scholar
Kai Sze Hong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Sze Hong .

Editor information

Editors and Affiliations

Universiti Tenaga Nasional, Kajang, Selangor, Malaysia
Halimah Badioze Zaman
University of Cambridge, Cambridge, UK
Peter Robinson
Dublin City University, Dublin, Ireland
Alan F. Smeaton
MIT Sloan School of Management, Asia School of Business, Cambridge, MA, USA
Renato Lima De Oliveira
University of Southern Denmark, Odense, Denmark
Bo Nørregaard Jørgensen
National Central University, Jhongli, Taiwan
Timothy K. Shih
Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia
Rabiah Abdul Kadir
Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia
Ummul Hanan Mohamad
Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia
Mohammad Nazir Ahmad

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liau, C.S., Hong, K.S. (2024). Multilingual Speech Emotion Recognition Using Deep Learning Approach. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2023. Lecture Notes in Computer Science, vol 14322. Springer, Singapore. https://doi.org/10.1007/978-981-99-7339-2_43

Download citation

DOI: https://doi.org/10.1007/978-981-99-7339-2_43
Published: 20 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7338-5
Online ISBN: 978-981-99-7339-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics