Abstract
Human emotion is an inherent part of human beings, and it is used to express their feelings to the listeners. While emotions are mostly conveyed via facial expressions, spoken words also contain emotions to reflect a speaker’s emotional state. This project focused on researching and evaluating the deep neural network performance on multi-lingual speech emotion recognition on RAVDESS, EMO-DB and combination of both emotional speech databases. Methodology used in the project was divided into five steps: data collection and speech signal extraction, signal conversion, image recognition using transfer learning, result validation and implementation of trained network in graphical user interface (GUI). The research on AlexNet and SqueezeNet in transfer learning was carried out by training the networks using different number of maximum epochs, learning rate and image augmentations. The research showed that AlexNet provided the higher validation accuracy than SqueezeNet at 66.20% during training the combined RAVDESS and EMO-DB databases. As for the testing data, the trained model obtained an F1-score of 0.6253 on testing 264 sample data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech 2005, vol. 5, pp. 1517–1520 (2005)
Rodriguez, V.: Basic rules for indoor anechoic chamber design [measurements corner]. IEEE Antennas Propag. Mag. 58(6), 82–93 (2016)
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Ghule, K.R., Deshmukh, R.: Feature extraction techniques for speech recognition: a review. Int. J. Sci. Eng. Res. 6(5), 2229–5518 (2015)
Dave, N.: Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1(6), 1–4 (2013)
Alim, S.A., Rashid, N.K.A.: Some Commonly Used Speech Feature Extraction Algorithms. From Natural to Artificial Intelligence - Algorithms and Applications (2018). undefined/state.item.id
Dörfler, M., Bammer, R., Grill, T.: Inside the spectrogram: Convolutional Neural Networks in audio processing. In: 2017 12th International Conference on Sampling Theory and Applications, SampTA 2017, vol. 1, pp. 152–155 (2017)
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 4779–4783 (2018)
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Stolar, M.N., Lech, M., Bolia, R.S., Skinner, M.: Real time speech emotion recognition using RGB image classification and transfer learning. In: 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–8. IEEE (2017)
Lech, M., Stolar, M., Best, C., Bolia, R.: Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Front. Comput. Sci. 2, 14 (2020)
Zeng, Y., Mao, H., Peng, D., Yi, Z.: Spectrogram based multi-task audio classification. Multimedia Tools Appl. 78(3), 3705–3722 (2019)
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804 (2014)
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014 (2014)
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.A.: Speech emotion recognition: methods and cases study. In: ICAART (2), vol. 20 (2018)
MATLAB: Transfer learning with deep network designer (2023). https://www.mathworks.com/help/deeplearning/ug/transfer-learning-with-deep-network-designer.html
Zulkifli, H.: Understanding Learning Rates and How It Improves Performance in Deep Learning (2018). https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10
Hossain, M.S., Muhammad, G.: Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49, 69–78 (2019)
Acknowledgements
The authors would like to thanks Tunku Abdul Rahman University of Management and Technology for providing computing resources to run these experiments. Special thanks are given to the permission to use open access database (RAVDESS and EMO-DB) in our experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liau, C.S., Hong, K.S. (2024). Multilingual Speech Emotion Recognition Using Deep Learning Approach. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2023. Lecture Notes in Computer Science, vol 14322. Springer, Singapore. https://doi.org/10.1007/978-981-99-7339-2_43
Download citation
DOI: https://doi.org/10.1007/978-981-99-7339-2_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7338-5
Online ISBN: 978-981-99-7339-2
eBook Packages: Computer ScienceComputer Science (R0)