An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition

Duong, Binh Van; Ha, Chien Nhu; Nguyen, Trung T.; Nguyen, Phuc; Do, Trong-Hop

doi:10.1007/978-3-031-21967-2_15

Binh Van Duong^13,15,
Chien Nhu Ha^13,15,
Trung T. Nguyen^14,15,
Phuc Nguyen¹⁶ &
…
Trong-Hop Do^13,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13758))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

676 Accesses

Abstract

In recent years, the virtual assistant has become an essential part of many applications on smart devices. In these applications, users talk to virtual assistants in order to give commands. This makes speech emotion recognition to be a serious problem in improving the service and the quality of virtual assistants. However, speech emotion recognition is not a straightforward task as emotion can be expressed through various features. Having a deep understanding of these features is crucial to achieving a good result in speech emotion recognition. To this end, this paper conducts empirical experiments on three kinds of speech features: Mel-spectrogram, Mel-frequency cepstral coefficients, Tempogram, and their variants for the task of speech emotion recognition. Convolutional Neural Networks, Long Short-Term Memory, Multi-layer Perceptron Classifier, and Light Gradient Boosting Machine are used to build classification models used for the emotion classification task based on the three speech features. Two popular datasets: The Ryerson Audio-Visual Database of Emotional Speech and Song, and The Crowd-Sourced Emotional Multimodal Actors Dataset are used to train these models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Luu, S.T., Nguyen, H.P., Van Nguyen, K., Nguyen, N.L.-T.: Comparison between traditional machine learning models and neural network models for Vietnamese hate speech detection. In: 2020 RIVF International Conference on Computing and Communication Technologies (RIVF), pp. 1–6. IEEE (2020)
Google Scholar
Van Huynh, T., Nguyen, V.D., Van Nguyen, K., Nguyen, N.L.-T., Nguyen, A.G.-T.: Hate speech detection on Vietnamese social media text using the Bi-GRU-LSTM-CNN model. arXiv preprint arXiv:1911.03644 (2019)
Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41(4), 603–623 (2003)
Article Google Scholar
Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020)
Article Google Scholar
Wang, J.-C., Lin, C.-H., Chen, E.-T., Chang, P.-C.: Spectral-temporal receptive fields and MFCC balanced feature extraction for noisy speech recognition. In: 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE (2014)
Google Scholar
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE 13(5), e0196391 (2018)
Article Google Scholar
Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)
Article Google Scholar
Mulimani, M., Koolagudi, S.G.: Acoustic event classification using spectrogram features. In: 2018 IEEE Region 10 Conference, TENCON 2018, pp. 1460–1464 (2018). https://doi.org/10.1109/TENCON.2018.8650444
Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)
Article Google Scholar
Tiwari, V.T.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 1, 01 (2010)
Google Scholar
Tian, M., Fazekas, G., Black, D.A.A., Sandler, M.: On the use of the tempogram to describe audio content and its application to music structural segmentation. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 419–423. IEEE (2015)
Google Scholar
Tran, K.Q., Duong, B.V., Tran, L.Q., Tran, A.L.-H., Nguyen, A.T., Nguyen, K.V.: Machine learning-based empirical investigation for credit scoring in Vietnam’s banking. In: Fujita, H., Selamat, A., Lin, J.C.-W., Ali, M. (eds.) IEA/AIE 2021. LNCS (LNAI), vol. 12799, pp. 564–574. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79463-7_48
Chapter Google Scholar
Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems 30, pp. 3146–3154 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam
Binh Van Duong, Chien Nhu Ha & Trong-Hop Do
University of Technology, Ho Chi Minh City, Vietnam
Trung T. Nguyen
Vietnam National University, Ho Chi Minh City, Vietnam
Binh Van Duong, Chien Nhu Ha, Trung T. Nguyen & Trong-Hop Do
New York University, Abu Dhabi, UAE
Phuc Nguyen

Authors

Binh Van Duong
View author publications
You can also search for this author in PubMed Google Scholar
Chien Nhu Ha
View author publications
You can also search for this author in PubMed Google Scholar
Trung T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Phuc Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Trong-Hop Do
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Binh Van Duong .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Vietnam National University, Ho Chi Minh City, Ho Chi Minh City, Vietnam
Tien Khoa Tran
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Ualsher Tukayev
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
University of Newcastle, Newcastle, NSW, Australia
Edward Szczerbicki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duong, B.V., Ha, C.N., Nguyen, T.T., Nguyen, P., Do, TH. (2022). An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-21967-2_15
Published: 09 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Empirical Experiment on Feature Extractions Based for Speech Emotion Recognition