Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory

Dangol, Ranjana; Alsadoon, Abeer; Prasad, P. W. C.; Seher, Indra; Alsadoon, Omar Hisham

doi:10.1007/s11042-020-09693-w

Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory

Published: 30 August 2020

Volume 79, pages 32917–32934, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ranjana Dangol^1,2,
Abeer Alsadoon ORCID: orcid.org/0000-0002-2309-3540^1,2,
P. W. C. Prasad^1,2,
Indra Seher^1,2 &
…
Omar Hisham Alsadoon³

1192 Accesses
33 Citations
Explore all metrics

Abstract

Human-Robot interactions involve human intentions and human emotion. After the evolvement of positive psychology, the psychological research has a tremendous concentration to study the factors involved in the human emotion generation. Speech emotion recognition (SER) is a challenging job due to the complexity of emotions. Human emotion recognition is gaining importance as good emotional health can lead to good social and mental health. Although there are different approaches for speech emotion recognition, the most advanced model is Convolutional Neural Network (CNN) using Long Short-term Memory (LSTM) network. But they also suffer from the lack of parallelization of the sequences and computation times. Meanwhile, attention-mechanism has way better exhibitions in learning significant feature representations for specific tasks. Based on this technique, we propose an emotion recognition system with relation aware, self-attention mechanism to memorize the discriminative features for SER, where spectrograms are utilized as input. A CNN with a relation-aware self-attention is modelled to analyse 3D log-Mel spectrograms to extract the high-level features. Different layers such as 3D convolutional layers, 3D Max-pooling layers, and LSTM networks are used in the model. Here, the attention layer is exercised to support distinct parts of emotion and assemble discriminative utterance-level representations for SER. Finally, the fully connected layer is equipped with the utterance level representations with 64 output units to achieve higher-level representations. The approach of relation-aware attention-based 3D CNN and LSTM model provided a better outcome of 80.80% on average scale in speech emotion recognition. The proposed model in this paper focuses on enhancement of the attention mechanism to gain additional benefits of sequence to sequence parallelization by improving the recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning for time series classification: a review

Article 02 March 2019

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

References

Aldeneh Z, Mower Provost E (2017) Using regional saliency for speech emotion recognition. 2741–2745
Hajarolasvadi N, Demirel H (2019) 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms. Entropy 21(5):479. https://doi.org/10.3390/e21050479
Article Google Scholar
Huang K, Wu C, Su M, Kuo Y (2018) Detecting Unipolar and Bipolar Depressive Disorders from Elicited Speech Responses Using Latent Affective Structure Model. IEEE Trans Affect Comput 11:1–404. https://doi.org/10.1109/TAFFC.2018.2803178
Article Google Scholar
Huang KY, Wu CH, Su MH (2019) Attention-based convolutional neural network and long short-term memory for short-term detection of mood disorders based on elicited speech responses. Pattern Recognition 88:668–678. https://doi.org/10.1016/j.patcog.2018.12.016
Article Google Scholar
Huang K, Wu C, Hong Q, Su M, Chen Y (2019) Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5866–5870, https://doi.org/10.1109/ICASSP.2019.8682283.
Jing S, Mao X, Chen L (2018) Prominence features: Effective emotional features for speech emotion recognition. Digital Signal Process 72:216–231. https://doi.org/10.1016/j.dsp.2017.10.016
Article Google Scholar
Liu ZT, Wu M, Cao W-H, Mao J-W, Xu J-P, Tan G-Z (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050
Article Google Scholar
Lorenzo-Trueba J, Eje Henter G, Takaki S, Yamagishi J, Morino Y, Ochiai Y (2018) Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis. Speech Commun 99:135–143. https://doi.org/10.1016/j.specom.2018.03.002
Article Google Scholar
Lotfian R, Busso C (2019) Over-sampling Emotional Speech Data Based on Subjective Evaluations Provided by Multiple Individuals. IEEE Trans Affect Comput:1–1. https://doi.org/10.1109/TAFFC.2019.2901465.
Motamed S, Setayeshi S, Rabiee A (2017) Speech emotion recognition based on a modified brain emotional learning model. Biol Inspired Cognitive Architect 19:32–38. https://doi.org/10.1016/j.bica.2016.12.002
Article Google Scholar
Pérez-Benito FJ, Villacampa-Fernández P, Conejero JA, García-Gómez JM, Navarro-Pardo E (2019) A happiness degree predictor using the conceptual data structure for deep learning architectures. Comput Methods Prog Biomed 168:59–68. https://doi.org/10.1016/j.cmpb.2017.11.004
Article Google Scholar
Poorna SS, Nair GJ Multistage classification scheme to enhance speech emotion recognition. Int J Speech Technol 22(2):327–340. https://doi.org/10.1007/s10772-019-09605-w
Raffel C, Ellis DPW (2015) Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems
Shaw P, Uszkoreit J, Vaswani A (2018) Self-Attention with Relative Position Representations arXiv.org
Sun L, Fu S, Wang F (2019) Decision tree SVM model with Fisher feature selection for speech emotion recognition. EURASIP J Audio Speech Music Process 2019(1):1–1. https://doi.org/10.1186/s13636-018-0145-5
Article Google Scholar
Tokuno S, Mitsuyoshi S, Suzuki G, Tsumatori G (2014) Stress Evaluation Using Voice Emotion Recognition Technology: a novel stress evaluation technology for disaster responders
Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. Multimed Tools Appl 78(3):3705–3722. https://doi.org/10.1007/s11042-017-5539-3
Article Google Scholar
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Mathematics, Charles Sturt University, Sydney, Australia
Ranjana Dangol, Abeer Alsadoon, P. W. C. Prasad & Indra Seher
Department of Information Technology, Study Group Australia, Sydney Campus, Sydney, Australia
Ranjana Dangol, Abeer Alsadoon, P. W. C. Prasad & Indra Seher
Department of Islamic Sciences, Al Iraqia University, Baghdad, Iraq
Omar Hisham Alsadoon

Authors

Ranjana Dangol
View author publications
You can also search for this author in PubMed Google Scholar
Abeer Alsadoon
View author publications
You can also search for this author in PubMed Google Scholar
P. W. C. Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Indra Seher
View author publications
You can also search for this author in PubMed Google Scholar
Omar Hisham Alsadoon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abeer Alsadoon.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 7.

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dangol, R., Alsadoon, A., Prasad, P.W.C. et al. Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory. Multimed Tools Appl 79, 32917–32934 (2020). https://doi.org/10.1007/s11042-020-09693-w

Download citation

Received: 21 March 2020
Revised: 31 July 2020
Accepted: 21 August 2020
Published: 30 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11042-020-09693-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

Facial emotion recognition using convolutional neural networks (FERC)

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation