DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

Fahad, Md. Shah; Deepak, Akshay; Pradhan, Gayadhar; Yadav, Jainath

doi:10.1007/s00034-020-01486-8

DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

Published: 21 July 2020

Volume 40, pages 466–489, (2021)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Md. Shah Fahad ORCID: orcid.org/0000-0002-2556-131X¹,
Akshay Deepak¹,
Gayadhar Pradhan² &
…
Jainath Yadav³

895 Accesses
45 Citations
2 Altmetric
Explore all metrics

Abstract

Speech emotion recognition (SER) systems are often evaluated in a speaker-independent manner. However, the variation in the acoustic features of different speakers used during training and evaluation results in a significant drop in the accuracy during evaluation. While speaker-adaptive techniques have been used for speech recognition, to the best of our knowledge, they have not been employed for emotion recognition. Motivated by this, a speaker-adaptive DNN-HMM-based SER system is proposed in this paper. Feature space maximum likelihood linear regression technique has been used for speaker adaptation during both training and testing phases. The proposed system uses MFCC and epoch-based features. We have exploited our earlier work on robust detection of epochs from emotional speech to obtain emotion-specific epoch-based features, namely instantaneous pitch, phase, and the strength of excitation. The combined feature set improves on the MFCC features, which have been the baseline for SER systems in the literature by + 5.07% and over the state-of-the-art techniques by + 7.13 %. While using just the MFCC features, the proposed model improves upon the state-of-the-art techniques by 2.06%. These results bring out the importance of speaker adaptation for SER systems and highlight the complementary nature of the MFCC and epoch-based features for emotion recognition using speech. All experiments were carried out an IEMOCAP emotional dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

References

D.O. Bos, EEG-based emotion recognition. Infl. Vis. Audit. Stimul. 56(3), 1–17 (2006)
Google Scholar
F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of German emotional speech, in 9h European Conference on Speech Communication and Technology (2005)
C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Article Google Scholar
C. Busso, A. Metallinou, S.S. Narayanan, Iterative feature normalization for emotional speech detection, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5692–5695 (2011)
R.A. Calix, G.M. Knapp, Actor level emotion magnitude prediction in text and speech. Multimed. Tools. Appl. 62(2), 319–332 (2013)
Article Google Scholar
C. Clavel, I. Vasilescu, L. Devillers, G. Richard, T. Ehrette, Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun. 50(6), 487–503 (2008)
Article Google Scholar
F. Dellaert, T. Polzin, A. Waibel, Recognizing emotion in speech, in Proceeding of Fourth International Conference on Spoken Language Processing ICSLP’96, vol. 3. IEEE, pp. 1970-1973. (1996)
F. Eyben, A. Batliner, B. Schuller, Towards a standard set of acoustic features for the processing of emotion in speech, in Proceedings of Meetings on Acoustics 159ASA, vol. 9. Acoustical Society of America, p. 060006 (2010)
P. Gangamohan, S.R. Kadiri, S.V. Gangashetty, B. Yegnanarayana, Excitation source features for discrimination of anger and happy emotions, in 15th Annual Conference of the International Speech Communication Association (2014)
M.J. Gales, Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
M.J. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Article Google Scholar
K. Han, D. Yu, I. Tashev, Speech emotion recognition using deep neural network and extreme learning machine, in 15th Annual Conference of the International Speech Communication Association (2014)
D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Article Google Scholar
S.G. Koolagudi, R. Reddy, K.S. Rao, Emotion recognition from speech signal using epoch parameters, in 2010 international conference on signal processing and communications (SPCOM), pp. 1–5 (2010)
S.R. Krothapalli, S.G. Koolagudi, Characterization and recognition of emotions from speech using excitation source information. Int. J. Speech Technol. 16(2), 181–201 (2013)
Article Google Scholar
S.S. Kumar, K.S. Rao, Voice/non-voice detection using phase of zero frequency filtered speech signal. Speech Commun. 81, 90–103 (2016)
Article Google Scholar
C.M. Lee, S.S. Narayanan, Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005)
Article Google Scholar
L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin, H. Sahli, Hybrid deep neural network–hidden Markov model (DNN-HMM) based speech emotion recognition, in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 312–317 (2013)
M. Mansoorizadeh, N.M. Charkari, Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–297 (2010)
Article Google Scholar
S. Mariooryad, C. Busso, Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Commun. 57, 1–12 (2014)
Article Google Scholar
L. Mary, Significance of prosody for speaker, language, emotion, and speech recognition, in Extraction of Prosody for Automatic Speaker, Language, Emotion and Speech Recognition. Springer, Cham, pp. 1-22 (2019)
S. Matsoukas, R. Schwartz, H. Jin, L. Nguyen, Practical implementations of speaker-adaptive training, in DARPA Speech Recognition Workshop (1997)
S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition using recurrent neural networks with local attention, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231 (2017)
R. Nakatsu, J. Nicholson, N. Tosa, Emotion recognition and its application to computer agents with spontaneous interactive capabilities. Knowl.-Based Syst. 13(7), 497–504 (2000)
Article Google Scholar
N.P. Narendra, K.S. Rao, Robust voicing detection and \( F_ 0 \) estimation for HMM-based speech synthesis. Circuits Syst. Signal Process. 34(8), 2597–2619 (2015)
Article Google Scholar
J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks. Neural Comput. Appl. 9(4), 290–296 (2000)
Article Google Scholar
K.E.B. Ooi, L.S.A. Low, M. Lech, N. Allen, Early prediction of major depression in adolescents using glottal wave characteristics and teager energy parameters, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4613–4616 (2012)
D. O’Shaughnessy, Recognition and processing of speech signals using neural networks. Circuits Syst. Signal Process. 38(8), 3454–3481 (2019)
Article MathSciNet Google Scholar
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, The Kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (No. CONF). IEEE Signal Processing Society (2011)
L. Rabiner, Fundamentals of speech recognition. Fundam. Speech Recognit. (1993)
T.V. Sagar, Characterisation and synthesis of emotions in speech using prosodic features. Master’s thesis, Dept. of Electronics and communications Engineering, Indian Institute of Technology Guwahati (2007)
B. Schuller, G. Rigoll, M. Lang, Hidden Markov model-based speech emotion recognition, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings (ICASSP’03), vol. 2. IEEE, pp. II–1 (2003)
B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, G. Rigoll, Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Trans. affect. Comput. 1(2), 119–131 (2010)
Article Google Scholar
D. Ververidis, C. Kotropoulos, A state of the art review on emotional speech databases, in Proceedings of 1st Richmedia Conference, pp. 109–119 (2003)
D. Ververidis, C. Kotropoulos, I. Pitas, Automatic emotional speech classification, in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1. IEEE, pp. I-593 (2004)
O. Viikki, K. Laurila, Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun. 25(1–3), 133–147 (1998)
Article Google Scholar
H.K. Vydana, S.R. Kadiri, A.K. Vuppala, Vowel-based non-uniform prosody modification for emotion conversion. Circuits Syst. Signal Process. 35(5), 1643–1663 (2016)
Article Google Scholar
Y. Wang, L. Guan, An investigation of speech-based human emotion recognition, in IEEE 6th Workshop on Multimedia Signal Processing, pp. 15–18 (2004)
C. Wu, C. Huang, H. Chen, Text-independent speech emotion recognition using frequency adaptive features. Multimed. Tools Appl. 77(18), 24353–24363 (2018)
Article Google Scholar
J. Yadav, K.S. Rao, Prosodic mapping using neural networks for emotion conversion in Hindi language. Circuits Syst. Signal Process. 35(1), 139–162 (2016)
Article MathSciNet Google Scholar
J. Yadav, M.S. Fahad, K.S. Rao, Epoch detection from emotional speech signal using zero time windowing. Speech Commun. 96, 142–149 (2018)
Article Google Scholar
D. Yu, L. Deng, Automatic Speech Recognition. Springer London Limited (2016)

Download references

Acknowledgements

Akshay Deepak has been awarded Young Faculty Research Fellowship (YFRF) of Visvesvaraya PhD Programme of Ministry of Electronics & Information Technology, MeitY, Government of India. In this regard, he would like to acknowledge that this publication is an outcome of the R&D work undertaken in the project under the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technology, Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, India
Md. Shah Fahad & Akshay Deepak
Department of Electronics and Communication, National Institute of Technology Patna, Patna, India
Gayadhar Pradhan
Department of Computer Science, Central University of South Bihar, Gaya, India
Jainath Yadav

Authors

Md. Shah Fahad
View author publications
You can also search for this author in PubMed Google Scholar
Akshay Deepak
View author publications
You can also search for this author in PubMed Google Scholar
Gayadhar Pradhan
View author publications
You can also search for this author in PubMed Google Scholar
Jainath Yadav
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Shah Fahad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fahad, M.S., Deepak, A., Pradhan, G. et al. DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features. Circuits Syst Signal Process 40, 466–489 (2021). https://doi.org/10.1007/s00034-020-01486-8

Download citation

Received: 23 April 2019
Revised: 12 June 2020
Accepted: 13 June 2020
Published: 21 July 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s00034-020-01486-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation