Improvement of Audio-Visual Keyword Spotting System Accuracy Using Excitation Source Feature

Nandakishor, Salam; Pati, Debadatta

doi:10.1007/978-3-031-48312-7_28

Salam Nandakishor¹³ &
Debadatta Pati¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14339))

Included in the following conference series:

International Conference on Speech and Computer

292 Accesses

Abstract

In this paper, we proposed a robust audio-visual keyword spotting (AVKS) system. This system is developed using DNN (Deep Neural Network) model with State Level Minimum Bayes Risk (sMBR) criteria. The symbols of International Phonetic Alphabet (IPA) are used for representing the speech sounds at phonetic level. Our proposed system can recognize 34 phonemes, silence region and can also detect the predefined keywords formed by these phonemes. Most of the audio-visual keyword spotting system used Mel-frequency cepstral coefficient (MFCC) as audio feature. This feature represents only the vocal-tract related information but does not contain excitation source information. Therefore, we explore the excitation source features as the supplementary information in this work. The excitation source features extracted from glottal flow derivative (GFD) and linear prediction (LP) residual through standard mel cepstral analysis are termed as Glottal Mel-Frequency Cepstral Coefficient (GMFCC) and Residual Mel-Frequency Cepstral Coefficient (RMFCC) respectively. The GFD signal is generated using Iterative Adaptive Inverse Filtering (IAIF) method whereas LP residual is estimated by inverse filtering process. In our experimental analysis, we observe that the performance of glottal based excitation feature is better than LP residual based excitation source feature in keyword spotting task. Hence, we consider the GMFCC features in development of our proposed system. The AVKS system using MFCC and DCT (Discrete Cosine Transform) based visual features extracted from mouth region provides an average accuracy of 93.87%, whereas the inclusion of GMFCC feature improves the performance to 94.93%. The experimental observations show the benefit of excitation source information for audio-visual keyword spotter under noisy condition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Pingping, W., et al.: A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE Trans. Multimedia 18(3), 326–338 (2016)
Article Google Scholar
Nandakishor, S., Pati, D.: Analysis of lombard effect by using hybrid visual features for ASR. In: Pattern Recognition and Machine Intelligence (PReMI 2021) (2021)
Google Scholar
Higuchi, T., Gupta, A., Dhir, C.: Multi-task learning with cross attention for keyword spotting. In: IEEE Automatic Speech Recognition and Understanding Workshop (2021)
Google Scholar
Berg, A., Connor, M., Cruz, M.T.: Keyword transformer: a self-attention model for keyword spotting. In: Proceedings of INTERSPEECH (2021)
Google Scholar
Li, Y., et al.: Audio-visual keyword transformer for unconstrained sentence-level keyword spotting. In: CAAI Transactions on Intelligence Technology (2023)
Google Scholar
Rabiner, L.R., Juang, B.-H., Yegnanarayana, B.: Fundamentals of Speech Recognition. Pearson Education (2012)
Google Scholar
Manjunath, K., Rao, K.S.: Source and system features for phone recognition. Int. J. Speech Technol. 18(2), 257–270 (2015)
Article Google Scholar
International Phonetic Association: Hand Book of the International Phonetic Association, Cambridge University Press (1999)
Google Scholar
Bear, H.L., Harvey, R.: Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun. 95, 40–67 (2017)
Article Google Scholar
Yengnanarayana, B., Murthy, P.S.: Enhancement of reverberant speech using LP residual signal. IEEE Trans. Speech Audio Process. 8(3), 267–281 (2000)
Article Google Scholar
Prasanna, S.R.M., Srinivasa, C., Yengnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)
Article Google Scholar
Drugman, T., et al.: Detection of glottal closure instants from speech signals: a quantitative review. IEEE Trans. Audio Speech Lang. Process. 20(3), 994–1006 (2012)
Article Google Scholar
Naylor, P.A., et al.: Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process. 15(1), 34–43 (2007)
Article Google Scholar
Thomas, M.R., Gudnason, J., Naylor, P.A.: Estimation of glottal closing and opening instants in voiced speech using the YAGA algorithm. IEEE Trans. Audio Speech Lang. Process. 20(1), 82–91 (2012)
Article Google Scholar
Murthy, K.S.R., Yegnanarayana, B.: Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)
Article Google Scholar
Prathosh, A., Ananthapadmanabha, T., Ramakrishnan, A.: Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Trans. Audio Speech Lang. Process. 21(12), 2471–2480 (2013)
Article Google Scholar
Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(23), 109–118 (1992)
Article Google Scholar
Alku, P., Vilkman, E.: A comparison of glottal voice source quantification parameters in breathy, normal and pressed phonation of female and male speakers. IEEE Trans. Audio Speech Lang. Process. 48(5), 240–254 (1996)
Google Scholar
Dutta, K., Singh, M., Pati, D.: Detection of replay signals using excitation source and shifted CQCC features. Int. J. Speech Technol. 24(9), 497–507 (2009)
Google Scholar
Liu, M., et al.: Audio visual word spotting. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 785–788 (2004)
Google Scholar
Liu, H., et al.: Audio-visual keyword spotting for mandarin based on discriminative local spatial-temporal descriptors. In: International Conference on Pattern Recognition, pp. 785–790 (2014)
Google Scholar
He, J., Liu, L., Palm, G.: On the use of residual cepstrum in speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 5–8 (1996)
Google Scholar
Chengalvarayan, R.: On the use of normalized LPC error towards better large vocabulary speech recognition systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1998)
Google Scholar
Tripathi, K., Rao, K.S.: Improvement of phone recognition accuracy using speech mode classiffication. Int. J. Speech Technol. 21(3), 489–500 (2018)
Article Google Scholar
Vincent, E., et al.: The second “CHiME” speech separation and recognition challenge: datasets, tasks and baselines. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 126–130 (2013)
Google Scholar
Cooke, M., et al.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Ephraim, T., Himmelman, T., Siddiqi, K.: Real-time viola-jones face detection in a web browser. In: Canadian Conference on Computer and Robot Vision, pp. 321–328 (2009)
Google Scholar
Rath, S.P., et al.: Improved feature processing for deep neural networks. In: Proceedings of the INTERSPEECH, pp. 109–113 (2013)
Google Scholar
Povey, D., Saon, G.: Feature and model space speaker adaptation with full covariance Gaussians. In: Proceedings of the INTERSPEECH (2006)
Google Scholar
Vesely, K., et al.: Sequence-discriminative training of deep neural networks. In: Proceedings of the INTERSPEECH (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Nagaland, Chumukedima, Dimapur, 797103, Nagaland, India
Salam Nandakishor & Debadatta Pati

Authors

Salam Nandakishor
View author publications
You can also search for this author in PubMed Google Scholar
Debadatta Pati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salam Nandakishor .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nandakishor, S., Pati, D. (2023). Improvement of Audio-Visual Keyword Spotting System Accuracy Using Excitation Source Feature. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-48312-7_28
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48311-0
Online ISBN: 978-3-031-48312-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improvement of Audio-Visual Keyword Spotting System Accuracy Using Excitation Source Feature