Speech Recognition Via Machine Learning in Recording Studio

Devi, Mampi; Sarma, Manoj Kr.; Talukdar, Jyotismita

doi:10.1007/978-981-99-1699-3_4

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 676))

Included in the following conference series:

International Conference on Communication, Electronics and Digital Technology

173 Accesses

Abstract

The speech recognition from voice signals has up till now been the most needed component of computer-aided services (CAS). Several techniques, including various classification and speech analysis methods, have been employed in Automatic speech recognition (ASR) to extract information from the signals. We introduce a novel method for automatically identifying speakers and their response tones from continuous speech through this research. The famous “The Assamese language,” which is combined with other cultures from Northeast India, is employed in the construction of the speech corpus that is used in experiments. Controlling “Noise” combined with speakers’ voices is a crucial issue in contact centres, and we have brought forward one GMM approach in this research. The earliest estimates of the speech and noise power spectra were obtained by forming sets of the system of equations from the GMM mean vectors. The noise cause category is resolved, and the input SNR is assessed in this first stage, followed from the first estimation of the noise power spectrum, the chosen noise model is finally constructed. The Wiener filter was then applied to the improved estimate to dampen background noise and improve noisy speech. Using the assumption that the uncertainty parameters were known, we then assessed the classification of synthetic data and enhanced speech data. Finally, clustering is done one Gaussian data to identify three clusters for the high, medium, and low tone voice for proper identification of three common emotional stats of humane.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ang J, Dhillon R, Krupski A, Shriberg E, Stolcke A (2002) Prosody-based automatic detection of annoyance and fustration in human–computer dialog. In: Proceedings of the international conference on spoken language processing (ICSLP2002). Denver, Colorado
Google Scholar
Polzin TS, Waibel A (2000) Emotion-sensitive human–computer interfaces. In: Proceedings of the ISCA workshop on speech and emotion, Belfast, Northern Ireland in 2000
Google Scholar
Fulmare NS, Chakrabarti P, Yadav D (2013) Understanding and estimation of emotional expression using acoustic analysis of natural speech. Int J Nat Language Comput 2(4):37–46
Google Scholar
Zheng F, Li LT, Zhang H (2016) Voiceprint recognition technology and its application status. Inf Secur Res 2(1):44–57
Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Google Scholar
Reich W, Chou W (2000) Robust decision tree state tying for continuous speech recognition. IEEE Trans Speech Audio Proc 8(5)
Google Scholar
Taheri A, Tarihiet MR et al (2005) Fuzzy hidden markov models for speech recognition on based algorithm. Trans Eng Comput Technol V4. ISSN: 1305-5313
Google Scholar
Tran D, Wagner M, Zheng T (2000) A fuzzy approach to statistical models in speech and speaker recognition. In: Proceedings of international conference on fuzzy systems, pp 22–25
Google Scholar
Chien J-T (2003) Linear regression based bayesian predictive classification for speech recognition. IEEE Trans Speech Audio Proc 11(1)
Google Scholar
Wessel F, Ney H (2005) Unsupervised training of acoustic models for large vocabulary continuous speech recognition. IEEE Trans Speech Audio Proc 13(1)
Google Scholar
Lui X et al (2007) A study of variable parameter Gaussian mixture HMM modeling fro Noisy speech recognition. IEEE Trans Audio, Speech Lang Proc 15(1)
Google Scholar
Hu GR, Wei XD (2000) End point detection of noisy speech based on cepstrum feature. J Electron 28(10):95–97
Google Scholar
Almajai BM, Darch J (2006) Analysis of correlation between audio and visual speech features forc lean audio feature prediction in noise. In: Proceeding of ICSLP
Google Scholar
Afify M, Siohan O (2004) Sequential estimation with optimal forgetting for robust speech recognition. IEEE Trans Speech Audio Proc 12(1)
Google Scholar
Li XK, Zheng YL, Yuan N et al (2018) Research on voiceprint recognition method based on deep learning. J Eng Heilongjiang Univ 9(1):64–70
Google Scholar
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proceedings of the conference on artificial neural networks in engineering, pp 7–10
Google Scholar
Atassi H, Esposito A (2008) A speaker independent approach to the classification of emotional vocal expressions. In: Proceeding of 20th international conference tools with artificial intelligence, ICTAI 2008. IEEE Computer Society, Dayton, Ohio, USA, pp 147–151
Google Scholar
Skowronski M, Harris J (2003) Improving the filter bank of a classic speech feature extraction algorithm. In: IEEE international symposium on circuits and system, Bangok, pp 281–284
Google Scholar
O’Shaughnessy D (2000) Speech communications: human and machine 2nd edn. IEEE Press, New York
Google Scholar
Siddiqi MH, Ali R, Rana MS, Hong E-K, Kim ES, Lee S (2014) Video-based human activity recognition using multilevel wavelet decomposition and step wise linear discriminant analysis. Sensors 14(4):6370–6392
Google Scholar
Juang BH (1998) The past, present, and future of speech processing. In: IEEE signal processing magazine
Google Scholar
Rabiner L, Juang B, Levinson S, Sondhi M (1986)Recent developments in the application of hidden Markov models to speaker-independent isolated word recognition. Proc IEEE Trans ASSP 34(1):52–59
Google Scholar
Kim D-S (1999) Auditory processing of speech signals for robust speech recognition in real-world noisy environments. IEEE Trans Speech Audio Proc 7(1)
Google Scholar
Garau G, Rebaks S (2008) Combining spectral representation for large vocabulary continuous speech recognition. IEEE Trans Audio, Speech Lang Proc 16(1)
Google Scholar
Fraser KC, Meltzer JA, Rudzicz F (2016) Linguistic features identify Alzheimer’s disease in narrative speech. J Alzheimer’s Disease 49(2):407–422
Google Scholar
Balagopalan A, Novikova J, Rudzicz F, Ghassemi M (2018) The effect of heterogeneous data for alzheimer’s disease detection from speech. In: Proceedings of the machine learning for health (ML4H) workshop at Neur IPS 2018
Google Scholar
Ma JL, Jing XX, Yang HY (2015) Application of principal component analysis and K-means clustering in Speaker recognition. Comput Appl 35(s1):127–129
Google Scholar
Fergani B, Davy M, Houacine A (2008) Speaker diarizationusing one-class support vector machines. Speech Commun 50(5):355–365
Google Scholar
Delacourt P, Wellekens CJ (2000) DISTBIC: a speaker-based segmentation for audio data indexing. SpeechCommun 32(1–2):111–126
Google Scholar
HannunC, Case JC (2014) Deep speech: scaling up end-to-end speech recognition. Comput Sci 17:1–12
Google Scholar
Di WU, Zhao H, Huang C et al (2014) Speech endpoint detection in low-SNRs environment based on perception spectrogram structure boundary parameter. J Sig Proc Syst 39(4):392–399
Google Scholar
Bora DJ, Gupta AK (2014) A comparative study between fuzzy clustering algorithm and hard clustering algorithm. Int J Comput Trends Technol 10(4):108–113
Google Scholar
Xumin SNL, Yong G (2010) Research on K-Means clustering algorithm: an improved k-means clustering algorithm. In: Proceedings of 3rd international symposium on intelligent information technology and security informatics, pp 1–5
Google Scholar
Ramathilagam S, Devi R, Kannan SR (2012) Extended fuzzy C-Means: an analyzing data clustering problems. Cluster Comput 16(3):389–406
Google Scholar
Gopal A et al (2021) Automated recognition of Hindi word audio clips for Indian children using clustering based filters and binary classifier. In: ICNLSP
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Assam Down Town University, Guwahati, Assam, India
Mampi Devi
Department of Computer Science and Engineering, Assam Down Town University, Guwahati, Assam, India
Manoj Kr. Sarma
Department of Computer Science and Engineering, Tezpur University, Sonitpur, Assam, India
Jyotismita Talukdar

Authors

Mampi Devi
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Kr. Sarma
View author publications
You can also search for this author in PubMed Google Scholar
Jyotismita Talukdar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mampi Devi .

Editor information

Editors and Affiliations

IIITM, Gwalior, India
Sri Niwas Singh
NIELIT, Guwahati, India
Saurov Mahanta
NIELIT, Guwahati, India
Yumnam Jayanta Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Devi, M., Sarma, M.K., Talukdar, J. (2023). Speech Recognition Via Machine Learning in Recording Studio. In: Singh, S.N., Mahanta, S., Singh, Y.J. (eds) Proceedings of the NIELIT's International Conference on Communication, Electronics and Digital Technology. NICE-DT 2023. Lecture Notes in Networks and Systems, vol 676. Springer, Singapore. https://doi.org/10.1007/978-981-99-1699-3_4

Download citation

DOI: https://doi.org/10.1007/978-981-99-1699-3_4
Published: 27 June 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1698-6
Online ISBN: 978-981-99-1699-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics