An amalgamation of integrated features with DeepSpeech2 architecture and improved spell corrector for improving Gujarati language ASR system

Dua, Mohit; Bhagat, Bhavesh; Dua, Shelza

doi:10.1007/s10772-024-10082-z

An amalgamation of integrated features with DeepSpeech2 architecture and improved spell corrector for improving Gujarati language ASR system

Published: 13 February 2024

(2024)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Mohit Dua¹,
Bhavesh Bhagat¹ &
Shelza Dua²

66 Accesses
Explore all metrics

Abstract

Automatic Speech Recognition systems that convert language into written text have greatly transformed human–machine interaction. Although these systems have achieved results, in languages building accurate and reliable ASR models for low resource languages like Gujarati comes with significant challenges. Gujarati lacks data and linguistic resources, making developing high-performance ASR systems quite difficult. In this paper, we propose an approach to enhance the effectiveness of a Gujarati ASR model despite resources. We achieve this by incorporating integrated features such as Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) utilizing the DeepSpeech2 architecture and implementing an improved spell correction technique based on the Bidirectional Encoder Representations from Transformers (BERT) algorithm. Our approach has demonstrated superiority over previous state-of-the-art methodologies through testing and evaluation. The experimental results demonstrate that our proposed method consistently reduces the Word Error Rate (WER) by 10–12 percentage points compared to the existing work, surpassing the most significant improvement of 5.87%. Our findings demonstrate the viability of developing accurate and dependable ASR systems for languages with limited resources, such as Gujarati.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study on the challenges and opportunities of speech recognition for Bengali language

Article 05 November 2021

Multilingual low resource Indian language speech recognition and spell correction using Indic BERT

Article 05 November 2022

A review on Gujarati language based automatic speech recognition (ASR) systems

Article 12 March 2024

References

Amodei, D. et al. (2016). Deep Speech 2 : End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd international conference on machine learning, 2016, (vol. 48, pp. 173–182). Retrieved from https://proceedings.mlr.press/v48/amodei16.html
Anoop, C. S., & Ramakrishnan, A. G. (2021, July). CTC-based end-to-end ASR for the low resource Sanskrit language with spectrogram augmentation. In 2021 National conference on communications (NCC) (pp. 1–6). IEEE.
Bhogale, K., Raman, A., Javed, T., Doddapaneni, S., Kunchukuttan, A., Kumar, P., & Khapra, M. M. (2023, June). Effectiveness of mining audio and text pairs from public data for improving ASR systems for low-resource languages. In ICASSP 2023–2023 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). IEEE.
Billa, J. (2018). ISI ASR system for the low resource speech recognition challenge for Indian languages. In INTERSPEECH, 2018.
Cho, K., et al. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Oct 2014, pp. 1724–1734. https://doi.org/10.3115/v1/D14-1179.
Dave, D. (2015). An approach to increase word recognition accuracy in Gujarati language. International Journal of Innovative Research in Computer and Communication Engineering, 3, 6442–6450. https://doi.org/10.15680/ijircce.2015.0307012
Article Google Scholar
Deshmukh, A. M. (2020). Comparison of hidden Markov model and recurrent neural network in automatic speech recognition. European Journal of Engineering and Technology Research, 5(8), 958–965. https://doi.org/10.24018/ejeng.2020.5.8.2077
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2018). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2019). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10, 2301–2314.
Article Google Scholar
Dubey, P., & Shah, B. (2022). Deep speech based end-to-end automated speech recognition (ASR) for Indian-English accents. arXiv preprint arXiv:2204.00977.
Forsberg, M. (2003). Why is speech recognition difficult.
Gaudani, H., & Patel, N. M. (2022). Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language. In Proceedings of second international conference on sustainable expert systems: ICSES 2021 (pp. 763–775). Springer Nature.
Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st international conference on machine learning, 2014, (vol. 32, no. 2, pp. 1764–1772). Retrieved from https://proceedings.mlr.press/v32/graves14.html
Graves, A., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning, 2006.
Hu, Y., Jing, X., Ko, Y. L., & Rayz, J. (2021). Misspelling correction with pre-trained contextual language model.
Joshi, B., Bhatta, B., Panday, S. P., & Maharjan, R. K. (2022). A novel deep learning based nepali speech recognition. In Innovations in electrical and electronic engineering: Proceedings of ICEEE 2022, (Vol. 2, pp. 433–443). Springer.
Krishna, H., Gurugubelli, K., Vegesna, V., & Vuppala, A. (2018). An exploration towards joint acoustic modeling for Indian languages: IIIT-h submission for low resource speech recognition challenge for Indian languages, INTERSPEECH 2018 (pp. 3192–3196). https://doi.org/10.21437/Interspeech.2018-1584.
Lakshminarayanan, V. (2022). Impact of noise in automatic speech recognition for low-resourced languages, Doctoral dissertation, Rochester Institute of Technology.
Maji, B., Swain, M., & Panda, R. (2022). A feature selection based parallelized CNN-BiGRU network for speech emotion recognition in Odia language.
Patel, D., & Goswami, M. (2014). Word level correction in Gujarati document using probabilistic approach. https://doi.org/10.1109/ICGCCEE.2014.6921395.
Raval, D., Pathak, V., Patel, M., & Bhatt, B. (2021). Improving deep learning based automatic speech recognition for Gujarati. ACM Transactions on Asian and Low-Resource Language and Information Processing. https://doi.org/10.1145/3483446
Article Google Scholar
Scharenborg, O., Ciannella, F., Palaskar, S., Black, A., Metze, F., Ondel, L., & Hasegawa-Johnson, M. (2017). Building an ASR system for a low-resource language through the adaptation of a high-resource language ASR system: Preliminary results. In Proceedings of international conference on natural language, signal and speech processing (ICNLSSP) (pp. 26–30).
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
Article ADS Google Scholar
Srivastava, B., Abraham, B., Sitaram, S., Mehta, R., & Jyothi, P. (2019). End-to-end ASR for code-switched Hindi-English speech.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 2014 (Vol. 27). https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2Paper.pdf
Tailor, J. H., & Shah, D. B. (2018). HMM-based lightweight speech recognition system for Gujarati language.
Toshniwal, S., et al. (2017). Multilingual speech recognition with a single end-to-end model.
Zhang, S., Huang, H., Liu, J., & Li, H. (2020). Spelling error correction with soft-masked BERT.

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, National Institute of Technology, Kurukshetra, India
Mohit Dua & Bhavesh Bhagat
Department of Electronics Communication and Engineering, Punjab Engineering College, Chandigarh, India
Shelza Dua

Authors

Mohit Dua
View author publications
You can also search for this author in PubMed Google Scholar
Bhavesh Bhagat
View author publications
You can also search for this author in PubMed Google Scholar
Shelza Dua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shelza Dua.

Ethics declarations

Conflict of interest

I, Mohit Dua, on the behalf of all the authors declare that: this study did not receive any finding from any resource, all the authors and the submitted manuscript do not have any conflict of interest and this article does not contain any studies with human participants or animal performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dua, M., Bhagat, B. & Dua, S. An amalgamation of integrated features with DeepSpeech2 architecture and improved spell corrector for improving Gujarati language ASR system. Int J Speech Technol (2024). https://doi.org/10.1007/s10772-024-10082-z

Download citation

Received: 09 September 2023
Accepted: 02 January 2024
Published: 13 February 2024
DOI: https://doi.org/10.1007/s10772-024-10082-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An amalgamation of integrated features with DeepSpeech2 architecture and improved spell corrector for improving Gujarati language ASR system

Abstract

Access this article

Similar content being viewed by others

A study on the challenges and opportunities of speech recognition for Bengali language

Multilingual low resource Indian language speech recognition and spell correction using Indic BERT

A review on Gujarati language based automatic speech recognition (ASR) systems

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An amalgamation of integrated features with DeepSpeech2 architecture and improved spell corrector for improving Gujarati language ASR system

Abstract

Access this article

Similar content being viewed by others

A study on the challenges and opportunities of speech recognition for Bengali language

Multilingual low resource Indian language speech recognition and spell correction using Indic BERT

A review on Gujarati language based automatic speech recognition (ASR) systems

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation