Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

China Bhanja, Chuya; Laskar, Mohammad Azharuddin; Laskar, Rabul Hussain

doi:10.1007/s10579-020-09527-z

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Original Paper
Published: 20 January 2021

Volume 55, pages 689–730, (2021)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Chuya China Bhanja¹,
Mohammad Azharuddin Laskar¹ &
Rabul Hussain Laskar¹

211 Accesses
2 Citations
Explore all metrics

Abstract

In this paper an attempt has been made to prepare an automatic tonal and non-tonal pre-classification-based Indian language identification (LID) system using multi-level prosody and spectral features. Languages are first categorized into tonal and non-tonal groups, and then, from among the languages of the respective groups, individual languages are identified. The system uses syllable, word (tri-syllable) and phrase level (multi-word) prosody (collectively called multi-level prosody) along with spectral features, namely Mel-frequency cepstral coefficients (MFCCs), Mean Hilbert envelope coefficients (MHEC), and shifted delta cepstral coefficients of MFCCs and MHECs for the pre-classification task. Multi-level analysis of spectral features has also been proposed and the complementarity of the syllable, word and phrase level (spectral + prosody) has been examined for pre-classification-based LID task. Four different models, particularly, Gaussian Mixture Model (GMM)-Universal Background Model (UBM), Artificial Neural Network (ANN), i-vector based support vector machine (SVM) and Deep Neural Network (DNN) have been developed to identify the languages. Experiments have been carried out on National Institute of Technology Silchar language database (NITS-LD) and OGI Multi-language Telephone Speech corpus (OGI-MLTS). The experiments confirm that both prosody and (spectral + prosody) obtained from syllable-, word- and phrase-level carry complementary information for pre-classification-based LID task. At the pre-classification stage, DNN models based on multi-level (prosody + MFCC) features, coupled with score combination technique results in the lowest EER value of 9.6% for NITS-LD. For OGI-MLTS database, the lowest EER value of 10.2% is observed for multi-level (prosody + MHEC). The pre-classification module helps to improve the performance of baseline single-stage LID system by 3.2% and 4.2% for NITS-LD and OGI-MLTS database respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

References

Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003, April). Modeling prosodic dynamics for speaker recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP'03). (Vol. 4, pp. IV-788). IEEE.
Atterer, M., & Ladd, D. R. (2004). On the phonetics and phonology of “segmental anchoring” of F0: Evidence from German. Journal of Phonetics, 32(2), 177–197.
Article Google Scholar
Baby, A., Thomas, A. L., & Nishanthi, N. L. (2016). T. Consortium, “Resources for Indian languages,” CBBLR-Community-Based Building of Language Resources. Brno, Czech Republic: Tribun EU, 37–43.
Beckman, M. E., & Pierrehumbert, J. B. (1986). Intonational structure in Japanese and English. Phonology, 3, 255–309.
Article Google Scholar
Burgos, W. (2014). Gammatone and MFCC Features in Speaker Recognition (Doctoral dissertation).
Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Torres-Carrasquillo, P. A. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, 20, 210–229.
Article Google Scholar
Casale, S., Russo, A., Scebba, G., & Serrano, S. (2008, August). Speech emotion classification using machine learning algorithms. In The IEEE international conference on semantic computing (pp. 158–165). IEEE.
China Bhanja, C., Laskar, M. A., & Laskar, R. H. (2018 October). A pre-classification-based language identification for Northeast Indian Languages using prosody and spectral features. Circuits System and Signal Processing. https://doi.org/10.1007/s00034-018-0962-x.
Article Google Scholar
Dediu, D., & Ladd, D. R. (2007). Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes, ASPM and Microcephalin. Proceedings of the National Academy of Sciences, 104(26), 10944–10949.
Article Google Scholar
Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Twelfth annual conference of the international speech communication association.
Dey, S., Motlicek, P., Madikeri, S., & Ferras, M. (2017). Template-matching for text-dependent speaker verification. Speech Communication, 88, 96–105.
Article Google Scholar
Dorofki, M., Elshafie, A. H., Jaafar, O., Karim, O. A., & Mastura, S. (2012). Comparison of artificial neural network transfer functions abilities to simulate extreme runoff data. International Proceedings of Chemical, Biological and Environmental Engineering, 33, 39–44.
Google Scholar
Dusan S, & Deng L. (1998). Recovering vocal tract shapes from MFCC parameters. In Fifth International Conference on Spoken Language Processing.
Gandour, J. (1977). Counterfeit tones in the speech of Southern Thai bidialectals. Lingua, 41, 125–143.
Article Google Scholar
Hatch, A. O., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for SVM-based speaker recognition. In Ninth international conference on spoken language processing.
https://www.iitm.ac.in/donlab/tt/index.php
Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). A hierarchical language identification system for Indian languages. Digital Signal Processing, 22(3), 544–553.
Article Google Scholar
Le, P. N., Ambikairajah, E., & Choi, E. H. (2009, July). Improvement of Vietnamese tone classification using FM and MFCC features. In Computing and communication technologies, 2009. RIVF'09. International Conference on (pp. 1–4). IEEE.
Lee, C. C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9–10), 1162–1171.
Article Google Scholar
Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modelling with application to robust and efficient language identification and speaker verification. Computer Speech and Language, 28, 940–958.
Article Google Scholar
Li, Q., & Huang, Y. (2011). An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Transactions on Audio, Speech and Language Processing, 19(6), 1791–1801.
Article Google Scholar
Maddieson, I., Dryer, M. S., & Haspelmath, M. (2013). The world atlas of language structures online. Leipzig, Germany: Max Planck Institute for Evolutionary Anthropology.
Google Scholar
Maity, S., Vuppala, A. K., Rao, K. S., & Nandi, D. (2012, February). IITKGP-MLILSC speech database for language identification. In National conference on communication.
Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013, May). Prosodic features and formant modelling for an ivector-based language recognition system. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6847–6851). IEEE.
Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. dissertation. IIT Madras, India.
Mounika, K. V., Achanta, S., Lakshmi, H. R., Gangashetty, S. V., & Vuppala, A. K. (2016, June). An investigation of deep neural network architectures for language recognition in Indian languages. In INTERSPEECH (pp. 2930–2933).
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Second International Conference on Spoken Language Processing.
Ng, R. W. M., Lee, T., Leung, C. C., Ma, B., Li, H. (2009). Analysis and selection of prosodic features for language identification. In Proc. Asian Language Processing, pp 123–128.
Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987, December). An efficient auditory filterbank based on the gammatone function. In A meeting of the IOC Speech Group on Auditory Modelling at RSRE (Vol. 2, No. 7).
Prasanna, S. R. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.
Article Google Scholar
Prince, S. J., & Elder, J. H. (2007, October). Probabilistic linear discriminant analysis for inferences about identity. In IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007. (pp. 1–8). IEEE.
Qu, C., & Goad, H. (2012). The interaction of stress and tone in standard Chinese: Experimental findings and theoretical consequences. Tone: Theory and Practice, Max Planck Institute for Evolutionary Anthropology.
Google Scholar
Reddy, V. R., Maity, S., & Rao, K. S. (2013). Identification of Indian languages using multi-level spectral and prosodic features. International Journal of Speech Technology, 16(4), 489–511.
Article Google Scholar
Reynolds, D. (2015). Gaussian mixture models. Encyclopedia of biometrics, 827–832.
Richardson, F., Reynolds, D., & Dehak, N. (2015a). A unified deep neural network for speaker and language recognition. In: proc of International Speech Communication Association.
Richardson, F., Reynolds, D., & Dehak, N. (2015b). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Article Google Scholar
Ryant, N., Yuan, J., & Liberman, M. (2014, May). Mandarin tone classification without pitch tracking. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4868–4872). IEEE.
Sadjadi, S. O., & Hansen, J. H. L. (2015). Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.
Article Google Scholar
Sarmah, P., & Wiltshire, C. R. (2010). A preliminary acoustic study of Mizo vowels and tones. Journal of Acoustic Society of India, 37(3), 121–129.
Google Scholar
Singh, A. K. (2006, October). A computational phonetic model for Indian language scripts. In Constraints on spelling changes: Fifth international workshop on writing systems.
Steven, D., & Mermelstein, P. (August 1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics Speech and Signal Processing, 28(4), 357–366.
Article Google Scholar
Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). In W. B. Klein & K. K. Paliwal (Eds.), Speech coding and synthesis. New York: Elsevier.
Google Scholar
Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller Jr, J. R. (2007). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Seventh international conference on spoken language processing.
Wang, L., Ambikairajah, E., & Choi, E. H. (2007, September). Automatic language recognition with tonal and non-tonal language pre-classification. In Signal Processing Conference, 2007 15th European (pp. 2375–2379). IEEE.
www.ciil-spokencorpus.net [Online, Retrieved January 20, 2009].
Yin, B., Ambikairajah, E., & Chen, F. (2006). Combining cepstral and prosodic features in language identification. In 18th international conference on pattern recognition (ICPR'06) (Vol. 4, pp. 254–257). IEEE.
Zhang, J. (2014). Tones, tonal phonology, and tone sandhi. In C.-T. James Huang, Y.-H. Audrey Li, & A. Simpson (Eds.), The handbook of Chinese linguistics (pp. 443–464). Oxford: Wiley Blackwell.
Chapter Google Scholar
Zhao, X., & Wang, D. (2013). Analyzing noise robustness of MFCC and GFCC features in speaker identification. In 2013 IEEE International conference on acoustics, speech and signal processing (ICASSP), (pp. 7204–7208). IEEE.

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Silchar, Assam, 788 010, India
Chuya China Bhanja, Mohammad Azharuddin Laskar & Rabul Hussain Laskar

Authors

Chuya China Bhanja
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Azharuddin Laskar
View author publications
You can also search for this author in PubMed Google Scholar
Rabul Hussain Laskar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuya China Bhanja.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

China Bhanja, C., Laskar, M.A. & Laskar, R.H. Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system. Lang Resources & Evaluation 55, 689–730 (2021). https://doi.org/10.1007/s10579-020-09527-z

Download citation

Accepted: 30 December 2020
Published: 20 January 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10579-020-09527-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation