Skip to main content
Log in

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this paper an attempt has been made to prepare an automatic tonal and non-tonal pre-classification-based Indian language identification (LID) system using multi-level prosody and spectral features. Languages are first categorized into tonal and non-tonal groups, and then, from among the languages of the respective groups, individual languages are identified. The system uses syllable, word (tri-syllable) and phrase level (multi-word) prosody (collectively called multi-level prosody) along with spectral features, namely Mel-frequency cepstral coefficients (MFCCs), Mean Hilbert envelope coefficients (MHEC), and shifted delta cepstral coefficients of MFCCs and MHECs for the pre-classification task. Multi-level analysis of spectral features has also been proposed and the complementarity of the syllable, word and phrase level (spectral + prosody) has been examined for pre-classification-based LID task. Four different models, particularly, Gaussian Mixture Model (GMM)-Universal Background Model (UBM), Artificial Neural Network (ANN), i-vector based support vector machine (SVM) and Deep Neural Network (DNN) have been developed to identify the languages. Experiments have been carried out on National Institute of Technology Silchar language database (NITS-LD) and OGI Multi-language Telephone Speech corpus (OGI-MLTS). The experiments confirm that both prosody and (spectral + prosody) obtained from syllable-, word- and phrase-level carry complementary information for pre-classification-based LID task. At the pre-classification stage, DNN models based on multi-level (prosody + MFCC) features, coupled with score combination technique results in the lowest EER value of 9.6% for NITS-LD. For OGI-MLTS database, the lowest EER value of 10.2% is observed for multi-level (prosody + MHEC). The pre-classification module helps to improve the performance of baseline single-stage LID system by 3.2% and 4.2% for NITS-LD and OGI-MLTS database respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003, April). Modeling prosodic dynamics for speaker recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP'03). (Vol. 4, pp. IV-788). IEEE.

  • Atterer, M., & Ladd, D. R. (2004). On the phonetics and phonology of “segmental anchoring” of F0: Evidence from German. Journal of Phonetics, 32(2), 177–197.

    Article  Google Scholar 

  • Baby, A., Thomas, A. L., & Nishanthi, N. L. (2016). T. Consortium, “Resources for Indian languages,” CBBLR-Community-Based Building of Language Resources. Brno, Czech Republic: Tribun EU, 37–43.

  • Beckman, M. E., & Pierrehumbert, J. B. (1986). Intonational structure in Japanese and English. Phonology, 3, 255–309.

    Article  Google Scholar 

  • Burgos, W. (2014). Gammatone and MFCC Features in Speaker Recognition (Doctoral dissertation).

  • Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Torres-Carrasquillo, P. A. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, 20, 210–229.

    Article  Google Scholar 

  • Casale, S., Russo, A., Scebba, G., & Serrano, S. (2008, August). Speech emotion classification using machine learning algorithms. In The IEEE international conference on semantic computing (pp. 158–165). IEEE.

  • China Bhanja, C., Laskar, M. A., & Laskar, R. H. (2018 October). A pre-classification-based language identification for Northeast Indian Languages using prosody and spectral features. Circuits System and Signal Processing. https://doi.org/10.1007/s00034-018-0962-x.

    Article  Google Scholar 

  • Dediu, D., & Ladd, D. R. (2007). Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes, ASPM and Microcephalin. Proceedings of the National Academy of Sciences, 104(26), 10944–10949.

    Article  Google Scholar 

  • Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Twelfth annual conference of the international speech communication association.

  • Dey, S., Motlicek, P., Madikeri, S., & Ferras, M. (2017). Template-matching for text-dependent speaker verification. Speech Communication, 88, 96–105.

    Article  Google Scholar 

  • Dorofki, M., Elshafie, A. H., Jaafar, O., Karim, O. A., & Mastura, S. (2012). Comparison of artificial neural network transfer functions abilities to simulate extreme runoff data. International Proceedings of Chemical, Biological and Environmental Engineering, 33, 39–44.

    Google Scholar 

  • Dusan S, & Deng L. (1998). Recovering vocal tract shapes from MFCC parameters. In Fifth International Conference on Spoken Language Processing.

  • Gandour, J. (1977). Counterfeit tones in the speech of Southern Thai bidialectals. Lingua, 41, 125–143.

    Article  Google Scholar 

  • Hatch, A. O., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for SVM-based speaker recognition. In Ninth international conference on spoken language processing.

  • https://www.iitm.ac.in/donlab/tt/index.php

  • Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). A hierarchical language identification system for Indian languages. Digital Signal Processing, 22(3), 544–553.

    Article  Google Scholar 

  • Le, P. N., Ambikairajah, E., & Choi, E. H. (2009, July). Improvement of Vietnamese tone classification using FM and MFCC features. In Computing and communication technologies, 2009. RIVF'09. International Conference on (pp. 1–4). IEEE.

  • Lee, C. C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9–10), 1162–1171.

    Article  Google Scholar 

  • Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modelling with application to robust and efficient language identification and speaker verification. Computer Speech and Language, 28, 940–958.

    Article  Google Scholar 

  • Li, Q., & Huang, Y. (2011). An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Transactions on Audio, Speech and Language Processing, 19(6), 1791–1801.

    Article  Google Scholar 

  • Maddieson, I., Dryer, M. S., & Haspelmath, M. (2013). The world atlas of language structures online. Leipzig, Germany: Max Planck Institute for Evolutionary Anthropology.

    Google Scholar 

  • Maity, S., Vuppala, A. K., Rao, K. S., & Nandi, D. (2012, February). IITKGP-MLILSC speech database for language identification. In National conference on communication.

  • Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013, May). Prosodic features and formant modelling for an ivector-based language recognition system. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6847–6851). IEEE.

  • Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. dissertation. IIT Madras, India.

  • Mounika, K. V., Achanta, S., Lakshmi, H. R., Gangashetty, S. V., & Vuppala, A. K. (2016, June). An investigation of deep neural network architectures for language recognition in Indian languages. In INTERSPEECH (pp. 2930–2933).

  • Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Second International Conference on Spoken Language Processing.

  • Ng, R. W. M., Lee, T., Leung, C. C., Ma, B., Li, H. (2009). Analysis and selection of prosodic features for language identification. In Proc. Asian Language Processing, pp 123–128.

  • Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987, December). An efficient auditory filterbank based on the gammatone function. In A meeting of the IOC Speech Group on Auditory Modelling at RSRE (Vol. 2, No. 7).

  • Prasanna, S. R. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.

    Article  Google Scholar 

  • Prince, S. J., & Elder, J. H. (2007, October). Probabilistic linear discriminant analysis for inferences about identity. In IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007. (pp. 1–8). IEEE.

  • Qu, C., & Goad, H. (2012). The interaction of stress and tone in standard Chinese: Experimental findings and theoretical consequences. Tone: Theory and Practice, Max Planck Institute for Evolutionary Anthropology.

    Google Scholar 

  • Reddy, V. R., Maity, S., & Rao, K. S. (2013). Identification of Indian languages using multi-level spectral and prosodic features. International Journal of Speech Technology, 16(4), 489–511.

    Article  Google Scholar 

  • Reynolds, D. (2015). Gaussian mixture models. Encyclopedia of biometrics, 827–832.

  • Richardson, F., Reynolds, D., & Dehak, N. (2015a). A unified deep neural network for speaker and language recognition. In: proc of International Speech Communication Association.

  • Richardson, F., Reynolds, D., & Dehak, N. (2015b). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.

    Article  Google Scholar 

  • Ryant, N., Yuan, J., & Liberman, M. (2014, May). Mandarin tone classification without pitch tracking. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4868–4872). IEEE.

  • Sadjadi, S. O., & Hansen, J. H. L. (2015). Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.

    Article  Google Scholar 

  • Sarmah, P., & Wiltshire, C. R. (2010). A preliminary acoustic study of Mizo vowels and tones. Journal of Acoustic Society of India, 37(3), 121–129.

    Google Scholar 

  • Singh, A. K. (2006, October). A computational phonetic model for Indian language scripts. In Constraints on spelling changes: Fifth international workshop on writing systems.

  • Steven, D., & Mermelstein, P. (August 1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). In W. B. Klein & K. K. Paliwal (Eds.), Speech coding and synthesis. New York: Elsevier.

    Google Scholar 

  • Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller Jr, J. R. (2007). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Seventh international conference on spoken language processing.

  • Wang, L., Ambikairajah, E., & Choi, E. H. (2007, September). Automatic language recognition with tonal and non-tonal language pre-classification. In Signal Processing Conference, 2007 15th European (pp. 2375–2379). IEEE.

  • www.ciil-spokencorpus.net [Online, Retrieved January 20, 2009].

  • Yin, B., Ambikairajah, E., & Chen, F. (2006). Combining cepstral and prosodic features in language identification. In 18th international conference on pattern recognition (ICPR'06) (Vol. 4, pp. 254–257). IEEE.

  • Zhang, J. (2014). Tones, tonal phonology, and tone sandhi. In C.-T. James Huang, Y.-H. Audrey Li, & A. Simpson (Eds.), The handbook of Chinese linguistics (pp. 443–464). Oxford: Wiley Blackwell.

    Chapter  Google Scholar 

  • Zhao, X., & Wang, D. (2013). Analyzing noise robustness of MFCC and GFCC features in speaker identification. In 2013 IEEE International conference on acoustics, speech and signal processing (ICASSP), (pp. 7204–7208). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuya China Bhanja.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

China Bhanja, C., Laskar, M.A. & Laskar, R.H. Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system. Lang Resources & Evaluation 55, 689–730 (2021). https://doi.org/10.1007/s10579-020-09527-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09527-z

Keywords

Navigation