Abstract
In this paper an attempt has been made to prepare an automatic tonal and non-tonal pre-classification-based Indian language identification (LID) system using multi-level prosody and spectral features. Languages are first categorized into tonal and non-tonal groups, and then, from among the languages of the respective groups, individual languages are identified. The system uses syllable, word (tri-syllable) and phrase level (multi-word) prosody (collectively called multi-level prosody) along with spectral features, namely Mel-frequency cepstral coefficients (MFCCs), Mean Hilbert envelope coefficients (MHEC), and shifted delta cepstral coefficients of MFCCs and MHECs for the pre-classification task. Multi-level analysis of spectral features has also been proposed and the complementarity of the syllable, word and phrase level (spectral + prosody) has been examined for pre-classification-based LID task. Four different models, particularly, Gaussian Mixture Model (GMM)-Universal Background Model (UBM), Artificial Neural Network (ANN), i-vector based support vector machine (SVM) and Deep Neural Network (DNN) have been developed to identify the languages. Experiments have been carried out on National Institute of Technology Silchar language database (NITS-LD) and OGI Multi-language Telephone Speech corpus (OGI-MLTS). The experiments confirm that both prosody and (spectral + prosody) obtained from syllable-, word- and phrase-level carry complementary information for pre-classification-based LID task. At the pre-classification stage, DNN models based on multi-level (prosody + MFCC) features, coupled with score combination technique results in the lowest EER value of 9.6% for NITS-LD. For OGI-MLTS database, the lowest EER value of 10.2% is observed for multi-level (prosody + MHEC). The pre-classification module helps to improve the performance of baseline single-stage LID system by 3.2% and 4.2% for NITS-LD and OGI-MLTS database respectively.
Similar content being viewed by others
References
Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003, April). Modeling prosodic dynamics for speaker recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP'03). (Vol. 4, pp. IV-788). IEEE.
Atterer, M., & Ladd, D. R. (2004). On the phonetics and phonology of “segmental anchoring” of F0: Evidence from German. Journal of Phonetics, 32(2), 177–197.
Baby, A., Thomas, A. L., & Nishanthi, N. L. (2016). T. Consortium, “Resources for Indian languages,” CBBLR-Community-Based Building of Language Resources. Brno, Czech Republic: Tribun EU, 37–43.
Beckman, M. E., & Pierrehumbert, J. B. (1986). Intonational structure in Japanese and English. Phonology, 3, 255–309.
Burgos, W. (2014). Gammatone and MFCC Features in Speaker Recognition (Doctoral dissertation).
Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Torres-Carrasquillo, P. A. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, 20, 210–229.
Casale, S., Russo, A., Scebba, G., & Serrano, S. (2008, August). Speech emotion classification using machine learning algorithms. In The IEEE international conference on semantic computing (pp. 158–165). IEEE.
China Bhanja, C., Laskar, M. A., & Laskar, R. H. (2018 October). A pre-classification-based language identification for Northeast Indian Languages using prosody and spectral features. Circuits System and Signal Processing. https://doi.org/10.1007/s00034-018-0962-x.
Dediu, D., & Ladd, D. R. (2007). Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes, ASPM and Microcephalin. Proceedings of the National Academy of Sciences, 104(26), 10944–10949.
Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Twelfth annual conference of the international speech communication association.
Dey, S., Motlicek, P., Madikeri, S., & Ferras, M. (2017). Template-matching for text-dependent speaker verification. Speech Communication, 88, 96–105.
Dorofki, M., Elshafie, A. H., Jaafar, O., Karim, O. A., & Mastura, S. (2012). Comparison of artificial neural network transfer functions abilities to simulate extreme runoff data. International Proceedings of Chemical, Biological and Environmental Engineering, 33, 39–44.
Dusan S, & Deng L. (1998). Recovering vocal tract shapes from MFCC parameters. In Fifth International Conference on Spoken Language Processing.
Gandour, J. (1977). Counterfeit tones in the speech of Southern Thai bidialectals. Lingua, 41, 125–143.
Hatch, A. O., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for SVM-based speaker recognition. In Ninth international conference on spoken language processing.
Jothilakshmi, S., Ramalingam, V., & Palanivel, S. (2012). A hierarchical language identification system for Indian languages. Digital Signal Processing, 22(3), 544–553.
Le, P. N., Ambikairajah, E., & Choi, E. H. (2009, July). Improvement of Vietnamese tone classification using FM and MFCC features. In Computing and communication technologies, 2009. RIVF'09. International Conference on (pp. 1–4). IEEE.
Lee, C. C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9–10), 1162–1171.
Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modelling with application to robust and efficient language identification and speaker verification. Computer Speech and Language, 28, 940–958.
Li, Q., & Huang, Y. (2011). An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Transactions on Audio, Speech and Language Processing, 19(6), 1791–1801.
Maddieson, I., Dryer, M. S., & Haspelmath, M. (2013). The world atlas of language structures online. Leipzig, Germany: Max Planck Institute for Evolutionary Anthropology.
Maity, S., Vuppala, A. K., Rao, K. S., & Nandi, D. (2012, February). IITKGP-MLILSC speech database for language identification. In National conference on communication.
Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013, May). Prosodic features and formant modelling for an ivector-based language recognition system. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6847–6851). IEEE.
Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. dissertation. IIT Madras, India.
Mounika, K. V., Achanta, S., Lakshmi, H. R., Gangashetty, S. V., & Vuppala, A. K. (2016, June). An investigation of deep neural network architectures for language recognition in Indian languages. In INTERSPEECH (pp. 2930–2933).
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Second International Conference on Spoken Language Processing.
Ng, R. W. M., Lee, T., Leung, C. C., Ma, B., Li, H. (2009). Analysis and selection of prosodic features for language identification. In Proc. Asian Language Processing, pp 123–128.
Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987, December). An efficient auditory filterbank based on the gammatone function. In A meeting of the IOC Speech Group on Auditory Modelling at RSRE (Vol. 2, No. 7).
Prasanna, S. R. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.
Prince, S. J., & Elder, J. H. (2007, October). Probabilistic linear discriminant analysis for inferences about identity. In IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007. (pp. 1–8). IEEE.
Qu, C., & Goad, H. (2012). The interaction of stress and tone in standard Chinese: Experimental findings and theoretical consequences. Tone: Theory and Practice, Max Planck Institute for Evolutionary Anthropology.
Reddy, V. R., Maity, S., & Rao, K. S. (2013). Identification of Indian languages using multi-level spectral and prosodic features. International Journal of Speech Technology, 16(4), 489–511.
Reynolds, D. (2015). Gaussian mixture models. Encyclopedia of biometrics, 827–832.
Richardson, F., Reynolds, D., & Dehak, N. (2015a). A unified deep neural network for speaker and language recognition. In: proc of International Speech Communication Association.
Richardson, F., Reynolds, D., & Dehak, N. (2015b). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Ryant, N., Yuan, J., & Liberman, M. (2014, May). Mandarin tone classification without pitch tracking. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4868–4872). IEEE.
Sadjadi, S. O., & Hansen, J. H. L. (2015). Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication, 72, 138–148.
Sarmah, P., & Wiltshire, C. R. (2010). A preliminary acoustic study of Mizo vowels and tones. Journal of Acoustic Society of India, 37(3), 121–129.
Singh, A. K. (2006, October). A computational phonetic model for Indian language scripts. In Constraints on spelling changes: Fifth international workshop on writing systems.
Steven, D., & Mermelstein, P. (August 1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics Speech and Signal Processing, 28(4), 357–366.
Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). In W. B. Klein & K. K. Paliwal (Eds.), Speech coding and synthesis. New York: Elsevier.
Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller Jr, J. R. (2007). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Seventh international conference on spoken language processing.
Wang, L., Ambikairajah, E., & Choi, E. H. (2007, September). Automatic language recognition with tonal and non-tonal language pre-classification. In Signal Processing Conference, 2007 15th European (pp. 2375–2379). IEEE.
www.ciil-spokencorpus.net [Online, Retrieved January 20, 2009].
Yin, B., Ambikairajah, E., & Chen, F. (2006). Combining cepstral and prosodic features in language identification. In 18th international conference on pattern recognition (ICPR'06) (Vol. 4, pp. 254–257). IEEE.
Zhang, J. (2014). Tones, tonal phonology, and tone sandhi. In C.-T. James Huang, Y.-H. Audrey Li, & A. Simpson (Eds.), The handbook of Chinese linguistics (pp. 443–464). Oxford: Wiley Blackwell.
Zhao, X., & Wang, D. (2013). Analyzing noise robustness of MFCC and GFCC features in speaker identification. In 2013 IEEE International conference on acoustics, speech and signal processing (ICASSP), (pp. 7204–7208). IEEE.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
China Bhanja, C., Laskar, M.A. & Laskar, R.H. Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system. Lang Resources & Evaluation 55, 689–730 (2021). https://doi.org/10.1007/s10579-020-09527-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-020-09527-z