Robust features for multilingual acoustic modeling

Santhosh Kumar, C.; Mohandas, V. P.

doi:10.1007/s10772-011-9092-6

Robust features for multilingual acoustic modeling

Published: 11 May 2011

Volume 14, pages 147–155, (2011)
Cite this article

Download PDF

International Journal of Speech Technology Aims and scope Submit manuscript

Robust features for multilingual acoustic modeling

Download PDF

C. Santhosh Kumar¹ &
V. P. Mohandas¹

97 Accesses
Explore all metrics

Abstract

In this paper, we propose a technique to derive robust features for multilingual acoustic modeling using hidden Markov model–Gaussian mixture models (HMM-GMM). We achieve this by discriminatively combining the phonetic contexts of the target languages (languages in the multilingual system). Phonetic context is captured using wide temporal context of the features, and the dimensionality of the resulting feature set is reduced to suit the HMM-GMM implementation using a neural network with a bottle-neck in one of the hidden layers. The output before the non-linearity at the bottle-neck layer of the neural network is the new feature. Since the features are optimized for the target languages in the multilingual recognizer, they are referred to as Target Languages Oriented Features (TLOF).

We perform our experiments for two of the most widely spoken Indian languages, Hindi and Tamil. TLOF offers significant performance improvements over both monolingual and multilingual phone recognizers using Mel frequency cepstral coefficients (MFCC). This emphasizes that TLOF can help share data across languages.

It was also seen that TLOF can enhance the performance of monolingual acoustic models, compared to systems using MFCC.

References

Bub, U., Kohler, J., & Imperl, B. (1997). In-service adaptation of multilingual hidden Markov models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Munich (pp. 1451–1454).
Google Scholar
Burget, L. et al. (2010). Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Denver, USA.
Google Scholar
Chatzichrisafis, N., Digalakis, V., Diakoloukas, V., & Harizakis, C. (2004). Rapid acoustic model development using Gaussian mixture clustering and language adaptation. In Proc. int. conf. on spoken language processing, Jeja Island, Korea.
Google Scholar
Forney, G. D. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61, 268–278.
Article MathSciNet Google Scholar
Grezl, F., & Fousek, F. (2008). Optimizing bottle-neck features for LVCSR. In Proc. int. conf. on acoustics, speech and signal processing, Las Vegas, USA.
Google Scholar
Hermansky, H., & Sharma, S. (1998). TRAPS—classifiers of temporal patterns. In Proc. international conference on spoken language processing, Sydney, Australia.
Google Scholar
Hongbing, H., & Zahorian, S. A. (2008). A neural network based non-linear feature transformation for speech recognition. In Proc. Interspeech, Brisbane, Australia.
Google Scholar
Itahashi, S., Zhu, S., & Yamamoto, M. (2004). Constructing family trees of multilingual speech using Gaussian mixture models. In Proc. international conference on spoken language processing, Jeju Island, Japan.
Google Scholar
Jain, A. K. (1989). Fundamentals of digital image processing. Englewood Cliffs: Prentice Hall.
MATH Google Scholar
Juang, B. H., & Rabiner, L. R. (1985). A probabilistic distance measure for hidden Markov models. AT&T Technical Journal, 64(2), 391–408.
MathSciNet Google Scholar
Ketabdar, H. (2008). Enhancing posterior based speech recognition systems. Ph.D. thesis, IDIAP, Research Institute, Switzerland.
Ketabdar, H., & Boulard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In Proc. international conference on acoustics, speech and signal processing, Las Vegas, USA.
Google Scholar
Kirchoff, K. (1999). Robust speech recognition using articulatory features. Ph.D. thesis, University of Bielefield.
Kirchoff, K. (2000). Integrating articulatory features into acoustic models for speech recognition. In Proc. of the workshop on phonetics and phonology in ASR, parameters and features, and their implications, Saarbrucken, Germany.
Google Scholar
Kohler, J. (1996). Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds. In Proc. international conferences on spoken language processing (ICSLP), Philadelphia, USA.
Google Scholar
Kohler, J. (1998). Language adaptation of multilingual phone models for vocabulary independent multilingual speech recognition. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Seattle, USA (pp. 417–420).
Google Scholar
Kohler, J. (2001). Multilingual phone models for vocabulary independent speech recognition. Speech Communication, 35, 21–30.
Article Google Scholar
Kullback, S. (1958). Information theory and statistics. New York: Wiley.
Google Scholar
Lee, C. H. et al. (2007). An overview of automatic speech attribute transcription. In Proc. Interspeech, Antwerp, Belgium.
Google Scholar
Li, J., & Lee, C. H. (2005). On designing and evaluating speech event detectors. In Proc. Interspeech, Lisbon, Portugal.
Google Scholar
Lin, H., Deng, L., Yu, D., Gong, Y., Acero, A., & Lee, C. H. (2009). A study on multilingual acoustic modeling for large vocabulary ASR. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Taipei, Taiwan.
Google Scholar
Lyu, D., Siniscalchi, S. M., Kim, T. Y., & Li, C. H. (2008). Continuous speech recognition without target language data. In Proc. Interspeech, Brisbane, Australia.
Google Scholar
Odell, J. J. (1995). The use of context in large vocabulary speech recognition. Ph.D. Thesis, Engineering Department, Cambridge University.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
Rumelhart, D. E., Hintont, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(4), 533–536.
Article Google Scholar
Schultz, T., & Krirchoff, K. (Eds.) (2006). Multilingual speech processing. New York: Elsevier.
Google Scholar
Schultz, T., & Waibel, A. (1999). Language adaptive LVCSR through poly-phone decision tree specialization. In Workshop on multilingual interoperability in speech technology, Leusden, The Netherlands (pp. 85–90).
Google Scholar
Schultz, T., & Waibel, A. (2001). Language independent and language adaptive acoustic modeling for speech recognition. Speech Communication, 31–51.
Schwarz, P. (2008). Phoneme recognition using long temporal block. Ph.D. thesis, Brno University of Technology, Czech Republic.
Schwarz, P., Matějka, P., & Černocký, J. (2004). Towards lower error rates in phoneme recognition. In Proceedings of 7th international conference text, speech and dialogue, Brno, Czech Republic.
Google Scholar
Stuker, S., Metze, F., Schultz, T., & Waibel, A. (2003). Integrating multilingual articulatory features into speech recognition. In Proc. Eurospeech, Geneva.
Google Scholar
Stuker, S., Schultz, T., Meize, F., & Waibel, A. (2007). Multilingual articulatory features. In Proc. international conference on acoustics, speech and signal processing, Honolulu, USA.
Google Scholar
Toth, L., Frankel, J., Gosziolya, G., & King, S. (2008). Cross-lingual portability of MLP based tandem features—A case study for English and Hungarian. In Proc. Interspeech, Brisbane, Australia.
Google Scholar
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.
Article MATH Google Scholar
Waibel, A., Geutner, P., Mayfield, L., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. In Proc. IEEE (Vol. 88, pp. 1297–1313). Special issue on spoken language processing.
Google Scholar
Young, S., Jansen, J., Odell, J., Ollason, D., & Woodland, P. (2003). The HTK book. Cambridge: Cambridge University Engineering Department.
Google Scholar

Download references

Author information

Authors and Affiliations

ECE Department, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Ettimadai, Coimbatore, India
C. Santhosh Kumar & V. P. Mohandas

Authors

C. Santhosh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
V. P. Mohandas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to C. Santhosh Kumar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Santhosh Kumar, C., Mohandas, V.P. Robust features for multilingual acoustic modeling. Int J Speech Technol 14, 147–155 (2011). https://doi.org/10.1007/s10772-011-9092-6

Download citation

Received: 13 October 2010
Accepted: 18 April 2011
Published: 11 May 2011
Issue Date: September 2011
DOI: https://doi.org/10.1007/s10772-011-9092-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust features for multilingual acoustic modeling

Abstract

Article PDF

Similar content being viewed by others

Automatic speech recognition: a survey

Speech Emotion Recognition: A Comprehensive Survey

Natural Language Processing: History, Evolution, Application, and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust features for multilingual acoustic modeling

Abstract

Article PDF

Similar content being viewed by others

Automatic speech recognition: a survey

Speech Emotion Recognition: A Comprehensive Survey

Natural Language Processing: History, Evolution, Application, and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation