Abstract
In this paper, we propose a technique to derive robust features for multilingual acoustic modeling using hidden Markov model–Gaussian mixture models (HMM-GMM). We achieve this by discriminatively combining the phonetic contexts of the target languages (languages in the multilingual system). Phonetic context is captured using wide temporal context of the features, and the dimensionality of the resulting feature set is reduced to suit the HMM-GMM implementation using a neural network with a bottle-neck in one of the hidden layers. The output before the non-linearity at the bottle-neck layer of the neural network is the new feature. Since the features are optimized for the target languages in the multilingual recognizer, they are referred to as Target Languages Oriented Features (TLOF).
We perform our experiments for two of the most widely spoken Indian languages, Hindi and Tamil. TLOF offers significant performance improvements over both monolingual and multilingual phone recognizers using Mel frequency cepstral coefficients (MFCC). This emphasizes that TLOF can help share data across languages.
It was also seen that TLOF can enhance the performance of monolingual acoustic models, compared to systems using MFCC.
Article PDF
Similar content being viewed by others
References
Bub, U., Kohler, J., & Imperl, B. (1997). In-service adaptation of multilingual hidden Markov models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Munich (pp. 1451–1454).
Burget, L. et al. (2010). Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Denver, USA.
Chatzichrisafis, N., Digalakis, V., Diakoloukas, V., & Harizakis, C. (2004). Rapid acoustic model development using Gaussian mixture clustering and language adaptation. In Proc. int. conf. on spoken language processing, Jeja Island, Korea.
Forney, G. D. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61, 268–278.
Grezl, F., & Fousek, F. (2008). Optimizing bottle-neck features for LVCSR. In Proc. int. conf. on acoustics, speech and signal processing, Las Vegas, USA.
Hermansky, H., & Sharma, S. (1998). TRAPS—classifiers of temporal patterns. In Proc. international conference on spoken language processing, Sydney, Australia.
Hongbing, H., & Zahorian, S. A. (2008). A neural network based non-linear feature transformation for speech recognition. In Proc. Interspeech, Brisbane, Australia.
Itahashi, S., Zhu, S., & Yamamoto, M. (2004). Constructing family trees of multilingual speech using Gaussian mixture models. In Proc. international conference on spoken language processing, Jeju Island, Japan.
Jain, A. K. (1989). Fundamentals of digital image processing. Englewood Cliffs: Prentice Hall.
Juang, B. H., & Rabiner, L. R. (1985). A probabilistic distance measure for hidden Markov models. AT&T Technical Journal, 64(2), 391–408.
Ketabdar, H. (2008). Enhancing posterior based speech recognition systems. Ph.D. thesis, IDIAP, Research Institute, Switzerland.
Ketabdar, H., & Boulard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In Proc. international conference on acoustics, speech and signal processing, Las Vegas, USA.
Kirchoff, K. (1999). Robust speech recognition using articulatory features. Ph.D. thesis, University of Bielefield.
Kirchoff, K. (2000). Integrating articulatory features into acoustic models for speech recognition. In Proc. of the workshop on phonetics and phonology in ASR, parameters and features, and their implications, Saarbrucken, Germany.
Kohler, J. (1996). Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds. In Proc. international conferences on spoken language processing (ICSLP), Philadelphia, USA.
Kohler, J. (1998). Language adaptation of multilingual phone models for vocabulary independent multilingual speech recognition. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Seattle, USA (pp. 417–420).
Kohler, J. (2001). Multilingual phone models for vocabulary independent speech recognition. Speech Communication, 35, 21–30.
Kullback, S. (1958). Information theory and statistics. New York: Wiley.
Lee, C. H. et al. (2007). An overview of automatic speech attribute transcription. In Proc. Interspeech, Antwerp, Belgium.
Li, J., & Lee, C. H. (2005). On designing and evaluating speech event detectors. In Proc. Interspeech, Lisbon, Portugal.
Lin, H., Deng, L., Yu, D., Gong, Y., Acero, A., & Lee, C. H. (2009). A study on multilingual acoustic modeling for large vocabulary ASR. In Proc. IEEE int. conf. on acoustics, speech and signal processing, Taipei, Taiwan.
Lyu, D., Siniscalchi, S. M., Kim, T. Y., & Li, C. H. (2008). Continuous speech recognition without target language data. In Proc. Interspeech, Brisbane, Australia.
Odell, J. J. (1995). The use of context in large vocabulary speech recognition. Ph.D. Thesis, Engineering Department, Cambridge University.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Rumelhart, D. E., Hintont, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(4), 533–536.
Schultz, T., & Krirchoff, K. (Eds.) (2006). Multilingual speech processing. New York: Elsevier.
Schultz, T., & Waibel, A. (1999). Language adaptive LVCSR through poly-phone decision tree specialization. In Workshop on multilingual interoperability in speech technology, Leusden, The Netherlands (pp. 85–90).
Schultz, T., & Waibel, A. (2001). Language independent and language adaptive acoustic modeling for speech recognition. Speech Communication, 31–51.
Schwarz, P. (2008). Phoneme recognition using long temporal block. Ph.D. thesis, Brno University of Technology, Czech Republic.
Schwarz, P., Matějka, P., & Černocký, J. (2004). Towards lower error rates in phoneme recognition. In Proceedings of 7th international conference text, speech and dialogue, Brno, Czech Republic.
Stuker, S., Metze, F., Schultz, T., & Waibel, A. (2003). Integrating multilingual articulatory features into speech recognition. In Proc. Eurospeech, Geneva.
Stuker, S., Schultz, T., Meize, F., & Waibel, A. (2007). Multilingual articulatory features. In Proc. international conference on acoustics, speech and signal processing, Honolulu, USA.
Toth, L., Frankel, J., Gosziolya, G., & King, S. (2008). Cross-lingual portability of MLP based tandem features—A case study for English and Hungarian. In Proc. Interspeech, Brisbane, Australia.
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.
Waibel, A., Geutner, P., Mayfield, L., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. In Proc. IEEE (Vol. 88, pp. 1297–1313). Special issue on spoken language processing.
Young, S., Jansen, J., Odell, J., Ollason, D., & Woodland, P. (2003). The HTK book. Cambridge: Cambridge University Engineering Department.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Santhosh Kumar, C., Mohandas, V.P. Robust features for multilingual acoustic modeling. Int J Speech Technol 14, 147–155 (2011). https://doi.org/10.1007/s10772-011-9092-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-011-9092-6