Skip to main content
Log in

Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

We propose a novel approach to generate an ensemble of context-dependent deep neural networks (CD-DNNs) by using random forests of phonetic decision trees (RF-PDTs) and construct an ensemble acoustic model (EAM) accordingly for speech recognition. We present evaluation results on the TIMIT dataset and a telemedicine automatic captioning dataset and demonstrate the superior performance of the proposed RF-PDT+CD-DNN based EAM over the conventional CD-DNN based single acoustic model (SAM) in phone and word recognition accuracies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8

Similar content being viewed by others

References

  1. Young, S.J., Odell, J.J., & Woodland, P.C. (1994). Tree-based state tying for high accuracy modeling. In Proc. ARPA Human Lang. Tech. Workshop (pp. 307–312).

  2. Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 20(1), 30–42.

    Article  Google Scholar 

  3. Deng, L., Yu, D., & Platt, J. (2012). Scalable stacking and learning for building deep architectures. In Proc. ICASSP (pp. 2133–2136).

  4. Cook, G., & Robinson, T. (1996). Boosting the performance of connectionist large vocabulary speech recognition. ICSLP, 3, 1305–1308.

    Google Scholar 

  5. Cook, G., Waterhouse, S., & Robinson, A. (1997). Ensemble methods for connectionist acoustic modelling. Proc. Eurospeech, 3, 1959–1962.

    Google Scholar 

  6. Schwenk, H. (1999). Using boosting to improve a hybrid HMM/neural network speech recognizer. In Proc. ICASSP (pp. 1009–1012).

  7. Kazemi, A., Sobhanmanesh, F., & Boostani, R. (2011). Boosting small MLPs with entropy combination improves phoneme posteriors enstimation. In Proc. International Symposium on AISP (pp. 11–14).

  8. Qian, Y., & Liu, J. (2012). Cross-lingual and ensemble MLPs strategies for low-resource speech recognition. In Proc. Interspeech (pp. 354–358).

  9. Chen, X., & Zhao, Y. (2013). Building acoustic model ensembles by data sampling with enhanced trainings and features. IEEE Transactions on Audio, Speech and Language Processing, 21(3), 498–507.

    Article  Google Scholar 

  10. Xue, J., & Zhao, Y. (2008). Random forests of phonetic decision trees for acoustic modeling in conversational speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 16(3), 519–528.

    Article  Google Scholar 

  11. Siohan, O., Ramabhadran, B., & Kingsbury, B. (2005). Constructing ensembles of ASR systems using randomized decision trees. In Proc. ICASSP (pp. I-197-I-200).

  12. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  13. Tumer, K., & Ghosh, J. (1996). Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2), 341–348.

    Article  Google Scholar 

  14. Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (pp. 231–238).

  15. Audhkhasi, K., Zavou, A.M., Georgiou, P.G., & Narayanan, S.S. (2014). Theoretical analysis of diversity in an ensemble of automatic speech recognition systems. IEEE Transactions on ASLP, 22(3), 711–726.

    Google Scholar 

  16. Zhao, Y., Xue, J., & Chen, X. (2014). Ensemble learning approaches in speech recognition. In T. Ogunfunmi, R. Togneri, & M. Narasimha (Eds.), Speech and audio processing for coding, enhancement and recognition: Springer.

  17. Fiscus, J.G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proc. IEEE ASRU (pp. 347–352).

  18. Shinozaki, T., & Furui, S. (2004). Spontaneous speech recognition using a massively parallel decoder. In Proc. ICSLP (pp. 1705–1708).

  19. Zhao, Y., Zhang, X., Hu, R.-S., Xue, J., Li, X., Che, L., Hu, R., & Schopp, L. (2006). An automatic captioning system for telemedicine. In Proc. ICASSP (pp. I-957-I-960).

  20. Zhao, T., Zhao, Y., & Chen, X. (2014). Building an ensemble of CD-DNN-HMM acoustic model using random forests of phonetic decision trees. In Proc. ISCSLP (pp. 98–102).

  21. Hinton, G.E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.

    Article  MATH  MathSciNet  Google Scholar 

  22. Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Proc. IEEE ASRU (pp. 24–29).

  23. (2009). The hidden Markov model toolkit (HTK). CUED Machine Intelligence Lab. accessed 28 June 2013. http://htk.eng.cam.ac.uk/ftp/software/HTK-3.4.1.tar.gz.

  24. Vesely, K., Burget, L., & Grezl, F. (2010). Parallel training of neural networks for speech recognition. In Proc. International Conf Text, Speech and Dialog (pp. 439–446).

  25. Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Audio, Speech and Language Processing, 37(11), 1641–1648.

    Google Scholar 

  26. Zhang, X., Zhao, Y., & Schopp, L. (2007). A novel method of language modeling for automatic captioning in telemedicine. IEEE Transactions on Information Technology in Biomedicine, 11(3), 332–337.

    Article  MATH  Google Scholar 

  27. Sun, X., & Zhao, Y. (2014). Integrated exemplar-based template matching and statistical modeling for continuous speech recognition. In Proc. EURASIP Journal on Audio, Speech and Music (Vol. 4, p. 16).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunxin Zhao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, T., Zhao, Y. & Chen, X. Ensemble Acoustic Modeling for CD-DNN-HMM Using Random Forests of Phonetic Decision Trees. J Sign Process Syst 82, 187–196 (2016). https://doi.org/10.1007/s11265-015-1001-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1001-9

Keywords

Navigation