Skip to main content
Log in

Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. We present a hidden Markov model (HMM)-based emphatic speech synthesis model. The ultimate objective is to synthesize corrective feedback in a computer-aided pronunciation training (CAPT) system. We first analyze contrastive (neutral versus emphatic) speech recording. The changes of the acoustic features of emphasis at different prosody locations and the local prominences of emphasis are analyzed. Based on the analysis, we develop a perturbation model that predicts the changes of the acoustic features from neutral to emphatic speech with high accuracy. Further based on the perturbation model we develop an HMM-based emphatic speech synthesis model. Different from the previous work, the HMM model is trained with neutral corpus, but the context features and additional acoustic-feature-related features are used during the growing of the decision tree. Then the output of the perturbation model can be used to supervise the HMM model to synthesize emphatic speeches instead of applying the perturbation model at the backend of a neutral speech synthesis model directly. In this way, the demand of emphasis corpus is reduced and the speech quality decreased by speech modification algorithm is avoided. The experiments indicate that the proposed emphatic speech synthesis model improves the emphasis quality of synthesized speech while keeping a high degree of the naturalness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Bou-Ghazale SE, Hansen JHL (1996) Generating stressed speech from neutral speech using a modified CELP vocoder. Speech Comm 20:93–110, Oxford University Press

    Article  Google Scholar 

  2. Bou-Ghazale SE, Hansen JHL (1998) HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress. IEEE Trans Speech Audio Process 6:201–216, IEEE Press

    Article  Google Scholar 

  3. Chen SW, Wang B, Xu Y (2009) Closely related languages, different ways of realizing focus. Proceedings of Interspeech

  4. http://www.cstr.ed.ac.uk/projects/festival/

  5. Jia J, Zhang S, Meng FB, Wang YX, Cai LH (2011) Emotional audio-visual speech synthesis based on PAD. IEEE transactions on audio, speech, and language processing 19(3):570–582

    Google Scholar 

  6. Kominek J, Black AW (2003) CMU ARCTIC databases for speech synthesis. Tech. Rep. CMU-LTI-03-177, Carnegie Mellon University

  7. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Comput Speech Lang 9:171–186

    Article  Google Scholar 

  8. Li AJ (1994) Duration charateristics of stress and its synthesis rules on standard Chinese, report of phonetic research. CASS, Beijing

    Google Scholar 

  9. Li Y, Lu YC, Xu XY, Tao JH (2011) Influence of rhythm and tone pattern on Mandarin stress perception in continuous speech. J Tsinghua Univ (Sci Technol) 51:1239–1243, Tsinghua University Press

    Google Scholar 

  10. Li Y, Pan SF, Tao JH (2011) HMM-based expressive speech synthesis with a flexible Mandarin stress adaptation model. J Tsinghua Univ (Sci Technol) 51:1171–1175, Tsinghua Unversity Press

    Google Scholar 

  11. Liu F (2010) Single vs double focus in English statements and yes no questions. Proceedings of Speech Prosody. ISCA Press, Chicago

  12. Maeno Y, Nose T, Kobayashi T, Ijim Y, Nakajima H, Mizuno H, Yoshioka O (2011) HMM-based emphatic speech synthesis using unsupervised context labeling. Proceedings of Interspeech. Oxford University, Italy, p. 1849–1852

  13. Meng H, Lo WK, Harrison AM, Lee P, Wong KH, Leung WK, Meng FB (2011) Development of automatic speech recongition and synthesis technologies to support Chinese learners of English: the CUHK experience. Proceedings of APSIPA. Cambridge University Press, Taiwan, March

  14. Meng FB, Wu ZY, Meng H, Jia J, Cai LH (2012) Hierarchical English emphatic speech synthesis based on HMM with limited training data. Proceedings of Interspeech. Oxford University Press

  15. Morizane K, Nakamura K, Toda T, Saruwatari H, Shikano K (2009) Emphasized speech synthesis based on hidden Markov models. Proceedsing of Speech Database and Assessments Oriental COCOSDA International Conference. IEEE Press, p. 76–81

  16. Neri A, Cucchiarini C, Strik H (2006) ASR-based corrective feedback on pronunciation: Does it really work? Proceedings of Interspeech. Pittsburgh, USA

  17. Plag I (2006) The variability of compound stress in English: structural, semantic and analogical factors. Engl Lang Linguist 10(1):143–172, Cambridge University Press

    Article  Google Scholar 

  18. Raux A, Black AW (2003) A unit selection approach to F0 modeling and its application to emphasis, Proceedings of ASRU

  19. Rump HH, Collier R (1996) Focus conditions and the prominence of pitch-accented syllables. Lang Speech 39:1–17, MIT Press

    Google Scholar 

  20. Selkirk EO (1980) The role of prosodic categories in English word stress. Linguist Inq 11(3):563–605, MIT Press

    Google Scholar 

  21. Strangert E (2003) Emphasis by pausing. Proceedings of 15th ICPhS. Cambridge University Press, Barcelona, p. 2477–2480

  22. Tamburini F (2003) Automatic prosodic prominence detection in speech using acoustic features: an unsupervised system. Proceedings of Eurospeech. Oxford University, 129–132

  23. Tamburini F (2003) Prosodic prominence detection in speech. Proceedings of Signal Processing and its Applications. IEEE Press, p. 385–388

  24. Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2003) Speech parameter generation algorithms for HMM-based speech synthesis. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 3:1315–1318

  25. Tokuda K, Zen H, Yamagishi J, Masuko T, Sako S, Black A, Nose T (2008) The HMM-based speech synthesis system (HTS) version 2.1. http://hts.sp.nitech.ac.jp/

  26. Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Multimedia 9(3):500–510

    Google Scholar 

  27. Xu JP, Chu M, Lin HE, Lu SN (2004) The influence of Chinese sentence stress on pitch and duration. Chin J Acoust 4:335–339, Allerton Press

    Google Scholar 

  28. Xu Y, Xu CX (2005) Phonetic realization of focus in English declarative intonation. J Phon 33:159–197, Academic Press

    Article  Google Scholar 

  29. Yu K, Mairesse F, Young S (2010) Word-level emphasis modeling in HMM-based speech synthesis. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. Cambridge University Press, p. 4238–4241

  30. Zhou QF, Cai LH (1996) Mandarin stress and its simulation in TTS system. Microcomputer 16(4):16–19, Microcomputer Press

    Google Scholar 

  31. Zhu WB (2007) A Chinese speech synthesis system with capability of accent realizing. J Chin Inf Process 21(3):122–128, Chinese Information Processing Press

    Google Scholar 

Download references

Acknowledgments

This work is supported by the National Basic Research Program of China (2012CB316401 and 2013CB329304). This work is also partially supported by the Hong Kong SAR Government's Research Grants Council (N-CUHK414/09), the National Natural Science Foundation of China (60805008 and 61003094). The authors would like to thank the students of the research group of Human Computer Speech Interaction in Tsinghua University, the Graduate School at Shenzhen of Tsinghua University and the Chinese University of Hong Kong, for their cooperation with the dataset setup and experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Jia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meng, F., Wu, Z., Jia, J. et al. Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimed Tools Appl 73, 463–489 (2014). https://doi.org/10.1007/s11042-013-1601-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1601-y

Keywords

Navigation