Abstract
Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. We present a hidden Markov model (HMM)-based emphatic speech synthesis model. The ultimate objective is to synthesize corrective feedback in a computer-aided pronunciation training (CAPT) system. We first analyze contrastive (neutral versus emphatic) speech recording. The changes of the acoustic features of emphasis at different prosody locations and the local prominences of emphasis are analyzed. Based on the analysis, we develop a perturbation model that predicts the changes of the acoustic features from neutral to emphatic speech with high accuracy. Further based on the perturbation model we develop an HMM-based emphatic speech synthesis model. Different from the previous work, the HMM model is trained with neutral corpus, but the context features and additional acoustic-feature-related features are used during the growing of the decision tree. Then the output of the perturbation model can be used to supervise the HMM model to synthesize emphatic speeches instead of applying the perturbation model at the backend of a neutral speech synthesis model directly. In this way, the demand of emphasis corpus is reduced and the speech quality decreased by speech modification algorithm is avoided. The experiments indicate that the proposed emphatic speech synthesis model improves the emphasis quality of synthesized speech while keeping a high degree of the naturalness.
Similar content being viewed by others
References
Bou-Ghazale SE, Hansen JHL (1996) Generating stressed speech from neutral speech using a modified CELP vocoder. Speech Comm 20:93–110, Oxford University Press
Bou-Ghazale SE, Hansen JHL (1998) HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress. IEEE Trans Speech Audio Process 6:201–216, IEEE Press
Chen SW, Wang B, Xu Y (2009) Closely related languages, different ways of realizing focus. Proceedings of Interspeech
Jia J, Zhang S, Meng FB, Wang YX, Cai LH (2011) Emotional audio-visual speech synthesis based on PAD. IEEE transactions on audio, speech, and language processing 19(3):570–582
Kominek J, Black AW (2003) CMU ARCTIC databases for speech synthesis. Tech. Rep. CMU-LTI-03-177, Carnegie Mellon University
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density HMMs. Comput Speech Lang 9:171–186
Li AJ (1994) Duration charateristics of stress and its synthesis rules on standard Chinese, report of phonetic research. CASS, Beijing
Li Y, Lu YC, Xu XY, Tao JH (2011) Influence of rhythm and tone pattern on Mandarin stress perception in continuous speech. J Tsinghua Univ (Sci Technol) 51:1239–1243, Tsinghua University Press
Li Y, Pan SF, Tao JH (2011) HMM-based expressive speech synthesis with a flexible Mandarin stress adaptation model. J Tsinghua Univ (Sci Technol) 51:1171–1175, Tsinghua Unversity Press
Liu F (2010) Single vs double focus in English statements and yes no questions. Proceedings of Speech Prosody. ISCA Press, Chicago
Maeno Y, Nose T, Kobayashi T, Ijim Y, Nakajima H, Mizuno H, Yoshioka O (2011) HMM-based emphatic speech synthesis using unsupervised context labeling. Proceedings of Interspeech. Oxford University, Italy, p. 1849–1852
Meng H, Lo WK, Harrison AM, Lee P, Wong KH, Leung WK, Meng FB (2011) Development of automatic speech recongition and synthesis technologies to support Chinese learners of English: the CUHK experience. Proceedings of APSIPA. Cambridge University Press, Taiwan, March
Meng FB, Wu ZY, Meng H, Jia J, Cai LH (2012) Hierarchical English emphatic speech synthesis based on HMM with limited training data. Proceedings of Interspeech. Oxford University Press
Morizane K, Nakamura K, Toda T, Saruwatari H, Shikano K (2009) Emphasized speech synthesis based on hidden Markov models. Proceedsing of Speech Database and Assessments Oriental COCOSDA International Conference. IEEE Press, p. 76–81
Neri A, Cucchiarini C, Strik H (2006) ASR-based corrective feedback on pronunciation: Does it really work? Proceedings of Interspeech. Pittsburgh, USA
Plag I (2006) The variability of compound stress in English: structural, semantic and analogical factors. Engl Lang Linguist 10(1):143–172, Cambridge University Press
Raux A, Black AW (2003) A unit selection approach to F0 modeling and its application to emphasis, Proceedings of ASRU
Rump HH, Collier R (1996) Focus conditions and the prominence of pitch-accented syllables. Lang Speech 39:1–17, MIT Press
Selkirk EO (1980) The role of prosodic categories in English word stress. Linguist Inq 11(3):563–605, MIT Press
Strangert E (2003) Emphasis by pausing. Proceedings of 15th ICPhS. Cambridge University Press, Barcelona, p. 2477–2480
Tamburini F (2003) Automatic prosodic prominence detection in speech using acoustic features: an unsupervised system. Proceedings of Eurospeech. Oxford University, 129–132
Tamburini F (2003) Prosodic prominence detection in speech. Proceedings of Signal Processing and its Applications. IEEE Press, p. 385–388
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2003) Speech parameter generation algorithms for HMM-based speech synthesis. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 3:1315–1318
Tokuda K, Zen H, Yamagishi J, Masuko T, Sako S, Black A, Nose T (2008) The HMM-based speech synthesis system (HTS) version 2.1. http://hts.sp.nitech.ac.jp/
Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Multimedia 9(3):500–510
Xu JP, Chu M, Lin HE, Lu SN (2004) The influence of Chinese sentence stress on pitch and duration. Chin J Acoust 4:335–339, Allerton Press
Xu Y, Xu CX (2005) Phonetic realization of focus in English declarative intonation. J Phon 33:159–197, Academic Press
Yu K, Mairesse F, Young S (2010) Word-level emphasis modeling in HMM-based speech synthesis. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. Cambridge University Press, p. 4238–4241
Zhou QF, Cai LH (1996) Mandarin stress and its simulation in TTS system. Microcomputer 16(4):16–19, Microcomputer Press
Zhu WB (2007) A Chinese speech synthesis system with capability of accent realizing. J Chin Inf Process 21(3):122–128, Chinese Information Processing Press
Acknowledgments
This work is supported by the National Basic Research Program of China (2012CB316401 and 2013CB329304). This work is also partially supported by the Hong Kong SAR Government's Research Grants Council (N-CUHK414/09), the National Natural Science Foundation of China (60805008 and 61003094). The authors would like to thank the students of the research group of Human Computer Speech Interaction in Tsinghua University, the Graduate School at Shenzhen of Tsinghua University and the Chinese University of Hong Kong, for their cooperation with the dataset setup and experiments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Meng, F., Wu, Z., Jia, J. et al. Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimed Tools Appl 73, 463–489 (2014). https://doi.org/10.1007/s11042-013-1601-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1601-y