Abstract
This paper addresses intonation synthesis combining both statistical and generative models to manipulate fundamental frequency (F 0) contours in the framework of HMM-based speech synthesis. An F 0 contour is represented as a superposition of micro, accent, and register components at logarithmic scale in light of the Fujisaki model. Three component sets are extracted from a speech corpus by an algorithm of pitch decomposition upon a functional F 0 model, and separated context-dependent (CD) HMM is trained for each component. At the phase of speech synthesis, CDHMM-generated micro, accent, and register components are superimposed to form F 0 contours for input text. Objective and subjective evaluations are carried out on a Japanese speech corpus. Compared with the conventional approach, this method demonstrates the improved performance in naturalness by achieving better local and global F 0 behaviors and exhibits a link between phonology and phonetics, making it possible to flexibly control intonation using given marking information on the fly to manipulate the parameters of the functional F 0 model.
Similar content being viewed by others
References
Pierrehumbert, J., & Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse In Cohen, P., Morgan, J., & P.llack, M. (Eds.), Intentions in communication, (pp. 271–311). Cambridge: MIT Press.
Taylor, P. (2009). Text-to-speech synthesis: Cambridge University Press.
Zen, H., Tokuda, K., & Black, A. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039– 1064.
Zen, H., Tokuda, K., Nasuko, T., Kobayashi, T., & Kitamura, T. (2007). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information Systems, E90-D(5), 825–834.
Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information Systems, E90-D(5), 816–824.
Hunt, A, & Black, A. (1996). Unit selection in a cancatenative speech synthesis system using a large speech database. In ICASSP1996 (pp. 373–376).
Yamagishi, J., Usabaev, B., King, S., Watts, O., Dines, J., Tian, J., Guan, Y., Hu, R., Oura, K., Wu, Y.J., Tokuda, K., Karhila, R., & Kurimo, M. (2010). Thousands of voices for HMM-based speech synthesis — analysis and application of TTS systems built on various ASR corpora. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 984–1004.
Ni, J., & Kawai, H. (2010). An unsupervised approach to creating web audio contents-based HMM voices. In Proceedings of INTERSPEECH2010 (pp. 849–852).
Maeno, Y., Nose, T., Kobayashi, T., Ijima, Y., Nakajima, H., Mizuno, H., & Yoshioka, O. (2011). HMM-based emphatic speech synthesis using unsupervised context labelling. In Proceedings of INTERSPEECH2011 (pp. 1849–1852).
Kawai, H., Toda, T., Ni, J., Tsuzaki, M., & Tokuda, K. (2004). XIMERA: a new TTS from ATR based on corpus-based technologies. In 5th ISCA Speech Synthesis Workshop (pp. 179–184).
Tokuda, K., Masuko, T., Miyazaki, N., & Kobayashi, T. (1999). Hidden Markov models based on multispace probability distribution for pitch pattern modeling. In proceedings of ICASSP1999 (pp. 229–232).
Santen, J., & Hirschberg, J. (1994). Segmental effect on timing and height of pitch contours. In in Proceedings of ICSLP1994.
Yu, K., & Young, S. (2011). Continuous F0 modelling for HMM based statistical parametric speech synthesis. IEEE Transactions on ASLP, 19(5), 1071–1079.
Sakai, S. (2005). Fundamental frequency modeling for speech synthesis based on a statistical learning technique. IEICE Transactions on Information Systems, E88D(3), 489–495.
Wu, Y.J., & Soong, F. (2012). Modeling pitch trajectory by hierarchical HMM with minimum generation error training. In Proceedings of ICASSP2012 (pp. 4017–4020).
Beckman, M.E., & Pierrehumbert, J.B. (1986). Intonational structure in Japanese and English. Phonology Yearbook, 3, 255–309.
’tHart, J., Collier, R., & Cohen, A. (1990). A perceptual study of intonation: an experimental-phonetic approach to speech melody: Cambridge University Press.
Fujisaki, H. (2004). Information, prosody, and modeling — with emphasis on tonal features of speech —. In proceedings of Speech Prosody 2004 (pp. 1–10).
Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan, 5, 233–242.
Garding, E. (1993). On parameters and principles in intonation analysis, Lund University, Department of Linguistics. Working Papers, 40, 25–47.
van Santen, J., Mashira, T., & Klabbers, E. (2004). Estimating phrase curves in the general superpositional intonation model. In Proceedings of the 5th Speech Synthesis Workshop (pp. 61–66).
Matsuda, T., Hirose, K., & Minematsu, N. (2012). Applying generation process model constraint to fundamental frequency contours generated by hidden-Markove-model-based speech synthesis. Acoustical Science and Technology, 33(4), 221–228.
Hashimoto, H., Hirose, K., & Minematsu, N. (2012). Improved automatic extraction of generation process model commands and its use for generating fundamental frequency contours for training HMM-based speech synthesis. In Proceedings of INTERSPEECH2012.
Sakurai, A., & Hirose, K. (1996). Detection of phrase boundaries in Japanese by low-pass filtering of fundamental frequency contours. In Proceedings of ICSLP1996 (pp. 817–820).
Mixdorff, H. (2000). A novel approach to the fully automatic extraction of fujisaki model parameters. In Proceedings of ICASSP 2000, (Vol. 3 pp. 1281–1284).
Narusawa, S., Minematsu, N., Hirose, K., & Fujisaki, H. (2002). A method for automatic extraction of model parameters from fundamental frequency contours of speech. In proceedings of ICASSP 2002, pp. I-509 – I-512.
Langarani, M., Klabbers, E., & Santen, J. (2014). A novel pitch decomposition method for the generalized linear alignment model. In Proceedings of ICASSP2014 (pp. 2603–2607).
Kameoka, H., Yoshizato, K., Ishihara, T., Ohishi, Y., Kashino, K., & Sagayama, S. (2013). Generative modeling of speech F0 contours. In Proceedings of INTERSPEECH2013 (pp. 1826–1830).
Ni, J., Shiga, Y., Hori, C., & Kidawara, Y. (2013). A targets-based superpositional model of fundamental frequency contours applied to HMM-based speech synthesis. In Proceedings of INTERSPEECH2013 (pp. 1052–1056).
Mishira, T. (2008). Decomposition of fundamental frequency contours in the general superpositional intonation model, PhD. thesis, the Oregon Health & Science University.
Ni, J., & Nakamura, S. (2007). Use of Poisson processes to generate fundamental frequency contours. In Proceedings of ICASSP2007 (pp. 825–828).
Ni, J., Kawai, H., & Hirose, K. (2006). Constrained tone transformation technique for separation and combination of Mandarin tone and intonation. Journal of the Acoustical Society of America, 119(3), 1764–1782.
Venditti, J., Maekawa, K., & Beckman, M. (2008). Prominence marking in the Japanese intonation system, The Oxford Handbook of Japanese Linguistics: Oxford University Press.
Clark, R., & Dusterhoff, K. (1999). Objective methods for evaluating synthetic intonation. In Proceedings of Eurospeech 1999, (Vol. 4 pp. 1623–1626).
Oura, K., Zen, H., Nankaku, Y., Lee, A., & Tokuda, K. (2010). A covariance-tying technique for HMM-based speech synthesis. IEICE Transactions on Information Systems, E93-D(3), 595–601.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ni, J., Shiga, Y. & Hori, C. Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model. J Sign Process Syst 82, 273–286 (2016). https://doi.org/10.1007/s11265-015-1011-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-1011-7