Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

Ni, Jinfu; Shiga, Yoshinori; Hori, Chiori

doi:10.1007/s11265-015-1011-7

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

Published: 19 May 2015

Volume 82, pages 273–286, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Jinfu Ni¹,
Yoshinori Shiga¹ &
Chiori Hori¹

246 Accesses
2 Citations
Explore all metrics

Abstract

This paper addresses intonation synthesis combining both statistical and generative models to manipulate fundamental frequency (F ₀) contours in the framework of HMM-based speech synthesis. An F ₀ contour is represented as a superposition of micro, accent, and register components at logarithmic scale in light of the Fujisaki model. Three component sets are extracted from a speech corpus by an algorithm of pitch decomposition upon a functional F ₀ model, and separated context-dependent (CD) HMM is trained for each component. At the phase of speech synthesis, CDHMM-generated micro, accent, and register components are superimposed to form F ₀ contours for input text. Objective and subjective evaluations are carried out on a Japanese speech corpus. Compared with the conventional approach, this method demonstrates the improved performance in naturalness by achieving better local and global F ₀ behaviors and exhibits a link between phonology and phonetics, making it possible to flexibly control intonation using given marking information on the fly to manipulate the parameters of the functional F ₀ model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Pierrehumbert, J., & Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse In Cohen, P., Morgan, J., & P.llack, M. (Eds.), Intentions in communication, (pp. 271–311). Cambridge: MIT Press.
Google Scholar
Taylor, P. (2009). Text-to-speech synthesis: Cambridge University Press.
Zen, H., Tokuda, K., & Black, A. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039– 1064.
Article Google Scholar
Zen, H., Tokuda, K., Nasuko, T., Kobayashi, T., & Kitamura, T. (2007). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information Systems, E90-D(5), 825–834.
Article Google Scholar
Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information Systems, E90-D(5), 816–824.
Article Google Scholar
Hunt, A, & Black, A. (1996). Unit selection in a cancatenative speech synthesis system using a large speech database. In ICASSP1996 (pp. 373–376).
Yamagishi, J., Usabaev, B., King, S., Watts, O., Dines, J., Tian, J., Guan, Y., Hu, R., Oura, K., Wu, Y.J., Tokuda, K., Karhila, R., & Kurimo, M. (2010). Thousands of voices for HMM-based speech synthesis — analysis and application of TTS systems built on various ASR corpora. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 984–1004.
Article Google Scholar
Ni, J., & Kawai, H. (2010). An unsupervised approach to creating web audio contents-based HMM voices. In Proceedings of INTERSPEECH2010 (pp. 849–852).
Maeno, Y., Nose, T., Kobayashi, T., Ijima, Y., Nakajima, H., Mizuno, H., & Yoshioka, O. (2011). HMM-based emphatic speech synthesis using unsupervised context labelling. In Proceedings of INTERSPEECH2011 (pp. 1849–1852).
Kawai, H., Toda, T., Ni, J., Tsuzaki, M., & Tokuda, K. (2004). XIMERA: a new TTS from ATR based on corpus-based technologies. In 5th ISCA Speech Synthesis Workshop (pp. 179–184).
Tokuda, K., Masuko, T., Miyazaki, N., & Kobayashi, T. (1999). Hidden Markov models based on multispace probability distribution for pitch pattern modeling. In proceedings of ICASSP1999 (pp. 229–232).
Santen, J., & Hirschberg, J. (1994). Segmental effect on timing and height of pitch contours. In in Proceedings of ICSLP1994.
Yu, K., & Young, S. (2011). Continuous F0 modelling for HMM based statistical parametric speech synthesis. IEEE Transactions on ASLP, 19(5), 1071–1079.
Google Scholar
Sakai, S. (2005). Fundamental frequency modeling for speech synthesis based on a statistical learning technique. IEICE Transactions on Information Systems, E88D(3), 489–495.
Article Google Scholar
Wu, Y.J., & Soong, F. (2012). Modeling pitch trajectory by hierarchical HMM with minimum generation error training. In Proceedings of ICASSP2012 (pp. 4017–4020).
Beckman, M.E., & Pierrehumbert, J.B. (1986). Intonational structure in Japanese and English. Phonology Yearbook, 3, 255–309.
Article Google Scholar
’tHart, J., Collier, R., & Cohen, A. (1990). A perceptual study of intonation: an experimental-phonetic approach to speech melody: Cambridge University Press.
Fujisaki, H. (2004). Information, prosody, and modeling — with emphasis on tonal features of speech —. In proceedings of Speech Prosody 2004 (pp. 1–10).
Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan, 5, 233–242.
Article Google Scholar
Garding, E. (1993). On parameters and principles in intonation analysis, Lund University, Department of Linguistics. Working Papers, 40, 25–47.
Google Scholar
van Santen, J., Mashira, T., & Klabbers, E. (2004). Estimating phrase curves in the general superpositional intonation model. In Proceedings of the 5th Speech Synthesis Workshop (pp. 61–66).
Matsuda, T., Hirose, K., & Minematsu, N. (2012). Applying generation process model constraint to fundamental frequency contours generated by hidden-Markove-model-based speech synthesis. Acoustical Science and Technology, 33(4), 221–228.
Article Google Scholar
Hashimoto, H., Hirose, K., & Minematsu, N. (2012). Improved automatic extraction of generation process model commands and its use for generating fundamental frequency contours for training HMM-based speech synthesis. In Proceedings of INTERSPEECH2012.
Sakurai, A., & Hirose, K. (1996). Detection of phrase boundaries in Japanese by low-pass filtering of fundamental frequency contours. In Proceedings of ICSLP1996 (pp. 817–820).
Mixdorff, H. (2000). A novel approach to the fully automatic extraction of fujisaki model parameters. In Proceedings of ICASSP 2000, (Vol. 3 pp. 1281–1284).
Narusawa, S., Minematsu, N., Hirose, K., & Fujisaki, H. (2002). A method for automatic extraction of model parameters from fundamental frequency contours of speech. In proceedings of ICASSP 2002, pp. I-509 – I-512.
Langarani, M., Klabbers, E., & Santen, J. (2014). A novel pitch decomposition method for the generalized linear alignment model. In Proceedings of ICASSP2014 (pp. 2603–2607).
Kameoka, H., Yoshizato, K., Ishihara, T., Ohishi, Y., Kashino, K., & Sagayama, S. (2013). Generative modeling of speech F0 contours. In Proceedings of INTERSPEECH2013 (pp. 1826–1830).
Ni, J., Shiga, Y., Hori, C., & Kidawara, Y. (2013). A targets-based superpositional model of fundamental frequency contours applied to HMM-based speech synthesis. In Proceedings of INTERSPEECH2013 (pp. 1052–1056).
Mishira, T. (2008). Decomposition of fundamental frequency contours in the general superpositional intonation model, PhD. thesis, the Oregon Health & Science University.
Ni, J., & Nakamura, S. (2007). Use of Poisson processes to generate fundamental frequency contours. In Proceedings of ICASSP2007 (pp. 825–828).
Ni, J., Kawai, H., & Hirose, K. (2006). Constrained tone transformation technique for separation and combination of Mandarin tone and intonation. Journal of the Acoustical Society of America, 119(3), 1764–1782.
Article Google Scholar
Venditti, J., Maekawa, K., & Beckman, M. (2008). Prominence marking in the Japanese intonation system, The Oxford Handbook of Japanese Linguistics: Oxford University Press.
http://hts.sp.nitech.ac.jp/.
Clark, R., & Dusterhoff, K. (1999). Objective methods for evaluating synthetic intonation. In Proceedings of Eurospeech 1999, (Vol. 4 pp. 1623–1626).
Oura, K., Zen, H., Nankaku, Y., Lee, A., & Tokuda, K. (2010). A covariance-tying technique for HMM-based speech synthesis. IEICE Transactions on Information Systems, E93-D(3), 595–601.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Spoken Language Communication Laboratory, Universal Communication Research Institute, National Institute of Information and Communications Technology, Kyoto, Japan
Jinfu Ni, Yoshinori Shiga & Chiori Hori

Authors

Jinfu Ni
View author publications
You can also search for this author in PubMed Google Scholar
Yoshinori Shiga
View author publications
You can also search for this author in PubMed Google Scholar
Chiori Hori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinfu Ni.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ni, J., Shiga, Y. & Hori, C. Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model. J Sign Process Syst 82, 273–286 (2016). https://doi.org/10.1007/s11265-015-1011-7

Download citation

Received: 15 November 2014
Revised: 14 April 2015
Accepted: 28 April 2015
Published: 19 May 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11265-015-1011-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

Abstract

Access this article

Similar content being viewed by others

Use of Generation Process Model for Improved Control of Fundamental Frequency Contours in HMM-Based Speech Synthesis

Prosody Control and Variation Enhancement Techniques for HMM-Based Expressive Speech Synthesis

$$\hbox {F}_{0}$$ contour generation and synthesis using Bengali Hmm-based speech synthesis system

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

Abstract

Access this article

Similar content being viewed by others

Use of Generation Process Model for Improved Control of Fundamental Frequency Contours in HMM-Based Speech Synthesis

Prosody Control and Variation Enhancement Techniques for HMM-Based Expressive Speech Synthesis

$$\hbox {F}_{0}$$ contour generation and synthesis using Bengali Hmm-based speech synthesis system

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation