Skip to main content

Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

Abstract

Text-to-speech synthesis is a task that solves many real-world problems such as providing speaking and reading ability to people who lack those capabilities. It is thus viewed mainly as an engineering problem rather than a purely scientific one. Therefore many of the solutions in speech synthesis are purely practical. However, from the point of view of phonetics, the process of producing speech from text artificially is also a scientific one. Here I argue – using an example from speech prosody, namely speech melody – that phonetics is the key discipline in helping to solve what is arguably one of the most interesting problems in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For a good overview of techniques used see [43].

  2. 2.

    There are interesting developments towards more articulatory control in HMM based TTS [53]. However, this can only be seen as compromise as the units are still defined acoustically and do not necessarily correspond with the actual underlying articulatory gestures.

References

  1. (2014). http://www.simple4all.org

  2. Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2–3), 109–118 (1992)

    Article  Google Scholar 

  3. Alku, P., Tiitinen, H., Näätänen, R.: A method for generating natural-sounding speech stimuli for cognitive brain research. Clin. Neurophysiol. 110, 1329–1333 (1999)

    Article  Google Scholar 

  4. Altosaar, T., Karjalainen, M.: Multiple-resolution analysis of speech signals. In: Proceedings of IEEE ICASSP-88, New York (1988)

    Google Scholar 

  5. Anumanchipalli, G.K., Oliveira, L.C., Black, A.W.: A statistical phrase/accent model for intonation modeling. In: INTERSPEECH, pp. 1813–1816 (2011)

    Google Scholar 

  6. Arnold, D., Wagner, P., Möbius, B.: Obtaining prominence judgments from naïve listeners-influence of rating scales, linguistic levels and normalisation. In: Proceedings of Interspeech 2012 (2012)

    Google Scholar 

  7. Badino, L., Clark, R.A., Wester, M.: Towards hierarchical prosodic prominence generation in TTS synthesis. In: INTERSPEECH (2012)

    Google Scholar 

  8. Badino, L., D’Ausilio, A., Fadiga, L., Metta, G.: Computational validation of the motor contribution to speech perception. Top. Cogn. Sci. 6(3), 461–475 (2014)

    Article  Google Scholar 

  9. Bailly, G., Holm, B.: SFC: a trainable prosodic model. Speech Commun. 46(3), 348–364 (2005)

    Article  Google Scholar 

  10. Becker, S., Schröder, M., Barry, W.J.: Rule-based prosody prediction for german text-to-speech synthesis. In: Proceedings of Speech Prosody 2006, pp. 503–506 (2006)

    Google Scholar 

  11. Bengio, Y.: Evolving culture vs local minima. arXiv preprint arXiv:1203.2990 (2012)

  12. Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 1–37. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  13. Beňuš, Š.: Conversational entrainment in the use of discourse markers. In: Bassis, S., Esposito, A., Morabito, F.C. (eds.) Recent Advances of Neural Network Models and Applications, pp. 345–352. Springer, Heidelberg (2014)

    Google Scholar 

  14. Birkholz, P.: Modeling consonant-vowel coarticulation for articulatory speech synthesis. PloS One 8(4), e60603 (2013)

    Article  Google Scholar 

  15. Birkholz, P., Jackel, D.: A three-dimensional model of the vocal tract for speech synthesis. In: Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 2597–2600 (2003)

    Google Scholar 

  16. Bolinger, D.L.: Around the edge of language: intonation. Harvard Educ. Rev. 34(2), 282–296 (1964)

    Google Scholar 

  17. Campbell, W.N.: CHATR: a high-definition speech re-sequencing system. In: Proceedings of 3rd ASA/ASJ Joint Meeting, pp. 1223–1228 (1996)

    Google Scholar 

  18. Cole, J., Mo, Y., Hasegawa-Johnson, M.: Signal-based and expectation-based factors in the perception of prosodic prominence. Lab. Phonology 1(2), 425–452 (2010)

    Article  Google Scholar 

  19. Cooper, F.S.: Speech synthesizers. In: Proceedings of 4th International Congress of Phonetic Sciences (ICPhS’61), pp. 3–13 (1962)

    Google Scholar 

  20. D’Ausilio, A., Maffongelli, L., Bartoli, E., Campanella, M., Ferrari, E., Berry, J., Fadiga, L.: Listening to speech recruits specific tongue motor synergies as revealed by transcranial magnetic stimulation and tissue-doppler ultrasound imaging. Philos. Trans. R. Soc. B: Biol. Sci. 369(1644), 20130418 (2014)

    Article  Google Scholar 

  21. Denes, P.B., Pinson, E.N.: The Speech Chain, p. 121. Bell Laboratory Educational Publication, New York (1963)

    Google Scholar 

  22. Deng, L.: A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 3, e2 (2014)

    Article  Google Scholar 

  23. Deng, L., Li, X.: Machine learning paradigms for speech recognition: an overview. IEEE Trans. Audio, Speech Lang. Process. 21(5), 1060–1089 (2013)

    Article  Google Scholar 

  24. Dutoit, T.: An Introduction to Text-to-Speech Synthesis, vol. 3. Springer, New York (1997)

    Google Scholar 

  25. Eriksson, A., Thunberg, G.C., Traunmüller, H.: Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing. In: Proceedings of European Conference on Speech Communication and Technology Aalborg, vol. 1, pp. 399–402, September 2001

    Google Scholar 

  26. Fant, C.G.M., Martony, J., Rengman, U., Risberg, A.: OVE II synthesis strategy. In: Proceedings of the Speech Communication Seminar F, vol. 5 (1962)

    Google Scholar 

  27. Farouk, M.H.: Application of Wavelets in Speech Processing. Springer, New York (2014)

    Book  MATH  Google Scholar 

  28. Flanagan, J.L.: Speech Analysis, Synthesis and Perception, vol. 1, 2nd edn. Springer, Heidelberg (1972)

    Book  Google Scholar 

  29. Flanagan, J.L.: Note on the design of “terminal-analog” speech synthesizers. J. Acoust. Soc. Am. 29(2), 306–310 (1957)

    Article  MathSciNet  Google Scholar 

  30. Frank, S.L., Bod, R., Christiansen, M.H.: How hierarchical is language use? Proc. R. Soc. B: Biol. Sci. 279, 4522–4531 (2012)

    Article  Google Scholar 

  31. Fujisaki, H., Hirose, K.: Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Jpn. (E) 5(4), 233–241 (1984)

    Article  Google Scholar 

  32. Fujisaki, H., Sudo, H.: A generative model for the prosody of connected speech in japanese. Annu. Rep. Eng. Res. Inst. 30, 75–80 (1971)

    Google Scholar 

  33. Fukui, K., Ishikawa, Y., Sawa, T., Shintaku, E., Honda, M., Takanishi, A.: New anthropomorphic talking robot having a three-dimensional articulation mechanism and improved pitch range. In: 2007 IEEE International Conference on Robotics and Automation pp. 2922–2927. IEEE (2007)

    Google Scholar 

  34. Goldsmith, J.A.: Autosegmental and Metrical Phonology, vol. 11. Blackwell, Oxford (1990)

    Google Scholar 

  35. Grossman, A., Morlet, J.: Decomposition of functions into wavelets of constant shape, and related transforms. Math. Phys. Lect. Recent Results 11, 135–165 (1985)

    Article  Google Scholar 

  36. Halle, M., Vergnaud, J.R.: Three dimensional phonology. J. Linguist. Res. 1(1), 83–105 (1980)

    Google Scholar 

  37. Halle, M., Vergnaud, J.R., et al.: Metrical Structures in Phonology. MIT, Cambridge (1978)

    Google Scholar 

  38. Hannukainen, A., Lukkari, T., Malinen, J., Palo, P.: Vowel formants from the wave equation. J. Acoust. Soc. Am. 122(1), EL1–EL7 (2007)

    Article  Google Scholar 

  39. Hertz, S.R.: From text to speech with SRS. J. Acoust. Soc. Am. 72(4), 1155–1170 (1982)

    Article  Google Scholar 

  40. Hertz, S.R., Kadin, J., Karplus, K.J.: The delta rule development system for speech synthesis from text. Proc. IEEE 73(11), 1589–1601 (1985)

    Article  Google Scholar 

  41. Hirschberg, J.: Pitch accent in context: predicting intonational prominence from text. Artif. Intell. 63(1–2), 305–340 (1993)

    Article  Google Scholar 

  42. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 1, pp. 373–376. IEEE (1996)

    Google Scholar 

  43. King, S.: Measuring a decade of progress in text-to-speech. Loguens 1(1) (2014)

    Google Scholar 

  44. Klatt, D.H.: Review of text-to-speech conversion for english. J. Acoust. Soc. Am. 82(3), 737–793 (1987)

    Article  Google Scholar 

  45. Klatt, D.: Acoustic theory of terminal analog speech synthesis. In: Proceedings of 1972 International Conference on Speech Communication Processing, Boston, MA (1972)

    Google Scholar 

  46. Kleijn, W.B.: Principles of speech coding. In: Benesty, J., Sondhi, M.M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 283–306. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  47. Kochanski, G., Shih, C.: Stem-ml: language-independent prosody description. In: INTERSPEECH, pp. 239–242 (2000)

    Google Scholar 

  48. Kochanski, G., Shih, C.: Prosody modeling with soft templates. Speech Commun. 39(3), 311–352 (2003)

    Article  MATH  Google Scholar 

  49. Kruschke, H., Lenz, M.: Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis. In: INTERSPEECH (2003)

    Google Scholar 

  50. Lei, M., Wu, Y.J., Soong, F.K., Ling, Z.H., Dai, L.R.: A hierarchical f0 modeling method for HMM-based speech synthesis. In: INTERSPEECH, pp. 2170–2173 (2010)

    Google Scholar 

  51. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74(6), 431 (1967)

    Article  Google Scholar 

  52. Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21(1), 1–36 (1985)

    Article  Google Scholar 

  53. Ling, Z.H., Richmond, K., Yamagishi, J.: Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. IEEE Trans. Audio Speech Lang. Process. 21(1), 207–219 (2013)

    Article  Google Scholar 

  54. Mallat, S.: A wavelet tour of signal processing. Access Online via Elsevier (1999)

    Google Scholar 

  55. Mishra, T., Santen, J.V., Klabbers, E.: Decomposition of pitch curves in the general superpositional intonation model. In: Speech Prosody, Dresden, Germany (2006)

    Google Scholar 

  56. Moro, E.B.: A 19th-century speaking machine: the tecnefón of severino perez y vazquez. Historiographia Linguistica 34(1), 19–36 (2007)

    Article  MathSciNet  Google Scholar 

  57. Nishikawa, K., Asama, K., Hayashi, K., Takanobu, H., Takanishi, A.: Development of a talking robot. In: Proceedings of 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems 2000 (IROS 2000), vol. 3, pp. 1760–1765. IEEE (2000)

    Google Scholar 

  58. Öhman, S.: Word and sentence intonation: a quantitative model. Speech Transmission Laboratory, Department of Speech Communication, Royal Institute of Technology (1967)

    Google Scholar 

  59. Pfeifer, R., Lungarella, M., Iida, F.: Self-organization, embodiment, and biologically inspired robotics. Science 318(5853), 1088–1093 (2007)

    Article  Google Scholar 

  60. Raitio, T., Lu, H., Kane, J., Suni, A., Vainio, M., King, S., Alku, P.: Voice source modelling using deep neural networks for statistical parametric speech synthesis. In: 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014 (accepted)

    Google Scholar 

  61. Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P.: Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. In: Proceedings of Interspeech, Singapore, accepted: September 2014

    Google Scholar 

  62. Raitio, T., Suni, A., Pohjalainen, J., Airaksinen, M., Vainio, M., Alku, P.: Analysis and synthesis of shouted speech. In: Interspeech, Lyon, France, pp. 1544–1548, August 2013

    Google Scholar 

  63. Raitio, T., Suni, A., Vainio, M., Alku, P.: Analysis of HMM-based lombard speech synthesis. In: Interspeech, Florence, Italy, pp. 2781–2784, August 2011

    Google Scholar 

  64. Raitio, T., Suni, A., Vainio, M., Alku, P.: Synthesis and perception of breathy, normal, and lombard speech in the presence of noise. Comput. Speech Lang. 28(2), 648–664 (2014)

    Article  Google Scholar 

  65. Ramachandran, R., Mammone, R.: Modern Methods of Speech Processing. Springer, New York (1995)

    Book  Google Scholar 

  66. Riley, M.D.: Speech Time-Frequency Representation, vol. 63. Springer, New York (1989)

    Google Scholar 

  67. van Rooij, J.C., Plomp, R.: The effect of linguistic entropy on speech perception in noise in young and elderly listeners. J. Acoust. Soc. Am. 90(6), 2985–2991 (1991)

    Article  Google Scholar 

  68. van Santen, J.P., Mishra, T., Klabbers, E.: Estimating phrase curves in the general superpositional intonation model. In: Fifth ISCA Workshop on Speech Synthesis (2004)

    Google Scholar 

  69. Schroeder, M.R.: A brief history of synthetic speech. Speech Commun. 13(1), 231–237 (1993)

    Article  MathSciNet  Google Scholar 

  70. Simko, J., Cummins, F.: Embodied task dynamics. Psychol. Rev. 117(4), 1229 (2010)

    Article  Google Scholar 

  71. Šimko, J., O’Dell, M., Vainio, M.: Emergent consonantal quantity contrast and context-dependence of gestural phasing. J. Phonetics 44, 130–151 (2014)

    Article  Google Scholar 

  72. Sondhi, M.M., Schroeter, J.: A hybrid time-frequency domain articulatory speech synthesizer. IEEE Trans. Acoust. Speech Signal Process. 35(7), 955–967 (1987)

    Article  Google Scholar 

  73. Sproat, R.W.: Multilingual Text-to-Speech Synthesis. Kluwer Academic Publishers, Boston (1997)

    Google Scholar 

  74. Story, B.H.: A parametric model of the vocal tract area function for vowel and consonant simulation. J. Acoust. Soc. Am. 117(5), 3231–3254 (2005)

    Article  Google Scholar 

  75. Suni, A., Aalto, D., Raitio, T., Alku, P., Vainio, M.: Wavelets for intonation modeling in HMM speech synthesis. In: 8th ISCA Speech Synthesis Workshop (SSW8), Barcelona, Spain, pp. 285–290, August-September 2013

    Google Scholar 

  76. Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM speech synthesis entry for Blizzard Challenge 2010. In: Blizzard Challenge 2010 Workshop, Kyoto, Japan, September 2010

    Google Scholar 

  77. Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2011: utilizing source unit selection in HMM-based speech synthesis for improved excitation generation. In: Blizzard Challenge 2011 Workshop, Florence, Italy, September 2011

    Google Scholar 

  78. Suni, A., Raitio, T., Vainio, M., Alku, P.: The GlottHMM entry for Blizzard Challenge 2012 - hybrid approach. In: Blizzard Challenge 2012 Workshop, Portland, Oregon, September 2012

    Google Scholar 

  79. Suni, A., Simko, J., Aalto, D., Vainio, M.: Continuous wavelet transform in text-to-speech synthesis prosody control (in preparation)

    Google Scholar 

  80. Suni, A.S., Aalto, D., Raitio, T., Alku, P., Vainio, M., et al.: Wavelets for intonation modeling in HMM speech synthesis. In: Proceedings of 8th ISCA Workshop on Speech Synthesis, Barcelona, 31 August-2 September 2013

    Google Scholar 

  81. Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  82. Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from HMM using dynamic features. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95, vol. 1, pp. 660–663. IEEE (1995)

    Google Scholar 

  83. Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’00, vol. 3, pp. 1315–1318. IEEE (2000)

    Google Scholar 

  84. Vainio, L., Tiainen, M., Tiippana, K., Vainio, M.: Shared processing of planning articulatory gestures and grasping. Exp. Brain Res. 232(7), 2359–2368 (2014)

    Article  Google Scholar 

  85. Vainio, L., Schulman, M., Tiippana, K., Vainio, M.: Effect of syllable articulation on precision and power grip performance. PloS One 8(1), e53061 (2013)

    Article  Google Scholar 

  86. Vainio, M., Järvikivi, J.: Tonal features, intensity, and word order in the perception of prominence. J. Phonetics 34, 319–342 (2006)

    Article  Google Scholar 

  87. Vainio, M., Suni, A., Aalto, D.: Continuous wavelet transform for analysis of speech prosody. In: Proceedings of TRASP 2013-Tools and Resources for the Analysis of Speech Prosody, An Interspeech 2013 Satellite Event, August 30 2013, Laboratoire Parole et Language, Aix-en-Provence, France (2013)

    Google Scholar 

  88. Vainio, M., Suni, A., Aalto, D.: Emphasis, word prominence, and continuous wavelet transform in the control of HMM based synthesis. In: Speech Prosody in Speech Synthesis - Modeling, Realizing, Converting Prosody for High Quality and Flexible Speech Synthesis, Prosody, Phonology and Phonetics. Springer (2015)

    Google Scholar 

  89. Vainio, M., Suni, A., Raitio, T., Nurminen, J., Järvikivi, J., Alku, P.: New method for delexicalization and its application to prosodic tagging for text-to-speech synthesis. In: Interspeech, Brighton, UK, pp. 1703–1706, September 2009

    Google Scholar 

  90. Vainio, M., Suni, A., Sirjola, P.: Developing a finnish concept-to-speech system. In: Langemets, M., Penjam, P. (eds.) Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, pp. 201–206, 4–5 April 2005

    Google Scholar 

  91. von Kempelen, W., de Pázmánd, W.K., Autriche, M.: Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. bei JV Degen (1791)

    Google Scholar 

  92. Watts, O.S.: Unsupervised learning for text-to-speech synthesis. Ph.D. thesis (2013)

    Google Scholar 

  93. Zen, H., Braunschweiler, N.: Context-dependent additive log f_0 model for HMM-based speech synthesis. In: INTERSPEECH, pp. 2091–2094 (2009)

    Google Scholar 

  94. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Article  Google Scholar 

Download references

Acknowledgements

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under grant agreement n\(^o\) 287678 (Simple4All) and the Academy of Finland (projects 128204, 125940, and 1265610 (the MIND programme)). I would also like to thank Antti Suni, Daniel Aalto, and Juraj Šimko for their insightful discussions regarding this manuscript. Special thanks go to Paavo Alku and Tuomo Raitio for the GlottHMM collaboration.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martti Vainio .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Vainio, M. (2014). Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11397-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11396-8

  • Online ISBN: 978-3-319-11397-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics