Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis

Wightman, Colin W.; Syrdal, Ann K.; Stemmer, Georg; Conkie, Alistair; Beutnagel, Mark

doi:10.21437/ICSLP.2000-211

Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis

Colin W. Wightman, Ann K. Syrdal, Georg Stemmer, Alistair Conkie, Mark Beutnagel

Prosody is an important factor in the quality of text-to-speech (TTS) synthesis. Typically, acoustic parameters such as f0 and duration are the only variables related to prosody that are used to determine unit selection. Our study explored adding the explicit use of linguistically and perceptually motivated prosodic categories in unit selection-based TTS. One of our goals was to automate the process of prosodically labeling our TTS inventory. However, reliability among labelers for some ToBI (Tones and Break Indices) categories was too low for successful training of an automatic prosody recognizer. We developed a prosody labeling system simpler and more robust than standard EToBI (English ToBI). This "ToBI Lite" system was used successfully for automatic labeling of the acoustic inventory and in prosodically enriched unit selection. A formal listening test was conducted to compare subjective quality ratings for several variations of the AT&T unit selection concatenative TTS system that differed only in their method of prosodic labeling of the inventory or their use of prosody for unit selection. The use of simple prosodic categories in unit selection significantly improved ratings, and automatic prosodic labeling resulted in higher ratings than manual labeling.

doi: 10.21437/ICSLP.2000-211

Cite as: Wightman, C.W., Syrdal, A.K., Stemmer, G., Conkie, A., Beutnagel, M. (2000) Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 2, 71-74, doi: 10.21437/ICSLP.2000-211

@inproceedings{wightman00_icslp,
  author={Colin W. Wightman and Ann K. Syrdal and Georg Stemmer and Alistair Conkie and Mark Beutnagel},
  title={{Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis}},
  year=2000,
  booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)},
  pages={vol. 2, 71-74},
  doi={10.21437/ICSLP.2000-211}
}