Unsupervised prominence prediction for speech synthesis

Mehrabani, Mahnoosh; Mishra, Taniya; Conkie, Alistair

doi:10.21437/Interspeech.2013-394

Unsupervised prominence prediction for speech synthesis

Mahnoosh Mehrabani, Taniya Mishra, Alistair Conkie

We propose an unsupervised prominence prediction method for expressive speech synthesis. Prominence patterns are learned by statistical analysis of prosodic features extracted from speech data. The advantages of our unsupervised data-driven prominence prediction include: easy adaptation to new speakers, speech styles, and even languages without requiring expert knowledge or complicated linguistic rules. In this approach, first, prominence predictive prosodic features are extracted at the foot level. Next, the extracted prosodic features are clustered, each cluster representing a prominence level. Based on just-noticeable-differences of prosodic features, the optimal number of perceptually distinct prominence levels is determined. Finally, the proposed prominence prediction is applied to prosody prediction for unit selection speech synthesis. Perceptual evaluation results show a preference for a 4-level unsupervised prominence prediction over a rule-based baseline in terms of naturalness and expressiveness of synthesized speech.

doi: 10.21437/Interspeech.2013-394

Cite as: Mehrabani, M., Mishra, T., Conkie, A. (2013) Unsupervised prominence prediction for speech synthesis. Proc. Interspeech 2013, 1559-1563, doi: 10.21437/Interspeech.2013-394

@inproceedings{mehrabani13_interspeech,
  author={Mahnoosh Mehrabani and Taniya Mishra and Alistair Conkie},
  title={{Unsupervised prominence prediction for speech synthesis}},
  year=2013,
  booktitle={Proc. Interspeech 2013},
  pages={1559--1563},
  doi={10.21437/Interspeech.2013-394}
}