Text-to-speech synthesis with arbitrary speaker's voice from average voice

Tamura, Masatsune; Masuko, Takashi; Tokuda, Keiichi; Kobayashi, Takao

doi:10.21437/Eurospeech.2001-107

Text-to-speech synthesis with arbitrary speaker's voice from average voice

Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi

This paper describes a technique for synthesizing speech with any desired voice. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. To generate speech of an arbitrarily given target speaker, speaker-independent speech units, i.e., average voice models, is adapted to the target speaker using MLLR framework. In addition to spectrum and pitch adaptation, we derive an algorithm for adaptation of state duration. We demonstrate that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features. Synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using a large amount of speech data.

doi: 10.21437/Eurospeech.2001-107

Cite as: Tamura, M., Masuko, T., Tokuda, K., Kobayashi, T. (2001) Text-to-speech synthesis with arbitrary speaker's voice from average voice. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 345-348, doi: 10.21437/Eurospeech.2001-107

@inproceedings{tamura01_eurospeech,
  author={Masatsune Tamura and Takashi Masuko and Keiichi Tokuda and Takao Kobayashi},
  title={{Text-to-speech synthesis with arbitrary speaker's voice from average voice}},
  year=2001,
  booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)},
  pages={345--348},
  doi={10.21437/Eurospeech.2001-107}
}