Speech synthesis based on one-model of articulatory movement HMMs, that are commonly applied to both speech recognition (SR) and speech synthesis (SS), is described. In an SS module, speaker-invariant HMMs are applied to generate an articulatory feature (AF) sequence, and then, after converting AFs into vocal tract parameters by using a multilayer neural network (MLN), a speech signal is synthesized through an LSP digital filter. The CELP coding technique is applied to improve voice-sources when generating these sources from embedded codes in the corresponding state of HMMs. The proposed SS module separates phonetic information and the individuality of a speaker. Therefore, the targeted speaker's voice can be synthesized with a small amount of speech data. In the experiments, we carried out listening tests for ten subjects and evaluated both of sound quality and individuality of synthesized speech. As a result, we confirmed that the proposed SS module could produce good quality speech of the targeted speaker even when the training was done with the data set of two-sentences.
Cite as: Nitta, T., Onoda, T., Kimura, M., Iribe, Y., Katsurada, K. (2011) Speech synthesis based on articulatory-movement HMMs with voice-source codebooks. Proc. Interspeech 2011, 1841-1844, doi: 10.21437/Interspeech.2011-43
@inproceedings{nitta11_interspeech, author={Tsuneo Nitta and Takayuki Onoda and Masashi Kimura and Yurie Iribe and Kouichi Katsurada}, title={{Speech synthesis based on articulatory-movement HMMs with voice-source codebooks}}, year=2011, booktitle={Proc. Interspeech 2011}, pages={1841--1844}, doi={10.21437/Interspeech.2011-43} }