ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort

Tuomo Raitio, Antti Suni, Lauri Juvela, Martti Vainio, Paavo Alku

This paper studies a deep neural network (DNN) based voice source modelling method in the synthesis of speech with varying vocal effort. The new trainable voice source model learns a mapping between the acoustic features and the time-domain pitch-synchronous glottal flow waveform using a DNN. The voice source model is trained with various speech material from breathy, normal, and Lombard speech. In synthesis, a normal voice is first adapted to a desired style, and using the flexible DNN-based voice source model, a style-specific excitation waveform is automatically generated based on the adapted acoustic features. The proposed voice source model is compared to a robust and high-quality excitation modelling method based on manually selected mean glottal flow pulses for each vocal effort level and using a spectral matching filter to correctly match the voice source spectrum to a desired style. Subjective evaluations show that the proposed DNN-based method is rated comparable to the baseline method, but avoids the manual selection of the pulses and is computationally faster than a system using a spectral matching filter.


doi: 10.21437/Interspeech.2014-444

Cite as: Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P. (2014) Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. Proc. Interspeech 2014, 1969-1973, doi: 10.21437/Interspeech.2014-444

@inproceedings{raitio14_interspeech,
  author={Tuomo Raitio and Antti Suni and Lauri Juvela and Martti Vainio and Paavo Alku},
  title={{Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort}},
  year=2014,
  booktitle={Proc. Interspeech 2014},
  pages={1969--1973},
  doi={10.21437/Interspeech.2014-444}
}