This paper studies a deep neural network (DNN) based voice source modelling method in the synthesis of speech with varying vocal effort. The new trainable voice source model learns a mapping between the acoustic features and the time-domain pitch-synchronous glottal flow waveform using a DNN. The voice source model is trained with various speech material from breathy, normal, and Lombard speech. In synthesis, a normal voice is first adapted to a desired style, and using the flexible DNN-based voice source model, a style-specific excitation waveform is automatically generated based on the adapted acoustic features. The proposed voice source model is compared to a robust and high-quality excitation modelling method based on manually selected mean glottal flow pulses for each vocal effort level and using a spectral matching filter to correctly match the voice source spectrum to a desired style. Subjective evaluations show that the proposed DNN-based method is rated comparable to the baseline method, but avoids the manual selection of the pulses and is computationally faster than a system using a spectral matching filter.
Cite as: Raitio, T., Suni, A., Juvela, L., Vainio, M., Alku, P. (2014) Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort. Proc. Interspeech 2014, 1969-1973, doi: 10.21437/Interspeech.2014-444
@inproceedings{raitio14_interspeech, author={Tuomo Raitio and Antti Suni and Lauri Juvela and Martti Vainio and Paavo Alku}, title={{Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort}}, year=2014, booktitle={Proc. Interspeech 2014}, pages={1969--1973}, doi={10.21437/Interspeech.2014-444} }