WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

Zhang, Zewang; Zheng, Yibin; Li, Xinhui; Lu, Li

doi:10.21437/Interspeech.2022-454

WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

Zewang Zhang, Yibin Zheng, Xinhui Li, Li Lu

In this paper, we develop a new multi-singer Chinese neural singing voice synthesis (SVS) system named WeSinger. To improve the accuracy and naturalness of synthesized singing voice, we design several specifical modules and techniques: 1) A deep bi-directional LSTM-based duration model with multi-scale rhythm loss and post-processing step; 2) A Transformer-alike acoustic model with progressive pitch-weighted decoder loss; 3) a 24 kHz pitch-aware LPCNet neural vocoder to produce high-quality singing waveforms; 4) A novel data augmentation method with multi-singer pre-training for stronger robustness and naturalness. To our knowledge, WeSinger is the first SVS system to adopt 24 kHz LPCNet and multi-singer pre-training simultaneously. Both quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness, and WeSinger achieves state-of-the-art performance on the recent public Chinese singing corpus Opencpop\footnote{https://wenet.org.cn/opencpop/}. Some synthesized singing samples are available online\footnote{https://zzw922cn.github.io/wesinger/}.

doi: 10.21437/Interspeech.2022-454

Cite as: Zhang, Z., Zheng, Y., Li, X., Lu, L. (2022) WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses. Proc. Interspeech 2022, 4252-4256, doi: 10.21437/Interspeech.2022-454

@inproceedings{zhang22e_interspeech,
  author={Zewang Zhang and Yibin Zheng and Xinhui Li and Li Lu},
  title={{WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={4252--4256},
  doi={10.21437/Interspeech.2022-454}
}