Abstract
The increasing availability of audio data on the internet leads to a multitude of datasets for development and training of text to speech applications, based on deep neural networks. Highly differing quality of voice, low sampling rates, lack of text normalization and disadvantageous alignment of audio samples to corresponding transcript sentences still limit the performance of deep neural networks trained on this task. Additionally, data resources in languages like German are still very limited. We introduce the “HUI-Audio-Corpus-German”, a large, open-source dataset for TTS engines, created with a processing pipeline, which produces high quality audio to transcription alignments and decreases manual effort needed for creation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017)
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. arXiv preprint arXiv:2005.05106 (2020)
voicebot.ai: Hearables Consumer Adoption Report 2020. https://research.voicebot.ai/report-list/hearables-consumer-adoption-report-2020/. Accessed 12 May 2021
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset. Accessed 03 May 2021
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv:1904.02882 [cs, eess] (2019)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: WaveGrad: estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020)
The M-AILABS Speech Dataset – caito. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/. Accessed 03 May 2021
Govalkar, P., Fischer, J., Zalkow, F., Dittmar, C.: A comparison of recent neural vocoders for speech signal reconstruction. In: Proceedings of 10th ISCA Speech Synthesis Workshop, pp. 7–12 (2019)
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 14910–14921. Curran Associates, Inc. (2019)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Müller, T.: Thorsten Open German Voice Dataset. https://github.com/thorstenMueller/deep-learning-german-tts. Accessed 26 March 2021
Park, K., Mulc, T.: CSS10: a collection of single speaker speech datasets for 10 languages. arXiv preprint arXiv:1903.11269 (2019)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411 (2020)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE (2015)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. Das umfassende Handbuch: Grundlagen, aktuelle Verfahren und Algorithmen, neue Forschungsansätze. mitp, Frechen (2018)
Agarwal, A., Zesch, T.: German end-to-end speech recognition based on DeepSpeech. In: Proceedings of the 15th Conference on Natural Language Processing (2019)
Behara, K.N.S., Bhaskar, A., Chung, E.: A novel approach for the structural comparison of origin-destination matrices: levenshtein distance. Transp. Res. Part C: Emerg. Technol. 111, 513–530 (2020). https://doi.org/10.1016/j.trc.2020.01.005
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Puchtler, P., Wirth, J., Peinl, R. (2021). HUI-Audio-Corpus-German: A High Quality TTS Dataset. In: Edelkamp, S., Möller, R., Rueckert, E. (eds) KI 2021: Advances in Artificial Intelligence. KI 2021. Lecture Notes in Computer Science(), vol 12873. Springer, Cham. https://doi.org/10.1007/978-3-030-87626-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-87626-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87625-8
Online ISBN: 978-3-030-87626-5
eBook Packages: Computer ScienceComputer Science (R0)