HUI-Audio-Corpus-German: A High Quality TTS Dataset

Puchtler, Pascal; Wirth, Johannes; Peinl, René

doi:10.1007/978-3-030-87626-5_15

Pascal Puchtler¹¹,
Johannes Wirth¹¹ &
René Peinl¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12873))

Included in the following conference series:

German Conference on Artificial Intelligence (Künstliche Intelligenz)

970 Accesses
8 Citations

Abstract

The increasing availability of audio data on the internet leads to a multitude of datasets for development and training of text to speech applications, based on deep neural networks. Highly differing quality of voice, low sampling rates, lack of text normalization and disadvantageous alignment of audio samples to corresponding transcript sentences still limit the performance of deep neural networks trained on this task. Additionally, data resources in languages like German are still very limited. We introduce the “HUI-Audio-Corpus-German”, a large, open-source dataset for TTS engines, created with a processing pipeline, which produces high quality audio to transcription alignments and decreases manual effort needed for creation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017)
Google Scholar
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. arXiv preprint arXiv:2005.05106 (2020)
voicebot.ai: Hearables Consumer Adoption Report 2020. https://research.voicebot.ai/report-list/hearables-consumer-adoption-report-2020/. Accessed 12 May 2021
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset. Accessed 03 May 2021
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv:1904.02882 [cs, eess] (2019)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: WaveGrad: estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020)
The M-AILABS Speech Dataset – caito. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/. Accessed 03 May 2021
Govalkar, P., Fischer, J., Zalkow, F., Dittmar, C.: A comparison of recent neural vocoders for speech signal reconstruction. In: Proceedings of 10th ISCA Speech Synthesis Workshop, pp. 7–12 (2019)
Google Scholar
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 14910–14921. Curran Associates, Inc. (2019)
Google Scholar
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Google Scholar
Müller, T.: Thorsten Open German Voice Dataset. https://github.com/thorstenMueller/deep-learning-german-tts. Accessed 26 March 2021
Park, K., Mulc, T.: CSS10: a collection of single speaker speech datasets for 10 languages. arXiv preprint arXiv:1903.11269 (2019)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411 (2020)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE (2015)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. Das umfassende Handbuch: Grundlagen, aktuelle Verfahren und Algorithmen, neue Forschungsansätze. mitp, Frechen (2018)
Google Scholar
Agarwal, A., Zesch, T.: German end-to-end speech recognition based on DeepSpeech. In: Proceedings of the 15th Conference on Natural Language Processing (2019)
Google Scholar
Behara, K.N.S., Bhaskar, A., Chung, E.: A novel approach for the structural comparison of origin-destination matrices: levenshtein distance. Transp. Res. Part C: Emerg. Technol. 111, 513–530 (2020). https://doi.org/10.1016/j.trc.2020.01.005
Article Google Scholar
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)

Download references

Author information

Authors and Affiliations

Hof University of Applied Sciences, Alfons-Goppel-Platz 1, 95028, Hof, Germany
Pascal Puchtler, Johannes Wirth & René Peinl

Authors

Pascal Puchtler
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Wirth
View author publications
You can also search for this author in PubMed Google Scholar
René Peinl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pascal Puchtler .

Editor information

Editors and Affiliations

Czech Technical University in Prague, Prague, Czech Republic
Stefan Edelkamp
University of Lübeck, Lübeck, Germany
Ralf Möller
University of Leoben, Leoben, Austria
Elmar Rueckert

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Puchtler, P., Wirth, J., Peinl, R. (2021). HUI-Audio-Corpus-German: A High Quality TTS Dataset. In: Edelkamp, S., Möller, R., Rueckert, E. (eds) KI 2021: Advances in Artificial Intelligence. KI 2021. Lecture Notes in Computer Science(), vol 12873. Springer, Cham. https://doi.org/10.1007/978-3-030-87626-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-87626-5_15
Published: 30 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87625-8
Online ISBN: 978-3-030-87626-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics