Skip to main content

HUI-Audio-Corpus-German: A High Quality TTS Dataset

  • Conference paper
  • First Online:
KI 2021: Advances in Artificial Intelligence (KI 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12873))

Included in the following conference series:

Abstract

The increasing availability of audio data on the internet leads to a multitude of datasets for development and training of text to speech applications, based on deep neural networks. Highly differing quality of voice, low sampling rates, lack of text normalization and disadvantageous alignment of audio samples to corresponding transcript sentences still limit the performance of deep neural networks trained on this task. Additionally, data resources in languages like German are still very limited. We introduce the “HUI-Audio-Corpus-German”, a large, open-source dataset for TTS engines, created with a processing pipeline, which produces high quality audio to transcription alignments and decreases manual effort needed for creation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://opendata.iisys.de/datasets.html#hui-audio-corpus-german.

  2. 2.

    https://github.com/iisys-hof/HUI-Audio-Corpus-German.

  3. 3.

    https://librivox.org/.

  4. 4.

    https://librivox.org/api/info.

  5. 5.

    https://github.com/csteinmetz1/pyloudnorm.

  6. 6.

    https://github.com/AASHISHAG/deepspeech-german#trained-models.

  7. 7.

    https://www.projekt-gutenberg.org/.

  8. 8.

    https://gutenberg.org.

  9. 9.

    https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/vctk/voc1/conf/multi_band_melgan.v2.yaml.

  10. 10.

    https://www.wiktionary.org/.

  11. 11.

    https://opendata.iisys.de/datasets.html#hui-audio-corpus-german.

References

  1. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of Interspeech 2017, pp. 4006–4010 (2017)

    Google Scholar 

  2. Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. arXiv preprint arXiv:2005.05106 (2020)

  3. voicebot.ai: Hearables Consumer Adoption Report 2020. https://research.voicebot.ai/report-list/hearables-consumer-adoption-report-2020/. Accessed 12 May 2021

  4. Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)

    Google Scholar 

  5. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  6. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset. Accessed 03 May 2021

  7. Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv:1904.02882 [cs, eess] (2019)

  8. Kong, J., Kim, J., Bae, J.: HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

    Google Scholar 

  9. Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.: WaveGrad: estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020)

  10. The M-AILABS Speech Dataset – caito. https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/. Accessed 03 May 2021

  11. Govalkar, P., Fischer, J., Zalkow, F., Dittmar, C.: A comparison of recent neural vocoders for speech signal reconstruction. In: Proceedings of 10th ISCA Speech Synthesis Workshop, pp. 7–12 (2019)

    Google Scholar 

  12. Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 14910–14921. Curran Associates, Inc. (2019)

    Google Scholar 

  13. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)

    Google Scholar 

  14. Müller, T.: Thorsten Open German Voice Dataset. https://github.com/thorstenMueller/deep-learning-german-tts. Accessed 26 March 2021

  15. Park, K., Mulc, T.: CSS10: a collection of single speaker speech datasets for 10 languages. arXiv preprint arXiv:1903.11269 (2019)

  16. Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411 (2020)

  17. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE (2015)

    Google Scholar 

  18. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. Das umfassende Handbuch: Grundlagen, aktuelle Verfahren und Algorithmen, neue Forschungsansätze. mitp, Frechen (2018)

    Google Scholar 

  19. Agarwal, A., Zesch, T.: German end-to-end speech recognition based on DeepSpeech. In: Proceedings of the 15th Conference on Natural Language Processing (2019)

    Google Scholar 

  20. Behara, K.N.S., Bhaskar, A., Chung, E.: A novel approach for the structural comparison of origin-destination matrices: levenshtein distance. Transp. Res. Part C: Emerg. Technol. 111, 513–530 (2020). https://doi.org/10.1016/j.trc.2020.01.005

    Article  Google Scholar 

  21. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pascal Puchtler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Puchtler, P., Wirth, J., Peinl, R. (2021). HUI-Audio-Corpus-German: A High Quality TTS Dataset. In: Edelkamp, S., Möller, R., Rueckert, E. (eds) KI 2021: Advances in Artificial Intelligence. KI 2021. Lecture Notes in Computer Science(), vol 12873. Springer, Cham. https://doi.org/10.1007/978-3-030-87626-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87626-5_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87625-8

  • Online ISBN: 978-3-030-87626-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics