Skip to main content

Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Included in the following conference series:

Abstract

This paper provides a survey of the current state of ARTIC – the modern Czech concatenative corpus-based text-to-speech system. Through more than a decade of research & development in the field of speech technologies and applications, the system was enriched with new languages (and, as a consequence, language-dependent NLP methods), and its speech generation capabilities were significantly improved when new progressive speech generation modules (SPS, DNN, HSS) were (and are still being to) designed and incorporated into it. Also, ARTIC has to deal with various requirements on data used to generate speech from, ranging in size, quality and domain of the output speech, while there always was the requirement to achieve the highest quality in terms of both naturalness and intelligibility. Thus, the paper summarizes some of the most significant achievements and demanding tasks which had to be tackled by the system, illustrating the universality and flexibility of this Czech TTS system.

This research was supported by the Technology Agency of the Czech Republic, project No. TH02010307.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hanzlíček, Z.: Czech HMM-based speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 291–298. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_37

    Chapter  Google Scholar 

  2. Hanzlíček, Z.: Czech HMM-based speech synthesis: experiments with model adaptation. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 107–114. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23538-2_14

    Chapter  Google Scholar 

  3. Hanzlíček, Z.: Optimal Number of States in HMM-Based Speech Synthesis. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 353–361. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_40

    Chapter  Google Scholar 

  4. Hanzlíček, Z., Matoušek, J., Tihelka, D.: Experiments on reducing footprint of unit selection TTS system. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 249–256. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_32

    Chapter  Google Scholar 

  5. Hanzlíček, Z., Romportl, J., Matoušek, J.: Voice conservation: towards creating a speech-aid system for total laryngectomees. In: Kelemen, J., Romportl, J., Zackova, E. (eds.) Beyond Artificial Intelligence. TIEI, vol. 4, pp. 203–212. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34422-0_14

    Chapter  Google Scholar 

  6. Hanzlíček, Z., Vít, J., Tihelka, D.: WaveNet-based speech synthesis applied to Czech: a comparison with the traditional synthesis methods. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNAI, vol. 11107, pp. 445–452. Springer, Cham (2018)

    Chapter  Google Scholar 

  7. Ircing, P., Romportl, J., Loose, Z.: Audiovisual interface for Czech spoken dialogue system. In: Proceedings of ICSP 2010, pp. 526–529. IEEE, Beijing (2010)

    Google Scholar 

  8. ITU Recommendation BS.1534-2: Method for the subjective assessment of intermediate quality level of coding systems. Technical report, International Telecommunication Union (2014)

    Google Scholar 

  9. Jůzová, M., Tihelka, D.: Minimum text corpus selection for limited domain speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 398–407. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2_48

    Chapter  Google Scholar 

  10. Jůzová, M., Tihelka, D.: Tuning limited domain speech synthesis using general text-to-speech system. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 408–415. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2_49

    Chapter  Google Scholar 

  11. Jůzová, M., Tihelka, D., Matoušek, J.: Designing high-coverage multi-level text corpus for non-professional-voice conservation. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 207–215. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_24

    Chapter  Google Scholar 

  12. Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings of Interspeech 2017, pp. 3425–3426. ISCA, Stockholm (2017)

    Google Scholar 

  13. Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: Proceedings of ICASSP 2014, pp. 2569–2573. IEEE, Florence (2014)

    Google Scholar 

  14. Krňoul, Z., Železný, M.: A development of Czech talking head. In: Proceedings of Interspeech (ICSLP) 2008, Brisbane, Australia, pp. 2326–2329 (2008)

    Google Scholar 

  15. Legát, M., Matoušek, J.: Pitch contours as predictors of audible concatenation artifacts. In: Proceedings of WCECS 2011, San Francisco, USA, pp. 525–529 (2011)

    Google Scholar 

  16. Matoušek, J., Hanzlíček, Z., Campr, M., Krňoul, Z., Campr, P., Grůber, M.: Web-based system for automatic reading of technical documents for vision impaired students. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 364–371. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23538-2_46

    Chapter  Google Scholar 

  17. Matoušek, J., Legát, M.: Is unit selection aware of audible artifacts? In: Proceedings of SSW8, ISCA, Barcelona, pp. 267–271 (2013)

    Google Scholar 

  18. Matoušek, J., Romportl, J.: Recording and annotation of speech corpus for Czech unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 326–333. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_43

    Chapter  Google Scholar 

  19. Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of Interspeech 2013, pp. 1511–1515. ISCA, Lyon (2013)

    Google Scholar 

  20. Matoušek, J., Tihelka, D.: Voting detector: a combination of anomaly detectors to reveal annotation errors in TTS corpora. In: Proceedings of Interspeech 2016, pp. 1560–1564. ISCA, San Francisco (2016)

    Google Scholar 

  21. Matoušek, J., Tihelka, D., Romportl, J.: Current state of czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10.1007/11846406_55

    Chapter  Google Scholar 

  22. Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC 2008, pp. 1296–1299. ELRA, Marrakech (2008)

    Google Scholar 

  23. Matoušek, J., Tihelka, D., Šmídl, L.: On the impact of annotation errors on unit-selection speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 456–463. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_55

    Chapter  Google Scholar 

  24. van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)

    Google Scholar 

  25. van den Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR abs/1711.10433 (2017)

    Google Scholar 

  26. Qian, Y., Soong, F.K., Yan, Z.J.: A unified trajectory tiling approach to high quality speech rendering. IEEE Trans. Audio Speech Lang. Process. 21(2), 280–290 (2013)

    Article  Google Scholar 

  27. Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of the Speech Prosody 2006, pp. 549–552. TUDpress, Dresden (2006)

    Google Scholar 

  28. Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48

    Chapter  Google Scholar 

  29. Romportl, J., Zovato, E., Santos, R., Ircing, P., Relaño, J.G., Danieli, M.: Application of expressive TTS synthesis in an advanced ECA system. In: Proceedings of SSW7, pp. 120–125. ISCA, Kyoto (2010)

    Google Scholar 

  30. Stanislav, P., Šmídl, L., Švec, J.: An automatic training tool for air traffic control training. In: Proceedings of Interspeech 2016, pp. 782–783. ISCA, San Francisco (2016)

    Google Scholar 

  31. Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)

    Book  Google Scholar 

  32. Tihelka, D.: Symbolic prosody driven unit selection for highly natural synthetic speech. In: Proceedings of Interspeech 2005 - Eurospeech, pp. 2525–2528. ISCA, Lisboa (2005)

    Google Scholar 

  33. Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 442–449. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_56

    Chapter  Google Scholar 

  34. Tihelka, D., Hanzlíček, Z., Jůzová, M., Matoušek, J.: First steps towards hybrid speech synthesis in Czech TTS system ARTIC. In: SPECOM 2018 (2018, submitted for review)

    Chapter  Google Scholar 

  35. Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proceedings of Interspeech 2010, pp. 174–177. ISCA, Makuhari (2010)

    Google Scholar 

  36. Tihelka, D., Matoušek, J., Kala, J.: Quality deterioration factors in unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 508–515. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_66

    Chapter  Google Scholar 

  37. Tihelka, D., Stanislav, P.: ARTIC for assistive technologies: transformation to resource-limited hardware. In: Proceedings of WCECS 2011, pp. 581–584. IANG, San Francisco (2011)

    Google Scholar 

  38. Vít, J., Matoušek, J.: Concatenation artifact detection trained from listeners evaluations. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 169–176. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_22

    Chapter  Google Scholar 

  39. Vít, J., Matoušek, J.: On the analysis of training data for WaveNet-based speech synthesis. In: Proceedings of ICASSP 2018, IEEE, Calgary (2018)

    Google Scholar 

  40. Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015, invited paper)

    Google Scholar 

  41. Železný, M., Krňoul, Z., Císař, P., Matoušek, J.: Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis. Sig. Process. 12, 3657–3673 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Tihelka .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M. (2018). Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00794-2_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00793-5

  • Online ISBN: 978-3-030-00794-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics