Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies

Tihelka, Daniel; Hanzlíček, Zdeněk; Jůzová, Markéta; Vít, Jakub; Matoušek, Jindřich; Grůber, Martin

doi:10.1007/978-3-030-00794-2_40

Daniel Tihelka¹⁹,
Zdeněk Hanzlíček¹⁹,
Markéta Jůzová²⁰,
Jakub Vít²⁰,
Jindřich Matoušek^19,20 &
…
Martin Grůber¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1748 Accesses
29 Citations

Abstract

This paper provides a survey of the current state of ARTIC – the modern Czech concatenative corpus-based text-to-speech system. Through more than a decade of research & development in the field of speech technologies and applications, the system was enriched with new languages (and, as a consequence, language-dependent NLP methods), and its speech generation capabilities were significantly improved when new progressive speech generation modules (SPS, DNN, HSS) were (and are still being to) designed and incorporated into it. Also, ARTIC has to deal with various requirements on data used to generate speech from, ranging in size, quality and domain of the output speech, while there always was the requirement to achieve the highest quality in terms of both naturalness and intelligibility. Thus, the paper summarizes some of the most significant achievements and demanding tasks which had to be tackled by the system, illustrating the universality and flexibility of this Czech TTS system.

This research was supported by the Technology Agency of the Czech Republic, project No. TH02010307.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hanzlíček, Z.: Czech HMM-based speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 291–298. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_37
Chapter Google Scholar
Hanzlíček, Z.: Czech HMM-based speech synthesis: experiments with model adaptation. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 107–114. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23538-2_14
Chapter Google Scholar
Hanzlíček, Z.: Optimal Number of States in HMM-Based Speech Synthesis. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 353–361. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_40
Chapter Google Scholar
Hanzlíček, Z., Matoušek, J., Tihelka, D.: Experiments on reducing footprint of unit selection TTS system. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 249–256. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_32
Chapter Google Scholar
Hanzlíček, Z., Romportl, J., Matoušek, J.: Voice conservation: towards creating a speech-aid system for total laryngectomees. In: Kelemen, J., Romportl, J., Zackova, E. (eds.) Beyond Artificial Intelligence. TIEI, vol. 4, pp. 203–212. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34422-0_14
Chapter Google Scholar
Hanzlíček, Z., Vít, J., Tihelka, D.: WaveNet-based speech synthesis applied to Czech: a comparison with the traditional synthesis methods. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNAI, vol. 11107, pp. 445–452. Springer, Cham (2018)
Chapter Google Scholar
Ircing, P., Romportl, J., Loose, Z.: Audiovisual interface for Czech spoken dialogue system. In: Proceedings of ICSP 2010, pp. 526–529. IEEE, Beijing (2010)
Google Scholar
ITU Recommendation BS.1534-2: Method for the subjective assessment of intermediate quality level of coding systems. Technical report, International Telecommunication Union (2014)
Google Scholar
Jůzová, M., Tihelka, D.: Minimum text corpus selection for limited domain speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 398–407. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2_48
Chapter Google Scholar
Jůzová, M., Tihelka, D.: Tuning limited domain speech synthesis using general text-to-speech system. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 408–415. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2_49
Chapter Google Scholar
Jůzová, M., Tihelka, D., Matoušek, J.: Designing high-coverage multi-level text corpus for non-professional-voice conservation. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 207–215. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_24
Chapter Google Scholar
Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings of Interspeech 2017, pp. 3425–3426. ISCA, Stockholm (2017)
Google Scholar
Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: Proceedings of ICASSP 2014, pp. 2569–2573. IEEE, Florence (2014)
Google Scholar
Krňoul, Z., Železný, M.: A development of Czech talking head. In: Proceedings of Interspeech (ICSLP) 2008, Brisbane, Australia, pp. 2326–2329 (2008)
Google Scholar
Legát, M., Matoušek, J.: Pitch contours as predictors of audible concatenation artifacts. In: Proceedings of WCECS 2011, San Francisco, USA, pp. 525–529 (2011)
Google Scholar
Matoušek, J., Hanzlíček, Z., Campr, M., Krňoul, Z., Campr, P., Grůber, M.: Web-based system for automatic reading of technical documents for vision impaired students. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 364–371. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23538-2_46
Chapter Google Scholar
Matoušek, J., Legát, M.: Is unit selection aware of audible artifacts? In: Proceedings of SSW8, ISCA, Barcelona, pp. 267–271 (2013)
Google Scholar
Matoušek, J., Romportl, J.: Recording and annotation of speech corpus for Czech unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 326–333. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_43
Chapter Google Scholar
Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of Interspeech 2013, pp. 1511–1515. ISCA, Lyon (2013)
Google Scholar
Matoušek, J., Tihelka, D.: Voting detector: a combination of anomaly detectors to reveal annotation errors in TTS corpora. In: Proceedings of Interspeech 2016, pp. 1560–1564. ISCA, San Francisco (2016)
Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Current state of czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10.1007/11846406_55
Chapter Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC 2008, pp. 1296–1299. ELRA, Marrakech (2008)
Google Scholar
Matoušek, J., Tihelka, D., Šmídl, L.: On the impact of annotation errors on unit-selection speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 456–463. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_55
Chapter Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)
Google Scholar
van den Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR abs/1711.10433 (2017)
Google Scholar
Qian, Y., Soong, F.K., Yan, Z.J.: A unified trajectory tiling approach to high quality speech rendering. IEEE Trans. Audio Speech Lang. Process. 21(2), 280–290 (2013)
Article Google Scholar
Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of the Speech Prosody 2006, pp. 549–552. TUDpress, Dresden (2006)
Google Scholar
Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_48
Chapter Google Scholar
Romportl, J., Zovato, E., Santos, R., Ircing, P., Relaño, J.G., Danieli, M.: Application of expressive TTS synthesis in an advanced ECA system. In: Proceedings of SSW7, pp. 120–125. ISCA, Kyoto (2010)
Google Scholar
Stanislav, P., Šmídl, L., Švec, J.: An automatic training tool for air traffic control training. In: Proceedings of Interspeech 2016, pp. 782–783. ISCA, San Francisco (2016)
Google Scholar
Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)
Book Google Scholar
Tihelka, D.: Symbolic prosody driven unit selection for highly natural synthetic speech. In: Proceedings of Interspeech 2005 - Eurospeech, pp. 2525–2528. ISCA, Lisboa (2005)
Google Scholar
Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 442–449. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_56
Chapter Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Matoušek, J.: First steps towards hybrid speech synthesis in Czech TTS system ARTIC. In: SPECOM 2018 (2018, submitted for review)
Chapter Google Scholar
Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proceedings of Interspeech 2010, pp. 174–177. ISCA, Makuhari (2010)
Google Scholar
Tihelka, D., Matoušek, J., Kala, J.: Quality deterioration factors in unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 508–515. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74628-7_66
Chapter Google Scholar
Tihelka, D., Stanislav, P.: ARTIC for assistive technologies: transformation to resource-limited hardware. In: Proceedings of WCECS 2011, pp. 581–584. IANG, San Francisco (2011)
Google Scholar
Vít, J., Matoušek, J.: Concatenation artifact detection trained from listeners evaluations. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 169–176. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_22
Chapter Google Scholar
Vít, J., Matoušek, J.: On the analysis of training data for WaveNet-based speech synthesis. In: Proceedings of ICASSP 2018, IEEE, Calgary (2018)
Google Scholar
Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015, invited paper)
Google Scholar
Železný, M., Krňoul, Z., Císař, P., Matoušek, J.: Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis. Sig. Process. 12, 3657–3673 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Daniel Tihelka, Zdeněk Hanzlíček, Jindřich Matoušek & Martin Grůber
Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Markéta Jůzová, Jakub Vít & Jindřich Matoušek

Authors

Daniel Tihelka
View author publications
You can also search for this author in PubMed Google Scholar
Zdeněk Hanzlíček
View author publications
You can also search for this author in PubMed Google Scholar
Markéta Jůzová
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Vít
View author publications
You can also search for this author in PubMed Google Scholar
Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar
Martin Grůber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Tihelka .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M. (2018). Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-00794-2_40
Published: 08 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies