Skip to main content
Log in

A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perception Based Cost Functions

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In concatenative Text-to-Speech, the size of the speech corpus is closely related to synthetic speech quality. In this paper, we describe our work on a new corpus-based Bell Labs' TTS system. This encompasses large acoustic inventories with a rich set of annotations, models and data structures for representing and managing such inventories, and an optimal unit selection algorithm that accommodates a broad range of possible cost criteria. We also propose a new method for setting weights in the cost functions based on a perceptual preference test. Our results show that this approach can successfully predict human preference patterns. Synthetic speech using weights determined in this manner consistently demonstrates smoother transitions and higher voice quality than speech using manually set weights.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Breen, A.P. and Jackson, P. (1998). Non-uniform unit selection and the similarity metric within BT's laureate tts system. Proceedings of the Third ESCA Workshop on Speech Synthesis, Jenolan Caves, Australia: ESCA/IEEE.

    Google Scholar 

  • Donovan, R.E. (1996). Trainable speech synthesis. Ph.D. Thesis, University of Cambridge, Cambridge, UK.

  • Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht; Boston; London: Kluwer Academic.

    Google Scholar 

  • Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing-96, Munich, IEEE, vol. 1, pp. 373-76.

    Google Scholar 

  • Lee, M., van Santen, J.P.H., Möbius, B., and Olive, P.O. (1999). Formant tracking using segmental phonemic information. Proceedings of the European Conference on Speech Communication and Technology (Eurospeech). Budapest, Hungary: ESCA.

    Google Scholar 

  • Nakajima, S. and Hamada, H. (1988). Automatic generation of synthesis units based on context oriented clustering. Proceedings of the IEEE International Conference on Acoustics and Speech Signal Processing-88, New York, NY: IEEE.

    Google Scholar 

  • Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (1992). Numerical recipes in C-The art of scientific computing. Cambridge University Press.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, M., Lopresti, D.P. & Olive, J.P. A Text-to-Speech Platform for Variable Length Optimal Unit Searching Using Perception Based Cost Functions. International Journal of Speech Technology 6, 347–356 (2003). https://doi.org/10.1023/A:1025752731945

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025752731945

Navigation