Skip to main content
Log in

Measuring the naturalness of synthetic speech

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Even the highest quality synthetic speech generated by rule sounds unlike human speech. As the intelligibility of rule-based synthetic speech improves, and the number of applications for synthetic speech increases, the naturalness of synthetic speech will become an important factor in determining its use. In order to improve this aspect of the quality of synthetic speech it is necessary to have diagnostic tests that can measure naturalness. Currently, all of the available metrics for evaluating the acceptability of synthetic speech do not distinguish sufficiently between measuring overall acceptability (including naturalness) and simply measuring the ability of listeners to extract intelligible information from the signal. In this paper we propose a new methodology for measuring the naturalness of particular aspects of synthesized speech, independent of the intelligibility of the speech. Although naturalness is a multidimensional, subjective quality of speech, this methodology makes it possible to assess the separate contributions of prosodic, segmental, and source characteristics of the utterance. In two experiments, listeners reliably differentiated the naturalness of speech produced by two male talkers and two text-to-speech systems. Furthermore, they reliably differentiated between the two text-to-speech systems. The results of these experiments demonstrate that perception of naturalness is affected by information contained within the smallest part of speech, the glottal pulse, and by information contained within the prosodic structure of a syllable. These results show that this new methodology does provide a solid basis for measuring and diagnosing the naturalness of synthetic speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Best, C.T., Morongiello, B., and Robson, R. (1981). Perceptual equivalence of acoustic cues in speech and nonspeech perception.Perception & Psychophysics, 29:191–211.

    Google Scholar 

  • Bollinger, D. (1989).Intonation and its uses: Melody in grammar and discourse. Stanford University Press, Stanford.

    Google Scholar 

  • Campbell, W.N. and Isard, S.D. (1991). Segment durations in a syllable frame.Journal of Phonetics, 19:37–47.

    Google Scholar 

  • Carrell, T.D. (1984).Contributions of fundamental frequency formant spacing, and glottal waveform to talker identification. Research on Speech Perception Technical Report No. 5, Speech Research Laboratory, Indiana University, Bloomington.

    Google Scholar 

  • Cooper, W.E. and Paccia-Cooper, J. (1980).Syntax and speech. Harvard University Press, Cambridge.

    Google Scholar 

  • Cooper, W.E. and Sorenson, J.M. (1981).Fundamental frequency in sentence production. Springer-Verlag, New York.

    Google Scholar 

  • Fant, G. (1991). What can basic research contribute to speech synthesis?Journal of Phonetics, 19:75–90.

    Google Scholar 

  • Flanagan, J.L. (1955). A difference limen for vowel formant frequency.Journal of the Acoustical Society of America, 27:613–617.

    Google Scholar 

  • Flanagan, J.L. (1957). Difference limen for formant amplitude.Journal of Speech and Hearing Disorders, 22:202–212.

    Google Scholar 

  • Flege, J.E. (1988). Factors affecting degree of perceived foreign accent in English sentences.Journal of the Acoustical Society of America, 84:10–19.

    Google Scholar 

  • Holmes, J.N. (1961). Research on speech synthesis. Joint Speech Research Unit Report JU 11-4, British Postal Office, Eastcote, England.

    Google Scholar 

  • Holmes, J.N. (1973). Influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer,IEEE Transactions on Audio and Electroacoustics, AU-21:298–305.

    Google Scholar 

  • House, A.S., Williams, C.E., Hecker, M.H.L., and Kryter, K.D. (1965). Articulation testing methods: Consonantal differentiation with a closed response set.Journal of the Acoustical Society of America, 37:158–166.

    Google Scholar 

  • Hunnicutt, S. (1995). The development of text-to-speech technology for use in communication aids. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 547–563.

    Google Scholar 

  • Klatt, D.H. (1976). Linguistic uses of segmental duration in English: acoustic and perceptual evidence.Journal of the Acoustical Society of America, 59:1208–1221.

    Google Scholar 

  • Klatt, D.H. and Klatt, L.C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers.Journal of the Acoustical Society of America, 87:820–857.

    Google Scholar 

  • Kohler, K.J. (1991). Prosody in speech synthesis: The interplay between basic research and TTS application.Journal of Phonetics, 19:121–138.

    Google Scholar 

  • Laver, J. (1980).The phonetic description of voice quality, Cambridge University Press, Cambridge.

    Google Scholar 

  • Lee, L. and Nusbaum, H.C. (1989). The effects of perceptual learning on capacity demands for recognizing synthetic speech. Paper presented at the Acoustical Society of America, Syracuse, May.

  • Lieberman, P. and Crelin, E.S. (1971). On the speech of Neanderthal man.Linguistic Inquiry, 2:203–222.

    Google Scholar 

  • Lieberman, P., Crelin, E.S., and Klatt, D.H. (1972). Phonetic ability and related anatomy of the newborn, adult human, Neanderthal man, and the chimpanzee.American Anthropologist, 74:287–302.

    Google Scholar 

  • Lisker, L. and Abramson, A.S. (1967). Some effects of context on voice onset time in English stops.Language & Speech, 10:1–28.

    Google Scholar 

  • Logan, J.S., Greene, B.G., and Pisoni, D.B. (1989). Segmental intelligibility of synthetic speech produced by rule.Journal of the Acoustical Society of America, 86:566–581.

    Google Scholar 

  • Nusbaum, H.C., Schwab, E.C., and Pisoni, D.B. (1984). Subjective evaluation of synthetic speech: Measuring preference, naturalness, and acceptability. Research on Speech Perception Progress Report No. 10, Speech Research Laboratory, Department of Psychology, Indiana University, pp. 391–407.

  • Nusbaum, H.C. and Pisoni, D.B. (1985). Constraints on the perception of synthetic speech generated by rule.Behavior Research Methods, Instruments & Computers, 17:235–242.

    Google Scholar 

  • Pisoni, D.B., Nusbaum, H.C., Luce, P.A., and Slowiaczek, L.M. (1985). Speech perception, word recognition and the structure of the lexicon.Speech Communication, 4:15–95.

    Google Scholar 

  • Pols, L.C.W. and van Bezooijen, R. (1991). Gaining phonetic knowledge whilst improving synthetic speech quality?Journal of Phonetics, 19:139–146.

    Google Scholar 

  • Ralston, J.V., Pisoni, D.B., and Mullennix, J.W. (1995). Perception and comprehension of speech. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 233–288.

    Google Scholar 

  • Schmidt-Nielsen, S. (1995). Intelligibility testing and acceptability testing for speech technology. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 195–232.

    Google Scholar 

  • Selkirk, E.O. (1986).Phonology and syntaz: The relation between sound and structure, MIT Press, Cambridge.

    Google Scholar 

  • Slowiaczek, L.M. and Nusbaum, H.C. (1985). Effects of speech rate and pitch contour on the perception of synthetic speech.Human Factors, 27:701–712.

    Google Scholar 

  • Schwab, E.C., Nusbaum, H.C., and Pisoni, D.B. (1985). Effects of training on the perception of synthetic speech.Human Factors, 27:395–408.

    Google Scholar 

  • Syrdal, A.K. (1989). Improved duration rules for text-to-speech synthesis.Journal of the Acoustical Society of America, 85, S43.

    Google Scholar 

  • Voiers, W.D. (1977). Diagnostic Acceptability Measure for speech communication systems. IEEE International Conference on Acoustics, Speech, and Signal Processing, New York.

  • Voiers, W.D. (1983). Evaluating processed speech using the Diagnostic Rhyme Test.Speech Technology, pp. 30–39.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nusbaum, H.C., Francis, A.L. & Henly, A.S. Measuring the naturalness of synthetic speech. Int J Speech Technol 2, 7–19 (1997). https://doi.org/10.1007/BF02215800

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02215800

Keywords

Navigation