Measuring the naturalness of synthetic speech

Nusbaum, Howard C.; Francis, Alexander L.; Henly, Anne S.

doi:10.1007/BF02215800

Measuring the naturalness of synthetic speech

Published: May 1997

Volume 2, pages 7–19, (1997)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Howard C. Nusbaum¹,
Alexander L. Francis¹ &
Anne S. Henly¹

552 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Even the highest quality synthetic speech generated by rule sounds unlike human speech. As the intelligibility of rule-based synthetic speech improves, and the number of applications for synthetic speech increases, the naturalness of synthetic speech will become an important factor in determining its use. In order to improve this aspect of the quality of synthetic speech it is necessary to have diagnostic tests that can measure naturalness. Currently, all of the available metrics for evaluating the acceptability of synthetic speech do not distinguish sufficiently between measuring overall acceptability (including naturalness) and simply measuring the ability of listeners to extract intelligible information from the signal. In this paper we propose a new methodology for measuring the naturalness of particular aspects of synthesized speech, independent of the intelligibility of the speech. Although naturalness is a multidimensional, subjective quality of speech, this methodology makes it possible to assess the separate contributions of prosodic, segmental, and source characteristics of the utterance. In two experiments, listeners reliably differentiated the naturalness of speech produced by two male talkers and two text-to-speech systems. Furthermore, they reliably differentiated between the two text-to-speech systems. The results of these experiments demonstrate that perception of naturalness is affected by information contained within the smallest part of speech, the glottal pulse, and by information contained within the prosodic structure of a syllable. These results show that this new methodology does provide a solid basis for measuring and diagnosing the naturalness of synthetic speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Best, C.T., Morongiello, B., and Robson, R. (1981). Perceptual equivalence of acoustic cues in speech and nonspeech perception.Perception & Psychophysics, 29:191–211.
Google Scholar
Bollinger, D. (1989).Intonation and its uses: Melody in grammar and discourse. Stanford University Press, Stanford.
Google Scholar
Campbell, W.N. and Isard, S.D. (1991). Segment durations in a syllable frame.Journal of Phonetics, 19:37–47.
Google Scholar
Carrell, T.D. (1984).Contributions of fundamental frequency formant spacing, and glottal waveform to talker identification. Research on Speech Perception Technical Report No. 5, Speech Research Laboratory, Indiana University, Bloomington.
Google Scholar
Cooper, W.E. and Paccia-Cooper, J. (1980).Syntax and speech. Harvard University Press, Cambridge.
Google Scholar
Cooper, W.E. and Sorenson, J.M. (1981).Fundamental frequency in sentence production. Springer-Verlag, New York.
Google Scholar
Fant, G. (1991). What can basic research contribute to speech synthesis?Journal of Phonetics, 19:75–90.
Google Scholar
Flanagan, J.L. (1955). A difference limen for vowel formant frequency.Journal of the Acoustical Society of America, 27:613–617.
Google Scholar
Flanagan, J.L. (1957). Difference limen for formant amplitude.Journal of Speech and Hearing Disorders, 22:202–212.
Google Scholar
Flege, J.E. (1988). Factors affecting degree of perceived foreign accent in English sentences.Journal of the Acoustical Society of America, 84:10–19.
Google Scholar
Holmes, J.N. (1961). Research on speech synthesis. Joint Speech Research Unit Report JU 11-4, British Postal Office, Eastcote, England.
Google Scholar
Holmes, J.N. (1973). Influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer,IEEE Transactions on Audio and Electroacoustics, AU-21:298–305.
Google Scholar
House, A.S., Williams, C.E., Hecker, M.H.L., and Kryter, K.D. (1965). Articulation testing methods: Consonantal differentiation with a closed response set.Journal of the Acoustical Society of America, 37:158–166.
Google Scholar
Hunnicutt, S. (1995). The development of text-to-speech technology for use in communication aids. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 547–563.
Google Scholar
Klatt, D.H. (1976). Linguistic uses of segmental duration in English: acoustic and perceptual evidence.Journal of the Acoustical Society of America, 59:1208–1221.
Google Scholar
Klatt, D.H. and Klatt, L.C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers.Journal of the Acoustical Society of America, 87:820–857.
Google Scholar
Kohler, K.J. (1991). Prosody in speech synthesis: The interplay between basic research and TTS application.Journal of Phonetics, 19:121–138.
Google Scholar
Laver, J. (1980).The phonetic description of voice quality, Cambridge University Press, Cambridge.
Google Scholar
Lee, L. and Nusbaum, H.C. (1989). The effects of perceptual learning on capacity demands for recognizing synthetic speech. Paper presented at the Acoustical Society of America, Syracuse, May.
Lieberman, P. and Crelin, E.S. (1971). On the speech of Neanderthal man.Linguistic Inquiry, 2:203–222.
Google Scholar
Lieberman, P., Crelin, E.S., and Klatt, D.H. (1972). Phonetic ability and related anatomy of the newborn, adult human, Neanderthal man, and the chimpanzee.American Anthropologist, 74:287–302.
Google Scholar
Lisker, L. and Abramson, A.S. (1967). Some effects of context on voice onset time in English stops.Language & Speech, 10:1–28.
Google Scholar
Logan, J.S., Greene, B.G., and Pisoni, D.B. (1989). Segmental intelligibility of synthetic speech produced by rule.Journal of the Acoustical Society of America, 86:566–581.
Google Scholar
Nusbaum, H.C., Schwab, E.C., and Pisoni, D.B. (1984). Subjective evaluation of synthetic speech: Measuring preference, naturalness, and acceptability. Research on Speech Perception Progress Report No. 10, Speech Research Laboratory, Department of Psychology, Indiana University, pp. 391–407.
Nusbaum, H.C. and Pisoni, D.B. (1985). Constraints on the perception of synthetic speech generated by rule.Behavior Research Methods, Instruments & Computers, 17:235–242.
Google Scholar
Pisoni, D.B., Nusbaum, H.C., Luce, P.A., and Slowiaczek, L.M. (1985). Speech perception, word recognition and the structure of the lexicon.Speech Communication, 4:15–95.
Google Scholar
Pols, L.C.W. and van Bezooijen, R. (1991). Gaining phonetic knowledge whilst improving synthetic speech quality?Journal of Phonetics, 19:139–146.
Google Scholar
Ralston, J.V., Pisoni, D.B., and Mullennix, J.W. (1995). Perception and comprehension of speech. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 233–288.
Google Scholar
Schmidt-Nielsen, S. (1995). Intelligibility testing and acceptability testing for speech technology. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 195–232.
Google Scholar
Selkirk, E.O. (1986).Phonology and syntaz: The relation between sound and structure, MIT Press, Cambridge.
Google Scholar
Slowiaczek, L.M. and Nusbaum, H.C. (1985). Effects of speech rate and pitch contour on the perception of synthetic speech.Human Factors, 27:701–712.
Google Scholar
Schwab, E.C., Nusbaum, H.C., and Pisoni, D.B. (1985). Effects of training on the perception of synthetic speech.Human Factors, 27:395–408.
Google Scholar
Syrdal, A.K. (1989). Improved duration rules for text-to-speech synthesis.Journal of the Acoustical Society of America, 85, S43.
Google Scholar
Voiers, W.D. (1977). Diagnostic Acceptability Measure for speech communication systems. IEEE International Conference on Acoustics, Speech, and Signal Processing, New York.
Voiers, W.D. (1983). Evaluating processed speech using the Diagnostic Rhyme Test.Speech Technology, pp. 30–39.

Download references

Author information

Authors and Affiliations

Center for Computational Psychology, Committee on Cognition and Communication, The University of Chicago, 5848 South University Avenue, 60637, Chicago, IL
Howard C. Nusbaum, Alexander L. Francis & Anne S. Henly

Authors

Howard C. Nusbaum
View author publications
You can also search for this author in PubMed Google Scholar
Alexander L. Francis
View author publications
You can also search for this author in PubMed Google Scholar
Anne S. Henly
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nusbaum, H.C., Francis, A.L. & Henly, A.S. Measuring the naturalness of synthetic speech. Int J Speech Technol 2, 7–19 (1997). https://doi.org/10.1007/BF02215800

Download citation

Received: 18 May 1995
Accepted: 16 June 1995
Issue Date: May 1997
DOI: https://doi.org/10.1007/BF02215800

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Measuring the naturalness of synthetic speech

Abstract

Access this article

Similar content being viewed by others

Automatic Evaluation of Synthetic Speech Quality by a System Based on Statistical Analysis

A Method for Auditory Evaluation of Synthesized Speech Intonation

The Role of Prosody in the Perception of Synthesized and Natural Speech

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Measuring the naturalness of synthetic speech

Abstract

Access this article

Similar content being viewed by others

Automatic Evaluation of Synthetic Speech Quality by a System Based on Statistical Analysis

A Method for Auditory Evaluation of Synthesized Speech Intonation

The Role of Prosody in the Perception of Synthesized and Natural Speech

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation