Abstract
Even the highest quality synthetic speech generated by rule sounds unlike human speech. As the intelligibility of rule-based synthetic speech improves, and the number of applications for synthetic speech increases, the naturalness of synthetic speech will become an important factor in determining its use. In order to improve this aspect of the quality of synthetic speech it is necessary to have diagnostic tests that can measure naturalness. Currently, all of the available metrics for evaluating the acceptability of synthetic speech do not distinguish sufficiently between measuring overall acceptability (including naturalness) and simply measuring the ability of listeners to extract intelligible information from the signal. In this paper we propose a new methodology for measuring the naturalness of particular aspects of synthesized speech, independent of the intelligibility of the speech. Although naturalness is a multidimensional, subjective quality of speech, this methodology makes it possible to assess the separate contributions of prosodic, segmental, and source characteristics of the utterance. In two experiments, listeners reliably differentiated the naturalness of speech produced by two male talkers and two text-to-speech systems. Furthermore, they reliably differentiated between the two text-to-speech systems. The results of these experiments demonstrate that perception of naturalness is affected by information contained within the smallest part of speech, the glottal pulse, and by information contained within the prosodic structure of a syllable. These results show that this new methodology does provide a solid basis for measuring and diagnosing the naturalness of synthetic speech.
Similar content being viewed by others
References
Best, C.T., Morongiello, B., and Robson, R. (1981). Perceptual equivalence of acoustic cues in speech and nonspeech perception.Perception & Psychophysics, 29:191–211.
Bollinger, D. (1989).Intonation and its uses: Melody in grammar and discourse. Stanford University Press, Stanford.
Campbell, W.N. and Isard, S.D. (1991). Segment durations in a syllable frame.Journal of Phonetics, 19:37–47.
Carrell, T.D. (1984).Contributions of fundamental frequency formant spacing, and glottal waveform to talker identification. Research on Speech Perception Technical Report No. 5, Speech Research Laboratory, Indiana University, Bloomington.
Cooper, W.E. and Paccia-Cooper, J. (1980).Syntax and speech. Harvard University Press, Cambridge.
Cooper, W.E. and Sorenson, J.M. (1981).Fundamental frequency in sentence production. Springer-Verlag, New York.
Fant, G. (1991). What can basic research contribute to speech synthesis?Journal of Phonetics, 19:75–90.
Flanagan, J.L. (1955). A difference limen for vowel formant frequency.Journal of the Acoustical Society of America, 27:613–617.
Flanagan, J.L. (1957). Difference limen for formant amplitude.Journal of Speech and Hearing Disorders, 22:202–212.
Flege, J.E. (1988). Factors affecting degree of perceived foreign accent in English sentences.Journal of the Acoustical Society of America, 84:10–19.
Holmes, J.N. (1961). Research on speech synthesis. Joint Speech Research Unit Report JU 11-4, British Postal Office, Eastcote, England.
Holmes, J.N. (1973). Influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer,IEEE Transactions on Audio and Electroacoustics, AU-21:298–305.
House, A.S., Williams, C.E., Hecker, M.H.L., and Kryter, K.D. (1965). Articulation testing methods: Consonantal differentiation with a closed response set.Journal of the Acoustical Society of America, 37:158–166.
Hunnicutt, S. (1995). The development of text-to-speech technology for use in communication aids. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 547–563.
Klatt, D.H. (1976). Linguistic uses of segmental duration in English: acoustic and perceptual evidence.Journal of the Acoustical Society of America, 59:1208–1221.
Klatt, D.H. and Klatt, L.C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers.Journal of the Acoustical Society of America, 87:820–857.
Kohler, K.J. (1991). Prosody in speech synthesis: The interplay between basic research and TTS application.Journal of Phonetics, 19:121–138.
Laver, J. (1980).The phonetic description of voice quality, Cambridge University Press, Cambridge.
Lee, L. and Nusbaum, H.C. (1989). The effects of perceptual learning on capacity demands for recognizing synthetic speech. Paper presented at the Acoustical Society of America, Syracuse, May.
Lieberman, P. and Crelin, E.S. (1971). On the speech of Neanderthal man.Linguistic Inquiry, 2:203–222.
Lieberman, P., Crelin, E.S., and Klatt, D.H. (1972). Phonetic ability and related anatomy of the newborn, adult human, Neanderthal man, and the chimpanzee.American Anthropologist, 74:287–302.
Lisker, L. and Abramson, A.S. (1967). Some effects of context on voice onset time in English stops.Language & Speech, 10:1–28.
Logan, J.S., Greene, B.G., and Pisoni, D.B. (1989). Segmental intelligibility of synthetic speech produced by rule.Journal of the Acoustical Society of America, 86:566–581.
Nusbaum, H.C., Schwab, E.C., and Pisoni, D.B. (1984). Subjective evaluation of synthetic speech: Measuring preference, naturalness, and acceptability. Research on Speech Perception Progress Report No. 10, Speech Research Laboratory, Department of Psychology, Indiana University, pp. 391–407.
Nusbaum, H.C. and Pisoni, D.B. (1985). Constraints on the perception of synthetic speech generated by rule.Behavior Research Methods, Instruments & Computers, 17:235–242.
Pisoni, D.B., Nusbaum, H.C., Luce, P.A., and Slowiaczek, L.M. (1985). Speech perception, word recognition and the structure of the lexicon.Speech Communication, 4:15–95.
Pols, L.C.W. and van Bezooijen, R. (1991). Gaining phonetic knowledge whilst improving synthetic speech quality?Journal of Phonetics, 19:139–146.
Ralston, J.V., Pisoni, D.B., and Mullennix, J.W. (1995). Perception and comprehension of speech. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 233–288.
Schmidt-Nielsen, S. (1995). Intelligibility testing and acceptability testing for speech technology. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.),Applied speech technology, CRC Press, Boca Raton, pp. 195–232.
Selkirk, E.O. (1986).Phonology and syntaz: The relation between sound and structure, MIT Press, Cambridge.
Slowiaczek, L.M. and Nusbaum, H.C. (1985). Effects of speech rate and pitch contour on the perception of synthetic speech.Human Factors, 27:701–712.
Schwab, E.C., Nusbaum, H.C., and Pisoni, D.B. (1985). Effects of training on the perception of synthetic speech.Human Factors, 27:395–408.
Syrdal, A.K. (1989). Improved duration rules for text-to-speech synthesis.Journal of the Acoustical Society of America, 85, S43.
Voiers, W.D. (1977). Diagnostic Acceptability Measure for speech communication systems. IEEE International Conference on Acoustics, Speech, and Signal Processing, New York.
Voiers, W.D. (1983). Evaluating processed speech using the Diagnostic Rhyme Test.Speech Technology, pp. 30–39.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Nusbaum, H.C., Francis, A.L. & Henly, A.S. Measuring the naturalness of synthetic speech. Int J Speech Technol 2, 7–19 (1997). https://doi.org/10.1007/BF02215800
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF02215800