Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates
Introduction
Conversational speech is one of the most difficult genres for automatic speech recognition (ASR) systems to recognize, due to high levels of disfluency, non-canonical pronunciations, and acoustic and prosodic variability. In order to improve ASR performance, it is important to understand which of these factors is most problematic for recognition. Previous work on recognition of spontaneous monologues and dialogues has shown that infrequent words are more likely to be misrecognized (Fosler-Lussier and Morgan, 1999, Shinozaki and Furui, 2001) and that fast speech is associated with higher error rates (Siegler and Stern, 1995, Fosler-Lussier and Morgan, 1999, Shinozaki and Furui, 2001). In some studies, very slow speech has also been found to correlate with higher error rates (Siegler and Stern, 1995, Shinozaki and Furui, 2001). In Shinozaki and Furui’s (2001) analysis of a Japanese ASR system, word length (in phones) was found to be a useful predictor of error rates, with more errors on shorter words. In Hirschberg et al.’s (2004) analysis of two human–computer dialogue systems, misrecognized turns were found to have (on average) higher maximum pitch and energy than correctly recognized turns. Results for speech rate were ambiguous: faster utterances had higher error rates in one corpus, but lower error rates in the other. Finally, Adda-Decker and Lamel (2005) demonstrated that both French and English ASR systems had more trouble with male speakers than female speakers, and suggested several possible explanations, including higher rates of disfluencies and more reduction.
In parallel to these studies within the speech-recognition community, a body of work has accumulated in the psycholinguistics literature examining factors that affect the speed and accuracy of spoken word recognition in humans. Experiments are typically carried out using isolated words as stimuli, and controlling for numerous factors such as word frequency, duration, and length. Like ASR systems, humans are better (faster and more accurate) at recognizing frequent words than infrequent words (Howes, 1954, Marslen-Wilson, 1987, Dahan et al., 2001). In addition, it is widely accepted that recognition is worse for words that are phonetically similar to many other words than for highly distinctive words (Luce and Pisoni, 1998). Rather than using a graded notion of phonetic similarity, psycholinguistic experiments typically make the simplifying assumption that two words are “similar” if they differ by a single phone (insertion, substitution, or deletion). Such pairs are referred to as neighbors. Early on, it was shown that both the number of neighbors of a word and the frequency of those neighbors are significant predictors of recognition performance; it is now common to see those two factors combined into a single predictor known as frequency-weighted neighborhood density (Luce and Pisoni, 1998, Vitevitch and Luce, 1999), which we discuss in more detail in Section 3.1.
Many questions are left unanswered by these previous studies. In the word-level analyses of Fosler-Lussier and Morgan, 1999, Shinozaki and Furui, 2001, only substitution and deletion errors were considered, and it is unclear whether including insertions would have led to different results. Moreover, these studies primarily analyzed lexical, rather than prosodic, factors. Hirschberg et al.’s (2004) work suggests that utterance-level prosodic factors can impact error rates in human–computer dialogues, but leaves open the question of which factors are important at the word level and how they influence recognition of natural conversational speech. Adda-Decker and Lamel’s (2005) suggestion that higher rates of disfluency are a cause of worse recognition for male speakers presupposes that disfluencies raise error rates. While this assumption seems natural, it was never carefully tested, and in particular neither Adda-Decker and Lamel nor any of the other papers cited investigated whether disfluent words are associated with errors in adjacent words, or are simply more likely to be misrecognized themselves. Many factors that are often thought to influence error rates, such as a word’s status as a content or function word, and whether it starts a turn, also remained unexamined. Next, the neighborhood-related factors found to be important in human word recognition have, to our knowledge, never even been proposed as possible explanatory variables in ASR, much less formally analyzed. Additionally, many of these factors are known to be correlated. Disfluent speech, for example, is linked to changes in both prosody and rate of speech, and low-frequency words tend to have longer duration. Since previous work has generally examined each factor independently, it is not clear which factors would still be linked to word error after accounting for these correlations.
A final issue not addressed by these previous studies is that of speaker differences. While ASR error rates are known to vary enormously between speakers (Doddington and Schalk, 1981, Nusbaum and Pisoni, 1987, Nusbaum et al., 1995), most previous analyses have averaged over speakers rather than examining speaker differences explicitly, and the causes of such differences are not well understood. Several early hypotheses regarding the causes of these differences, such as the user’s motivation to use the system or the variability of the user’s speech with respect to user-specific training data (Nusbaum and Pisoni, 1987), can be ruled out for recognition of conversational speech because the user is speaking to another human and there is no user-specific training data. However, we still do not know the extent to which differences in error rates between speakers can be accounted for by the lexical, prosodic, and disfluency factors discussed above, or whether additional factors are at work.
The present study is designed to address the questions raised above by analyzing the effects of a wide range of lexical and prosodic factors on the accuracy of two English ASR systems for conversational telephone speech. We introduce a new measure of error, individual word error rate (IWER), that allows us to include insertion errors in our analysis, along with deletions and substitutions. Using this measure, we examine the effects of each factor on the recognition performance of two different state-of-the-art conversational telephone speech recognizers – the SRI/ICSI/UW RT-04 system (Stolcke et al., 2006) and the 2004 CU-HTK system (Evermann et al., 2004b, Evermann et al., 2005). In the remainder of the paper, we first describe the data used in our study and the individual word error rate measure. Next, we present the features we collected for each word and the effects of those features individually on IWER. Finally, we develop a joint statistical model to examine the effects of each feature while accounting for possible correlations, and to determine the relative importance of speaker differences other than those reflected in the features we collected.
Section snippets
Data
Our analysis is based on the output from two state-of-the-art speech recognition systems on the conversational telephone speech evaluation data from the National Institute of Standards and Technology (NIST) 2003 Rich Transcription exercise (RT-03).1 The two recognizers are the SRI/ICSI/UW RT-04 system (Stolcke et al., 2006
Analysis of individual features
In this section, we first describe all of the features we collected for each word and how the features were extracted. We then provide results detailing the association between each individual feature and recognition error rates.
Analysis using a joint model
In the previous section, we investigated the effects of various individual features on ASR error rates. However, there are many correlations between these features – for example, words with longer duration are likely to have a larger range of pitch and intensity. In this section, we build a single model for each system’s output with all of our features as potential predictors in order to determine the effects of each feature after accounting for possible correlations. We use the no-contractions
Conclusion
In this paper, we introduced the individual word error rate (IWER) for measuring ASR performance on individual words, including insertions as well as deletions and substitutions. Using IWER, we analyzed the effects of a large variety of lexical, disfluency, contextual, and prosodic features in two different ASR systems, both individually and in a joint model. We found that despite differences in the overall performance of the two systems, the effects of the factors we examined were extremely
Acknowledgments
This work was supported by the Edinburgh–Stanford LINK and ONR MURI award N000140510388. We thank Andreas Stolcke for providing the SRI recognizer output, language model, and forced alignments; Phil Woodland for providing the Cambridge recognizer output and other evaluation data; and Katrin Kirchhoff and Raghunandan Kumaran for datasets used in preliminary work, useful scripts, and additional help.
References (43)
- et al.
Intelligibility of normal speech I: Global and fine-grained acoustic–phonetic talker characteristics
Speech Comm.
(1996) - et al.
Time course of frequency effects in spoken-word recognition: evidence from eye movements
Cognit. Psychol.
(2001) - et al.
On explaining certain male–female differences in the phonetic realization of vowel categories
J. Phonetics
(1996) - et al.
Effects of speaking rate and word frequency on pronunciations in conversational speech
Speech Comm.
(1999) - et al.
Prosodic and other cues to speech recognition failures
Speech Comm.
(2004) Functional parallelism in spoken word-recognition
Cognition
(1987)- et al.
Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance
Comput. Speech Language
(2008) - et al.
Automatic measurement of speech recognition performance: a comparison of six speaker-dependent recognition devices
Comput. Speech Language
(1987) - et al.
Probabilistic phonotactics and neighborhood activation in spoken word recognition
J. Memory Language
(1999) - Adda-Decker, M., Lamel, L., 2005. Do speech recognizers prefer female speakers? In: Proc. INTERSPEECH, pp....
Analyzing Linguistic Data
Effects of disfluencies, predictability, and utterance position on word form variation in English conversation
J. Acoust. Soc. Amer.
Speech recognition: turning theory to practice
IEEE Spectrum
Articulatory strengthening at edges of prosodic domains
J. Acoust. Soc. Amer.
Cited by (139)
Towards inclusive automatic speech recognition
2024, Computer Speech and LanguageTrends and developments in automatic speech recognition research
2023, Computer Speech and LanguageCombining automatic speech recognition with semantic natural language processing in schizophrenia
2023, Psychiatry ResearchDo smart speaker skills support diverse audiences?
2022, Pervasive and Mobile ComputingEnd-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting
2022, Computer Speech and LanguageEvaluating OpenAI's Whisper ASR: Performance analysis across diverse accents and speaker traits
2024, JASA Express Letters