Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates

doi:10.1016/j.specom.2009.10.001

Speech Communication

Volume 52, Issue 3, March 2010, Pages 181-200

https://doi.org/10.1016/j.specom.2009.10.001 Get rights and content

Abstract

Despite years of speech recognition research, little is known about which words tend to be misrecognized and why. Previous work has shown that errors increase for infrequent words, short words, and very loud or fast speech, but many other presumed causes of error (e.g., nearby disfluencies, turn-initial words, phonetic neighborhood density) have never been carefully tested. The reasons for the huge differences found in error rates between speakers also remain largely mysterious.

Using a mixed-effects regression model, we investigate these and other factors by analyzing the errors of two state-of-the-art recognizers on conversational speech. Words with higher error rates include those with extreme prosodic characteristics, those occurring turn-initially or as discourse markers, and doubly confusable pairs: acoustically similar words that also have similar language model probabilities. Words preceding disfluent interruption points (first repetition tokens and words before fragments) also have higher error rates. Finally, even after accounting for other factors, speaker differences cause enormous variance in error rates, suggesting that speaker error rate variance is not fully explained by differences in word choice, fluency, or prosodic characteristics. We also propose that doubly confusable pairs, rather than high neighborhood density, may better explain phonetic neighborhood errors in human speech processing.

Introduction

Conversational speech is one of the most difficult genres for automatic speech recognition (ASR) systems to recognize, due to high levels of disfluency, non-canonical pronunciations, and acoustic and prosodic variability. In order to improve ASR performance, it is important to understand which of these factors is most problematic for recognition. Previous work on recognition of spontaneous monologues and dialogues has shown that infrequent words are more likely to be misrecognized (Fosler-Lussier and Morgan, 1999, Shinozaki and Furui, 2001) and that fast speech is associated with higher error rates (Siegler and Stern, 1995, Fosler-Lussier and Morgan, 1999, Shinozaki and Furui, 2001). In some studies, very slow speech has also been found to correlate with higher error rates (Siegler and Stern, 1995, Shinozaki and Furui, 2001). In Shinozaki and Furui’s (2001) analysis of a Japanese ASR system, word length (in phones) was found to be a useful predictor of error rates, with more errors on shorter words. In Hirschberg et al.’s (2004) analysis of two human–computer dialogue systems, misrecognized turns were found to have (on average) higher maximum pitch and energy than correctly recognized turns. Results for speech rate were ambiguous: faster utterances had higher error rates in one corpus, but lower error rates in the other. Finally, Adda-Decker and Lamel (2005) demonstrated that both French and English ASR systems had more trouble with male speakers than female speakers, and suggested several possible explanations, including higher rates of disfluencies and more reduction.

In parallel to these studies within the speech-recognition community, a body of work has accumulated in the psycholinguistics literature examining factors that affect the speed and accuracy of spoken word recognition in humans. Experiments are typically carried out using isolated words as stimuli, and controlling for numerous factors such as word frequency, duration, and length. Like ASR systems, humans are better (faster and more accurate) at recognizing frequent words than infrequent words (Howes, 1954, Marslen-Wilson, 1987, Dahan et al., 2001). In addition, it is widely accepted that recognition is worse for words that are phonetically similar to many other words than for highly distinctive words (Luce and Pisoni, 1998). Rather than using a graded notion of phonetic similarity, psycholinguistic experiments typically make the simplifying assumption that two words are “similar” if they differ by a single phone (insertion, substitution, or deletion). Such pairs are referred to as neighbors. Early on, it was shown that both the number of neighbors of a word and the frequency of those neighbors are significant predictors of recognition performance; it is now common to see those two factors combined into a single predictor known as frequency-weighted neighborhood density (Luce and Pisoni, 1998, Vitevitch and Luce, 1999), which we discuss in more detail in Section 3.1.

Many questions are left unanswered by these previous studies. In the word-level analyses of Fosler-Lussier and Morgan, 1999, Shinozaki and Furui, 2001, only substitution and deletion errors were considered, and it is unclear whether including insertions would have led to different results. Moreover, these studies primarily analyzed lexical, rather than prosodic, factors. Hirschberg et al.’s (2004) work suggests that utterance-level prosodic factors can impact error rates in human–computer dialogues, but leaves open the question of which factors are important at the word level and how they influence recognition of natural conversational speech. Adda-Decker and Lamel’s (2005) suggestion that higher rates of disfluency are a cause of worse recognition for male speakers presupposes that disfluencies raise error rates. While this assumption seems natural, it was never carefully tested, and in particular neither Adda-Decker and Lamel nor any of the other papers cited investigated whether disfluent words are associated with errors in adjacent words, or are simply more likely to be misrecognized themselves. Many factors that are often thought to influence error rates, such as a word’s status as a content or function word, and whether it starts a turn, also remained unexamined. Next, the neighborhood-related factors found to be important in human word recognition have, to our knowledge, never even been proposed as possible explanatory variables in ASR, much less formally analyzed. Additionally, many of these factors are known to be correlated. Disfluent speech, for example, is linked to changes in both prosody and rate of speech, and low-frequency words tend to have longer duration. Since previous work has generally examined each factor independently, it is not clear which factors would still be linked to word error after accounting for these correlations.

A final issue not addressed by these previous studies is that of speaker differences. While ASR error rates are known to vary enormously between speakers (Doddington and Schalk, 1981, Nusbaum and Pisoni, 1987, Nusbaum et al., 1995), most previous analyses have averaged over speakers rather than examining speaker differences explicitly, and the causes of such differences are not well understood. Several early hypotheses regarding the causes of these differences, such as the user’s motivation to use the system or the variability of the user’s speech with respect to user-specific training data (Nusbaum and Pisoni, 1987), can be ruled out for recognition of conversational speech because the user is speaking to another human and there is no user-specific training data. However, we still do not know the extent to which differences in error rates between speakers can be accounted for by the lexical, prosodic, and disfluency factors discussed above, or whether additional factors are at work.

The present study is designed to address the questions raised above by analyzing the effects of a wide range of lexical and prosodic factors on the accuracy of two English ASR systems for conversational telephone speech. We introduce a new measure of error, individual word error rate (IWER), that allows us to include insertion errors in our analysis, along with deletions and substitutions. Using this measure, we examine the effects of each factor on the recognition performance of two different state-of-the-art conversational telephone speech recognizers – the SRI/ICSI/UW RT-04 system (Stolcke et al., 2006) and the 2004 CU-HTK system (Evermann et al., 2004b, Evermann et al., 2005). In the remainder of the paper, we first describe the data used in our study and the individual word error rate measure. Next, we present the features we collected for each word and the effects of those features individually on IWER. Finally, we develop a joint statistical model to examine the effects of each feature while accounting for possible correlations, and to determine the relative importance of speaker differences other than those reflected in the features we collected.

Section snippets

Data

Our analysis is based on the output from two state-of-the-art speech recognition systems on the conversational telephone speech evaluation data from the National Institute of Standards and Technology (NIST) 2003 Rich Transcription exercise (RT-03).¹ The two recognizers are the SRI/ICSI/UW RT-04 system (Stolcke et al., 2006

Analysis of individual features

In this section, we first describe all of the features we collected for each word and how the features were extracted. We then provide results detailing the association between each individual feature and recognition error rates.

Analysis using a joint model

In the previous section, we investigated the effects of various individual features on ASR error rates. However, there are many correlations between these features – for example, words with longer duration are likely to have a larger range of pitch and intensity. In this section, we build a single model for each system’s output with all of our features as potential predictors in order to determine the effects of each feature after accounting for possible correlations. We use the no-contractions

Conclusion

In this paper, we introduced the individual word error rate (IWER) for measuring ASR performance on individual words, including insertions as well as deletions and substitutions. Using IWER, we analyzed the effects of a large variety of lexical, disfluency, contextual, and prosodic features in two different ASR systems, both individually and in a joint model. We found that despite differences in the overall performance of the two systems, the effects of the factors we examined were extremely

Acknowledgments

This work was supported by the Edinburgh–Stanford LINK and ONR MURI award N000140510388. We thank Andreas Stolcke for providing the SRI recognizer output, language model, and forced alignments; Phil Woodland for providing the Cambridge recognizer output and other evaluation data; and Katrin Kirchhoff and Raghunandan Kumaran for datasets used in preliminary work, useful scripts, and additional help.

References (43)

A. Bradlow et al.
Intelligibility of normal speech I: Global and fine-grained acoustic–phonetic talker characteristics
Speech Comm.
(1996)
D. Dahan et al.
Time course of frequency effects in spoken-word recognition: evidence from eye movements
Cognit. Psychol.
(2001)
R. Diehl et al.
On explaining certain male–female differences in the phonetic realization of vowel categories
J. Phonetics
(1996)
E. Fosler-Lussier et al.
Effects of speaking rate and word frequency on pronunciations in conversational speech
Speech Comm.
(1999)
J. Hirschberg et al.
Prosodic and other cues to speech recognition failures
Speech Comm.
(2004)
W. Marslen-Wilson
Functional parallelism in spoken word-recognition
Cognition
(1987)
M. Nakamura et al.
Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance
Comput. Speech Language
(2008)
H. Nusbaum et al.
Automatic measurement of speech recognition performance: a comparison of six speaker-dependent recognition devices
Comput. Speech Language
(1987)
M. Vitevitch et al.
Probabilistic phonotactics and neighborhood activation in spoken word recognition
J. Memory Language
(1999)
Adda-Decker, M., Lamel, L., 2005. Do speech recognizers prefer female speakers? In: Proc. INTERSPEECH, pp....

R.H. Baayen

Analyzing Linguistic Data

(2008)

Bates, D., 2007. lme4: Linear mixed-effects models using S4 classes. R package version...

A. Bell et al.

Effects of disfluencies, predictability, and utterance position on word form variation in English conversation

J. Acoust. Soc. Amer.

(2003)

Boersma, P., Weenink, D., 2007. Praat: doing phonetics by computer (version 4.5.16). Available from...

Bulyko, I., Ostendorf, M., Stolcke, A., 2003. Getting more mileage from web text sources for conversational speech...

G. Doddington et al.

Speech recognition: turning theory to practice

IEEE Spectrum

(1981)

Evermann, G., Chan, H.Y., Gales, M.J.F., Hain, T., Liu, X., Wang, L., Mrva, D., Woodland, P.C., 2004a. Development of...

Evermann, G., Chan, H.Y., Gales, M.J.F., Jia, B., Liu, X., Mrva, D., Sim, K.C., Wang, L., Woodland, P.C., Yu, K.,...

Evermann, G., Chan, H.Y., Gales, M.J.F., Jia, B., Mrva, D., Woodland, P.C., Yu, K., 2005. Training LVCSR systems on...

Fiscus, J., Garofolo, J., Le, A., Martin, A., Pallett, D., Przybocki, M., Sanders, G., 2004. Results of the fall 2004...

C. Fougeron et al.

Articulatory strengthening at edges of prosodic domains

J. Acoust. Soc. Amer.

(1997)

Cited by (139)

Towards inclusive automatic speech recognition
2024, Computer Speech and Language
Practice and recent evidence show that state-of-the-art (SotA) automatic speech recognition (ASR) systems do not perform equally well for all speaker groups. Many factors can cause this bias against different speaker groups. This paper, for the first time, systematically quantifies and finds speech recognition bias against gender, age, regional accents and non-native accents, and investigates the origin of this bias by investigating bias cross-lingually (i.e., Dutch and Mandarin) and for two different SotA ASR architectures (a hybrid DNN-HMM and an attention based end-to-end (E2E) model) through a phoneme error analysis. The results show that only a fraction of the bias can be explained by pronunciation differences between speaker groups, and that in order to mitigate bias, language- and architecture specific solutions need to be found.
Trends and developments in automatic speech recognition research
2023, Computer Speech and Language
This paper discusses how automatic speech recognition systems are and could be designed, in order to best exploit the discriminative information encoded in human speech. This contrasts with many recent machine learning approaches that apply general recognition architectures to signals to identify, with little concern for the nature of the input. The implicit assumption has often been that training can automatically discover the useful properties that exist in signals, with minimal manual intervention. These approaches may be suitable for some tasks such as image recognition, where the diversity of visual input is vast; e.g., an image may be any (natural or synthetic) scene that a camera views. We first examine what makes speech special, i.e., a natural signal from a complex tube, driven by a source that is quasi-periodic and/or noisy, aiming to communicate a wide variety of information, using the different vocal systems of human speakers. Then, we view how pertinent features are extracted from speech via efficient means, related to the objectives of communication. We see how to reliably and efficiently identify the different units of oral language. We learn from the history of attempts to do ASR, e.g., why they succeeded and how improved methods exploited the increasing availability of data and computer power (in particular, deep neural networks). Finally, we suggest ways to render ASR both more accurate and efficient. This work is aimed at both newcomers to ASR and experts, in terms of presenting issues broadly, but without mathematical or algorithmic details, which are readily found in the references.
Combining automatic speech recognition with semantic natural language processing in schizophrenia
2023, Psychiatry Research
Natural language processing (NLP) tools are increasingly used to quantify semantic anomalies in schizophrenia. Automatic speech recognition (ASR) technology, if robust enough, could significantly speed up the NLP research process. In this study, we assessed the performance of a state-of-the-art ASR tool and its impact on diagnostic classification accuracy based on a NLP model. We compared ASR to human transcripts quantitatively (Word Error Rate (WER)) and qualitatively by analyzing error type and position. Subsequently, we evaluated the impact of ASR on classification accuracy using semantic similarity measures. Two random forest classifiers were trained with similarity measures derived from automatic and manual transcriptions, and their performance was compared. The ASR tool had a mean WER of 30.4%. Pronouns and words in sentence-final position had the highest WERs. The classification accuracy was 76.7% (sensitivity 70%; specificity 86%) using automated transcriptions and 79.8% (sensitivity 75%; specificity 86%) for manual transcriptions. The difference in performance between the models was not significant. These findings demonstrate that using ASR for semantic analysis is associated with only a small decrease in accuracy in classifying schizophrenia, compared to manual transcripts. Thus, combining ASR technology with semantic NLP models qualifies as a robust and efficient method for diagnosing schizophrenia.
Do smart speaker skills support diverse audiences?
2022, Pervasive and Mobile Computing
Smart speakers with voice assistants like Google Home or Amazon Alexa are increasingly popular and essential in our daily lives due to their convenience of issuing voice commands. Ensuring that these voice assistants are equitable to different population subgroups is crucial. In this paper, we present the first framework, AudioAcc, to help evaluate the performance of various accents. AudioAcc takes in videos from YouTube and generates composite commands. We further propose a new metric called Consistency of Results (COR) that developers can use to avoid the incorrect translation of the produced results by rewriting the skill to improve the Word Error Rate (WER) performance. We evaluate AudioAcc on complete sentences extracted from YouTube videos. The result reveals that our composite sentences generated by AudioAcc are close to the complete sentences. Our evaluation of diverse audiences shows that first, speech from native speakers, particularly Americans, exhibits the best WER performance by 9.5% in comparison to speech from other native and nonnative speakers. Second, speech from American professional speakers has significantly more fairness and the best WER performance by 8.3% in comparison to speech from German professional speakers and German and American amateur speakers. Moreover, we show that using the COR metric could help developers to rewrite the skill to improve the WER accuracy, which we used to improve the accuracy of the Russian accent.
End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting
2022, Computer Speech and Language
Spoken Language Understanding (SLU) is a core task in most human–machine interaction systems . With the emergence of smart homes, smart phones and smart speakers, SLU has become a key technology for the industry. In a classical SLU approach, an Automatic Speech Recognition (ASR) module transcribes the speech signal into a textual representation from which a Natural Language Understanding (NLU) module extracts semantic information. Recently End-to-End SLU (E2E SLU) based on Deep Neural Networks has gained momentum since it benefits from the joint optimisation of the ASR and the NLU parts, hence limiting the cascade of error effect of the pipeline architecture. However, little is known about the actual linguistic properties used by E2E models to predict concepts and intents from speech input. In this paper, we present a study identifying the signal features and other linguistic properties used by an E2E model to perform the SLU task. The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
The results show that a good E2E SLU performance does not always require a perfect ASR capability. Furthermore, the results show the superior capabilities of the E2E model in handling background noise and syntactic variation compared to the pipeline model. Finally, a finer-grained analysis suggests that the E2E model uses the pitch information of the input signal to identify voice command concepts. The results and methodology outlined in this paper provide a springboard for further analyses of E2E models in speech processing.
Evaluating OpenAI's Whisper ASR: Performance analysis across diverse accents and speaker traits
2024, JASA Express Letters

View all citing articles on Scopus

View full text

Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates

Abstract

Introduction

Section snippets

Data

Analysis of individual features

Analysis using a joint model

Conclusion

Acknowledgments

Speech Comm.

Cognit. Psychol.

J. Phonetics

Speech Comm.

Speech Comm.

Cognition

Comput. Speech Language

Comput. Speech Language

J. Memory Language

Analyzing Linguistic Data

Effects of disfluencies, predictability, and utterance position on word form variation in English conversation

J. Acoust. Soc. Amer.

Speech recognition: turning theory to practice

IEEE Spectrum

Articulatory strengthening at edges of prosodic domains

J. Acoust. Soc. Amer.