Intelligibility predictors and neural representation of speech
Graphical abstract
Research highlights
►A representation of speech is inferred from human behavior. ►That and other representations are used to recognize consonant-vowel syllables. ►Error patterns (confusions) are compared to human error patterns. ►The new representation made mistakes more like humans with a test-train mismatch.
Introduction
The authors of (Hermansky, 1998, Allen, 1994) inspire us to (1) examine existing knowledge of human speech perception, (2) employ transformations of speech which simplify the relationship between acoustics and human perception, and (3) use a task which allows machine recognition behavior to be compared in a comprehensible way with human behavior (which is phone classification). The goal of this paper is to examine some qualitative knowledge of human speech perception, and address questions about the structure of the classifier humans use to perform phone transcription.
A great deal of descriptive knowledge exists about speech perception, including:
- 1.
Experiments which find “cues” indicating membership to various phonetic categories by modifying the time-frequency content of a speech signal and observing human classifications. Many phonetic categories have been investigated in this way (Stevens and Blumstein, 1978, Kewley-Port et al., 1983, Cooper et al., 1952, Jongman, 1989, Hedrick and Ohde, 1993, Repp, 1986, Repp, 1988, Sharf and Hemeyer, 1972, Darwin and Pearson, 1982).
- 2.
Measurement of human classification accuracy as a function of distortion: removal of fine spectral content, temporal modulations, representation of speech exclusively by formants, etc. (Shannon et al., 1995, Remez et al., 1981, Drullman et al., 1996, Furui, 1986).
- 3.
Models of human behavior as a function of physical qualities of a speech communication channel, such as noise level and filtering. These models of human behavior are called intelligibility predictors. The most notable are the Articulation Index (French and Steinberg, 1947) and Speech Transmission Index (Houtgast and Steeneken, 1980).
These studies have contributed greatly to speech and hearing science, audiology, and psychology; however, they arguably have little effect on the design of machine speech recognition systems. This is likely because they describe human behavior, without attempting to infer how they do it. This study is different in that it attempts to infer something about the structure of the human phone classifier.
Systems, known as intelligibility predictors, were developed to aid the design of speech communication equipment and auditoria. They are models of human performance as a function of parameters of a speech communication channel. The Articulation Index (AI) models the phone error rate as a function of masking noise spectrum and channel filtering. It is described in (French and Steinberg, 1947, Fletcher and Galt, 1950), reformulated in (Kryter, 1962a, Müsch, 2000, Allen, 2005), verified in (Kryter, 1962b, Ronan et al., 2004, Pavlovic and Studebaker, 1984), and standardized in (ANSI, 1969, ANSI, 1997). The accuracy and generality of its predictions over a variety of acoustic conditions is remarkable.
The AI model of human phone error rate indicates that the most important factor affecting human performance is the speech-to-noise ratio as a function of frequency. The AI is the frequency-average of the non-linearly transformed speech-to-noise ratio (described in detail in Section 1.2). There are numerous modifiers which compensate for sharp filtering, high speech levels, loud maskers, or sharply bandpass maskers, all of which evoke effects in the auditory periphery. It may be deduced from the formulation in (Fletcher and Galt, 1950) that these effects play a relatively small role in typical listening conditions. In fact, another formulation (French and Steinberg, 1947) considers fewer of these peripheral effects presumably because they were not seen as necessary. There is also empirical evidence that a reasonably good prediction of intelligibility can be obtained from an even simpler formulation (Phatak et al., 2008).
It seems reasonable to expect that the human brain keeps a running estimate of prevailing noise and filtering conditions, and uses them to interpret acoustic signals, including speech. This notion was suggested by Hermansky and Morgan (1994), who then developed a representation of speech which ignored the effects of slowly varying filtering and noise conditions. It is also substantiated by Drullman et al. (1994b), which showed that low frequency modulations do not affect human performance. The effectiveness of the Articulation Index has been thought to imply that the brain estimates background noise levels, and only “sees” speech if it is unlikely to have come from the background noise. French and Steinberg (1947) put forth this interpretation:
When speech, which is constantly fluctuating in intensity, is reproduced at sufficiently low level only the occasional portions of highest intensity will be heard …
If W is equal to the fraction of the time intervals that speech in a critical band can be heard, it should be possible to derive W from the characteristics of speech and hearing … it will be appreciated that there are certain consequences that can be tested if the hypothesis is correct that W is equal to the proportion of the intervals of speech in a band which can be heard. There are…
The symbol W is essentially the logarithm of the frequency-specific signal to noise ratio. The intelligibility prediction produced by the AI is essentially the exponent of the average of W over frequency.
They conclude that the speech-derived estimates of W are consistent enough with perceptual data to endorse their hypothesis that W is proportional to the percentage of time intervals during which the speech signal is unlikely to have come from the noise. They use the phrase can be heard in a way which seems synonymous with signal detection. Also, an AI model parameter (denoted p in the formulation by French and Steinberg (1947)) is specifically related to the probability distribution of speech (the level in decibels which is higher than 99% of speech levels), and is employed in a way which assumes speech is detectable if its level is greater than a threshold. Two studies (Phatak and Allen, 2007, Pavlovic and Studebaker, 1984) have shown that frequency-specific values for p based on the level distribution of speech offer a better prediction of human recognition accuracy, supporting this view. The meaning of W, the form of the AI prediction of intelligibility, and its relationship to signal detection will be discussed in more detail in Section 1.2.
The AI predicts average phone error rate for a large amount of phonetically balanced speech, based on the average spectrum of speech, and information about the acoustic conditions. The interpretation of the AI offered above is based on the average spectrum of speech and average phone error rate. In this paper we will attempt to determine whether this interpretation holds for classification of individual utterances, based on the acoustics of individual utterances.
The following paragraphs place this study in the context of research on machine speech recognition, and human speech perception.
Standard representations of speech for speech recognition include the mel-frequency cepstral coefficients (MFCCs) and perceptual LPC (PLP). Davis and Mermelstein (1980) demonstrated that warping the frequency axis to a perceptually-based scale improves word discriminability. Hermansky (1990) demonstrated that an all-pole summary of the loudness spectrum (PLP) exhibits less inter-speaker variability than the raw loudness spectrum. Optimization-based approaches have been adopted recently; for example, transforming the speech signal to maximize information content (Padmanabhan and Dharanipragada, 2005), or transforming the speech signal into a form which can be parsimoniously represented by parametric distributions used in speech recognition systems (Omar and Hasegawa-Johnson, 2004). None of them have supplanted the MFCCs as the dominant representation of speech for automatic speech recognition. The current study is different because it seeks to determine whether a representation of speech in noise is more or less consistent with human behavior, rather than deriving one more appropriate for speech recognition systems.
The idea of using representations of speech inspired by the human auditory system is not new. For example, Hermansky (1990) suggests a representation of speech based on human auditory tuning, level normalization, and compression (which are present in the auditory system). In (Strope and Alwan, 1997) the authors simulate the dynamic activity of the auditory system to emulate a phenomenon called forward masking, and showed that a recognizer based on it is more robust to background noise than conventional systems. Another representation of speech called RASTA (Hermansky and Morgan, 1994), is predicated on an assumption that the brain keeps a running estimate of noise and filtering conditions, and uses them when recognizing speech. All systems compared in the current study use an auditory-like representation of speech similar to PLP (described in Hermansky, 1990). Our intention is to test a particular representation of speech in noise to deduce the structure used to classify phones, rather than test the merits of auditory-like representations of speech, which we already consider to be important.
Studies about representations of speech in noise, and models for detection of speech in noise are especially relevant. Experiments have been done (for example in Viemeister and Wakefield, 1991, Durlach et al., 1986) which demonstrate that Bayes’ rule applied to the probability distribution of auditory signals can predict human performance for some psycho-physical tasks. Hant and Alwan (2003) show that a similar model also predicts discrimination of some speech sounds. This paper is meant to expand the domain of tasks which Bayes’ rule can explain.
The Articulation Index models human phone recognition accuracy as a function of filtering and masking conditions. Several versions of the AI (mentioned in Section 1) have been published, which vary in sophistication and correspondingly, their accuracy and convenience. For the sake of brevity, we will describe the version published in (Allen, 2005) which has good accuracy in typical listening conditions.
First, speech is filtered into (in this formulation, 30) disjoint frequency bands with bandpass filters. The edges of these bands were chosen to fit empirical data, and are roughly proportional to the critical bandwidth (Fletcher, 1938, Allen, 1994). The second step is measurement of the speech and noise root-mean-squared levels in each bank, denoted here by σs,k and σn,k, respectively, where k indexes the frequency band.
The Articulation Index is computed fromThe parameter p (and c = 10p/20) is related to the quote by French and Steinberg (1947) in Section 1. They describe p as the “difference in db between the intensity in a critical band exceeded by 1% of th second intervals of received speech and the long average intensity in the same band,” depicted in Fig. 1. Thus p (and c) are related to the threshold which is thought to determine whether humans “hear” speech at a particular frequency and time. French and Steinberg (1947) use A in place of our symbol AI and represent the argument of the summation in Eq. (1) with W. They hypothesize that W “is equal to the fraction of the time intervals that speech in a critical band can be heard” which, in terms of Fig. 1, suggests some level on the abscissa which represents a threshold: speech intervals above the threshold can be heard and those below the threshold cannot. They suggest W could be computed by integrating the speech probability distribution in Fig. 1 above this threshold.
The probability of a human incorrectly identifying a phone can be computed from the Articulation Index (Eq. (1)) withwhere emin is a parameter equal to approximately 0.015.
The experiment described in this paper is meant to test the hypothesis that humans’ phone transcriptions for an acoustic waveform are based on the time-frequency signal-to-noise ratio rather than the short-time spectral level: A particular time-frequency sample will affect classification only if that sample is unlikely to have resulted from the prevailing noise level in that spectral channel. This is a difficult proposition to test directly because many samples interact with each other in the brain, and our perceptual experiments are not sensitive enough to measure the effect of a single sample. Rather than attempt to directly test this hypothesis, which in our view is ill advised, we will classify speech sounds with several representations of speech, and examine the results to see which are most consistent with human classifications.
Four representations of speech will be tested:
- 1.
The power spectrum. Many automatic speech recognition systems observe a linear transform of the log power spectrum (mel-frequency cepstral coefficients), therefore the power spectrum maybe considered analogous to those usually used in speech recognition. These will be referred to as the STF (spectro-temporal features).
- 2.
A representation based on the Articulation Index, which is essentially the speech-to-noise ratio as a function of time and frequency. This will be referred to as the AIF.
- 3.
A thresholded version of the AIF. A particular time-frequency pixel is unity if its SNR is greater than some threshold, and zero otherwise. These will be referred to as the AIBINF (“BIN” for binary).
- 4.
A version of the STF enhanced by spectral subtraction, which is called SSF.
This study will evaluate these speech representations based on the similarity between the mistakes they produce and the mistakes produced by humans in the same acoustic conditions. Greater consistency between human and machine errors is interpreted to mean greater similarity between the human recognition process and the classifier implied by a particular representation. The recognition accuracy provided by the various feature types will be compared, since a better performing feature type will be of interest to speech recognition researchers.
If the AIF leads to mistakes similar to those made by humans, it will support our hypothesis that humans estimate the prevailing noise spectrum and represent speech as an SNR spectrum rather than as the power spectrum of the noisy signal (as in the STF). The drop in performance between AIF and AIBINF determines how much information is gained by using a high level-resolution representation of the signal (as in the AIF) rather than only a single bit (1 = detected, 0 = not detected) for each time/frequency pixel. The SSF are included because the AIF features will be of less engineering interest if they do not provide an advantage over the simple and ubiquitous noise removal technique called spectral subtraction.
Section 2 describes the speech representations used to test our hypotheses, the human speech classification experiments, the machine speech classification experiments, and the metric used to compare results from them. Section 3 shows the recognition accuracies for each experiment, the relative performance of the various feature types, the similarity between the human and machine mistakes, and the most evident conclusions. The final section discusses their implications to the hypothesis presented above.
Section snippets
Speech materials
The stimuli used in this study are consonant-vowel sounds from the “Articulation Index Corpus” published by the Linguistic Data Consortium (Catalog #LDC2005S22). The sixteen consonants [/p, t, k, f, θ, s, ∫, b, d, g, v, ð, z, ʒ, m, n/] are paired with vowels in all experiments. The average duration of the speech sounds is 500 ms.
The machine experiment uses the sixteen consonants paired with ten vowels, and approximately fifty examples of each consonant-vowel pair. The total number of sounds is
Results
In this section we will summarize the results of the experiment: recognition accuracy, and similarity to human response patterns. The first is relevant to evaluating the AI-based features’ value for automatic speech recognition, and the second to our hypothesis about human speech perception.
Fig. 2 shows data from a subset of the conditions. Panels (a) and (d) of Fig. 2 show recognition accuracies for the conditions where the test noise spectrum and level match the training noise spectrum and
Review of hypotheses
Humans do not suffer from train-test mismatch in speech classification problems, but machines do. An automatic classifier using AI-based features suffers less than a classifier using spectral subtraction: its classification accuracy is higher, and its confusion matrices more closely resemble the confusion matrices produced by human subjects (lower symmetrized KL divergence).
Classification accuracy and KL divergence are correlated in only two respects: they are degraded by train-test mismatch,
Conclusions
We classified speech sounds with several representations of speech meant to help us determine which representation is more consistent with human behavior.
The AI-based representations performed better and had error patterns more consistent with humans in cases where the testing and training noise spectrum or level were mismatched. This property could be valuable in a practical recognizer because robustness to changes in conditions is a major problem in speech recognition.
A thresholded version of
References (44)
- et al.
What tells us when voicing has started?
Speech Comm.
(1982) - et al.
psychoacoustic-masking model to predict the perception of speech-like stimuli in noise
Speech Comm.
(2003) Should recognizers have ears?
Speech Comm.
(1998)How do humans process and recognize speech?
IEEE Trans. Speech Audio Process.
(1994)Consonant recognition and the articulation index
J. Acoust. Soc. Amer.
(2005)- ANSI (1969). Methods for the calculation of the articulation index, ANSI...
- ANSI (1997). Methods for the calculation of the speech intelligibility index, ANSI...
Suppression of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
(1976)- et al.
Some experiments on the perception of synthetic speech sounds
J. Acoust. Soc. Amer.
(1952) - et al.
Elements of Information Theory
(2006)
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
IEEE Trans. Acoust. Speech Signal Process.
Effect of temporal envelope smearing on speech reception
J. Acoust. Soc. Amer.
Effect of reducing slow temporal modulations on speech reception
J. Acoust. Soc. Amer.
Effect of temporal modulation reduction on spectral contrasts in speech
J. Acoust. Soc. Amer.
Towards a model for discrimination of broadband signals
J. Acoust. Soc. Amer.
Loudness, masking and their relation to the hearing process and the problem of noise measurement
J. Acoust. Soc. Amer.
Perception of speech and its relation to telephony
J. Acoust. Soc. Amer.
Factors governing the intelligibility of speech sounds
J. Acoust. Soc. Amer.
On the role of spectral transition for speech perception
J. Acoust. Soc. Amer.
Effect of relative amplitude of frication on perception of place of articulation
J. Acoust. Soc. Amer.
Perceptual linear predictive (plp) analysis of speech
J. Acoust. Soc. Amer.
Rasta processing of speech
IEEE Trans. Speech Audio Process.
Cited by (6)
A new model for automatically locating the perceptional cues of consonants
2015, Electric, Electronic and Control Engineering - Proceedings of the International Conference on Electric, Electronic and Control Engineering, ICEECE 2015Perceptual effects of nasal cue modification
2015, Open Electrical and Electronic Engineering JournalThe contribution of nasal murmur to the perception of nasal consonant
2015, Open Electrical and Electronic Engineering JournalAn objective identification of spectral distinctiveness on acoustic cue to subjects with hearing loss
2014, Innovation, Communication and Engineering - Proceedings of the 2nd International Conference on Innovation, Communication and Engineering, ICICE 2013An objective approach to identify spectral distinctiveness for hearing impairment
2013, Mathematical Problems in EngineeringA psychoacoustic method for studying the necessary and sufficient perceptual cues of American English fricative consonants in noise
2012, Journal of the Acoustical Society of America