Elsevier

Speech Communication

Volume 53, Issue 2, February 2011, Pages 185-194
Speech Communication

Intelligibility predictors and neural representation of speech

https://doi.org/10.1016/j.specom.2010.08.016Get rights and content

Abstract

Intelligibility predictors tell us a great deal about human speech perception, in particular which acoustic factors strongly effect human behavior, and which do not. A particular intelligibility predictor, the Articulation Index (AI), is interesting because it models human behavior in noise, and its form has implications about representation of speech in the brain. Specifically, the Articulation Index implies that a listener pre-consciously estimates the masking noise distribution and uses it to classify time/frequency samples as speech or non-speech. We classify consonants using representations of speech and noise which are consistent with this hypothesis, and determine whether their error rate and error patterns are more or less consistent with human behavior than representations typical of automatic speech recognition systems. The new representations resulted in error patterns more similar to humans in cases where the testing and training data sets do not have the same masking noise spectrum.

Research highlights

►A representation of speech is inferred from human behavior. ►That and other representations are used to recognize consonant-vowel syllables. ►Error patterns (confusions) are compared to human error patterns. ►The new representation made mistakes more like humans with a test-train mismatch.

Introduction

The authors of (Hermansky, 1998, Allen, 1994) inspire us to (1) examine existing knowledge of human speech perception, (2) employ transformations of speech which simplify the relationship between acoustics and human perception, and (3) use a task which allows machine recognition behavior to be compared in a comprehensible way with human behavior (which is phone classification). The goal of this paper is to examine some qualitative knowledge of human speech perception, and address questions about the structure of the classifier humans use to perform phone transcription.

A great deal of descriptive knowledge exists about speech perception, including:

  • 1.

    Experiments which find “cues” indicating membership to various phonetic categories by modifying the time-frequency content of a speech signal and observing human classifications. Many phonetic categories have been investigated in this way (Stevens and Blumstein, 1978, Kewley-Port et al., 1983, Cooper et al., 1952, Jongman, 1989, Hedrick and Ohde, 1993, Repp, 1986, Repp, 1988, Sharf and Hemeyer, 1972, Darwin and Pearson, 1982).

  • 2.

    Measurement of human classification accuracy as a function of distortion: removal of fine spectral content, temporal modulations, representation of speech exclusively by formants, etc. (Shannon et al., 1995, Remez et al., 1981, Drullman et al., 1996, Furui, 1986).

  • 3.

    Models of human behavior as a function of physical qualities of a speech communication channel, such as noise level and filtering. These models of human behavior are called intelligibility predictors. The most notable are the Articulation Index (French and Steinberg, 1947) and Speech Transmission Index (Houtgast and Steeneken, 1980).

These studies have contributed greatly to speech and hearing science, audiology, and psychology; however, they arguably have little effect on the design of machine speech recognition systems. This is likely because they describe human behavior, without attempting to infer how they do it. This study is different in that it attempts to infer something about the structure of the human phone classifier.

Systems, known as intelligibility predictors, were developed to aid the design of speech communication equipment and auditoria. They are models of human performance as a function of parameters of a speech communication channel. The Articulation Index (AI) models the phone error rate as a function of masking noise spectrum and channel filtering. It is described in (French and Steinberg, 1947, Fletcher and Galt, 1950), reformulated in (Kryter, 1962a, Müsch, 2000, Allen, 2005), verified in (Kryter, 1962b, Ronan et al., 2004, Pavlovic and Studebaker, 1984), and standardized in (ANSI, 1969, ANSI, 1997). The accuracy and generality of its predictions over a variety of acoustic conditions is remarkable.

The AI model of human phone error rate indicates that the most important factor affecting human performance is the speech-to-noise ratio as a function of frequency. The AI is the frequency-average of the non-linearly transformed speech-to-noise ratio (described in detail in Section 1.2). There are numerous modifiers which compensate for sharp filtering, high speech levels, loud maskers, or sharply bandpass maskers, all of which evoke effects in the auditory periphery. It may be deduced from the formulation in (Fletcher and Galt, 1950) that these effects play a relatively small role in typical listening conditions. In fact, another formulation (French and Steinberg, 1947) considers fewer of these peripheral effects presumably because they were not seen as necessary. There is also empirical evidence that a reasonably good prediction of intelligibility can be obtained from an even simpler formulation (Phatak et al., 2008).

It seems reasonable to expect that the human brain keeps a running estimate of prevailing noise and filtering conditions, and uses them to interpret acoustic signals, including speech. This notion was suggested by Hermansky and Morgan (1994), who then developed a representation of speech which ignored the effects of slowly varying filtering and noise conditions. It is also substantiated by Drullman et al. (1994b), which showed that low frequency modulations do not affect human performance. The effectiveness of the Articulation Index has been thought to imply that the brain estimates background noise levels, and only “sees” speech if it is unlikely to have come from the background noise. French and Steinberg (1947) put forth this interpretation:

When speech, which is constantly fluctuating in intensity, is reproduced at sufficiently low level only the occasional portions of highest intensity will be heard …

If W is equal to the fraction of the time intervals that speech in a critical band can be heard, it should be possible to derive W from the characteristics of speech and hearing … it will be appreciated that there are certain consequences that can be tested if the hypothesis is correct that W is equal to the proportion of the intervals of speech in a band which can be heard. There are…

The symbol W is essentially the logarithm of the frequency-specific signal to noise ratio. The intelligibility prediction produced by the AI is essentially the exponent of the average of W over frequency.

They conclude that the speech-derived estimates of W are consistent enough with perceptual data to endorse their hypothesis that W is proportional to the percentage of time intervals during which the speech signal is unlikely to have come from the noise. They use the phrase can be heard in a way which seems synonymous with signal detection. Also, an AI model parameter (denoted p in the formulation by French and Steinberg (1947)) is specifically related to the probability distribution of speech (the level in decibels which is higher than 99% of speech levels), and is employed in a way which assumes speech is detectable if its level is greater than a threshold. Two studies (Phatak and Allen, 2007, Pavlovic and Studebaker, 1984) have shown that frequency-specific values for p based on the level distribution of speech offer a better prediction of human recognition accuracy, supporting this view. The meaning of W, the form of the AI prediction of intelligibility, and its relationship to signal detection will be discussed in more detail in Section 1.2.

The AI predicts average phone error rate for a large amount of phonetically balanced speech, based on the average spectrum of speech, and information about the acoustic conditions. The interpretation of the AI offered above is based on the average spectrum of speech and average phone error rate. In this paper we will attempt to determine whether this interpretation holds for classification of individual utterances, based on the acoustics of individual utterances.

The following paragraphs place this study in the context of research on machine speech recognition, and human speech perception.

Standard representations of speech for speech recognition include the mel-frequency cepstral coefficients (MFCCs) and perceptual LPC (PLP). Davis and Mermelstein (1980) demonstrated that warping the frequency axis to a perceptually-based scale improves word discriminability. Hermansky (1990) demonstrated that an all-pole summary of the loudness spectrum (PLP) exhibits less inter-speaker variability than the raw loudness spectrum. Optimization-based approaches have been adopted recently; for example, transforming the speech signal to maximize information content (Padmanabhan and Dharanipragada, 2005), or transforming the speech signal into a form which can be parsimoniously represented by parametric distributions used in speech recognition systems (Omar and Hasegawa-Johnson, 2004). None of them have supplanted the MFCCs as the dominant representation of speech for automatic speech recognition. The current study is different because it seeks to determine whether a representation of speech in noise is more or less consistent with human behavior, rather than deriving one more appropriate for speech recognition systems.

The idea of using representations of speech inspired by the human auditory system is not new. For example, Hermansky (1990) suggests a representation of speech based on human auditory tuning, level normalization, and compression (which are present in the auditory system). In (Strope and Alwan, 1997) the authors simulate the dynamic activity of the auditory system to emulate a phenomenon called forward masking, and showed that a recognizer based on it is more robust to background noise than conventional systems. Another representation of speech called RASTA (Hermansky and Morgan, 1994), is predicated on an assumption that the brain keeps a running estimate of noise and filtering conditions, and uses them when recognizing speech. All systems compared in the current study use an auditory-like representation of speech similar to PLP (described in Hermansky, 1990). Our intention is to test a particular representation of speech in noise to deduce the structure used to classify phones, rather than test the merits of auditory-like representations of speech, which we already consider to be important.

Studies about representations of speech in noise, and models for detection of speech in noise are especially relevant. Experiments have been done (for example in Viemeister and Wakefield, 1991, Durlach et al., 1986) which demonstrate that Bayes’ rule applied to the probability distribution of auditory signals can predict human performance for some psycho-physical tasks. Hant and Alwan (2003) show that a similar model also predicts discrimination of some speech sounds. This paper is meant to expand the domain of tasks which Bayes’ rule can explain.

The Articulation Index models human phone recognition accuracy as a function of filtering and masking conditions. Several versions of the AI (mentioned in Section 1) have been published, which vary in sophistication and correspondingly, their accuracy and convenience. For the sake of brevity, we will describe the version published in (Allen, 2005) which has good accuracy in typical listening conditions.

First, speech is filtered into (in this formulation, 30) disjoint frequency bands with bandpass filters. The edges of these bands were chosen to fit empirical data, and are roughly proportional to the critical bandwidth (Fletcher, 1938, Allen, 1994). The second step is measurement of the speech and noise root-mean-squared levels in each bank, denoted here by σs,k and σn,k, respectively, where k indexes the frequency band.

The Articulation Index is computed fromAI=1301Kkmin30,10log101+c2σs,k2σn,k2.The parameter p (and c = 10p/20) is related to the quote by French and Steinberg (1947) in Section 1. They describe p as the “difference in db between the intensity in a critical band exceeded by 1% of 18th second intervals of received speech and the long average intensity in the same band,” depicted in Fig. 1. Thus p (and c) are related to the threshold which is thought to determine whether humans “hear” speech at a particular frequency and time. French and Steinberg (1947) use A in place of our symbol AI and represent the argument of the summation in Eq. (1) with W. They hypothesize that W “is equal to the fraction of the time intervals that speech in a critical band can be heard” which, in terms of Fig. 1, suggests some level on the abscissa which represents a threshold: speech intervals above the threshold can be heard and those below the threshold cannot. They suggest W could be computed by integrating the speech probability distribution in Fig. 1 above this threshold.

The probability of a human incorrectly identifying a phone can be computed from the Articulation Index (Eq. (1)) withPe=eminAI,where emin is a parameter equal to approximately 0.015.

The experiment described in this paper is meant to test the hypothesis that humans’ phone transcriptions for an acoustic waveform are based on the time-frequency signal-to-noise ratio rather than the short-time spectral level: A particular time-frequency sample will affect classification only if that sample is unlikely to have resulted from the prevailing noise level in that spectral channel. This is a difficult proposition to test directly because many samples interact with each other in the brain, and our perceptual experiments are not sensitive enough to measure the effect of a single sample. Rather than attempt to directly test this hypothesis, which in our view is ill advised, we will classify speech sounds with several representations of speech, and examine the results to see which are most consistent with human classifications.

Four representations of speech will be tested:

  • 1.

    The power spectrum. Many automatic speech recognition systems observe a linear transform of the log power spectrum (mel-frequency cepstral coefficients), therefore the power spectrum maybe considered analogous to those usually used in speech recognition. These will be referred to as the STF (spectro-temporal features).

  • 2.

    A representation based on the Articulation Index, which is essentially the speech-to-noise ratio as a function of time and frequency. This will be referred to as the AIF.

  • 3.

    A thresholded version of the AIF. A particular time-frequency pixel is unity if its SNR is greater than some threshold, and zero otherwise. These will be referred to as the AIBINF (“BIN” for binary).

  • 4.

    A version of the STF enhanced by spectral subtraction, which is called SSF.

This study will evaluate these speech representations based on the similarity between the mistakes they produce and the mistakes produced by humans in the same acoustic conditions. Greater consistency between human and machine errors is interpreted to mean greater similarity between the human recognition process and the classifier implied by a particular representation. The recognition accuracy provided by the various feature types will be compared, since a better performing feature type will be of interest to speech recognition researchers.

If the AIF leads to mistakes similar to those made by humans, it will support our hypothesis that humans estimate the prevailing noise spectrum and represent speech as an SNR spectrum rather than as the power spectrum of the noisy signal (as in the STF). The drop in performance between AIF and AIBINF determines how much information is gained by using a high level-resolution representation of the signal (as in the AIF) rather than only a single bit (1 = detected, 0 = not detected) for each time/frequency pixel. The SSF are included because the AIF features will be of less engineering interest if they do not provide an advantage over the simple and ubiquitous noise removal technique called spectral subtraction.

Section 2 describes the speech representations used to test our hypotheses, the human speech classification experiments, the machine speech classification experiments, and the metric used to compare results from them. Section 3 shows the recognition accuracies for each experiment, the relative performance of the various feature types, the similarity between the human and machine mistakes, and the most evident conclusions. The final section discusses their implications to the hypothesis presented above.

Section snippets

Speech materials

The stimuli used in this study are consonant-vowel sounds from the “Articulation Index Corpus” published by the Linguistic Data Consortium (Catalog #LDC2005S22). The sixteen consonants [/p, t, k, f, θ, s, ∫, b, d, g, v, ð, z, ʒ, m, n/] are paired with vowels in all experiments. The average duration of the speech sounds is 500 ms.

The machine experiment uses the sixteen consonants paired with ten vowels, and approximately fifty examples of each consonant-vowel pair. The total number of sounds is

Results

In this section we will summarize the results of the experiment: recognition accuracy, and similarity to human response patterns. The first is relevant to evaluating the AI-based features’ value for automatic speech recognition, and the second to our hypothesis about human speech perception.

Fig. 2 shows data from a subset of the conditions. Panels (a) and (d) of Fig. 2 show recognition accuracies for the conditions where the test noise spectrum and level match the training noise spectrum and

Review of hypotheses

Humans do not suffer from train-test mismatch in speech classification problems, but machines do. An automatic classifier using AI-based features suffers less than a classifier using spectral subtraction: its classification accuracy is higher, and its confusion matrices more closely resemble the confusion matrices produced by human subjects (lower symmetrized KL divergence).

Classification accuracy and KL divergence are correlated in only two respects: they are degraded by train-test mismatch,

Conclusions

We classified speech sounds with several representations of speech meant to help us determine which representation is more consistent with human behavior.

The AI-based representations performed better and had error patterns more consistent with humans in cases where the testing and training noise spectrum or level were mismatched. This property could be valuable in a practical recognizer because robustness to changes in conditions is a major problem in speech recognition.

A thresholded version of

References (44)

  • C. Darwin et al.

    What tells us when voicing has started?

    Speech Comm.

    (1982)
  • J. Hant et al.

    psychoacoustic-masking model to predict the perception of speech-like stimuli in noise

    Speech Comm.

    (2003)
  • H. Hermansky

    Should recognizers have ears?

    Speech Comm.

    (1998)
  • J. Allen

    How do humans process and recognize speech?

    IEEE Trans. Speech Audio Process.

    (1994)
  • J. Allen

    Consonant recognition and the articulation index

    J. Acoust. Soc. Amer.

    (2005)
  • ANSI (1969). Methods for the calculation of the articulation index, ANSI...
  • ANSI (1997). Methods for the calculation of the speech intelligibility index, ANSI...
  • S.F. Boll

    Suppression of acoustic noise in speech using spectral subtraction

    IEEE Trans. Acoust. Speech Signal Process.

    (1976)
  • F. Cooper et al.

    Some experiments on the perception of synthetic speech sounds

    J. Acoust. Soc. Amer.

    (1952)
  • T.M. Cover et al.

    Elements of Information Theory

    (2006)
  • S. Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoust. Speech Signal Process.

    (1980)
  • R. Drullman et al.

    Effect of temporal envelope smearing on speech reception

    J. Acoust. Soc. Amer.

    (1994)
  • R. Drullman et al.

    Effect of reducing slow temporal modulations on speech reception

    J. Acoust. Soc. Amer.

    (1994)
  • R. Drullman et al.

    Effect of temporal modulation reduction on spectral contrasts in speech

    J. Acoust. Soc. Amer.

    (1996)
  • N. Durlach et al.

    Towards a model for discrimination of broadband signals

    J. Acoust. Soc. Amer.

    (1986)
  • H. Fletcher

    Loudness, masking and their relation to the hearing process and the problem of noise measurement

    J. Acoust. Soc. Amer.

    (1938)
  • H. Fletcher et al.

    Perception of speech and its relation to telephony

    J. Acoust. Soc. Amer.

    (1950)
  • N. French et al.

    Factors governing the intelligibility of speech sounds

    J. Acoust. Soc. Amer.

    (1947)
  • S. Furui

    On the role of spectral transition for speech perception

    J. Acoust. Soc. Amer.

    (1986)
  • M. Hedrick et al.

    Effect of relative amplitude of frication on perception of place of articulation

    J. Acoust. Soc. Amer.

    (1993)
  • H. Hermansky

    Perceptual linear predictive (plp) analysis of speech

    J. Acoust. Soc. Amer.

    (1990)
  • H. Hermansky et al.

    Rasta processing of speech

    IEEE Trans. Speech Audio Process.

    (1994)
  • Cited by (6)

    View full text