Elsevier

Speech Communication

Volume 84, November 2016, Pages 66-82
Speech Communication

Human-inspired modulation frequency features for noise-robust ASR

https://doi.org/10.1016/j.specom.2016.09.003Get rights and content

Highlights

  • We investigate whether the configuration of an auditory model that is optimal for predicting the intelligibility of speech under several adverse conditions is also optimal as a frontend for an automatic speech recognition system. We found that the answer is negative. The frequency resolution in the modulation filterbank that is optimal for predicting global intelligibility is too coarse for effective automatic speech recognition.

  • We also found that detailed resolution of the modulation frequencies at the low end of the spectrum becomes more important as the signal-to-noise ratio decreases. Quite unexpectedly, the modulation frequency spectrum of car noise and train station noise appeared to be different from the spectra of the other noise type in AURORA-2.

  • To handle the very high-dimensional and redundant feature vectors, we used a sparse coding approach for estimating the posterior probabilities of the subword units. As in a previous system that used sparse coding, we found that noise robustness in the lowest SNR conditions is improved relative to systems based on GMMs, but at the cost of slightly lower performance in the highest SNR conditions. We discuss the impact of the distance measures used in the sparse coding engine. We suggest several ways in which recognition accuracy can be improved, guided by knowledge about human speech processing.

  • We briefly point out possible connections between combining an auditory model as a frontend and an exemplar-based procedure for estimating posterior probabilities with recent findings in brain research.

  • The eventual aim of our research is building a model of speech recognition that is as robust to noise as humans are, using as much as possible the same processing procedures as humans do, so that the remaining recognition errors are similar to the errors that humans make.

Abstract

This paper investigates a computational model that combines a frontend based on an auditory model with an exemplar-based sparse coding procedure for estimating the posterior probabilities of sub-word units when processing noisified speech. Envelope modulation spectrogram (EMS) features are extracted using an auditory model which decomposes the envelopes of the outputs of a bank of gammatone filters into one lowpass and multiple bandpass components. Through a systematic analysis of the configuration of the modulation filterbank, we investigate how and why different configurations affect the posterior probabilities of sub-word units by measuring the recognition accuracy on a semantics-free speech recognition task. Our main finding is that representing speech signal dynamics by means of multiple bandpass filters typically improves recognition accuracy. This effect is particularly noticeable in very noisy conditions. In addition we find that to have maximum noise robustness, the bandpass filters should focus on low modulation frequencies. This reenforces our intuition that noise robustness can be increased by exploiting redundancy in those frequency channels which have long enough integration time not to suffer from envelope modulations that are solely due to noise. The ASR system we design based on these findings behaves more similar to human recognition of noisified digit strings than conventional ASR systems. Thanks to the relation between the modulation filterbank and procedures for computing dynamic acoustic features in conventional ASR systems, the finding can be used for improving the frontends in those systems.

Introduction

During the last decades a substantial body of neurophysiological and behavioral knowledge about the human auditory system has been accumulated. Psycho-acoustic research has provided detailed information about the frequency and time resolution capabilities of the human auditory system (e.g. Fletcher, 1940, Zwicker, Flottorp, Stevens, 1957, Kay, Matthews, 1972, Bacon, Viemeister, 1985, Houtgast, 1989, Houtgast, Steeneken, 1985, Drullman, Festen, Plomp, 1994, Dau, Kollmeier, Kohlrausch, 1997a, Dau, Kollmeier, Kohlrausch, 1997b, Ewert, Dau, 2000, Chi, Ru, Shamma, 2005, Moore, 2008, Jørgensen, Dau, 2011, Jørgensen, Ewert, Dau, 2013). It is now generally assumed that the rate with which the tonotopic representations in the cochlea change over time, the so-called modulation frequencies, is a crucial aspect of the intelligibility of speech signals. Drullman et al. (1994) showed that modulation frequencies between 4 Hz and 16 Hz carry the bulk of the information in speech signals. Modulation frequencies around 4 Hz roughly correspond to the number of syllables per second in normal speech; the highest modulation frequencies are most likely related to changes induced by transitions between phones.1 Despite the fact that several attempts have been made to integrate the concept of modulation frequencies in automatic speech recognition (ASR) (e.g., Hermansky, 1997; Kanedera et al., (1998); Kanedera, Arai, Hermansky, Pavel, 1999, Hermansky, 2011, Schädler, Meyer, Kollmeier, 2012, Moritz, Anemüller, Kollmeier, 2015), these investigations have not led to the crucial break-through in noise-robust ASR that was hoped for. The performance gap between human speech recognition (HSR) and ASR is still large, especially for speech corrupted by noise (e.g. Lippmann, 1996, Sroka, Braida, 2005, Meyer, Brand, Kollmeier, 2011, Meyer, 2013).

For meaningful connected speech, part of the advantage of humans is evidently due to semantic predictability, but also in tasks where there is no semantic advantage, such as in recognizing digit sequences (Meyer, 2013) or phonemes (Meyer et al., 2011), humans tend to outperform machines substantially. Therefore, it must be assumed that acoustic details that are important in human processing are lost in feature extraction or in the computation of posterior probabilities in ASR systems.

There is convincing evidence that some information is lost if (noisy) speech signals are merely represented as sequences of spectral envelopes. Demuynck et al. (2004) showed that it is possible to reconstruct intelligible speech from a sequence of MFCC vectors, but when Meyer et al. (2011) investigated the recognition accuracy of re-synthesized speech in noise by human listeners, they found that in order to achieve the same phoneme recognition accuracy as with the original speech, the re-synthesized speech required a signal-to-noise ratio (SNR) that was 10  dB higher (3.8  dB versus −6.2 dB).

In Macho et al. (2002) it was shown that an advanced frontend that implements a dynamic noise reduction prior to the computation of MFCC features reduces the word error rate. Meyer (2013) showed that advanced features, such as power-normalized cepstral coefficients (PNCC) (Kim and Stern, 2009) and Gabor filter features (Schädler et al., 2012) improve recognition accuracy compared to default MFCCs. The advanced frontend, the PNCC and the Gabor filter features introduce characteristics of the temporal dynamics of the speech signals that go beyond static coefficients enriched by adding deltas and delta-deltas. Therefore, it is quite likely that both HSR and ASR suffer from the fact that a conventional frontend that samples the spectral envelope at a rate of 100 times per second and then adds first and second order time derivatives yields an impoverished representation of crucial information about the dynamic changes in noisy speech.

The research reported here is part of a long-term enterprize aimed at understanding human speech comprehension by means of a computational model that is in conformity with the (neuro)physiological knowledge. For that purpose we want to build a simulation that not only makes equally few, but also the same kind of recognition errors as humans in tasks that do not involve elusive semantic processing. As a first step in that direction we investigate the performance of ASR systems with frontends inspired by an auditory model that has proved to predict intelligibility quite accurately in conditions with additive stationary noise, reverberation, and non-linear processing with spectral subtraction (Elhilali, Chi, Shamma, 2003, Jørgensen, Dau, 2011, Jørgensen, Dau, 2014, Jørgensen, Ewert, Dau, 2013). In addition, we investigate how an exemplar-based procedure for estimating the posterior probabilities of sub-word units interacts with the auditory-based frontends.

Auditory models predict speech intelligibility on the basis of difference between the long-term average power of the noise and the speech signals at the output of the peripheral auditory system (Jørgensen and Dau, 2011). However, it is evident that the long-term power spectrum of a speech signal is not sufficient for speech recognition. Auditory models are silent about all the processing of their outputs that is necessary to accomplish speech recognition. As a consequence, it is not clear whether an auditory model that performs well in predicting intelligibility for humans based on the noise envelope power ratio, such as the SNRenv model (Jørgensen and Dau, 2011) is also optimal in an ASR system that most probably processes the output of the auditory model in a different way than humans do.

The modulation filterbank in the auditory frontend proposed in e.g. Jørgensen and Dau (2011); 2014); Jørgensen et al. (2013) consists of a lowpass filter (LPF) and a number of bandpass filters (BPFs) that together cover the modulation frequency band up to 20 Hz. In our work we will vary the cut-off frequency of the LPF, as well as the number and center frequencies of the BPFs. In this respect, our experiments are somewhat similar to the experiments reported in Moritz et al. (2015), who aimed to harness knowledge about the human auditory system to improve the conventional procedure for enriching MFCCs with delta and delta-delta coefficients. In our research the focus is on understanding how and why resolving specific details in the modulation spectrum improves recognition performance, rather than on obtaining the highest possible recognition accuracy. The way in which we use sparse coding for estimating the likelihood of sub-word units in noise-corrupted speech is very different from the approach pioneered by Gemmeke et al. (2011), who tried to capture the articulatory continuity in speech by using exemplars that spanned 300 ms. In Ahmadi et al. (2014) it was shown that single-frame samples of the output of a modulation filterbank capture a comparable amount of information about articulatory continuity. In that paper we designed the modulation filterbank based on knowledge collected from relevant literature on the impact of different modulation bands on clean speech recognition. Here, we extend that work substantially by experimenting with conceptually motivated designs of the filterbank.

All theories of human speech comprehension (e.g. Cutler, 2012) and all extant ASR systems (e.g. Rabiner, Juang, 1993, Huang, Acero, Hon, 2001, Holmes, Holmes, 2001) assume that speech recognition hinges on recognizing words in some lexicon, and that these words are represented in the form of a limited number of sub-word units. The recognition after the frontend is assumed to comprise two additional processes, viz. estimating the likelihoods of sub-word units and finding the sequence of words that is most likely given the sub-word unit likelihoods. Both computational models of HSR (e.g. ten Bosch, Boves, Ernestus, 2013, ten Bosch, Boves, Tucker, Ernestus, 2015) and ASR prefer statistical models, or -alternatively- neural network models, for estimating sub-word model likelihoods and some sort of finite state transducer for finding the best path through the sub-word unit lattice.

Despite the analogy between artificial neural networks and the operation of the brain, and despite the fact that networks of spiking neurons have been shown to be able to approximate arbitrary statistical distributions (e.g. Buesing et al., 2011), there is no empirical evidence in support of a claim that human speech processing makes use of statistical models of sub-word units. Therefore, we decided to explore the possibility that the estimation of the likelihoods of sub-word units is mediated by an exemplar-based procedure (Goldinger, 1998). Exemplar-based procedures offer several benefits, compared to GMM-based approaches. An advantage that is especially beneficial for our work is that exemplar-based approaches can handle high-dimensional feature vectors, without the need for dimensionality reduction procedures that are likely to mix up tonotopic features that are clean and features that are corrupted by some kind of ‘noise’. In addition, exemplar-based representations are compatible with recent findings about the representation of auditory patterns in human cortex (Mesgarani, Cheung, Johnson, Chang, 2014, Mesgarani, David, Fritz, Shamma, 2014) and models of memory formation and retrieval (e.g. Wei, Wang, Wang, 2012, Myers, Wallis, 2013).

De Wachter et al. (2007) have shown that an exemplar-based approach to automatic speech recognition is feasible when using MFCCs and GMMs. More recently, Ahmadi et al. (2014); Gemmeke et al. (2011) have shown that noise-robust ASR systems can be built using exemplar-based procedures in combination with sparse coding (e.g. Lee, Seung, 1999, Olshausen, Field, 2004, Ness, Walters, Lyon, 2012). Geiger et al. (2013) have shown that the exemplar-based SC approach can be extended to handle medium-vocabulary noise-robust ASR. In sparse coding procedures a -possibly very large- dictionary of exemplars of speech and noise is used to represent unknown incoming observations as a sparse sum of the exemplars in the dictionary.

The seminal research in Bell Labs by Fletcher (1940); 1953) provides evidence for the hypothesis that speech processing relies on matching incoming signals to stored knowledge in separate frequency bands. That insight has been explored for the purpose of noise-robust ASR in the form of multi-stream processing (Misra, 2006). We apply the same insight to the frequency bands in the modulation spectrum: we assume that the high-dimensional modulation spectrum contains enough features that are not affected by the noise, so that they will dominate the distance measure in a sparse coding engine. The probability that ‘clean’ bands exist will depend on the design details of the modulation filter (and on the noise characteristics).

A sparse coding engine that represents noisy speech in the form of sparse sums of clean speech and pure noise exemplars can operate in three main ways. If it starts with matching noise exemplars, the operation is reminiscent of noise suppression and spectral subtraction (e.g. Kolossa and Haeb-Umbach, 2011). If the engine starts with matching speech exemplars, its operation is reminiscent of missing data approaches and glimpsing (Cooke, 2006). Combinations of both strategies can also be envisaged. A third possible strategy, and the strategy used in this paper, is treating the noise and speech exemplars in the exact same way, leaving it to the solver whether an unknown exemplar is first matched to speech or noise exemplars.

To maximize the possibility for comparing our results to previous research, we develop our initial system using the aurora-2 data set. Although one might argue that the aurora-2 task is not representative for a general speech recognition task, the task does not limit the generalizability of the insight gained. Actually, the design of aurora-2 is beneficial for our current purpose for two reasons. First, recognizing connected digit strings does not require an advanced language model; the fact that all sequences of two digits are equally probable minimizes the interference between the frontend and the backend. This set-up also corresponds to research on human speech intelligibility, which is often based on short semantically unpredictable (and therefore effectively meaningless) utterances. Second, the literature contains a number of benchmarks to which the current results can be compared. In our experiments we will follow the conventional approach to the aurora-2 task which requires estimating the posterior probabilities of 176 speech and 3 silence states in a hidden Markov model.

Section snippets

System overview

The recognition system used in this work is depicted schematically in Fig. 1. We discern three main processing blocks. In the first block, acoustic features are extracted every 10 ms from the speech signal using the same type of signal processing as employed in the speech-based envelope power spectrum model (sEPSM) proposed by Jørgensen and Dau (2011); Jørgensen et al. (2013). The sEPSM model contains more simplifying assumptions than the auditory model proposed in Chi et al. (2005), but the

Exploiting modulation frequency domain information

To investigate the impact of the way in which the information about modulation frequencies is represented in the EMS feature vectors, we designed a sequence of experiments. In “Study1” we use a simplified version of the auditory model to investigate several technical and conceptual issues. We also address the correspondence between the LPF and BPFs in the modulation filterbank on the one hand and the static and dynamic features in conventional ASR systems (c.f. Moritz et al., 2015). In Study 2

ASR

In this research we investigated how different configurations of the modulation filterbank affect recognition performance. To deepen our understanding of the strengths and weaknesses of the combination of EMS features and SC, we compared the performance on test sets A and B in aurora-2 with previously published recognition accuracies of three other systems: the ‘standard’ aurora-2 system trained with the multi-condition data (Hirsch and Pearce, 2000), the multi-condition aurora-2 system that

General discussion

In this paper we investigated how different configurations of the modulation filterbank in an auditory frontend affect the degree to which an exemplar-based engine can provide accurate posterior probability estimates of sub-word units when recognizing noise-corrupted speech. The auditory model proposed in (Jørgensen and Dau, 2014), which consists of a LPF with a cut-off frequency of 1 Hz and nine Q=1 BPFs with center frequencies one octave apart, served as the point of departure. For estimating

Conclusion

In this paper we investigated to what extent a model of the human auditory system that is capable of predicting speech intelligibility in adverse conditions also provides a promising starting point for designing the frontend of a noise robust ASR system. The long-term goal of the research is to design a computational model that shows human-like recognition behavior in terms of performance level and the type of errors. We investigated which details of the auditory model configuration are most

Acknowledgments

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no FP7-PEOPLE-2011-290000. We express our gratitude towards Torsten Dau, the discussions with whom were most helpful during the design phase of the experiments. Also his contributions in interpreting the results are greatly appreciated. We are also grateful to Tobias May for his advices during experiments and for providing part

References (72)

  • S. Ahmadi et al.

    Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

    EURASIP J. Audio Speech Music Process.

    (2014)
  • D. Baby et al.

    Investigating modulation spectrogram features for deep neural network-based automatic speech recognition

    Proceedings INTERSPEECH. Dresden, Germany

    (2015)
  • S.P. Bacon et al.

    Temporal modulation transfer functions in normal-hearing and hearing-impaired listeners

    Int. J. Audiol.

    (1985)
  • H. Bourlard

    Non-stationary multi-channel (multi-stream) processing towards robust and adaptive asr

    Proceedings ESCA Workshop Robust Methods Speech Recognition in Adverse Conditions

    (1999)
  • H. Bourlard et al.

    Towards subband-based speech recognition

    Proceedings of EUSIPCO

    (1996)
  • L. Buesing et al.

    Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons

    PLoS Comput. Biol.

    (2011)
  • ChiT. et al.

    Multiresolution spectrotemporal analysis of complex sounds

    J. Acoust. Soc. Am.

    (2005)
  • ChoiJ. et al.

    Toward sparse coding on cosine distance

    22nd International Conference on Pattern Recognition (ICPR)

    (2014)
  • M. Cooke

    A glimpsing model of speech perception in noise

    J. Acoust. Soc. Am.

    (2006)
  • A. Cutler

    Native Listening: Language Experience and the Recognition of Spoken Words

    (2012)
  • T. Dau et al.

    Modeling auditory processing of amplitude modulation. i. detection and masking with narrow-band carriers

    J. Acoust. Soc. Am.

    (1997)
  • T. Dau et al.

    Modeling auditory processing of amplitude modulation. ii. spectral and temporal integration

    J. Acoust. Soc. Am.

    (1997)
  • T. Dau et al.

    A quantitative model of the “effective” signal processing in the auditory system. i. model structure

    J. Acoust. Soc. Am.

    (1996)
  • M. De Wachter et al.

    Template-based continuous speech recognition

    IEEE Trans. Audio Speech Lang. Process.

    (2007)
  • K. Demuynck et al.

    Synthesizing speech from speech recognition parameters

    Proceedings of Interspeech

    (2004)
  • R. Drullman et al.

    Effect of temporal envelope smearing on speech reception

    J. Acoust. Soc. Am.

    (1994)
  • B. Efron et al.

    Least angle regression

    Ann. Stat.

    (2004)
  • M. Elhilali et al.

    A spectro-temporal modulation index (stmi) for assessment of speech intelligibility

    Speech Commun.

    (2003)
  • S.D. Ewert et al.

    Characterizing frequency selectivity for envelope fluctuations

    J. Acoust. Soc. Am.

    (2000)
  • H. Fletcher

    Auditory patterns

    Rev. Mod. Phys.

    (1940)
  • H. Fletcher

    Speech and Hearing in Communication

    (1953)
  • J. Geiger et al.

    The TUM+ TUT+ KUL approach to the 2nd CHiME challenge: multi-stream ASR exploiting BLSTM networks and sparse NMF

    Proceedings of CHiME

    (2013)
  • J.F. Gemmeke et al.

    Exemplar-based sparse representations for noise robust automatic speech recognition

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • S. Goldinger

    Echoes of echoes? an episodic theory of lexical access

    Psychol. Rev.

    (1998)
  • S. Grossberg et al.

    Laminar cortical dynamics of conscious speech perception: neural model of phonemic restoration using subsequent context in noise

    J. Acoust. Soc. Am.

    (2011)
  • M.J. Henry et al.

    Selective attention to temporal features on nested time scales

    Cereb. Cortex

    (2013)
  • H. Hermansky

    The modulation spectrum in the automatic recognition of speech

    Proceedings IEEE Workshop on Automatic Speech Recognition and Understanding. Santa Barbara

    (1997)
  • H. Hermansky

    Speech recognition from spectral dynamics

    Sadhana

    (2011)
  • H. Hermansky

    Multistream recognition of speech: dealing with unknown unknowns

    Proc. IEEE

    (2013)
  • H. Hermansky et al.

    Multi-resolution rasta filtering for TANDEM-based ASR

    Proc. Int. Conf. Spoken Lang. Process.

    (2005)
  • Hirsch, H., Pearce, D., 2006. Applying the advanced ETSI frontend to the Aurora-2 task. Tech. Report version 1.1....
  • H.G. Hirsch et al.

    The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions

    Proceedings ISCA Workshop ASR2000, Automatic Speech Recognition: Challenges for the Next Millennium. Paris, France

    (2000)
  • J. Holmes et al.

    Speech Synthesis and Recognition

    (2001)
  • T. Houtgast

    Frequency selectivity in amplitude-modulation detection

    J. Acoust. Soc. Am.

    (1989)
  • T. Houtgast et al.

    A review of the mtf concept in room acoustics and its use for estimating speech intelligibility in auditoria

    J. Acoust. Soc. Am.

    (1985)
  • HuangX. et al.

    Spoken Language Processing

    (2001)
  • Cited by (0)

    View full text