Abstract
The segregation of concurrent speakers and other sound sources is an important aspect in improving the performance of audio technology, such as noise reduction and automatic speech recognition, ASR, in difficult acoustic conditions. This technology is relevant for applications like hearing aids, mobile audio devices, robotics, hands-free audio communication and speech-based computer interfaces. Computational auditory-scene analysis (CASA) techniques simulate aspects of processing properties of the human perceptual system using statistical signal-processing techniques to improve inferences about the causes of audio input received by the system. This study argues that CASA is a promising approach to achieve source separation and outlines several theoretical arguments to support this hypothesis. With a focus on computational binaural scene analysis, principles of CASA techniques are reviewed. Furthermore, in an experimental approach, the applicability of a recent model of binaural interaction to improve ASR performance in multi-speaker conditions with spatially separated moving speakers is explored. The binaural model provides input to a statistical inference filter that employs a priori information on possible movements of the sources in order to track the positions of the speakers. The tracks are used to adapt a beamformer that selects a specific speaker. The output of the beamformer is subsequently used for an ASR task. Compared to the unprocessed, that is, mixed, data in a two-speaker condition, the word recognition rates obtained with the enhanced signals based on binaural information were increased from 30.8 to 88.4 %, demonstrating the potential of the proposed CASA-based approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that this approach can be extended to more inputs, for example, multiple microphones or audiovisual input, or might be restricted to a single input.The current study covers its application to binaural input signals like recordings from a dummy head.
- 2.
- 3.
The algorithm is part of a Matlab-Toolbox provided by [25].
References
M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear / non-Gaussian bayesian tracking. IEEE Trans. Signal Process., 50:174–188, 2002.
R. Beutelmann and T. Brand. Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am., 120:331–342, 2006.
J. Bitzer and K. U. Simmer. Superdirective microphone arrays. In M. Brandstein and D. Ward, editors, Microphone Arrays, chapter 2. Springer, 2001.
A. Brand, O. Behrend, T. Marquardt, D. McAlpine, and B. Grothe. Precise inhibition is essential for microsecond interaural time difference coding. Nature, 417:543–547, 2002.
J. Breebaart, S. van de Par, and A. Kohlrausch. Binaural processing model based on contralateral inhibition. I. Model structure. J. Acoust. Soc. Am., 110:1074–1088, 2001.
A. S. Bregman. Auditory scene analysis: The perceptual organization of sound. MIT Press, 1990.
K. O. Bushara, T. Hanakawa, I. Immisch, K. Toma, K. Kansaku, and M. Hallett. Neural correlates of cross-modal binding. Nat. Neurosci., 6:190–195, 2003.
C. E. Carr and M. Konishi. Axonal delay lines for time measurement in the owl’s brainstem. Proc. Natl. Acad. Sci. U. S. A., 85:8311–8315, 1988.
G. Casella and C. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83:81–94, 1996.
H. Christensen, N. M. N. Ma, S. N. Wrigley, and J. Barker. A speech fragment approach to localising multiple speakers in reverberant environments. In IEEE ICASSP, 2009.
M. Cooke. Glimpsing speech. Journal of Phonetics, 31:579–584, 2003.
H. Cox, R. Zeskind, and M. Owen. Robust adaptive beamforming. IEEE Trans. Acoust., Speech, Signal Process., 35:1365–1376, 1987.
S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, Signal Process., 28:357–366,1980.
M. Dietz, S. D. Ewert, and V. Hohmann. Lateralization of stimuli with independent fine-structure and envelope-based temporal disparities. J. Acoust. Soc. Am., 125:1622–1635, 2009.
M. Dietz, S. D. Ewert, and V. Hohmann. Auditory model based direction estimation of concurrent speakers from binaural signals. Speech Commun., 53:592–605, 2011.
M. Dietz, S. D. Ewert, and V. Hohmann. Lateralization based on interaural differences in the second-order amplitude modulator. J. Acoust. Soc. Am., 131:398–408, 2012.
M. Dietz, S. D. Ewert, V. Hohmann, and B. Kollmeier. Coding of temporally fluctuating interaural timing disparities in a binaural processing model based on phase differences. Brain Res., 1220:234–245, 2008.
M. Dietz, T. Marquardt, D. Greenberg, D. McAlpine. The influence of the envelope waveform on binaural tuning of neurons in the inferior colliculus and its relation to binaural perception. In B. C. J. Moore, R. Patterson, I. M. Winter, R. P. Carlyon, H. E. Gockel, editors, Basic Aspects of Hearing: Physiology and Perception, chapter 25. Springer, New York, 2013.
A. Doucet, N. de Freitas, and N. Gordon. An introduction to sequential Monte Carlo methods. In A. Doucet, N. de Freitas, and N. Gordon, editors, Sequential Monte Carlo Methods in Practice. Springer, 2001.
C. Faller and J. Merimaa. Source localization in complex listening situations: Selection of binaural cues based on interaural coherence. J. Acoust. Soc. Am., 116:3075–3089, 2004.
K. Friston and S. Kiebel. Cortical circuits for perceptual inference. Neural Networks, 22:1093–1104, 2009.
M. J. Goupell and W. M. Hartmann. Interaural fluctuations and the detection of interaural incoherence: Bandwidth effects. J. Acoust. Soc. Am., 119:3971–3986, 2006.
S. Harding, J. P. Barker, and G. J. Brown. Mask estimation for missing data speech recognition based on statistics of binaural interaction. IEEE T. Audio. Speech., 14:58–67, 2006.
J. Hartikainen and S. Särkkä. Optimal filtering with Kalman filters and smoothersa Manual for Matlab toolbox EKF/UKF. Technical report, Department of Biomedical Engineering and Computational Science, Helsinki University of Technology, 2008.
J. Hartikainen and S. Särkkä. RBMCDAbox-Matlab tooolbox of rao-blackwellized data association particle filters. Technical report, Department of Biomedical Engineering and Computational Science, Helsinki University of Technology, 2008.
H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am., 87:1738–1752, 1990.
V. Hohmann. Frequency analysis and synthesis using a Gammatone filterbank. Acta Acustica united with Acustica, 88:433–442, 2002.
L. a. Jeffress. A place theory of sound localization. J. Comp. Physiol. Psychol., 41:35–39, 1948.
H. Kayser, S. D. Ewert, J. Anemüller, T. Rohdenburg, V. Hohmann, and B. Kollmeier. Database of multichannel in-ear and behind-the-ear head-related and binaural room impulse responses. EURASIP Journal on Advances in Signal Processing, 2009:298605, 2009.
M. Klein-Hennig, M. Dietz, V. Hohmann, and S. D. Ewert. The influence of different segments of the ongoing envelope on sensitivity to interaural time delays. J. Acoust. Soc. Am., 129:3856–3872, 2011.
M. Kleinschmidt. Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acustica united with Acustica, 88:416–422, 2002.
D. Kolossa, F. Astudillo, A. Abad, S. Zeiler, R. Saeidi, P. Mowlaee, R. Martin. CHiME challenge : Approaches to robustness using beamforming and uncertainty-of-observation techniques. Int. Workshop on Machine Listening in Multisource, Environments, 1:6–11, 2011.
A.-G. Lang and A. Buchner. Relative influence of interaural time and intensity differences on lateralization is modulated by attention to one or the other cue: 500-Hz sine tones. J. Acoust. Soc. Am., 126:2536–2542, 2009.
N. Le Goff, J. Buchholz, and T. Dau. Modeling localization of complex sounds in the impaired and aided impaired auditory system. In J. Blauert, editor, The technology of binaural listening, chapter 5. Springer, Berlin-Heidelberg-New York NY, 2013.
W. Lindemann. Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals. J. Acoust. Soc. Am., 80:1608–1622, 1986.
R. F. Lyon. A computational model of binaural localization and separation. In IEEE ICASSP, volume 8, pages 1148–1151, 1983.
T. May, S. Van De Par, and A. Kohlrausch. A probabilistic model for robust localization based on a binaural auditory front-end. IEEE T. Audio. Speech., 19:1–13, 2011.
T. May, S. Van De Par, and A. Kohlrausch. A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE T. Audio. Speech., 20:1–15, 2012.
T. May, S. Van De Par, and A. Kohlrausch. Noise-robust speaker recognition combining missing data techniques and universal background modeling. IEEE T. Audio. Speech., 20:108–121, 2012.
T. May, S. van de Par, and A. Kohlrausch. Binaural localization and detection of speakers in complex acoustic scenes. In J. Blauert, editor, The technology of binaural listening, chapter 15. Springer, Berlin-Heidelberg-New York NY, 2013.
D. McAlpine and B. Grothe. Sound localization and delay lines-do mammals fit the model? Trends Neurosci., 26:347–350, 2003.
D. McAlpine, D. Jiang, and a. R. Palmer. A neural code for low-frequency sound localization in mammals. Nat. Neurosci., 4:396–401, 2001.
J. Nix and V. Hohmann. Sound source localization in real sound fields based on empirical statistics of interaural parameters. J. Acoust. Soc. Am., 119:463–479, 2006.
J. Nix and V. Hohmann. Combined estimation of spectral envelopes and sound source direction of concurrent voices by multidimensional statistical filtering. IEEE T. Audio. Speech., 15:995–1008, 2007.
B. Opitz, A. Mecklinger, A. D. Friederici, and D. Y. Von Cramon. The functional neuroanatomy of novelty processing: integrating ERP and fMRI results. Cereb. Cortex, 9:379–391, 1999.
B. Osnes, K. Hugdahl, and K. Specht. Effective connectivity analysis demonstrates involvement of premotor cortex during speech perception. Neuroimage, 54:2437–2445, 2011.
P. Paavilainen, M. Jaramillo, R. Näätänen, and I. Winkler. Neuronal populations in the human brain extracting invariant relationships from acoustic variance. Neurosci. Lett., 265:179–182, 1999.
K. Palomäki and G. J. Brown. A computational model of binaural speech recognition: Role of across-frequency vs. within-frequency processing and internal noise. Speech Commun., 53:924–940, 2011.
K. J. Palomäki, G. J. Brown, and D. Wang. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Commun., 43:361–378, 2004.
D. P. Phillips. A perceptual architecture for sound lateralization in man. Hear. Res., 238:124–132, 2008.
V. Pulkki and T. Hirvonen. Functional count-comparison model for binaural decoding. Acta Acustica united with Acustica, 95:883–900, 2009.
L. Rayleigh. On our perception of sound direction. Philos. Mag., 13:214–232, 1907.
H. Riedel and B. Kollmeier. Interaural delay-dependent changes in the binaural difference potential of the human auditory brain stem response. Hear. Res., 218:5–19, 2006.
N. Roman, D. Wang, and G. J. Brown. Speech segregation based on sound localization. J. Acoust. Soc. Am., 114:2236–2252, 2003.
S. Särkkä, A. Vehtari, and J. Lampinen. Rao-Blackwellized particle filter for multiple target tracking. Information Fusion, 8:2–15, 2007.
P. Søndergaard and P. Majdak. The auditory-modeling toolbox.In J. Blauert, editor, The technology of binaural listening, chapter 2. Springer, Berlin-Heidelberg-New York NY, 2013.
S. Spors and H. Wierstorf. Evaluation of perceptual properties of phase-mode beamforming in the context of data-based binaural synthesis. In 5th International Symposium on Communications Control and Signal Processing (ISCCSP), 2012, pages 1–4, 2012.
R. Stern and N. Morgan. Hearing is believing: Biologically-inspired feature extraction for robust automatic speech recognition. IEEE Signal Processing Magazine, 29:34–43, 2012.
R. Stern, A. Zeiberg, and C. Trahiotis. Lateralization of complex binaural stimuli: A weighted-image model. J. Acoust. Soc. Am., 84:156–165, 1988.
R. M. Stern and H. S. Colburn. Theory of binaural interaction based in auditory-nerve data. IV. A model for subjective lateral position. J. Acoust. Soc. Am., 64:127–140, 1978.
S. K. Thompson, K. von Kriegstein, A. Deane-Pratt, T. Marquardt, R. Deichmann, T. D. Griffiths, and D. McAlpine. Representation of interaural time delay in the human auditory midbrain. Nat. Neurosci., 9:1096–1098, 2006.
S. P. Thompson. On binaural audition. Philos. Mag., 4:274–276, 1877.
S. P. Thompson.On the function of the two ears in the perception of space. Philos. Mag., 13:406–416, 1882.
M. van der Heijden and C. Trahiotis. Masking with interaurally delayed stimuli: the use of "internal" delays in binaural detection. J. Acoust. Soc. Am., 105:388–399, 1999.
G. von Békésy. Zur Theorie des Hörens. Über das Richtungshören bei einer Zeitdifferenz oder Lautstärkenunggleichheit der beiderseitigen Schalleinwirkungen. Phys. Z., 31:824–835, 1930.
C. Wacongne, J. P. Changeux, and S. Dehaene. A neuronal model of predictive coding accounting for the mismatch negativity. J. Neurosci., 32:3665–3678, 2012.
K. C. Wagener and T. Brand. Sentence intelligibility in noise for listeners with normal hearing and hearing impairment: influence of measurement procedure and masking parameters. Int. J. Audiol., 44:144–156, 2005.
D. Wang and G. J. Brown. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, 2006.
S. Wilson, A. Saygin, M. Sereno, and M. Iacoboni. Listening to speech activates motor areas involved in speech production. Nat. Neurosci., 7:701–702, 2004.
I. Winkler. Interpreting the Mismatch Negativity. J. Psychophysiol., 21:147–163, 2007.
J. Woodruff and D. Wang. Binaural localization of multiple sources in reverberant and noisy environments. IEEE T. Audio. Speech., 20:1503–1512, 2012.
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book. Cambridge University Engineering Department, 3, 2002.
Acknowledgments
Supported by the DFG—SFB/TRR 31 The active auditory system, URL: http://www.uni-oldenburg.de/sfbtr31. The authors would like to thank M. Klein-Hennig for casting the IPD model code in the AMToolbox format, D. Marquardt and G. Coleman for their contributions to the beamforming algorithm, M. R. Schädler for sharing the code of the OLSA recognition system, H. Kayser for support with the HRIR database, and two anonymous reviewers for constructive suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Spille, C., Meyer, B.T., Dietz, M., Hohmann, V. (2013). Binaural Scene Analysis with Multidimensional Statistical Filters. In: Blauert, J. (eds) The Technology of Binaural Listening. Modern Acoustics and Signal Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37762-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-37762-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37761-7
Online ISBN: 978-3-642-37762-4
eBook Packages: EngineeringEngineering (R0)