Binaural Scene Analysis with Multidimensional Statistical Filters

Spille, C.; Meyer, B. T.; Dietz, M.; Hohmann, V.

doi:10.1007/978-3-642-37762-4_6

C. Spille²,
B. T. Meyer²,
M. Dietz² &
…
V. Hohmann²

Part of the book series: Modern Acoustics and Signal Processing ((MASP))

4101 Accesses
6 Citations

Abstract

The segregation of concurrent speakers and other sound sources is an important aspect in improving the performance of audio technology, such as noise reduction and automatic speech recognition, ASR, in difficult acoustic conditions. This technology is relevant for applications like hearing aids, mobile audio devices, robotics, hands-free audio communication and speech-based computer interfaces. Computational auditory-scene analysis (CASA) techniques simulate aspects of processing properties of the human perceptual system using statistical signal-processing techniques to improve inferences about the causes of audio input received by the system. This study argues that CASA is a promising approach to achieve source separation and outlines several theoretical arguments to support this hypothesis. With a focus on computational binaural scene analysis, principles of CASA techniques are reviewed. Furthermore, in an experimental approach, the applicability of a recent model of binaural interaction to improve ASR performance in multi-speaker conditions with spatially separated moving speakers is explored. The binaural model provides input to a statistical inference filter that employs a priori information on possible movements of the sources in order to track the positions of the speakers. The tracks are used to adapt a beamformer that selects a specific speaker. The output of the beamformer is subsequently used for an ASR task. Compared to the unprocessed, that is, mixed, data in a two-speaker condition, the word recognition rates obtained with the enhanced signals based on binaural information were increased from 30.8 to 88.4 %, demonstrating the potential of the proposed CASA-based approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that this approach can be extended to more inputs, for example, multiple microphones or audiovisual input, or might be restricted to a single input.The current study covers its application to binaural input signals like recordings from a dummy head.
2.
A demo folder containing the file exp_spille2013 used to run the IPD model and to generate Fig. 6 is available in the AMToolbox [56].
3.
The algorithm is part of a Matlab-Toolbox provided by [25].

References

M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear / non-Gaussian bayesian tracking. IEEE Trans. Signal Process., 50:174–188, 2002.
Google Scholar
R. Beutelmann and T. Brand. Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am., 120:331–342, 2006.
Google Scholar
J. Bitzer and K. U. Simmer. Superdirective microphone arrays. In M. Brandstein and D. Ward, editors, Microphone Arrays, chapter 2. Springer, 2001.
Google Scholar
A. Brand, O. Behrend, T. Marquardt, D. McAlpine, and B. Grothe. Precise inhibition is essential for microsecond interaural time difference coding. Nature, 417:543–547, 2002.
Google Scholar
J. Breebaart, S. van de Par, and A. Kohlrausch. Binaural processing model based on contralateral inhibition. I. Model structure. J. Acoust. Soc. Am., 110:1074–1088, 2001.
Google Scholar
A. S. Bregman. Auditory scene analysis: The perceptual organization of sound. MIT Press, 1990.
Google Scholar
K. O. Bushara, T. Hanakawa, I. Immisch, K. Toma, K. Kansaku, and M. Hallett. Neural correlates of cross-modal binding. Nat. Neurosci., 6:190–195, 2003.
Google Scholar
C. E. Carr and M. Konishi. Axonal delay lines for time measurement in the owl’s brainstem. Proc. Natl. Acad. Sci. U. S. A., 85:8311–8315, 1988.
Google Scholar
G. Casella and C. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83:81–94, 1996.
Google Scholar
H. Christensen, N. M. N. Ma, S. N. Wrigley, and J. Barker. A speech fragment approach to localising multiple speakers in reverberant environments. In IEEE ICASSP, 2009.
Google Scholar
M. Cooke. Glimpsing speech. Journal of Phonetics, 31:579–584, 2003.
Google Scholar
H. Cox, R. Zeskind, and M. Owen. Robust adaptive beamforming. IEEE Trans. Acoust., Speech, Signal Process., 35:1365–1376, 1987.
Google Scholar
S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, Signal Process., 28:357–366,1980.
Google Scholar
M. Dietz, S. D. Ewert, and V. Hohmann. Lateralization of stimuli with independent fine-structure and envelope-based temporal disparities. J. Acoust. Soc. Am., 125:1622–1635, 2009.
Google Scholar
M. Dietz, S. D. Ewert, and V. Hohmann. Auditory model based direction estimation of concurrent speakers from binaural signals. Speech Commun., 53:592–605, 2011.
Google Scholar
M. Dietz, S. D. Ewert, and V. Hohmann. Lateralization based on interaural differences in the second-order amplitude modulator. J. Acoust. Soc. Am., 131:398–408, 2012.
Google Scholar
M. Dietz, S. D. Ewert, V. Hohmann, and B. Kollmeier. Coding of temporally fluctuating interaural timing disparities in a binaural processing model based on phase differences. Brain Res., 1220:234–245, 2008.
Google Scholar
M. Dietz, T. Marquardt, D. Greenberg, D. McAlpine. The influence of the envelope waveform on binaural tuning of neurons in the inferior colliculus and its relation to binaural perception. In B. C. J. Moore, R. Patterson, I. M. Winter, R. P. Carlyon, H. E. Gockel, editors, Basic Aspects of Hearing: Physiology and Perception, chapter 25. Springer, New York, 2013.
Google Scholar
A. Doucet, N. de Freitas, and N. Gordon. An introduction to sequential Monte Carlo methods. In A. Doucet, N. de Freitas, and N. Gordon, editors, Sequential Monte Carlo Methods in Practice. Springer, 2001.
Google Scholar
C. Faller and J. Merimaa. Source localization in complex listening situations: Selection of binaural cues based on interaural coherence. J. Acoust. Soc. Am., 116:3075–3089, 2004.
Google Scholar
K. Friston and S. Kiebel. Cortical circuits for perceptual inference. Neural Networks, 22:1093–1104, 2009.
Google Scholar
M. J. Goupell and W. M. Hartmann. Interaural fluctuations and the detection of interaural incoherence: Bandwidth effects. J. Acoust. Soc. Am., 119:3971–3986, 2006.
Google Scholar
S. Harding, J. P. Barker, and G. J. Brown. Mask estimation for missing data speech recognition based on statistics of binaural interaction. IEEE T. Audio. Speech., 14:58–67, 2006.
Google Scholar
J. Hartikainen and S. Särkkä. Optimal filtering with Kalman filters and smoothersa Manual for Matlab toolbox EKF/UKF. Technical report, Department of Biomedical Engineering and Computational Science, Helsinki University of Technology, 2008.
Google Scholar
J. Hartikainen and S. Särkkä. RBMCDAbox-Matlab tooolbox of rao-blackwellized data association particle filters. Technical report, Department of Biomedical Engineering and Computational Science, Helsinki University of Technology, 2008.
Google Scholar
H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am., 87:1738–1752, 1990.
Google Scholar
V. Hohmann. Frequency analysis and synthesis using a Gammatone filterbank. Acta Acustica united with Acustica, 88:433–442, 2002.
Google Scholar
L. a. Jeffress. A place theory of sound localization. J. Comp. Physiol. Psychol., 41:35–39, 1948.
Google Scholar
H. Kayser, S. D. Ewert, J. Anemüller, T. Rohdenburg, V. Hohmann, and B. Kollmeier. Database of multichannel in-ear and behind-the-ear head-related and binaural room impulse responses. EURASIP Journal on Advances in Signal Processing, 2009:298605, 2009.
Google Scholar
M. Klein-Hennig, M. Dietz, V. Hohmann, and S. D. Ewert. The influence of different segments of the ongoing envelope on sensitivity to interaural time delays. J. Acoust. Soc. Am., 129:3856–3872, 2011.
Google Scholar
M. Kleinschmidt. Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acustica united with Acustica, 88:416–422, 2002.
Google Scholar
D. Kolossa, F. Astudillo, A. Abad, S. Zeiler, R. Saeidi, P. Mowlaee, R. Martin. CHiME challenge : Approaches to robustness using beamforming and uncertainty-of-observation techniques. Int. Workshop on Machine Listening in Multisource, Environments, 1:6–11, 2011.
Google Scholar
A.-G. Lang and A. Buchner. Relative influence of interaural time and intensity differences on lateralization is modulated by attention to one or the other cue: 500-Hz sine tones. J. Acoust. Soc. Am., 126:2536–2542, 2009.
Google Scholar
N. Le Goff, J. Buchholz, and T. Dau. Modeling localization of complex sounds in the impaired and aided impaired auditory system. In J. Blauert, editor, The technology of binaural listening, chapter 5. Springer, Berlin-Heidelberg-New York NY, 2013.
Google Scholar
W. Lindemann. Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals. J. Acoust. Soc. Am., 80:1608–1622, 1986.
Google Scholar
R. F. Lyon. A computational model of binaural localization and separation. In IEEE ICASSP, volume 8, pages 1148–1151, 1983.
Google Scholar
T. May, S. Van De Par, and A. Kohlrausch. A probabilistic model for robust localization based on a binaural auditory front-end. IEEE T. Audio. Speech., 19:1–13, 2011.
Google Scholar
T. May, S. Van De Par, and A. Kohlrausch. A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE T. Audio. Speech., 20:1–15, 2012.
Google Scholar
T. May, S. Van De Par, and A. Kohlrausch. Noise-robust speaker recognition combining missing data techniques and universal background modeling. IEEE T. Audio. Speech., 20:108–121, 2012.
Google Scholar
T. May, S. van de Par, and A. Kohlrausch. Binaural localization and detection of speakers in complex acoustic scenes. In J. Blauert, editor, The technology of binaural listening, chapter 15. Springer, Berlin-Heidelberg-New York NY, 2013.
Google Scholar
D. McAlpine and B. Grothe. Sound localization and delay lines-do mammals fit the model? Trends Neurosci., 26:347–350, 2003.
Google Scholar
D. McAlpine, D. Jiang, and a. R. Palmer. A neural code for low-frequency sound localization in mammals. Nat. Neurosci., 4:396–401, 2001.
Google Scholar
J. Nix and V. Hohmann. Sound source localization in real sound fields based on empirical statistics of interaural parameters. J. Acoust. Soc. Am., 119:463–479, 2006.
Google Scholar
J. Nix and V. Hohmann. Combined estimation of spectral envelopes and sound source direction of concurrent voices by multidimensional statistical filtering. IEEE T. Audio. Speech., 15:995–1008, 2007.
Google Scholar
B. Opitz, A. Mecklinger, A. D. Friederici, and D. Y. Von Cramon. The functional neuroanatomy of novelty processing: integrating ERP and fMRI results. Cereb. Cortex, 9:379–391, 1999.
Google Scholar
B. Osnes, K. Hugdahl, and K. Specht. Effective connectivity analysis demonstrates involvement of premotor cortex during speech perception. Neuroimage, 54:2437–2445, 2011.
Google Scholar
P. Paavilainen, M. Jaramillo, R. Näätänen, and I. Winkler. Neuronal populations in the human brain extracting invariant relationships from acoustic variance. Neurosci. Lett., 265:179–182, 1999.
Google Scholar
K. Palomäki and G. J. Brown. A computational model of binaural speech recognition: Role of across-frequency vs. within-frequency processing and internal noise. Speech Commun., 53:924–940, 2011.
Google Scholar
K. J. Palomäki, G. J. Brown, and D. Wang. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Commun., 43:361–378, 2004.
Google Scholar
D. P. Phillips. A perceptual architecture for sound lateralization in man. Hear. Res., 238:124–132, 2008.
Google Scholar
V. Pulkki and T. Hirvonen. Functional count-comparison model for binaural decoding. Acta Acustica united with Acustica, 95:883–900, 2009.
Google Scholar
L. Rayleigh. On our perception of sound direction. Philos. Mag., 13:214–232, 1907.
Google Scholar
H. Riedel and B. Kollmeier. Interaural delay-dependent changes in the binaural difference potential of the human auditory brain stem response. Hear. Res., 218:5–19, 2006.
Google Scholar
N. Roman, D. Wang, and G. J. Brown. Speech segregation based on sound localization. J. Acoust. Soc. Am., 114:2236–2252, 2003.
Google Scholar
S. Särkkä, A. Vehtari, and J. Lampinen. Rao-Blackwellized particle filter for multiple target tracking. Information Fusion, 8:2–15, 2007.
Google Scholar
P. Søndergaard and P. Majdak. The auditory-modeling toolbox.In J. Blauert, editor, The technology of binaural listening, chapter 2. Springer, Berlin-Heidelberg-New York NY, 2013.
Google Scholar
S. Spors and H. Wierstorf. Evaluation of perceptual properties of phase-mode beamforming in the context of data-based binaural synthesis. In 5th International Symposium on Communications Control and Signal Processing (ISCCSP), 2012, pages 1–4, 2012.
Google Scholar
R. Stern and N. Morgan. Hearing is believing: Biologically-inspired feature extraction for robust automatic speech recognition. IEEE Signal Processing Magazine, 29:34–43, 2012.
Google Scholar
R. Stern, A. Zeiberg, and C. Trahiotis. Lateralization of complex binaural stimuli: A weighted-image model. J. Acoust. Soc. Am., 84:156–165, 1988.
Google Scholar
R. M. Stern and H. S. Colburn. Theory of binaural interaction based in auditory-nerve data. IV. A model for subjective lateral position. J. Acoust. Soc. Am., 64:127–140, 1978.
Google Scholar
S. K. Thompson, K. von Kriegstein, A. Deane-Pratt, T. Marquardt, R. Deichmann, T. D. Griffiths, and D. McAlpine. Representation of interaural time delay in the human auditory midbrain. Nat. Neurosci., 9:1096–1098, 2006.
Google Scholar
S. P. Thompson. On binaural audition. Philos. Mag., 4:274–276, 1877.
Google Scholar
S. P. Thompson.On the function of the two ears in the perception of space. Philos. Mag., 13:406–416, 1882.
Google Scholar
M. van der Heijden and C. Trahiotis. Masking with interaurally delayed stimuli: the use of "internal" delays in binaural detection. J. Acoust. Soc. Am., 105:388–399, 1999.
Google Scholar
G. von Békésy. Zur Theorie des Hörens. Über das Richtungshören bei einer Zeitdifferenz oder Lautstärkenunggleichheit der beiderseitigen Schalleinwirkungen. Phys. Z., 31:824–835, 1930.
Google Scholar
C. Wacongne, J. P. Changeux, and S. Dehaene. A neuronal model of predictive coding accounting for the mismatch negativity. J. Neurosci., 32:3665–3678, 2012.
Google Scholar
K. C. Wagener and T. Brand. Sentence intelligibility in noise for listeners with normal hearing and hearing impairment: influence of measurement procedure and masking parameters. Int. J. Audiol., 44:144–156, 2005.
Google Scholar
D. Wang and G. J. Brown. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, 2006.
Google Scholar
S. Wilson, A. Saygin, M. Sereno, and M. Iacoboni. Listening to speech activates motor areas involved in speech production. Nat. Neurosci., 7:701–702, 2004.
Google Scholar
I. Winkler. Interpreting the Mismatch Negativity. J. Psychophysiol., 21:147–163, 2007.
Google Scholar
J. Woodruff and D. Wang. Binaural localization of multiple sources in reverberant and noisy environments. IEEE T. Audio. Speech., 20:1503–1512, 2012.
Google Scholar
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book. Cambridge University Engineering Department, 3, 2002.
Google Scholar

Download references

Acknowledgments

Supported by the DFG—SFB/TRR 31 The active auditory system, URL: http://www.uni-oldenburg.de/sfbtr31. The authors would like to thank M. Klein-Hennig for casting the IPD model code in the AMToolbox format, D. Marquardt and G. Coleman for their contributions to the beamforming algorithm, M. R. Schädler for sharing the code of the OLSA recognition system, H. Kayser for support with the HRIR database, and two anonymous reviewers for constructive suggestions.

Author information

Authors and Affiliations

Department of Medical Physics and Acoustics, University of Oldenburg, 26111, Oldenburg, Germany
C. Spille, B. T. Meyer, M. Dietz & V. Hohmann

Authors

C. Spille
View author publications
You can also search for this author in PubMed Google Scholar
B. T. Meyer
View author publications
You can also search for this author in PubMed Google Scholar
M. Dietz
View author publications
You can also search for this author in PubMed Google Scholar
V. Hohmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. Hohmann .

Editor information

Editors and Affiliations

Fak. Elektrotechnik, LS Allgm.Elektrotechn.+Akustik, Univ. Bochum, Bochum, Germany
Jens Blauert

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Spille, C., Meyer, B.T., Dietz, M., Hohmann, V. (2013). Binaural Scene Analysis with Multidimensional Statistical Filters. In: Blauert, J. (eds) The Technology of Binaural Listening. Modern Acoustics and Signal Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37762-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-37762-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37761-7
Online ISBN: 978-3-642-37762-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics