Abstract
In this chapter we consider the separation of multiple sound sources of different types including multiple speakers and transients, which are measured by a single microphone and by a video camera. We address the problem of separating a particular sound source from all other sources focusing specifically on obtaining an underlying representation of it while attenuating all other sources. By pointing the video camera merely to the desired sound source, the problem becomes equivalent to extracting the common source to the audio and the video modalities while ignoring the other sources. We use a kernel-based method, which is particularly designed for this task, providing an underlying representation of the common source. We demonstrate the usefulness of the obtained representation for the activity detection of the common source and discuss how it may be further used for source separation.
This research was supported by the Israel Science Foundation (grant no. 576/16).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
R.R. Lederman, R. Talmon, Learning the geometry of common latent variables using alternating-diffusion. Appl. Comput. Harmon. Anal. (2015)
S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, J.C. Langford, The isomap algorithm and topological stability. Science 295(5552), 7–7 (2002)
M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)
D.L. Donoho, C. Grimes, Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Nat. Acad. Sci. 100(10), 5591–5596 (2003)
R. Coifman, S. Lafon, Diffusion maps. Appl. Comput. Harmon. Anal. 21(1), 5–30 (2006)
D. Zhou, C.J.C. Burges, Spectral clustering and transductive learning with multiple views, in Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA (2007), pp. 1159–1166
M.B. Blaschko, C.H. Lampert, Correlational spectral clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK (2008), pp. 1–8
V.R. De Sa, P.W. Gallagher, J.M. Lewis, V.L. Malave, Multi-view kernel construction. Mach. Learn. 79(1–2), 47–71 (2010)
A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, Adv. Neural Inf. Process. Syst., 1413–1421 (2011)
A. Kumar, H. Daumé, A co-training approach for multi-view spectral clustering, in Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA (2011), pp. 393–400
Y.Y. Lin, T.L. Liu, C.S. Fuh, Multiple kernel learning for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1147–1160 (2011)
B. Wang, J. Jiang, W. Wang, Z.H. Zhou, Z. Tu, Unsupervised metric fusion by cross diffusion, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 2997–3004
H.C. Huang, Y.Y. Chuang, C.S. Chen, Affinity aggregation for spectral clustering, in Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 773–780
B. Boots, G. Gordon, Two-manifold problems with applications to nonlinear system identification, in Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, GB (2012), pp. 623–630
M.M. Bronstein, K. Glashoff, T.A. Loring, Making laplacians commute (2013), arXiv:1307.6549
O. Lindenbaum, A. Yeredor, M. Salhov, A. Averbuch, Multiview diffusion maps (2015), arXiv preprint arXiv:1508.05550
T. Michaeli, W. Wang, T. Livescu, Nonparametric canonical correlation analysis, in Proceedings of the International Conference on Machine Learning (ICML), New York, USA (2016)
A. Aubrey, B. Rivet, Y. Hicks, L. Girin, J. Chambers, C. Jutten, Two novel visual voice activity detectors based on appearance models and retinal filltering, Proceedings of the 15th European Signal Processing Conference (EUSIPCO) (2007), pp. 2409–2413
E. Ong, R. Bowden, Robust lip-tracking using rigid flocks of selected linear predictors, Proceedings of the 8th IEEE International Conference on Automatic Face and Gesture Recognition (2008)
Q. Liu, W. Wang, P. Jackson, A visual voice activity detection method with adaboosting, in Proceedings of the Sensor Signal Processing for Defence (SSPD) (IET, 2011), pp. 1–5
D. Sodoyer, B. Rivet, L. Girin, J. Schwartz, C. Jutten, An analysis of visual speech information applied to voice activity detection, Proceedings of the 31st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1 (2006)
D. Sodoyer, B. Rivet, L. Girin, C. Savariaux, J. Schwartz, C. Jutten, A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125, 1184 (2009)
S. Siatras, N. Nikolaidis, M. Krinidis, I. Pitas, Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)
A. Aubrey, Y. Hicks, J. Chambers, Visual voice activity detection with optical flow. IET Image Proc. 4(6), 463–472 (2010)
P. Tiawongsombat, M. Jeong, J. Yun, B. You, S. Oh, Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)
P. Atrey, M. Hossain, A. El Saddik, M. Kankanhalli, Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)
S. Tamura, M. Ishikawa, T. Hashiba, S. Takeuchi, S. Hayamizu, A robust audio-visual speech recognition using audio-visual voice activity detection, in Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2010), pp. 2694–2697
D. Dov, R. Talmon, I. Cohen, Audio-visual voice activity detection using diffusion maps. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 732–745 (2015)
R. Talmon, I. Cohen, S. Gannot, R.R. Coifman, Supervised graph-based processing for sequential transient interference suppression. IEEE Trans. Audio Speech Lang. Process. 20(9), 2528–2538 (2012)
A. Hirszhorn, D. Dov, R. Talmon, I. Cohen, Transient interference suppression in speech signals based on the OM-LSA algorithm, Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC) (2012), pp. 1–4
R. Talmon, I. Cohen, S. Gannot, Clustering and suppression of transient noise in speech signals using diffusion maps, in Proceedings of the 36th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2011), pp. 5084–5087
D. Dov, R. Talmon, I. Cohen, Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Trans. Signal Process. 64(24), 6406–6416 (2016)
D. Dov, R. Talmon, I. Cohen, Kernel method for voice activity detection in the presence of transients. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2313–2326 (2016)
P.C. Mahalanobis, On the generalized distance in statistics. Proc. Nat. Inst. Sci. (Calcutta) 2, 49–55 (1936)
C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the Nyström method. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 214–225 (2004)
J. Shi, J. Malik, Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
J. Barron, D. Fleet, S. Beauchemin, Performance of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994)
A. Bruhn, J. Weickert, C. Schnörr, Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61(3), 211–231 (2005)
S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
H. Hirsch, D. Pearce, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ASR2000-Automatic Speech Recognition: Challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)
B. Logan, Mel frequency cepstral coefficients for music modeling, Proceedings of the 1st International Conference on Music Information Retrieval (ISMIR) (2000)
R. Talmon, I. Cohen, S. Gannot, Single-channel transient interference suppression with diffusion maps. IEEE Trans. Audio Speech Lang. Process. 21(1), 132–144 (2013)
I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. 81(11), 2403–2418 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Dov, D., Talmon, R., Cohen, I. (2018). Audio-Visual Source Separation with Alternating Diffusion Maps. In: Makino, S. (eds) Audio Source Separation. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-73031-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-73031-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73030-1
Online ISBN: 978-3-319-73031-8
eBook Packages: EngineeringEngineering (R0)