Skip to main content

Audio-Visual Source Separation with Alternating Diffusion Maps

  • Chapter
  • First Online:
Audio Source Separation

Part of the book series: Signals and Communication Technology ((SCT))

  • 1937 Accesses

Abstract

In this chapter we consider the separation of multiple sound sources of different types including multiple speakers and transients, which are measured by a single microphone and by a video camera. We address the problem of separating a particular sound source from all other sources focusing specifically on obtaining an underlying representation of it while attenuating all other sources. By pointing the video camera merely to the desired sound source, the problem becomes equivalent to extracting the common source to the audio and the video modalities while ignoring the other sources. We use a kernel-based method, which is particularly designed for this task, providing an underlying representation of the common source. We demonstrate the usefulness of the obtained representation for the activity detection of the common source and discuss how it may be further used for source separation.

This research was supported by the Israel Science Foundation (grant no. 576/16).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. R.R. Lederman, R. Talmon, Learning the geometry of common latent variables using alternating-diffusion. Appl. Comput. Harmon. Anal. (2015)

    Google Scholar 

  2. S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)

    Article  Google Scholar 

  3. M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, J.C. Langford, The isomap algorithm and topological stability. Science 295(5552), 7–7 (2002)

    Google Scholar 

  4. M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)

    Article  MATH  Google Scholar 

  5. D.L. Donoho, C. Grimes, Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Nat. Acad. Sci. 100(10), 5591–5596 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  6. R. Coifman, S. Lafon, Diffusion maps. Appl. Comput. Harmon. Anal. 21(1), 5–30 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  7. D. Zhou, C.J.C. Burges, Spectral clustering and transductive learning with multiple views, in Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA (2007), pp. 1159–1166

    Google Scholar 

  8. M.B. Blaschko, C.H. Lampert, Correlational spectral clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK (2008), pp. 1–8

    Google Scholar 

  9. V.R. De Sa, P.W. Gallagher, J.M. Lewis, V.L. Malave, Multi-view kernel construction. Mach. Learn. 79(1–2), 47–71 (2010)

    MathSciNet  Google Scholar 

  10. A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, Adv. Neural Inf. Process. Syst., 1413–1421 (2011)

    Google Scholar 

  11. A. Kumar, H. Daumé, A co-training approach for multi-view spectral clustering, in Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA (2011), pp. 393–400

    Google Scholar 

  12. Y.Y. Lin, T.L. Liu, C.S. Fuh, Multiple kernel learning for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1147–1160 (2011)

    Article  Google Scholar 

  13. B. Wang, J. Jiang, W. Wang, Z.H. Zhou, Z. Tu, Unsupervised metric fusion by cross diffusion, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 2997–3004

    Google Scholar 

  14. H.C. Huang, Y.Y. Chuang, C.S. Chen, Affinity aggregation for spectral clustering, in Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 773–780

    Google Scholar 

  15. B. Boots, G. Gordon, Two-manifold problems with applications to nonlinear system identification, in Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, GB (2012), pp. 623–630

    Google Scholar 

  16. M.M. Bronstein, K. Glashoff, T.A. Loring, Making laplacians commute (2013), arXiv:1307.6549

  17. O. Lindenbaum, A. Yeredor, M. Salhov, A. Averbuch, Multiview diffusion maps (2015), arXiv preprint arXiv:1508.05550

  18. T. Michaeli, W. Wang, T. Livescu, Nonparametric canonical correlation analysis, in Proceedings of the International Conference on Machine Learning (ICML), New York, USA (2016)

    Google Scholar 

  19. A. Aubrey, B. Rivet, Y. Hicks, L. Girin, J. Chambers, C. Jutten, Two novel visual voice activity detectors based on appearance models and retinal filltering, Proceedings of the 15th European Signal Processing Conference (EUSIPCO) (2007), pp. 2409–2413

    Google Scholar 

  20. E. Ong, R. Bowden, Robust lip-tracking using rigid flocks of selected linear predictors, Proceedings of the 8th IEEE International Conference on Automatic Face and Gesture Recognition (2008)

    Google Scholar 

  21. Q. Liu, W. Wang, P. Jackson, A visual voice activity detection method with adaboosting, in Proceedings of the Sensor Signal Processing for Defence (SSPD) (IET, 2011), pp. 1–5

    Google Scholar 

  22. D. Sodoyer, B. Rivet, L. Girin, J. Schwartz, C. Jutten, An analysis of visual speech information applied to voice activity detection, Proceedings of the 31st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1 (2006)

    Google Scholar 

  23. D. Sodoyer, B. Rivet, L. Girin, C. Savariaux, J. Schwartz, C. Jutten, A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125, 1184 (2009)

    Article  Google Scholar 

  24. S. Siatras, N. Nikolaidis, M. Krinidis, I. Pitas, Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)

    Article  Google Scholar 

  25. A. Aubrey, Y. Hicks, J. Chambers, Visual voice activity detection with optical flow. IET Image Proc. 4(6), 463–472 (2010)

    Article  Google Scholar 

  26. P. Tiawongsombat, M. Jeong, J. Yun, B. You, S. Oh, Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)

    Article  Google Scholar 

  27. P. Atrey, M. Hossain, A. El Saddik, M. Kankanhalli, Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)

    Article  Google Scholar 

  28. S. Tamura, M. Ishikawa, T. Hashiba, S. Takeuchi, S. Hayamizu, A robust audio-visual speech recognition using audio-visual voice activity detection, in Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2010), pp. 2694–2697

    Google Scholar 

  29. D. Dov, R. Talmon, I. Cohen, Audio-visual voice activity detection using diffusion maps. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 732–745 (2015)

    Article  Google Scholar 

  30. R. Talmon, I. Cohen, S. Gannot, R.R. Coifman, Supervised graph-based processing for sequential transient interference suppression. IEEE Trans. Audio Speech Lang. Process. 20(9), 2528–2538 (2012)

    Google Scholar 

  31. A. Hirszhorn, D. Dov, R. Talmon, I. Cohen, Transient interference suppression in speech signals based on the OM-LSA algorithm, Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC) (2012), pp. 1–4

    Google Scholar 

  32. R. Talmon, I. Cohen, S. Gannot, Clustering and suppression of transient noise in speech signals using diffusion maps, in Proceedings of the 36th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2011), pp. 5084–5087

    Google Scholar 

  33. D. Dov, R. Talmon, I. Cohen, Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Trans. Signal Process. 64(24), 6406–6416 (2016)

    Article  MathSciNet  Google Scholar 

  34. D. Dov, R. Talmon, I. Cohen, Kernel method for voice activity detection in the presence of transients. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2313–2326 (2016)

    Google Scholar 

  35. P.C. Mahalanobis, On the generalized distance in statistics. Proc. Nat. Inst. Sci. (Calcutta) 2, 49–55 (1936)

    MATH  Google Scholar 

  36. C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the Nyström method. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 214–225 (2004)

    Article  Google Scholar 

  37. J. Shi, J. Malik, Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  38. http://www.freesound.org

  39. J. Barron, D. Fleet, S. Beauchemin, Performance of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994)

    Article  Google Scholar 

  40. A. Bruhn, J. Weickert, C. Schnörr, Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61(3), 211–231 (2005)

    Article  Google Scholar 

  41. S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  42. H. Hirsch, D. Pearce, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ASR2000-Automatic Speech Recognition: Challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)

    Google Scholar 

  43. B. Logan, Mel frequency cepstral coefficients for music modeling, Proceedings of the 1st International Conference on Music Information Retrieval (ISMIR) (2000)

    Google Scholar 

  44. R. Talmon, I. Cohen, S. Gannot, Single-channel transient interference suppression with diffusion maps. IEEE Trans. Audio Speech Lang. Process. 21(1), 132–144 (2013)

    Article  Google Scholar 

  45. I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. 81(11), 2403–2418 (2001)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Dov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dov, D., Talmon, R., Cohen, I. (2018). Audio-Visual Source Separation with Alternating Diffusion Maps. In: Makino, S. (eds) Audio Source Separation. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-73031-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73031-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73030-1

  • Online ISBN: 978-3-319-73031-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics