Skip to main content

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10891))

Abstract

A main challenge in applying deep learning to music processing is the availability of training data. One potential solution is Multi-task Learning, in which the model also learns to solve related auxiliary tasks on additional datasets to exploit their correlation. While intuitive in principle, it can be challenging to identify related tasks and construct the model to optimally share information between tasks. In this paper, we explore vocal activity detection as an additional task to stabilise and improve the performance of vocal separation. Further, we identify problematic biases specific to each dataset that could limit the generalisation capability of separation and detection models, to which our proposed approach is robust. Experiments show improved performance in separation as well as vocal detection compared to single-task baselines. However, we find that the commonly used Signal-to-Distortion Ratio (SDR) metrics did not capture the improvement on non-vocal sections, indicating the need for improved evaluation methodologies.

S. Ewert—Work was conducted at Queen Mary University of London.

D. Stoller—This work was funded by EPSRC grant EP/L01632X/1.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    See ancillary files at https://arxiv.org/abs/1804.01650.

  2. 2.

    https://github.com/faroit/dsd100mat.

References

  1. Benetos, E., Dixon, S., Giannoulis, D., Kirchhoff, H., Klapuri, A.: Automatic music transcription: challenges and future directions. J. Intell. Inf. Syst. 41(3), 407–434 (2013)

    Article  Google Scholar 

  2. Bittner, R., Salamon, J., Tierney, M., Mauch, M., Cannam, C., Bello, J.: MedleyDB: a multitrack dataset for annotation-intensive MIR research. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2014)

    Google Scholar 

  3. Caruana, R.: Multitask Learning, pp. 95–133. Springer, Boston (1998). https://doi.org/10.1007/978-1-4615-5529-2_5

    Book  Google Scholar 

  4. Chan, T.S., Yeh, T.C., Fan, Z.C., Chen, H.W., Su, L., Yang, Y.H., Jang, R.: Vocal activity informed singing voice separation with the iKala dataset. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 718–722 (2015)

    Google Scholar 

  5. DeLong, E.R., DeLong, D.M., Clarke-Pearson, D.L.: Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3), 837–845 (1988)

    Article  Google Scholar 

  6. Ewert, S., Sandler, M.B.: Structured dropout for weak label and multi-instance learning and its application to score-informed source separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2277–2281 (2017)

    Google Scholar 

  7. Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Sig. Process. 32(2), 236–243 (1984)

    Article  Google Scholar 

  8. Heittola, T., Mesaros, A., Virtanen, T., Eronen, A.: Sound event detection in multisource environments using source separation. In: Machine Listening in Multisource Environments (2011)

    Google Scholar 

  9. Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Singing-voice separation from monaural recordings using deep recurrent neural networks. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 477–482 (2014)

    Google Scholar 

  10. Ikemiya, Y., Yoshii, K., Itoyama, K.: Singing voice analysis and editing based on mutually dependent F0 estimation and source separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 574–578 (2015)

    Google Scholar 

  11. Jansson, A., Humphrey, E.J., Montecchio, N., Bittner, R., Kumar, A., Weyde, T.: Singing voice separation with deep U-Net convolutional networks. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 323–332 (2017)

    Google Scholar 

  12. Kong, Q., Xu, Y., Wang, W., Plumbley, M.D.: A joint separation-classification model for sound event detection of weakly labelled data. CoRR abs/1711.03037 (2017). http://arxiv.org/abs/1711.03037

  13. Liutkus, A., Fitzgerald, D., Rafii, Z.: Scalable audio separation with light kernel additive modelling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 76–80 (2015)

    Google Scholar 

  14. Liutkus, A., Stöter, F.-R., Rafii, Z., Kitamura, D., Rivet, B., Ito, N., Ono, N., Fontecave, J.: The 2016 signal separation evaluation campaign. In: Tichavský, P., Babaie-Zadeh, M., Michel, O.J.J., Thirion-Moreau, N. (eds.) LVA/ICA 2017. LNCS, vol. 10169, pp. 323–332. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-53547-0_31

    Chapter  Google Scholar 

  15. Luo, Y., Chen, Z., Hershey, J.R., Roux, J.L., Mesgarani, N.: Deep clustering and conventional networks for music separation: stronger together. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65 (2017)

    Google Scholar 

  16. Mauch, M., Dixon, S.: pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 659–663 (2014)

    Google Scholar 

  17. Ramona, M., Richard, G., David, B.: Vocal detection in music with support vector machines. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1885–1888 (2008)

    Google Scholar 

  18. Schlüter, J.: Learning to pinpoint singing voice from weakly labeled examples. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 44–50 (2016)

    Google Scholar 

  19. Stoller, D., Ewert, S., Dixon, S.: Adversarial semi-supervised audio source separation applied to singing voice extraction. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2018)

    Google Scholar 

  20. Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  21. Vincent, E.: Improved perceptual metrics for the evaluation of audio source separation. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) LVA/ICA 2012. LNCS, vol. 7191, pp. 430–437. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28551-6_53

    Chapter  Google Scholar 

Download references

Acknowledgements

We thank Emmanouil Benetos for the useful comments and feedback, as well as Mi Tian for references on related literature.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Daniel Stoller or Simon Dixon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Stoller, D., Ewert, S., Dixon, S. (2018). Jointly Detecting and Separating Singing Voice: A Multi-Task Approach. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M., Ward, D. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science(), vol 10891. Springer, Cham. https://doi.org/10.1007/978-3-319-93764-9_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93764-9_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93763-2

  • Online ISBN: 978-3-319-93764-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics