Skip to main content
Log in

Multi-representation knowledge distillation for audio classification

  • 1193: Intelligent Processing of Multimedia Signals
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Audio classification aims to discriminate between different audio signal types, and it has received intensive attention due to its wide applications. In deep learning-based audio classification methods, researchers usually transform the raw signal of audios into different feature representations (such as Short Time Fourier Transform and Mel Frequency Cepstral Coefficients) as the inputs of networks. However, selecting the feature representation requires expert knowledge and extensive experimental verification. Besides, using a single type of feature representation may cause suboptimal results as the information implied in different kinds of feature representations may be complementary. Previous works show that ensembling the networks trained on different representations can greatly boost classification performance. However, making inferences using multiple networks is cumbersome and computation expensive. In this paper, we propose a novel end-to-end collaborative training framework for the audio classification task. The framework takes multiple representations as inputs to train the networks jointly with a knowledge distillation method. Consequently, our framework significantly promotes the performance of networks without increasing the computational overhead in the inference stage. Extensive experimental results demonstrate that the proposed approach improves classification performance and achieves competitive results on both acoustic scene classification tasks and general audio tagging tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/librosa/librosa

  2. https://github.com/grpc/grpc

References

  1. Anil R, Pereyra G, Passos A, Ormandi R, Dahl GE, Hinton GE (2018) Large scale distributed neural network training through online distillation

  2. Batra T, Parikh D (2017) Cooperative learning with visual attributes. Computer vision and pattern recognition

  3. Bucilua C, Caruana R, Niculescu-Mizil, A (2006) Model compression. In: Proceedings of the 12th ACM SIGKDD International conference on knowledge discovery and data mining. ACM, pp 535–541

  4. Dhanalakshmi P, Palanivel S, Ramalingam V (2009) Classification of audio signals using svm and rbfnn. Expert Sys Appl 36(3):6069–6075

    Article  Google Scholar 

  5. Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2):139–157

    Article  Google Scholar 

  6. Fonseca E, Plakal M, Font F, Ellis DPW, Favory X, Pons J, Serra X (2018) General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp 69–73

  7. Fonseca E, Plakal M, Font F, Ellis DPW, Favory X, Pons J, Serra X (2018) General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline. Proceedings of the detection and classification of acoustic scenes and events workshop, pp 69–73

  8. Fonseca E, Puig JP, Favory X, Corbera FF, Bogdanov D, Ferraro A, Oramas S, Porter A, Serra X (2017) Freesound datasets: a platform for the creation of open audio datasets. In: Hu X, Cunningham SJ, Turnbull D, Duan Z (eds.). Proceedings of the 18th ISMIR Conference Oct 23-27; Suzhou, China.[Canada]: International Society for Music Information Retrieval; 2017. p. 486-93. International Society for Music Information Retrieval, 2017

  9. Fraile R, Blanco-Martin E, Gutierrez-Arriola JM, Saenz-Lechon N, Osma-Ruiz VJ (2018) Classification of acoustic scenes based on modulation spectra and position-pitch maps. Technical report, DCASE2018 Challenge

  10. Guido RC (2019) Enhancing teager energy operator based on a novel and appealing concept: Signal mass. Journal of the Franklin Institute 356(4):2346–2352

    Article  MathSciNet  Google Scholar 

  11. Guido RC (2019) Paraconsistent feature engineering [lecture notes]. IEEE Signal Processing Magazine 36(1):154–158

    Article  Google Scholar 

  12. Hao W, Zhao L, Zhang Q, Zhao HY, Wang JH (2018) DCASE 2018 task 1a: Acoustic scene classification by bi-LSTM-CNN-net multichannel fusion. Technical report, DCASE2018 Challenge

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  14. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. Stat 1050:9

    Google Scholar 

  15. Huang Y, Cao X, Wang Q, Zhang B, Zhen X, Li X (2019) Long-short-term features for dynamic scene classification. IEEE Trans Circuits Syst Video Technol 29(4):1038–1047

    Article  Google Scholar 

  16. Huang G, Liu Z, Van Der Maaten L, Weinberger, KQ (2017) Densely connected convolutional networks. In: IEEE Conf Comput Vision Pattern Recogn, vol 1, p 3

  17. Jing L, Liu B, Choi J, Janin A, Bernd J, Mahoney MW, Friedland G (2017) Dcar: A discriminative and compact audio representation for audio processing. IEEE Trans Multimed 19(12):2637–2650

    Article  Google Scholar 

  18. Jung J, Heo H, Shim H, Yu H (2018) DNN based multi-level features ensemble for acoustic scene classification. Technical report, DCASE2018 Challenge

  19. Jun W, Shengchen L, (2018) Self-attention mechanism based system for dcase2018 challenge task1 and task4. Technical report, DCASE2018 Challenge

  20. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  21. Kulkarni A (2009) Audio signal processing. US Patent 7,490,044

  22. Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51(2):181–207

    Article  Google Scholar 

  23. Lan X, Zhu X, Gong S (2018) Knowledge distillation by on-the-fly native ensemble. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc, pp 7528–7538

  24. Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv: Sound

  25. Ma C, Guo Y, Yang J, An W (2019) Learning multi-view representation with lstm for 3-d shape recognition and retrieval. IEEE Trans Multimed 21(5):1169–1182

    Article  Google Scholar 

  26. Ma L, Smith DJ, Milner BP (2003) Context awareness using environmental noise classification. In: European conference on speech communication and technology, pp 2237–2240

  27. Mesaros THA, Virtanen T (2018) A multi-device dataset for urban acoustic scene classification. Proceedings of the detection and classification of acoustic scenes and events workshop, pp 9–13

  28. Mesaros A, Heittola T, Virtanen T (2018) A multi-device dataset for urban acoustic scene classification. In: Proceedings of the detection and classification of acoustic scenes and events 2018 workshop (DCASE2018), pp 9–13

  29. Nguyen T, Pernkopf F (2018) Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters. Technical report, DCASE2018 Challenge

  30. Piczak, KJ (2015) Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, pp 1–6

  31. Poggio T, Girosi F (1990) Regularization algorithms for learning that are equivalent to multilayer networks. Science 247(4945):978–982

    Article  MathSciNet  Google Scholar 

  32. Ren J, Jiang X, Yuan J, Magnenat-Thalmann N (2017) Sound-event classification using robust texture features for robot hearing. IEEE Trans Multimed 19(3):447–458

    Article  Google Scholar 

  33. Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1–2):1–39

    Article  Google Scholar 

  34. Sercu T, Goel V (2016) Dense prediction on sequences with time-dilated convolutions for speech recognition. arXiv:1611.09288

  35. Shan S, Ren Y (2018) Automatic audio tagging with 1d and 2d convolutional neural networks. Technical report, DCASE2018 Challenge

  36. Simonyan K, Zisserman, A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations

  37. Sun S, Chen W, Bian J, Liu X, Liu T-Y (2017) Ensemble-compression: A new method for parallel training of deep neural networks. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 187–202

  38. Veredas FJ, Urda D, Francisco LS, Aledo JC (2020) Combining feature engineering and feature selection to improve the prediction of methionine oxidation sites in proteins. Neural Comput Appl 32(2):323–334

    Article  Google Scholar 

  39. Virtanen T, Plumbley MD, Ellis D (2018) Computational analysis of sound scenes and events. Springer, Heidelberg

    Book  Google Scholar 

  40. Wang Qi, He Xiang, Li Xuelong (2019) Locality and structure regularized low rank representation for hyperspectral image classification. IEEE Trans Geosci Remote Sensing 57(2):911–923

    Article  Google Scholar 

  41. Wei Q, Liu Y, Ruan X (2018) A report on audio tagging with deeper cnn, 1d-convnet and 2d-convnet. Technical report, DCASE2018 Challenge

  42. Xu K, Zhu B, Kong Q, Mi H, Ding B, Wang D, Wang H (2019) General audio tagging with ensembling convolutional neural networks and statistical features. J Acoust Soc Am 145:521–527

    Article  Google Scholar 

  43. Xu Y, Kong Q, Wang W, Plumbley MD (2018) Large-scale weakly supervised audio classification using gated convolutional neural network. In: International conference on acoustics, speech and signal processing. IEEE, pp 121–125

  44. Xu Z, Smit P, Kurimo M (2018) The aalto system based on fine-tuned audioset features for dcase2018 task2 —- general purpose audio tagging. Technical report, DCASE2018 Challenge

  45. Yang JH, Kim NK, Kim HK (2018) Se-resnet with gan-based data augmentation applied to acoustic scene classification. Technical report, DCASE2018 Challenge

  46. Yin Y, Shah RR, Zimmermann, R (2018) Learning and fusing multimodal deep features for acoustic scene categorization. In: ACM multimedia conference on multimedia conference. ACM, pp 1892–1900

  47. Zhang S, Zhang S, Huang T, Gao W (2018) Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans Multimed 20(6):1576–1590

    Article  Google Scholar 

  48. Zhang C, Cheng J, Tian Q (2018) Multiview label sharing for visual representations and classifications. IEEE Trans Multimed 20(4):903–913

    Article  Google Scholar 

  49. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) mixup: Beyond empirical risk minimization. In: International conference on learning representations

  50. Zhang Y, Xiang T, Hospedales TM, Lu H (2018) Deep mutual learning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4320–4328

Download references

Acknowledgements

This paper is supported by  the major Science and Technology Innovation 2030 “New generation Artificial Intelligence’’ project 2020AAA104803.

Author information

Authors and Affiliations

Authors

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, L., Xu, K., Wang, H. et al. Multi-representation knowledge distillation for audio classification. Multimed Tools Appl 81, 5089–5112 (2022). https://doi.org/10.1007/s11042-021-11610-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11610-8

Keywords

Navigation