Multi-representation knowledge distillation for audio classification

Gao, Liang; Xu, Kele; Wang, Huaimin; Peng, Yuxing

doi:10.1007/s11042-021-11610-8

Multi-representation knowledge distillation for audio classification

1193: Intelligent Processing of Multimedia Signals
Published: 08 January 2022

Volume 81, pages 5089–5112, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Liang Gao¹,
Kele Xu ORCID: orcid.org/0000-0001-5997-5169¹,
Huaimin Wang¹ &
…
Yuxing Peng¹

688 Accesses
11 Citations
Explore all metrics

Abstract

Audio classification aims to discriminate between different audio signal types, and it has received intensive attention due to its wide applications. In deep learning-based audio classification methods, researchers usually transform the raw signal of audios into different feature representations (such as Short Time Fourier Transform and Mel Frequency Cepstral Coefficients) as the inputs of networks. However, selecting the feature representation requires expert knowledge and extensive experimental verification. Besides, using a single type of feature representation may cause suboptimal results as the information implied in different kinds of feature representations may be complementary. Previous works show that ensembling the networks trained on different representations can greatly boost classification performance. However, making inferences using multiple networks is cumbersome and computation expensive. In this paper, we propose a novel end-to-end collaborative training framework for the audio classification task. The framework takes multiple representations as inputs to train the networks jointly with a knowledge distillation method. Consequently, our framework significantly promotes the performance of networks without increasing the computational overhead in the inference stage. Extensive experimental results demonstrate that the proposed approach improves classification performance and achieves competitive results on both acoustic scene classification tasks and general audio tagging tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-Based Music Classification with DenseNet and Data Augmentation

Data, Information, Knowledge, Wisdom Pyramid Concept Revisited in the Context of Deep Learning

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Article Open access 04 March 2019

Notes

References

Anil R, Pereyra G, Passos A, Ormandi R, Dahl GE, Hinton GE (2018) Large scale distributed neural network training through online distillation
Batra T, Parikh D (2017) Cooperative learning with visual attributes. Computer vision and pattern recognition
Bucilua C, Caruana R, Niculescu-Mizil, A (2006) Model compression. In: Proceedings of the 12th ACM SIGKDD International conference on knowledge discovery and data mining. ACM, pp 535–541
Dhanalakshmi P, Palanivel S, Ramalingam V (2009) Classification of audio signals using svm and rbfnn. Expert Sys Appl 36(3):6069–6075
Article Google Scholar
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2):139–157
Article Google Scholar
Fonseca E, Plakal M, Font F, Ellis DPW, Favory X, Pons J, Serra X (2018) General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp 69–73
Fonseca E, Plakal M, Font F, Ellis DPW, Favory X, Pons J, Serra X (2018) General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline. Proceedings of the detection and classification of acoustic scenes and events workshop, pp 69–73
Fonseca E, Puig JP, Favory X, Corbera FF, Bogdanov D, Ferraro A, Oramas S, Porter A, Serra X (2017) Freesound datasets: a platform for the creation of open audio datasets. In: Hu X, Cunningham SJ, Turnbull D, Duan Z (eds.). Proceedings of the 18th ISMIR Conference Oct 23-27; Suzhou, China.[Canada]: International Society for Music Information Retrieval; 2017. p. 486-93. International Society for Music Information Retrieval, 2017
Fraile R, Blanco-Martin E, Gutierrez-Arriola JM, Saenz-Lechon N, Osma-Ruiz VJ (2018) Classification of acoustic scenes based on modulation spectra and position-pitch maps. Technical report, DCASE2018 Challenge
Guido RC (2019) Enhancing teager energy operator based on a novel and appealing concept: Signal mass. Journal of the Franklin Institute 356(4):2346–2352
Article MathSciNet Google Scholar
Guido RC (2019) Paraconsistent feature engineering [lecture notes]. IEEE Signal Processing Magazine 36(1):154–158
Article Google Scholar
Hao W, Zhao L, Zhang Q, Zhao HY, Wang JH (2018) DCASE 2018 task 1a: Acoustic scene classification by bi-LSTM-CNN-net multichannel fusion. Technical report, DCASE2018 Challenge
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. Stat 1050:9
Google Scholar
Huang Y, Cao X, Wang Q, Zhang B, Zhen X, Li X (2019) Long-short-term features for dynamic scene classification. IEEE Trans Circuits Syst Video Technol 29(4):1038–1047
Article Google Scholar
Huang G, Liu Z, Van Der Maaten L, Weinberger, KQ (2017) Densely connected convolutional networks. In: IEEE Conf Comput Vision Pattern Recogn, vol 1, p 3
Jing L, Liu B, Choi J, Janin A, Bernd J, Mahoney MW, Friedland G (2017) Dcar: A discriminative and compact audio representation for audio processing. IEEE Trans Multimed 19(12):2637–2650
Article Google Scholar
Jung J, Heo H, Shim H, Yu H (2018) DNN based multi-level features ensemble for acoustic scene classification. Technical report, DCASE2018 Challenge
Jun W, Shengchen L, (2018) Self-attention mechanism based system for dcase2018 challenge task1 and task4. Technical report, DCASE2018 Challenge
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kulkarni A (2009) Audio signal processing. US Patent 7,490,044
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51(2):181–207
Article Google Scholar
Lan X, Zhu X, Gong S (2018) Knowledge distillation by on-the-fly native ensemble. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc, pp 7528–7538
Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv: Sound
Ma C, Guo Y, Yang J, An W (2019) Learning multi-view representation with lstm for 3-d shape recognition and retrieval. IEEE Trans Multimed 21(5):1169–1182
Article Google Scholar
Ma L, Smith DJ, Milner BP (2003) Context awareness using environmental noise classification. In: European conference on speech communication and technology, pp 2237–2240
Mesaros THA, Virtanen T (2018) A multi-device dataset for urban acoustic scene classification. Proceedings of the detection and classification of acoustic scenes and events workshop, pp 9–13
Mesaros A, Heittola T, Virtanen T (2018) A multi-device dataset for urban acoustic scene classification. In: Proceedings of the detection and classification of acoustic scenes and events 2018 workshop (DCASE2018), pp 9–13
Nguyen T, Pernkopf F (2018) Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters. Technical report, DCASE2018 Challenge
Piczak, KJ (2015) Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, pp 1–6
Poggio T, Girosi F (1990) Regularization algorithms for learning that are equivalent to multilayer networks. Science 247(4945):978–982
Article MathSciNet Google Scholar
Ren J, Jiang X, Yuan J, Magnenat-Thalmann N (2017) Sound-event classification using robust texture features for robot hearing. IEEE Trans Multimed 19(3):447–458
Article Google Scholar
Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1–2):1–39
Article Google Scholar
Sercu T, Goel V (2016) Dense prediction on sequences with time-dilated convolutions for speech recognition. arXiv:1611.09288
Shan S, Ren Y (2018) Automatic audio tagging with 1d and 2d convolutional neural networks. Technical report, DCASE2018 Challenge
Simonyan K, Zisserman, A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Sun S, Chen W, Bian J, Liu X, Liu T-Y (2017) Ensemble-compression: A new method for parallel training of deep neural networks. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 187–202
Veredas FJ, Urda D, Francisco LS, Aledo JC (2020) Combining feature engineering and feature selection to improve the prediction of methionine oxidation sites in proteins. Neural Comput Appl 32(2):323–334
Article Google Scholar
Virtanen T, Plumbley MD, Ellis D (2018) Computational analysis of sound scenes and events. Springer, Heidelberg
Book Google Scholar
Wang Qi, He Xiang, Li Xuelong (2019) Locality and structure regularized low rank representation for hyperspectral image classification. IEEE Trans Geosci Remote Sensing 57(2):911–923
Article Google Scholar
Wei Q, Liu Y, Ruan X (2018) A report on audio tagging with deeper cnn, 1d-convnet and 2d-convnet. Technical report, DCASE2018 Challenge
Xu K, Zhu B, Kong Q, Mi H, Ding B, Wang D, Wang H (2019) General audio tagging with ensembling convolutional neural networks and statistical features. J Acoust Soc Am 145:521–527
Article Google Scholar
Xu Y, Kong Q, Wang W, Plumbley MD (2018) Large-scale weakly supervised audio classification using gated convolutional neural network. In: International conference on acoustics, speech and signal processing. IEEE, pp 121–125
Xu Z, Smit P, Kurimo M (2018) The aalto system based on fine-tuned audioset features for dcase2018 task2 —- general purpose audio tagging. Technical report, DCASE2018 Challenge
Yang JH, Kim NK, Kim HK (2018) Se-resnet with gan-based data augmentation applied to acoustic scene classification. Technical report, DCASE2018 Challenge
Yin Y, Shah RR, Zimmermann, R (2018) Learning and fusing multimodal deep features for acoustic scene categorization. In: ACM multimedia conference on multimedia conference. ACM, pp 1892–1900
Zhang S, Zhang S, Huang T, Gao W (2018) Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans Multimed 20(6):1576–1590
Article Google Scholar
Zhang C, Cheng J, Tian Q (2018) Multiview label sharing for visual representations and classifications. IEEE Trans Multimed 20(4):903–913
Article Google Scholar
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) mixup: Beyond empirical risk minimization. In: International conference on learning representations
Zhang Y, Xiang T, Hospedales TM, Lu H (2018) Deep mutual learning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4320–4328

Download references

Acknowledgements

This paper is supported by the major Science and Technology Innovation 2030 “New generation Artificial Intelligence’’ project 2020AAA104803.

Author information

Authors and Affiliations

National Key Laboratory of Parallel and Distributed Processing, College of Computer, National University of Defense Technology, 410073, Changsha, China
Liang Gao, Kele Xu, Huaimin Wang & Yuxing Peng

Authors

Liang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Kele Xu
View author publications
You can also search for this author in PubMed Google Scholar
Huaimin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxing Peng
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, L., Xu, K., Wang, H. et al. Multi-representation knowledge distillation for audio classification. Multimed Tools Appl 81, 5089–5112 (2022). https://doi.org/10.1007/s11042-021-11610-8

Download citation

Received: 14 December 2020
Revised: 10 July 2021
Accepted: 22 September 2021
Published: 08 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11042-021-11610-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-representation knowledge distillation for audio classification

Abstract

Access this article

Similar content being viewed by others

Audio-Based Music Classification with DenseNet and Data Augmentation

Data, Information, Knowledge, Wisdom Pyramid Concept Revisited in the Context of Deep Learning

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Ethics declarations

Conflicts of interest

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-representation knowledge distillation for audio classification

Abstract

Access this article

Similar content being viewed by others

Audio-Based Music Classification with DenseNet and Data Augmentation

Data, Information, Knowledge, Wisdom Pyramid Concept Revisited in the Context of Deep Learning

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Ethics declarations

Conflicts of interest

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation