Multi-level distance embedding learning for robust acoustic scene classification with unseen devices

Jiang, Gang; Ma, Zhongchen; Mao, Qirong; Zhang, Jianming

doi:10.1007/s10044-023-01172-w

Multi-level distance embedding learning for robust acoustic scene classification with unseen devices

Industrial and Commercial Application
Published: 20 June 2023

Volume 26, pages 1089–1099, (2023)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Gang Jiang¹,
Zhongchen Ma¹,
Qirong Mao ORCID: orcid.org/0000-0002-0616-4431^1,2 &
…
Jianming Zhang¹

140 Accesses
2 Citations
Explore all metrics

Abstract

Acoustic scene classification (ASC) aims to analyse the recording scene of a piece of audio. In real life, ASC has to deal with audio data from various recording devices, even those recorded by devices that did not appear during the training phase. Audio data recorded by different devices, especially unseen devices, have differences in sampling rate, amplitude, data distribution, etc. These differences can greatly interfere with the feature learning process of CNNs and lead to degradation of the performance of the ASC model. In order to learn advanced features that are less susceptible to differences in device information from manual features that contain device information, we propose an ASC method based on multi-level distance embedding space, called multi-level distance embedding learning (MDEL). There is a hierarchical relationship among the categories of acoustic scene, that is, from the three coarse-grained categories of indoor, outdoor, and transportation to more fine-grained categories. This relation corresponds to a similarity relation between categories of different granularity. MDEL exploits this hierarchical relationship of similarity between acoustic scene classes to construct embedding space containing multi-level distance. During the learning process, the model is guided to focus more on common features of the same scene classes and learn an advanced feature that is more robust to the device, thus improving the robustness of the model to data from unseen devices. Our method was evaluated on the audio dataset provided by the DCASE2020 Challenge for Task1a, and the overall classification accuracy was improved by 1.2\(\%\). For audio data from unseen devices, the classification accuracy was improved by 2.3\(\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Domain Adaptation via Principal Subspace Projection for Acoustic Scene Classification

Article 28 February 2022

A Comparative Study on Approaches to Acoustic Scene Classification Using CNNs

Sparse Representation Frameworks for Acoustic Scene Classification

Data availability

The dataset that supports the findings of this study are available in zenodo with the identifier https://doi.org/10.5281/zenodo.3819968.

References

Yuanbo H, Bo K, Hauwermeiren V W, Botteldooren D (2022) Relation-guided acoustic scene classification aided with event embeddings. arXiv preprint arXiv:2205.00499
Barchiesi D, Giannoulis D, Stowell D, Plumbley M (2015) Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Process Mag 32(3):16–34
Article Google Scholar
Byeonggeun K, Seunghan Y, Jangho K, Hyunsin P, Juntae L, Simyung C (2022) Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification. arXiv preprint arXiv:2206.12513
Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley M (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimedia 17(10):1733–1746
Article Google Scholar
Qian K, Janott C, Zhang Z, Deng J, Baird A, Heiser C, Hohenhorst W, Herzog M, Hemmert W, Schuller B (2018) Teaching machines on snoring: a benchmark on computer audition for snore sound excitation localisation. Arch Acoustics 43(3):465–475
Google Scholar
Perera C, Zaslavsky A, Christen P, Georgakopoulos D (2014) Context aware computing for the internet of things: A survey. IEEE Commun Surv Tutor 16(1):414–454
Article Google Scholar
Harma A, Jakka J, Tikander M, Karjalainen M, Lokki T, Nironen H (2003) Techniques and applications for wearable augmented reality audio. In: Audio engineering society convention 114. audio engineering
Martinson AE (2007) Robotic discovery of the auditory scene. In: robotics and automation, 2007 IEEE international conference on
Qian K, Zhao R, Pandit V, Yang Z, Schuller B (2017) Wavelets revisited for the classification of acoustic scenes. In: workshop on detection and classification of acoustic scenes and events
Amiriparian S, Freitag M, Cummins N, Schuller B (2017) Sequence to sequence autoencoders for unsupervised representation learning from audio
Hershey S, Chaudhuri S, Ellis D, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 131–135. IEEE
Ren Z, Kong Q, Han J, Plumbley M, Schuller BW (2019) Attention-based atrous convolutional neural networks: Visualisation and understanding perspectives of acoustic scenes. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 56–60
Wang H, Zou Y, Chong D (2020) Acoustic scene classification with spectrogram processing strategies
Salamon J, Bello JP (2017) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 24(3):1–1
Article Google Scholar
Abeer J (2020) A review of deep learning based methods for acoustic scene classification. Appl Sci 10(6):2020
Article Google Scholar
Suh S, Park S, Jeong Y, Lee T (June 2020) Designing acoustic scene classification models with CNN variants. Technical report, DCASE2020 challenge
Hu H, Yang C, Xia X, Bai X, Lee CH (2020) Device-robust acoustic scene classification based on two-stage categorization and data augmentation
Hu H, Yang C, Xia X, Bai X, Lee CH (2021) A two-stage approach to device-robust acoustic scene classification. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 845–849. IEEE
Gao W, Mcdonnell M (June 2020) Acoustic scene classification using deep residual networks with focal loss and mild domain adaptation. Technical report, DCASE2020 Challenge
Jie L (June 2020) Acoustic scene classification with residual networks and attention mechanism. Technical report, DCASE2020 Challenge
Mcdonnell M, Gao W (2020) Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)
Jie H, Li S, Gang S, Albanie S (2019) Squeeze-and-excitation networks. IEEE transactions on pattern analysis and machine intelligence
Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional block attention module. Springer, Cham
Google Scholar
Bochkovskiy A, Wang CY, Liao H (2020) Yolov4: Optimal speed and accuracy of object detection
Hadsell R, Chopra S, Lecun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06)
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. IEEE
Kim Y, Park W (2021) Multi-level distance regularization for deep metric learning. CoRR arXiv:2102.04223
Heittola T, Mesaros A, Virtanen, T.: TAU Urban Acoustic Scenes, (2020) Mobile. Development Dataset. https://doi.org/10.5281/zenodo.3819968
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization
Koutini K, Eghbal-Zadeh H, Dorfer M, Widmer G (2019) The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification
Cramer J, Wu HH, Salamon J, Bello JP (2019) Look, listen and learn more: Design choices for deep audio embeddings. In: IEEE Int. \(\tilde{}\)Conf.\(\tilde{}\)on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 3852–3856. https://ieeexplore.ieee.org/document/8682475

Download references

Acknowledgements

This work is supported in part by the Key Projects of the National Natural Science Foundation of China under Grant U1836220, the National Nature Science Foundation of China under Grant 62176106, the National Natural Science Foundation of China under the Grant No.62006098, Jiangsu Province key research and development plan under Grant BE2020036, and the Fellowship of China Postdoctoral Science Foundation under the Grant No.2020M681515.

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
Gang Jiang, Zhongchen Ma, Qirong Mao & Jianming Zhang
Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications, Zhenjiang, China
Qirong Mao

Authors

Gang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongchen Ma
View author publications
You can also search for this author in PubMed Google Scholar
Qirong Mao
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qirong Mao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, G., Ma, Z., Mao, Q. et al. Multi-level distance embedding learning for robust acoustic scene classification with unseen devices. Pattern Anal Applic 26, 1089–1099 (2023). https://doi.org/10.1007/s10044-023-01172-w

Download citation

Received: 14 September 2022
Accepted: 29 May 2023
Published: 20 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10044-023-01172-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level distance embedding learning for robust acoustic scene classification with unseen devices

Abstract

Access this article

Similar content being viewed by others

Unsupervised Domain Adaptation via Principal Subspace Projection for Acoustic Scene Classification

A Comparative Study on Approaches to Acoustic Scene Classification Using CNNs

Sparse Representation Frameworks for Acoustic Scene Classification

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-level distance embedding learning for robust acoustic scene classification with unseen devices

Abstract

Access this article

Similar content being viewed by others

Unsupervised Domain Adaptation via Principal Subspace Projection for Acoustic Scene Classification

A Comparative Study on Approaches to Acoustic Scene Classification Using CNNs

Sparse Representation Frameworks for Acoustic Scene Classification

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation