ASF-Conformer: Audio Scoring Conformer with FFC for Speaker Verification in Noisy Environments

Zhang, Xiran; Liu, Haiyan; Liu, Caixia; Zhang, Haiyang; Huo, Zhiwei

doi:10.1007/978-3-031-53311-2_8

Xiran Zhang¹⁴,
Haiyan Liu¹⁴,
Caixia Liu¹⁴,
Haiyang Zhang¹⁴ &
…
Zhiwei Huo¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14556))

Included in the following conference series:

International Conference on Multimedia Modeling

393 Accesses

Abstract

Background noise significantly impacts speech intelligibility, reducing the accuracy and reliability of the speaker verification system. Most existing noise reduction algorithms are specific to certain types of noise and have limitations, making them ineffective on eliminating background noise. Therefore, the extraction of robust features and the development of noise-resistant models that adapt to various noisy environments remain crucial challenges in the field of speaker verification. In this paper, we propose a Conformer-based Audio Scoring Conformer with Fast Fourier Convolution (ASF-Conformer), which is a speaker verification model. Firstly, the audio scoring module is introduced to evaluate and weight the audio features, aiming to select more robust features in noisy environments. Secondly, we introduce Fast Fourier Convolution as a replacement for the Conformer’s convolution module, improving the model’s ability to capture global features while reducing the model parameters. Finally, this paper conducts comparative tests with the current mainstream models on public dataset VoxCeleb1, and synthesized noisy dataset Mu-VoxCeleb1. The experimental results demonstrate that the proposed ASF-Conformer model, compared to the ECAPA-TDNN model with essentially the same parameters, outperforms ECAPA-TDNN by 2% and 18% respectively when evaluated using the EER metrics on the VoxCeleb1 and Mu-VoxCeleb1 datasets. These results highlight the effectiveness of the proposed model in enhancing the accuracy of speaker verification tasks, especially in noisy environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cai, D., Cai, W., Li, M.: Within-sample variability-invariant loss for robust speaker recognition under noisy environments. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6469–6473. IEEE (2020). https://doi.org/10.1109/ICASSP40776.2020.9053407
Chen, L., Liang, Y., Shi, X., Zhou, Y., Wu, C.: Crossed-time delay neural network for speaker recognition. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12572, pp. 1–10. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67832-6_1
Chapter Google Scholar
Chen, S., et al.: Continuous speech separation with conformer. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5749–5753. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413423
Chung, J.S., et al.: In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)
Desplanques, B., Thienpondt, J., Demuynck, K.: Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143 (2020)
Hong, J., Kim, M., Choi, J., Ro, Y.M.: Watch or listen: robust audio-visual speech recognition with visual corruption modeling and reliability scoring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18783–18794 (2023). https://doi.org/10.1109/CVPR52729.2023.01801
Jin, M., Yoo, C.D.: Speaker verification and identification. In: Behavioral Biometrics for Human Identification: Intelligent Applications, pp. 264–289. IGI Global (2010)
Google Scholar
Jung, J.w., Heo, H.S., Kim, J.h., Shim, H.J., Yu, H.J.: Rawnet: advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv preprint arXiv:1904.08104 (2019)
Kim, J.h., Heo, J., Shim, H.j., Yu, H.J.: Extended u-net for speaker verification in noisy environments. arXiv preprint arXiv:2206.13044 (2022)
Koizumi, Y., et al.: Df-conformer: integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhancement. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 161–165. IEEE (2021). https://doi.org/10.1109/WASPAA52581.2021.9632794
Li, Y., Lin, X.: Dual-stream time-delay neural network with dynamic global filter for speaker verification. arXiv preprint arXiv:2303.11020 (2023)
Liu, T., Das, R.K., Lee, K.A., Li, H.: Mfa: Tdnn with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7517–7521. IEEE (2022). https://doi.org/10.1109/ICASSP43922.2022.9747021
Matejka, P., Novotnỳ, O., Plchot, O., Burget, L., Sánchez, M.D., Cernockỳ, J.: Analysis of score normalization in multilingual speaker recognition. In: Interspeech, pp. 1567–1571 (2017). https://doi.org/10.21437/Interspeech. 2017–803
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: Proceedings of 2001 A Speaker Odyssey: The Speaker Recognition Workshop, pp. 213–218. European Speech Communication Association (2001)
Google Scholar
Snyder, D., Chen, G., Povey, D.: Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust dnn embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8461375
Thienpondt, J., Desplanques, B., Demuynck, K.: Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification. arXiv preprint arXiv:2104.02370 (2021)
Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014). https://doi.org/10.1109/ICASSP.2014.6854363
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Wang, C., et al.: Unispeech: unified speech representation learning with labeled and unlabeled data. In: International Conference on Machine Learning, pp. 10937–10947. PMLR (2021)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Zhang, Y., et al.: Mfa-conformer: multi-scale feature aggregation conformer for automatic speaker verification. arXiv preprint arXiv:2203.15249 (2022)
Zhou, T., Zhao, Y., Wu, J.: Resnext and res2net structures for speaker verification. In: 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 301–307. IEEE (2021). https://doi.org/10.1109/SLT48900.2021.9383531

Download references

Author information

Authors and Affiliations

School of Computer Science, Inner Mongolia University, Hohhot, 010021, China
Xiran Zhang, Haiyan Liu, Caixia Liu, Haiyang Zhang & Zhiwei Huo

Authors

Xiran Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haiyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Caixia Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Huo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caixia Liu .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Liu, H., Liu, C., Zhang, H., Huo, Z. (2024). ASF-Conformer: Audio Scoring Conformer with FFC for Speaker Verification in Noisy Environments. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-53311-2_8
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53310-5
Online ISBN: 978-3-031-53311-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ASF-Conformer: Audio Scoring Conformer with FFC for Speaker Verification in Noisy Environments