skip to main content
10.1145/3592572.3592841acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection

Published:12 June 2023Publication History

ABSTRACT

The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems.

References

  1. Moustafa Alzantot, Ziqi Wang, and Mani B Srivastava. 2019. Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019).Google ScholarGoogle Scholar
  2. Kishor B Bhangale, Prashant Titare, Raosaheb Pawar, and Sagar Bhavsar. 2018. Synthetic speech spoofing detection using MFCC and radial basis function SVM. IOSR J. Eng.(IOSRJEN) 8, 6 (2018), 55–62.Google ScholarGoogle Scholar
  3. Clara Borrelli, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, and Stefano Tubaro. 2021. Synthetic speech detection through short-term and long-term prediction traces. EURASIP Journal on Information Security 2021, 1 (2021), 1–14.Google ScholarGoogle ScholarCross RefCross Ref
  4. Zhuxin Chen, Zhifeng Xie, Weibin Zhang, and Xiangmin Xu. 2017. ResNet and Model Fusion for Automatic Spoofing Detection.. In Interspeech. 102–106.Google ScholarGoogle Scholar
  5. Sanshuai Cui, Bingyuan Huang, Jiwu Huang, and Xiangui Kang. 2022. Synthetic Speech Detection Based on Local Autoregression and Variance Statistics. IEEE Signal Processing Letters 29 (2022), 1462–1466. https://doi.org/10.1109/LSP.2022.3183951Google ScholarGoogle ScholarCross RefCross Ref
  6. R Hemavathi and R Kumaraswamy. 2021. Voice conversion spoofing detection by exploring artifacts estimates. Multimedia Tools and Applications 80 (2021), 23561–23580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bingyuan Huang, Sanshuai Cui, Jiwu Huang, and Xiangui Kang. 2023. Discriminative Frequency Information Learning for End-to-End Speech Anti-Spoofing. IEEE Signal Processing Letters 30 (2023), 185–189.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ali Javed, Khalid Mahmood Malik, Aun Irtaza, and Hafiz Malik. 2021. Towards protecting cyber-physical and IoT systems from single-and multi-order voice spoofing attacks. Applied Acoustics 183 (2021), 108283.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6367–6371.Google ScholarGoogle Scholar
  10. Shrikrishna V Kulkarni and Shrikrishna A Khaparde. 2017. Transformer engineering: design, technology, and diagnostics. CRC press.Google ScholarGoogle Scholar
  11. Cheng-I Lai, Nanxin Chen, Jesús Villalba, and Najim Dehak. 2019. ASSERT: Anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120 (2019).Google ScholarGoogle Scholar
  12. Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandr Kozlov. 2019. STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576 (2019).Google ScholarGoogle Scholar
  13. Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. 2021. Replay and synthetic speech detection with res2net architecture. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 6354–6358.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211.Google ScholarGoogle ScholarCross RefCross Ref
  15. Anwei Luo, Enlei Li, Yongliang Liu, Xiangui Kang, and Z Jane Wang. 2021. A capsule network based approach for detection of audio spoofing attacks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6359–6363.Google ScholarGoogle ScholarCross RefCross Ref
  16. Prasanth Parasu, Julien Epps, Kaavya Sriskandaraja, and Gajan Suthokumar. 2020. Investigating Light-ResNet Architecture for Spoofing Detection Under Mismatched Conditions.. In INTERSPEECH. 1111–1115.Google ScholarGoogle Scholar
  17. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In International conference on machine learning. PMLR, 4055–4064.Google ScholarGoogle Scholar
  18. Khomdet Phapatanaburi, Longbiao Wang, Seiichi Nakagawa, and Masahiro Iwahashi. 2019. Replay attack detection using linear prediction analysis-based relative phase features. IEEE Access 7 (2019), 183614–183625.Google ScholarGoogle ScholarCross RefCross Ref
  19. Raoudha Rahmeni, Anis Ben Aicha, and Yassine Ben Ayed. 2020. Acoustic features exploration and examination for voice spoofing counter measures with boosting machine learning techniques. Procedia Computer Science 176 (2020), 1073–1082.Google ScholarGoogle ScholarCross RefCross Ref
  20. Raoudha Rahmeni, Anis Ben Aicha, and Yassine Ben Ayed. 2022. Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimedia Tools and Applications 81, 22 (2022), 31443–31467.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tushar Ranjan Sahoo and Sabyasachi Patra. 2014. Silence removal and endpoint detection of speech signal for text independent speaker identification. International Journal of Image, Graphics and Signal Processing 6, 6 (2014), 27.Google ScholarGoogle ScholarCross RefCross Ref
  22. Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-end anti-spoofing with rawnet2. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6369–6373.Google ScholarGoogle ScholarCross RefCross Ref
  23. Zhongwei Teng, Quchen Fu, Jules White, Maria E Powell, and Douglas C Schmidt. 2022. ARawNet: A Lightweight Solution for Leveraging Raw Waveforms in Spoof Speech Detection. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 692–698.Google ScholarGoogle Scholar
  24. Zheng Wang, Sanshuai Cui, Xiangui Kang, Wei Sun, and Zhonghua Li. 2020. Densely connected convolutional network for audio spoofing detection. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1352–1360.Google ScholarGoogle Scholar
  25. Linqiang Wei, Yanhua Long, Haoran Wei, and Yijie Li. 2022. New acoustic features for synthetic and replay spoofing attack detection. Symmetry 14, 2 (2022), 274.Google ScholarGoogle ScholarCross RefCross Ref
  26. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiong Xiao, Xiaohai Tian, Steven Du, Haihua Xu, Engsiong Chng, and Haizhou Li. 2015. Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge.. In Interspeech. 2052–2056.Google ScholarGoogle Scholar
  28. You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters 28 (2021), 937–941.Google ScholarGoogle ScholarCross RefCross Ref
  29. Yuxiang Zhang12, Wenchao Wang12, and Pengyuan Zhang12. 2021. The effect of silence and dual-band fusion in anti-spoofing system. (2021).Google ScholarGoogle Scholar

Index Terms

  1. SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation
            June 2023
            65 pages
            ISBN:9798400701870
            DOI:10.1145/3592572

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 June 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Upcoming Conference

            ICMR '24
            International Conference on Multimedia Retrieval
            June 10 - 14, 2024
            Phuket , Thailand

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format