ABSTRACT
The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems.
- Moustafa Alzantot, Ziqi Wang, and Mani B Srivastava. 2019. Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019).Google Scholar
- Kishor B Bhangale, Prashant Titare, Raosaheb Pawar, and Sagar Bhavsar. 2018. Synthetic speech spoofing detection using MFCC and radial basis function SVM. IOSR J. Eng.(IOSRJEN) 8, 6 (2018), 55–62.Google Scholar
- Clara Borrelli, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, and Stefano Tubaro. 2021. Synthetic speech detection through short-term and long-term prediction traces. EURASIP Journal on Information Security 2021, 1 (2021), 1–14.Google ScholarCross Ref
- Zhuxin Chen, Zhifeng Xie, Weibin Zhang, and Xiangmin Xu. 2017. ResNet and Model Fusion for Automatic Spoofing Detection.. In Interspeech. 102–106.Google Scholar
- Sanshuai Cui, Bingyuan Huang, Jiwu Huang, and Xiangui Kang. 2022. Synthetic Speech Detection Based on Local Autoregression and Variance Statistics. IEEE Signal Processing Letters 29 (2022), 1462–1466. https://doi.org/10.1109/LSP.2022.3183951Google ScholarCross Ref
- R Hemavathi and R Kumaraswamy. 2021. Voice conversion spoofing detection by exploring artifacts estimates. Multimedia Tools and Applications 80 (2021), 23561–23580.Google ScholarDigital Library
- Bingyuan Huang, Sanshuai Cui, Jiwu Huang, and Xiangui Kang. 2023. Discriminative Frequency Information Learning for End-to-End Speech Anti-Spoofing. IEEE Signal Processing Letters 30 (2023), 185–189.Google ScholarCross Ref
- Ali Javed, Khalid Mahmood Malik, Aun Irtaza, and Hafiz Malik. 2021. Towards protecting cyber-physical and IoT systems from single-and multi-order voice spoofing attacks. Applied Acoustics 183 (2021), 108283.Google ScholarCross Ref
- Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6367–6371.Google Scholar
- Shrikrishna V Kulkarni and Shrikrishna A Khaparde. 2017. Transformer engineering: design, technology, and diagnostics. CRC press.Google Scholar
- Cheng-I Lai, Nanxin Chen, Jesús Villalba, and Najim Dehak. 2019. ASSERT: Anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120 (2019).Google Scholar
- Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandr Kozlov. 2019. STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576 (2019).Google Scholar
- Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. 2021. Replay and synthetic speech detection with res2net architecture. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 6354–6358.Google ScholarCross Ref
- Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211.Google ScholarCross Ref
- Anwei Luo, Enlei Li, Yongliang Liu, Xiangui Kang, and Z Jane Wang. 2021. A capsule network based approach for detection of audio spoofing attacks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6359–6363.Google ScholarCross Ref
- Prasanth Parasu, Julien Epps, Kaavya Sriskandaraja, and Gajan Suthokumar. 2020. Investigating Light-ResNet Architecture for Spoofing Detection Under Mismatched Conditions.. In INTERSPEECH. 1111–1115.Google Scholar
- Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In International conference on machine learning. PMLR, 4055–4064.Google Scholar
- Khomdet Phapatanaburi, Longbiao Wang, Seiichi Nakagawa, and Masahiro Iwahashi. 2019. Replay attack detection using linear prediction analysis-based relative phase features. IEEE Access 7 (2019), 183614–183625.Google ScholarCross Ref
- Raoudha Rahmeni, Anis Ben Aicha, and Yassine Ben Ayed. 2020. Acoustic features exploration and examination for voice spoofing counter measures with boosting machine learning techniques. Procedia Computer Science 176 (2020), 1073–1082.Google ScholarCross Ref
- Raoudha Rahmeni, Anis Ben Aicha, and Yassine Ben Ayed. 2022. Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimedia Tools and Applications 81, 22 (2022), 31443–31467.Google ScholarDigital Library
- Tushar Ranjan Sahoo and Sabyasachi Patra. 2014. Silence removal and endpoint detection of speech signal for text independent speaker identification. International Journal of Image, Graphics and Signal Processing 6, 6 (2014), 27.Google ScholarCross Ref
- Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-end anti-spoofing with rawnet2. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6369–6373.Google ScholarCross Ref
- Zhongwei Teng, Quchen Fu, Jules White, Maria E Powell, and Douglas C Schmidt. 2022. ARawNet: A Lightweight Solution for Leveraging Raw Waveforms in Spoof Speech Detection. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 692–698.Google Scholar
- Zheng Wang, Sanshuai Cui, Xiangui Kang, Wei Sun, and Zhonghua Li. 2020. Densely connected convolutional network for audio spoofing detection. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1352–1360.Google Scholar
- Linqiang Wei, Yanhua Long, Haoran Wei, and Yijie Li. 2022. New acoustic features for synthetic and replay spoofing attack detection. Symmetry 14, 2 (2022), 274.Google ScholarCross Ref
- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45.Google ScholarCross Ref
- Xiong Xiao, Xiaohai Tian, Steven Du, Haihua Xu, Engsiong Chng, and Haizhou Li. 2015. Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge.. In Interspeech. 2052–2056.Google Scholar
- You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters 28 (2021), 937–941.Google ScholarCross Ref
- Yuxiang Zhang12, Wenchao Wang12, and Pengyuan Zhang12. 2021. The effect of silence and dual-band fusion in anti-spoofing system. (2021).Google Scholar
Index Terms
- SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection
Recommendations
Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description
Voice conversion, i.e. modification of a speech signal to sound as if spoken by a different speaker, finds its use in speech synthesis with a new voice without necessity of a new database. This paper introduces two new simple non-linear methods of ...
FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection
Digital Forensics and WatermarkingAbstractAs increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin ...
Speaker adaptation of pitch and spectrum for HMM-based speech synthesis
This paper describes a technique for synthesizing speech with an arbitrary speaker's voice using speaker-independent speech units, which we call “average voice”models. The proposed method is based on an HMM-based text-to-speech synthesis system. In the ...
Comments