research-article

SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection

Authors:
Awais Khan

Department of Computer Science & Engineering, Oakland University, USA

Department of Computer Science & Engineering, Oakland University, USA

0000-0003-2497-7687
View Profile

,
Khalid Mahmood Malik

Department of Computer Science & Engineering, Oakland University, USA

Department of Computer Science & Engineering, Oakland University, USA

0000-0002-7927-3436
View Profile

MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against DisinformationJune 2023Pages 10–18https://doi.org/10.1145/3592572.3592841

Published:12 June 2023Publication History

MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation

Pages 10–18

ABSTRACT

The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems.

References

Moustafa Alzantot, Ziqi Wang, and Mani B Srivastava. 2019. Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019).Google Scholar
Kishor B Bhangale, Prashant Titare, Raosaheb Pawar, and Sagar Bhavsar. 2018. Synthetic speech spoofing detection using MFCC and radial basis function SVM. IOSR J. Eng.(IOSRJEN) 8, 6 (2018), 55–62.Google Scholar
Clara Borrelli, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, and Stefano Tubaro. 2021. Synthetic speech detection through short-term and long-term prediction traces. EURASIP Journal on Information Security 2021, 1 (2021), 1–14.Google ScholarCross Ref
Zhuxin Chen, Zhifeng Xie, Weibin Zhang, and Xiangmin Xu. 2017. ResNet and Model Fusion for Automatic Spoofing Detection.. In Interspeech. 102–106.Google Scholar
Sanshuai Cui, Bingyuan Huang, Jiwu Huang, and Xiangui Kang. 2022. Synthetic Speech Detection Based on Local Autoregression and Variance Statistics. IEEE Signal Processing Letters 29 (2022), 1462–1466. https://doi.org/10.1109/LSP.2022.3183951Google ScholarCross Ref
R Hemavathi and R Kumaraswamy. 2021. Voice conversion spoofing detection by exploring artifacts estimates. Multimedia Tools and Applications 80 (2021), 23561–23580.Google ScholarDigital Library
Bingyuan Huang, Sanshuai Cui, Jiwu Huang, and Xiangui Kang. 2023. Discriminative Frequency Information Learning for End-to-End Speech Anti-Spoofing. IEEE Signal Processing Letters 30 (2023), 185–189.Google ScholarCross Ref
Ali Javed, Khalid Mahmood Malik, Aun Irtaza, and Hafiz Malik. 2021. Towards protecting cyber-physical and IoT systems from single-and multi-order voice spoofing attacks. Applied Acoustics 183 (2021), 108283.Google ScholarCross Ref
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6367–6371.Google Scholar
Shrikrishna V Kulkarni and Shrikrishna A Khaparde. 2017. Transformer engineering: design, technology, and diagnostics. CRC press.Google Scholar
Cheng-I Lai, Nanxin Chen, Jesús Villalba, and Najim Dehak. 2019. ASSERT: Anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120 (2019).Google Scholar
Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandr Kozlov. 2019. STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576 (2019).Google Scholar
Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. 2021. Replay and synthetic speech detection with res2net architecture. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 6354–6358.Google ScholarCross Ref
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211.Google ScholarCross Ref
Anwei Luo, Enlei Li, Yongliang Liu, Xiangui Kang, and Z Jane Wang. 2021. A capsule network based approach for detection of audio spoofing attacks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6359–6363.Google ScholarCross Ref
Prasanth Parasu, Julien Epps, Kaavya Sriskandaraja, and Gajan Suthokumar. 2020. Investigating Light-ResNet Architecture for Spoofing Detection Under Mismatched Conditions.. In INTERSPEECH. 1111–1115.Google Scholar
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In International conference on machine learning. PMLR, 4055–4064.Google Scholar
Khomdet Phapatanaburi, Longbiao Wang, Seiichi Nakagawa, and Masahiro Iwahashi. 2019. Replay attack detection using linear prediction analysis-based relative phase features. IEEE Access 7 (2019), 183614–183625.Google ScholarCross Ref
Raoudha Rahmeni, Anis Ben Aicha, and Yassine Ben Ayed. 2020. Acoustic features exploration and examination for voice spoofing counter measures with boosting machine learning techniques. Procedia Computer Science 176 (2020), 1073–1082.Google ScholarCross Ref
Raoudha Rahmeni, Anis Ben Aicha, and Yassine Ben Ayed. 2022. Voice spoofing detection based on acoustic and glottal flow features using conventional machine learning techniques. Multimedia Tools and Applications 81, 22 (2022), 31443–31467.Google ScholarDigital Library
Tushar Ranjan Sahoo and Sabyasachi Patra. 2014. Silence removal and endpoint detection of speech signal for text independent speaker identification. International Journal of Image, Graphics and Signal Processing 6, 6 (2014), 27.Google ScholarCross Ref
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-end anti-spoofing with rawnet2. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6369–6373.Google ScholarCross Ref
Zhongwei Teng, Quchen Fu, Jules White, Maria E Powell, and Douglas C Schmidt. 2022. ARawNet: A Lightweight Solution for Leveraging Raw Waveforms in Spoof Speech Detection. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 692–698.Google Scholar
Zheng Wang, Sanshuai Cui, Xiangui Kang, Wei Sun, and Zhonghua Li. 2020. Densely connected convolutional network for audio spoofing detection. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1352–1360.Google Scholar
Linqiang Wei, Yanhua Long, Haoran Wei, and Yijie Li. 2022. New acoustic features for synthetic and replay spoofing attack detection. Symmetry 14, 2 (2022), 274.Google ScholarCross Ref
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45.Google ScholarCross Ref
Xiong Xiao, Xiaohai Tian, Steven Du, Haihua Xu, Engsiong Chng, and Haizhou Li. 2015. Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge.. In Interspeech. 2052–2056.Google Scholar
You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters 28 (2021), 937–941.Google ScholarCross Ref
Yuxiang Zhang12, Wenchao Wang12, and Pengyuan Zhang12. 2021. The effect of silence and dual-band fusion in anti-spoofing system. (2021).Google Scholar

Index Terms

SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description

Voice conversion, i.e. modification of a speech signal to sound as if spoken by a different speaker, finds its use in speech synthesis with a new voice without necessity of a new database. This paper introduces two new simple non-linear methods of ...
Read More
FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection
Digital Forensics and Watermarking
Abstract
As increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin ...
Read More
Speaker adaptation of pitch and spectrum for HMM-based speech synthesis

This paper describes a technique for synthesizing speech with an arbitrary speaker's voice using speaker-independent speech units, which we call “average voice”models. The proposed method is based on an HMM-based text-to-speech synthesis system. In the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation
June 2023
65 pages
ISBN:9798400701870
DOI:10.1145/3592572
Editors:
Luca Cuccovillo
Fraunhofer IDMT, Germany
,
Bagdan Ionescu
UPB, Romania
,
Giorgos Kordopatis-Zilos
CTU in Pargue, Czech Republic
,
Symeon Papadopoulos
CERTH-ITI, Greece
,
Adrina Popescu
CEA LIST, France
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Logical attacks
Speech synthesis
Text-to-speech
Voice conversion
Voice spoofing detection
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 189
  Total Downloads
- Downloads (Last 12 months)189
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection

MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Speaker adaptation of pitch and spectrum for HMM-based speech synthesis