short-paper

Open Access

Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection

Authors:
Ziyue Xiang

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA

0000-0001-6054-5801
View Profile

,
Amit Kumar Singh Yadav

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA

0000-0001-6464-7688
View Profile

,
Stefano Tubaro

Politecnico di Milano, Milan, Italy

Politecnico di Milano, Milan, Italy

0000-0002-1990-9869
View Profile

,
Paolo Bestagini

Politecnico di Milano, Milan, Italy

Politecnico di Milano, Milan, Italy

0000-0003-0406-0222
View Profile

,
Edward J. Delp

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA

0000-0002-2909-7323
View Profile

IH&MMSec '23: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia SecurityJune 2023Pages 163–168https://doi.org/10.1145/3577163.3595104

Published:28 June 2023Publication History

IH&MMSec '23: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security

Pages 163–168

ABSTRACT

Many speech signals are compressed with MP3 to reduce the data rate. In many synthetic speech detection methods the spectrogram of the speech signal is used. This usually requires the speech signal to be fully decompressed. We show that the design of MP3 compression allows one to approximate the spectrogram of the MP3 compressed speech efficiently without fully decoding the compressed speech. We denote the spectograms obtained using our proposed approach by Efficient Spectrograms (E-Specs). E-Spec can reduce the complexity of spectrogram computation by ~77.60 percentage points (p.p.) and save ~37.87 p.p. of MP3 decoding time. E-Spec bypasses the reconstruction artifacts introduced by the MP3 synthesis filterbank, which makes it useful in speech forensics tasks. We tested E-Spec in the synthetic speech detection, where a detector is asked to determine whether a speech signal is synthesized or recorded from a human. We examined 4 different neural network architectures to evaluate the performance of E-Spec compared to speech features extracted from the fully decoded speech signal. E-Spec achieved the best synthetic speech detection performance for 3 architectures; it also achieved the best overall detection performance across architectures. The computation of E-Spec is an approximation to Short Time Fourier Transform (STFT). E-Spec can be extended to other audio compression methods.

References

Zaynab Almutairi and Hebah Elgibreen. 2022. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms, Vol. 15, 5 (2022), 155. https://doi.org/10.3390/a15050155Google Scholar
Vipin Bansal, Gaurav Pahwa, and Nirmal Kannan. 2020. Cough Classification for COVID-19 Based on Audio MFCC Features Using Convolutional Neural Networks. Proceedings of the 2020 IEEE International Conference on Computing, Power and Communication Technologies (2020), 604--608. https://doi.org/10.1109/GUCON48875.2020.9231094Google ScholarCross Ref
Emily R Bartusiak and Edward J Delp. 2021. Frequency Domain-based Detection of Generated Audio. Proceedings of IS&T International Symposium on Electronic Imaging: Media Watermarking, Security, and Forensics (2021), 273-1-273-7. https://doi.org/10.2352/ISSN.2470-1173.2021.4.MWSF-273 Virtual.Google ScholarCross Ref
Kratika Bhagtani, Amit Kumar Singh Yadav, Emily R Bartusiak, Ziyue Xiang, Ruiting Shao, Sriram Baireddy, and Edward J Delp. 2022. An Overview of Recent Work in Media Forensics: Methods and threats. arXiv preprint arXiv:2204.12067 (2022). https://doi.org/10.48550/arXiv.2204.12067Google Scholar
Tiziano Bianchi, Alessia De Rosa, Marco Fontani, Giovanni Rocciolo, and Alessandro Piva. 2013. Detection and Classification of Double Compressed MP3 Audio Tracks. Proceedings of the First ACM Workshop on Information Hiding and Multimedia Security (2013), 159--164. https://doi.org/10.1145/2482513.2482523 Montpellier, France.Google ScholarDigital Library
Judith C Brown. 1991. Calculation of a Constant Q Spectral Transform. The Journal of the Acoustical Society of America, Vol. 89, 1 (1991), 425--434. https://doi.org/10.1121/1.400476Google ScholarCross Ref
Tom Fawcett. 2006. An Introduction to ROC Analysis. Pattern recognition letters, Vol. 27, 8 (2006), 861--874. https://doi.org/10.1016/j.patrec.2005.10.010Google ScholarDigital Library
Deepanway Ghosal and Maheshkumar H. Kolekar. 2018. Music Genre Recognition Using Deep Neural Networks and Transfer Learning. Proceedings of Interspeech 2018 (2018), 2087--2091. https://doi.org/10.21437/Interspeech.2018-2045Google Scholar
Eric Grinstein, Ngoc Q. K. Duong, Alexey Ozerov, and Patrick Pérez. 2018. Audio Style Transfer. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018), 586--590. https://doi.org/10.1109/ICASSP.2018.8461711Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (2016), 770--778. https://doi.org/10.1109/CVPR.2016.90 Las Vegas, NV, USA.Google ScholarCross Ref
Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (2019), 1314--1324. https://doi.org/10.1109/ICCV.2019.00140Google ScholarCross Ref
International Organization for Standardization. 1995. ISO/IEC 13818--3:1995 - Information technology -- Generic coding of moving pictures and associated audio information -- Part 3: Audio. https://www.iso.org/standard/22991.htmlGoogle Scholar
International Organization for Standardization. 1997. ISO/IEC 13818--7:1997 Information technology - Generic Coding of Moving Pictures and Associated Audio Information - Part 7: Advanced Audio Coding (AAC). https://www.iso.org/standard/25040.htmlGoogle Scholar
Joebert S. Jacaba. 2001. Audio Compression Using Modified Discrete Cosine Transform: The MP3 Coding Standard. Bachelor's thesis. University of the Philippines, Manila. https://www.math.utah.edu/ gustafso/s2016/2270/project-ideas/audio-mp3-compression-MDCT-jacaba_main.pdfGoogle Scholar
Muhammad Mohsin Kabir, Muhammad F Mridha, Jungpil Shin, Israt Jahan, and Abu Quwsar Ohi. 2021. A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities. IEEE Access, Vol. 9 (2021), 79236--79263. https://doi.org/10.1109/ACCESS.2021.3084299Google ScholarCross Ref
Hasam Khalid, Minha Kim, Shahroz Tariq, and Simon S. Woo. 2021. Evaluation of an Audio-Video Multimodal Deepfake Dataset Using Unimodal and Multimodal Detectors. Proceedings of the 1st Workshop on Synthetic Multimedia -- Audiovisual Deepfake Generation and Detection (2021), 7--15. https://doi.org/10.1145/3476099.3484315 Virtual.Google ScholarDigital Library
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014). https://doi.org/10.48550/arXiv.1412.6980Google Scholar
lieff. 2018. textttminimp3: Minimalistic MP3 Decoder Single Header Library. https://github.com/lieff/minimp3Google Scholar
Qingzhong Liu, Andrew H Sung, and Mengyu Qiao. 2010. Detection of double MP3 compression. Cognitive Computation, Vol. 2 (2010), 291--296. https://doi.org/10.1007/s12559-010-9045-4Google ScholarCross Ref
Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H. Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee. 2021. ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 3, 2 (2021), 252--265. https://doi.org/10.1109/TBIOM.2021.3059479Google ScholarCross Ref
T.Q. Nguyen. 1994. Near-perfect-reconstruction Pseudo-QMF Banks. IEEE Transactions on Signal Processing, Vol. 42, 1 (1994), 65--76. https://doi.org/10.1109/78.258122Google ScholarDigital Library
Alan V. Oppenheim. 1970. Speech Spectrograms Using the Fast Fourier Transform. IEEE Spectrum, Vol. 7, 8 (1970), 57--62. https://doi.org/10.1109/MSPEC.1970.5213512Google ScholarDigital Library
Ted Painter and Andreas Spanias. 2000. Perceptual Coding of Digital Audio. Proc. IEEE, Vol. 88, 4 (2000), 451--515. https://doi.org/10.1109/5.842996Google ScholarCross Ref
Rassol Raissi. 2002. The Theory Behind MP3. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113.6804Google Scholar
Ricardo Reimao and Vassilios Tzerpos. 2019. FoR: A Dataset for Synthetic Speech Detection. Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (2019), 1--10. https://doi.org/10.1109/SPED.2019.8906599 Timisoara, Romania.Google ScholarCross Ref
Ricardo Reimao and Vassilios Tzerpos. 2021. Synthetic Speech Detection Using Neural Networks. Proceedings of the 2021 International Conference on Speech Technology and Human-Computer Dialogue (2021), 97--102. https://doi.org/10.1109/SpeD53181.2021.9587406Google ScholarCross Ref
Joseph Rothweiler. 1983. Polyphase Quadrature Filters--A New Subband Coding Technique. Proceedings of the 1983 IEEE International Conference on Acoustics, Speech, and Signal Processing (1983), 1280--1283. https://doi.org/10.1109/ICASSP.1983.1172005Google ScholarCross Ref
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarDigital Library
Michael E. Schuckers. 2010. Receiver Operating Characteristic Curve and Equal Error Rate. In Computational Methods in Biometric Authentication: Statistical Methods for Performance Evaluation. Springer London, London, 155--204. https://doi.org/10.1007/978-1-84996-202-5_5Google Scholar
Premjeet Singh, Goutam Saha, and Md Sahidullah. 2021. Non-linear Frequency Warping Using Constant-Q Transformation for Speech Emotion Recognition. Proceedings of the 2021 International Conference on Computer Communication and Informatics (2021), 1--6. https://doi.org/10.1109/ICCCI50826.2021.9402569Google ScholarCross Ref
John S. Sobolewski. 2003. Data Transmission Media. In Encyclopedia of Physical Science and Technology (Third Edition), Robert A. Meyers (Ed.). Academic Press, New York, 277--303. https://doi.org/10.1016/B0-12-227410-5/00165-4Google Scholar
Praveen Sripada. 2006. MP3 Decoder in Theory and Practice. Master's thesis. Blekinge Institute of Technology, Ronneby, Sweden. https://www.diva-portal.org/smash/get/diva2:830195/FULLTEXT01.pdfGoogle Scholar
Mingxing Tan and Quoc Le. 2021. EfficientNetV2: Smaller Models and Faster Training. Proceedings of International Conference on Machine Learning, Vol. 139 (2021), 10096--10106. Virtual.Google Scholar
TorchAudio Contributors. 2023. TorchAudio Documentation. https://pytorch.org/audio/master/index.htmlGoogle Scholar
Cheuk Kin Wai. 2023. nnAudio 0.3.1. https://kinwaicheuk.github.io/nnAudio/index.htmlGoogle Scholar
Ye Wang and Mikka Vilermo. 2003. Modified Discrete Cosine Transform: Its Implications for Audio Coding and Error Concealment. Journal of the Audio Engineering Society, Vol. 51, 1/2 (2003), 52--61.Google Scholar
Ziyue Xiang, Paolo Bestagini, Stefano Tubaro, and Edward J. Delp. 2022. Forensic Analysis and Localization of Multiply Compressed MP3 Audio Using Transformers. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (2022), 2929--2933. https://doi.org/10.1109/ICASSP43922.2022.9747639 Singapore.Google Scholar
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (2017), 5987--5995. https://doi.org/10.1109/CVPR.2017.634 Honolulu, HI, USA.Google ScholarCross Ref
Amit Kumar Singh Yadav, Ziyue Xiang, Emily R. Bartusiak, Paolo Bestagini, Stefano Tubaro, and Edward J. Delp. 2023. ASSD: Synthetic Speech Detection in the AAC Compressed Domain. Proceedings of the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (2023).Google Scholar
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al. 2021. ASVspoof 2021: Accelerating Progress in Spoofed and Deepfake Speech Detection. arXiv preprint arXiv:2109.00537 (2021). https://doi.org/10.48550/arXiv.2109.00537Google Scholar
Diqun Yan, Rangding Wang, Jinglei Zhou, Chao Jin, and Zhifeng Wang. 2018. Compression History Detection for MP3 Audio. KSII Transactions on Internet and Information Systems (TIIS), Vol. 12, 2 (2018), 662--675. https://doi.org/10.3837/tiis.2018.02.007Google Scholar
Mohammed Zakariah, Muhammad Khurram Khan, and Hafiz Malik. 2018. Digital Multimedia Audio Forensics: Past, Present and Future. Multimedia tools and applications, Vol. 77 (2018), 1009--1040. https://doi.org/10.1007/s11042-016-4277-2Google ScholarDigital Library
Fang Zheng, Guoliang Zhang, and Zhanjiang Song. 2001. Comparison of Different Implementations of MFCC. Journal of Computer science and Technology, Vol. 16 (2001), 582--589. https://doi.org/10.1007/BF02943243Google ScholarDigital Library
Pedram Abdzadeh Ziabary and Hadi Veisi. 2021. A Countermeasure Based on CQT Spectrogram for Deepfake Speech Detection. Proceedings of the 2021 7th International Conference on Signal Processing and Intelligent Systems (2021), 1--5. https://doi.org/10.1109/ICSPIS54653.2021.9729387Google ScholarCross Ref

Index Terms

Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection

Recommendations

Intelligibility of time-compressed synthetic speech

Analysis of listeners' intelligibility of natural and synthetic time-compressed speech.Different compression methods are applied to normal and fast speech.We evaluated a linear method and two non linear methods that act on the duration model.The linear ...
Read More
FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection
Digital Forensics and Watermarking
Abstract
As increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin ...
Read More
Dithering techniques in automatic recognition of speech corrupted by MP3 compression

A large portion of the audio files distributed over the Internet or those stored in personal and corporate media archives are in a compressed form. There exist several compression techniques and algorithms but it is the MPEG Layer-3 (known as MP3) that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IH&MMSec '23: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security
June 2023
190 pages
ISBN:9798400700545
DOI:10.1145/3577163
General Chair:
Daniel Moreira
Loyola University Chicago, USA
,
Program Chairs:
Aparna Bharati
Lehigh University, USA
,
Cecilia Pasquini
Fondazione Bruno Kessler, Italy
,
Yassine Yousfi
Comma.ai, USA
Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2023
Check for updates
Author Tags
asvspoof19
audio compression
deep learning
mp3 compression
signal processing
synthetic speech detection
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate128of318submissions,40%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 172
  Total Downloads
- Downloads (Last 12 months)172
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection

IH&MMSec '23: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Intelligibility of time-compressed synthetic speech

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Dithering techniques in automatic recognition of speech corrupted by MP3 compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection

IH&MMSec '23: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Intelligibility of time-compressed synthetic speech

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Dithering techniques in automatic recognition of speech corrupted by MP3 compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media