skip to main content
10.1145/3577163.3595104acmconferencesArticle/Chapter ViewAbstractPublication Pagesih-n-mmsecConference Proceedingsconference-collections
short-paper
Open Access

Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection

Published:28 June 2023Publication History

ABSTRACT

Many speech signals are compressed with MP3 to reduce the data rate. In many synthetic speech detection methods the spectrogram of the speech signal is used. This usually requires the speech signal to be fully decompressed. We show that the design of MP3 compression allows one to approximate the spectrogram of the MP3 compressed speech efficiently without fully decoding the compressed speech. We denote the spectograms obtained using our proposed approach by Efficient Spectrograms (E-Specs). E-Spec can reduce the complexity of spectrogram computation by ~77.60 percentage points (p.p.) and save ~37.87 p.p. of MP3 decoding time. E-Spec bypasses the reconstruction artifacts introduced by the MP3 synthesis filterbank, which makes it useful in speech forensics tasks. We tested E-Spec in the synthetic speech detection, where a detector is asked to determine whether a speech signal is synthesized or recorded from a human. We examined 4 different neural network architectures to evaluate the performance of E-Spec compared to speech features extracted from the fully decoded speech signal. E-Spec achieved the best synthetic speech detection performance for 3 architectures; it also achieved the best overall detection performance across architectures. The computation of E-Spec is an approximation to Short Time Fourier Transform (STFT). E-Spec can be extended to other audio compression methods.

References

  1. Zaynab Almutairi and Hebah Elgibreen. 2022. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms, Vol. 15, 5 (2022), 155. https://doi.org/10.3390/a15050155Google ScholarGoogle Scholar
  2. Vipin Bansal, Gaurav Pahwa, and Nirmal Kannan. 2020. Cough Classification for COVID-19 Based on Audio MFCC Features Using Convolutional Neural Networks. Proceedings of the 2020 IEEE International Conference on Computing, Power and Communication Technologies (2020), 604--608. https://doi.org/10.1109/GUCON48875.2020.9231094Google ScholarGoogle ScholarCross RefCross Ref
  3. Emily R Bartusiak and Edward J Delp. 2021. Frequency Domain-based Detection of Generated Audio. Proceedings of IS&T International Symposium on Electronic Imaging: Media Watermarking, Security, and Forensics (2021), 273-1-273-7. https://doi.org/10.2352/ISSN.2470-1173.2021.4.MWSF-273 Virtual.Google ScholarGoogle ScholarCross RefCross Ref
  4. Kratika Bhagtani, Amit Kumar Singh Yadav, Emily R Bartusiak, Ziyue Xiang, Ruiting Shao, Sriram Baireddy, and Edward J Delp. 2022. An Overview of Recent Work in Media Forensics: Methods and threats. arXiv preprint arXiv:2204.12067 (2022). https://doi.org/10.48550/arXiv.2204.12067Google ScholarGoogle Scholar
  5. Tiziano Bianchi, Alessia De Rosa, Marco Fontani, Giovanni Rocciolo, and Alessandro Piva. 2013. Detection and Classification of Double Compressed MP3 Audio Tracks. Proceedings of the First ACM Workshop on Information Hiding and Multimedia Security (2013), 159--164. https://doi.org/10.1145/2482513.2482523 Montpellier, France.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Judith C Brown. 1991. Calculation of a Constant Q Spectral Transform. The Journal of the Acoustical Society of America, Vol. 89, 1 (1991), 425--434. https://doi.org/10.1121/1.400476Google ScholarGoogle ScholarCross RefCross Ref
  7. Tom Fawcett. 2006. An Introduction to ROC Analysis. Pattern recognition letters, Vol. 27, 8 (2006), 861--874. https://doi.org/10.1016/j.patrec.2005.10.010Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Deepanway Ghosal and Maheshkumar H. Kolekar. 2018. Music Genre Recognition Using Deep Neural Networks and Transfer Learning. Proceedings of Interspeech 2018 (2018), 2087--2091. https://doi.org/10.21437/Interspeech.2018-2045Google ScholarGoogle Scholar
  9. Eric Grinstein, Ngoc Q. K. Duong, Alexey Ozerov, and Patrick Pérez. 2018. Audio Style Transfer. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018), 586--590. https://doi.org/10.1109/ICASSP.2018.8461711Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (2016), 770--778. https://doi.org/10.1109/CVPR.2016.90 Las Vegas, NV, USA.Google ScholarGoogle ScholarCross RefCross Ref
  11. Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam, and Quoc Le. 2019. Searching for MobileNetV3. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (2019), 1314--1324. https://doi.org/10.1109/ICCV.2019.00140Google ScholarGoogle ScholarCross RefCross Ref
  12. International Organization for Standardization. 1995. ISO/IEC 13818--3:1995 - Information technology -- Generic coding of moving pictures and associated audio information -- Part 3: Audio. https://www.iso.org/standard/22991.htmlGoogle ScholarGoogle Scholar
  13. International Organization for Standardization. 1997. ISO/IEC 13818--7:1997 Information technology - Generic Coding of Moving Pictures and Associated Audio Information - Part 7: Advanced Audio Coding (AAC). https://www.iso.org/standard/25040.htmlGoogle ScholarGoogle Scholar
  14. Joebert S. Jacaba. 2001. Audio Compression Using Modified Discrete Cosine Transform: The MP3 Coding Standard. Bachelor's thesis. University of the Philippines, Manila. https://www.math.utah.edu/ gustafso/s2016/2270/project-ideas/audio-mp3-compression-MDCT-jacaba_main.pdfGoogle ScholarGoogle Scholar
  15. Muhammad Mohsin Kabir, Muhammad F Mridha, Jungpil Shin, Israt Jahan, and Abu Quwsar Ohi. 2021. A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities. IEEE Access, Vol. 9 (2021), 79236--79263. https://doi.org/10.1109/ACCESS.2021.3084299Google ScholarGoogle ScholarCross RefCross Ref
  16. Hasam Khalid, Minha Kim, Shahroz Tariq, and Simon S. Woo. 2021. Evaluation of an Audio-Video Multimodal Deepfake Dataset Using Unimodal and Multimodal Detectors. Proceedings of the 1st Workshop on Synthetic Multimedia -- Audiovisual Deepfake Generation and Detection (2021), 7--15. https://doi.org/10.1145/3476099.3484315 Virtual.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014). https://doi.org/10.48550/arXiv.1412.6980Google ScholarGoogle Scholar
  18. lieff. 2018. textttminimp3: Minimalistic MP3 Decoder Single Header Library. https://github.com/lieff/minimp3Google ScholarGoogle Scholar
  19. Qingzhong Liu, Andrew H Sung, and Mengyu Qiao. 2010. Detection of double MP3 compression. Cognitive Computation, Vol. 2 (2010), 291--296. https://doi.org/10.1007/s12559-010-9045-4Google ScholarGoogle ScholarCross RefCross Ref
  20. Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi H. Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, and Kong Aik Lee. 2021. ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 3, 2 (2021), 252--265. https://doi.org/10.1109/TBIOM.2021.3059479Google ScholarGoogle ScholarCross RefCross Ref
  21. T.Q. Nguyen. 1994. Near-perfect-reconstruction Pseudo-QMF Banks. IEEE Transactions on Signal Processing, Vol. 42, 1 (1994), 65--76. https://doi.org/10.1109/78.258122Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alan V. Oppenheim. 1970. Speech Spectrograms Using the Fast Fourier Transform. IEEE Spectrum, Vol. 7, 8 (1970), 57--62. https://doi.org/10.1109/MSPEC.1970.5213512Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ted Painter and Andreas Spanias. 2000. Perceptual Coding of Digital Audio. Proc. IEEE, Vol. 88, 4 (2000), 451--515. https://doi.org/10.1109/5.842996Google ScholarGoogle ScholarCross RefCross Ref
  24. Rassol Raissi. 2002. The Theory Behind MP3. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113.6804Google ScholarGoogle Scholar
  25. Ricardo Reimao and Vassilios Tzerpos. 2019. FoR: A Dataset for Synthetic Speech Detection. Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (2019), 1--10. https://doi.org/10.1109/SPED.2019.8906599 Timisoara, Romania.Google ScholarGoogle ScholarCross RefCross Ref
  26. Ricardo Reimao and Vassilios Tzerpos. 2021. Synthetic Speech Detection Using Neural Networks. Proceedings of the 2021 International Conference on Speech Technology and Human-Computer Dialogue (2021), 97--102. https://doi.org/10.1109/SpeD53181.2021.9587406Google ScholarGoogle ScholarCross RefCross Ref
  27. Joseph Rothweiler. 1983. Polyphase Quadrature Filters--A New Subband Coding Technique. Proceedings of the 1983 IEEE International Conference on Acoustics, Speech, and Signal Processing (1983), 1280--1283. https://doi.org/10.1109/ICASSP.1983.1172005Google ScholarGoogle ScholarCross RefCross Ref
  28. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  29. Michael E. Schuckers. 2010. Receiver Operating Characteristic Curve and Equal Error Rate. In Computational Methods in Biometric Authentication: Statistical Methods for Performance Evaluation. Springer London, London, 155--204. https://doi.org/10.1007/978-1-84996-202-5_5Google ScholarGoogle Scholar
  30. Premjeet Singh, Goutam Saha, and Md Sahidullah. 2021. Non-linear Frequency Warping Using Constant-Q Transformation for Speech Emotion Recognition. Proceedings of the 2021 International Conference on Computer Communication and Informatics (2021), 1--6. https://doi.org/10.1109/ICCCI50826.2021.9402569Google ScholarGoogle ScholarCross RefCross Ref
  31. John S. Sobolewski. 2003. Data Transmission Media. In Encyclopedia of Physical Science and Technology (Third Edition), Robert A. Meyers (Ed.). Academic Press, New York, 277--303. https://doi.org/10.1016/B0-12-227410-5/00165-4Google ScholarGoogle Scholar
  32. Praveen Sripada. 2006. MP3 Decoder in Theory and Practice. Master's thesis. Blekinge Institute of Technology, Ronneby, Sweden. https://www.diva-portal.org/smash/get/diva2:830195/FULLTEXT01.pdfGoogle ScholarGoogle Scholar
  33. Mingxing Tan and Quoc Le. 2021. EfficientNetV2: Smaller Models and Faster Training. Proceedings of International Conference on Machine Learning, Vol. 139 (2021), 10096--10106. Virtual.Google ScholarGoogle Scholar
  34. TorchAudio Contributors. 2023. TorchAudio Documentation. https://pytorch.org/audio/master/index.htmlGoogle ScholarGoogle Scholar
  35. Cheuk Kin Wai. 2023. nnAudio 0.3.1. https://kinwaicheuk.github.io/nnAudio/index.htmlGoogle ScholarGoogle Scholar
  36. Ye Wang and Mikka Vilermo. 2003. Modified Discrete Cosine Transform: Its Implications for Audio Coding and Error Concealment. Journal of the Audio Engineering Society, Vol. 51, 1/2 (2003), 52--61.Google ScholarGoogle Scholar
  37. Ziyue Xiang, Paolo Bestagini, Stefano Tubaro, and Edward J. Delp. 2022. Forensic Analysis and Localization of Multiply Compressed MP3 Audio Using Transformers. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (2022), 2929--2933. https://doi.org/10.1109/ICASSP43922.2022.9747639 Singapore.Google ScholarGoogle Scholar
  38. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (2017), 5987--5995. https://doi.org/10.1109/CVPR.2017.634 Honolulu, HI, USA.Google ScholarGoogle ScholarCross RefCross Ref
  39. Amit Kumar Singh Yadav, Ziyue Xiang, Emily R. Bartusiak, Paolo Bestagini, Stefano Tubaro, and Edward J. Delp. 2023. ASSD: Synthetic Speech Detection in the AAC Compressed Domain. Proceedings of the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (2023).Google ScholarGoogle Scholar
  40. Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al. 2021. ASVspoof 2021: Accelerating Progress in Spoofed and Deepfake Speech Detection. arXiv preprint arXiv:2109.00537 (2021). https://doi.org/10.48550/arXiv.2109.00537Google ScholarGoogle Scholar
  41. Diqun Yan, Rangding Wang, Jinglei Zhou, Chao Jin, and Zhifeng Wang. 2018. Compression History Detection for MP3 Audio. KSII Transactions on Internet and Information Systems (TIIS), Vol. 12, 2 (2018), 662--675. https://doi.org/10.3837/tiis.2018.02.007Google ScholarGoogle Scholar
  42. Mohammed Zakariah, Muhammad Khurram Khan, and Hafiz Malik. 2018. Digital Multimedia Audio Forensics: Past, Present and Future. Multimedia tools and applications, Vol. 77 (2018), 1009--1040. https://doi.org/10.1007/s11042-016-4277-2Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Fang Zheng, Guoliang Zhang, and Zhanjiang Song. 2001. Comparison of Different Implementations of MFCC. Journal of Computer science and Technology, Vol. 16 (2001), 582--589. https://doi.org/10.1007/BF02943243Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Pedram Abdzadeh Ziabary and Hadi Veisi. 2021. A Countermeasure Based on CQT Spectrogram for Deepfake Speech Detection. Proceedings of the 2021 7th International Conference on Signal Processing and Intelligent Systems (2021), 1--5. https://doi.org/10.1109/ICSPIS54653.2021.9729387Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Article Metrics

            • Downloads (Last 12 months)172
            • Downloads (Last 6 weeks)25

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader