Skip to main content

Advertisement

Log in

Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

  • Original Paper
  • Published:
Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Abstract

Modern speech synthesis technologies can be used to deceive voice authentication systems, phone scams, or discredit public figures. An urgent task is to detect synthesized speech to protect against the threat of voice substitution attacks. The solution to this problem is based on the choice of cepstral coefficients that determine the quality of the cloned voice. In addition, the dataset used to train the neural network must match the language for which the synthesized speech will be detected. The paper discusses the most widely used cepstral coefficients and compares their effectiveness using two main types of neural networks. To train the network, the Russian speech dataset was developed, the use of which allows achieving the highest accuracy in determining speech synthesis in the case of deepfakes in Russian.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.youtube.com/watch?v=yIUTvPOXkk8.

  2. https://arstechnica.com/information-technology/2023/01/microsofts-new-ai-can-simulate-anyones-voice-with-3-seconds-of-audio/.

References

  1. PyAra: Russian bona fide and spoofed speech. https://www.kaggle.com/datasets/alep079/pyara

  2. Almutairi, Z., Elgibreen, H.: A review of modern audio Deepfake detection methods: challenges and future directions. Algorithms 15(5), 155 (2022). https://doi.org/10.3390/a15050155

    Article  Google Scholar 

  3. Akinrinmade, A.A., et al.: Creation of a Nigerian voice corpus for indigenous speaker recognition. J. Phys. Conf. Ser. 1378, 032011 (2019). https://doi.org/10.1088/1742-6596/1378/3/032011

    Article  Google Scholar 

  4. Aly, M., Alotaibi, N.S.: A novel deep learning model to detect COVID-19 based on wavelet features extracted from Mel-scale spectrogram of patients’ cough and breathing sounds. Inform. Med. Unlocked 32, 101049 (2022). https://doi.org/10.1016/j.imu.2022.101049. (ISSN 2352-9148)

    Article  Google Scholar 

  5. Andrusenko, AYu., Romanenko, A.N.: Improving out of vocabulary words recognition accuracy for an end-to-end Russian speech recognition system. Sci. Tech. J. Inf. Technol. Mech. Opt. 22(6), 1143–1149 (2022). https://doi.org/10.17586/2226-1494-2022-22-6-1143-1149

    Article  Google Scholar 

  6. Arif, T., Javed, A., Alhameed, M., Jeribi, F., Tahir, A.: Voice spoofing countermeasure for logical access attacks detection. IEEE Access 9, 162857–162868 (2021). https://doi.org/10.1109/ACCESS.2021.3133134

    Article  Google Scholar 

  7. Chettri, B., Sturm, B.L.: A deeper look at Gaussian mixture model based anti-spoofing systems. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 5159–5163. https://doi.org/10.1109/ICASSP.2018.8461467

  8. Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21, 6 (2020). https://doi.org/10.1186/s12864-019-6413-7

    Article  Google Scholar 

  9. Cuccovillo, L., et al.: Open challenges in synthetic speech detection. In: 2022 IEEE International Workshop on Information Forensics and Security (WIFS), Shanghai, China, 2022, pp. 1–6. https://doi.org/10.1109/WIFS55849.2022.9975433

  10. Dawood, H., Saleem, S., Hassan, F., Javed, A.: A robust voice spoofing detection system using novel CLS-LBP features and LSTM. J. King Saud Univ. Comput. Inf. Sci. 34(9), 7300–7312 (2022). https://doi.org/10.1016/j.jksuci.2022.02.024. (ISSN 1319-1578)

    Article  Google Scholar 

  11. Delgado, H., Evans, N., Kinnunen, T., Lee, K.A., Liu, X., Nautsch, A., Patino, J., Sahidullah, M., Todisco, M., Wang, X., Yamagishi, J.: ASVspoof 2021 Challenge—Speech Deepfake Database (1.0). Zenodo (2021). https://doi.org/10.5281/zenodo.4835108

  12. Efanov, D., Aleksandrov, P., Karapetyants, N.: The BiLSTM-based synthesized speech recognition. Procedia Comput. Sci. 213, 415–421 (2022). https://doi.org/10.1016/j.procs.2022.11.086. (ISSN 1877-0509)

    Article  Google Scholar 

  13. Hanilçi, C., Kinnunen, T., Sahidullah, M., Sizov, A.: Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise. Speech Commun. 85, 83–97 (2016). https://doi.org/10.1016/j.specom.2016.10.002. (ISSN 0167-6393)

    Article  Google Scholar 

  14. Kinnunen, T., et al.: Tandem assessment of spoofing countermeasures and automatic speaker verification: fundamentals. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2195–2210 (2020). https://doi.org/10.1109/TASLP.2020.3009494

    Article  Google Scholar 

  15. Murtazin, R.A., Kuznetsov, A.Y.: The speech synthesis detection algorithm based on cepstral coefficients and convolutional neural network. Sci. Tech. J. Inf. Technol. Mech. Opt. 21(4), 545–552 (2021). https://doi.org/10.17586/2226-1494-2021-21-4-545-552

    Article  Google Scholar 

  16. Osipov, A., Pleshakova, E., Gataullin, S., Korchagin, S., Ivanov, M., Finogeev, A., Yadav, V.: Deep learning method for recognition and classification of images from video recorders in difficult weather conditions. Sustainability 14(4), 2020 (2022). https://doi.org/10.3390/su14042420

    Article  Google Scholar 

  17. Phapatanaburi, K., Buayai, P., Kupimai, M., Yodrot, T.: Linear prediction residual-based constant-Q cepstral coefficients for replay attack detection. In: 2020 8th International Electrical Engineering Congress (iEECON), Chiang Mai, Thailand, 2020, pp. 1–4. https://doi.org/10.1109/iEECON48109.2020.229465

  18. Pleshakova, E.S., Gataullin, S.T., Osipov, A.V., Filimonov, A.V.: Countering telephone fraud using neural network technologies. Cybersecur. Issues 6(32), 83–92 (2022)

    Google Scholar 

  19. Rebai, I., BenAyed, Y.: Text-to-speech synthesis system with Arabic diacritic recognition system. Comput. Speech Lang. 34(1), 43–60 (2015). https://doi.org/10.1016/j.csl.2015.04.002. (ISSN 0885-2308)

    Article  Google Scholar 

  20. Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368

  21. Sulír, M., Juhár, J.: Speaker adaptation for Slovak statistical parametric speech synthesis based on hidden Markov models, 2015. In: 25th International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic, 2015, pp. 137–140. https://doi.org/10.1109/RADIOELEK.2015.7128977

  22. Todisco, M., Delgado, H., Evans, N.: Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput. Speech Lang. 45, 516–535 (2017). https://doi.org/10.1016/j.csl.2017.01.001. (ISSN 0885-2308)

    Article  Google Scholar 

  23. Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., Li, H.: Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015). https://doi.org/10.1016/j.specom.2014.10.005. (ISSN 0167-6393)

    Article  Google Scholar 

  24. Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., Liu, X., Lee, K.A., Kinnunen, T., Evans, N., Delgado, H.: ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. In: Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 47–54 (2021). https://doi.org/10.21437/ASVSPOOF.2021-8

  25. Yang, J., Das, R.K., Li, H.: Extended constant-Q cepstral coefficients for detection of spoofing attacks. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, pp. 1024–1029 (2018). https://doi.org/10.23919/APSIPA.2018.8659537

  26. Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., Shamma, S.: Linear versus MEL frequency cepstral coefficients for speaker recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, pp. 559–564 (2011). https://doi.org/10.1109/ASRU.2011.6163888

Download references

Acknowledgements

The work was funded by the Foundation for Assistance to Small Innovative Enterprises, Russia under Grant No (36ГУКoдИИC12-D7/81484).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitry Efanov.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Efanov, D., Aleksandrov, P. & Mironov, I. Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection. J Comput Virol Hack Tech (2023). https://doi.org/10.1007/s11416-023-00491-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11416-023-00491-0

Keywords

Navigation