Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Efanov, Dmitry; Aleksandrov, Pavel; Mironov, Ilia

doi:10.1007/s11416-023-00491-0

Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Original Paper
Published: 13 August 2023

(2023)
Cite this article

Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

103 Accesses
Explore all metrics

Abstract

Modern speech synthesis technologies can be used to deceive voice authentication systems, phone scams, or discredit public figures. An urgent task is to detect synthesized speech to protect against the threat of voice substitution attacks. The solution to this problem is based on the choice of cepstral coefficients that determine the quality of the cloned voice. In addition, the dataset used to train the neural network must match the language for which the synthesized speech will be detected. The paper discusses the most widely used cepstral coefficients and compares their effectiveness using two main types of neural networks. To train the network, the Russian speech dataset was developed, the use of which allows achieving the highest accuracy in determining speech synthesis in the case of deepfakes in Russian.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GRU-SVM Model for Synthetic Speech Detection

A Watermark Challenge: Synthetic Speech Detection

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Notes

References

PyAra: Russian bona fide and spoofed speech. https://www.kaggle.com/datasets/alep079/pyara
Almutairi, Z., Elgibreen, H.: A review of modern audio Deepfake detection methods: challenges and future directions. Algorithms 15(5), 155 (2022). https://doi.org/10.3390/a15050155
Article Google Scholar
Akinrinmade, A.A., et al.: Creation of a Nigerian voice corpus for indigenous speaker recognition. J. Phys. Conf. Ser. 1378, 032011 (2019). https://doi.org/10.1088/1742-6596/1378/3/032011
Article Google Scholar
Aly, M., Alotaibi, N.S.: A novel deep learning model to detect COVID-19 based on wavelet features extracted from Mel-scale spectrogram of patients’ cough and breathing sounds. Inform. Med. Unlocked 32, 101049 (2022). https://doi.org/10.1016/j.imu.2022.101049. (ISSN 2352-9148)
Article Google Scholar
Andrusenko, AYu., Romanenko, A.N.: Improving out of vocabulary words recognition accuracy for an end-to-end Russian speech recognition system. Sci. Tech. J. Inf. Technol. Mech. Opt. 22(6), 1143–1149 (2022). https://doi.org/10.17586/2226-1494-2022-22-6-1143-1149
Article Google Scholar
Arif, T., Javed, A., Alhameed, M., Jeribi, F., Tahir, A.: Voice spoofing countermeasure for logical access attacks detection. IEEE Access 9, 162857–162868 (2021). https://doi.org/10.1109/ACCESS.2021.3133134
Article Google Scholar
Chettri, B., Sturm, B.L.: A deeper look at Gaussian mixture model based anti-spoofing systems. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 5159–5163. https://doi.org/10.1109/ICASSP.2018.8461467
Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21, 6 (2020). https://doi.org/10.1186/s12864-019-6413-7
Article Google Scholar
Cuccovillo, L., et al.: Open challenges in synthetic speech detection. In: 2022 IEEE International Workshop on Information Forensics and Security (WIFS), Shanghai, China, 2022, pp. 1–6. https://doi.org/10.1109/WIFS55849.2022.9975433
Dawood, H., Saleem, S., Hassan, F., Javed, A.: A robust voice spoofing detection system using novel CLS-LBP features and LSTM. J. King Saud Univ. Comput. Inf. Sci. 34(9), 7300–7312 (2022). https://doi.org/10.1016/j.jksuci.2022.02.024. (ISSN 1319-1578)
Article Google Scholar
Delgado, H., Evans, N., Kinnunen, T., Lee, K.A., Liu, X., Nautsch, A., Patino, J., Sahidullah, M., Todisco, M., Wang, X., Yamagishi, J.: ASVspoof 2021 Challenge—Speech Deepfake Database (1.0). Zenodo (2021). https://doi.org/10.5281/zenodo.4835108
Efanov, D., Aleksandrov, P., Karapetyants, N.: The BiLSTM-based synthesized speech recognition. Procedia Comput. Sci. 213, 415–421 (2022). https://doi.org/10.1016/j.procs.2022.11.086. (ISSN 1877-0509)
Article Google Scholar
Hanilçi, C., Kinnunen, T., Sahidullah, M., Sizov, A.: Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise. Speech Commun. 85, 83–97 (2016). https://doi.org/10.1016/j.specom.2016.10.002. (ISSN 0167-6393)
Article Google Scholar
Kinnunen, T., et al.: Tandem assessment of spoofing countermeasures and automatic speaker verification: fundamentals. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2195–2210 (2020). https://doi.org/10.1109/TASLP.2020.3009494
Article Google Scholar
Murtazin, R.A., Kuznetsov, A.Y.: The speech synthesis detection algorithm based on cepstral coefficients and convolutional neural network. Sci. Tech. J. Inf. Technol. Mech. Opt. 21(4), 545–552 (2021). https://doi.org/10.17586/2226-1494-2021-21-4-545-552
Article Google Scholar
Osipov, A., Pleshakova, E., Gataullin, S., Korchagin, S., Ivanov, M., Finogeev, A., Yadav, V.: Deep learning method for recognition and classification of images from video recorders in difficult weather conditions. Sustainability 14(4), 2020 (2022). https://doi.org/10.3390/su14042420
Article Google Scholar
Phapatanaburi, K., Buayai, P., Kupimai, M., Yodrot, T.: Linear prediction residual-based constant-Q cepstral coefficients for replay attack detection. In: 2020 8th International Electrical Engineering Congress (iEECON), Chiang Mai, Thailand, 2020, pp. 1–4. https://doi.org/10.1109/iEECON48109.2020.229465
Pleshakova, E.S., Gataullin, S.T., Osipov, A.V., Filimonov, A.V.: Countering telephone fraud using neural network technologies. Cybersecur. Issues 6(32), 83–92 (2022)
Google Scholar
Rebai, I., BenAyed, Y.: Text-to-speech synthesis system with Arabic diacritic recognition system. Comput. Speech Lang. 34(1), 43–60 (2015). https://doi.org/10.1016/j.csl.2015.04.002. (ISSN 0885-2308)
Article Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Sulír, M., Juhár, J.: Speaker adaptation for Slovak statistical parametric speech synthesis based on hidden Markov models, 2015. In: 25th International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic, 2015, pp. 137–140. https://doi.org/10.1109/RADIOELEK.2015.7128977
Todisco, M., Delgado, H., Evans, N.: Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput. Speech Lang. 45, 516–535 (2017). https://doi.org/10.1016/j.csl.2017.01.001. (ISSN 0885-2308)
Article Google Scholar
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., Li, H.: Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015). https://doi.org/10.1016/j.specom.2014.10.005. (ISSN 0167-6393)
Article Google Scholar
Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., Liu, X., Lee, K.A., Kinnunen, T., Evans, N., Delgado, H.: ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. In: Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 47–54 (2021). https://doi.org/10.21437/ASVSPOOF.2021-8
Yang, J., Das, R.K., Li, H.: Extended constant-Q cepstral coefficients for detection of spoofing attacks. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, pp. 1024–1029 (2018). https://doi.org/10.23919/APSIPA.2018.8659537
Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., Shamma, S.: Linear versus MEL frequency cepstral coefficients for speaker recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, pp. 559–564 (2011). https://doi.org/10.1109/ASRU.2011.6163888

Download references

Acknowledgements

The work was funded by the Foundation for Assistance to Small Innovative Enterprises, Russia under Grant No (36ГУКoдИИC12-D7/81484).

Author information

Authors and Affiliations

National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Moscow, Russian Federation
Dmitry Efanov, Pavel Aleksandrov & Ilia Mironov

Authors

Dmitry Efanov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Aleksandrov
View author publications
You can also search for this author in PubMed Google Scholar
Ilia Mironov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Efanov.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Efanov, D., Aleksandrov, P. & Mironov, I. Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection. J Comput Virol Hack Tech (2023). https://doi.org/10.1007/s11416-023-00491-0

Download citation

Received: 02 February 2023
Accepted: 03 July 2023
Published: 13 August 2023
DOI: https://doi.org/10.1007/s11416-023-00491-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Abstract

Access this article

Similar content being viewed by others

GRU-SVM Model for Synthetic Speech Detection

A Watermark Challenge: Synthetic Speech Detection

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Abstract

Access this article

Similar content being viewed by others

GRU-SVM Model for Synthetic Speech Detection

A Watermark Challenge: Synthetic Speech Detection

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation