Abstract
Modern speech synthesis technologies can be used to deceive voice authentication systems, phone scams, or discredit public figures. An urgent task is to detect synthesized speech to protect against the threat of voice substitution attacks. The solution to this problem is based on the choice of cepstral coefficients that determine the quality of the cloned voice. In addition, the dataset used to train the neural network must match the language for which the synthesized speech will be detected. The paper discusses the most widely used cepstral coefficients and compares their effectiveness using two main types of neural networks. To train the network, the Russian speech dataset was developed, the use of which allows achieving the highest accuracy in determining speech synthesis in the case of deepfakes in Russian.
Similar content being viewed by others
References
PyAra: Russian bona fide and spoofed speech. https://www.kaggle.com/datasets/alep079/pyara
Almutairi, Z., Elgibreen, H.: A review of modern audio Deepfake detection methods: challenges and future directions. Algorithms 15(5), 155 (2022). https://doi.org/10.3390/a15050155
Akinrinmade, A.A., et al.: Creation of a Nigerian voice corpus for indigenous speaker recognition. J. Phys. Conf. Ser. 1378, 032011 (2019). https://doi.org/10.1088/1742-6596/1378/3/032011
Aly, M., Alotaibi, N.S.: A novel deep learning model to detect COVID-19 based on wavelet features extracted from Mel-scale spectrogram of patients’ cough and breathing sounds. Inform. Med. Unlocked 32, 101049 (2022). https://doi.org/10.1016/j.imu.2022.101049. (ISSN 2352-9148)
Andrusenko, AYu., Romanenko, A.N.: Improving out of vocabulary words recognition accuracy for an end-to-end Russian speech recognition system. Sci. Tech. J. Inf. Technol. Mech. Opt. 22(6), 1143–1149 (2022). https://doi.org/10.17586/2226-1494-2022-22-6-1143-1149
Arif, T., Javed, A., Alhameed, M., Jeribi, F., Tahir, A.: Voice spoofing countermeasure for logical access attacks detection. IEEE Access 9, 162857–162868 (2021). https://doi.org/10.1109/ACCESS.2021.3133134
Chettri, B., Sturm, B.L.: A deeper look at Gaussian mixture model based anti-spoofing systems. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 5159–5163. https://doi.org/10.1109/ICASSP.2018.8461467
Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21, 6 (2020). https://doi.org/10.1186/s12864-019-6413-7
Cuccovillo, L., et al.: Open challenges in synthetic speech detection. In: 2022 IEEE International Workshop on Information Forensics and Security (WIFS), Shanghai, China, 2022, pp. 1–6. https://doi.org/10.1109/WIFS55849.2022.9975433
Dawood, H., Saleem, S., Hassan, F., Javed, A.: A robust voice spoofing detection system using novel CLS-LBP features and LSTM. J. King Saud Univ. Comput. Inf. Sci. 34(9), 7300–7312 (2022). https://doi.org/10.1016/j.jksuci.2022.02.024. (ISSN 1319-1578)
Delgado, H., Evans, N., Kinnunen, T., Lee, K.A., Liu, X., Nautsch, A., Patino, J., Sahidullah, M., Todisco, M., Wang, X., Yamagishi, J.: ASVspoof 2021 Challenge—Speech Deepfake Database (1.0). Zenodo (2021). https://doi.org/10.5281/zenodo.4835108
Efanov, D., Aleksandrov, P., Karapetyants, N.: The BiLSTM-based synthesized speech recognition. Procedia Comput. Sci. 213, 415–421 (2022). https://doi.org/10.1016/j.procs.2022.11.086. (ISSN 1877-0509)
Hanilçi, C., Kinnunen, T., Sahidullah, M., Sizov, A.: Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise. Speech Commun. 85, 83–97 (2016). https://doi.org/10.1016/j.specom.2016.10.002. (ISSN 0167-6393)
Kinnunen, T., et al.: Tandem assessment of spoofing countermeasures and automatic speaker verification: fundamentals. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2195–2210 (2020). https://doi.org/10.1109/TASLP.2020.3009494
Murtazin, R.A., Kuznetsov, A.Y.: The speech synthesis detection algorithm based on cepstral coefficients and convolutional neural network. Sci. Tech. J. Inf. Technol. Mech. Opt. 21(4), 545–552 (2021). https://doi.org/10.17586/2226-1494-2021-21-4-545-552
Osipov, A., Pleshakova, E., Gataullin, S., Korchagin, S., Ivanov, M., Finogeev, A., Yadav, V.: Deep learning method for recognition and classification of images from video recorders in difficult weather conditions. Sustainability 14(4), 2020 (2022). https://doi.org/10.3390/su14042420
Phapatanaburi, K., Buayai, P., Kupimai, M., Yodrot, T.: Linear prediction residual-based constant-Q cepstral coefficients for replay attack detection. In: 2020 8th International Electrical Engineering Congress (iEECON), Chiang Mai, Thailand, 2020, pp. 1–4. https://doi.org/10.1109/iEECON48109.2020.229465
Pleshakova, E.S., Gataullin, S.T., Osipov, A.V., Filimonov, A.V.: Countering telephone fraud using neural network technologies. Cybersecur. Issues 6(32), 83–92 (2022)
Rebai, I., BenAyed, Y.: Text-to-speech synthesis system with Arabic diacritic recognition system. Comput. Speech Lang. 34(1), 43–60 (2015). https://doi.org/10.1016/j.csl.2015.04.002. (ISSN 0885-2308)
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Sulír, M., Juhár, J.: Speaker adaptation for Slovak statistical parametric speech synthesis based on hidden Markov models, 2015. In: 25th International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic, 2015, pp. 137–140. https://doi.org/10.1109/RADIOELEK.2015.7128977
Todisco, M., Delgado, H., Evans, N.: Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput. Speech Lang. 45, 516–535 (2017). https://doi.org/10.1016/j.csl.2017.01.001. (ISSN 0885-2308)
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., Li, H.: Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015). https://doi.org/10.1016/j.specom.2014.10.005. (ISSN 0167-6393)
Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., Liu, X., Lee, K.A., Kinnunen, T., Evans, N., Delgado, H.: ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. In: Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 47–54 (2021). https://doi.org/10.21437/ASVSPOOF.2021-8
Yang, J., Das, R.K., Li, H.: Extended constant-Q cepstral coefficients for detection of spoofing attacks. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, pp. 1024–1029 (2018). https://doi.org/10.23919/APSIPA.2018.8659537
Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., Shamma, S.: Linear versus MEL frequency cepstral coefficients for speaker recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, pp. 559–564 (2011). https://doi.org/10.1109/ASRU.2011.6163888
Acknowledgements
The work was funded by the Foundation for Assistance to Small Innovative Enterprises, Russia under Grant No (36ГУКoдИИC12-D7/81484).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Efanov, D., Aleksandrov, P. & Mironov, I. Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection. J Comput Virol Hack Tech (2023). https://doi.org/10.1007/s11416-023-00491-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11416-023-00491-0