음성 위조 탐지를 위한 2단계 학습 모형 연구

강태인; 곽일엽

doi:10.7465/jkdi.2023.34.2.203

주제분류

...

저널정보

한국데이터정보과학회
한국데이터정보과학회지 학술저널
한국데이터정보과학회지 제34권 제2호
2023.3 203 - 214 (12page)
DOI : 10.7465/jkdi.2023.34.2.203

저자정보

강태인 (중앙대학교)
곽일엽 (중앙대학교)

이용수
내서재: 2

내서재에 추가
되었습니다. 내서재에서
삭제되었습니다.

초록·키워드

오류제보하기

본 연구에서는 음성 위조 탐지 문제에 있어서 딥러닝 모형들의 2단계 학습 모형에 대한 모델과 성능 결과를 제시하고자한다. 음성 위조 탐지는 실제 음성과 원래 음성과 다른 환경에서 복제된 위조 음성을 구별하는 과제이다. 음성비서와 같이 화자의 식별이 보안과 직접적으로 연관되는 문제들에서 음성 위조 탐지의 필요성이 커지고 있다. 제시하는 2단계 학습 모형은 Automatic Speaker Verification Spoofing (ASVSpoof) 2019 대회 LA 데이터 셋으로 연구된 여러 단일 음성 모형들의 임베딩 벡터들을 가져오고, 이를 합쳐서 새로운 피쳐로 정의한 후, 해당 피쳐에 딥러닝 네트워크를 구축하여 모형을 만들어 내는 방식이다. 다수의 모형들을 통해 결과를 도출한다는 면에서 유사성이 있는 기존 앙상블 기법들과 비교를 위해 음성위조 탐지문제 LA 데이터에 있어서 우수한 성능을 가진 단일 모형들을 이용하여 비교 분석한 결과를 살펴보았다. 여러 모형의 임베딩 조합으로 진행된 2단계 학습 모형은 Equal Error Rate (EER) 0.26 (%) 을 달성했다. 이는 앙상블 기법인 Voting의 최고 성능인 0.60 (%) 보다 0.34 (%p) 향상된 결과이며 단일모델 최고 성능 0.83 (%)과 비교해 0.57 (%p) 향상된 결과이다. 음성 위조 탐지 모형에서, 2단계 학습모형의 기초적인 모형을 제시했다는 것이 의미가 있으며 구조를 좀 더 고도화 시키는 후속 연구로 발전 시킬 수 있을 것이다.

A novel 2-stage training method for voice spoofing detection is presented in this work along with performance experiments. The challenge of voice spoofing detection is to tell a real voice from a spoof that has been replicated in a setting other than the original voice. In areas where speaker identification is crucial to security, such as voice assistants, the demand for speech forgery detection is on the rise. The proposed 2-stage training model imports the embedding vectors of several single speech models studied with the Automatic Speaker Verification Spoofing (ASVSpoof) 2019 competition LA data set, combines them to define a concatenated embedding feature, and then builds a deep learning network on the concatenated embedding feature to create an ensemble model. We examined the analysis results based on the fusion of embedding vectors from various single models and modifications to deep learning networks for comparison with existing ensemble methodologies. The 2-stage training model produced an EER of 0.26 (%) by combining a number of models. This is a 0.34 (%p) improvement over the ensemble technique (Voting method) of 0.60 (%) and a 0.57 (%p) improvement over the single model’s highest performance of 0.83 (%).

#딥러닝 #2단계 학습 #음성 위조 탐지 #임베딩 #Deep learning #two-stage training #voice spoofing detection #embedding