How to train your speaker embeddings extractor

Mclaren, Mitchell; Castán, Diego; Nandwana, Mahesh Kumar; Ferrer, Luciana; Yilmaz, Emre

doi:10.21437/Odyssey.2018-46

How to train your speaker embeddings extractor

Mitchell Mclaren, Diego Castán, Mahesh Kumar Nandwana, Luciana Ferrer, Emre Yilmaz

With the recent introduction of speaker embeddings for text-independent speaker recognition, many fundamental questions require addressing in order to fast-track the development of this new era of technology. Of particular interest is the ability of the speaker embeddings network to leverage artificially degraded data at a far greater rate beyond prior technologies, even in the evaluation of naturally degraded data. In this study, we aim to explore some of the fundamental requirements for building a good speaker embeddings extractor. We analyze the impact of voice activity detection, types of degradation, the amount of degraded data, and number of speakers required for a good network. These aspects are analyzed over a large set of 11 conditions from 7 evaluation datasets. We lay out a set of recommendations for training the network based on the observed trends. By applying these recommendations to enhance the default recipe provided in the Kaldi toolkit, a significant gain of 13-21% on the Speakers in the Wild and NIST SRE’16 datasets is achieved.

doi: 10.21437/Odyssey.2018-46

Cite as: Mclaren, M., Castán, D., Nandwana, M.K., Ferrer, L., Yilmaz, E. (2018) How to train your speaker embeddings extractor . Proc. The Speaker and Language Recognition Workshop (Odyssey 2018), 327-334, doi: 10.21437/Odyssey.2018-46

@inproceedings{mclaren18_odyssey,
  author={Mitchell Mclaren and Diego Castán and Mahesh Kumar Nandwana and Luciana Ferrer and Emre Yilmaz},
  title={{How to train your speaker embeddings extractor	}},
  year=2018,
  booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2018)},
  pages={327--334},
  doi={10.21437/Odyssey.2018-46}
}