One-to-many voice conversion based on tensor representation of speaker space

Saito, Daisuke; Yamamoto, Keisuke; Minematsu, Nobuaki; Hirose, Keikichi

doi:10.21437/Interspeech.2011-268

One-to-many voice conversion based on tensor representation of speaker space

Daisuke Saito, Keisuke Yamamoto, Nobuaki Minematsu, Keikichi Hirose

This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of speaker space. In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model (EV-GMM) was proposed. In the EVC, similarly to speaker recognition approaches, a speaker space is constructed based on GMM supervectors which are high-dimensional vectors derived by concatenating the mean vectors of each of the speaker GMMs. In the speaker space, each speaker is represented by a small number of weight parameters of eigen-supervectors. In this paper, we revisit construction of the speaker space by introducing the tensor analysis of training data set. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the Gaussian component and the dimension of the mean vector, and the speaker space is derived by the tensor analysis of the set of the matrices. Our approach can solve an inherent problem of supervector representation, and it improves the performance of voice conversion. Experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.

doi: 10.21437/Interspeech.2011-268

Cite as: Saito, D., Yamamoto, K., Minematsu, N., Hirose, K. (2011) One-to-many voice conversion based on tensor representation of speaker space. Proc. Interspeech 2011, 653-656, doi: 10.21437/Interspeech.2011-268

@inproceedings{saito11_interspeech,
  author={Daisuke Saito and Keisuke Yamamoto and Nobuaki Minematsu and Keikichi Hirose},
  title={{One-to-many voice conversion based on tensor representation of speaker space}},
  year=2011,
  booktitle={Proc. Interspeech 2011},
  pages={653--656},
  doi={10.21437/Interspeech.2011-268}
}