Abstract
In the literature of voice conversion (VC), the method based on statistical Gaussian mixture model (GMM) serves as a benchmark. However, one of the inherent drawbacks of GMM is well-known as discontinuity problem, which is caused by transforming features on a frame-by-frame basis, thus ignoring the dynamics between adjacent frames and finally resulting in degraded quality of the converted speech. A variety of algorithms have been proposed to overcome this deficiency, among which the state space model (SSM) based method provides some promising results. In this paper, we proceed by presenting an enhanced version of the traditional SSM, namely, the switching SSM (SSSM). This new structure is more flexible than the conventional one in that it allows using mixture of components to account for the rapid transitions between neighboring frames. Moreover, physical meaning of the model parameters of SSSM has been examined in depth, leading to efficient application-specific training and transforming procedures of VC. Experiments including both objective and subjective measurements were conducted to compare the performances of the conventional and the proposed SSM-based methods, which have convinced that obvious improvements in both aspects of similarity and quality can be obtained by SSSM.
Similar content being viewed by others
References
Abe M, Nakamura S, Shikano K, et al. Voice conversion through vector quantization. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, New York, 1988. 655–658
Arslan L M. Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun, 1999, 28: 211–226
Turk O, Arslan L M. Robust processing techniques for voice conversion. Comput Speech Lang, 2006, 20: 441–467
Stylianou Y, Cappe O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Proc, 1998, 6: 131–142
Kain A. High resolution voice transformation. Dissertation for the Doctoral Degree. Rockford: Oregon Health and Science University, 2001
Rentzos D, Vaseghi S, Yan Q, et al. Voice conversion through transformation of spectral and intonation features. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, 2004. 21–24
Ye H, Young S. Quality-enhanced voice morphing using maximum likelihood transformations. IEEE Trans Speech Audio Proc, 2006, 14: 1301–1312
Lee K S. Statistical approach for voice personality transformation. IEEE Trans Speech Audio Proc, 2007, 15: 641–651
Chen Y, Chu M, Chang E, et al. Voice conversion with smoothed GMM and MAP adaptation. In: Proceedings of Interspeech, Geneva, 2003. 2413–2416
Toda T, Black A W, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Speech Audio Proc, 2007, 15: 2222–2235
Kim E K, Lee S, Oh Y H. Hidden Markov model based voice conversion using dynamic characteristics of speaker. In: Proceedings of Interspeech, Rhodes, 1997. 2519–2522
Wu C H, Hsia C C, Liu T H, et al. Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Trans Speech Audio Proc, 2006, 14: 1109–1116
Helander E, Silén H, Miguez J, et al. Maximum a posteriori voice conversion using sequential Monte Carlo methods. In: Proceedings of Interspeech, Makuhari, 2010. 1716–1719
Helander E, Silén H, Virtanen T, et al. Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Speech Audio Proc, 2012, 20: 806–817
Xu N, Yang Z, Zhang L H, et al. Voice conversion based on state-space model for modelling spectral trajectory. Electron Lett, 2009, 45: 673–674
Julier S J. Unscented filtering and nonlinear estimation. Proc IEEE, 2004, 92: 401–422
Bishop C M. Pattern Recognition and Machine Learning. New York: Springer, 2006
Li Z, Shaw M, Yedwabnick J, et al. Using a state space model with hidden variables to infer transcription factor activities. Bioinformatics, 2006, 22: 747–754
Franklin G F, Powell J D, Workman M L. Digital Control of Dynamic Systems. New Jersey: Prentice-Hall, 1998
Tanizaki H. Nonlinear Filters: Estimation and Applications. New York: Springer, 1996
Haykin S. Kalman Filtering and Neural Networks. New York: John Wiley & Sons, 2001
Erro D, Moreno A, Bonafonte A. Flexible harmonic/stochastic speech synthesis. In: Proceedings of ISCA Workshop Speech Synthesis, Bonn, 2007. 194–199
Erro D. Intra-lingual and cross-lingual voice conversion using harmonic plus stochastic models. Dissertation for Doctoral Degree. Barcelona: Universitat Politécnica de Catalunya, 2008
Stylianou Y. Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Dissertation for Doctoral Degree. Paris: École Nationale Supérieure des Télécommunications, 1996
Makhoul J. Linear prediction: a tutorial review. Proc IEEE, 1975, 63: 561–580
Desai S, Black A W, Yegnanarayana B, et al. Spectral mapping using artificial neural networks for voice conversion. IEEE Trans Speech Audio Proc, 2010, 18: 954–964
Frankel J. Linear dynamic models for automatic speech recognition. Dissertation for Doctoral Degree. Edinburgh: University of Edinburgh, 2003
Kominek J, Black A W. The CMU ARCTIC speech databases. In: Proceedings of ISCA Workshop Speech Synthesis, Pittsburgh, 2004. 223–224
Erro D, Moreno A, Bonafonte A. Voice conversion based on weighted frequency warping. IEEE Trans Speech Audio Proc, 2010, 18: 922–931
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xu, N., Bao, J., Liu, X. et al. Voice conversion towards modeling dynamic characteristics using switching state space model. Sci. China Inf. Sci. 56, 1–15 (2013). https://doi.org/10.1007/s11432-013-4799-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-013-4799-4