Skip to main content
Log in

Voice conversion towards modeling dynamic characteristics using switching state space model

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

In the literature of voice conversion (VC), the method based on statistical Gaussian mixture model (GMM) serves as a benchmark. However, one of the inherent drawbacks of GMM is well-known as discontinuity problem, which is caused by transforming features on a frame-by-frame basis, thus ignoring the dynamics between adjacent frames and finally resulting in degraded quality of the converted speech. A variety of algorithms have been proposed to overcome this deficiency, among which the state space model (SSM) based method provides some promising results. In this paper, we proceed by presenting an enhanced version of the traditional SSM, namely, the switching SSM (SSSM). This new structure is more flexible than the conventional one in that it allows using mixture of components to account for the rapid transitions between neighboring frames. Moreover, physical meaning of the model parameters of SSSM has been examined in depth, leading to efficient application-specific training and transforming procedures of VC. Experiments including both objective and subjective measurements were conducted to compare the performances of the conventional and the proposed SSM-based methods, which have convinced that obvious improvements in both aspects of similarity and quality can be obtained by SSSM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abe M, Nakamura S, Shikano K, et al. Voice conversion through vector quantization. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, New York, 1988. 655–658

    Google Scholar 

  2. Arslan L M. Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun, 1999, 28: 211–226

    Article  Google Scholar 

  3. Turk O, Arslan L M. Robust processing techniques for voice conversion. Comput Speech Lang, 2006, 20: 441–467

    Article  Google Scholar 

  4. Stylianou Y, Cappe O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Proc, 1998, 6: 131–142

    Article  Google Scholar 

  5. Kain A. High resolution voice transformation. Dissertation for the Doctoral Degree. Rockford: Oregon Health and Science University, 2001

    Google Scholar 

  6. Rentzos D, Vaseghi S, Yan Q, et al. Voice conversion through transformation of spectral and intonation features. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, 2004. 21–24

    Google Scholar 

  7. Ye H, Young S. Quality-enhanced voice morphing using maximum likelihood transformations. IEEE Trans Speech Audio Proc, 2006, 14: 1301–1312

    Article  Google Scholar 

  8. Lee K S. Statistical approach for voice personality transformation. IEEE Trans Speech Audio Proc, 2007, 15: 641–651

    Article  Google Scholar 

  9. Chen Y, Chu M, Chang E, et al. Voice conversion with smoothed GMM and MAP adaptation. In: Proceedings of Interspeech, Geneva, 2003. 2413–2416

    Google Scholar 

  10. Toda T, Black A W, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Speech Audio Proc, 2007, 15: 2222–2235

    Article  Google Scholar 

  11. Kim E K, Lee S, Oh Y H. Hidden Markov model based voice conversion using dynamic characteristics of speaker. In: Proceedings of Interspeech, Rhodes, 1997. 2519–2522

    Google Scholar 

  12. Wu C H, Hsia C C, Liu T H, et al. Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Trans Speech Audio Proc, 2006, 14: 1109–1116

    Article  Google Scholar 

  13. Helander E, Silén H, Miguez J, et al. Maximum a posteriori voice conversion using sequential Monte Carlo methods. In: Proceedings of Interspeech, Makuhari, 2010. 1716–1719

    Google Scholar 

  14. Helander E, Silén H, Virtanen T, et al. Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Speech Audio Proc, 2012, 20: 806–817

    Article  Google Scholar 

  15. Xu N, Yang Z, Zhang L H, et al. Voice conversion based on state-space model for modelling spectral trajectory. Electron Lett, 2009, 45: 673–674

    Google Scholar 

  16. Julier S J. Unscented filtering and nonlinear estimation. Proc IEEE, 2004, 92: 401–422

    Article  Google Scholar 

  17. Bishop C M. Pattern Recognition and Machine Learning. New York: Springer, 2006

    MATH  Google Scholar 

  18. Li Z, Shaw M, Yedwabnick J, et al. Using a state space model with hidden variables to infer transcription factor activities. Bioinformatics, 2006, 22: 747–754

    Article  Google Scholar 

  19. Franklin G F, Powell J D, Workman M L. Digital Control of Dynamic Systems. New Jersey: Prentice-Hall, 1998

    Google Scholar 

  20. Tanizaki H. Nonlinear Filters: Estimation and Applications. New York: Springer, 1996

    Book  MATH  Google Scholar 

  21. Haykin S. Kalman Filtering and Neural Networks. New York: John Wiley & Sons, 2001

    Book  Google Scholar 

  22. Erro D, Moreno A, Bonafonte A. Flexible harmonic/stochastic speech synthesis. In: Proceedings of ISCA Workshop Speech Synthesis, Bonn, 2007. 194–199

    Google Scholar 

  23. Erro D. Intra-lingual and cross-lingual voice conversion using harmonic plus stochastic models. Dissertation for Doctoral Degree. Barcelona: Universitat Politécnica de Catalunya, 2008

    Google Scholar 

  24. Stylianou Y. Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Dissertation for Doctoral Degree. Paris: École Nationale Supérieure des Télécommunications, 1996

    Google Scholar 

  25. Makhoul J. Linear prediction: a tutorial review. Proc IEEE, 1975, 63: 561–580

    Article  Google Scholar 

  26. Desai S, Black A W, Yegnanarayana B, et al. Spectral mapping using artificial neural networks for voice conversion. IEEE Trans Speech Audio Proc, 2010, 18: 954–964

    Article  Google Scholar 

  27. Frankel J. Linear dynamic models for automatic speech recognition. Dissertation for Doctoral Degree. Edinburgh: University of Edinburgh, 2003

    Google Scholar 

  28. Kominek J, Black A W. The CMU ARCTIC speech databases. In: Proceedings of ISCA Workshop Speech Synthesis, Pittsburgh, 2004. 223–224

    Google Scholar 

  29. Erro D, Moreno A, Bonafonte A. Voice conversion based on weighted frequency warping. IEEE Trans Speech Audio Proc, 2010, 18: 922–931

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ning Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, N., Bao, J., Liu, X. et al. Voice conversion towards modeling dynamic characteristics using switching state space model. Sci. China Inf. Sci. 56, 1–15 (2013). https://doi.org/10.1007/s11432-013-4799-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-013-4799-4

Keywords

Navigation