Voice conversion towards modeling dynamic characteristics using switching state space model

Xu, Ning; Bao, JingYi; Liu, XiaoFeng; Jiang, AiMing; Tang, YiBing

doi:10.1007/s11432-013-4799-4

Voice conversion towards modeling dynamic characteristics using switching state space model

Research Paper
Published: 03 December 2013

Volume 56, pages 1–15, (2013)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Ning Xu^1,2,
JingYi Bao³,
XiaoFeng Liu¹,
AiMing Jiang¹ &
…
YiBing Tang¹

96 Accesses
Explore all metrics

Abstract

In the literature of voice conversion (VC), the method based on statistical Gaussian mixture model (GMM) serves as a benchmark. However, one of the inherent drawbacks of GMM is well-known as discontinuity problem, which is caused by transforming features on a frame-by-frame basis, thus ignoring the dynamics between adjacent frames and finally resulting in degraded quality of the converted speech. A variety of algorithms have been proposed to overcome this deficiency, among which the state space model (SSM) based method provides some promising results. In this paper, we proceed by presenting an enhanced version of the traditional SSM, namely, the switching SSM (SSSM). This new structure is more flexible than the conventional one in that it allows using mixture of components to account for the rapid transitions between neighboring frames. Moreover, physical meaning of the model parameters of SSSM has been examined in depth, leading to efficient application-specific training and transforming procedures of VC. Experiments including both objective and subjective measurements were conducted to compare the performances of the conventional and the proposed SSM-based methods, which have convinced that obvious improvements in both aspects of similarity and quality can be obtained by SSSM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion

A Voice Conversion Method Based on the Separation of Speaker-Specific Characteristics

Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

References

Abe M, Nakamura S, Shikano K, et al. Voice conversion through vector quantization. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, New York, 1988. 655–658
Google Scholar
Arslan L M. Speaker transformation algorithm using segmental codebooks (STASC). Speech Commun, 1999, 28: 211–226
Article Google Scholar
Turk O, Arslan L M. Robust processing techniques for voice conversion. Comput Speech Lang, 2006, 20: 441–467
Article Google Scholar
Stylianou Y, Cappe O, Moulines E. Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Proc, 1998, 6: 131–142
Article Google Scholar
Kain A. High resolution voice transformation. Dissertation for the Doctoral Degree. Rockford: Oregon Health and Science University, 2001
Google Scholar
Rentzos D, Vaseghi S, Yan Q, et al. Voice conversion through transformation of spectral and intonation features. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, 2004. 21–24
Google Scholar
Ye H, Young S. Quality-enhanced voice morphing using maximum likelihood transformations. IEEE Trans Speech Audio Proc, 2006, 14: 1301–1312
Article Google Scholar
Lee K S. Statistical approach for voice personality transformation. IEEE Trans Speech Audio Proc, 2007, 15: 641–651
Article Google Scholar
Chen Y, Chu M, Chang E, et al. Voice conversion with smoothed GMM and MAP adaptation. In: Proceedings of Interspeech, Geneva, 2003. 2413–2416
Google Scholar
Toda T, Black A W, Tokuda K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Speech Audio Proc, 2007, 15: 2222–2235
Article Google Scholar
Kim E K, Lee S, Oh Y H. Hidden Markov model based voice conversion using dynamic characteristics of speaker. In: Proceedings of Interspeech, Rhodes, 1997. 2519–2522
Google Scholar
Wu C H, Hsia C C, Liu T H, et al. Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Trans Speech Audio Proc, 2006, 14: 1109–1116
Article Google Scholar
Helander E, Silén H, Miguez J, et al. Maximum a posteriori voice conversion using sequential Monte Carlo methods. In: Proceedings of Interspeech, Makuhari, 2010. 1716–1719
Google Scholar
Helander E, Silén H, Virtanen T, et al. Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Speech Audio Proc, 2012, 20: 806–817
Article Google Scholar
Xu N, Yang Z, Zhang L H, et al. Voice conversion based on state-space model for modelling spectral trajectory. Electron Lett, 2009, 45: 673–674
Google Scholar
Julier S J. Unscented filtering and nonlinear estimation. Proc IEEE, 2004, 92: 401–422
Article Google Scholar
Bishop C M. Pattern Recognition and Machine Learning. New York: Springer, 2006
MATH Google Scholar
Li Z, Shaw M, Yedwabnick J, et al. Using a state space model with hidden variables to infer transcription factor activities. Bioinformatics, 2006, 22: 747–754
Article Google Scholar
Franklin G F, Powell J D, Workman M L. Digital Control of Dynamic Systems. New Jersey: Prentice-Hall, 1998
Google Scholar
Tanizaki H. Nonlinear Filters: Estimation and Applications. New York: Springer, 1996
Book MATH Google Scholar
Haykin S. Kalman Filtering and Neural Networks. New York: John Wiley & Sons, 2001
Book Google Scholar
Erro D, Moreno A, Bonafonte A. Flexible harmonic/stochastic speech synthesis. In: Proceedings of ISCA Workshop Speech Synthesis, Bonn, 2007. 194–199
Google Scholar
Erro D. Intra-lingual and cross-lingual voice conversion using harmonic plus stochastic models. Dissertation for Doctoral Degree. Barcelona: Universitat Politécnica de Catalunya, 2008
Google Scholar
Stylianou Y. Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Dissertation for Doctoral Degree. Paris: École Nationale Supérieure des Télécommunications, 1996
Google Scholar
Makhoul J. Linear prediction: a tutorial review. Proc IEEE, 1975, 63: 561–580
Article Google Scholar
Desai S, Black A W, Yegnanarayana B, et al. Spectral mapping using artificial neural networks for voice conversion. IEEE Trans Speech Audio Proc, 2010, 18: 954–964
Article Google Scholar
Frankel J. Linear dynamic models for automatic speech recognition. Dissertation for Doctoral Degree. Edinburgh: University of Edinburgh, 2003
Google Scholar
Kominek J, Black A W. The CMU ARCTIC speech databases. In: Proceedings of ISCA Workshop Speech Synthesis, Pittsburgh, 2004. 223–224
Google Scholar
Erro D, Moreno A, Bonafonte A. Voice conversion based on weighted frequency warping. IEEE Trans Speech Audio Proc, 2010, 18: 922–931
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer and Information Engineering, Hohai University, Changzhou, 213022, China
Ning Xu, XiaoFeng Liu, AiMing Jiang & YiBing Tang
Ministry of Education Key Laboratory of Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, China
Ning Xu
School of Electronic Information and Electric Engineering of Changzhou Institute of Technology, Changzhou, 213002, China
JingYi Bao

Authors

Ning Xu
View author publications
You can also search for this author in PubMed Google Scholar
JingYi Bao
View author publications
You can also search for this author in PubMed Google Scholar
XiaoFeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
AiMing Jiang
View author publications
You can also search for this author in PubMed Google Scholar
YiBing Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ning Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, N., Bao, J., Liu, X. et al. Voice conversion towards modeling dynamic characteristics using switching state space model. Sci. China Inf. Sci. 56, 1–15 (2013). https://doi.org/10.1007/s11432-013-4799-4

Download citation

Received: 17 September 2012
Accepted: 25 October 2012
Published: 03 December 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s11432-013-4799-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Voice conversion towards modeling dynamic characteristics using switching state space model

Abstract

Access this article

Similar content being viewed by others

Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion

A Voice Conversion Method Based on the Separation of Speaker-Specific Characteristics

Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Voice conversion towards modeling dynamic characteristics using switching state space model

Abstract

Access this article

Similar content being viewed by others

Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion

A Voice Conversion Method Based on the Separation of Speaker-Specific Characteristics

Voice Conversion for TTS Systems with Tuning on the Target Speaker Based on GMM

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation