Skip to main content

Advertisement

Log in

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents a photo realistic facial animation synthesis approach based on an audio visual articulatory dynamic Bayesian network model (AF_AVDBN), in which the maximum asynchronies between the articulatory features, such as lips, tongue and glottis/velum, can be controlled. Perceptual Linear Prediction (PLP) features from audio speech, as well as active appearance model (AAM) features from face images of an audio visual continuous speech database, are adopted to train the AF_AVDBN model parameters. Based on the trained model, given an input audio speech, the optimal AAM visual features are estimated via a maximum likelihood estimation (MLE) criterion, which are then used to construct face images for the animation. In our experiments, facial animations are synthesized for 20 continuous audio speech sentences, using the proposed AF_AVDBN model, as well as the state-of-art methods, being the audio visual state synchronous DBN model (SS_DBN) implementing a multi-stream Hidden Markov Model, and the state asynchronous DBN model (SA_DBN). Objective evaluations on the learned AAM features show that much more accurate visual features can be learned from the AF_AVDBN model. Subjective evaluations show that the synthesized facial animations using AF_AVDBN are better than those using the state based SA_DBN and SS_DBN models, in the overall naturalness and matching accuracy of the mouth movements to the speech content.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abboud B, Davoine F, Dang M (2004) Facial expression recognition and synthesis based on an appearance model. Sig Process: Image Commun. doi:10.1016/j.image.2004.05.009

    Google Scholar 

  2. Bilmes J, Zweig G (2002) The graphical models toolkit: an open source software system for speech and time series processing. Proc IEEE Int Conf Acoust, Speech, Signal Process 4:3916–3919

    Google Scholar 

  3. Brand M (1999) Voice puppetry. SIGGRAPH ’99 proceedings of the 26th annual conference on computer graphics and interactive techniques, pp. 21–28

  4. Bregler C, Covell M, Slaney M (2006) Video rewrite: driving visual speech with audio. Computer graphics annual conference series (SIGGRAPH), 353–360, Los Angeles, California

  5. Choi K, Luo Y, Hwang J (2001) Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. J VLSI Signal Process 29:51–61

    Article  MATH  Google Scholar 

  6. Cootes TF, Edwards GJ, Taylor CJ (1998) Active appearance models. Proc European Conf Comput Vis 2:484–498

    Google Scholar 

  7. Cossato E, Graf HP (2000) Photo-realistic talking heads from image samples. IEEE Trans Multimedia 2(3):152–163

    Article  Google Scholar 

  8. Ezzat T, Geiger G, Poggio T (2002) Trainable video realistic speech animation. SIGGRAPH ’02 proceedings of the 29th annual conference on computer graphics and interactive techniques, pp. 388–398

  9. Gowdy JN, Subramanya A, et al. (2004) DBN based multi-stream models for audio-visual speech recognition. Proc. International Conference on Acoustics, Speech and Signal Processing, pp. 993–996

  10. Gutierrez-Osuna R, Kakumanu PK, Esposito A et al (2005) Speech-driven facial animation with realistic dynamics. IEEE Trans Multimedia 7(1):33–42

    Article  Google Scholar 

  11. Hou Y, Sahli H, Ravyse I, Zhang Y, Zhao R (2007) Robust shape based head tracking. Proc Adv Concepts Intell Vis Syst LNCS 4678:340–351

    Article  Google Scholar 

  12. http://personalpages.manchester.ac.uk/staff/timothy.f.cootes/software/am_tools_doc/index.html. Accessed on February 23, 2013

  13. http://www.reallusion.com/crazytalk/, accessed on February 23, 2013

  14. Jiang D, Ravyse I, Liu P, Sahli H, Verhelst W (2010) Realistic mouth animation based on an articulatory DBN model with constrained asynchrony. Proc. 35th IEEE Int. Conf. Audio, speech and signal processing (ICASSP), March 14–19, Texas, USA, pp. 2478–2481

  15. Li Y, Shum H-Y (2006) Learning dynamic audio-visual mapping with input-output hidden Markov models. IEEE Trans Multimedia 8(3):542–549

    Article  Google Scholar 

  16. Livescu K, Centin O, H J Mark, et al (2006). Articulatory feature-based methods for acoustic and audio-visual speech recognition: 2006 JHU summer workshop final report. Center for Language and Speech Processing, Johns Hopkins University

  17. Massaro W (2003) A computer-animated tutor for spoken and written language learning. Int. Conf. Multimodal Interfaces, 172–175

  18. Mattheyses W, Latacz L, Verhelst V (2010) Active appearance models for photorealistic visual speech synthesis. Proc. INTERSPEECH 2010, pp. 1113–1116

  19. Mattheyses W, Latacz L, Verhelst V, Sahli H (2008) Multimodal unit selection for 2D audiovisual text-to-speech synthesis. Proceedings of the 5th international workshop on machine learning for multimodal interaction, In: Popescu-Belis A, Stiefelhagen R (Eds.), Lecture notes in computer science, Vol. 5237, pp. 125–136

  20. Ning L, Ning F, Kamata S (2010) 3D reconstruction from a single image for a Chinese talking face. TENCON 2010–2010 I.E. Region 10 Conference, pp. 1613–1616, Nov. 21–24, Shanghai, China

  21. Rao RR, Chen T, Mersereau RM (1998) Audio-to-visual conversion for multimedia communication. IEEE Trans Ind Electron 45(1):15–22

    Article  Google Scholar 

  22. Salvi G, Beskow J et al (2009) SynFace-speech-driven facial animation for virtual speech-reading support. EURASIP J Audio, Speech, Music Process. doi:10.1155/2009/191940

    Google Scholar 

  23. Terissi LD, Gomez JC (2008) Audio-to-visual conversion via HMM inversion for speech-driven facial animation. Lecture Notes on Artif Intell, LNAI 5249:33–42

    Google Scholar 

  24. Wu P, Jiang D, Zhang H, Sahli H (2011) Realistic visual speech synthesis based on AAM features and an articulatory DBN model with constrained asynchrony. Proc. Audio-Visual Speech Processing (AVSP), pp. 59–64

  25. Xie L, Liu Z (2007) Speech animation using coupled hidden Markov models. Pattern Recognit 40(8):2325–2340

    Article  MATH  Google Scholar 

  26. Xie L, Liu Z (2007) Realistic mouth-synching for speech driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510

    Article  Google Scholar 

  27. Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden Markov models. Speech Commun 26(1–2):105–115

    Article  Google Scholar 

  28. Young S (2001) The HTK book. Cambridge University Engineering Department, UK

    Google Scholar 

Download references

Acknowledgement

This work is supported within the framework of the National Natural Science Foundation of China (grant 61273265), the Shaanxi Provincial Key International Cooperation Project (2011KW-04), the LIAMA-CAVSA project, the EU FP7 project ALIZ-E (grant 248116), and the VUB-HOA CaDE project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, D., Zhao, Y., Sahli, H. et al. Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features. Multimed Tools Appl 73, 397–415 (2014). https://doi.org/10.1007/s11042-013-1610-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1610-x

Keywords

Navigation