Skip to main content
Log in

Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System

  • Published:
Journal of VLSI signal processing systems for signal, image and video technology Aims and scope Submit manuscript

Abstract

MPEG-4 standard allows composition of natural or synthetic video with facial animation. Based on this standard, an animated face can be inserted into natural or synthetic video to create new virtual working environments such as virtual meetings or virtual collaborative environments. For these applications, audio-to-visual conversion techniques can be used to generate a talking face that is synchronized with the voice. In this paper, we address audio-to-visual conversion problems by introducing a novel Hidden Markov Model Inversion (HMMI) method. In training audio-visual HMMs, the model parameters {λav} can be chosen to optimize some criterion such as maximum likelihood. In inversion of audio-visual HMMs, visual parameters that optimize some criterion can be found based on given speech and model parameters {λav}. By using the proposed HMMI technique, an animated talking face can be synchronized with audio and can be driven realistically. The HMMI technique combined with MPEG-4 standard to create a virtual conference system, named VIRTUAL-FACE, is introduced to show the role of HMMI for applications of MPEG-4 facial animation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. K. Kiyokawa, H. Takemura, and N. Yokoya, “SeamlessDesign: A Face-to-Face Collaborative Virtual/Augmented Environment for Rapid Prototyping of Geometrically Constrained 3-D Ob-jects, ” IEEE International Conference on Multimedia Comput-ing and Systems, vol. 2, 1999, pp. 447–453.

    Article  Google Scholar 

  2. Yao-Jen Chang, Chih-Chung Chen, Jen-Chung Chou, and Yung-Chang Chen, “Implementation of a Virtual Chat Room for Mul-timedia Communications, ” 1999 IEEE 3rd Workshop on Multi-media Signal Processing, 1999, pp. 599–604.

  3. S. Yura, T. Usaka, and K. Sakamura, “Video Avatar: Embed-ded Video for Collaborative Virtual Environment, ” IEEE Inter-national Conference on Multimedia Computing and Systems, vol. 2, 1999, pp. 433–438.

    Article  Google Scholar 

  4. S. Morishima and H. Harashima, “A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface, ” IEEE Journal on Sel. Areas in Communications, vol. 9, no. 4, 1991, pp. 594–600.

    Article  Google Scholar 

  5. Fabio Lavagetto, “Converting Speech into Lip Movement: A Multimedia Telephone for Hard of Hearing People, ” IEEE Transaction on Rehabilitation Engineering, vol. 3, no. 1, 1995, pp. 90–102.

    Article  Google Scholar 

  6. Ram R. Rao, Tsuhan Chen, and Russell M. Mersereau, “Audio-to-Visual Conversion for Multimedia Communication, ” IEEE. Transactions on Industrial Electronics, vol. 45, no. 1, 1998, pp. 15–22.

    Article  Google Scholar 

  7. S. Nakamura, E. Yamamoto, and K. Shikano, “Speech-Lip Movement Synthesis Maximizing Audio-Visual Joint Probability Based on EM Algorithm, ” IEEE International Workshop on Multimedia Signal Processing, 1998, pp. 53–58.

  8. KyoungHo Choi and J.N. Hwang, “Baum–Welch HMM Inversion for Audio-to-Visual Conversion, ” IEEE International Workshop on Multimedia Signal Processing, 1999, pp. 175–180.

  9. S.Y. Moon and J.N. Hwang, “Noisy Speech Recognition Using Robust Inversion of Hidden Markov Models, ” IEEE International Conf. Acoust., Speech, Signal Processing, 1995, pp. 145–148.

  10. S.Y. Moon and J.N. Hwang, “Robust Speech Recognition Based on Joint Model and Feature Space Optimization of Hidden Markov Models, ” IEEE Transactions on Neural Networks, vol. 8, no. 2, 1997, pp. 194–204.

    Article  Google Scholar 

  11. L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall Inc., 1993.

  12. Nadia Magnenat Thalmann, Prem Kalra, and Marc Escher, “Face to Virtual Face, ” Proceedings of the IEEE, vol. 86, no. 5, 1998, pp. 870–883.

    Article  Google Scholar 

  13. Fabio Lavagetto, “Time-Delay Neural Networks for Estimating Lip Movements From Speech Analysis: A Useful Tool in Audio-Video Synchronization, ” IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 5, 1997, pp. 786–800.

    Article  Google Scholar 

  14. Won-Sook Lee, Marc Escher, Gael Sannier, and Nadia Magnenat-Thalmann, “MPEG-4 Compatible Faces from Orthogonal Photos, ” International Conference on Computer An-imation, 1999, pp. 186–194.

  15. Won-Sook Lee and N. Magnenat-Thalmann, “Fast Head Modeling for Animation, ” Journal of Image and Vision Computing, vol. 18, no 4, 2000, pp. 355–364.

    Article  Google Scholar 

  16. L. Moccozet and N. Magnenat-Thalmann, “Dirichlet Free-Form Deformations and Their Application to Hand Simulation, ” The Proceedings of Computer Animation, 1997, pp. 93–102.

  17. Frederic Pighin, Richard Szeliski, and David H. Salesin, “Resynthesizing Facial Animation Through 3D Model-Based Tracking, ” The Proceedings of the Seventh IEEE Internation Conference on Computer Vision, vol. 1, 1999, pp. 143–150.

    Article  Google Scholar 

  18. J. Strom, T. Jebara, S. Basu, and A. Pentland, “Real Time Tracking and Modeling of Faces: An EKF-based Analysis by Synthesis Approach, ” Proceedings IEEE International Workshop on Modeling People, 1999, pp. 55–61.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, K., Luo, Y. & Hwang, JN. Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 29, 51–61 (2001). https://doi.org/10.1023/A:1011171430700

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011171430700

Navigation