Abstract
MPEG-4 standard allows composition of natural or synthetic video with facial animation. Based on this standard, an animated face can be inserted into natural or synthetic video to create new virtual working environments such as virtual meetings or virtual collaborative environments. For these applications, audio-to-visual conversion techniques can be used to generate a talking face that is synchronized with the voice. In this paper, we address audio-to-visual conversion problems by introducing a novel Hidden Markov Model Inversion (HMMI) method. In training audio-visual HMMs, the model parameters {λav} can be chosen to optimize some criterion such as maximum likelihood. In inversion of audio-visual HMMs, visual parameters that optimize some criterion can be found based on given speech and model parameters {λav}. By using the proposed HMMI technique, an animated talking face can be synchronized with audio and can be driven realistically. The HMMI technique combined with MPEG-4 standard to create a virtual conference system, named VIRTUAL-FACE, is introduced to show the role of HMMI for applications of MPEG-4 facial animation.
Similar content being viewed by others
References
K. Kiyokawa, H. Takemura, and N. Yokoya, “SeamlessDesign: A Face-to-Face Collaborative Virtual/Augmented Environment for Rapid Prototyping of Geometrically Constrained 3-D Ob-jects, ” IEEE International Conference on Multimedia Comput-ing and Systems, vol. 2, 1999, pp. 447–453.
Yao-Jen Chang, Chih-Chung Chen, Jen-Chung Chou, and Yung-Chang Chen, “Implementation of a Virtual Chat Room for Mul-timedia Communications, ” 1999 IEEE 3rd Workshop on Multi-media Signal Processing, 1999, pp. 599–604.
S. Yura, T. Usaka, and K. Sakamura, “Video Avatar: Embed-ded Video for Collaborative Virtual Environment, ” IEEE Inter-national Conference on Multimedia Computing and Systems, vol. 2, 1999, pp. 433–438.
S. Morishima and H. Harashima, “A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface, ” IEEE Journal on Sel. Areas in Communications, vol. 9, no. 4, 1991, pp. 594–600.
Fabio Lavagetto, “Converting Speech into Lip Movement: A Multimedia Telephone for Hard of Hearing People, ” IEEE Transaction on Rehabilitation Engineering, vol. 3, no. 1, 1995, pp. 90–102.
Ram R. Rao, Tsuhan Chen, and Russell M. Mersereau, “Audio-to-Visual Conversion for Multimedia Communication, ” IEEE. Transactions on Industrial Electronics, vol. 45, no. 1, 1998, pp. 15–22.
S. Nakamura, E. Yamamoto, and K. Shikano, “Speech-Lip Movement Synthesis Maximizing Audio-Visual Joint Probability Based on EM Algorithm, ” IEEE International Workshop on Multimedia Signal Processing, 1998, pp. 53–58.
KyoungHo Choi and J.N. Hwang, “Baum–Welch HMM Inversion for Audio-to-Visual Conversion, ” IEEE International Workshop on Multimedia Signal Processing, 1999, pp. 175–180.
S.Y. Moon and J.N. Hwang, “Noisy Speech Recognition Using Robust Inversion of Hidden Markov Models, ” IEEE International Conf. Acoust., Speech, Signal Processing, 1995, pp. 145–148.
S.Y. Moon and J.N. Hwang, “Robust Speech Recognition Based on Joint Model and Feature Space Optimization of Hidden Markov Models, ” IEEE Transactions on Neural Networks, vol. 8, no. 2, 1997, pp. 194–204.
L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall Inc., 1993.
Nadia Magnenat Thalmann, Prem Kalra, and Marc Escher, “Face to Virtual Face, ” Proceedings of the IEEE, vol. 86, no. 5, 1998, pp. 870–883.
Fabio Lavagetto, “Time-Delay Neural Networks for Estimating Lip Movements From Speech Analysis: A Useful Tool in Audio-Video Synchronization, ” IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 5, 1997, pp. 786–800.
Won-Sook Lee, Marc Escher, Gael Sannier, and Nadia Magnenat-Thalmann, “MPEG-4 Compatible Faces from Orthogonal Photos, ” International Conference on Computer An-imation, 1999, pp. 186–194.
Won-Sook Lee and N. Magnenat-Thalmann, “Fast Head Modeling for Animation, ” Journal of Image and Vision Computing, vol. 18, no 4, 2000, pp. 355–364.
L. Moccozet and N. Magnenat-Thalmann, “Dirichlet Free-Form Deformations and Their Application to Hand Simulation, ” The Proceedings of Computer Animation, 1997, pp. 93–102.
Frederic Pighin, Richard Szeliski, and David H. Salesin, “Resynthesizing Facial Animation Through 3D Model-Based Tracking, ” The Proceedings of the Seventh IEEE Internation Conference on Computer Vision, vol. 1, 1999, pp. 143–150.
J. Strom, T. Jebara, S. Basu, and A. Pentland, “Real Time Tracking and Modeling of Faces: An EKF-based Analysis by Synthesis Approach, ” Proceedings IEEE International Workshop on Modeling People, 1999, pp. 55–61.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Choi, K., Luo, Y. & Hwang, JN. Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 29, 51–61 (2001). https://doi.org/10.1023/A:1011171430700
Published:
Issue Date:
DOI: https://doi.org/10.1023/A:1011171430700