Partial linear regression for speech-driven talking head application

https://doi.org/10.1016/j.image.2005.04.002Get rights and content

Abstract

Avatars in many applications are constructed manually or by a single speech-driven model which needs a lot of training data and long training time. It is essential to build up a user-dependent model more efficiently. In this paper, a new adaptation method, called the partial linear regression (PLR), is proposed and adopted in an audio-driven talking head application. This method allows users to adapt the partial parameters from the available adaptive data while keeping the others unchanged. In our experiments, the PLR algorithm can retrench the hours of time spent on retraining a new user-dependent model, and adjust the user-independent model to a more personalized one. The animated results with adapted models are 36% closer to the user-dependent model than using the pre-trained user-independent model.

Introduction

With the rapid development of multimedia technology, the virtual avatar has been widely used in many areas, like cartoon or computer game characters and news announcers. However, huge amount of manpower is needed in adjusting the avatar frame by frame to achieve a vivid and precise synthetic facial animation, since the asynchronism between mouth motion and voice pronunciation would be a fatal defect of realism. Therefore, a real-time speech-driven synthetic talking head, or so-called audio-to-visual synthesis system, is expected, which can provide an effective interface for many applications, e.g. image communication [1], [24], video conferencing [12], [7], video processing [8], talking head representation of agents [26], and telephone conversion for people with impaired hearing [22].

In an audio-to-visual synthesis system, it needs a model established for describing the correspondence between the acoustic parameters and the mouth-shape parameters. In other words, the corresponding visual information is to be estimated for some given acoustic parameters, such as the phonemes, the cepstral coefficients or the line spectrum pairs. The visual information could be images or mouth movement parameters. Mouth images were used in the work of Bregler et al. [6] to provide a factual representation. However, the stitching perplexity and the limited view angle abated the practicability.

A number of algorithms have been proposed for the task of mapping between acoustic parameters and visual parameters. The conversion problem is treated as one of finding the best approximation from given sets of training data. These approaches were briefly discussed in Chen and Rao [10], including vector quantization [25], Hidden Markov Models (HMM) [2], [3], [9], [13], [31], and neural networks [19], [20], [30]. However, the speech-driven systems were generally made to be user-independent for satisfactory average performance, which means a decrease in accuracy rate for a specific user. To maintain high performance, a time-consuming retraining procedure for a new user-dependent model is unavoidable since there is no reported adaptation method for this application in the literature.

On the other hand, speaker adaptation methods have been extensively studied in the speech recognition field. There are two main categories in the adaptation methods. The first is the eigenvector-based speaker adaptation method [4], [5], which uses the normalization on both the training-end and the recognition-end to deal with a variety of the acoustic characteristics due to different vocal channels. The other is based on the acoustic model, and is simpler than the former since the normalization for the training data is not necessary. A user-independent model is statistically established with the training data of several speakers in the beginning, and the parameters are then modified with certain adaptation data of a new user. The adaptation schemes include maximum a posteriori (MAP) estimation [11], [17], [27], [28], maximum likelihood linear regression (MLLR) [18], [21], [32], VFS [29], and nonlinear neural network [16]. In these methods, they tried to adjust the model parameters to maximize the occurrence probability of the new observation data. Among them, the MLLR method is more widely adopted for its simplicity and effectiveness when the set of adaptation data is small.

In this study, we try to integrate the MLLR adaptation approach with the audio-to-visual conversion of Gaussian mixture model, because the MLLR is first used for speaker adaptation of continuous density Hidden Markov Models and GMM is the kernel distribution used in an HMM. If the adaptation of audio-to-visual conversion model can be carried out with both audio and visual adaptation data, it will be exactly the same task as that in [21]. However, to obtain the precise visual adaptation information of a new user is not feasible in a usual environment, since some markers, infrared cameras, and post-processing (same as in the training phase) are needed. This makes the MLLR not fully adequate to adapt only the audio parameters while keeping the visual part the same. In other words, we require another appropriate adaptation, by means of which the new model will map the new audio parameters of a new user to the original visual movement.

A new adaptation method, called partial linear regression, is proposed in this paper. It is derived from the MLLR and put into practice in an audio-driven talking head system (Fig. 1). Rather than a time consuming retraining procedure, a simple adaptation with a small amount of additional data will be sufficient to adjust the model so as to be more applicable to the new user.

The rest of the paper is organized as follows. In Section 2, we describe the audio-driven talking head system which uses the Gaussian mixture model to represent the relationship between audio and video feature vectors. The audio-to-visual conversion is also mentioned. Section 3 provides a review of MLLR and a detailed description of the proposed PLR model adaptation algorithm. Some experimental results are described in Section 4, and Section 5 concludes the paper.

Section snippets

System architecture

The flowcharts of the training and the testing phases of our audio-driven talking head system are described in Fig. 1. In the audio signal processing, we extract 10th-order line spectrum pair (LSP) coefficients [14] from every audio frame of 240 samples. In the training phase, the frame rates of the audio and video signal generally differ from each other. After labeling the beginning and the ending points of every training word manually, we use linear interpolation to align the audio and visual

Maximum likelihood linear regression

In MLLR of mean adaptation [21], the purpose is to maximize the likelihood of the new observation data such thatP(O|πi,μi,S¯i)P(O|πi,μi,S¯i),by linear-regressively adjusting the mean vectors of every Gaussian kernel, i.e.μi=W¯[1μi].

With the definition of the auxiliary function Q asQ(λ,λ)=P(O|λ)logp(O|λ),we can find the optimal value of the matrix W¯ by differentiating the auxiliary function Q with respect to the matrix W¯ and setting the derivative to zero.

Partial linear regression

The MLLR method performs well

Experimental data

The audio ground truth data was captured with a microphone with a 8 kHz and 16-bit mono channel, and the facial expression was captured by 6 infrared cameras, 120 fps, with 27 particular markers (1 is the root) stuck on certain feature points of the user's face (Fig. 6). Even though six infrared cameras were used for motion capture, the recorded data was still noisy. A time-consuming post-process was executed for avoiding such a situation.

For each of the three male subjects in our experiment, we

Conclusion

We have proposed a new adaptation algorithm using partial-linear-regression. The PLR method can be used in updating a part of the mean vector in Gaussian mixture model, keeping the corresponding relationship unchanged. This is because the precise visual data of a new user cannot be obtained easily, and we may only collect the audio information in the adaptation procedure. As the experimental result in Table 1 shows, we can derive a more adequate model for the new user via the PLR adaptation

Acknowledgement

This research is conducted in the project of “Video-driven 3D Synthetic Facial Animation”, supported by OES Laboratories, Industrial Technology Research Institute (ITRI), Taiwan.

References (32)

  • C.J. Leggetter et al.

    Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

    Computer Speech Language

    (1995)
  • K. Aizawa et al.

    Model-based image coding

    Proc. IEEE

    (August 1995)
  • P.S. Aleksic, A.K. Katsaggelos, Speech-to-video synthesis using facial animation parameters, Proceedings of 2003...
  • P.S. Aleksic et al.

    Speech-to-video synthesis using MPEG-4 compliant visual features

    IEEE. Trans. Circuits Systems Video Technol.

    (May 2004)
  • T. Anastaskos, J. McDonough, R. Schwartz, J. Makhoul, A compact model for speaker adaptive training, ICSLP,...
  • T. Anastaskos, J. McDonough, R. Schwartz, J. Makhoul, A compact model for speaker adaptive training a maximum...
  • C. Bregler, M. Covell, M. Slaney, Video rewrite: driving visual speech with audio, in: Proceedings of the International...
  • Y.J. Chang, C.K. Hsieh, P.W. Hsu, Y.C. Chen, Speech-assisted facial expression analysis and synthesis for virtual...
  • T. Chen, H.P. Graf, K. Wang, Speech-assisted video processing: interpolation and low-bitrate coding, 1994 Conference...
  • T. Chen, R.R. Rao, Audio-visual interaction in multimedia communication, in: Proceedings of the ICASSP, Munich,...
  • T. Chen et al.

    Audio-visual integration in multimodal communication

    Proc. IEEE

    (May 1998)
  • J.T. Chien, C.H. Lee, H.H. Wang, Improved Bayesian learning of hidden Markov models for speaker adaptation, in:...
  • C.S. Choi et al.

    Analysis and synthesis of facial image sequences in model-based image coding

    IEEE Trans. Circuits Systems Video Technol.

    (June 1994)
  • K.H. Choi, J.H. Lee, Constrained optimization for a speech-driven talking head, Proceedings of the ISCAS, vol. 2, May...
  • J.R. Deller et al.

    Discrete-time Processing of Speech Signals

    (September 1999)
  • A.P. Dempster et al.

    Maximum-likelihood from incomplete data via the EM algorithm

    J. Roy. Statist. Soc. B

    (1977)
  • Cited by (8)

    • Audiovisual speech synthesis: An overview of the state-of-the-art

      2015, Speech Communication
      Citation Excerpt :

      For example, Mel-frequency Cepstrum Coefficients (MFCC) (Mermelstein, 1976) were used by Massaro et al. (1999), Theobald and Wilkinson (2007), and by Wang et al. (2010). Other potential auditory features include Line Spectral Pairs (Deller et al., 1993) (LSP) (Hsieh and Chen, 2006), Linear Prediction Coefficients (Rabiner and Schafer, 1978) (LPC) (Eisert et al., 1997; Du and Lin, 2002) or filter-bank output coefficients (Gutierrez-Osuna et al., 2005). The definition of the visual features in speech-driven synthesizers depends heavily on the used techniques to represent the output visual speech (i.e., the chosen facial animation strategy – see further in Section 7).

    • A review on data-driven learning of a talking head model

      2017, International Journal of Intelligent Systems Technologies and Applications
    • Speech-driven facial animation using manifold relevance determination

      2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Relating objective and subjective performance measures for AAM-based visual speech synthesis

      2012, IEEE Transactions on Audio, Speech and Language Processing
    • A probabilistic trajectory synthesis system for synthesising visual speech

      2008, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    • On Evaluating Synthesised Visual Speech

      2008, International Conference on Auditory-Visual Speech Processing 2008, AVSP 2008
    View all citing articles on Scopus
    View full text