Elsevier

Neurocomputing

Volume 149, Part C, 3 February 2015, Pages 1535-1543
Neurocomputing

Monocular face reconstruction with global and local shape constraints

https://doi.org/10.1016/j.neucom.2014.08.039Get rights and content

Abstract

To reconstruct 3D face from single monocular image, this paper proposes an approach which comprises three steps. First, a set of 3D facial features is recovered from 2D features extracted from the image. The features are recovered by solving equations derived from a regularized scaled orthogonal projection. The regularization is achieved by a global shape constraint exploiting a prior reference 3D facial shape. Second, we warp a high-resolution reference 3D face, using both recovered 3D features and local shape constraint at each model points. Last, realistic 3D face is obtained through texture synthesis. Compared with existing approach, the proposed feature recovery method has higher accuracy, and it is robust to facial pose variation appeared on the given image. Moreover, the model warping method based on local shape constraints can warp a high-resolution reference 3D face using few 3D features more reasonably and accurately. The proposed approach generates realistic 3D face with impressive visual effect.

Introduction

3D face reconstruction from single monocular image has received considerable attention from researchers in the field of computer vision and computer animation, and it has found its use in digital entertainment like film and game production for decades of years.

Recently, 3D face reconstruction gives light to some newly emerged applications such as human computer interaction [1], [2], electronically mediated communication [3] and public security [4], [5]. In [2], the authors developed an automatic face tracking and lip reading system through a reconstructed 3D face avatar for speech learning, emotional state monitoring and non-verbal human computer interfaces design. A real-time facial tracking system was developed in [3] to extract animation control parameters from videos. The system could translate these parameters to 3D facial expression and then retarget the expressions to reconstructed 3D faces for applications like teleconferencing. Also, in visual surveillance [5], face cues were combined with gait cues as biometrical features to achieve person identification.

Prior information is not indispensable for 3D reconstruction from multiple input views [6], but it is necessary for monocular reconstruction or pose estimation [7], [8]. Specifically, monocular 3D face reconstruction is a highly ill-posed problem, the reconstruction process usually needs additional constraints derived from some prior knowledge. The most common constraint used in face reconstruction is shape constraint, which is usually concealed beneath 3D sample faces. The most favorable and related work is the approaches based on Morphable Models [9], [10]. A Morphable Model refers to a statistical model constructed by linearly combining a set of 3D sample faces. The desired 3D face can be generated by tuning the parameters (combining coefficient) of the model. The optimal parameters are determined by fitting the Morphable Model to the given image to match the 2D projection of the model to the face appeared on the image. The sample face usually comprises tens of thousands of 3D points. Matching the 2D projection of these 3D points to image pixels incurs great computational cost. Therefore Morphable Model approaches usually have low computational efficiency. Furthermore, the shape constraint imposed by the sample faces is a kind of global constraint. This means that the constraint works simultaneously for all the points on the face model, and we cannot adjust any local face region by tuning the model parameters.

Instead of using all 3D points, some researchers compute the model parameters by fitting a few salient 3D facial feature points to corresponding image feature points [11], [12], [13], [14]. The feature-based fitting speeds up the computation process enormously, but the parameter estimation of the above-mentioned works is unsatisfactory because it is based on alternating least square method which is not derived from the conditions driving the objective function to reach its optimal value. In addition, these improvements fail to consider some powerful methods such as sparse representation [15], [16] and novel distance metric learning [17], [18], [19], [20], which have been proved effective in solving linear approximation problems. Most importantly, the shape constraint here is also global shape constraint, and some approaches [13] can only recover a few 3D feature points, 3D face model should be generated by warping a high-resolution reference 3D face using the feature points. Prevalent warping method is scattered data interpolation [21] which relies on measuring Euclidean distances directly between feature points and landmark model points distributed on the model surface, therefore the warping is still conducted in a global way and rarely achieves ideal result with few feature points.

The aforementioned methods require a large number of sample face models and a detailed and accurate point-wise correspondence between all the models. This prevents these methods from wider usage when the sample face models are unavailable. In recent years, some researchers propose to regularize the 3D face reconstruction using shape constraint derived from a prior face model [22], [23], [24]. In [22], the authors detected the facial features around eyes, mouth, eyebrow and contour of face from a given face image, and adapted a generic 3D model into face specific 3D model using geometric transformations. Similar work can be found in [23], where a generic 3D face was projected onto image plane to fit the 2D projection to the input face. Shape regularization is implicitly used during the reconstruction procedure of these two methods, among which [22] uses local translation to achieve model fitting, and the depth information of the reconstructed 3D face in [23] totally comes from the generic face. Hence, the result of the regularization is not decent. In [24], the regularization was explicitly introduced in the formulation of the problem, which used the prior face model and albedo to extract illumination and pose information for 3D reconstruction. However, the result varies significantly depending on which prior model is used. This approach also requires plenty of manual work to register the image with the prior face.

To this end, we propose a novel approach to 3D face reconstruction. Unlike Morphable Model approaches, we reconstruct the unknown 3D face by fitting it directly to the given face image through a scaled orthogonal projection. To deal with the ill-posedness, we regularize the projection explicitly with a global shape constraint constructed using a reference 3D face. To ensure the efficiency, only few 3D facial features are fitted to a set of image feature points. Then, we obtain high-resolution 3D face by warping a reference 3D face model using the recovered 3D features. Unlike previous approaches, the warping is based on local shape constraint at each point of the face model[25]. The local shape constraints convey human characteristics of local face regions to the reconstructed 3D face. Realistic 3D face is generated after texture mapping. Our approach has following advantages: the feature recovery resorts to solving several equations, hence is very fast. In addition, the recovered features are very close to the optimal solution due to the global shape constraint, and the recovery is not sensitive to facial pose on the given image. Last, the warping based on local shape constraints is superior to scattered data interpolation in that it can achieve better result based on only few 3D features.

The rest part of this paper is organized as follows: In the next section, we describe our 3D feature recovery algorithm based on global shape constraint in detail. In Section 3, we discuss the high-resolution 3D face reconstruction method, as well as the local shape constraints. Texture mapping is introduced in Section 4. Section 5 shows some experimental results, and we conclude this paper in the last section.

Section snippets

3D feature recovery

Twenty feature points of the input image are extracted by the Active Appearance Model (AAM) [26] for model fitting. Fig. 1 shows some examples selected from Pointing face database [28]. The result is quite robust when the horizontal viewpoint varies between ±45°. Besides the 2D image features, we manually select corresponding 20 features from a prior reference human face to construct global shape constraint. See Fig. 2 for the reference 3D face with reference features (marked by red dots). The

Face warping based on local shape constraints

The recovered points S have personalized characteristics of the input face image. However, it so far is a set of sparsely distributed 3D feature points. The high-resolution and personalized 3D face is constructed by warping a reference 3D face using these recovered 3D points. In this paper, we scan several human faces using the Cyberware 3D scanner1 to obtain several 3D face models, and create the reference 3D face by preprocessing and averaging these 3D face models. The

Texture mapping

Texture mapping is to transfer the texture of the input image onto the reconstructed face. The key to texture mapping is to determine the mapping from the input image to the 3D face. The previous section determines the projection from the 3D space to the 2D space of input: sR3p=λRs. According to this projection, we can synthesize the texture for the 3D face as follows: Let v be an arbitrary point on the 3D face and u=λRv. Find the nearest point pi to u in the input image. Then the texture ti=[

Experiments and discussion

In this section, we demonstrate the results of our proposed algorithms for 3D feature recovery and personalized 3D face reconstruction from a 2D image. We will compare the accuracy of the proposed feature recovery algorithm based on global shape constraint with that of existing approach [13], and show the strong adaptability of the proposed algorithm on variant poses of input face images taken from persons with different ages, races and genders. The following errors ϵR=R˜RF,ϵS=1ni=1ns˜isi

Conclusion

In this paper, we propose a novel approach for monocular high-resolution 3D face reconstruction. Our approach consists of three steps. The first one is accurate 3D facial feature recovery from 2D image features, based on a scaled orthogonal projection regularized by a global shape constraint. The second one is the model warping method with local shape constraints to construct personalized 3D face using the recovered 3D features and a reference 3D face. The last one is the texture synthesis. Our

Jian Zhang received his B.E. degree and M.E. degree from Shandong University of Science and Technology in 2000 and 2003 respectively, and received the Ph.D. degree from Zhejiang University in 2008. He is currently working in the school of science & technology of Zhejiang International Studies University, as an associate professor for computer science. His research interests include machine learning, computer vision, computer graphics and multimedia processing.

References (31)

  • W. Hu et al.

    A survey on visual surveillance of object motion and behaviors

    IEEE Trans. Syst. Man Cybern. C Appl. Rev.

    (2004)
  • M. Song et al.

    Three-dimensional face reconstruction from a single image by a coupled RBF network

    IEEE Trans. Image Process.

    (2012)
  • V. Blanz, P. Grother, P. Phillips, T. Vetter, Face recognition based on frontal views generated from non-frontal...
  • V. Blanz, T. Vetter, A morphable model for the synthesis of 3D faces, in: Proceedings of SIGGRAPH׳99, 1999, pp....
  • U. Park et al.

    Age invariant face recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • Cited by (7)

    • MLFNet: Monocular lifting fusion network for 6DoF texture-less object pose estimation

      2022, Neurocomputing
      Citation Excerpt :

      Therefore, the current RGB methods usually use a two-stage pipeline, which first generates 2D intermediate structure representations, and then uses the Perspective-n-Point (PnP) algorithm [8] to obtain poses based on known 3D models. Different from the dependence of non-rigid objects (human body [9] and human face [10]) on local point features, the particularity of rigid objects lies in the constant relative spatial position between any two points on the surface, so how to effectively use the rigid body structure information has become a research hotspot. Recent works can be classified into two categories: 1) regressing dense 3D point cloud coordinates; 2) locating sparse 2D keypoints and other 2D components.

    • Combining active learning and local patch alignment for data-driven facial animation with fine-grained local detail

      2020, Neurocomputing
      Citation Excerpt :

      Some approaches use 3D mocap data as features, and the marker points are placed empirically on the performer’s face in the process of motion capture [43]. Some approaches reconstruct 3D features from image feature points that are usually aligned by pre-trained models whose training process rely on manually labeled feature points [44]. However, there is no conclusion that designating the positions of the feature points by hand can generate the best animation result.

    • A semi-supervised framework for topology preserving performance-driven facial animation

      2018, Signal Processing
      Citation Excerpt :

      However, the interpolation is based on Euclidean distance measurement between the feature points and the other vertices, hence is unable to generate decent result in case of small number of feature points. This was clearly demonstrated in [18]. Recently, some topology preserving methods were applied to face driving [17,19,20].

    • Learning 3D faces from 2D images via Stacked Contractive Autoencoder

      2017, Neurocomputing
      Citation Excerpt :

      Good performance of these applications relies on understanding the correct semantic information hiding behind the images. It is commonly known that 3D face models provide invariant properties to changes of viewpoint, illumination and occlusion [10] which may significantly reduce the accuracy of face recognition and deteriorate the result of face detection, therefore reconstructing 3D face from 2D face image is of great importance to face recognition and detection. The prevalent image-based 3D face reconstruction methods can be categorized into three classes.

    • A novel feature representation for automatic 3D object recognition in cluttered scenes

      2016, Neurocomputing
      Citation Excerpt :

      Feature representation is of critical significance to many pattern recognition applications and systems involving detection [1], recognition [2–8], registration [9], reconstruction [10–12] and classification [13–15].

    View all citing articles on Scopus

    Jian Zhang received his B.E. degree and M.E. degree from Shandong University of Science and Technology in 2000 and 2003 respectively, and received the Ph.D. degree from Zhejiang University in 2008. He is currently working in the school of science & technology of Zhejiang International Studies University, as an associate professor for computer science. His research interests include machine learning, computer vision, computer graphics and multimedia processing.

    Dapeng Tao received a B.E. degree from Northwestern Polytechnical University and a Ph.D. degree from South China University of Technology, respectively. He is currently with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, as an Associate Professor for Human Machine Control. Over the past years, his research interests include machine learning, computer vision and cloud computing.

    Xiangjuan Bian received the B.E. degree in 2001 from Northeastern University, and received the M.E. degree in 2004 from Kunming University of Science and Technology. She is a lecturer in the school of science & technology of Zhejiang International Studies University. Her research focuses on multi-field model construction and simulation and CIMS.

    Xiaosi Zhan received his M.E. degree from Jilin University in 2001, and received the Ph.D. degree in computer science from Nanjing University in 2004. He is a professor working in the school of science & technology of Zhejiang International Studies University. His research interests include image processing and pattern recognition, machine learning, biometric recognition.

    This paper is supported by the National Natural Science Foundation of China (no. 61303143), Scientific Research Fund of Zhejiang Provincial Education Department (no. Y201326609).

    View full text