Monocular face reconstruction with global and local shape constraints

doi:10.1016/j.neucom.2014.08.039

Neurocomputing

Volume 149, Part C, 3 February 2015, Pages 1535-1543

https://doi.org/10.1016/j.neucom.2014.08.039 Get rights and content

Abstract

To reconstruct 3D face from single monocular image, this paper proposes an approach which comprises three steps. First, a set of 3D facial features is recovered from 2D features extracted from the image. The features are recovered by solving equations derived from a regularized scaled orthogonal projection. The regularization is achieved by a global shape constraint exploiting a prior reference 3D facial shape. Second, we warp a high-resolution reference 3D face, using both recovered 3D features and local shape constraint at each model points. Last, realistic 3D face is obtained through texture synthesis. Compared with existing approach, the proposed feature recovery method has higher accuracy, and it is robust to facial pose variation appeared on the given image. Moreover, the model warping method based on local shape constraints can warp a high-resolution reference 3D face using few 3D features more reasonably and accurately. The proposed approach generates realistic 3D face with impressive visual effect.

Introduction

3D face reconstruction from single monocular image has received considerable attention from researchers in the field of computer vision and computer animation, and it has found its use in digital entertainment like film and game production for decades of years.

Recently, 3D face reconstruction gives light to some newly emerged applications such as human computer interaction [1], [2], electronically mediated communication [3] and public security [4], [5]. In [2], the authors developed an automatic face tracking and lip reading system through a reconstructed 3D face avatar for speech learning, emotional state monitoring and non-verbal human computer interfaces design. A real-time facial tracking system was developed in [3] to extract animation control parameters from videos. The system could translate these parameters to 3D facial expression and then retarget the expressions to reconstructed 3D faces for applications like teleconferencing. Also, in visual surveillance [5], face cues were combined with gait cues as biometrical features to achieve person identification.

Prior information is not indispensable for 3D reconstruction from multiple input views [6], but it is necessary for monocular reconstruction or pose estimation [7], [8]. Specifically, monocular 3D face reconstruction is a highly ill-posed problem, the reconstruction process usually needs additional constraints derived from some prior knowledge. The most common constraint used in face reconstruction is shape constraint, which is usually concealed beneath 3D sample faces. The most favorable and related work is the approaches based on Morphable Models [9], [10]. A Morphable Model refers to a statistical model constructed by linearly combining a set of 3D sample faces. The desired 3D face can be generated by tuning the parameters (combining coefficient) of the model. The optimal parameters are determined by fitting the Morphable Model to the given image to match the 2D projection of the model to the face appeared on the image. The sample face usually comprises tens of thousands of 3D points. Matching the 2D projection of these 3D points to image pixels incurs great computational cost. Therefore Morphable Model approaches usually have low computational efficiency. Furthermore, the shape constraint imposed by the sample faces is a kind of global constraint. This means that the constraint works simultaneously for all the points on the face model, and we cannot adjust any local face region by tuning the model parameters.

Instead of using all 3D points, some researchers compute the model parameters by fitting a few salient 3D facial feature points to corresponding image feature points [11], [12], [13], [14]. The feature-based fitting speeds up the computation process enormously, but the parameter estimation of the above-mentioned works is unsatisfactory because it is based on alternating least square method which is not derived from the conditions driving the objective function to reach its optimal value. In addition, these improvements fail to consider some powerful methods such as sparse representation [15], [16] and novel distance metric learning [17], [18], [19], [20], which have been proved effective in solving linear approximation problems. Most importantly, the shape constraint here is also global shape constraint, and some approaches [13] can only recover a few 3D feature points, 3D face model should be generated by warping a high-resolution reference 3D face using the feature points. Prevalent warping method is scattered data interpolation [21] which relies on measuring Euclidean distances directly between feature points and landmark model points distributed on the model surface, therefore the warping is still conducted in a global way and rarely achieves ideal result with few feature points.

The aforementioned methods require a large number of sample face models and a detailed and accurate point-wise correspondence between all the models. This prevents these methods from wider usage when the sample face models are unavailable. In recent years, some researchers propose to regularize the 3D face reconstruction using shape constraint derived from a prior face model [22], [23], [24]. In [22], the authors detected the facial features around eyes, mouth, eyebrow and contour of face from a given face image, and adapted a generic 3D model into face specific 3D model using geometric transformations. Similar work can be found in [23], where a generic 3D face was projected onto image plane to fit the 2D projection to the input face. Shape regularization is implicitly used during the reconstruction procedure of these two methods, among which [22] uses local translation to achieve model fitting, and the depth information of the reconstructed 3D face in [23] totally comes from the generic face. Hence, the result of the regularization is not decent. In [24], the regularization was explicitly introduced in the formulation of the problem, which used the prior face model and albedo to extract illumination and pose information for 3D reconstruction. However, the result varies significantly depending on which prior model is used. This approach also requires plenty of manual work to register the image with the prior face.

To this end, we propose a novel approach to 3D face reconstruction. Unlike Morphable Model approaches, we reconstruct the unknown 3D face by fitting it directly to the given face image through a scaled orthogonal projection. To deal with the ill-posedness, we regularize the projection explicitly with a global shape constraint constructed using a reference 3D face. To ensure the efficiency, only few 3D facial features are fitted to a set of image feature points. Then, we obtain high-resolution 3D face by warping a reference 3D face model using the recovered 3D features. Unlike previous approaches, the warping is based on local shape constraint at each point of the face model[25]. The local shape constraints convey human characteristics of local face regions to the reconstructed 3D face. Realistic 3D face is generated after texture mapping. Our approach has following advantages: the feature recovery resorts to solving several equations, hence is very fast. In addition, the recovered features are very close to the optimal solution due to the global shape constraint, and the recovery is not sensitive to facial pose on the given image. Last, the warping based on local shape constraints is superior to scattered data interpolation in that it can achieve better result based on only few 3D features.

The rest part of this paper is organized as follows: In the next section, we describe our 3D feature recovery algorithm based on global shape constraint in detail. In Section 3, we discuss the high-resolution 3D face reconstruction method, as well as the local shape constraints. Texture mapping is introduced in Section 4. Section 5 shows some experimental results, and we conclude this paper in the last section.

Section snippets

3D feature recovery

Twenty feature points of the input image are extracted by the Active Appearance Model (AAM) [26] for model fitting. Fig. 1 shows some examples selected from Pointing face database [28]. The result is quite robust when the horizontal viewpoint varies between ±45°. Besides the 2D image features, we manually select corresponding 20 features from a prior reference human face to construct global shape constraint. See Fig. 2 for the reference 3D face with reference features (marked by red dots). The

Face warping based on local shape constraints

The recovered points S have personalized characteristics of the input face image. However, it so far is a set of sparsely distributed 3D feature points. The high-resolution and personalized 3D face is constructed by warping a reference 3D face using these recovered 3D points. In this paper, we scan several human faces using the Cyberware 3D scanner¹ to obtain several 3D face models, and create the reference 3D face by preprocessing and averaging these 3D face models. The

Texture mapping

Texture mapping is to transfer the texture of the input image onto the reconstructed face. The key to texture mapping is to determine the mapping from the input image to the 3D face. The previous section determines the projection from the 3D space to the 2D space of input: $s \in R^{3} \to p = λ Rs$ . According to this projection, we can synthesize the texture for the 3D face as follows: Let v be an arbitrary point on the 3D face and $u = λ Rv$ . Find the nearest point p_i to u in the input image. Then the texture $t_{i} = [$

Experiments and discussion

In this section, we demonstrate the results of our proposed algorithms for 3D feature recovery and personalized 3D face reconstruction from a 2D image. We will compare the accuracy of the proposed feature recovery algorithm based on global shape constraint with that of existing approach [13], and show the strong adaptability of the proposed algorithm on variant poses of input face images taken from persons with different ages, races and genders. The following errors $ϵ_{R} = ‖ \tilde{R} - R ‖_{F}, ϵ_{S} = \frac{1}{n} \sum_{i = 1}^{n} ∥ {\tilde{s}}_{i} - s_{i} ∥$

Conclusion

In this paper, we propose a novel approach for monocular high-resolution 3D face reconstruction. Our approach consists of three steps. The first one is accurate 3D facial feature recovery from 2D image features, based on a scaled orthogonal projection regularized by a global shape constraint. The second one is the model warping method with local shape constraints to construct personalized 3D face using the recovered 3D features and a reference 3D face. The last one is the texture synthesis. Our

Jian Zhang received his B.E. degree and M.E. degree from Shandong University of Science and Technology in 2000 and 2003 respectively, and received the Ph.D. degree from Zhejiang University in 2008. He is currently working in the school of science & technology of Zhejiang International Studies University, as an associate professor for computer science. His research interests include machine learning, computer vision, computer graphics and multimedia processing.

References (31)

T. Sha et al.
Feature level analysis for 3D facial expression recognition
Neurocomputing
(2011)
M. Le et al.
3D scene reconstruction enhancement method based on automatic context analysis and convex optimization
Neurocomputing
(2014)
R. Kouskouridas et al.
Efficient representation and feature extraction for neural network-based 3D object pose estimation
Neurocomputing
(2013)
D. Jiang et al.
Efficient 3D reconstruction for face recognition
Pattern Recognit.
(2005)
J. Yu et al.
Semantic preserving distance metric learning and applications
Inf. Sci.
(2014)
T.F. Cootes et al.
View-based active appearance models
Image Vis. Comput.
(2002)
A. Buttari et al.
A class of parallel tiled linear algebra algorithms for multicore architectures
Parallel Comput.
(2009)
X. Wei , L. Yin, Avatar-mediated face tracking and lip reading for human computer interaction, in: Proceedings of the...
J. Chai, J. Xiao, J. Hodgins, Vision-based control of 3D facial animation, in: Proceedings of the 2003...
Y. Ming
Rigid-area orthogonal spectral regression for efficient 3D face recognition
Neurocomputing
(2014)

W. Hu et al.

A survey on visual surveillance of object motion and behaviors

IEEE Trans. Syst. Man Cybern. C Appl. Rev.

(2004)

M. Song et al.

Three-dimensional face reconstruction from a single image by a coupled RBF network

IEEE Trans. Image Process.

(2012)

V. Blanz, P. Grother, P. Phillips, T. Vetter, Face recognition based on frontal views generated from non-frontal...

V. Blanz, T. Vetter, A morphable model for the synthesis of 3D faces, in: Proceedings of SIGGRAPH׳99, 1999, pp....

U. Park et al.

Age invariant face recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2010)

Cited by (7)

MLFNet: Monocular lifting fusion network for 6DoF texture-less object pose estimation
2022, Neurocomputing
Citation Excerpt :
Therefore, the current RGB methods usually use a two-stage pipeline, which first generates 2D intermediate structure representations, and then uses the Perspective-n-Point (PnP) algorithm [8] to obtain poses based on known 3D models. Different from the dependence of non-rigid objects (human body [9] and human face [10]) on local point features, the particularity of rigid objects lies in the constant relative spatial position between any two points on the surface, so how to effectively use the rigid body structure information has become a research hotspot. Recent works can be classified into two categories: 1) regressing dense 3D point cloud coordinates; 2) locating sparse 2D keypoints and other 2D components.
This paper addresses the challenge of 6DoF texture-less object pose estimation from a single RGB image. Many recent works have shown that two-stage deep learning approaches based on the fusion of 2D geometric intermediate representations achieve remarkable results. These methods implicitly explore the mapping from the 2D appearance domain to the 3D structure domain. However, due to the lack of 3D geometric constraints from depth maps, it is difficult to extract enough clues based on appearance features to master the geometric relation of projection from 3D viewpoints to 2D planes, and this estimation process is extremely sensitive to occlusion. We propose a novel network called MLFNet that lifts the feature space from 2D to 3D based on hybrid 3D geometric intermediate representations. For the first time, we propose the surface normals in the object coordinate system as an intermediate representation of pose; its violent change provides strong clues for the keypoints usually located at the abrupt change of object surface. Dense 3D surfaces can enhance the geometric consistency of multi-representation constraints and retain more information in occluded scenes. With the proposed multi-modality dual attention mechanism and the embedding of standard 3D shape knowledge, the 2D geometric representation learning process explicitly depends on the fusion of 2D appearance features and 3D geometric features. This standardized information fusion pattern among 2D intermediate representations, 3D intermediate representations, and CAD models prior significantly reduces the network learning space. The proposed method achieves competitive performance on the Linemod dataset and outperforms the state-of-the-art methods on the Occlusion Linemod and T-Less datasets, which demonstrates the feasibility of the pose multi-representation fusion technique. The project site is at https://github.com/JJJano/MLFNet.
Combining active learning and local patch alignment for data-driven facial animation with fine-grained local detail
2020, Neurocomputing
Citation Excerpt :
Some approaches use 3D mocap data as features, and the marker points are placed empirically on the performer’s face in the process of motion capture [43]. Some approaches reconstruct 3D features from image feature points that are usually aligned by pre-trained models whose training process rely on manually labeled feature points [44]. However, there is no conclusion that designating the positions of the feature points by hand can generate the best animation result.
Data-driven facial animation has attracted considerable attention from both academic and industrial communities in recent years. Typically, the motion data used to animate the faces are derived from either 3D or 2D facial features whose positions on the face are determined according to experience. There still lacks an automatic approach to determine the optimal positions of the features to face deformation, and current face deformation methods are incapable of providing fine-grained local geometric characteristics. This paper proposes a novel scheme for face animation in which an active learning method based on Locally Linear Reconstruction algorithm is exploited to determine the optimal feature positions on the face for face deformation, and the Semi-Supervised Local Patch Alignment algorithm is subsequently used to deform the face with the selected features according to the optimal feature positions. The active learning model can be solved by a sequential approach, and the Semi-Supervised Local Patch Alignment model can be addressed by a least-square method. Experimental results on various types of faces demonstrate the superiority of the proposed scheme to existing approaches in both feature points selection and fine-grained local characteristics preservation.
A semi-supervised framework for topology preserving performance-driven facial animation
2018, Signal Processing
Citation Excerpt :
However, the interpolation is based on Euclidean distance measurement between the feature points and the other vertices, hence is unable to generate decent result in case of small number of feature points. This was clearly demonstrated in [18]. Recently, some topology preserving methods were applied to face driving [17,19,20].
In this paper, we divide performance-driven facial animation into two data transformation problems, facial expression retargeting and face driving, and report a semi-supervised framework to solve the two problems. The objective function includes two parts. In the first part, we unify the temporal and geometrical characteristics of facial expressions and face models as topology characteristics, and preserve the topology characteristics in manifold subspace during data transformation. In the second part, some given data are used as labels to guide the transformation. The proposed semi-supervised framework can be efficiently solved by a least square method. Experimental results show that the proposed framework outperforms existing methods in both facial expression retargeting and face driving.
Learning 3D faces from 2D images via Stacked Contractive Autoencoder
2017, Neurocomputing
Citation Excerpt :
Good performance of these applications relies on understanding the correct semantic information hiding behind the images. It is commonly known that 3D face models provide invariant properties to changes of viewpoint, illumination and occlusion [10] which may significantly reduce the accuracy of face recognition and deteriorate the result of face detection, therefore reconstructing 3D face from 2D face image is of great importance to face recognition and detection. The prevalent image-based 3D face reconstruction methods can be categorized into three classes.
3D face reconstruction from a 2D face image has been found important to various applications such as face detection and recognition because a 3D face provides more semantic information than 2D image. This paper proposes a deep learning framework for 3D face reconstruction. The framework is designed to compute subspace feature of arbitrary face image, then map the feature to its counterpart in another subspace learned with 3D faces, and reconstruct the 3D face using the counterpart feature. During the course of training, we learn 2D and 3D subspaces through Stacked Contractive Autoencoders (SCAE), use a one-layer fully connected neural network to learn the mapping, and use the pre-trained parameters of the SCAEs and the one-layer network to initialize a deep feedforward neural network whose input are face images and output are 3D faces. The network is optimized by gradient descent algorithm with back-propagation. Extensive experimental results on various data sets indicate the effectiveness of the proposed SCAE-based 3D face reconstruction method.
A novel feature representation for automatic 3D object recognition in cluttered scenes
2016, Neurocomputing
Citation Excerpt :
Feature representation is of critical significance to many pattern recognition applications and systems involving detection [1], recognition [2–8], registration [9], reconstruction [10–12] and classification [13–15].
We present a novel local surface description technique for automatic three dimensional (3D) object recognition. In the proposed approach, highly repeatable keypoints are first detected by computing the divergence of the vector field at each point of the surface. Being a differential invariant of curves and surfaces, the divergence captures significant information about the surface variations at each point. The detected keypoints are pruned to only retain the keypoints which are associated with high divergence values. A keypoint saliency measure is proposed to rank these keypoints and select the best ones. A novel integral invariant local surface descriptor, called 3D-Vor, is built around each keypoint by exploiting the vorticity of the vector field at each point of the local surface. The proposed descriptor combines the strengths of signature-based methods and integral invariants to provide robust local surface description. The performance of the proposed fully automatic 3D object recognition technique was rigorously tested on three publicly available datasets. Our proposed technique is shown to exhibit superior performance compared to state-of-the-art techniques. Our keypoint detector and descriptor based algorithm achieves recognition rates of 100%, 99.35% and 96.2% respectively, when tested on the Bologna, UWA and Ca׳ Foscari Venezia datasets.
Visual 3d Reconstruction of Coal Pile Based on Bridge Crane
2024, SSRN

View all citing articles on Scopus

Dapeng Tao received a B.E. degree from Northwestern Polytechnical University and a Ph.D. degree from South China University of Technology, respectively. He is currently with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, as an Associate Professor for Human Machine Control. Over the past years, his research interests include machine learning, computer vision and cloud computing.

Xiangjuan Bian received the B.E. degree in 2001 from Northeastern University, and received the M.E. degree in 2004 from Kunming University of Science and Technology. She is a lecturer in the school of science & technology of Zhejiang International Studies University. Her research focuses on multi-field model construction and simulation and CIMS.

Xiaosi Zhan received his M.E. degree from Jilin University in 2001, and received the Ph.D. degree in computer science from Nanjing University in 2004. He is a professor working in the school of science & technology of Zhejiang International Studies University. His research interests include image processing and pattern recognition, machine learning, biometric recognition.

^☆: This paper is supported by the National Natural Science Foundation of China (no. 61303143), Scientific Research Fund of Zhejiang Provincial Education Department (no. Y201326609).

View full text

Monocular face reconstruction with global and local shape constraints☆

Abstract

Introduction

Section snippets

3D feature recovery

Face warping based on local shape constraints

Texture mapping

Experiments and discussion

Conclusion

Neurocomputing

Neurocomputing

Neurocomputing

Pattern Recognit.

Inf. Sci.

Image Vis. Comput.

Parallel Comput.

Rigid-area orthogonal spectral regression for efficient 3D face recognition

Neurocomputing

A survey on visual surveillance of object motion and behaviors

IEEE Trans. Syst. Man Cybern. C Appl. Rev.

Three-dimensional face reconstruction from a single image by a coupled RBF network

IEEE Trans. Image Process.

Age invariant face recognition

IEEE Trans. Pattern Anal. Mach. Intell.