VoD: A novel image representation for head yaw estimation
Introduction
During the last decades, there has been a significant progress in the face recognition research. However, one of the most challenging factors influencing the robustness and accuracy of face recognition is pose variation. To achieve robustness to pose variation, one might have to process face images differently according to their poses. Therefore, head pose estimation has been an active research topic for many years.
More precisely, head pose estimation essentially means the computation of three types of rotations of the head: yaw (looking left or right), pitch (looking up or down) and roll (tilting left or right). Among them, the roll rotation can be computed easily by the relative positions of the feature points, but the other two rotations are rather difficult to estimate. As the estimation of the yaw rotation has many important applications, it attracts more attention than pitch estimation [1], with more research data available. Therefore, in this paper, as most previous works have done, we focus on the challenging problem of estimating the head yaw pose from the input face images.
In head pose estimation, one of the crucial steps is to extract image representation characterizing the pose. Generally speaking, the proposed visual features can be roughly categorized into global and local features. While global features encode the holistic configuration of the image, local features encode the detailed traits within a local region. In the literatures, many methods combine both global and local features as they play different roles in the visual perception process. Among them, perhaps the most commonly used one is the Bag-of-Words (BoW) model [2] in which local descriptors extracted from an image are first mapped to a set of visual words and the image is then represented as a histogram of visual word occurrences. Recently, the Fisher vectors [3], which encode higher order statistics of local descriptors, improved the BoW model greatly for image classification. Instead of encoding only the frequency of visual word occurrences, Fisher vectors encode how the parameters of the model should be changed to represent the image. It can be seen as an extension of BoW, and has been shown to achieve the state-of-the-art performance for several challenging object recognition and image retrieval tasks [4], [5].
Inspired by these exciting advances, we present a novel image representation for head yaw estimation in this paper. More specifically, the proposed image representation encodes a new type of concise 9-dimensional local descriptors with Fisher vectors to describe the head images, called fisher Vector of local Descriptors, VoD for short. The VoD representation has been experimentally validated on five head pose datasets (FacePix, Pointing׳04, MultiPIE, CAS-PEAL and our own dataset). The results on these datasets show that the proposed representation outperforms the state-of-the-art.
The contribution of this paper is three-fold. Firstly, we proposed a 9-dimensional local attribute vector which can be applied in the Fisher vector method. The 9-dimensional vector is extracted at each pixel, which contains the coordinates, intensity, the first- and second-order derivatives and the magnitude and orientation of the gradient of the pixel. Compared with the SIFT feature used in the traditional Fisher vectors, the computational efficiency of the 9-dimensional local descriptor is significantly improved. More importantly, despite its conciseness, the descriptor preserves enough information essential to head pose estimation.
Secondly, the proposed local descriptors are encoded and aggregated into Fisher vectors to form the new VoD representation. To keep the spatial structure of the head in the global representation, we divide a head image into many rectangular bins and compute one VoD per bin.
Finally, we further improved the discriminative ability of VoD by supervised metric learning. Considering the great success of Keep It Simple and Straightforward Metric Learning (KISSME) [6], we train kVoD from VoD using KISSME under a supervised setting. The final product improved the accuracy of head pose estimation greatly over the state of the art.
The remainder of this paper is organized as follows: in Section 2, we introduce the related methods for head pose estimation; In Section 3, the proposed representation is introduced in detail. Experiments on five challenging datasets are shown in Section 4 to demonstrate the effectiveness of the proposed representations. Conclusions are drawn in Section 5 with some discussions on the future work.
Section snippets
Related work
Head pose estimation from images is a challenging problem due to large variations of illumination, facial expressions, subject variability, occlusions, noise and perspective distortion. A generic (i.e., person-independent) algorithm for head pose estimation has to be robust to such factors. There exists a large amount of literatures on this topic, see [7] for a review. Broadly speaking, most previous work can mainly be categorized into three groups: algorithms based on facial features [8], [9],
Fisher vectors of local descriptor
This section presents the proposed novel image representation. In the following sections, we first introduce each component of VoD in detail, followed by the extension on how to improve its performance by using metric learning in the supervised setting.
Experiments
In this section, to validate the effectiveness of the proposed representations, we perform experiments on five different head pose datasets. These datasets have been used extensively in the recent literatures, allowing direct comparisons with other approaches. And the experimental results show that the proposed representations can improve the performance of the state-of-the-art on head pose estimation.
Conclusions
In this paper, we propose a novel image representation for the problem of head pose estimation. The proposed representation encodes a new type of concise 9-dimensional local descriptors into a globe descriptor as Fisher vectors. The performance of the proposed representation can be improved further by using metric learning. We test our method on five challenging datasets, outperforming the current state-of-the-art on all these datasets.
There are several aspects to be further studied in the
Acknowledgments
This paper is partially supported by National Natural Science Foundation of China under Contract nos. 61003103, 61173065, 61105014, 61332016, and the President Fund of UCAS.
Bingpeng Ma received the B.S. degree in mechanics, in 1998, and the M.S. degree in mathematics, in 2003, from Huazhong University of Science and Technology. He received Ph.D. degree in computer science at the Institute of Computing Technology, Chinese Academy of Sciences, PR China, in 2009. He was a post-doctorial researcher in University of Caen, France, from 2011 to 2012. He joined the School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, in March
References (44)
- et al.
3D face pose estimation and tracking from a monocular camera
Image Vis. Comput.
(2002) - L. Chen, L. Zhang, Y. Hu, M. Li, H. Zhang, Head pose estimation using fisher manifold learning, in: Proceedings of IEEE...
- J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: Proceedings of IEEE...
- F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: Proceedings of IEEE...
- F. Perronnin, J. Sánchez, T. Mensink, Improving the Fisher kernel for large-scale image classification, in: Proceedings...
- F. Perronnin, Y. Liu, J. Sánchez, H. Poirier, Large-scale image retrieval with compressed Fisher vectors, in:...
- M. Köstinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints,...
- et al.
Head pose estimation in computer visiona survey
IEEE Trans. Pattern Anal. Mach. Intell.
(2009) - A. Nikolaidis, I. Pitas, Facial feature extraction and determination of pose, in: Proceedings of NOBLESSE Workshop on...
- F. Fleuret, D. Geman, Fast face detection with precise pose estimation, in: Proceedings of IEEE International...
Robust full-motion recovery of head by dynamic templates and re-registration techniques
Int. J. Imaging Syst. Technol.
Learning multiview face subspaces and facial pose estimation using independent component analysis
IEEE Trans. Image Process.
Nonlinear dimensionality reduction by locally linear embedding
Science
A global geometric framework for nonlinear dimensionality reduction
Science
Laplacian eigenmaps and spectral techniques for embedding and clustering
Adv. Neural Inf. Process. Syst.
Cited by (23)
Joint usage of global and local attentions in hourglass network for human pose estimation
2022, NeurocomputingCitation Excerpt :Human pose estimation plays an important role in analyzing human behavior based on images or videos. Accurate and efficient human pose estimation may facilitate various applications such as human action recognition [5,6,40], person ReID [3], human–computer interaction [24,35] and video object tracking [32]. However, due to the volatile camera view angle and complex human posture, human pose estimation remains a challenging task after decades of study.
Head pose estimation: A survey of the last ten years
2021, Signal Processing: Image CommunicationCitation Excerpt :The authors treat the problem as a multi-class classification task: by applying Gaussian derivatives to the image they extract image features, which undergo a pattern classification algorithm (SVM) to discriminate between the different poses. Similarly [59] proposes a head pose estimation framework where a nine dimensional local descriptor is computed for pixel coordinates of each image. Magnitude and orientation of the gradient are also extracted.
Densely connected attentional pyramid residual network for human pose estimation
2019, NeurocomputingFace analysis through semantic face segmentation
2019, Signal Processing: Image CommunicationCitation Excerpt :The MLD results are good on frontal poses but as orientation of the pose changes from frontal to profile, its MAE increases. Finally reported results for methods MGD [36] and kVoD [35] are less uniform, and abrupt changes can be seen in the form of spikes in Fig. 10. By trying all possible combinations of six facial features ‘skin’, ‘hair’, ‘eyes’, ‘nose’, ‘mouth’, and ‘background’, best results are obtained by concatenating four facial features, that are ‘skin’, ‘hair’, ‘eyes’, and ‘nose’.
Head pose estimation with soft labels using regularized convolutional neural network
2019, NeurocomputingCitation Excerpt :This confirms that by absorbing the benefits from label distribution, robust architecture and fully learning strategy, our method with entire images could try its best to explore the discriminative representation across different categories. The literature [47] extracted fisher vector of local descriptors (VoD) or its variant (kVoD) that combines with metric learning as features and used Nearest Centroid (NC) classifier to determine head pose categories. However, this unsupervised method cannot utilize class label information and the best accuracy they reported is about 94.2%.
Computer vision for assistive technologies
2017, Computer Vision and Image UnderstandingCitation Excerpt :To overcome them, on the one hand, approaches that combine demographic recognition (gender) and behavior analysis (gaze) have been proposed in order to create a user-centered HCI environment by better understanding the needs and intentions of its users (Zhang et al., 2016). On the other hand, more sophisticated algorithmic strategies have been investigated i.e. a new type of concise 9-dimensional local descriptors with Fisher vectors that have been introduced in literature (Ma et al., 2015) to describe the head images. Concerning the estimation of the gaze direction an accurate and efficient eye detection method using the discriminatory Haar features (DHFs) and a new efficient support vector machine (eSVM) have been recently published (Chen and Liu, 2015).
Bingpeng Ma received the B.S. degree in mechanics, in 1998, and the M.S. degree in mathematics, in 2003, from Huazhong University of Science and Technology. He received Ph.D. degree in computer science at the Institute of Computing Technology, Chinese Academy of Sciences, PR China, in 2009. He was a post-doctorial researcher in University of Caen, France, from 2011 to 2012. He joined the School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, in March 2013, and now he is an assistant professor. His research interests cover image analysis, pattern recognition, and computer vision.
Rui Huang received his B.Sc. degree in Peking University (1999) and M.E. degree in Chinese Academy of Sciences (2002). In 2008, he received his Ph.D. degree in Rutgers University, and had since worked there as a research associate for two years. He was an assistant professor at Huazhong University of Science and Technology, and is currently a research staff member at NEC Laboratories China. His research interests include graphical models and their applications in computer vision, pattern recognition and medical imaging.
Lei Qin received the B.S. and M.S. degrees in mathematics from the Dalian University of Technology, Dalian, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2008. He is currently an associate professor with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include image/video processing, computer vision, and pattern recognition. He has authored or coauthored over 30 technical papers in the area of computer vision. He is a reviewer for IEEE Trans. on Multimedia, IEEE Trans. on Circuits and Systems for Video Technology, and IEEE Transactions on Cybernetics. He has served as TPC member for various conferences, including ECCV, ICPR, ICME, PSIVT, ICIMCS, PCM.