VoD: A novel image representation for head yaw estimation

doi:10.1016/j.neucom.2014.07.019

Neurocomputing

Volume 148, 19 January 2015, Pages 455-466

https://doi.org/10.1016/j.neucom.2014.07.019 Get rights and content

Abstract

Building on the recent advances in the Fisher kernel framework for image classification, this paper proposes a novel image representation for head yaw estimation. Specifically, for each pixel of the image, a concise 9-dimensional local descriptor is computed consisting of the pixel coordinates, intensity, the first and second order derivatives, as well as the magnitude and orientation of the gradient. These local descriptors are encoded by Fisher vectors before being pooled to produce a global representation of the image. The proposed image representation is effective to head yaw estimation, and can be further improved by metric learning. A series of head yaw estimation experiments have been conducted on five datasets, and the results show that the new image representation improves the current state-of-the-art for head yaw estimation.

Introduction

During the last decades, there has been a significant progress in the face recognition research. However, one of the most challenging factors influencing the robustness and accuracy of face recognition is pose variation. To achieve robustness to pose variation, one might have to process face images differently according to their poses. Therefore, head pose estimation has been an active research topic for many years.

More precisely, head pose estimation essentially means the computation of three types of rotations of the head: yaw (looking left or right), pitch (looking up or down) and roll (tilting left or right). Among them, the roll rotation can be computed easily by the relative positions of the feature points, but the other two rotations are rather difficult to estimate. As the estimation of the yaw rotation has many important applications, it attracts more attention than pitch estimation [1], with more research data available. Therefore, in this paper, as most previous works have done, we focus on the challenging problem of estimating the head yaw pose from the input face images.

In head pose estimation, one of the crucial steps is to extract image representation characterizing the pose. Generally speaking, the proposed visual features can be roughly categorized into global and local features. While global features encode the holistic configuration of the image, local features encode the detailed traits within a local region. In the literatures, many methods combine both global and local features as they play different roles in the visual perception process. Among them, perhaps the most commonly used one is the Bag-of-Words (BoW) model [2] in which local descriptors extracted from an image are first mapped to a set of visual words and the image is then represented as a histogram of visual word occurrences. Recently, the Fisher vectors [3], which encode higher order statistics of local descriptors, improved the BoW model greatly for image classification. Instead of encoding only the frequency of visual word occurrences, Fisher vectors encode how the parameters of the model should be changed to represent the image. It can be seen as an extension of BoW, and has been shown to achieve the state-of-the-art performance for several challenging object recognition and image retrieval tasks [4], [5].

Inspired by these exciting advances, we present a novel image representation for head yaw estimation in this paper. More specifically, the proposed image representation encodes a new type of concise 9-dimensional local descriptors with Fisher vectors to describe the head images, called fisher Vector of local Descriptors, VoD for short. The VoD representation has been experimentally validated on five head pose datasets (FacePix, Pointing׳04, MultiPIE, CAS-PEAL and our own dataset). The results on these datasets show that the proposed representation outperforms the state-of-the-art.

The contribution of this paper is three-fold. Firstly, we proposed a 9-dimensional local attribute vector which can be applied in the Fisher vector method. The 9-dimensional vector is extracted at each pixel, which contains the coordinates, intensity, the first- and second-order derivatives and the magnitude and orientation of the gradient of the pixel. Compared with the SIFT feature used in the traditional Fisher vectors, the computational efficiency of the 9-dimensional local descriptor is significantly improved. More importantly, despite its conciseness, the descriptor preserves enough information essential to head pose estimation.

Secondly, the proposed local descriptors are encoded and aggregated into Fisher vectors to form the new VoD representation. To keep the spatial structure of the head in the global representation, we divide a head image into many rectangular bins and compute one VoD per bin.

Finally, we further improved the discriminative ability of VoD by supervised metric learning. Considering the great success of Keep It Simple and Straightforward Metric Learning (KISSME) [6], we train kVoD from VoD using KISSME under a supervised setting. The final product improved the accuracy of head pose estimation greatly over the state of the art.

The remainder of this paper is organized as follows: in Section 2, we introduce the related methods for head pose estimation; In Section 3, the proposed representation is introduced in detail. Experiments on five challenging datasets are shown in Section 4 to demonstrate the effectiveness of the proposed representations. Conclusions are drawn in Section 5 with some discussions on the future work.

Section snippets

Related work

Head pose estimation from images is a challenging problem due to large variations of illumination, facial expressions, subject variability, occlusions, noise and perspective distortion. A generic (i.e., person-independent) algorithm for head pose estimation has to be robust to such factors. There exists a large amount of literatures on this topic, see [7] for a review. Broadly speaking, most previous work can mainly be categorized into three groups: algorithms based on facial features [8], [9],

Fisher vectors of local descriptor

This section presents the proposed novel image representation. In the following sections, we first introduce each component of VoD in detail, followed by the extension on how to improve its performance by using metric learning in the supervised setting.

Experiments

In this section, to validate the effectiveness of the proposed representations, we perform experiments on five different head pose datasets. These datasets have been used extensively in the recent literatures, allowing direct comparisons with other approaches. And the experimental results show that the proposed representations can improve the performance of the state-of-the-art on head pose estimation.

Conclusions

In this paper, we propose a novel image representation for the problem of head pose estimation. The proposed representation encodes a new type of concise 9-dimensional local descriptors into a globe descriptor as Fisher vectors. The performance of the proposed representation can be improved further by using metric learning. We test our method on five challenging datasets, outperforming the current state-of-the-art on all these datasets.

There are several aspects to be further studied in the

Acknowledgments

This paper is partially supported by National Natural Science Foundation of China under Contract nos. 61003103, 61173065, 61105014, 61332016, and the President Fund of UCAS.

Bingpeng Ma received the B.S. degree in mechanics, in 1998, and the M.S. degree in mathematics, in 2003, from Huazhong University of Science and Technology. He received Ph.D. degree in computer science at the Institute of Computing Technology, Chinese Academy of Sciences, PR China, in 2009. He was a post-doctorial researcher in University of Caen, France, from 2011 to 2012. He joined the School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, in March

References (44)

Q. Ji et al.
3D face pose estimation and tracking from a monocular camera
Image Vis. Comput.
(2002)
L. Chen, L. Zhang, Y. Hu, M. Li, H. Zhang, Head pose estimation using fisher manifold learning, in: Proceedings of IEEE...
J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: Proceedings of IEEE...
F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: Proceedings of IEEE...
F. Perronnin, J. Sánchez, T. Mensink, Improving the Fisher kernel for large-scale image classification, in: Proceedings...
F. Perronnin, Y. Liu, J. Sánchez, H. Poirier, Large-scale image retrieval with compressed Fisher vectors, in:...
M. Köstinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints,...
E. Murphy-Chutorian et al.
Head pose estimation in computer visiona survey
IEEE Trans. Pattern Anal. Mach. Intell.
(2009)
A. Nikolaidis, I. Pitas, Facial feature extraction and determination of pose, in: Proceedings of NOBLESSE Workshop on...
F. Fleuret, D. Geman, Fast face detection with precise pose estimation, in: Proceedings of IEEE International...

R. Stiefelhagen, J. Yang, A. Waibel, A model-based gaze tracking system, in: IEEE International Joint Symposia on...

J. Xiao et al.

Robust full-motion recovery of head by dynamic templates and re-registration techniques

Int. J. Imaging Syst. Technol.

(2003)

Y. Wei, L. Fradet, T. Tan, Head pose estimation using Gabor eigenspace modeling, in: Proceedings of IEEE International...

Y. Li, S. Gong, H. Liddel, Support vector regression and classification based multi-view face detection and...

J.N.S. Kwong, S. Gong, Learning support vector machines for a multi-view face model, in: Proceedings of British Machine...

T. Darrell, B. Moghaddam, A.P. Pentland, Active face tracking and pose estimation in an interactive room, in:...

Stan.Z. Li et al.

Learning multiview face subspaces and facial pose estimation using independent component analysis

IEEE Trans. Image Process.

(2005)

M.A. Haj, J. Gonzlez, L.S. Davis, On partial least squares in head pose estimation: how to simultaneously deal with...

S.T. Roweis et al.

Nonlinear dimensionality reduction by locally linear embedding

Science

(2000)

J.B. Tenenbaum et al.

A global geometric framework for nonlinear dimensionality reduction

Science

(2000)

M. Belkin et al.

Laplacian eigenmaps and spectral techniques for embedding and clustering

Adv. Neural Inf. Process. Syst.

(2001)

Y. Fu, T.S. Huang, Graph embedded analysis for head pose estimation, in: Proceedings of IEEE International Conference...

Cited by (23)

Joint usage of global and local attentions in hourglass network for human pose estimation
2022, Neurocomputing
Citation Excerpt :
Human pose estimation plays an important role in analyzing human behavior based on images or videos. Accurate and efficient human pose estimation may facilitate various applications such as human action recognition [5,6,40], person ReID [3], human–computer interaction [24,35] and video object tracking [32]. However, due to the volatile camera view angle and complex human posture, human pose estimation remains a challenging task after decades of study.
Human pose estimation is a challenging research task in the field of computer vision. The current mainstream work has made great progress in pose estimation, but these works still do not pay enough attention to the negative impact of background on human pose estimation. In this work, we propose a human pose estimation framework characterized by the joint usage of both global and local attention module in an hourglass backbone network. The global attention module aims to reduces the negative impact of background. The local attention module is designed to help refine each joint. We tested our method on two benchmark datasets for human pose estimation, and the experimental results show that the proposed model is superior to current mainstream algorithms.
Head pose estimation: A survey of the last ten years
2021, Signal Processing: Image Communication
Citation Excerpt :
The authors treat the problem as a multi-class classification task: by applying Gaussian derivatives to the image they extract image features, which undergo a pattern classification algorithm (SVM) to discriminate between the different poses. Similarly [59] proposes a head pose estimation framework where a nine dimensional local descriptor is computed for pixel coordinates of each image. Magnitude and orientation of the gradient are also extracted.
Head pose is an important cue in computer vision when using facial information. Over the last three decades, methods for head pose estimation have received increasing attention due to their application in several image analysis tasks. Although many techniques have been developed in the years to address this issue, head pose estimation remains an open research topic, particularly in unconstrained environments. In this paper, we present a comprehensive survey focusing on methods under both constrained and unconstrained conditions, focusing on the literature from the last decade. This work illustrates advantages and disadvantages of existing algorithms, starting from seminal contributions to head pose estimation, and ending with the more recent approaches which adopted deep learning frameworks. Several performance comparison are provided. This paper also states promising directions for future research on the topic.
Densely connected attentional pyramid residual network for human pose estimation
2019, Neurocomputing
Research on 3D human pose estimation based on deep neural networks has recently witnessed substantial progress in both accuracy and execution efficiency. Many methods combine the deep neural network-based 2D pose estimation and 3D pose matching. However, (1) multiscale analysis has the potential to improve inference accuracy, and (2) isometric constraint is not considered in the 3D matching stage. In this paper, a new module named the Densely Connected Attentional Pyramid Residual Module (DCAPRM) in the bottom-up mapping stage is presented to effectively increase the inference accuracy. A new isometric regularization term is also proposed to punish limb extension or shrinkage in the top-down fitting phase. The performance of our approach on 3D human pose datasets is evaluated. The experimental results show that our approach provides better results than other approaches in terms of the accuracy of 3D human pose estimation.
Face analysis through semantic face segmentation
2019, Signal Processing: Image Communication
Citation Excerpt :
The MLD results are good on frontal poses but as orientation of the pose changes from frontal to profile, its MAE increases. Finally reported results for methods MGD [36] and kVoD [35] are less uniform, and abrupt changes can be seen in the form of spikes in Fig. 10. By trying all possible combinations of six facial features ‘skin’, ‘hair’, ‘eyes’, ‘nose’, ‘mouth’, and ‘background’, best results are obtained by concatenating four facial features, that are ‘skin’, ‘hair’, ‘eyes’, and ‘nose’.
Automatic face analysis, including head pose estimation, gender recognition, and expression classification, strongly benefits from an accurate segmentation of the human face. In this paper we present a multi-feature framework which first segments a face image into six parts, and then performs classification tasks on head pose, gender, and expression. Segmentation is achieved by training a discriminative model on a manually labeled face database, namely FASSEG, which we extend from previous versions, and which we publicly share. Three kinds of features accounting for location, shape, and color are extracted from uniformly sampled square image patches. Facial images are then pixel-wise segmented into six semantic classes – hair, skin, nose, eyes, mouth, and background, – using a Random Forest classifier (RF). Then a linear Support Vector Machine (SVM) is trained for each face analysis task i.e., head pose estimation, gender recognition, and expression classification by using the probability maps obtained during the segmentation step. Performance of the proposed framework is evaluated on four face databases, namely Pointing’04, FEI, FERET, and MPI, with results which outperform the current state-of-the-art.
Head pose estimation with soft labels using regularized convolutional neural network
2019, Neurocomputing
Citation Excerpt :
This confirms that by absorbing the benefits from label distribution, robust architecture and fully learning strategy, our method with entire images could try its best to explore the discriminative representation across different categories. The literature [47] extracted fisher vector of local descriptors (VoD) or its variant (kVoD) that combines with metric learning as features and used Nearest Centroid (NC) classifier to determine head pose categories. However, this unsupervised method cannot utilize class label information and the best accuracy they reported is about 94.2%.
Head pose estimation has many wide applications such as driver monitoring, attention recognition and multi-view facial analysis. Most of the previous works routinely utilize detected face regions to further estimate head pose with hard labels, which limits to explore more discriminative texture information and tends to over-fit. In this paper, we present a novel framework to alleviate this problem, which takes entire images as input and constructs soft labels using a Gaussian distribution function as supervision information, and then introduces a regularized convolutional neural network architecture that is optimized by two types of similarity measure functions: Kullback–Leibler divergence loss and Jeffreys divergence loss. The regularized architecture includes four modules: one backbone net for learning common features, two parallel branches named sub-net1 and sub-net2 for learning complementary features and one feature fusion module, namely, fused net. The architecture is trained in an alternately training fashion, making the learned model more robust and stable. Extensive experiments have been carried out on three public datasets: Pointing04, CAS-PEAL-R1 and CMU Multi-PIE. The results show that our method achieves a significant improvement in performance compared to the state of the art. The best accuracy on the three datasets we achieve are 85.77%, 99.19% and 99.88%, respectively.
Computer vision for assistive technologies
2017, Computer Vision and Image Understanding
Citation Excerpt :
To overcome them, on the one hand, approaches that combine demographic recognition (gender) and behavior analysis (gaze) have been proposed in order to create a user-centered HCI environment by better understanding the needs and intentions of its users (Zhang et al., 2016). On the other hand, more sophisticated algorithmic strategies have been investigated i.e. a new type of concise 9-dimensional local descriptors with Fisher vectors that have been introduced in literature (Ma et al., 2015) to describe the head images. Concerning the estimation of the gaze direction an accurate and efficient eye detection method using the discriminatory Haar features (DHFs) and a new efficient support vector machine (eSVM) have been recently published (Chen and Liu, 2015).
In the last decades there has been a tremendous increase in demand for Assistive Technologies (AT) useful to overcome functional limitations of individuals and to improve their quality of life. As a consequence, different research papers addressing the development of assistive technologies have appeared into the literature pushing the need to organize and categorize them taking into account the application assistive aims. Several surveys address the categorization problem for works concerning a specific need, hence giving the overview on the state of the art technologies supporting the related function for the individual. Unfortunately, this “user-need oriented” way of categorization considers each technology as a whole and then a deep and critical explanation of the technical knowledge used to build the operative tasks as well as a discussion on their cross-contextual applicability is completely missing making thus existing surveys unlikely to be technically inspiring for functional improvements and to explore new technological frontiers. To overcome this critical drawback, in this paper an original “task oriented” way to categorize the state of the art of the AT works has been introduced: it relies on the split of the final assistive goals into tasks that are then used as pointers to the works in literature in which each of them has been used as a component. In particular this paper concentrates on a set of cross-application Computer Vision tasks that are set as the pivots to establish a categorization of the AT already used to assist some of the user’s needs. For each task the paper analyzes the Computer Vision algorithms recently involved in the development of AT and, finally, it tries to catch a glimpse of the possible paths in the short and medium term that could allow a real improvement of the assistive outcomes. The potential impact on the assessment of AT considering users, medical, economical and social perspective is also addressed.

View all citing articles on Scopus

Rui Huang received his B.Sc. degree in Peking University (1999) and M.E. degree in Chinese Academy of Sciences (2002). In 2008, he received his Ph.D. degree in Rutgers University, and had since worked there as a research associate for two years. He was an assistant professor at Huazhong University of Science and Technology, and is currently a research staff member at NEC Laboratories China. His research interests include graphical models and their applications in computer vision, pattern recognition and medical imaging.

Lei Qin received the B.S. and M.S. degrees in mathematics from the Dalian University of Technology, Dalian, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2008. He is currently an associate professor with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include image/video processing, computer vision, and pattern recognition. He has authored or coauthored over 30 technical papers in the area of computer vision. He is a reviewer for IEEE Trans. on Multimedia, IEEE Trans. on Circuits and Systems for Video Technology, and IEEE Transactions on Cybernetics. He has served as TPC member for various conferences, including ECCV, ICPR, ICME, PSIVT, ICIMCS, PCM.

View full text

VoD: A novel image representation for head yaw estimation

Abstract

Introduction

Section snippets

Related work

Fisher vectors of local descriptor

Experiments

Conclusions

Acknowledgments

Image Vis. Comput.

Head pose estimation in computer visiona survey

IEEE Trans. Pattern Anal. Mach. Intell.

Robust full-motion recovery of head by dynamic templates and re-registration techniques

Int. J. Imaging Syst. Technol.

Learning multiview face subspaces and facial pose estimation using independent component analysis

IEEE Trans. Image Process.

Nonlinear dimensionality reduction by locally linear embedding

Science

A global geometric framework for nonlinear dimensionality reduction

Science

Laplacian eigenmaps and spectral techniques for embedding and clustering

Adv. Neural Inf. Process. Syst.