Elsevier

Neurocomputing

Volume 148, 19 January 2015, Pages 455-466
Neurocomputing

VoD: A novel image representation for head yaw estimation

https://doi.org/10.1016/j.neucom.2014.07.019Get rights and content

Abstract

Building on the recent advances in the Fisher kernel framework for image classification, this paper proposes a novel image representation for head yaw estimation. Specifically, for each pixel of the image, a concise 9-dimensional local descriptor is computed consisting of the pixel coordinates, intensity, the first and second order derivatives, as well as the magnitude and orientation of the gradient. These local descriptors are encoded by Fisher vectors before being pooled to produce a global representation of the image. The proposed image representation is effective to head yaw estimation, and can be further improved by metric learning. A series of head yaw estimation experiments have been conducted on five datasets, and the results show that the new image representation improves the current state-of-the-art for head yaw estimation.

Introduction

During the last decades, there has been a significant progress in the face recognition research. However, one of the most challenging factors influencing the robustness and accuracy of face recognition is pose variation. To achieve robustness to pose variation, one might have to process face images differently according to their poses. Therefore, head pose estimation has been an active research topic for many years.

More precisely, head pose estimation essentially means the computation of three types of rotations of the head: yaw (looking left or right), pitch (looking up or down) and roll (tilting left or right). Among them, the roll rotation can be computed easily by the relative positions of the feature points, but the other two rotations are rather difficult to estimate. As the estimation of the yaw rotation has many important applications, it attracts more attention than pitch estimation [1], with more research data available. Therefore, in this paper, as most previous works have done, we focus on the challenging problem of estimating the head yaw pose from the input face images.

In head pose estimation, one of the crucial steps is to extract image representation characterizing the pose. Generally speaking, the proposed visual features can be roughly categorized into global and local features. While global features encode the holistic configuration of the image, local features encode the detailed traits within a local region. In the literatures, many methods combine both global and local features as they play different roles in the visual perception process. Among them, perhaps the most commonly used one is the Bag-of-Words (BoW) model [2] in which local descriptors extracted from an image are first mapped to a set of visual words and the image is then represented as a histogram of visual word occurrences. Recently, the Fisher vectors [3], which encode higher order statistics of local descriptors, improved the BoW model greatly for image classification. Instead of encoding only the frequency of visual word occurrences, Fisher vectors encode how the parameters of the model should be changed to represent the image. It can be seen as an extension of BoW, and has been shown to achieve the state-of-the-art performance for several challenging object recognition and image retrieval tasks [4], [5].

Inspired by these exciting advances, we present a novel image representation for head yaw estimation in this paper. More specifically, the proposed image representation encodes a new type of concise 9-dimensional local descriptors with Fisher vectors to describe the head images, called fisher Vector of local Descriptors, VoD for short. The VoD representation has been experimentally validated on five head pose datasets (FacePix, Pointing׳04, MultiPIE, CAS-PEAL and our own dataset). The results on these datasets show that the proposed representation outperforms the state-of-the-art.

The contribution of this paper is three-fold. Firstly, we proposed a 9-dimensional local attribute vector which can be applied in the Fisher vector method. The 9-dimensional vector is extracted at each pixel, which contains the coordinates, intensity, the first- and second-order derivatives and the magnitude and orientation of the gradient of the pixel. Compared with the SIFT feature used in the traditional Fisher vectors, the computational efficiency of the 9-dimensional local descriptor is significantly improved. More importantly, despite its conciseness, the descriptor preserves enough information essential to head pose estimation.

Secondly, the proposed local descriptors are encoded and aggregated into Fisher vectors to form the new VoD representation. To keep the spatial structure of the head in the global representation, we divide a head image into many rectangular bins and compute one VoD per bin.

Finally, we further improved the discriminative ability of VoD by supervised metric learning. Considering the great success of Keep It Simple and Straightforward Metric Learning (KISSME) [6], we train kVoD from VoD using KISSME under a supervised setting. The final product improved the accuracy of head pose estimation greatly over the state of the art.

The remainder of this paper is organized as follows: in Section 2, we introduce the related methods for head pose estimation; In Section 3, the proposed representation is introduced in detail. Experiments on five challenging datasets are shown in Section 4 to demonstrate the effectiveness of the proposed representations. Conclusions are drawn in Section 5 with some discussions on the future work.

Section snippets

Related work

Head pose estimation from images is a challenging problem due to large variations of illumination, facial expressions, subject variability, occlusions, noise and perspective distortion. A generic (i.e., person-independent) algorithm for head pose estimation has to be robust to such factors. There exists a large amount of literatures on this topic, see [7] for a review. Broadly speaking, most previous work can mainly be categorized into three groups: algorithms based on facial features [8], [9],

Fisher vectors of local descriptor

This section presents the proposed novel image representation. In the following sections, we first introduce each component of VoD in detail, followed by the extension on how to improve its performance by using metric learning in the supervised setting.

Experiments

In this section, to validate the effectiveness of the proposed representations, we perform experiments on five different head pose datasets. These datasets have been used extensively in the recent literatures, allowing direct comparisons with other approaches. And the experimental results show that the proposed representations can improve the performance of the state-of-the-art on head pose estimation.

Conclusions

In this paper, we propose a novel image representation for the problem of head pose estimation. The proposed representation encodes a new type of concise 9-dimensional local descriptors into a globe descriptor as Fisher vectors. The performance of the proposed representation can be improved further by using metric learning. We test our method on five challenging datasets, outperforming the current state-of-the-art on all these datasets.

There are several aspects to be further studied in the

Acknowledgments

This paper is partially supported by National Natural Science Foundation of China under Contract nos. 61003103, 61173065, 61105014, 61332016, and the President Fund of UCAS.

Bingpeng Ma received the B.S. degree in mechanics, in 1998, and the M.S. degree in mathematics, in 2003, from Huazhong University of Science and Technology. He received Ph.D. degree in computer science at the Institute of Computing Technology, Chinese Academy of Sciences, PR China, in 2009. He was a post-doctorial researcher in University of Caen, France, from 2011 to 2012. He joined the School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, in March

References (44)

  • Q. Ji et al.

    3D face pose estimation and tracking from a monocular camera

    Image Vis. Comput.

    (2002)
  • L. Chen, L. Zhang, Y. Hu, M. Li, H. Zhang, Head pose estimation using fisher manifold learning, in: Proceedings of IEEE...
  • J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: Proceedings of IEEE...
  • F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categorization, in: Proceedings of IEEE...
  • F. Perronnin, J. Sánchez, T. Mensink, Improving the Fisher kernel for large-scale image classification, in: Proceedings...
  • F. Perronnin, Y. Liu, J. Sánchez, H. Poirier, Large-scale image retrieval with compressed Fisher vectors, in:...
  • M. Köstinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from equivalence constraints,...
  • E. Murphy-Chutorian et al.

    Head pose estimation in computer visiona survey

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • A. Nikolaidis, I. Pitas, Facial feature extraction and determination of pose, in: Proceedings of NOBLESSE Workshop on...
  • F. Fleuret, D. Geman, Fast face detection with precise pose estimation, in: Proceedings of IEEE International...
  • R. Stiefelhagen, J. Yang, A. Waibel, A model-based gaze tracking system, in: IEEE International Joint Symposia on...
  • J. Xiao et al.

    Robust full-motion recovery of head by dynamic templates and re-registration techniques

    Int. J. Imaging Syst. Technol.

    (2003)
  • Y. Wei, L. Fradet, T. Tan, Head pose estimation using Gabor eigenspace modeling, in: Proceedings of IEEE International...
  • Y. Li, S. Gong, H. Liddel, Support vector regression and classification based multi-view face detection and...
  • J.N.S. Kwong, S. Gong, Learning support vector machines for a multi-view face model, in: Proceedings of British Machine...
  • T. Darrell, B. Moghaddam, A.P. Pentland, Active face tracking and pose estimation in an interactive room, in:...
  • Stan.Z. Li et al.

    Learning multiview face subspaces and facial pose estimation using independent component analysis

    IEEE Trans. Image Process.

    (2005)
  • M.A. Haj, J. Gonzlez, L.S. Davis, On partial least squares in head pose estimation: how to simultaneously deal with...
  • S.T. Roweis et al.

    Nonlinear dimensionality reduction by locally linear embedding

    Science

    (2000)
  • J.B. Tenenbaum et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2000)
  • M. Belkin et al.

    Laplacian eigenmaps and spectral techniques for embedding and clustering

    Adv. Neural Inf. Process. Syst.

    (2001)
  • Y. Fu, T.S. Huang, Graph embedded analysis for head pose estimation, in: Proceedings of IEEE International Conference...
  • Cited by (23)

    • Joint usage of global and local attentions in hourglass network for human pose estimation

      2022, Neurocomputing
      Citation Excerpt :

      Human pose estimation plays an important role in analyzing human behavior based on images or videos. Accurate and efficient human pose estimation may facilitate various applications such as human action recognition [5,6,40], person ReID [3], human–computer interaction [24,35] and video object tracking [32]. However, due to the volatile camera view angle and complex human posture, human pose estimation remains a challenging task after decades of study.

    • Head pose estimation: A survey of the last ten years

      2021, Signal Processing: Image Communication
      Citation Excerpt :

      The authors treat the problem as a multi-class classification task: by applying Gaussian derivatives to the image they extract image features, which undergo a pattern classification algorithm (SVM) to discriminate between the different poses. Similarly [59] proposes a head pose estimation framework where a nine dimensional local descriptor is computed for pixel coordinates of each image. Magnitude and orientation of the gradient are also extracted.

    • Face analysis through semantic face segmentation

      2019, Signal Processing: Image Communication
      Citation Excerpt :

      The MLD results are good on frontal poses but as orientation of the pose changes from frontal to profile, its MAE increases. Finally reported results for methods MGD [36] and kVoD [35] are less uniform, and abrupt changes can be seen in the form of spikes in Fig. 10. By trying all possible combinations of six facial features ‘skin’, ‘hair’, ‘eyes’, ‘nose’, ‘mouth’, and ‘background’, best results are obtained by concatenating four facial features, that are ‘skin’, ‘hair’, ‘eyes’, and ‘nose’.

    • Head pose estimation with soft labels using regularized convolutional neural network

      2019, Neurocomputing
      Citation Excerpt :

      This confirms that by absorbing the benefits from label distribution, robust architecture and fully learning strategy, our method with entire images could try its best to explore the discriminative representation across different categories. The literature [47] extracted fisher vector of local descriptors (VoD) or its variant (kVoD) that combines with metric learning as features and used Nearest Centroid (NC) classifier to determine head pose categories. However, this unsupervised method cannot utilize class label information and the best accuracy they reported is about 94.2%.

    • Computer vision for assistive technologies

      2017, Computer Vision and Image Understanding
      Citation Excerpt :

      To overcome them, on the one hand, approaches that combine demographic recognition (gender) and behavior analysis (gaze) have been proposed in order to create a user-centered HCI environment by better understanding the needs and intentions of its users (Zhang et al., 2016). On the other hand, more sophisticated algorithmic strategies have been investigated i.e. a new type of concise 9-dimensional local descriptors with Fisher vectors that have been introduced in literature (Ma et al., 2015) to describe the head images. Concerning the estimation of the gaze direction an accurate and efficient eye detection method using the discriminatory Haar features (DHFs) and a new efficient support vector machine (eSVM) have been recently published (Chen and Liu, 2015).

    View all citing articles on Scopus

    Bingpeng Ma received the B.S. degree in mechanics, in 1998, and the M.S. degree in mathematics, in 2003, from Huazhong University of Science and Technology. He received Ph.D. degree in computer science at the Institute of Computing Technology, Chinese Academy of Sciences, PR China, in 2009. He was a post-doctorial researcher in University of Caen, France, from 2011 to 2012. He joined the School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, in March 2013, and now he is an assistant professor. His research interests cover image analysis, pattern recognition, and computer vision.

    Rui Huang received his B.Sc. degree in Peking University (1999) and M.E. degree in Chinese Academy of Sciences (2002). In 2008, he received his Ph.D. degree in Rutgers University, and had since worked there as a research associate for two years. He was an assistant professor at Huazhong University of Science and Technology, and is currently a research staff member at NEC Laboratories China. His research interests include graphical models and their applications in computer vision, pattern recognition and medical imaging.

    Lei Qin received the B.S. and M.S. degrees in mathematics from the Dalian University of Technology, Dalian, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2008. He is currently an associate professor with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include image/video processing, computer vision, and pattern recognition. He has authored or coauthored over 30 technical papers in the area of computer vision. He is a reviewer for IEEE Trans. on Multimedia, IEEE Trans. on Circuits and Systems for Video Technology, and IEEE Transactions on Cybernetics. He has served as TPC member for various conferences, including ECCV, ICPR, ICME, PSIVT, ICIMCS, PCM.

    View full text