Elsevier

Signal Processing

Volume 112, July 2015, Pages 43-52
Signal Processing

Sparse auto-encoder based feature learning for human body detection in depth image

https://doi.org/10.1016/j.sigpro.2014.11.003Get rights and content

Highlights

  • Sparse auto-encoder is used to learn depth feature for human detection.

  • A beyond sliding window localization method based on depth value.

Abstract

Human body detection in depth image is an active research topic in computer vision. But depth feature extraction is still an open problem. In this paper, a novel feature learning method based on sparse auto-encoder (SAE) is proposed for human body detection in depth image. The proposed learning based feature enables capturing the intrinsic human body structure. To further reduce the computation cost of SAE, both convolution neural network and pooling are introduced to reduce the training complexity. In addition, upon learning SAE based depth feature, we further pursuit the detector efficiency. A beyond sliding window localization strategy is proposed based on the fact that the depth values of object surface are almost the same. The proposed localization strategy first uses the histogram of depth to generate candidate detection window center, and then exploits the relationship between human body height and depth to determine the detection window size. Thus, it can avoid the time-consuming sliding window search, and further enables fast human body localization. Experiments on SZU Depth Pedestrian dataset verify the effectiveness of our proposed method.

Introduction

Human body detection is a fundamental step for understanding human behaviors in camera(s) with applications to intelligent video surveillance, assistant vehicle driving, and automatic action recognition. Therefore, it has become an active research filed in the computer vision community. However, the appearance of human body is distracted by variations caused by different illuminations, poses, viewing angles, and partially occlusion. It is therefore fairly challenging to build a human body detector in real scenes.

RGB camera was the main imaging device in the early stage of human body detection. Extensive progresses on detection algorithms based on RGB or gray images [1], [2], [3], [4] have been obtained. However, RGB or gray images are still suffering from photographing variations as stated above, which largely decrease the performance of detection algorithms. Fortunately, with the popularity of depth sensors like RealSense and Kinect, it is now feasible to obtain the distance between the surface of objects and the camera in dynamic environments. Moreover, those depth images are insensitive to illumination variations and shadows. As a result, research on detection algorithms over depth image has attracted ever increasing attentions.

In this paper we pursuit the detection efficiency. Currently, sliding window (SW) is the mainstream approach in object or human body detection, e.g. as in the PASCAL Visual Object Classes (VOC) challenge [5], the majority of the entries used a “sliding window” approach to the detection task. Essentially, detection or localization is a binary classification problem to determine whether there is an object in a given scanned window, to which end classifiers such as Neuro-network (NN), Support Vector Machines (SVMs), Random Forest (RF), and AdaBoost are widely investigated. In terms of feature part, descriptors such as Histogram of Oriented Gradient (HOG) [6], Local Binary Pattern (LBP) [7], Integral Channel Feature [8], and Haar-like Feature [9] are widely used.

To carry out human body detection in depth image, the first task is to design discriminative descriptor. In the literature, quite few feature extraction algorithms are developed exclusive for depth image. Most of the proposed depth descriptors are similar to those in RGB domain. For example, Spinello and Arras [10] and Wu et al. [11] proposed Histograms of Oriented Depths (HOD) and Histogram of Depth Difference (HDD), respectively, both of which are very similar to HOG. Yu et al. [12] proposed a Simplified Local Ternary Patterns (SLTP) descriptor, which was an improvement of Local Ternary Patterns (LTP). Ikemura and Fujiyoshi [13] proposed a descriptor called Relational Depth Similarity Feature (RDSF), which is based on statistic features on depth values. However, the above features were handcraft designed. Therefore to some extent, these features only encode part of the human body information while neglecting the intrinsic structure of the human body, which are therefore not discriminative enough in complex scenes.

Along with the recent advance on sparse coding and deep learning, learning based feature is receiving ever-increasing research attention. For instance, Ren and Ramanan [14] proposed a learning based feature based on sparse coding. Dollar et al. [8] proposed a learning based integral channel feature, with encouraging performance on pedestrian detection. Sermanet et al. [15] also proposed an unsupervised multi-stage feature learning method for pedestrian detection. However, these features work on the RGB image domain only. And in the literature, there are few works on learning based depth features. Recently, sparse auto-encoder (SAE) becomes popular, and can automatically learn features from unlabelled data. SAE has got satisfying performance on many applications, such as image classification, voice recognition, and hand gesture recognition. In this paper, we introduce SAE to learn depth image feature for human body detection. SAE also brings another merit on the efficiency, as its objective function can be solved via fast backward propagation.

Upon learning SAE based depth feature, we further pursuit the detector efficiency: during the human body localization, it would be time-consuming to use sliding window to scan all the windows in the image scale-space. Researchers have proposed methods to reduce the search space in RGB images, such as brand-and-bound [16] and jumping windows [17]. In some special cases, supplementary information of scenes can be exploited to reduce the search space. For example, under the circumstance that the camera is fixed, moving targets can be extracted by background subtraction, and then the detection can be only performed on the foreground regions. Other researchers also use the scene geometry information to reduce the search space [18]. In the driver assistant system, regions on the road can be quickly identified by calibrating the cameras [19], and then the potential objects are detected only on those regions. In depth images, since depth values on the object surface are aggregated together, potential positions of the objects can be directly predicted before detection. This inspires us to propose a method to accelerate detection speed.

Overall, a detection framework beyond sliding window was proposed aiming at fast human body detection in depth image. Our contributions are three-fold: (1) in terms of discriminative depth feature extraction, we introduce SAE to learn depth feature automatically for human body detection. (2) Histogram of depth is exploited to generate candidate detection windows, thus avoiding the time-consuming exhaustively sliding window search. (3) We model the relationship between human body height and depth values, which is then used to determine the object size, thus avoiding the multiple scale scanning that is typical in the sliding window based detector.

This paper is outlined as follows. We summarize the state-of-the-art human body detection in RGB images and depth images in Section 2; Section 3 gives the overview of the proposed human body detection systems; the SAE based feature learning method is described in Section 4; experimental results are presented in the Section 5; the conclusions and future work are presented in Section 6.

Section snippets

Human body detection in RGB images

Roughly speaking, there exist three human body models, i.e., Holistic model, part-based model, and patch-based model.

Holistic models: The Holistic model, which is also called the monolithic model, extracts human body feature as a whole rather than concatenating part-based features. The extracted feature (also called the descriptor) in a scanning window is then fed into a binary classifier to decide whether the window contains human body. Various human body descriptors have been proposed, such

Overview of the proposed method

As shown in Fig. 1, we review the proposed detection framework as below.

Training module: In the training phase, depth values in the full images are normalized, and patches with size 16 by 16 are selected randomly from the full images. Those patches are then fed into sparse auto-encoder to learn feature representation. Given the learned SAE, the human body feature is extracted as below: first, the annotated sub-images containing human body in the image centers are resized to normal size, which

Sparse Auto-Encoder

Sparse Auto-Encoder (SAE) is an unsupervised feature leaning methods which can avoid the labor-intensive and handcraft feature design. Experiments on various applications are encouraging, such as natural language process, computer vision, and audio processing. The goal of SAE is to make the input to be equal to the output. The hidden layer in SAE can be seen as feature extraction of the input layer. Essentially, SAE is a type of unsupervised feed-forward neural network, the structure is shown

Dataset and performance measurement

Dataset: SZU Kinect People Dataset, built by Yu et al. [12], is a large scale dataset for human body detection in depth images. It contains 7260 labeled positive samples, and the negative samples are from B3DO (Berkeley 3-D Object Dataset) [50], which is designed for household object detection and not include any human body. SZU Kinect People Dataset includes both color images and depth images. The color images are calibrated with depth images. Since we only focus on depth image feature

Conclusions

In this paper, we proposed a method for human body detection based on feature learning in depth image. By constructing an auto-encoding neural network which enables the machine to learn image features by itself, we do not need to manually design methods for feature extraction. The SAE can naturally compress and reconstruct image features, making the feature extraction process automatically. Since the dimension of descriptor is fairly high and the computation of neural network is of great

Acknowledgments

This work is supported by the Nature Science Foundation of China (Nos. 61422210, 61373076, and 61202143), the Fundamental Research Funds for the Central Universities (Nos. 2013121026 and 2011121052), the 985 Project of Xiamen University, the Natural Science Foundation of Fujian Province (No. 2013J05100), and the Key Projects Fund of Science and Technology in Xiamen (No. 3502Z20123017).

References (50)

  • G. Rogez et al.

    Exploiting projective geometry for view-invariant monocular human motion analysis in man-made environments

    Comput. Vis. Image Understand.

    (2014)
  • L. Chen et al.

    A survey of human motion analysis using depth imagery

    Pattern Recognit. Lett.

    (2013)
  • P. Dollar et al.

    Pedestrian detectionan evaluation of the state of the art

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • D. Geronimo et al.

    Survey of pedestrian detection for advanced driver assistance systems

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • S.-Z. Su et al.

    A survey on pedestrian detection

    Dianzi Xuebao (Acta Electron. Sin.)

    (2012)
  • Y. Gao et al.

    Visual–textual joint relevance learning for tag-based social image search

    IEEE Trans. Image Process.

    (2013)
  • M. Everingham et al.

    The Pascal visual object classes (voc) challenge

    Int. J. Comput. Vis.

    (2010)
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference on...
  • X. Wang, T.X. Han, S. Yan, An hog-lbp human detector with partial occlusion handling, in: 2009 IEEE 12th International...
  • P. Dollar, Z. Tu, P. Perona, S. Belongie, Integral channel features, in: British Machine Vision Conference, vol. 2,...
  • P. Viola et al.

    Robust real-time face detection

    Int. J. Comput. Vis.

    (2004)
  • L. Spinello, K.O. Arras, People detection in rgb-d data, in: 2011 IEEE/RSJ International Conference on Intelligent...
  • S. Wu, S. Yu, W. Chen, An attempt to pedestrian detection in depth images, in: 2011 Third Chinese Conference on...
  • S. Yu, S. Wu, L. Wang, Sltp: A fast descriptor for people detection in depth images, in: 2012 IEEE Ninth International...
  • S. Ikemura, H. Fujiyoshi, Real-time human detection using relational depth similarity features, in: Asian Conference on...
  • X. Ren, D. Ramanan, Histograms of sparse codes for object detection, in: 2013 IEEE Conference on Computer Vision and...
  • P. Sermanet, K. Kavukcuoglu, S. Chintala, Y. LeCun, Pedestrian detection with unsupervised multi-stage feature...
  • C.H. Lampert, M.B. Blaschko, T. Hofmann, Beyond sliding windows: Object localization by efficient subwindow search, in:...
  • O. Chum, A. Zisserman, An exemplar model for learning object classes, in: IEEE Conference on Computer Vision and...
  • K. Dimza, T.-F. Su, S.-H. Lai, Search space reduction in pedestrian detection for driver assistance system based on...
  • D.M. Gavrila, Pedestrian detection from a moving vehicle, in: European Conference on Computer Vision, ECCV 2000,...
  • C. Papageorgiou et al.

    A trainable system for object detection

    Int. J. Comput. Vis.

    (2000)
  • B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single image by Bayesian combination of...
  • P. Sabzmeydani, G. Mori, Detecting pedestrians by learning shapelet features, in: IEEE Conference on Computer Vision...
  • O. Tuzel et al.

    Pedestrian detection via classification on Riemannian manifolds

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2008)
  • Cited by (0)

    View full text