Sparse auto-encoder based feature learning for human body detection in depth image
Introduction
Human body detection is a fundamental step for understanding human behaviors in camera(s) with applications to intelligent video surveillance, assistant vehicle driving, and automatic action recognition. Therefore, it has become an active research filed in the computer vision community. However, the appearance of human body is distracted by variations caused by different illuminations, poses, viewing angles, and partially occlusion. It is therefore fairly challenging to build a human body detector in real scenes.
RGB camera was the main imaging device in the early stage of human body detection. Extensive progresses on detection algorithms based on RGB or gray images [1], [2], [3], [4] have been obtained. However, RGB or gray images are still suffering from photographing variations as stated above, which largely decrease the performance of detection algorithms. Fortunately, with the popularity of depth sensors like RealSense and Kinect, it is now feasible to obtain the distance between the surface of objects and the camera in dynamic environments. Moreover, those depth images are insensitive to illumination variations and shadows. As a result, research on detection algorithms over depth image has attracted ever increasing attentions.
In this paper we pursuit the detection efficiency. Currently, sliding window (SW) is the mainstream approach in object or human body detection, e.g. as in the PASCAL Visual Object Classes (VOC) challenge [5], the majority of the entries used a “sliding window” approach to the detection task. Essentially, detection or localization is a binary classification problem to determine whether there is an object in a given scanned window, to which end classifiers such as Neuro-network (NN), Support Vector Machines (SVMs), Random Forest (RF), and AdaBoost are widely investigated. In terms of feature part, descriptors such as Histogram of Oriented Gradient (HOG) [6], Local Binary Pattern (LBP) [7], Integral Channel Feature [8], and Haar-like Feature [9] are widely used.
To carry out human body detection in depth image, the first task is to design discriminative descriptor. In the literature, quite few feature extraction algorithms are developed exclusive for depth image. Most of the proposed depth descriptors are similar to those in RGB domain. For example, Spinello and Arras [10] and Wu et al. [11] proposed Histograms of Oriented Depths (HOD) and Histogram of Depth Difference (HDD), respectively, both of which are very similar to HOG. Yu et al. [12] proposed a Simplified Local Ternary Patterns (SLTP) descriptor, which was an improvement of Local Ternary Patterns (LTP). Ikemura and Fujiyoshi [13] proposed a descriptor called Relational Depth Similarity Feature (RDSF), which is based on statistic features on depth values. However, the above features were handcraft designed. Therefore to some extent, these features only encode part of the human body information while neglecting the intrinsic structure of the human body, which are therefore not discriminative enough in complex scenes.
Along with the recent advance on sparse coding and deep learning, learning based feature is receiving ever-increasing research attention. For instance, Ren and Ramanan [14] proposed a learning based feature based on sparse coding. Dollar et al. [8] proposed a learning based integral channel feature, with encouraging performance on pedestrian detection. Sermanet et al. [15] also proposed an unsupervised multi-stage feature learning method for pedestrian detection. However, these features work on the RGB image domain only. And in the literature, there are few works on learning based depth features. Recently, sparse auto-encoder (SAE) becomes popular, and can automatically learn features from unlabelled data. SAE has got satisfying performance on many applications, such as image classification, voice recognition, and hand gesture recognition. In this paper, we introduce SAE to learn depth image feature for human body detection. SAE also brings another merit on the efficiency, as its objective function can be solved via fast backward propagation.
Upon learning SAE based depth feature, we further pursuit the detector efficiency: during the human body localization, it would be time-consuming to use sliding window to scan all the windows in the image scale-space. Researchers have proposed methods to reduce the search space in RGB images, such as brand-and-bound [16] and jumping windows [17]. In some special cases, supplementary information of scenes can be exploited to reduce the search space. For example, under the circumstance that the camera is fixed, moving targets can be extracted by background subtraction, and then the detection can be only performed on the foreground regions. Other researchers also use the scene geometry information to reduce the search space [18]. In the driver assistant system, regions on the road can be quickly identified by calibrating the cameras [19], and then the potential objects are detected only on those regions. In depth images, since depth values on the object surface are aggregated together, potential positions of the objects can be directly predicted before detection. This inspires us to propose a method to accelerate detection speed.
Overall, a detection framework beyond sliding window was proposed aiming at fast human body detection in depth image. Our contributions are three-fold: (1) in terms of discriminative depth feature extraction, we introduce SAE to learn depth feature automatically for human body detection. (2) Histogram of depth is exploited to generate candidate detection windows, thus avoiding the time-consuming exhaustively sliding window search. (3) We model the relationship between human body height and depth values, which is then used to determine the object size, thus avoiding the multiple scale scanning that is typical in the sliding window based detector.
This paper is outlined as follows. We summarize the state-of-the-art human body detection in RGB images and depth images in Section 2; Section 3 gives the overview of the proposed human body detection systems; the SAE based feature learning method is described in Section 4; experimental results are presented in the Section 5; the conclusions and future work are presented in Section 6.
Section snippets
Human body detection in RGB images
Roughly speaking, there exist three human body models, i.e., Holistic model, part-based model, and patch-based model.
Holistic models: The Holistic model, which is also called the monolithic model, extracts human body feature as a whole rather than concatenating part-based features. The extracted feature (also called the descriptor) in a scanning window is then fed into a binary classifier to decide whether the window contains human body. Various human body descriptors have been proposed, such
Overview of the proposed method
As shown in Fig. 1, we review the proposed detection framework as below.
Training module: In the training phase, depth values in the full images are normalized, and patches with size 16 by 16 are selected randomly from the full images. Those patches are then fed into sparse auto-encoder to learn feature representation. Given the learned SAE, the human body feature is extracted as below: first, the annotated sub-images containing human body in the image centers are resized to normal size, which
Sparse Auto-Encoder
Sparse Auto-Encoder (SAE) is an unsupervised feature leaning methods which can avoid the labor-intensive and handcraft feature design. Experiments on various applications are encouraging, such as natural language process, computer vision, and audio processing. The goal of SAE is to make the input to be equal to the output. The hidden layer in SAE can be seen as feature extraction of the input layer. Essentially, SAE is a type of unsupervised feed-forward neural network, the structure is shown
Dataset and performance measurement
Dataset: SZU Kinect People Dataset, built by Yu et al. [12], is a large scale dataset for human body detection in depth images. It contains 7260 labeled positive samples, and the negative samples are from B3DO (Berkeley 3-D Object Dataset) [50], which is designed for household object detection and not include any human body. SZU Kinect People Dataset includes both color images and depth images. The color images are calibrated with depth images. Since we only focus on depth image feature
Conclusions
In this paper, we proposed a method for human body detection based on feature learning in depth image. By constructing an auto-encoding neural network which enables the machine to learn image features by itself, we do not need to manually design methods for feature extraction. The SAE can naturally compress and reconstruct image features, making the feature extraction process automatically. Since the dimension of descriptor is fairly high and the computation of neural network is of great
Acknowledgments
This work is supported by the Nature Science Foundation of China (Nos. 61422210, 61373076, and 61202143), the Fundamental Research Funds for the Central Universities (Nos. 2013121026 and 2011121052), the 985 Project of Xiamen University, the Natural Science Foundation of Fujian Province (No. 2013J05100), and the Key Projects Fund of Science and Technology in Xiamen (No. 3502Z20123017).
References (50)
- et al.
Exploiting projective geometry for view-invariant monocular human motion analysis in man-made environments
Comput. Vis. Image Understand.
(2014) - et al.
A survey of human motion analysis using depth imagery
Pattern Recognit. Lett.
(2013) - et al.
Pedestrian detectionan evaluation of the state of the art
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - et al.
Survey of pedestrian detection for advanced driver assistance systems
IEEE Trans. Pattern Anal. Mach. Intell.
(2010) - et al.
A survey on pedestrian detection
Dianzi Xuebao (Acta Electron. Sin.)
(2012) - et al.
Visual–textual joint relevance learning for tag-based social image search
IEEE Trans. Image Process.
(2013) - et al.
The Pascal visual object classes (voc) challenge
Int. J. Comput. Vis.
(2010) - N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference on...
- X. Wang, T.X. Han, S. Yan, An hog-lbp human detector with partial occlusion handling, in: 2009 IEEE 12th International...
- P. Dollar, Z. Tu, P. Perona, S. Belongie, Integral channel features, in: British Machine Vision Conference, vol. 2,...