View-invariant gesture recognition using 3D optical flow and harmonic motion context
Research highlights
► We apply intensity and depth data acquired by a Time-of-Flight sensor. ► Gestures are represented efficiently using 3D optical flow and motion context. ► We show how a motion context can be made view-invariant using spherical harmonics. ► We address the problem of not knowing when a gesture commences and terminates. ► Recognition rate of 94.4% and validation of view-invariance.
Introduction
Automatic analysis of humans and their actions has received increasingly more attention in the last decade [21]. One of the areas of interest is recognition of human gestures for use in for example Human Computer Interaction.
Many different approaches to gesture recognition have been reported [20]. They apply a number of different segmentation, feature extraction, and recognition strategies. E.g. [19], [31] extract and represent human gestures/actions by velocity histories of tracked keypoints and ballistic dynamics, respectively, while gestures are recognized, e.g., through Hidden Marcov Models (HMMs) [1], [24], [25] or Dynamic Baysian Networks (DBNs) [2], [32]. These methods are virtually all based on analyzing 2D data, i.e., images. A consequence of this is that approaches only analyze 2D gestures carried out in the image plane, which is only a projection of the actual gesture. As a result, the projection of the gesture will dependent on the viewpoint, and not contain full information about the performed gesture. To overcome this shortcoming the use of 3D data has been introduced through the use of two or more cameras, see for example [4], [33]. In this way, e.g., the surface structure or a 3D volume of the person can be reconstructed, and thereby a more descriptive representation for gesture recognition can be established. We follow this line of work and also apply 3D data. To avoid the difficulties inherent to classical stereo approaches (the correspondence problem, careful camera placement and calibration) we instead apply a Time-of-Flight (ToF) range camera – SwissRanger SR4000. Each pixel in this camera directly provides a depth value (distance to object). Even though the technology in range cameras is still in its early days, e.g., resulting in low resolution data, the great potential of such sensors has already resulted in them being applied in a number of typical computer vision applications like face detection [7], face tracking [6], shape analysis [13], [15], robot navigation [23] and gesture-based scene navigation [26]. In [14] a survey of recent developments in ToF-technology are presented. It discusses applications of this technology for vision, graphics, and HCI.
The development of range cameras has progressed rapidly over the last few years, leading to the release of new and improved camera models from some of the main manufacturers: MESA Imaging [36], PMD Technologies [37] and 3DV Systems [35]. Recently, MESA Imaging released the new SwissRanger SR4000 range camera with higher frame rate (up to 54 fps) and resolution (176 × 144 pixels). 3DV Systems is aiming at a consumer class range camera with similar size and look as a regular web-camera and a integrated sensor capable of producing 1 mega pixels color images, while PMD Technologies made a camera version with improved operating range (up to 40 m) for e.g. pedestrian detection in cars.
The SwissRanger camera that we apply also provides an amplitude value corresponding to an intensity value for each pixel. This means that at any given time instant both a depth image and an intensity image are present. For some applications these two information types compliment each other and are therefore both used. For example in [13] where the objective is to segment planar surfaces in 3D (range) data, the edges in the intensity image are applied to improve the result. Similar benefits of applying both data types can be seen in [6], [7], [8]. We also apply both data types and will show how they compliment each other.
Applying 3D data allows for analysis of 3D gestures. However, we are still faced with the problem that a user has to be fronto-parallel with respect to the camera. A few works have been reported without the assumption on the user being fronto-parallel. E.g. in [27] where five calibrated and synchronized cameras are used to acquire data (the publicly available IXMAS data set), which is further projected to 64 evenly spaced virtual cameras used for training. Actions are described in a view-invariant manner by computing transform surfaces and manifold learning. Similarly, [33] use the same data set to compute motion history volumes, which are used to derive view-invariant motion descriptors in Fourier space. Another example is seen in [4] where 3D Human Body Shapes are used for view-independent identification of human body postures, which are trained and tested on another multi-camera dataset.
The need for multiple calibrated and synchronized cameras followed up by an exhaustive training phase for multiple viewpoints is obviously not desirable. Instead we aim at a view-invariant approach which is trained by examples from one camera viewpoint and able to recognize gestures from a very different viewpoint, say ±45°. Another issue we want to combat is the often used assumption of known start- and end points. That is, often the test data consists of N sequences where each sequence contains one and only one gesture. This obviously makes the problem easier and it favors a trajectory-based approach, where each gesture is represented as a trajectory though some state-space with known start and end point. For real-life scenarios the start and end point is normally not known. To deal with this issue we follow the notion of recognition through a set of primitives [5], [10], [30], [34]. Concretely, we define a primitive as a time instance with significant 3D motion.
So, we represent gestures as an ordered sequence of 3D motion primitives (temporal instances). We focus on arm gestures and therefore only segment the arms (when they move) and hereby suppress the rest of the (irrelevant) body information. Concretely we extract the moving arms using a 3D version of optical flow to produce velocity annotated point clouds and represent this data efficiently by their motion context. The motion context is an extended version of the regular shape context [3], and represents the velocity annotated point cloud by using both the location of motion, together with the amount of motion and its direction. We make the primitives invariant to rotation around the vertical axis by re-representing the motion context using spherical harmonic basis functions, yielding a harmonic motion context representation. In each frame the primitive, if any, which best explains the observed data is identified. This leads to a discrete recognition problem since a video sequence of range data will be converted into a string containing a sequence of symbols, each representing a primitive. After pruning the string a probabilistic Edit Distance classifier is applied to identify which gesture best describes the pruned string. Our approach is illustrated in Fig. 1.
This paper is organized as follows. Data acquisition and preprocessing is presented in Section 2, followed up by how we perform motion detection in 3D. In Section 3 we describe the concept of motion primitives, and how they are represented compactly by introducing motion context. Furthermore, we show how the motion context can be transformed into a view-invariant representation using spherical harmonic basis functions, yielding a harmonic motion context representation. In Section 4 we describe the classification of motion primitives, and how we perform gesture recognition by introducing a probabilistic edit distance classifier. Finally, we present experimental results in Section 5 and concluding remarks in Section 6.
Section snippets
Data acquisition and preprocessing
We capture intensity and range data using a SwissRanger SR4000 range camera from MESA Imaging. The camera is based on the Time-of-Flight (ToF) principle and emits radio-frequency modulated (30 MHz) light in the near-infrared spectrum (850 nm), which is backscattered by the scene and detected by a CMOS CCD. The resolution is 176 × 144 pixels with an active range of 0.3–5.0 m and a field of view of 43.6 × 34.6°. The distance accuracy is typically in the order of ±1 centimeter, depending of the distance
Motion context
After motion detection we are left with a velocity annotated point cloud in 3D, which is represented efficiently using a motion oriented version of shape context. We call this representation the motion context.
A shape context [3] is based on a spherical histogram. This histogram is centered in a reference point and divided linearly into S azimuthal (east–west) bins and T colatitudinal (north–south) bins, while the radial direction is divided into U bins. Fig. 5 gives an example of the shape
Classification
The classification is divided into two main tasks: recognition of motion primitives by use of the harmonic motion context descriptors, and recognition of the actual gestures using an ordered sequence of primitives (see Fig. 1).
Test and results
For testing purpose we use a vocabulary consisting of 22 primitives. This is illustrated in Fig. 10. The criteria for finding the primitives are (1) that they represent characteristic and representative 3D configurations, (2) that their configurations contain a certain amount of motion, and (3) that the primitives are used in the description of as many gestures as possible, i.e., fewer primitives are required. By use of this vocabulary of primitives we describe 4 one- and two-arms gestures:
Conclusion
The contributions of this paper are twofold. Firstly, motion is detected by 2D optical flow estimated in the intensity image but extended to 3D using the depth information acquired from only one viewpoint by a range camera. We show how gestures can be represented efficiently using motion context, and how gesture recognition can be made view-invariant through the use of 3D data and transforming a motion context representation using spherical harmonics. Secondly, for the gesture recognition
Acknowledgments
This work is partially funded by the MoPrim and the BigBrother projects (Danish National Research Councils – FTP) and partially by the HERMES Project (FP6 IST-027110).
References (37)
- et al.
Determining optical flow
Artificial Intelligence
(1981) - et al.
A survey of advances in vision-based human motion capture and analysis
Computer Vision and Image Understanding
(2006) - et al.
Free viewpoint action recognition using motion history volumes
Computer Vision and Image Understanding
(2006) - M. Ahmad, S.-W. Lee, HMM-based human action recognition using multiview image sequences, in: International Conference...
- et al.
Dynamic bayesian networks for visual recognition of dynamic gestures
Journal of Intelligent and Fuzzy Systems
(2002) - et al.
Shape matching and object recognition using shape contexts
Pattern Analysis and Machine Intelligence
(2002) - I. Cohen, H. Li, Inference of human postures by classification of 3D human body shape, in: Workshop on Analysis and...
- J. Gonzalez, J. Varona, F. Roca, J. Villanueva, aSpaces: action spaces for recognition and synthesis of human actions,...
- M. Haker, M. Bohme, T. Martinetz, E. Barth. Geometric invariants for facial feature tracking with 3D TOF cameras, in:...
- D. Hansen, R. Larsen, F. Lauze. Improving face detection with TOF cameras, in: International Symposium on Signals,...