View-invariant gesture recognition using 3D optical flow and harmonic motion context

https://doi.org/10.1016/j.cviu.2010.07.012Get rights and content

Abstract

This paper presents an approach for view-invariant gesture recognition. The approach is based on 3D data captured by a SwissRanger SR4000 camera. This camera produces both a depth map as well as an intensity image of a scene. Since the two information types are aligned, we can use the intensity image to define a region of interest for the relevant 3D data. This data fusion improves the quality of the motion detection and hence results in better recognition. The gesture recognition is based on finding motion primitives (temporal instances) in the 3D data. Motion is detected by a 3D version of optical flow and results in velocity annotated point clouds. The 3D motion primitives are represented efficiently by introducing motion context. The motion context is transformed into a view-invariant representation using spherical harmonic basis functions, yielding a harmonic motion context representation. A probabilistic Edit Distance classifier is applied to identify which gesture best describes a string of primitives. The approach is trained on data from one viewpoint and tested on data from a very different viewpoint. The recognition rate is 94.4% which is similar to the recognition rate when training and testing on gestures from the same viewpoint, hence the approach is indeed view-invariant.

Research highlights

► We apply intensity and depth data acquired by a Time-of-Flight sensor. ► Gestures are represented efficiently using 3D optical flow and motion context. ► We show how a motion context can be made view-invariant using spherical harmonics. ► We address the problem of not knowing when a gesture commences and terminates. ► Recognition rate of 94.4% and validation of view-invariance.

Introduction

Automatic analysis of humans and their actions has received increasingly more attention in the last decade [21]. One of the areas of interest is recognition of human gestures for use in for example Human Computer Interaction.

Many different approaches to gesture recognition have been reported [20]. They apply a number of different segmentation, feature extraction, and recognition strategies. E.g. [19], [31] extract and represent human gestures/actions by velocity histories of tracked keypoints and ballistic dynamics, respectively, while gestures are recognized, e.g., through Hidden Marcov Models (HMMs) [1], [24], [25] or Dynamic Baysian Networks (DBNs) [2], [32]. These methods are virtually all based on analyzing 2D data, i.e., images. A consequence of this is that approaches only analyze 2D gestures carried out in the image plane, which is only a projection of the actual gesture. As a result, the projection of the gesture will dependent on the viewpoint, and not contain full information about the performed gesture. To overcome this shortcoming the use of 3D data has been introduced through the use of two or more cameras, see for example [4], [33]. In this way, e.g., the surface structure or a 3D volume of the person can be reconstructed, and thereby a more descriptive representation for gesture recognition can be established. We follow this line of work and also apply 3D data. To avoid the difficulties inherent to classical stereo approaches (the correspondence problem, careful camera placement and calibration) we instead apply a Time-of-Flight (ToF) range camera – SwissRanger SR4000. Each pixel in this camera directly provides a depth value (distance to object). Even though the technology in range cameras is still in its early days, e.g., resulting in low resolution data, the great potential of such sensors has already resulted in them being applied in a number of typical computer vision applications like face detection [7], face tracking [6], shape analysis [13], [15], robot navigation [23] and gesture-based scene navigation [26]. In [14] a survey of recent developments in ToF-technology are presented. It discusses applications of this technology for vision, graphics, and HCI.

The development of range cameras has progressed rapidly over the last few years, leading to the release of new and improved camera models from some of the main manufacturers: MESA Imaging [36], PMD Technologies [37] and 3DV Systems [35]. Recently, MESA Imaging released the new SwissRanger SR4000 range camera with higher frame rate (up to 54 fps) and resolution (176 × 144 pixels). 3DV Systems is aiming at a consumer class range camera with similar size and look as a regular web-camera and a integrated sensor capable of producing 1 mega pixels color images, while PMD Technologies made a camera version with improved operating range (up to 40 m) for e.g. pedestrian detection in cars.

The SwissRanger camera that we apply also provides an amplitude value corresponding to an intensity value for each pixel. This means that at any given time instant both a depth image and an intensity image are present. For some applications these two information types compliment each other and are therefore both used. For example in [13] where the objective is to segment planar surfaces in 3D (range) data, the edges in the intensity image are applied to improve the result. Similar benefits of applying both data types can be seen in [6], [7], [8]. We also apply both data types and will show how they compliment each other.

Applying 3D data allows for analysis of 3D gestures. However, we are still faced with the problem that a user has to be fronto-parallel with respect to the camera. A few works have been reported without the assumption on the user being fronto-parallel. E.g. in [27] where five calibrated and synchronized cameras are used to acquire data (the publicly available IXMAS data set), which is further projected to 64 evenly spaced virtual cameras used for training. Actions are described in a view-invariant manner by computing R transform surfaces and manifold learning. Similarly, [33] use the same data set to compute motion history volumes, which are used to derive view-invariant motion descriptors in Fourier space. Another example is seen in [4] where 3D Human Body Shapes are used for view-independent identification of human body postures, which are trained and tested on another multi-camera dataset.

The need for multiple calibrated and synchronized cameras followed up by an exhaustive training phase for multiple viewpoints is obviously not desirable. Instead we aim at a view-invariant approach which is trained by examples from one camera viewpoint and able to recognize gestures from a very different viewpoint, say ±45°. Another issue we want to combat is the often used assumption of known start- and end points. That is, often the test data consists of N sequences where each sequence contains one and only one gesture. This obviously makes the problem easier and it favors a trajectory-based approach, where each gesture is represented as a trajectory though some state-space with known start and end point. For real-life scenarios the start and end point is normally not known. To deal with this issue we follow the notion of recognition through a set of primitives [5], [10], [30], [34]. Concretely, we define a primitive as a time instance with significant 3D motion.

So, we represent gestures as an ordered sequence of 3D motion primitives (temporal instances). We focus on arm gestures and therefore only segment the arms (when they move) and hereby suppress the rest of the (irrelevant) body information. Concretely we extract the moving arms using a 3D version of optical flow to produce velocity annotated point clouds and represent this data efficiently by their motion context. The motion context is an extended version of the regular shape context [3], and represents the velocity annotated point cloud by using both the location of motion, together with the amount of motion and its direction. We make the primitives invariant to rotation around the vertical axis by re-representing the motion context using spherical harmonic basis functions, yielding a harmonic motion context representation. In each frame the primitive, if any, which best explains the observed data is identified. This leads to a discrete recognition problem since a video sequence of range data will be converted into a string containing a sequence of symbols, each representing a primitive. After pruning the string a probabilistic Edit Distance classifier is applied to identify which gesture best describes the pruned string. Our approach is illustrated in Fig. 1.

This paper is organized as follows. Data acquisition and preprocessing is presented in Section 2, followed up by how we perform motion detection in 3D. In Section 3 we describe the concept of motion primitives, and how they are represented compactly by introducing motion context. Furthermore, we show how the motion context can be transformed into a view-invariant representation using spherical harmonic basis functions, yielding a harmonic motion context representation. In Section 4 we describe the classification of motion primitives, and how we perform gesture recognition by introducing a probabilistic edit distance classifier. Finally, we present experimental results in Section 5 and concluding remarks in Section 6.

Section snippets

Data acquisition and preprocessing

We capture intensity and range data using a SwissRanger SR4000 range camera from MESA Imaging. The camera is based on the Time-of-Flight (ToF) principle and emits radio-frequency modulated (30 MHz) light in the near-infrared spectrum (850 nm), which is backscattered by the scene and detected by a CMOS CCD. The resolution is 176 × 144 pixels with an active range of 0.3–5.0 m and a field of view of 43.6 × 34.6°. The distance accuracy is typically in the order of ±1 centimeter, depending of the distance

Motion context

After motion detection we are left with a velocity annotated point cloud in 3D, which is represented efficiently using a motion oriented version of shape context. We call this representation the motion context.

A shape context [3] is based on a spherical histogram. This histogram is centered in a reference point and divided linearly into S azimuthal (east–west) bins and T colatitudinal (north–south) bins, while the radial direction is divided into U bins. Fig. 5 gives an example of the shape

Classification

The classification is divided into two main tasks: recognition of motion primitives by use of the harmonic motion context descriptors, and recognition of the actual gestures using an ordered sequence of primitives (see Fig. 1).

Test and results

For testing purpose we use a vocabulary consisting of 22 primitives. This is illustrated in Fig. 10. The criteria for finding the primitives are (1) that they represent characteristic and representative 3D configurations, (2) that their configurations contain a certain amount of motion, and (3) that the primitives are used in the description of as many gestures as possible, i.e., fewer primitives are required. By use of this vocabulary of primitives we describe 4 one- and two-arms gestures:

Conclusion

The contributions of this paper are twofold. Firstly, motion is detected by 2D optical flow estimated in the intensity image but extended to 3D using the depth information acquired from only one viewpoint by a range camera. We show how gestures can be represented efficiently using motion context, and how gesture recognition can be made view-invariant through the use of 3D data and transforming a motion context representation using spherical harmonics. Secondly, for the gesture recognition

Acknowledgments

This work is partially funded by the MoPrim and the BigBrother projects (Danish National Research Councils – FTP) and partially by the HERMES Project (FP6 IST-027110).

References (37)

  • B. Horn et al.

    Determining optical flow

    Artificial Intelligence

    (1981)
  • T. Moeslund et al.

    A survey of advances in vision-based human motion capture and analysis

    Computer Vision and Image Understanding

    (2006)
  • D. Weinland et al.

    Free viewpoint action recognition using motion history volumes

    Computer Vision and Image Understanding

    (2006)
  • M. Ahmad, S.-W. Lee, HMM-based human action recognition using multiview image sequences, in: International Conference...
  • H.H. Avils-Arriaga et al.

    Dynamic bayesian networks for visual recognition of dynamic gestures

    Journal of Intelligent and Fuzzy Systems

    (2002)
  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    Pattern Analysis and Machine Intelligence

    (2002)
  • I. Cohen, H. Li, Inference of human postures by classification of 3D human body shape, in: Workshop on Analysis and...
  • J. Gonzalez, J. Varona, F. Roca, J. Villanueva, aSpaces: action spaces for recognition and synthesis of human actions,...
  • M. Haker, M. Bohme, T. Martinetz, E. Barth. Geometric invariants for facial feature tracking with 3D TOF cameras, in:...
  • D. Hansen, R. Larsen, F. Lauze. Improving face detection with TOF cameras, in: International Symposium on Signals,...
  • M. Holte, T. Moeslund, P. Fihl, Fusion of range and intensity information for view invariant gesture recognition, in:...
  • O. Jenkins, M. Mataric, Deriving action and behavior primitives from human motion data, in: International Conference on...
  • Y. Kameda, M. Minoh, K. Ikeda, Three Dimensional motion estimation of a human body using a difference image sequence,...
  • S. Khan, LUMS School of Science and Engineering Lahore, Pakistan....
  • O. Kahler, E. Rodner, J. Denzler, On fusion of range and intensity information using graph-cut for planar patch...
  • A. Kolb, E. Barth, R. Koch, R. Larsen, Time-of-flight sensors in computer graphics, in: Eurographics 2009 – State of...
  • R. Larsen, B. Lading, Multiple Geodesic distance based registration of surfaces applied to facial expression data, in:...
  • V. Levenshtein

    Binary codes capable of correcting deletions, insertions and reversals

    Soviet Physics Doklady

    (1966)
  • Cited by (0)

    View full text