Vision-based hand pose estimation: A review
Introduction
There has been a great emphasis lately in HCI research to create easier to use interfaces by directly employing natural communication and manipulation skills of humans. Adopting direct sensing in HCI will allow the deployment of a wide range of applications in more sophisticated computing environments such as Virtual Environments (VEs) or Augmented Reality (AR) systems. The development of these systems involves addressing challenging research problems including effective input/output techniques, interaction styles and evaluation methods. In the input domain, the direct sensing approach requires capturing and interpreting the motion of head, eye gaze, face, hand, arms or even the whole body.
Among different body parts, the hand is the most effective, general-purpose interaction tool due to its dexterous functionality in communication and manipulation. Various interaction styles tend to import both modalities to allow intuitive, natural interaction (see Appendix A). Gesture languages made up of hand postures (i.e., static gestures) or motion patterns (i.e., dynamic gestures) have been employed to implement command and control interfaces [1], [2], [3], [4]. Gesticulations, which are spontaneous movements of the hand and arms that accompany speech, have shown to be very effective tools in Multimodal User Interfaces [5], [6], [7], [8], [9]. Object manipulation interfaces [10], [11], [12] utilize the hand for navigation, selection, and manipulation tasks in VEs. In many applications such as complex machinery or manipulator control, computer-based puppetry or musical performance [13], the hand serves as an efficient, high degree of freedom (DOF) control device. Finally, some immersive VE applications, such as surgical simulations [14] and training systems [15], have intricate object manipulation in their definitions. Broad deployment of hand gesture-based HCI requires the development of general purpose-hand motion capture and interpretation systems.
Currently, the most effective tools for capturing hand motion are electro-mechanical or magnetic sensing devices (data gloves) [16], [17]. These devices are worn on the hand to measure the location of the hand and the finger joint angles. They deliver the most complete, application-independent set of real-time measurements that allow importing all the functionality of the hand in HCI. However, they have several drawbacks in terms of casual use as they are very expensive, hinder the naturalness of hand motion, and require complex calibration and setup procedures to be able to obtain precise measurements.
CV represents a promising alternative to data gloves because of its potential potential to provide more natural, unencumbered, non-contact interaction. However, several challenges including accuracy, processing speed, and generality have to be overcome for the widespread use of this technology. Recovering the full DOF hand motion from images with unavoidable self-occlusions is a very challenging and computationally intensive problem. As a result, current implementations of CV-based systems do not have much in common with glove-based ones. Dating back to late 70s [18], the dominant method pursued in the implementation of CV-based interaction has been appearance-based modeling of hand motion [19], [20]. These models have been successfully applied to build gesture classification engines for detecting elements of a gesture vocabulary. However, 3D motion information delivered by these systems is limited to rough estimates of fingertip positions, finger orientations and/or palm frame obtained using appearance-specific features that affect the generality of the approach.
In this study, we review a more general problem, which aims to recover the full kinematic structure of the hand by bridging the gap between CV-based and glove-based sensing. This is a very challenging, high dimensional problem. Since the hand is a very flexible object, its projection leads to a large variety of shapes with many self-occlusions. Nevertheless, there are several good reasons for tackling this problem. First, there are various types of interaction styles and applications that explicitly rely on 3D hand pose information. Second, 3D hand pose forms an effective feature to be used in gesture classification, as it is view independent and directly related to hand motion. Finally, in contrast to appearance-based methods, full DOF hand pose estimation can provide general, principled methods that can be easily adapted to process simple, lower DOF tasks such as pointing, resizing, navigation etc. [21], [22], [23].
There exist several reviews on hand modeling, pose estimation, and gesture recognition [24], [25], [26], [27], [19], [28], the latest of which covers studies up to 2000. However, none of these surveys addresses the pose estimation problem in detail as they mainly concentrate on the gesture classification problem. In this study, we provide a detailed review on pose estimation together with recent contributions in the hand modeling domain including new shape and motion models and the kinematic fitting problem.
It should be mentioned that hand pose estimation has a close relationship to human body or articulated object pose estimation. Human body pose estimation is a more intensive research field. Many algorithms used in hand tracking have their roots in methods proposed previously in human body tracking. However, there are also many differences in operation environments, related applications and the features being used [29]. For example, clothing on human body introduces extra difficulties in segmentation but it also makes color or texture features more reliable for tracking compared to the weakly textured, uniformly colored surface of the hand. Another example is the possibility of estimating human body pose part-by-part or hierarchically (first head, then torso and so on), to break the problem into smaller dimensional ones. In the case of the hand hierarchical processing is limited to two stages for palm, and fingers. It would be difficult, if not impossible, to go any further because of the lack of texture, the proximity of the limbs and the mostly concave shape of the hand that produces severe occlusions. Therefore, we have limited the content of this paper to studies directly addressing the problem of hand pose estimation. Some reviews covering human body pose estimation can be found in [30], [31], [32], [33], [34], [35].
In Section 2, we define the problem of hand pose estimation, discuss the challenges involved, and provide a categorization of the methods that have appeared in the literature. Hand modeling is an important issue to be considered for any model-based method and it is reviewed in Section 3. Sections 4 Partial hand pose estimation, 5 Model-based tracking, 6 Single frame pose estimation provide a detailed review of the methods mentioned in Section 2. In Section 7, we summarize the systems reviewed and discuss their strengths and weaknesses. In Section 8, we discuss potential problems for future research. Finally, our conclusions are provided in Section 9.
Section snippets
CV-based pose estimation
The dominant motion observed in hand image sequences is articulated motion. There is also some elastic motion but recovering it does not have any major practical use in most applications. Therefore, hand pose estimation corresponds to estimating all (or a subset of) the kinematic parameters of the skeleton of the hand (see Fig. 2). Using visual data for this purpose, however, involves solving challenging image analysis problems in real-time.
In this section, we first discuss some major
Hand modeling
In this section, we provide a review on hand modeling in the context of model-based vision. First, we describe the kinematic model that forms the basis of all types of hand models. A kinematic hand model represents the motion of hand skeleton, but is also a redundant model in the sense that it does not capture the correlation between joints. After a review on modeling the natural hand motion, we present some hand shape models that allow generating appearances of the hand in arbitrary
Partial hand pose estimation
In this section, we provide a review on estimating partial hand pose, which corresponds to rough models of the hand motion, mainly consisting of position of the fingertips, orientation of the fingers or position and orientation of the palm. Partial hand pose estimation algorithms are used to complement appearance-based systems to provide continuous motion data for manipulation, navigation or pointing tasks. First, we describe the architecture of these systems followed by implementation details.
Model-based tracking
A block diagram of a generic model-based tracking system is shown in Fig. 6. At each frame of the image sequence, a search in the configuration space is executed to find the best parameters that minimize a matching error, which is a measure of similarity between groups of model features and groups of features extracted from the input images. The search is initiated by a prediction mechanism, based on a model of the system dynamics. In the first frame, a prediction is not available, therefore, a
Single frame pose estimation
By single frame pose estimation we mean estimating the pose of the hand using a single image or multiple images taken simultaneously from different views. In terms of the model based approach, the solution to this problem corresponds to a global search over the entire configuration space. Especially with a single image and unconstrained hand motion, single frame pose estimation is an ambiguous problem due to occlusions.
One motivation for addressing this more challenging problem is for the
Summary and evaluation
The key characteristics of the full DOF hand pose estimation systems reviewed in this study are summarized in Table 1. These studies were chosen on the basis of generality of their solutions and satisfactory experimental results. The first column provides the reference number while the other columns provide the key characteristics of each system. Specifically, we report: (1) the effective number of DOF that the system targets (i.e., the final DOF after possible reduction due constraints), (2)
Future research directions
Model-based vision seems to be a promising direction for hand pose estimation. All the studies reviewed here represent important steps taken forward to achieve the ultimate goal but there are also some problems that have not received enough attention. One is the hand model calibration problem, which has received attention only recently [70]. In many applications, precision is important; however, it may not be possible to obtain it with a general manually constructed hand model. Besides,
Conclusions
CV has a distinctive role in the development of direct sensing-based HCI. However, various challenges must be addressed in order to satisfy the demands of potential interaction methods. Currently, CV-based pose estimation has some limitations in processing arbitrary hand actions. Incorporating the full functionality of the hand in HCI requires capturing the whole hand motion. However, CV can only provide support for only a small range of hand actions under restrictive conditions. This approach
Acknowledgments
This work was supported by NASA under Grant No. NCC5-583. We acknowledge the editors and reviewers for their constructive comments and pointers to some references that we missed in the first version of this paper.
References (136)
- et al.
Human motion analysis: a review
Computer Vision and Image Understanding
(1999) The visual analysis of human movement: a survey
Computer Vision and Image Understanding
(1999)- et al.
A survey of computer vision-based human motion capture
Computer Vision and Image Understanding
(2001) - et al.
Recent developments in human motion analysis
Pattern Recognition
(2003) Video analysis of human dynamics-a survey
Real-Time Imaging
(2003)- et al.
A graphical model of the human hand using catia
International Journal of Industrial Ergonomics
(1993) - et al.
Visual gesture interfaces for virtual environments
Interacting with Computers
(2002) Eyes in the interface
Image and Vision Computing
(1995)- et al.
Human–robot interface by pointing with uncalibrated stereo vision
Image and Vision Computing
(1996) Unencumbered gestural interaction
IEEE MultiMedia
(1996)