A Survey on Hand Pose Estimation with Wearable Sensors and Computer-Vision-Based Methods

Real-time sensing and modeling of the human body, especially the hands, is an important research endeavor for various applicative purposes such as in natural human computer interactions. Hand pose estimation is a big academic and technical challenge due to the complex structure and dexterous movement of human hands. Boosted by advancements from both hardware and artificial intelligence, various prototypes of data gloves and computer-vision-based methods have been proposed for accurate and rapid hand pose estimation in recent years. However, existing reviews either focused on data gloves or on vision methods or were even based on a particular type of camera, such as the depth camera. The purpose of this survey is to conduct a comprehensive and timely review of recent research advances in sensor-based hand pose estimation, including wearable and vision-based solutions. Hand kinematic models are firstly discussed. An in-depth review is conducted on data gloves and vision-based sensor systems with corresponding modeling methods. Particularly, this review also discusses deep-learning-based methods, which are very promising in hand pose estimation. Moreover, the advantages and drawbacks of the current hand gesture estimation methods, the applicative scope, and related challenges are also discussed.


Introduction
With the rapid growth of computer science and related fields, the way that humans interact with computers has evolved towards a more natural and ubiquitous form. Various technologies have been developed to capture users' facial expressions as well as body movements and postures to serve two types of applications: information captured becomes a "snapshot" of a user for computers to better understand users' intentions or emotional states; and users apply natural movements instead of using dedicated input devices to send commands for system control or to interact with digital content in a virtual environment.
Among all body parts, we depend heavily on our hands to manipulate objects and communicate with other people in daily life, since hands are dexterous and effective tools with highly developed sensory and motor structures. Therefore, the hand is a critical component for natural human-computer interactions, and many efforts have been made to integrate our hands in the interaction loop for more 1. Existing surveys focus either on glove-based devices [10,11] or vision-based [12][13][14] systems, since these works were carried out in two distinct research communities, i.e., human-computer interaction and computer vision. We covered both directions to provide a complete overview of the state-of-the-art for hand pose estimation, which can be particularly helpful for people making applications with hand pose estimation technology.

2.
With the boost of data-driven machine learning methods, a large number of new solutions have been proposed recently, especially in the last three years. It is now urgent to provide a comprehensive review of current progress to help researchers that are interested in this field to obtain a quick overview of existing solutions and unsolved challenges.
The reminder of the paper is as follows: Section 2 summaries the structural properties of the hand and some intrinsic and extrinsic difficulties related to the pose estimation problem. Section 3 presents different types of glove-shaped wearable sensors to capture finger-level poses, and Section 4 lists hand pose estimation methods and datasets working with various type of cameras. Finally, Section 5 summarizes challenges and possible working directions on this topic.

Hand Structure
The hand is a highly complex and articulated body part, which makes it difficult to model its kinematic and dynamic properties, especially in real time. Thus, here we start from its anatomical Getting a clear picture of the hand anatomy can help us to better represent a hand's configuration in space. A kinematic hand model can be built according to the hand anatomy to encode the hand's kinematic properties. The IP joints have only flexion-extension ability (1 DoF), and all CMC joints can be considered static (although the CMC of the little and the ring finger have some motion capability reflecting palm folding). However, the thumb is more difficult to model as there exist different considerations regarding the MCP of the thumb (also called the trapeziometacarpal or TM): it can be considered as a 2 DoF saddle joint, as are the other MCP joints that support both abduction/adduction and flexion/extension or having only flexion-extension ability ( Figure 2). This leads to two different but very similar kinematic models either with 27 or 26 DoF. Early work from the field of computer animation started by using the 27 DoF model [15], but recent studies all chose the 26 DoF version, which has a simpler modeling of the thumb [18][19][20].
Based on the degrees of freedom analysis, we can use the kinematic model to generate a feature vector to represent a hand's configuration. More precisely, the 6 DoF frame of the joint connecting the wrist and the hand is often called the global configuration, and the angular DoF of all fingers are called local configurations, which can be combined to form a feature vector for full DoF hand pose estimation.
A kinematic model based on an accurate anatomical structure is a useful way to parameterize the hand, but it is not the only choice. In fact, building a high-resolution anatomical model can be overly complicated for many applications, so various simplifications are proposed in order to keep models only as complicated as needed. For example, the palm can sometimes be represented as a single rigid body if only fingers are of interest [21], although a rigid palm is poor for tasks such as manipulation and grasping. For those tasks, two to four additional DoF can be added for better palm representation [22]. Besides the articulated rigid model, hands can also be modeled as a small group of independent rigid bodies for each component of the hand, and a prior model of the belief Getting a clear picture of the hand anatomy can help us to better represent a hand's configuration in space. A kinematic hand model can be built according to the hand anatomy to encode the hand's kinematic properties. The IP joints have only flexion-extension ability (1 DoF), and all CMC joints can be considered static (although the CMC of the little and the ring finger have some motion capability reflecting palm folding). However, the thumb is more difficult to model as there exist different considerations regarding the MCP of the thumb (also called the trapeziometacarpal or TM): it can be considered as a 2 DoF saddle joint, as are the other MCP joints that support both abduction/adduction and flexion/extension or having only flexion-extension ability ( Figure 2). This leads to two different but very similar kinematic models either with 27 or 26 DoF. Early work from the field of computer animation started by using the 27 DoF model [15], but recent studies all chose the 26 DoF version, which has a simpler modeling of the thumb [18][19][20].
Based on the degrees of freedom analysis, we can use the kinematic model to generate a feature vector to represent a hand's configuration. More precisely, the 6 DoF frame of the joint connecting the wrist and the hand is often called the global configuration, and the angular DoF of all fingers are called local configurations, which can be combined to form a feature vector for full DoF hand pose estimation.
A kinematic model based on an accurate anatomical structure is a useful way to parameterize the hand, but it is not the only choice. In fact, building a high-resolution anatomical model can be overly complicated for many applications, so various simplifications are proposed in order to keep models only as complicated as needed. For example, the palm can sometimes be represented as a single rigid body if only fingers are of interest [21], although a rigid palm is poor for tasks such as manipulation and grasping. For those tasks, two to four additional DoF can be added for better palm representation [22]. Besides the articulated rigid model, hands can also be modeled as a small group of Sensors 2020, 20, 1074 4 of 25 independent rigid bodies for each component of the hand, and a prior model of the belief propagation network can be used instead to enforce the kinematic relations between these rigid bodies [23].
The kinematic model combined with a shape model are the basis of many model-driven approaches, but the hand can also be modeled in a "non-parametric" way, i.e., an implicit structural model of the hand can be trained from images or other types of data. Different model-based or data-driven methods will be fully discussed in the following sections. propagation network can be used instead to enforce the kinematic relations between these rigid bodies [23]. The kinematic model combined with a shape model are the basis of many model-driven approaches, but the hand can also be modeled in a "non-parametric" way, i.e., an implicit structural model of the hand can be trained from images or other types of data. Different model-based or datadriven methods will be fully discussed in the following sections.

Sensor Taxonomy
The majority of hand pose reconstruction methods are based on either external sensing devices or wearable sensors directly attached to the hand. Although limited in precision, the application of both types of sensors appeared very early in various fields such as gaming, virtual reality, and the related applications, and are still in rapid development.
Wearable sensors are mostly in the form of gloves (also called "data gloves") that a user can directly put on. Data gloves make use of dedicated electromagnetic or mechanical sensors to directly capture the bending angles of the palm and each finger joint so the local configurations with respect to the wrist can be recorded in real time. As data gloves do not support positional tracking, the global configuration of a hand is often captured with the help of vision-based sensors.
Vision-based sensors, or more commonly named, cameras, have unprecedented popularity in our daily lives. They can be found on smartphones, drones, humanoid robots, or in the streets and supermarkets, etc. Cameras are ubiquitous tools of low cost to capture a wide range of reflections of visible light, infrared rays, and sometimes lasers. As opposed to wearable sensors, cameras employ indirect measurements by capturing the appearance of the hand from images (pixel arrays) and derive positions of hand joints with intricate algorithms. Recently, with the widespread use of depth cameras (RGB-camera with a depth sensor) and deep learning algorithms, there has been a boost of vision-based methods for hand pose estimation, which in particular leads to this review.
Wearable sensors and vision-based sensors both have some advantages and drawbacks. Visionbased sensors generally do not require the users to wear any devices that may hinder free hand motion; this is particularly important in some real-world applications, such as rehabilitation, a delicate tool manipulation. However, vision-based sensors need the hands to be always visible to the camera and are sensitive to background noise; wearable devices like data gloves are mostly selfcontained and mobility-restricted. Thus, these two types of sensors are complementary to each other in hand pose estimation, and more generally in intelligent human-computer interactions. In the following sections, we discuss in detail state-of-the-art methods and commercial solutions of wearable and vision-based sensors for hand pose estimation.

Wearable Devices
The efforts to develop wearable devices for hand gesture recognition and pose estimation began in the 1970s, and the field has remained active for more than 40 years. This section is mainly focused

Sensor Taxonomy
The majority of hand pose reconstruction methods are based on either external sensing devices or wearable sensors directly attached to the hand. Although limited in precision, the application of both types of sensors appeared very early in various fields such as gaming, virtual reality, and the related applications, and are still in rapid development.
Wearable sensors are mostly in the form of gloves (also called "data gloves") that a user can directly put on. Data gloves make use of dedicated electromagnetic or mechanical sensors to directly capture the bending angles of the palm and each finger joint so the local configurations with respect to the wrist can be recorded in real time. As data gloves do not support positional tracking, the global configuration of a hand is often captured with the help of vision-based sensors.
Vision-based sensors, or more commonly named, cameras, have unprecedented popularity in our daily lives. They can be found on smartphones, drones, humanoid robots, or in the streets and supermarkets, etc. Cameras are ubiquitous tools of low cost to capture a wide range of reflections of visible light, infrared rays, and sometimes lasers. As opposed to wearable sensors, cameras employ indirect measurements by capturing the appearance of the hand from images (pixel arrays) and derive positions of hand joints with intricate algorithms. Recently, with the widespread use of depth cameras (RGB-camera with a depth sensor) and deep learning algorithms, there has been a boost of vision-based methods for hand pose estimation, which in particular leads to this review.
Wearable sensors and vision-based sensors both have some advantages and drawbacks. Vision-based sensors generally do not require the users to wear any devices that may hinder free hand motion; this is particularly important in some real-world applications, such as rehabilitation, a delicate tool manipulation. However, vision-based sensors need the hands to be always visible to the camera and are sensitive to background noise; wearable devices like data gloves are mostly self-contained and mobility-restricted. Thus, these two types of sensors are complementary to each other in hand pose estimation, and more generally in intelligent human-computer interactions. In the following sections, we discuss in detail state-of-the-art methods and commercial solutions of wearable and vision-based sensors for hand pose estimation.

Wearable Devices
The efforts to develop wearable devices for hand gesture recognition and pose estimation began in the 1970s, and the field has remained active for more than 40 years. This section is mainly focused on late advancements in the two categories of wearable devices for hand pose estimation, namely data gloves and wearable markers. The wearable devices have been reviewed by other surveys [10,24], but gloves designed merely for gesture recognition were not included.
A data glove is a glove-based system composed of one or multiple sensors for data acquisition, and sometimes processing and power supply integration, to be worn on the user's hands. The bending angle and level of adduction of each finger are captured by embedded sensors of different natures. As summarized by Rashid and Hasan [11], there are typically four types of sensors that can be used for hand-related tasks: bend sensors, stretch sensors, inertial measurement units (IMUs), and magnetic sensors. Most existing data gloves used for hand pose modeling are based on bend or stretch sensors, although some have a combination of multiple types of sensors.
In this section, we present in detail a typical setup and characteristics of data gloves based on bend and stretch sensors as well as other types of sensors.

Bend (Flex) Sensors
Bend or flex sensors are passive resistive devices that are commonly used to measure deflection angles, and are the most widely used among all types of sensors used on hand wearables [25]. Bend sensors are thin and available in different sizes, so they can be easily placed on a glove over the knuckles of finger joints. They also have other advantages such as a relatively long-life cycle and low price, and they can stay operational in a wide range of temperatures, which make them a popular choice to measure different joints of the hand.
Bend sensors can be manufactured by coating resistive carbon elements on a flexible thin plastic substrate or by using optic fibers with mounted receivers. For example, the CyberGlove series gloves are built with conductive-ink-based bend sensors and have been on the market for more than 20 years. The latest CyberGlove III [26] has reached a resolution of less than 1 degree and a data rate of up to 120 records/second. The VPL glove (no longer available) and 5DT glove [27] are also classical data gloves that are based on optical flex sensors.
Besides commercial products, there are also many research efforts to design gloves based on bend sensors for different applicative purposes. Some glove designs make use of off-the-shelf bend sensors [28] (Figure 3a), whereas others tried to design novel, soft bend sensors [29,30] (Figure 3b). Typical bend sensor-based gloves have up to 22 sensors per hand with reasonable cost and design complexity; a design with a bend sensor array [31] can further increase the number of integrated sensors without hindering natural hand movements. on late advancements in the two categories of wearable devices for hand pose estimation, namely data gloves and wearable markers. The wearable devices have been reviewed by other surveys [10,24], but gloves designed merely for gesture recognition were not included. A data glove is a glove-based system composed of one or multiple sensors for data acquisition, and sometimes processing and power supply integration, to be worn on the user's hands. The bending angle and level of adduction of each finger are captured by embedded sensors of different natures. As summarized by Rashid and Hasan [11], there are typically four types of sensors that can be used for hand-related tasks: bend sensors, stretch sensors, inertial measurement units (IMUs), and magnetic sensors. Most existing data gloves used for hand pose modeling are based on bend or stretch sensors, although some have a combination of multiple types of sensors.
In this section, we present in detail a typical setup and characteristics of data gloves based on bend and stretch sensors as well as other types of sensors.

Bend (Flex) Sensors
Bend or flex sensors are passive resistive devices that are commonly used to measure deflection angles, and are the most widely used among all types of sensors used on hand wearables [25]. Bend sensors are thin and available in different sizes, so they can be easily placed on a glove over the knuckles of finger joints. They also have other advantages such as a relatively long-life cycle and low price, and they can stay operational in a wide range of temperatures, which make them a popular choice to measure different joints of the hand.
Bend sensors can be manufactured by coating resistive carbon elements on a flexible thin plastic substrate or by using optic fibers with mounted receivers. For example, the CyberGlove series gloves are built with conductive-ink-based bend sensors and have been on the market for more than 20 years. The latest CyberGlove III [26] has reached a resolution of less than 1 degree and a data rate of up to 120 records/second. The VPL glove (no longer available) and 5DT glove [27] are also classical data gloves that are based on optical flex sensors.
Besides commercial products, there are also many research efforts to design gloves based on bend sensors for different applicative purposes. Some glove designs make use of off-the-shelf bend sensors [28] (Figure 3a), whereas others tried to design novel, soft bend sensors [29,30] (Figure 3b). Typical bend sensor-based gloves have up to 22 sensors per hand with reasonable cost and design complexity; a design with a bend sensor array [31] can further increase the number of integrated sensors without hindering natural hand movements.
Bend sensors also have some limitations. Although they can bend millions of times, their accuracy generally decreases over time. Bending a flex sensor with no protective coating for a long period can result in a permanent bend in the sensor, affecting its base resistance. This stability issue requires periodic recalibration, which is not a trivial process.  Bend sensors also have some limitations. Although they can bend millions of times, their accuracy generally decreases over time. Bending a flex sensor with no protective coating for a long period can result in a permanent bend in the sensor, affecting its base resistance. This stability issue requires periodic recalibration, which is not a trivial process.

Stretch (Strain) Sensors
Stretch sensors are increasingly used for the measurement of human body movements as they can be stretched to fit joints and other deformable parts of the human body and obtain measurements of good quality. With the development of material science and sensing technology, various stretch sensors are proposed in different sizes and sensitivities to fit particular applications, some also with pressure measurement capacity. While non-stretchable data gloves tend to be cumbersome and hinder free hand movements with unsuitable sizes and rigid components, elastic stretch sensors can allow for very slim and comfortable data gloves that fit the hand and are particularly dexterous and sensitive.
Stretch sensors are typically resistors with resistance values directly proportional to the sensor's deformation. They can be roughly divided into two groups depending on the process of fabrication. They are either made of stretchy fabrics coated with a conducting material such as polymer or metal, or they are constructed by knitting and stitching conductive fiber with resistive thread to form a mixed structure.
Many recent works have proposed different designs and implementations of stretch sensor gloves. For example, Lee et al. [32] fabricated a stretchable sensor for the detection of tensile as well as compressive strains by putting silver nanoparticle (Ag NP) thin film on a polydimethylsiloxane (PDMS) stamp. Bianchi et al. [33] presented a sensing glove with knitted piezoresistive fabrics (KPFs) (Figure 4a) based on their previous work [34]. This glove is able to track the full hand pose of 19 degrees of freedom (DoF) with only five sensors. Similarly, Michaud et al. [35] built a stretch sensor glove with extremely thin (<50 µm) and skin-conforming sensors made of biphasic, gallium-based metal films embedded in an elastomeric substrate. Besides stretchable fabrics, there are also gloves based on liquid conductors [36,37] (Figure 4b,c) and made with knitted textiles [38,39] (Figure 4d).
However, the abovementioned stretch sensor gloves all have a limited number of embedded sensors (up to 15 [37]), which limits their use for full-hand pose recovery. To solve this problem, Glauser et al. [40] extended the capacitive strain sensor concept of Atalay et al. [38] to achieve dense area-stretch sensor arrays. Later, they designed a stretchable glove based on stretch array sensors, combined with a learned prior, to capture dense surface deformations of full hands [41] (Figure 4e).
Despite recent advances in stretch sensors, one of their major limitations is that the sensitivity of these sensors changes with the size of the sensor, which makes calibration very difficult. Moreover, stretch sensors exhibit slower response times, and especially, very limited lifespans compared to other sensing technologies.

Other Types of Sensors
Besides bend and stretch sensors, inertial measurement units (IMUs) and magnetic sensing are also very popular.
IMUs are often a combination of accelerometers, gyroscopes, and sometimes magnetometers to provide measurements of linear accelerations and rotation rates. They are commonly used in wearable devices to obtain the orientation and motion related features of body parts, include hands and fingers [42]. When compared with bend or stretch sensors, IMUs provide good data rates as accelerometers give digital outputs, and they are relatively low cost and have long lifespans. For example, Keyglove ( Figure 5a) is an Arduino-powered glove that uses touch combinations to generate keyboard and mouse control codes, which is now an open source kit for further development. Other IMU-based gloves share similar architectures with 17, [43] or 16, 9-axis IMU's [44] (Figure 5b), where each one includes a 3-axis accelerometer, a 3-axis gyroscope, and a 3-axis magnetometer to provide a real-time measurement of hand joint movements. A recent commercial product named Hi5 VR Glove Sensors 2020, 20, 1074 7 of 25 is designed for VR applications (Figure 5c). It contains 6, 9-axis IMU sensors on each finger for full left-and-right-hand motion capture with high-performance tracking.

Other Types of Sensors
Besides bend and stretch sensors, inertial measurement units (IMUs) and magnetic sensing are also very popular.
IMUs are often a combination of accelerometers, gyroscopes, and sometimes magnetometers to provide measurements of linear accelerations and rotation rates. They are commonly used in wearable devices to obtain the orientation and motion related features of body parts, include hands and fingers [42]. When compared with bend or stretch sensors, IMUs provide good data rates as accelerometers give digital outputs, and they are relatively low cost and have long lifespans. For example, Keyglove ( Figure 5a) is an Arduino-powered glove that uses touch combinations to generate keyboard and mouse control codes, which is now an open source kit for further development. Other IMU-based gloves share similar architectures with 17, [43] or 16, 9-axis IMU's [44] (Figure 5b), where each one includes a 3-axis accelerometer, a 3-axis gyroscope, and a 3-axis magnetometer to provide a real-time measurement of hand joint movements. A recent commercial product named Hi5 VR Glove is designed for VR applications (Figure 5c). It contains 6, 9-axis IMU sensors on each finger for full left-and-right-hand motion capture with high-performance tracking.
Magnetic sensors, including linear Hall-effect and magnetic current sensors, are also used for hand pose capturing. Due to their contactless working principle, magnetic sensors enable repeatable operations by avoiding frictional forces. Moreover, Hall-effect sensors are of low cost and compact in size and can work in a wide range of temperatures. For example, the Humanglove [45] has 20 Halleffect sensors that can measure the joint angles of fingers. Wu et al. [46] proposed a wearable rehabilitation robotic hand using Hall-effect sensors that can be worn on the forearm. Another light-  [33]; (b) wearable soft artificial sensing skin made of a hyperelastic elastomer material [36]; (c) data glove made of soft Ecoflex material [37]; (d) wearable glove based on highly stretchable textile-silicone capacitive sensors [38]; (e) glove made of a full soft composite of a stretchable capacitive silicone sensor array [41]. Magnetic sensors, including linear Hall-effect and magnetic current sensors, are also used for hand pose capturing. Due to their contactless working principle, magnetic sensors enable repeatable operations by avoiding frictional forces. Moreover, Hall-effect sensors are of low cost and compact in size and can work in a wide range of temperatures. For example, the Humanglove [45] has 20 Hall-effect sensors that can measure the joint angles of fingers. Wu et al. [46] proposed a wearable rehabilitation robotic hand using Hall-effect sensors that can be worn on the forearm. Another light-weight system called Finexus was designed as a multipoint tracking system by instrumenting the fingertips with electromagnets [47].
Both IMU or magnetic sensors are rigid components and have the problem of tricky sensor placement on the glove as finger tracking requires the sensors to be small enough to be wearable. The sensors have to be placed in between each finger joint to catch poses in detail, which is quite challenging due to their fixed shape and dimensions. Moreover, the sensitivity of magnetic sensors increases with their size, so small sensors are often lacking in precision and are easier to be disturbed by external magnetic fields. fingertips with electromagnets [47].
Both IMU or magnetic sensors are rigid components and have the problem of tricky sensor placement on the glove as finger tracking requires the sensors to be small enough to be wearable. The sensors have to be placed in between each finger joint to catch poses in detail, which is quite challenging due to their fixed shape and dimensions. Moreover, the sensitivity of magnetic sensors increases with their size, so small sensors are often lacking in precision and are easier to be disturbed by external magnetic fields.

Evaluations
There are seldom direct comparisons between data gloves as most of them are still prototypes in the lab and only a few commercial products exist (some disappeared), so it is difficult to draw conclusions on the reconstruction quality among these solutions.
However, as shown in Table 1, we can still benefit from the analyses of different types of sensors; on one hand, bend (flex) sensors and stretch (strain) sensors are very suitable for hand pose estimation as they are less disturbing for the users with deformable abilities to follow finger movements and palm deformations; on the other hand, IMUs and magnetic sensors have no burdens from mechanical deformation; thus, can have longer lifespans across usage. Thus, the optimal design of a data glove may involve multiple types of sensors to joint their advantages for better performance with lower cost. A further comparison of wearable technologies on accuracy, cost, and lifetime can be found in [11].

Computer-Vision-Based Methods
The computer vision community has witnessed rapid advancements in almost every subdomain in recent years, from famous local image descriptors such as SIFT [50], to applicative algorithms like the Adaboost face detection framework [51], and then to the boom of deep learningbased image analyses methods such as Resnet [52] and GAN [53].
The computer-vision-based hand pose estimation has made some progress in recent years. The pose estimation task can be further subdivided into 2D and 3D estimation tasks according to the input data. Deriving 3D hand poses merely from 2D images is extremely difficult due to depth ambiguity

Evaluations
There are seldom direct comparisons between data gloves as most of them are still prototypes in the lab and only a few commercial products exist (some disappeared), so it is difficult to draw conclusions on the reconstruction quality among these solutions.
However, as shown in Table 1, we can still benefit from the analyses of different types of sensors; on one hand, bend (flex) sensors and stretch (strain) sensors are very suitable for hand pose estimation as they are less disturbing for the users with deformable abilities to follow finger movements and palm deformations; on the other hand, IMUs and magnetic sensors have no burdens from mechanical deformation; thus, can have longer lifespans across usage. Thus, the optimal design of a data glove may involve multiple types of sensors to joint their advantages for better performance with lower cost. A further comparison of wearable technologies on accuracy, cost, and lifetime can be found in [11].

Computer-Vision-Based Methods
The computer vision community has witnessed rapid advancements in almost every sub-domain in recent years, from famous local image descriptors such as SIFT [50], to applicative algorithms like the Adaboost face detection framework [51], and then to the boom of deep learning-based image analyses methods such as Resnet [52] and GAN [53].
The computer-vision-based hand pose estimation has made some progress in recent years. The pose estimation task can be further subdivided into 2D and 3D estimation tasks according to the input data. Deriving 3D hand poses merely from 2D images is extremely difficult due to depth ambiguity and the difficulty of obtaining fully-annotated data for training. The emergence of commodity depth sensors makes pose estimation much easier by solving the depth ambiguity issue, and most recently proposed methods are largely based on depth maps. However, some methods still target pose recovery using merely monocular RGB images, as RGB cameras are widely available since depth sensors bring addition cost and they are limited in the usable range (usually less than 10 m).
Whichever data used, vision-based hand pose estimation methods can generally be grouped into two categories, namely generative and discriminative. Generative methods are also known as model-based or model-driven methods, as they need to construct a 3D hand model based on prior knowledge of the hand structure and are optimized continuously to better fit the shape of the hand. Discriminative methods are also called appearance-based methods or data-driven methods, and they directly predict the joint locations from images to implement hand pose estimation.
The purpose of the hand pose estimation based on model-based methods and the discriminative methods is to obtain a representation of the hand for tracking hand movement. Given a hand image, the main task of the model-based method is to find the optimal parameters of the hand model to fit the hand in the image, and the goal is to model the hand structure in 3D space. The model-based hand pose estimation does not require any datasets to learn the parameters of the hand model. This is different from the discriminative methods, which use a large amount of hand data to train a unified model that can calculate the coordinates of the hand joint points to achieve hand pose estimation. The process of learning and predicting is separated in the discriminative methods, and the way that online learning and offline prediction leads to rapid execution performance. However, in the model-based methods, the parameters of the hand model in each frame need to be re-learned.
In this section, we describe, respectively, common model-based methods and discriminative methods that have been proposed in recent years, as well as the existent problems and improvement methods. The hybrid methods that use both generative and discriminative models are also introduced. At the end of this section, we describe commonly used public datasets for training and benchmark purposes.

Generative Methods
A generative method needs to construct an explicit hand model based on prior knowledge of the hand structure to recover the hand pose. The hand model needs to satisfy the hand morphology constraints. The task of generative methods is composed of four parts, as shown in Figure 6. Firstly, a hand model should be selected according to the prior knowledge. Different kinds of hand models are shown in Figures 7 and 8. Then, the parameter of the model is to be initialized. The commonly used initialization method is to use the pose from the previous frame as the initialization value of the current frame. After that, a similarity or loss function is established to measure the distance between the actual hand and the chosen hand model, which is represented by hand-crafted features. The commonly used image features are silhouettes, edges, shading, optical flow, and depth value [54][55][56][57][58][59]. At last, parameters of the model are continuously updated until the optimal parameters are found. Commonly used optimization methods are iterative closest point (ICP) [60] and particle swarm optimization (PSO) [61]. The kinematic hand models presented in Section 2.1 are intuitive and accurate hand models for pose fitting tasks, except that their high dimensional nature makes the optimization difficult to solve in real-time; thus, variants of kinematic hand models are often used in discriminative methods rather than model-based methods. Currently, geometric models are often used as the 3D hand model in the generative method-based hand pose estimation. Geometric models are usually composed of some The quasi-Newton method was used in the optimization process for the method mentioned above. This is a local optimization method, which is more efficient but requires accurate design of the objective function to avoid local minima. Ballan et al. [67] proposed a generative approach based on local optimization that uses a discriminatively trained salient point detector to achieve better accuracy. This method adds edges, optical flow, and collision information to the objective function, and can detect the interaction between two hands and objects. The proposed hand model consists of a surface mesh model and an underlying bone skeleton, and the surface deformations are encoded using the linear blend skinning operator (LBS) [66]. In each frame, the positions of the fingernails are detected using the Hough Forest classifier as the salient points. These salient points are used to help find the hand position during the interaction and to make a distinction between two hands. However, the method needs heavy computations and has poor real-time performance.
Different from the work above, Sridhar et al. [68] proposed a faster method that uses the linear Support Vector Machine (SVM) classifier as the discriminator to find the fingertip position in the depth map. The proposed hand model is the SoG (sum of Gaussian) model, and the color information is used to calculate the hand model parameters; then a gradient descent method is used to optimize the parameters of the hand model. Tzionas et al. [69] also used the linear blend skinning (LBS) [66] model, which consists of a triangular mesh and an underlying kinematic skeleton. The method uses the information from an RGB-D image to track two interacting hands. This method only uses an RGB-D camera to realize hand pose estimation, while the work of Ballan et al. [67] needed a more expensive and elaborate multi-camera system.
In the generative methods, the commonly used data types are an RGB image and an RGB-D image. The commonly used optimization techniques are PSO. The important information of generative methods is summarized in Tables 2 and 3.   The kinematic hand models presented in Section 2.1 are intuitive and accurate hand models for pose fitting tasks, except that their high dimensional nature makes the optimization difficult to solve in real-time; thus, variants of kinematic hand models are often used in discriminative methods rather than model-based methods. Currently, geometric models are often used as the 3D hand model in the generative method-based hand pose estimation. Geometric models are usually composed of some simple geometric primitives such as triangles, cylinders, polygons, or their combination. This way of splitting the hand model into smaller structures largely reduces the dimension of the problem and simplifies the task complexity to a certain extent and is often used in computationally complex model-based hand pose estimation. The most commonly used geometric models are the generalized cylindrical model and the deformable polygonal mesh model.

Generalized Cylindrical Model
Oikonomidis et al., as pioneers, used the generalized cylindrical model (GCM) to achieve generative method-based hand pose estimation [62]. In their work, the hand model they used consisted of four kinds of geometric primitives: cylinders, ellipsoids, spheres, and cones. The hand model is shown in Figure 8a, which has 26 DOF and 27 parameters. It uses the skin and edge feature maps to measure the differences between hand model and the true hand with PSO as the optimization method. The paper points out that it proves for the first time that PSO can be used for hand pose estimation and can achieve certain accuracy and robustness.
Compared to RGB images, RGB-D images can provide depth information as an additional source to reduce the computational complexity of hand pose estimation and can be more robust to illumination changes. Thus, Oikonomidis et al. [58] proposed to use skin information and depth information from RGB-D image. In the proposed method, first the RGB image and the depth image are obtained from a Kinect. Then, the hand is segmented by combining the skin color information and the depth information. Finally, the hand model is used to fit the real hand by optimization with PSO.
However, this work can track only one hand, so they further proposed a method that can track the full articulation of two hands from a video sequence [63]. The objective function calculates the distance between the image and the hand model based on the image depth value and color and uses a PSO search heuristic to optimize the objective function. The method enables tracking of two interacting hands with an accuracy of 6 mm.
Oikonomidis et al. further proposed a method for estimating hand pose under the conditions of interactions between human hands and objects in the work [64]. In addition to the hand model shown in Figure 8a, a hand collision model consisting of 25 spheres, shown in Figure 8b, was also proposed to keep track of interactions between the hand and the object.
Due to the fast motion of the hand, the initialization method based on the pose of the last frame is not good enough. Qian et al. [19] proposed a method that can first detect the fingers to generate intermediate poses to help hand initialization. In their research, a hand model consisting of 48 simplest spheres was used to estimate the hand pose, as shown in Figure 8c. They pointed out that gradient-based optimization and manual tracking optimization based on random tracking are not good enough to minimize the cost function. They are either too sensitive to local minima or too slow to converge. Observing the complementarity of the two methods, a hybrid local optimization method ICP-PSO was used in the optimization process, converging faster computation and better resisting local optima. Due to the fast motion of the hand, the initialization method based on the pose of the last frame is not good enough. Qian et al. [19] proposed a method that can first detect the fingers to generate intermediate poses to help hand initialization. In their research, a hand model consisting of 48 simplest spheres was used to estimate the hand pose, as shown in Figure 7(c). They pointed out that gradient-based optimization and manual tracking optimization based on random tracking are not good enough to minimize the cost function. They are either too sensitive to local minima or too slow to converge. Observing the complementarity of the two methods, a hybrid local optimization method ICP-PSO was used in the optimization process, converging faster computation and better resisting local optima.

Deformable Polygonal Mesh Model
The deformable polygonal mesh model (DPMM) usually consists of a surface model and an underlying skeleton model. In the parameter calculation process, a specific method is needed to deform the surface model according to pose changes of an underlying articulated skeleton.
In order to recover a 3D hand from only RGB images, de La Gorce et al. [55] proposed a deformed hand triangulated surface, which had 28 DoF and was deformed according to pose changes of an underlying articulated skeleton using skeleton subspace deformation [65,66]. The model is shown in Figure 8. The proposed objective function can handle self-occlusion and illumination problems, and explicitly use temporal texture continuity and shadow information at the same time. It minimizes the objective function using quasi-Newton methods. In each frame, the parameters of the hand model are initialized using the results of the previous frame.

Deformable Polygonal Mesh Model
The deformable polygonal mesh model (DPMM) usually consists of a surface model and an underlying skeleton model. In the parameter calculation process, a specific method is needed to deform the surface model according to pose changes of an underlying articulated skeleton.
In order to recover a 3D hand from only RGB images, de La Gorce et al. [55] proposed a deformed hand triangulated surface, which had 28 DoF and was deformed according to pose changes of an underlying articulated skeleton using skeleton subspace deformation [65,66]. The model is shown in Figure 7. The proposed objective function can handle self-occlusion and illumination problems, and explicitly use temporal texture continuity and shadow information at the same time. It minimizes the objective function using quasi-Newton methods. In each frame, the parameters of the hand model are initialized using the results of the previous frame.
The quasi-Newton method was used in the optimization process for the method mentioned above. This is a local optimization method, which is more efficient but requires accurate design of the objective function to avoid local minima. Ballan et al. [67] proposed a generative approach based on local optimization that uses a discriminatively trained salient point detector to achieve better accuracy. This method adds edges, optical flow, and collision information to the objective function, and can detect the interaction between two hands and objects. The proposed hand model consists of a surface mesh model and an underlying bone skeleton, and the surface deformations are encoded using the linear blend skinning operator (LBS) [66]. In each frame, the positions of the fingernails are detected using the Hough Forest classifier as the salient points. These salient points are used to help find the hand position during the interaction and to make a distinction between two hands. However, the method needs heavy computations and has poor real-time performance.
Different from the work above, Sridhar et al. [68] proposed a faster method that uses the linear Support Vector Machine (SVM) classifier as the discriminator to find the fingertip position in the depth map. The proposed hand model is the SoG (sum of Gaussian) model, and the color information is used to calculate the hand model parameters; then a gradient descent method is used to optimize the parameters of the hand model. Tzionas et al. [69] also used the linear blend skinning (LBS) [66] model, which consists of a triangular mesh and an underlying kinematic skeleton. The method uses the information from an RGB-D image to track two interacting hands. This method only uses an RGB-D camera to realize hand pose estimation, while the work of Ballan et al. [67] needed a more expensive and elaborate multi-camera system.
In the generative methods, the commonly used data types are an RGB image and an RGB-D image. The commonly used optimization techniques are PSO. The important information of generative methods is summarized in Tables 2 and 3.

Discriminative Methods
The goal of discriminative methods is to learn a map from visual features to the target parameter space, such as joint labels or joint 3D locations from images or videos. Discriminative methods rely heavily on the quality of training data as they require one or more datasets to train the model; the labels of datasets give the position of the joint of the hand. The goal of model prediction is to compute the coordinates of the hand joints in the image.
There are two major types of discriminative methods: random forests (RF)-and convolutional neural network (CNN)-based models.

Random Forest
Methods based on random forests [70] consider hand pose estimation as a regression problem. This line of work was pioneered by Keskin et al. [71], who used a randomized decision forest (RDF) for hand shape classification and applied this shape classification forest (SCF) to a novel multi-layer RDF framework for hand pose estimation. This classifier assigns the input depth pixels to hand shape classes and directs them to the corresponding hand pose estimators trained specifically for that hand shape.
However, the above approach needs large amounts of per-pixel labeled training data, which is difficult to obtain, so it extensively uses synthetic data in training that leads to performance discrepancies among realistic and synthetic pose data. To tackle this problem, Tang et al. [72] proposed the semi-supervised transductive regression (STR) forest to learn the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset using transductive learning. They also designed a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints.
Pixel-level classification is often prone to noisy real world data, so Liang et al. [73] used a superpixel-Markov random field (SMRF) parsing scheme to enforce the spatial smoothness and the label co-occurrence prior to remove the misclassified regions. They targeted the robustness of regression with more discriminative depth-context features by using a novel distance-adaptive selection method.
To further improve the accuracy and efficiency of the regression forest-based method, Tang et al. [74] proposed a new forest-based, discriminative framework for structured searches in images called latent regression forest (LRF). The method takes a depth map as input and learns the topology of the hand with unsupervised learning in a data driven manner. The main difference of LRF from existing methods is that it employs a structured coarse-to-fine search on a point cloud instead of dense pixels, and an error regression step to avoid error accumulation. As shown in Figure 9, once LRF is trained, point-region correspondence can be found by a tree search in a divide-and-conquer way.
However, the above approach needs large amounts of per-pixel labeled training data, which is difficult to obtain, so it extensively uses synthetic data in training that leads to performance discrepancies among realistic and synthetic pose data. To tackle this problem, Tang et al. [72] proposed the semi-supervised transductive regression (STR) forest to learn the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset using transductive learning. They also designed a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints.
Pixel-level classification is often prone to noisy real world data, so Liang et al. [73] used a superpixel-Markov random field (SMRF) parsing scheme to enforce the spatial smoothness and the label co-occurrence prior to remove the misclassified regions. They targeted the robustness of regression with more discriminative depth-context features by using a novel distance-adaptive selection method.
To further improve the accuracy and efficiency of the regression forest-based method, Tang et al. [74] proposed a new forest-based, discriminative framework for structured searches in images called latent regression forest (LRF). The method takes a depth map as input and learns the topology of the hand with unsupervised learning in a data driven manner. The main difference of LRF from existing methods is that it employs a structured coarse-to-fine search on a point cloud instead of dense pixels, and an error regression step to avoid error accumulation. As shown in Figure 9, once LRF is trained, point-region correspondence can be found by a tree search in a divide-and-conquer way.  Instead of performing regression for all hand joints, one may employ a progressive strategy via a sequence of weak regressors [76]. Based on this idea, Sun et al. [77] proposed a cascaded regression method for hand pose estimation. Their key observation is that different object parts typically exhibit different amount of variations and degrees of freedom due to the articulated structure. Thus, regressing all parts together is unnecessarily difficult and causes slow convergence and degraded accuracy. Their hierarchical approach regresses the pose of different parts sequentially in the order of their articulation complexity. Similarly, Wan et al. [78] designed a hierarchical regression framework for estimating hand joint positions from single depth images following the tree structured topology of the hand from wrist to finger tips. They proposed a conditional regression forest, i.e., the frame conditioned regression forest (FCRF) along with local surface normals instead of normal difference as features. This modification was shown to obtain consistent improvement over previous discriminative pose estimation methods on real-world datasets.

Convolution Neural Networks
Deep learning has developed rapidly in recent years and has been widely used for hand pose estimation. This type of method trains deep convolutional neural networks and learns model parameters through a large number of labeled datasets so that it can predict the joint locations to achieve hand pose estimation.
Tompson et al. [79] proposed a four stage method for hand pose estimation. First, the input image was processed by the decision forest to separate the hand from the background. When the hand in the image was acquired, a robust method was developed to label the dataset. After that, a deep convolutional neural network was used to extract the heatmap from the input hand image. Finally, the features were extracted from heatmaps and an objective function was proposed and minimized to align the features of the model to heatmap features.
Although the method above shows good result in hand tracking, it is inefficient in situations with occlusion, because it uses the inverse kinematic (IK) approach to recover a 3D pose from a 2D image. To solve this problem, Sinha et al. [20] proposed a method based on global and local regression. In their work, parameters of the wrist were computed in global regression, and then the parameters of five fingers were separately calculated using five local regression networks, which is shown in Figure 10. This method can effectively deal with occlusion problems, and it can also avoid the need to re-initialize all parameters when the previous frame is lost.
achieve hand pose estimation.
Tompson et al. [79] proposed a four stage method for hand pose estimation. First, the input image was processed by the decision forest to separate the hand from the background. When the hand in the image was acquired, a robust method was developed to label the dataset. After that, a deep convolutional neural network was used to extract the heatmap from the input hand image. Finally, the features were extracted from heatmaps and an objective function was proposed and minimized to align the features of the model to heatmap features.
Although the method above shows good result in hand tracking, it is inefficient in situations with occlusion, because it uses the inverse kinematic (IK) approach to recover a 3D pose from a 2D image. To solve this problem, Sinha et al. [20] proposed a method based on global and local regression. In their work, parameters of the wrist were computed in global regression, and then the parameters of five fingers were separately calculated using five local regression networks, which is shown in Figure 10. This method can effectively deal with occlusion problems, and it can also avoid the need to re-initialize all parameters when the previous frame is lost. Figure 10. Global regression calculates the wrist parameters, local regression calculates five fingers parameters [20].
The work above only considered predicting the positions of hand joints directly. However, during hand movement, there is a strong correlation between different hand joints, so prior information can be introduced to constrain the parameter space. The method proposed by Oberweger The work above only considered predicting the positions of hand joints directly. However, during hand movement, there is a strong correlation between different hand joints, so prior information can be introduced to constrain the parameter space. The method proposed by Oberweger et al. [80] adds prior information to predict the parameters of the pose in a lower dimensional space, and can solve the ambiguity of the finger joints. They introduced a "bottleneck" structure to the last layer of the network, which is a layer with only necessary neurons.
Although the works above solve the occlusion problem or use prior information to constrain the parameter space to achieve good results, they are in general very demanding on the training dataset. To reduce the cost of getting large amounts of labelled data from the real world, they often use synthetic data to train the convolutional neural network. For example, Ge et al. [81] used a synthetic dataset containing both ground truth 3D meshes and 3D poses to realize 3D hand shape and pose estimation. Wan et al. [82] used depth maps, which were generated online from a hand model provided by [45] to train the deep neural network.
Due to the gap between synthetic and real data, the models trained with synthetic data often have poor performance once applied in the real-world. Although we are aware of the importance of real data, building a dataset covering all possible camera viewpoints and hand poses with detailed annotations is still a big challenge. To build a functional model without a large training dataset, Baek et al. [83] proposed a method for synthesizing data using skeleton maps to add data to the skeleton space. As shown in Figure 11, the model consists of a hand pose estimator (HPE), a hand pose generator (HPG), and a hand pose discriminator (HPD). This method expands the existing dataset and proposes a method of generating depth map data based on the skeleton map. The simultaneous data generation and model training philosophy yields good prediction results. However, this method still imposes some constraints on the dataset that initiates the model. If the input skeleton map differs greatly from the maps in the dataset during the test, the generated depth map will blur, and the final prediction result will be affected. method introduces a feedback loop to refine the hand pose estimates. Yang and Yao [85] proposed a method to better deal with the problem of large discrepancies between backgrounds and camera viewpoints. The work proposed the use of disentangled representations and a disentangled variational autoencoder (dVAE) that can synthesize highly realistic images. Spurr et al. [86] developed a generative deep neural network to learn a latent space, which can be used directly to estimate 3D hand poses. Figure 11. The augmented skeleton space model [83]. HPE: hand pose estimator; HPG: hand pose generator; HPD: hand pose discriminator.
The above-mentioned discriminative methods are summarized in Table 4. There are also some works that can track simultaneously the human body, hand, and face. The convolutional pose machine (CPM) [87] trained with datasets such as FLIC [88], LSP [89], and MPII [90], can deal with cases where there are multiple human bodies and hands in the scene. It aims at single and multi- Figure 11. The augmented skeleton space model [83]. HPE: hand pose estimator; HPG: hand pose generator; HPD: hand pose discriminator.
Thus, further efforts were taken in this direction. Oberweger et al. [84] proposed a joint hand-object pose estimation approach that learns a synthesizer CNN to synthesize an image in the model. The synthesizer CNN can generate convincing depth images for a very large range of poses. The method introduces a feedback loop to refine the hand pose estimates. Yang and Yao [85] proposed a method to better deal with the problem of large discrepancies between backgrounds and camera viewpoints. The work proposed the use of disentangled representations and a disentangled variational autoencoder (dVAE) that can synthesize highly realistic images. Spurr et al. [86] developed a generative deep neural network to learn a latent space, which can be used directly to estimate 3D hand poses.
The above-mentioned discriminative methods are summarized in Table 4. There are also some works that can track simultaneously the human body, hand, and face. The convolutional pose machine (CPM) [87] trained with datasets such as FLIC [88], LSP [89], and MPII [90], can deal with cases where there are multiple human bodies and hands in the scene. It aims at single and multi-person body pose estimation and can make good predictions for hand joints location. The Perceptual Computing Lab at Carnegie Mellon University proposed a multi-task 2D human pose estimation method named OpenPose [91], which uses a multi-stage approach to estimate poses for human bodies, faces, and hands, where the hand pose estimation is based on the improvements of CPM. As a multi-network approach, it directly uses existing body, face, and hand key point detection algorithms. Based on the OpenPose project, Hidalgo et al. [92] combined multi-task learning (MTL), which is a classic machine learning technique [93][94][95], and the improved OpenPose model was used to train the first single-network for 2D integral estimation. This method combines multiple independent key point detection tasks into a unified framework that simultaneously detects key points of the body like feet, face, and hands. For the part of the hand pose estimation, the dataset used is the OpenPose hand dataset [96], which combines a subset of 1k hand forms manually annotated from MPII [90], as well as the 15k samples automatically annotated on the Dome or Panoptic Studio [97].
Compared with those methods that leverage a depth map coming from commodity depth sensors, as shown in Table 4, obtaining a 3D hand pose from merely RGB images is generally more challenging than pose recovery from RGB and depth information. As Table 5 shows, Zimmermann and Brox pioneered in this direction by proposing a deep network that learns a network-implicit 3D articulation prior [98]. Iqbal et al. proposed a novel 2.5D pose representation and implicitly learned depth map and heatmap distributions with a novel CNN architecture [99]. However, these methods require large amounts of annotated data, which are difficult to generate, and synthetic datasets are used instead. To ensure good generalization ability to real hands, Rad et al. learned a mapping from paired color and depth images and aligned synthetic depth images with the real depth images [100]. Cai et al. used a weakly-supervised method that adapts from a fully-annotated synthetic dataset to a weakly-labeled real-world dataset with the aid of a depth regularizer [101]. Recently, Ge et al. made further improvements by proposing a graph CNN-based method to reconstruct both 3D hand poses and shapes represented by a full 3D mesh [81].
There are also some very recent works that brought new insights into the field by employing CNN with custom modifications. CNNs can be applied to images from multiple viewpoints [102], combined with octrees [103,104], or applied to a point cloud instead of pixels [105][106][107][108], or even with a complete 3D architecture [109][110][111][112]. Table 4. Summary of discriminative methods for hand pose estimation with RGB and depth inputs.

Hybrid Methods
Generative methods need to re-compute the parameters of the hand model for each frame, the speed of which is slow and thus the real-time performance is usually poor. Moreover, the parameter of each frame of the hand model is often initialized based on the parameters of the previous frame. If the previous frame estimation has an error, this error will accumulate along the running process, thereby affecting the final quality of hand pose estimation.
Although the parameters of the model can be trained offline and used directly in prediction, discriminant methods require a large amount of annotated data to train the model. If the scenes used for training and testing are quite different, the quality of hand pose estimation will also be compromised.
Therefore, some researchers attempted to combine model-based and data-driven approaches. Xu and Cheng [18] used a single depth image and adopted the Hough forest model in a two-stage hand detection method. First the Hough forest model is used to provide an initial estimate of the direction and 3D position of the hand in the plane, then another Hough forest regression model, which is based on the hand coordinates and direction values acquired in the first step, is used to calculate the depth features that are invariant to the rotation in the plane. Next it uses the hand 3D model to generate a reasonable set of 3D candidate gestures. Finally, based on the candidate gesture, the pose estimation is performed by solving the optimization problem. The method uses a skinned mesh model combined with a discriminative approach to achieve hand pose estimation.
Baek et al. [119] proposed a model that is able to estimate the 3D skeleton structure of the hand from the RGB image and recover the hand shape from it. In their work, a 2D skeleton model was used to predict 21 joint points, and the 3D hand model used a generative mesh model named MANO [120] representing the hand grid based on 45-dimensional pose parameters and 10-dimensional shape parameters, which was used in some very recent work [121,122]. The model consists of three parts, namely a 2D evidence estimator to calculate the 2D skeleton coordinates of the hand according to the RGB image, a 3D mesh estimator to compute the 3D mesh model of the hand, and a projector that combines the 3D model information with the hand skeleton coordinate information to obtain the coordinates of 3D hand joints. Another work from Zhang et al. [123] predicted the current hand pose based on the previous poses by a pre-trained LSTM network, which is an interesting way to generate a "hand model" from past experiences.

Public Datasets
At present, most hand pose estimation tasks take place under controlled conditions. Different camera viewpoints, hand poses and shapes, and illuminations and backgrounds are all required to be covered by the training dataset in order to obtain successful hand pose estimation results. However, so far, the variability and quantity in the existing datasets are still relatively limited.
The datasets used in the current literatures include RGB images, depth images (depth maps), and their combination (RGB-D). For different data types, the corresponding labels and annotations in the datasets are also different. The datasets widely used in recent years are summarized in Table 6.
As shown in Tables 3 and 4, depth data is becoming more and more popular in hand pose estimation tasks as it has good resistance to color and illumination change in the scene and can help extract hands from cluttered backgrounds. Commercial depth sensors such as Kinect and Intel RealSense have relatively good depth sensing performance, although the obtained depth maps are often degraded by noise. We can also see that some datasets contain purely synthetic data and others are constructed with real image data, but manual labeling is not always possible. Thus, to further improve the quality and ability to generalize to unseen situations for discriminative methods, we can continue to pursue hybrid methods that are less dependent on the training datasets, especially how discriminative methods can help hand model initialization and fast calibration. Another direction we can take is to develop weakly-supervised methods that are less demanding for large amounts of labeled training data.

Challenges and Future Work
From the analyses above, we can see that existing hand pose estimation systems can already accurately track the movement of the human hand in real time in a relatively controllable environment. However, hand pose estimation cannot yet be considered as a solved problem and still faces many challenges, especially in open and complex environments, where we should take the amount of computing resources needed into consideration.

Challenges
Wearable sensors, or data gloves, are promising for accurate and disturbance-free hand modeling since they generally have compact design and become lighter and less cumbersome for dexterous hand movements. However, there are three main challenges remaining to be solved.
Most data gloves are still "in the lab" and there is no industrial standard on the design and fabrication of such devices, which leads to high costs of available commercial products, making them unaffordable for daily use. Second, except gloves that are based on stretch sensors, most gloves have fixed size and are difficult to match different users' hands. Lastly, gloves are unsuitable to be used in certain cases, for example, some stroke patients have difficulties opening their hands to wear gloves designed for normal users, or in situations when the user needs to manipulate tiny objects, or put their hands into water, etc.
Vision-based methods, on the other hand, have overcome many difficulties faced by common computer vision tasks, such as rotation, scale and illumination invariance, and cluttered backgrounds. The high dimensional nature of hand pose representation, and even hand self-occlusion, are no longer obstacles in the way of achieving accurate hand pose estimation in real time. However, vision-based methods still face the following challenges: First, occlusion is still the major problem. As the hands are extensively used to manipulate objects in daily life, they are very likely to be blocked or partially blocked by objects during interaction, which forms the hand-object-interaction (HOI) problem. There are already some efforts to deal with object occlusion. For example, Tekin et al. [127] proposed an end-to-end architecture to jointly estimates the 3D hand and object poses from egocentric RGB images. Myanganbayar et al. [128] proposed a challenging dataset consisting of hands interacting with 148 objects as a novel benchmark for HOI.
Second, since many methods are data-driven, the quality and coverage of training datasets is of great importance. As discussed in Section 4.4, there are already many useful datasets with 2D/3D annotations. However, a larger portion of annotated data comes from synthetic simulations. Existing methods tried to employ weakly supervised learning, transfer learning, or different data augmentation approaches to better cope with insufficiency of real world data, but more data representing tremendous viewpoints, shapes, illumination, background variations, and objects in interaction are required to train deep learning-based architecture, or we must find a new way to incorporate the hand model for 3D pose recovery.
Moreover, most deep learning-based methods also require large amounts of computational resources during the training and inference stages. Many algorithms need to run on a graphics processing unit (GPU) to achieve a real-time frame rate, making it difficult to be deployed to portable devices such as mobile phones and tablets. Thus, it is important to find effective and efficient solutions on mobile platforms for ubiquitous applications.

Future Work
To conclude, various devices and methods have already enabled hand pose estimation for different applicative purposes in a controlled environment, and we are not far from real-time, efficient, and ubiquitous hand modeling.
In the near future, expertise from material science and electronics is needed to build easy to wear and maintain, yet more affordable data gloves for accurate hand modeling. Regarding vision-based methods, data-efficient methods such as weakly supervised learning or hybrid methods are needed to minimize the dependency on large hand pose datasets and to improve the generalization ability to unseen situations. Moreover, we can already see the benefits of new sensors, e.g., the depth sensor, as they can largely reduce the computation complexity by using 2D data to deduce 3D poses; thus, novel accurate long-range 3D sensors will definitely contribute to contactless hand pose estimation.