Vision-based Hand Recognition Based on ToF Depth Camera

Abstract The current gesture recognition methods mostly adopt the classification-based approaches such as Neural Network (NN), Support Vector Machine (SVM), Hidden Markov Model (HMM) etc. As for the input image features, most research studies combined the color and depth images (ex. RGB-D) to obtain more accurate information of hand area, and such techniques may cost high computational resources and energy consumptions. To provide a low-cost gesture recognition method for wearable devices, this thesis used merely the Time-of-Flight depth camera to achieve a lightweight gesture recognition method. In most traditional gesture recognition methods, users have to wear gloves or bracelets to let depth cameras being able to accurately capture hands areas, and so that the hand contours, palm’s distances, and angle feature can be obtained. Moreover, the Earth Mover’s Distance (EMD) algorithm, which is adopted in most gesture recognition approaches, costs high computational times. In this study, to avoid wearing gloves or bracelets, we propose a new algorithm that can compute the wrist cutting edges and capture the palm areas. In addition, this thesis proposes an efficient finger detection algorithm to judge the number of fingers, and significantly reduce the computing times. In the experimental results, our proposed method achieves a recognition rate of 90% and the performance has 5 frames per second on NVIDIA TX1 embedded platforms.


Introduction
Human-computer interaction (HCI) is a popular topic, which means the machine operation modes are gradually being developed and paid attention. Along with machine intelligence widely used, many devices are integrated with computing platform to promote users interest and interactive opportunity with machines, so HCI is an important issue. People want to use hand to control something and reduce dependence on other devices; therefore, gesture recognition becomes an important direction of research. The following will illustrate gesture recognition used in major areas in recent years. The following will illustrate gesture recognition application in major areas in recent years.
In 2016, Consumer Electronics, the vehicle manufacturers of BMW, showed the concept cars that can be controlled by gestures [1]. Based on the concept, users can use gestures to control the entertainment device or air-conditioning device when vehicles are in automatic mode, that will improve comfort and convenience when users control the vehicle peripherals systems.
In the field of virtual reality, users have to wear headmounted display to play at virtual reality. So in such situations, medium through which to interact with virtual reality will be an important issue. The company 'eyeSight' is specialized in machine vision and gesture recognition, which provide convenience interface on many digital

Materials and Methods
On gesture recognition study based on image process, some methods combine RGB features such as skin color detection [14] to remove the area except hand area. [5] uses OpenNI SDKto capture and track hand area. Compared to these methods, the study only relies on depth image to achieve gesture recognition.

System Architecture
The study system flow is shown in Figure 1. The system captured the 165 × 120 pixels depth image and used normalization, binarization, median filter, and morphological to calculate the palm center and then through static and dynamic gesture recognition method to judge gestures.

Palm Center Detection
Here, we will use morphological to capture the palm center; first, through Opening operation to erode hand area, shown as Figure 2(b), then use dilation to dilate the block, shown as Figure 2(c), that is similar to the palm block; As a result, we can calculate approximate ellipses for the block and use the ellipse center be the palm center, as the blue point is palm center shown in Figure 2(d).

Finger Detection
We use Suzuki [15] to calculate the large block contour of the binarization image, as shown in Figure 3(b). When the block size is greater than a certain threshold value, it will judge the block as a hand area possibly; and then we use Ramer-Douglas-Peucker algorithm [16] to reduce the number of contour points, as shown in Figure 3(c).

Hand Segmentation
Shown as in Figure 4, when at different distance from depth camera, the arm length will be different and have a change of angle feature to affect the EMD calculation results and reduce the recognition success rate. Therefore, the study implemented the Hand segmentation method to devices [2] such as Personal Computer, Tablet, TV, wearable device, etc. 'eyeSight' provides a solution for gesture recognition in the field of virtual reality which lets users through hand to operate the list in virtual reality [3]. From the above example, we can know through the human-computer interaction flourish, the gesture recognition elements will also more and more important. Therefore, this study proposes the gesture recognition and control system that uses a depth camera to do gesture recognition research based on depth of the image. The system is implemented on the head-mounted wearable situation, and uses a small-footprint depth camera module to provide easy gesture control interfaces for the users. Moreover, after we compared the state-of-the-arts [4,5], this study can significantly reduce the dependencies on devices for the users and is robust to various ambient lights for gesture recognition.
Following the development of gesture recognition, it has many methods to achieve. First, about the recognition device, it can be divided into with gloves and without gloves. When users wear the special sensor gloves [6], the sensors on the glove will capture the posture and motion track of the hand and return it to the computer for analysis and calculation, in order to achieve accurate gesture recognition results.
The method of wearing sensor gloves, that are highly accurate but have to wear the glove device to achieve, will increase costs and have a bad experience for users. Therefore, in order to allow users to not have to wear sensor gloves or other devices, and appear in the main method of image recognition that use camera to capture and track users' hand area, and then performed gesture recognition. This part has some reference like [7] that uses Kinect as an input device which uses depth image for image processing and Support Vector Machine (SVM) to train gesture data, and classify input image to achieve gesture recognition. Reference [8], the hand image is taken by the background subtraction, and gesture recognition is performed by the separation of the palm and the finger. At the recognition section, we use some common methods of classifying like Hidden Markov Model (HMM) [9], Support Vector Machine (SVM) [10], and Neural Network (NN) [11,12], that uses classifiers to classify the capture image and determine whether the gesture is similar to the gesture database, but that has to have a large number of samples for training database.
In addition to the classifier, there is some other method of feature comparison, just as the reference method: Finger-Earth Mover's Distance (FEMD) [13]. Users wear gloves or bracelets to let cameras accurately capture hands areas, so that the hand contours, palm's distances, and angle feature can be obtained, and then through the remove the arm and keep the palm. So that the difference in the feature of different distances will be narrow. When we observed the hand contour can be found which has to find out the two reference points WPa and WPb that can cut off the arm area, we set a sliding window to scan from the image bottom, as shown in Figure 5(a). The purpose is to check the width difference of white pixels on a certain interval. Suppose SW is sliding window, as shown in Figure 5(b) that will scan from the bottom to point P and judge the hand contour gradient change. When it has largest change, which is the reference point, we can succeed in removing the arm, as shown in Figure 5(c).

Feature Extraction and EMD Calculation
After Hand segmentation, we can calculate hand area feature histogram to be the EMD feature set. First, this study will focus on the input image and database image of contour points and palm center, respectively, to calculate the Euclidean distance and relative angle to be image feature set. Suppose the contour set is {CP1, CP2, … …, CPi }, the i is contour points number, as shown in Figure 6(a). The Euclidean distance of contour point    CPi and palm center is defined as Edi. The Euclidean distance function is (1). After calculating the Euclidean distance, it also has to be normalized between 0 and 255, and recorded the maximum and minimum distances of the contour points, at last, normalizing distances of the contour points (2).
After calculating the Euclidean distance, we will calculate the angle of contour points with palm center. Figure  6 (1)

Experimental Environment
The experimental environment specifications are shown in Tables 1 and 2.

Results
The study used the personal computer and a depth camera to do experiments, and defined ten gestures to experimental test.

Gesture Experiment
The study defined gesture name, real image and depth image shown as Figures 8 and 9. In order to compare the differences between different EMD thresholds and distances of the experimental results, we used different EMD thresholds and distances to experiment. The experiment will set three EMD thresholds as 10, 14, 18, and distances as 15, 20, 25 cm from camera (Tables 3-5).
After the experiment of different EMD thresholds, we can learn from this. When we increase the EMD thresholds that can upgrade the recognition rate, but the EMD thresholds increases above a certain value, The study used the personal computer and a depth camera to perform experiments, as depicted in Figure 7, and we defined ten gestures for experimental test. In order to compare the differences between different EMD thresholds and distances of the experimental results, we used different EMD thresholds and distances to experiment. The experiment will set three EMD thresholds as 10, 14, 18, respectively, and set the distances as 15, 20, 25 cm from camera, respectively, as shown in Tables 3-5.     the increased recognition success rate will slow down, as shown in Figure 10 (Y-axis is the percentage of recognition rate). This is followed by comparing the same user (User1) and the EMD threshold of 18 but in different distances situation, the user has or has not used the hand segmentation algorithm to identify gesture. The Table 6 is experiment video on different distances and the Tables 7-13 are the comparison tables at different distances.
Summary of the above experiment: in the distance of 15-25 cm, the effect of gesture recognition will be the best, but too close or too far will result in the poor quality of the image making the recognition rate decline.
The Table 14 is the gesture recognition rate for multiple users and the Table 15 is our method to compare with the one in [4]. In this paper, the method is based on hand segmentation to capture the palm region than doing the EMD calculation, but the reference [4] used the Thresholding Decomposition of the fixed threshold method. So the average recognition success rate of our method is 84.5%, and thatof the one in [4] is 69.81%, which shows that the method of this study has a higher recognition success rate.

Conclusions
Integrating finger detection [5], EMD and hand segmentation can achieve a recognition rate of 80% on various users. In this situation of too close or too far from the camera, the input image quality will be poor and the angle features of gesture will be more different. So after using the Hand segmentation method, it will improve the recognition rate of 30%. It will be seen from this that the Hand segmentation method can improve the gesture recognition success rate. Apart from the Hand segmentation method, the thesis used the depth camera in the wearable situation and defined the gesture class that applicable in the wearable situation. The thesis is a great contribution to the field of wearable systems.