Keywords

1 Introduction

Gestures are one of the most common and natural ways people use to communicate; humans move arms, hands, fingers or even the whole body to transmit information or interact with the environment. In recent years, the development of Human-Computer Interaction systems received great attention from the research community with the aim of developing natural and unobtrusive interfaces, and making users able to interact with the system without any hand-held device. Gesture recognition systems can be profitably used in a variety of applications [4]; among others, sign language translation, daily assistance to elders or disabled people, security application and gaming are probably the most relevant.

This work focuses on the development of a gesture recognition system for CAD interfaces. Although the realisation of a complete 3D model requires fine user movements quite difficult to realise outside the sophisticated traditional CAD interfaces, more intuitive and natural interactions can be useful for initial prototyping or successive interaction with existing models. The widespread diffusion of low-cost RGB-D sensors (e.g. Kinect) and their ability to track users’ movements greatly fostered research in this field. The approach proposed in this paper is based on the use of the Leap Motion Controller (LMC) [13, 17], which provides interesting functionalities for detecting and tracking user’s hands; being it designed to work at short distance, hands information is provided with a noticeably higher level of precision with respect to previous devices operating at larger distances and tracking the whole human body.

The proposed gesture recognition approach is based on a novel, compact but effective hand representation coupled with Long-Short Term Memory networks (LSTM), which represent a natural choice for their ability of managing sequences of inputs over time. When dealing with networks, the level of accuracy reachable is often influenced by the availability of training data; while, for its nature, gesture recognition is in general considered a small-scale problem, the set of data for network training can be incremented by artificially generated data. One further contribution of this paper is the definition of data augmentation techniques able to produce additional data for training while keeping unaltered the semantic of gestures. Finally, a new dataset of gestures will be made available to the research community to allow for future comparisons.

The paper is organised as follow: Sect. 2 presents the state of the art, with particular reference to gesture recognition for CAD applications, Sect. 3 describes the proposed approach, the experiments are described in Sect. 4 and Sect. 5 draws some conclusions.

2 Related Works

The recent literature on human gesture recognition is huge and a comprehensive review goes beyond the scope of this work; interested readers can refer to [3, 4, 15] for recent surveys on 3D hand gesture recognition. Several solutions for natural CAD interfaces have been proposed in the literature. Many works propose contact-based solutions where the user interacts with the system by means of ad-hoc input devices. In [10] different techniques for sketch-based modeling are described, where the users interact with CAD applications by means of sketches; in [19] a Virtual Reality based system is described, where an electronic data glove is suggested as input device. Several vision-based techniques have also been proposed as an alternative to contact-based solutions, with the aim of providing to the user a more natural interface. No direct interactions with input devices are requested in this case; gesture interpretation is based on data streams acquired by cameras of different nature (e.g. RGB or depth). One of the most interesting sensors in this context is Microsoft Kinect [12, 18], a low-cost device able to capture in parallel RGB and Depth data streams; its success is largely related to the skeleton representation provided by the SDK which allows to easily track subjects and analyze their behaviour. The use of Kinect for gesture recognition in CAD applications is proposed in some works [7, 8, 16]; however it is worth noting that the fine hand gestures needed to precisely interact with the system are difficult to capture with Kinect due to its simplified skeleton model where hands are simply identified by a single joint (in the palm) and no information about fingers is provided. Leap Motion Controller works at smaller distances with respect to Kinect and offers a much more detailed hand representation, where each finger is represented by several joints. In [2] LMC multiple applications in Human-Computer Interaction are described, ranging from the medical field to human-robot interaction, from games and gamification to sign language recognition. A CAD interface based on LMC is described in [14] where a proof of concept system able to recognize a set of gestures is described; the details of the recognition approach are not given and the dataset used for testing is not available, thus making impossible a comparison with our proposal. A relevant work for our study is [1] where the use of LMC coupled with recurrent neural networks is discussed for sign language and semaphoric gesture recognition. The authors adopt a complex hand model and a deep network to deal with gestures of different nature with interesting results; we will show in our experiments that, for the specific CAD context, also a simplified representation and a relatively small network allow to reach fully satisfactory results.

3 Proposed Approach

This paper proposes a novel approach for gesture recognition based on the use of Leap Motion Controller. The Leap Motion Controller is a device designed to detect and track user hands, usually placed on the user physical desktop in front of the computer, or mounted on a headset for virtual reality. The device has two monochromatic IR cameras and three infrared LEDs. The IR light emitted from the LEDs is reflected by the user hands and then read by the cameras. Thanks to these tools, the device is able to perceive user hands inside a hemispherical area until a distance of 1 m, with a precision of 0,7 mm and a frame rate up to 200 fps. The information acquired by the sensor is then used to create an internal representation of the two hands, easily accessible thanks to the provided SDK.

3.1 Hand Representation

The hand skeleton information extracted by the LMC consists of a set of attributes, providing geometric data about the user palm, fingers and arm, but also high-level information like acquisition confidence and grabbing or pinching strength. Among the different data provided, the geometric ones are more relevant to our model. Our objective is to define a representation capturing the gesture evolution represented by hand pose, without including any information related to hand shape which is user-specific and not meaningful for gesture recognition. For this reason we neglect most of the data related to the hand position in space (except palm position, used as a reference to evaluate hand translation in time), and we mainly rely on the directions characterizing hand and fingerprints. In particular we exploited for our representation (see Fig. 1a):

  • arm: described by its direction \(\mathbf {d_a}\);

  • palm: described by its 3D position \(\mathbf {p}\) and its direction \(\mathbf {d_p}\);

  • fingers: each finger is a complex object, consisting of a list of bones representing the single phalanges. We consider the direction of each bone \({\mathbf {d_{b_{f,p}}}}\), with f being the finger index (\(f={1,..,5}\)) and p the phalanx index (\(p={1,..,3}\)).

Starting from the hand information provided by LMC, we defined a set of numerical features able to encode the hand pose as well as its movement in space across time. The use of angle values, instead of joint positions, allows to achieve a good level of invariance with respect to users’ specific hand characteristics. In this work only gestures involving a single hand are considered, but the proposed model can be easily extended to a more general case where the user exploits both hands.

Using the above described raw data, different types of features are extracted for each frame i:

  • the translation \(\mathbf {\Delta p}(i)\) of the palm position with respect to frame \(i-1\):

    $$\mathbf {\Delta p}(i) = \mathbf {p}(i) - \mathbf {p}(i-1)$$
  • the angle \(\mathbf {\omega }(i)\) between the palm direction and the arm direction, computed as:

    $$\omega (i) = arccos(\frac{\mathbf {d_p}(i)\cdot \mathbf {d_a}(i)}{|\mathbf {d_p}(i)|\cdot |\mathbf {d_a}(i)|})$$
  • a set of angles \(\alpha _{f,p}(i)\), with \(f={1,..,5}\) and \(p={1,..,3}\), representing for each finger the angle between the palm direction and each finger phalanx:

    $$\alpha _{f,p}(i) = arccos(\frac{\mathbf {d_p}(i)\cdot \mathbf {d_{b_{f,p}}}(i)}{|\mathbf {d_p}(i)|\cdot |\mathbf {d_{b_{f,p}}}(i)|})$$

    Please note that for the thumb finger, only the \(\alpha _{f,p}\) angles referred to two phalanges can be computed (i.e. for \(f=1\), \(p=1,2\)).

The angles \(\alpha _{f,p}\) are computed to capture the finger extension or closure; \(\omega \) angle can detect the wrist movement during the gesture. Each angle is measured in the plane formed by the two directions involved. In order to keep track of the hand spatial movement, we decided to consider only the variation of the palm center coordinates; by considering only point variation and not its absolute coordinates, the resulting features are invariant from the initial hand position. Each frame of the video sequence is therefore represented by a 18-dimensional vector obtained by the ordered concatenation of the above described values (3 values for translation on the three axis, 1 \(\omega \) angle, and 14 \(\alpha _{f,p}\) values). The sequence length is fixed to 60 frames per gesture.

Fig. 1.
figure 1

Hand model: the palm direction (red) intersects the different phalanges directions (blue) and the arm direction (green), forming the angles used to model the hand pose (for instance, the features are represented for each phalanx \(i=1,..,3\) for \(f=2\)). The palm position (black dot) is used to keep track of the hand movement. (Color figure online)

Fig. 2.
figure 2

Network structure unrolled through time.

3.2 Network Structure

Our approach exploits Recurrent Neural Network to recognize gestures; in particular we evaluated two variants: Long Short-Term Memory [9] and Gated Recurrent Unit [5]. All RNNs have internal state vectors than can store past events and process current data based on the past, but in particular LSTM and GRU are able to handle longer-term dependencies characterising longer sequences of data. The results obtained using LSTM or GRU are often comparable in terms of accuracy [6]. We chose a many-to-one network model; in fact the network processes all the sequence elements before returning the predicted class. We chose a fixed length of 60 frames the sake of simplicity, because it has proved to be a sufficient time span for every gesture (about 2–3 s per gesture). The model can be easily adapted to different frame lengths or even variable lengths among samples. For our problem, we sized the network as shown in Fig. 2: the input layer has 18 neurons, corresponding to the size of feature vectors; it is then connected to two hidden layers, each one composed by 200 LSTM neurons. The final layer is a fully-connected layer, which takes as input the last output of the second hidden layer; this layer works as a classifier and it will return the probability of each class for the current gesture. As optimizing algorithm to minimize the loss function during the training phase, we chose Adam Optimizer because it provides in several contexts better performance than other optimizers [11]. The learning rate is fixed to 0.0005.

3.3 Data Augmentation

In order to increase the data available for network training, a data augmentation technique is proposed; in particular, some transformations to the original data are applied to produce new gestures which reproduce the main gesture characteristics without introducing “unnatural” movements or hand poses.

Please note that the same random transformations are applied to the whole gesture since applying independent variations to the single frames would produce a noisy, non-smooth pattern.

Trajectory Rotation and Scaling. The first transformation applies to hand trajectory, described by the palm position \(\mathbf {p}_i\) across time. An affine transform is applied to produce trajectory rotation and scaling; trajectory translation would be totally ineffective, since the trajectory is finally encoded in terms of pose variations (\(\mathbf {\Delta p}_i\) features) to achieve independence from the absolute coordinates. The affine transform given in Eq. (1) produces:

  • a trajectory rotation of \(\theta _x\), \(\theta _y\) and \(\theta _z\) degrees on the X, Y and Z axis, respectively;

  • a trajectory scaling of \(s_x, s_y, s_z\) on the three axis.

The transformation parameters are randomly generated within the ranges given in Table 1. The rotation on the X axis is quite small, because higher values would affect excessively the gesture nature; larger variations on the Y and Z axes can be applied. Moreover a uniform scaling is applied.

$$\begin{aligned} {\scriptscriptstyle \left[ \begin{array}{c} p_x' \\ p_y' \\ p_z'\end{array} \right] \,= \,\begin{bmatrix} 1 &{} 0 &{} 0 \\ 0 &{} cos(\theta _x) &{} -sin(\theta _x) \\ 0 &{} sin(\theta _x) &{} cos(\theta _x) \end{bmatrix} \begin{bmatrix} cos(\theta _y) &{} 0 &{} sin(\theta _y) \\ 0 &{} 1 &{} 0 \\ -sin(\theta _y) &{} 0 &{} cos(\theta _y) \end{bmatrix} \begin{bmatrix} cos(\theta _z) &{} -sin(\theta _z) &{} 0 \\ sin(\theta _z) &{} cos(\theta _z) &{} 0 \\ 0 &{} 0 &{} 1 \end{bmatrix} \begin{bmatrix} s_x &{} 0 &{} 0 \\ 0 &{} s_y &{} 0 \\ 0 &{} 0 &{} s_z \end{bmatrix} \left[ \begin{array}{c} p_x \\ p_y \\ p_z \end{array} \right] } \end{aligned}$$
(1)

Hand Pose Variation. The second transformation applies to hand pose, represented by the \(\alpha _{f,p}\) angles. Each angle is slightly modified to generate a new pose that is still natural and realistic. In particular, to emulate effectively the natural movement of fingers, the amplitude of the variation applied is directly proportional to the phalanx distance from the palm (see \(v_1, v_2\) and \(v_3\) in Table 1). In fact, the farther the phalanx is from the palm, the wider is the angle resultant from the variation applied.

For this reason, the transformation factor chosen for the fingers is slightly modified from phalanx to phalanx. Let \(\alpha '_{f,p}\) be the generated angle from the original \(\alpha _{f,p}\), it can be computed as:

$$\alpha '_{f,p} = v_p \cdot \alpha _{f,p}$$

Variations can be applied in both directions evenly (extending all the fingers or making them more closed).

Table 1. Transformations applied in data augmentation for trajectory rotation and scaling and for hand pose modification.
Fig. 3.
figure 3

Examples of data augmentation: (a) hand pose variation (example on a single finger) and (b) gesture trajectory scaling. Solid blue lines represent the original data, orange dotted lines the derived one. (Color figure online)

4 Experiments

4.1 Dataset

To the best of our knowledge no public benchmarks including raw data acquired by LMC are available. The authors of [1] share their dataset but only in terms of extracted features; the raw data needed to derive our representation are not available thus making impossible the evaluation. We decided therefore to collect a new dataset including gestures that can be applied in a hypothetical CAD software. In particular, starting from the interface described in [14], we defined 8 gesture:

  • Translation: using only index finger, the user draws a straight trajectory.

  • Rotation: extending both index and middle finger, the user rotates the hand of \(180^\circ \), facing the palm upwards.

  • Extrusion: extending thumb, index and middle finger, the user draws a straight or undulating trajectory.

  • Left swipe: using only index finger, the user moves the hand quickly from right to left.

  • Right swipe: using only index finger, the user moves the hand quickly from left to right.

  • Close: using only index finger, the user moves down the hand quickly.

  • Scale enlargement: starting with thumb, index and middle finger tips close together, the user moves them apart.

  • Scale reduction: starting with thumb, index and middle finger far apart, the user moves them close together.

Each of the 30 volunteers used his/her dominant hand to perform the gesture, so there are also left-handed samples; a small training about the gestures is provided by letting them to watch a short video (available at https://youtu.be/ZWPTjusyaoo). Then, they proceeded to perform the gestures, at their chosen speed (keeping the difference between standard speed gestures and quick gestures). Each person performed each gesture twice, so 16 gestures were obtained per person, overall 480 gesture samples. The dataset is available at http://biolab.csr.unibo.it/CADGestures.html.

4.2 Result and Discussion

The main indicator used for performance evaluation is accuracy, which is simply computed as the number of correct predictions C made by the network over the total number of examined instances N: \(accuracy = \frac{C}{N}\). Furthermore, to extract more precise and class-specific information about the recognition accuracy, we also analyzed the confusion matrix where the rows refer to the real gesture class and the columns to the predicted one. All tests have been performed on a PC with Linux OS, on a GeForce GTX1070 GPU with 8 GB of dedicated memory and 16 GB RAM. We implemented the LSTM and GRU networks using Tensorflow, while Scikit-learn was used test SVM.

The dataset is partitioned in training set and test set in proportion 80–20, so we have 384 gestures for network training, and 96 for testing purpose. This basic training set is referred to as \(TS_{Base}\). Moreover, to evaluate the effectiveness of data augmentation, we derived two additional training set, \(TS_{A1}\) and \(TS_{A2}\), obtained generating respectively 1 or 2 gestures for each original gesture in \(TS_{Base}\); the resulting cardinality is then \(|TS_{A1}|\) = 768 and \(|TS_{A2}|\) = 1152.

We tested two versions of the proposed network, i.e. built with LSTM and GRU cells; moreover, as a term of comparison, we also evaluated the proposed hand model coupled with a SVM classifier. Since SVMs are not able to process data sequences, we concatenated in a single vector all the sequence feature vectors (overall 1080 features).

The results obtained with the base and augmented training sets are summerized in Table 2 and Fig. 3. Both LSTM and GRU reach 100% accuracy on the training set, but the first one better generalizes its knowledge on the test set, thus producing overall better results. SVMs are not designed to evaluate the sequential nature of the input, which is significant in this particular problem and this may be the reason of their lower accuracy.

In general, even if a good testing accuracy is reached with \(TS_{Base}\), the results clearly show that data augmentation is important and significantly impacts performance for all the tested classifiers (+6% accuracy for LSTM). We can then deduce that the proposed data augmentation allows to produce new instances maintaining the nature and the spontaneity of the gesture performed.

Table 2. Results obtained using different algorithms and training sets.

An analysis of the confusion matrices in Fig. 3 shows that the most difficult gesture to recognize is Extrusion, probably due to its similarity with the Rotation gesture pose (the only difference is the extension of the thumb), even if Extrusion requires a well defined trajectory in the space, whilst Rotation is almost static. This is comprehensible if we consider that in the proposed model, only one feature value is related to trajectory and the pose information has a much higher influence on the final decision.

Even though a direct comparison with [1] is not possible since different gesture datasets are used, we can observe that our compact representation, coupled with proper data augmentation techniques, allows to reach an overall accuracy of 93,7%, comparable to that of more complex systems, like the one proposed in [1] where the reached accuracy is 96,4% (Fig. 4).

Fig. 4.
figure 4

Confusion matrix of the LSTM network (on the left) and the GRU network (on the right).

5 Conclusions

In this paper a new approach to gesture recognition has been proposed, based on LSTM recurrent networks and Leap Motion Controller. The results obtained are overall quite satisfactory; the fine representation of user hands allows to discriminate precise gestures with a good accuracy. Moreover, the data augmentation technique proposed to increase the set of data for network training allowed to achieve a further performance improvement. An analysis of the main causes of errors suggests some possible future works; in particular, the extracted features are mainly related to hand pose, while hand trajectory contributes to a little extent to the whole representation. Improving this aspect would allow to better discriminate gestures characterized by a similar hand posture by different trajectories across space.