From Kinect skeleton data to hand gesture recognition with radar

: In an era where man-machine interaction increasingly uses remote sensing, gesture recognition through use of the micro-Doppler (mD) effect is an emerging application which has attracted great interest. It is a sensible solution and here the authors show its potential for detecting aperiodic human movements. In this study, the authors classify ten hand gestures with a set of handcrafted features using simulated mD signatures generated from Kinect skeleton data. Data augmentation in the form of synthetic minority oversampling technique has been applied to create synthetic samples and classified with the support vector machine and K-nearest neighbour classifier with classification rate of 71.1 and 51% achieved. Finally, using weights generated by an action pair based one vs. one classification layer improves classification accuracy by 24.7 and 28.4%.


Introduction
With the rapid development of IoT and smart homes (i.e. smart light or air conditioner remotely controlled by humans), the demand for systems able to recognise human motions and react accordingly has arisen. Waving hands or drawing a circle are different physical movements, and these two actions can be treated as two instructions to control different appliances. Gesture classification helps computers recognise human motions to provide another form of man-machine interface.
Much research focuses on human activity recognition (HAR) from optical sensors such as visible light camera [1,2]. Features extracted from silhouettes and spatial-temporal domains are widely used in HAR [3]. Another video representation is action graph using Gaussian mixture models with salient postures via unsupervised learning [4]. There is a wealth of work in camera/ video HAR [5]. Even though radar presents more technical challenges than video for HAR, it is a preferred sensing modality because it does not record plain images of people. It protects the user sense of privacy and there is currently no legal/ethical framework regulating assisted living technologies, so image rights might become a legal issue in the future [6]. Compared with optical sensors, radar can be operated in dark environments which means that artificial light is unnecessary for HAR [7][8][9]. The motion of limbs will lead to frequency modulations in spectrograms around the main Doppler component [10]. Classification in radar has thus been using micro-Doppler (mDs) signatures from which features are extracted to recognise activities [5,[11][12][13].
This paper reports on the work conducted in a final year project at the Glasgow College -UESTC in an undergraduate EEE degree programme in 2017-2018 looking at the issue of classifying aperiodic motions such as hand gestures.

Kinect sensors
The first Kinect was released in 2010 for interactive games [20]. In this project, the Kinect V2 was used and comprises a RGB camera and an IR camera capable of extracting the 3D skeleton data. Table 1 lists the main features of Kinect.
The advantages of Kinect lie in its low price and depth sensing capability (0.4-8 m). Moreover, a skeleton tracking toolbox is available on MATLAB [21] and other development platform. However, the IR camera is sensitive to external infrared source and cannot detect highly reflective objects [15].
Kinect sensor is of sufficient quality to simulate human mD signatures for classification purposes [15]. However, our preliminary tests have shown an instability in skeleton extraction that would require correcting the skeleton before being used for mD simulation as it would create unrealistic Doppler modulations because of the glitches as shown in Fig. 1.
Since time constraints for the completion of the project and the time it would take to work on designing a more robust skeleton extraction routine, it was decided to work from Kinect databases that had already been 'cleaned up'.

MSR Action3D dataset
The Kinect data required for simulation is from MSR Action3D Dataset [22] captured at 15 fps. It includes depth map sequences with 640 × 240 resolution and skeleton data in screen and 3D Cartesian coordinates. The database contains 20 action types performed 2-3 times by ten subjects. Due to high-frequency vibrations and incorrect skeleton tracking issues in some recordings, we picked the most reliable records of nine actions ( Table 2) each having 20 samples. Only the skeleton data is used to construct the human model.

Human model reconstruction
To construct the skeleton, 17 of the detected joints are included to be consistent with the Boulic kinematic model [23] and the Chen's mD modelling [10] (Fig. 2, left). The radar return from human motion was emulated with the radar cross section (RCS) of ellipsoids snapped on the skeleton (Fig. 2, right).

Radar returns
From Fig. 3, the radar return signal is a time-delayed version of the transmitted signal. Thus, for a sequence of narrow rectangular pulses with transmitted frequencyf c , a pulse width Δ, and a pulse repetition interval ΔT, the radar return from a body part may be represented by (1), which is given by [10] where σ P t is the RCS of the body part at time t, n p is the total number of pulses received, R P t is the distance from the radar to the body part at time t, and the rectangular function rect is defined by (2) The body is composed of spheres and ellipsoids for the RCS estimation. As spheres are particular ellipsoids, the RCS is calculated for both using (3) [24] The spectrogram is obtained using the short-time Fourier transform of the return signals following the method in [10].

Simulated human mD signatures and feature extraction
The 15 fps Kinect dataset is too sparse to derive an accurate spectrogram [15], the skeleton data provided by [22] is interpolated to achieve a sampling rate of 2048 Hz. A pulse-Doppler radar is simulated with a carrier frequency of 15 GHz. The radar is placed at the right-hand side (10 m, 0, 0) of the subject. In other words, it is at 90° from the direct line of sight of the Kinect camera.
Before extracting features from the spectrogram, a single delay line canceler is applied as a moving target indicator (MTI) filter [24] to remove the torso contributions around the zero-Doppler line; as it provides little information and masks more salient mD signatures, i.e. improves contrast (Fig. 4).
Features (Table 3) were extracted from spectrograms: entropy and skewness (information of the energy distribution), envelope and energy curve, singular value decomposition (SVDdecomposing the energy contents into time and frequency domain),   time durations between peaks, upper and lower Doppler bandwidth. Other parameters were tested as classically seen in periodical motion analysis [9]. However, after initial testing the features ( Table 3) that are not greyed were abandoned as they were not salient for this classification task. The others bear a number indicating how many features are used, e.g. SVD (#12) has 12 features for classification.

Data augmentation
The 20 samples of each action are too sparse for machine learning algorithms to generalise the models accurately. To augment the number of samples, synthetic minority over-sampling technique (SMOTE) [25] was applied to create more 'synthetic' samples. It generates samples in 'feature space' instead of 'data space'. After dividing the 20 samples to 15 training and 5 test samples per action, SMOTE is used applied separately to the datasets with an oversampling rate of 500% and the number of nearest neighbours K = 3. We will explain why in the results part. The augmented dataset contains 75 training and 25 test samples as shown in Fig. 4.

Hand gesture classification results
Two classifiers support vector machine (SVM) with the radial basis unction kernel function [26,27] and K-nearest neighbour (KNN) with K = 5 [28] are used throughout.

Classification of nine actions (1 vs. all)
The features (Table 3) are used as inputs to classifiers based on supervised machine learning. After running the proposed oversampling and classification for 100 times, the results are averaged and we generated the confusion matrices as shown in Fig. 5. The KNN accuracy is 72.1% and that of SVM is 51%. Both the overall accuracy and confusion matrix show that the classification accuracy is rather low. Actions are frequently misclassified as other actions. One oddity of this result is that KNN performs better than SVM. By looking more closely at the SMOTE algorithm, it relies on KNN to create 'synthetic' samples as explained earlier. Instead of performing randomly assigning the samples between training and test data sets first and then using SMOTE, the separation between training and test sets was performed at first prior to data augmentation. The simulations were repeated 100 times and the results averaged using the method in Section 2.2 for data augmentation. The KNN accuracy dropped to 27.4% and that of SVM to 31.7%; the confusion matrices are shown in Fig. 6. The accuracy is as low as the original data without SMOTE augmentation which means that the synthetic samples are better representing reality. This method is therefore retained to pursue an optimisation of the classification accuracy.

Classification of action pairs
Due to the relatively low classification accuracy aforementioned, action pairs classification is carried out. The aim is to simplify the feature space that currently has 22 features to classify 9 actions in a 1 vs. all modality. We now switch to a 1 vs. 1 classification approach to discover the salient features. The test classifies an action pair with only 1 feature involved. For each action pair, the 22 features are tested separately and the feature yielding the highest accuracy is retained as shown in Table 4. Using feature #13, actions 1 and 2 can be classified with an accuracy of 100%.
A summary of the best classification accuracies of action pairs is shown in Fig. 7. Using the cumulative number of times, a feature was picked to best classify an action pair, a table of feature weights was defined for the 22 features (Table 5).

Classification of 1 vs. all with weighted features
To optimise the performance of classification, weighted feature classification approach is applied. The weights w (Table 5) are applied to the features. After 100 iterations, the KNN accuracy is 52.1% and that of SVM is 60%. The confusion matrix is shown in Fig. 8.
The KNN and SVM classifiers show similar accuracy results. The confusion between action 6 (draw x) and action 8 (draw circle) can be explained by the similarity of the movements when projected in the radar radial direction. Actions 1, 3 and 4 are easily misclassified as almost all actions.
One noticeable result is that actions 2, 5, 7 and 9 have a high accuracy and are not misclassified with one another. On that basis, we reran the classification for these 4 actions only and got an accuracy of 94.2% with KNN and 97.7% with SVM.

Data augmentation with SMOTE
In the classification result, the SMOTE algorithm significantly increased the classification accuracy of KNN classifier (raised from 31.1 to 72.1%) also with a smaller increment in the accuracy of SVM classifier (raised from 26.7 to 51.0%). When using SMOTE algorithm before dividing training and testing samples, it had a positive impact on the result. However, it means that the synthetic samples are creating a biased dataset and do not represent the original samples accurately. The data augmentation introduced bias. When looking at the way SMOTE generates samples, neighbours are the key to generate more samples. It is also easy to notice that KNN classifier is using neighbours to classify samples. Thus, SMOTE is biased towards the neighbour-based classifier, thus explaining the higher accuracy and therefore the result cannot be trusted.
To avoid this bias, using SMOTE algorithm after dividing the training and test samples was adopted (Section 2.2). When SMOTE algorithm generates the two groups of data independently, the neighbours of training data may not necessarily be the neighbours of test data and provides a more realistic 'synthetic' dataset.
The classification result also shows the fact that the SMOTE after dividing samples is a reliable data augmentation technique. The result does not vary too much (KNN 27.4% and SVM 31.7%) from the original dataset before augmentation.

One vs. one compared to 1 vs. all
The difference in classification results of action pairs and all 9 actions are noticeable. The accuracy of one versus one with 1 feature has a very high accuracy >80% for most pairs. However, the accuracy of all actions classification is <40%.
Such a gap between the two classifications is mainly caused by the complexity of the model (too many features were not salient) and high similarity between actions from the radar perspective. With the proposed weighted feature approach, a simpler model was reached and the accuracy improved by ∼30% for both KNN and SVM classifiers. By setting the weights of the irrelevant features to 0 and the most salient features to larger weights (Table 5), the overall accuracy almost increased from 27.4 to 52.1% for KNN classifier and 31.7 to 60.0% for SVM classifier. Therefore, in feature-based classification, the old adage less is more applies.

Similarity of actions
The similarity of actions is also an important factor that will lead to the degradation in classification accuracy. Similar actions are highly likely to be misclassified with one another which makes the classification more challenging as observed in Figs. 9 and 10 for confusing actions and distinguishable ones, respectively.
Radar can only sense the ranging information from a target in the radial direction. So, a lot of spatial information in other directions are lost when using a radar to detect the movements. Actions which are not alike in three dimensions may have a greater similarity when projected on the radial direction. For example, the confusion between action 6 (draw x) and action 8 (draw circle) can be explained by the similarity of the physical movements. The projection of these two actions on the direction of radar is almost the same.
This problem may not be solved with one radar to detect the movement. However, by placing different radar sensors at different angles as shown in [29] may help to solve this problem as well as studying the aspect angle dependence of the hand gesture as studied in [30]. Information lost in one direction can be captured in another direction. The randomness of energy distribution of the spectrogram skewness (#1) Asymmetry of energy distribution of the spectrogram centroid (mean and variance) Centre of mass of the spectrogram bandwidth (mean and variance) An estimation of energy distribution range around the centroid SVD (mean and variance of right and left vectors) (#12) Decompose the energy contents into time and frequency domain. upper/lower envelope (min, max, mean and standard deviation) (#2) Statistic features of the upper and lower envelope of spectrogram torso (bandwidth and average frequency) A measure of features related to torso movement energy curve (mean, standard deviation and integrated) (#3) Energy content of the spectrogram time duration between peaks of the upper envelope (minimum, mean and maximum) (#3) The repetition of peaks of spectrograms in time domain

Comparison to the state of the art
In investigating of using mD radar to classify human actions, most researchers are focused on classifying locomotion (e.g. walking, running, leaping, crawling and creeping) [15,19]. These motions contain a lot of periodic information and torso oscillation. The classifications of these motions are using the features related to the bandwidth of the torso oscillation and average torso velocity. This project is to classifying actions (e.g. high arm wave, hand clap, draw stick) that have almost no information in these two features mentioned above and are aperiodic.
The difficulty of classifying these actions by using mD radar is that the discriminating features are different for different action pairs, instead of a few features being sufficient to classify all actions which makes the one versus all classification more challenging.

Conclusion
A Kinect-based mD radar simulator was developed and tested in this paper. Classification experiments were carried out to recognise human hand gestures based on simulated spectrograms.
For the data preparation, the limited records for each action in the MSR dataset led us to the SMOTE data augmentation technique in order to obtain a sufficient number of samples for machine learning. We also highlighted a problem with the use of such techniques that should be considered when using it.
Our proposed weighted feature technique allowed an increase of ∼30% in classification accuracy. The weighted feature technique removes irrelevant features and increases the weight of salient features. Furthermore, after removing the most confusing actions, the accuracy of 4 actions increased to 97.7% with SVM showing the relevance of this technique.
Future work will look at carrying out live experiments for validation with ultra-wideband software defined radar [7,9,13,31,32].