Deep-Learning Methods for Hand-Gesture Recognition Using Ultra-Wideband Radar

Using deep-learning techniques for analyzing radar signatures has opened new possibilities in the field of smart-sensing, especially in the applications of hand-gesture recognition. In this paper, we present a framework, using deep-learning techniques, to classify hand-gesture signatures generated from an ultra-wideband (UWB) impulse radar. We extract the signals of 14 different hand-gestures and represent each signature as a 3-dimensional tensor consisting of range-Doppler frame sequence. These signatures are passed to a convolutional neural network (CNN) to extract the unique features of each gesture, and are then fed to a classifier. We compare 4 different classification architectures to predict the gesture class, namely; (i) fully connected neural network (FCNN), (ii) <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Nearest Neighbours (<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-NN), (iii) support vector machine (SVM), (iv) long short term memory (LSTM) network. The shape of the range-Doppler-frame tensor and the parameters of the classifiers are optimized in order to maximize the classification accuracy. The classification results of the proposed architectures show a high level of accuracy above 96 % and a very low confusion probability even between similar gestures.


I. INTRODUCTION
Hand-gesture recognition is gaining significant research interest due to the wide range of envisioned applications. The use of such technology ranges from convenient device control [1], infection prevention in clinical environments [2], to safer and quicker accessibility of features in automotive [3]. The common hand-gesture signal acquisition approaches today are cameras [4], infra-red sensors [5], and ultrasonic sensors [6]. On the other hand, radar sensors are newly emerging due to their superior recognition performance even in adverse lighting conditions and complex background. In addition, low-cost commercial miniature radars are becoming widely available, and are capable of capturing the signature of finer hand movements which can yield high classification accuracy at low processing cost [7].
When using radar sensors for capturing hand-gestures, the type and richness of the gesture signatures depend on The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu . the architecture of the radar and on the employed waveform. The common types of waveforms used in miniature radar sensors are; (i) continuous waveform (CW), (ii) pulses, and (iii) frequency modulated continuous waveform (FMCW). Popular CW-based radars are capable of detecting micro-Doppler signatures in addition to the main Doppler components; micro-Doppler signatures are the frequency components that occur due to the motion or vibration of the non-rigid parts (fingers, knuckle, wrist) along with the main translational motion of the target (hand) [8]. However, despite the excellent ability to capture Doppler signatures, Doppler radars fail to extract the range information of the targets [9] due to the inherited narrow band nature of the waveform. Simultaneous estimation of range and Doppler is achieved using more advanced waveforms like pulsed wave and FMCW. Therefore, in addition to the Doppler variations, it is possible to distinguish each gesture with the spatial variations of the hand movements. Capturing handmovements/variations along the radial distance is shown to increase the classification accuracy of the hand-gesture recognition [10].
In this study, we utilize a low-power UWB impulse radar which transmits sharp temporal pulses. The advantages of using UWB impulse radar for capturing the range-Doppler signatures compared to their counterparts are mainly; (i) low power consumption, (ii) fine range resolution, and (iii) the ability to detect very close targets [9]. These features make UWB impulse radar an excellent candidate for collecting hand-gestures. In addition, UWB radar has the ability to work reliably in interference-rich environments. The transmitted waveform is extremely short, typically in nanoseconds, hence the signal energy is spread across a very large RF bandwidth providing immunity to the interference [11]. Due to the same reason UWB radars will cause less interference to other devices.
With the collected hand-gesture signatures from the UWB impulse radar, we explore the integration of several machine learning techniques to enhance the hand-gesture classification performance. We present an end-to-end framework for pre-processing the received gesture signals into a sequence of range-Doppler frames forming a 3-dimensional tensor. We utilize 2 different approaches for hand-gesture recognition; the first approach employs a 3D CNN for feature extraction coupled with three different classifiers as the final layer, namely; (i) FCNN, (ii) k-NN, and (iii) SVM. We further present a second approach using 2D CNN along with LSTM to predict the gesture class. The main contributions of this paper are: • A framework for mapping the raw signal from a UWB impulse radar as a sequence of range-Doppler frames suitable for 3D deep-learning methods.
• Four different CNN architecture models for classifying hand-gesture signatures from a UWB impulse radar.
• Analytic formulation of key controllable parameters to optimize the classification performance.

II. BACKGROUND AND RELATED WORK
The recent developments in consumer radar, motivated numerous researches on integrating radar sensors with machine-learning for hand-gesture recognition. The most common machine-learning approaches for radar-based handgesture recognition are CNN, SVM, k-NN, and LSTM. In order to classify hand-gestures using classifiers such as SVM and k-NN, one of the techniques is the manual extraction of hand-gesture features [1], [14], [15] from the range-Doppler or time-frequency (spectrogram) maps. Manual feature extraction requires predefined characteristic features of the gesture signatures, and therefore, the performance of the classifier varies significantly depending on the defined features. Other approaches utilize statistical procedures such as principal component analysis (PCA) for dimension reduction and feature extraction [1], [13]. PCA transforms the input data into a few orthogonal variables (principle components) that represent unique features with reduced dimension. A common approach in radar hand-gesture recognition is to use CNN, which does not require predefined features, but rather, the network self-learns the features from input signals during the training process [18]. The majority of CNN-based hand-gesture recognition methods extract the signature from either: (i) the changes in Doppler over time [18], or from (ii) a snapshot of the overall range-Doppler fingerprint [19]. Both of these signal types are represented in the form of a 2D matrix (monochromatic image) that is further processed by the CNN. Our previous work [20] utilizes a two-antenna Doppler radar to represent the changes of Doppler over time as 2D spectrogram, along with angle of arrival (AoA) information. The main drawback of 2D image methods is that they lack the 3 rd dimension that adds further information to the signature. On the other hand, the representation of range-Doppler maps as a time/frame tensor [22], [23] is shown to increase the richness of the signatures, thus, leading to a better description of the hand-gestures. For the range-Doppler-frame tensor, suitable features can be extracted using a 3D CNN, a process that is followed in [23], [26], [27]. Another recent research demonstrates that integrating CNN and LSTM tends to increase the classification performance of the signatures that vary in time and space [22], [23]. LSTM networks are recurrent neural networks with feedback connections, which makes them suitable for time sequence analysis [21]. We describe in Table. 1 some of the approaches found in the literature for hand-gesture recognition using radar sensors, including the utilized machine-learning algorithms, the type of radar waveform, and the set of gesture signatures used.
To the best of the authors knowledge, this paper is the first to provide an end-to-end framework for hand-gesture recognition using the range-Doppler-frame tensor from a UWB impulse radar in integration with multiple deep-learning methods.

III. SYSTEM DESCRIPTION
In this proposed framework, we utilize a UWB impulse radar with a single-output-single-input configuration, i.e. one antenna for transmitting and one for receiving. The reflected electromagnetic signal is captured by the receiving antenna, then sampled in the RF domain and digitally downconverted to baseband. The output of the radar module (the baseband signal) is passed to the computer as two vectors; (i) in-phase component (I) and (ii) quadrature component (Q). The received signal of each gesture is processed to form a 3-dimensional tensor of range-Doppler-frame, as shows in Fig. 1. This tensor represents the pattern generated by a hand-gesture as a sequence of frames each consisting of a temporal snapshot of range-Doppler image. Fig. 2 shows a functional block diagram of the classification framework. In the proposed framework we present 4 different classifiers and compare their performance; (i) 3D CNN for feature extraction with FCNN for classification, (ii) 3D CNN feature extractor and k-NN classifier, (iii) 3D CNN feature extractor and SVM classfier, (iv) 2D CNN feature extractor and LSTM classifier. The range-Doppler-frame tensor is passed to a CNN to extract the features representing  Fig. 3 to predict the classes of hand-gestures. The following subsections provide a detailed description of each component in the proposed framework, from hand-gesture collection to gesture classification.

IV. HAND-GESTURE COLLECTION AND PRE-PROCESSING
In order to collect the hand-gesture signatures we utilize a Xethru X4M03 UWB impulse radar module from Novelda [28] which is shown in Fig. 4. The parameters of the radar are tuned to fit the requirements of the hand-gesture recognition application which are described in Table. 3. The selected gestures are the typical movements of hands. We utilize a total of 14 hand-gestures for the study, which are depicted in Fig. 5. The gestures are performed using the right hand. The description of the selected 14 gestures is given in Table. 2. In order to introduce variations in the collected gestures, we perform the gesture collection in arbitrary radar orientations with respect to the surrounding room environment with randomized speeds and distances. These variations would increase the richness of the data set, allowing for a better classification performance.

A. UWB IMPULSE RADAR
A UWB impulse radar [29] transmits a sequence of short pulses (in our case Gaussian-shaped) having a duration/width T p in the order of nanoseconds (ns). The main difference   between a UWB impulse radar and a standard pulse radar is in the utilized pulse width. A UWB impulse radar transmits short pulses of pulse width comparable with the period of the carrier waveform, whereas the typical pulse radar utilizes pulses with pulse width larger than many periods of the carrier waveform. The short pulses of the UWB impulse radar provides a wide bandwidth, which in turn yields a high range resolution, given as, where B is the bandwidth and c is the propagation speed of light.  In order to get better insights on the range-Doppler processing, we present some details on how the signal is received VOLUME 8, 2020 and processed inside the utilized radar module before being collected for further processing. Fig. 6 illustrates the architecture of the utilized UWB impulse radar module. The utilized radar module employs a direct-RF synthesizer [2] to generate the Gaussian pulses with an analytic signal form, where the carrier frequency f c is modulated with a baseband , having an amplitude A. The reflected impulse from the hand is received at the receiver given by, where, τ = 2R c + 2vt c is the time delay, R and v are the range and radial velocity of the target respectively. The signal is reconstructed at the receiving side of the radar module using the swept-threshold sampling method [30]. The r(t) is sampled with a high sampling rate f s (In our application, for practical reasons, the high-rate sampling is implemented by employing 12 parallel samplers of sampling frequency f s 12 , each sampling with a slight delay equivalently giving a sampling rate f s [28]). The received pulse samples are thresholded to V using a comparator (output of the comparator is 1 and 0, where 1 for signal above the threshold and 0 otherwise). After certain number of pulses, the threshold voltage V is stepped using a digital-to-analog converter (DAC). As a result, V is swept between [V min , V max ]. The comparator output is also sampled at a sampling rate f s and are distributed across N counters.
The sampled bits are summed, across each counter to incrementally build the multi-bit block which gives a cumulative distribution function of the received signal. This digitally reconstructed RF signal block is given by, where, n is the discrete time index. A simulation of the generated Gaussian pulse is given in Fig. 7. The number of counters N is the same as the number of range bins, which gives the maximum range, where t s = 1 f s is the sampling period. Therefore, a block represents the strength of reflection located in each range bin. Thus, several such blocks are collected as the output of the radar module. We can read out this RF data directly or enable on-chip digital down conversion to read the baseband analytic signal where, we utilize the on-chip down conversion to obtain the baseband signal. After down conversion we get a complex block, as the in-phase (I) component, x I [n] = re x[n] and the quadrature (Q) component, of the baseband signal 203584 VOLUME 8, 2020  given by, Fig. 8 illustrates a down converted block of the received signal. The timing control unit of the module digitally controls the entire process. The summary of parameters employed in the radar sensor is given in Table 3.

B. RANGE-DOPPLER FRAMES
As explained in section IV-A, at a given time instance the radar interprets the scene as a block of range bins. The block rate of a UWB impulse radar is given by, f block = f PRF K , where f PRF is the pulse repetition rate, and K is the number of pulses per block. We arrange each of the M blocks into a single frame, such that the changes in the stacked blocks are used for extracting Doppler information using the fast Fourier transform (FFT). The resulting Doppler resolution is given by, However, it can be shown that the Doppler frequency is given where v is the velocity resolution. The block is related to the frame rate as f frame = f block M . Accordingly, there is a trade-off between the frame rate and velocity resolution as follows, Furthermore, in-order to capture the changes in range-Doppler signature over time we stack each L frames into a 3D matrix (tensor) as indicated in Fig. 9. Therefore, we optimize f frame and v to obtain the maximum classification accuracy. The range-Doppler-frame tensor has a dimension N × M × L. However, the dimension of the tensor is slightly different as we cut some unwanted values in all the three dimensions, which we explain in the following paragraph.
We define the length of the gesture in time as total sample time which is kept constant for all the 14 gestures. The total sample time for each gesture is set as 3 seconds. In-order to collect only the useful signal (reflected signal once the hand is present), we applied a suitable threshold to crop the received signal. Since the hand-gestures are taken in the range 0.5 to 1 m from the radar module, we took only 15 range bins starting from 0.4 to 1.3 m. Thereafter, we compute column-wise FFT of each frame to get the range-Doppler- frame tensor. In-order to capture the gestures we use a block rate of f block = 440 Hz, which gives a maximum Doppler frequency of f block 2 = ±220 Hz. This range is much larger than the observed Doppler frequencies of the performed hand-gestures which are within f max = ±120 Hz. The frequency range is restricted to retain only the significant signal components in the range within ±120 kHz. Therefore, the reduction of points in all the three dimensions significantly reduces the computational load on the classification networks.

V. CLASSIFICATION ARCHITECTURES A. 3D CNN ARCHITECTURE
Recent developments in 3D CNN have proven to be effective for the classification of volumetric data such as video [31], computed tomography images [32], magnetic resonance imaging [33], ultrasound imaging [34] etc. 3D CNN has the same principles as the 2D version, it is composed of a series of basic structures repeated multiple times. The basic structure is primarily composed of: (i) a convolutional layer intended for feature extraction, (ii) activation function for a non-linear transformation of the inputs, and (iii) a pooling layer to reduce the dimension and noise of the input [35]. The difference between 3D and 2D CNN is that in 3D CNN, mathematical operations are done using 3D matrices (tensors) [36], this naturally requires higher processing and memory capacity. In our application, the 3D CNN extracts the temporal variation along with the range-Doppler features from the 3D data tensor as explained in the previous section.
In a typical CNN architecture, once the features are extracted, the classification is performed by a fully connected layer (conventionally an FCNN [35]). Whereas in this work, along with the FCNN we explore the performance of two more classifiers: k-NN and SVM, when integrated with the 3D CNN. Following we describe how these classifiers are integrated,

1) 3D CNN -FCNN (Network I)
FCNN is a feed forward neural network with fully connected multi-layer perceptron (MLP), having all the inputs from one layer connected to the input of the next layer. The summary of the employed architecture is shown in Table 4, we also reference to this architecture as Network I.

2) 3D CNN -K-NN (Network II)
k-NN classifies an input by identifying the class based on the proximity of the outcoming pattern (from the 3D CNN) to the pre-trained classes. In particular, it takes the majority votes from its k nearest neighbours and assign the outcome class based on the dominant number of neighbours. We optimize the k value to obtain the maximum classification accuracy which is for k = 1, which means the input is simply assigned to the class of the nearest neighbor. The summary of the employed 3D CNNk-NN architecture is given in Table 5, we also reference to this architecture as Network-II.

3) 3D CNN -SVM (Network III)
SVM classifiers are supervised learning models that construct a set of hyper-planes in a higher-dimensional space to separate each class [37]. It utilizes support vectors, which are the data points that determines the hyper-planes to separate the classes. The summary of the employed 3D CNN -SVM architecture is given in Table 6, we also reference to this architecture as Network-III.

B. 2D CNN -LSTM (Network IV)
LSTM networks comes under the recurrent neural network (RNN) group, which has the ability to analyse time-series inputs [38]. The network consists of cellstate/memory, which makes it possible to store data from previous state/time. Cell-states carry relevant information throughout the processing of the time sequence, thereby, identifying and extracting the temporal relation within the sequence [39]. Along with the cell-states, LSTM consists of 3 different gates, which are basically neural networks that control the weights and outputs at each state [40]. These three gates are as follows: • Forget gate: The information from the previous state and current input are utilized to decide on which cell state/memory to be deleted/forget and which ones to be kept.
• Input gate: It decides on which values of the cell state need to be modified/updated.
• Output gate: Output gate utilizes the updated cell state and the current inputs to compute the weights of the current state. These gates control the weights and outputs of the cell at each time step establishing temporal relations. Once the temporal features are extracted, a classifier, usually a FCNN is used for classification of the inputs. Since the hand-gestures are represented as a temporal sequence of range-Dopplerframe, at each time step a 2D CNN is utilized to extract features from the corresponding range-Doppler frame. Thus, the LSTM network establishes the temporal relation between the features of the range-Doppler-frame tensor. The end-toend network is trained using the back-propagation algorithm. A summary of the network architecture is provided in Table 7, we also reference to this architecture as Network-IV.

VI. EXPERIMENTATION AND VALIDATION
In-order to perform experimental verification of the proposed framework, we collect 250 samples of each of the 14 gestures. The samples are divided into two groups, (i) the first contains 80% of samples and is used for training the network with 5-fold cross validation, while (ii) second group with the remaining 20% of the samples are only used for testing the networks without being in the training process. This grouping will better allow to understand the performance under realistic use scenarios where recognizing unseen data is required.
We first optimize the dimensions of the input to obtain the best classification accuracy and using the selected input parameters of the networks such as, number of layers, kernel size, and number of features in the convolutional layers, number of FCNN, k in k-NN, kernel functions in SVM, are optimized to obtain the architecture which has the maximum classification accuracy. The sensitivity of the classification networks to the other parameters are observed as negligible. We use Python programming language, especially Keras library to build and train the networks. Fig. 10 shows the comparison of the classification accuracy obtained for each classifier for different values of M ranging from 20 to 100 blocks per frame. The value of M determines the input size of the range-Doppler-frame tensor. The range of M we tested is limited to 20 − 110 is because, beyond this range, the size of the input will fall below the minimum size requirements of the filer sizes of the CNN network [20]. The experimental results of each classification network are detailed below.   The testing classification accuracy of the Network IV outperforms the Network I by 3 %, outperforms the Network II by 4 % and, outperforms Network III by 2 % as indicated in Fig. 11. The computation time for training Network I is approximately 30 minutes, for Network II is around 12 minutes, for Network III is approximately 27 minutes, and for Network IV is around 14 minutes using a typical laptop with an Intel Core i5 processor. However, there is no significant difference in the recognition time (prediction time during the testing which is in milliseconds) of the gesture by the networks. The computational complexity in terms of memory usage and trainable parameters is highest for the Network II, the second for the Network III network, third for the Network I and lowest computational cost is for the Network IV, which gives the highest classification performance. Table 8 shows the average confusion matrix between gestures obtained from the Network IV based on the training results. From the confusion matrix it can be observed that the highest confusion appears among gestures 7 and 8 and 11 and 12. The reason being the limited radial movements for these gestures which limits the Doppler signatures which in turn affects the classification of these gestures. Despite that, the overall performance of the Network IV is promising in classifying similar gestures.
Given the sample size of n = 250, and selecting confidence interval as p = 90%, the error bounds are calculated [20] based on the obtained success probability estimateβ using, n , where z p denotes the inverse-CDF of a standard normal distribution (quantile function) calculated at the probability 1 − 1−p 2 with p confidence. The error bar is also shown in Fig. 11. Considering the presented results of the Network IV with a training accuracy of 98.40 % and testing accuracy 96.15 % with 14 gestures, the proposed architecture in this paper outperforms our previous work [20] that was based on two-antenna Doppler radar with CNN classifier (having an accuracy of 95.5%), it also outperforms recent CNN-LSTM approach in the literature [23] utilizing a FMCW radar (having an accuracy of 88.1% with 6 gestures). These enhancements are primarily referred to the higher range resolution offered by low-cost UWB radar compare to their counterparts such as FMCW at the same cost range. The higher range resolution is due to the high bandwidth of the UWB radars that in-turn gives richer gesture signatures. Also, the accuracy improvements are caused by the optimization of both the tensor and the network parameters.
In order to observe the behaviour of the Network IV on multiple volunteers we collected gestures from two volunteers; one female volunteer and one male volunteer so that we increase the variations in the received signals. With the proposed architecture we trained the network using separate data sets from each volunteer. The classification accuracy obtained for individual samples from volunteer 1 is 96.15% and with volunteer 2 is 91.5%. The proposed method is showing favourable results for practical applications.

VII. CONCLUSION
This paper investigated the feasibility of using four different classifiers for recognizing hand-gestures from a UWB impulse radar. It presented a novel framework for mapping the output of the UWB impulse radar into a sequence of range-Doppler frames that are fed to CNN-based classifiers. The classification accuracies for the 14 hand-gestures were quite high (93.33 %, 92.02 %, 94.08 % and 96.15 %) for the four different classifiers: (i) 3D CNN -FCNN (ii) 3D CNN k-NN (iii) 3D CNN -SVM (iv) 2D CNN -LSTM. This indicates the considerable promise for utilizing UWB radar in practical hand-gesture applications.