WiGR: A Practical Wi-Fi-Based Gesture Recognition System with a Lightweight Few-Shot Network

: Wi-Fi sensing technology based on deep learning has contributed many breakthroughs in gesture recognition tasks. However, most methods concentrate on single domain recognition with high computational complexity while rarely investigating cross-domain recognition with lightweight performance, which cannot meet the requirements of high recognition performance and low computational complexity in an actual gesture recognition system. Inspired by the few-shot learning methods, we propose WiGR, a Wi-Fi-based gesture recognition system. The key structure of WiGR is a lightweight few-shot learning network that introduces some lightweight blocks to achieve lower computational complexity. Moreover, the network can learn a transferable similarity evaluation ability from the training set and apply the learned knowledge to the new domain to address domain shift problems. In addition, we made a channel state information (CSI)-Domain Adaptation (CSIDA) data set that includes channel state information (CSI) traces with various domain factors (i.e., environment, users, and locations) and conducted extensive experiments on two data sets (CSIDA and SignFi). The evaluation results show that WiGR can reach 87.8–94.8% cross-domain accuracy, and the parameters and the calculations are reduced by more than 50%. Extensive experiments demonstrate that WiGR can achieve excellent recognition performance using only a few samples and is thus a lightweight and practical gesture recognition system compared with state-of-the-art methods.


Introduction
With the rapid development of the Internet of Things technology, various smart devices have changed people's lives. Human-computer interaction technologies, i.e., information interaction between humans and computers, have become essential. Since gestures have the advantages of easy learning, rich information, and simplicity, gesture recognition technology [1] has become a research hotspot in recent years. Gesture recognition technology can be widely used in virtual games, automatic driving assistance systems, sign language recognition, and intelligent robot control. Currently, the main problems of the existing gesture recognition methods based on wearable sensors [2,3] and cameras [4,5] are that they are not convenient enough, the required equipment is expensive, and there is a risk of privacy leakage, which limits the wide application of gesture recognition systems in reality. Gesture recognition technology is more practical than ever before under the booming development of Wi-Fi sensing technologies, progressively transitioning from theoretical research to practical landing application stages due to their advantages of a contactless manner, low cost, good privacy, and the fact that they do not require line-ofsight propagation (LoS) [1]. Specifically, the development of gesture recognition systems is moving from the single domain to the cross domain, from recognizing fixed types of gestures to new types of gestures. In addition, a gesture recognition system is increasingly deployed in the mobile environment, and its model has also transformed from heavyweight to lightweight to meet the requirements of mobile device deployment.
Wi-Fi sensing technologies recognize a gesture by analyzing a gesture's feature, extracted from the channel state information (CSI) of Wi-Fi signals, which are generated during the execution of the gesture. A convolution neural network [6][7][8], an important neural network model of deep learning, has excellent feature extraction capabilities. Therefore, Wi-Fi-based gesture recognition methods mainly adopt deep learning algorithms to recognize gestures [9][10][11][12]. However, these methods concentrate on single domain recognition. Once they face new types of gestures or gestures performed in a new domain, the recognition performance will dramatically degrade, and a large amount of testing data from the new domain is needed to adjust the model. This problem is called a "domain shift" and is a substantial challenge for improving the practicality of the gesture recognition system. In addition, deep-learning-based gesture recognition systems usually have a complex neural network model. Due to the limitations of storage space and computation consumption, the storage and calculation of neural network models on mobile devices are other substantial challenges. Therefore, designing a lightweight gesture recognition system with good recognition performance in the new domain using a small amount of data is an essential aspect of facilitating the application of gesture recognition technology.
Recently, there has been an increasing amount of literature adopting the transfer learning technique [13][14][15], generative adversarial networks [16], or a manually designed domain-independent feature body-coordinate velocity profile [17] to eliminate the domain shift problem. However, the excellent performance of these methods depends on high amounts of data, and the manual modeling method needs to analyze complex CSI data. Since the influence pattern of gestures on Wi-Fi signals is complicated, the model of velocity profiles is complicated as well.
In addition, inspired by the few-shot learning technique [18][19][20][21][22], Zou et al. [23] and Zhou et al. [24] combined a few-shot network and adversarial learning to remove domainrelated information. Lan et al. [25] proposed a few-shot multi-task classifier to address the domain shift problem. The basic idea is to initialize the parameters of the classifier so that the classifier can quickly adapt to a new domain. Yang et al. [26] proposed a Siamese recurrent convolutional architecture to remove structured noise and used convolution neural network (CNN)-long short-term memory (LSTM) to extract temporal-spatial features. Although these methods can eliminate the domain shift problem with a small amount of data, they require more computation. Their complex models with many parameters are not suitable for deployment.
To address the challenges mentioned above, we proposed WiGR, a novel, practical Wi-Fi-based gesture recognition system. The key structure of WiGR is an improved fewshot learning network, which consists of a feature extraction subnetwork and a similarity discrimination subnetwork. The feature extraction subnetwork adopted a 2-D convolutional kernel [6] to simultaneously extract the spatial features and temporal dynamics of gestures. Similar to the relation network [22], the similarity discrimination subnetwork uses a learning-based neural network as the similarity measurement method to determine the type of gesture, and this is more accurate than using fixed functions as measurement methods [18][19][20][21]. The whole network can learn a transferable similarity evaluation ability from the training set and apply the learned knowledge to the new testing domain via an episode-based training strategy [20] to eliminate the problem of domain shift. In addition, there is evidence that lightweight networks [27][28][29][30][31] play a crucial role in mobile deployment. Therefore, we introduce depthwise separable convolution and an inverted residual layer of a linear bottleneck [30,31] in a few-shot learning network to reduce model computations and parameters. Simultaneously, to reduce the complexity of the model while the recognition performance does not decrease accordingly, we introduce a squeeze and excitation (SE) block [32] to improve the quality of the features generated from the network by explicitly modeling the interdependence between the network convolution feature channels. Later extensive experiments on two data sets (CSI-Domain Adaptation (CSIDA) and SignFi [10]) demonstrate that WiGR can achieve excellent recognition per-formance in cross-domain evaluation, and our network design dramatically reduces the model computations.
Our contributions can be summarized as follows: • We designed a novel Wi-Fi-based gesture recognition system called WiGR that is more practical than existing gesture recognition systems. The practicality is reflected in its ability to recognize new gestures or gestures performed in new domains using just a few new samples. • A lightweight few-shot learning network, which consists of a feature extraction subnetwork and a similarity discrimination subnetwork, is proposed to address the hard domain shift problem. Lightweight and effective blocks are introduced in the network to achieve lower computational complexity and high performance. With the rise of Wi-Fi sensing technology, the CSI of Wi-Fi can convey rich information and achieve precise tracking. There are many different types of methods based on CSI to achieve gesture recognition. For example, WiGeR [33] employs a multilevel wavelet decomposition algorithm and the short-time energy algorithm dynamic time warping (DTW) to recognize gestures. WiCatch [34] utilizes the support vector machine (SVM) with the MUSIC signal processing algorithm to recognize gestures. Ma et al. [10] proposed SignFi, a deep learning method with a nine-layer CNN architecture, to recognize sign gestures. However, these methods have not dealt with the hard domain shift problem.
Few-shot learning methods [18][19][20][21][22] have achieved great success in addressing the domain shift problem. Zou et al. [23] proposed a new few-shot domain adaptation scheme (F-CADA). F-CADA adopts adversarial learning to construct an embedding space, which needs a large number of unlabeled target data. It then enhances the performance of the target classifier by a few labeled target data via greedy label propagation. Zhou et al. [24] proposed three adversarial learning processions to remove the distribution discrepancy between source and target data, increasing the complexity of the system. Lan et al. [25] proposed a multi-task classifier to address the domain shift problem. The basic idea of addressing domain shift is to initialize the classifier with multi-task classifier parameters so that the classifier can quickly adapt to any new sensing domain while it is difficult for the cross-tasks classifier to converge. For the deep Siamese recurrent convolutional network [26], it is a typical method of using a few-shot learning network to recognize gestures. The Siamese network relies on CNN-LSTM architecture to extract spatial-temporal features, which also increases the complexity of the model. Taken together, these methods ignore the problem of model calculation complexity, which is not beneficial for model deployment. In this paper, our proposed system adopts a different feature extraction network, i.e., a 2-D convolutional neural network, which has a better performance in feature extraction compared with the CNN-LSTM architecture. In addition, we not only focus on the domain shift issue but also introduce a lightweight block to meet the performance requirements of mobile deployment.

Few-Shot Learning Network
The few-shot learning method [18][19][20][21][22], the key technology used in this paper, is committed to addressing the domain shift problem using just a few support samples. This is the key difference between the few-shot learning method and other domain adaptation methods. Traditional few-shot learning methods use a certain measurement method to express the correlation of samples. For example, the Siamese network [18] is a two-way neural network that determines whether the samples belong to the same class depending on their distance. This network is fed a pair of samples in a sequence to calculate a contrastive loss function in each iteration process, which has less efficiency in updating the network's weights compared with the number of batch samples [19]. A matching network [20] utilizes the idea of metric learning based on deep neural features and augments the neural network with external memories to achieve few-shot learning. Snell et al. [21] proposed a prototype network that measures the similarity of features by a fixed equation (e.g., negative Euclidean distance and cosine similarity). In the above methods, the similarity measurements are fixed functions that are not flexible when applied in a complex embedding space. In 2017, Sung et al. [22] proposed a relational network that adopts a learning-based neural network as the similarity measurement method, and this measurement method helps determine the relationship between samples more accurately, compared with a fixed manual measurement method. Therefore, we introduce a relation network as the basic model for solving the domain shift problem in our system. Additionally, we introduce some lightweight blocks in the model to make the system more suitable for mobile devices.

Lightweight Network Designs
Lightweight neural networks have fewer parameters and consume fewer computer resources, so these networks are more suitable for deployment on mobile devices. SqueezeNet [27] reduces the network's parameters by replacing the 3 × 3 convolution kernel with a 1 x 1 convolution kernel and limiting the number of channels. ShuffleNet [28] adopts pointwise group convolution to reduce the model computational cost and uses channel shuffle to improve the information presentation ability of the network. InceptionV3 [7], Xception [29], MobileNetV1 [30], and MobileNetV2 [31] adopt depthwise separable convolution instead of traditional convolution to reduce parameters and computing consumption. In addition, MobileNetV2 uses an inverted residual layer of a linear bottleneck to achieve better performance with less computing consumption. Overall, these studies prove the effectiveness of the deep separable convolution model and the linear inverted residual lightweight structure. Therefore, we introduce these strategies into our network to make our system more lightweight.

Overview of CSI
As a signal descriptor of the Wi-Fi signal, CSI reflects the signal information of the communication link, such as signal scattering, multipath fading, and the power decay of distance. A wireless channel usually uses the channel impulse response (CIR) to describe the multipath propagation of the signal from the amplitude characteristics and the phase characteristics. The measurement of CSI is mainly used to obtain CIR values [35]. The CIR is mathematically expressed as where ||X(i) || represents the amplitude of CSI measurement at the ith subcarrier, and ∠X(i) denotes the phase of CSI measurement at the ith subcarrier. Since the phase information is more sensitive to environmental changes, our interest is in obtaining the CSI phase information for each subcarrier. Currently, some network interface cards (NICs) have been able to continuously monitor the state changes of signal frequency response in wireless signals [36], such as Intel 5300, Atheros 9390 [37,38], and Atheros AR9580 [39]. We can obtain CSI data directly from the NICs by modifying the open-source driver of the NICs.

Problem Definition
In actual testing scenarios for gesture recognition, the testing conditions are usually different from those for training procedures. It is not feasible to collect a large amount of data in new scenarios to adapt the system to the current scenario. Therefore, a practical gesture recognition system should achieve excellent performance using just a few samples of gestures when facing new types of gestures that have not been seen in a training procedure or when gestures are performed in a new domain. Formally, our system is trained by training set D, which consists of samples with corresponding labels of the old types of gestures. We then divide the samples with corresponding labels of the new types of gestures or gestures performed in the new domain into two subsets, i.e., support subset S and testing subset Q. Our goal is to train the system by training set D and then use the transferable knowledge learned from D and the feature knowledge learned from the support subset S to identify the label y j of each sample x j in the testing subset Q.

Overview of WiGR
In this section, we introduce the framework of the proposed WiGR system. As illustrated in Figure 1, WiGR mainly contains three parts: CSI data collection, data processing, and a lightweight few-shot network. First, the input of the system is CSI data containing gesture information. These CSI data collection methods are described in detail in Section 3.2.1. We will explain the data processing in Section 3.2.2. The key structure of the system is a lightweight few-shot network, which is explained in Section 3.2.3. We describe the episode-based training strategy [19] used to train the lightweight few-shot network in Section 3.2.4.

CSI Data Collection
In this section, we introduce the collection method of the CSI data. The CSI data used in this paper came from two Wi-Fi data sets, i.e., our own CSIDA data set and the public SignFi data set [10], which were created via different data collection methods.
The collection of the CSIDA data set. We used two Atheros AR9580 Wi-Fi chipsets supporting the IEEE 802.11n standard as a transmitter (Tx) and a receiver (Rx) [39], respectively. Each Wi-Fi chipset was equipped with three antennas with an interval of 0.1 m (m). It should be noted that, considering the performance of the computer, only one transmitting antenna and three receiving antennas were used in our experiment. Therefore, there were 3 (1 × 3) Tx-Rx pairs in total. The bandwidth was 40 MHz, and the Wi-Fi frequency was 5 GHz. Since orthogonal frequency division multiplexing (OFDM) was used in protocol 802.11n [40], many subcarriers could be obtained. Therefore, one CSI datum included 114 subcarriers for each Tx-Rx pair. In addition, each CSI datum was collected in 1.8 s (s) with a sampling rate of 1000 data frames/s. Denote the number of antennas at the Tx as N Tx , the number of antennas at the Rx as N Rx , the number of subcarriers as Nc, and the sampling data frames as T. The CSI data can be represented as a complex matrix of T × Nc × (N Tx × N Rx ) (i.e., 1800 × 114 × 3), which indicates the size of the input data of our proposed network.
We collected CSI data in two different indoor environments (Room 1 and Room 2). The layout of the indoor environment is shown in Figure 2. From Figure 2, we can see that the distance between Tx and Rx is 2.6 m [41]. In Room 1, we marked three locations on which the users stood and performed predesigned gestures. In Room 2, we marked two locations. The distance between the user and the transmitter/receiver refers to [17,41]. The user stood on the premarked locations and saw the instructions on the screen of a computer. The computer was used to automatically label the CSI data generated during the execution of the gestures. Five users performed predesigned gestures. As shown in Figure 3, the predesigned gestures were of six types: upward, downward, leftward, rightward, circle, and zigzag, which are commonly used in the field of human-computer interaction. When collecting the data, the user stood on the premarked location and faced the computer screen. Before data collection, the screen showed the type of gesture and reminded the user to raise their hand to prepare for the action. After 3 s of preparation time, the user started performing the gesture, and the duration time of each gesture was 1.8 s. At the same time, the computer started collecting CSI data frames with a sampling rate of 1000 data frames/s, and each CSI datum had four labels: the identity of the users, the room number, the location number, and the gesture category. Afterward, the screen showed instructions to stop for 2 s, and the user took a short break. Thus, the CSI data collection of one gesture was completed. We kept repeating the above process until the data collection was completed. Table 1 shows a summary of the CSIDA data collection. The five users with different figures stood on five different locations (three locations in Room 1 and two locations in Room 2) to perform six predesigned gestures. Each gesture was repeated 20 times. Therefore, there were 1800 (5 × 3 × 6 × 20) samples of gestures in Room 1 and 1200 (5 × 2 × 6 × 20) samples of gestures in Room 2.
The collection of the SignFi data set. The SignFi data set was collected using an 802.11n CSI tool based on Intel 5300 NIC [10]. The CSI collection system contained a Tx and an Rx, equipped with one and three antennas, respectively. In addition, there were 30 subcarriers, and the sampling time was 200 data frames in that system. Therefore, the size of one CSI datum inputted into the lightweight few-shot learning network was 200 × 30 × 3. The SignFi data set contains 276 sign gestures collected by five users, and each gesture was repeated 10 times. There are 14,280 gesture samples in total, which consist of 11,520 gesture samples obtained in a lab and 2760 gesture samples obtained in a home. A detailed description is given in Table 2 [10], where "Number of Samples" denotes the total number of samples (number of gestures × number of repetitions).

CSI Data Processing
Before feeding the raw CSI data into the proposed WiGR model, we needed to remove noises to improve gesture recognition accuracy. As we know, pulse and burst noise are usually at a higher frequency than the reflected signal caused by human movement, whereas static reflectors usually have a lower interference frequency [42,43]. Therefore, it is necessary to filter this interference noise. In our experiments, we adopted a finite impulse response (FIR) filter [44] designed by the least-squares method, with the cutoff frequencies set to 2 and 80 Hz. Figure 4 shows the CSI phase waveform of the "upward" gesture in 200 data frames and its corresponding CSI radio image. As shown in Figure 4a-c, the CSI radio images have both spatial and temporal characteristics that are useful for recognition. The x-axis represents the duration of one CSI datum collection, which shows the temporal characteristics of the CSI data. The y-axis represents 114 subcarriers, which shows the spatial change of the CSI data. For the sake of clarity, we randomly selected one subcarrier of the CSI data to show its phase waveform in Figure 4d-f. In addition, the sampling clock and carrier frequency of the Tx and Rx were not synchronized in the real-world Wi-Fi systems, and this led to sampling time offset and sampling frequency offset, which introduce random phase shift. Therefore, the raw CSI phases were wrapped in the range of [−π, π], as shown in Figure 4d, which wrongly shows the changing trend of CSI phases. We unwrapped the CSI phases to recover the lost information by removing random phase shifts [10], as shown in Figure 4e. Unwrapped CSI phases were then filtered with an FIR filter to remove noise interference, as shown in Figure 4f.

Lightweight Few-Shot Network
The key structure of WiGR is a lightweight few-shot network that consists of a feature extraction subnetwork and a similarity discrimination subnetwork. The function of the feature extraction subnetwork is to extract advanced features of support samples and testing samples, and the features of testing samples and support samples are then combined in-depth. The function of the similarity discrimination subnetwork is to determine the relationship of combination features and output the similarity score of these gestures. The samples with the highest score are considered to be of the same type. Additionally, we introduce depthwise separable convolution and an inverted residual layer of a linear bottleneck [30,31] in the network to reduce model computations and parameters. Feature extraction subnetwork. As shown in Figure 5, we adopted the one "conv block" and five "mobile blocks" to construct the feature extraction subnetwork. Specifically, conv block is a normal CNN structure that consists of a convolutional layer, a normalization layer, and an activation layer. Each convolutional layer has multiple three-dimensional (3-D) convolutional kernels, and each 3-D convolutional kernel consists of multiple two-dimensional (2-D) convolutional kernels. Each 2-D convolutional kernel performs a convolution operation on the CSI data and can simultaneously extract the spatial features and temporal dynamics of the CSI data. The CSI datum of each gesture obtained from one antenna is a 2-D radio image (see Figure 4a), which can be denoted as where R is a real number, N c is the number of subcarriers, and T is the number of data frames. As an analogy to the image recognition problem, one CSI datum is analogized to a video frame, where N c looks like the pixel in one frame and T looks like the number of frames. If there are N CSI data (N = N Tx × N Rx ), the output Q i of the ith 3-D convolutional kernel can be denoted as where X N is the Nth 2-D CSI datum, W N i is the Nth 2-D convolutional kernel of the ith 3-D convolutional kernel, and b i is the ith bias parameter. Benefitting from the excellent feature extraction capabilities of CNN [6][7][8], the feature extraction process of the feature extraction subnetwork is effective. In addition, we use depthwise separable convolution instead of ordinary convolution operations in the convolutional layer to reduce the network's parameters and computing consumption. The normalization layer can accelerate network training by reducing the internal covariate shift. The activation layer adopts two different activation functions, i.e., H-swish [45] (HS) and ReLU [46] (RE). H-swish is an improved version of rectified linear unit (ReLU) and can work on more features, but the calculation consumption of H-swish will also increase compared with the ReLU. Therefore, we used H-swish and ReLU alternately to balance the complexity and accuracy of the network.
The mobile block is based on a linear bottleneck with an inverted residual structure [30], which is beneficial for deployment on mobile devices. Firstly, the block makes a low-dimensional compressed feature high-dimensional using a pointwise convolution layer consisting of M convolution kernels with a kernel size of 1 × 1. M is the number of convolution kernels, whose size is determined by the parameter fac. fac is used to change the number of feature dimensions proportionally, which is beneficial for reducing the model calculation complexity. It then uses a depthwise convolution layer consisting of M convolution kernels with a kernel size of 3 × 3 or 5 × 5 to further extract features and uses an SE module to enhance the robustness of the feature map. The SE module is optional in the mobile block. Finally, the features are projected back to a low-dimensional representation using another pointwise convolution kernel.
The SE module, as per [32], is a lightweight attention model based on the squeeze and excitation structure. The SE module is used to enhance the robustness of the feature map by generating relation weights for each channel of the feature map. Firstly, the SE block uses a global average pooling layer to squeeze the feature map obtained from the upper layer into a 1 × 1 × c feature channel vector, where c is the number of channels. To show the correlation between feature channels, it then uses two fully connected layers to reduce c to c/r, where r is the reduction factor, and then return c/r to c to obtain a feature attention weight of the feature map. This operation reduces the consumption of the calculation. The Scale operation multiplies the feature attention weight with the feature map and outputs a robust feature. Table 3 shows the specifications for the feature extraction subnetwork, where #Out denotes the number of channels of output features map, SE denotes whether there is an SE module in block, NL denotes the types of activation function, and S stands for stride. Similarity discrimination subnetwork. Similar to the feature extraction subnetwork, we also adopted a CNN structure to construct the similarity discrimination subnetwork. As shown in Figure 6, we utilized a conv block to further analyze the representational information of the combination features obtained from the feature extraction subnetwork. We used an average pooling layer made up of a 3 × 3 kernel to reduce the number of parameters. The following layer is a convolutional layer used to further extract features. A flatten layer was used to condense multidimensional features into one dimension, and it is usually used in the transition from a convolutional layer to a fully connected layer. The fully connected layer mapped the learned distributed feature representation to the sample labeling space with a sigmoid as an activation function and output the similarity score of gestures in the range of 0 to 1. A large score means that the combined features belong to the same type of gestures. Thus, the similarity discrimination subnetwork could determine the relationship of samples accurately. Table 4 shows the specifications for the similarity discrimination subnetwork.

Episode-Based Training Strategy
We adopted an episode-based training strategy [20] to train the lightweight few-shot network. In each episode, we extracted K (e.g., 5) types of gestures uniformly at random from the training set D without replacement, and took G (e.g., up to 5) samples from each gesture to simulate support set S. We then took the remaining samples of each gesture to simulate testing set Q. Subsequently, we used a feature extraction subnetwork f Φ (·) to calculate the feature e of support sample x i and the feature o of testing sample x j . These features were combined in-depth with the operator concate (e, o). Finally, we fed these combined features into the similarity discrimination subnetwork g Φ (·), which produced a similarity score h i, j , representing the similarity between x i and x j . The h i, j is defined as follows: Additionally, the lightweight few-shot network adopts the mean square error loss function J, defined as follows: where y i and y j represent labels of sample x i and sample x j , respectively. P (y i == y j ) indicates whether y i and y j are equal. If they are equal, P (y i == y j ) = 1; otherwise, P (y i == y j ) = 0. Moreover, we adopted the gradual warmup learning scheduler [6] to minimize the loss function J.
Based on the episode-based training strategy, in each training episode, we can randomly produce a training support set and a training query set to simulate the support set and testing set encountered in the test scenario. We repeated the above process until the model could learn a robust transfer knowledge from the labeled training set D. We then applied the learned transfer knowledge to the new testing domain to address the domain shift problem.

Results
We conducted extensive experiments on two data sets (SignFi [10] and CSIDA) to verify WiGR's effectiveness. We implemented the proposed system on a PyTorch 1.8.0 framework on an Intel(R) Xeon(R) CPU E5-2630 v4 @2.20GHz with an Nvidia Titan X Pascal GPU and 32.0 GB of RAM.
The SignFi data set and our CSIDA data set are both Wi-Fi data sets and include CSI data with various domain factors. The SignFi data set includes two domain factors, i.e., different environments and users. The CSIDA data set includes three domain factors, i.e., different environments, users, and locations.

Recognition Performance Evaluation
Recognizing new types of gestures. The ability to recognize new types of gestures is important for enhancing the scalability of a gesture recognition system. The few-shot learning method, the key technology used in this paper, can realize the generalization of the model through just a few support samples. This is the key difference between the few-shot learning method and other domain adaptation methods. To verify the ability of the proposed system to identify new types of gestures through just a few samples, we compared it with other few-shot learning methods on the SignFi data set and the CSIDA data set. Table 5 demonstrates that the WiGR model can achieve 98.6%, 97.2%, and 95.8% accuracy when it recognizes 10, 20, and 30 new types of gestures, respectively, under the condition that 100 new types of gestures are used for training and each new gesture has three support samples. Compared with other methods, the improvement in accuracy is more than 10 percentage points.  Table 6 demonstrates that our WiGR model has better recognition performance than the other few-shot learning models. When WiGR was trained with three old types of gestures, it achieved 91.4% and 84.9% recognition accuracies for two and three new types of gestures, and each new type of gesture had three support samples. Because our CSIDA data set does not have enough training types for training, the recognition accuracy dropped slightly compared with the SignFi data set. In general, the accuracy of the proposed WiGR model is remarkably higher than the other few-shot learning models in all evaluations.
Cross-domain evaluation. To verify that the proposed WiGR system does play a role in cross-domain recognition, we conducted extensive cross-domain experiments by splitting the data set according to the layout of the environment, the user who performs the gestures, and the user's location. We compared our model with other traditional gesture recognition systems, such as WiGeR [33], which utilizes a classifier with a DTW algorithm, and WiCatch [34], which employs SVM with a MUSIC signal processing algorithm. In addition, since the proposed WiGR adopts components of a CNN to construct a network, then, to verify the superiority of the CNN-based WiGR, the selected comparison systems were based on machine learning algorithms (i.e., WiGeR and WiCatch) or based on only a sample structure of a CNN without the capability of cross-domain recognition (i.e., SignFi [10]). Moreover, Siamese-LSTM [26], using a Siamese network that consists of a CNN and LSTM to address domain shift problems, is a typical few-shot domain adaptive method and was used as a baseline method. These competitive methods were useful in verifying the effectiveness of our WiGR model in cross-domain evaluation.

•
Cross-environment evaluation. For the environmental shift, we used CSI data from two different environments. All the data from one environment were used for training, while data from the other environment were used for testing. Figure 7 shows the accuracy for recognizing gestures that are collected in a new environment with three support samples for each gesture, where A → B denotes that A is the training set and B is the testing set. We can see that traditional machine learning methods, such as WiGeR and WiCatch, and an ordinary convolutional network, such as SignFi, have almost no shift ability when testing samples from a totally new environment, while our proposed WiGR model could achieve an average accuracy of 98% and 88% using the SignFi and CSIDA data sets, respectively, and therefore remarkably outperforms the other methods. • Cross-user evaluation. For the user shift, we evaluated all methods in the same environment to control variables, and then conducted leave-one-user-out cross-validation using CSI traces from different users. In other words, we adopted CSI traces collected by some users as the training set and utilized the CSI traces of the other users as the testing set. Figure 8 shows the results of recognizing new user's gestures, and each gesture has three support samples. From Figure 8, we can see that the cross-user recognition accuracies of WiGeR, WiCatch, and SignFi are no more than 80%, but still better than the cross-environment performance. The reason is that the training data set has abundant user domain information for extracting common features. Our WiGR model achieves state-of-the-art performance with a recognition average recognition accuracy of 92% and 91% using the SignFi and CSIDA data sets, respectively. Compared with the domain-adaptive Siamese-LSTM, our method improves its performance by about 10%, which demonstrates that WiGR alleviates the problem of domain shift effectively by learning transferable knowledge from the training set and using the features extracted from the support samples to recognize gestures. • Cross-location evaluation. For the location shift, we evaluated all the methods in the same environment to control variables, and then performed leave-one-locationout cross validation using CSI traces. As shown in Figure 9, our proposed WiGR model still shows excellent performance with an average recognition accuracy 90.8%, and, therefore, outperforms other methods. In addition, when the testing CSI data are collected at Loc. 1 and Loc. 3, the recognition performance is slightly reduced compared with the data collected at Loc. 2. This is because that the user performed gestures at Loc. 1 or Loc. 3 is very close to Rx or Tx. In this case, the user's body will block more signals, resulting in weaker signal propagation, which in turn affects gesture recognition performance.  [20] 60.2% 55.4% Prototype Network [21] 63.6% 58.4% Relation Network [22] 77 Different users have different physical body conditions, gesture speeds, and hand movements for the same gestures, and there are two different layout environments. Moreover, different locations can result in different signal propagation paths. These three factors may result in different CSI signal patterns, even for the same gesture. However, due to the excellent feature extraction capabilities of the CNN, CNN-based gesture recognition systems (i.e., WiGR and Siamese-LSTM) have superior cross-domain recognition performance compared to other gesture recognition systems based on traditional machine learning methods (i.e., WiGeR and WiCatch). Although SignFi also adopts the components of a CNN, the structure of SignFi is too simple to play a role in cross-domain recognition. Additionally, the WiGR model can learn more robust transferable knowledge through supervised training, thereby eliminating the influence of individual, environmental, and location factors on gestures, which allows WiGR to achieve gesture recognition under a new domain with only a few samples.

Model Complexity Analysis
The complexity of a gesture recognition model, affecting storage space and computational cost, plays a vital role in mobile deployment. We utilized indicators Params and MACs to reflect the complexity of the model. Params refers to the model's parameters-the smaller the value, the smaller the storage space required by the model. MACs refer to the calculations required by the model, and a smaller value corresponds to fewer computing resources be consumed. M is an abbreviation for million. The key network of the WiGR is an improved few-shot learning model, in which lightweight blocks are introduced. Therefore, to verify the effectiveness of these lightweight blocks, we compared them with normal few-shot learning models [18,20,21]. Table 7 shows that the WiGR outperforms other popular few-shot learning methods [18,[20][21][22] about the model's complexity by a clear margin, and we can see that Params and MACs have the smallest value when scaling factor fac = 1/6. Thus, the value of fac also plays an important role in the model's computational complexity. Experimental results show that WiGR is a state-of-the-art lightweight gesture recognition model by far when fac = 1/6.

The Influence of The Number of Antennas
Since only some high-end mobile devices are tailored for multiple input, multiple output (MIMO) communication with several antennas, it is necessary to study the influence of the number of antennas on the recognition performance of the WiGR model.
With different numbers of receiving antennas, we conducted cross-domain recognition evaluation and single-domain recognition evaluation. Specifically, the CSI data collected in Room 2 are selected as the test data in cross-environment evaluation, the CSI data performed by User 5 are selected as test data in cross-user evaluation, and the CSI data collected in Location 3 are selected as test data in cross-location evaluation. In singledomain recognition evaluation, we selected some CSI data of six gestures performed by User 1 at Location 1 of Room 1 as training data, and the remaining CSI data of each gesture as testing data. Similarly, there are three support samples provided for each gesture. From Table 8, we can see that the larger the number of receiving antennas, the better the recognition performance. This is because multiple receiving antennas can transmit richer CSI data, which helps the WiGR model recognize gestures more accurately. In addition, when only one transmitting antenna and one receiving antenna are used, the cross-domain recognition accuracy of the WiGR model can only reach 70.2-73.2%, and the single-domain recognition accuracy of the WiGR model can reach 91.3%. To a certain extent, it can still show a cross-domain recognition ability and a good single-domain recognition ability, although the effect is not as good as using MIMO.

Discussion
There are several limitations to our proposed WiGR, and they can become fruitful directions of further investigation. Firstly, we only discuss the impact of finite domains (i.e., environment, users, and locations). In fact, CSI signals will also be affected by the orientation of the face [17] and other signal sources. These factors need to be considered in future work.
Secondly, in many human-computer interaction scenes, such as virtual games, automatic driving assistance systems, sign language recognition, and intelligent robot control, the distance between the user and the transmitter/receiver, or the distance between the transmitter and the receiver, is not fixed. Therefore, we simply set these distances according to [17,41]. In future work, we will focus on a specific application scenario (e.g., controlling a mobile phone with gestures) and discuss the setting of distance based on the application scenario.
Finally, in our experiment, the gestures were performed in the LoS. Wi-Fi signals do not require LoS propagation. Therefore, we are interested in expanding WiGR to the LoS scenario. For example, we can separate the transmitter and receiver with a wall, and then study the impact on the Wi-Fi signal in this case.

Conclusions
In this paper, we propose WiGR, a novel and practical Wi-Fi-based gesture recognition system. This system uses a lightweight few-shot network that is trained by an episodebased training strategy to eliminate the influence of domain shift. Lightweight and effective blocks are introduced into the network to achieve lower computational complexity and high performance. In addition, we made a CSIDA data set that includes CSI traces with various domain factors to verify the accuracy of the proposed WiGR in cross-domain evaluation. Extensive experiments on the SignFi [10] and CSIDA data sets show that the proposed WiGR is excellent in cross-domain recognition and computational complexity evaluation. It is a practical and lightweight gesture recognition system compared with existing gesture recognition systems.