A Novel WiFi-Based Personnel Behavior Sensing With a Deep Learning Method

People’s activity recognition technology has a large number of applications in indoor monitoring, smart healthcare, and smart homes. Traditional methods require hardware-mounted or wearable sensors, which bring additional costs and impose many limitations on usage, while WiFi devices have the advantages of low cost, wide coverage, and wall penetration, which can compensate for the limitations of traditional means. A WiFi-based behavior sensing system is proposed in this study. As the characteristics of personnel activities would cause path changes, the path decomposition algorithm is designed to use the path information as the feature of the action and to fully exploit the feature information to improve the accuracy of recognition, the system proposes the BI-AT-GRU network in which bi-directional and attention mechanism are added to the gated loop unit to achieve action recognition. The system is evaluated in different environments with different levels of multipath reflections and compared with other systems. The experimental results show that the recognition accuracy of this system in the two scenes is 97.4% and 93.3%, respectively.


I. INTRODUCTION
With the development of Internet of Things (IoT) technology, intelligent applications appear in many life scenarios. Personal activity recognition as an important application of IoT [1], [2], [3] plays an important role in some scenarios such as smart homes [4], [5], intelligent medical care [6], [7], and indoor monitoring [8] where the behavior of the target needs to be effectively monitored to ensure safety.
Computer vision [9] and wearable devices [10] are the two main traditional techniques for personal activity recognition. However, wearable device-based recognition methods require the user to carry the device with him/her to collect information on human activity characteristics., which is not convenient enough.
Computer vision-based activity recognition methods belong to passive detection, which can be divided into two The associate editor coordinating the review of this manuscript and approving it for publication was Chan Hwang See. categories by its sensors. One is a video recognition method aimed at extracting and classifying spatial-temporal features, including three branches of the dual-stream method [11], C3D [12], and CNN-LSTM [13]; the other is a pose estimation method based on human skeletal point information, which requires the use of the Kinect camera to extract skeletal point information [14]. However, the camera can only work properly in scenes with sufficient light. If there are obstacles in the environment to occlude or the person is in the deadend position of the camera, the camera would not be able to capture the person's action information. In addition, in the bathroom, bedroom, and other indoor environments, the use of cameras for activity recognition would bring the problem of privacy intrusion.
To overcome the shortcomings of the above sensing methods, WiFi signals due to their wide coverage and strong penetration characteristics which have been widely used in the commercial field, have become a hot spot for human activity sensing. As shown in Fig.1, the original indoor environment only has line-of-sight transmission paths and reflection paths of indoor objects. When human activities occur, additional reflection paths caused by human actions are introduced. Since different human activities produce different multipath changes on the WiFi signal, and these multipath changes can be used to distinguish the corresponding activities.
Earlier WiFi-based activity identification methods utilized the Received Signal Strength Indicator(RSSI) as the metric [15], which can be easily obtained from commercial WiFi. In general, a large RSSI variance indicates the possibility of personal activity in the indoor environment, and vice versa, indicating no personal activity. However, RSSI experiences severe degradation in complex environments due to time delay, signal reflection, and diffraction [16], making RSSIbased methods more suitable for describing coarse-grained activities.
Since finer-grained Channel State Information(CSI) can be extracted from commercial WiFi, CSI-based sensing studies have emerged [17], [18]. Compared with RSSI, CSI is a finergrained signal feature that characterizes multipath effects through OFDM subcarrier in the frequency domain [18]. The E-eyes system is proposed which extracts CSI amplitude information to detect walking and in-situ activities [19]. Based on E-eyes, Wi-Chase employs all 30 CSI subcarriers to discriminate three different person activities [20]. To analyze the relationship between personnel activity and CSI variation, CARM [21] proposed the CSI-speed and CSI-activity models to quantify the relationship between the speed of personnel activity and CSI values. The system was designed for the homebound elderly to classify and identify seven daily activities of the elderly. WiAct [22] used Doppler shift as a feature and classified 10 activities using Extreme Learning Machine (ELM). WiMU first determines the number of executed gestures, generates virtual samples for these gestures based on pretrained working samples, and then compares the real samples with the virtual samples to determine the real executed actions [23].
With a large number of applications of deep learning algorithms, many CSI activity recognition systems based on deep neural networks have emerged. Feng designed a three-layer LSTM network and discussed the impact on accuracy [24]. ABLSTM [25] accomplished the recognition of six-person activities by utilizing a bi-directional LSTM network with an attention mechanism. In addition to LSTM, convolutional neural networks also are used for CSI activity recognition. Alazrai [26] proposed a three-layer CNN network to extract the features and achieves an average recognition accuracy of 86.3% on the collected person interaction dataset [26]. Sheng discussed how to learn rich spatial-temporal information from CSI data automatically and proposes a deep learning framework by putting CNN into a multilayer LSTM as a temporal model [27]. WiDar3.0 [28] proposes a method for constructing cross-domain features that implement the training model of one scene to apply to other different scenes.
In this paper, a system used in an indoor environment is designed to identify daily activities. The system uses the transmission path information as the metric to describe the person's actions based on the influence of the person's activities on the path and achieves the classification of all person's daily behavioral activities by the designed deep learning network.
In summary, the main contributions of our work are as follows: (1) To be able to use path information to describe the actions of people, a path decomposition algorithm is proposed by using CSI information to realize the decomposition of transmission paths.
(2) The relevant characteristics and physical models of CSI are analyzed, and a series of data processing algorithms have been designed for the specific application of CSI.
(3) A residual network architecture with channel attention and spatial attention mechanisms is designed for action recognition, and the feature information consisting of multiple links is converted into an image matrix as the input to the network using the CSI multiple-input multiple-output technique.
(4) Some impactors on the accuracy of action recognition of the system are investigated, such as the size of sliding windows. (5) The experiments are carried out in two different scenarios and the performance of the system is evaluated.
The rest of this paper is organized as follows, related work in the field of human activity recognition is briefly presented in Section 2. Section 3 briefly introduces some principles of CSI-based personnel activity discrimination. In Section 4, our proposed activity discrimination system is detailed. The implementation of our system and the evaluation results of our system are discussed in Section 5. The conclusion is drawn in the final section.

II. FUNDAMENTALS A. CSI
CSI is fine-grained information from the physical layer that can represent the channel properties of the communication link, including information on scattering, environmental fading, and distance fading [29]. In a narrowband flat fading channel, the channel information can be represented in the frequency domain as y = Hx + n (1) VOLUME 10, 2022 where y denotes the receiving vector, x denotes the transmitting vector, and n and H denote the noise and channel matrix, respectively. According to the 802.11n protocol, WiFi can work in 2.4GHz and 5GHz frequency bands, whose bandwidths are 20MHz and 40MHz, respectively. Table 1 shows the distribution of subcarriers under the 802.11n, there are three groups of subcarriers in 20 MHz bandwidth, and the maximum number of subcarriers available to transmit data is 56; in 40 MHz bandwidth, there are also three groups of subcarriers with the maximum number of subcarriers available to transmit data is 114 [30]. Using the CSI-Tool, 30 subcarriers can be acquired, each containing amplitude and phase information, and the acquired subcarriers are represented as follows: Since WiFi uses the multiple-input multiple-output system(MIMO) technique, in which the CSI matrix can be represented as a matrix shown in (3). Where N t and N r denote the number of transmitting and receiving antennas, respectively. Each combination of transmitting and receiving antennas can be considered as a separate TX-RX data stream. .
The Fresnel Zone is an ellipsoidal region with the transmitting and receiving endpoints as the focal points and the apparent distance path between the two points as the axis. As shown in Fig.2, the Fresnel Zone consists of n ellipses, assuming that the transmitting end is P 1 , the receiving end is P 2 , given the radio wavelength, then for any point Q n on the nth ellipse, there are the following equations: where the first Fresnel region is the central region of the geometry, the nth Fresnel region denotes the middle region between the (n-1)th Fresnel boundary and the nth Fresnel boundary, for example, the second Fresnel region denotes the middle range between the first Fresnel boundary and the second Fresnel boundary. According to the Fresnel Zone principle [31], the important regions for RF transmission are the first 8 to 12 regions, where more than 70% of the energy is transmitted through the first Fresnel region. The paths introduced due to human activity is referred to dynamic pahs, and the paths that exist when the scene is unoccupied as static paths. For example, when Q 1 located on the first Fresnel boundary, the difference in length between the reflected path P 1 Q 2 P 1 and the line-of-sight path P 1 P 2 is λ/2, and the phase difference between the static signal and the dynamic signal is π, and when the phase shift π caused by the reflection is added, the phase difference between the two signals is 2π. Thus, the amplitudes of the two signals are superimposed to obtain a larger amplitude. When Q 1 located on the second Fresnel boundary, the phase difference between the two signals is 3π, and the amplitudes are mutually suppressed to obtain a lower amplitude. When moving outward from the first Fresnel region to the nth Fresnel region, the phase difference increases from 2π, 3π, 4π, and so on to (n+1)π, which results in the superposition of static and dynamic signals, making the CSI signal waveform show periodic fluctuations. Therefore, the use of path information can reflect the activity information of the personnel

III. MOTIVATION AND RECOGNITION SYSTEM DESIGN
In this section, the challenges posed by CSI are investigated to improve the accuracy of action recognition. The architecture of the system is designed with several key constituent blocks, as shown in Fig.3. Each block would be described in detail in the following subsections.

A. OUTLINE REMOVAL
There are always some outliers in the collected CSI data, which are not caused by human activities. To remove these outliers, the Hampel filter [32] is used in this study. Suppose each element in the collected CSI data can be represented as where x nom is used to indicate the normal data, e k represents measurement error, which is the basis for discerning whether the data point is an outlier or not, and the specific identification steps are as follows.
First, all data are arranged in descending order to find the median of the data set, then the median absolute deviation (MAD) is obtained according to the median. The formula for calculating the MAD is as follows.
The discriminant threshold T is shown in (7).
The threshold T is generally set empirically, which is set to 3 in this study. The filtering results are shown in Fig.4, where the outlier filtering is performed on the 8th and 10th subcarriers, respectively. It can be seen from the figure that the outlier can be effectively removed using the Hampel filter.

B. AMPLITUDE FILTERING
Although the Hampel filter can effectively remove outliers, the signal after filtering outliers still contains noise generated by the external environment, which makes the obtained CSI data cannot be directly used for action estimation, therefore, considering the frequency of human actions, this study uses the Savitzky-Golay Filter (SGF) for noise removal [33]. The Savitzky-Golay filter retains the main useful components of the signal by fitting the low-frequency band information of the signal, and the specific algorithm is as follows: Let x(k), k = −m . . . , 0, . . . m be the set of the CSI data, construct a polynomial of order n fit: The sum of squares of the residuals of the fitted data points and the original data points is: Derivative of the coefficients b ni and making the derivative zero, the m, n, and x[k] are known, which can be used to generate the fitted value of the window center point, and the data can be filtered by moving the sliding window.

C. PHASE SANITIZATION
Due to the clock asynchrony and center frequency deviation between the transmitter and receiver, the collected raw phase information exhibits great randomness, which is not suitable for personnel activity detection. Therefore, in this section, the raw phase data are processed using linear transformation to obtain the CSI available phase information [34], and the acquired CSI phase information is represented as follows: whereφ i and φ i represent the measured and true CSI phase values of the ith subcarrier, respectively. δ denotes the time difference between transmitter and receiver, β is the phase deviation due to center frequency desynchronization, Z is the Gaussian noise, k i denotes the index of 30 subcarriers, and N is the size of the Fast Fourier transform. To reduce the effects of β and δ, a linear correction method is applied to remove the phase deviation. The slope and solution of the linear transform equation are denoted by a and b, respectively.
Since the frequencies of the CSI subcarriers are symmetric, then n j=1 k j = 0 can be obtained. The noise Z belongs to the measurement error, which can be ignored after several measurements. Finally, the approximate phaseφ i can be obtained by calculatingφ i − ak i − b which is expressed as follows: The CSI phase after sanitation is shown in Fig. 5(b). Compared with the raw phase which scatters randomly over many different angles, the calibrated phases are less fluctuant and more concentrated.

D. PHASE DECOMPOSITION
From the analysis above, it can be seen that the activities of personnel introduce additional paths, so the information on paths can be used to classify personnel activities. From the definition of CSI, it is clear that the CSI value of each VOLUME 10, 2022 subcarrier is the result of the superposition of multiple paths, and thus the CSI of subcarrier k can be written as: where L is the number of paths, α l and τ l denote the signal amplitude and the TOF of path n, respectively. f 0 is the central frequency, f is the frequency interval of adjacent subcarriers.
For the path l, H l k is rewritten as follows: where S l k = α l e −j2πf 0 τ l and l k = e −jk f τ l . With the CSITool, the index of subcarriers ranging from −58 to 58, in which the interval of subcarrier index is 4. Supposing that there are five paths, a Hankel matrix can be built: Then rewrite X as: where v can be transformed into a Vandermonde matrix as follows: And Therefore, x = (vs)s(vs ) T , the problem arises to find a Vandermonde decomposition of X. With the decomposition algorithm According to phase decomposition, the phase can be extracted from each path. On the other hand, it cannot be used for the calculation of TOF as it contains some unknown phase offset caused by ADC.

E. FEATURE IMAGE CONSTRUCTION
Since the CSI signal uses MIMO technology, the total number of links is the product of the number of transmit antennas and the number of receive antennas. For a system with 3 receive antennas and 3 transmit antennas, there are generally a minimum of 3 links present. Each link is a CSI matrix that can be considered as a channel of images so that the input data is a multi-channel image matrix fused with the path information and subcarrier data.
A residual network architecture with channel attention and spatial attention mechanisms is designed for feature extraction. In this network, the features extracted by the CNN module can effectively represent the relationship between adjacent subcarriers and suppress invalid subcarriers. While spatial and temporal attention are able to locate and enhance the key links and subcarrier actions, due to the different links' pairs being affected by the actions to different degrees. It is also greatly helpful to further reduce the overhead of the overall parameters by fusing the output of the spatial and channel attention modules by element multiplication. The output of the flatten layer is fused with a bidirectional attention mechanism gated recurrent unit network.
As different actions are both independent and correlated in space-time, to be able to mine the spatial-temporal information of action features, this study uses a bidirectional attention mechanism gated recurrent unit network (BI-AT-GRU) for action classification. This network connects the forward and backward learning vectors as the final structure, which can effectively avoid the problem of missing spatial-temporal information before and after, and at the same time, the attention mechanism that gives different weights to different information pays more attention to the important features of the data and further improves the classification accuracy. The general design of the model is shown in Fig.6.

IV. EXPERIMENTAL ANALYSIS A. THE EXPERIMENTAL ENVIRONMENT
As shown in Fig.7, two laboratories are chosen as the experimental scene. The environment of Lab1 is simple, which has only a few tables and chairs and electronic devices; while Lab2 occupied with tables, chairs, cabinets, and other furniture is crowded, and there are more multipath reflections.
The hardware equipment of the system consists of a Mini-PC and a wireless access point(D-Link DIR-859), which named MP and AP, respectively. The AP with three antennas is used as the data transmitter, which supports two center frequencies, 2.4 GHz and 5 GHz. The MP is installed with Ubuntu 14.04 and comes with a modified Intel 5300 NIC driver. The AP and MP are shown in Fig.8. As the WiFi network is working in 802.11n protocol, the 5GHz band is used in the experiment to avoid interference from other devices working in the 2.4GHz band in the experimental scenario.  Five volunteers with different physical conditions were engaged in data collection for one month in two different test scenarios. The volunteers completed different daily activities according to the experimental setup and the data collected included both upper-limb activities and whole-body activities. In order to ensure that 5 volunteers participated in the data acquisition of different actions in two scenes, before each day's acquisition, 5 volunteers were divided into two groups according to the pre-designed acquisition schedule in groups of 2-3 people and conducted data acquisition in two scenes simultaneously. In the data collection process, the center point of the distance from the receiving and transmitting terminal was taken as the action completion area to ensure the consistency of daily collection. After the first volunteer completed the action that needed to be collected, the next volunteer entered the action collection area and completed the same action. Since each action lasts for a short period of time, 3500 sets of data were collected for each action in each scene in one month of data collection. 2500 sets of data were randomly selected by us as the training set, and the remaining 1000 sets of data were used as the test set, and the model was VOLUME 10, 2022  trained by a high-performance computer with 2080ti graphics card Partial of activity data acquisition is shown in Fig9.
The final dataset obtained is shown in Table 2, which describes the specific actions and their representation in the text. Fig.10 shows the curve of the accuracy and loss function of the BI-AT-GRU network. Fig.10(a) shows that the accuracy of the validation set increases as the number of iterations increases, and it gradually stabilizes at 97.3% after the number of iterations reaches 800, which means that the algorithm is able to train the intrinsic features well. Fig.10 shows that as the number of iterations increases, the loss function gradually decreases and eventually converges to 0, which means that the training is effective. Adam, SGD, Ada Delta and RMS Pro are selected as the optimizers of our proposed network respectively, and the accuracies with different learning rates are presented in Table 3. Compared with other optimizers, the Adam achieves the highest accuracy with the learning rate of 0.01. And when the learning rate is greater than 0.01, the accuracy of the model starts to decrease. Therefore, in the case of network applications, the learning rate should be less than 0.01.

C. OVERALL PERFORMANCE
Our system is evaluated in Lab1 and Lab2 scenes, respectively, and the test results are shown in Fig.11, which shows that the recognition accuracy of all activities in both scenes is above 88%. Lab1 has higher accuracy than Lab2 in classifying human activities, which is mainly due to the fact that Lab2 has a lot of furniture, which generates a lot of extra reflective signals that interfere with the reflective signals from the human body. In contrast, the CSI signals collected in Lab1 are only human reflection and wall reflection, so the accuracy of action recognition is higher.  Fig.12 shows the confusion matrix obtained after discriminating eight personnel activities. Each row of the matrix represents the true label value, i.e., the activity category, and each column represents the algorithm predicted value. Among them, the actions of sleeping and sitting achieve the highest accuracy because they are relatively simple and coarsegrained actions, which are easier to discriminate compared to other activities. In addition, the three actions of waving, drinking, throwing and removing a hat are all upper limb actions that are easily confused with each other. The remaining are walking and running actions, which are particularly easy to confuse because they perform the same model and sometimes have little difference in speed, and the recognition accuracy of these two activities is lower compared to other activities.

D. ACTIVITY CLASSIFICATION RESULTS
In addition, 8 similar activities of 5 volunteers are also tested, and the results are shown in Fig.13. The average recognition accuracy of the 5 volunteers' actions is 93.4%,  which is close to the action recognition accuracy of a single volunteer To compare the recognition performance of different algorithms, the proposed BI-AT-GRU is compared with LSTM and BI-LSTM where Precision, Recall and F1-score are used as the metrics. The results are shown in Fig.14, from which it can be seen that BI-AT-GRU has the best performance, followed by BI-LSTM, which is mainly due to the fact that BI-AT-GRU turns the three links into a three-channel image matrix when processing the input data, making full use of the information between the links; meanwhile, the residual network with attention can increase the network depth while ensuring the accuracy of feature extraction. In contrast, LSTM and BI-LSTM can only integrate the link information directly into a matrix, which is incomplete for the use of link information. The specific comparison results are shown in Table 4. Our system is also compared with HAR system [35] and WiWave system [36] in two scenarios respectively. The experimental results are shown in Fig.15. In HAR system, the use of KNN as the action recognition model cannot represent the complex relationship between the actions of the features, so the accuracy of recognition in both environments is lower than that of the WiWave system and our system. While WiWave although CNN introduced discrete wavelet transform into the convolutional architectures, which can combine the good time-frequency local characteristics of the wavelet transform with the self-learning ability of the neural network, however, it does not take into account the different characteristics of different actions in continuous time, that results in a lower recognition accuracy than our system.

E. EFFECT OF SLIDING WINDOW SIZE
Since the data size also affects the recognition accuracy, the effect of different sizes of sliding windows is tested in two different environments, and the results are shown in Fig.16. From the results, it can be seen that the recognition accuracy of person activity in both environments increases with the increase of window size, whose threshold of sizes is 20. This is mainly due to the increase in the sliding window length, the obtained feature samples would also increase, which in turn improves the classification accuracy. However, when the window increases to this length threshold, the recognition accuracy cannot be effectively improved even if the size continues to increase.

F. EFFECT OF DIFFERENT DISTANCE
In addition, the effect of different distances between the transceiver devices on the recognition is also tested in this VOLUME 10, 2022   study. The tests are conducted at distances of 0.5m, 1m, 1.5m, and 2m, respectively. Fig.17 shows the action recognition accuracy of different distances under two experimental environments. It can be seen that the accuracy is better than the others at the distance of 1m and 1.5. When the distance is 0.5m, some of the actions with less interference to the signal transmission path would cause the original multicommunication link to degrade into a single communication link due to the closer distance, thus making the obtained feature information less and leading to a decrease in recognition accuracy. And when the distance increases, additional transmission paths would be introduced, leading to the introduction of more noise, which in turn affects the recognition accuracy.

G. EFFECT OF DIFFERENT SAMPLING FREQUENCIES
In order to analyze the effect of different sampling frequencies on the accuracy of action recognition. Different sampling frequency frequencies are tested in two scenes respectively, and the test results are shown in Fig.18, from which it can be seen that as the sampling rate increases, the recognition accuracy of the system increases. This is mainly due to the fact that as the sampling frequency increases, more CSI data are collected per unit of time, and more CSI samples are used in a fixed time window to more accurately describe the impact of the person's activities on the transmission path. However, it can also be seen from the figure that the accuracy hardly changes when the sampling rate increases to 800 Hz, mainly because there is an upper limit to the impact of each action on the environment, and no more details are obtained at higher sampling rates.

V. CONCLUSION
In this study, a WiFi-based indoor people activity recognition system is presented. A new phase decomposition method is proposed to achieve the features of people activity. Through analyzing the relevant characteristics and physical model of CSI, a series of algorithms is designed to process the data, and the BI-AT-GRU which is a residual network architecture with attention mechanisms is used for action recognition. The performance in two different laboratories is validated. The results reveal a mean accuracy is 97.4% and 93.3%, respectively. We also compare our system with other systems and analyze the impact factors on the recognition accuracy of our system. The results demonstrate that our system can more effectively distinguish personnel activities than the others.
Since the model of the Fresnel zone does not fully explain the complex activity changes of personnel. Therefore, our future work will focus on analyzing the characteristics of different categories of personnel activities and building a unified personnel activity model to improve the recognition accuracy and generalization ability. XINGUO SHI was born in 1972. He received the B.S. degree in control engineering from the Shandong University of Science and Technology, in 1992. He is currently a Senior Engineer with the Information Center of Shandong Energy Zibo Mining Group Company. His research interest includes coal mine information technology. VOLUME 10, 2022