An End-to-End Deep Learning Framework for Recognizing Human-to-Human Interactions Using Wi-Fi Signals

Channel state information (CSI)-based human activity recognition plays an essential role in various application domains, such as security, healthcare, and Internet of Things. Most existing CSI-based activity recognition approaches rely on manually designed features that are classified using traditional classification methods. Furthermore, the use of deep learning methods for CSI-based activity recognition is still at its infancy with most of the existing approaches focus on recognizing single-human activities. The current study explores the feasibility of utilizing deep learning methods to recognize human-to-human interactions (HHIs) using CSI signals. Particularly, we introduce an end-to-end deep learning framework that comprises three phases, which are the input, feature extraction, and recognition phases. The input phase converts the raw CSI signals into CSI images that comprise time, frequency, and spatial information. In the feature extraction phase, a novel convolutional neural network (CNN) is designed to automatically extract deep features from the CSI images. Finally, the extracted features are fed to the recognition phase to identify the class of the HHI associated with each CSI image. The performance of our proposed framework is assessed using a publicly available CSI dataset that was acquired from 40 different pairs of subjects while performing 13 HHIs. Our proposed framework achieved an average recognition accuracy of 86.3% across all HHIs. Moreover, the experiments indicate that our proposed framework enabled significant improvements over the results achieved using three state-of-the-art pre-trained CNNs as well as the results obtained using four different conventional classifiers that employs traditional handcrafted features.

activities can be recognized by analyzing the variations in the Wi-Fi signals surrounding the users. In fact, the use of Wi-Fi signals can alleviate the limitations associated with traditional sensing technologies [1], [16]. This is due to the following attractive properties of the Wi-Fi signals. First, Wi-Fi signals have a better range of coverage compared with traditional sensing technologies, such as cameras, wearable sensors, and radars. Second, Wi-Fi signals have a noninvasive nature that preserves the privacy of the users. Third, human activity recognition approaches that rely on Wi-Fi signals are device-free approaches that do not require the users to wear any sensing devices.
Literature reveals that the channel state information (CSI), which is a fine-grained metric that captures the variations in the amplitude and phase information associated with different subcarrier frequencies of a Wi-Fi channel, has been widely used to develop human activity recognition systems [5], [16], [17]. The majority of the existing CSI-based human activity recognition approaches rely on extracting handcrafted features from the CSI signals using various signal processing techniques [18]. These handcrafted features are processed using a classifier, such as hidden Markov model (HMM) and support vector machine (SVM), to identify the human activity associated with the CSI signals. Notwithstanding the favorable results achieved using the handcrafted features, the task of manually designing new features that describe the information encapsulated in the time, frequency, and spatial domains of the CSI signals is considered challenging [7], [16], [19].
To avoid the process of manually designing features, researchers have utilized deep learning (DL) methods, such as the convolutional neural networks (CNNs), to learn deep features from the input signals. In fact, DL methods have been successfully employed in several fields, such as signal and image classification [20] and computer vision [21], [22]. Motivated by the great success of DL methods in different fields, researchers have recently started to explore the feasibility of utilizing DL methods to develop CSI-based human activity recognition approaches [7], [16], [17], [19], [23]- [25]. The results obtained by these approaches indicate that the use of DL methods has significantly improved the recognition accuracy compared to other human activity recognition approaches that utilize manually designed features [7], [16], [19].
In spite of the remarkable results achieved by the existing CSI-based human activity recognition approaches, these approaches were mainly focused on recognizing single-human activities that are performed by one individual. This can limit the potentials of using these approaches in real-world scenarios that involve more than one human [7], [16], [17]. In this regard, previous studies have indicated that the problem of recognizing human-to-human interactions (HHIs), which involve two interacting humans (e.g., handshaking and hugging interactions), is considered more challenging than the problem of recognizing single-human activities (e.g., walking and falling activities) [26], [27].
This stems from the following factors. First, HHIs involve causal relationships and interdependencies among the moving body-parts of the two interacting humans. Second, HHIs comprise large inter-and intra-personal variations in the performed interactions. Third, different HHIs may involve similar movements that are performed by the two interacting humans.
In light of this, the current study proposes an end-to-end deep learning framework (E2EDLF) for recognizing HHIs using CSI signals. The proposed framework comprises three phases, which are the input, feature extraction, and recognition phases. The input phase converts the raw CSI signals into a set of two-dimensional (2D) gray-scale CSI images that comprise the time, frequency, and spatial information encapsulated in the raw CSI data. The feature extraction and recognition phases are implemented using a novel CNN architecture that comprises three blocks of layers. Particularly, in the feature extraction phase, the first two blocks of layers within our proposed CNN architecture are utilized to automatically extract data-driven features from the CSI images. Specifically, the first block of layers extracts joint time-frequency features from the CSI signals associated with each transmit-receive pair of antennas, while the second block of layers extracts spatial features from the different pairs of transmit-receive antennas. In the recognition phase, the joint time, frequency, and spatial features, which are extracted at the feature extraction phase, are fed to the third block of layers within our proposed CNN architecture to recognize the performed HHI within each CSI image.
The performance of our proposed E2EDLF is assessed using a publicly available CSI dataset [28] that was introduced by our research group. This dataset contains the raw CSI signals recorded for 40 different pairs of subjects while performing 13 HHIs. Moreover, we compare the results obtained by our proposed E2EDLF with the results obtained using three state-of-the-art pre-trained CNNs. Besides, we compare the results achieved by our proposed E2EDLF with the results achieved by traditional handcrafted features that are extracted from the CSI signals and classified using four different conventional classifiers, including a multi-class support vector machine (mcSVM) classifier, k-NN classifier, naive Bayes classifier, and decision tree classifier. In fact, the results indicate that the performance of our proposed E2EDLF outperforms the performances achieved using the pre-trained CNNs and the traditional handcrafted features, respectively. Moreover, the results provided in our study demonstrate the feasibility of recognizing HHIs by analyzing the CSI signals using DL technology.
The main contributions of the current study can be summarized as follows: • For the first time, this study investigates the possibility of recognizing HHIs by analyzing CSI signals.
• We propose a novel E2EDLF for recognizing HHIs that can extract features from the time, frequency, and spatial domains of the CSI signals. To the best of our knowledge, this is the first study that explores the feasibility of utilizing CNNs to learn features from the time, frequency, and spatial domains of the CSI signals with the goal of distinguishing between HHIs.
• Extensive experiments are performed using our publicly available CSI dataset to demonstrate the capability of the proposed framework for recognizing HHIs. The remainder of this paper is structured as follows. Section II provides a review about the previous studies that were conducted in the field of CSI-based human activity recognition. Section III provides a background knowledge about the CSI of a Wi-Fi system. Section IV presents our proposed E2EDLF for recognizing HHIs. Section V describes the publicly available CSI dataset employed in the current study, presents the experimental results, and discusses the performance of the proposed framework. Finally, the conclusion is provided in Section VI.

II. RELATED WORK
Over the past few years, researchers have proposed numerous CSI-based approaches for recognizing human activities. These approaches can be generally grouped into two categories, namely fine-grained and coarse-grained human activity recognition approaches.

A. FINE-GRAINED ACTIVITY RECOGNITION APPROACHES
The approaches within this category focus on recognizing primitive movements of human body-parts that are in the range of millimeters [7], [18], [29], such as keystroke recognition, vital sign monitoring, and gesture recognition.
In this regard, researchers have recently explored the possibility of recognizing keystrokes using CSI signals. For example, Ali et al. [30] proposed a CSI-based keystroke recognition system called WiKey. Particularly, WiKey can distinguish between the CSI variants that correspond to different hands and fingers movements of a user while pressing various buttons on the keyboard. Li et al. [31] introduced a CSI-based approach that can infer keystrokes on a mobile device.
Another group of researchers has utilized the CSI signals to track human's vital signs. For instance, Liu et al. [32] proposed a CSI-based system for sleep monitoring called Wi-Sleep. The Wi-Sleep system analyzes the CSI values to extract sleep-related information, such as the respiration of the user and sleeping postures. Liu et al. [33] developed a system that can track human vital signs, such as heart rates and breathing, using CSI signals. Niu et al. [29] utilized the CSI signals to detect the human respiration rate. Zhao et al. [34] employed the CSI signals to detect the heartbeats of individuals. The detected heartbeats are used to infer the user's emotional state.
Others studies were focused on recognizing hand and finger gestures using CSI signals. In this vein, Abdelnasser et al. [35] proposed WiGest, which employs the CSI signals to recognize hand gestures. Li et al. [36] introduced WiFinger, which is a CSI-based system that can recognize finger gestures. Pu et al. [10] presented WiSee, which utilizes the CSI signals to recognize hand gestures.

B. COARSE-GRAINED ACTIVITY RECOGNITION APPROACHES
The approaches within this category focus on recognizing single-human activities that involve movements of different body-parts, such as walking, running, and falling. Coarse-grained human activity recognition approaches can be generally organized into two groups as per the employed classification schemes [18]: conventional approaches and deep leaning-based approaches.
Conventional approaches rely on manually designing and extracting features from the time and frequency domains of the CSI signals. Then, the manually extracted features are used to construct standard classification models, such as the SVM classifier, to recognize human activities. For instance, Wang et al. [1] developed CARM, which is a CSI-based system that can recognize nine daily human activities. The proposed system comprises two models, namely the CSI-speed model and the CSI-activity model, that are used to quantify the correlation between the CSI dynamics and a particular human activity. Palpana et al. [37] proposed a CSI-based fall detection system called FallDeFi. The FallDeFi system utilizes the short-time Fourier transform to extract time-frequency features from the CSI measurements. The extracted features were used to construct a SVM classifier that can recognize the following four types of falls: loss of balance, tripping, loss of consciousness, and slipping. Wang et al. [38] designed a CSI-based location-oriented activity recognition system called E-eyes. Particularly, the E-eyes system utilizes a moving variance thresholding method to distinguish between walking activity and nine in-place daily human activities. Then, human activities are recognized using a matching algorithm that computes the similarity between the CSI measurements and a set of pre-constructed activity profiles. Xiao et al. [39] proposed SEARE, which is a CSI-based system that can recognize exercise activities. Specifically, the SEARE system extracts features from the time and frequency domains of the CSI measurements, and then utilizes the dynamic time warping technique to quantify the distance between feature vectors to recognize the performed exercise.
In contrast to the conventional approaches, DL-based approaches can automatically extract latent features from the CSI measurements, which can minimize the necessity to manually design the features. Recently, researchers have started to explore the possibility of utilizing DL methods to develop CSI-based human activity recognition approaches [18]. In this vein, Yousefi et al. [16] proposed a CSI-based DL approach that utilizes a long short-term memory (LSTM) network to recognize six daily human activities. Feng et al [7] presented a DL approach for human activity recognition that is based on LSTM networks. The approach can automatically extract time and frequency features from the raw CSI signals to recognize three types of human activities. Sheng et al. [19] presented a DL approach for activity recognition that can automatically learn temporal-spatial features from the CSI data. The approach integrates the spatial features extracted using a CNN into the temporal model which is realized using a bidirectional LSTM network. In another study, Gao et al [25] converted the CSI measurements associated with multiple channels into radio images and employed a sparse auto-encoder to extract deep optimized features from the radio images and recognize human activities.
The aforementioned studies indicate that the use of DL methods have obtained remarkable performance improvements compared with conventional classification methods that utilize manually designed features [16], [19]. Nonetheless, the use of DL methods to analyze the CSI signals and recognize human activities is still at its early stages with most of the existing approaches focus on recognizing single-human activities. Having that said, our work contributes to the continuing studies in the area of CSI-based human activity recognition by presenting a novel DL framework that can automatically extract effective features from the time, frequency, and spatial domains of the CSI signals and recognize thirteen HHIs with high accuracy.

III. BACKGROUND OF CHANNEL STATE INFORMATION
Commercial off-the-shelf Wi-Fi devices that run according to the IEEE 802.11n standard utilize the multiple-input multiple-output (MIMO) technology with the orthogonal frequency-division multiplexing (OFDM) scheme to send and receive different Wi-Fi signals over multiple transmit-receive antenna pairs [18]. Specifically, the OFDM scheme divides the bandwidth of a MIMO channel into a set of orthogonal subcarrier frequencies that are transmitted in parallel [12]. The propagation of wireless signals between a transmit-receive antenna pair is characterized by the CSI metric [16], which represents the channel frequency response (CFR) measured for a transmit-receive antenna pair and a particular OFDM subcarrier frequency [5]. In particular, a Wi-Fi system that utilizes the MIMO-OFDM scheme can be modeled as follows [12], [16]: where s ∈ [1, · · · , N S ] represents the index of the OFDM subcarrier frequency, N S is the number of the OFDM subcarrier frequencies, i represents the index of the transmitted and received packets, A s (i) and B s (i) are the i th transmitted and received packets associated with the OFDM subcarrier frequency s, respectively, N T and N R represent the number of transmitting and receiving antennas, respectively, N represents noise, and H s (i) is a complex-valued matrix of dimensions N T × N R that comprises the CSI measurements of the MIMO channel for the OFDM subcarrier frequency s. The structure of the CSI matrix H s (i) is shown below: where h represents the CFR value measured for the OFDM subcarrier frequency s at the i th packet between the x th transmitting antenna, denoted as T x where x ∈ [1, · · · , N T ], and the y th receiving antenna, denoted as R y where y ∈ [1, · · · , N R ]. In this work, the employed CSI dataset was acquired using the Linux 802.11n CSI tool [40] which allows the recording of N S = 30 OFMD subcarrier frequencies for each transmit-receive antenna pair. Moreover, the number of transmitting and receiving antennas of the equipment used to collect our dataset are N T = 2 and N R = 3, respectively.

IV. OUR PROPOSED E2EDLF FOR RECOGNIZING HHIs
This section presents our proposed CSI-based E2EDLF for recognizing HHIs. The proposed E2EDLF comprises three phases, namely the input, feature extraction, and recognition phases. In the input phase, the raw CSI data are converted into a set of 2D gray-scale CSI images. Section IV-A provides detailed description of the conversion procedure employed in the input phase. The feature extraction and recognition phases are implemented using a novel CNN architecture that comprises three blocks of layers. Particularly, in the feature extraction phase, the first two blocks of layers within our proposed CNN architecture are utilized to automatically analyze and extract salient features from the CSI images obtained in the input phase. Section IV-B provides detailed description of the feature extraction phase of the proposed E2EDLF. In the recognition phase, the features extracted at the feature extraction phase are fed to the third block of layers within our proposed CNN architecture to recognize the class of the HHI associated with each CSI image. Section IV-C provides detailed description of the recognition phase of the proposed E2EDLF. The structure of our proposed CSI-based E2EDLF for recognizing HHIs is shown in Fig. 1

A. THE INPUT PHASE
The raw CSI data can be viewed as a four-dimensional (4D) tensor that characterizes the variations of the CFR values measured for a Wi-Fi system over the time domain (i.e., packet index), frequency domain (i.e., OFDM subcarrier frequencies), and the spatial domain (i.e., pairs of transmit-receive antennas). Figure 2(A) shows the structure of the recorded raw CSI signals included in the publicly available dataset included in this study [28]. The amplitude and phase information comprised within the raw CSI signals are affected by several factors, including the multi-path effects and the existence of moving objects and humans in the signal propagation path [18]. In this regard, literature reveals that the amplitude information of the CSI signals has been widely used to recognize human activities [16]. This is due to the fact that the changes in the amplitude of the CSI signals are relatively more stable than the deteriorations in the phase information [16], [41]. Therefore, in this study, we employ the amplitude of the CSI values to design an E2EDLF for HHIs recognition. The structure of the proposed E2EDLF: (A) the layout of the room used to record the raw CSI signals included in the CSI dataset [28], and (B) the three phases comprised within our proposed E2EDLF along with the three blocks of layers that are comprised within the CNN used to implement the feature extraction and recognition phases.
The objective of the input phase is to convert the original 4D raw CSI signals into a set of 2D CSI images that preserve the time, frequency, and spatial information comprised within the original raw CSI signals. To construct the CSI images, we compute the amplitude (i.e., the magnitude) of the raw CSI signals acquired in each recorded trial included in the CSI dataset [28]. Each trial in the dataset comprises the CSI signals recorded for a pair of subjects while performing a particular HHI. Furthermore, the computed amplitude of the CSI signals included in each trial are arranged into a 2D matrix of dimensions M × I , where M = N P × N S , N P = N T × N R represents the number of transmit-receive antenna pairs in the Wi-Fi system, and I represents the number of packets recorded in a particular trial.
A sliding window is utilized to divide the CSI signals comprised within the computed 2D matrix of each trial into a set of overlapped segments. The size of each segment is set to W = 256 packets and the overlap between each two consecutive segments is set to W /2. Particularly, the CSI signals associated with the OFDM subcarrier frequencies of each transmit-receive antenna pair (T x , R y ) are divided into overlapped segments. We refer to each segment as CSI represents the index of the OFDM subcarrier frequency and ω represents the index of the packet located at the center of the current position of the utilized sliding window. The size of the CSI segment CSI Then, the CSI segments obtained at each window position are normalized to be in the range of [0, 255] and converted into 2D gray-scale sub-images. We denote each of the 2D gray-scale sub-images obtained at a particular position of the utilized sliding window as CSI ω) . Finally, at each position of the utilized sliding window, we vertically combine the sub-images CSI constructed for all x ∈ [1, N T ] and y ∈ [1, N R ] to construct a new image, denoted as CSI ω , with dimensions M × W . Figure 2 illustrates the construction procedure of the image CSI ω using the raw CSI signals of one trial in the CSI dataset that was recorded for a pair of subjects while performing the handshaking interaction.

B. THE FEATURE EXTRACTION PHASE
The 2D CSI images generated in the input phase characterize the variations in the amplitude of the CSI signals that are comprised within each window position in the time, frequency, and spatial domains. This implies the necessity to analyze the changes in the CSI signals in the time and frequency domains for each transmit-receive antenna pair as well as across different pairs of transmit-receive antennas. Therefore, the objective of the feature extraction phase is to automatically learn latent features from each CSI ω that can be used to recognize different HHIs.
In this work, the feature extraction phase is implemented using the first two blocks of layers within our proposed CNN architecture as depicted in Fig. 1(B). The first block, denoted by block 1, consists of three layers: convolutional layer (L 1,1 ), batch normalization layer (L 1,2 ), and rectified linear unit layer (L 1,3 ). The objective of the first block of layers is to extract time-frequency features from the CSI images associated with each transmit-receive antenna pair. The second block, denoted by block 2, consists of three layers: convolutional layer (L 2,1 ), batch normalization layer (L 2,2 ), and rectified linear unit layer (L 2,3 ). The objective of the second block of layers is to extract spatial features from all pairs of transmit-receive antennas.
In block 1, L 1,1 is a 2D convolutional layer with neurons that are connected to subregions of the input image CSI ω . This layer learns the features localized within these subregions while scanning the input image along the horizontal and vertical dimensions using a set of 2D filters. We refer to the number of 2D filters in L 1,1 as N   The time-frequency convolution applied at the layer L 1,1 to the input CSI image CSI ω . The number of sub-images comprised within the input CSI image CSI ω is equal to N P , where each sub-image corresponds to a transmit-receive antenna pair. The red and yellow rectangles represent the current and next positions of a particular 2D filter, respectively.
As described earlier, the input CSI image CSI ω consists of N P sub-images, where each sub-image is associated with a particular transmit-receive antenna pair and has a size of N S × W . In light of this, the aforementioned parameter selection scheme enables the analysis of each of the N P sub-images comprised within the input image CSI ω along the horizontal axis, which corresponds to the packet index (i.e., time domain), and the vertical axis, which represents the indices of the OFDM subcarrier frequencies associated with a particular transmit-receive antenna pair (i.e., frequency domain). Therefore, L 1,1 can be viewed as a time-frequency convolutional layer that analyzes the CSI sub-images associated with each transmit-receive antenna pair in the input CSI image CSI ω and produces a set of time-frequency feature maps (FMs). The number of time-frequency FMs obtained at the output of layer L 1,1 is equal to the number of 2D filters employed in this layer. In addition, the height and width of each time-frequency FM are FM L 1,1 1 = N P and FM L 1,1 2 = 15, respectively. Figure 3 demonstrates the time-frequency convolution applied in the layer L 1,1 to the input CSI image CSI ω .
The time-frequency FMs generated at the output of layer L 1,1 are propagated to the next layer in block 1, which is layer L 1,2 . Layer L 1,2 normalizes the FMs to simplify the training of the CNN and reduces the potential occurrence of overfitting [42]. The performed normalization at the layer L 1,2 does not affect the number and size of the FMs obtained at the output of layer L 1,1 . Therefore, the height and width of each FM generated at the output of layer L 1,2 are FM L 1,2 1 = N P and FM L 1,2 2 = 15, respectively. The FMs produced at the output of layer L 1,2 are propagated to the last layer in block 1, which is layer L 1,3 . Layer L 1,3 performs a threshold operation to each value in the FMs obtained from layer L 1,2 , where any value less than zero is set to zero [42]. Similar to layer L 1,2 , the performed threshold operation at layer L 1,3 does not affect the number and size of the FMs obtained at the output of layer L 1,2 . Therefore, the height and width of each FM generated at the output of layer L 1,3 are FM L 1,3 1 = N P and FM L 1,3 2 = 15, respectively. Figure 4 shows the structure of the FMs produced at the output of layer L 1,3 . Particularly, the number of rows in each FM is equal to N P . We refer to each row in each FM as sub − map p , where p ∈ [1, N P ]. Each sub-map contains the features extracted from a particular sub-image in the input CSI image CSI ω . Specifically, the sub-maps within each FM are arranged from top to bottom according to the following order: the top sub-map, denoted as sub − map 1 , contains the time-frequency features extracted from the CSI signals associated with the transmit-receive antenna pair T 1 − R 1 , while the bottom sub-map, denoted as sub − map N P , contains the time-frequency features extracted from the CSI signals associated with the transmit-receive antenna pair T N T − R N R . This implies that the FMs generated at the output of the first block of layers characterize the time-frequency variations of the CSI signals associated with each individual transmit-receive antenna pair without taking into consideration the variations in the CSI signals across different pairs of transmit-receive antennas.
To analyze the variations of the CSI signals across different pairs of transmit-receive antennas, we passed on the FMs generated at the output of layer L 1,3 to the first layer in block 2, namely L 2,1 . In particular, layer L 2,1 is a 2D convolutional layer with neurons that are connected to subregions of the FMs generated at the output of layer L 1,3 . This layer learns the features localized within these subregions while scanning the FMs along the horizontal and vertical dimensions using a set of 2D filters. We refer to the number of 2D filters in  = 2, and N L 2,1 F = 160. As described earlier, each of the FMs obtained at the output of layer L 3,1 consists of N P sub-maps, each of which is associated with a particular transmit-receive antenna pair. Hence, the selected values of the parameters associated with layer L 2,1 enable the analysis of all sub-maps comprised within each FM, which are associated with different pairs of transmit-receive antennas, over time. Thus, layer L 2,1 can be viewed as a spatial convolutional layer that analyzes all the sub-maps associated with all the pairs of transmit-receive antennas in the FMs obtained at the output of the layer L 1,3 to generate a new set of FMs. The number of FMs obtained at the output of layer L 2,1 is equal to the number of 2D filters employed in this layer. In addition, the height and width of each FM generated at the output of layer L 2,1 are FM L 2,1 1 = 1 and FM L 2,1 2 = 6, respectively. Figure 5 demonstrates the spatial convolution applied in layer L 2,1 to each of the time-frequency FMs obtained at the output of layer L 1,3 .
The FMs generated at the output of layer L 2,1 are passed on to the next layer in block 2, namely layer L 2,2 . Layer L 2,2 normalizes the FMs and propagates its values to the next layer, namely layer L 2,3 . In layer L 2,3 , a threshold operation is applied to the values of the normalized FMs, which are obtained at the output of layer L 2,2 , by setting the negative values to zero. The performed normalization and threshold operation in layers L 2,2 and L 2,3 , respectively, do not affect the number and size of the FMs generated at the output of layer L 2,1 . Hence, the size of the FMs generated at the output of layer L 2,3 is 1×6. The FMs generated at the output of layer L 2,3 represent the features extracted from the input CSI image CSI ω . These FMs are propagated to the recognition phase to recognize the class of the HHI associated with the input CSI image CSI ω .

C. THE RECOGNITION PHASE
In this phase, the FMs learned at the feature extraction phase are further analyzed to recognize the class of the HHI associated with the input CSI image CSI ω , where the number of HHI classes considered in the current study is thirteen classes, as described in SectionV-A. The recognition phase is implemented using the third block of layers within our proposed CNN architecture, denoted as block 3, which comprises four layers, namely the flatten layer (L 3,1 ), fully connected layer (L 3,2 ), softmax layer (L 3,3 ), and classification layer (L 3,4 ). Figure 6 illustrates the structure of the recognition phase.
In layer L 3,1 , the FMs obtained at the output of layer L 2,3 are rearranged into a column vector of dimensions D F . The column vector obtained at the output of layer L 3,1 is propagated to the next layer of the recognition phase, namely layer L 3,2 . Layer L 3,2 consists of neurons that are connected to all features in the column vector at the output of layer L 3,1 . The number of neurons in layer L 3,2 is selected to be the same as the number of HHI classes, which is equal to 13. Moreover, layer L 3,2 has a set of parameters, namely a weight matrix and a bias vector, that are learned during the training of the CNN. After that, the outputs of layer L 3,2 are passed on to the next layer of the recognition phase, namely layer L 3,3 . Layer L 3,3 normalizes the outputs of layer L 3,2 , such that all the values obtained at the output of layer L 3,3 are greater than zero and their sum is equal to one. Each of the thirteen  normalized values obtained at the output of layer L 3,3 represents the classification probability that the input CSI image CSI ω belongs to one of the thirteen HHI classes. Finally, the normalized values obtained at the output of layer L 3,3 are passed on to the last layer in the recognition phase, namely layer L 3,4 . Layer L 3,4 assigns the input CSI image CSI ω to the HHI class that has the highest classification probability. Table 1 summarizes the details of the layers comprised within the proposed CNN architecture that is used to implement the feature extraction and recognition phase in our proposed E2EDLF.

V. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we present the publicly available CSI dataset of HHIs that was previously published by our research group [28] and used in this work to assess the performance TABLE 1. Summary of the details of the layers comprised within the proposed CNN architecture that is used to implement the feature extraction and recognition phases of our proposed E2EDLF. of our proposed E2EDLF. Furthermore, we describe the procedure used to train and test our proposed E2EDLF. After that, we describe and discuss the results achieved by our proposed E2EDLF based on the CSI dataset. Moreover, we present and discuss the runtime of our proposed E2EDLF. Finally, we compare the results achieved by our proposed E2EDLF with the results achieved using different pre-trained CNNs and the results achieved using traditional handcrafted features that are extracted from the CSI signals and classified using a mcSVM classifier.

A. THE CSI DATASET OF HHI
A publicly available CSI dataset of HHIs [28] is used to validate the performance of our proposed E2EDLF. The dataset contains the CSI packets that were recorded for forty distinct pairs of subjects while performing different HHIs inside an office with dimensions 5.3 m × 5.3 m, as illustrated in Fig. 1(A). Each pair of subjects performed ten different trials of the following HHIs: approaching, departing, handshaking, high five, hugging, kicking with the left leg, kicking with the right leg, pointing with the left hand, pointing with the right hand, punching with the left hand, punching with the right hand, and pushing. In addition, each of the recorded trials comprises two interludes, namely the steady state and the interaction interludes. Specifically, throughout the steady state interlude, the pair of subjects were confronting each other without doing any action. During the interaction interval, the pair of subjects perform one of the aforementioned HHIs. Therefore, the total number of HHI classes incorporated in the CSI dataset is equal to thirteen classes, which include the steady state interaction as well as the twelve HHIs described above.
The publicly available Linux 802.11n CSI tool [40] was utilized to record the Wi-Fi signals transmitted from a commercial off-the-shelf access point (AP), namely the Sagemcom 2704, to a desktop computer that is equipped with an Intel 5300 network interface card (NIC). The constructed MIMO system comprises six different pairs of transmit-receive antennas (i.e., N P = 6). Thus, for our MIMO-OFDM system, the number of CSI values contained within each packet is equal to 180 values. Detailed description of the CSI dataset is provided in [28].

B. TRAINING AND TESTING OUR PROPOSED E2EDLF
To train and test the proposed E2EDLF, we have employed a 10-fold cross-validation (CV) procedure. Particularly, for all pairs of subjects, we apply the procedure described in subsection IV-A to transform the CSI signals recorded during each trial in the CSI dataset into a set of labeled CSI images, where the label of each CSI image can be one of the thirteen HHI classes described in subsection V-A.
The labeled CSI images obtained from all pairs of subjects, all trials, and all HHI classes are divided into ten different folds. Particularly, nine folds of the CSI images associated with the thirteen HHI classes are randomly chosen and used to train the feature extraction and recognition phases of our proposed framework, while the remaining fold of the CSI images is used for testing. The 10-fold CV procedure is repeated ten times, and the overall recognition performance is computed for each of the thirteen HHI classes by averaging the results obtained from each repetition [1], [19], [24]. During each repetition of the 10-fold CV procedure, the stochastic gradient decent (SGD) algorithm was employed to learn the weights and biases of the convolutional layers of the feature extraction phase as well as the weights and biases of the fully connected layer in the recognition phase by minimizing the categorical cross-entropy loss function. The training process was run for 50 epochs and the learning rate of the SGD algorithm was experimentally selected to be 0.001.

1) RESULTS OF OUR PROPOSED E2EDLF
The proposed E2EDLF achieved an average recognition accuracy of 86.3% across the thirteen HHI classes. Figure 7 shows the confusion matrix of our proposed framework computed over the ten repetitions of the employed 10-fold CV procedure. The average recognition accuracies computed for each of the thirteen HHI classes, which are shown along the main diagonal of the confusion matrix presented in Fig. 7, are substantially higher than the random classification rate, which is equal to 7.7% (i.e., the reciprocal of the number of HHI classes).
The results presented in Fig. 7 show some confusion between the kicking with the left leg and kicking with the right leg interactions. Similarly, one can observe a confusion between the punching with the left hand and punching with the right hand interactions. These confusions can be attributed to the large similarity between the two kicking interactions and the two punching interactions. In addition, the confusion matrix indicates the existence of some confusion between the steady state interaction and some HHIs, such as the handshaking, high five, pointing with the left hand, and pointing with the right hand. This can be attributed to the relatively large similarity between the steady state interaction and the beginning and end of each of the previously mentioned HHIs. Particularly, at the beginning and end of the aforementioned HHIs, the pairs of subjects are standing still against each other, which is similar to the behavior performed by the subjects during the steady state interaction.
We also compute the F 1 − Score for each of the thirteen HHI classes. The F 1 − Score is a harmonic mean of the recall and precision that attains its best value at 1 and the worst value at 0. The F 1 − Score can be used to evaluate the recognition performance when the numbers of samples associated with different classes are imbalanced [43]- [45]. In this regard, Fig. 8 shows the number of CSI images extracted from the recorded trials of each of the thirteen interactions across all pairs of subjects in our dataset. Figure 8 illustrates that the number of constructed CSI images varies substantially across the thirteen interactions. This is due to the variation in the lengths of the trials recorded for the different interactions in our dataset. To evaluate the impact of the CSI dataset imbalance on the recognition performance of our proposed E2EDLF, we have computed the F 1 − Score for each of the thirteen HHI classes. The blue bars presented in Fig. 9 show the mean F 1 − Score values computed for each of the thirteen HHI classes across the ten repetitions of the 10-fold CV procedure. The F 1 −Score values obtained using our proposed framework for each of the thirteen HHI classes are higher than 0.8. In fact, the average F 1 −Score value computed across all interactions is equal to 0.86.
To further analyze the recognition performance obtained using our proposed framework, we have computed the Cohen's kappa score [46], [47] for each of the thirteen HHI classes. The Cohen's kappa score is used to measure the agreement between the classes of the CSI images predicted VOLUME 8, 2020 by the proposed E2EDLF and the matching true classes of these images after removing the agreements occurring by chance. Particularly, the Cohen's kappa score enables us to compare the recognition performance obtained by our proposed E2EDLF with the recognition performance obtained by random guessing according to the number of samples of each class. According to Landis et al. [48], the value of the Cohen's kappa score (κ −Score) can be interpreted as follows to determine the strength of agreement: (κ − Score ≤ 0) poor agreement, (0 < κ − Score ≤ 0.2) slight agreement, (0.2 < κ − Score ≤ 0.4) fair agreement, (0.4 < κ − Score ≤ 0.6) moderate agreement, (0.6 < κ − Score ≤ 0.8) substantial agreement, and (0.8 < κ − Score ≤ 1) almost perfect agreement.
The red bars in Fig. 9 show the κ − Score values computed for each of the thirteen HHI classes over the ten repetitions of the 10-fold CV procedure. The κ − Score values presented in Fig. 9 indicate that the strength of agreement of the recognition accuracies computed for the kicking with the left leg, kicking with the right leg, punching with the left hand, and punching with the right hand interactions are within the substantial agreement range. Furthermore, the κ − Score values that are computed for the remaining interactions are within the perfect agreement range. In fact, the average κ − Score value computed across all interactions using our proposed framework is equal to 0.85. Therefore, the strength of agreement of the recognition accuracy computed across all interactions is within the perfect agreement range. The results presented in Figs. 7 and 9 illustrate the ability of our proposed E2EDLF to accurately recognize HHIs.

2) RUNTIME OF OUR PROPOSED E2EDLF
The proposed E2EDLF was executed on a workstation with an Intel Xeon Silver-4110 2.1GHz 16 cores CPU, 64 GB RAM, and Nvidia Quadro P6000 GPU. The runtime of our proposed framework is quantified in terms of the following three different metrics: (1) The average ± standard deviation value of the time required to train the proposed framework computed over the ten repetitions of the employed 10-fold CV procedure, and we refer to this metric as the training time. (2) The average ± standard deviation value of the time required to construct the input CSI image associated with a particular window position at the input phase computed across the thirteen HHI classes, and we refer to this metric as the CSI image construction time. (3) The average ± standard deviation value of the time required to recognize the class of an input CSI image at the recognition phase computed across the ten repetitions of the 10-fold CV procedure, and we refer to this metric as the CSI image recognition time.
The average± standard deviation values of the training time, CSI image construction time, and CSI image recognition time computed for our proposed framework were 934.27 ± 3.56 s, 0.00051 ± 0.000042 s, 0.00022 ± 0.000018 s, respectively. Despite the relatively large training time required to train our proposed framework, the training process is performed offline and the trained framework is used to recognize the testing CSI images online. The average time required to construct and recognize an input CSI image using our trained E2EDLF is equal to 0.00073 s. We refer to the average time required to construct and recognize an input CSI image as the framework response time. The proportion between the response time of our proposed E2EDLF and the length of each window position, where the later time is computed by dividing the number of packets in each window position (which is equal to 256 packets) by the number of packets received per each second (which is equal to 320 packets/s), is approximately 0.091%.
The previously described runtime analysis shows the ability of our proposed E2EDLF to recognize the class of the input CSI image associated with a particular window position before moving to the next window position. This indicates the suitability of using our proposed E2EDLF for real-time CSI-based HHI recognition.

3) COMPARISON WITH THE RESULTS OBTAINED USING OTHER STATE-OF-THE-ART PRE-TRAINED CNNs
In this section, we compare the results obtained by our proposed E2EDLF with the results obtained using three stateof-the-art pre-trained CNNs, namely the GoogleNet [49], ResNet-18 [50], and SqueezeNet [51]. Particularly, in this study, we have used the implementation provided in the MATLAB DL toolbox [52] for each of the three pre-trained CNNs. Furthermore, we have utilized the concept of transferlearning [53] to tune the three pre-trained CNNs using the CSI images extracted from the CSI dataset. A zero-padding procedure is applied to adjust the size of each input CSI image, which is equal to 180 × 256 × 1 in our proposed E2EDLF, to match the sizes of the input layers of the GoogleNet, ResNet-18, and SqueezeNet, which are equal to 224×224×3, 224 × 224 × 3, and 227 × 227 × 3, respectively. Furthermore, the number of neurons in the last fully connected layer in each one of these three pre-trained CNNs was set to 13 (i.e., the number of HHI classes in the CSI dataset). Moreover, the initial learning rate was set to 0.001, the number of epochs was set to 15, and the SGD with momentum algorithm was used to tune each of the three pre-trained CNNs. To facilitate the comparison with our proposed framework, we have computed the recognition performance for each of the three pre-trained CNNs using the same training and testing sets of CSI images that were employed to evaluate our proposed framework, where these training and testing sets of CSI images were obtained using the 10-fold CV procedure described in subsection V-B. Table 2 shows the recognition accuracy, F 1 − Score, and κ − Score values obtained for each of the thirteen HHI classes using each each one of the three pre-trained CNNs. In particular, the average recognition accuracies computed across all HHI classes for the GoogleNet, ResNet-18, and SqueezeNet are 72.1%, 77.1%, and 76.7%, respectively. The average F 1 −Score values computed across all interactions for the GoogleNet, ResNet-18, and SqueezeNet are 0.75, 0.80, and 0.79, respectively. Moreover, the average κ−Score values computed across all interactions for the GoogleNet, ResNet-18, and SqueezeNet are 0.70, 0.76, and 0.76, respectively. This implies that the strength of agreement obtained for the average recognition accuracies computed across all interactions for each of the three pre-trained CNNs are within the substantial agreement range. The results presented in Figs. 7 and 9 indicate that the recognition performance obtained by our proposed framework outperforms the recognition results obtained using each of the three pre-trained CNNs, which are depicted in Table 2.
We also compute the runtime for each of the three pre-trained CNNs in terms of the training time and CSI image recognition time, as described in subsection V-B2. Table 3 shows the runtime computed for each of the three pre-trained CNNs. The runtime computed for our proposed framework, which is presented in subsection V-B2, indicates that our proposed framework required less training time and CSI image recognition time compared with the training time and CSI image recognition time required by each of the three pre-trained CNNs. This can be attributed to fact that our proposed E2EDLF comprises a relatively smaller number of layers compared with the number of layers contained within each of the three pre-trained CNNs. This implies that the number of free parameters in our proposed framework is considerably less than the free parameters in each of the three pre-trained CNNs. As a consequence, the runtime analysis reported in the current study suggests the feasibility of utilizing our proposed framework for developing real-time systems that can accurately recognize HHIs.

4) COMPARISON WITH THE RESULTS ACHIEVED USING HANDCRAFTED FEATURES AND CONVENTIONAL CLASSIFIERS
This section presents a comparison between the results achieved by our proposed E2EDLF and the results achieved using traditional handcrafted features that are extracted from the CSI signals. Specifically, the sliding window approach, which was described in subsection IV-A, is used to divide the CSI signals into a set of overlapped segments, where each segment contains 256 packets and the overlap between any two consecutive window positions is equal to 128 packets. At each window position, we extract a set of commonly used handcrafted features that are computed from the time- and  TABLE 4. The results obtained using the handcrafted features and the four conventional classifiers described in subsection V-B4.

TABLE 5.
Summary of the average recognition accuracies computed across all interactions using each of the four conventional classifiers described in subsection V-B4, each of the three pre-trained CNNs described in subsection V-B3, and our proposed E2EDLF.
frequency-domains of the CSI signals [7], [54], including the mean, minimum value, standard deviation, maximum value, skewness, kurtosis, entropy, fast Fourier transform (FFT) peak, energy, and domain frequency ratio. The extracted features at each window position are combined to form a feature vector. The constructed feature vectors are used to train and test four conventional classifiers, including a mcSVM classifier with the radial basis function kernel [55], k-NN classifier [56] with k = 5, naive Bayes classifier [56], and decision tree classifier [56], to recognize the HHI class associated with each feature vector. To evaluate the performance of each one of the four conventional classifiers, we have utilized the 10-fold CV procedure described in subsection V-B. Table 4 shows the recognition accuracy, F 1 − Score, and κ − Score values obtained using each one of the four conventional classifiers for each of the thirteen interactions. Specifically, the average recognition accuracy computed across all interactions using the mcSVM, k-NN, naive Bayes, and decision tree classifiers are 58.3%, 43.9%, 24.3%, and 30.5%, respectively. Moreover, the average F 1 − Score / κ − Score values computed across all interactions using the mcSVM, k-NN, naive Bayes, and decision tree classifiers are 0.63/0.57, 0.48/0.42, 0.26/0.2, and 0.31/0.26, respectively. Table 5 shows the average recognition accuracies computed across all interactions using each of the four conventional classifiers in this subsection, each of the three pre-trained CNNs described in subsection V-B3, and our proposed E2EDLF. The results presented in Table 5 indicate that the performance of our proposed E2EDLF outperforms significantly the performances obtained using the handcrafted features combined with each one of the four conventional classifiers as well as the performances achieved using each of the three pre-trained CNNs.

VI. CONCLUSION
In this paper, we explored the feasibility of recognizing HHIs based on the CSI signals. Particularly, we presented a new E2EDLF that analyzes the time, frequency, and spatial domains of the CSI signals to recognize the class of the performed HHI. A publicly available CSI dataset of HHI was utilized to validate the performance of our proposed E2EDLF. Moreover, we have compared the results of our proposed E2EDLF with the results achieved using three well-known pre-trained CNNs and the performance obtained using commonly used handcrafted features that were classified using four different conventional classifiers. The experimental results depicted in this study illustrate the ability of our proposed E2EDLF to accurately recognize HHIs based on CSI signals analysis. Furthermore, the recognition accuracies achieved by our proposed E2EDLF are considerably higher than the accuracies achieved using the pre-trained CNNs and the achieved using the traditional handcrafted features.
In the future, we aim to extend our proposed E2EDLF to recognize group activities that involve more than two interacting persons. Furthermore, we intend to explore the potential of applying our proposed E2EDLF to recognize HHIs that are performed in a non-line-of-sight configuration. In addition, we plan to investigate the use of our proposed E2EDLF to recognize different types of fine-grained single-human activities, such as hand gestures and sign language. Moreover, we plan to investigate the possibility of developing CNN architectures that can directly analyze the raw CSI signals without converting it into another representation.