Environment and Person Independent Activity Recognition with a Commodity IEEE 802.11ac Access Point

—Here, we propose an original approach for human activity recognition (HAR) with commercial IEEE 802.11ac (WiFi) devices, which generalizes across different persons, days and environments. To achieve this, we devise a technique to extract, clean and process the received phases from the channel frequency response (CFR) of the WiFi channel, obtaining an estimate of the Doppler shift at the receiver of the communication link. The Doppler shift reveals the presence of moving scatterers in the environment, while not being affected by (environment speciﬁc) static objects. The proposed HAR framework is trained on data collected as a person performs four different activities and is tested on unseen setups, to assess its performance as the person, the day and/or the environment change with respect to those considered at training time. In the worst case scenario, the proposed HAR technique reaches an average accuracy higher than 95% , validating the effectiveness of the extracted Doppler information, used in conjunction with a learning algorithm based on a neural network, in recognizing human activities in a subject and environment independent fashion.


I. INTRODUCTION
Human activity recognition (HAR) systems are key elements of emerging applications for smart buildings.Among others, they can be profitably used to optimize the buildings' energy consumption, implement alert systems, and provide platforms for smart entertainment.Also, in a residential scenario, they enable solutions for elderly care, to promptly detect critical situations or, for assisted living, to support physically impaired people in their daily activities [1].
In this work, we present a device-free HAR system for indoor spaces, which relies upon the analysis of scattered WiFi signals.A commercial-off-the-shelf (COTS) WiFi router is used as a sensor to obtain information of moving targets.The Doppler information at the receiving end of the WiFi link is extracted and utilized to concoct an environmentand person-independent learning based algorithm that reliably detects human activities.
Several technologies can serve as sensors for HAR.The most accurate approach relies on wearable devices, such as inertial measurement units (IMUs), smartphones or smartwatches [2].These, analyze the changes in the accelerometer and gyroscope data to recognize human movements.However, they require the user to carry along the monitoring device, thus limiting their applicability.For this reason, device-free HAR systems are often preferred.Among these, camera based solutions have been widely investigated and adopted, but they are highly environment dependent as either obstacles in the field of view or low light conditions can heavily impact their performance [3].Moreover, their use poses privacy issues, as the subjects are clearly identifiable.As an alternative, detection and ranging devices based on radio waves (RADAR) and light beams (LIDAR) are emerging [4].With them, properly designed signals are irradiated and their reflections are analyzed to infer human presence and activities [5].
In the last decade, since the pioneering work of Adib and Katabi [6], wireless signals of opportunity have been extensively researched as a means to perform device-free localization and activity recognition tasks, including pattern and gesture recognition, and even estimating biological signals at distance, such as the respiration rate or the heartbeat [7].In this respect, WiFi signals are particularly appealing, due to the high availability of WiFi enabled devices in most residential and working spaces [8].While active sensing techniques allow localizing and following a user as she/he carries along a WiFi-connected device, passive sensing approaches obtain such information by monitoring the changes in the WiFi channel, where users act as intermediate scatterers that modify the channel frequency response (CFR) [9].
In the present work, we exploit an IEEE 802.11ac router to perform device-free HAR.Human movement data is extracted from the channel state information (CSI), a frequency estimate of the radio channel describing how the signal changes as it propagates from the transmitter to the receiver.Channel estimation is continuously performed by WiFi routers for communication purposes, and specific tools allow gathering such information from COTS devices, exploiting them for environmental sensing.So far, many approaches have been presented in the literature, e.g., [8], [9].However, we stress that, while they work well within the specific physical environment and conditions for which they were trained, they are not able to reliably recognize human activities across different scenarios, days or with different persons, without being re-trained with the new data.That is, they fall short in generalizing across persons and environments.
In this work, we tackle this open issue by considering the Doppler effect measured at the receiver as persons move within the environment.This Doppler shift reveals the velocities of movement of the scattering points during the transmission • We design an environment-and person-independent framework for HAR, which exploits the Doppler effect caused by moving humans.The information collected at the four monitoring antennas is combined to identify the activity, regardless of the user's position in the monitored area.Our approach is learning based, and reaches an average accuracy higher than 95% in the identification of four activities, namely walking, running, jumping and sitting, in the most challenging scenario, i.e., when the person, the day and the environment change with respect to those in the training set.The rest of the paper is organized as follows.The related work is reviewed in Section II.In Section III and Section IV, we respectively detail the processing on the CSI data and the learning architecture for HAR.The experimental setup is presented in Section V, while the performance of the proposed approach is evaluated in Section VI. Conclusions are drawn in Section VII.For completeness, we include Appendix A and Appendix B, which provide an overview of the OFDM WiFi channel model, and the derivation of the Doppler information.

A. CSI based human activity recognition
HAR through CSI data from COTS WiFi devices was first studied in, e.g., E-eyes [11], CARM [12].More recently, several articles showed the effectiveness of machine learning techniques in building algorithms that distinguish human activities based on CSI features [13]- [19].However, these works do not focus on the robustness to environmental changes and on the generalization capability to previously unseen environments and subjects, which are key enablers for the successful development of WiFi-based sensing systems [9].Only a few works try to address these weaknesses.In [20], the authors use a neural network based approach to extract environment-independent features from the CSI amplitude to recognize human movements.The performance of their algorithm is promising, but remains below 80% in the best scenario.In [21], transfer learning is shown to be effective to adapt the WiFi-based HAR algorithm to different persons and days for the same environment.The algorithm presented in [22] leverages generative adversarial networks to generalize on new persons, while in [23] the matching network oneshot learning approach [24] is proposed to bridge the gap between previously seen environments and new ones.A recent work [25] addresses the problem of location and subject independent HAR through a learning architecture consisting of three deep neural networks.The algorithm is trained on the CSI amplitude collected by monitor routers placed in different positions inside a room, and is tested in the same room by changing the location of a single router.The approach presents a significant performance degradation when evaluated on other datasets (around 80% of accuracy).
To the best of our knowledge, no work in the literature proposes a solution that generalizes well on both unseen environments, days and subjects without any re-training step and using COTS devices.In the present work, we propose an effective solution to this problem, and we consider the popular DeepSense framework as a benchmark for comparison [13].We selected DeepSense, as it is a very popular and widely adopted algorithm and considers an activity set similar to ours.

B. Exploitation of the CFR phase
Some works, e.g, [26]- [28], consider a reference antenna and use the phase differences at the remaining ones to perform sensing.By construction, such phase difference is not affected by systematic offsets (same value across the antennas), but it is unable to cope with offsets that are antenna-specific.Exploiting a similar idea, in [29], the authors correct the rotation errors using a reference signal, obtained by connecting with a cable one of the available monitoring antennas with the transmitter.However, the need for a reference reduces the spatial diversity that can be exploited for sensing purposes at the monitor station.The possibility to remove the unwanted phase offsets without exploiting any reference is less addressed in the literature.In [30], the authors propose to mitigate the errors by considering the average signal over a number of time instants, while in [31] a two-step approach is presented to estimate and remove some of the antenna-independent offsets.
In this work, we correct the phase offsets by devising an optimization approach that finds the optimal CFR parameters from the raw channel data.

C. Doppler based applications
A few works in the literature exploit the Doppler shift computed from WiFi CSI data.In [32]- [34], the Doppler information is used to track the movement of a user inside an environment.However, two different monitoring devices, placed in strategic positions, are used to obtain good estimates of the Doppler shift.A sensing system consisting of a single monitoring station is presented in [35], to recognize three different bodyweight exercises.In this work, the transmitter and the monitor stations must be placed in specific positions with respect to the user, so that the activity is performed perpendicularly to the radio link.Moreover, the detection algorithm is based on metrics that are manually extracted from the raw CSI data and the Doppler trace.This makes the proposed approach highly activity-dependent, as the features are specifically designed to separate the three considered physical exercises.HAR through Doppler information has been recently addressed in [36].The authors reach an high recognition accuracy (96%) on test data collected in the same environment where the algorithm is trained, proving the effectiveness of the Doppler information for HAR purposes.However, they do not provide any performance assessment in a different scenario.III.CSI DATA PROCESSING WiFi systems adopt orthogonal frequency division multiplexing (OFDM), by transmitting the user information over M partially overlapping and orthogonal sub-channels, with M even.At the OFDM receiver, the channel parameters (amplitudes and phases) are continuously estimated for all the sub-carriers.This information is computed for each received packet based on known preamble symbols, and is collected by the CSI, a large (environment-dependent) complex matrix describing the channel frequency response (CFR) for each sub-channel along every spatial stream (antenna). 1or each pair of transmit and receive antennas, the CFR is a set of complex numbers A m e jφm specifying the attenuation A m and the phase shift φ m for each sub-carrier m ∈ {−M/2, . . ., M/2 − 1}.Considering a multi-path propagation channel, and using index p ∈ {0, . . ., P −1} to indicate the P copies of the transmitted signal that are collected at the receiver, the CFR estimated on packet n is written as where f c is the main carrier frequency, T = 1/∆f is the OFDM symbol time, while A p (n) and τ p (n) respectively represent the attenuation and the delay associated with path p.The complete OFDM model for the WiFi channel, along with a derivation of Eq. ( 1) are given in Appendix A. In this work, the CFR matrix is estimated by a monitor WiFi device running the Nexmon CSI tool [37], as detailed in Section V. We remark that the collected CFR slightly deviates from the theoretical model in Eq. ( 1) due to hardware artifacts, which introduce an undesired phase offset φ offs , i.e., Next, we present the steps that we implemented to clean the CFR matrix that is extracted by the Nexmon CSI tool.Note that a new CFR matrix is retrieved for each received packet.Thus, the interval between subsequent acquisitions of the CFR is variable but, for the sake of exposition, in the following analysis we assume that a new sample is available every T c seconds, with T c fixed.We remark that, in a real WiFi network, the traffic exchanged between already deployed access points and user devices (e.g., streaming services), can be captured and exploited for HAR recognition purposes.When this is not possible, two WiFi access points are required to recreate the setup, using a simple network controller to transmit packets at regular intervals and a second monitor device to infer the CFR from these.

A. Phase sanitization
The undesired phase offset φ offs,m in Eq. ( 2) contains different contributions [29], [38], see Appendix A for their detailed explanation.Some of them, namely, the channel frequency offset (CFO), the phase-locked loop (PPO) and the phase ambiguity (PA), although changing in time have the same value across the sub-carriers in each spatial stream (receiving antenna).The sampling frequency offset (SFO) and the packet detection delay (PDD) are instead sub-carrier dependent.As shown in Appendix A, the offset φ offs,m experienced at one receiving antenna in sub-channel m can be expressed as The main idea of our approach for phase sanitization is that the contribution of each path in Eq. ( 2) is affected by the same phase shift φ offs,m .Hence, if we were able to separate the different paths, we could use (any) one of them as a reference to remove the phase offset from the CFR.In our method, the strongest path will be used to this end, as this is the path whose parameters are more reliably estimated at the receiver.The details are given shortly below.(Note that in the following analysis the time index n is omitted in the interest of readability.) Let h be the M -dimensional vector collecting the CFR information for the M sub-channels, To separate the P multi-path contributions, we define a grid of P possible paths with P > P and solve a minimization problem to select the P components out of the P ones that contribute to the CFR.We consider the following decomposition h = Tr, where T is an (M ×P )-dimensional matrix collecting the contributions to the CFR that depend on the sub-carrier index m, while r is a P -dimensional column vector representing the sub-carrier independent terms.The P -dimensional row vectors T m are defined though a dictionary of candidate total delays τ p,tot = τ p +τ SFO +τ PDD , with p ∈ {0, . . ., P − 1} as The column vector r can instead be modeled as r = e j(φCFO+φPPO+φPA)
and is obtained by solving the following minimization problem The non-zero entries in the solution r reveal the presence of a path p with corresponding total delay τ p,tot .
Using the decomposition in Eq. ( 5), vector r, that we found solving problem P1, and Eq. ( 6), we are able to separate the contributions of the different paths in the CFR for each sub-carrier m by applying the following Hadamard product X m can be rewritten by replacing the terms in Eq. ( 9), as A 0 e −j2π(fc+m/T )τ0 . . .
At this point, we define p * ∈ {0, . . ., P − 1} as the position where r has the highest amplitude, which, in turn, is associated with the strongest path.Let X m,p * be the p * -th entry of X m .By multiplying X m by the complex conjugate of X m,p * , we remove the phase components that are constant across all the paths, including the offset, obtaining A 0 e −j2π(fc+m/T )τ0 . . .
Hence, summing up the elements of Xm (associated with the paths from p = 0 to p = P − 1), we attain a CFR estimate for each of the sub-carriers, where the phase offset is mitigated, as follows, A p e −j2π(fc+m/T )τp where τp = τ p − τ p * .Note that Ĥm of Eq. ( 12) represents the CFR estimate for sub-carrier m, where the phase offset has been removed and the contribution of each path is modulated according to the amplitude and phase of path p * .

B. Doppler trace computation
Fig. 1 shows how the amplitude and the phase change during a time interval of 3 seconds in two different scenarios, i.e., an empty room (on the left) and a room with a moving person (right plots).The presence of a human induces changes in the extracted channel parameters, which can be exploited by HAR algorithms.However, the CFR is affected by the whole indoor multi-path propagation environment and hence also accounts for the reflections from static objects.The reflected signals combine differently at each sub-carrier, and the delays induce sub-carrier specific phase shifts (see Eq. ( 1)).Such a behavior is environment-specific and is clearly identifiable in the amplitude plots of Fig. 1 (see the horizontal patterns in the figure).This fact does not allow developing robust algorithms for HAR that generalize across different environments, as the CFR is strongly affected by the room configuration itself (static objects, including walls, and the room shape).Also, even considering the same indoor space, slight changes in the position of the objects therein have a non-negligible effect on the measured CFR, thus making the HAR task more challenging.As an example, Fig. 2 shows the CFR amplitude collected in two different days within the same empty room.Due to these facts, in our framework we exploit the Doppler effect to obtain effective features for environment-independent HAR.The Doppler effect corresponds to a shift in the signal phase measured at the receiver as the geometry of the multipath propagation changes during a transmission event.The movements of the scatterers cause variations in the time taken for the signal to reach the receiver through each of the propagation paths.This reflects in a phase shift in the received OFDM signal and, in turn, in the CFR matrix (see Appendix A for a description of the OFDM model).Considering a path p, its associated delay, τ p (n), can be expressed as the sum of two contributions.Let p be the path length related to the initial position of the scattering point and ∆ p (n) be the delta caused by the movement of the point during the transmission period nT c .We have that with where v p indicates the speed of scatterer p, while cos α p results from the combination of sinusoidal functions related to the angles of motion of the scatterer, and the angles of arrival and departure of the signal.The activity-related movements of a human cause complex variations in the phase, as each body part acts as a scatterer moving at a specific velocity v p (x).This is revealed by the Doppler vector computed from the CFR matrix through a short-time Fourier transform over N subsequent channel estimates, i.e., during a channel observation window.These N estimates are respectively acquired by extracting the CFR matrix at the WiFi monitor for N subsequent packets, collected with sampling period T c .The value of N is selected so that the attenuation, the velocities, and the angles can be considered fixed during the i-th observation window [iN T c , (i + 1)N T c ], with i ≥ 0. This allows us to rewrite ∆ p (n) (Eq.( 14)) as We define H i as the M × N dimensional CFR matrix associated with the i-th observation window, where u ∈ {−N D /2, . . ., N D /2 − 1} and F {•} indicates the Fourier transform.For completeness, the mathematical expression for The non-zero entries u in the Doppler vector reveal the presence of a scatterer with associated velocity where c is the speed of light.
In the remainder of this paper, we refer to Doppler trace as the matrix obtained by stacking a number of subsequent Doppler vectors, computed for consecutive observation windows.
As an example, in Fig. 3 we plot three-seconds long Doppler traces obtained as a person performs each of the four considered activities inside a room, compared with an empty room case.As expected, the Doppler trace related to the empty room only presents non-negligible power at the zero-velocity bin, revealing no movement in the environment.Instead, in the presence of a moving person the power is spread across different bins, reflecting the human-related multi-path changes.

IV. LEARNING ARCHITECTURE FOR HAR
The proposed HAR framework is conceived to fully exploit the data gathered from a commercial WiFi access point, i.e., using amplitude and phase information from the CFR, and combining this data for the available N ant receiving antennas.Activity recognition is performed using N w subsequent channel estimates at a time, which amounts to monitoring the channel for N w T c seconds.Note that this can be implemented in a sliding window fashion, thus updating the activity label at every new CSI sample.
The HAR algorithm consists of two steps.1) First, we compute the Doppler traces from the data collected at all the receiving antennas and we use them to obtain activity estimates though a neural-network Fig. 4: HAR classification architecture for the single antenna system.The input of the network consists of 340 (about two seconds) long Doppler traces with 100 velocity bins.
based algorithm (Section IV-A).Specifically, the machine learning based prediction chain that we build is independently applied to the data stream coming from each of the available antennas.2) As a result of the previous step, we obtain N ant independent predictions, one per antenna, which are combined, in a second step, through a decision fusion method that leads to the final activity estimate (Section IV-B).We remark that an alternative approach that combines the data from the antennas (at the input of the above decision chains), would also be possible.Such combined input data (the Doppler traces coming from the antennas) would then be fed to a neural network to classify the activity.We experimentally verified that this is not a viable method to obtain an environment-invariant classifier, as the classification result depends on the antenna ordering and, in turn, on the person's location inside the room.

A. Activity classifier for the single antenna system
We framed the problem as a multi-class classification task with five classes (four user activities plus empty room), tackling it through a learning-based approach.The designed neural-network classifier takes as input the Doppler trace from a single antenna element, i.e., an N w × N D -dimensional matrix obtained by stacking N w subsequent Doppler vectors, and returns the label of the activity being performed.The classification architecture for the single-antenna system is shown in Fig. 4.
To start with, the input is fed to a simplified Inception module with dimensional reduction.This block extracts significant features from the Doppler trace at several scales by using layers with different kernel sizes in a parallel fashion.The structure is inspired by the reduction block proposed for the Inception-v4 neural network in [39], and consists of three branches combining max-pooling (MAXPOOL) and convolutional (CONV) layers.The N w /2×N D /2 dimensional feature maps obtained at the output of each branch are concatenated and passed to the following convolutional filter with 1 × 1 kernel, used to reduce the number of feature maps from 15 to 3. Hence, the output of such filter is flattened and passed to a fully connected (dense) layer with five output neurons, one for each activity class.The Dropout technique is used as a regularization strategy, randomly zeroing 20% of the elements in the flatted vector preceding the dense layer.Overall, the proposed neural network has 128, 535 parameters.
The classifier is trained in a supervised manner on data collected from a single indoor environment, using the cross-entropy loss function.The Doppler traces collected from the different antennas are used without any distinction among them, i.e., they are all added to a unique training set, without keeping track of the antenna that generated them.Using the trained architecture of Fig. 4, each input trace is associated with the activity having the highest score in the output five-dimensional vector, referred to as activity vector.

B. Decision fusion for the multiple antenna system
At runtime, the trained classification engine from the single antenna system is independently applied to the Doppler stream gathered by each antenna.This returns N ant independent classification outcomes that are combined as we now explain.In detail, for each antenna we obtain a five-dimensional activity vector (the classifier output) and an activity label, corresponding to the largest element in the activity vector.When at least N ant − 1 activity labels agree on a certain activity, there is a clear winner, and the Doppler trace is associated with that activity.Otherwise, an overall decision vector is computed by summing, element-wise, the N ant activity vectors.The trace is then associated with the activity having the highest score in this decision vector.

V. EXPERIMENTAL SETUP
In this section we present the experimental setup designed for training and validating our HAR system.We first introduce the CSI extraction method (Section V-A), and then present the measurement scenarios and campaigns selected for building the dataset (Section V-B).

A. Nexmon extraction tool
Although most commercial WiFi chipsets can potentially generate CSI data, few manufacturers make this data available to developers and researchers, especially for modern chipsets.Hence, the majority of the WiFi-based HAR works in the literature have used outdated chipsets, for which some CSI extraction tools have been developed over the years: widely used ones are [40], [41], that target network interface cards implementing the IEEE 802.11n protocol.Recently, as part of the Nexmon project [37], [42], [43], it has been released a firmware patch allowing the extraction of CSI from specific Broadcom/Cypress WiFi chipsets.
In the present work, we use the Nexmon CSI extraction tool presented in [10] to obtain CSI data from an Asus RT-AC86U IEEE 802.11acWiFi router.The extraction tool is compatible with the very-high-throughput mode, defined by IEEE 802.11ac, working with a total bandwidth of 80 MHz.Each CSI sample results in complex-valued channel information from 242 data sub-carriers for each transmit-receive antennas pair.In our experiments, with one transmitter antenna and four receiver ones, each CSI sample corresponds to four vectors of 242 complex values.Although the total number of sub-carriers

B. Dataset acquisition and organization
In order to analyze the effects of different room geometries and static obstacles, we collected CSI samples in three different environments, a bedroom (Fig. 5-a), a living room (Fig. 5-b) and a University laboratory (Fig. 6), where one person moves within the area.Specifically, we obtain data from three volunteers (a male, P1, and two females, P2, P3) while they are walking or running around, jumping in place, or sitting somewhere in the room.The CSI samples are collected by a monitor node, implemented on an Asus router equipped with N ant = 4 antennas and running the Nexmon firmware.To generate the signals of opportunity exploited for sensing, we set up a WiFi transmission link using two Netgear X4S AC2600 WiFi routers, equipped with Qualcomm Atheros chipsets.The packets are transmitted through a single antenna, by using a fixed modulation and coding scheme (namely,  For each set Sj we specify the position of the monitor station (Mj) (see Fig. 5 and Fig. 6), the person (Pi) performing the activity, and the presence of a direct path between the transmitter and the monitor.The last column indicates whether the set is used for training the proposed HAR algorithm or testing its performance.
MSC 4), by disenabling frame aggregation, and by setting the source rate to 173 packets per second.Since the Nexmon tool is configured for reading CSI samples on data packets only (i.e., by neglecting the acknowledgement frames sent by the receiver), a new channel estimate is generated every T c 6 × 10 −3 s.Fig. 5 and Fig. 6 display the positions of transmitter, receiver and monitor routers in the different environments.The black boxes represent the transmitter (Tx) and the receiver (Rx), while the red boxes labeled with Mj (with j ∈ {1, . . ., 4}) indicate the position of the monitor station in different measurement sets.The activities are performed in the areas identified by a color and a number.Note that in the bedroom (Fig. 5-a), the direct path between Tx and M2 is occluded by the bookcase in the middle of the room (grey rectangle in the figure).
We performed several measurement campaigns and we grouped them into seven sets, Sj with j ∈ {1, ..., 7} each corresponding to a different triplet of environment-day-person.For each set, Table I provides the position of the monitoring station (Mj), the area where the activities take place (identified by the color), and the person performing them (Pi).We also include an indication of whether there is a direct path between the transmitter and the monitor stations.The configuration for sets S1-S2 is the same apart from the day of measurement.Set S1 is used to train the model, while set S2 is used to test the generalization over different days.In S3 we monitor the same environment as the previous sets, but in a different day and with a different person performing the activities.For sets S4 and S5, the person is required to move in both area 1 and 2 of the bedroom depicted in Fig. 5-a.In these configurations, the direct path between the transmitter and the monitor is disturbed by the bookcase.Set S6-S7 are collected in two different days and environments.S7 represents the more challenging situation, in which no element (room, day, person) is in common with the scenario used for the training set.
Each measurement campaign involves 120 seconds of data for each activity, plus an additional trace of 120 seconds of data collected when the room is empty.Activities are repeated continuously by the volunteers during the trace acquisition time.The campaigns have been performed in different days through several months (April-December, 2020), resulting in a considerable time diversity.Overall, we collected nearly 120 minutes of CSI data, consisting of the CFRs estimated at all the four antennas of the monitor station.

VI. EXPERIMENTAL RESUTS
The HAR algorithm has been tested on the scenarios detailed in Table I.The adopted communication and processing parameters are summarized in Table II and are presented in Section VI-A.The performance is discussed in Section VI-B.

A. Pre-processing steps and training
We considered some pre-processing operations for extracting the features of interest from the dataset.For each CSI sample, we divided the CFR values by the mean amplitude over the 242 monitored sub-carriers to remove unwanted amplifications.Then, the phase sanitization algorithm presented in Section III-A is applied.The CFR values on the three central sub-carriers are reconstructed together with the other 242 using Eq. ( 11), obtaining a CFR complex-valued vector (amplitudes and phases) of 245 components.
Doppler vectors are computed considering N = 31 subsequent samples of the the sanitized CFR values.The velocity resolution is increased by zero-padding the signal out to N D = 100 points before applying the Fourier transform.A threshold is used on the resulting Doppler vectors to remove noisy contributions with power smaller than 12 dB.Finally, the Doppler trace acting as input features in our HAR system is built by stacking N w = 340 consecutive Doppler vectors.Doppler vectors are generated for each CSI acquisition by using a sliding window mechanism.Therefore, the complete Doppler trace lasts roughly 340 • T c = 2 s of measurements.
The HAR algorithm presented in Section IV has been trained by using the features extracted on set S1.More into details, 60% of data in S1 composes the training set, while the remaining is evenly divided between validation and test sets.The other measurement sets, S2-S7, are only used in the test phase.

B. HAR performance assessment
The classification accuracy and the F1-score obtained on each of the test sets are reported in the leftmost part of Table III.Note that, for S1, the evaluation is performed on the test set only, i.e., on new data never seen during training.Overall, the mean recognition accuracy is higher than 95%, reaching almost 100% when the environment and location of the monitor node (M 1) remain the same of the training data, regardless of day of measurement and the person performing the activity (P 1 or P 2).We still obtain a good accuracy, around 97%, when the direct path between the transmitter and the monitor stations is blocked (sets S4 and S5).In this situation, the running activity is sometimes confused with the walking one, as revealed by the F1-score associated with these classes.However, this is a reasonable limit of our approach, as the activities are recognized by considering the velocity of different parts of the human body (i.e., the scattering sources), which can be very similar when people walk or run in indoor spaces.We also believe that a small confusion between the running and walking activities is admissible in almost all the fields of application of a WiFi sensing system.For example, in a residential scenario, it is more relevant to correctly distinguish between static and dynamic activities, rather than providing details about the type of movements.
For testing the generality of our algorithm under different environments, we consider the measurement sets S6 and S7.When the person is the same of the training data, the average HAR accuracy approaches 100% while it decreases to 95.99% when both the person and the room change.Again, the lower accuracy is achieved for the walking activity, which is wrongly classified as the running one, as displayed in the normalized confusion matrix in Fig. 7.
On the rightmost part of Table III, we report the metrics obtained using the state-of-the-art DeepSense approach [13] in place of the single antenna classifier described in Section IV-A (i.e., before decision fusion).DeepSense relies on the sole CSI amplitude to perform HAR.A learning-based algorithm is used consisting of an encoder for feature extraction followed by a classifier.The encoder is trained as part of an auto-encoder structure to learn the most representative characteristics of the input signal.The code extracted by the amplitude trace is then processed by an activity classifier, which consists of convolutional and recurrent layers.Overall, the total number of parameters of DeepSense is comparable to the one used in  our approach.For performing a fair comparison, we trained DeepSense with the same portion of the set S1 used for training our architecture.At run-time, the decision fusion approach in Section IV-B is applied to combine the output of the classifiers working on the single antenna data and obtain the final activity label.From Table III, we observe that the performance is in line with ours when considering the same environment of the training set, while degrades remarkably in all the other cases, even when data are collected in the same environment used for training, but in a different day (e.g., set S2).These results demonstrate that the sole amplitude information is insufficient to provide a good level of generalization across different situations.This is particularly evident for the identification of empty rooms, which in principle should be easily characterized by the lack of moving objects.DeepSense completely fails in identifying empty rooms in four out of the six sets S2 − S7.Indeed, as shown and discussed in Section III, the CFR amplitude is environment-dependent and is also affected by small scene variations over days, such as changes in the positioning of the monitor node, room obstacles, and objects.Conversely, our HAR system generalizes across different days, environments and persons, especially for distinguishing between static and dynamic environments.

VII. CONCLUDING REMARKS
In this work, a novel low-cost system for human activity recognition (HAR) in indoor spaces has been presented and experimentally validated.This system analyzes WiFi signals scattered into the environment and runs on COTS WiFi routers, which are usually available in indoor spaces for communication purposes, thus making the deployment and maintenance of an ad-hoc sensing infrastructure unnecessary.The use of WiFi routers as environmental sensors is enabled by the possibility of reliably estimating the wireless channel frequency response while receiving background data traffic.
In contrast with previous solutions, we implemented a robust HAR system, which does not require complex and periodic calibrations, i.e., once trained, it can be used at run-time in a completely unseen situation.Our approach revolves around the idea of processing the CSI data gathered by a monitor access point to quantify the Doppler effect due to the presence of moving objects over time.Indeed, the Doppler shift describes the velocity of the scattering points (i.e., parts of the human body under movement), whose temporal trace depends on the specific activity and is neither affected by the environment geometry, nor by static objects.A learning based algorithm, working on the Doppler trace, has also been proposed to distinguish among different activities.
The robustness of the proposed HAR system has been challenged by conducting several measurement campaigns, using an IEEE 802.11ac router by Asus, operating on a frequency band of 80 MHz and having four antennas.Measurements are performed for seven different configurations of environment, access point position, people and day.Four human activities, i.e., sitting, walking, running and jumping are considered, together with the empty space recognition.Experimental results show that our system achieves a very high accuracy in all the considered scenarios: in the most challenging situation, when the person and the environment change with respect to those used at training time, the average accuracy is 96%.Moreover, the system outperforms state-of-art methods based on the analysis of the sole CSI amplitudes.Small errors are observed only when working with unseen persons and when distinguishing between walking and running.These errors somehow represent the fundamental limits of the approach, because they are due to the behavioral differences among different persons, or to the movement similarities in walking and running activities.
Future research avenues include the investigation of algorithms that work across differing hardware, at both the transmitter and the monitor access point, as well as the design of a new learning based algorithm that identifies multiple persons moving concurrently in the environment.Both extensions have important practical implications for real deployments and applications.

APPENDIX A OFDM MODEL FOR THE WIFI CHANNEL
Next, we detail the main blocks of a WiFi transmission chain, deriving the expression for the CFR in Eq. (2).

A. Transmitted signal model
In this subsection, we summarize how orthogonal frequency-division multiplexing (OFDM) is implemented by an IEEE 802.11 transmitter.OFDM uses a large number of closely spaced orthogonal sub-carrier, transmitted in parallel.Each sub-carrier is modulated with a conventional digital modulation scheme (such as QPSK, 16QAM, etc.).
The input bits are grouped and mapped onto source data symbols, which are complex numbers representing the modulation constellation points.Such constellation points are further grouped into M elements each, i.e., the OFDM symbols.OFDM symbols are then fed to an IFFT block that transforms the data prior to transmitting it over the wireless channel, in parallel, over M sub-channels with carriers spaced apart by ∆f = 1/T Hz (T is the OFDM symbol time).
The duration of an OFDM symbol is T = T + T CP , where T CP is the duration of the cyclic prefix, added to mitigate inter-symbol interference.Specifically, for IEEE 802.11ac the transmission bandwidth is 80 MHz, the samples are clocked out at 80 Msps, the number of sub-channels is M = 256, T = 1/∆f = 3.2 µs (i.e., ∆f = 312.5 kHz), T CP = 0.8 µs, and, in turn, T = 4 µs.
Let a k = [a k,−M/2 , . . ., a k,M/2−1 ] be k-th OFDM symbol, where a k,m is the m-th OFDM sample.After digital to analog conversion, the baseband OFDM signal for the k-th symbol is where m ∈ {−M/2, . . ., M/2 − 1} is the sub-channel index.Considering K subsequent blocks, the baseband signal is with and the signal transmitted over the WiFi channel is obtained by upconverting x(t) to the carrier frequency f c , s tx (t) = e j2πfct x(t).

B. Received signal model
At each receiver antenna, P signal copies are collected, due to the scatterers that the signal s tx (t) encounters (multi-path propagation).Each path p is characterized by an attenuation A p (t) and a delay τ p (t). Neglecting the additive white Gaussian noise, the received signal s rx (t) is written as s rx (t) = A p (t)e −j2πfcτp(t) x(t − τ p (t)), (23) and its baseband representation y s (t) is expressed as, y s (t) = s rx (t)e −j2πfct . ( A rectangular window [k T , k T + T ] is used at the receiver to collect and decode the information carried by an OFDM symbol at a time.Without loss of generality, we assume k = 0 and hence we omit such index from the following equations.The transmitted symbol a m is recovered by computing the Fourier transform of the signal in the received window: A p e −j2π(fc+m/T )τp , (25) where the sum in the last line corresponds to the frequency response of the WiFi channel, A p e −j2π(fc+m/T )τp , (26) that is estimated based on the known preamble symbols.In Eq. ( 25), we consider that the path attenuation and delay remain constant over each window, i.e., A p (t) = A p and τ p (t) = τ p .Also, exchanging the order of integration and summation is legitimate as we deal with finite quantities, and we used

C. Phase offsets in the WiFi channel estimates
Hardware artifacts make the CFR gathered from WiFi devices slightly deviate from the model in Eq. ( 26).These artifacts introduce offsets (rotation errors) in the phase information, among which the most significant are [29], [38]: • carrier frequency offset (CFO), due to the difference between the carrier frequency of the transmitted signal and the one measured at the receiver.The CFO is only partially compensated for at the receiver [31].
• sampling frequency offset (SFO), due to the imperfect synchronization of the clocks between transmitter and receiver.
• packet detection delay (PDD), due to the time required to recover the transmitted modulated symbols from the received signal [31].
• phase-locked loop phase offset (PPO), due to the phase-locked loop (PLL), the entity responsible for randomly generating the initial phase at the transmitter.
• phase ambiguity (PA), due to the phase difference (multiples of π) between the antennas, that in static conditions should remain constant.Considering these contributions, the complete expression for the phase of the p-th path in the received signal is φp,m = − 2π(f c + m/T )τ p + φ CFO − 2πm(τ SFO + τ PDD )/T + φ PPO + φ PA .(27) Note that, while the CFO, SFO and PDD contributions take the same value across different antennas, the initial PLL phase (PPO) and PA are antenna specific [44], [45].APPENDIX B DERIVATION OF EQ. (17) In the following Eq.( 28), we derive the expression of F {H i }(m, u), where H i is the M × N dimensional matrix in Eq. ( 16), collecting N subsequent channel estimates for each of the M sub-channels, and F {•} indicates the Fourier transform.According to the formulation in Section III, element m, n of H i is Ĥm (n).The models for the channel frequency response and the Doppler effect used in the computation are given in Eq. ( 12), Eq. ( 13) and Eq. ( 15), while W m is the Hanning function, selected for the windowing operation.
The term e j2π(mvp cos αpnTc/T )/c is negligible and omitted in the final expression.Note also that, to increase the velocity resolution, the signal can be zero-padded out to N D samples before applying the Fourier transform.W m (n)e j2πn(fc vp cos αpTc/c−u/ND) .(28)

Fig. 1 :Fig. 2 :
Fig.1: Amplitude (in dB scale), raw and sanitized phases (unwrapped) of CSI data for an empty room (left plots) and with a person running (right plots).Each trace is three seconds long and shows the behavior on each of the monitored subchannels (y-axis).Note that CSI is not available on the three central sub-carriers, see Section V for details.

Fig. 3 :
Fig. 3: Doppler traces related to five different setups (from left to right: empty room, sitting, walking, running and jumping).The velocity can be positive or negative depending on the way the scattering points move, changing the geometry of the multi-path propagation channel.
where the row and column indices respectively represent the OFDM sub-channels and the number of packets in the observation window.At this point, each element of the N D -dimensional Doppler vector D i = [d i (−N D /2), . . ., d i (N D /2 − 1)] T is obtained as

Fig. 6 :
Fig. 6: Monitored environment for set S7.The desks are fully of computers and monitors.Tx and Rx denote the transmitter and the receiver, respectively.M4 denotes the monitor station.
communication and processing parameters monitored channel IEEE 802.11ac 42 OFDM sample duration, T 3.2 × 10 −6 s modulation and coding scheme MCS 4 channel estimates interval, Tc 6 × 10 −3 s no.OFDM sub-channels, M 256 (245 used) no.ch.estimates for Doppler computation, N 31 no.bins in a Doppler vector, N D 100 no.Doppler vectors per classification window, Nw 340 no.monitoring antennas, Nant 4TABLE II: Summary of parameters related to the communication setup, and the Doppler traces computation.

AccuracyFig. 7 :
Fig. 7: Normalized confusion matrix for test set S7. Environment, day and person change with respect to the training.

TABLE I :
Measurement conditions.

TABLE III :
Performance of the proposed HAR strategy (on the left) compared with the state-of-the-art DeepSense approach (on the right).The accuracy and the F1-score are reported for each of the five classes together with their average value.