1 Introduction

Indoor positioning systems (IPS) have seen increasing demand and applications in location-based services (LBS), the Internet of Things (IoT), and intelligent healthcare [1]. In view of the diverse applications of IPS, significant research efforts have been made for positioning with different wireless technologies, such as Bluetooth, WiFi, and mobile network [2]. Among them, WiFi-based positioning has attracted the most attention for at least two reasons: First, existing WiFi networks can serve as a general infrastructure for positioning purposes. With the dense deployment of WiFi devices and their built-in circuits for measurements of radio signal strength (RSS) or even channel state information (CSI), a WiFi network can be easily transformed into a positioning system. Second, the orthogonal frequency division multiplexing (OFDM) technology that underpins the WiFi and the fourth generation (4G) and beyond systems can provide the signature CSI of targets in space, time, and frequency, allowing for more precise positioning than using RSS readings only. In addition, the rich research results in WiFi-based positioning also lay a solid foundation for technology developments in this field [3, 4].

The idea of RSS-based positioning is first introduced in [5], and is realized with the readings of RSS indicators (RSSI) of received WiFi packets. Despite the significant research efforts in this area over the past decade, IPS with RSSI measurements remains a challenging task in practice, particularly, in the consistency of positioning accuracy, which is sensitive to measurement campaigns and channel modeling in complex indoor environments [3, 4].

Compared to RSS-based positioning that offers a meter-level accuracy, positioning with instantaneous CSI has a potential to provide a centimeter-level accuracy [6, 7] at the cost of a measurement density 2 to 3 orders higher in both space and time than that for the RSS-based approach, and is, thus, more suitable for positioning in small-scale areas of few meters squared (\(m^2\)). Nevertheless, the combination of these two positioning technologies creates a possibility for high-precision end-to-end autonomous indoor navigation services that can use RSS-based positioning methods for coarse trajectory control and CSI-based methods for fine localizations at endpoints.

From an estimation theoretical point of view, either RSS- or CSI-based positioning can be further classified into two categories: the deterministic versus the statistical ones. In the deterministic approach, a geometric model is needed to serve the positioning purpose. For RSS-based methods in this category, a radio power path-loss model is typically used to characterize the relationship of RSS measurements to radio propagation distances [5, 8, 9] between transmitters and receivers. For CSI-based methods in this category, a radio wave propagation model is employed to estimate the angle of arrival (AOA) or the time difference of arrival (TDOA) based on the phase differences in the CSI of designated receivers [10, 11]. These geometric measurements in terms of propagation distances, AOA, or TDOA can then be transformed into the location information of the target, given the predetermined locations of the receivers.

In contrast to the deterministic approach, location information of a statistical approach is directly extracted from measurement data. No specific physical model is needed for location estimation. Either the well-known pattern matching methods, e.g. [12, 13], or the more sophisticated deep learning methods, e.g. [14,15,16], fall in this category and have been extensively studied for positioning with RSSI or CSI measurements.

The aforementioned positioning methods, either with RSS or CSI measurements, or either following a deterministic or a statistical approach, are all based on reference signals transmitted or received from the target device. The physical principle underpinning the positioning algorithms is the same and results from electromagnetic (EM) wave propagations and their interactions with the environments, including reflection, diffraction, or refraction effects on obstacles. From this point of view, positioning can also be done for a target that is not actively involved in the transmissions or receptions of reference signals. This kind of passive positioning for a device-free target is somewhat like Radar remote sensing, but perhaps more challenging due to the complicated EM wave propagation channels in indoor or outdoor urban environments. Consequently, IPS with such a device-free target localization method typically relies on a deep learning model to extract the location information of the target intrinsic in the CSI that is reflected or diffracted from the target and is received by the IPS receivers, such as [17,18,19], to name a few.

Despite the rich research results in IPS, location tracking with CSI remains to be a challenging task due to the complexity to represent the spatial and temporal structure of CSI with regard to moving targets without a specific physical model for it, although several recent research efforts have been made to investigate this issue with, for instance, recurrent neural networks (RNN) [20,21,22] or particle filters [22]. This problem becomes even more complicated for device-free target tracking if the transmitter of the reference signal also moves with the target. This could happen in applications where a robotic pet or a care robot is approaching or following its master. In view of the technical challenges in device-free target tracking and its great potential in applications such as care robots, automatic guided vehicles (AGVs), and intelligent patrol inspection, etc., we study herein a device-free target following method based on neural networks that use both the CSI and RSSI of reference signals.

To exploit the spatial structures and characteristics of both CSI and RSSI, we reexamine their individual spatial correlations in indoor space and design neural networks to extract both the coarse and fine location information of the target out of RSSI and CSI. With these spatial features of the target, we further construct an RNN to learn and exploit the temporal structures and characteristics in the spatial features for continuous location tracking and target following. The performance of the proposed device-free target following system is verified by experiments on our testbed, and has a positioning accuracy of 0.773 meters, which has a \(64\%\) improvement over traditional device-free approaches. The contributions of this research are summarized as follows:

  • We reinvestigated the spatial correlations of CSI and RSSI, and developed a deep neural network (NN) structure to exploit the spatial and temporal correlations of CSI and RSSI for location tracking of moving targets.

  • Based on the proposed NN structure, we designed a common framework for IPS which not only can be applied for typical device-based or device-free location tracking, but also for device-free target tracking where reference signals are sent from a moving device that is also tracked by the IPS.

  • By experimental verifications, we showed that the proposed device-free target tracking method enjoys a \(64\%\) improvement in positioning accuracy when reference signals are sent from a device that follows the target.

This paper is organized as follows. Section 2 introduces the system model for IPS and Section 3 provides some background knowledge on data preprocessing for CSI measurements. A spatial-temporal NN structure is proposed for IPS in Section 4. Experimental results are presented in Section 5, followed by some discussions and concluding remarks drawn in Section 6.

2 System Architecture

In current WiFi and 4G/5G systems, Orthogonal Frequency Division Multiplexing (OFDM) and Multiple Input Multiple Output (MIMO) technologies [23] are used to combat the multipath effects in broadband wireless transmissions and to increase the system capacity. In MIMO transmissions, data is usually split into multiple streams, with each stream transmitted from a different transmit antenna in the same frequency band. For MIMO-OFDM systems, each data stream is further modulated with OFDM. Specifically, data are further split, modulated, and transmitted in multiple orthogonal subcarriers in the occupied frequency band to avoid inter-symbol interferences.

Due to this orthogonality property, CSI amplitude and phase of each subcarrier can be easily estimated in OFDM systems. For a legacy 802.11n WiFi that supports \(2\times 2\) MIMO, and has a total of K subcarriers, e.g. \(K=56\) across a total bandwidth of 20 MHz, the CSI matrix \(\textbf{H}\) of the \(L=4\) antenna pairs is given by

$$\begin{aligned} \textbf{H}= \left[ \begin{array}{cccc} h_{1,1} &{} h_{1,2} &{} \cdots &{} h_{1,L} \\ h_{2,1} &{} h_{2,2} &{} \cdots &{} h_{2,L} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ h_{K,1} &{} h_{K,2}&{} \cdots &{} h_{K,L} \end{array} \right] \end{aligned}$$
(1)

where \(h_{k,l}=\vert h_{k,l}\vert e^{-j \angle {h_{k,l}}}\) is the CSI value on the k-th subcarrier and the l-th antenna pair, and consists of the amplitude (\(\vert h_{k,l}\vert\)) and phase (\(\angle {h_{k,l}}\)) parts. The CSI matrix \(\textbf{H}\) represents the overall channel frequency response (CFR) of the wireless channels, including the effects of small-scale channel fading and large-scale channel reflection, scattering and shadowing effects. In other words, we can sense a slight variation in an indoor space via its CSI matrix.

On the other hand, the RSS measurements, expressed as RSSI (RSS Indicator), give the average power of received OFDM symbols. In free space, their values mainly depend on the propagation loss from a transmitter (TX) to a receiver (RX), and are characterized by a log-distance path-loss model [24] on the decibel (dB) scale as

$$\begin{aligned} P_{TX}-P_{RX}=PL(d_{0})+10\gamma _d \log _{{10}}{\frac{d}{d_{0}}}+\eta _{s} \quad (dB) \end{aligned}$$
(2)

with \(PL(d_{0})\) being the path loss at the reference distance (\(d_0\)), \(\gamma _d\) being the decay exponent and \(\eta _{s}\) being a Gaussian random variable (RV) to model the shadowing effect. This model basically assumes that the radio power decays exponentially with distance in free space. While, an RSSI is also affected by the shadowing and multipath effects such that the RSS-based positioning approach is not able to capture a small-scale power variation, and has an indoor localization accuracy at the meter level [6].

To improve positioning accuracy, instantaneous CSI can be used to extract the information of individual propagation path in order to achieve a decimeter-level localization accuracy. In our proposed system, we use WiFi APs to collect the CSI and RSS measurements of a target device. The WiFi APs use a Linux-based operating system, OpenWRT [25], and can synchronously collect their RSSI readings of WiFi packets from an associated device on an assigned WiFi channel. Their CSI, however, is collected individually by their on-board Atheros network interface cards (NIC) (QCA9558) for IEEE 802.11n standard. The Atheros-CSI-tool [26] built on top of the Atheros WiFi driver (ath9k) can extract the CSI, RSSI, and the timestamps of received WiFi packets.

Moreover, in order to monitor the CSI data from multiple devices simultaneously, the Atheros-CSI-tool can be implemented with 3 different modes: the AP-client, broadcast, and injector monitor modes. The AP-client mode is the basic configuration of this tool, which provides one-to-one communication and captures the CSI data from the associated device. The broadcast mode involves multiple clients and one broadcast AP. Clients share the same MAC address to simultaneously receive packets from the broadcast AP. In the injector monitor mode, all devices share the same MAC address and can acquire CSI from an injector. In our WiFi-based IPS, the system is implemented with the broadcast mode and uses multiple receivers to obtain CSI measurements.

The IPS is developed, deployed, and tested on the 8-th floor of the EED building of National Yang Ming Chiao Tung University (NYCU), Taiwan, and controls a total of 3 WiFi APs, which serve as CSI clients to collect CSI and RSSI measurements. The floor plan and the locations of the three RX APs (marked as triangles) are illustrated in Fig. 1. The entire testing area consists of two hallways, and the dimensions of the testing area are around \(13.8\times 8.2\ m^2\). The RX APs are installed on the ceiling and configured to use the WiFi channel 2 in the industry, scientific, and medical (ISM) band.

Figure 1
figure 1

The floor plan of the indoor space where our WiFi-based IPS is installed.

3 Calibration Procedure of CSI Measurements

In 802.11n WiFi networks, CSI data are obtained from the high-throughput (HT) preamble [27]. Signal processing on the reference signals of preamble can be divided into four functional steps: start-of-packet detection, automatic gain control (AGC), frequency offset estimation, and channel estimation. In these functional steps, the AGC will introduce an unknown scaling parameter in the CSI amplitude. On the other hand, the phase offset in CSI is caused by hardware mismatches in the carrier and the sampling frequencies between the transceiver and the receiver. Besides, a packet detection error also causes a CSI phase rotation error. These errors on both the CSI phase and amplitude must be calibrated if we aim for a precise positioning result.

To address the issue of CSI phase correction, many studies have been proposed [28,29,30]. In their results, CSI phase distortions are found attributed to four sources: sampling frequency offset (SFO), sampling time offset (STO), carrier frequency offset (CFO), and Phase-Lock-Loop (PLL) phase offset. According to the linear phase offset assumption [29], the observed CSI phase of each CFR sample can be defined as

$$\begin{aligned} \angle {H_{k,l}} = \varphi _{k,l} + \frac{2\pi k}{N}\Delta t + \beta + Z_{k} \end{aligned}$$
(3)

where \(\varphi _{k,l}\) denotes the true value of the phase on the k-th subcarrier of the l-th antenna pair, with \(k=-28,...,-1,1,...,28\) in our system setting, \(\Delta t\) is the time delay caused by STO and SFO, \(\beta\) is the phase offset caused by CFO and PLL phase offsets, N is the OFDM dimension, and \(Z_k\) is measurement error.

To compensate the CSI phase due to the time delay and phase offset, we need to remove the effect of \(\Delta t\) and \(\beta\) accordingly. Given that k is the subcarrier index, a phase calibration method can estimate the mean \(\beta\) and the slope \(\Delta n \triangleq \frac{\Delta t}{N}\) of phase offsets from the CSI phase measurements with a least squares (LS) estimator. As a result, we can calculate the estimate as follows:

$$\begin{aligned} \hat{\beta } = \frac{1}{K} \sum _{k \in S_k} \angle {H_{k,l}} \end{aligned}$$
(4)

with the number of total subcarriers \(K=56\) and the set of available subcarriers \(S_k=\{-28,...,-1,1,...,28\}\) in our system. And

$$\begin{aligned} \Delta n = \frac{\sum _{k \in S_k} k\times (\angle {H_{k,l}}-\hat{\beta })}{2\pi \sum _{k \in S_k}k^2}. \end{aligned}$$
(5)

Based on the estimated slope (\(\Delta n\)) and mean (\(\hat{\beta }\)), the calibrated phase value is given by

$$\begin{aligned} \angle {\hat{H}_{k,l}} = \angle {H_{k,l}} - 2\pi k \Delta n - \hat{\beta }. \end{aligned}$$
(6)

On the other hand, the amplitudes of CSI measurements are affected by the AGC in the WiFi NIC. When received WiFi packets are under different AGC settings, the scales of received signal strengths are different. Therefore, we only consider the morphology of CSI amplitudes and normalize the collected CSI amplitudes into a range of [0,1] as follows

$$\begin{aligned} \vert \hat{H}_{k,l} \vert = \frac{\vert H_{k,l}\vert -\min \limits _{q\in S_k}{\vert H_{q,l}\vert }}{\max \limits _{q\in S_k}{\vert H_{q,l}\vert }-\min \limits _{q\in S_k}{\vert H_{q,l}\vert }}. \end{aligned}$$
(7)
Figure 2
figure 2

The phases and amplitudes of CSI data in the CSI calibration procedure.

On the basis of CSI calibration, stable CSI are likely to be obtained on each subcarrier. To verify the CSI calibration performance, we test it for a fix reference point (RP) by experiments. In the experiment, the TX is placed at RP 7 in Fig. 1 and the CSI data are measured by receive AP 3 (RX3). Figure 2(a) presents the original phases of 100 collected CSI samples. The phases after removing the offsets due to CFO and PLL are presented in subfigure (b), and the phases further eliminating the distortions caused by STO and SFO are provided in subfigure (c). The normalized CSI amplitudes are shown in subfigure (d). As shown in the results, the calibrated CSI data, especially the phase terms, are more stable than the raw data in static environments, allowing us to use the CSI to represent the complex channel between RP7 and RX3.

Furthermore, we also investigated the spatial correlation among the collected CSI samples. Spatial correlation measures the continuous changes in CSI measurements at nearby locations. In the experiment, we collected CSI measurements along a straight line (the vertical hallway in Fig. 1) and calculated the correlation between two sampling points. In order to avoid AGC distortion and ensure consistency in the NIC configuration, the TX was continuously transmitting WiFi packets and moving at a constant speed of 2 m/s. Therefore, the collected CSI measurements are partitioned into sampling intervals of 100 ms. The average CSI in different intervals are used to calculate the spatial correlation as follows

$$\begin{aligned} \eta (\varvec{h}_{l1},\varvec{h}_{l2}) = \frac{\Vert<\varvec{h}_{l1},\varvec{h}_{l2}>\Vert ^2}{<\varvec{h}_{l1},\varvec{h}_{l1}> <\varvec{h}_{l2},\varvec{h}_{l2}>}. \end{aligned}$$
(8)

where \(\varvec{h}_{j}=[ h_{1,j},...,h_{K,j}]^T\) represent the average CFR values at the j-th sampling location, and \(<\cdot>\) is the inner product of two complex vectors.

Figure 3
figure 3

The spatial correlations in the CSI and the RSSI measurements.

Figure 3 shows the relationship of CSI correlations to distances, and compares it to the spatial correlations of RSSIs that are collected in the same experiment. We can observe that the correlations of CSI are rapidly decreasing with separation distances, compared to that of RSSI. The effective correlation distances in terms of CSI and RSSI correlations are 0.12 m and 2 m, respectively. Similar results were also presented in [31]. In their works, the correlations of the time-reversal resonance strength (TRRS) samples, i.e., the time-domain CSI measurements, was modeled with a Bessel function, and the effective correlation distance was also about 0.1 m [31]. The results clearly show that CSI and RSSI measurements have different spatial correlation resolutions and can be combined to provide location information of mixed resolutions for indoor navigation purposes.

4 Deep Spatial-Temporal Network Architecture for Target Tracking and Following

Figure 4
figure 4

The proposed neural network structure for device-free location tacking and following.

In this research, we aim to use CSI and RSSI of reference signals received by RX APs in our IPS to assist an agent device to continuously track and follow a target which does not carry a WiFi device with it. In contrast to typical device-free location estimation that uses a fixed WiFi signal source, the agent device carries a WiFi with it to continuously send reference signals for IPS to track its location and the target’s, as well. As shown in the previous section, the CSI and RSSI have different correlation distances in space. RSSIs can reflect meter-level location variations, which are useful to build a large-scale radio power map for coarse locationing in indoor space. CSI readings, on the other hand, can provide a decimeter-level location information for fine positioning with respect to (w.r.t.) designated points. Combing these two positioning technologies allows us to provide an automatic indoor navigation service with acceptable measurement efforts for RPs of different scales of spatial resolutions in different indoor space.

To fulfill the goal, we need to design a deep learning model for IPS to merge and transform the information continuously acquired from CSI and RSSI into the locations of the agent device and the target, as well, when both are moving. This implies that the neural network, on one hand, needs to be able to represent the spatial structure of RSSI and CSI with regard to the device locations, and, one the other hand, able to interpret the temporal variations in CSI and RSSI into continuous movements of the devices. This inspires us to construct a neural network structure as illustrated in Fig. 4.

The entire network mainly consists of two parts: a spatial feature extraction network that decodes the RSSI and CSI observations into the location features of the agent device and the target for individual signal sampling and processing intervals, and an encoding network that keeps track of and transforms the features of consecutive time intervals into the current locations of the agent device and the target. Details of them are specified in the following subsections.

4.1 Spatial Feature Extractions

Figure 5
figure 5

The neural network structure for spatial feature extractions. The spatial features of CSI are obtained from CNN models and weighted by an attention network. The weighted CSI features are concatenated with RSSI features and transformed into the final output vector.

For the spatial feature extraction network as illustrated in Fig. 5, it mainly contains three parts: a deep neural network (DNN) used to extract the coarse spatial structure of RSSI with regard to the location of the agent device that transmits a reference signal for positioning purposes, and two convolutional neural network (CNN) that are employed to extract the fine spatial features of CSI for the target’s location relative to the agent device. The inputs to the two CNNs are \(\angle {H_{k,l}}\) and \(\vert H_{k,l}\vert\), respectively, for \(k=-28,\dots ,-1,1,\dots ,28\) and \(l=1,\dots ,4\), collected from the three APs of our IPS as illustrated in Fig. 1. Consequently, \(\angle {H_{k,l}}\) and \(\vert H_{k,l}\vert\) are formatted as two individual maps of dimension \(56\times 12\), and fed into individual CNNs to extract the spatial features among the CSI of different subcarriers, transmit and antenna pairs, and APs.

Each CNN contains 5 layers of 2-dimensional (2D) spatial convolutions as illustrated in Fig. 6. The number of channels in the five convolutional layers are 16, 32, 32, 32, and 32, respectively, and each channel is implemented with a kernel size of \(3\times 3\) and stride of 1 for filtering. The activation functions in the CNN network are leaky ReLU. The first four convolutional layers are followed by maximum pooling layers, which use a kernel of \(2\times 2\) and stride of 2 for pooling. The outputs of the final convolutional layer are flattened into a feature vector of dimension 128. The CNN output feature vectors are denoted by \(\varvec{f}^t_{amp}\) for CSI amplitudes \(\vert H_{k,l}\vert\) and \(\varvec{f}^t_{phase}\) for CSI phase \(\angle {H_{k,l}}\). The two vectors are concatenated to form a CSI feature vector of \(\varvec{f}^t_{CSI}=\varvec{f}^t_{amp} \oplus \varvec{f}^t_{phase}\), where \(\oplus\) stands for the concatenation operator.

Figure 6
figure 6

Architecture overview of the proposed CNN model for CSI spatial feature extractions. The model accepts CSI sample inputs prepared as \(56\times 12\) 2D images. Each convolutional layer is implemented with a \(3\times 3\) kernel and a leaky ReLU activation function, followed by a max pooling layer with stride of two. The outputs of the 5-th layer are flattened into a feature vector of dimension 128.

Given that the importance of the different feature contents may also vary with the location of the reference agent device, the CSI feature vector is weighted by the outputs of an attention network [32] in an attempt to extract the essential location information of the target relative to the agent device. The attention network is formed with a DNN network whose output is an importance vector \(\varvec{\alpha }^t\) in the range of [0, 1]. Multiplying \(\varvec{f}^t_{CSI}\) by \(\varvec{\alpha }^t\) gives the weighted CSI feature vector of

$$\begin{aligned} \hat{\varvec{f}}^t_{CSI} = \varvec{\alpha }^t \otimes \varvec{f}^t_{CSI} \end{aligned}$$
(9)

where \(\otimes\) stands for an element-wise product.

Unlike CSI inputs, the RSSI inputs \(\hat{R}^t_{p}\) to their feature extraction network only have \(P=3\) variables given by the three received WiFi APs in our IPS. We, thus, use a DNN model to extract the RSSI feature vector \(\varvec{f}^t_{RSS}\) of dimension \(16\times 1\). Concatenating this feature vector to \(\hat{\varvec{f}}^t_{CSI}\) from CSI, and sending them to another DNN generates a final feature vector \(\varvec{f}_t\) of dimension 32.

4.2 Temporal Feature Extraction

The spatial feature \(\varvec{f}_t\) serves an observation for the locations of the agent device and the target during a certain observation time interval. The observation can be good to serve the locationing purpose if the target and the agent device are static and the observation interval is long enough. For the target tracking problem considered in this work, however, the observation interval cannot be set too long to track the dynamic location changes, nor too short to combat the observation noise. To resolve this dilemma, we use a recurrent neural network (RNN) to learn and exploit the relationship among the time series of feature vectors \(\varvec{f}_t\) with regard to the time-varying locations of the target and the agent device, as well, that follows the target. The proposed architecture is presented in Fig. 7, formed based on the well-known long-short-term memory (LSTM) model.

The network consists of a layer of bidirectional LSTM model (BiLSTM) [33] and a layer of unidirectional LSTM. The BiLSTM processes the times series of feature vectors \(\varvec{f}_t\) on a time-window basis, and derives a forward feature vector \(\overrightarrow{\varvec{f}}^t\) of dimension 32 and a backward feature vector \(\overleftarrow{\varvec{f}}^t\) of the same dimension for every time instant t within a time window, making use of the temporal relationship among \(\varvec{f}_t\). These two feature vectors \([\overrightarrow{\varvec{f}}^t, \overleftarrow{\varvec{f}}^t]\) are then merged by another layer of LSTM to generate a feature vector \(\varvec{y}^t\). The feature vector \(\varvec{y}^t\) is directly transformed by a DNN into the location estimate \(\hat{\varvec{p}}^{t}\).

Figure 7
figure 7

The proposed RNN model, which includes a BiLSTM layer and an LSTM layer

4.3 Overall System Architecture for IPS

Figure 8
figure 8

Overview of the proposed neural network architecture for device-free target tracking. The offline phase performs network training, and the online phase estimates the target location with instantaneous CSI and RSSI measurements.

With the signal preprocessing procedure specified in Section 3 and the neural network architectures depicted in Figs. 4\(\sim\)7, we construct an IPS as illustrated in Fig. 8. The operations of the IPS system are divided into two phases: the offline training phase and the online inference phase. In the offline training phase, CSI and RSSI readings are collected for the different types of devices under test (DUT) along preplanned paths. The loss function for the training of the entire neural network is defined as the MSE of positioning, given by

$$\begin{aligned} L(\varvec{\theta }) = \frac{1}{\xi }\sum _{i=1}^{\xi }{\vert \hat{\varvec{p}}^{i}-\varvec{p}^{i} \vert } \end{aligned}$$
(10)

where \(\hat{\varvec{p}}^{i}\) and \(\varvec{p}^{i}\) are the estimated and the ground-truth locations of the DUT, \(\varvec{\theta }\) are learnable parameters in the system, and \(\xi\) is the number of samples for evaluation. For the generalization of neural network, training samples are randomly shuffled into multiple mini-batches of size \(\xi\). The parameters \(\varvec{\theta }\) are updated for every mini-batch. The convergence curves of the training loss functions for device-based tracking are presented in Fig. 9, where the mini-batch size is 190 and the number of mini-batches in an epoch is 24.

Figure 9
figure 9

The convergence curves of the loss function for device-based location tracking.

The IPS can be applied in 3 positioning scenarios: (1) device-based location tracking for an agent who is carrying a WiFi device and continuously sending reference signals for IPS to estimate its location; (2) device-free location tracking for a target in which the target’s location is continuously estimated by reference signals sent from a fixed TX in our IPS; and (3) device-free target tracking and following in which the IPS continuously tracks the target’s location based only on the reference WiFi signals sent from a moving agent device that is following the target. The dimensions of the output vector are different according to the operating scenarios. In the next section, we will present our experimental results for the three positioning scenarios and investigate their performance limitations, respectively.

5 Experimental Results

We perform experiments for the three aforementioned positioning scenarios: (1) device-based location tracking, (2) device-free target tracking, and (3) device-free target tracking and following, to analyze and compare their positioning accuracies. All experiments are carried out in a corridor of dimensions \(16\times 8\) as illustrated in Fig. 1. RSSI and CSI are sampled and processed at an interval of 100 ms, and collected individually for the three scenarios. In addition, all network models are trained with a commercial server, which has a CPU of Intel i9-9900X 3.50 GHz (10 cores), 64 GB memory, and a GPU of NVIDIA GeForce GTX2080. Details of the experimental settings and results are provided in the following subsections.

5.1 Scenario 1: Device-Based Location Tracking

In this scenario, we assume that the DUT is equipped with a WiFi TX, which is an OpenWRT WiFi AP and programmed to transmit WiFi packets continuously. During the data collection period, CSI data are collected by 3 WiFi RXs installed on the ceiling of the corridor in Fig. 1, and the DUT is moved along the 6 predefined paths at a constant speed of 0.46 m/s, as shown in the figure. For each path, we repeat the experiments for 5 times and record the CSI and RSSI measurements in 5 individual data sets. Three of the data sets are used for network training and two of them are used for network validation. The DUT locations are labeled at a meter-level scale in the format of \((p_x, p_y)\), and are interpolated between end locations based on a constant traveling speed of 0.46 m/s between them. Therefore, the location resolution is 4.6 cm at the system data sampling rate of 10 Hz. And the total number of sampling locations is 1540 along the 6 predefined paths of the entire route of 70 m.

The MSEs of location tracking in each path are presented in Table 1. The MSEs of different paths are all about 1 meter, which implies that the proposed model is robust and can offer an acceptable accuracy in a device-based tracking scenario. To provide a baseline for performance comparison, we also implemented a straightforward model, which learns spatial features with a 3-layer DNN network (512-256-128) and uses the same RNN model to learn the temporal correlation. The results show that their positioning accuracies are at the same level. However, the DNN-based model requires 854400 neurons to learn the spatial characteristics, which is 3.37 times larger than the proposed CNN feature extraction model. As a result, the proposed model can characterize the spatial relationship with a lower complexity.

Table 1 Average MSE of positioning in meter along different paths.

On the other hand, the window size of the RNN model also affects the performance of the tracking algorithm. In this work, the maximum window size is set to 21 sampling intervals to exploit the spatial correlations of CSI measurements in a maximum separation distance of 46 cm. Table 1 also shows the results of different window sizes. Although the accuracies improve with larger window sizes, the improvement gets saturated when the window size is beyond 15, or the correlation distance reaches \(7\times 4.6 \simeq 32\) cm, which is also a distance where the CSI correlation drops below 0.7 as shown in Fig. 3.

5.2 Scenario 2: Device-Free Target Tracking

In this positioning scenario, the DUT is not equipped with a WiFi TX. The reference signal for positioning is transmitted by a WiFi TX installed at a reference point (RP) in the test field. Therefore, the location of the RP is a key factor for target tracking performance. In our experimental setting, we have 7 RPs as shown in Fig. 1. The distance between TX RPs is about 1.4 m.

Table 2 Average MSE of positioning in meter for different TX RP locations.

During the data collection period, the WiFi TX is placed at a given RP while the DUT moves along three predefined paths (Path 1, Path 3, and Path 5) at a constant traveling speed of 0.46 m/s. There are a total of 740 sampling locations along the three paths of the entire route of 34 m. For each path, we repeated the experiments for 5 times and recorded the CSI and RSSI measurements in 5 separate data sets. Following the same setting in scenario 1, three of the data sets are used for network training and the other two for network validation. However, the data collection procedure was repeated for 7 times to verify the performance w.r.t. the 7 WiFi TX RPs in the device-free scenario.

Figure 10
figure 10

MSE of device-free target tracking at the true locations of the target.

The MSEs of device-free location tracking for the different TX RPs are provided in Table 2. Among them, when WiFi TX is placed at RP 6, the IPS achieves the best performance with an average MSE of 1.3 m. In contrast, the IPS has the worst MSE of 2.1 m when WiFi TX is placed at RP 1. The varying tracking performance shows that device-free positioning can be severely affected by the source location of WiFi TX. To find the reason, we closely examined the results for WiFi TX at RP 5 and RP 7, respectively, and plotted the positioning MSE at the true locations of DUT as shown in Fig. 10. We can see from the figures that the positioning MSEs for DUT closer to the WiFi TX are better than those for DUT away from the WiFi TX. This is because device-free positioning relies on the channel disturbances due to the moving DUT. If the DUT is far from the WiFi TX, the wireless channels and the measured CSI, would become very weak and noisy, hence, not able to provide useful location information for the DUT. Table 2 also compares the performance with the proposed feature extraction model and that of the DNN model.

5.3 Scenario 3: Device-Free Target Tracking and Following

In this scenario, both the locations of a target and an agent device are tracked by the IPS. The agent device, e.g. an AGV, is equipped with a WiFi TX and follows behind the device-free target at a fixed distance, and continuously transmits WiFi packets for IPS to track its location and the target’s as well. This makes the IPS a hybrid positioning system that provides both a device-based and a device-free location tracking services. To conduct experiments for this challenging scenario, we consider a case where the agent device is within 1.4 m of the target and is at one of the RPs defined in Fig. 1. This allows us to use the CSI and the RSSI data collected in the device-free target tracking scenario for the target tracking and following problem considered herein. Taking path 3 as an example, the agent device, namely the WiFi TX, moves along a sequence of RPs, \(\{RP7, RP6, RP1, RP4, RP5\}\), and the CSI and RSSI measurements collected for the target that is within 1.4 m in front of the agent device are used for positioning the target by our IPS.

There are a total of 3 paths in our experiments as defined in Fig. 11(a). For each path, we have 5 sets of measurement data. Three of the data sets are used for network training, and two of them are used for network validation. The total length of these three paths is around 18 m, yielding 401 samples in each data set. The output labels of the network consist of two parts: the locations of the device-free target and the agent device. Both locations are labeled at a meter-level scale in the format of \((p_x, p_y)\). The proposed method has a very competitive performance with an average MSE of 0.773 m for the device-free target. Compared to typical device-free positioning, the IPS enjoys a \(64\%\) improvement in MSE of positioning when an agent device follows the device-free target.

Figure 11
figure 11

Experimental settings and MSE of device-free target tracking and following.

As shown in Fig. 11(b), the positioning errors on the testing routes are more uniform and precise than the results in Fig. 10. Although target tracking is implemented in a device-free manner, the agent device and its corresponding TX signal are able to produce highly spatially correlated CSI and RSSI measurements, due to the closer distances between the WiFi TX of the agent and the moving target. As a result, this target tracking method not only leads to a more accurate positioning precision for device-based tracking of the agent, but also provides a better precision for device-free tracking of the target.

6 Conclusions

The problem of device-free target tracking with RSSI and CSI measurements of wireless signals were reinvestigated in this work. Compared to existing works in this area, we are particularly interested in applications where the device-free target under tracking is followed by a device that transmits the reference signals for location tracking of the target and the device itself. To fulfill the goal, we reinvestigated the spatial correlations of RSSI and CSI measurements, and designed a deep neural network model to learn and exploit the spatial and temporal characteristics of CSI and RSSI for location tracking. The neural network architecture consists of a DNN to extract the coarse spatial features of RSSI w.r.t. the indoor space, and two CNNs to extract the fine spatial features in the phases and the amplitudes of CSI w.r.t. the device that transmits the reference signals. The multi-resolution spatial features of RSSI and CSI are further merged and transformed by an RNN model to extract the temporal features in them for target tracking purposes. The proposed learning-based IPS can be applied for three locating scenarios: (1) device-based location tracking; (2) device-free target tracking; and (3) device-free target tracking and following. Our experimental results show that the proposed IPS can provide a positioning accuracy of 0.898 meters in device-based location tracking. For device-free target tracking, however, the average MSE of positioning is 2.164 meters, and is found mainly affected by the propagation distances from the reference signals to the target. To improve location precision, a novel device-free target tracking method was proposed in scenario (3) where the target’s location is tracked based on reference signals sent from a device that follows the target. Both the target and the device need to be positioned in this scenario. The results show that the MSE of target tracking with the proposed method can reduce to 0.773 meters, which has a significant \(64\%\) improvement over the device-free tracking method in scenario (2). This suggests that the proposed method has a good potential to be extended for device-free multi-target tracking in the future.