Performance Enhancement of Wi-Fi Fingerprinting-based IPS by Accurate Parameter Estimation of Censored and Dropped Data

. In complex indoor environments, the censoring, dropping, and multi-component problems may present in the observable data. This is due to the attenuation of signals, the unexpected operation of equipments, and the changing surrounding environment. Censoring refers to the fact that sensors on portable devices are unable to measure Received Signal Strength Index (RSSI) values below a certain threshold, for example, − 100 dBm with typical smart phones. Dropping means that, occasionally, RSSI measurements of Wi-Fi access points are not available, although their value is clearly above the censoring threshold. The multi-component problem occurs when the measured data varies due to obstacles as well as user directions; doors closed or open; and so forth. Taking these problems into consideration, this paper proposes a novel approach to enhance the performance of the Wi-Fi Finger-printing based Indoor Positioning System (WF-IPS). The proposed method is verified through simulated data and real field data. The experimental results show that our proposal outperforms the other state-of-the-art WF-IPS approaches both in positioning accuracy and computational cost.


Introduction
Modeling RSSI distribution in the WF-IPS: Nowadays, indoor positioning, based on Wi-Fi fingerprinting, has attracted significant interest due to its potential to obtain high accuracy at low costs [1], [2]. This method can be well-formulated as pattern recognition system, which operates in two phases: training phase and online localization/ classification phase [3]. In the training phase, measured data at the reference points (RP) from available Wi-Fi access points (AP) are collected to build the database. In the classification phase, the online measured data is compared to the training, and the target position is determined according to the similarity between online data and training data. There are two common methods to be used in the classification phase: the deterministic approach [4], [5], and the probabilistic approach [2], [6][7][8][9]. As reported in previous studies, the probabilistic approach seems to outperform the deterministic approach. In the probabilistic approach, the parametric model [6], [8] and nonparametric model [7] can be used to represent the training RSSI distributions. As reported in [10], systems utilizing the parametric model perform better. The reason is the parametric model can take into account the missing signal strength values during the training phase (due to a finite number of measurements) to smooth the distribution shape. This helps to avoid zero probability at those signal strength points. Some studies showed that the majority of RSSI histograms fitted very well with the Gaussian distribution if sufficient samples have been collected [6], [11], [12] while others proposed to model RSSI distribution by the Gaussian Mixture Model (GMM) [8], [11], [13]. It has noted that the GMM extended the single Gaussian process with ability of modeling multi-modal data. Therefore, the GMM is the most feasible parametric model for modeling Wi-Fi RSSI data.

The characteristics of the measured Wi-Fi RSSI data:
In this work, the characteristics of the real field Wi-Fi RSSI data have been investigated. In [6], [9], [14], authors have recognized the censoring and dropping problem in the observable data. The Gaussian distribution was chosen as the model for data throughout [6], [9]. In [8], [11], [13], the multi-component problem was noticed. In [11], authors showed that human behaviors in the measurement environment (absence, sitting/standing still, moving randomly and moving specifically) led to the bi-modal phenomena in the experimental data. In this case, using the Gaussian distribution to model the RSSI histogram is not appropriate. In [8], [13], the authors used the GMM to model data owing to the changes in the surrounding environment which would obviously change the measured signal strength. However, in [8], [11], [13] the censoring and dropping problems have not been considered. With respect to the above mentioned issues, this paper proposes to utilize the GMM including censored and dropped observations (CD-GMM) [15] to model the distribution of the Wi-Fi RSSI data.
Parameter estimation and model selection: For estimating parameters of a probabilistic model in the presence of missing data, the EM algorithm [16], [17] is one of the most feasible estimators among available novel approaches. The results in [15] showed the effectiveness of the EM algorithm for the CD-GMM. However, this approach can only be used for parameter estimation of the GMM with known number of components. In WF-IPS, since the real training data collected at RPs from APs have different distributions, it is necessary to develop an appropriate method to estimate the number of components of the CD-GMM instead of fixing it to a specific number. In [8], the Akaike Information Criterion (AIC) is used to determine the best number of components for the GMM. Authors in [18] proposed a penalized likelihood method for model selection of finite multivariate Gaussian mixture models. This method involves a light computational load and is attractive when there are many possible candidate models. A model selection criterion, based on the sum of weighted real parts of all log-characteristic functions (SWRLCF), was introduced in [19]. This method is suited for large sample applications. The approaches introduced in [8], [18], [19] can select the number of components of GMMs consistently when data are complete. However, in the complex indoor environment, the collected Wi-Fi data are often incomplete due to the censoring and dropping [6], [9], [14].

Scalability of WF-IPS:
It is reported that WF-IPS is low cost and easy to deploy [20]. However, once deploying in a large scale, due to a huge amount of RPs, the execution time to produce the positioning results of the IPS must be considered carefully [21]. Therefore, reducing execution time, while maintaining the positioning accuracy, is a challenge when developing a WF-IPS.
The target of this research is to enhance the performance of a WF-IPS including positioning accuracy and computational cost. For the above reasons, this paper proposes a novel method to precisely estimate the number of components as well as their parameters of a GMM in the presence of censored and dropped data. The proposed approach is the combination of Bayesian Information Criterion (BIC) to determine the best number of components of the CD-GMM and EM algorithm to deal with censored and dropped data.
In the following, the characteristics of Wi-Fi RSSI data collected in the indoor environment are investigated; and based on these, we propose a new parameter estimation and model selection algorithm in which the censoring, dropping and multi-component problems are considered (Sec. 2). Section 3 is the evaluation of the effectiveness of proposed approaches in the WF-IPS. The paper is concluded in Sec. 4.

Parameter Estimation and Model Selection Based on the Characteristics of Real Field Collected Wi-Fi RSSI Data
The characteristics of Wi-Fi RSSI data: According to our data investigation, we have detected three problems with the Wi-Fi RSSI data, namely censoring, dropping and multi-component ( Fig. 1) which strongly affect the accuracy of parameter estimates and, consequently, the positioning results.
In Fig. 1(a), (b) and (c), the RP where RSSI measurements were taken is close enough to the AP, hence, all the RSSI values are above the limited sensitivity of the Wi-Fi chipset (in our data set, it is −100 dBm). The distribution of RSSI shown in Fig. 1(b), (c) seems to be drawn from more than one Gaussian component. The reason is the measurements were gathered in varying states such as door opening/closing, the direction of the person who handled collecting equipment (smart phone) had been changed. In these three cases, the distribution of data can be modeled by the standard GMM with one, two and three Gaussian components, respectively.
However, a large number of readings belong to one of the three latter cases which are shown in Fig. 1 where the unobservable data are presented in the collected data. In Fig. 1(d), training data were taken at a RP far away from an AP, therefore, a certain number of samples were unobservable due to the censoring problem reported, as can be seen by the histogram bar, at −100 dBm. Figure 1(e) shows the presence of dropped data owing to the temporary switching off state of an AP for the energy-saving purpose. In Fig. 1(f), one part of the histogram is missing, which is similar to Fig. 1(d), but the amount of unobservable measurements seems to be larger than the missing part in Fig. 1(e) because the Wi-Fi data might experience both censoring and dropping.
Parameter estimation: Considering all the above phenomena presented in Wi-Fi RSSI data and assuming that data collected from different APs are independent, we propose to model the distribution of data gathered at a RP from each AP by the CD-GMM and estimate its parameters by utilizing the EM algorithm as follows:

E-Step:
Let y = [y 1 ,…,y N ] be the set of unobservable, noncensored, non-dropped data (complete data) representing the Wi-Fi RSSIs collected at a RP from an AP, y n  , n = 1,…, N, N is the number of elements in y; c is the specific threshold at which a portable device (e.g., smart phone) does not report the signal strength; x = [x 1 ,…,x N ] is the set of observable data, censored, possibly dropped data (incomplete data), x n = y n only if y n > c and the dropping problem does not occur, x n = c means that y n  c or the dropping problem occurs.
Since the observations can be considered as incomplete data, instead of computing the likelihood directly, the expected value of log-likelihood of complete data given the observations and old estimated parameters are calculated: For calculating Q(; (k) ) in (1), three cases are considered: Data are observable, data are unobservable due to the censoring problem and data are unobservable due to the dropping problem. Finally, the detailed calculation of Q(; (k) ) are given in (2) with some definitions are as follows: v n (n = 1,…, N) are hidden binary variables indicating whether y n is unobservable The other terms in (2) are given in (3)-(6).

M-Step:
Re-estimated parameters at the (k + 1) th iteration are obtained by computing the partial derivatives of Q(; (k) ) in (2) w.r.t. the elements of  j ,  j , w j ,  and setting them to zero, then we arrived at formulae given in (7)-(10).
As can be seen in (2), (7)-(10), collected data, including observable, censored and dropped samples are contributed to the estimate, simultaneously. This means the proposed EM algorithm can deal with all the mentioned phenomena presented in collected data. ; Model selection: As mentioned in the first part of this sub-section, the distribution of collected RSSIs might be drawn from one, two, three or several Gaussian components while the presented EM algorithm must use an assumption of the number of Gaussian components (J). For this reason, an extended BIC was developed to estimate the number of components in the CD-GMM as follows: The penalty function (PF) of BIC for the GMM is In (13) Here, BIC CD is the PF of extended BIC, in which both observable and unobservable data are considered. The term (14) can be calculated as follows: In (15), the term p(x n ;  j , ) is the continuous probability density function parameterized by  j , .
Let d n (n = 1, …, N) be hidden binary variables indicating whether an observation (y n ) is dropped (d n = 1) or not (d n = 0), for calculating p(x n ;  j , ), three cases are considered:  x Then, the PF of the extended BIC in (14) Using the EM algorithm and the PF of the extended BIC calculated in (20), the proposed algorithm for parameter estimation and model selection is as follows: Input: A set of incomplete data (x), convergence threshold of the EM algorithm for CD-GMM () and the maximum number of Gaussian components (J max ) for calculating PFs.

Positioning/Classification
The parameter estimation and model selection algorithm mentioned in Sec. 2.1 is done for all RPs and it is done for the measurements of each AP separately. Let Q and N AP are the total number of RPs and APs, respectively, the final estimated parameters of the q th RP (q = 1,…, Q) and the i th AP (i = 1,…, N AP ) are denoted by Indoor localization can be formulated as a classification problem, where the classes are RPs. During online classification/positioning phase, to estimate the target's position, a MAP (maximum a posteriori) based classification rule is developed as follows.
First, the posterior is calculated: In (21), ℓ q is the position of the q th RP; x on is a set of online measurements, x i on is the RSSI value measured from the i th AP (i = 1,…, N AP ). We considered that RSSI measurements of different APs are independent, and the prior P(ℓ q ) is equal for all locations.
According to equation (2)(3)(4)(5)10,11), According to equation (18) ( 1 ( ) ) 1 ( 1) According to equation (6-9), According to equation (18)  The online measurements might be suffered from censoring or dropping or both of them. Therefore, the likelihood p(x i on |ℓ q ) in (21) can be calculated for the censored and dropped Gaussian mixture data as follows: if Let K NN be the number of the nearest neighbors chosen among RPs by taking those with the largest posteriors. The estimated position of the mobile object is obtained by:

Estimating the Number of Components in the GMM
In this simulation, our proposal mentioned in Sec. 2.1 and three other approaches [8], [18], [19] were applied to estimate the number of components in the GMM with artificial data as follows: First, a random integer number which refers to the true number of components of artificial mixture data (J) was generated within a range of 1÷4. Next, according to the value of J, one of four sets of parameters defined in Tab. 1 was selected to generate 1000 data samples (complete data). The incomplete data (x) are the censored, possibly dropped data, gathered by changing the limited sensitivity of Wi-Fi sensors (c) and dropped rate (). After 1000 experiments, different levels between the true number (J) and estimated number (Ĵ) of Gaussian components were recorded in Tab. 2.
As can be seen in Tab. 2, our proposed method introduced far better results than other approaches, especially when data are suffered from censoring or dropping or both of them. This can be explained as follows: our proposed method utilized the extended version of the EM algorithm in which both observable data (x n = y n ) and unobservable data (x n = c) are contributed to the estimates. When data are unobservable owing to the censoring and dropping problems, this algorithm produces a lot better results compared to the standard EM algorithm introduced in [8], [18], [19]. Moreover, in the PF of AIC in [8], the PF of BIC in [18] and SWRLCF in [19], unobservable data had almost no practical contribution while they really contributed to the likelihood in PF of our proposal, as mentioned in Sec. 2.1 (equation (20)).

Positioning Accuracy
In order to evaluate the positioning accuracy of the proposed method, compared to the three state-of-art approaches [7][8][9] on real data, we have used the Wi-Fi RSSI data measured at 25 RPs (black dots) of an office building as shown in Fig. 3.
In the training phase, RSSI values were taken at 25 RPs (25 free positions, without wall, furniture), roughly evenly distributed, resulting in an average distance of 2.7 m between two locations. At each RP, 400 measurements were collected from each available AP. Training  [10] was applied to select 4 APs which have the largest mean of RSSI values and use them to build the radio-map by utilizing the algorithm introduced in Sec. 2.1. The convergence threshold of the EM algorithm was set to 10 -6 ( = 10 -6 ) and the maximum number of Gaussian components for calculating PFs was set to 6 (J max = 6).
In the online phase, 100 sets (x on ) of Wi-Fi RSSI measurements were gathered at the positions of 25 RPs (4 sets per RP) in the same scenarios with the training data. The MAP method presented in Sec. 2.2 was applied to estimate the target's position. The number of nearest neighbors K NN is 3 (K NN = 3). After 100 experiments, positioning results were calculated and reported in Fig. 4 and Tab. 3. Figure 4 shows the Cumulative Distribution Function (CDF) of positioning error as a function of the distance for four methods. The CDF is defined as the probability that the positioning error (e) is lower than a certain distance (d): Furthermore, mean ( DE ) and variance ( 2 DE ) of distance error of four approaches were recorded in Tab. 3.  It can be seen in Fig. 4 and Tab. 3, among four probabilistic approaches, the histogram method [7] which utilized non-parametric model produced the lowest positioning accuracy. The three remaining parametric model approaches showed different levels of positioning accuracy. These can be expounded as follows: In [8], the standard EM algorithm for GMM was applied to build the radio-map in the training phase. This approach can deal with the multi-component problem. However, as mentioned in [15], when censored and dropped data presented in collected data, the standard EM algorithm produced biased estimates and, hence, it leads to high positioning error.
In [9], the censoring and dropping problems were considered but, in this work, the distribution of collected Wi-Fi RSSIs was assumed to be a Gaussian distribution while it might be drawn from two or more Gaussian components, according to our data investigation. Therefore, this proposal produced a lower positioning accuracy compared to our proposed method.
By utilizing the novel method presented in Sec. 2.1, the censoring, dropping and the multi-component issues have been solved, simultaneously. For this reason, although the test area is limited, the experimental results in Fig. 4 and Tab. 3 indicate that the proposed approach is significantly better than the other approaches.

The Computational Cost
Beside the positioning accuracy, the computational cost of localization procedure is one of the most important metrics in the WF-IPSs. Systems which have a lower computational cost will spend less time on resulting the target's position. According to (21)-(23), excepting the number of RPs (Q) and number of Aps (N AP ), the computational cost highly depends on the estimated number of Gaussian components (Ĵ q,i ). In order to evaluate the positioning accuracy and the computational cost, we performed four experiments with the same collected data as mentioned in Sec. 3.2, but different numbers of Gaussian components were selected. In the first experiment, the number of components and parameters in the CD-GMM were estimated by applying the algorithm introduced in Fig. 2 (J = Ĵ). In the experiment 2, 3 and 4, the number of components was fixed by 2, 3 and 4 (J = 2, 3 and 4), respectively; parameters were estimated by using the EM algorithm for CD-GMM mentioned in Sec. 2.1. After 100 experiments, the mean time spent on estimating the target's position (t ETP ), the mean and variance of distance error were recorded in Tab. 4.
As can be seen in Tab. 4, the four experiments introduced about the same positioning accuracy, but very different t ETP . When the proposed PF of extended BIC was applied, the t ETP reduced by 25%, 30% and 36% compared to fixing the number of components by 2, 3 and 4, respectively. This demonstrates that our proposed method not only improved positioning accuracy, but also introduced the least computational cost.

Conclusion
The performance of the WF-IPS is of particular interest. In this paper, novel approaches are introduced to take into account the phenomena presented in real field data. When the censoring, dropping and multi-component problems occurred simultaneously, by utilizing our proposed method, the positioning results of the WF-IPS improved considerably. The experiment in the complex indoor environment showed that the mean distance error is at least 0.4921 m lower than available fingerprinting based probabilistic approaches. On the other hand, by applying our proposed PF of extended BIC, both the number of components and parameters in the CD-GMM are accurately estimated, which leads to better performance in both positioning accuracy and computational cost.
The computational cost of the WF-IPS is proportion to number of RPs and parameters of each distribution to be stored in the database. While this proposal has solved the latter, the former still remains. Once the deployment of the WF-IPS is in a large scale, the searching domain becomes large, too. Therefore, in the future work, we will find a solution to reduce the searching domain which helps to further reduce the execution time in the localization phase. On the other hand, there were still some high positioning errors (5% position estimates had errors which are higher than 4 m). These errors can be explained as follows: In the complex indoor environment, some unexpected reasons (for example: the moving of people, the unexpected operation of APs) might cause the unusual fluctuation of Wi-Fi signal strength that have not been captured during the training phase. As a consequence, some unwonted samples might present in collected data in the online positioning phase. This led to some outliers reflected by some position estimates had high errors. For solving this problem, our approach and the dead reckoning will be combined in the IPS in the future work.