Indoor Localization Based on Weighted Surfacing from Crowdsourced Samples

Fingerprinting-based indoor localization suffers from its time-consuming and labor-intensive site survey. As a promising solution, sample crowdsourcing has been recently promoted to exploit casually collected samples for building offline fingerprint database. However, crowdsourced samples may be annotated with erroneous locations, which raises a serious question about whether they are reliable for database construction. In this paper, we propose a cross-domain cluster intersection algorithm to weight each sample reliability. We then select those samples with higher weight to construct radio propagation surfaces by fitting polynomial functions. Furthermore, we employ an entropy-like measure to weight constructed surfaces for quantifying their different subarea consistencies and location discriminations in online positioning. Field measurements and experiments show that the proposed scheme can achieve high localization accuracy by well dealing with the sample annotation error and nonuniform density challenges.


Introduction
Fingerprinting has been extensively researched for indoor localization systems in the last decade [1][2][3][4]. The basic idea is based on the assumption that each indoor location can be identified by a unique signal feature, called fingerprint. The widely used fingerprint is a vector of the received signal strengths (RSS) from the access points (AP) of wireless local access networks. The location of a test fingerprint can be estimated to a known location with minimal signal difference. One of the key challenges to support such fingerprinting localization is to construct an indoor radio map in the offline training phase [5][6][7][8]. Normally, the indoor environment is divided into non-overlapping grid cells. Site survey is often used to collect RSS samples at each grid center by surveyors for training one grid fingerprint for each grid. However, this scheme suffers from the time-consuming and labor-intensive site survey for radio map construction.
Recently, fingerprint crowdsourcing has been promoted to relieve or even eliminate the burdensome site survey by exploiting casually collected RSS samples [6][7][8][9]. Although not collected at specified locations, crowdsourced RSS samples still need to be annotated with some location information for fingerprint database construction. To this end, a common approach is to extract RSS samples from pedestrian movement trajectory [10][11][12]. As long as a trajectory can be correctly matched to one physical route, each step position can be obtained from the floor plan to annotate the corresponding step RSS sample.
Although fingerprint crowdsourcing seems a promising approach, care must be taken to deal with the samples with erroneously annotated locations. Compared with the site survey, such erroneous location annotation of crowdsourced samples could lead to an inaccurate radio map and degrade the performance of fingerprinting-based localization. Besides annotation errors, another challenge lies in that crowdsourced samples may not be uniformly distributed in the whole environment.
In this paper, we study the indoor localization through constructing radio propagation surfaces from crowdsourced samples. For each AP, its surface takes a location as input and outputs an estimated RSS of this location. To deal with annotation errors, we propose a cross-domain cluster intersection algorithm to assign each sample a reliability weight, which exploits the sample clustering results from both the physical and signal space. We next select a subset of weighted samples to fit each AP a surface from polynomial primary functions and construct subarea fingerprints by sampling AP surfaces. Furthermore, we compute two weights for each AP surface for describing its subarea consistency and location discrimination capability in online positioning. A two-step positioning algorithm is proposed to first determine the belonging subarea for a test sample, and then a weighted surface search is exploited to estimate its location within the subarea. We conducted field measurements and experiments. Compared with the peer schemes, results validate the effectiveness and robustness of the proposed scheme in terms of its lower localization error when facing sample annotation error and nonuniform density challenges.
The rest of the paper is organized as follows. Section 2 briefly reviews the most related work as well as the proposed system. The proposed offline surface fitting algorithm is presented in Section 3. Section 4 presents our online localization algorithm. Field measures are used for experiments and the results are provided in Section 5. Finally, Section 6 concludes the paper.

Related Work
Several fingerprinting systems based on sample crowdsourcing have been proposed for indoor localization in previous studies [13][14][15][16][17][18]. For example, Chen and Wang [13] proposed using a density-based clustering technique to group crowdsourced samples to generate a cluster fingerprint and using a matching algorithm to assign each cluster fingerprint to one subarea for room-level localization. Liu et al. [14] also applied crowdsourced samples for room-level localization yet with an improved energy-efficient sampling approach. Chang et al. [15] applied a local Gaussian process to construct grid fingerprints from crowdsourced samples. Jung et al. [16] adopted a hybrid global-local optimization scheme to determine the location of fingerprint sequences based on the constraint of the indoor structure, rather than using labeled fingerprints. They also proposed an unsupervised learning method to calibrate the localization model.
In the literature, many have proposed to extract crowdsourced samples from pedestrian trajectories. The core idea is to match a trajectory to one physical route such that each sample on a trajectory can be labeled with one location in the route [19][20][21][22][23][24][25]. For example, Kim et al. [19] combined the lightweight site survey and fingerprint crowdsourcing for radio map construction. They first constructed an initial radio map according to the lightweight site survey and use the pedestrian dead reckoning (PDR) to match the the war-walking paths into the radio map. Huang et al. [20] exploited layout landmarks such as the cross points of corridors for matching pedestrian trajectories to physical routes. Zhou et al. [23] proposed to transform the indoor layout into a semantic graph to map with activity sequences contained within the trajectories. Zhou et al. [25] applied a density-based spatial clustering algorithm to determine hotspots which are then mapped to physical subareas.
For crowdsourced samples, the conventional approach is to construct a radio map for grid fingerprints. In Ref. [26], Wang et al. proposed using polynomial functionals to fit a propagation surface for each AP based on a few reference fingerprints with correct location annotations. Ye and Wang [27] applied the surfacing method to deal with the problem of non-uniformly distributed crowdsourced samples, with the objective of composing grid fingerprints for radio map construction. Unlike these approaches, this paper proposes to exploit crowdsourced samples for fitting radio propagation surfaces. As crowdsourced samples normally have inaccurate location labels, how to construct a reliable surface is rather challenging. In this paper, we propose a sample weighting algorithm and apply weighted samples to fitting surfaces.

System Overview
We divided an indoor environment into several distinct subareas, such as rooms, corridors, etc., according to their functional layout by inherent obstructions and partitions such as concrete walls. We assumed that each crowdsourced sample has been annotated with some location, though possibly with annotation errors. We attributed each sample to one subarea according to its annotated location. The proposed system also consists of the offline and online phases.
The offline phase consists of four steps: Weighting crowdsourced samples assigns each crowdsourced sample a reliability weight based on our proposed cross-domain cluster intersection algorithm. Fitting radio surfaces constructs a radio propagation surface for each AP based on the weighted samples. Weighting fitted surfaces further assigns each fitted surface with two weights for discriminating their contributions for online localization. Constructing subarea fingerprints creates an RSS fingerprint for each subarea from its fitted and weighted surfaces.
The online localization consists of two steps: Subarea determination first locates an online test fingerprint into one subarea according to our proposed weighted signal distance. Location search searches the coordinate for the test fingerprint based on the gradient search on the constructed surfaces. Figure 1 presents the main flowchart of the proposed system, and Table 1 lists the symbols used in this paper as well as their notations. The flowchart of the proposed system: In the offline phase, crowdsourced samples are each weighted according to our algorithm. For each access point and for one subarea, its radio propagation surface is firstly fitted and also weighted from those selected and weighted samples. Subarea fingerprints are then composed from fitted surfaces. In the online phase, a test sample is first compared with subarea fingerprints to determine its belonging subarea, and then a gradient search is used to estimate its exact location. The ith crowdsourced sample in S. l i The annotated location of the ith crowdsourced sample. r i The RSS vector of the ith crowdsourced sample. N The maximum number of hearable AP in S. K The number of clusters. C p The set of clusters in the physical space. C s The set of clusters in the signal space.
The cross-domain cluster coefficient of the ith sample.
The reliability weight of the ith sample.
The RSS surface function. p th The percentile threshold in sample selection method. ω th The weight threshold in sample selection method. ω The increasing order of sample reliability weight. ω k The reliability weight at the p th percentile in ω.

S
The set of select samples. A The set of hearable Aps by samples in S .
The surface coefficient of the RSS surface function.

R
The set of RSS values from an AP in S . r i The normalized elements in R. η The entropy-like quantity for each AP in A.
The set of grid cells in one subarea. G The number of grids in G, The RSS vector of a test sample. f s The sth subarea fingerprint.

A int
The set of hearable APs by both f t and f s .

D s
The weighted signal distance between the test sample and a subarea.

M g
The number of grid cells. σ The standard deviation of location offset.

S site
The set of samples from site survey.

S walk
The set of samples from pedestrian trajectories.

Weighting Crowdsourced Samples
In one subarea, e.g., a room, let S = {s i , ..., s M } denote its set of M crowdsourced samples. A sample s i = ( l i , r i ) consists of two parts: l i = (x i , y i ) is its annotated location; and r i = (r i1 , r i2 , ...r iN ) the received RSS vector where N is the maximum number of hearable APs in one subarea. For one sample s i , it is possible that not all the N APs could be heard, that is, some r ij (j < N) might not be available in s i . In this case, to allow the clustering and surfacing algorithm to run normally, we simply set it to a very small RSS value, r min = −90 dBm, which is the lower bound of the collected signal strength, during the following sample clustering and weighting process.
A crowdsourced sample s i may not be reliable in that its annotated location l i , RSS measurement r i , or both might have some errors. However, among a large number of such samples, we conjecture that some statistical relations could be extracted from the similarities between the physical and signal space. Consider the following example of two samples s i and s j . Let d p ij l i − l j and d s ij r i − r j denote the distance between the two samples in the physical space and signal space, respectively. Suppose that d p ij is small, indicating that s i and s j are close to each other according to their annotated locations. For a small d s ij , we could conjecture that both samples are reliable or both samples are unreliable. Although we could not determine which is the real case for only two samples, we might be able to infer the statistic relations from a large number of samples to discriminate unreliable samples. Motivated from such considerations, we next present a cross-domain cluster intersection (CCI) algorithm to assign each sample a reliability weight.
In both the physical and signal space, we group all samples s i ∈ S into K clusters by the classic K-means clustering algorithm.
.., C s K } denote the set of clusters in the physical and signal space, respectively. Notice that a sample s i is within one of the clusters in C p and C s simultaneously. We define a cross-domain cluster coefficient for such a sample s i based on the cluster intersection between C p a and C s b as follows: If C p a = C s b , i.e., the two clusters contain the same set of samples, then all such samples have the same coefficient and γ i = 1. According to the K-means clustering, all samples in C p a are closer to this cluster center than to other cluster centers. This is also the case for samples in C s b in the signal space in terms of their RSS vector similarities. Therefore, |C p a C s b | describes how many samples are close to each other in both the physical and signal space. A small value of γ i indicates that s i is not similar to the majority of the two clusters, which might suggest its unreliability. As the surface fitting is done in the signal space, we further normalize γ i to assign the sample weight based on the signal space clusters. For each sample s i ∈ C s b , we compute its reliability weight by where the denominator is the maximum cross-domain cluster coefficient of the samples in the cluster. Figure 2 illustrates the CCI algorithm and computes reliability weights for some samples.

Fitting Radio Surfaces
In one subarea, we construct a radio propagation surface for each hearable AP based on the weighted samples (s i , w i ). A surface function φ(x, y) takes a location as its input and outputs an estimated RSS at this location. Notice that the number of crowdsourced samples could be large and keep increasing. To reduce computational complexity and alleviate surface overfitting, we propose a percentile weight partition (PWP) method to select only a subset of weighted samples.
Define p th and ω th as the percentile and weight threshold, respectively, and p th , ω th ∈ [0, 1]. The objective is to ensure that more than p th samples have weights larger than ω th . We first sort samples according to their weights in an increasing order, denoted by ω. Let ω k denote the kth sample whose weight is at the p th percentile of ω. If ω k ≥ ω th , then no samples will be removed. Otherwise, we remove the first M−kω th 1−ω th samples from w. After the sample selection, let S denote the set of select samples and let A denote the set of hearable APs by samples in S .
In this paper, we adopt a polynomial function to fit a radio propagation surface for each AP in A as follows: where a ij s are fitting coefficients. The objective of weighted surface fitting is to To compute one fitting coefficient a er , we equate its partial derivative to zero to minimize H.
From the equation above, we can derive We define Thus, we can rewrite the equation as: a cd u cd (e, r) = v (e, r), e = 1, · · · , p, r = 1, · · · , q The matrix form of equation above is: Then, by A = U −1 V, the surface coefficient can be calculated.

Weighting Fitted Surfaces
Each AP surface is constructed based on its weighted samples. Different AP surfaces could contribute differently for describing the whole signal space. We next assign two weights to each AP surface via an entropy-like quantity computed from its samples: one is used for subarea determination and the other for location search in our online positioning.
For each AP in A, let R = {r 1 , ..., r R } denote its set of RSS values extracted from the weighted samples in S . As the samples are assumed to be crowdsourced randomly from different locations, the set R is also expected to contain the RSS values from different locations. If all elements in R have similar values, then this AP might not be very helpful for discriminating different locations in one subarea. On the other hand, such an AP may be seen as a good indication of this subarea for its RSS consistency. Motivated by such considerations, we propose to weight AP surfaces for their different subarea consistencies and location discriminations from an entropy-like viewpoint.
We first normalize the elements in R bȳ , for all r i ∈ R.
We next compute an entropy-like quantity η for each AP in A to describe its RSS distribution property by For our two-step online positioning, we compute two surface weights for each AP: ρ sub n is used in the subarea determination, while ρ loc n is used in the location search in one subarea.

Constructing Subarea Fingerprints
For each subarea, we construct a subarea fingerprint f based on its weighted surfaces φ n (n ∈ A). We adopt a grid lattice approach to sample each surface φ n uniformly in the physical space. Let G denote such a grid structure. For the gth grid, let f gn = φ n (g x , g y ) denote a sampled grid RSS value from the nth surface, where (g x , g y ) is the coordinate of the grid center. Then, f consists of subarea-averaged RSS values for all hearable APs where N = |A| is the number of hearable APs in A.

The Online Positioning Algorithm
The online positioning consists of two phases: subarea determination and location search. Subarea Determination: Let f t denote the RSS vector of a test sample, and f s the sth subarea fingerprint. Let A int denote the set of hearable APs by both f t and f s . We compute the weighted signal distance between f t and f s as: where f sn and f tn are the RSS values from the nth hearable AP in f s and f t , respectively. The test sample is then localized into a subarea with the minimum D s . Location Search: Assume that the sth subarea is selected in the first phase. We next search a space point (x,ŷ) in this subarea to minimize the weighted signal difference between f t and subarea surfaces: In this paper, we use the gradient descent search method. Instead of randomly choosing a start point, we use the localization result of a simple nearest neighbor (NN) algorithm as the initial searching point, where the grid fingerprints are spatially sampled from the fitted surfaces. We then calculate weighted signal difference as the cost function and its partial derivation to determine the search direction. The cost function is defined as The search iteration is defined by where α d is the search step. We substitute Equation (3) into Equation (18): Next, we compute the partial derivation of this cost function to gain the gradient and update the search iteration.
The gradient search will stop when the d t is too small to update the search position for the next iteration. Figure 3 plots the indoor layout of our field measurements in a typical lecture building with total area of 482 m 2 . In our work, we did not place our own APs. Instead, we employed the existing Wi-Fi infrastructure with APs deployed by different parties, such as individual laboratories, telecom operators and campus authorities. Indeed, the total number of hearable AP in our experimental environment was more than 400, while, for each sample, normally >70 APs could be heard. We note that emplying the existing Wi-Fi infrastructure makes our proposed scheme ready to be implemented in many practical scenarios. A Huawei Honor 3C smartphone was used to collect RSS samples. We conducted two batches of sample collection: The first batch S site was based on the site survey approach, containing in total 13,670 samples each collected at one grid center. The second batch S walk , containing in total 13,370 samples, was extracted from movement trajectories restricted to only those walkable routes, as illustrated by the colored area in one room in Figure 3. Note that the samples in both S site and S walk are firstly annotated true location information at collection. To emulate annotation errors, we again annotate each sample into a new location with a location offset randomly drawn from a Gaussian distribution with zero mean and σ standard deviation. The test set S test contains 5600 samples uniformly distributed in the whole environment.

Experiment Settings
Experiment Schemes: According to their annotated locations, crowdsourced samples can be assigned into different grids to construct grid fingerprints. Similarly, they can also be grouped into different clusters in the signal space to obtain cluster fingerprints. We tested the following peer localization schemes to examine these typical approaches.

•
FGrid emulates the traditional site-survey fingerprinting based on grid fingerprints, which divides the subarea into several non-overlapping grid cell to contain samples, and assigns each new sample into its nearest grid cell. For each grid cell, a grid fingerprint is composed by averaging all samples located within the grid cell, and the location of the grid fingerprint is annotated as the grid center. In the online phase, we used the nearest neighbor algorithm.

•
SGrid is similar to the FGrid to obtain grid fingerprints. We then constructed surfaces based on these fingerprints in the offline phase. In the online phase, we used the same surface search method as the one in our proposed SWSample.

•
SRaw retains the original position of every crowdsourced sample and fits propagation surfaces based on them. In the online phase, we used the same surface search method as the one in our proposed SWSample.

•
SCluster clusters the samples in signal domain only. For each cluster, we obtained a cluster fingerprint, which is the average of its cluster members' RSS vectors. The location of a cluster fingerprint is the geometric center of the cluster members. We fitted the propagation surfaces for every AP based on these cluster fingerprints. In the online phase, we used the surface search method the same as the one in our proposed SWSample.

•
SWSample is the proposed scheme.
In all the above schemes, we set the cluster number equal to the number of grids used in FGrid for a fair comparison. We also adopted the proposed two-step online positioning algorithm. We noticed that, from our experiments, the subarea hitting rate of all these schemes is not smaller than 99.58%, i.e., almost all test samples can be correctly determined to its belonging subarea. Thus, we do not report this result again in the following.

Surface Fitting Examples
Figures 4-7 plot the fitted surfaces for the four surfacing-based schemes. We chose one AP for Room A and fit its surface from 1800 samples randomly drawn from S site S walk . For the SWSample fitted surface in Figure 7, we also color the sample weight as shown by the weight color spectrum alongside the graph. It can be seen that the proposed scheme could produce a smoother surface, compared with other schemes. If we assume that this AP is located at the coordinate around the highest RSS value, then we could observe that the surface in Figure 7 is more like an attenuated sphere centered at the AP. The Keenan-Motley path loss model has been widely adopted to characterize the radio propagation in mobile cellular networks. If such a model could still be applicable in a small and open space such as a room, then our fitted surface resembles the most to this model, which might also help to explain the effectiveness of our weighted surface fitting.

Experiment Results
Uniformly distributed samples: We first considered the scenario that all crowdsourced samples are uniformly distributed in the experiment environment, that is, we used crowdsourced samples from S site S walk . Figure 8 plots the average localization error (ALE) against the number of crowdsourced samples randomly drawn from S site S walk . It was first observed that all the surfacing schemes outperform the grid fingerprinting FGrid, which validates the effectiveness of using fitted radio propagation surfaces for localization. When the number of samples increases, from about 0.33M g to 20M g with M g the number of total grid cells, the ALE of the surfacing schemes first decreases and then increases. At first, the number of samples is not large enough to well fit actual surfaces. In this case, our scheme SWSample has a slightly higher ALE than other surfacing schemes (see the first two points in Figure 8) due to its sample selection. On the other hand, if the noisy samples are too many, the surfaces may be overfitted for unreliable samples. However, ours presents a decent degradation and the ALE of using all 27,040 samples is 1.54 m, slightly higher than the best case of 1.45 m of using 3605 samples. The positioning accuracy improvement of our scheme are 36.71% over FGrid and 9.41% over SRaw, respectively. Compared with the SRaw scheme, the improvement can be attributed to our sample weighting and selection algorithm, which only chooses those reliable samples for weighted surface fitting, leading to a more accurate radio map and better positioning results.  Figure 9 presents the ALE against the standard deviation σ of location offset. Notice that σ = 0 indicating no annotation errors. It is not unexpected to see that all schemes suffer from the increasing of σ, i.e., the annotated locations farther away from true locations. However, our scheme SWSample still performs the best. Figure 10 plots the cumulative distribution function (CDF) of localization error. It is worth noting that, besides a low median localization error of 1.51 m, our SWSample has a low 90% percentile error of only 2.64 m. To provide the exact numbers, Table 2 summarizes the localization error results for three situations, namely, σ = 0.6 m, σ = 0.9 m, and σ = 1.2 m, respectively.   Non-uniformly distributed samples: It is also often the case that crowdsourced samples are not uniformly distributed in the whole environment. To examine this nonuniform density issue, we only use the samples from S walk to fit surfaces. That is, the subregion of chairs and desks in each room do not contain crowdsourced samples. However, as we intentionally include location annotation errors, some samples may still be annotated to locations within such a vacant subregion. As shown in Figures 11 and 12, it is not unexpected to observe that all schemes suffer from such a nonuniform density situation, comparing with the results in Figure 8. However, our SWSample scheme can still outperform other schemes in most of cases. The positioning accuracy improvements are 36.85% over FGrid and 18.79% over SRaw, respectively. Furthermore, the median and 90% localization errors in Figure 13 are 2.04 m and 3.24 m, respectively, which are comparable to the uniform density case. Table 2 summarizes the localization error results from three situations for non-uniformly distributed samples. It can be observed that our proposed scheme has great potential to obtain a better result in this non-uniformly distributed case, which illustrates its robustness for tackling the nonuniform density challenge.

Concluding Remarks
This paper has studied the problem of constructing radio propagation surfaces from unreliable crowdsourced samples with annotation errors. We have proposed a cross-domain cluster intersection to weight each sample reliability and an entropy-like approach to further weight the constructed surfaces. Field experiments have validated its effectiveness and robustness for dealing with the nonuniform density challenge. Our proposed method contributes to indoor localization society in its high accuracy and easy implementation.
We close this paper with some discussions about future work. This paper has applied polynomial functions for fitting radio propagation surfaces in the offline phase. Indeed, the propagation surfaces may take different forms and there could exist many other primary functions or stochastic kernels for surface fitting. How to intelligently choose the most suitable primary functions or stochastic kernels and automatically adjust their fitting parameters are worthy of further research. In this paper, we have applied the commonly used deterministic positioning algorithm in the online phase. Using some probabilistic positioning algorithms, especially when the radio propagation surfaces are modelled as stochastic processes, is also worthy of further investigation.