A Modified Inverse Distance Weighting Method for Interpolation in Open Public Places Based on Wi-Fi Probe Data

Urban open places with a public service function (e.g., urban parks) are likely to be populated in peak hours and during public events. Tomitigate the risk of overcrowding and even events of stampedes, it is of considerable significance to realize a real-time full coverage estimate of the population density.Themain challenge has been the limited deployment of crowd surveillance detectors in open public spaces, leading to incomplete data coverage and thus impacting the quality and reliability of the density estimation. To remedy this issue, this paper proposes a modified inverse distance weighting (IDW)method, named the inverse distance weighting based on path selection behavior (IDWPSB) method. The proposed IDWPSB method adjusts the distance decay effect according to visitors’ path selection behavior, which better characterizes the human dynamics in open spaces. By implementing the model in a real-world road network in the Shichahai scenic area in Beijing, China, the study shows a decrease in the absolute deviation by 17.62% comparing the results between the new method and the traditional IDWmethod, justifying the effectiveness of the new method for spatial interpolation in open public places. By considering the behavioral factor, the proposed IDWPSB method can provide insights into public safety management with the increasing availability of data derived from location-based services.


Introduction
In Chinese cities, open places serving a public service function (e.g., urban parks) are likely to be populated in peak hours and during public events. The inherent difficulties of surveillance and crowd control in these open spaces highlight the numerous evacuation problems that could arise in an emergency, including the occurrence of stampedes. This is preciselywhatoccurredonFebruary5,2004,attheLantern Festival in Beijing's Miyun Park. In this instance, the primary viewing space was trampled and caused 37 fatalities even though the total number of visitors was far below the park's maximum capacity. Hence, controlling the total number of visitors is insufficient for comprehensive safety management in open public places. It is paramount to realize a real-time and full coverage population density estimate to prevent local overcrowding.
To date, many techniques have been applied to crowd surveillance. Video monitoring recognition is one of the most widely used techniques in public security surveillance. With the development of computer vision, studies on crowd analysis [1,2] and anomalous detection [3][4][5] based on video monitoring recognition techniques have made significant strides. Although critical improvements have been made in the video monitoring recognition techniques, it remains infeasible to count the number of people primarily due to the partial occlusions among individuals when the crowd density is extremely high [6]. This difficulty in counting highdensity crowds has become a more severe issue for safety management in open public places.
There is an opportunity to overcome this issue by using location-based services (LBSs), such as cellular phone and Wi-Fi probe data. The popularity of smartphones ensures that cell phone data, including cellular signaling data and call detail record data, can fill the gap of individual-based trajectory tracking in urban open spaces [7][8][9][10]. Nonetheless, employing cellular phone data has been considered an invasion of people's privacy [11] and, consequently, its widespread To estimate the spatial distribution of the crowd popu l a t i o ni no p e np u b l i cp l a c e s ,w eh a v ep r o p o s e dan e w interpolation method called the inverse distance weighting method based on path selection behavior (IDWPSB). The path selection behavior is used to adjust the distance decay effect, making the interpolation result more suitable for approximating the actual crowd characteristics. The IDWPSB has been applied to a case study in the Shichahai scenic area in Beijing, China, to verify the computational efficiency a n dr e l i a b i l i t yo ft h i sn e wm e t h o di ne s t i m a t i n gt h ec r o w d population in road networks.

Materials
2.1. Description of Wi-Fi Data. AW i -F i -e n a b l e dm o b i l e device (e.g., smartphone) can initiate a connection to a Wi-Fi network by continuously broadcasting signals, also known as the probe requests. Each probe request contains a sequence of device information, including the media access control (MAC) address, device type, brand, and manufacturer. For each smart device, the MAC address is unique to the network connection. Since the probe requests are not encrypted, they can be passively captured and decoded with the help of wireless sniffers. In addition, the received signal strength (RSS) of probe requests can also be measured. All these data areuploadedtoaserver.
The Wi-Fi probe is a type of wireless sniffer. A description of the data collected by the Wi-Fi probe is shown in Table 1. Since there is a clear correspondence between a user and his smart device, the user's MAC address can be regarded as an identifier of a specific individual who is located within the detection range of the Wi-Fi probe. To avoid violations of privacy, the user's MAC address is anonymized by only extracting a fragment from the source record. The timestamp records the temporal information of the Wi-Fi connection. The RSS indicates the connection intensity, which mainly depends on the physical distance between the smart device and the Wi-Fi probe.

Preprocessing of Wi-Fi Data.
Before being analyzed, the raw Wi-Fi data extracted from the server require several steps of preprocessing. First, the RSS needs to be resampled at a regular time step due to the bursty nature of the probe requests, as the probe requests sent by a mobile device are not evenly distributed in time. For example, there can be a sequence of probe requests from one smart device within a few hundred milliseconds, followed by a silent period that lasts several seconds. The number of bursts depends on the working condition of the smart device. To mitigate the uncertainty of the RSS captures, we averaged the RSS values received within a one-second interval [25]. The second step is to uniformize the detection radius among the Wi-Fi probes installed in various environments by data screening. The relationship between the RSS and physical distance is calculated as where is the physical distance from the smart device to the Wi-Fi probe, is the RSS, is the calibration parameter of the RSS, and is the calibration parameter of the distance decay effect. For a given type of Wi-Fi probe, and are constants. By setting a maximum detection radius for the Wi-Fi probe, the lower limit of the RSS, , can be calculated accordingly, by which the request records with RSS values smaller than can be excluded. In addition, to remove individuals who are not outdoor, a threshold for the time duration can be set, defined as the end time minus the start time. All the records with a time duration larger than the threshold value (i.e., o n eh o u r )a r ee l i m i n a t e db e c a u s ei n d i v i d u a l si na ni n d o o r environment stay connected longer.
Third, considering that some individuals, such as children and senior citizens, do not carry a smart device, we adjusted thedetectedpopulationthroughthedetectablerateoftheWi-Fi probe. The detectable rate of the Wi-Fi probe is defined as where is the detectable rate, is the population detected by the Wi-Fi probe, and i sth ea ct u a lpo p u l a t i o ni nth e detection radius. The detectable rate of a Wi-Fi probe is related to multiple factors, including the installation environment, working conditions of the smart devices, and the population under coverage. We obtained the parameter in a field survey, in which was acquired by manually counting where is the local crowd density and is the adjusted detected population at a sampling site . is the total area of the road segments in the detection radius of Wi-Fi probe andiscalculatedas where is the number of road segments within the detection radius of Wi-Fi probe , isthelengthofroadsegment ,and is the average width of road segment . According to the principle of the IDW, the estimated crowd density at an unsampled site, (node) , is calculated as where is the crowd density of the unsampled node, ; is the crowd density of the sampling node, ;a nd , is the shortest path distance from node to node . A series of scenarios were set to simulate the interpolation process for open public places and validate the effectiveness of (6). As shown in Figure 1, we assume that node is the u n s a m p l e ds i t ea n dt h a tn o d e s and are sampling sites. We also assume that the path distance from node to node and that to node are equal. Equation (7) formulates the crowd density at node as ,w h i c hi se s t i m a t e db a se do n the observed population on nodes and .
where and are the observed population on node and node ,respectively , , and , are the shortest path distances for segment -and segment -,respectively .
To further explain Equation (7), we have designed three scenarios, as shown in Figure 1. In Figure 1 road, as shown in Figure 1(b). When individuals select roads randomly (i.e., no path selection behavior), the crowd flow will evenly split between -and -.Inthiscase,node and node still contribute equally to the population estimation at node . In Figure 1(c), when node becomes more attractive than node , we adjust the distance weight according to the path selection behavior to reflect the attractiveness of the road segment.

Inverse Distance Weighting Based on the Path Selection
Behavior. B e c a u s eo ft h el a r g en u m b e ro fj u n c t i o n si nt h e road networks of open public places, the distance decay effect with the spatial distance on a straight road is altered by individual preference. Through the observation of individuals' activity trajectories derived from the Wi-Fi data, the path selection behavior pattern, denoted as the transition probability, can be extracted. Adjustment of the distance decay effect according to the transition probability of the path selection behavior can improve the interpolation accuracy.
An activity trajectory is defined as a sequence of locations that an individual walks through during a period, denoted as {( , ) | = 1,2,..., } where is the th node on the activity trajectory and is the timestamp of node ,b o t ho fw h i c ha r er e c o r d e db ya series of Wi-Fi probes located at different sites, and is the total number of nodes on the activity trajectory. As shown in Figure 2, a cluster of activity trajectories can reveal the diversity of the path selection behavior among the crowd. In a road network, the set of nodes that most activity trajectories contain has a stronger association, regardless of the spatial distance.
The transition probability of the path selection behavior can be calculated from a large number of activity trajectories r ec o r d edb yW i -F id a tao v e rape ri odo fti m e .F o rapa i ro f adjacent nodes, -, the transition probability is defined as the proportion of activity trajectories that contain both nodes and to those containing node ,calculatedas where is the total number of individuals recorded by the Wi-Fi probes, and and are arbitrary timestamps on an activity trajectory. For nodes that are not adjacent, the transition probability is the product of the transition probabilities of each pair of adjacent nodes on the shortest path between the unsampled node and the nonadjacent sampling node, calculated as where is the number of intermediate nodes from node to node . It is worth noting that since an activity trajectory has a movement direction, , maynotbeequalto ,o .Thus,the relationship between node and node can be represented as the average of the transition probabilities between nodes and , denoted as Then, the transition probability of the path selection behavior is used to adjust the distance decay effect in the traditional IDW metric. The higher the transition probability of the path selection behavior is, the smaller the distance decay is, which is calculated as * , 2 where * , is the adjusted path distance from node to node , , is the shortest path distance, and , is the average transition probability between nodes and .
3.3. The Validation Index of the Interpolation. Crossvalidation is applied to evaluate the accuracy of the interpolation results by comparing the interpolated crowd density of a sampling node with the adjusted crowd density detected by a Wi-Fi probe. As mentioned before, there is a deviation between the population detected by the Wi-Fi probe and the ground truth data because the detectable rate of the Wi-Fi probe is not 100%. For this reason, some field survey experiments were carried out to calibrate the detection rate of the Wi-Fi probe and to adjust the detected population according to Equation (3), making the adjusted population as close to the ground truth value as possible. Although there is still a deviation between the adjusted population and the ground truth data, assuming that this deviation is a systematic error for each Wi-Fi probe, it is acceptable to use the adjusted population as the ground truth because it does not affect the validation result.
The widely used accuracy indices, the mean absolute error Journal of Advanced Transportation where , is the adjusted crowd density detected by a Wi-Fi probe on node , which is regarded as the true value; , is theestimatedvalueobtainedbyIDWPSBmethodonnode .
The MAE represents the mean value of the absolute deviationsbetweentheestimatedvaluesandthetruevalues. The smaller the MAE is, the closer the estimated values are to the true values. The MBE denotes the mean bias between theestimatedvaluesandthetruevalues;thesignoftheMBE suggests underestimation (positive value) or overestimation (negative) of the interpolation results.

Study Area and Data Collection
The Shichahai scenic area is a tourist neighborhood in Northwest Beijing, covering an area of 2.31 million square meters. As an open public place in a metropolitan center, the Shichahai scenic area fulfills multiple public service functions, combining a tourist attraction with business and residential sectors. Because of the multifunction role of the area, it is crowded throughout the year. On a typical holiday, the crowd flow can easily reach 20 thousand per hour. Thus, there has been a considerable safety concern and needs of crowd control in this area.
A ss h o w ni nF i g u r e3 ,t h er o a dn e t w o r ki nt h i sa r e a is rather complex and contains many junctions. To select the optimal locations to install the Wi-Fi probes, first, the road segments considered to be crowded were designated by experts, as illustrated by the purple lines in Figure 3. Then, the three tourist landmarks, including the Yinding Bridge, the former residence of Soong Ching Ling, and the Prince Gong's Mansion, illustrated by pink hollow stars in Figure 3, were also selected as candidates for installation. Taking these factors into account, 23 Wi-Fi probes, denoted as p1, p2⋅⋅⋅p23, were installed along the main roads to collect the crowd data. By preprocessing the Wi-Fi data, the detection radius of the Wi-Fi probe was uniformized to 50 m.
During the study period from December 23, 2016, to March 16, 2017, the crowd data of more than 6,000,000 individuals were collected by the installed Wi-Fi probes. The mean value of the all-day crowd density during the study period was calculated according to Equation (3) for each sampling site and is shown in Figures 3(a) and 3(b) for a weekday and a holiday. Table 2 shows the exact values of the crowd density at each sampling site and ranks them by sorting the weekday and holiday values in descending order. It can be seen from the table that although the absolute crowd density value differs by the two days, there is no significant difference in the ranks of the crowd density. That is, whether on the weekday or the holiday, the crowding patterns are similar in  thestudyarea.ThemostcrowdedsitesareneartheY inding Bridge, including p8, p11, p14, and p12, which are the core scenic spots.

Results and Discussion
5.1. Comparing the Global Crowd Density Estimates by the IDWPSB and IDW Methods. As shown in Figure 4, we selected a typical peak time (3 p.m.) on a weekday (Feb 8, 2017) and a holiday (Dec 25, 2016) as an example to show the interpolation results of the IDWPSB method. The spatial resolution of the interpolation was 5 meter, generating the estimation of 628 unsampled sites in total. Overall, the global crowd density on the holiday was significantly higher than that on the weekday, with more road segments highly populated. The crowd tended to aggregate towards the northeast corner, the Yinding Bridge, on both days; meanwhile, the population was relatively sparse in the southwest quadrant, which is the residential sector.
Additionally, the traditional IDW method was employed to interpolate the population density and was compared with the new IDWPSB method. As shown in Figure 5, the majority of the MAE values at the sampling sites obtained by the IDWPSB method are smaller than those obtained by the IDW method. As shown by the MBE values, the IDWPSB method is less likely to overestimate or underestimate the crowd density in the cross-validation, suggesting a more robust performance than the IDW method. Furthermore, the accuracy improvement ratio, , was calculated as the difference between the IDWPSB MAE and the IDW MAE, as shown in For the weekday result, the accuracy improvement ratio is 20.49% and for the holiday, the ratio is 15.69%. In general, the IDWPSB method improves the interpolation accuracy by 17.62%, indicating a better performance than the IDW. In terms of computational efficiency, the interpolation process took 52.4 seconds using the IDW method and 74.3 seconds using the IDWPSB method under the same computing environment. These various aspects of comparison suggest that including the path selection behavior in the IDWPSB method can improve the interpolation accuracy without significantly increasing the computational demand.

Factors
Influencing the Interpolation Accuracy. Several factors may affect the interpolation accuracy and should be t a k e ni n t oc o n s i d e r a t i o n .Th efi r s tf a c t o ri st h en u m b e ro f sampling sites. We implemented the IDWPSB interpolation by using different numbers of sampling sites ranging from 1 0t o2 2a n dt h e nr a n d o m l ys e l e c t i n gs a m p l i n gs i t e sf r o m the pool. Figure 6 shows the variation in the MAE value for e a c hs a m p l i n gs i t ew h e nt h es i t ei se s t i m a t e du s i n gc r o s svalidation. It is evident that there is a decreasing trend in the MAE when the number of sampling sites increases in a l li n s t a n c e s .I nr e a l i t y ,t h eo p t i m a ln u m b e ro fs a m p l i n g sites should be determined by both budget consideration and requirements for accuracy. The second factor is the location of the sampling sites. To evaluate the relationship of the interpolation error with respect to the location of the sampling sites, the all-day crowd density for each sampling site on the weekday and the holiday was estimated for all the other sampling sites using the IDWPSB method. The spatial distributions of the MAE and MBE are shown in Figure 7. It is worth noting that, on either day, the site with the maximum interpolation error is p8, which is the sampling site with the highest crowd density. This result is likely due to the inherent bias of the IDW method in that it fails to extrapolate the maximum value. Thus, we suggest that the deployment of the Wi-Fi probes should cover the sites with the local maximum values.
The third factor is the role of an unsampled node in the road network. We use the node centrality to indicate the importance of an unsampled node in the road network. Here, the centrality of node i is defined as the number of times that node i is on the shortest path between any other two nodes being the origin and the destination. As shown in Figure 8, we investigated the relationship between the node centralityofanunsamplednodeanditsinterpolationMAEin the cross-validation. It is evident that there is no relationship between the node centrality and the MAE, suggesting that the accuracyoftheinterpolationishardlyaffectedbytheroleof an unsampled node in the road network. An advantage of the new method is that it requires less consideration of the road network structure.

Conclusion
A real-time full coverage estimate of crowd density is essential for safety management of urban spaces. In practice, the tradeoff between full coverage monitoring and limited budgets always exists, leading to the partial deployment of crowd surveillance detectors in open public spaces. The incomplete data coverage seriously influences the quality and reliability of Journal of Advanced Transportation 9 (22) (23) (3) (1)  the crowd density estimate. By extracting individual activity trajectories from the Wi-Fi data, this paper proposes a modified IDW method based on the path selection behaviors, named the IDWPSB method. The main improvement of the new method is to adjust the distance decay effect in the IDW method according to the transition probabilities of thepa thselectionbeha vior .ThecasestudyintheShichahai scenic area of Beijing city demonstrates better performance The proposed method can also be applied to urban spaces structured by networks in both indoor and outdoor environments with the availability of LBS data. However, it should be noted that the path preference is not static in time. Thus, the transition probabilities of the path selection behavior should be a function of time, which may vary over time of the day or may differ between weekdays and weekends. The time-dependent calculation will provide a valid context for improving the interpolation method and accommodating dynamic travel environments.

Data Availability
Th ed a t au s e dt os u p p o r tt h efi n d i n g so ft h i ss t u d ya r e available from the corresponding author upon request.

Conflicts of Interest
The authors declare no conflicts of interest in the research.