Geospatial Big Data Analytics for Quality Control of Surveys

Author Note: CDT Benjamin Leehan is a fourth-year cadet at the United States Military Academy, where he double majors in Computer Science and Mathematical Sciences. His thesis advisor, MAJ(P) Nathaniel Bastian, supported this project. We would like to thank members of D3 Systems, particularly Mr. Matt Warshaw, Mr. Tim Van Blarcom, Mr. Jeffry Yan, and Mr. Connor Brassil, for generously sponsoring the project and providing the necessary data to conduct this research. Abstract: Geospatial big data analytics allows survey quality control analysts to draw important conclusions about survey data quality that otherwise would take excessive time and resources. In this work, we explored two algorithmic methods that can help ensure reliability of survey interviews by detecting geospatial outliers. Focusing on geospatial data collected from surveys, we implemented outlier detection techniques with two different distance metrics to identify statistical anomalies in real-world datasets that may have qualitative interpretations. We found that one algorithm, which considers the local distribution of points in a dataset, identifies a different set of outliers when compared to another method, which considers the global distribution of points. Since there was a small overlap (10-19%) of flagged points between the two algorithms implemented, it may be helpful for analysts to focus on the fewer “outlier” points that are flagged by both methods rather than all the “outlier” points that are flagged by each algorithm. Finally, analysts should consider the computational costs, as the algorithms differ significantly.


Background
Traditional big data analytics of geospatial data often requires an analyst to rely on data visualizations to draw qualitative conclusions on a quantitative data set.However, this task becomes unfeasible as the data set grows larger and more nuanced.The ability to flag a subset of a large geospatial data set allows an analyst to narrow focus on smaller data sets identified to contain anomalous features.One particular field that benefits from such a capability is survey quality control, where human-collected surveys are examined to identify ill-collected data that might invalidate the survey.In this work, we explore and implement two outlier detection algorithms, with two different distance metrics, that can aid in the identification of geospatial anomalies on two real-world big data sets from surveys conducted and collected by humans on electronic devices.

Quality Control of Surveys
A survey seeks to capture targeted data of a population by collecting a sample that reflects a proportion of the population.A quality survey satisfies the following three characteristics (Brassil, 2021): 1) integrity -an interview or data collection occurred without fabrication (artificially crafted surveys, duplicate surveys, etc.) 2) reliability -the sampling happened with correct methodology (missing target population, biased sampling method, etc.) 3) accuracy -the respondent's true metrics/opinion were given, recorded and processed into the data set (human/machine error while reading responses, "straight-lining," contradicting responses, etc.) After post-field collection of survey data, an analyst may be interested in verifying these characteristics to ensure the quality of the survey.Violation of any of the three characteristics may lead to survey results that do not reflect the true information regarding the population.Automation of survey quality control via statistical methods has the potential to help relieve analysts from heavy work.For example, D3 Systems created ValkyRie, which is an application that automates common survey quality control tests to catch instances of duplication, straight-lining, non-response, etc. (Brassil, 2021).Their analytic tool focuses on the survey responses themselves to catch violation of integrity and accuracy.However, it is difficult to measure reliability of a survey, as the survey itself does not show how the survey was performed.Analyzing the geographic locations of the executed surveys offers a promising way to help verify the reliability of surveys.Modern survey collection is often performed through software on electronic tablets, which records the time and location of the survey as metadata.An analyst can display the survey locations to visualize the surveyor behavior and detect violation of protocols.However, visualization becomes infeasible as the number of devices increase and surveys cover a larger geographic area.Currently, there is lack of geospatial big data analytic tools that help automate this survey quality control.

Related Works
There is rather limited existing literature on identifying location-based outliers in survey quality control perhaps due to the limited complexity of a purely geospatial data point.This is especially true on a geospatial data set that is limited to a two-dimensional surface.One location-based outlier that is of interest is an isolated point.Breunig, Kriegel, Ng, and Sander (2000) defined outlier factor that captures the degree of isolation compared to a surrounding clustering structure rather than compared to the global distribution.On the other hand, there have been numerous advances in identifying attribute-based outliers.Simple methods include performing univariate statistical analysis on an attribute among identified clusters of data.Mean or median-based analysis is highly influenced by the distribution of the attribute or existence of an extreme outlier.Singh and Lalitha (2017) introduced location quotient (LQ) as an alternative to mean and median-based analysis that may be more robust against extremities.Further, multivariate statistical analysis can help identify subtle outliers that are hard to identify when attributes are examined independently (Ben-Gal, 2005).In this work, we test the effectiveness and generalizability of some of these methods, attempting to interpret the outliers detected as possible violation of the surveys' reliability.

Methods
In this section, we briefly discuss survey data exploration and review two distance metrics and two algorithms that use geospatial data to perform outlier detection.The goal is to identify instances of the survey that display geographical anomalies within a relevant group and, thus, help identify violation of the reliability of the survey.

Survey Data Exploration
The data sets explored are from real-world surveys conducted by D3 Systems in two countries: Cameroon and Philippines.Human agents conducted surveys among general population using electronic devices.Our data sets are the collection of metadata from the electronic devices, some of which recorded the GPS coordinate at the time of each survey.We refer to each instance of the survey as a point and the set of points that have been collected on a same electronic device as a group.Our algorithms aim to find points within each group that display geographic anomaly such as isolation or concentration.To properly apply the methods described in the subsequent sub-sections, we only consider groups that contain at least 10 points with properly recorded GPS locations.Table 1 highlights descriptive analysis comparing the two country data sets.

Distance Metrics for Outlier Detection
Distance metrics measure how different a point is from another point.We consider two different distance metrics, Haversine and Mahalanobis, that are then used within the two outlier detection algorithms that we implemented.

Haversine Distance
Most GPS data provides geographic coordinates in terms of latitude and longitude.The haversine distance is the greatcircle distance between two GPS coordinates.Suppose we know the radius of the Earth r and two points on a map are given as tuples of latitude and longitude  = ( 1 ,  1 ) and  = ( 2 ,  2 ), we calculate the haversine distance (, ) as follows:

Mahalanobis Distance
The Mahalanobis distance accounts for the distribution of the points in the group by taking the correlation among the attributes into consideration.Suppose we consider each point as a vector of normalized attributes with mean 0 and standard deviation 1 within the group to which the point belongs.Then, for two points  ⃗ and  �⃗ that belong to a same group G: where  is the nonsingular covariant matrix of the points in .An advantage of Mahalanobis distance is that we can use additional numerical attributes to measure the degree of difference between two points.However, we only consider geospatial attributes (longitude and latitude) to calculate Mahalanobis distance for one-to-one comparison with haversine distance.

Outlier Detection Algorithms
Geospatial outlier detection algorithms seek to assign a scalar value to each point based on the distance metrics to other points within a group.If the scalar values of points in the group make a distribution centered around a mean, for example, then we can identify the points with values greater or lower than certain threshold away from the mean as outliers.We will use two standard deviations away from the mean as the threshold in this work.Here, we describe two outlier detection algorithms: mean-distance method and outlier factor method.Each method requires calculation of distances between every pair of points for each group.Instead of calculating distance between points every time required by the algorithm, it is more efficient (timewise) to create a dictionary data structure with the key as a pair of points in the same group and value as the distance between the pair.Thus, given a dataset , we divide  into disjoint subsets of data with common group ID.Let Gr() be such partition.We refer to each subset as a group.For each group  ∈ Gr(), we calculate and store (, ) for all ,  ∈  where :  ×  → ℝ + ∪ {0} is the chosen metric.We note that any distance metric can be used for the algorithm.However, using a different metric will produce a different distribution of scalar values assigned to the points, and hence identify different outliers.

Mean-Distance Method
The mean-distance method calculates the mean of distances of a point to all other points within the group to which the point belongs.One advantage of the mean-distance method is that we can assign a scalar value group mean to each group to identify groups whose points are abnormally farther apart or concentrated compared to other groups.See Figure 1 for the pseudocode of Algorithm 1.

Outlier Factor Method
The outlier factor method calculates the degree of local isolation of each point from a nearby cluster.This requires introduction of a hyperparameter k which we set as 5.The outlier factors of points in a group create a positive distribution centered at 1. Outlier factors greater than 1 indicate points that are more isolated whereas values less than 1 indicate points that are more concentrated compared to other points within a group.See Figure 1 for the pseudocode of Algorithm 2.

Computational Experimentation and Results
In this experiment, we identify geospatial outliers among data points of the Cameroon and Philippines data sets discussed in Section 3.1 with different combinations of the distance metrics and the outlier detection algorithms described in Sections 3.2 and 3.3.Table 2 shows the proportions of flagged points by different algorithms when the distance metrics used are fixed.The percentage values measure the proportions out of the total flagged.
We note that most of the flagged points are not flagged by both algorithms for each metric.Only 10-19% of the flagged points are flagged by both algorithms.Initially, this is surprising given both algorithms seek to measure the degree of isolation of each point in a group.Further examination reveals that this is due to the geographical distribution of the points within each group.Many survey collection devices were used in multiple cities/towns resulting in clusters of points for each group.While the mean-distance method considers all points within each group to assign a scalar, outlier factor method only considers nearest clusters nearby each point as determined by the hyperparameter k.Therefore, a point considered as an outlier by the mean distance method do not have to be considered an outlier by the outlier factor method and vice versa.Note that the outlier factor method flagged more points (50.30-54.82%) in the Cameroon data set compared to the Philippines data set (24.32-30.29%).As displayed in Table 3, we now fix the outlier detection algorithm and vary the distance metrics.There is greater overlap of flagged points between different metrics than the overlap between different algorithms.Figure 2 depicts the plots of the flagged points identified in Tables 2 and 3    Next, we analyze the computational time associated with implementing each method.Assuming amortized constant time for adding an item to a dictionary, the expected time complexity of creating a distance dictionary is at least O(mn 2 ), where m is the number of groups and n is the maximum count of points for a group.The calculation of Mahalanobis distance is far more costly than the calculation of haversine distance as it requires calculation of a covariance matrix per attribute per group in addition to matrix-vector multiplication.
Table 4 displays the comparison of the computational times to create distance dictionaries for the two distance metrics used.Since we are using dictionaries to store distances instead of calculating them when required, our choice of distance metric does not affect the computational times of the outlier detection algorithms themselves.Table 5 shows the comparison of computation times of the algorithms applied to the two data sets.While the outlier factor method is more computationally expensive than the mean-distance method, we note that most of the total computational time will be used on creating the distance dictionaries rather than performing the algorithms themselves.Focusing on hastening calculation of distance metrics will better optimize the running time.

Verification and Validation
Verification is the process of ensuring that the outlier detection algorithms are implemented correctly, which was achieved through testing the algorithms for correctness, stability, and convergence.Algorithm code reviews and inspections were performed in addition to some baseline testing to verify that our implementation of the algorithms indeed detect relevant geographic outliers.For example, we created an artificial data set of a random distribution of points with clear instances of outliers that were visibly away from nearby clusters.Both algorithms reliably detected geographical outliers when using the Haversine distance metric.However, the presence of an extreme outlier had significant impact on the overall correlation between the coordinates such that both algorithms sometimes failed to identify desired outlier points when using the Mahalanobis metric.
Validation is the process of ensuring that the outlier detection algorithms adequately capture the phenomenon of interest (i.e., actual survey outliers) by comparing with experiments and/or observations.Further work is needed to fully validate our implemented algorithms, which will entail comparing the results of our implemented algorithms to ground truth data provided by D3 Systems where their survey analysts identified and confirmed actual outlier points.

Conclusion and Future Work
In this work, we explored and implemented two different algorithms and two different distance metrics to identify statistical anomalies among geospatial data points from two real-world surveys.By viewing distance metrics as not simply the physical distance between two points but a measure of difference between the points, one can use more than locational data for geospatial outlier detection algorithms.If the devices used to conduct interviews/surveys can collect additional attributes such as interview duration or distance/time since previous interview/survey, the Mahalanobis distance may better measure how an interview differs from other interviews and help identify instances of policy violation when used in outlier detection algorithms.
We noticed that using the outlier factor method, which considers the local distribution of points in a dataset, identifies a very different set of outliers when compared to using mean-distance, which does not.Since there is only a small overlap (10-19%) of flagged points between the two algorithms explored in this work, it may be helpful for survey analysts to focus on the fewer points that are flagged by both methods rather than all the "outlier" points that are flagged by each algorithm.Alternatively, we suggest a visual inspection of the points within each group to decide on the geospatial outlier detection algorithm to be applied.If the locations of the points span multiple geographically distant jurisdictions, the outlier factor method will better identify relevant geographical outliers within each jurisdiction.However, we currently have no evidence to assert that the flagged points show instances of policy violation by the surveyors/interviewers.
Future work includes the creation of additional artificial survey data sets for continued verification of whether the algorithms catch instances of violation of reliability and to help determine which algorithm may better catch certain violation of reliability such as straight-walking or standing still.Moreover, future work will include thorough algorithm validation, as we were limited by the lack of actual survey outlier ground truth data.Additional future work entails building a geospatial big data analytics application with a user interface to visualize and interact with the data while allowing the user to select the algorithm and distance metric to use.An ability to plot flagged data points on a zoomed-in satellite image could help survey analysts understand why the points were flagged as an outlier and decide whether they warrant further investigation.While the flags can help reduce dependence on visual analysis by reducing the number of points examined, this tool will help automate survey quality control for analysts to ensure the reliability of surveys while drawing important conclusions in a timely fashion.

Figure 1 .
Figure 1.Descriptions and Pseudocodes of the Two Outlier Detection Algorithms

Figure 2 .
Figure 2. Flagged Points in the Cameroon/Philippines Data Sets with Various Metric/Algorithm Combinations

Table 1 .
Descriptive Analysis Comparing the Two Country Data Sets Explored ISBN: 97819384962-2-6 263A Regional Conference of the Society for Industrial and Systems Engineering

Table 2 .
Counts of Flagged Points Between Two Algorithms for each Distance Metric on geospatial maps of Cameroon and Philippines, respectively.

Table 3 .
Comparison of Flagged Points Between Two Distance Metrics for each Algorithm

Table 4 .
Comparison of Computation Times for Creating Distance Dictionary (Average of 100 Iterations)

Table 5 .
Comparison of Computation Times for Outlier Detection Algorithms