Exploring the effectiveness of geomasking techniques for protecting the geoprivacy of Twitter users

: With the ubiquitous use of location-based services, large-scale individual-level location data has been widely collected through location-awareness devices. Geoprivacy concerns arise on the issues of user identity de-anonymization and location exposure. In this work, we investigate the effectiveness of geomasking techniques for protecting the geo-privacy of active Twitter users who frequently share geotagged tweets in their home and work locations. By analyzing over 38,000 geotagged tweets of 93 active Twitter users in three U.S. cities, the two-dimensional Gaussian masking technique with proper standard deviation settings is found to be more effective to protect user’s location privacy while sacriﬁcing geospatial analytical resolution than the random perturbation masking method and the aggregation on trafﬁc analysis zones. Furthermore, a three-dimensional theoretical framework considering privacy, analytics, and uncertainty factors simultaneously is proposed to assess geomasking techniques. Our research offers insights into geoprivacy concerns of social media users’ georeferenced data sharing for future development of location-based applications and services.


Introduction
The availability of location-based services has made the collection of large-scale individuallevel location data through the use of mobile phones, GPS devices, and geotagged social media commonplace [34,58].While such location-based big data provides new opportunities to study human mobility patterns and transportation models [6,13,21,32], complex human-environment interactions [18,26,34,39,49], socioeconomic characteristics [22,31,35], urban spatial structure and changes [20,59], disaster responses [23,54,55], and location business intelligence [48], it also introduces challenges regarding the protection of location privacy [51].Furthermore, there are increasing concerns about the social, ethical, legal, and behavioral implications of geoprivacy caused by user identity de-anonymization and location exposures [5,27,50].Generally speaking, geoprivacy refers to an individual's rights to prevent the disclosure of personal sensitive locations including but not limit to their home, workplace, daily activity places, and travel trips [30].However, the majority of people are unaware of how the underlying location-related technologies work and what can be inferred from an individual's location records that are collected when people use various location-based services [27] .Figure 1 shows the spatial distribution of geotagged tweets around a Twitter user's home.Obviously, the home location of this individual can be easily identified through his/her digital footprint on social media with high confidence [21].As a result, researchers have developed a number of statistical approaches and technical solutions aimed to protect individuals from being identified through their location records.A common practice for preserving data confidentiality is aggregation such that detailed individual records are merged into anonymized large-group characteristics.For example, aggregating individual home location into geographic or administrative units.Aggregating raw address points into such identical polygons makes the inference of original locations hard and user privacy becomes a k-anonymity problem [8,14,42].There exist several location obfuscation approaches for achieving k-anonymity [8,15,28].However, aggregation may reduce the spatial resolution of analysis that can be conducted and reduce the effectiveness of the analysis [30].Another family of approaches is called geomasking in which the original location may be hidden or modified for geoprivacy protection but the spatial point patterns are not significantly affected.
There is a rich history of literature on leveraging geographical masking to preserve the confidentiality of health records and trajectory data.With child leukemia lymphoma data from North Humberside, England, Armstrong et al. [4] described and evaluated several types of geographical masks to protect personal privacy as well as to allow the conduct of valid spatial analyses.Kwan [30] examined the effects of random perturbation masks on the results of a spatial analysis using data on lung-cancer deaths.Three different random perturbation masks were implemented with each at three different levels of introduced error.Hampton et al. [17] extended existing methods of random perturbation by developing an adaptive geomasking technique known as the donut method.This method guarantees that each geocoded address is not randomly assigned on or too near its original location.Compared with random perturbation method, the performance of k-anonymity using the proposed donut method was at least 42.7% higher in geoprivacy measures and was less than 4.8% in cluster detection measures.Seidl et al. [46] examined the grid masking and random perturbation techniques for anonymizing the GPS trajectory data and tested the preservation of both privacy and spatial patterns.They found that as the distance thresholds for grid masking and random perturbation increase, the correlation between density patterns decreases.
However, the use of geographical masking methods to prevent the disclosure of sensitive locations of social media users is still not well addressed.Location-based social media data is different from other existing data sources (e.g., health survey and GPS trajectories) due to its innate characteristics such as data sparsity and sampling bias, spatiotemporal distribution heterogeneity, and location representativeness and uncertainty [21,33].To this end, we aim to investigate the effectiveness of geomasking techniques for protecting the geoprivacy of active Twitter users who frequently share geotagged tweets in their home or work location.To the best of our knowledge, this work is a first attempt in this direction using individual-level location-based social media data.Additionally, a theoretical framework considering privacy, analytics, and uncertainty factors simultaneously is proposed to assess different geomasking techniques.

Related work
Geomasking has been used in public health and spatial analysis for decades in order to protect sensitive information.Much of the literature on geomasking has been done on data with a fairly coarse spatial and temporal resolution.Twitter data, on the other hand, are frequent and may occur in a relatively small geographic area.In order to inform ourselves on the nature of obfuscating a varying density of geospatial data, we need to investigate novel and traditional geomasking techniques.
Voronoi masking relies on the creation of Voronoi polygons around individual point features, and then, for those points to be relocated to the nearest edge of its bounding polygon.This method is shown to be robust with lower resolution spatial data, about 23 persons/km 2 of population density [47].It is also effective in reducing the likelihood of false identification of true household location because the points are often relocated to boundaries of parcels.Since Voronoi masking is not randomly generated and dependent on the spatial structure of points, it may preserve the original locations, however, as polygons.Given the nature of geotagged tweets, it would not benefit user privacy to create hundreds of polygons which still lie on or near the location of concern, whether it be home or work.This issue is inherent in high resolution spatial data.It may be beneficial to repeat the Voronoi masking process a second time.The nearest edge to a polygon centroid may be the nearest edge for more than one centroid and therefore it is possible that the number of unique locations will be reduced after the initial masking.This process would lower the resolution of the dataset and possibly reduce the true location detection accuracy after two or more iterations.
On the other hand, Seidl et al. [47] show that grid masking is not an effective method for preserving spatial analysis at the aforementioned low resolution.In this case, the assignment of points along a uniform grid amounts to aggregation over the area of the grid.This may be a beneficial method at high resolutions as we are able to set the size of the grid to a much smaller area and in essence create our own minor aggregation units without displacing the points nearly as far [47].
A multiscale geomasking technique by which locations are converted to Military Grid Reference System (MGRS) coordinates provides a unique amount of control over the adjusted locations [7].MGRS eastings and northings provide 5 levels at which to mask data in increments of powers of ten from 1, 10, 100, 1000, and 10000 meters.Points are displaced along axes from the original point along the grid system.Tests show the method is invertible and, after Level 3, loses almost all overlap between masked and unmasked points indicative of personal location information.These tests were conducted on 2,000 randomly generated points in GIS software.This method also resembles grid masking such that the displacement of points is done along the eastings and northings from the origin [7].The ability to control random perturbations along MGRS easting and northings is a form of high resolution grid masking that is worthwhile to compare with traditional grid masking.
A further consideration for the preservation of spatial characteristics as well as privacy, is topology.Given a set of parcels or an easily obtained base map such as OpenStreetMap (OSM), we can ground truth residential, work, or school locations based on spatiotemporal tweet patterns.Relocating points just outside of parcels or to a road center line, was shown via survey to introduce more uncertainty among participants as to actual location points [45].Those points displaced within a parcel or along a parcel boundary induced less uncertainty.In addition to cluster detection, the reduction in map-user confidence is a unique measure for determining the effectiveness of geomasking.This method may be useful in distorting user perceptions of point clusters and reduce the likelihood of inferring a Twitter user's home or place of employment [45].Geoprivacy is not limited to the users' geometric coordinate information [40].The user-generated social media content includes rich semantic signatures (i.e., spatial, temporal, and thematic patterns) [1,24,38,39,62], which may also reveal distinct place-based patterns and cause potential privacy risks.McKenzie et al. [40] illustrate how protecting place-based information differs from a purely spatial perspective using location-based social networking check-in data.
In the statistics and computer science communities, the trade-off between utility and the level of differential privacy guaranteed by a processing mechanism has been considered in several privacy-preserving learning approaches such as private support vector machine (SVM) learning [44] and private Bayesian inference [61].The key concern in the study www.josis.org of differential privacy is whether the published aggregation information from a statistical database would disclose private individual information.Regarding location-based systems or services (LBS), a mechanism to draw random noise to the user's location from a planar Laplace distribution has been proposed to guarantee geo-indistinguishability [3].In [19], a differential private pattern mining algorithm was proposed for geographic location discovery using a combination of region quadtree spatial decomposition and a densitybased clustering algorithm.The experiments were conducted using synthetic datasets and showed the feasibility of their proposed algorithm to achieve the differential privacy goal.In addition, privacy-preservation can be achieved through the process of obfuscation with degrading the quality of information about a person's location using spatial and temporal cloaking [9,15].A geographic graph model of obfuscation for protecting an individual's location privacy in LBS was demonstrated in [8].Moreover, a comprehensive survey of computational location privacy for broader implications was conducted in [28].

Methods
In this research, we are concerned with individual user coordinate information and survey the effectiveness of three popular geomasking techniques: Aggregation, Random Perturbation, and Gaussian Perturbation, for the preservation of Twitter users' location privacy [4].One open-source geomasking implementation in R can be accessed via the GitHub repository 1 .
Aggregation: merges individual geotagged tweet points into polygons into which those points fall.Different types of administrative boundaries such as census blocks, tracts, and traffic analysis zones or vague cognitive regions (e.g., downtown) could be candidate polygons [12].And the centroids of those spatially overlaying polygons are used as the coordinates of those tweets.
Random Perturbation: is a geomasking approach in which each point is displaced in space by a randomly determined distance and direction [4,30].A distance threshold is typically added to set the allowed maximum displacement distance in the case of uniform geomasking.As shown in Figure 2, the original posted locations of the geotagged tweets of a Twitter user are randomly displaced within a 1km distance radius.
Gaussian Perturbation: uses a two-dimensional isotropic (i.e., circularly symmetric) Gaussian kernel to control the random displacement process of a point set such that the distribution of those displaced points follows a two-dimensional Gaussian ("bell-shaped") form [11,60]: Where (x, y) is the 2D coordinates of each location after displacement, and δ specifies the standard deviation (SD) of the positional error, (x,ȳ) is the mean center of a point set, and n is the total number of points.As shown in Figure 2, with the increment of the standard deviation, the displacement of those points spreads more widely.The derived spatial point patterns with a large standard deviation may not capture the original spatial density distribution of an individual's digital footprints.
After the perturbation processing of the original locations, we need to further determine whether users' home or work location (two of the most sensitive places for an individual's geoprivacy) can still be identified through state-of-the-art location detection algorithms.Specifically, we explored different parameter calibrations for the density-based spatial clustering with noise (DBSCAN) [10,21,35] that has been widely used in spatial clustering and the identification of significant human activity places.The DBSCAN algorithm requires www.josis.orgtwo parameters: the searching radius of a cluster (Eps) and the minimum number of points (MinPts) within a cluster.The different combinations of Eps and MinPts values may get different spatial clustering results [20,37].In the case of detecting Twitter users' home or work location, the parameter calibration may generate different candidate clusters or distance shifts from the actual location.Therefore, we have explored different scenarios with varying parameter values for the perturbation and the spatial clustering steps.In each operation of the perturbation and the clustering, two representative centers (i.e., centroid and medoid) are calculated for further calculation of shift distance from the true home or work location.The centroid is the weighted sum of geotagged tweets' coordinates in a cluster and it might not be one of the original locations, while the medoid can be defined as the point of a cluster whose average distance to all the objects in the cluster is minimal [52,53].
Evaluation Measures: In addition to the shift distance to ground-truth locations, the following quality measures are also used in this study to evaluate the effectiveness of different geomasking approaches.Those measures are defined in terms of the following four cases: true positive (TP), true negative (TN), false positive (FP), and false negative (FN) [2,29,43].TP is the number of points correctly identified as home or work locations after geomasking and cluster analysis in each period (daytime or nighttime).TN is the number of points correctly identified as not-home or not-work locations.FP is the number of points incorrectly identified as home or work locations.FN is the number of points incorrectly identified as non-home or non-work locations.
• Overall accuracy = (TN+TP)/(TN+TP+FN+FP) is the ratio between the number of correctly identified home (work) and not-home (not-work) cluster points (including both TP and TN cases) and the total number of points in each period.It serves as a general accuracy measure.Usually, the higher the value is, the worse the associated geomasking is in protecting geoprivacy.• Precision = TP/(TP+FP) is the ratio between the number of correctly identified home or work cluster points and the total number of locations that are identified as home or work locations (all positive predictions).A high precision shows that, among all the positive predictions, the method gets more home or work cluster points that are correctly identified than the home or work cluster points that are incorrectly identified.• Sensitivity = TP/(TP+FN) (also known as Recall) is the ratio between the number of correctly identified home or work cluster points and the total number of true home or work locations.A high recall shows that the results discover a larger fraction of the home or work cluster points.• Specificity = TN/(TN+FP) is the ratio between the number of correctly identified nonhome or non-work cluster points and the total number of true non-home or non-work locations.It shows how good a method is for detecting a user's non-home or nonwork location after geomasking.• Balanced accuracy = (Sensitivity+Specificity)/2 is a measure that combines sensitivity and specificity.It considers the imbalance of a dataset and shows a balanced performance on how accurate a method is.If the data is imbalanced, then the balanced accuracy is suggested to be used as an accuracy measure.• F1-score = 2*Precision*Recall/(Precision+Recall) is a measure that combines precision and recall.It shows a balanced performance on how effective a method is for detecting a user's home or work cluster location after geomasking.

Experiments and results
We selected 93 active Twitter users who have frequently posted geotagged tweets in three U.S. urban areas: two metropolitan areas (Washington DC and Los Angeles) and one smaller urban area (the City of Madison, Wisconsin).Over 38,000 geotagged tweets were collected only from users' mobile phone devices such that their location information is most accurate for human mobility studies [13,22,33].We selected these three areas (two large cities located on the east coast and the west coast respectively, and one city located in the Midwest) as representations of the U.S. urban areas.Another reason was our familiarity of the geographic backgrounds of the three cities, which helped the location ground-truth labeling and validation process.Up to 3,200 tweets can be fetched for each individual Twitter user due to the API access limit.The anchor points (i.e., the location of home and work) [16,41,56,57] are two most important locations for an individual and are chosen as the target place type for geoprivacy protection.We manually identified their home and work locations as the ground-truth by overlaying their nighttime (8pm-7am) and daytime (9am-5pm) geotagged tweets onto the high-resolution (about 2m-4m) Digital Globe aerial images and the OpenStreetMap (OSM) points of interest layer.Another important rule for the ground-truth labeling is to check whether the same location cluster persists across multiple days.Among those users, 70 users' home location and 60 users' work location can be manually identified.
The impact of Eps and MinPts: Before applying the geomasks, we first tested how the choice of MinPts and Eps in DBSCAN would impact the effectiveness of identifying the home clusters of those Twitter users.We chose the MinPts ranging from 4 points to the square root of the total number of tweets in each period (nighttime or daytime), and the search radius Eps in a range of 50m to 1000m with a step of 50m.As shown in Figure 3,

Measures / Methods RM (H) GM (H) TAZ (H) RM (W) GM (W) TAZ (W)
Mean we grouped the home cluster detection results based on the Eps, and each sub-boxplot represents the overall accuracy with varying MinPts in DBSCAN.Not surprisingly, the mean and median of overall accuracy is over 0.836 and 0.970 and keeps high values (basically over 0.8) regardless of the parameter choices.It also indicates the potential risk of location exposures of those active users as their home location cluster can be easily identified even without parameter calibration.
Comparing the effectiveness of different geomasking techniques: First, we explored the impact of the random perturbation geomasking with different thresholds.Existing studies have found that the choice of Eps=200m to 300m could generate good spatial clustering results for urban areas of interest and human activity zones [20,22,33].Therefore, we are interested in whether the geomasking process could protect users' home or work location privacy within such a distance range.However, our experiments show that small-distance (such as within 300m or even 1km) random perturbations don't help the protection of users' geoprivacy because their home location clusters can still be correctly identified with over 0.80 overall accuracy.Moreover, the mean of sensitivity (recall) for detecting home clusters is 0.978 and the median is even higher to 1.0; the mean of precision for detecting home clusters is 0.789 and the median is 0.936; and the mean of F1-score is 0.836 and the median is 0.954.All these quality measures show that the users' home locations are exposed to the general public after random perturbation masking with a 1km distance threshold.Even when the displacement threshold reaches 2km, the mean of overall accuracy using the ran- dom perturbation mask is still over 0.70.The 2km random perturbation mask is effective for protecting users' home locations within the search radius of 200m, and only less than 0.5 of overall accuracy can be achieved for identifying the users' home locations (in Figure 4).As for the displacement distance, as shown in Table 1, the median of shift distances from the true home location to the centroid of home clustering results is about 737m and to www.josis.orgAs for the Gaussian perturbation, we found that it is effective for protecting users' home locations with proper parameters.The mean of overall accuracy for identifying users' home clusters using two-dimensional Gaussian kernels with 0.05 standard deviation (SD) is less www.josis.orgthan 0.50 in average and the median is also less than 0.50 regardless of the DBSCAN parameter choices.And the Gaussian maskings with larger SD get more dispersed spatial point patterns and have very low prediction accuracy for home location identification.However, given the nature of sparse spatial distributions of digital footprints, the SD of geotagged tweet distributions of a user is often large (about 5~10km), and so is the distance shift from original points to displaced points after the Gaussian masking process.This is part of the reason why we could not correctly identify the home location clusters after Gaussian masking.
Results regarding geoprivacy protection of work locations during daytime (in Figure 6), differ from the home location case.The mean and median overall accuracy decreased to 0.661 and 0.722 using the random perturbation (1km) approach.The reason might be the diverse spatial patterns of human activity locations in daytime [33].However, high median sensitivity 1.000 and F1-score 0.898 demonstrate that high percentage of true work location clusters can be successfully identified.The median of shift distances from the true work location to the medoid of work clustering results increased to about 90m (and shift to the centroid: 798m).The boxplot of F1-score, the medoid distance shift and the centroid distance shift of work cluster detection after random perturbation (1km) can be seen in Figure 7(a), Figure 7(c), and Figure 7(e) respectively.In addition, the overall prediction accuracy of work location decreases as the Eps increases using the random perturbation method.This is mainly because of the imbalanced work location data problem (i.e., the number of exposed work locations is much fewer than the non-work locations during the daytime) and the decrease of true negative predictions, which will be discussed later in Section 5. To deal with the imbalanced data, the balanced accuracy is also reported to measure the accuracy of the clustering results.The mean and median of the balanced accuracy are all 0.500, and in most cases the sensitivity is 0.000 and the specificity is 1.000, which shows that, despite the high overall accuracy, no work clusters after Gaussian masking are correctly identified.It is worth noting that the displacement of each tweet location might vary in different operations of perturbation process.But the overall quality measures for geoprivacy preservation in multiple operations did not change much (about 2% difference in our experiment) and the overall accuracy measures reported are stable.
In addition, we also conducted the traditional aggregation-based masking analysis at the traffic analysis zones (TAZs) in the three urban areas (as shown in Figure 8).As a lot of human mobility and transportation studies using geotagged social media data are based on the home-work trips at the TAZ level, such a scale meets the spatial resolution requirement for urban transportation analysis.Also, we demonstrate here the aggregation results on the Madison area (as shown in the TAZ(H) and TAZ(W) columns of Table 1), since we are more familiar with Madison's traffic conditions and urban spatial layout.Also, unlike other users in Washington DC and Los Angeles, many of whom are tourists and thus have a very wide range of activities (even nationwide), the Madison users are more locally active and their tweet locations match Madison TAZ better, enabling more accurate and representative aggregation analysis.The TAZ-based aggregation method, however, still cannot protect the geoprivacy well of those active Twitter users given a high median overall accuracy of identifying home cluster (0.980) and work cluster (0.989).The boxplot of overall accuracy, F1-score, the medoid distance shift and the centroid distance shift of home cluster detection after TAZ-based aggregation can be seen in Figure 4(f), Figure 5 regardless of the setting for DBSCAN search radius Eps and confirm the ineffectiveness of geoprivacy preservation at the TAZ level using aggregation.It is worth noting that the "ineffectiveness" is in a sense for the protection of home cluster identity rather than the actual home location as there is still a possible distance shift between the true home www.josis.orglocation and the centroid of a TAZ polygon.From this perspective, it might be effective for protecting the true home (work) location, but the distance shift really depends on the spatial distribution of a home (work) location within the TAZs (i.e., the proximity to the TAZ center).The results show that the median of shift distances from the true home (work) location to the centroid of home (work) TAZ is about 485m (work: 460m) and to the medoid is about 403m (work: 360m).

Imbalanced data and geomasking performance
As shown in the results, the geomasking effectiveness differs largely on home locations and work locations due to their different spatiotemporal patterns.During the nighttime, the users often stay at home, and thus most of their nighttime tweet locations can represent their home locations.However, during the daytime, the users have more diverse activity space and do not necessarily stay at their work locations all the time.Thus there are many non-work tweets that are posted at other places, which results in an imbalanced data problem between work locations and non-work locations.The number of home locations (6,700) and non-home locations (6,053) during the nighttime is almost equal (about 1:1), which is more balanced and explains why the overall accuracy is still able to reach around 0.50 but with 0.0 recall rate after the Gaussian masking.However, the number of work locations (3,168) and non-work locations (8,904) during the daytime differs substantially (about 1:3), and the number of true negative predictions is therefore large enough so that the overall accuracy could be around 80% even when no true work location cluster is correctly detected.As the Eps (the search radius threshold for the DBSCAN cluster algorithm) increases, however, more true negative predictions become false positive predictions, resulting in a decreasing overall accuracy.In this regard, focusing only on the overall accuracy alone is meaningless, and we need to take both the precision and recall into account to evaluate the geomasking performance.Therefore, we further computed and drew the Precision-Recall curves (PR-curves) for different geomasking methods based on different Eps (from 50m to 1000m) in Figure 9.The PR-curve is a comprehensive tool to measure the model performance even on imbalanced data since it takes both the precision and recall into consideration at the same time.Each curve represents the performance of a geomasking method.The higher the curve stays when moving from left to right, the higher precision the home (work) cluster detection algorithm gets, and therefore the worse the geomasking method performs.As shown in Figure 9, for both home locations and work locations, the random perturbation methods (both 1km and 2km) and TAZ-based aggregation methods have higher overall precisions and recalls, whereas the Gaussian masking methods (SD=0.01 and 0.03) are able to suppress both the precision and recall at the same time.Note that the Gaussian masking method with 0.05 SD doesn't appear in the Figure since it protects the geoprivacy well and thus there is no precision or recall rates for drawing its PR curve.As such, with proper parameter settings, the Gaussian geomasking method could have a better effectiveness for protecting the location privacy of Twitter users than the random perturbation method and the TAZ-based aggregation method.
In addition, we also explored the differences between the settings of distance threshold (for random perturbation) and the standard deviations (for Gaussian geomasking) as well as their influences on the geomasking performance.As shown in the violin plot (Figure 10), the values of distance shifts of tweet locations (daytime and nighttime) after random perturbation are evenly distributed within the distance threshold, while after the Gaussian geomasking, the distribution of distance shifts is much closer to a normal distribution with a wider range.For the random perturbation (2km) and the Gaussian geomasking (SD=0.01),although they have a similar average distance shift (about 1081m and 1211m respectively), the latter still has a significantly better performance due to its high uncertainty www.josis.orglevel by comparing their PR-curves in Figure 9.It shows that the perturbation that follows a normal distribution would have a larger but more natural influence on the geotagged tweet locations than the random perturbation, and we believe this partially explains why the Gaussian masking method with proper standard deviation settings could have a better performance than the random perturbation method.

Implications among privacy, analytics, and uncertainty
The results in our experiment demonstrate that one geomasking method could effectively protect users' geoprivacy but may reduce the spatial analysis capability and introduce uncertainty to further analytics.Inspired by several theoretical frameworks in geovisualization and geospatial semantics studies [25,36], we herein present a three-dimensional visualization framework (as shown in Figure 11) including privacy, analytics, and uncertainty as a tool to evaluate and inform the selection of appropriate geomasking methods under different contexts.The first dimension is about the capability to protect users' geoprivacy from low to high.The second dimension is the spatial resolution of geospatial analytics from coarse to fine.And the third dimension is uncertainty level from low to high [21].Two presented geomasking methods (Gaussian and random perturbations) with different parameter settings and the TAZ-based aggregation method in our experiments using geotagged tweets are tentatively placed in this 3D cube.It is worth noting that the placement of each method is estimated from the results of our case study shown in Table 1 (e.g., based on the accuracy measures and the distance shifts).We think that such a 3D cube visualization can serve as an assessment tool for evaluating other geomasking methods from the three aspects simultaneously as well.For instance, one may add the donut masking, Voronoi masking, and other geomasking techniques into this framework with a different application domain (e.g., public health).While we mainly focus on the privacy preservation aspect in this research, the exploration of other two dimensions (i.e., spatial analytics and uncertainty) requires more investigation in future work.

Limitations
Several limitations exist in our current study.First, our manual labeling approach has uncertainty about users' actual home or work locations without their interview confirmation.Thus, it is possible that the labeling results may not be able to reveal the ground truth of all the actual home or work locations of those Twitter users.However, we try our best to ensure the quality of the labels using a set of comprehensive rules mentioned in Section 4. Second, such an approach is also limited on the sample size since it needs labor-intensive labeling process.Even within the same study group, we agree on most of the manual labels but individual differences do exist towards the concordance of labeled training data.Third, the location cluster detection results depend heavily on the number of tweets of a user, and their particular tweeting behavior.If a user posts a large number of tweets from home (or work location) then it is easier to identify his/her home (or work location) compared to the users who tweet rarely.Last but not least, the presented home (work)-detection method only relies on the DBSCAN spatial clustering for geotagged tweets.Other approaches such as the detection of home-work locations using recurring trips also exist.The coordinate information may not be as important as the spatial interaction frequency among those points using the trip-based detection approaches.

Broader impacts
In fact, Twitter removed support for precise geotagging since June, 2019.However, the metadata of historical tweets prior to the policy change may still reveal precise GPS coordinates.In addition, when a user deletes a geotagged tweet2 , Twitter does not guarantee the information will be completely removed from all copies of the data on third-party applications or in external search results.Even if the precise GPS location is not available anymore, Twitter users are still able to add place tags (e.g., a city, office building, apartment, landmark, and many other types of places) to their geotagged tweets, which can be converted to the GPS coordinates (often using the centroid as a representation location).This is similar to the aforementioned aggregation-based masking approach, thus we may still be able to get users' sensitive locations based on fine-scale place tags.People should be aware that sharing or publishing such kind of location data involve geoprivacy issues and the geomasking technique provides a way to help mitigate the problem not only for Twitter users but also for other social media platforms such as Facebook, Flickr, Weibo, and Instagram where geotagging or place-tagging is accessible, as well as for mobile applications that track individual locations.

Conclusions and future work
In this work, we have explored the effectiveness of three popular geomasking techniques for protecting the geoprivacy of active Twitter users who frequently share geotagged tweets in their home or work locations.Based on our experiments, the two-dimensional Gaussian masking with proper standard deviation settings is found to be more effective on hiding or shifting social media user's home location than the random perturbation and the aggregation masks.However, the Gaussian masking may also lower the spatial resolution of geospatial analytics given the sparsity nature in geotagged social media data.Our experiments show that small-distance (such as within 1km or 2km) random perturbations do not sufficiently help the protection of users' geoprivacy because the majority of their home or work locations can still be correctly identified with high accuracy and very small median shift distance from the ground-truth locations.Our research offers insights into the geoprivacy concern of social media users' georeferenced data sharing for future development of location-based applications and services.
For future work, one direction would be what is the impact of these geoprivacy enhancements on the user experience comparing with simply removing the benefit to the user of posting geotagged tweets.Another direction is about the protection of geoprivacy using the spatiotemporal information and among other activity place types (e.g., shopping, entertainment) of social media users.In addition, we would like to extend our workflow to other cities to test whether our conclusion drawn from our case study is generalizable.Although Twitter decided to remove the precise location coordinate of each tweet while keeping the place tagging function, a precise location is very critical in some application scenarios such as disaster response and crime investigation.The trade-off between the requirement of spatial analysis resolution and the privacy preservation capability requires more research on different scenarios.

Figure 1 :
Figure 1: The spatial distribution of geotagged tweets around a Twitter user's home.

Figure 2 :
Figure 2: The Gaussian geomasking with different standard deviations (SD) and the random perturbation with 1km and 2km threshold of a user's geotagged tweets.

Figure 3 :
Figure 3: The boxplot of overall accuracy changes of home cluster detection with different DBSCAN parameters (without geomasking).

Figure 4 :
Figure 4: The boxplot of overall accuracy of home cluster detection with different DBSCAN parameters with random perturbation, Gaussian masking, and the TAZ-based aggregation.

Figure 5 :Figure 6 :
Figure 5: The boxplot of F1-score, medoid and centroid distance shifts of home cluster detection with different DBSCAN parameters with random perturbation and the TAZ-based aggregation.

Figure 7 :
Figure 7: The boxplot of F1-score, medoid and centroid distance shift of work cluster detection with different DBSCAN parameters with random perturbation and the TAZ-based aggregation.

Figure 8 :
Figure 8: Traffic Analysis Zones (TAZs) of the three cities in this research.

Figure 10 :
Figure 10: The violin plot of distance shifts of tweet locations after geomasking.

Figure 11 :
Figure 11: A 3D-cube framework for assessing different geomasking techniques; the position of each method is estimated from the results of our case study.

Table 1 :
The geoprivacy effectiveness measures using different geomasking methods (Random perturbation with 1km threshold and Gaussian perturbation with 0.05 SD; H: Home, W: Work, GM: Gaussian Masking, RM: Random Masking, TAZ: Aggregation by traffic analysis zones, and N/A means results are not available).