A context sensitive approach to anonymizing public participation GIS data : From development to the assessment of anonymization effects on data quality

Use of Public Participation Geographic Information System (PPGIS) for data collection has been significantly growing over the past few years in different areas of research and practice. With the growing amount of data, there is little doubt that a potentially wider community can benefit from open access to them. Additionally, open data add to the transparency of research and can be considered as an essential feature of science. However, data anonymization is a complex task and the unique characteristics of PPGIS add to this complexity. PPGIS data often include personal spatial and non-spatial information, which essentially require different approaches for anonymization. In this study, we first identify different privacy concerns and then develop a PPGIS data anonymization strategy to overcome them for an open PPGIS data. Specifically, this article introduces a context-sensitive spatial anonymization method to protect individual home locations while maintaining their spatial resolution for mapping purposes. Furthermore, this study empirically evaluates the effects of data anonymization on PPGIS data quality. The results indicate that a satisfactory level of anonymization can be reached using this approach. Moreover, the assessment results indicate that the environmental and home range measurements as well as their intercorrelations are not significantly biased by the anonymization. However, necessary analytical measures such as use of larger spatial units is recommendable when anonymized data is used. In this study, European data protection regulations were used as the legal guidelines. However, adaptation of methods employed in this study may be also relevant to other countries where comparable regulations exist. Although specifically targeted at PPGIS data, what is discussed in this paper can be applicable to other similar spatial datasets as well.


Introduction
Transparency, openness, and reproducibility are widely recognized as essential features of science (McNutt, 2014;Miguel et al., 2014;Nosek et al., 2015). In theory, most scientists embrace these features as disciplinary norms and values of science (Anderson, Martinson, & De Vries, 2007). However, as widely discussed and reviewed in a number of studies (Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014;John, Loewenstein, & Prelec, 2012;O'Boyle, Banks, & Gonzalez-Mulé, 2017), as opposed to what one might expect, these valued features are not yet routine in daily practice of many researchers. A likely culprit for this mismatch is an academic reward system that does not sufficiently incentivize open practices (Nosek et al., 2015). However, this disconnect can also be attributed to the rightfully ever-tightening rules and legislations related to privacy and personal data protection. Adequate consideration of such concerns poses technical and legal difficulties that may render the idea of open science more problematic than rewarding.
There are various aspects and levels of open science discussed in the literature (Nosek et al., 2015). Nevertheless, this paper focuses on open data as one of the standards of a move toward open science. Particularly, this paper focuses on data collected through participatory mapping methods. Participatory mapping approaches, applied in a variety of fields of research and practice, have raised increasing interest during technologies to engage the general public and stakeholders to inform participatory planning and decision-making, particularly in urban and regional development contexts (Sieber, 2006). A wider user community increasingly adopts participatory mapping applications and scholarly interest in PPGIS is growing, as evidenced by the increasing number of academic publications, conferences, workshops and journal special issues (e.g., Brown & Fagerholm 2015, Brown & Kyttä, 2018Mukherjee, 2015). Consequently, a large volume of PPGIS data is increasingly available. Given the extensive amount of resources required for any data collection, including PPGIS, a wider community can potentially benefit from open access to such data. However, PPGIS data typically comprises spatial and non-spatial components that may pose risks to the individuals' privacy without proper anonymization.
Although to date there has not been any work done on PPGIS data anonymization, there exists a limited but valuable body of literature on data anonymization from a number of other fields. A small fraction of such literature has focused on spatial data anonymization. Moreover, the literature has rarely investigated the effects of spatial anonymization on data quality. Particularly, there is currently little knowledge available on whether and how the data will be usable after the spatial anonymization. Motivated by the existing opportunities and limitations, this study develops a PPGIS data anonymization approach and empirically evaluates how the anonymized data can be used for further processing and research.

Research objectives and paper structure
The objectives of this study are threefold. First, this study aims to explore and identify the risks as well as the opportunities in publishing PPGIS data. This is pursued by describing the common characteristics of PPGIS data, identifying the types of personal data, and evaluating the privacy concerns according to The European Union General Data Protection Regulation (GDPR) (European Parliament and Council, 2016).
Second, following the understanding of personal data types and potential privacy concerns, this study aims to develop a safe yet practical PPGIS data anonymization approach and strategy. For doing so, this study reviews the most common data anonymization approaches in the literature and builds upon them to develop a PPGIS data anonymization approach.
Third, in this study we empirically assess how data anonymization can affect data quality. To pursue this objective, we use real data obtained from a PPGIS survey and analyze how the measurements and research findings yielded from the original and anonymized data 1 differ from each other.
The structure of this paper is in line with these research objectives. Accordingly, we will first review PPGIS data characteristics and assess the privacy concerns associated with their publishing according to the legal documents. Next, we will review the literature to find solutions on how we can tackle these privacy concerns and make open PPGIS data possible. Subsequently, we will explain our PPGIS data anonymization method. Finally, we will evaluate how this data anonymization has affected the data quality. At the end, we will discuss the findings and limitations and make some conclusions for future work in this area.

Personal data and privacy concerns
Legislation concerning data privacy regulations varies between legislative systems. This article discusses privacy concerns and related legislation from the perspective of The European Union General Data Protection Regulation (GDPR) (European Parliament and Council, 2016) implemented since May 2018 and superseding the Data Protection Directive of 1995 (European Parliament and Council, 1995). However, the best practices on anonymizing and publishing PPGIS data introduced in this study may be applicable in other legislative systems when necessary modifications are considered.
The aim of the GDPR is to protect the rights of natural persons -in relation to the processing of their personal data and to harmonize these rights across the EU member states. The GDPR defines personal data as any information that may lead to the direct or indirect identification of a natural person. Examples of personal data are provided, including, but not limited to, name, location data, online identifier, and factors specific to the persons physical, economic, or social identity (European Parliament and Council, 2016;Article 4). The GDPR defined the rights natural persons have concerning their personal data, including, for example, the right to be informed about the content and processing of the personal data and the right to access, rectify, or erase personal data.
However, the principals of data protection defined by the GDPR do not apply to anonymous information. Anonymity refers to a state where a person can no longer be identified or singled out from the data (European Parliament and Council, 2016;Recital 26). In other words, during an anonymization process the data must be irreversibly processed in such a way that it can no longer be used to identify a natural person by using "all the means likely reasonably to be used" by any party (European Parliament and Council, 2016). Unlike data that is pseudonymized (i.e., personal data is processed in such a manner that it cannot itself be linked to a specific person, e.g. replacing names with number codes), anonymized data guarantees that the individual person cannot be identified when all available additional information on the subject is considered (European Parliament and Council, 2016;Recital 26).
If not for any other ethical reasons or otherwise agreed with the study participants, open sharing and publishing of research data in compliance with the GDPR requires that the preconditions of anonymized personal information are met. The European advisory body on data protection and privacy outlines three criteria for an effective anonymization (Article 29 Data Protection Working Party, 2014): -Singling out, the anonymization must make it impossible to isolate some or all records which identify an individual in a dataset -Linkability, the anonymization must make it impossible to link records relating to an individual -Inference, the anonymization must make it impossible to infer, with significant probability, the value of an attribute from values of a set of other attributes.

Types of personal PPGIS data
A PPGIS survey may include all components of a conventional research survey, with different elements used to collect personal and nonpersonal information. PPGIS data stands apart from other survey-based data by including the additional component of spatial information created by the study or survey participants. As described by Brown and Kyttä (2014), PPGIS surveys employ spatial elements to locate behaviors, functions, perceptions, or evaluations. Common to these elements is that they are located in the geographic extent of the respondent's everyday life, thus capturing the context the respondents have the most knowledge about through their lived-in experiences. A typical PPGIS dataset could contain, in addition to other survey elements, spatial elements (points, polylines, or polygons) representing spatial phenomena mapped by the respondent, for example, places the respondent visits on a regular basis or perceptions of the environment, such as, places with high natural value or places the respondent perceives as unsafe to visit.
Considering the nature of respondent-created spatial information and the possibility to identify an individual according to the GDPR, we identify three main types of PPGIS spatial data, namely (1) Primary personal spatial data, (2) Group-level spatial data, and (3) Thematic spatial data. These classes relate to different types of mapping tasks differing on whether the respondent may be identified from the spatial data itself or from the spatial data in conjunction with other personal information. These data types and recommendations for their treatment during an anonymization process are introduced in Table 1. These recommendations are based on our interpretation of GDPR and its implications for data types typically present in PPGIS.

Data anonymization: Existing methods and approaches
Data anonymization is a type of information sanitization process (Saygin, Hakkani-Tür, & Tür, 2005), which aims to protect individuals' privacy and to satisfy the requirement of compatibility with legal and ethical grounds of further processing. Anonymization is a good strategy to keep the benefits while mitigating the risks. For spatial anonymization, this would mean that we provide privacy protection for individual addresses and precise geographical information while maintaining spatial resolution for mapping purposes (Allshouse et al., 2010). Although this is a sensitive and challenging task, its proper implantation can enable open data publication and greatly benefit the scientific community.
Broadly speaking, the existing anonymization techniques fall into two categories of generalization and randomization (Zhou, Pei, & Luk, 2008). The generalization approach consists of generalizing, or diluting, the attributes of data participants by modifying the respective scale or order of magnitude. For spatial anonymization this would mean that each individual map feature would be generalized into a bigger spatial region that contains at least K−1 other users (Ghinita, Zhao, Papadias, & Kalnis, 2010). For example, instead of sharing points representing individuals' homes, one would share the neighborhood, the grid cell, the region, or any other bigger corresponding spatial unit. Such techniques help prevent a data subject from being singled out by grouping them with, at least, K−1 other individuals. This is widely referred to as K-anonymity, which not only serves as an anonymization method, but also as an indicator of how effective an anonymization process is (Cassa, Grannis, Overhage, & Mandl, 2006).
Randomization refers to a family of techniques that alter the accuracy of data in order to weaken links between the data and the individuals. This is most commonly accomplished by addition of some noise to the data (Zandbergen, 2014). For spatial anonymization this would mean that a map feature, for example a point, is displaced to a new location d units of distance away from its original location. Depending on the implementation, the value of d, or its direction, or both can be randomly generated. To control the characteristics of d, the operator may impose conditions for its generation. For example, the operator may define minimum and maximum values for d to control the magnitude of displacement. In this case, the displacement area will be a donut shaped ring according to the minimum and maximum parameters (Allshouse et al., 2010) (Fig. 1).
It is also possible to go further in controlling d by defining a custom function for generating it. For instance, a Gaussian function can be used to make the production of smaller displacement values more likely (Cassa et al., 2006). This can help preserve the overall spatial quality of the data by generally avoiding very large displacements. Nevertheless, satisfactory anonymization may not be achieved with very small displacements. That is why it is important to use mixed approaches. One mixed approach is to use a bimodal Gaussian displacement. This is in essence a combination of the "donut" approach and the Gaussian random generation. In other words, while values are generated using the given function, minimum and maximum conditions can be defined to control the outcomes.
A Gaussian function is blind to the feature's context and hence may not be the most suitable approach for spatial anonymization per se. For example, if we are trying to anonymize the primary personal spatial data of individuals, represented as points in a dataset, a small displacement may suffice in dense urban areas to ensure that the individual can no longer be singled out. On the other hand, in a sparsely populated area, a larger displacement may be needed to effectively deidentify an individual. Therefore, a customized and context sensitive function for randomization is a more promising approach for spatial anonymization.
At the same time, a parallel line of research has occasionally sought alternative approaches for spatial data anonymization. Obfuscation is an example of such alternative approaches that replaces an individual's location with a near-by intersection or building to obscure the real location (Ardagna, Cremonini, Damiani, Di Vimercati, & Samarati, 2007;Duckham & Kulik, 2005). Furthermore, Zhang and colleagues develop a Table 1 PPGIS data and potential personal information (before anonymization).

Likelihood of individual Identification
Recommendations for data anonymization 1. Primary personal spatial data -Residential location(s), second homes Point

Very likely
In areas with low residential density, an individual or the individual's household could be identified from non-anonymized point data.

Likely
Increased risk of identification when spatial data is linked to other individual-level variables Unlikely In areas with high residential density, individual may be recognized on the level of street address Always recommended. Increased need for anonymization when the residential location is situated in rural areas or urban areas with low population density, or when the amount of other individual-level variables increases (gender, age, occupation, etc.) 2. Group-level spatial data -locations identifiable to a limited group of individuals, e.g., place of work, university, child's kindergarten

Point, polyline
Unlikely If data is presented as such Likely Increased risk of identification when spatial data is linked to other individual-level variables Recommended, when spatial data is linked to other individual-level variables.
3. Thematic spatial data -locations with no direct connection to the individual, e.g., environmental perceptions, places related to behavior in public or private spaces visited by a high number of people, such as, shopping centers, parks, etc.

Very unlikely
If data is presented as such Likely Increased risk of identification when spatial data is connected to other individual-level variables that can be used to infer individual behavior patterns, e.g., activity spaces Anonymization is rarely needed. Recommended in specific cases, when spatial data is connected to other individual-level variables and patterns derived from thematic spatial data that can be used to identify the individual K. Hasanzadeh, et al. Computers, Environment and Urban Systems 83 (2020) 101513 more complex spatial anonymization method called "Location swapping" (2017), which aims to replace an original location with a masked location selected from all possible locations with similar geographic characteristics within a specified neighborhood. A common limitation of obfuscation methods is that they can significantly degrade the quality and usability of the anonymized data. This is especially problematic when there are no appropriate targets around the individual, thus the substitute location will be far from that of the individual. Additionally, the notion of "similar geographic characteristics" in methods such as location swapping can be subjective and hard to define and operationalize.

Effects of anonymization on spatial analysis results
Research on the effects of anonymization on the analytical results is essential in order to determine whether an anonymization method reaches a meaningful balance between the privacy protection and the ability to derive relevant results and patterns in data (Zandbergen, 2014). Apart from few examples involving visual perception of spatial patterns (Leitner & Curtis, 2004, 2006, the effects of spatial anonymization on data quality has mostly been examined using specific spatial analytical procedures. Examples of such technical procedures include the assessment of anonymization on results from clustering analysis (Cassa et al., 2006;Kwan, Casas, & Schmitz, 2004), built environmental measurements (Clifton & Gehrke, 2013), and kernel density estimation (Shi, Alford-Teaster, & Onega, 2009). Results vary between these studies suggesting that different anonymization processes may affect the utility of different measures and analytical procedures differently. Nevertheless, all findings are in line indicating a consistent tradeoff between the amount of displacement and the accuracy of analytical results. In other words, with larger displacement distances a gradual reduction in the usability of data for certain analytical purposes can emerge. These findings are consistent with the body of literature on the effects of positional errors on geocoded data (e.g. Duncan, Castro, Blossom, Bennett, & Steven, 2011;Zandbergen, 2009).

Test data
The data was collected using an online PPGIS method that combines Internet maps with traditional questionnaires (Brown & Kyttä, 2014). A random sample of 5000 residents of Helsinki metropolitan area, in Finland, aged between 55 and 75 was obtained from Finnish Population Register Center and an invitation was sent to participants' home addresses in October 2015. The dataset included personal information as described in Table 1. In the survey (Fig. 2), respondents used an online interface to answer a number of questions about themselves and mark their everyday important places. This included their living location as well as their daily destinations such as, leisure and recreational activity places, shopping, services, and sport facilities. There were 1139 responses in total. After deleting incomplete submissions, data from 844  K. Hasanzadeh, et al. Computers, Environment and Urban Systems 83 (2020) 101513 participants was used for the study. A summary of the attributes in the data is presented in Appendix 2.

Data anonymization
In this study, we develop a unique approach to anonymize personal spatial data while maintaining the overall quality of the datasets. In this approach, group level and thematic spatial data, such as daily destinations, are displaced using a donut spatial anonymization. Accordingly, these points are randomly displaced by a minimum distance of a and maximum distance b, to a random direction. However, for primary spatial data, such as home locations, a more complex approach was implemented. This will be described in the following section.
All the anonymization procedure described below was implemented in Python language using ESRI's ArcPy module. According to legal documents, to ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments" (European Parliament and Council, 2016;Recital 26). Therefore, for safety reasons we do not share the exact anonymization scripts with the reader. However, the detailed description of the procedure as well as the pseudocode provided as appendix should make it possible for the interested reader to easily reimplement the algorithm.

Anonymizing primary personal spatial data
For anonymizing primary personal spatial data, such as home locations, we developed a customized bimodal Gaussian displacement algorithm. The use of Gaussian function was to ensure that the points are displaced as little as possible as long as the minimum anonymization requirements are met. To ensure all points are displaced from their original locations, the Gaussian function was coupled with donut anonymization. Hence, the algorithm ensures that all points are displaced at least by a minimum distance. To ensure deidentification of individuals in sparsely populated areas and to avoid unnecessarily large displacements in dense areas where identification of individuals is reasonably unlikely, the displacements were adjusted according to the population density and feature density around the original location of the feature. Accordingly, features located in less dense areas are more likely to deviate more from their original place than features located in dense areas. Fig. 3 shows the anonymization process.
Mathematically speaking, for each feature f, the displacement D f was calculated as below: Where, G is the Gaussian function output generated separately for x and y coordinates, for a random value of σ, as below: and CM is the combined multiplier calculated as: Where, C is a constant parameter with non-linear effect on model's skewness, and ∆ is a constant between 0 and 1 controlling the weight of FDM on anonymization. FDM and PDM are the feature and population density multipliers and are calculated as below:

Calculating k-anonymity
We estimate the expected level of k-anonymity for each individual point by multiplying the local population density by a circular ring area approximation of the Gaussian probability distribution function (Fig. 4) (Cassa et al., 2006). Since 68.26% of individual cases should fall, on average, within the first standard deviation, σ meters in radius from where they were originally located, we can multiply the local population density by the area, πσ 2 , and by the probability that the point would have been moved into that area, 0.6826. Subsequently, we add to this the next ring's population density multiplied by its area and its probability that the point would be relocated into that area, 0.2718. Finally, we add the area of the third ring multiplied by its local population density by its probability density, 0.0428. The sum of these three values provides an estimation of k-anonymity achieved for a specific point in a dataset (Cassa et al., 2006). Consequently, k-anonymity achieved for the whole dataset can be calculated as the minimum kanonymity of all individual points. This, together with other basic statistics measured such as average, maximum, median, and standard deviation of the k-anonymity for all individuals, can provide a computationally tractable expectation of k-anonymity and an overall assessment of anonymization achieved for a dataset.
It should be noted that this estimation of k-anonymity is based on the assumption that no other external knowledge of an individual is available. Therefore, if for instance we know the gender for each individual, assuming that half of the population are males and the other Fig. 3. An overview of the spatial anonymization process developed in this study. K. Hasanzadeh, et al. Computers, Environment and Urban Systems 83 (2020) 101513 half females, the actual achieved k-anonymity will only be half of the measured value.

Anonymizing non-spatial attributes
A dual approach was taken to anonymize sensitive non-spatial attributes that can be used to single out individuals, or compromise achieved k-anonymity beyond an acceptable threshold. The very sensitive personal information that could be used to identify the individuals are removed from the data. Examples of such attributes include open comments by individuals or specific information about their job and household characteristics. As a second approach, some of the other attributes were kept in the data in an aggregated form to ensure they cannot be used to single out individuals. For example, age of individuals, which was available as a numeric value in the original data, was aggregated into large categories. The used categorization was adopted from Statistics Finland's public datasets and included three groups of individuals within following age ranges (<15, 15-65, and > 65).

Real data implementation and evaluation of effectiveness
To evaluate its performance, the procedure described above was applied to a PPGIS dataset collected from Helsinki metropolitan area, Finland. Subsequently, we measured the statistics on how the procedure has performed in anonymizing the data. Additionally, we tested how the anonymization has affected the measurements derived from the anonymized data compared to the original dataset.
The data in this part was anonymized using the same procedure as described earlier. However, it should be noted that as an extra safety measure, home points located in areas less densely populated than 200 per square kilometer were removed prior to the anonymization. Additionally, any point, which failed to meet a minimum K-anonymity of 300 after anonymization, was also removed.

Measures and variables used for evaluation
PPGIS data is widely used in urban and environmental studies to assess various levels and aspects of person-environment relationships. Therefore, in order to evaluate the usability of anonymized data for research, we used a home range model as the spatial unit of analysis and calculated a number of mobility and environmental characteristics for each individual. The home range model was adopted from an earlier study using PPGIS data (Hasanzadeh, Broberg, & Kyttä, 2017). The model is an individualized customized minimum convex polygon containing the home location of everyday destinations of the individuals. In this study, to avoid overly large polygons, destinations further than 10 km from an individual's place of residence were excluded from the modeling.
Using this model as the spatial unit, a number of variables were calculated for each individual in both original and anonymized datasets for comparison. These are some common variables that are adopted from previous research (Table 2).
In addition to the calculated variables described above, a number of other variables were directly taken from the survey. This includes four perceived wellbeing measures, namely health, quality of life (Qol), capability of functionality, and happiness, which were directly asked in the survey using a five-point Likert scale that ranged from very bad to Fig. 4. Estimating k-anonymity. Using the dataset's standard deviation of displacements, σ, an estimate of achieved k-anonymity is calculated. K. Hasanzadeh, et al. Computers, Environment and Urban Systems 83 (2020) 101513 very good. Additionally, four background variables, namely age education, gender, and income, are also included in the analysis. It should be noted that all spatial analyses and simulations were conducted in ArcMap 10.6 and Python 2.7 environment mainly using ESRI's ArcPy module.

Statistical methods
Paired sample t-tests were utilized to examine whether significant differences exist between variables calculated using the two sets of data, i.e., the original and the anonymized data. The significance of comparison results is adjusted for type I error using Bonferroni correction. To examine the associations between variables within the two datasets, a Pearson correlation analysis was performed on each dataset. Subsequently, we compared the two correlation analyses to evaluate how the anonymization has affected the correlations between variables. We did this by transforming the correlation coefficients into Z scores using Fisher's transformation. Consequently, the significance of differences was tested using Z test statistic. For the Z test, the null hypothesis, H 0, was that the correlation coefficients from each corresponding pair in the two datasets are equal (β 1 = β 2 ). H A , the alternative hypothesis, was that β 1 ≠ β 2. All the statistical analyses were conducted using SPSS Statistics 26 and Python 2.7.

Data anonymization output, key figures and stats
A total of 844 home points were available in the original data. This number dropped to 824 points after anonymization as a number of points were removed to protect these individuals' privacy 1 . This included points located in sparsely populated areas, as well as those which failed to meet a minimum k-anonymity requirement of 300 after data anonymization. As it can be seen in Table 3, all the points have been displaced for between 50 and 727 m from their original location and a minimum k-anonymity of 330 is achieved in the process. As illustrated in Fig. 5, the displacements have preserved the normal curve shape with a higher concentration of smaller values.
As seen in Fig. 6, population density multiplier (PDM), has affected the anonymization process, with the greatest displacements occurring in the least populated areas. A very small value of 0.05 was used for Δ, hence the feature density multiplier (FDM) has minimal effect on the anonymization process. Fig. 7 illustrates examples of how home range models derived from the two datasets overlap with one another. On average, the anonymized and original home ranges overlap for around 92% of their areas. However, there is some variation between individuals with overlap percentages varying from as little as 31% up to 99%, with a standard deviation of 6%. As shown in Fig. 7, the poorest overlaps occur when large displacements are applied to small spatial units. Such poor cases were most common in less densely populated areas and among individuals who had marked few or no other points than their living location.

Effects of data anonymization on measurements
The t-test analysis indicates that none of the measurements has significantly changed after the anonymization (Table 4). Green area percentage on average has diverged from its original value for only 0.17%, which is a small value. Further, the area of home range on average has changed for 0.1 km 2 after anonymization. Similarly, the changes incurred to distance, elongation, and orientation were also small.
It is worth mentioning that interpretation of significances needs to be made cautiously and in a context-sensitive manner. For example, the 0.1 km 2 average change caused to the areas seem insignificant compared to the 11.95 km 2 average area of the home range units in this study (0.8% change). This change equates to a bigger proportion of a smaller spatial unit such as a 500 m in radius circular buffer (12%). Table 5 shows the results of Z test on the correlation coefficients derived from the two datasets, original and anonymous, assessing how significantly they differ from each other. According to these results, the changes caused by data anonymization are insignificant on most correlation coefficients. However, the few significant differences are likely caused by the generalization of attribute data. For example, in the original dataset income was presented in 16 categories. This was aggregated into only two categories of below and above average in the anonymization process.

Discussion
Use of PPGIS for data collection has been growing rapidly over the past few years in different areas of research and practice. With the growing amount of data, there is little doubt that a potentially wider community can benefit from open access to them. Additionally, open data add to the transparency of research and can be considered as an essential feature of science. However, open data comes with significant legal and ethical challenges as PPGIS datasets typically contain sensitive personal information, which need to be protected prior to any publication.
PPGIS data have special characteristics that make them different  (Hasanzadeh, Laatikainen, & Kyttä, 2018) Elongation Length to width ratio of the smallest rectangle enclosing the home range (Ramezani, Laatikainen, Hasanzadeh, & Kyttä, 2019) Greenness The percentage of home range area covered by open green spaces e.g. forests and parks. (using Corine land cover data) (Broberg, Salminen, & Kyttä, 2013) Distance to destinations Average distance from home to all everyday destinations for each individual. (Perchoux et al., 2014) Population density Average population density within home range boundary (using Statistics Finland population grid data 2017) (Hasanzadeh, 2019) Orientation The orientation of the longer side of the smallest rectangle enclosing the home range. Orientation angles are measured in decimal degrees clockwise from north. (Sherman, Spencer, Preisser, Gesler, & Arcury, 2005) Hasanzadeh, et al. Computers, Environment and Urban Systems 83 (2020) 101513 from many other sources of data, hence, special measures need to be taken for anonymizing them. PPGIS datasets are often collected on an individual level, which adds to the sensitivity of this data. Additionally, PPGIS data typically include both spatial and non-spatial personal information. Therefore, not a single method but a strategy comprising of various methods is required in order to protect different levels of personal information in this data. In response to these needs, in this study we developed a PPGIS data anonymization strategy comprising various anonymization methods to enable opening PPGIS data.
Using the anonymization algorithm developed in this study, the home locations in the test data were displaced for 137 m on average from their original locations. This is nearly half the average displacement distance reported by a previous study using a comparable bimodal Gaussian approach (Cassa et al., 2006). Because of the included context-based parameters, the largest displacements generally occurred in areas where population density and participation were lower. A minimum and average K-anonymity of respectively 330 and 53,456 were accomplished for the home locations after the anonymization. This a considerably higher K-anonymity compared to some previous studies. In a study carried out in three counties in the US, respectively 17 and 23% of cases yielded k-anonymity values of less than 20 after anonymization with random noise addition and location swapping techniques (Zhang, Freundschuh, Lenzer, & Zandbergen, 2017). In another study carried out in Portland metropolitan region in the US, 5 out of 10 studied neighborhoods yielded minimum k-anonymity values of less than 50 when a donut approach with maximum distance of around 1600 m (1 mile) was used (Clifton & Gehrke, 2013). It is worth noting that interpretation of k value as a measure of anonymity should be made with caution. The true k-anonymity actualized in an anonymization may be considerably lower than the measured value depending on what other attributes are included in the data. For example, the kanonymity reported in this study does not take into account any attributes other than gender. Depending on the level of generalization, including information such as income and education for each individual may significantly lower the actual K. Therefore, it is recommended to aim for a greater K-anonymity in the anonymization process in order to ensure confidentiality of participants.
We compared the measurements derived from the two datasets based on a number of home range calculations adopted from earlier studies (Hasanzadeh et al., 2017). On average, the home range of individuals after anonymization showed a roughly 92% match with the one before anonymization. This indicates that anonymized and original home ranges substantially overlap, suggesting potentially insignificant   K. Hasanzadeh, et al. Computers, Environment and Urban Systems 83 (2020) 101513 environmental measurement biases after the anonymization. However, in few cases the overlap percentage was as low as 30%. These individuals have relatively small home ranges and/or have not reported any other points than their home locations in the survey. Therefore, scarcity of map responses and the size of spatial units used need to be taken into consideration for future use of the anonymized data. None of the included measurements, namely area, greenness, distance, orientation, and elongation of home range, had statistically significantly changed from their original values. Furthermore, with the little effects of anonymization observed on these measurements, as expected, the associations were statistically unchanged in most cases. The only significant changes were caused by the generalization of nonspatial personal attributes such as income and education. Although measured by different sets of structural variables, previous research using donut approach has generally shown more environmental measurement errors caused by the anonymization process (Clifton & Gehrke, 2013). However, the errors have been previously found to be associated with the amount of displacement as well as the structural characteristics of the area (Clifton & Gehrke, 2013).
At the same time, for the interpretation of these results it is important to note that the statistical differences depend on the geographical scale and the size of spatial unit of analysis. For instance, a 0.1 km 2 change in the area of home ranges is insignificant compared to the large size of home ranges in this study. Obviously, for a smaller spatial unit of analysis this may turn out to be a significant deviation from the original. Therefore, when using anonymized data, it is recommendable to opt for larger spatial units of analysis, such as activity spaces and home ranges (Hasanzadeh et al., 2017;, rather than smaller, often home based, units of analysis such as circular buffers (Kyttä, Broberg, Haybatollahi, & Schmidt-Thome, 2015;Seliske, Pickett, Boyce, & Janssen, 2009). This is comparable to a previous finding suggesting that a larger search radius or bandwidth needs to be used when applying kernel density estimation on geomasked data (Shi et al., 2009).
Overall, the results from this study show that the PPGIS data can be safely anonymized while maintaining its overall quality for many purposes. However, the data anonymization inevitably comes with some level of data accuracy and quality degradation. Therefore, for reuse of anonymized data, it is important to first, know the quality losses, second, learn how they can affect the study results, and third, take necessary analytical measures to mediate these impacts. Further, it should be noted that this study has discussed the anonymization of usual PPGIS data with certain personal information. However, anonymization of PPGIS with a clear focus on highly sensitive personal information, such as medical history or other health records, most likely requires additional steps in order to ensure the privacy of the respondents.
This study was conducted in EU and thus European data protection regulations have been used as the legal guidelines. However, comparable regulations exist in other parts of the world and thus, adaptation of methods employed in this study may be relevant to other countries as well. Further, although the methods discussed in this paper are directly targeted to PPGIS, many of them can be applicable to other sources of spatial data as well.

Conclusions
This study developed a PPGIS data anonymization approach that can be used to protect the sensitive spatial and non-spatial information in data. In its core, this approach offered a context sensitive spatial anonymization method for protecting primary spatial information such as home locations of participants.
The application of this anonymization method on real data obtained from a PPGIS survey showed that using this method, PPGIS data could be safely anonymized while maintaining its overall quality. Although, some quality loss is the price we must pay for privacy protection, the data seems to be still usable for many purposes. Particularly, when a lot of individual level spatial data is available and larger spatial units such as activity spaces are used, the anonymization is unlikely to cause significant biases to the results. Despite these forward steps in PPGIS data anonymization, future research can potentially benefit from other optimized anonymization approaches. Further, additional evidence on effects of anonymization on data quality is needed to further examine the reusability of anonymized data and promote use of open PPGIS data.