An anonymised longitudinal GPS location dataset to understand changes in activity-travel behaviour between pre- and post-COVID periods

Collecting GPS data using mobile devices is essential to understanding human mobility. However, getting this type of data is tricky because of some specific features of mobile operating systems, the high-power consumption of mobile devices, and users’ privacy concerns. Therefore, data of this kind are rarely publicly available for scientific purposes, while private companies that own the data are often reluctant to share it. Here we present a large anonymous longitudinal dataset of Activity Point Location (APL) generated from mobile devices’ GPS tracking. The GPS data were collected by using the Google Location History (GLH), accessible in the Google Maps application. Our dataset, named AnLoCOV hereafter, includes anonymised data from 338 persons with corresponding socio-demographics over approximately ten years (2012–2022), thus covering pre- and post-COVID periods, and calculates over 2 million weekly-classified APL extracted from approximately 16 million GPS tracking points in Ecuador. Furthermore, we made our models publicly available to enable advanced analysis of human mobility and activity spaces based on the collected datasets.


a b s t r a c t
Collecting GPS data using mobile devices is essential to understanding human mobility. However, getting this type of data is tricky because of some specific features of mobile operating systems, the high-power consumption of mobile devices, and users' privacy concerns. Therefore, data of this kind are rarely publicly available for scientific purposes, while private companies that own the data are often reluctant to share it. Here we present a large anonymous longitudinal dataset of Activity Point Location (APL) generated from mobile devices' GPS tracking. The GPS data were collected by using the Google Location History (GLH), accessible in the Google Maps application. Our dataset, named AnLo-COV hereafter, includes anonymised data from 338 persons with corresponding socio-demographics over approximately ten years (2012-2022), thus covering pre-and post-COVID periods, and calculates over 2 million weekly-classified APL extracted from approximately 16 million GPS tracking points in Ecuador. Furthermore, we made our models publicly available to enable advanced analysis of human mobility and activity spaces based on the collected datasets.
© 2022 The Author(s The spatial analysis of Activity Point Locations and Human Activity Spaces clarifies the relationships between the built environment/the transport system and travel behaviour in cities.
• AnLoCOV is publicly available at Mendeley Data [1] . We provide algorithms entirely based on open-source frameworks and make them publicly available on GitHub [2] . The methodological workflow can be re-used with JSON data from other applications or geographical locations.

Data Description
Nowadays, mobile devices' ubiquity and affordability of smartphone technology increase the possibility of getting, in a secure, efficient, and inexpensive way, human movement data using GPS [3] . The use of this type of data has been growing in many studies related to mobility patterns [ 4 , 5 ], route choice modelling [ 6 , 7 ], transport mode recognition [ 8 , 9 ], origin-destination trip purpose [ 10 , 11 ], identification of activity stops locations [ 12 , 13 , 14 ], sports activity identification [15] , and in the human activity behaviour analysis [ 16 , 17 ]. This new data collection framework allows the collection of considerable amounts of data compared with traditional methods [18] , which grants in-depth and long-term research of Human Activity Spaces (HAS) [19] .
Places frequently visited by people represent the Activity Point Locations (APL). People spend time in these places doing daily activities (e.g., home, work, supermarket, bus stop, gas station, traffic jam, park, church, cinema, and others). These points are also well-known as Points of Interest (POI) and are the basis for measuring the size of HAS [20] . The APL identification based on mobile devices' GPS tracking has improved because of innovative spatial analysis software packages. These analyses can include day-to-day activity-travel variability for estimating activitybased models of travel demand and the complexity of persons' daily activity-travel patterns (number of stops, activity-travel sequences). However, the main problem in the scientific community is to share this data publicly due to people's privacy [21] , so it is essential to anonymise it to share data for further research. The anonymisation technique must enable data access while maintaining people's privacy and keeping the data structure to analyse it efficiently within the original research purpose [ 22 , 23 ] despite the undeniable semantic information loss [24] . Empirical APL data collected on a longitudinal basis are rarely publicly available, mainly because of the costs and difficulty of acquiring data over a long period of time [25] .
AnLoCOV is an open, anonymised, and longitudinal dataset with spatial APLs computed on a weekly basis. This dataset stores information collected using GLH, which, once activated, accumulates GPS data from the mobile device and can be used as a mobility data acquisition tool [26] .
In addition, AnLoCOV considers different restriction levels imposed by the government of Ecuador due to COVID-19 from March 2020, when practically all cities in the world were in lockdowns to reduce mobility and prevent the spread of the disease. The Ecuadorian emergency operations committee (EOC) periodically analysed the country's health conditions. The lowest level (0) implies no restrictions, i.e., before COVID-19. The highest level (3) implies total lockdown; only priority public services such as health, food and transport are provided. The intermediate levels (1 and 2) imply vehicular prohibition during certain days of the week, curfew during the nights and capacity control in enclosed or crowded places. These restriction levels are depicted in Fig. 1 . and encoded in the datasets (see Data Description section).
AnLoCOV encompasses four anonymised datasets distributed in CSV format: • DemographicData: Contains anonymous demographic data from 338 persons. The data is ordered by Google Location History id. • GPSTrackingData: Contains more than 16 million GPS point coordinates corresponding to the clean GPS tracking of each person. The anonymisation of this dataset is based on the gravity point of the whole GPS data. The data is ordered by Google Location History id and datetime. • APLData: Contains more than 2 million Activity Point Location coordinates, including cluster information. The anonymisation of this dataset is based on the most visited Activity Point Location (cluster labelled as 0). The data is ordered by Google Location History id, week and trip. • SummaryData: Contains summary measures of GPSTrackingData and APLData, like the number of GPS, APL, and clusters. It is ordered by Google Location History id, week, and trip.
All datasets are publicly available and licensed with the Creative Commons BY 4.0 license, which allows their use for any purpose (including commercial use) if appropriate credit for the dataset is declared.
A detailed description of data records is given in Tables 1 , 2 , 3 and 4 .

Experimental Design, Materials, and Methods
A schematic overview of the process adopted for data preparation is given in Fig. 2 . The data stem from adult participants who provided informed consent and agreed to share their data anonymously. The data preparation work presented below ensures compliance with GDPR and University's ethical committee regulations. The data preparation framework includes four successive stages: data collection, data transformation, AnLoCOV processing and demographics survey.

First stage: data collection
Data is an essential component in the research process. We use the GLH component of the Google Maps platform, an innovative and widely used application, to get location data worldwide. Through GLH, Google Maps collects the device's locations via GPS, Wi-Fi, or mobile Fig. 2. Schematic overview of the data preparation process. The data is first collected in JSON files. Then, it is transformed into a proper format before being processed and anonymised network connections if the GPS is active. The location data coordinates are defined in accordance with the WGS84 coordinate reference system.
The device's hardware, operating system version, or use in indoor locations (e.g. tunnels, buildings) can result in location data loss accuracy. However, Macarulla Rodriguez, in his paper [27] , demonstrates that this loss of accuracy is not significant, so we can assume that the device was close to a specific location. Also, a continuous Internet connection, GPS enabled, and frequent use of the Google Maps application to search for routes or move from one place to another improve the location data accuracy. The drawback is that continued use of the device's GPS may result in battery performance problems.
Once the GLH is activated in the Google account, Google tracks, when possible, the GPSbased mobility data from the mobile device. Each person can manage their location history status (pause, disable, edit, or delete) in Google Timeline. By default, GLH is disabled in all Google accounts.
The last step in this stage is to request the GLH JSON data from Google via the Google Takeout application. This activity was carried out by each participant of the study. Fig. 3 shows an extract of a JSON data file provided by Google. This file is the input for the next stage.

Second stage: data transformation
A JSON file transformation is required to manipulate data computationally. In this paper, we assume the most straightforward possible representation of location data: each observation consists of a timestamp and a location point. We will use three fields of the GLH JSON file to represent data: (1) timestamp (recorded UTC date and time), (2) latitudeE7, and (3) longitudeE7 (recorded WGS84 GPS location coordinates). An algorithm in Python [28] will transform the original GLH JSON file into a comma-separated value (CSV) file with three columns (datetime, latitude and longitude). This file is then the input for the AnLoCOV processing stage. Fig. 4 represents the generic workflow for AnLoCOV processing in Python. Specific preprocessing steps were applied to (1) clean the GLH data, (2) anonymise GLH data based on a gravity point, (3) identify the trajectories and trips, (4) compute APLs and clusters, and finally (5) anonymise APL data based on a cluster point for sharing.

(1) Data cleaning
The presence of noise in the data is a consequence of the accuracy loss mentioned earlier in the data collection stage. Therefore, we apply a filtering process that deletes GPS points considered noise or outliers in the trajectory. For example, when two consecutive GPS points have much higher speeds than the globally defined speed limits in urban or non-urban areas, the second GPS point is considered an outlier and is subsequently removed from the dataset.
A compression step further reduces the number of GPS points while preserving the trajectory properties. It works as follows. When the Euclidean radius distance is very small between consecutive points, it implies the points are in a very close neighbourhood of the same location. Subsequently, all these points are merged into a single point whose location is given by the median of all point coordinates, while the associated timestamp corresponds to the first point. This process results in significant compression of the number of GPS locations.
The set of parameters for this step is defined in Table 5 .

(2) Gravity GPS anonymisation
After data filtering and compression, we obtain the clean GPS points. At the end of this second step, we release the first dataset (GPSTrackingData) . Nevertheless, to guarantee privacy, we Fig. 3. Extract of a GLH JSON file used to build our dataset. Some data has been replaced by "xxxxx" on purpose Compress consecutive GPS points if the Euclidean radius distance between points is less than 0.05km (50m) Fig. 4. Workflow for generating processed and anonymised activity point locations. The names of the used Python packages are given between parentheses. In blue Italic, the generated datasets Table 6 Trajectory-trip identification parameters.

Trajectory-Trip identification Parameter (value) Description
Trajectory min_length (200) The minimum length for trajectories is 200m Trip gap (30) The minimum gap time in a trajectory to split into trips is 30min min_length (100) The minimum length for trips is 100m need to anonymise the GPS locations. This is done as follows: Data is anonymised based on the gravity point calculated for each person's latitude and longitude data. The gravity point is set at the location 0-latitude and 0-longitude, and all GPS points are translated accordingly. This translation preserves the original shape position and distances between GPS points.
While this first dataset is the core of AnLoCOV, we provide further datasets to ease the analysis at higher levels, which is at the level of trips and Activity Point Locations (APLs). It is important to note that we release all the codes for the different steps, allowing any researcher to adapt the parameters to their needs for their own data, while providing an example of how to deal with higher-level GPS locations. However, trips and APLs must be calculated on the original, non-anonymised GPS locations because, at higher levels, we are interested in cross-persons analyses. Therefore, we need to anonymise the APLs as well before releasing the subsequent datasets.

(3) Trajectory-trip identification
All consecutive GPS points with a minimum length are converted into weekly trajectories for trajectory-trip identification. The process excludes short trajectories. These weekly trajectories are split into trips with a minimum gap threshold and a minimum length.
The set of parameters for this step is defined in Table 6 . The minimum time spend in a place to consider it as a stop is 5min stop_radius_factor (1) The minimum Euclidean radius distance for a stop is 1Km (10 0 0m) stop_radius_factor * spatial_radius_km spatial_radius_km (1) Cluster cluster_radius_km (0.05) The minimum radius proximity of the points in each cluster is 0.05km (50m) min_samples (1) The minimum GPS points for a Cluster is 1

(4) APL identification and clustering
This step detects APLs for each trip. When the person stays a minimum number of minutes within a Euclidean radius distance from a given GPS point location during the trip, it forms an APL. The APL's time is the time of the initial GPS point, and the coordinates are the median latitude and longitude values of all the GPS points found within the specified distance.
The clustering step ranks all APLs. Each APL is labelled with a clustering number depending on its spatial proximity and the number of visits to similar locations at different times. Densitybased spatial clustering analysis is conducted by using the DBSCAN algorithm. Each cluster is sequentially labelled, starting from 0, whereas cluster 0 corresponds to the most visited APL over time.
The set of parameters for this step is defined in Table 7 .

(5) Cluster APL anonymisation
The latitude and longitude coordinates of each APL are anonymised. In this case, the reference point for anonymising the coordinates of each person is set as the most visited APL over time (cluster labelled as 0). In other words, we compute the mean latitude and longitude of the APLs clustered as 0 and translate all APLs accordingly. The most visited APL ends at the location 0latitude and 0-longitude. This translation preserves the original shape position and distances between each APL.
Finally, we summarise trip measures such as the total number of GPS points, APLs, and clusters per trip. We do not consider the geodesic distances because it would provide some information about the exact geographical map and contradict our willingness to anonymise the data fully. After this step, we obtain the second and third datasets, i.e. (APLData) and (Summary-Data) , respectively.

Fourth stage: demographics survey
After validating and processing GLH data, each volunteer participant was asked to give sociodemographic information by answering an online survey via Google Forms [29] and providing informed consent to use and share their anonymised GLH data. Only consenting participants are included in AnLoCOV. A copy of the survey is attached as supplementary material. After this stage, we obtain the last dataset ( DemographicData)

Ethics Statements
All data has been anonymised to respect the privacy of participants. Informed consent was obtained from each one by completing the demographic data survey. In addition, because we used data from Google, the Board for Ethics and Scientific Integrity of the University of Liège confirmed that the project meets the standard ethical requirements and complies with the GDPR. The protocol number is JUR26262.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.