Intelligent GPS trace management for human mobility pattern detection

: Large volumes of volunteered GPS traces in the last decade have provided location-based services with an opportunity to become more intelligent and personalized. Individual and group mobility patterns, detected from GPS traces, can be used for this purpose. In this paper, we show the potential of GPS traces, if managed properly in the database, for detecting points of interest for individual users and even recognizing individual users from their walking patterns. However, when it comes to GPS traces, databases can be very complicated and cumbersome to populate. Databases provided by OSM and GeoLife do not effectively pave the path for data mining and machine learning techniques which require a much more detailed and organized database. A GPS trace database must provide statistics and detailed information about GPS traces not only for visualization purposes at the front-end, but also for cross checking purposes to eliminate erroneous records and to be applied in mobility pattern detection applications. This study provides the design of an interactive database management system for GPS traces whose applications in detecting points of interest and user identification are tested with GPS traces from the GeoLife project. The results show that while the accuracy of detected points of interest depends mostly on the size of data, the accuracy of user identification relies more upon the appropriate choice of input features to machine learning techniques.


Introduction
Understanding the dynamics of large-scale human mobility patterns is beneficial to urban planning, traffic and transportation management, public transport design, emergency response management, public health, disease outbreak detection, and economic forecasting. The spread of mobile devices equipped with GPS receivers among people (Hashemi & Malek, 2012) has contributed to accumulation of large-scale GPS traces (Hashemi, 2017a;Hashemi & Karimi, 2014, 2016a, 2016b, 2017 which have motivated researchers from various fields to study human mobility (Liao, Patterson, Fox, & Kautz, 2007;Liu, Andris, & Ratti, 2010;Patterson, Liao, Fox, & Kautz, 2003;Zheng, Cao, et al., 2010). Song, Qu, Blumm, and Barabasi (2010) showed the high predictability in human mobility. Azevedo, Bezerra, Campos, and Moraes (2009) found that the movement velocity and acceleration of pedestrians follow a normal distribution. Lee, Hong, Kim, Rhee, and Chong (2009) effectively modeled human mobility using gaps among fractal waypoints. González, Hidalgo, and Barabási (2008) concluded that people tend to visit few locations frequently and highlighted the contrast between the simple repeated patterns in human mobility trajectories on one side and models such as Levy flight and random walk on the other side. Phithakkitnukoon, Lorenzo, Shibasaki, and Ratti (2010) found a strong correlation in daily activity patterns of people who share the same work area's profile. Peng, Jin, Wong, Shi, and Liò (2012) developed a linear model to approximate the traffic flow between pairs of locations based on the experimentally inferred fact that people travel on workdays for three purposes: commuting between home and workplace, traveling from workplace to workplace, and others such as social activities. Li et al. (2012) used taxi traces to uncover patterns of pick-up quantity in urban hotspots and developed an ARIMA model to forecast how many passengers will be in a certain hotspot in the next time interval. Detecting points of interest (POIs) and identifying people through their walking patterns from large-scale GPS traces are two venues that we explore in our work.
However, GPS traces are stored in plain text formats with no attached metadata such as, transportation mode, length, or speed. This not only makes managing large volumes of GPS traces inefficient but also restricts the scale and scope of algorithms for human mobility pattern detection. Open Street Map (OSM), founded in UK in 2004 with more than 1 million registered users (Wood, 2013), is the most prominent volunteered geographic information devoted to providing a free map of the world emphasizing the road networks. Road networks are built upon GPS traces uploaded by registered users and can be edited or updated manually at any time. A description can be associated to a GPS trace while being uploaded but there are no additional required metadata or restrictions (https: //www.openstreetmap.org/traces; OpenStreetMap, n.d.). This means the transportation mode of the GPS trace (e.g. walking, motoring, or boating) cannot be known in the database which in turn limits the database's applications. Besides, they do not store additional metadata, such as average speed or total length of the GPS trace which can be automatically calculated. Such metadata not only facilitates analyzing, mining, and visualizing large volumes of GPS traces, but also paves the path for automatic applications of GPS traces. Examples of such applications are automatic road and pedestrian network construction (Hashemi, 2017b), recognizing POIs (Bhattacharya, Kulik, & Bailey, 2015), developing intelligent location-based services (Liu & Karimi, 2006), detecting individual (Song et al., 2010), or collective (Becker et al., 2013;Harder, Nes, Jensen, Reinau, & Weber, 2012) mobility patterns, and real-time event detection which is of great value to municipalities, police, and fire departments. This paper shows how large volumes of GPS traces can be used to detect people's mobility patterns, such as their POIs and to associate walking patterns with people's identities. POIs can be used to make location-based or location-aware services more intelligent and personalized. For example, Patel, Chen, Smith, and Landay (2006) personalized the routes and shrank the navigation directions for drivers by applying their POIs. On the other hand, associating patterns in walking GPS traces with people's identities can be used, for instance, in location-based social networks (e.g. friend recommendation (Yu, Pan, Tang, Li, & Han, 2011)) or in smart cities (Chang, Liu, Chou, Chen, & Shin, 2007;Ferrari, Rosi, Mamei, & Zambonelli, 2011;Pan et al., 2013) upon users' permission. However, since detecting mobility patterns is not feasible without having sophisticated databases specifically designed for this purpose, we propose a structure for storing and managing crowd-sourced GPS traces.

GPS trace database management system
The entity relationship (ER) model and its corresponding relational logical model in Boyce-Codd normal form (along with functional dependencies) in the proposed GPS trace management system are represented in Figures 1 and 2. In Figure 1, rectangles show entities, rectangles with thick borders show weak entities, ovals show attributes, underlined attributes are primary keys, dashed underlined attributes are partial keys, arrows show key constraints (each entity appears in at most one instance of the relationship), and thick lines show total participation (all entities appear in at least one instance of the relationship). The attributes Owner_Username (representing the username of the person who uploaded this GPS trace) and Manager_Username (representing the username of the manager who is responsible for the user who uploaded this GPS trace) in GPS_TRACE table are foreign keys from tables USER and USER_MANAGER, respectively.
According to the ER, users can register in the system and upload GPS traces. Users are identified by their account username. A user can upload none or as many GPS traces as he/she wants. The user specifies the transportation mode for a GPS trace at the time of upload. Other information stored for a GPS trace include start time, total traveled distance, total time duration of the GPS trace, maximum and minimum longitude, maximum and minimum latitude, average longitude, average latitude, average altitude (some GPS traces may not contain altitude), and average speed. This information is not explicitly expressed in GPS traces and must be calculated. Each GPS trace has a set of GPS points. Each GPS point has a longitude, latitude, altitude, heading, HDOP (horizontal delusion of precision), date, time, and Unix time. Some points may not have altitude, heading, and HDOP. There are two types of managers: (a) user managers who manage a group of users, and (b) trace managers who manage a group of GPS traces.  For the physical design, MySQL is used as the DBMS and Tomcat as the web server. The server is written in Servlets and the client in JSP. The post method is used for communication between the client and server. D3js is used to create charts in JSP pages. To produce those charts, JSON arrays are  constructed out of database query results in the server and sent to the client. The structure of our client-server system is shown in Figure 3. Arrows show the correspondence between clients and servers. Figure 4 shows the JSP pages in a browser. After logging into the system, the user can see the user page which displays a list of the user's GPS traces where more recently uploaded GPS traces are placed on top of the list. Transportation mode, date, total traveled distance, time duration, and average speed are shown for each GPS trace in the list. Two time series on the user page (Figure 4(a)) show the total traveled distance for the user and the average of all users in different years, so the user can compare his/her traveled distance with other users. Users can delete their GPS traces or upload new GPS traces in this page. For each new GPS trace, the user must determine the transportation mode through a drop-down menu.
The trace manager page (Figure 4(b)) shows a list of all GPS traces which have been assigned to this manager along with their summary statistics and the owner's username. The time series on this page show the total traveled distance over all users for different transportation modes. The user manager page (Figure 4(c)) shows a list of all users who have been assigned to this manager. A bar chart on this page shows the total traveled distance for each user.

Data
OSM GPS traces cannot be used to populate the proposed database in this work due to their lack of transportation mode. GeoLife project, conducted by Microsoft Research Asia, collected GPS traces from 182 users between April 2007 and August 2012 (Zheng, Chen, et al., 2010;Zheng, Li, Chen, & Xie, 2008;Zheng, Liu, Wang, & Xie, 2008). The data is available for download on their website (Microsoft Research, 2012). The data includes 17,621 GPS traces with a total distance of 1.2 million km and a total duration of over 48,000 h. However, only a small portion of this data, shown in Figure  5, are associated with a transportation mode (Microsoft Research, 2010). Since transportation mode is an integral part of our database and applications, only this small portion of GPS traces qualifies for our work, which belongs to 32 users. All the 32 users and their associated GPS traces are used in this work. In the GeoLife data-set, transportation modes are stored in a separate plain text file for each user and include walk, run, bike, car, driving meet congestion, motorcycle, taxi, bus, subway, train, railway, plane, airplane, and boat. The transportation mode file includes start time and end time for each transportation mode. Therefore our first task was to associate the transportation modes to GPS traces which was challenged by multimodal GPS traces, occasional long gaps between sequential points, inconsistencies in timestamps (the time goes back), and implausible traveled distances in short times. We considered 20 min as a threshold for the gap between two sequential GPS points to decide when to split a GPS trace into two GPS traces. GPS traces are also split whenever the transportation mode changes to create unimodal GPS traces. GPS traces with inconsistent timestamps or out of range longitudes and latitudes were excluded. Tables 1 and 2 show statistics of the refined data.
The statistics listed in these tables, such as average speed, can be used as knowledge for other applications. For example, average walking speed for a person can be used to personalize pedestrian navigation systems or to compare people's mobility behaviors in different regions. Some errors in GPS traces are semantic and thus, cannot be easily detected. For example, the average speed for one of the running GPS traces is 26.87 m/s, while the top running speed for people usually ranges from 6.2 m/s to 11.1 m/s (Weyand, Sternlight, Bellizzi, & Wright, 2000). The knowledge gained from these tables can help to detect such semantic inconsistencies.

Detecting POIs
One of the popular applications of GPS traces, when they are available in large amounts, is finding POIs of people (Lane, Lymberopoulos, Zhao, & Campbell, 2010;Lian & Xie, 2011;Shaw, Shea, Sinha, & Hogue, 2013). These POIs can be places where a person lives, works, shops, or spends time. Knowing users' POIs helps to provide them with more intelligent location-based or location-aware services. Following are the steps used to detect a specific user's POIs: (a)Since we define POIs as points that a user usually visits for resting, relaxing, working, shopping, exercising, or similar activities, POIs must either be the origin or destination of a trip. Therefore, call the first and last GPS points in each GPS trace significant points and discard the rest.
(b)For each significant point, count the number of other significant points falling in a circle of 50 m radius around it. Call this value the significance rate. Therefore, each significant point is associated with a significance rate.
(c)If two significant points are closer than 100 m to each other, discard the one with the lower significance rate. This is because their 50 m radius circles overlap and significant points falling in the overlap area are being counted for both of them. On the other side, having two POIs closer than 100 m to each other is not realistic.
Since the user with the id 12 has the largest number of GPS traces, we use his/her GPS traces to detect his/her POIs. Figures 6-10 show his/her POIs and their significance rates for different transportation modes.
For driving, walking, running, and bus GPS traces, the POIs are mostly focused in two very small areas, one in the north (referred to as POI N ) and one in the south (referred to as POI S ) in Figures 6-9. These are most probably his/her working and living places. For biking GPS races in Figure 10, on the other hand, the POIs are more spread over the city. A closer look at the biking POIs shows that the most significant one (at 40°5′, 116°20′) falls in the POI N area, though none of them falls in the POI S area. This can be used as an argument that the POI N area is the living place rather than the working place. Additionally, the majority of walking and running POIs also fall in the POI N area. Overlapping these POIs with a land-use map can reveal the name of locations and buildings.

Associating patterns in walking GPS traces with people's identities
Assume we have a walking GPS trace whose owner is unknown. In this section, we explore the possibility of finding its owner assuming he/she is among the system's users. GPS trace recognition, like face recognition, has applications in information systems and services. For example, Facebook can suggest you to tag your friends using their GPS traces like it does using their pictures.
This is a classification problem where classes are users of our system. Each walking GPS trace is a sample or observation whose feature vector includes: average speed, sampling rate, average longitude, and average latitude. To justify the selection of these four features for our classification problem, we investigate their distribution across classes using box plots, their linear independence from each other using correlation coefficients, and eigenvalues of the features' covariance matrix. Figure 11 shows the box plot of sampling rate of walking GPS traces in each of the 32 classes. Sampling rate is one of the four features. Each person is a class in this plot, represented by an individual box. The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. More overlap among boxes means that feature is less diverse and less helpful in distinguishing among classes. The box plots in Figure 11 show that the values of this feature are well diversified across different classes (little overlap among boxes) and they can be effective in recognizing classes. Figures 12-14 represent the same type of plot for the other three features: speed, longitude, and latitude. Little overlap among boxes, observed in these plots, similarly indicates their effectiveness in distinguishing among classes. Table 3 reports the correlation coefficient for pairs of features over all classes. If two features are strongly correlated (indicated by correlation coefficients close to ±1), there is not much sense in considering both of them in the feature vector. However, all correlation coefficients in Table 3 are close to zero. The almost equal eigenvalues of the features' covariance matrix (1.1, 1.0, 1.0, and 0.9) also indicate equal significance of different features in our classification problem.
Bayesian classifiers transcend other classifiers in terms of minimizing the error probability which comes with the cost of their need to large amounts of training data (~10 l samples for each class where l is the number of features which is 4 in our case) in order to detect the underlying probability    density function (PDF) of features in each class (Theodoridis & Koutroumbas, 2009). Among different versions of Bayesian classifier, the Naïve Bayesian (NB) classifier is the right choice in this application because of the validity of the assumption that features in each class are independent. Independence of features in our classification problem is both conceptually sensible and quantitatively shown in Table 3. With this assumption, the NB classifier reduces the size of the required training data-set from 10 l to approximately 10 × l (Theodoridis & Koutroumbas, 2009). Most of our classes qualify this training sample size requirement. An unlabeled walking GPS trace (x) is assigned to the class (ω i ) with the largest posterior p(ω i )p(x|ω i ). To avoid presumptions about the overall shape of the PDFs, the non-parametric Parzen Window approach with a Gaussian Kernel (Theodoridis & Koutroumbas, 2009) is applied to estimate likelihoods p(x|ω i ). Two scenarios are considered for priors p(ω i ): assumed equal for all classes and estimated as the relative frequency of classes (p(ω i ) = n i /n where n i is the number of training samples in class ω i and n is the total number of training samples).
Least squares (LS) and linear support vector machine (LSVM) among linear classifiers and nonlinear support vector machine (NLSVM) with a Gaussian kernel (with σ = 1) among non-linear classifiers are selected here for experimental purposes. The smoothing parameter (C) is considered 10 and 325 for LSVM and NLSVM, respectively. These are the experimentally optimized values for C as represented in Figure 15. Because LS, LSVM, and NLSVM classifiers can distinguish between only two classes, a separate classifier is trained for each pair of classes (one-to-one approach). Since there are 31 classes (the user with Id 8 in Table 1 is dismissed since he/she only has one GPS trace), there could be C 31 2 = 465 pair of classes. Therefore, for each of the three aforementioned classifiers (LS, LSVM, and NLSVM), 465 pairwise classifiers need to be trained and the class with the largest number of wins in pairwise comparisons wins the unlabeled sample.

NB2 NLSVM LSVM
We also define the combined classifier by combining the results of all previous classifiers using the product rule. In the product rule, the combined posterior of a class is the multiplication of the posteriors obtained for that class from different classifiers (Theodoridis & Koutroumbas, 2009). A random classifier, which randomly assigns a GPS trace to one of the 31 classes, is also considered to depict the lowest expected accuracy of any non-random classifier.
The leave-one-out cross validation method (Theodoridis & Koutroumbas, 2009) is used to evaluate the generalization accuracy of different classifiers. Figure 16 shows the overall accuracy of different classifiers, where RC stands for random classifier, LSVM stands for linear SVM, LS stands for least squares, NLSVM stands for non-linear SVM, NB1 is the naïve Bayesian with equal priors, NB2 is the naïve Bayesian with relative frequencies as priors, and CC stands for the combined classifier. Figure 17 shows the sample size vs. recall and precision (obtained from NB2 classifier which revealed the best accuracy in Figure 16) for different classes.
As shown in Figure 17, having more or less training samples from a class does not necessarily reflect in higher or lower precision and recall for that class when it comes to Bayesian classifiers. Precision and recall for a class increases if that person's walking characteristics follow a specific and discriminable pattern. More specific and different from others that pattern is, higher the recall and precision would be for that class. Therefore, choosing which characteristics of walking GPS traces need to be used as predictors for classification is quintessential because it is the combination of these characteristics which is supposed to reveal the specific and distinguishable walking pattern of each person.

Conclusions and future directions
Gathering GeoLife GPS traces in a relational database was the most cumbersome part of this work. However, the created database facilitated the access, analysis, and management of GPS traces. The metadata provided for the GPS traces in our database were used to produce time series and bar charts on the front-end, semantically cross check the accuracy of GPS traces, and detect useful human mobility patterns. We showed how our database can be used to detect users' POIs and walking patterns, although there are many other such applications for this database, e.g. transportation network construction and traffic management. The results endorsed the database's effectiveness in both applications.
Users need to determine the transportation mode while uploading their GPS traces in our system, which causes difficulties when a GPS trace is multimodal. More sophisticated approaches are required to collect GPS traces from users' mobile devices directly and detect the transportation mode automatically or semi-automatically with the user's help. The Naïve Bayesian classifier with longitude, latitude, speed, and sampling rate as predictors achieved 47% overall accuracy in finding the owner of a walking GPS trace among 31 people (the accuracy of a random classifier is 3%). Investigating how much each predictor contributes in the overall accuracy and finding other predictors (e.g. time) which can boost this accuracy are also among our future research directions. Another important issue is protecting the users' privacy which can be accomplished by anonymization (as applied in this study) or mixing GPS traces from different users. Privacy can be protected at different levels with stricter rules for people who are more concerned about their privacy and looser rules for those who do not mind revelation of specific aspects of their mobility patterns (Hashemi & Malek, 2012).