Feature-First Add-On for Trajectory Simplification in Lifelog Applications.

Lifelog is a record of one’s personal experiences in daily lives. User’s location is one of the most common information for logging a human’s life. By understanding one’s spatial mobility we can figure out other pieces of context such as businesses and activities. With GPS technology we can collect accurate spatial and temporal details of a movement. However, most GPS receivers generate a huge amount of data making it difficult to process and store such data. In this paper, we develop a generic add-on algorithm, feature-first trajectory simplification, to simplify trajectory data in lifelog applications. It is based on a simple sliding window mechanism counting occurrence of certain conditions. By automatically identifying feature points such as signal lost and found, stall, and turn, the proposed scheme provides rich context more than spatio-temporal information of a trajectory. In experiments with a case study of commuting in personal vehicles, we evaluate the effectiveness of the scheme. We find the proposed scheme significantly enhances existing simplification algorithms preserving much richer context of a trajectory.


Introduction
Lifelog is a record of a person's daily life in varying amounts of detail [1][2][3]. It represents the totality of life experience and one can potentially improve work performance or find unconscious behavior through self-tracking. Since people divide living space for a specific purpose one can easily figure out a variety of context such as businesses and activities from location data. For example, a typical house consists of living room, bedroom, bathroom, kitchen, and dining room. People sleep at bedroom, watch TV at living room, and cook at kitchen but hardly eat at bathroom. User's location has a huge implication and is one of the most common information for logging a person's life [3][4][5]. By understanding one's spatial mobility we can infer the behavior and attitude of the person in various situations in everyday life.
The availability and affordability of GPS (Global Positioning System) technology provides a simple yet powerful harness for collecting mobility data [4][5][6]. With GPS, we can capture consistent and accurate spatial and temporal details of a movement. However, most GPS receivers generate a huge amount of data making it difficult to process and store them. To overcome the difficulty, various simplification algorithms for trajectory data have been proposed. The basic idea is to discard redundant or less important data points preserving the context of the original trajectory data. Most of the existing algorithms, however, mainly focus on accuracy and storage size [7][8][9][10]. An error-bounded approach tries to minimize the number of data points in simplified trajectory while it maintains a specified approximation error ε. a size-bounded approach, on the other hand, tries to minimize the approximation error of simplified trajectory with a specified number of data points. Typical trajectory simplification is a process to balance between accuracy and storage size. In the meantime, information loss from the original trajectory is unavoidable under certain criterion.

GPS Trajectory Data
GPS is a network of satellites and provides users with positioning, navigation, and timing services in a wide range of personal and commercial applications. GPS satellites continuously send out radio signals on their orbital information and onboard clock time. a GPS receiver on the ground calculates its own position based on the signals from the satellites. The positioning works on a simple concept of trilateration using the locations of orbiting GPS satellites and the distance from those satellites to the receiver on the Earth [15,16]. In order to determine the coordinates on the three-dimensional space (longitude, latitude, altitude) of the earth at least three satellites are necessary. In addition, an extra satellite is used for synchronization of clocks in use. The onboard atomic clocks of the satellites are highly accurate and are synchronized with each other. The clock at the receiver, however, is not synchronized precisely with the clocks of the satellites. As of 14 January 2020, there are 31 operational satellites in the GPS constellation ensuring that at least four satellites are visible at all time anywhere on the Earth [15].
Most GPS receivers provide data in the form of ready to be used in typical location-based applications, however, only when they can see at least four satellites. This is called a fix. a GPS receiver starts spitting out data when you turn it on even if it does not have a fix. The time-to-first-fix (TTFF) depends on the startup mode of a receiver. a GPS receiver, in general, stores its last valid position and time information with almanac and ephemeris data to predict which satellites are in visible range. If a receiver has no such information, then it is up in cold start mode taking several minutes to get a fix. If a receiver has all the information, then in hot start mode for the quickest fix taking typically dozens of seconds. In warm start mode it takes longer than a hot start but not as long as a cold start.
GPS receivers generally output information in NMEA (National Marine Electronics Association) 0183 format, which is an ASCII interface standard for marine electronic devices [17,18]. There are a few different kinds of NMEA sentence and all sentences begin with the character "$" and end with the sentence termination delimiter "<CR><LF>". Each NMEA sentence consists of an address field, a data field, and a checksum such that "$<address>, <data>*<checksum><CR><LF>". The first field of <address> consists of the talker identifier and the sentence formatter. The talker identifier indicates where the data comes from and it is "GP" for GPS. The sentence formatter specifies the number of data subfields in the sentence, the type of data they contain and the order in which the data subfields are transmitted. The data field <data> in each sentence contains the number of subfields, which is specified by the sentence formatter, separated by commas ",". Most GPS receivers provide "GPGGA" (GPS Fix Data), "GPRMC" (Recommended Minimum Specific GNSS Data), "GPGSA" (GNSS DOP and Active Satellites), "GPGSV" (GNSS Satellites in View) sentences by default. GPGGA and GPRMC are most commonly used and Figure 1 shows the format of them. You can see that there are redundant data between sentences and easily imagine the meaning of the subfields for UTC (Coordinated Universal Time) of position fix, latitude in the northern or southern hemisphere, longitude in the easterly or westerly, speed over ground in knots, course over ground in degree true. The following is a brief description of the remaining subfields: • GPS status: The data set quality (V = invalid, a = valid) • GPS quality indicator: The GPS fix type (0 = no GPS, 1 = GPS SPS, 2 = DGPS, 3 = GPS PPS). It indicates whether the GPS receiver has fixed onto satellites' data and received enough data to determine the location.  In most literatures on GPS trajectory study, only the minimum information of longitude, latitude, and time is considered. GPS receivers, however, have a processor and an antenna that directly receive data from the satellites and compute its position on the fly. The processor on a GPS chipset is responsible for all of the calculations and user interfaces, as well as analog circuits for the antenna. Users can change the configurations of a GPS receiver for sampling frequency, sentence selection, baud rate, etc. That is, GPS receivers provide much more information than the minimum by nature and it is a waste not to use this information in trajectory study.

The Feature-First Trajectory Simplification
We assume that the raw data stream consists of a series GPS data, denoted as P = {p t0 , p t1 , p t2 , . . . , p tn , . . . } where p tn = (id, x, y, t, d 1 , d 2 , . . . , d m , . . . ) is referred to as a data point. Each data point basically contains object ID, longitude x, latitude y, and time stamp t. It also has other information d m , which a GPS receiver provides by nature. In this study, we use the additional information of GPS quality indicator, number of satellites in use, HDOP from a GPGGA sentence and of GPS status, speed and course over ground from a GPRMC sentence, as well as UTC time and date, latitude, longitude for spatial and temporal information. The objective of the feature-first trajectory simplification (FFTS) is to find a subset of P denoted as P , which represents P. Each data point in P needs to have an additional attribute of feature type but may exclude extra information other than the minimum of (id, x, y, t) depending on the applications' requirements. We define six different types of a feature point for the simplification: The location, between any consecutive feature points within a trajectory, where applications demand for recording with optional requirements. Certain parameters X n may be considered depending on the requirements.
The parameters (T fixL , T fixF , D maxS , T movS , S minG , T movG ) are set according to GPS receivers' quality and configurations. For example, D maxS is required to tolerate signal noises and errors. If the positioning signal is 100% accurate without any errors, we may set D maxS = 0. The other parameters (Θ minT , X n ) are set according to applications' demands on simplification. Figure 2 shows a brief description of the FFTS algorithm. When a new data point p t is collected, FFTS tries to identify whether it belongs to one of the feature types above. If then, we add p t to P ⊂ P. Otherwise, the data point will be disregarded. To identify a feature point, it utilizes a sliding window and simply counts occurrences of certain conditions. We denote a window record that consists of the most recent K data point history Swin K . Let p tc be the data point at current time such that Swin K contains (p tc , p tc−1 , p tc−2 , . . . , p tc−(K−1) ). We keep Swin K in two portions of front-end window Swin Kf and back-end window Swin Kb , for our convenience. The bound between the two is parameterized by B such as Swin Kf = (p tc , p tc−1 , . . . , p tc−(B−1) ) and Swin Kb = (p tc−B , p tc−(B+1) , . . . , p tc−(K−1) ). Moreover, we denote p t the last data point added to the simplified trajectory P and use the symbol "←" to assign a type to its corresponding feature point. For the type assignment of a feature point we give priority LOST/FOUND > STALL/GO > TURN > eXTRA in the order. The algorithm is straightforward and easy to see that the running time of FFTS is O(1) for P = {L, F, S, G, T}. With a trajectory we first identify potential problematic data points. We need to be aware that GPS data are subject to various sources of errors including satellite orbit errors, satellite clock errors, receiver errors, tropospheric and ionospheric errors, multipath errors, etc. [15,16,19]. The GPS quality indicator in a GPGGA and the GPS status in a GPRMC are mainly used for inspecting the validity of a location data: Any data with status 'V' or fix type '0' is invalid. In addition, we further refer to the number of satellites in use and the HDOP in a GPGGA. a location data with at least '4' satellites and at most '2' HDOP is considered reliable. Problematic data results from the loss of satellite signals such that the sky is partially or completely blocked. These include when the object is in an underground area, a tunnel, or even a valley between tall buildings. It is quite common for a GPS receiver to provide a temporarily invalid data, which is followed by valid data within a single or a couple of seconds. In order to identify points of LOST we disregard these temporarily invalid data by using the parameter T fixL . If the last n L consecutive ones among the K data points of Swin K are invalid, then we add p tc−nL to P with LOST status as shown at line 13-15 in Figure 2. The value of n L is determined based on GPS receiver's sampling frequency such that n L = T fixL /f op + 1 ≤ K. At the same extent, in the status of LOST, if the last n F = T fixF /f op + 1 ≤ K consecutive data points of Swin K are valid, then we add p tc to P with FOUND status as shown at line 5-9 in Figure 2. a FOUND point will be identified only after the corresponding LOST point is identified previously. At the beginning, when the P is empty, the default status is LOST. The value of T fixF is not necessarily the same as that of T fixL .
The operation and performance of a GPS receiver greatly depends on the acceleration of the receiver. Especially, the accuracy of location data for a stationary object is marginal and the same physical location will have different GPS coordinates from time to time [19,20]. That is, a GPS receiver does not record exactly the same location when users stay at the same place for a while. High precision would be desired but it is typical for low-cost GPS receivers and is good enough for most lifelog applications. With a valid data point we next test a range of low speed by using the speed over ground field in a GPRMC sentence. Considering the inertial nature of movement we make the decision conservatively. If the speed of all the K data points of Swin K are slower than a predefined threshold S slow then it is in the range of SPDslow. It is shown at line 17 and line 38-44 in Figure 2. In this study, a speed bound of S slow = 15 km/h, which is determined empirically, is used.
To identify points of STALL we consider the distance between two location data: (lat 1 , lon 1 ) and (lat 2 , lon 2 ). The Haversine formula gives the great-circle distance between two points [17]: where R is earth's radius. a point of STALL occurs naturally when it is in the range of SPDslow. Calculate the distance d between any two consecutive data points within Swin K and, among the K − 1 calculations, simply count the case of d ≤ D maxS . If it is larger than m S ≈ T movS /f op ≤ K − 1 then we add p tc − mS to P with STALL type as shown at line 18-20 and line 45-51 in Figure 2. The value of m S is somewhat related to T movS but not exactly since in the counting we ignore the order of occurrences. It corresponds to heavy congestion or signal-related complete stops. To identify points of GO we use the speed over ground subfield in a GPRMC sentence. In the status of STALL, if m G = T movG /f op + 1 ≤ K consecutive data points show speed higher than S minG then we add p tc to P with GO type as shown at line 10-12 and line 31-37 in Figure 2. a GO point will be identified only after the corresponding STALL point is identified previously. Ideally its location is identical to the corresponding STALL point.
To identify points of TURN we consider the course over ground in a GPRMC. As GPS receivers are concerned, it is the direction that an object is moving in and has little relationship with the direction the object is pointing to. However, a turn can be detected by calculating the difference in track angles between two data points. We take and compare the median values of the two sub-windows of Swin K : Swin Kf and Swin Kb . By taking median values in track angles we eliminate instantaneous or erroneous values. If the angle difference is larger than the threshold value Θ minT then we add p tc-B to P with TURN type as shown at line 21-23 and line 51-55 in Figure 2. Note that the track angle, which is used for waypoint navigation of an object, is not accurate, especially at low speeds. Therefore, we disregard any data points in the range of SPDslow for the detection of TURN as shown at line 21 in Figure 2.
Note that when a feature point is detected, a data point among (p tc-nL , p tc , p tc−mS , p tc , p tc−B ) within Swin K is added to P depending on the type of the feature point {L, F, S, G, T}. In Figure 2 it is shown at line 4 that the FFTS does not allow to have more than a single feature point within the record of window Swin K .
As the final step, for optional requirements in the simplification we may sample extra data points between two feature points above with eXTRA type as shown at line 24-27 in Figure 2. The FFTS can collaborate with any existing trajectory simplification scheme, which is either a batch or an online algorithm, in this step. When the FFTS works by itself this final step may be skipped. In Section 4 of the case study, two well-known techniques are used as examples: Douglas-Peucker (DP) and uniform sampling (US) algorithms [10,12].

Douglas-Peucker (DP) Algorithm
It is a classic line generalization algorithm and is widely used in many geospatial applications. Initially, it takes the first and the last data points as the end points of a line segment. Next, calculate the perpendicular distance between the line segment and intermediate data points of the original trajectory. Then, add the data point with the greatest distance to the simplified one forming two new segments. Repeat the process with the new segments until the maximum distance for each line segment is less than a predefined threshold D DP . The computational complexity of DP is O(n 2 ), where n is the number of data points within a trajectory. This scheme guarantees that the error of a discarded data point is less than the threshold and shows excellent performance in general. We use the DP as a representative batch algorithm: First, FFTS splits the original trajectory P into multiple small segments. Any two consecutive feature points {L, F, S, G, T} become the end points of each segment. Next, the DP algorithm is applied to each segment independently for the feature points of type X. The optional parameter X n becomes the threshold D DP in meter. When the DP is collaborated in FFTS we denote it FFDP.

Uniform Sampling (US) Algorithm
It is a simple and straightforward scheme. With a stream of data point it down-samples at fixed time intervals. Though this scheme is trivial to implement it often results in significant information loss, especially when there are drastic changes in trajectory between sampled data points. We use the US as a representative online algorithm: Assume that we sample every i th data point using a counter. If FFTS identifies feature points {L, F, S, G, T} from the trajectory stream P it adds the data point to P and resets the counter. Otherwise, it adds the i th data point to P with type X, when the counter is terminated, and resets the counter. The optional parameter X n becomes T US = i/f op in second. When the US is collaborated in FFTS we denote it FFUS.

A Case Study
We collect user's spatial mobility data using a GPS logger, which is built around the MediaTek's MTK3339 chipset [21]. It can track up to 22 satellites on 66 channels in a −165 dBm sensitivity with a 15 × 15 × 2.5 mm built-in ceramic patch antenna. The GPS logger is configured to generate NMEA sentences at f op = 1 Hz such that user's movements are sampled at every second. a single user's commute history by a personal automobile is traced for seven different days. Figure 3 provides a summary of each trajectory data with respect to travel distance and time. Since all the trip pass through the same route there are little differences in travel distance. One-way trip consists of around 31.6 km on average (between 31.2 and 32.1 km). Any variation in the value comes from the way of measurements, in which it accumulates point-by-point from its trajectory data. However, there are noticeable differences in elapsed time depending on traffic conditions. The travel time is 3835 s on average: The trajectory data of day#5 takes the least time (2822 s) and that of day#6 takes the most (4649 s) to travel the same commute path.

Context of Trajectory by FFTS
The context, which one might want to know in lifelog applications, of the trajectory are clearly the route and elapsed time for the trip. In addition, information on places of significance and time delay around the area within the route would be desirable. The FFTS is an effort to simplify a trajectory data without losing this context. In this experimental study, we set window size K = 6 so that we can simplify trajectory data with only the last 5 s history. The parameters of (T fixL , T fixF , D maxS , T movS , S minG , T movG , Θ minT ) are set to (4 s, 2 s, 1 m, 3 s, S slow = 15 km/h, 3 s, 30º). The corresponding thresholds of (n L , n F , m S , m G ) for counting become (5,3,3,4) considering the GPS logger's sampling frequency of f op = 1 Hz. Moreover, we set B = 3 to divide Swin K into two sub-windows of Swin Kf and Swin Kb . These values are determined empirically. Using feature points of FFTS we can redraw Figure 3 for more context of the trajectory. Figure 4 provides the same summary on travel distance and time by feature types. While the travel distance for STALL, which is ideally zero, is marginally constant the travel time for STALL varies much day-by-day. We expect the same for LOST. From the graph, however, we can see that the travel distance and time for LOST of day#6 are significantly larger than others. In order to inspect further the trajectory data of day #6 we visualize its context with normalized travel distance and time. For comparison, the trajectory data of day#1 is drawn together. There are four graphs in Figure 5. The upper two are about the trajectory of day#1. The y-axis represents the type of feature points of F, T, G, S, L. The x-axis represents scale in % with respect to travel time of 3456 s for the top most line graph or scale in % with respect to travel distance of 31.2 km for the second dotted line graph. The lower two graphs are the same context but about day#6, which we want to examine. Note that the x-axis of the bottom most line graph is normalized to travel time of 3456 s of day #1 instead of 4649 s, which becomes 134.5 %, of day #6 for direct comparison between them. Figure 5 also provides the trajectory data on Google map for convenience. The commute path begins at the bottom-right point of yellow star labeled by "S (src)" and ends at the top-left point labeled by "L (dst)". In between several dots, which represent different feature types, are shown so that we can match the graphs with the path on the map.
From the graphs we can see three occurrences of LOST and FOUND period. The first one at the beginning is about start-up periods of the GPS receiver. The start-up time of day#6 takes much longer than that of day#1. The GPS logger is powered on at the point "S (src)". The first FOUND points are shown on the map by a backward-pointing arrow with the label of "F(day#1)" and by a forward-pointing arrow with the label of "F (day #6)". In Figure 6, we compare the start-up periods of the trajectory data with respect to distance and time to fix. When the GPS receiver has an estimation of current time and position it typically takes up to 3 min to acquire satellite signals. That is, the start-up of day#1 is normal and seems to be in warm start mode. On the other hand, the start-up of day #6 seems to be in cold start mode and takes more than expected. It happens from time to time and is affected by the weather condition as well. On the date of day#6 it was rain in heavy fog. The rest two LOST and FOUND periods correspond to passing tunnels. Those locations can be considered as places of significance for the commute path [6,22,23].
The occurrences of STALL or TURN might be slightly different time-by-time even on the same travel path. You may pass a traffic light without stop when it is on green and turn either gently or fiercely on a curve. However, they can be used for identifying places of significance if we have trajectory data large in volume. In addition, with the feature points of FFTS in simplification we can have much richer context of a trajectory. For example, in Figure 5 we distinguish types of road such as highway, expressway, and local lane on the commute path. We can say that, while day #6 takes more time than day #1 by 34.5 %, it spends extra time mostly on expressway and highway. These kinds of context of a trajectory data with FFTS can be used widely in trajectory data mining and lifelog applications [4,5,24].

Performance of FFTS
In order to quantitatively compare the effectiveness of FFTS we use perpendicular Euclidean distance (PED) and synchronized Euclidean distance (SED) metrics [7,10,13]. Suppose that p i and p j are two consecutive data points of a simplified trajectory P : p i ∈ P and p j ∈ P . We can consider a data point p k , which is discarded in the P , of the original trajectory P between p i and p j : p k ∈ P but p k P . PED measures the shortest distance between p k and the line segment p i p j . In contrast, SED finds a virtual data point p' k on the line segment p i p j via interpolation. Then, it measures the distance between p k and p' k . The total error is the sum of those distances for each data point of the original trajectory P.
where Figure 7 shows the performance comparison of the original DP and US algorithms by varying the threshold D DP = 200, 100, 50, 30, 20 m. For the graph we use average PED or average SED such that they can be independent of the travel time for the same commute path. The performance data of each trajectory is provided on Tables A1 and A2 in Appendix A. For a fair comparison we set the parameter T US of US by the parameter D DP of DP such that the compression rates of them are comparable. We cannot make them exactly the same because DP is an error-bounded simplification scheme while US is a size-bounded one. The compression rate of each trajectory is provided on Table A3 in Appendix A. From the graph we can see that DP significantly outperforms US, as we expected, with respect to PED. However, surprisingly it is not true with respect to SED: US is better than DP. In many literatures SED is a prefer metric to PED since it counts on both spatial and temporal aspects of a trajectory. It seems to result from several causes. First, the DP implementation in this study relies on perpendicular distance when it adds a data point to the simplified trajectory P . In SED calculations, however, it assumes that the object moves at constant speed between two consecutive sampled data points. For instance, Figure 8 shows five data points of trajectory P. Let us say that p i and p j are two sampled data points of a simplified trajectory P . The three discarded data points of p k1 , p k2 , p k3 lie unevenly on the original trajectory and their corresponding virtual data points of p' k1 , p' k2 , p' k3 lie evenly on the simplified trajectory. In this scenario, SED = a' + b' + c' is clearly larger than PED = a + b + c. Second, errors in trajectory simplifications tend to be enlarged as the number of missing points between two sampled data points increases. That is, in Figure 8, if there were a single discarded data point of p k2 , instead of three, then PED = b << a + b + c and SED = b' << a' + b' + c'. At the same compression rate, the number of missing points between two consecutive sampled data points varies largely in DP while it is constant in US. That is, in US the SED of each segment can be bounded within a certain range. However, in DP the SED of a segment can be amplified. Note that this case study is executed in real situations, which generate highly redundant trajectory data of which compression rates are easily higher than 95+%. There are many chances to amplify the SED in the simplification.    Figure 10 show the performance comparison of DP vs. FFDP and US vs. FFUS from the same perspective, respectively. a line graph is shown together, for your convenience, on the secondary y-axis. It represents a normalized distance (PED or SED) of the algorithm with FFTS (FFDP or FFUS) over the original algorithm (DP or US) without FFTS. We can see that a simplification with FFTS outperforms significantly the one without FFTS. This is true for both PED and SED: For DP the reduction is 12-45 % in PED and 31-45 % in SED. For US it is 39-52 % in PED and 33-44 % in SED. Note that FFTS is applied first segmenting the original trajectory by feature points and that the original algorithm (DP or US) is applied next to the sub-segments for the feature type X. That is, the huge improvements in performance of using FFTS naturally result from taking more data points as feature points, in advance, for the simplification. Moreover, note that the number of feature points of type {L, F, S, G, T} for a travel path is almost constant regardless the value of the parameter X n (D DP or T US ). Figure 11 shows the compression rate of different simplification schemes in the experiments. We can see that the cost of compression rate for the performance improvements is minimal, less than 0.67% point, by applying the FFTS. The decrease in compression rates for this kind of highly redundant trajectory data might be negligible. In addition, FFTS segmenting the original trajectory by feature points may take a benefit in processing time of the combined simplification algorithm.

Conclusions
Lifelogging is an activity of recording user's daily live in varying amounts of detail. User's location is a natural form for logging a person's life since it is of vital importance having an immense implication. Thanks to the abundant technology of GPS it becomes easier to get location data of moving objects. However, GPS trajectory data generally is huge in size making it difficult to process and store. In this paper, we presented a generic add-on algorithm, feature-first trajectory simplification (FFTS), for simplifying trajectory data in lifelog applications. It is based on a simple sliding window mechanism with additional information such as GPS status, speed, track angle, etc., which a GPS receiver provides by nature, as well as the location and timestamp. Experiments with a real trajectory data showed that the proposed scheme can preserve rich context more than spatio-temporal information of a trajectory. By identifying feature points within a trajectory such as signal lost and found, stall, and turn travel distance and time were analyzed by feature types and were visualized in normalized scale for extra context. Moreover, from the experimental results we showed that the simplification with FFTS outperforms significantly the one without FFTS at marginal cost in compression rate.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A
The performance data of the seven days' trajectory is provided for better understanding of the experimental results in Section 4.2.