Application of Artificial Intelligence in an Unsupervised Algorithm for Trajectory Segmentation Based on Multiple Motion Features

With the development of the wireless network, location-based services (e.g., the place of interest recommendation) play a crucial role in daily life. However, the data acquired is noisy, massive, it is difficult to mine it by artificial intelligence algorithm. One of the fundamental problems of trajectory knowledge discovery is trajectory segmentation. Reasonable segmentation can reduce computing resources and improvement of storage effectiveness. In this work, we propose an unsupervised algorithm for trajectory segmentation based on multiple motion features (TS-MF). The proposed algorithm consists of two steps: segmentation and mergence. The segmentation part uses the Pearson coefficient to measure the similarity of adjacent trajectory points and extract the segmentation points from a global perspective. The merging part optimizes the minimum description length (MDL) value by merging local sub-trajectories, which can avoid excessive segmentation and improve the accuracy of trajectory segmentation. To demonstrate the effectiveness of the proposed algorithm, experiments are conducted on two real datasets. Evaluations of the algorithm’s performance in comparison with the state-of-the-art indicate the proposed method achieves the highest harmonic average of purity and coverage.


Introduction
With the rapid development of location technology (such as GPS, Beidou System, AIS), it is becoming easier to get trajectory data of moving objects, including time, location, speed, acceleration, and heading. The analysis on trajectory data can provide a lot of valuable information for applications based on location data, such as traffic pattern detection [1], fishing detection [2,3], animal migration behavior detection [4][5][6], human behavior patterns recognition [7], and hurricane trajectory prediction.
The preprocessing step of trajectory data mining includes noise cleaning, segmentation, stop points detection, compression, and map matching [8]. And trajectory seg-mentation is one of the most basic tasks, which is to partition the trajectory into disjoint parts. The motion features of each part are uniform, and the two adjacent parts represent different motion modes. Segmentation reduces computational complexity and allows us to mine richer knowledge, which exceeds the knowledge we learn from the entire trajectory. Furthermore, accurate segmentation methods can provide higher-quality features for further analysis of the behavior of moving objects.
The trajectory segmentation algorithm as aforementioned can solve most of the problems in the preprocessing of trajectory data mining, but there are still the following challenges: (1) At present, most of the supervised trajectory segmentation algorithms, such as SPD [11], Warped K-means [10], and WS-II [9], required labeled data or prior information such as time threshold, speed threshold, and the number of trajectory segments.
(2) Semisupervised trajectory segmentation algorithm (e.g., RGRASP-SemTS) uses a combination of both labeled and unlabeled data to segment. However, the majority of trajectory datasets do not contain the labeled data.
(3) The unsupervised trajectory segmentation algorithm does not require labeled data. But the existing unsupervised segmentation algorithms use greedy algorithms with high time complexity, resulting in uselessness which causes it is not suitable for large trajectory data.
To overcome these challenges, we propose an unsupervised algorithm for trajectory segmentation based on multiple motion features (TS-MF). The algorithm includes two steps: segmentation and mergence. First, to maximize the homogeneity of the subtrajectories, the segmentation part uses the Pearson coefficient to measure the similarity of trajectories. Furthermore, to avoid local oversegmentation, mergence part merges the subtrajectory by minimizing the cost function. Finally, we verify the proposed algorithm in two trajectory datasets of two different domains.
The main contributions of this article are as follows: (1) The study proposes a segmentation method based on the Pearson coefficient. First, the Pearson coefficient is employed to measure the similarity according to the speed, acceleration, differential position, angle, and other movement features of the two trajectory points. Then, the trajectory is segmented from a global perspective.
(2) Considering the local oversegmentation of trajectories, we propose a merging method, which merges trajectories by minimizing cost function value.
(3) Fusion of the segmentation and merging method proposes an unsupervised algorithm for trajectory segmentation based on multiple motion features (TS-MF).
(4) The time complexity of our proposed algorithm is OðnÞ, which is suitable for the segmentation of large trajectory datasets.
The rest of this article is organized as follows: Section 2 gives the related works. Section 3 introduces the proposed trajectory segmentation algorithm. In Section 4, we verify the feasibility of our algorithm on two actual datasets. Finally, Section 5 gives our conclusions and future work.

Literature Review
In the past few years, scholars have published lots of papers related to trajectory segmentation. In this section, we mainly summarize most of the trajectory segmentation methods.
The supervised trajectory segmentation algorithm requires label data and heuristic rules such as the time threshold, speed threshold, density threshold, and angle threshold to segment trajectory. Mohammad et al. proposed a segmentation algorithm named WS-II [9], and it requires the labeled data. But the majority of trajectory datasets do not contain such information. Zheng et al. proposed a staying point detection (SPD) algorithm to segment trajectory [11]. SPD suppose that there is a stay point between two adjacent motion modes and uses the distance threshold δd and the time threshold δt to find the stay points. Then use the stay points to segment trajectory. Finally, SPD was verified on the geolife dataset. Mirge and Verma define the distance threshold and the angle threshold to segment trajectory [13]. Although these two algorithms can quickly find the stay points and segment the trajectory, the algorithm requires heuristic rules. In practical application, it is difficult to obtain these rules in advance, and the value of the threshold would greatly impact the accuracy of trajectory segmentation. Leiva and Vidal proposed a trajectory segmentation algorithm named Warped K-means [10] based on the K-means [29]. This algorithm adds time constraints in the K-means. It reaches 97% accuracy on real datasets. However, the number of trajectory segments k is generally unknown.
The unsupervised segmentation algorithm mainly includes clustering-based, cost-function-based, and interpolation-based. The detailed description is as follows.
The clustering-based segmentation algorithm mainly improves the existing clustering algorithm, which makes it more suitable for trajectory segmentation. A plethora of cluster-based trajectory segmentation algorithms have been proposed. CB-SMOT [27] was proposed by Andrey, which is an extension of the DBSCAN algorithm [30]. The algorithm also uses speed characteristics to discover the stop points and move points of the trajectory. And to better process the spatial-temporal trajectory data, it replaces the distance threshold in DBSCAN with the time threshold. Chen et al. improved the DBSCAN algorithm and proposed a segmentation algorithm named T-DBSCAN [18]. This algorithm utilizes the important spatial-temporal characteristics of the trajectory to segment the trajectory. The accuracy of the two algorithms is high on the experiment dataset. However, since CB-SMOT and T-DBSCAN are improved based on DBSCAN, they also have the same weaknesses as DBSCAN, which cannot reliably detect stop points from sparse trajectories.
The cost-function-based approach mainly segments the trajectory by minimizing the cost function, including GRASP-UTS [23]. It was proposed by Amilcar et al. in 2015. This algorithm first randomly selects the segmentation point, that is, landmark. Then, it utilizes the adaptive greedy algorithm to optimal the landmark and calculates the cost function. Finally, when the cost function reaches the lowest, 2 Wireless Communications and Mobile Computing segment the trajectory by landmark. GRASP-UTS is tested on two real datasets of different domains and achieves high accuracy. However, because the algorithm uses an adaptive greedy algorithm, the time complexity is very high, which makes it is not suitable for large datasets. The interpolation-based trajectory segmentation algorithm mainly uses different interpolation methods such as linear interpolation and kinematic interpolation to generate error signals for segmentation, including OWS [15] and SWS [19]. Mohammad et al. proposed the trajectory segmentation algorithm named Octal Window Segmentation (OWS) in 2019, and the SWS is an improvement of the OWS. The intuition of the two algorithms is that when a moving object changes from one behavior to another, this can be captured directly from its geographic location. Mohammad et al. compare the real position of the moving object with the estimated one to generate an error signal. By evaluating this error signal, predicting whether the behavior of moving object changed, and utilizing this information to segment trajectory. These two algorithms are better than the benchmark algorithm in segmentation accuracy. However, a part of the data is required to optimize the parameter and different trajectory datasets need to select different interpolation methods.
The semisupervised segmentation algorithm mainly includes RGRASP-SemTS [28]. RGRASP-SemTS was proposed by Amilcar et al. It uses the minimum description length (MDL) principle to measure homogeneity inside segments and segment trajectories by combining a limited user labeling phase with a low number of input parameters and no predefined segmenting criteria. However, when the algorithm faces large-scale data, it is difficult to create a part of labeled trajectory datasets.
This study proposes an efficient and accurate trajectory segmentation and merging algorithm based on multiple motion features (TS-MF) to overcome the limitations of the aftermentioned, mainly composed of a segmentation method and a trajectory merging method. The TS-MF algorithm divides the trajectory both from the global and local perspectives to ensure the accuracy of segmentation.

Methodology
This section details the novel unsupervised algorithm for trajectory segmentation based on multiple motion features (TS-MF). In Section 3.1, we present the relevant definitions. Figure 1 shows the overview of TS-MF, which includes the two core processing: segmentation and mergence. The first step of TS-MF is to segment the raw trajectory by Pearson coefficient, which is detailed in Section 3.2. The second step is to merge the subtrajectory of oversegmented, which is described in Section 3.3. Finally, the details of TS-MF are introduced in Section 3.4.

Definitions
3.1.1. Raw Trajectory. A raw trajectory is composed of a series of multidimensional spatial-temporal data points. It is denoted as where p i = ðlat i , lon i , t i , f i Þ, lat i and lon i represent the position coordinates at the time t i . f i means the movement characteristics of the trajectory point at the time t i such as speed, angle, and acceleration.

Subtrajectory.
A subtrajectory is a set of consecutive trajectory points in the raw trajectory, for example, the subtrajectory can be denoted as 3.1.3. Trajectory Segmentation. According to feature similarity of trajectory points, the trajectory segmentation algorithm can efficiently and accurately find a set of segment points from the raw trajectory, such as Seg = ½p 0 , p 1 ⋯ p k . We can segment the raw trajectory into several disjoint parts by these segmentation points. For example, Traj i = ðs 0 , s 1 , s 2 ⋯ s k Þ, where k is the number of subtrajectories.

Segmentation Method.
The intuition behind the segmentation method is that when the motion features of two adjacent trajectory points (such as longitude, latitude, velocity, angle, acceleration, and heading) have significant variation, this trajectory point is where the motion state changes, that is, segmentation points. Therefore, the core of the segmentation method is to determine the segmentation point.
To accurately extract the segmentation points, it is necessary to define an index to measure the similarity of multiple motion features between two adjacent trajectory points. Since the Pearson coefficient is sensitive to variation, the Pearson coefficient is employed to calculate the similarity of adjoining trajectory points, extract the point where the motion feature changes, and save it to the segmentation point sequence.
The Pearson coefficient is a statistical indicator that reflects the degree of linear correlation between two variables. The Pearson coefficient can be calculated through Equation (1), where F i , F j , are the features of p i and p j , the features include longitude, latitude, speed, average speed, acceleration, and angle, coνðF i , F j Þ is the covariance between F i and F j , μ represents the mean value, σ F i , σ F j means the standard deviations of F i and F j , and it indicates that F i and F j are irrelevant; when the value equals 1 (e.g., [1][2][3][4][5][6] and [1][2][3][4][5][6]), it suggests that F i and F j are completely positive correlation; when the value equals -1, it means that F i and F j are perfectly negative correlation (e.g., [1][2][3][4][5][6] and [-1, -2, -3, -4, -5, -6]). Generally, trajectory data reflects the motion history of moving objects, and its sampling time is usually very short, so the characteristics between adjacent points in the same motion state are usually the same, that is, the value of the Pearson coefficient is close to 1. And the acceleration, speed, average speed, and angle of trajectory points with changed motion state will change obviously, resulting in Pearson coefficient is closed -1. For example, we calculate the value of the Pearson coefficient of two sets of adjacent trajectory points, whose result is shown in Table 1. We can discover that when the features 3 Wireless Communications and Mobile Computing of adjacent trajectory points are no obvious variation, the value of ρ F i ,F j is close to 1, and it is close to -1 otherwise. Figure 2 shows the change of the value of ρ F i ,F j , and there are many mutation points of the Pearson coefficient. The value of ρ F i ,F j between mutation points is close to 1 and remains unchanged. Meanwhile, we can discover there are multiple mutation points in a short time. However, the motion state of the moving object does not change in a short time. It means that some multiple mutation points are the outlier points. Therefore, the purpose of the segmentation method is to extract mutation points and remove the outlier points.
The pseudocode of the segmentation method is detailed in Algorithm 1. The proposed segmentation method firstly takes out the raw trajectory (such as Traj i = ðp 0 , p 1 , p 2 , p 3 , ⋯, p i , ⋯, p n Þ 0 ≤ i ≤ n) from the database. Then, calculate  The segmentation method looks for p l with the minimum value of ρ F i ,F j from a global perspective. When the ρ F i ,F j less than δ and the time interval between p l and adjacent segment points is less than T, the p l is added the segmentation point sequence and remove the p l from array pcc. And on the contrary, the outlier point p l is removed. The procedure performs this step until the minimum value of ρ F i ,F j is greater than δ.

Merging Method.
The trajectory segmentation algorithm based on the Pearson coefficient achieves high homogeneity in the subtrajectories. However, in practical application, the collected trajectories contain some outlier points, which cause the value of ρ F i ,F j is closed to -1. Though the segmentation method utilizes time threshold T to remove the outlier points, when the time interval is greater than T, the outlier points may be mistakenly added to the segmentation point sequence. This condition may cause the raw trajectory to be oversegmented. For example, the raw trajectory containing 122 subtrajectories is finally partitioned into 187 segments, which is oversegmented. In the mergence part, the minimum description length (MDL) principle is used to construct the cost function and merge the subtrajectories by optimizing the cost function from a local perspective, which can ensure the final segmented subtrajectory achieves the best accuracy. The MDL was proposed by Rissanen [31] and then used and detailed by Grünwald et al. [32]. According to Grünwald et al. [32], the MDL cost consists of LðHÞ and LðD | HÞ.
Here, H means the hypothesis, and D the datasets. LðHÞ is the length of the description of the hypothesis in bits, and LðD | HÞ is the length of the description of the data when encoded with the hypothesis. The best hypothesis H to explain D is the one that minimizes the sum of LðHÞ and LðD | HÞ.
In the problem of trajectory segmentation, a hypothesis corresponds to a subtrajectory. Finding the optimal subtrajectory means finding the best hypothesis. Give a subtrajectory S = ðp i , p i+1 , ⋯, p j Þ 0 ≤ i < j ≤ n , and we formulate cost function by Equations (2), which can be used to measure homogeneity. In Equations (2), LðHÞ = log 2 ðlenðp i p j ÞÞ and LðD | HÞ = log 2 ð∑ j i d ⊥ ðp i p j , p i p i+1 ÞÞ + log 2 ð∑ j i d θ ðp i p j , p i p i+1 ÞÞ, where d ⊥ ðp i p j , p i p i+1 Þ means the perpendicular distance between p i p j and p i p i+1 , d θ ðp i p j , p i p i+1 Þ represents the angle distance between p i p j and p i p i+1 . The d ⊥ and d θ are defined as Equations (3) and Equations (4), which are mentioned in [17]. Figure 3 shows the formulation of the cost function, d ⊥ and d θ of a subtrajectory, which contains 5 trajectory points.
Based on the theory as aforementioned, the merging method is detailed in Algorithm 2. First, the procedure uses the segmentation point sequence SegPoint to segment the raw trajectory, it can be denoted as Traj = ðs i ⋯ s j ⋯ s k Þ 0 ≤ i ≤ j ≤ k. Then mergers s i and s i+1 into s total , calculates the cost function of s i , s i+1 and s total , and the results are represented as cost s i , cost s i+1 , and cost s total . When cost s i , cost s i+1 , and cost s total satisfy Equation (5), it means that the two subtrajectories are oversegmented and merge s i and s i+1 from the local perspective. The procedure repeats this step until the last subtrajectory.
3.4. The TS-MF Algorithm. The segmentation part and mergence part are the two phases (global segmentation and local optimization) of TS-MF, which are described in Section 3.2 and Section 3.3. Algorithm 3 shows the pseudocode of TS-MF. This algorithm receives the following inputs: the raw trajectory Traj i ð0 ≤ i ≤ nÞ, a time threshold T and the Pearson coefficient threshold δ. The output is the set of subtrajectories, which can be denoted as ðs i ⋯ s j ⋯ s k Þ.

Experimental Evaluation
To evaluate the effectiveness of the proposed algorithm, we verify the proposed algorithm on two real datasets. This section first details the datasets (Section 4.1) and the evaluation metrics (Section 4.2). Then, the parameter settings and experimental results are introduced in Section 4.3 and Section 4.4, while a comparative analysis with other algorithms is presented in Section 4.5.  Seg ⟵ add p 0 and p n to Seg 3: n ⟵ the sum of the num of points with ρ F i ,F j < δ

4:
for i = 0 ⟶ n − 1 do * time complexity is OðnÞ * 5: ρ F l−1 ,F l , p l ⟵ the minimum value of ρ F l−1 ,F l and the corresponding points 6: p left , p right ⟵ the adjacent segmentation points of p l in the SegPoint 7: T lef t ⟵ Time interval between p l and p left 8: T right ⟵ Time interval between p l and p right 9: if T lef t < T and T right < T then 10: add p l to Seg 11: end if 12: end for 13: return Seg Algorithm 1: Trajectory segmentation algorithm based on the Pearson coefficient. if ðcost s i + cost s i+1 Þ > cost s total then 7: merges s i and s i+1 from the local perspective 8: end if 9: end for 10: return The set of sub-trajectories ðs i ⋯ s j ⋯ s k Þ Algorithm 2: Trajectory merging method. 6 Wireless Communications and Mobile Computing geolife dataset has a mix of behaviors, such as car, bus train, and walk. Figure 4 (right) shows the part trajectory of geolife. From these trajectories, we extracted the information of time, longitude, latitude, fishing, speed, and angle collected. We computed some trajectory features for all the points in this dataset, including mean speed and acceleration. The data description is shown in Table 2.

Evaluation Metrics.
In this work, the harmonic mean (H) of average purity P and average coverage C is used to evaluate the proposed algorithm. Scholars firstly proposed the concepts of coverage and purity in [23] and used the harmonic mean (H) to evaluate the trajectory segmentation algorithm in [19].
The segment purity is the ratio of the sum of the most frequent label in the segment and the sum of all the trajectory points. For example, suppose a segmented trajectory has k points, and the number of trajectory points with the most same label is d, then, the segment purity C is d/k. The average of purity values for all segments is called as P. Coverage is to evaluate the completeness of the segmentation algorithm. For example, suppose that the raw trajectory segment τ is divided into τ 1 , τ 2 , τ 2 is the larger one, and the coverage C is defined as τ 2 /τ. The average for coverage of all segments is called as C. Since the two metrics of purity (P) and coverage (C) are designed to be orthogonal, i.e., when one index increases, the other index decreases. Therefore, the harmonic mean of the purity and coverage is used to evaluate the performance of TS-MF. Equation (6) gives the formulation of the harmonic mean [19]. When the harmonic mean is the highest, the purity and coverage of the segmented trajectory reach a good compromise, and the segmentation of subtrajectories is the best.
4.3. Parameter Settings. In the segmentation process, the threshold of Pearson coefficient, that is, σ is employed to find the segmentation point. In general, when ρ F i ,F j ≥ 0:8,   7 Wireless Communications and Mobile Computing 0:6 ≤ ρ F i ,F j < 0:8, the two variables of features are moderately positive correlated; 0:4 ≤ ρ F i ,F j < 0:6, the two variables of features can be low correlation; 0 ≤ ρ F i ,F j < 0:4, the two variables of features may be irrelevant, and ρ F i ,F j < 0 suggests that two variables of features are negatively correlated. Therefore, the TS-MF can make δ = 0:4 to extract segmentation points. In addition, the segmentation process also utilizes the threshold of T to remove outlier points. Since it is difficult to know the specific duration of each state of the moving object and the purpose of setting the T is only to remove the part outlier points, the T can be set to the minimum value of the duration of each movement state. The duration of fishing activities of fishing vessels is 6 hours on the coast of Brazil, which is mentioned in [33], and the shortest duration of the walk generally is 30 min. Therefore, the T = 6 hours for the vessels performing fish-  The results are shown in Figures 5-8, which display the value of the sum of subtrajectories, P, C, and H under different σ on different datasets.
The results of the segmentation method are shown in Figures 5-6. The results display that with the increase of δ, the sum of subtrajectories increases, the P increases, the C decreases, and the H increases.     TS-MF is an extension of the segmentation method, that is, there is one more merging method. The mergence part merges the local subtrajectory by the segmentation method. The results of TS-MF are as shown in Figures 7-8. Compare the results of the segmentation method, we can observe the num of subtrajectory is lower and the C and H is better. We also can discover that in the mergence part, the num of merged subtrajectories on the fishing dataset is more than     Figure 9: Compare with other segmentation algorithms on fishing dataset (left) and geolife dataset (right) geolife dataset. The reason is that when fishing vessels engage in fishing, the speed generally is 4 miles per hour, and the heading angle is constantly changing. This condition leads to the value of the Pearson coefficient being lower and the segmentation method may add many outlier points into segmentation points. Geolife collected trajectory data of 182 users, which includes various motion states. The difference of features of different motion states is large while the same motion state is small. Therefore, the segmentation method can accurately discover the segmentation points, that is, the outlier points in segmentation points is less.
Overall, the results of TS-MF are better and the greater of δ, the segmentation method can extract more segmentation points and leads to the C and H becomes lower. But it does not mean that the lowest δ is the best selection. As shown in Figures 7-8, when the δ = 0:3, the sum of subtrajectory is very low, that is, many segmentation points are lost of TS-MF. The results also indicate that it is the feasibility of δ = 0:4.

4.5.
Comparing TS-MF with Other Baseline Algorithms. In this section, the experiment is repeated in the same environment, and TS-MF was compared with the other four trajectory segmentation algorithms (CB-SMoT, GRASP-UTS, SPD, and SWS) on the fishing dataset and geolife dataset. The results are reported in Figure 9. As shown in Figure 9, we can discover that the value of harmonic average is 90.1% and 94.28% on different datasets, and TS-MF achieves the highest harmonic average of purity and coverage. The results also demonstrate the feasibility of TS-MF.

Conclusions
It is envisioned that future wireless communications will be more data-driven. It is possible to obtain the high-accuracy and long-term trajectory of a moving object by mobile edge cloud, beamforming, and artificial intelligence techniques. But the long-term location data need huge computing resources to process and loses a lot of information. The segmentation algorithm designed for location data is the basic step to develop the location-based application. This study proposes an unsupervised trajectory segmentation algorithm, named TS-MF, which employs the Pearson coefficient to find the segmentation points and minimum cost function to merge the oversegmented subtrajectory. We compared our proposed segmentation algorithm against GRASP-UTS, SPD, CB-SMoT, and SWS; the results show that the proposed algorithm reaches the best harmonic mean of purity and coverage on the fishing dataset and geolife dataset. Furthermore, the TS-MF algorithm requires no labeled data and its time complexity is OðnÞ, which means it is computation efficient and thus most suitable for the segmentation of large trajectory datasets.
However, there is one limitation of TS-MF. It is that when the features are similar in the different movement states, the proposed segmentation algorithm may not find the qualified segmentation points for the raw trajectory.
As future work, we plan to extend this work in other directions. First, we would analyze the trajectory motion pat-tern and predict the subtrajectory state, semantic enhancement for raw trajectory. Second, we would like to apply the segmentation algorithm (TS-MF) to more wireless positioning data, which facilitates more artificial intelligence technology are used to mine valuable information.

Data Availability
The data and codes that support the findings of this study are available with the identifier(s) at the private link https://figshare.com/s/6e6fb483b076b2a34cbe.