Assessing compression algorithms to improve the efficiency of clustering analysis on AIS vessel trajectories

Abstract In the maritime environment, the Automatic Identification System (AIS) is used to monitor vessel activity concerning security and safety ocean-wide. AIS data has been used to detect anomalous behaviors related to suspicious activities and hazardous events. Typically, clustering analysis is used to investigate anomalous events within the AIS data stream. However, the main challenge in this approach is to determine and execute the dissimilarity measure between trajectories since they differ in size and time. In addition, these calculations are computationally expensive and not scalable. To tackle this issue, compression algorithms can be applied to perform clustering analysis since they are typically used to reduce storage and processing time. Therefore, the proposed analysis will assess how compression algorithms affect clustering results with respect to detecting anomalous vessel trajectories. The analysis results show that a suitable compression algorithm can reduce the overall processing time with little impact on the clustering results while supporting the scalability of this type of analysis.


Introduction
In data science, anomaly detection is the process of finding events that do not present normal patterns or behaviors. Such events are referred to as outliers or anomalies (Kelleher and Tierney 2018). In the maritime environment, the Automatic Identification System (AIS) is used to monitor vessel activity concerning security and safety oceanwide (Fu et al. 2017, Li et al. 2018). AIS provides important information related to the status of the vessel, such as the geographic location, speed, and vessel type.
Anomaly detection using AIS data is typically performed with clustering analysis, which is an unsupervised learning approach that requires a dissimilarity or a similarity measure between the instances (Pallotta et al. 2013, Liu et al. 2014, Fu et al. 2017, Li et al. 2017, Abreu et al. 2021a. TRACLUS is a density-based clustering algorithm that was developed to identify common sub-trajectories in real-world datasets (Lee et al. 2007). This algorithm partitions the trajectories into a set of line segments according to the minimum description length (MDL) principle. Next, densitybased clustering is performed on the set of segments to identify common sub-trajectories. A challenge associated with the TRACLUS algorithm is the sensitivity associated with the selection of the input parameters. Therefore, improvements to this algorithm were proposed to tackle this issue while including other properties of the trajectories such as the sequence of the messages (Jiashun 2012, Zhang et al. 2018. This algorithm was also extended to perform trajectory segmentation (Soares J unior et al. 2015, Junior et al. 2018 and to be processed using GPUs (Mustafa et al. 2021). However, most studies use the DBSCAN algorithm (Ester et al. 1996) or a variation of it, which allows for other attributes to be taken into consideration, such as speed. DBSCAN is typically used to detect outliers and noise within the dataset by defining regions of high and low density (Pallotta et al. 2013, Liu et al. 2014, Fu et al. 2017, Li et al. 2018, Pedroche et al. 2021. One drawback of DBSCAN is that it requires a predefined radius to form the clusters, which can be challenging to determine. Similar to DBSCAN, HDBSCAN (Campello et al. 2013) also detects outliers and clusters based on density. However, HDBSCAN incorporates a hierarchical approach to the DBSCAN algorithm, which allows for a visual selection of the parameters through a dendrogram that shows the levels of the hierarchical clusters.
Clustering techniques have been used to analyze ship behavior, but the scope is limited and is typically performed for pattern and anomaly detection (May Petry et al. 2020). One of the main challenges in performing clustering analysis with AIS data is determining which dissimilarity measure is appropriate. This is due to the fact that trajectories can have different lengths and varying time lags between messages. TRACLUS applies three different distances: (i) perpendicular distance, (ii) parallel distance, and (iii) an angle distance (Lee et al. 2007). However, such distances do not include the speed or time information. In addition, the TRACLUS distance is typically applied to segments of the trajectory using the Euclidean distance measure to obtain those three distances. However, for the whole trajectory, the Euclidean distance is not suitable due to the earth's curvature which requires the use of a geographic distance measure. To address these issues, Li et al. (2017) and Fu et al. (2017) use Dynamic Time Warping (DTW), which is a commonly used tool in time-series analysis for computing distances (Keogh and Ratanamahatana 2005). Li et al. (2017) found that DTW is an effective method for calculating the distance between trajectories when performing clustering analysis. Likewise, Li et al. (2018) uses the Merge Distance (MD) calculation to compute the distance between trajectories, claiming that MD is more robust than DTW for sub-sampling or super-sampling trajectories because MD is invariant under rigid movements. Additionally, the Hausdorff Distance (HD) and the Discrete Fr echet Distance (DFD) are typically applied to the literature to measure the distance between trajectories (Cao et al. 2018, Mannarini et al. 2019, Andersen et al. 2021, Wang et al. 2021, Lee et al. 2022, Xiong et al. 2022. Another challenge in clustering analysis is that those distance measures are computationally expensive to compute and not scalable (Pedroche et al. 2021). In an attempt to solve this problem, compression algorithms have been widely applied to trajectories to reduce the storage and the processing time (Liu et al. 2019, Makris et al. 2021a,b, Tang et al. 2021, Lee et al. 2022. Different algorithms have been developed to reduce the vessel trajectory size while maintaining relevant features. For instance, Makris et al. (2021a) performed a comparison between several algorithms by analyzing their compression ratio and error. The error was computed through the DTW distance between the original trajectory and the compressed one.
Although compression algorithms are widely applied to improve the storage and processing time of vessel trajectory analysis, only recently have researchers begun investigating the application of compression algorithms to improve the processing time for clustering analysis (Vries and Someren 2010, De Vries and Van Someren 2012, Wang et al. 2021, Alam and Torgo 2022, Murray and Perera 2022. Only a handful of work has investigated the impact of compression algorithms on the clustering results (Vries and Someren 2010, De Vries and Van Someren 2012, Alam and Torgo 2022. For instance, Alam and Torgo (2022) analyzed the effect of the compression technique Ramer-Douglas-Peucker (RDP) (Ramer 1972, Douglas andPeucker 1973) on the clustering analysis process. HDBSCAN was used in conjunction with DTW to perform the clustering and the results showed that the compression did not affect the clustering results significantly. In addition, De Vries and Van Someren (2012) investigated the impact of the Piecewise Linear Segmentation (PLS) algorithm on clustering analysis. The approach clustered the trajectories using k-means with DTW and Edit distances. This research showed that compression algorithms do not negatively impact trajectory analysis. However, the evaluation of the impact on the clustering results is qualitative only, being based on visual aspects and the number of clusters.
Unlike the aforementioned works, this work quantitatively analyzes the effect of applying compression algorithms to improve the efficiency of clustering for anomaly detection within vessel traffic. The proposed methodology include the compression algorithms discussed in Makris et al. (2021b) and includes Douglas-Peucker (DP), Time-Ratio (TR), and Speed-based (SB) algorithms, along with combinations of these techniques. These algorithms require predefined thresholds that were also evaluated in this work. For the clustering analysis, HDBSCAN was used and requires the minimum cluster size as input. The selection of a minimum cluster size in HDBSCAN will provide similar clustering results if the compression algorithm maintains the relative distance amongst the trajectories. That is, the clusters produced with HDBSCAN should be similar when using the same parameters if the distances obtained after compression are proportional to the distances obtained with the whole trajectory. Such behavior cannot be evaluated using DBSCAN because the definition of a radius will require that the distances amongst the trajectories are the same before and after compression.
In summary, this paper aims to assess how compression algorithms may influence clustering analyses with respect to anomaly detection of vessel trajectories. Therefore, the contributions of this work are the following: A benchmark is designed to carefully consider the relative distance among trajectories; An in-depth analysis is performed that supports the selection of an adequate compression algorithm according to the vessel movement and the distance measure applied; A set of experiments are executed that covers state-of-the-art techniques typically employed in the literature; Our results show that choosing a suitable compression algorithm for a particular scenario can reduce the overall processing time with very little impact on the clustering outcome.
The rest of this paper is organized as follows. Section 2 describes the approach used to evaluate the impact of the compression algorithm, the description of the dissimilarity measures, the compression algorithms, and the evaluation metric used. The experiments are presented in Section 3, where fishing and tanker vessels are used to detect different vessel patterns. Lastly, the conclusions are discussed in Section 4.

Methodology
The experimental framework for this study is illustrated in Figure 1. This analysis uses an open-source Digital Coast AIS dataset that covers the coast of the United States. The dataset availability is described in Section Data and Code Availability Statement together with the information on the source code. The AIS messages contain information on vessel trajectories that include: maritime mobile service identities (MMSI), latitude, longitude, speed over ground (SOG), course over ground (COG), vessel type, date, and time. Note, MMSI is typically used as a unique identifier in this type of analysis (Li et al. 2018).
Initially, a preprocessing stage is conducted on the AIS dataset to remove duplicates and invalid messages. 1 After preprocessing, the dissimilarity between the trajectories is computed to form the matrix needed for the clustering algorithm. This analysis uses Dynamic Time Warping (DTW) (Keogh andRatanamahatana 2005, M€ uller 2007) and Merge Distance (MD) (Ismail andVigneron 2015, Li et al. 2018) to calculate the distance between trajectories.
The detection of anomalous trajectories is then conducted using the HDBSCAN (Campello et al. 2013). Density-based clustering techniques, such as HDBSCAN, identify clusters by locating regions of high density separated by regions of low density. The density of the cluster is determined by the number of instances that it contains. HDBSCAN requires two main parameters, the minimum size of a cluster, and the minimum number of instances in a neighborhood to define a core point. It is worth mentioning that HDBSCAN will be performed with identical parameters in order to evaluate the impact of the compression algorithms.
Given that the process of computing the distances between trajectories is computationally intensive, three different compression algorithms and combinations of them are applied to the trajectories before the distance calculation: Douglas-Peucker (DP), Time-Ratio (TR), Speed based (SB), TR þ SB, SB þ TR, DP þ SB, SB þ DP, TR þ DP, DP þ TR.
Initially, the HDBSCAN algorithm is performed using both MD and DTW to obtain the ground truth, or control, which is in the blue box in Figure 1. Next, the compression algorithms are applied using different thresholds. Then, the HDBSCAN algorithm is performed using both MD and DTW for each compressed dataset. This procedure is illustrated in the orange box in Figure 1. To evaluate the impact that the compression algorithms have on the clustering results, Normalized Mutual Information (NMI) is used to compare the results. In addition, the Mantel test is used to compare the distance matrices for DTW and MD. The following sections will discuss each component of the analysis.

Dissimilarity measures
The dissimilarity between trajectories is computed using DTW (Keogh andRatanamahatana 2005, M€ uller 2007), HD (Wang et al. 2021, Lee et al. 2022, DFD (Eiter and Mannila 1994, Cao et al. 2018, Tang et al. 2022), or MD (Li et al. 2018. Both algorithms compute the distance between individual messages, which is typically calculated using the Euclidean distance. However, in this experiment, the haversine distance (Van Brummelen 2014) 2 was used because the messages of each trajectory may be far enough apart such that the consideration of the earth curvature in the calculation is required.

Dynamic time warping
Dynamic Time Warping measures the dissimilarity of two individual time series that may vary in time and length. This is done by aligning the time series trajectories and finding the shortest path. To do this, DTW calculates the distance between all n positions of a given trajectory Q with each position of another trajectory C of size m. This creates a n Â m warping matrix W. The warping path is formed using the minimum values from the warping matrix, starting at w 1 ¼ W ð1, 1Þ and ending at w k ¼ W ðn, mÞ : The distance of this shortest path is considered the final distance between the trajectories. The shortest path is calculated using: where w contains the elements in the warping matrix (distances between positions) and k is the number of elements in the warping path.

Hausdorff distance
Hausdorff Distance (HD) measures the distance between two sets of points (Rucklidge 1997). The symmetric HD is obtained by the maximum of the directed Hausdorff distances. Therefore, given two trajectories Q and C, the HD computes the distance between a position x i 2 Q with its neighbor in y n 2 C (Rucklidge 1997). The direct HD is presented as follows: where, jj Á jj is any norm. One aspect of HD is that the direction (sequence) of trajectories is not incorporated (Eiter and Mannila 1994), which was solved by the Fr echet Distance.

Fr echet distance
The Fr echet Distance was developed by Fr echet (1906) to measure the similarity between two curves in a metric space. As the Discrete Fr echet Distance (DFD) considers the location and the order of the points that compose the curves, it is typically better than Hausdorff distance (Eiter and Mannila 1994). Given two trajectories Q and C, the DFD computes the distance between the two positions according to the different sequence combinations (Eiter andMannila 1994, Tang et al. 2022). The maximum distance D l between messages of Q and P is computed by considering different ways of alignment (l ¼ 1, . . . , L), and the final distance is given by 3 :

Merge distance
Merge Distance was applied by Li et al. (2018) to improve clustering analysis for anomaly detection tasks. They claim that MD is robust for sub-sampling and super-sampling, making it a suitable measure for the compressed trajectories. Given two trajectories Q and C, the MD distance calculates the distance between any two positions originating from Q, C, or both, in constant time to define the shortest super trajectory of Q [ C, and determine its length l(Q), l(C) and l(Q, C). The MD is calculated using: MDðQ, CÞ ¼ 2lðQ, CÞ lðQÞ þ lðCÞ À 1 (4)

Compression algorithms
Compression algorithms were developed to reduce the storage and processing time.
In order to evaluate the compression algorithms in the clustering analysis, three algorithms and combinations of them were analyzed: DP, TR, and SB. All algorithms are recursively executed in a top-down approach, in which the trajectory is binary divided into segments according to a predefined threshold. That is, for a given trajectory T, the start (S) and the end (E) points are defined as the initial and final positions respectively, while I is an intermediate point. If a measure computed for this segment is greater than a predefined threshold , then the intermediate point is kept and the process continues recursively using the two segments SI and IE (Makris et al. 2021b). If it is less than the threshold , only the S and E points are kept. The threshold is given by: in which a is a predefined factor, l is the average of all the values m T obtained from a metric computed over the trajectory T, considering all I in T. In summary, each algorithm applies a different measure to perform the compression, which is illustrated in Figure 2.
One important aspect to mention is that the small time irregularities presented in the AIS data do not impact the compression algorithms because the algorithms consider the time information into account. In the case of TR and SB, the time between observations is included in the measure used (SED and AVS). On the other hand, DP uses the distance between the positions, which is directly related to the time and the velocity of the vessel. However, these algorithms might be affected if the time between observations is too large, missing important information about the movement behavior. These large gaps are considered missing data, which occurs in real scenarios where AIS data is not available due to transmission problems.

Douglas-Peucker
The Douglas-Peucker algorithm uses the Perpendicular Distance (PD) to compress the trajectory, which is illustrated in Figure 3. Thus, for a given trajectory T, the S and E points form a line segment L ¼ Ax À By þ C, where the coefficients A, B, and C are defined as: For each intermediate point I in T, the algorithm computes the perpendicular distance between I and segment L, selecting I as the point that is furthest from L. The distance between the selected I and the segment L is obtained using the following equation:

Time-ratio
The Time-Ratio method uses the Synchronous Euclidean Distance (SED) error, which measures the distance between two positions at identical timestamps ( Figure 4). The SED error is given by the Euclidean Distance between the intermediate point I and the temporally synchronized point I 0 on L. The synchronized point I 0 is defined as: Figure 3. Illustration of the PD measure used in DP algorithm.
where r is the time ratio between IS and ES:

Speed based
The Speed Based algorithm applies the Absolute Value of Speed (AVS) to compress trajectories, in which the velocities computed are illustrated in Figure 5. It computes the absolute value of the difference in the speeds between two subsequent segments, SI and IE, where I is an intermediate point between S and E. The speed is calculated as: where jja, bjj 2 represents the Euclidean Distance.

Combining compression algorithms
In order to combine two compression algorithms, the metrics of each one are computed and analyzed individually and sequentially (Makris et al. 2021b). For example, when combining TR with SB (TR þ SB), first the SED is computed and evaluated. If it is lower than the threshold, the AVS is computed and analyzed. These evaluation measures are defined based on the combined methods. The combinations included:  TR þ SB, SB þ TR, DP þ SB, SB þ DP, TR þ DP, DP þ TR. It is worth mentioning that the thresholds value are set individually for each evaluation measure used by the algorithms TR, SB, and DP, making them independent from another compression applied as described at the beginning of Section 2.2.

Evaluation metrics
To evaluate the clustering results, the NMI metric was used. NMI measures how much information is preserved when comparing the clustering results of the control to the results using compression methods. This measure is commonly used to compare clustering results as it is based on the probability of points belonging to the same group (Kvålseth 2017). Therefore, the higher the NMI value, the more similar the clustering results are.
In addition, the Mantel Test (Mantel 1967) was used to evaluate how similar the distance matrices are before and after the compression algorithms were applied. It also provides the p-value that indicates if their differences are statistically significant. Therefore, the correlation and p-values provided by the Mantel test were used to evaluate the impact that the compression algorithms had on the calculation of the distances between trajectories.

Experiments
For this analysis, the trajectories selected represent tanker vessels located in the Juan de Fuca Strait 4 and fishing vessels at Francisco Bay, 5 both from April to June of 2020. Figure 6 illustrates these trajectories. The dataset has a total of 90 tanker trajectories and 55 fishing trajectories after preprocessing.
The experiments use NMI to measure the similarity of the clustering results between the compressed trajectories and the results obtained with the original Figure 6. Illustration of the tankers and fishing vessels trajectories. trajectories (the control). Since the clustering algorithm relies on the distance between the trajectories, an analysis of the final distance matrices was performed using the Mantel test. In addition, HDBSCAN was performed using the same hyper-parameters empirically selected for each dataset to allow for a fair comparison. For tanker vessels, the analysis had 3 as the minimum size of a cluster, while for fishing it was set to 2. Both analyses have 1 as the minimum number of samples in a neighborhood.
Initially, in Section 3.1, the compression ratio for each compression algorithm was explored using different factors to investigate the performance of each algorithm. Such an investigation supports the selection of a suitable compression algorithm since the ones with a high ratio should reduce the computational cost of the framework. Then, the impact of the compression algorithms using DTW, HD, DFD, and MD was analyzed for tankers in Sections 3.2. Such an analysis supports the selection of an adequate compression algorithm according to the distance measure applied, creating a trade-off between the processing time and the clustering results. In order to evaluate the impact of the compression algorithms for trajectories with more chaotic movements, the same analysis was conducted using fishing vessels, which is presented in Section 3.3.
The processing time was also computed to analyze if the addition of the compression process reduces the overall computational cost. The python libraries were used to execute the DTW distance and HD, which have parts of the code implemented in C. While, the MD and DFD were implemented in python. Due to this difference in the implementation, it is not fair comparing the processing time between the distance measures. Therefore, the processing time analyzed is only used to evaluate the impact of the compression algorithm on the overall computational cost of the clustering analysis.

Analysis of the compression ratio
The compression ratio of each trajectory was calculated, and the average was used to estimate the overall compression ratio for each algorithm using different factors. Such factors define the threshold for the compression algorithms as described in Section 2. 2. In this analysis, the explored factors were empirically selected and included: 1 128 , 1 64 , 1 32 , 1 16 , 1 8 , 1 4 , 1 2 , 1, 1.5, and 2. Figure 2 shows the compression ratio of each configuration for each dataset, where the colors represent different compression algorithms.
The DP, TR, and TR þ SB algorithms maintain a high compression ratio with both high and low factor values. However, SB and SB þ TR algorithms' compression ratio decreases as the factor decrease, indicating that the compressed trajectories still have a large number of messages. A possible reason for SB having a low compression ratio is that the measure applied is related to the speed of the vessel, which can have a high variation due to moving patterns employed in that specific region. The other algorithms rely on spatial features, which have low variability, especially for tanker vessels that tend to travel in the same direction for long periods of time. In addition, the compression ratio of the fishing vessels is lower than the compression ratio for tankers in general. Such behavior is due to the chaotic movements that fishing vessels make, such as more turns and maneuvers than a tanker would make.

Analysis for tankers vessels
The clustering results for our control experiments (results without applying compression algorithms) are illustrated in Figure 7, which shows the dendrogram of the clusters produced with HDBSCAN using 3 as the minimum size of a cluster. In this scenario, HDBSCAN produced 3 clusters using DTW, 10 when using HD, 10 when using DFD, and 5 when using MD.
In order to evaluate the impact that compression algorithms have on the clustering analysis, a trade-off between the processing time and the NMI obtained for all compression algorithms needs to be analyzed. Thus, Figure 8 shows the total processing time for each configuration, including the compression and clustering executions. DP, TR, and TR þ SB maintain a low processing time, even when smaller thresholds were used. SB and SB þ TR have an increased processing time as the threshold decreases, which is expected as the compression ratio is low for lower thresholds. This work shows that, on average, SB reduces the processing time by approximately 94% when using MD, 95% when using DFD, and 79% when using DTW. This result shows that the SB, which has a higher processing time, still significantly reduces the processing time of the clustering analysis using MD despite the low compression ratio.
Next, the NMI was calculated for all compression algorithms and was compared to the control. Figure 9 shows the NMI results using DTW, HD, DFD, and MD, respectively. The results for DTW show that TR and SB have high NMI values, while DP had high NMI values when using HD and DFD. However, the NMI values are lower when MD is used, which shows that MD is more sensitive to compression algorithms.
One explanation for why SB had the highest NMI with DTW is that it takes the speed component into account, which is relevant information to provide a good alignment to DTW. Accordingly, TR incorporates the time component, which also provides a good alignment for DTW, being slightly better than DP. The combinations of compression algorithms did not outperform the individual algorithms, and the lower factors of compression provided higher NMI values. In conclusion, individual compression algorithms, especially SB and TR, are good choices when performing clustering analysis using DTW. Also, a factor lower than 1 16 provides a reasonable result (NMI larger than 0.6).
In the case of HD and DFD, DP provided the highest NMI in general, followed by combinations with SB. This result indicates that maintaining speed and time information into compressed trajectories is not relevant for HD and DFD. This occurs because both focus more on the sequence composition of the positions rather than on the distance between them. Another general aspect is that the SB presents a low compression ratio, i.e., the trajectories are not as reduced, which can be the reason for its good performance in general.
The obtained distance matrices were compared with the control distance matrix using the Mantel test, which provides the correlation and evaluates if they are statistically similar. The Mantel correlation for DTW, HD, DFD, and MD is presented in Figure 10.
For most configurations with DTW, the p-value was less than 0.05, which shows that the distance matrices are statistically similar. However, for DP þ TR using a factor of 1.5, the p-value is higher than 0.05 which shows that they are statistically different. The correlation results increased when the threshold decreased as presented in Figure  10(a). This behavior is expected as compression algorithms with lower thresholds tend to keep more messages from the original trajectory. From these results, TR and SB algorithms have higher correlation values, which shows that DTW requires speed or time information to obtain a relative distance amongst the trajectories. These results also corroborate with the NMI values presented.
All the results using HD and DFD provided p-values lower than 0.05, showing that they are statistically similar. According to the Mantel correlation, the three individual compression algorithms using lower factor values are the ones most similar to the control. In the case of DFD, all distance matrices provided a higher value of correlation, however, NMI values indicated that they did not provide similar cluster results. Although the distance matrix seems similar, the clustering analysis is impacted by the compressed algorithm when using DFD. The correlations obtained when using HD were higher for the individual algorithms. Besides, the correlation decreases as the factor value increases, which is correlated with the NMI values presented.
For the MD tests, the distance matrices are statistically similar to the control, except for the TR algorithm using a factor of 1.5 that provides a p-value higher than 0.05. TR and SB algorithms have higher correlation values when a factor lower than 1 is used as shown in Figure 10(d). Although the Mantel correlation shows that SB with lower factor values is most similar to the control, MD did not perform well on compressed trajectories according to the NMI values.
In summary, although the clustering analysis provided is not the same as the control, TR and SB with a factor of 1 16 are suitable to use when DTW is selected to perform the clustering analysis. In this scenario, SB had a compression ratio of 0.78, which reduced the processing time by 75% and had an NMI of 0.78. On the other hand, TR had a compression ratio of 0.99, which reduced the processing time by 99% and had an NMI of 0.67. The Mantel correlation for SB was 0.76, while TR was 0.84, which means that TR is correlated more closely to the control.
Clustering analysis with HD seems to be adequate when using DP, TR, and SB with a factor of 1 64 : In this scenario, DP had a compression ratio of 0.98, which reduced the processing time by 99% and had an NMI of 0.85. TR had a compression ratio of 0.97, which reduced the processing time by 99% and had an NMI of 0.86. And SB had a compression ratio of 0.53, which reduced the processing time by 73% and had an NMI of 0.98. All three algorithms presented 0.99 as Mantel correlation.
In the case of DFD, clustering analysis seems to be a good approach when using DP and SB with a factor of 1 64 : In this scenario, DP had a compression ratio of 0.98, which reduced the processing time by 99% and had an NMI of 0.94. On the other hand, SB had a compression ratio of 0.53, which reduced the processing time by 81% and had an NMI of 0.99. Both algorithms presented 0.99 as Mantel correlation.

Analysis for fishing vessels
Initially, we execute the control experiments, which is the clustering analysis without applying compression algorithms. Figure 11 shows the dendrogram of the clusters produced with HDBSCAN using 3 as the minimum size of a cluster. In this scenario, HDBSCAN produced 6 clusters using DTW, 12 when using HD, 7 when using DFD, and 13 when using MD.
The impact of the compression algorithms was also analyzed for fishing vessels because their movement behaviors can be more chaotic while fishing. The total processing time for this framework using fishing vessels is shown in Figure 12 for DTW, HD, DFD, and MD, respectively. Similarly to the tanker results, most of the configurations reduced the overall processing time. However, when using HD, SB and its combinations increased the processing time, meaning that the addition of the SB algorithm did not compress the trajectory enough to reduce the distance calculation.
The NMI measures for DTW, DFD, HD, and MD distances are illustrated in Figure 13. Similar to the tanker trajectories analysis, SB presented the highest NMI values for all distance measures when using lower factor values. The largest NMI was 1 when using DFD with the SB for test runs with factors of 1 32 , 1 64 , and 1 128 : Besides, DFD with TR and DP provided high NMI values for most of the factors used. Such behavior shows that the good performance of SB is provided more by the low compression rate instead the speed component used by the algorithm. In the case of DTW and HD, the speed component of SB is the relevant information used to provide a similar cluster analysis, as the NMI is high for most of the factors. On the other hand, NMI values using MD were higher for fishing vessels than for tankers, indicating that MD might be suitable for the comparison of local and chaotic trajectories. According to the results, MD seems to be affected more by the compression than DTW, HD, and DFD, confirming that MD is not robust for compressed trajectories.
The Mantel correlation is presented in Figure 14 for DTW, HD, DFD, and MD distances. According to the p-value from the Mantel test (considering 0.05 as a) for all the configurations using the HD and DFD, the results were statistically similar to the control meaning they originated from the same probability distribution. Besides, the correlations provided by both are close to 1 in all configurations tested. On the other hand, TR with DTW had a p-value larger than 0.05 for factors greater than 1 4 : For instance, SB and SB þ TR presented higher correlations for factors lower than 1 16 : In the case of MD, most configurations resulted in higher p-values, having only individual algorithms DP, TR, and SB statistically similar to the control. However, only SB and TR presented a high correlation for most factors tested.
In summary, when analyzing fishing vessels, SB, TR, and SB þ TR are suitable to use when DTW is selected to perform the clustering analysis. In this scenario, SB with a factor of 1 128 had a compression ratio of 0.15, which reduced the processing time by 11% and had an NMI of 0.94. On the other hand, TR with a factor of 1 64 had a compression ratio of 0.65, which reduced the processing time by 92% and had an NMI of 0.83. The Mantel correlation for SB was 0.80, while TR was 0.42, which means that TR is correlated more closely to the control.
Clustering analysis with HD seems to be adequate when using TR, and TR þ SB with a factor of 1 64 : In this scenario, TR had a compression ratio of 0.65, which reduced the processing time by 96%. On the other hand, TR þ SB had a compression ratio of 0.66, which reduced the processing time by 95%. For both algorithms, NMI was 0.92 and the Mantel correlation was 0.99.
In the case of DFD, clustering analysis provided a good approach when using DP, TR, and SB with a factor of 1 32 : In this case, DP had a compression ratio of 0.73, which reduced the processing time by 98% and had an NMI of 0.93. On the other hand, TR had a compression ratio of 0.71, which reduced the processing time by 99% and had an NMI of 0.96. Lastly, SB had a compression ratio of 0.28, which reduced the processing time by 53% and had an NMI of 1. The three algorithms had 0.99 for the Mantel correlation. Performing clustering analysis with MD should be reasonable when using SB and SB þ TR with a factor of 1 128 : In this scenario, SB had a compression ratio of 0.15, which reduced the processing time by 46% and had an NMI of 0.65. On the other hand, SB þ TR had a compression ratio of 0.17, which reduced the processing time by 68% and had an NMI of 0.65. Both algorithms had a value of 0.99 for the Mantel correlation.

Conclusions
The proposed methodology evaluates how compression algorithms affect the clustering results obtained by HDBSCAN with respect to anomaly detection of tankers and fishing vessel trajectories. Both NMI and the Mantel Test were used to evaluate the similarity of the clustering results before and after compression algorithms. In this analysis, three compression algorithms and combinations of these algorithms were explored. The experimental results confirmed that compression algorithms indeed reduced the overall processing time of clustering analysis. The time complexity of the algorithms is O(NM), where N and M are the lengths of the two trajectories under analysis. However, when using the fastDTW (Salvador and Chan 2007), the complexity is reduced to O(N), providing near-optimal alignments. As the distances are dependent on the trajectory's lengths, reducing the size of the trajectories also reduces the processing time when executing the distances. As a result, the additional computation of the compression process reduces the overall processing time because distance calculations are the most expensive process.
NMI was used to measure how the compression algorithms affect the clustering results with DTW, HD, DFD, and MD. The obtained results show that DTW, HD, and DFD are more robust to compression algorithms than MD because they had higher values for the NMI calculation. In general, SB with a factor of 1 64 seems to be the most suitable compression algorithm applied within the clustering analysis because it provides high NMI values and still reduces the processing time. Also, SB takes the speed component into account which is relevant to the movement patterns of vessels. In the case of DTW, TR is another suitable compression algorithm as it provided high values of NMI and high compression ratios. While for DFD, DP is a more adequate choice as it provides high NMI values for most of the factors tested and a high compression rate. When using HD, TR and DP are suitable for clustering analysis, providing high NMI values and a high correlation. However, HD did not consider the sequence or the distance between adjacent positions so all three individual algorithms provided similar clustering results. Clustering analysis using MD distance is highly impacted by the compression algorithms. As the MD distance is based on the addition of the distances between every AIS message in the trajectories, the compression of the trajectories might change significantly these distances by reducing their variance. Such occurrences present a higher impact on tanker vessels than the fishing vessels. This impact difference is due to fishing vessels containing chaotic movements that maintain more variances (lower compression rate), yielding higher NMI values.
Additionally, it was found that combining compression algorithms did not improve the outcome of the clustering analysis compared to using individual ones. This observation indicated that the combination of the three algorithms would not bring improvements and would increase the processing time. New strategies to combine different aspects of the compression algorithms should be explored to improve the compression process.
In general, according to the results, DTW benefits from the speed and time components included in SP and TR algorithms, respectively. While DFD benefits from the sequence of the positions which were preserved by the DP algorithm. Considering trajectory movement aspects, the chaotic behavior makes it difficult for the compression algorithms to reduce the trajectory length, producing a low compression rate. Therefore, the clustering analysis is more similar to the control leading to higher NMI values.
Moreover, one challenge in the experiments was the selection of the threshold, which changes the results depending on the vessel movement and the distance measure selected. Thus, in future work, an evaluation of the thresholds can be conducted to explore their impact on the clustering analysis of vessel trajectories, including adaptive factor algorithms (Liu et al. 2019). In addition, trajectory segmentation strategies can also be explored as a mechanism to reduce the trajectories length and the computational costs in the clustering analysis for anomaly detection trajectories. Besides, online compression algorithms can be explored using this quantitative analysis, as they may provide information related to movement behavior more quickly. Another way to improve the quantitative evaluation is to investigate the fuzzy membership produced by HDBSCAN, which can give more detailed information on the vessel's movement and how each algorithm preserves it. Notes 1. Invalid messages contain either a negative value for SOG or a COG value that does not fall between 0 and 359. 2. It is a geographic distance measure. 3. See Eiter and Mannila (1994) and Gudmundsson et al. (2021) for more information of the mathematical formalization 4. A region between a latitude of ½47:5, 49:3, and longitude of ½À125:5, À 122:5 was selected. 5. A region between a latitude of ½37:6, 39, and longitude of ½À122:9, À 122:2 was selected.

Disclosure statement
No potential conflict of interest was reported by the author(s).