A Benchmark Study on Time Series Clustering

This paper presents the first time series clustering benchmark utilizing all time series datasets currently available in the University of California Riverside (UCR) archive -- the state of the art repository of time series data. Specifically, the benchmark examines eight popular clustering methods representing three categories of clustering algorithms (partitional, hierarchical and density-based) and three types of distance measures (Euclidean, dynamic time warping, and shape-based). We lay out six restrictions with special attention to making the benchmark as unbiased as possible. A phased evaluation approach was then designed for summarizing dataset-level assessment metrics and discussing the results. The benchmark study presented can be a useful reference for the research community on its own; and the dataset-level assessment metrics reported may be used for designing evaluation frameworks to answer different research questions.

latter is the single largest limitation of the archive when used for assessing clustering algorithms. Different researchers must repeat the process of implementing and benchmarking clustering algorithms over the same data sets. At a minimum, this may cost months or longer of run time [4]; and when benchmark tests are repeated, the subjective nature of test details (e.g., pre-processing) may introduce bias that affects the objectivity and re-producibility of the test results.
The work presented in this paper aims to address the limitation associated with testing time series clustering algorithms by providing a clustering benchmark. The intent of this benchmark is similar to the classification benchmark of Dau et al. (2018) [7], that is to provide comparison with several established methods in order to reduce both the repetition of experiments and time to publication. We would add to this another goal, that is to study the impact of changing design choices that occur within a given clustering method (i.e., a combination of clustering algorithm and distance measure). Additionally, the discussion highlights the value of considering a pool of clustering methods for use in cluster analysis and provides guidance on how to select individual algorithms in such a pool. To this end, we select eight clustering methods in this benchmark study that span three types of clustering algorithms and three distance measures, and assess each while adhering to the six restrictions laid out below.
1. No pre-processing. All datasets in the archive were used without any additional pre-processing (e.g., normalization in magnitude, filtering, smoothing). The reason is that, while pre-processing is common and is shown to improve results [9], any improvement resulting from the pre-processing should not be attributed to the clustering method itself [7,10] and, even if it were, the same pre-processing may have different performance impacts on different clustering methods. (About 80% of the UCR datasets are already z-normalized.) 2. Only uniform length time series. Only datasets in which all time series have equal length are used. The reason is that some of the clustering methods used in this benchmark were designed to work only with time series of equal length. (Only 11 out of 128 datasets in the archive have varying time series length.) 3. Known number of clusters. The clustering methods used in this work require that the number of clusters, k, be provided as input. The value of k is known from the class labels annotated in the datasets. There are several techniques for estimating k [11,12,13,14], but evaluating those techniques is not part of this benchmark. 4. Minimum two classes. Only datasets with k = 2 or more classes (other than a class designated as "noise") are used, as clustering time series data that all belong to the same class (i.e., k = 1) is not meaningful. (Five datasets have less than two classes.) 5. Established methods. All clustering methods used in this work are well-established or have survived the test of time. They are treated with equal merit with no effort to identify one as "superior" or "inferior" to another.
6. Dataset-level assessment metrics. The assessment metrics are reported for each clustering method on each of the 112 remaining datasets. Using assessment metrics at the dataset level enables evaluation frameworks to be designed with the research questions in mind, eliminating repetitive experimentation.
The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 describes the benchmark methods. Section 4 presents the benchmark test results. Section 5 concludes the paper.

Related work
Benchmarking, in general, has been recognized as an important step in advancing the knowledge of both supervised and unsupervised learning [10,15,16,17]. See Keogh and Kasetty [10] for a nice summary on the need to benchmark time series algorithms. They highlight many studies that use straw man algorithms to compare time series classification algorithms, and note that many of these algorithms provide little value because the levels of improvement are completely dwarfed by the variance observed when tested on real datasets or when minor unstated implementation details change. After a thorough survey of more than 350 time series data mining papers, they concluded that a median of only 1.0 (or an average of 0.91) rival methods were compared against a "novel" method (e.g., clustering algorithm, distance measure, pre-processing); and on average, each method was tested on only 1.85 datasets. While their summary is based on time series classification, the same concerns apply to time series clustering.
Works that compare time series clustering methods suggest that these comparisons have either been done qualitatively, using a theoretical approach [18,19,20], or quantitatively using an empirical approach [3,4,6]. Only the empirical approaches provide evidence of performance measured on external datasets. The UCR archive has been used for that purpose in most of the recent time series clustering comparisons [3,4,6]. However, none of them reports assessment metrics at the dataset level accounting for all datasets in the archive because the goal was to evaluate a novel method in the context of unique research questions/objectives. While it may serve individual research goals, the summarized results are often difficult and time-consuming to reproduce because of missing details (e.g., parameter settings, pre-processing details) and non-deterministic nature of the algorithm (e.g., K-means).
The absence of assessment metrics at the dataset level means that researchers must repeat experiments in order to view the tradeoffs among methods, thereby wasting precious resources and often delaying publications. The benchmark provided in this paper is intended to relax some of the burdens on researchers to foster more objective benchmark studies.

Benchmark Methods
The benchmark methods comprise clustering methods (Section 3.1) and evaluation methods (Section 3.2).

Clustering methods
There are two major design criteria in clustering methods: the clustering algorithm and the distance measure. Eight clustering methods are used in this benchmark (see Table 1). They represent three categories of clustering algorithms -partitional, density-based, and hierarchical -and three distance measures -Euclidean, dynamic time warping (DTW), and shape-based. This subsection summarizes the clustering algorithms and distant measures.

Clustering algorithms
Choice of clustering algorithms may depend on the strategy used to maximize the intra-group similarity and minimize the inter-group similarity. The algorithms considered in this benchmark cover three popularly used categories of such strategies, each described below.

Partitional
Three partitional clustering algorithms, K-means [21], K-medoids [22], and Fuzzy C-means [23], are selected based on their popularity [19] and known accuracy for time series data clustering [4]. Note K-means with  shape-based distance is K-shape [4]. These partitional algorithms generate spherical clusters that are similar in size [18]; and optimize clustering by minimizing the distance between each cluster center (a.k.a. centroid) and the data points within that cluster. A centroid may or may not be an actual data point, depending on the algorithm -it is for K-medoids and not for K-means and Fuzzy C-means (see Figure 1a and Figure 1b). All three of these partitional algorithms require that one input parameter be specified -the number of clusters (k). Given k, the algorithm iterates over two phases: (1) calculate centroids, and (2) assign data points to their closest centroid, until some termination condition (e.g., number of iterations or convergence) is met. For all three algorithms used in this benchmark, the initial centroids are chosen at random, making the algorithm non-deterministic; all subsequent centroids are calculated so as to minimize the distance to all other data points within the given cluster.
While K-means and K-medoids are hard clustering algorithms (i.e., producing non-overlapping partitions), Fuzzy C-means is a soft clustering algorithm (i.e., producing overlapping partitions). In this benchmark, the Fuzzy C-means clustering results are similar to that of a hard clustering algorithm, as each data point is assigned to the cluster that has the highest probability. There are several techniques for improving the clustering accuracy of these algorithms including-performing z-score normalization on the input [24], or invoking the algorithm multiple times using different random seeds to select the clusters with the highest intra-cluster similarity and the lowest inter-cluster similarity. This benchmark excludes using such techniques, per restrictions 1 and 5 (see Section 1).

Density-based
Density Peaks [5] was selected as the representative for density-based algorithms due to its recent popularity, particularly for time series clustering [6]. Unlike other density-based algorithms [25], Density Peaks is not sensitive to the "density parameter" but needs the number of clusters, k, as one of the inputs. This makes it a good fit for this benchmark, where k is assumed to be known and no assumptions are made for other input parameters.
The Density Peaks algorithm generates cluster centroids (called "density peaks") that are surrounded by neighboring data points that have lower local density (see Figure 1c) and are relatively farther from data points with a higher local density [5]. The algorithm has two phases. It first finds centroids (density peaks), and then assigns data points to the closest centroid. The algorithm requires two input parameters: the number of clusters (k) and the local neighborhood distance d (wherein the local density of a data point is calculated). While the value of k is assumed to be known in this benchmark, the value of d is determined as the distance wherein the average number of neighbors is 1 to 2% of the total number of observations in the dataset, following a rule of thumb proposed by the original authors [26].

Hierarchical
A hierarchical clustering algorithm can be Agglomerative (bottom-up) or divisive (top-down). In the former, each data point begins as its own cluster and cluster pairs are merged as the algorithm moves up the hierarchy. In the latter, all data points are initially assigned to a single cluster and clusters are split as the algorithm moves down the hierarchy. Because of its popularity over divisive clustering [18], Agglomerative clustering is used in this benchmark.
The algorithm has two phases. It first initializes each data point into its own cluster and then repeatedly merges the two nearest clusters into one until there are k clusters (see Figure 2). The value of k is an input to the algorithm. There are several options for measuring the distance between pairs of clusters. Ward's linkage, which minimizes the variance of data points in the merged clusters [27], is used in this benchmark due to its popularity and also its similarity to the optimization strategy of the partitional clustering methods. Other popular distance measures include single-linkage (minimum distance between a pair of data points belonging to different clusters) and complete-linkage (maximum distance between a pair of data points belonging to different clusters) [28].

Distance measures
The choice of distance measure is the other criterion that has a direct impact on the clustering performance. This section discusses the three distance measures used in this benchmark.
Dynamic time warping Figure 3: Alignment between two times series for calculating distance.
Dynamic time warping (DTW) is a mapping of points between a pair of time series, T 1 and T 2 (see Figure 3) designed to minimize the pairwise Euclidean distance. It is becoming recognized as one of the most accurate similarity measures for time series data [4,9,30]. The optimal mapping should adhere to three rules.
• Every point from T 1 must be aligned with one or more points from T 2, and vice versa.
• The first and last points of T 1 and T 2 must align.
• No cross-alignment is allowed, that is, the warping path must increase monotonically.
DTW is often restricted to mapping points within a moving window. In general, the window size could be optimized using supervised learning with training data; this, however, is not possible with clustering as it is an unsupervised learning task. Paparrizos and Gravano [3] found 4.5% of the time series length to be the optimal window size when clustering 48 of the time series datasets in the UCR archive; as a result, we use a fixed window size of 5% in this benchmark study.
Density Peaks with DTW as the distance measure can be computationally infeasible for larger datasets because the Density Peaks algorithm is non-scalable of O(n 2 ) complexity [4]. We employ a novel pruning strategy (see TADPole [6]) to speed up the algorithm by pruning unnecessary DTW distance calculations.

Shape-based distance
Shape-based distance is both shift-invariant and scale-invariant [3], that is, not affected by the shifting or scaling of the time series data. It calculates the cross-correlation between two time series and produces a distance value between 0.0 to 2.0, with 0.0 indicating that the time series are identical and 2.0 indicating Figure 4: Accuracy scores resulting from randomly assigning 1000 data points to a varying number of clusters.
maximally different shapes. To ensure the distance measure is scale-invariant, the original time series, T , is z-normalized to T as follows [3]: so T has mean µ = 0 and standard deviation σ = 1.

Evaluation methods
The purpose of this benchmark study is to assess the performance of the eight clustering algorithms on the 112 datasets, as well as the impact of changing design choices in either clustering algorithms or distance measures. To this end, the evaluation framework and select assessment metrics are discussed in this section.

Assessment metrics
Metrics for assessing clustering output may be external or internal. External measures are used when the class labels are available for individual data points. Examples include the Rand Index [31], Adjusted Rand Index (ARI) [32], Adjusted Mutual Information [33], Fowlkes Mallows index [34], Homogeneity [35], and Completeness [35]. Internal measures quantify the goodness of clusters based on a optimization objective for the clustering output, without the need for class labels; examples include Silhouette score [36], Davies-Bouldin index [37], Calinski-Harabasz index [38], the I-index [39] and sum of square errors (SSE). We used all the external measures listed above in this benchmark because having the class labels provided in the UCR archive makes the evaluation independent of the algorithms optimization function. Despite the popularity of the Rand Index (Figure 4f) for prior UCR archive studies [3,4,6], we find the adjusted measures more suitable for clustering because they are independent of the number of clusters. As demonstrated in Figure 4, the accuracy scores resulting from random cluster assignment are consistently low as the number of clusters varies for the two adjusted measures (Figures 4a and 4b), while this is not the case for the other measures. In this work, the Adjusted Rand Index was selected as the default measure.
For the partitional algorithms in this benchmark, all of which are non-deterministic, the scores reported for each external measure are the average over ten runs using randomly selected initial centroids.

Adjusted Rand Index
The Adjusted Rand Index is the adjusted-for-chance version of the more commonly used Rand Index. Given two sets of clusters, X and Y , and a contingency table where each cell n ij is the number of elements in both the i th cluster of X and the j th cluster of Y, the Adjusted Rand Index is calculated as shown in Equation 3.
where a i is the sum of the i th row and b j is the sum of the j th column in the contingency table.

Spread between clustering outputs
The measure of spread is used to quantify how much the accuracy of the two clustering methods differ from each other over multiple datasets (see Equation 4).
where A1 i and A2 i are the accuracy scores of the two methods for dataset i; and n is the total number of datasets.

Evaluation framework
Researchers will often design an evaluation framework for assessing accuracy because what constitutes "good" with respect to the assessment metrics may vary depending on the research question. One of the simplest approaches is to rank the performance of each clustering method and tally the number of winning performances across all available (in this work 112) datasets. This approach, however, is not without bias, as it depends on the distribution of both the datasets and clustering methods. For instance, in this work there are five partitional methods and one density-based method. If one half the datasets are amenable to partitional and the other half to density-based, this evaluation metric will bias the density-based method because the tally for the partitional methods would be partitioned across the five datasets. On the other extreme, if pairwise comparison were performed on all clustering methods, it would result in 28 (= 8 2 ) pairwise comparisons for each of the 112 datasets (i.e., 3,136 comparisons). More importantly, a pairwise comparison assumes that every algorithm is designed to achieve the same result.
Based on the above challenges, we designed a phased evaluation approach in this benchmark study. This approach first compares the eight clustering methods, and then controls for either the distance measure or clustering algorithm while evaluating the impact of changing the other.
Phase 1 All eight methods are compared using all datasets, and the resulting accuracy is averaged over all datasets for each method.
Phase 2 Partitional algorithms with Euclidean distance are compared to select the one that achieves the highest accuracy on the largest number of datasets.
Phase 3 Different distance measures are compared using the partitional algorithm selected in Phase 2.
Phase 4 Clustering algorithms belonging to different categories are compared using Euclidean distance. Among them, the partitional algorithm is the one selected in Phase 2 (i.e., K-means with Euclidean distance).
Phase 5 Density Peaks algorithm using Euclidean distance is compared with Density Peaks algorithm using DTW.
Phase 6 Density Peaks algorithm using DTW is compared with the partitional algorithm selected in Phase 2 but using DTW.
In Phase 1, we report the average ARI and standard deviation across all datasets. In each subsequent phase, we report the number of datasets (called "winning count") for which an algorithm or a distance measure achieved the highest ARI, and refine the comparison with the measure of spread (see Section 3.2.1) and the associated scatter plots. Here, datasets that result in an ARI score lower than 0.05 are excluded from winning counts since scores that approach 0.00 represent random assignment.

Benchmark Test Results
This section provides the results of dataset-level assessment (Section 4.1) and the phased evaluation (Section 4.2), and discusses the results (Section 4.3).

Dataset-level assessment
Appendix A shows the Adjusted Rand Index (ARI) scores for all eight clustering methods on the 112 shortlisted datasets (see Section 1) in the UCR archive (Table 8), and the spread of ARI scores (Table 9) between each pair of clustering methods. Additionally, in line with the restriction 6 (dtatset-level assessment; see Section 1), the scores of each clustering method on each dataset tested for all the six external measures (see Section 3.2.1) are available at GitHub [40] along with the source codes.

Phased evaluation
Phase 1 -Ranked comparison of all methods using average ARI Figure 5 shows the average ARI's for each of the eight clustering methods in decreasing order; and Table 2 provides corresponding detail including the standard deviation of the ARI. The highest average ARI was 0.26 for the Agglomerative clustering using Ward linkage and Euclidean as distance measure; and the lowest average ARI was 0.16 for Density Peaks using DTW as distance measure. The standard deviations show wide variation in method performance.
Phase 2 -Comparison of partitional algorithms using Euclidean distance Of the partitional clustering methods that use a Euclidean distance measure, K-means had a winning count of 54 datasets, while Fuzzy C-means and K-medoids performed best on 31 and 18 datasets, respectively, (see Table 3). While K-means had a higher ARI score in almost twice as many datasets, differences in score values were minor, with a spread of only 0.005 against K-medoids (see Figure 6a) and only slightly larger (0.010) against Fuzzy C-means (see Figure 6b). This result is not surprising, given the similarity of methodology (all partitional using Euclidean distance) across the three algorithms.  Phase 3 -Comparison of distance measures using selected partitional algorithm When we examine the winning counts for K-means (i.e., method that performed best in Phase 2) using the three distance measures, the tallies are 32, 31 and 28 for DTW, shape-based, and Euclidean, respectively (see Table 4). A pairwise comparison between the distance measures also shows the wining counts to be 45 vs. 38 between DTW and Euclidean, 52 vs. 38 between DTW and shape-based, and 45 vs. 44 between shape-based and Euclidean. The scatter plots in Figure 7 show the spreads between each of the paired distance measures. The shape-based distance has a relatively larger spread with each of the other two measures. As a side note, when the optimal DTW window size is assumed to be known, then it is trivial to understand that DTW will always achieve a score that is higher or equal to that of Euclidean distance, since the two measures are equivalent when the window size is 0.
Phase 4 -Comparison of clustering algorithms using Euclidean distance When we hold the distance measure (in this case, Euclidean distance) constant and examine the winning counts across the clustering algorithms that use this distance measure, the tallies are 45, 21, and 19 in the order of Agglomerative, K-means, and Density Peaks. A pairwise comparison is also shown in Table 5, where the winning counts are 57 vs. 26 between Agglomerative and Density Peaks, 52 vs. 30 between Agglomerative and K-means, and 60 vs. 23 between K-means and Density Peaks. Despite the difference in winning counts, the spreads of ARI values between Agglomerative and K-means (see Figure 8a) is fairly small compared with the spread of either method with Density Peaks (see Figure 8b and Figure 8c).

Phase 5 -Comparison of Euclidean distance and DTW in Density Peaks algorithm The Density
Peaks algorithm achieved a higher winning count (i.e., across 45 datasets; see Table 6) when Euclidean distance was used as the distance measure compared to a count of 31 with DTW. Figure 9 shows the spread (a) K-medoids vs. K-means.
(b) Fuzzy C-means vs. K-means (c) K-medoids vs. Fuzzy C-means. Figure 6: Spread of ARI scores between each pair of the three clustering algorithms with Euclidean distance in Phase 2. of ARI scores between Euclidean distance and DTW to be 0.021.

Phase 6 -Comparison of Density Peaks and selected partitional algorithm using DTW
Lastly, when the DTW distance measure is held constant, we may compare across the clustering algorithms that use this distance measure -Density Peaks and K-means. K-means achieved a higher winning count (i.e., winner across 60 datasets; see Table 7) compared to a winning count of 24 for Density Peaks. But while the winning count appears positively skewed in favor of K-means, there are still a considerable number of datasets for which Density Peaks achieved higher ARI, and the spread of ARI scores (see Figure 10) was the largest (0.052) observed in the six phases.

Discussion
This section analyzes the results of each evaluation phase and provides concluding remarks summarizing the analysis.
Phase 1 -Ranked comparison of all methods using average ARI The high standard deviations associated with the average ARI of Table 2 suggest that accuracy is dependent on which clustering method is used on which dataset; and that it may be fair to conclude that we have no clear winner in this benchmark.  This high variability also suggests that using a simple winning count of dataset-level assessment as the only means of evaluation, may be very misleading. While reporting counts of win-lose-tie for clustering method accuracy has become common practice in the literature, the UCR archive authors describe it as not that useful [7]. In light of these issues, we used both winning counts and the ARI scores in this benchmark and reinforced the measures with ARI score scatter plots and the associated spreads.
Phase 2 -Comparison of partitional algorithms using Euclidean distance When comparing the three partitional algorithms that use the Euclidean distance measure, a researcher may well select K-means based on the winning count (see Table 3), especially without adequate prior knowledge of how the algorithm performs on the individual datasets. However, the selection may likely change when the user has knowledge of the dataset and/or application at hand. For instance, K-medoids is more resilient to outliers, because the medoids are not as sensitive to the presence of outliers as say, the centroids in K-means. In another example, Fuzzy C-means may be preferred over K-means given a dataset where the membership of data points are "soft", as in the case when categorical classes have numerical attribute values that overlap. As an aside, Fuzzy C-means shows a larger spread of ARI scores against K-means ( Figure 6b) and K-medoids (Figure 6c), indicating that changing from K-means to the fuzzy mechanism of C-means has more impact on the final clustering than changing from means to medoids.
Phase 3 -Comparison of distance measures using selected partitional algorithm The results in Table 4 appear to suggest that the winning count does not favor the shape-based distance measure in the       same manner that it did in a prior study [4] that used 85 datasets in the UCR archive compared to the 112 datasets (and different evaluation criteria) used in this benchmark study. The larger spreads observed when one distance measure is shape-based (Figures 7b and 7c) suggest the method is useful as the best distance measure for a nontrivial number of datasets, and therefore, should be considered in a pool of potential clustering methods. We believe the larger spread may be a result of the shape-based distance measure's lack of sensitivity to the magnitudes and shifts in time series data compared with the Euclidean measure, or for that matter, DTW (for which the underlying distance measure is also Euclidean), which therefore results in a different partitioning.
Phase 4 -Comparison of clustering algorithms using Euclidean distance The very small spread in Figure 8a shows similar performance for the K-means and Agglomerative algorithms on most datasets in the archive. With Agglomerative clustering, this can be attributed to the use of Ward's linkage, which merges the two clusters that when combined provide the minimum increase in variance. This optimization using Ward's linkage has some similarity to optimizing the centroids in K-means (i.e., minimizing the total variance within cluster). Using a different linkage criteria such as "complete" linkage does not bias clusters to be as spherical as Ward linkage (and for that matter K-means). Such a change will result in different clusters when compared to K-means. Specifically, with complete linkage, Agglomerative clustering has a measure of spread of 0.026 when compared to K-means, and an average ARI of 0.17 ± 0.24.
Phase 5 -Comparison of Euclidean distance and DTW in Density Peaks algorithm The spread (0.021) between DTW and Euclidean (see Figure 9) in Density Peaks algorithm is relatively consistent with spread (0.016) between DTW and Euclidean in K-means algorithm (see Figure 7a). These medium to high level of spread values indicate the difference of clusters formed when using DTW as opposed to Euclidean distance. Density Peaks is an O(n 2 ) complexity algorithm (where n is the number of data points) that when used with DTW may become computationally infeasible for large datasets. The TADPole method [6], with its novel pruning strategy, makes Density Peaks with DTW feasible enough for use on large datasets in the archive. However, even with this accelerated TADPole, the largest 20 datasets of the archive took 32 days to cluster on a dual 20-Core Intel Xeon E5-2698 v4 2.2 GHz machine with 512 GB 2,133 MHz DDR4 RDIMM.
Phase 6 -Comparison of Density Peaks and selected partitional algorithm using DTW When using DTW as a distance metric, K-means and Density Peaks produce different clusters as indicated by the relatively higher spreads of ARI 0.052 (see Figure 10), which is consistent with the somewhat high spread 0.036 observed between the two methods (see Phase 4 with Euclidean distance, Figure 8c). This result is counter-intuitive given that both K-means and Density Peaks form spherical clusters by assigning data points to the closest centroid, and leads one to speculate that the cause may be the fundamentally different locations of the centroids in the K-means and Density Peaks algorithms (see Figure 1).

Concluding remarks
Overall, this benchmark study shows that among all methods tested, the variation in performance, as measured by the average and standard deviation of ARI (see Table 2 and Figure 5), is higher than the variation observed across winning counts (Table 3 to Table 7). Notably, there is no one method that performs better than the others for all datasets in this benchmark, and that method performance is dependent on the datasets, as well as the evaluation criteria (i.e., objective). Similar findings for time series representation methods and distance measures were made in an earlier benchmark study using UCR archive [16]. In light of these findings, as well as noting that exploratory cluster analysis typically involves multiple clustering methods rather than a single method to identify clusters of interest, cluster analysis should be conducted by selecting a pool of methods that produce different clusters, rather than those that produce relatively similar clusters. In other words, select methods that show greater spread (i.e., combination of average accuracy scores and their spread) than those with higher winning counts. Methods with higher spreads of ARI are likely to provide different clusters for the same dataset-all of which may be valid depending on the target research goal. For instance, using three algorithms with higher spread values (e.g., Density Peaks (DTW), K-means (shape-based), and K-means (Euclidean) of Figure 7c, Figure 8c and Figure 10) on the same dataset are more likely to provide three dissimilar clustering outputs compared to those generated using K-means (Euclidean), K-medoids (Euclidean), and Fuzzy C-means (Euclidean) (lower spread values in Figure 6).

Conclusion
This paper reports benchmark test from applying eight popular time series clustering methods on 112 datasets in the UCR archive. One essential goal of the benchmark is to make the results available and reusable to other researchers. In this work, we laid out six restrictions to help reduce bias. Eight popular clustering methods were selected to cover three categories of clustering algorithms (i.e., partitional, density-based, and hierarchical) and three distance measures (i.e., Euclidean, Dynamic time warping, and shape-based). The dataset-level assessment metrics are reported using six external evaluation measures. Adjusted Rand Index was selected as the default measure for discussion in this paper. A phased evaluation framework was designed such that in each phase only one of the two building blocks of a clustering method-algorithm and distance measure-is varied at a time. Benchmark results show the overall performance of the eight algorithms to be similar with high variance across different datasets. Discussion of the results helps highlight the importance of creating a pool of clustering methods with high spread in accuracy scores for effective exploratory analysis.
Clemins and Scott Hamshaw, Research Assistant Professors at the University of Vermont, for support in using Vermont EPSCoR's high performance computing resources. We also thank Dr. Eamon Keogh for his invaluable feedback, and all other curators and administrators of the UCR archive without which this work would not have been possible.