1 Introduction

Clustering is an unsupervised analysis technique where a set of cases, defined as a vector of continuous or discrete variables, are grouped to create clusters which contain cases considered to be homogeneous, whereas cases in different clusters are considered heterogeneous [12].

Time series clustering is the act of grouping ordered, time series data without recourse to a label. We use the acronym TSCL for time series clustering, to make a distinction between TSCL and time series classification, which is commonly referred to as TSC. There has been a wide range of algorithms proposed for TSCL. Our focus is specifically on clustering using elastic distance measures, i.e. distance measures that use some form of realignment of the series being compared. Our aim is to perform a comparative study of these algorithms that follows the basic structure of bake offs in distance-based TSC [44], univariate TSC [8] and multivariate TSC [60]. A huge number of alternative transformation-based, e.g. [42], deep learning-based clustering algorithms, e.g. [39], and statistical model-based approaches [15] have been proposed for TSCL. These approaches are not the focus of this research. Our primary aim is to provide a detailed description, with examples, of nine commonly used elastic distance measures and conduct extensive experimentation to compare their utility for TSCL. k-means is by far the most popular clustering algorithm for TSCL (for example, [33]). k-medoids clusterer is used much less frequently. Our secondary aim is to compare k-means and k-medoid distance-based clustering for TSCL.

Clustering is often the starting point for exploratory data analysis and is widely performed in all fields of data science [74]. However, clustering is harder to define than, for example, classification. There is debate about what clustering means [31] and no accepted standard definition of what constitutes good clustering or what it means for cases to be homogeneous or heterogeneous. For example, homogeneous could mean generated by some common underlying process or mean it has some hidden variable in common. A clustering of patients based on some medical data might group all male patients in one cluster and all female patients in another. The clusters are from one view homogeneous, but that does not mean it is necessarily a good clustering. The interpretation of the usefulness of clustering a particular dataset requires domain knowledge. This makes comparing algorithms over a range of problems more difficult than performing a bake off of TSC algorithms. Nevertheless, there have been numerous comparative studies (for example, [2, 33] and [39]) which take TSC problems from the UCR archive [18] and then evaluate clusterings based on how well they recreate class labels. We aim to add to this body of knowledge with a detailed description and a reproducible evaluation of clustering with elastic distance measures (described in Sect. 3) following a similar methodology. We do this in the context of partitional clustering algorithms (see Sect. 2). We stress that our findings relate only to performance on the UCR univariate datasets and that our conclusions should be taken in that context. We think that our findings are useful for practitioners wanting a benchmark for a TSCL problem, but there are always use cases for different approaches.

Elastic distance measures are significantly more accurate on average than Euclidean distance for TSC when used with a one nearest neighbour classifier [44]. We evaluate whether this improvement translates to centroid-based clustering. Following experiments presented in Sect. 5.1, we conclude that, somewhat surprisingly, this is not the case with k-means clustering. Using standard default parameters, only one elastic distance measure, move–split–merge (MSM) [67], is significantly better than Euclidean distance, and five of those considered are significantly worse, including the most popular approach, dynamic time warping (DTW). The pattern of results is the same if we cluster the raw data or normalise it first. We repeat our experiments using k-medoids clusterer (see Sect. 5.2) and find that clustering performance is improved for all distances except Euclidean, and DTW is no longer significantly worse than ED. With both k-means and k-medoids clusterer, the move–split–merge distance function [67] performs best. We evaluate whether these results could be an artefact of our clustering algorithm configuration in Sect. 5.3 by comparing alternative cluster initialisation algorithms and finding the same pattern of results in all cases. k-means with DTW is a very popular benchmark for TSCL (for example, see [33]), so the findings in Sect. 5.1 are perhaps counter to the received wisdom in the community. We investigate the performance of DTW under additional scenarios in Sects. 5.4.2 and 5.4. We try alternative parameter settings and use an unsupervised tuning algorithm for the window size. We show that reducing the window size to 5% improves k-means DTW, but only to the point where it is no longer worse than Euclidean distance. We also find that tuning does not improve the 5% window performance. We then explore using dynamic time warping with barycentre averaging (DBA) to improve k-means [55]. DBA finds centroids by aligning cluster members and averaging over values warped to each location. We reproduce the reported improvement that DBA brings to DTW with k-means, but we observe that the improvement is not enough to make it better than Euclidean distance, and it comes with heavy computational overhead. Finally, we assess whether tuning our best performing distance functions and find that we cannot on average improve on using the default parameters. In Sect. 6, we look more closely at the performance of the distance functions for data with different characteristics. We find MSM does relatively better on problems with a large number of classes and a larger training set size. We also observe that it does better in problem domains where we would expect some phase shift within clusters, such as ECG and electric device power usage problems. Our first conclusion in Sect. 7 is that elastic distances perform better on the UCR archive when used with k-medoids clusterer rather than with k-means clustering. Using medoids (data points in the training data) rather than centroids (averaged cluster members) overcomes the mismatch between simple averaging and elastic distances that barycentre averaging is designed to mitigate against, but with far less computational overhead. Our second observation is that MSM is the best performing distance function on the UCR archive. MSM explicitly penalises longer warpings, and we observe that all distance functions that do this perform better on the UCR archive. We hope this work will raise the profile of MSM in the TSCL community.

We have provided optimised scikit-learn compatible Python implementations for the distances and clusterers in the aeon toolkit,Footnote 1 and a repository with an associated webpageFootnote 2 which provides notebooks on using distances and clusterers and contains all our results with guidance on how to reproduce them.

The remainder of this paper is structured as follows: Sect. 2 describes the general TSCL problem, provides a brief literature review and gives a detailed description of nine elastic distance measures used in our experiments. For more general background on clustering, we direct the reader to [32, 62]. For background into elastic distances for classification, we recommend [1, 65].

The methodology we use in our experiments is outlined in Sect. 4. The results on UCR datasets are presented in Sect. 5, and these are further analysed in Sect. 6. Finally, Sect. 7 concludes and signposts the future direction of our research.

2 Time series clustering background and literature review

A time series \(\varvec{x}\) is a sequence of m observations, \((x_1,\ldots , x_m)\). We assume all series are equal in length. For univariate time series, \(x_i\) are scalars and for multivariate time series, \(x_i\) are vectors. A time series data set, \(D = \{\varvec{x_1}, \varvec{x_2},\ldots , \varvec{x_n}\}\), is a set of n time series cases. A clustering is some form of a grouping of cases. Broadly speaking, clustering algorithms are either partitional or hierarchical. Partitional clustering algorithms assign (possibly probabilistic) cluster membership to each time series, usually through an iterative heuristic process of optimising some objective function that measures homogeneity. Given a dataset of n time series, D, the partitional time series clustering problem is to partition D into k clusters, \(C = \{C_1, C_2,\ldots , C_k\}\) where k is the number of clusters. It is usually assumed that the number of clusters is set prior to the optimisation heuristic. If k is not set, it is commonly found through some wrapper method. We assume k is fixed in advance for all our experiments.

Clustering algorithms can be split into those that work directly with the time series, and those that employ a transformation to derive features prior to clustering. The focus of this study is on non-probabilistic partitional clustering algorithms that work directly with time series.

2.1 Partitional time series clustering algorithms

Partition-based clustering algorithms share the same basic components. Firstly, the algorithm selects example cases (which we call exemplars) that are meant to characterise a cluster. This is known as the initialisation stage. After the initial exemplars are selected, an update stage begins where the exemplars are iteratively refined until some convergence condition is met.

One such partition-based algorithm is k-means [47], also known as Lloyd’s algorithm [46]. It is the most well-known and popular partitional clustering algorithm, in both standard clustering and time series clustering. The algorithm uses k centroids as exemplars for each cluster. A centroid is a summary of the members of a cluster found through the update operation, which for standard k-means involves averaging each time point over cluster members. Each case is assigned to the cluster of its closest centroid, as defined by a distance measure.

Many enhancements to the core algorithm have been proposed. One of the most effective refinements is changing how the exemplars are initialised. By default, the initial centroids for k-means are chosen randomly, either by randomly assigning cluster membership and then averaging or by choosing random instances from the training data as initial clusters. However, this risks choosing centroids that are in fact in the same cluster, making it harder to split the clusters. To address this problem, forms of restart and selection are often employed. For example, [13] propose restarting k-means over subsamples of the data and using the resulting centroids to seed the full run. Another solution is to run the algorithm multiple times and keep the model that yields the best result according to some unsupervised performance measure.

k-means assumes the number of clusters, k is set a priori. There are a range of methods of finding k. These often involve iteratively increasing the number of clusters until some stopping condition is met. This can involve some form of elbow finding or unsupervised quality measure, such as the silhouette value [45]. Time series-specific enhancements concern the distance metric used and the averaging technique to recalculate centroids. Averaging series matched with an elastic distance measure will mean that, often, the characteristics that made the series similar are lost in the centroid. [55] describe an alternative averaging method based on pairwise DTW. This is described in detail in Sect. 3.8.

k-medoids clusterer is an alternative clustering algorithm which uses cases, or medoids, as cluster exemplars. One benefit of using instances as cluster centres is that the pairwise distance matrix of the training data is sufficient to fit the model, and this can be calculated before the iterative partitioning. This is particularly important when performing the update operation, which is the main difference between k-means and k-medoids clusterer; k-medoids clusterer chooses a cluster member as the exemplar rather than average. The medoid is the case with the minimum total distance to the other members of the cluster. Refinements such as Partition Around Medoids (PAM) [40] avoid the complete enumeration of total distances through a swapping heuristic.

Both of our k-means and k-medoids clusterers extend an implementation of the Lloyds algorithm and employ the same stopping condition. The iterative process stops when cluster membership does not change, or when the inertia is below a certain tolerance (convergence) [27]. The inertia is a value that is used to determine how coherent different clusters are and equals the sum of distances between each case and the centre of its assigned cluster. If the difference in inertia between two iterations is below the (very small) threshold, convergence is assumed and the iterations stop.

Partitional clustering algorithms both suffer from the problem of initialising the clusters. The initialisation problem is twofold: defining the accurate cluster numbers to be generated from the given dataset and solving the problem of locating the position of initial centroids [27]. Whist there are numerous proposals on how to select the value of k (for example, [51] use medoids initialisation, [22] PCA initialisation, [29] propose cluster elimination and division initialisation), we follow related comparative studies [33] and assume the number of clusters is known and equal to the number of classes in a classification dataset. The second problem of finding the initial cluster centres is important because poor initial centres can lead to convergence to local optima and worse clustering performance [70]. We consider three commonly used initialisation algorithms in our experimentation.

The most common initialisation technique is random initialisation [47]. This technique consists of choosing the initial centres randomly from the dataset. The rationale behind this is that random selection is likely to pick points from dense regions. Rerunning the model multiple times with random initialisation and taking the best clustering (as measured by inertia) is the most common way of initialising k-means [14] and is used in the majority of experiments. However, we also consider Forgy’s method [24] which initialises centres by assigning each data point in the dataset to k clusters uniformly at random. The centres are then given by the centroids of these initial clusters. A drawback of this method is it has no theoretical bases and as such random clusters have no internal homogeneity [5]. Finally, we consider the k-means++ algorithm [7]. k-means++ tries to disperse the clusters by selecting to bias towards separate centres. The first centre is randomly selected from the train data. Subsequent centres are selected with probability inversely proportional to the minimum distance between cases and previously selected centres. If md(x) denotes the minimum distance from a case x and previously selected centres, then the probability of selecting x as the next centre is defined as

$$\begin{aligned} \frac{md(x)^{2}}{\sum ^{n}_{j=1}md(x_j)^2}. \end{aligned}$$
(1)

2.2 Literature review

Fig. 1
figure 1

Time series clustering taxonomy. The following models are included: K-means [47], K-spectral centroid [69], K-DBA [55], Kernel K-means [21], K-shapes [53], K-multishapes [54], PAM [40], CLARA [36], CLARANS [52], Alternate [46], DBSCAN [23], HDBSCAN [49], OPTICS [6], BIRCH [73], Agglomerative [35], Feature K-means [59], Feature K-medoids clusterer [59], U-shapelets [71], USSL [72], RSFS [64], NDFS [43], Deep learning and dimensionality reduction approaches see [39]

Our focus is on elastic distances-based clustering, but there are many other approaches to TSCL. The most popular TSCL approaches are summarised in Fig. 1.

The k-Shape algorithm [53] and its extension k-multishapes (k-MS) [54] are a variation of the k-means algorithm that uses cross correlation to compute the similarity between time series. Cross-correlation is a measure of similarity that compares points of time-lagged signals one to one. k-MS algorithm is a extends k-shapes by computing multiple centroids per cluster.

K-spectral centroid (KSc) [69] computes the distance between time series using shifted ranges of lags with k-means. Kernel k-means [21] operates in the Reproducing Kernel Hilbert Space associated with the chosen kernel. Further kernel based techniques include using Fisher kernels with hidden Markov models [68].

Similarity-based techniques for TSCL will compute similarity based on raw time series. However, working with raw time series is a non-trivial task due to the challenges that raw time series present (high dimensionality, missing values, variable length, etc.).

Another common way to cluster time series is to perform some form of feature extraction and then cluster the features. For example, [59] used a standard k-means clusterer on statistical features (skew, standard deviations, mean etc.) extracted from an electricity usage time series dataset. k-medoids clusterer has also been used with features extracted from time series using piecewise aggregate approximation [41]. Further unique feature extraction techniques specifically designed for TSCL include u-shapelets [71], Unsupervised Salient Subsequence Learning (USSL) [72], robust spectral learning for unsupervised feature selection (RSFS) [64] and nonnegative discriminative feature selection (NDFS) [43].

Finally, one of the most recent developments in the TSCL is the use of deep learning. A good review of deep learning-based time series clustering is found in [39].

3 Time series distance measures

Suppose we want to measure the distance between two equal lengths, univariate time series, \({\textbf{a}}=\{a_1,a_2,\ldots ,a_m\}\) and \({\textbf{b}}=\{b_1,b_2,\ldots ,b_m\}\). The Euclidean distance \(d_{\textrm{ed}}\) is the L2 norm between series,

$$\begin{aligned} d_{\textrm{ed}}({\textbf{a}}, {\textbf{b}}) = \sqrt{\sum _{i=1}^m(a_i-b_i)^2}. \end{aligned}$$
(2)

\(d_{\textrm{ed}}\) is a standard starting point for distance-based analysis. It puts no priority on the ordering of the series and, when used in TSC, is a poor benchmark for distance-based algorithms and a very long way from state of the art [50]. Elastic distance measures allow for possible error in offset by attempting to optimally align two series based on distorting indices or editing series. These have been shown to significantly improve k-nearest neighbour classifiers in comparison with \(d_{\textrm{ed}}\) [44]. Our aim is to see if we observe the same improvement with clustering. Table 1 lists the ten distance functions we describe.

Table 1 List of the ten distance functions reviewed

3.1 Dynamic time warping

Dynamic time warping (DTW) [58] is the most widely researched and used elastic distance measure. It mitigates distortions in the time axis by realigning (warping) the series to best match each other. Let \(M({\textbf{a}},{\textbf{b}})\) be the \(m \times m\) pointwise distance matrix between series \({\textbf{a}}\) and \({\textbf{b}}\), where \(M_{i,j}= (a_i-b_j)^2\). A warping path

$$\begin{aligned} P=<(e_1,f_1),(e_2,f_2),\ldots ,(e_s,f_s)> \end{aligned}$$

is a set of pairs of indices that define a traversal of matrix M. A valid warping path must start at location (1, 1), end at point (mm) and not backtrack, i.e. \(0 \le e_{i+1}-e_{i} \le 1\) and \(0 \le f_{i+1}- f_i \le 1\) for all \(1< i < m\). The DTW distance between series is the path through M that minimises the total distance. The distance for any path P of length s is

$$\begin{aligned} D_P({\textbf{a}},{\textbf{b}}, M) =\sum _{i=1}^s M_{e_i,f_i}. \end{aligned}$$
(3)

If \({\mathcal {P}}\) is the space of all possible paths, the DTW path \(P^*\) is the path that has the minimum distance, and hence, the DTW distance between series is

$$\begin{aligned} d_{\textrm{dtw}}({\textbf{a}}, {\textbf{b}}) =D_{P*}({\textbf{a}},{\textbf{b}}, M). \end{aligned}$$
(4)

The optimal warping path \(P^*\) can be found exactly through a dynamic programming formulation described in Algorithm 1. This can be a time-consuming operation, and it is common to put a restriction on the amount of warping allowed. Figure 2 describes the two most frequently used bounding techniques, the Sakoe–Chiba band [61] and Itakura parallelogram [30]. In Fig. 2, each individual square represents an element of matrix M. Applying a bounding constraint (represented by the darker squares in Fig. 2) reduces the required computation. The Sakoe–Chiba band creates a warping path window that has the same width along the diagonal of M. The Itakura parallelogram allows for less warping at the start or end of the series than in the middle. Algorithm 1 assumes a Sakoe–Chiba band.

Fig. 2
figure 2

Two most common bounding techniques [57]

Algorithm 1
figure d

DTW (\(\textbf{a},\textbf{b}\), (both series of length m), w (window proportion, default value \( w \leftarrow 1\)), M (pointwise distance matrix))

The DTW distance with Sakoe–Chiba window w can be expressed as Eq. 5.

$$\begin{aligned} d_{\textrm{dtw}}({\textbf{a}}, {\textbf{b}}) =D_{P*}({\textbf{a}},{\textbf{b}}, M) = DTW({\textbf{a}}, {\textbf{b}}, w, M). \end{aligned}$$
(5)

More general bands can be imposed in implementation by setting values outside the band in the matrix, M, to infinity. Figure 3 helps explain how the DTW calculations are arrived at. Euclidean distance is simply the sum of the diagonals of the Matrix M, in Fig. 3a. DTW constructs C using M and previously found values. For example, \(C_{1,1} = M_{1,1} = 0.6\) and \(C_{1,2}\) is the minimum of \(C_{1,1}\), \(C_{0,1}\) and \(C_{0,2}\) plus \(M_{1,2}\). \(C_{0,2}\) and \(C_{0,1}\) are initialised to infinity, so the best path to get to \(C_{1,2}\) has distance \(C_{1,1}+ M_{1,2}\) which equals 0.6 \(+\) 5.25 = 5.85. Similarly, cell \(C_{10,10}\) is the minimum of the three cells \(C_{9,9}, C_{10,9}, C_{9,10}\) plus the pointwise distance \(M_{10,10}\). The optimal path is the trace back through the minimum values (shown in white in Fig. 3b). Figure 4 gives a demonstration of the effect of constraining the warping path on DTW using the same two series from Fig. 3. The relatively extreme warping from point 0 to point 5 evident in Fig. 3a is constrained when the maximum warping allowed is 2 places (\(w=0.2\)) in Fig. 4.

Fig. 3
figure 3

An example of Euclidean and DTW distance functions for two series. The left hand matrix, (a), is the pointwise distance between the series (matrix M in Equation 5). The Euclidean distance is the sum of the diagonal path. The right hand matrix, (b), shows the DTW distances (matrix C in Equation 5) and the resulting warping path when the window size is unconstrained

Fig. 4
figure 4

An example of constraining the DTW window. Using the same series as Figure 3, (a) shows the DTW distance matrix when using a bounding window that constrains warping to 20% of the series. (b) shows the resulting alignment

3.2 Derivative dynamic time warping

Keogh and Pazzani [37] proposed a modification of DTW called derivative dynamic time warping (DDTW) that first transforms the series into a differential series. The difference series of \({\textbf{a}}\), \(\mathbf {a'}=\{a'_2,a'_3,\ldots ,a'_{m-1}\}\) where \(a'_i\) is defined as the average of the slopes between \(a_{i-1}\) and \(a_i\) and \(a_i\) and \(a_{i+1}\), i.e.

$$\begin{aligned} a'_i = \frac{(a_i-a_{i-1})+(a_{i+1}-a_{i-1})/2}{2} \end{aligned}$$
(6)

for \(1<i<m\). The DDTW is then simply the DTW of the differenced series,

$$\begin{aligned} d_{\textrm{ddtw}}({\textbf{a}},{\textbf{b}})= d_{\textrm{dtw}}(\mathbf {a'},\mathbf {b'}). \end{aligned}$$
(7)

3.3 Weighted dynamic time warping

A weighted form of DTW (WDTW) was proposed by [34]. WDTW adds a multiplicative weight penalty based on the warping distance between points in the warping path. It is a smooth alternative to the cut-off point approach of using a warping window. When creating the distance matrix M, a weight penalty \(w(|i-j|)\) for a warping distance of \(|i-j|\) is applied, so that

$$\begin{aligned} M^w_{i,j}= w(|i-j|) \cdot (a_i-b_j)^2. \end{aligned}$$
(8)

A logistic weight function is proposed in [34], so that a warping of a places imposes a weighting of

$$\begin{aligned} w(a)=\frac{w_{\textrm{max}}}{1+e^{-g\cdot (a-m/2)}} \end{aligned}$$
(9)

where \(w_{\textrm{max}}\) is an upper bound on the weight (set to 1), m is the series length and g is a parameter that controls the penalty level for large warpings. The larger g is, the greater the penalty for warping. Note that WDTW does not benefit from the reduced search space a window induces. The WDTW distance is then

$$\begin{aligned} d_{\textrm{wdtw}}({\textbf{a}}, {\textbf{b}}) =D_{P*}({\textbf{a}},{\textbf{b}}, M^w) = DTW({\textbf{a}},{\textbf{b}}, M^w). \end{aligned}$$
(10)

Figure 5 shows the warping resulting from two parameter values. For this example, \(g =0.2\) gives the same warping as full window DTW, but \(g=0.3\) is more constrained.

Fig. 5
figure 5

Examples of the weighted DTW cost matrix C and resulting alignment for two weight parameters

We also investigate the derivative weighted distance (WDDTW),

$$\begin{aligned} d_{\textrm{wddtw}}({\textbf{a}}, {\textbf{b}}) = d_{\textrm{wdtw}}(\mathbf {a'}, \mathbf {b'}). \end{aligned}$$
(11)

3.4 Longest common subsequence

DTW is usually expressed as a mechanism of finding an alignment so that points are warped onto each other to form a path. An alternative way of looking at the process is as a mechanism of forming a common series between the two input series. With this view, the warping operation can be seen as inserting a gap in one series or removing an element from another series. This way of thinking derives from approaches for aligning sequences of discrete variables, such as strings or DNA. The Longest Common Subsequence (LCSS) distance for time series is derived from a solution to the problem of finding the longest common subsequence between two discrete series through edit operations. For example, if two discrete series are abaacb and bcacab, the LCSS is baab. Unlike DTW, the LCSS does not provide a path from (1, 1) to (mm). Instead, it describes edit operations to form a final sequence, and these operations are given a certain cost. So, for example, to edit abaacb into the LCSS baab requires two deletion operations. For DTW, you can then think of the choice between the three warping paths in line 5 of Eq. 5 as \(C_{i-1,j}\) being a deletion in series \({\textbf{b}}\), \(C_{i, j-1}\) as a deletion in series \({\textbf{a}}\) and \(C_{i-1, j-1}\) as a match. The warping path shown in Fig. 4 is a sequence of pairs \(<(1,1), (2,1), (3,2), (4,3), (4,4), (5,5), (6,5),(7,5), (8,6), (8,7), (8,8),\) \( (8,9), (9,10), (10,10)>\) can instead be expressed as an edited series \(<(1,1), (3,2), (4,3), (5,5), (8,6), (9,10)>\). With this representation the warping operations are in fact deletions (or gaps) in series \({\textbf{a}}\) in positions 2, 6, 7 and 10 and in \({\textbf{b}}\) in positions 4, 7, 8 and 9. With discrete series, the matches in the common subsequence have the same value. Thus, each pair in the subsequence would be, for example, a letter in common for the two series. Obviously, actual matches in real-valued series will be rare. The discrete LCSS algorithm can be extended to consider real-valued time series by using a distance threshold \(\epsilon \), which defines the maximum difference between a pair of values that is allowed for them to be considered a match. The length of the LCSS between two series \(\textbf{a}\) and \(\textbf{b}\) can be found using Algorithm 2. If two cells are considered the same (line 4), the previously considered LCSS is increased by one. If not, then the LCSS seen so far is carried forward.

Algorithm 2
figure e

LCSS (\(\textbf{a},\textbf{b}\) , (both series of length m), \(\epsilon \) (equality threshold))

The LCSS distance between \(\textbf{a}\) and \(\textbf{b}\) is

$$\begin{aligned} d_{\textrm{LCSS}}(\textbf{a,b}) = 1- \frac{\textrm{LCSS}(\textbf{a,b})}{m}. \end{aligned}$$
(12)

Figure 6 shows the cost match between our two example series. The longest sequence of matches is seven. These are the pairs \(<(2,1), (3,2), (4,4), (5,5), (6,8), (8,8), (9,10)>\).

Fig. 6
figure 6

An example of the LCSS cost matrix with \(\epsilon = 1\), where the white cells are members of the LCSS

Fig. 7
figure 7

EDR and ERP example paths

3.5 Edit distance on real sequences (EDR)

LCSS was adapted in [17], where the edit distance on real sequences (EDR) is described. Like LCSS, EDR uses a distance threshold to define when two elements of a series match. However, rather than count matches and look for the longest sequence, ERP applies a (constant) penalty for non-matching elements where deletions, or gaps, are created in the alignment. Given a distance threshold, \(\epsilon \), the EDR distance between two points in series \(\textbf{a}\) and \(\textbf{b}\) is given by Algorithm 3. The EDR distance between \(\textbf{a}\) and \(\textbf{b}\) is then defined by Eq. 13.

Algorithm 3
figure f

EDR (\(\textbf{a},\textbf{b}\) , (both series of length m), \(\epsilon \) (equality threshold)

$$\begin{aligned} d_{\textrm{EDR}}(\textbf{a,b}) = \textrm{EDR}(\textbf{a, b}, w, \epsilon ) \end{aligned}$$
(13)

At any step, elastic distances can use one of three costs: diagonal, horizontal or vertical, in forming an alignment. In terms of forming a subsequence from series \({\textbf{a}}\), we can characterise these as operations such as match (diagonal), deletion (horizontal) and insertion (vertical). Insertion to \({\textbf{a}}\) can also be characterised as deletion from \({\textbf{b}}\), but we retain the match/delete/insert terminology for consistency and clarity. EDR does not satisfy triangular inequality, as equality is relaxed by assuming elements are equal when the distance between them is less than or equal to \(\epsilon \). EDR is very similar to LCSS, but it directly finds a distance rather than simply counting matches, thus removing the need for the subtraction in Eq. 12. The resulting cost matrix shown in Fig. 7a can easily be used to find either an alignment path or a common subsequence. EDR then characterises the three operations in DTW

3.6 Edit distance with real penalty (ERP)

An alternative to EDR was proposed in [16], where edit distance with real penalty (ERP) was introduced. LCSS produces a subsequence that can have gaps between elements. For example, there is a gap between (3,2) and (4,4) in the subsequence shown in Fig. 6. ERP imposes a penalty for when gaps occur based on the distance to a constant parameter g. ERP uses \(d(a,b) = \sqrt{(}(a-b)^2)\) for the pointwise distance rather than \(d(a,b) = (a-b)^2\) used to find M for DTW. ERP calculates a cost matrix E that is more like DTW than LCSS: it describes a path alignment of two series based on edits rather than warping. The ERP distance between two series is described in Algorithm 4. The edge terms are initialised to a large constant value (lines 5 and 7). The cost matrix E is then either the cost of matching, \(E_{i-1,j-1}+ d(a_i, b_j)\), where \(d(a_i, b_j)\) is the distance between two points or the cost of an inserting/deleting a term on either axis (\(E_{i-1,j}+ d(a_i, g)\) or \(E_{i,j-1}+ d(g, b_j)\)). The ERP distance between \(\textbf{a}\) and \(\textbf{b}\) is given by Eq. 14.

Algorithm 4
figure g

ERP (\(\textbf{a},\textbf{b}\) (both series of length m), g, (penalty value), d, (pointwise distance function))

$$\begin{aligned} d_{\textrm{ERP}}(\textbf{a,b}) = \textrm{ERP}(\textbf{a, b}, g,d) \end{aligned}$$
(14)

ERP satisfies triangular inequality and is a metric. Figure 7b shows the cost matrix and resulting alignment for our example series with \(g=0\). \(E_{1,1}\) is 0.774, which is simply the square root of \(M_{1,1}\). \(E_{2,1}\) is the minimum of the two edge cases to the left (large constants) and \(E_{1,1}+d(b_1,g)\), where \(b_1=0.42\), so \(d(0.17,1)=0.97\). Hence, \(E_{2,1}= 0.774 + 0.424 = 1.2.\)

3.7 Move–split–merge (MSM)

$$\begin{aligned} cost(x,y,z,c) = \left\{ \begin{array}{l} {c\, {\textbf {if}} \,y \le x \le z \, {\textbf {or}}\, y \ge x \ge z} \\ {c+min(|x-y|,|x-z|)\, {\textbf {otherwise}}.} \end{array} \right. \end{aligned}$$
(15)
Fig. 8
figure 8

MSM and TWE example paths

Algorithm 5
figure h

MSM(\(\textbf{a},\textbf{b}\) (both series of length m), c (minimum cost), d, (pointwise distance function))

Move–split–merge (MSM) [67] is a distance measure that is conceptually similar to ERP. The core motivation for MSM is that the cost of insertion/deletion in ERP are based on the distance of the value from some constant value, and thus it prefers inserting and deleting values close to g compared to other values. Algorithm 5 shows that the major difference is in the deletion/insertion operations on lines 10 and 11. The move operation in MSM uses the absolute difference rather than the squared distance for matching in ERP. Insert cost in ERP \(d(a_i, g)\) is replaced by split operation \(C(a_i,a_{i-1},b_j, c)\), where C is the cost function given in Eq. 15. If the value being inserted, \(b_j\), is between the two values \(a_i\) and \(a_{i-1}\) being split, the cost is a constant value c. If not, the cost is c plus the minimum deviation from the furthest point \(a_i\) and the previous point \(a_{i-1}\) or \(b_{j}\). The delete cost of ERP \(d(g, b_j)\) is replaced by the merge cost \(C(b_j,b_{j-1},a_{i},c)\), which is simply the same operation on the second series. Thus, the cost of splitting and merging values depends on the value itself and adjacent values, rather than treating all insertions and deletions equally as with ERP. The MSM distance between \(\textbf{a}\) and \(\textbf{b}\) is given by Eq. 16 and illustrated in Fig. 8b. MSM satisfies triangular inequality and is a metric.

$$\begin{aligned} d_{\textrm{MSM}}(\textbf{a,b}) = {\textrm{MSM}}(\textbf{a, b},c) \end{aligned}$$
(16)

3.7.1 Time warp edit (TWE)

Introduced in [48], time warp edit (TWE) distance is an elastic distance measure described in Algorithm 6. It encompasses characteristics from both warping and editing approaches. The warping, called stiffness, is controlled by a parameter \(\nu \). Stiffness enforces a multiplicative penalty on the distance between matched points in a way that is similar to WDTW, where \(\nu = 0\) gives no warping penalty. The stiffness is only weighted this way when considering the match option (line 11). For the delete and insert operations (lines 12 and 13), a constant stiffness (\(\nu \)) and edit (\(\lambda \)) penalty are applied since the warping is considered from consecutive points in the same series. An example is shown in Fig. 8a.

Algorithm 6
figure i

TWE(\(\textbf{a},\textbf{b}\) (both series of length m), \(\lambda \) (edit cost), \(\nu \) (warping penalty factor), d, (pointwise distance function))

$$\begin{aligned} d_{\textrm{TWE}}(\textbf{a,b}) = \textrm{TWE}(\textbf{a, b},\nu , \lambda ) \end{aligned}$$
(17)
Table 2 Summary of distance functions, their parameters and the default values

A summary of distance function parameters and their default values in our implementations is given in Table 2. DTW is sensitive to the window parameter [19] and large windows can cause pathological warpings. Based on experimental results [58], we set the default warping window to 0.2. WDTW uses a default scale parameter value for g of 0.05, based on results reported in [34]. LCSS and EDR both have an \(\epsilon \) parameter that is a threshold for similarity. If the difference in two random variables is below \(\epsilon \), the observations are considered a match. The variability in parameter effects depending on the values of the series is one of the arguments for normalising all series. We set the default \(\epsilon \) to 0.05. MSM has a single parameter, c, to represent the cost of the move–split–merge operation. This is set to 1 based on the original paper. TWE \(\lambda \) is analogous to the c parameter in MSM, so we also set it to one, whereas \(\nu \) is related to the weighting parameter of WDTW, so we set it to 0.05.

3.8 Averaging time series

k-means clustering requires characterising a set of time series to form an exemplar. The standard approach for k-means is simply to find the mean of the current members of a cluster over time points. This is appropriate if the distance function is Euclidean distance since the average centroid is the series with the minimal Euclidean distance to members of the cluster. However, if cluster membership is assigned based on an elastic distance measure, the average centroid may misrepresent the elements of a cluster; it is unlikely to be the series that minimises the elastic distance to cluster members. Dynamic time warping with barycentre averaging (DBA) [55] (see Algorithm 7) was proposed to overcome this limitation in the context of DTW.

Algorithm 7
figure j

DTW Barycentre Averaging (\(\textbf{c}\), the initial average sequence, \(\mathbf{X_p}\), p time series to average.

DBA is a heuristic to find a series that minimises the elastic distance to cluster members rather than the Euclidean distance. Starting with some initial centre, DBA works by first finding the warping path of series in the cluster onto the centre. It then updates the centre, by finding which points were warped onto each element of the centre for all elements of the cluster, then recalculating the centre as the average of these warped points.

Suppose function \(f(\textbf{a}, \textbf{b})\) returns the warping path of indexes

$$\begin{aligned} P=<(e_1,f_1),(e_2,f_2),\ldots ,(e_s,f_s)> \end{aligned}$$

generated by dynamic time warping. Given an initial centre \(\textbf{c}=<c_1, \ldots , c_m>\), DBA is described by Algorithm 7. It warps each series onto c (line 4), then from the warping path associates the value in the series with the value in the barycentre (lines 5 and 6). Once finished for all series, the average of values warped to each index of the centroid is taken. This is meant to better characterise the cluster members in the iterative partitioning of algorithms.

4 Methodology

Guided by related research, we specify the data we use, how we handle the data and the performance measures and hypothesis tests used to compare algorithms.

4.1 Data

We experiment with time series data in the University of California, Riverside (UCR) archive [18].Footnote 3 We restrict our attention to univariate time series, and in all experiments, we use 112 of the 128 datasets from the UCR time series archive. We exclude datasets containing series of unequal length or missing values. We also remove the Fungi data, which only provides a single train case for each class label. We report results using the six performance measures described in Sect. 4.2 on both the training sample and the test set.

Using the class labels to assess performance on some of these data may be unfair. For some problems, clustering algorithms naturally find clusters that are independent of the class labels but are nevertheless valid. For example, the GunPoint data set was created by a man and a woman drawing a gun from their belt or pretending to draw a gun. The class labels are Gun/No Gun. However, many clusterers find the Man/Woman clusters rather than Gun/No Gun. Without supervision, this is a perfectly valid clustering, but it will score approximately 50% accuracy since the man and woman cases are split evenly between Gun/No Gun. This is an inherent problem with attempting to evaluate exploratory, unsupervised algorithms by comparing them with what we know to be true a priori: if a clustering simply finds what we already know, its utility is limited. Furthermore, as observed in [39], some of the datasets have the same time series but with different labels. We aim to mitigate against these problems by using a large number of problems.

There are further problems with using the UCR data for algorithm analysis (see [26]): many of the data have been preprocessed with expert knowledge; the train/test sets have been hand crafted in some instances; there are no missing values in the data we use; all data are sampled at the same frequency; and the train sets are relatively small and the types of problem represent the research interests of the contributors. It is not a perfect model of the real world, nor representative of the type of problem faced by many data scientists in practice. Nevertheless, these faults should be put in context of the wider machine learning research. It is still common to see papers use the same 20 datasets from the UCI archive that have been employed since last century. We believe that bake offs using the UCR data still have value, but that we should continue to look to expand and diversify the archive.

4.2 Clustering metrics

Table 3 Six clustering performance measures used in experimentation

Table 3 summarises the 6 performance measures we use in our evaluation. Clustering accuracy (CL-ACC), like classification accuracy, is the number of correct predictions divided by the total number of cases. To determine whether a cluster prediction is correct, each cluster has to be assigned to its best matching class value. This can be done naively, taking the maximum accuracy from every permutation of cluster and class value assignment \(S_k\).

$$\begin{aligned} \text {CL-ACC}({\textbf{y}},{\hat{\textbf{y}}}) = \max _{{\textbf{s}} \in \mathbf {S_k}} \frac{1}{|{\textbf{y}}|} \sum _{i=1}^{|{\textbf{y}}|} {\left\{ \begin{array}{ll} 1, &{} y_i = {\textbf{s}}({\hat{y}}_i) \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(18)

Checking every permutation like this is prohibitively expensive, however, and can be done more efficiently using combinatorial optimisation algorithms for the assignment problem. A contingency matrix of cluster assignments and class values is created and turned into a cost matrix by subtracting each value of the contingency matrix from the maximum value. Clusters are then assigned using an algorithm such as the Hungarian algorithm [38] on the cost matrix. If the class value of a case matches the assigned class value of its cluster, the prediction is deemed correct, else it is incorrect. As classes can only have one assigned cluster each, all cases in unassigned clusters due to a difference in a number of clusters and class values are counted as incorrect.

The Rand index (RI) works by measuring the similarity between two sets of labels. This could be between the labels produced by different clustering models (thereby allowing direct comparison) or between the ground truth labels and those the model produced. The rand index is the number of pairs that agree on a label divided by the total number of pairs.

One of the limiting factors of RI is that the score is inflated, especially when the number of clusters is increased. The adjusted Rand index (ARI) compensates for this by adjusting the RI based on the expected scores on a purely random model.

The mutual information (MI) is a function that measures the agreement of the two clusterings or a clustering and a true labelling, based on entropy. Normalised mutual information (NMI) rescales MI onto [0, 1], and adjusted mutual information (AMI) adjusts the MI to account for the class distribution.

The Davies–Bouldin index is an unsupervised measure we employ for tuning clusterers. It compares the between cluster variation with the inter cluster variation, with a higher score awarded to a clustering where there is good separation between clusters.

4.3 Related clustering comparisons

There have been several reviews and summaries of the TSCL field that compare algorithms on the UCR datasets, and we use these to guide our methodology. [9] compares five DTW and ED clusterers on 5 UCR datasets using the Rand Index. [33] compared eight variants of k-means, k-medoids clusterer and density peaks on the same 112 UCR data we also use. They combined train and test data, used raw series and evaluated algorithms with the adjusted Rand index. [2, 4] and [3] are literature reviews for TSCL that do not include results. [72] compares 10 clusterers on 36 time series from the UCR archive. They use Rand index and normalised mutual information to compare algorithms. They used the provided train and test splits on the raw data. [39] compared a range of deep learning-based algorithms on the UCR data using normalised mutual information, adjusted Rand index and accuracy as assessment criteria.

There are two important decisions to make about experimental design: whether to normalise all series to zero mean and unit variance and whether to merge the train and test data or train and test on separate data samples. The issue of whether to always normalise is an open question for TSC, since some discriminatory features may be in the scale or variance. Some argue that normalisation should always occur. For example, a 2012 paper that has been cited over 1000 times states that “In order to make meaningful comparisons between two time series, both must be normalised” [56]. However, in TSC whether to normalise or not is often treated as a parameter. We wish to control as many factors as possible in our experiments, so we perform identical experiments with both raw and normalised data.

Evaluating on unseen test data is essential for any classification comparison. The issue is less clear cut with clustering, which is used more as an exploratory tool than a predictive model. Many comparative studies (e.g. [33]) have combined the train and test data and performed evaluation on a single data set. More recent research has followed the required protocol for classification of evaluating estimators on data not used in training (e.g. [39]). We perform all training on the default train split provided in the archive and present the results on both the train set and test set. We have implemented the distance functions and clustering algorithms in the aeon toolkit. Details of the implementation and guidance on reproducing results are provided in “Appendix A”.

To compare multiple clusterers on multiple datasets, we use the rank ordering of the algorithms for any given performance measure. We use an adaptation of the critical difference diagram [20], replacing the post hoc Nemenyi test with a comparison of all classifiers using pairwise Wilcoxon signed-rank tests, and cliques formed using the Holm correction recommended by [11, 25]. We use \(\alpha =0.05\) for all hypothesis tests. Critical difference diagrams such as those shown in Fig. 9 display the algorithms ordered by the average rank of the statistic in question and the groups of algorithms between which there is no significant difference (cliques). So, for example, in Fig. 9c, MSM has the lowest average rank of 3.8973 and is not in a clique, so has significantly better ARI than the other algorithms. TWE, ERP, WDTW and ED are all grouped into a clique, which means there is no pairwise significant difference between them. LCSS, EDR, WDDTW and DTW form another clique, and DDTW is significantly worse than all other algorithms (Fig. 10). For consistency with some of the related research, Fig. 11 shows the same results on train data. The pattern of results is the same as on the test data, although there is less significance in the results.

5 Results

We report a sequence of five sets of experiments designed to detect differences in performance between distance functions and between the two clustering algorithms, k-means and k-medoids clusterer. Firstly, in Sect. 5.1 we compare k-means clustering using 10 different distance functions. Secondly, in Sect. 5.2 we report the equivalent k-medoids clusterer results and we address the question of whether the same pattern of performance seen in k-means is observable in k-medoids clusterer. We also assess whether k-medoids clusterer is more effective than k-means on the UCR data. We conduct experiments in Sects. 5.1 and 5.2 on both normalised and raw data. Thirdly, in Sect. 5.3 we evaluate the relative effect of the clustering initialisation algorithm on the performance of the distance functions. Next, in Sect. 5.4 we investigate the performance of DTW in more detail, exploring the effect of the warping window on the clustering. Finally, in Sect. 5.5 we see whether we can improve performance through tuning distance parameters.

5.1 Elastic distances with k-means

The first set of experiments involves comparing alternative distance functions with k-means clustering using the arithmetic mean of series to find the centroids. Our primary aim is to investigate whether there are any significant differences between the measures when used on the UCR univariate classification datasets with both raw and normalised data. For each experiment, we ran k-means clustering with the default aeon parameters (random initialisation, maximum 300 iterations, 10 restarts, centroid averaging method is the mean) with Euclidean distance and the nine elastic distance measures, set up with the default parameters listed in Table 2. Figure 9 shows the summarised results on the test data using normalised data. Figure 10 displays the same results found using the raw data. Full results are available on the accompanying website, and implementation details with code examples are provided in “Appendix A” and on the associated GitHub repository.

Fig. 9
figure 9

Critical difference diagrams for k-means clustering using nine different elastic distance functions on 112 normalised UCR problems (normalised data) using six performance measures. Results are on the test data

Fig. 10
figure 10

Critical difference diagrams for k-means clustering using nine different elastic distance functions on 112 UCR problems (raw data) using six performance measures. Results are on derived from the test files

Fig. 11
figure 11

Train results for k-means clustering with 10 different distance measures using normalised data

The two derivative approaches are both significantly worse than their alternatives using the raw data, suggesting that clusters in the time domain better reflect the true classes. LCSS and EDR also perform poorly on all tests. This implies the simple edit thresholding is not sensitive enough to find clustering that represents the class labels. The best overall performing measures are MSM and TWE. These measures are similar, in that they combine elements of both warping and editing.

5.2 Elastic distances with k-medoids clusterer

Using standard centroids means that the averaging method is not related to the distance measure used unless employing Euclidean distance. This disconnect between the clustering stages may account for the poor performance of many of the distance measures, in particular relative to k-means with Euclidean distance. Figure 12 shows the ranked performance summary for ten distance measures using k-medoids clusterer rather than k-means. The pattern of performance is broadly the same as with k-means (Fig. 9) with some notable differences: DTW is now no longer worse than Euclidean; ERP performs much better, and there is no overall difference between ERP, TWE and MSM as the top performing algorithms. The top performing distance functions all involve an explicit penalty for warping. As with k-means, this indicates that regularisation on path length produces better clusters on average. Figure 13 shows the equivalent results when using the raw data.

Fig. 12
figure 12

Critical difference diagrams for k-medoids clustering with ten distance measures assessed on the test data (normalised data)

Fig. 13
figure 13

Critical difference diagrams for k-medoids clustering with ten distance measures assessed on the test data (raw data)

Table 4 Accuracy and Davies–Bouldin averaged over 112 problems for k-means and k-medoids clustering on normalised data

Also of interest is the relative performance between k-means and k-medoids clusterer. Table 4 lists the average accuracy over all problems of k-means, k-medoids clusterer and single cluster predictions for the ten distance measures. Accuracy increases for all distances except Euclidean. This indicates that k-medoids clusterer is a stronger benchmark for TSCL algorithms than k-means on the UCR data, irrespective of distance measure. Table 4 also shows the aggregated unsupervised Davies–Bouldin (DB) scores for k-means and k-medoids clusterer which we later use in tuning. Small values are better for DB, so the results tell a different picture. We believe this is because DB is closely related to k-means, in that it averages distances within a cluster. k-means is designed to optimise a measure similar to DB. However, Table 4 illustrates this does not always lead to more accurate clustering algorithms.

5.3 Clustering algorithm initialisation

One source of variation could be our implementation and parameterisation of the clustering algorithms. Both implementations of k-means and k-medoids clusterer are simple algorithms based on the scikit-learn implementation of k-means. The three possible parameters that might affect performance are the maximum number of iterations given to the clustering algorithm (defaults to 300), the initialisation algorithm (default to random) and the number of initialisation restarts (defaults to 10). Changing the maximum number of iterations is unlikely to improve performance. For most datasets/distances, convergence happens in under 20 iterations. Given tslearn defaults to 30 iterations, increasing beyond 300 is unlikely to have a significant effect. Restarting 10 times is computationally intensive, and increasing this further is also unlikely to improve performance. On the other hand, clustering using the alternating k-medoids clusterer algorithm is known to suffer from the initialisation problem [27] (see Sect. 2.1). We evaluate whether the difference in performance we have observed could have been caused by the initialisation algorithm. We repeat our experiments on the normalised data for DTW, Euclidean and MSM using random, Forgy and kmeans++ initialisation with both k-means and k-medoids clusterer. Our primary interest is the variation caused by the distance function, so we present the relative performance of the three distances for alternative initialisation techniques in Fig. 14. The ordering is the same throughout: MSM is better than ED, which is better than DTW. We conclude that the results observed in Sects. 5.1 and 5.2 are broadly independent of the initialisation algorithm.

Fig. 14
figure 14

Critical difference diagrams for alternative clustering initialisation algorithms

5.4 DTW warping window

In Sect. 5.1, we observed that DTW with a window of 20% of the series length is significantly worse than ED. k-means with DTW is widely used as a base line for time series clustering [33]. ED is a special case of DTW with window size of zero, so the finding is counter to our expectations and merits further investigation. It is even more notable when we observe that WDTW is significantly better than DTW. WDTW, MSM, TWE and ERP all give an explicit penalty for warping. The DTW penalty for warping is implicit (warping means a longer path), and the results indicate that this allows for more warping than is desirable for clustering.

To test whether this result was an artefact of our implementation, we reran the clustering experiments using the Java toolkit tsmlFootnote 4 version of DTW in conjunction with the WEKA k-means clusterer, and found no significant difference in the results. Furthermore, we have checked that the aeon DTW distance implementation produces the same distances as other implementations listed in Table 15. Begum et al. [10] also found that for clustering, ED outperforms DTW.

If our implementation is correct, then perhaps the poor performance is due to our experimental set up. Window size is the key parameter for k-means DTW. We initially experimented with k-means with full window and found it performed worse than DTW constrained to 20% (often called cDTW [19]). However, a window size of 5% is also popular in the literature [53]. We first repeat our experiments with a 5% window size. We then assess whether tuning the window size improves performance. Finally, we examine whether using an alternative averaging algorithm for DTW makes a significant difference to DTW.

Fig. 15
figure 15

Critical difference diagrams for k-means clustering with the following DTW variants: DTW 20% window (dtw20); 5% window (dtw5); and a tuned window (dtw-t) and centres found with barycentre averaging (dtw-ba). Euclidean distance (ed) and move–split–merge (msm) are also included for reference

5.4.1 Smaller maximum DTW warping window

Figure 15 demonstrates that using a smaller maximum warping window improves the performance of k-means DTW to the point that it is no longer significantly worse than k-means with ED. Figure 16 shows the scatter plots of the accuracy and rand index of DTW with window size 5% against a 20% window and against ED. A smaller window is better, but using no window is equally as effective. DTW with a small window is a better approximation of ED. These results suggest DTW is not adding much value to the clustering on the UCR data.

Fig. 16
figure 16

Scatter plots of accuracy and rand index for 5% window DTW versus 20% window and Euclidean distance

5.4.2 Tuning the DTW warping window

Tuning significantly improves the performance of DTW 1-NN classification, so it is possible that it may also improve DTW-based clustering. To tune a clusterer we need an unsupervised cluster assessment algorithm. We use the Davies–Bouldin score since it is popular and available in scikit-learn. We evaluate DTW for twenty possible windows on even intervals in the range [0, 1) and use the window with the lowest Davies–Bouldin score for our final clustering (favouring small windows if scores are tied).

Figure 17 shows that tuning in this way (dtw-t) is better than using a fixed window of size 20% (dtw20), but does not improve on DTW with a fixed window of size 5% (dtw5). The tuned clusterer is not significantly worse than ED and is significantly worse than MSM. It is possible that a more fine grained tuning would yield better performance. However, tuning takes significant computation and it appears that all the extra effort is simply achieving a better approximation of ED.

Fig. 17
figure 17

Scatter plots of accuracy and rand index for tuned DTW versus 5% window and Euclidean distance

5.4.3 k-means DBA

Using medoids mitigates the problem of averaging centres with k-means. DTW barycentre averaging (DBA), described in Sect. 3.8, has also been proposed as a means of improving centroid finding for k-means DTW. We have repeated our experiments with the same k-means set up, but centroids were found with the original DBA described in Algorithm 7. Figure 18a shows that DBA does indeed significantly improve DTW, a result that reproduces the findings in [55].

However, Fig. 18a illustrates that it is not significantly different to k-medoids clusterer DTW. Furthermore, Fig. 19 shows that k-means with DBA is significantly worse than the top clique, composed of k-medoids clusterer MSM, k-means MSM and k-medoids clusterer TWE.

Fig. 18
figure 18

Scatter plots of accuracy for DTW-based k-means, with DBA against standard averaging (a) and against k-medoids clusterer (b)

Fig. 19
figure 19

Critical difference diagrams for the best performing k-means and k-medoids clusterer distances, and k-means with DBA (dtw-dba)

Fig. 20
figure 20

Heat map, as described in [28], summarising the results used to generate Fig. 19

Our implementation of DBA is faithful to the original (Fig. 20). An alternative version of DBA was described in [63]. This version is implemented in the tslearn toolkit. We have wrapped the tslearn k-means DBA algorithm into the aeon toolkit and reran this version. Figure 21 compares the performance of two DBA versions with MSM k-medoids clusterer and k-means.

Fig. 21
figure 21

Comparison of performance of two DBA implementations and k-medoids clusterer and k-means with MSM distance

Table 5 Run time (in hours) average, maximum and total over 112 problems

Table 5 summarises the time taken to run experiments. We ran experiments on our HPC cluster, so timing results are indicative only. The differences are fairly small. However, we note that of the two best performing algorithms, k-medoids clusterer MSM and TWE, k-medoids clusterer MSM is the faster.

5.5 Tuning the distance functions

Table 6 Parameter ranges for tuning distance functions with k-means

Clustering performance could also be improved by tuning the distance functions (as we did with DTW in Sect. 5.4.2). We tune four distance functions using k-medoids clustering on parameter ranges given in Table 6 to determine whether it improved performance. The choice of ranges is taken from [44] which in turn took suggested ranges from the original papers. Tuning is also a way of demonstrating the sensitivity of a distance measure to the parameters: if tuning makes no significant difference to using the default parameters, we would consider the clusterer robust to the parameter and the default value an appropriate one.

Fig. 22
figure 22

Scatter plots of tuned versus untuned k-means clustering algorithms

6 Performance analysis

The UCR archive is a diverse collection of datasets, and whilst overall performance gives some indication as to good default benchmarks to use, it does not provide insight into the best approaches for data characteristics or the data domain.

Table 7 Average rank performance split by the number of clusters

The range of the number of classes/clusters in the archive is from 2 to 60. To aid analysis, we group datasets into four groups of datasets with: 2 clusters (Group A, consisting of 40 datasets); 3–5 clusters (Group B with 33 datasets); 6–10 clusters (Group C, 19 datasets); and 11 or more clusters (Group D, 13 datasets). Table 7 shows the average accuracy rank by group for the k-means and k-medoids clusterer results on normalised data presented in Sect. 5. MSM and TWE both improve relatively with more clusters. The two edit distance algorithms EDR and LCSS get worse with more clusters as do the derivative methods, leading to an improvement in ED.

Series length varies from 15 to 3000. We can group these into problems with series lengths less than 200 (Group A, 40 problems), 201–500 (Group B, 31 problems), 501–1000 (Group C, 20 problems) and >1000 (Group D, 21 problems). Table 8 breaks down the ranks by series length. There is no discernable pattern with series length. It does not seem to be a major factor in performance.

Table 8 Average rank performance split by the length of the series

Train set size may also influence performance. The UCR data range in the train set size from 16 to 8926. We spit data into the following four groups: less than 100 training cases (Group A, 41 datasets); between 100 and 299 train cases (Group B, 24 datasets); 300 to 499 cases (Group C, 25 datasets); and 500 or more cases (Group D, 22 datasets). Table 9 summarises the ranks of the distance functions split into these groups. As the number of training cases increases, the relative difference between MSM, WDTW and TWE increases. This is more pronounced with k-medoids clusterer MSM, which is clearly better with 300 or more training instances.

Table 9 Average rank performance split by the length of the number of training cases

Finally, the datasets have been classified as coming from different problem types. Table 10 describes the seven categories. These categories are useful, but not definitive. For example, there is an overlap between DEVICE, SENSOR and MOTION. However, breaking down by problem type can yield insights. Tables 11 and  12 show the breakdown of ranks by these problem types. MSM performs best on DEVICE, ECG, IMAGE and SIMULATED. These are all categories where some phase shift within the class would be expected. ED is the best at SPECTRO. Spectrograms are ordered series in the frequency domain, and we would expect the complexity of these problems to arise through noise and dimension rather than phase shift. ED also does surprisingly well at MOTION problems with k-means, but less well with k-medoids clusterer.

Table 10 Dataset categories
Table 11 Average rank performance of k-means on problems split by the problem domain
Table 12 Average rank performance of k-medoids clusterer on problems split by the problem domain

The results show that MSM is generally the best performing algorithm. Given the prevalence of DTW-based clustering, this is perhaps surprising and the reasons to merit further investigation.

One explanation could be it is an artefact of the clustering algorithm. There are many examples of where hierarchical clustering with DTW seems to produce more intuitive clusterings than Euclidean distance [19]. Partitioning clustering has the advantage over hierarchical clustering in that it can be applied to group unseen test data. This is important because clustering often plays a component role in machine learning (e.g. dimensionality reduction). Evaluation of distance functions for hierarchical clustering is also more complex, and we consider it future work.

Another possible explanation for the results is that they are an artefact of the datasets used in the evaluation. There are many known limitations of the UCR data (see [26]). The data have been preprocessed in ways that may favour Euclidean distance-based clustering. Whilst it is important to acknowledge this, we also note we are following almost every other paper referenced in using the UCR data. By highlighting the fact that standard DTW does not cluster well, we are not arguing that it should never be used. Our aim is discourage its use as a straw man for new algorithms evaluated on UCR data and to highlight that there may be better alternatives.

We think the true cause of the difference between DTW and MSM is in the warping penalty mechanism. All elastic distances implicitly penalise warpings simply because more distances are included in the calculation. However, both MSM and TWE also explicitly penalise for moving off the diagonal with a data driven cost term. We think that this disincentive to warp reduces the pathological warping made more likely in an unsupervised setting such as clustering.

To test this hypothesis, we re-run experiments recording the number of diagonal, horizontal and vertical moves for DTW (5% and 20% warping window), MSM and TWE for each instance to its best cluster centre using k-means. The number of horizontal moves always equals the number of vertical moves since we insist on the final alignment matching at the end. We average moves over all calculations and then over all datasets. Table 13 summarises the number of diagonal and off diagonal moves. DTW is making a surprising number of moves off diagonal. The average series length is 551, but the average DTW warping path is 863 with a 20% window and 690 with a 5% window. On average, the DTW warping path is 145% of the series length. MSM and TWE do far fewer off diagonal moves. There is still much more warping with a 5% window than seen with MSM. This implies DTW is moving back and forward off the diagonal far more than MSM. The notion that “a little warping is a good thing, but too much warping is a bad thing” is known for both classification [58] and clustering [19]. These results suggest that too much warping can also mean repeated small warpings rather than single long pathological warpings. Alignments with many gaps seem to result in inferior clustering (Fig. 22).

Table 13 Average number of moves on and off diagonal for three distances

Warping some of the time is often desirable, but DTW tends to do it too often. An example of this can be found by investigating the Fish dataset. For the Fish dataset, MSM k-means performs better across all metrics compared to DTW k-means. When we investigated the warping paths for the Fish dataset, we found significant differences in the amount of warping occurring in DTW compared to MSM. Figure 23 shows two alignment paths produced (DTW on the left, MSM on the right), between two time series in the Fish dataset. It shows that DTW is making many more small warps off the centre, whereas MSM makes smaller adjustments. The colours represent a heat map for the cost matrix, with lighter values representing a high value and a darker value representing a low value. Previous experiments showed that neither restricting the window nor tuning the window size reduced this over warping in a way that bridged the gap between DTW and MSM. Another possible reason for the poor performance of DTW could be a tendency to over warp series at the beginning or end of the series whilst remaining within the band. This has been shown to effect performance [66].

Fig. 23
figure 23

Example warping paths for DTW and MSM on two cases from the Fish dataset

Table 14 Proportion of off diagonal moves in each 10% segment of series across all datasets

There are two mechanisms to constrain DTW warping at the end or the beginning. Firstly, the warping band can be made smaller at the beginning or end using, for example, a parallelogram band [30] rather than a uniform band. Secondly, the prefix/suffix invariant technique [66] can be employed. The latter is designed specifically for data that has not been preprocessed and hence is not suitable for the UCR data. We have run DTW using both PSI and a parallelogram band, but neither significantly improved standard DTW. To find out why, we recorded which decile moves off the diagonal were happening for DTW, MSM and TWE. We already know DTW warps more than MSM, so Table 14 shows the proportion of warps for DTW, MSM and TWE normalised for the total number of shifts. It shows DTW is fairly consistently warping all along the series but surprisingly MSM is proportionately warping much more in the first decile and much less in the ninth and 10th. We are not sure exactly why this happens. It runs counter to the intuition that bad warpings may happen at the beginning.

7 Conclusions

We have described nine elastic time series distance measures and compared them when used to cluster the time series of the UCR archive with both k-means and k-medoids clusterer. There are a wealth of other time series clustering algorithms that we have not evaluated. We have opted to provide an in depth description, with examples and associated code and a tightly constrained bake off, rather than attempt to include all variants of, for example, deep learning and transformation-based time series clusterers. Distance-based approaches are still very popular, and we believe our experimental observations will help practitioners choose distance-based TSCL more effectively.

Our first conclusion is that k-medoids clusterer is more effective than k-means for TSCL with elastic distance measures on the UCR datasets. Standard k-means has the inherent problem that cluster centres are formed by averaging series, which takes no account of distortions that elastic distances are designed to compensate for. DBA does adjust for this problem with k-means, but it is designed specifically for DTW and it comes with a high computational overhead compared to k-medoids clusterer. The improved version of DBA [63] implemented in tslearn is significantly better than the original but is not different to either MSM clusterers, and it takes approximately twice as long to run. There is little difference in run time between the k-means with averaging and k-medoids clusterer: k-medoids clusterer is faster but uses more memory.

Our second conclusion based on these experiments on UCR data is that without any data specific prior knowledge as to the best approach, k-medoids clusterer with either MSM or TWE are good benchmark approaches for distance-based clustering of time series, with MSM preferred because of the lower run time. MSM is the top ranked distance measure on nearly all measures and configurations of both clusterers. We believe it is not widely known and that MSM and TWE should be included when assessing new clustering algorithms, although the tslearn version of DBA is also a good benchmark. TWE is slightly slower than the other distances. However, run time and memory were not a major constraint for these experiments.

More specifically, we conducted the following experiments:

  1. 1.

    In Sect. 5.1, we compare the 10 distance functions listed in Table 1 using k-means clustering and find that MSM produced significantly better clusters with five of the six performance measures and that five of the nine elastic measures perform worse than Euclidean distance. We find the same pattern of results with both normalised data (Fig. 9) and raw data (Fig. 10).

  2. 2.

    We repeat the experiments using k-medoids clusterer in Sect. 5.2. We find that, on the train data, TWE and MSM form a top clique, and that ERP performs much better. We observe a similar pattern of results on the train data and show that, overall, k-medoids clusterer is better than k-means, independent of elastic distance used.

  3. 3.

    In Sect. 5.3, we assess the impact of using alternative cluster initialisation algorithms and found that MSM still outperforms ED and DTW for three initialisation algorithms with both k-means and k-medoids clusterer.

  4. 4.

    In Sect. 5.4, we explore variants of DTW to determine if we could find out why it performs so poorly. We compare different fixed window sizes and find that whilst smaller windows better approximate ED, they do not improve performance. We find tuning DTW does not improve performance. We then look at barycentre averaging for k-means and show that whilst it significantly improved k-means, it requires a specific refinement described in [63] to gain equivalence.

  5. 5.

    Finally, in Sect. 5.5 we tried tuning the distance functions using the DB measure and found it did not improve performance.

We have released all our results and code in a dedicated repository. We would welcome contributors of new distance functions to aeon and would be happy to extend the evaluation to include them. Our next stage is to extend the bake off to consider clustering algorithms not based on elastic distances and to investigate alternative medoids-based approaches such as partitioning around the medoids [40]. Furthermore, we have not yet investigated the effect of having to set the number of clusters, k. A future experiment will involve testing whether the relative performance remains the same when using standard techniques for setting k. There are also several possible directions for algorithmic advancement highlighted by this research: we will try combining the clusterings through an ensemble in a manner similar to the elastic ensemble for classification [44]. Clustering ensembles are more complex than classification ensembles, requiring some form of alignment of labelling. Nevertheless, it seems reasonably likely that there may be some improvements from doing so. These are just some of the numerous open issues in TSCL research. Our study aims to put future research on a sound basis to facilitate the accurate assessment of algorithmic improvements in a fully reproducible manner.