Abstract
Matrix Profile (MP) has been proposed as a powerful technique for knowledge extraction from time series. Several algorithms have been proposed for computing it, e.g., STAMP and STOMP. Currently, MP is computed based on 1NN search in all subsequences of the time series. In this paper, we claim that a kNN MP can be more useful than the 1NN MP for knowledge extraction, and propose an efficient technique to compute such a MP. We also propose an algorithm for parallel execution of kNN MP by using multiple cores of an off-the-shelf computer. We evaluated the performance of our solution by using multiple real datasets. The results illustrate the superiority of kNN MP for knowledge discovery compared to 1NN MP.
Similar content being viewed by others
Notes
The source code and the tested datasets are publicly available at: https://sites.google.com/view/knnmatrixprofile/home
this case is only applicable when we don’t have a single big time series, but several small time series in the database. Hence, a big time series is formed by concatenating these small time series.
We have shared the code, datasets and instructions in: https://github.com/tanmayGIT/kNN_Matrix_Profile.
References
Balasubramanian A, Wang J, Prabhakaran B (2016) Discovering multidimensional motifs in physiological signals for personalized healthcare. J Sel Topics Signal Process 10(5):832–841
Dau HA, Bagnall A, Kamgar K, Yeh CC, Zhu Y, Gharghabi S, Ratanamahatana CA, Keogh E (2018) The UCR time series classification archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/
He Y, Chu X, Wang Y (2020) Neighbor profile: Bagging nearest neighbors for unsupervised time series mining. In: 36th IEEE international conference on data engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pp 373–384
Laptev N, Amizadeh S, Billawala Y (2015) A Benchmark Dataset for Time Series Anomaly Detection. https://yahooresearch.tumblr.com/post/114590420346/a-benchmark-dataset-for-time-series-anomaly
Mercer R, Alaee S, Abdoli A, Singh S, Murillo AC, Keogh EJ (2021) Matrix profile XXIII: contrast profile: A novel time series primitive that allows real world classification. In: Bailey J, Miettinen P, Koh YS, Tao D, Wu X (eds) IEEE international conference on data mining, ICDM 2021, Auckland, New Zealand, December 7-10, 2021, pp 1240–1245
Mueen A, Hamooni H, Estrada T (2014) Time series join on subsequence correlation. In: Kumar R, Toivonen H, Pei J, Huang JZ, Wu X (eds) IEEE international conference on data mining, ICDM 2014, Shenzhen, China, December 14-17, 2014, pp 450–459
Mueen A, Keogh EJ, Young NE (2011) Logical-shapelets: an expressive primitive for time series classification. In: Apté C, Ghosh J, Smyth P (eds) ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, August 21-24, 2011, pp 1154–1162
Nakamura T, Imamura M, Mercer R, Keogh EJ (2020) MERLIN: parameter-free discovery of arbitrary length anomalies in massive time series archives. In: Plant C, Wang H, Cuzzocrea A, Zaniolo C, Wu X (eds) 20th IEEE international conference on data mining, ICDM 2020, Sorrento, Italy, November 17-20, 2020, pp 1190–1195
Rakthanmanon T, Campana BJL, Mueen A, Batista GEAPA, Westover MB, Zhu Q, Zakaria J, Keogh EJ (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Yang Q, Agarwal D, Pei J (eds) ACM SIGKDD international conference on knowledge discovery and data mining, pp 262–270
Sinha S (2002) Discriminative motifs. In: Proceedings of the sixth annual international conference on computational biology, pp 291–298
Yagoubi DE, Akbarinia R, Kolev B, Levchenko O, Masseglia F, Valduriez P, Shasha DE (2018) ParCorr: efficient parallel methods to identify similar time series pairs across sliding windows. Data Mining Knowl Discov 32(5):1481–1507
Yeh C-CM, Herle HV, Keogh EJ (2016) Matrix profile III: The matrix profile allows visualization of salient subsequences in massive time series. In: Proceedings of the international conference on data mining (ICDM), pp 579–588
Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Zimmerman Z, Silva DF, Mueen A, Keogh E (2018) Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile 32(1):83–123
Yeh CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh EJ (2016) Matrix profile I: all pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In: Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Zhou Z, Wu X (eds) IEEE 16th international conference on data mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, pp 1317–1322
Zhu Y, Yeh C-CCM, Zimmerman Z, Keogh EJ (2020) Matrix Profile XVII: Indexing the matrix profile to allow arbitrary range queries. In: International conference on data engineering (ICDE), pp 1846–1849
Zhu Y, Yeh C-CM, Zimmerman Z, Kamgar K, Keogh E (2018) Matrix profile XI: SCRIMP++: Time series motif discovery at interactive speeds. In: Proceedings of the international conference on data mining (ICDM), pp 837–846
Zhu Y, Zimmerman Z Senobari NS, Yeh CM, Funning GJ, Mueen A, Brisk P, Keogh EJ (2016) Matrix profile II: exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Zhou Z, Wu X (eds) IEEE 16th international conference on data mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, pp 739–748
Zhu Y, Zimmerman Z, Senobari NS, Yeh C-CM, Funning G, Mueen A, Brisk P, Keogh EJ (2016) Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. In: Proceedings of the international conference on data mining (ICDM), pp 739–748
Zimmerman ZP (2019) Breaking computational barriers to perform time series pattern mining at scale and at the edge. PhD thesis, University of California, Riverside, https://escholarship.org/content/qt51z7d647/qt51z7d647.pdf
Zimmerman Z, Kamgar K, Senobari NS, Crites B, Funning GJ, Brisk P, Keogh EJ (2019) Matrix profile XIV: scaling time series motif discovery with gpus to break a quintillion pairwise comparisons a day and beyond. In: Proceedings of the ACM symposium on cloud computing, SoCC 2019, Santa Cruz, CA, USA, November 20-23, 2019, pp 74–86
Acknowledgements
We greatly acknowledge the funding from Safran Data Analytics Lab. The authors are grateful to Inria Sophia Antipolis - Méditerranée ”Nef” computation cluster for providing resources and support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Eamonn Keogh.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix: I Fast calculation of distance
Mueen et. al proposed a technique, known as Mueen’s Algorithm for Similarity Search (MASS) (Mueen et al. 2014, 2011) for the fast calculation of z-normalized Euclidean Distance between query subsequence and the subsequence of target time series, by exploiting Fast Fourier Transform (FFT). Let us first explain Algorithm 6 that generates the dot product of a subsequence (q), and the subsequences of a target time series (T).
In Line 4, both vectors q and T are made to be of the same length (see Section Appendix: I.1) by appending the required amount of zeros \((2n-m)\) to the reversed query so that like \(T_a\), \(q_{ra}\) will have 2n elements. This is done to facilitate element wise multiplication in frequency domain. In Line 5, the Fourier transform of \(q_{ra}\) and \(T_a\) is performed to transform time domain signals into frequency domain signals. Then in Line 6, an element wise multiplication is performed between two complex valued vectors, followed by inverse FFT on the resultant product.
1.1 Appendix: I.1 Relation between convolution in time domain with frequency domain
The time domain convolution of two signals \(T = [t_1, t_2,\ldots , t_n]\) and \(q = [q_1, q_2,\ldots , q_m]; m<< n\) can be calculated by sliding q upon T. For implementation, we would need \(\dfrac{m}{2}\) number of zeros padding at the beginning and at the end of original T vector. The convolution between these two vectors is represented by \((T * q)\) which is a vector \({\textbf {c}} = [c_1, c_2,\ldots ,c_{n+m-1}]\) denoted as : \({\textbf {C}}_p = \sum _{u} T_u \;q_{p-u+1}\) where \(u = [\max \left( 1,p-u+1 \right) ]\ldots [\min \left( n,m \right) ]\) and u ranges over all the sub-scripts for \(T_u\) and \(q_{p-u+1}\).
The convolution in time domain can be quickly calculated by element-wise multiplication in frequency domain by taking Fourier Transform of two signals, multiplying them element-wise and then applying inverse Fourier transform. Performing full convolution (in both time and frequency domains) between two 1D (same for 2D also) signals of size n and m results in an output of size \((n+m-1)\) elements. Usually, the two signals are made of same size by padding zeros at the end to facilitate the multiplication operation.
Appendix: II Fast calculation of mean and standard deviation
The fast calculation of mean \((\mu )\) and standard deviation \((\sigma )\) of a vector of elements (x) is proposed by Rakthanmanon et al. (2012). The technique needs only one scan through the sample to compute the mean and standard deviation of all the subsequences. The mean of the subsequences can be calculated by keeping two running sums of the long time series which have a lag of exactly m values.
In the same manner, the sum of squares of the subsequences can also be calculated which are used to compute the standard deviation of all the subsequences by using Eq. (2).
Appendix: III Brief description of MASS algorithm :
The MASS algorithm is mentioned in Algorithm 7. In Line 1 of this algorithm, the sliding dot product is calculated by using Algorithm 6.
The z-normalized Euclidean distance (\(D_i\)) is calculated between two time series subsequences q and \(T_{i}\) by using the dot product between them (\(qT_i\)). The formula to calculate the distance \((D_i)\) is shown below.
In this Eq. (3), m is the subsequence length, \(\mu _q\) is the mean of query sequence q, \(\mu _{T_i}\) is the mean of \(T_{i}\), \(\sigma _q\) is the standard deviation of q and \(\sigma _{T_i}\) is the standard deviation of \(T_{i}\).
Appendix: IV Brief description of STAMP algorithm
The Scalable Time Series Anytime MP (STAMP) algorithm, proposed by Yeh et al. (2018) (outlined in Algorithm 8) calculates the closest match (1NN) of every subsequence in a time series T, based on the calculated distance (called as distance profile) between any particular subsequence with all the remaining subsequence in T.
The pseudo-code is mentioned in Algorithm 8. A for loop is used in Line 6 to chop each subsequence (consider as query for better understanding). Then, a distance vector (\(Dist_{cutQuery}\)) is computed by calculating the distances between all other subsequences in T and the query. In each iteration, the smallest distance and its corresponding index are kept in \(P_T\) and \(I_T\) vectors.
1.1 Appendix: IV.1 Two independent time series matching using STOMP
In Section 4.1.1, we have explained the general principal of STOMP algorithm for the computation of MP for a single time series. In case of two independent time series, the basic STOMP algorithm needs to be marginally modified. The pseudo-code of modified STOMP algorithm, named as IndependentSTOMP is mentioned in Algorithm 9.
While called, this algorithm calculates the distances from 2 query subsequence until the last query subsequences i.e., (Idxs) (see Line 2) because before calling this algorithm, we have already computed the distances between 1st query subsequence and target subsequence t. Now let’s concentrate on the principal part of STOMP. Remind that the dot product profile (QT) of any particular subsequence, can be derived from the dot product profile of it’s previous subsequence (see Sect. 4.1.1). So, in Line 3 we repetitively calculate the dot product profile between each query subsequence (from 2nd subsequence onward) and the target subsequence (t). The 1st term i.e. (\(QT[j-1]\)) at the right hand side of Line 3 holds the dot product profile of the previous target subsequence (in comparison with t) and jth query subsequence. The \(Q[j-1]\) in 2nd term represents the 1st element of the previous query subsequence and t[1] represents the 1st element of the target subsequence (t). In the same manner, \(Q[j+m-1]\) and \(t[1+m-1]\) terms represent the last element of jth query subsequence (current one) and the last element of the subsequences t. In Line 4, the dot product value between the first query subsequence \((j=1)\) and the specific target subsequence (t) is copied in QT[1]. This value is taken as input in this algorithm (i.e., tqSingleVal) The distance profile is calculated in Line 5 by using the dot product profile (QT), mean \((\mu _t)\) and STD \((\sigma _{t})\) value of target subsequence t, mean \((\mu _Q)\) and STD \((\sigma _{Q})\) vector of query time series. Finally the distance vector \((Dist_{cutQuery})\) and the dot product vector (QT) are returned as the output from the algorithm.
The time complexity of classical STOMP is \(\mathcal {O}(n)^2\) which is \(\mathcal {O}(\log n)\) improvement over STAMP algorithm. This improvement is highly useful for computing MP over big time series.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mondal, T., Akbarinia, R. & Masseglia, F. kNN matrix profile for knowledge discovery from time series. Data Min Knowl Disc 37, 1055–1089 (2023). https://doi.org/10.1007/s10618-022-00883-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00883-8