Skip to main content
Log in

kNN matrix profile for knowledge discovery from time series

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Matrix Profile (MP) has been proposed as a powerful technique for knowledge extraction from time series. Several algorithms have been proposed for computing it, e.g., STAMP and STOMP. Currently, MP is computed based on 1NN search in all subsequences of the time series. In this paper, we claim that a kNN MP can be more useful than the 1NN MP for knowledge extraction, and propose an efficient technique to compute such a MP. We also propose an algorithm for parallel execution of kNN MP by using multiple cores of an off-the-shelf computer. We evaluated the performance of our solution by using multiple real datasets. The results illustrate the superiority of kNN MP for knowledge discovery compared to 1NN MP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. The source code and the tested datasets are publicly available at: https://sites.google.com/view/knnmatrixprofile/home

  2. this case is only applicable when we don’t have a single big time series, but several small time series in the database. Hence, a big time series is formed by concatenating these small time series.

  3. We have shared the code, datasets and instructions in: https://github.com/tanmayGIT/kNN_Matrix_Profile.

References

  • Balasubramanian A, Wang J, Prabhakaran B (2016) Discovering multidimensional motifs in physiological signals for personalized healthcare. J Sel Topics Signal Process 10(5):832–841

    Article  Google Scholar 

  • Dau HA, Bagnall A, Kamgar K, Yeh CC, Zhu Y, Gharghabi S, Ratanamahatana CA, Keogh E (2018) The UCR time series classification archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/

  • He Y, Chu X, Wang Y (2020) Neighbor profile: Bagging nearest neighbors for unsupervised time series mining. In: 36th IEEE international conference on data engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pp 373–384

  • Laptev N, Amizadeh S, Billawala Y (2015) A Benchmark Dataset for Time Series Anomaly Detection. https://yahooresearch.tumblr.com/post/114590420346/a-benchmark-dataset-for-time-series-anomaly

  • Mercer R, Alaee S, Abdoli A, Singh S, Murillo AC, Keogh EJ (2021) Matrix profile XXIII: contrast profile: A novel time series primitive that allows real world classification. In: Bailey J, Miettinen P, Koh YS, Tao D, Wu X (eds) IEEE international conference on data mining, ICDM 2021, Auckland, New Zealand, December 7-10, 2021, pp 1240–1245

  • Mueen A, Hamooni H, Estrada T (2014) Time series join on subsequence correlation. In: Kumar R, Toivonen H, Pei J, Huang JZ, Wu X (eds) IEEE international conference on data mining, ICDM 2014, Shenzhen, China, December 14-17, 2014, pp 450–459

  • Mueen A, Keogh EJ, Young NE (2011) Logical-shapelets: an expressive primitive for time series classification. In: Apté C, Ghosh J, Smyth P (eds) ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, August 21-24, 2011, pp 1154–1162

  • Nakamura T, Imamura M, Mercer R, Keogh EJ (2020) MERLIN: parameter-free discovery of arbitrary length anomalies in massive time series archives. In: Plant C, Wang H, Cuzzocrea A, Zaniolo C, Wu X (eds) 20th IEEE international conference on data mining, ICDM 2020, Sorrento, Italy, November 17-20, 2020, pp 1190–1195

  • Rakthanmanon T, Campana BJL, Mueen A, Batista GEAPA, Westover MB, Zhu Q, Zakaria J, Keogh EJ (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Yang Q, Agarwal D, Pei J (eds) ACM SIGKDD international conference on knowledge discovery and data mining, pp 262–270

  • Sinha S (2002) Discriminative motifs. In: Proceedings of the sixth annual international conference on computational biology, pp 291–298

  • Yagoubi DE, Akbarinia R, Kolev B, Levchenko O, Masseglia F, Valduriez P, Shasha DE (2018) ParCorr: efficient parallel methods to identify similar time series pairs across sliding windows. Data Mining Knowl Discov 32(5):1481–1507

    Article  MathSciNet  Google Scholar 

  • Yeh C-CM, Herle HV, Keogh EJ (2016) Matrix profile III: The matrix profile allows visualization of salient subsequences in massive time series. In: Proceedings of the international conference on data mining (ICDM), pp 579–588

  • Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Zimmerman Z, Silva DF, Mueen A, Keogh E (2018) Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile 32(1):83–123

  • Yeh CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh EJ (2016) Matrix profile I: all pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In: Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Zhou Z, Wu X (eds) IEEE 16th international conference on data mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, pp 1317–1322

  • Zhu Y, Yeh C-CCM, Zimmerman Z, Keogh EJ (2020) Matrix Profile XVII: Indexing the matrix profile to allow arbitrary range queries. In: International conference on data engineering (ICDE), pp 1846–1849

  • Zhu Y, Yeh C-CM, Zimmerman Z, Kamgar K, Keogh E (2018) Matrix profile XI: SCRIMP++: Time series motif discovery at interactive speeds. In: Proceedings of the international conference on data mining (ICDM), pp 837–846

  • Zhu Y, Zimmerman Z Senobari NS, Yeh CM, Funning GJ, Mueen A, Brisk P, Keogh EJ (2016) Matrix profile II: exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Zhou Z, Wu X (eds) IEEE 16th international conference on data mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, pp 739–748

  • Zhu Y, Zimmerman Z, Senobari NS, Yeh C-CM, Funning G, Mueen A, Brisk P, Keogh EJ (2016) Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. In: Proceedings of the international conference on data mining (ICDM), pp 739–748

  • Zimmerman ZP (2019) Breaking computational barriers to perform time series pattern mining at scale and at the edge. PhD thesis, University of California, Riverside, https://escholarship.org/content/qt51z7d647/qt51z7d647.pdf

  • Zimmerman Z, Kamgar K, Senobari NS, Crites B, Funning GJ, Brisk P, Keogh EJ (2019) Matrix profile XIV: scaling time series motif discovery with gpus to break a quintillion pairwise comparisons a day and beyond. In: Proceedings of the ACM symposium on cloud computing, SoCC 2019, Santa Cruz, CA, USA, November 20-23, 2019, pp 74–86

Download references

Acknowledgements

We greatly acknowledge the funding from Safran Data Analytics Lab. The authors are grateful to Inria Sophia Antipolis - Méditerranée ”Nef” computation cluster for providing resources and support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanmoy Mondal.

Additional information

Responsible editor: Eamonn Keogh.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 4346 KB)

Appendices

Appendix: I Fast calculation of distance

Mueen et. al proposed a technique, known as Mueen’s Algorithm for Similarity Search (MASS) (Mueen et al. 2014, 2011) for the fast calculation of z-normalized Euclidean Distance between query subsequence and the subsequence of target time series, by exploiting Fast Fourier Transform (FFT). Let us first explain Algorithm 6 that generates the dot product of a subsequence (q), and the subsequences of a target time series (T).

figure h

In Line 4, both vectors q and T are made to be of the same length (see Section Appendix: I.1) by appending the required amount of zeros \((2n-m)\) to the reversed query so that like \(T_a\), \(q_{ra}\) will have 2n elements. This is done to facilitate element wise multiplication in frequency domain. In Line 5, the Fourier transform of \(q_{ra}\) and \(T_a\) is performed to transform time domain signals into frequency domain signals. Then in Line 6, an element wise multiplication is performed between two complex valued vectors, followed by inverse FFT on the resultant product.

1.1 Appendix: I.1 Relation between convolution in time domain with frequency domain

The time domain convolution of two signals \(T = [t_1, t_2,\ldots , t_n]\) and \(q = [q_1, q_2,\ldots , q_m]; m<< n\) can be calculated by sliding q upon T. For implementation, we would need \(\dfrac{m}{2}\) number of zeros padding at the beginning and at the end of original T vector. The convolution between these two vectors is represented by \((T * q)\) which is a vector \({\textbf {c}} = [c_1, c_2,\ldots ,c_{n+m-1}]\) denoted as : \({\textbf {C}}_p = \sum _{u} T_u \;q_{p-u+1}\) where \(u = [\max \left( 1,p-u+1 \right) ]\ldots [\min \left( n,m \right) ]\) and u ranges over all the sub-scripts for \(T_u\) and \(q_{p-u+1}\).

The convolution in time domain can be quickly calculated by element-wise multiplication in frequency domain by taking Fourier Transform of two signals, multiplying them element-wise and then applying inverse Fourier transform. Performing full convolution (in both time and frequency domains) between two 1D (same for 2D also) signals of size n and m results in an output of size \((n+m-1)\) elements. Usually, the two signals are made of same size by padding zeros at the end to facilitate the multiplication operation.

Appendix: II Fast calculation of mean and standard deviation

The fast calculation of mean \((\mu )\) and standard deviation \((\sigma )\) of a vector of elements (x) is proposed by Rakthanmanon et al. (2012). The technique needs only one scan through the sample to compute the mean and standard deviation of all the subsequences. The mean of the subsequences can be calculated by keeping two running sums of the long time series which have a lag of exactly m values.

$$\begin{aligned} \mu = \frac{1}{m}\left( \sum _{i=1}^{k}x_i - \sum _{i=1}^{k-m}x_i \right) \quad \sigma ^2 = \frac{1}{m} \left( \sum _{i=1}^{k} x_i^2 - \sum _{i=1}^{k-m}x_i^2 \right) - \mu ^2 \end{aligned}$$
(2)

In the same manner, the sum of squares of the subsequences can also be calculated which are used to compute the standard deviation of all the subsequences by using Eq. (2).

Appendix: III Brief description of MASS algorithm :

The MASS algorithm is mentioned in Algorithm 7. In Line 1 of this algorithm, the sliding dot product is calculated by using Algorithm 6.

figure i

The z-normalized Euclidean distance (\(D_i\)) is calculated between two time series subsequences q and \(T_{i}\) by using the dot product between them (\(qT_i\)). The formula to calculate the distance \((D_i)\) is shown below.

$$\begin{aligned} D[i] = \sqrt{2m\left( 1- \frac{qT_i - m\mu _q ~\mu _{T_i}}{m\sigma _q~\sigma _{T_i}}\right) } \end{aligned}$$
(3)

In this Eq. (3), m is the subsequence length, \(\mu _q\) is the mean of query sequence q, \(\mu _{T_i}\) is the mean of \(T_{i}\), \(\sigma _q\) is the standard deviation of q and \(\sigma _{T_i}\) is the standard deviation of \(T_{i}\).

Appendix: IV Brief description of STAMP algorithm

The Scalable Time Series Anytime MP (STAMP) algorithm, proposed by Yeh et al. (2018) (outlined in Algorithm 8) calculates the closest match (1NN) of every subsequence in a time series T, based on the calculated distance (called as distance profile) between any particular subsequence with all the remaining subsequence in T.

figure j

The pseudo-code is mentioned in Algorithm 8. A for loop is used in Line 6 to chop each subsequence (consider as query for better understanding). Then, a distance vector (\(Dist_{cutQuery}\)) is computed by calculating the distances between all other subsequences in T and the query. In each iteration, the smallest distance and its corresponding index are kept in \(P_T\) and \(I_T\) vectors.

1.1 Appendix: IV.1 Two independent time series matching using STOMP

In Section 4.1.1, we have explained the general principal of STOMP algorithm for the computation of MP for a single time series. In case of two independent time series, the basic STOMP algorithm needs to be marginally modified. The pseudo-code of modified STOMP algorithm, named as IndependentSTOMP is mentioned in Algorithm 9.

While called, this algorithm calculates the distances from 2 query subsequence until the last query subsequences i.e., (Idxs) (see Line 2) because before calling this algorithm, we have already computed the distances between 1st query subsequence and target subsequence t. Now let’s concentrate on the principal part of STOMP. Remind that the dot product profile (QT) of any particular subsequence, can be derived from the dot product profile of it’s previous subsequence (see Sect. 4.1.1). So, in Line 3 we repetitively calculate the dot product profile between each query subsequence (from 2nd subsequence onward) and the target subsequence (t). The 1st term i.e. (\(QT[j-1]\)) at the right hand side of Line 3 holds the dot product profile of the previous target subsequence (in comparison with t) and jth query subsequence. The \(Q[j-1]\) in 2nd term represents the 1st element of the previous query subsequence and t[1] represents the 1st element of the target subsequence (t). In the same manner, \(Q[j+m-1]\) and \(t[1+m-1]\) terms represent the last element of jth query subsequence (current one) and the last element of the subsequences t. In Line 4, the dot product value between the first query subsequence \((j=1)\) and the specific target subsequence (t) is copied in QT[1]. This value is taken as input in this algorithm (i.e., tqSingleVal) The distance profile is calculated in Line 5 by using the dot product profile (QT), mean \((\mu _t)\) and STD \((\sigma _{t})\) value of target subsequence t, mean \((\mu _Q)\) and STD \((\sigma _{Q})\) vector of query time series. Finally the distance vector \((Dist_{cutQuery})\) and the dot product vector (QT) are returned as the output from the algorithm.

figure k

The time complexity of classical STOMP is \(\mathcal {O}(n)^2\) which is \(\mathcal {O}(\log n)\) improvement over STAMP algorithm. This improvement is highly useful for computing MP over big time series.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mondal, T., Akbarinia, R. & Masseglia, F. kNN matrix profile for knowledge discovery from time series. Data Min Knowl Disc 37, 1055–1089 (2023). https://doi.org/10.1007/s10618-022-00883-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00883-8

Keywords

Navigation