Abstract
Time series motifs are sets of very similar subsequences of a long time series. They are of interest in their own right, and are also used as inputs in several higher-level data mining algorithms including classification, clustering, rule-discovery and summarization. In spite of extensive research in recent years, finding time series motifs exactly in massive databases is an open problem. Previous efforts either found approximate motifs or considered relatively small datasets residing in main memory. In this work, we leverage off previous work on pivot-based indexing to introduce a disk-aware algorithm to find time series motifs exactly in multi-gigabyte databases which contain on the order of tens of millions of time series. We have evaluated our algorithm on datasets from diverse areas including medicine, anthropology, computer networking and image processing and show that we can find interesting and meaningful motifs in datasets that are many orders of magnitude larger than anything considered before.
Article PDF
Similar content being viewed by others
References
Abe H, Yamaguchi T (2005) Implementing an integrated time-series data mining environment—a case study of medical Kdd on chronic hepatitis. In: Presented at the 1st international conference on complex medical engineering (CME2005)
Androulakis I, Wu J, Vitolo J, Roth C (2005) Selecting maximally informative genes to enable temporal expression profiling analysis, In: Proceddings of foundations of systems biology in engineering
Arita D, Yoshimatsu H, Taniguchi R (2005) Frequent motion pattern extraction for motion recognition in real-time human proxy. In: Proceedings of JSAI workshop on conversational informatics, pp 25–30
Beaudoin P, Van de Panne M, Poulin P, Coros S (2008) Motion-motif graphs, symposium on computer animation
Bentley JL (1980) Multidimensional divide-and-conquer. Commun ACM 23(4): 214–229
Bigdely-Shamlo N, Vankov A, Ramirez R, Makeig S (2008) Brain activity-based image classification from rapid serial visual presentation. IEEE Trans Neural Syst Rehabil Eng 16(4)
Böhm C, Krebs F (2002) High performance data mining using the nearest neighbor join. In: Proceedings of 2nd IEEE international conference on data mining (ICDM), pp 43–50
Celly B, Zordan V (2004) Animated people textures. In: Proceedings of 17th international conference on computer animation and social agents (CASA)
Cheung SS, Nguyen TP (2005) Mining arbitrary-length repeated patterns in television broadcast. ICIP 3: 181–184
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: ACM SIGKDD, Washington, DC, pp 493–498
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. In: SIGMOD
Delorme A, Makeig S (2003) EEG changes accompanying learning regulation of the 12-Hz EEG activity. IEEE Trans Rehabil Eng 11(2): 133–136
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: VLDB
Dohnal V, Gennaro C, Zezula P (2003) Similarity join in metric spaces using eD-Index, vol 2736. In: DEXA, pp 484–493
Duchêne F, Garbay C, Rialle V (2007) Learning recurrent behaviors from heterogeneous multivariate time-series. Artif Intell Med 39(1): 25–47
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: SIGMOD, pp 419–429
Ferreira P, Azevedo PJ, Silva C, Brito R (2006) Mining approximate motifs in time series. Discov Sci
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE (2000) PhysioBank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23): e215–e220
Gonzalez EC, Figueroa K, Navarro G (2008) Effective proximity retrieval by ordering permutations. IEEE Trans Pattern Anal Mach Intell 30(9): 1647–1658
Guyet T, Garbay C, Dojat M (2007) Knowledge construction from time series data using a collaborative exploration system. J Biomed Inform 40(6): 672–687
Hafner J, Sawhney H et al (1995) Efficient color histogram indexing for quadratic form distance functions. IEEE Trans Pattern Anal Mach Intell 17(7): 729–736
Hamid R, Maddi S, Johnson A, Bobick A, Essa I, Isbell C (2005) Unsupervised activity discovery and characterization from event-streams. In: Proceedings of the 21st conference on uncertainty in artificial intelligence (UAI05)
Jagadish HV, Ooi BC, Tan K, Yu C, Zhang R (2005) iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst 30(2)
Kaffka S, Wintermantel B, Burk M, Peterson G (2000) Protecting high-yielding sugarbeet varieties from loss to curly top. http://sugarbeet.ucdavis.edu/Notes/Nov00a.htm
Keogh EJ (2003) Efficiently finding arbitrarily scaled patterns in massive time series databases. In: Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 253–265
Keogh EJ, Wei L, Xi X, Lee S-H, Vlachos M (2006) LB_Keogh supports exact indexing of shapes under rotation invariance with arbitrary representations and distance measures. In: VLDB, pp 882–893
Koudas N, Sevcik KC (2000) High dimensional similarity joins: algorithms and performance evaluation. IEEE Trans Knowl Data Eng 12(1): 3–18
Lee T, Girolami M, Sejnowski TJ (1999) Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Comput 11(2): 417–441
Lin J, Keogh E, Lonardi S, Patel P (2002) Finding motifs in time series. In: 2nd workshop on temporal data mining (KDD’02)
Liu Z, YU JX, Lin X, Lu H, Wang W (2005) Locating motifs in time-series data. In: PAKDD, pp 343–353
Loomis AL, Harvey E, Hobart G (1938) Disturbance patterns in sleep. J Neurophysiol 2: 413–430
McGovern A, Rosendahl D, Kruger A, Beaton M, Brown R, Droegemeier K (2007) Understanding the formation of tornadoes through data mining. In: 5th conference on artificial intelligence and its applications to environmental sciences at the American meteorological society
Meng J, Yuan J, Hans M, Wu Y (2008) Mining motifs from human motion. In: Proceedings of EUROGRAPHICS
Minnen D, Isbell CL, Essa I, Starner T (2007a) Detecting subdimensional motifs: an efficient algorithm for generalized multivariate pattern discovery. In: IEEE ICDM
Minnen D, Isbell CL, Essa I, Starner T (2007b) Discovering multivariate motifs using subsequence density estimation and greedy mixture learning. In: 22nd conference on artificial intelligence
Motzkin D, Hansen CL (1982) An efficient external sorting with minimal space requirement. Int J Parallel Program 11(6): 381–396
Mueen A, Keogh E, Zhu Q, Cash S, Westover B (2009) Exact discovery of time series motif. In: SDM
Murakami K, Doki S, Okuma S, Yano Y (2005) A study of extraction method of motion patterns observed frequently from time-series posture data. In: Proceedings of IEEE international conference on systems, man and cybernetics (SMC), pp 3610–3615
Nanopoulos A, Theodoridis Y, Manolopoulos Y (2001) C2P: clustering based on closest pairs. In: International conference on very large data bases (VLDB), pp 331–340
Niedermeyer, E, Lopes da Silva, F (eds) (1999) Electroencephalography: basic principles, clinical applications and related fields. Williams and Wilkins, Baltimore
Nyberg C, Barclay T, Cvetanovic Z, Gray J, Lomet D (1995) Alphasort: A cache-sensitive parallel external sort. VLDB J 4(4): 603–628
Patel P, Keogh E, Lin J, Lonardi S (2002) Mining motifs in massive time series databases. In: IEEE international conference on data mining
Rombo S, Terracina G (2004) Discovering representative models in large time series databases. In: Proceedings of the 6th international conference on flexible query answering systems, pp 84–97
Shieh J, Keogh E (2008) iSAX: Indexing and mining terabyte sized time series. In: IGKDD, pp 623–631
Simona R, Giorgio T (2004) Discovering representative models in large time series databases. Int Conf Query Answ Syst 3055: 84–97
Stafford C, Walker G (2009) Characterization and correlation of DC electrical penetration graph waveforms with feeding behavior of beet leafhopper (submission)
Stefanovic BJ, Schwindt W, Hoehn M, Silva AC (2007) Functional uncoupling of hemodynamic from neuronal response by inhibition of neuronal nitric oxide synthase. J Cereb Blood Flow Metab 27: 741–754
Stern JM, Engel J Jr (2004) Atlas of EEG patterns. Williams & Wilkins, Lippincott
Supporting Webpage www.cs.ucr.edu/~mueen/DAME/index.html
Tanaka Y, Iwamoto K, Uehara K (2005) Discovery of time-series motif from multi-dimensional data based on MDL principle. Mach Learn 58(2–3): 269–300
Tang H, Liao SS (2008) Discovering original motifs with different lengths from time series source. Knowl-Based Syst 21(7): 666–671
Tata S (2007) Declarative querying for biological sequences, Ph.D Thesis, The University of Michigan, (Advisor Jignesh M. Patel)
Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large database for non-parametric object and scene recognition. IEEE PAMI 30(11): 1958–1970
Ueno K, Xi X, Keogh E, Lee D (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: Proceedings of of IEEE international conference on data mining (ICDM)
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: International conference on very large data bases (VLDB), pp 194–205
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms machine learning, vol 38. Kluwer, Dordrecht, pp 257–286
Yankov D, Keogh E, Medina J, Chiu B, Zordan B (2007) Detecting motifs under uniform scaling. In: SIGKDD
Yoshiki T, Kazuhisa I, Kuniaki U (2005) Discovery of time-series motif from multi-dimensional data based on MDL principle. Mach Learn 58(2–3): 269–300
Yu C, Wang S (2007) Efficient index-based KNN join processing for high-dimensional data. Inf Softw Technol 49(4)
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Bart Goethals.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Mueen, A., Keogh, E., Zhu, Q. et al. A disk-aware algorithm for time series motif discovery. Data Min Knowl Disc 22, 73–105 (2011). https://doi.org/10.1007/s10618-010-0176-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0176-8