ABSTRACT
Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise.Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or "don't care" symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.
- Agrawal, R., Lin, K. I., Sawhney, H. S. & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In proceedings of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept. pp 490--50.]] Google ScholarDigital Library
- Apostolico, A., Bock, M. E. & Lonardi, S. (2002). Monotony of surprise and large-scale quest for unusual words. In proceedings of the 6th Int'l Conference on Research in Computational Molecular Biology. Washington, DC, April 18--21. pp. 22--31.]] Google ScholarDigital Library
- Bailey, T & Elkan, C. (1995). Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning, 21 (1/2), pp. 51--80.]] Google ScholarDigital Library
- Buhler, J. (2001). Efficient large-scale sequence comparison by locality-sensitive hashing, Bioinformatics 17: pp. 419--428.]]Google ScholarCross Ref
- Caraca-Valente., J. P. & Lopez-Chavarrias. I. (2000). Discovering similar patterns in time series. In Proceedings of the Association for Computing Machinery 6th International Conference on Knowledge Discovery and Data Mining, pp. 497--505.]] Google ScholarDigital Library
- Chan, K. & Fu, A. W. (1999). Efficient time series matching by wavelets. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23--26. pp 126--133.]] Google ScholarDigital Library
- Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule discovery from time series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 16--22.]]Google Scholar
- Dasgupta., D. & Forrest, S. (1999). Novelty detection in time series data using ideas from immunology. In Proceedings of the 5th International Conference on Intelligent Systems (1999).]]Google Scholar
- Daw, C. S., Finney, C. E. A. & Tracy, E. R. (2001). Symbolic analysis of experimental data. Review of Scientific Instruments.]]Google Scholar
- Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.]]Google Scholar
- Engelhardt, B., Chien, S. & Mutz, D. (2000). Hypothesis generation strategies for adaptive problem solving. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT.]]Google Scholar
- Ge, X. & Smyth, P. (2000). Deformable Markov model templates for time-series pattern matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, MA, Aug 20--23. pp 81--90.]] Google ScholarDigital Library
- Gionis, A., Indyk, P., Motwani, R. (1999). Similarity search in high dimensions via hashing. In proceedings of 25th Int'l Conference on Very Large Databases. Edinburgh, Scotland.]] Google ScholarDigital Library
- Han, J. Dong, G. & Yin., Y. (1999). Efficient mining partial periodic patterns in time series database. In Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia. pp 106--115.]] Google ScholarDigital Library
- Hegland, M., Clarke, W. & Kahn, M. (2002). Mining the MACHO dataset, Computer Physics Communications, Vol 142(1--3), December 15. pp. 22--28.]]Google Scholar
- Hertz, G. & Stormo, G. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, Vol. 15, pp. 563--577.]]Google ScholarCross Ref
- van Helden, J., Andre, B., & Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides. J. Mol. Biol., Vol. 281, pp. 827--842.]]Google ScholarCross Ref
- Höppner, F. (2001). Discovery of temporal patterns -- learning rules about the qualitative behavior of time series. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases. Freiburg, Germany, pp. 192--203.]] Google ScholarDigital Library
- Indyk, P., Koudas, N. & Muthukrishnan, S. (2000). Identifying representative trends in massive time series data sets using sketches. In proceedings of the 26th Int'l Conference on Very Large Data Bases. Cairo, Egypt, Sept 10--14. pp 363--372.]] Google ScholarDigital Library
- Indyk, P., and Motwani. R. Raghavan. R. & Vempala, S. (1997). Locality-preserving hashing in multidimensional spaces. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. pp. 618--625.]] Google ScholarDigital Library
- Keogh, E. and Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23--26, 2002. Edmonton, Alberta, Canada. pp 102--111.]] Google ScholarDigital Library
- Keogh, E. and Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification clustering and relevance feedback. In 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 239--243]]Google Scholar
- Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra (2000). Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems. pp 263--286.]]Google Scholar
- Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. & Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science, Oct. Vol. 262, pp 208--214.]]Google ScholarCross Ref
- Lawrence. C. &. Reilly. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, Vol. 7, pp 41--51.]]Google ScholarCross Ref
- Lin, J. Keogh, E. Patel, P. & Lonardi, S. (2002). Finding motifs in time series. In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada.]]Google Scholar
- Oates, T., Schmill, M. & Cohen, P. (2000). A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. In Proceedings of the 17th National Conference on Artificial Intelligence. pp 846--851.]] Google ScholarDigital Library
- Pevzner, P. A. & Sze, S. H. (2000). Combinatorial approaches to finding subtle signals in DNA sequences. In proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. La Jolla, CA, Aug 19--23. pp 269--278.]] Google ScholarDigital Library
- Reinert, G., Schbath, S. & Waterman, M. S. (2000). Probabilistic and statistical properties of words: An overview. J. Comput. Bio., Vol. 7, pp 1--46.]]Google ScholarCross Ref
- Rigoutsos, I. & Floratos, A. (1998) Combinatorial pattern discovery in biological sequences: The Teiresias algorithm, Bioinformatics, 14(1), pp. 55--67.]]Google ScholarCross Ref
- Roddick, J. F., Hornsby, K. & Spiliopoulou, M. (2001). An Updated Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. Lecture Notes in Artificial Intelligence. 2007. pp 147--163.]] Google ScholarDigital Library
- Scargle, J., (2000). Bayesian Blocks, A new method to analyze structure in photon counting data, Astrophysical Journal, 504, pp 405--418.]]Google ScholarCross Ref
- Staden, R. (1989). Methods for discovering novel motifs in nucleic acid sequences. Comput. Appl. Biosci., Vol. 5(5). pp 293--298.]]Google Scholar
- Tompa, M. & Buhler, J. (2001). Finding motifs using random projections. In proceedings of the 5th Int'l Conference on Computational Molecular Biology. Montreal, Canada, Apr 22--25. pp 67--74.]] Google ScholarDigital Library
- Vlachos, M., Kollios, G. & Gunopulos, G. (2002). Discovering similar multidimensional trajectories. In proceedings 18th International Conference on Data Engineering. pp 673--684.]] Google ScholarDigital Library
- Yi, B. K., & Faloutsos, C. (2000). Fast time sequence indexing for arbitrary Lp norms. In proceedings of the 26th Intl Conference on Very Large Databases. pp 385--394.]] Google ScholarDigital Library
- Yi, B. K., Jagadish, H., & Faloutsos, C. (1998). Efficient retrieval of similar time sequences under time wrapping. IEEE International Conference on Data Engineering. pp 201--208.]] Google ScholarDigital Library
Index Terms
- Probabilistic discovery of time series motifs
Recommendations
Detecting time series motifs under uniform scaling
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningTime series motifs are approximately repeated patterns foundwithin the data. Such motifs have utility for many data mining algorithms, including rule-discovery,novelty-detection, summarization and clustering. Since the formalization of the problem and ...
Latent Time-Series Motifs
Motifs are the most repetitive/frequent patterns of a time-series. The discovery of motifs is crucial for practitioners in order to understand and interpret the phenomena occurring in sequential data. Currently, motifs are searched among series sub-...
Online discovery and maintenance of time series motifs
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningThe detection of repeated subsequences, time series motifs, is a problem which has been shown to have great utility for several higher-level data mining algorithms, including classification, clustering, segmentation, forecasting, and rule discovery. In ...
Comments