Article

Probabilistic discovery of time series motifs

Authors:
Bill Chiu

University of California - Riverside, Riverside, CA

University of California - Riverside, Riverside, CA
View Profile

,
Eamonn Keogh

University of California - Riverside, Riverside, CA

University of California - Riverside, Riverside, CA
View Profile

,
Stefano Lonardi

University of California - Riverside, Riverside, CA

University of California - Riverside, Riverside, CA
View Profile

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2003Pages 493–498https://doi.org/10.1145/956750.956808

Published:24 August 2003Publication History

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 493–498

ABSTRACT

Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise.Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or "don't care" symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.

References

Agrawal, R., Lin, K. I., Sawhney, H. S. & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In proceedings of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept. pp 490--50.]] Google ScholarDigital Library
Apostolico, A., Bock, M. E. & Lonardi, S. (2002). Monotony of surprise and large-scale quest for unusual words. In proceedings of the 6th Int'l Conference on Research in Computational Molecular Biology. Washington, DC, April 18--21. pp. 22--31.]] Google ScholarDigital Library
Bailey, T & Elkan, C. (1995). Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning, 21 (1/2), pp. 51--80.]] Google ScholarDigital Library
Buhler, J. (2001). Efficient large-scale sequence comparison by locality-sensitive hashing, Bioinformatics 17: pp. 419--428.]]Google ScholarCross Ref
Caraca-Valente., J. P. & Lopez-Chavarrias. I. (2000). Discovering similar patterns in time series. In Proceedings of the Association for Computing Machinery 6th International Conference on Knowledge Discovery and Data Mining, pp. 497--505.]] Google ScholarDigital Library
Chan, K. & Fu, A. W. (1999). Efficient time series matching by wavelets. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23--26. pp 126--133.]] Google ScholarDigital Library
Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule discovery from time series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 16--22.]]Google Scholar
Dasgupta., D. & Forrest, S. (1999). Novelty detection in time series data using ideas from immunology. In Proceedings of the 5th International Conference on Intelligent Systems (1999).]]Google Scholar
Daw, C. S., Finney, C. E. A. & Tracy, E. R. (2001). Symbolic analysis of experimental data. Review of Scientific Instruments.]]Google Scholar
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.]]Google Scholar
Engelhardt, B., Chien, S. & Mutz, D. (2000). Hypothesis generation strategies for adaptive problem solving. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT.]]Google Scholar
Ge, X. & Smyth, P. (2000). Deformable Markov model templates for time-series pattern matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, MA, Aug 20--23. pp 81--90.]] Google ScholarDigital Library
Gionis, A., Indyk, P., Motwani, R. (1999). Similarity search in high dimensions via hashing. In proceedings of 25th Int'l Conference on Very Large Databases. Edinburgh, Scotland.]] Google ScholarDigital Library
Han, J. Dong, G. & Yin., Y. (1999). Efficient mining partial periodic patterns in time series database. In Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia. pp 106--115.]] Google ScholarDigital Library
Hegland, M., Clarke, W. & Kahn, M. (2002). Mining the MACHO dataset, Computer Physics Communications, Vol 142(1--3), December 15. pp. 22--28.]]Google Scholar
Hertz, G. & Stormo, G. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, Vol. 15, pp. 563--577.]]Google ScholarCross Ref
van Helden, J., Andre, B., & Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides. J. Mol. Biol., Vol. 281, pp. 827--842.]]Google ScholarCross Ref
Höppner, F. (2001). Discovery of temporal patterns -- learning rules about the qualitative behavior of time series. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases. Freiburg, Germany, pp. 192--203.]] Google ScholarDigital Library
Indyk, P., Koudas, N. & Muthukrishnan, S. (2000). Identifying representative trends in massive time series data sets using sketches. In proceedings of the 26th Int'l Conference on Very Large Data Bases. Cairo, Egypt, Sept 10--14. pp 363--372.]] Google ScholarDigital Library
Indyk, P., and Motwani. R. Raghavan. R. & Vempala, S. (1997). Locality-preserving hashing in multidimensional spaces. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. pp. 618--625.]] Google ScholarDigital Library
Keogh, E. and Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23--26, 2002. Edmonton, Alberta, Canada. pp 102--111.]] Google ScholarDigital Library
Keogh, E. and Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification clustering and relevance feedback. In 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 239--243]]Google Scholar
Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra (2000). Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems. pp 263--286.]]Google Scholar
Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. & Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science, Oct. Vol. 262, pp 208--214.]]Google ScholarCross Ref
Lawrence. C. &. Reilly. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, Vol. 7, pp 41--51.]]Google ScholarCross Ref
Lin, J. Keogh, E. Patel, P. & Lonardi, S. (2002). Finding motifs in time series. In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada.]]Google Scholar
Oates, T., Schmill, M. & Cohen, P. (2000). A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. In Proceedings of the 17th National Conference on Artificial Intelligence. pp 846--851.]] Google ScholarDigital Library
Pevzner, P. A. & Sze, S. H. (2000). Combinatorial approaches to finding subtle signals in DNA sequences. In proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. La Jolla, CA, Aug 19--23. pp 269--278.]] Google ScholarDigital Library
Reinert, G., Schbath, S. & Waterman, M. S. (2000). Probabilistic and statistical properties of words: An overview. J. Comput. Bio., Vol. 7, pp 1--46.]]Google ScholarCross Ref
Rigoutsos, I. & Floratos, A. (1998) Combinatorial pattern discovery in biological sequences: The Teiresias algorithm, Bioinformatics, 14(1), pp. 55--67.]]Google ScholarCross Ref
Roddick, J. F., Hornsby, K. & Spiliopoulou, M. (2001). An Updated Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. Lecture Notes in Artificial Intelligence. 2007. pp 147--163.]] Google ScholarDigital Library
Scargle, J., (2000). Bayesian Blocks, A new method to analyze structure in photon counting data, Astrophysical Journal, 504, pp 405--418.]]Google ScholarCross Ref
Staden, R. (1989). Methods for discovering novel motifs in nucleic acid sequences. Comput. Appl. Biosci., Vol. 5(5). pp 293--298.]]Google Scholar
Tompa, M. & Buhler, J. (2001). Finding motifs using random projections. In proceedings of the 5th Int'l Conference on Computational Molecular Biology. Montreal, Canada, Apr 22--25. pp 67--74.]] Google ScholarDigital Library
Vlachos, M., Kollios, G. & Gunopulos, G. (2002). Discovering similar multidimensional trajectories. In proceedings 18th International Conference on Data Engineering. pp 673--684.]] Google ScholarDigital Library
Yi, B. K., & Faloutsos, C. (2000). Fast time sequence indexing for arbitrary Lp norms. In proceedings of the 26th Intl Conference on Very Large Databases. pp 385--394.]] Google ScholarDigital Library
Yi, B. K., Jagadish, H., & Faloutsos, C. (1998). Efficient retrieval of similar time sequences under time wrapping. IEEE International Conference on Data Engineering. pp 201--208.]] Google ScholarDigital Library

Index Terms

Probabilistic discovery of time series motifs
1. Information systems
  1. Information systems applications

Recommendations

Detecting time series motifs under uniform scaling
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Time series motifs are approximately repeated patterns foundwithin the data. Such motifs have utility for many data mining algorithms, including rule-discovery,novelty-detection, summarization and clustering. Since the formalization of the problem and ...
Read More
Latent Time-Series Motifs

Motifs are the most repetitive/frequent patterns of a time-series. The discovery of motifs is crucial for practitioners in order to understand and interpret the phenomena occurring in sequential data. Currently, motifs are searched among series sub-...
Read More
Online discovery and maintenance of time series motifs
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

The detection of repeated subsequences, time series motifs, is a problem which has been shown to have great utility for several higher-level data mining algorithms, including classification, clustering, segmentation, forecasting, and rule discovery. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2003
736 pages
ISBN:1581137370
DOI:10.1145/956750
Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
motifs
randomized algorithms
time series
Qualifiers
- Article
Conference

Acceptance Rates
KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 364
  Total Citations
  View Citations
- 3,302
  Total Downloads
- Downloads (Last 12 months)60
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Probabilistic discovery of time series motifs

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting time series motifs under uniform scaling

Latent Time-Series Motifs

Online discovery and maintenance of time series motifs