skip to main content
10.1145/956750.956808acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Probabilistic discovery of time series motifs

Authors Info & Claims
Published:24 August 2003Publication History

ABSTRACT

Several important time series data mining problems reduce to the core task of finding approximately repeated subsequences in a longer time series. In an earlier work, we formalized the idea of approximately repeated subsequences by introducing the notion of time series motifs. Two limitations of this work were the poor scalability of the motif discovery algorithm, and the inability to discover motifs in the presence of noise.Here we address these limitations by introducing a novel algorithm inspired by recent advances in the problem of pattern discovery in biosequences. Our algorithm is probabilistic in nature, but as we show empirically and theoretically, it can find time series motifs with very high probability even in the presence of noise or "don't care" symbols. Not only is the algorithm fast, but it is an anytime algorithm, producing likely candidate motifs almost immediately, and gradually improving the quality of results over time.

References

  1. Agrawal, R., Lin, K. I., Sawhney, H. S. & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In proceedings of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept. pp 490--50.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apostolico, A., Bock, M. E. & Lonardi, S. (2002). Monotony of surprise and large-scale quest for unusual words. In proceedings of the 6th Int'l Conference on Research in Computational Molecular Biology. Washington, DC, April 18--21. pp. 22--31.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bailey, T & Elkan, C. (1995). Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning, 21 (1/2), pp. 51--80.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Buhler, J. (2001). Efficient large-scale sequence comparison by locality-sensitive hashing, Bioinformatics 17: pp. 419--428.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. Caraca-Valente., J. P. & Lopez-Chavarrias. I. (2000). Discovering similar patterns in time series. In Proceedings of the Association for Computing Machinery 6th International Conference on Knowledge Discovery and Data Mining, pp. 497--505.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chan, K. & Fu, A. W. (1999). Efficient time series matching by wavelets. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23--26. pp 126--133.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule discovery from time series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 16--22.]]Google ScholarGoogle Scholar
  8. Dasgupta., D. & Forrest, S. (1999). Novelty detection in time series data using ideas from immunology. In Proceedings of the 5th International Conference on Intelligent Systems (1999).]]Google ScholarGoogle Scholar
  9. Daw, C. S., Finney, C. E. A. & Tracy, E. R. (2001). Symbolic analysis of experimental data. Review of Scientific Instruments.]]Google ScholarGoogle Scholar
  10. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.]]Google ScholarGoogle Scholar
  11. Engelhardt, B., Chien, S. & Mutz, D. (2000). Hypothesis generation strategies for adaptive problem solving. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT.]]Google ScholarGoogle Scholar
  12. Ge, X. & Smyth, P. (2000). Deformable Markov model templates for time-series pattern matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, MA, Aug 20--23. pp 81--90.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gionis, A., Indyk, P., Motwani, R. (1999). Similarity search in high dimensions via hashing. In proceedings of 25th Int'l Conference on Very Large Databases. Edinburgh, Scotland.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Han, J. Dong, G. & Yin., Y. (1999). Efficient mining partial periodic patterns in time series database. In Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia. pp 106--115.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hegland, M., Clarke, W. & Kahn, M. (2002). Mining the MACHO dataset, Computer Physics Communications, Vol 142(1--3), December 15. pp. 22--28.]]Google ScholarGoogle Scholar
  16. Hertz, G. & Stormo, G. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, Vol. 15, pp. 563--577.]]Google ScholarGoogle ScholarCross RefCross Ref
  17. van Helden, J., Andre, B., & Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides. J. Mol. Biol., Vol. 281, pp. 827--842.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. Höppner, F. (2001). Discovery of temporal patterns -- learning rules about the qualitative behavior of time series. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases. Freiburg, Germany, pp. 192--203.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Indyk, P., Koudas, N. & Muthukrishnan, S. (2000). Identifying representative trends in massive time series data sets using sketches. In proceedings of the 26th Int'l Conference on Very Large Data Bases. Cairo, Egypt, Sept 10--14. pp 363--372.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Indyk, P., and Motwani. R. Raghavan. R. & Vempala, S. (1997). Locality-preserving hashing in multidimensional spaces. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. pp. 618--625.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Keogh, E. and Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23--26, 2002. Edmonton, Alberta, Canada. pp 102--111.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Keogh, E. and Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification clustering and relevance feedback. In 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 239--243]]Google ScholarGoogle Scholar
  23. Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra (2000). Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems. pp 263--286.]]Google ScholarGoogle Scholar
  24. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. & Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science, Oct. Vol. 262, pp 208--214.]]Google ScholarGoogle ScholarCross RefCross Ref
  25. Lawrence. C. &. Reilly. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, Vol. 7, pp 41--51.]]Google ScholarGoogle ScholarCross RefCross Ref
  26. Lin, J. Keogh, E. Patel, P. & Lonardi, S. (2002). Finding motifs in time series. In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada.]]Google ScholarGoogle Scholar
  27. Oates, T., Schmill, M. & Cohen, P. (2000). A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments. In Proceedings of the 17th National Conference on Artificial Intelligence. pp 846--851.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Pevzner, P. A. & Sze, S. H. (2000). Combinatorial approaches to finding subtle signals in DNA sequences. In proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. La Jolla, CA, Aug 19--23. pp 269--278.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Reinert, G., Schbath, S. & Waterman, M. S. (2000). Probabilistic and statistical properties of words: An overview. J. Comput. Bio., Vol. 7, pp 1--46.]]Google ScholarGoogle ScholarCross RefCross Ref
  30. Rigoutsos, I. & Floratos, A. (1998) Combinatorial pattern discovery in biological sequences: The Teiresias algorithm, Bioinformatics, 14(1), pp. 55--67.]]Google ScholarGoogle ScholarCross RefCross Ref
  31. Roddick, J. F., Hornsby, K. & Spiliopoulou, M. (2001). An Updated Bibliography of Temporal, Spatial and Spatio-Temporal Data Mining Research. Lecture Notes in Artificial Intelligence. 2007. pp 147--163.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Scargle, J., (2000). Bayesian Blocks, A new method to analyze structure in photon counting data, Astrophysical Journal, 504, pp 405--418.]]Google ScholarGoogle ScholarCross RefCross Ref
  33. Staden, R. (1989). Methods for discovering novel motifs in nucleic acid sequences. Comput. Appl. Biosci., Vol. 5(5). pp 293--298.]]Google ScholarGoogle Scholar
  34. Tompa, M. & Buhler, J. (2001). Finding motifs using random projections. In proceedings of the 5th Int'l Conference on Computational Molecular Biology. Montreal, Canada, Apr 22--25. pp 67--74.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vlachos, M., Kollios, G. & Gunopulos, G. (2002). Discovering similar multidimensional trajectories. In proceedings 18th International Conference on Data Engineering. pp 673--684.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yi, B. K., & Faloutsos, C. (2000). Fast time sequence indexing for arbitrary Lp norms. In proceedings of the 26th Intl Conference on Very Large Databases. pp 385--394.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yi, B. K., Jagadish, H., & Faloutsos, C. (1998). Efficient retrieval of similar time sequences under time wrapping. IEEE International Conference on Data Engineering. pp 201--208.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Probabilistic discovery of time series motifs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2003
      736 pages
      ISBN:1581137370
      DOI:10.1145/956750

      Copyright © 2003 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2003

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader