skip to main content
10.1145/332306.332321acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article
Free Access

Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

Authors Info & Claims
Published:08 April 2000Publication History

ABSTRACT

Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. In [8], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length. In [3, 4], these variants, called Probabilistic Suffix Trees (PSTs) were successfully applied to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires Θ (Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time Θ (m2) in the worst case.

The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties:

  • learning the automaton takes O (n) time.

  • prediction of a string of m symbols by the automaton takes O (m) time.

Along the way, the paper presents an evolving learning sheme, and addresses notions of empirical probability and related efficient computation,possibly a by-product of more general interest.

References

  1. 1.Aho, A.V. and M.E. Corasick, Efficient String Matching: an Aid to Bibliographic Search, CACM, 18, 333-340 (1975). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.Apostolico, A., and Z. Galil (Eds.), Pattern Matching Algorithms, Oxford University Press, New York (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.Bejerano, G. and G. Yona, Modeling Protein Families Using Probabilistic Suffix Trees. PT'oceedings of RECOMB99 (S. Istrail, P. Pevzner and M. Waterman, eds.), 15-24, Lyon, France, ACM Press (April 1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.Bejeraxlo, G. and G Yona, Variations on Probabilistic Suffix Trees - A New Tool for Modeling and Prediction of Protein Families. submitted to Biom}ormat, cs (October 1999).Google ScholarGoogle Scholar
  5. 5.Blumer, A., J. Blumer, A. Ehrenfeucaht, D. Haussler, M.T. Chen and J. Seiferas, The S- mallest Automaton Recognizing the Subwords of a Text, Theoretical Computer Science , 40, 31-55 (1985).Google ScholarGoogle ScholarCross RefCross Ref
  6. 6.Kirkpatrick, S. and C. D. Gelatt Jr., Optimization by Simulated Annealing. Science, 220,671- 680, 1983.Google ScholarGoogle Scholar
  7. 7.McCreight, E.M., A Space-economical Suffix Tree Construction Algorithm. Journal of the A CM, 23(2):262-272, April 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.Ron, D., Y. Singer and N. Tishby, The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Mach,ne Learnrag, 25:117-150 (1996). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.Ukkonen, E., On-line Construction of Suffix Trees. Algomthmica. 14(3):249-260, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.Weiner, P., Linear Pattern Matching Algorithm. In Proceedmgs of the lJth Annual IEEE Symposmm on Switching and Automata Theory, pages 1-11, Washington, DC, 1973.Google ScholarGoogle Scholar
  1. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology
        April 2000
        329 pages
        ISBN:1581131860
        DOI:10.1145/332306

        Copyright © 2000 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 April 2000

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate148of538submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader