ABSTRACT
Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. In [8], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length. In [3, 4], these variants, called Probabilistic Suffix Trees (PSTs) were successfully applied to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires Θ (Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time Θ (m2) in the worst case.
The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties:
learning the automaton takes O (n) time.
prediction of a string of m symbols by the automaton takes O (m) time.
Along the way, the paper presents an evolving learning sheme, and addresses notions of empirical probability and related efficient computation,possibly a by-product of more general interest.
- 1.Aho, A.V. and M.E. Corasick, Efficient String Matching: an Aid to Bibliographic Search, CACM, 18, 333-340 (1975). Google ScholarDigital Library
- 2.Apostolico, A., and Z. Galil (Eds.), Pattern Matching Algorithms, Oxford University Press, New York (1997). Google ScholarDigital Library
- 3.Bejerano, G. and G. Yona, Modeling Protein Families Using Probabilistic Suffix Trees. PT'oceedings of RECOMB99 (S. Istrail, P. Pevzner and M. Waterman, eds.), 15-24, Lyon, France, ACM Press (April 1999). Google ScholarDigital Library
- 4.Bejeraxlo, G. and G Yona, Variations on Probabilistic Suffix Trees - A New Tool for Modeling and Prediction of Protein Families. submitted to Biom}ormat, cs (October 1999).Google Scholar
- 5.Blumer, A., J. Blumer, A. Ehrenfeucaht, D. Haussler, M.T. Chen and J. Seiferas, The S- mallest Automaton Recognizing the Subwords of a Text, Theoretical Computer Science , 40, 31-55 (1985).Google ScholarCross Ref
- 6.Kirkpatrick, S. and C. D. Gelatt Jr., Optimization by Simulated Annealing. Science, 220,671- 680, 1983.Google Scholar
- 7.McCreight, E.M., A Space-economical Suffix Tree Construction Algorithm. Journal of the A CM, 23(2):262-272, April 1976. Google ScholarDigital Library
- 8.Ron, D., Y. Singer and N. Tishby, The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Mach,ne Learnrag, 25:117-150 (1996). Google ScholarDigital Library
- 9.Ukkonen, E., On-line Construction of Suffix Trees. Algomthmica. 14(3):249-260, 1995.Google ScholarDigital Library
- 10.Weiner, P., Linear Pattern Matching Algorithm. In Proceedmgs of the lJth Annual IEEE Symposmm on Switching and Automata Theory, pages 1-11, Washington, DC, 1973.Google Scholar
- Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space
Recommendations
Optimal Simulations between Unary Automata
We consider the problem of computing the costs---{ in terms of states---of optimal simulations between different kinds of finite automata recognizing unary languages. Our main result is a tight simulation of unary n-state two-way nondeterministic ...
Alternation for sublogarithmic space-bounded alternating pushdown automata
This paper investigates infinite hierarchies on alternation-depth and alternation-size of alternating pushdown automata (apda's) with sublogarithmic space. We first show that there is an infinite hierarchy on alternation-depth for apda's with ...
Recognizing ?-regular Languages with Probabilistic Automata
LICS '05: Proceedings of the 20th Annual IEEE Symposium on Logic in Computer ScienceProbabilistic finite automata as acceptors for languages over finite words have been studied by many researchers. In this paper, we show how probabilistic automata can serve as acceptors for ?-regular languages. Our main results are that our variant of ...
Comments