Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

Authors:
Alberto Apostolico

Department of Computer Sciences, Purdue University, Computer Sciences Building, West Lafayette, IN and Dipartimento di Elettronica e Informatica, Università di Padova, Podova, Italy

Department of Computer Sciences, Purdue University, Computer Sciences Building, West Lafayette, IN and Dipartimento di Elettronica e Informatica, Università di Padova, Podova, Italy
View Profile

,
Gill Bejerano

Institute of Computer Science, The Hebrew University, Jerusalem 91904, Israel

Institute of Computer Science, The Hebrew University, Jerusalem 91904, Israel
View Profile

RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biologyApril 2000Pages 25–32https://doi.org/10.1145/332306.332321

Published:08 April 2000Publication History

RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology

Pages 25–32

ABSTRACT

Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. In [8], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length. In [3, 4], these variants, called Probabilistic Suffix Trees (PSTs) were successfully applied to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires Θ (Ln²) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time Θ (m²) in the worst case.

The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties:

learning the automaton takes O (n) time.
prediction of a string of m symbols by the automaton takes O (m) time.

Along the way, the paper presents an evolving learning sheme, and addresses notions of empirical probability and related efficient computation,possibly a by-product of more general interest.

References

1.Aho, A.V. and M.E. Corasick, Efficient String Matching: an Aid to Bibliographic Search, CACM, 18, 333-340 (1975). Google ScholarDigital Library
2.Apostolico, A., and Z. Galil (Eds.), Pattern Matching Algorithms, Oxford University Press, New York (1997). Google ScholarDigital Library
3.Bejerano, G. and G. Yona, Modeling Protein Families Using Probabilistic Suffix Trees. PT'oceedings of RECOMB99 (S. Istrail, P. Pevzner and M. Waterman, eds.), 15-24, Lyon, France, ACM Press (April 1999). Google ScholarDigital Library
4.Bejeraxlo, G. and G Yona, Variations on Probabilistic Suffix Trees - A New Tool for Modeling and Prediction of Protein Families. submitted to Biom}ormat, cs (October 1999).Google Scholar
5.Blumer, A., J. Blumer, A. Ehrenfeucaht, D. Haussler, M.T. Chen and J. Seiferas, The S- mallest Automaton Recognizing the Subwords of a Text, Theoretical Computer Science , 40, 31-55 (1985).Google ScholarCross Ref
6.Kirkpatrick, S. and C. D. Gelatt Jr., Optimization by Simulated Annealing. Science, 220,671- 680, 1983.Google Scholar
7.McCreight, E.M., A Space-economical Suffix Tree Construction Algorithm. Journal of the A CM, 23(2):262-272, April 1976. Google ScholarDigital Library
8.Ron, D., Y. Singer and N. Tishby, The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Mach,ne Learnrag, 25:117-150 (1996). Google ScholarDigital Library
9.Ukkonen, E., On-line Construction of Suffix Trees. Algomthmica. 14(3):249-260, 1995.Google ScholarDigital Library
10.Weiner, P., Linear Pattern Matching Algorithm. In Proceedmgs of the lJth Annual IEEE Symposmm on Switching and Automata Theory, pages 1-11, Washington, DC, 1973.Google Scholar

Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space
1. Computing methodologies
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis

Recommendations

Optimal Simulations between Unary Automata

We consider the problem of computing the costs---{ in terms of states---of optimal simulations between different kinds of finite automata recognizing unary languages. Our main result is a tight simulation of unary n-state two-way nondeterministic ...
Read More
Alternation for sublogarithmic space-bounded alternating pushdown automata

This paper investigates infinite hierarchies on alternation-depth and alternation-size of alternating pushdown automata (apda's) with sublogarithmic space. We first show that there is an infinite hierarchy on alternation-depth for apda's with ...
Read More
Recognizing ?-regular Languages with Probabilistic Automata
LICS '05: Proceedings of the 20th Annual IEEE Symposium on Logic in Computer Science

Probabilistic finite automata as acceptors for languages over finite words have been studied by many researchers. In this paper, we show how probabilistic automata can serve as acceptors for ?-regular languages. Our main results are that our variant of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology
April 2000
329 pages
ISBN:1581131860
DOI:10.1145/332306
Editors:
Ron Shamir
Tel-Aviv Univ., Israel
,
Satoru Miyano
Univ. of Tokyo, Tokyo, Japan
,
Sorin Istrail
Sandia National Labs
,
Pavel Pevzner
Univ. of Southern California
,
Michael Waterman
Univ. of Southern California
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 April 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate148of538submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 307
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

RECOMB '00: Proceedings of the fourth annual international conference on Computational molecular biology

ABSTRACT

References

Cited By

Recommendations

Optimal Simulations between Unary Automata

Alternation for sublogarithmic space-bounded alternating pushdown automata

Recognizing ?-regular Languages with Probabilistic Automata