Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

Authors:
Alberto Apostolico;Gill Bejerano
Affiliations:
Department of Computer Sciences, Purdue University, Computer Sciences Building, West Lafayette, IN and Dipartimento di Elettronica e Informatica, Università di Padova, Podova, Italy;Institute of Computer Science, The Hebrew University, Jerusalem 91904, Israel
Venue:
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Year:
2000

Citing 5
Cited 4

The power of amnesia: learning probabilistic automata with variable memory length

Machine Learning - Special issue on COLT '94
Pattern matching algorithms

Pattern matching algorithms
Modeling protein families using probabilistic suffix trees

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Efficient string matching: an aid to bibliographic search

Communications of the ACM

String pattern matching for a deluge survival kit

Handbook of massive data sets
Notes on Learning Probabilistic Automata

DCC '00 Proceedings of the Conference on Data Compression
Towards Automatic Clustering of Protein Sequences

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Protein structure abstractionand automatic clustering using secondary structure element sequences

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. In [8], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length. In [3, 4], these variants, called Probabilistic Suffix Trees (PSTs) were successfully applied to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires &THgr; (Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time &THgr; (m2) in the worst case.The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties:learning the automaton takes O (n) time.prediction of a string of m symbols by the automaton takes O (m) time.Along the way, the paper presents an evolving learning sheme, and addresses notions of empirical probability and related efficient computation,possibly a by-product of more general interest.