The power of amnesia: learning probabilistic automata with variable memory length
Machine Learning - Special issue on COLT '94
Pattern matching algorithms
Modeling protein families using probabilistic suffix trees
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Efficient string matching: an aid to bibliographic search
Communications of the ACM
String pattern matching for a deluge survival kit
Handbook of massive data sets
Notes on Learning Probabilistic Automata
DCC '00 Proceedings of the Conference on Data Compression
Towards Automatic Clustering of Protein Sequences
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Protein structure abstractionand automatic clustering using secondary structure element sequences
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Hi-index | 0.00 |
Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. In [8], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length. In [3, 4], these variants, called Probabilistic Suffix Trees (PSTs) were successfully applied to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires &THgr; (Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time &THgr; (m2) in the worst case.The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties:learning the automaton takes O (n) time.prediction of a string of m symbols by the automaton takes O (m) time.Along the way, the paper presents an evolving learning sheme, and addresses notions of empirical probability and related efficient computation,possibly a by-product of more general interest.