Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

  • Authors:
  • Alberto Apostolico;Gill Bejerano

  • Affiliations:
  • Department of Computer Sciences, Purdue University, Computer Sciences Building, West Lafayette, IN and Dipartimento di Elettronica e Informatica, Università di Padova, Podova, Italy;Institute of Computer Science, The Hebrew University, Jerusalem 91904, Israel

  • Venue:
  • RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. In [8], much more compact, tree-shaped variants of probabilistic automata are built which assume an underlying Markov process of variable memory length. In [3, 4], these variants, called Probabilistic Suffix Trees (PSTs) were successfully applied to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires &THgr; (Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time &THgr; (m2) in the worst case.The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties:learning the automaton takes O (n) time.prediction of a string of m symbols by the automaton takes O (m) time.Along the way, the paper presents an evolving learning sheme, and addresses notions of empirical probability and related efficient computation,possibly a by-product of more general interest.