A generalization of the PST algorithm: modeling the sparse nature of protein sequences

  • Authors:
  • Florencia G. Leonardi

  • Affiliations:
  • Instituto de Matemática e Estatística, Universidade de São Paulo Rua do Matão 1010 CEP 05508-090, São Paulo, Brazil

  • Venue:
  • Bioinformatics
  • Year:
  • 2006

Quantified Score

Hi-index 3.84

Visualization

Abstract

Motivation: A central problem in genomics is to determine the function of a protein using the information contained in its amino acid sequence. Variable length Markov chains (VLMC) are a promising class of models that can effectively classify proteins into families and they can be estimated in linear time and space. Results: We introduce a new algorithm, called Sparse Probabilistic Suffix Trees (SPST), that identifies equivalences between the contexts of a VLMC. We show that, in many cases, the identification of these equivalences can improve the classification rate of the classical Probabilistic Suffix Trees (PST) algorithm. We also show that better classification can be achieved by identifying representative fingerprints in the amino acid chains, and this variation in the SPST algorithm is called F-SPST. Availability: The SPST algorithm can be freely downloaded from the site http://www.ime.usp.br/~leonardi/spst/ Contact: leonardi@ime.usp.br Supplementary information: Supplementary data are available at Bioinformatics online.