On Integrating Peptide Sequence Analysis and Relational Distance-Based Indexing

  • Authors:
  • Weijia Xu;Rui Mao;Shu Wang;Daniel P. Miranker

  • Affiliations:
  • University of Texas at Austin;University of Texas at Austin;University of Texas at Austin;University of Texas at Austin

  • Venue:
  • BIBE '06 Proceedings of the Sixth IEEE Symposium on BionInformatics and BioEngineering
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Managing data with distance-based indexing methods has the potential to provide scalability and integration with relational database management systems and the SQL programming model. We previously demonstrated the advantages of such an approach for nucleotide sequences using Hamming distance (mismatch). However, the larger alphabet size of peptide sequences increases the dimensionality of the problem, making algorithmic results more challenging. The development of a metric-PAM substitution matrix enables metric-distance based indexing for peptide sequences. The performance of distance-based indexing for homologous protein retrieval entails trade-off among accuracy, scalability and computational cost. We investigate the application of the multi-vantage point (MVP) tree algorithm to index peptide k-mers based on global mPAM alignment. We show that k-mer retrieval can still maintain accuracy when k is at least as large as 6 that creates a domain of over 60 million key values and enables scalability sufficient for effective performance on large disk-resident sequence databases.