SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Empirical evaluation of excluded middle vantage point forest on biological sequences workload
Proceedings of the 1st Workshop on New Trends in Similarity Search
Hi-index | 0.00 |
Managing data with distance-based indexing methods has the potential to provide scalability and integration with relational database management systems and the SQL programming model. We previously demonstrated the advantages of such an approach for nucleotide sequences using Hamming distance (mismatch). However, the larger alphabet size of peptide sequences increases the dimensionality of the problem, making algorithmic results more challenging. The development of a metric-PAM substitution matrix enables metric-distance based indexing for peptide sequences. The performance of distance-based indexing for homologous protein retrieval entails trade-off among accuracy, scalability and computational cost. We investigate the application of the multi-vantage point (MVP) tree algorithm to index peptide k-mers based on global mPAM alignment. We show that k-mer retrieval can still maintain accuracy when k is at least as large as 6 that creates a domain of over 60 million key values and enables scalability sufficient for effective performance on large disk-resident sequence databases.