On Optimizing Distance-Based Similarity Search for Biological Databases

Authors:
Rui Mao;Weijia Xu;Smriti Ramakrishnan;Glen Nuckolls;Daniel P. Miranker
Affiliations:
University of Texas at Austin;University of Texas at Austin;University of Texas at Austin;University of Texas at Austin;University of Texas at Austin
Venue:
CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
Year:
2005

Citing 18
Cited 2

A survey of information retrieval and filtering methods

A survey of information retrieval and filtering methods
Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
High-dimensional index structures database support for next decade's applications (tutorial)

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems (TODS)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Clustering to minimize the sum of cluster diameters

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Searching in metric spaces

ACM Computing Surveys (CSUR)
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Generalized Search Trees for Database Systems

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
An Assessment of a Metric Space Database Index to Support Sequence Homology

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
A metric model of amino acid substitution

Bioinformatics
MoBIoS: a metric-space DBMS to support biological discovery

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management

Dimension reduction for distance-based indexing

Proceedings of the Third International Conference on SImilarity Search and APplications
Pivot selection: Dimension reduction for distance-based indexing

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.