Finding the Nearest Neighbors in Biological Databases Using Less Distance Computations

Authors:
Jianjun Zhou;Jorg Sander;Zhipeng Cai;Lusheng Wang;Guohui Lin
Affiliations:
University of Alberta, Edmonton;University of Alberta, Edmonton;University of Alberta, Edmonton;City University of Hong Kong, Hong Kong;University of Alberta, Edmonton
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2010

Citing 18
Cited 1

A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements

Pattern Recognition Letters
A cost model for similarity queries in metric spaces

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The choice of reference points in best-match file searching

Communications of the ACM
Searching in metric spaces

ACM Computing Surveys (CSUR)
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods

Proceedings of the 17th International Conference on Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Searching in metric spaces by spatial approximation

The VLDB Journal — The International Journal on Very Large Data Bases
Comparison of AESA and LAESA search algorithms using string and tree-edit-distances

Pattern Recognition Letters
Pivot selection techniques for proximity searching in metric spaces

Pattern Recognition Letters
Approximate searches: k-neighbors + precision

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
Query-sensitive embeddings

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Speedup Clustering with Hierarchical Ranking

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient

The VLDB Journal — The International Journal on Very Large Data Bases
Nucleotide composition string selection in HIV-1 subtyping using whole genomes

Bioinformatics

Trying to outperform a well-known index with a sequential scan

Proceedings of the Joint EDBT/ICDT 2013 Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern biological applications usually involve the similarity comparison between two objects, which is often computationally very expensive, such as whole genome pairwise alignment and protein 3D structure alignment. Nevertheless, being able to quickly identify the closest neighboring objects from very large databases for a newly obtained sequence or structure can provide timely hints to its functions and more. This paper presents a substantial speedup technique for the well-studied k-nearest neighbor (k-nn) search, based on novel concepts of virtual pivots and partial pivots, such that a significant number of the expensive distance computations can be avoided. The new method is able to dynamically locate virtual pivots, according to the query, with increasing pruning ability. Using the same or less amount of database preprocessing effort, the new method outperformed the second best method by using no more than 40 percent distance computations per query, on a database of 10,000 gene sequences, compared to several best known k-nn search methods including M-Tree, OMNI, SA-Tree, and LAESA. We demonstrated the use of this method on two biological sequence data sets, one of which is for HIV-1 viral strain computational genotyping.