Quality and efficiency in high dimensional nearest neighbor search

Authors:
Yufei Tao;Ke Yi;Cheng Sheng;Panos Kalnis
Affiliations:
Chinese University of Hong Kong, New Territories, Hong Kong;Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong;Chinese University of Hong Kong, New Territories, Hong Kong;King Abdullah University of Science and Technology, Soudi Arabia, Saudi Arabia
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 35
Cited 29

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Nearest neighbor queries

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Multidimensional access methods

ACM Computing Surveys (CSUR)
An optimal algorithm for approximate nearest neighbor searching fixed dimensions

Journal of the ACM (JACM)
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Distance browsing in spatial databases

ACM Transactions on Database Systems (TODS)
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A cost model for query processing in high dimensional data spaces

ACM Transactions on Database Systems (TODS)
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Indexing the Solution Space: A New Technique for Nearest Neighbor Search in High-Dimensional Space

IEEE Transactions on Knowledge and Data Engineering
On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

IEEE Transactions on Knowledge and Data Engineering
Clustering for Approximate Similarity Search in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
Approximate Nearest Neighbor Searching in Multimedia Databases

Proceedings of the 17th International Conference on Data Engineering
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Evaluating Top-k Selection Queries

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Replacement for Voronoi Diagrams of Near Linear Size

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
A Sampling-Based Estimator for Top-k Query

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
LDC: Enabling Search By Partial Distance In A Hyper-Dimensional Space

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Navigating nets: simple algorithms for proximity search

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Entropy based nearest neighbor search in high dimensions

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Nearest Neighbor Retrieval Using Distance-Based Hashing

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems (TODS)
Similarity search and locality sensitive hashing using ternary content addressable memories

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Indexing high-dimensional data for main-memory similarity search

Information Systems
Monitoring near duplicates over video streams

Proceedings of the international conference on Multimedia
GRAMS3: an efficient framework for XML structural similarity search

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
iPoc: a polar coordinate based indexing method for nearest neighbor search in high dimensional space

WAIM'10 Proceedings of the 11th international conference on Web-age information management
An efficient algorithm for reverse furthest neighbors query with metric index

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Subspace clustering for indexing high dimensional data: a main memory index based on local reductions and individual multi-representations

Proceedings of the 14th International Conference on Extending Database Technology
Flexible aggregate similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Effective data co-reduction for multimedia similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient histogram-based similarity search in ultra-high dimensional space

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Embedding-based subsequence matching in time-series databases

ACM Transactions on Database Systems (TODS)
Fast approximate similarity search based on degree-reduced neighborhood graphs

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable kNN search on vertically stored time series

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient approximate similarity search using random projection learning

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Distributed similarity estimation using derived dimensions

The VLDB Journal — The International Journal on Very Large Data Bases
ISIS: a new approach for efficient similarity search in sparse databases

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
SIMP: accurate and efficient near neighbor search in high dimensional spaces

Proceedings of the 15th International Conference on Extending Database Technology
An efficient algorithm for arbitrary reverse furthest neighbor queries

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Boosting multi-kernel locality-sensitive hashing for scalable image retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Real-time near-duplicate web video identification by tracking and matching of spatial features

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Sparse hashing for fast multimedia search

ACM Transactions on Information Systems (TOIS)
$\mathcal{MD}$-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services

Distributed and Parallel Databases
Nearest group queries

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Near-duplicate video retrieval: Current research and future trends

ACM Computing Surveys (CSUR)
Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Approximate high-dimensional nearest neighbor queries using R-forests

Proceedings of the 17th International Database Engineering & Applications Symposium
A comprehensive study of idistance partitioning strategies for kNN queries and high-dimensional data indexing

BNCOD'13 Proceedings of the 29th British National conference on Big Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution fulfills both requirements, except locality sensitive hashing (LSH). The existing LSH implementations are either rigorous or adhoc. Rigorous-LSH ensures good quality of query results, but requires expensive space and query cost. Although adhoc-LSH is more efficient, it abandons quality control, i.e., the neighbor it outputs can be arbitrarily bad. As a result, currently no method is able to ensure both quality and efficiency simultaneously in practice. Motivated by this, we propose a new access method called the locality sensitive B-tree (LSB-tree) that enables fast high-dimensional NN search with excellent quality. The combination of several LSB-trees leads to a structure called the LSB-forest that ensures the same result quality as rigorous-LSH, but reduces its space and query cost dramatically. The LSB-forest also outperforms adhoc-LSH, even though the latter has no quality guarantee. Besides its appealing theoretical properties, the LSB-tree itself also serves as an effective index that consumes linear space, and supports efficient updates. Our extensive experiments confirm that the LSB-tree is faster than (i) the state of the art of exact NN search by two orders of magnitude, and (ii) the best (linear-space) method of approximate retrieval by an order of magnitude, and at the same time, returns neighbors with much better quality.