Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets

Authors:
Michael E. Houle;Jun Sakuma
Affiliations:
National Institute of Informatics;Tokyo Institute of Technology
Venue:
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Year:
2005

Citing 35
Cited 21

New techniques for exact and approximate dynamic closest-point problems

SCG '94 Proceedings of the tenth annual symposium on Computational geometry
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
VisualSEEk: a fully automated content-based image query system

MULTIMEDIA '96 Proceedings of the fourth ACM international conference on Multimedia
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Multidimensional access methods

ACM Computing Surveys (CSUR)
An optimal algorithm for approximate nearest neighbor searching fixed dimensions

Journal of the ACM (JACM)
Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems (TODS)
Indexing the edges—a simple and yet efficient approach to high-dimensional indexing

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the geometry of similarity search: dimensionality curse and concentration of measure

Information Processing Letters
The Quadtree and Related Hierarchical Data Structures

ACM Computing Surveys (CSUR)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Searching in metric spaces

ACM Computing Surveys (CSUR)
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Clustering for Approximate Similarity Search in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Approximate Nearest Neighbor Searching in Multimedia Databases

Proceedings of the 17th International Conference on Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Indexing the Distance: An Efficient Method to KNN Processing

Proceedings of the 27th International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Approximate similarity retrieval with M-trees

The VLDB Journal — The International Journal on Very Large Data Bases
Extracting Spatial Knowledge from the Web

SAINT '03 Proceedings of the 2003 Symposium on Applications and the Internet
Searching in Metric Spaces by Spatial Approximation

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Deflating the Dimensionality Curse Using Multiple Fractal Dimensions

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Sampling-Based Estimator for Top-k Query

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Navigating massive data sets via local clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate searches: k-neighbors + precision

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Scaling distributional similarity to large corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
The privacy of k-NN retrieval for horizontal partitioned data: new methods and applications

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
A posteriori multi-probe locality sensitive hashing

MM '08 Proceedings of the 16th ACM international conference on Multimedia
TagScore: Approximate Similarity Using Tag Synopses

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Nowhere to Hide: Finding Plagiarized Documents Based on Sentence Similarity

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Bounded coordinate system indexing for real-time video clip search

ACM Transactions on Information Systems (TOIS)
Quality and efficiency in high dimensional nearest neighbor search

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A partial-order based active cache for recommender systems

Proceedings of the third ACM conference on Recommender systems
A flexible framework to ease nearest neighbor search in multidimensional data spaces

Data & Knowledge Engineering
Practical protocol for Yao’s millionaires problem enables secure multi-party computation of metrics and efficient privacy-preserving k-NN for large data sets

Knowledge and Information Systems
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems (TODS)
Finding maximum degrees in hidden bipartite graphs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Active caching for similarity queries based on shared-neighbor information

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Techniques for similarity searching in multimedia databases

Proceedings of the VLDB Endowment
Effective data co-reduction for multimedia similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Face retrieval in broadcasting news video by fusing temporal and intensity information

CIVR'06 Proceedings of the 5th international conference on Image and Video Retrieval
A set correlation model for partitional clustering

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
A segmental non-parametric-based phoneme recognition approach at the acoustical level

Computer Speech and Language
Exact and approximate algorithms for the most connected vertex problem

ACM Transactions on Database Systems (TODS)
Scalable distributed algorithm for approximate nearest neighbor search problem in high dimensional general metric spaces

SISAP'12 Proceedings of the 5th international conference on Similarity Search and Applications
Annotation propagation in image databases using similarity graphs

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a practical index for approximate similarity queries of large multi-dimensional data sets: the spatial approximation sample hierarchy (SASH). A SASH is a multi-level structure of random samples, recursively constructed by building a SASH on a large randomly selected sample of data objects, and then connecting each remaining object to several of their approximate nearest neighbors from within the sample. Queries are processed by first locating approximate neighbors within the sample, and then using the pre-established connections to discover neighbors within the remainder of the data set. The SASH index relies on a pairwise distance measure, but otherwise makes no assumptions regarding the representation of the data. Experimental results are provided for query-by-example operations on protein sequence, image, and text data sets, including one consisting of more than 1 million vectors spanning more than 1.1 million terms 驴 far in excess of what spatial search indices can handle efficiently. For sets of this size, the SASH can return a large proportion of the true neighbors roughly 2 orders of magnitude faster than sequential search.