Locality-sensitive hashing scheme based on dynamic collision counting

Authors:
Junhao Gan;Jianlin Feng;Qiong Fang;Wilfred Ng
Affiliations:
Sun Yat-Sen University, Guangzhou, China;Sun Yat-Sen University, Guangzhou, China;Hong Kong University of Science and Technology, Hong Kong, China;Hong Kong University of Science and Technology, Hong Kong, China
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 14
Cited 5

K-d trees for semidynamic point sets

SCG '90 Proceedings of the sixth annual symposium on Computational geometry
Two algorithms for nearest-neighbor search in high dimensions

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Concrete Mathematics: A Foundation for Computer Science

Concrete Mathematics: A Foundation for Computer Science
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems (TODS)
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
HashFile: An efficient index structure for multimedia data

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Inter-media hashing for large-scale retrieval from heterogeneous data sources

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Effective hashing for large-scale multimedia search

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
HmSearch: an efficient hamming distance query processing algorithm

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Approximate high-dimensional nearest neighbor queries using R-forests

Proceedings of the 17th International Database Engineering & Applications Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Locality-Sensitive Hashing (LSH) and its variants are well-known methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH functions are concatenated to form a "static" compound hash function for building a hash table. In this paper, we propose to use a base of m single LSH functions to construct "dynamic" compound hash functions, and define a new LSH scheme called Collision Counting LSH (C2LSH). If the number of LSH functions under which a data object o collides with a query object q is greater than a pre-specified collision threhold l, then o can be regarded as a good candidate of c-approximate NN of q. This is the basic idea of C2LSH. Our theoretical studies show that, by appropriately choosing the size of LSH function base m and the collision threshold l, C2LSH can have a guarantee on query quality. Notably, the parameter m is not affected by dimensionality of data objects, which makes C2LSH especially good for high dimensional NN search. The experimental studies based on synthetic datasets and four real datasets have shown that C2LSH outperforms the state of the art method LSB-forest in high dimensional space.