Modeling LSH for performance tuning

Authors:
Wei Dong;Zhe Wang;William Josephson;Moses Charikar;Kai Li
Affiliations:
Princeton University, Princeton, NJ, USA;Princeton University, Princeton, NJ, USA;Princeton University, Princeton, NJ, USA;Princeton University, Princeton, NJ, USA;Princeton University, Princeton, NJ, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 19
Cited 12

K-d trees for semidynamic point sets

SCG '90 Proceedings of the sixth annual symposium on Computational geometry
Point location in arrangements of hyperplanes

Information and Computation
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Unsupervised Segmentation of Color-Texture Regions in Images and Video

IEEE Transactions on Pattern Analysis and Machine Intelligence
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Deflating the Dimensionality Curse Using Multiple Fractal Dimensions

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Rotation invariant spherical harmonic representation of 3D shape descriptors

Proceedings of the 2003 Eurographics/ACM SIGGRAPH symposium on Geometry processing
MARSYAS: a framework for audio analysis

Organised Sound
MARSYAS: a framework for audio analysis

Organised Sound
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Image similarity search with compact data structures

Proceedings of the thirteenth ACM international conference on Information and knowledge management
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Entropy based nearest neighbor search in high dimensions

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Ferret: a toolkit for content-based similarity search of feature-rich data

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Sizing sketches: a rank-based analysis for similarity search

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases

Scalable clip-based near-duplicate video detection with ordinal measure

Proceedings of the ACM International Conference on Image and Video Retrieval
Towards optimal naive bayes nearest neighbor

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
On classifying drifting concepts in P2P networks

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
Randomly projected KD-trees with distance metric learning for image retrieval

MMM'11 Proceedings of the 17th international conference on Advances in multimedia modeling - Volume Part II
Efficient k-nearest neighbor graph construction for generic similarity measures

Proceedings of the 20th international conference on World wide web
Stabilizing the recall in similarity search

Proceedings of the Fourth International Conference on SImilarity Search and APplications
Mining weakly labeled web facial images for search-based face annotation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Retrieval-based face annotation by weak label regularized local coordinate coding

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Fast GPU-based locality sensitive hashing for k-nearest neighbor computation

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Distributed similarity estimation using derived dimensions

The VLDB Journal — The International Journal on Very Large Data Bases
SIMP: accurate and efficient near neighbor search in high dimensional spaces

Proceedings of the 15th International Conference on Extending Database Technology
FANS: face annotation by searching large-scale web facial images

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although Locality-Sensitive Hashing (LSH) is a promising approach to similarity search in high-dimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained interesting asymptotic results, provides little guidance on how these parameters should be chosen, and tuning parameters for a given dataset remains a tedious process. To address this problem, we present a statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH. Our model can accurately predict the average search quality and latency given a small sample dataset. Apart from automatic parameter tuning with the performance model, we also use the model to devise an adaptive LSH search algorithm to determine the probing parameter dynamically for each query. The adaptive probing method addresses the problem that even though the average performance is tuned for optimal, the variance of the performance is extremely high. We experimented with three different datasets including audio, images and 3D shapes to evaluate our methods. The results show the accuracy of the proposed model: the recall errors predicted are within 5% from the real values for most cases; the adaptive search method reduces the standard deviation of recall by about 50% over the existing method.