LSH forest: self-tuning indexes for similarity search

Authors:
Mayank Bawa;Tyson Condie;Prasanna Ganesan
Affiliations:
Stanford University, Stanford, CA;U. C. Berkeley, Berkeley, CA;Stanford University, Stanford, CA
Venue:
WWW '05 Proceedings of the 14th international conference on World Wide Web
Year:
2005

Citing 32
Cited 39

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
The pyramid-technique: towards breaking the curse of dimensionality

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Prefix B-trees

ACM Transactions on Database Systems (TODS)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Pagination of B*-trees with variable-length records

Communications of the ACM
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
The K-D-B-tree: a search structure for large multidimensional dynamic indexes

SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The evolution of effective B-tree: page organization and techniques: a personal account

ACM SIGMOD Record
Similarity Indexing with the SS-tree

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Fast Index for Semistructured Data

Proceedings of the 27th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Online balancing of range-partitioned data with applications to peer-to-peer systems

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Rapid Object Indexing Using Locality Sensitive Hashing and Joint 3D-Signature Space Estimation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pruning SIFT for scalable near-duplicate image matching

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering near-duplicate images in large collections

Proceedings of the international workshop on Workshop on multimedia information retrieval
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Peer-to-peer similarity search in metric spaces

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Modeling LSH for performance tuning

Proceedings of the 17th ACM conference on Information and knowledge management
On low dimensional random projections and similarity search

Proceedings of the 17th ACM conference on Information and knowledge management
Dynamic user-defined similarity searching in semi-structured text retrieval

Proceedings of the 3rd international conference on Scalable information systems
Beta Random Projection

Bio-Inspired Computing and Communication
Plexus: a scalable peer-to-peer protocol enabling efficient subset search

IEEE/ACM Transactions on Networking (TON)
Distributed similarity search in high dimensions using locality sensitive hashing

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
P2P case storage and retrieval with an unspecified ontology

Artificial Intelligence Review
Quality and efficiency in high dimensional nearest neighbor search

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
SiMPSON: Efficient Similarity Search in Metric Spaces over P2P Structured Overlay Networks

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient range query processing in metric spaces over highly distributed data

Distributed and Parallel Databases
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Keyword searching in structured overlays via content distance addressing

DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems (TODS)
On locality-sensitive indexing in generic metric spaces

Proceedings of the Third International Conference on SImilarity Search and APplications
Efficient incremental near duplicate detection based on locality sensitive hashing

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
Fast GPU-based locality sensitive hashing for k-nearest neighbor computation

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Distributed similarity estimation using derived dimensions

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient semantic-aware detection of near duplicate resources

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Metric-Based similarity search in unstructured peer-to-peer systems

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Is simhash achilles?

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
SIMP: accurate and efficient near neighbor search in high dimensional spaces

Proceedings of the 15th International Conference on Extending Database Technology
Large-scale similarity data management with distributed Metric Index

Information Processing and Management: an International Journal
Use of permutation prefixes for efficient and scalable approximate similarity search

Information Processing and Management: an International Journal
An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features

Signal Processing
Searching similar segments over textual event sequences

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
On the use of decentralization to enable privacy in web-scale recommendation services

Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society
A gossip-based approach for Internet-scale cardinality estimation of XPath queries over distributed semistructured data

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.