The R*-tree: an efficient and robust access method for points and rectangles
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
The pyramid-technique: towards breaking the curse of dimensionality
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Inferring Web communities from link topology
Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Automatic resource compilation by analyzing hyperlink structure and associated text
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The string B-tree: a new data structure for string search in external memory and its applications
Journal of the ACM (JACM)
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
ACM Transactions on Database Systems (TODS)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM)
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Pagination of B*-trees with variable-length records
Communications of the ACM
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
The K-D-B-tree: a search structure for large multidimensional dynamic indexes
SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Similarity Indexing with the SS-tree
ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Fast Index for Semistructured Data
Proceedings of the 27th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding Interesting Associations without Support Pruning
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Measures of distributional similarity
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Online balancing of range-partitioned data with applications to peer-to-peer systems
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Rapid Object Indexing Using Locality Sensitive Hashing and Joint 3D-Signature Space Estimation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Pruning SIFT for scalable near-duplicate image matching
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Principles of hash-based text retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering near-duplicate images in large collections
Proceedings of the international workshop on Workshop on multimedia information retrieval
Multi-probe LSH: efficient indexing for high-dimensional similarity search
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Peer-to-peer similarity search in metric spaces
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Modeling LSH for performance tuning
Proceedings of the 17th ACM conference on Information and knowledge management
On low dimensional random projections and similarity search
Proceedings of the 17th ACM conference on Information and knowledge management
Dynamic user-defined similarity searching in semi-structured text retrieval
Proceedings of the 3rd international conference on Scalable information systems
Bio-Inspired Computing and Communication
Plexus: a scalable peer-to-peer protocol enabling efficient subset search
IEEE/ACM Transactions on Networking (TON)
Distributed similarity search in high dimensions using locality sensitive hashing
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
P2P case storage and retrieval with an unspecified ontology
Artificial Intelligence Review
Quality and efficiency in high dimensional nearest neighbor search
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
SiMPSON: Efficient Similarity Search in Metric Spaces over P2P Structured Overlay Networks
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient range query processing in metric spaces over highly distributed data
Distributed and Parallel Databases
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
Keyword searching in structured overlays via content distance addressing
DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
ACM Transactions on Database Systems (TODS)
On locality-sensitive indexing in generic metric spaces
Proceedings of the Third International Conference on SImilarity Search and APplications
Efficient incremental near duplicate detection based on locality sensitive hashing
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
Fast GPU-based locality sensitive hashing for k-nearest neighbor computation
Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Distributed similarity estimation using derived dimensions
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient semantic-aware detection of near duplicate resources
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Metric-Based similarity search in unstructured peer-to-peer systems
Transactions on Large-Scale Data- and Knowledge-Centered Systems V
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
SIMP: accurate and efficient near neighbor search in high dimensional spaces
Proceedings of the 15th International Conference on Extending Database Technology
Large-scale similarity data management with distributed Metric Index
Information Processing and Management: an International Journal
Use of permutation prefixes for efficient and scalable approximate similarity search
Information Processing and Management: an International Journal
Searching similar segments over textual event sequences
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
On the use of decentralization to enable privacy in web-scale recommendation services
Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.