Direct methods for sparse matrices
Direct methods for sparse matrices
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Multidimensional binary search trees used for associative searching
Communications of the ACM
ACM Computing Surveys (CSUR)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
A Study of Index Structures for Main Memory Database Management Systems
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Robust and fast similarity search for moving object trajectories
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Substructure similarity search in graph databases
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Efficient algorithms for substring near neighbor problem
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Effective keyword search in relational databases
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Google news personalization: scalable online collaborative filtering
Proceedings of the 16th international conference on World Wide Web
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs
Proceedings of the VLDB Endowment
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Streaming first story detection with application to Twitter
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fast locality-sensitive hashing
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast GPU-based locality sensitive hashing for k-nearest neighbor computation
Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Cache-Conscious collision resolution in string hash tables
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
WHAM: A High-Throughput Sequence Alignment Method
ACM Transactions on Database Systems (TODS)
Efficient distributed locality sensitive hashing
Proceedings of the 21st ACM international conference on Information and knowledge management
Modeling the impact of lifestyle on health at scale
Proceedings of the sixth ACM international conference on Web search and data mining
Hi-index | 0.00 |
Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects. In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.