Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Journal of the ACM (JACM)
Multidimensional binary search trees used for associative searching
Communications of the ACM
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Information Retrieval
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A Replacement for Voronoi Diagrams of Near Linear Size
FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Navigating nets: simple algorithms for proximity search
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
On demand classification of data streams
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Algorithms for advanced packet classification with ternary CAMs
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Entropy based nearest neighbor search in high dimensions
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Lower bounds on locality sensitive hashing
Proceedings of the twenty-second annual symposium on Computational geometry
Cover trees for nearest neighbor
ICML '06 Proceedings of the 23rd international conference on Machine learning
On the Optimality of the Dimensionality Reduction Method
FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions
FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Multi-probe LSH: efficient indexing for high-dimensional similarity search
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match
FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
International Journal of Approximate Reasoning
Quality and efficiency in high dimensional nearest neighbor search
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
LSH banding for large-scale retrieval with memory and recall constraints
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
Small subset queries and bloom filters using ternary associative memories, with applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Virtually cool ternary content addressable memory
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
A resistive TCAM accelerator for data-intensive computing
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Software defined traffic measurement with OpenSketch
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
AC-DIMM: associative computing with STT-MRAM
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors suffers from a "curse of dimensionality", i.e. for high dimensional spaces, best known solutions offer little improvement over brute force search and thus are unsuitable for large scale streaming applications. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than 1, and query time sublinear, but still polynomial in n, where n is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as address lookups and access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents 0, 1 or *. The * is a wild card representing either a 0 or a 1. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into {0,1,*}. By using the added functionality of a TLSH scheme with respect to the * character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We validate our claims with extensive simulations using both real world (Wikipedia) as well as synthetic (but illustrative) datasets. We observe that using a TCAM of width 288 bits, it is possible to solve the approximate NNS problem on a database of size 1 million points with high accuracy. Finally, we design an experiment with TCAMs within an enterprise ethernet switch (Cisco Catalyst 4500) to validate that TLSH can be used to perform 1.5 million queries per second per 1Gb/s port. We believe that this work can open new avenues in very high speed data mining.