Query evaluation: strategies and optimizations
Information Processing and Management: an International Journal
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
Optimization of inverted vector searches
SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Multidimensional binary search trees used for associative searching
Communications of the ACM
Efficient query evaluation using a two-level retrieval process
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
G-hash: towards fast kernel-based similarity search in large graph databases
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Hash Kernels for Structured Data
The Journal of Machine Learning Research
The Journal of Machine Learning Research
Hi-index | 0.00 |
In this paper, we propose a novel method to efficiently compute the top-K most similar items given a query item, where similarity is defined by the set of items that have the highest vector inner products with the query. The task is related to the classical k-Nearest-Neighbor problem, and is widely applicable in a number of domains such as information retrieval, online advertising and collaborative filtering. Our method assumes an in-memory representation of the dataset and is designed to scale to query lengths of 100,000s of terms. Our algorithm uses a generalized Holder's inequality to upper bound the inner product with the norms of the constituent vectors. We also propose a novel compression scheme that computes bounds for groups of candidate items, thereby speeding up computation and minimizing memory requirements per query. We conduct extensive experiments on the publicly available Wikipedia dataset, and demonstrate that, with a memory overhead of 21%, our method can provide 1-3 orders of magnitude improvement in query run-time compared to naive methods and state of the art competing methods. Our median top-10 word query time is 25 us on 7.5 million words and 2.3 million documents.