Fast top-k similarity queries via matrix compression

Authors:
Yucheng Low;Alice X. Zheng
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 10
Cited 0

Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Multidimensional binary search trees used for associative searching

Communications of the ACM
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
G-hash: towards fast kernel-based similarity search in large graph databases

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Hash Kernels for Structured Data

The Journal of Machine Learning Research
Graph Kernels

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel method to efficiently compute the top-K most similar items given a query item, where similarity is defined by the set of items that have the highest vector inner products with the query. The task is related to the classical k-Nearest-Neighbor problem, and is widely applicable in a number of domains such as information retrieval, online advertising and collaborative filtering. Our method assumes an in-memory representation of the dataset and is designed to scale to query lengths of 100,000s of terms. Our algorithm uses a generalized Holder's inequality to upper bound the inner product with the norms of the constituent vectors. We also propose a novel compression scheme that computes bounds for groups of candidate items, thereby speeding up computation and minimizing memory requirements per query. We conduct extensive experiments on the publicly available Wikipedia dataset, and demonstrate that, with a memory overhead of 21%, our method can provide 1-3 orders of magnitude improvement in query run-time compared to naive methods and state of the art competing methods. Our median top-10 word query time is 25 us on 7.5 million words and 2.3 million documents.