Similarity caching

Authors:
Flavio Chierichetti;Ravi Kumar;Sergei Vassilvitskii
Affiliations:
Sapienza University of Rome, Dipartmento di Informatica, Roma, Italy;Yahoo! Research, Sunnyvale, CA, USA;Yahoo! Research, Sunnyvale, CA, USA
Venue:
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2009

Citing 17
Cited 4

Principles of database buffer management

ACM Transactions on Database Systems (TODS)
Amortized efficiency of list update and paging rules

Communications of the ACM
Buffer management in relational database systems

ACM Transactions on Database Systems (TODS)
Optimal algorithms for approximate clustering

STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Randomized algorithms for metrical task systems

Theoretical Computer Science
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Online computation and competitive analysis

Online computation and competitive analysis
Operating system support for database management

Communications of the ACM
Operating System Concepts

Operating System Concepts
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Uniform Service System with k Servers

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
Predictive caching and prefetching of query results in search engines

WWW '03 Proceedings of the 12th international conference on World Wide Web
Better streaming algorithms for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Incremental Clustering and Dynamic Information Retrieval

SIAM Journal on Computing
A metric cache for similarity search

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Nearest-neighbor caching for content-match applications

Proceedings of the 18th international conference on World wide web

Nearest-neighbor caching for content-match applications

Proceedings of the 18th international conference on World wide web
Stochastic query covering

Proceedings of the fourth ACM international conference on Web search and data mining
Similarity caching in large-scale image retrieval

Information Processing and Management: an International Journal
Cache-Based Query Processing for Search Engines

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce the similarity caching problem, a variant of classical caching in which an algorithm can return an element from the cache that is similar, but not necessarily identical, to the query element. We are motivated by buffer management questions in approximate nearest-neighbor applications, especially in the context of caching targeted advertisements on the web. Formally, we assume the queries lie in a metric space, with distance function d(.,.). A query p is considered a cache hit if there is a point q in the cache that is sufficiently close to p, i.e., for a threshold radius r, we have d(p,q) ≤ r. The goal is then to minimize the number of cache misses, vis-à-vis the optimal algorithm. As with classical caching, we use the competitive ratio to measure the performance of different algorithms. While similarity caching is a strict generalization of classical caching, we show that unless the algorithm is allowed extra power (either in the size of the cache or the threshold r) over the optimal offline algorithm, the problem is intractable. We then proceed to quantify the hardness as a function of the complexity of the underlying metric space. We show that the problem becomes easier as we proceed from general metric spaces to those of bounded doubling dimension, and to Euclidean metrics. Finally, we investigate several extensions of the problem: dependence of the threshold r on the query and a smoother trade-off between the cache-miss cost and the query-query similarity.