Permutation indexing: fast approximate retrieval from large corpora

Authors:
Maxim Gurevich;Tamás Sarlós
Affiliations:
RelateIQ, Palo Alto, CA, USA;Google Inc., Mountain View, CA, USA
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 20
Cited 0

Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Cell-probe lower bounds for the partial match problem

Journal of Computer and System Sciences - Special issue: STOC 2003
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
An exploration of proximity measures in information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficiency-quality tradeoffs for vector score aggregation

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Design trade-offs for search engine caching

ACM Transactions on the Web (TWEB)
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Learning in a pairwise term-term proximity framework for information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
On Finding Dense Subgraphs

ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
The anatomy of an ad: structured indexing and retrieval for sponsored search

Proceedings of the 19th international conference on World wide web
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
How good is a span of terms?: exploiting proximity to improve web retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Stochastic query covering

Proceedings of the fourth ACM international conference on Web search and data mining
A cascade ranking model for efficient ranked retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficiently encoding term co-occurrences in inverted indexes

Proceedings of the 20th ACM international conference on Information and knowledge management
Retrieval models for audience selection in display advertising

Proceedings of the 20th ACM international conference on Information and knowledge management
Fast top-k retrieval for model based recommendation

Proceedings of the fifth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.