Query evaluation: strategies and optimizations
Information Processing and Management: an International Journal
Static index pruning for information retrieval systems
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Cell-probe lower bounds for the partial match problem
Journal of Computer and System Sciences - Special issue: STOC 2003
Three-level caching for efficient query processing in large Web search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
A document-centric approach to static index pruning in text retrieval systems
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
An exploration of proximity measures in information retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficiency-quality tradeoffs for vector score aggregation
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Design trade-offs for search engine caching
ACM Transactions on the Web (TWEB)
Challenges in building large-scale information retrieval systems: invited talk
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Learning in a pairwise term-term proximity framework for information retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
The anatomy of an ad: structured indexing and retrieval for sponsored search
Proceedings of the 19th international conference on World wide web
Compact set representation for information retrieval
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
How good is a span of terms?: exploiting proximity to improve web retrieval
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the fourth ACM international conference on Web search and data mining
A cascade ranking model for efficient ranked retrieval
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficiently encoding term co-occurrences in inverted indexes
Proceedings of the 20th ACM international conference on Information and knowledge management
Retrieval models for audience selection in display advertising
Proceedings of the 20th ACM international conference on Information and knowledge management
Fast top-k retrieval for model based recommendation
Proceedings of the fifth ACM international conference on Web search and data mining
Hi-index | 0.00 |
Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.