To index or not to index: time-space trade-offs in search engines with positional ranking functions

Authors:
Diego Arroyuelo;Senén González;Mauricio Marin;Mauricio Oyarzún;Torsten Suel
Affiliations:
Universidad Técnica Federico Santa María - Yahoo! Labs Santiago, Chile, Santiago, Chile;University of Chile - Yahoo! Labs Santiago, Chile, Santiago, Chile;University of Santiago, Chile - Yahoo! Labs Santiago, Chile, Santiago, Chile;University of Santiago, Chile - Yahoo! Labs Santiago, Chile., Santiago, Chile;Polytechnic Institute of NYU, Brooklyn, NY, 11201, USA
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 36
Cited 3

Word-based text compression

Software—Practice & Experience
Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Fast generation of result snippets in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An exploration of proximity measures in information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
The myriad virtues of Wavelet Trees

Information and Computation
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Term proximity scoring for keyword-based retrieval systems

ECIR'03 Proceedings of the 25th European conference on IR research
Efficient text proximity search

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Modern Information Retrieval

Modern Information Retrieval
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
A cascade ranking model for efficient ranked retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient phrase querying with flat position index

Proceedings of the 20th ACM international conference on Information and knowledge management
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Boosting Text Compression with Word-Based Statistical Encoding1

The Computer Journal
Boosting web retrieval through query operations

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Run-length encodings (Corresp.)

IEEE Transactions on Information Theory
Universal codeword sets and representations of the integers

IEEE Transactions on Information Theory
Distributed search based on self-indexed compressed text

Information Processing and Management: an International Journal

Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Second Chance: A Hybrid Approach for Dynamic Result Caching and Prefetching in Search Engines

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Positional ranking functions, widely used in Web search engines, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time-space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether one should index positional data or not. We show that there is a wide range of practical time-space trade-offs. Moreover, we show that both position and textual data can be stored using about 71% of the space used by traditional positional indexes, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from the literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.