Distributed search based on self-indexed compressed text

Authors:
Diego Arroyuelo;Veronica Gil-Costa;SenéN GonzáLez;Mauricio Marin;Mauricio OyarzúN
Affiliations:
Yahoo! Research Latin America, Santiago, Chile;Yahoo! Research Latin America, Santiago, Chile and CONICET, National University of San Luis, Argentina;Yahoo! Research Latin America, Santiago, Chile;Yahoo! Research Latin America, Santiago, Chile and Department of Informatics Engineering, University of Santiago of Chile, Chile;Department of Informatics Engineering, University of Santiago of Chile, Chile
Venue:
Information Processing and Management: an International Journal
Year:
2012

Citing 15
Cited 1

Performance issues in distributed shared-nothing information-retrieval systems

Information Processing and Management: an International Journal
Modern Information Retrieval

Modern Information Retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Load balancing for term-distributed parallel retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Mining query logs to optimize index partitioning in parallel web search engines

Proceedings of the 2nd international conference on Scalable information systems
Query-driven indexing for scalable peer-to-peer text retrieval

Future Generation Computer Systems
Scheduling Intersection Queries in Term Partitioned Inverted Files

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Improved techniques for result caching in web search engines

Proceedings of the 18th international conference on World wide web
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
New caching techniques for web search engines

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval

To index or not to index: time-space trade-offs in search engines with positional ranking functions

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.