A term-based inverted index partitioning model for efficient distributed query processing

Authors:
B. Barla Cambazoglu;Enver Kayaaslan;Simon Jonassen;Cevdet Aykanat
Affiliations:
Yahoo Labs;Yahoo Labs;Yahoo Labs;Bilkent University
Venue:
ACM Transactions on the Web (TWEB)
Year:
2013

Citing 27
Cited 0

Inverted File Partitioning Schemes in Multiple Disk Systems

IEEE Transactions on Parallel and Distributed Systems
Recent directions in netlist partitioning: a survey

Integration, the VLSI Journal
Query performance for tightly coupled distributed digital libraries

Proceedings of the third ACM conference on Digital libraries
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication

IEEE Transactions on Parallel and Distributed Systems
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Parallel Generation of Inverted Files for Distributed Text Collections

SCCC '98 Proceedings of the XVIII International Conference of the Chilean Computer Science Society
Parallel Search using Partitioned Inverted Files

SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Graphs and Hypergraphs

Graphs and Hypergraphs
Efficient Query Evaluation on Large Textual Collections in a Peer-to-Peer Environment

P2P '05 Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Load balancing for term-distributed parallel retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Analyzing imbalance among homogeneous index servers in a web search system

Information Processing and Management: an International Journal
A pipelined architecture for distributed text query evaluation

Information Retrieval
Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices

Journal of Parallel and Distributed Computing
Mining query logs to optimize index partitioning in parallel web search engines

Proceedings of the 2nd international conference on Scalable information systems
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Improved techniques for result caching in web search engines

Proceedings of the 18th international conference on World wide web
A refreshing perspective of search engine caching

Proceedings of the 19th international conference on World wide web
Load and storage balanced posting file partitioning for parallel information retrieval

Journal of Systems and Software
A combined semi-pipelined query processing architecture for distributed full-text retrieval

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems

ISCIS'06 Proceedings of the 21st international conference on Computer and Information Sciences
Replicated partitioning for undirected hypergraphs

Journal of Parallel and Distributed Computing
Intra-query concurrent pipelined processing for distributed full-text retrieval

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
A Parallel Framework for In-Memory Construction of Term-Partitioned Inverted Indexes

The Computer Journal
Improving the performance of pipelined query processing with skipping

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a shared-nothing, distributed text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the index is either document-based or term-based partitioned. This choice is made depending on the properties of the underlying hardware infrastructure, query traffic distribution, and some performance and availability constraints. In query processing on retrieval systems that adopt a term-based index partitioning strategy, the high communication overhead due to the transfer of large amounts of data from the index servers forms a major performance bottleneck, deteriorating the scalability of the entire distributed retrieval system. In this work, to alleviate this problem, we propose a novel inverted index partitioning model that relies on hypergraph partitioning. In the proposed model, concurrently accessed index entries are assigned to the same index servers, based on the inverted index access patterns extracted from the past query logs. The model aims to minimize the communication overhead that will be incurred by future queries while maintaining the computational load balance among the index servers. We evaluate the performance of the proposed model through extensive experiments using a real-life text collection and a search query sample. Our results show that considerable performance gains can be achieved relative to the term-based index partitioning strategies previously proposed in literature. In most cases, however, the performance remains inferior to that attained by document-based partitioning.