Mining query logs to optimize index partitioning in parallel web search engines

Authors:
Claudio Lucchese;Salvatore Orlando;Raffaele Perego;Fabrizio Silvestri
Affiliations:
Università Ca' Foscari di Venezia, Venezia, Italy and ISTI-CNR, Pisa, Italy;Università Ca' Foscari di Venezia, Venezia, Italy;ISTI-CNR, Pisa, Italy;ISTI-CNR, Pisa, Italy
Venue:
Proceedings of the 2nd international conference on Scalable information systems
Year:
2007

Citing 12
Cited 14

Strategies for building distributed information retrieval systems

Information Processing and Management: an International Journal
Prototyping a distributed information retrieval system that uses statistical ranking

Information Processing and Management: an International Journal
Inverted File Partitioning Schemes in Multiple Disk Systems

IEEE Transactions on Parallel and Distributed Systems
Efficient distributed algorithms to build inverted files

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating the performance of distributed architectures for information retrieval using a variety of workloads

ACM Transactions on Information Systems (TOIS)
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Methodologies for Distributed Information Retrieval

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Efficient Query Evaluation on Large Textual Collections in a Peer-to-Peer Environment

P2P '05 Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing
Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data

ACM Transactions on Information Systems (TOIS)
Load balancing for term-distributed parallel retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A pipelined architecture for distributed text query evaluation

Information Retrieval

Query-driven indexing for scalable peer-to-peer text retrieval

Future Generation Computer Systems
Scheduling Intersection Queries in Term Partitioned Inverted Files

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Two-Dimensional Distributed Inverted Files

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Sync/Async parallel search for the efficient design and construction of web search engines

Parallel Computing
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Load and storage balanced posting file partitioning for parallel information retrieval

Journal of Systems and Software
A combined semi-pipelined query processing architecture for distributed full-text retrieval

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Assigning documents to master sites in distributed search

Proceedings of the 20th ACM international conference on Information and knowledge management
Replicated partitioning for undirected hypergraphs

Journal of Parallel and Distributed Computing
Scalable search platform: improving pipelined query processing for distributed full-text retrieval

Proceedings of the 21st international conference companion on World Wide Web
Intra-query concurrent pipelined processing for distributed full-text retrieval

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Distributed search based on self-indexed compressed text

Information Processing and Management: an International Journal
Improving the performance of pipelined query processing with skipping

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
A term-based inverted index partitioning model for efficient distributed query processing

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale Parallel Web Search Engines (WSEs) needs to adopt a strategy for partitioning the inverted index among a set of parallel server nodes. In this paper we are interested in devising an effective term-partitioning strategy, according to which the global vocabulary of terms and the associated inverted lists are split into disjoint subsets, and assigned to distinct servers. Due to the workload imbalance caused by the skewed distribution of terms in user queries, finding an effective partitioning strategy is considered a very complex task. In this paper we first formally introduce Term Partitioning as a new optimization problem. Then we show how the knowledge mined from past WSE query logs can be profitably used to discover good solutions of this problem. Finally, we report many results to show that we are able to effectively reduce both the average number of servers activated per each query, along with the workload imbalance. Experiments are conducted on large query logs of real WSEs.