Load-balancing and caching for collection selection architectures

Authors:
Diego Puppin;Fabrizio Silvestri;Raffaele Perego;Ricardo Baeza-Yates
Affiliations:
ISTI-CNR, Pisa;ISTI-CNR, Pisa;ISTI-CNR, Pisa;Yahoo! Research, Barcelona/Santiago
Venue:
Proceedings of the 2nd international conference on Scalable information systems
Year:
2007

Citing 22
Cited 9

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
On the reuse of past optimal queries

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Real life information retrieval: a study of user queries on the Web

ACM SIGIR Forum
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Methods for information server selection

ACM Transactions on Information Systems (TOIS)
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of a very large web search engine query log

ACM SIGIR Forum
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Rank-preserving two-level caching for scalable search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Predictive caching and prefetching of query results in search engines

WWW '03 Proceedings of the 12th international conference on World Wide Web
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Parallel Search using Partitioned Inverted Files

SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
A survey of Web cache replacement strategies

ACM Computing Surveys (CSUR)
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Hourly analysis of a very large topically categorized web query log

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A case study of distributed information retrieval architectures to index one terabyte of text

Information Processing and Management: an International Journal
Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data

ACM Transactions on Information Systems (TOIS)
How are we searching the world wide web?: a comparison of nine search engine transaction logs

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Query-driven document partitioning and collection selection

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
On caching search engine query results

Computer Communications

Collection selection: ...now, with more documents!

Proceedings of the 3rd international conference on Scalable information systems
A Study of the Impact of Index Updates on Distributed Query Processing for Web Search

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
A Last-Resort Semantic Cache for Web Queries

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Location cache for web queries

Proceedings of the 18th ACM conference on Information and knowledge management
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
New caching techniques for web search engines

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Document allocation policies for selective searching of distributed indexes

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning to distribute queries into web search nodes

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

To address the rapid growth of the Internet, modern Web search engines have to adopt distributed organizations, where the collection of indexed documents is partitioned among several servers, and query answering is performed as a parallel and distributed task. Collection selection can be a way to reduce the overall computing load, by finding a trade-off between the quality of results retrieved and the cost of solving queries. In this paper, we analyze the relationship between the collection selection strategy, the effect on load balancing and on the caching subsystem, by exploring the design-space of a distributed search engine based on collection selection. In particular, we propose a strategy to perform collection selection in a load-driven way, and a novel caching policy able to incrementally refine the effectiveness of the results returned for each subsequent cache hit. The combination of load-driven collection selection and incremental caching strategies allows our system to retrieve two thirds of the top-ranked results returned by a baseline centralized index, with only one fifth of the computing workload.