Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

Authors:
Diego Puppin;Fabrizio Silvestri;Raffaele Perego;Ricardo Baeza-Yates
Affiliations:
ISTI “A. Faedo”, CNR, Pisa, Italy;ISTI “A. Faedo”, CNR, Pisa, Italy;ISTI “A. Faedo”, CNR, Pisa, Italy;Yahoo! Research, Barcelona, Spain
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2010

Citing 42
Cited 15

Algorithms for clustering data

Algorithms for clustering data
On the allocation of documents in multiprocessor information retrieval systems

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Caching strategies to improve disk system performance

Computer
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
On the reuse of past optimal queries

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Data structures for efficient broker implementation

ACM Transactions on Information Systems (TOIS)
Real life information retrieval: a study of user queries on the Web

ACM SIGIR Forum
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of a very large web search engine query log

ACM SIGIR Forum
Collection selection and results merging with topically organized U.S. patents and TREC data

Proceedings of the ninth international conference on Information and knowledge management
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Information Retrieval

Information Retrieval
Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Predictive caching and prefetching of query results in search engines

WWW '03 Proceedings of the 12th international conference on World Wide Web
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
The Link Database: Fast Access to Graphs of the Web

DCC '02 Proceedings of the Data Compression Conference
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Hourly analysis of a very large topically categorized web query log

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data

ACM Transactions on Information Systems (TOIS)
Query-driven document partitioning and collection selection

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
The query-vector document model

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Analyzing imbalance among homogeneous index servers in a web search system

Information Processing and Management: an International Journal
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A pipelined architecture for distributed text query evaluation

Information Retrieval
Finding near neighbors through cluster pruning

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The impact of caching on search engines

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Load-balancing and caching for collection selection architectures

Proceedings of the 2nd international conference on Scalable information systems
Query-sets: using implicit feedback and query patterns to organize web documents

Proceedings of the 17th international conference on World Wide Web
Design trade-offs for search engine caching

ACM Transactions on the Web (TWEB)
Collection selection: ...now, with more documents!

Proceedings of the 3rd international conference on Scalable information systems
DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
How are we searching the World Wide Web? A comparison of nine search engine transaction logs

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
On caching search engine query results

Computer Communications

A Last-Resort Semantic Cache for Web Queries

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Location cache for web queries

Proceedings of the 18th ACM conference on Information and knowledge management
Sync/Async parallel search for the efficient design and construction of web search engines

Parallel Computing
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
New caching techniques for web search engines

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Reverted indexing for feedback and expansion

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Document assignment in multi-site search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Performance evaluation of improved web search algorithms

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Timestamp-based result cache invalidation for web search engines

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A multi-collection latent topic model for federated search

Information Retrieval
Static index pruning in web search engines: Combining term and document popularities with query views

ACM Transactions on Information Systems (TOIS)
Towards a distributed search engine

CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Similarity caching in large-scale image retrieval

Information Processing and Management: an International Journal
A five-level static cache architecture for web search engines

Information Processing and Management: an International Journal
Shard ranking and cutoff estimation for topically partitioned collections

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article introduces an architecture for a document-partitioned search engine, based on a novel approach combining collection selection and load balancing, called load-driven routing. By exploiting the query-vector document model, and the incremental caching technique, our architecture can compute very high quality results for any query, with only a fraction of the computational load used in a typical document-partitioned architecture. By trading off a small fraction of the results, our technique allows us to strongly reduce the computing pressure to a search engine back-end; we are able to retrieve more than 2/3 of the top-5 results for a given query with only 10% the computing load needed by a configuration where the query is processed by each index partition. Alternatively, we can slightly increase the load up to 25% to improve precision and get more than 80% of the top-5 results. In fact, the flexibility of our system allows a wide range of different configurations, so as to easily respond to different needs in result quality or restrictions in computing power. More important, the system configuration can be adjusted dynamically in order to fit unexpected query peaks or unpredictable failures. This article wraps up some recent works by the authors, showing the results obtained by tests conducted on 6 million documents, 2,800,000 queries and real query cost timing as measured on an actual index.