On the feasibility of multi-site web search engines

Authors:
Ricardo Baeza-Yates;Aristides Gionis;Flavio Junqueira;Vassilis Plachouras;Luca Telloli
Affiliations:
Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 27
Cited 13

Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Rank-preserving two-level caching for scalable search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Building a distributed full-text index for the web

ACM Transactions on Information Systems (TOIS)
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
Lessons from Giant-Scale Services

IEEE Internet Computing
Query processing and inverted indices in shared: nothing text document information retrieval systems

The VLDB Journal — The International Journal on Very Large Data Bases - Parallelism in database systems
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Multi-Tier Architecture for Web Search Engines

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Access-ordered indexes

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Performance and cost tradeoffs in Web search

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Geographical partition for distributed web crawling

Proceedings of the 2005 workshop on Geographic information retrieval
Efficient Query Evaluation on Large Textual Collections in a Peer-to-Peer Environment

P2P '05 Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing
Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data

ACM Transactions on Information Systems (TOIS)
Load balancing for term-distributed parallel retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Analyzing imbalance among homogeneous index servers in a web search system

Information Processing and Management: an International Journal
Power provisioning for a warehouse-sized computer

Proceedings of the 34th annual international symposium on Computer architecture
A pipelined architecture for distributed text query evaluation

Information Retrieval
The impact of caching on search engines

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Pruning policies for two-tiered inverted index with correctness guarantee

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Quantifying performance and quality gains in distributed web search engines

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

Query forwarding in geographically distributed search engines

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
KMV-peer: a robust and adaptive peer-selection algorithm

Proceedings of the fourth ACM international conference on Web search and data mining
Document assignment in multi-site search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Indexing strategies for graceful degradation of search quality

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Energy-price-driven query processing in multi-center web search engines

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Optimal network locality in distributed virtualized data-centers

Computer Communications
Assigning documents to master sites in distributed search

Proceedings of the 20th ACM international conference on Information and knowledge management
Chapter 2: next generation web search

Search Computing
Towards a distributed search engine

CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Reactive index replication for distributed search engines

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Document replication strategies for geographically distributed web search engines

Information Processing and Management: an International Journal
Rank-energy selective query forwarding for distributed search systems

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Improving the efficiency of multi-site web search engines

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web search engines are often implemented as centralized systems. Designing and implementing a Web search engine in a distributed environment is a challenging engineering task that encompasses many interesting research questions. However, distributing a search engine across multiple sites has several advantages, such as utilizing less compute resources and exploiting data locality. In this paper we investigate the cost-effectiveness of building a distributed Web search engine. We propose a model for assessing the total cost of a distributed Web search engine that includes the computational costs and the communication cost among all distributed sites. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results compared to a centralized search engine. We simulate the algorithm on real document collections and query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to real cost.