Distributed web retrieval

Authors:
Ricardo Baeza-Yates
Affiliations:
Yahoo! Research, Barcelona, Spain
Venue:
Proceedings of the 20th international conference companion on World wide web
Year:
2011

Citing 1
Cited 0

Modern Information Retrieval

Modern Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly (over 270 millions at the beginning of 2011) and there are currently more than 20 billion indexed pages. On the other hand, Internet users are above one billion and hundreds of million of queries are issued each day. In the near future, centralized systems are likely to become less effective against such a data-query load, thus suggesting the need of fully distributed search engines. Such engines need to maintain high quality answers, fast response time, high query throughput, high availability and scalability; in spite of network latency and scattered data. In this tutorial we present the architecture of current search engines and we explore the main challenges behind the design of all the processes of a distributed Web retrieval system crawling, indexing, and query processing.