On the feasibility of geographically distributed web crawling

Authors:
B. Barla Cambazoglu;Vassilis Plachouras;Flavio Junqueira;Luca Telloli
Affiliations:
Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain
Venue:
Proceedings of the 3rd international conference on Scalable information systems
Year:
2008

Citing 22
Cited 8

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Accessibility of information on the Web

intelligence
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Topic-oriented collaborative crawling

Proceedings of the eleventh international conference on Information and knowledge management
Mercator: A scalable, extensible Web crawler

World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Design and Implementation of a Distributed Crawler and Filtering Processor

NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Collaborative Web Crawling: Information Gathering/Processing over Internet

HICSS '99 Proceedings of the Thirty-second Annual Hawaii International Conference on System Sciences-Volume 5 - Volume 5
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Geographical partition for distributed web crawling

Proceedings of the 2005 workshop on Geographic information retrieval
Architecture of a grid-enabled Web search engine

Information Processing and Management: an International Journal
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web

Quantifying performance and quality gains in distributed web search engines

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Query forwarding in geographically distributed search engines

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Document assignment in multi-site search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Reactive index replication for distributed search engines

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Document replication strategies for geographically distributed web search engines

Information Processing and Management: an International Journal
Accelerating Structured Web Crawling without Losing Data

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Improving the efficiency of multi-site web search engines

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We identify the issues that are important in design of a geographically distributed Web crawler. The identified issues are discussed from a "benefit" and "challenge" point of view. More specifically, we focus on the effect of geographical locality of Web sites on crawling performance, and, as a practical study, investigate the feasibility of a distributed crawler in terms of network costs. For this purpose, we conduct various experiments to collect network access statistics about the servers in the educational domains of eight different countries (USA, Canada, Chile, Brazil, Spain, Portugal, Turkey, and Greece). We gather the statistics from four different sites located in USA, Brazil, Spain, and Turkey using echoping. The results favor geographically distributed Web crawling in terms of crawling throughput.