Sitemaps: above and beyond the crawl of duty

Authors:
Uri Schonfeld;Narayanan Shivakumar
Affiliations:
UCLA Computer Science Department, Los Angeles, CA, USA;Google Inc., Mountain View, CA, USA
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 21
Cited 5

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Crawler-Friendly Web Servers

ACM SIGMETRICS Performance Evaluation Review
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages

Software—Practice & Experience - Special issue: Web technologies
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Inside PageRank

ACM Transactions on Internet Technology (TOIT)
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web

Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Web Crawling

Foundations and Trends in Information Retrieval
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Current challenges in web crawling

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.