Evaluation of crawling policies for a web-repository crawler

Authors:
Frank McCown;Michael L. Nelson
Affiliations:
Old Dominion University, Norfolk, Virginia;Old Dominion University, Norfolk, Virginia
Venue:
Proceedings of the seventeenth conference on Hypertext and hypermedia
Year:
2006

Citing 33
Cited 10

Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of the ACM (JACM)
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Organizing topic-specific web information

HYPERTEXT '00 Proceedings of the eleventh ACM on Hypertext and hypermedia
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
On the Automatic Extraction of Data from the Hidden Web

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Task Force on Archiving of Digital Information

Task Force on Archiving of Digital Information
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Managing distributed collections: evaluating web page changes, movement, and replacement

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Saving private hypertext: requirements and pragmatic dimensions for preservation

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
High performance crawling system

Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Web-crawling reliability

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed, real-time computation of community preferences

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Characterizing a national community web

ACM Transactions on Internet Technology (TOIT)
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
On URL normalization

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
mod_oai: an apache module for metadata harvesting

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries

Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Factors affecting website reconstruction from the web infrastructure

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Detecting age of page content

Proceedings of the 9th annual ACM international workshop on Web information and data management
Recovering a website's server components from the web infrastructure

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Towards mining past content of Web pages

The New Review of Hypermedia and Multimedia - Web Archiving
Do not crawl in the DUST: Different URLs with similar text

ACM Transactions on the Web (TWEB)
Access patterns for robots and humans in web archives

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.