Lazy preservation: reconstructing websites by crawling the crawlers

Authors:
Frank McCown;Joan A. Smith;Michael L. Nelson
Affiliations:
Old Dominion University, Norfolk, Virginia;Old Dominion University, Norfolk, Virginia;Old Dominion University, Norfolk, Virginia
Venue:
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Year:
2006

Citing 18
Cited 12

A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Finding information on the World Wide Web: the retrieval effectiveness of search engines

Information Processing and Management: an International Journal
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A proxy-based personal web archiving service

ACM SIGOPS Operating Systems Review
Responsible web caching

Communications of the ACM
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Managing versions of web documents in a transaction-time web server

Proceedings of the 13th international conference on World Wide Web
Automatic performance evaluation of web search engines

Information Processing and Management: an International Journal
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Website navigation architectures and their effect on website visibility: a literature survey

SAICSIT '04 Proceedings of the 2004 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The impact of webpage content characteristics on webpage visibility in search engine results (part I)

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
The freshness of web search engine databases

Journal of Information Science
Search Engine Coverage of the OAI-PMH Corpus

IEEE Internet Computing
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia

Factors affecting website reconstruction from the web infrastructure

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Detecting age of page content

Proceedings of the 9th annual ACM international workshop on Web information and data management
Recovering a website's server components from the web infrastructure

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Towards mining past content of Web pages

The New Review of Hypermedia and Multimedia - Web Archiving
A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Proceedings of the 10th ACM workshop on Web information and data management
A framework for describing web repositories

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Why web sites are lost (and how they're sometimes found)

Communications of the ACM - Scratch Programming for All
Teaching web information retrieval to undergraduates

Proceedings of the 41st ACM technical symposium on Computer science education
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries
An evaluation of caching policies for memento timemaps

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Backup of websites is often not considered until after a catastrophic event has occurred to either the website or its webmaster. We introduce "lazy preservation" -- digital preservation performed as a result of the normal operation of web crawlers and caches. Lazy preservation is especially suitable for third parties; for example, a teacher reconstructing a missing website used in previous classes. We evaluate the effectiveness of lazy preservation by reconstructing 24 websites of varying sizes and composition using Warrick, a web-repository crawler. Because of varying levels of completeness in any one repository, our reconstructions sampled from four different web repositories: Google (44%), MSN (30%), Internet Archive (19%) and Yahoo (7%). We also measured the time required for web resources to be discovered and cached (10-103 days) as well as how long they remained in cache after deletion (7-61 days).