Factors affecting website reconstruction from the web infrastructure

Authors:
Frank McCown;Norou Diawara;Michael L. Nelson
Affiliations:
Old Dominion University, Norfolk, VA;Old Dominion University, Norfolk, VA;Old Dominion University, Norfolk, VA
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 20
Cited 7

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
An analysis of Web page and Web site constancy and permanence

Journal of the American Society for Information Science
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Persistence of Web References in Scientific Research

Computer
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Managing distributed collections: evaluating web page changes, movement, and replacement

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Search engine coverage bias: evidence and possible causes

Information Processing and Management: an International Journal
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Search Engine Coverage of the OAI-PMH Corpus

IEEE Internet Computing
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Web crawling ethics revisited: Cost, privacy, and denial of service

Journal of the American Society for Information Science and Technology
Efficient, automatic web resource harvesting

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Detecting age of page content

Proceedings of the 9th annual ACM international workshop on Web information and data management
Recovering a website's server components from the web infrastructure

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
What can history tell us?: towards different models of interaction with document histories

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
Unsupervised creation of small world networks for the preservation of digital objects

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Why web sites are lost (and how they're sometimes found)

Communications of the ACM - Scratch Programming for All
Losing my revolution: how many resources shared on social media have been lost?

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

When a website is suddenly lost without a backup, it maybe reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website's resources. We found that Google's PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.