Recovering a website's server components from the web infrastructure

Authors:
Frank McCown;Michael L. Nelson
Affiliations:
Harding University, Searcy, AR, USA;Old Dominion University, Norfolk, VA, USA
Venue:
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Year:
2008

Citing 16
Cited 0

Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of the ACM (JACM)
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Information Hiding Techniques for Steganography and Digital Watermarking

Information Hiding Techniques for Steganography and Digital Watermarking
A First Experience in Archiving the French Web

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Pastiche: making backup cheap and easy

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Time-lock Puzzles and Timed-release Crypto

Time-lock Puzzles and Timed-release Crypto
The design of the VERS encapsulated object experience with an archival information package

International Journal on Digital Libraries
Search Engine Coverage of the OAI-PMH Corpus

IEEE Internet Computing
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Using free web storage for data backup

Proceedings of the second ACM workshop on Storage security and survivability
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Characterization of national Web domains

ACM Transactions on Internet Technology (TOIT)
Factors affecting website reconstruction from the web infrastructure

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Blogger perceptions on digital preservation

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Why web sites are lost (and how they're sometimes found)

Communications of the ACM - Scratch Programming for All

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our previous research has shown that the collective behavior of search engine caches (e.g., Google, Yahoo, Live Search) and web archives (e.g., Internet Archive) results in the uncoordinated but large-scale refreshing and migrating of web resources. Interacting with these caches and archives, which we call the Web Infrastructure (WI), allows entire websites to be reconstructed in an approach we call lazy preservation. Unfortunately, the WI only captures the client-side view of a web resource. While this may be useful for recovering much of the content of a website, it is not helpful for restoring the scripts, web server configuration, databases, and other server-side components responsible for the construction of the website's resources. This paper proposes a novel technique for storing and recovering the server-side components of a website from the WI. Using erasure codes to embed the server-side components as HTML comments throughout the website, we can effectively reconstruct all the server components of a website when only a portion of the client-side resources have been extracted from the WI. We present the results of a preliminary study that baselines the lazy preservation of ten EPrints repositories and then examines the preservation of an EPrints repository that uses the erasure code technique to store the server-side EPrints software throughout the website. We found nearly 100% of the EPrints components were recoverable from the WI just two weeks after the repository came online, and it remained recoverable four months after it was "lost".