Efficient dispersal of information for security, load balancing, and fault tolerance
Journal of the ACM (JACM)
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
Information Hiding Techniques for Steganography and Digital Watermarking
Information Hiding Techniques for Steganography and Digital Watermarking
A First Experience in Archiving the French Web
ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Pastiche: making backup cheap and easy
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Time-lock Puzzles and Timed-release Crypto
Time-lock Puzzles and Timed-release Crypto
The design of the VERS encapsulated object experience with an archival information package
International Journal on Digital Libraries
Search Engine Coverage of the OAI-PMH Corpus
IEEE Internet Computing
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
Using free web storage for data backup
Proceedings of the second ACM workshop on Storage security and survivability
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Characterization of national Web domains
ACM Transactions on Internet Technology (TOIT)
Factors affecting website reconstruction from the web infrastructure
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Blogger perceptions on digital preservation
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Why web sites are lost (and how they're sometimes found)
Communications of the ACM - Scratch Programming for All
Hi-index | 0.00 |
Our previous research has shown that the collective behavior of search engine caches (e.g., Google, Yahoo, Live Search) and web archives (e.g., Internet Archive) results in the uncoordinated but large-scale refreshing and migrating of web resources. Interacting with these caches and archives, which we call the Web Infrastructure (WI), allows entire websites to be reconstructed in an approach we call lazy preservation. Unfortunately, the WI only captures the client-side view of a web resource. While this may be useful for recovering much of the content of a website, it is not helpful for restoring the scripts, web server configuration, databases, and other server-side components responsible for the construction of the website's resources. This paper proposes a novel technique for storing and recovering the server-side components of a website from the WI. Using erasure codes to embed the server-side components as HTML comments throughout the website, we can effectively reconstruct all the server components of a website when only a portion of the client-side resources have been extracted from the WI. We present the results of a preliminary study that baselines the lazy preservation of ten EPrints repositories and then examines the preservation of an EPrints repository that uses the erasure code technique to store the server-side EPrints software throughout the website. We found nearly 100% of the EPrints components were recoverable from the WI just two weeks after the repository came online, and it remained recoverable four months after it was "lost".