Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
An analysis of Web page and Web site constancy and permanence
Journal of the American Society for Information Science
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity
Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
Managing distributed collections: evaluating web page changes, movement, and replacement
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Search engine coverage bias: evidence and possible causes
Information Processing and Management: an International Journal
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Search Engine Coverage of the OAI-PMH Corpus
IEEE Internet Computing
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Just-in-time recovery of missing web pages
Proceedings of the seventeenth conference on Hypertext and hypermedia
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
Web crawling ethics revisited: Cost, privacy, and denial of service
Journal of the American Society for Information Science and Technology
Efficient, automatic web resource harvesting
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Proceedings of the 9th annual ACM international workshop on Web information and data management
Recovering a website's server components from the web infrastructure
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
What can history tell us?: towards different models of interaction with document histories
Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
Unsupervised creation of small world networks for the preservation of digital objects
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Why web sites are lost (and how they're sometimes found)
Communications of the ACM - Scratch Programming for All
Losing my revolution: how many resources shared on social media have been lost?
TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Hi-index | 0.00 |
When a website is suddenly lost without a backup, it maybe reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website's resources. We found that Google's PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.