Efficient dispersal of information for security, load balancing, and fault tolerance
Journal of the ACM (JACM)
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Synchronizing a database to improve freshness
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Organizing topic-specific web information
HYPERTEXT '00 Proceedings of the eleventh ACM on Hypertext and hypermedia
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
ACM Transactions on Internet Technology (TOIT)
Evaluating topic-driven web crawlers
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 11th international conference on World Wide Web
Optimal crawling strategies for web search engines
Proceedings of the 11th international conference on World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Proceedings of the 27th International Conference on Very Large Data Bases
On the Automatic Extraction of Data from the Hidden Web
Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Task Force on Archiving of Digital Information
Task Force on Archiving of Digital Information
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
Managing distributed collections: evaluating web page changes, movement, and replacement
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Saving private hypertext: requirements and pragmatic dimensions for preservation
Proceedings of the fifteenth ACM conference on Hypertext and hypermedia
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
High performance crawling system
Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed, real-time computation of community preferences
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Characterizing a national community web
ACM Transactions on Internet Technology (TOIT)
Just-in-time recovery of missing web pages
Proceedings of the seventeenth conference on Hypertext and hypermedia
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
mod_oai: an apache module for metadata harvesting
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Just-in-time recovery of missing web pages
Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Factors affecting website reconstruction from the web infrastructure
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 9th annual ACM international workshop on Web information and data management
Recovering a website's server components from the web infrastructure
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Towards mining past content of Web pages
The New Review of Hypermedia and Multimedia - Web Archiving
Do not crawl in the DUST: Different URLs with similar text
ACM Transactions on the Web (TWEB)
Access patterns for robots and humans in web archives
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Hi-index | 0.00 |
We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.