Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Keeping Up with the Changing Web
Computer
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
Scheduling Algorithms for Web Crawling
LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
Web Archiving
Efficient Monitoring Algorithm for Fast News Alerts
IEEE Transactions on Knowledge and Data Engineering
Frequent pattern mining: current status and future directions
Data Mining and Knowledge Discovery
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
The web changes everything: understanding the dynamics of web content
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Proceedings of the 3rd workshop on Information credibility on the web
SHARC: framework for quality-conscious web archiving
Proceedings of the VLDB Endowment
Using visual pages analysis for optimizing web archiving
Proceedings of the 2010 EDBT/ICDT Workshops
Vi-DIFF: understanding web pages changes
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Archiving the web using page changes patterns: a case study
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Archival HTTP redirection retrieval policies
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
Due to the growing importance of the Web, several archiving institutes (national libraries, Internet Archive, etc.) are harvesting sites to preserve (a part of) the Web for future generations. A major issue encountered by archivists is to preserve the quality of web archives. One way of assessing the quality of an archive is to quantify its completeness and the coherence of its page versions. Due to the large number of pages to be captured and the limitations of resources (storage space, bandwidth, etc.), it is impossible to have a complete archive (containing all the versions of all the pages). Also it is impossible to assure the coherence of all captured versions because pages are changing very frequently during the crawl of a site. Nonetheless, it is possible to maximize the quality of archives by adjusting web crawlers strategy. Our idea for that is (i) to improve the completeness of the archive by downloading the most important versions and (ii) to keep the most important versions as coherent as possible. Moreover, we introduce a pattern model which describes the behavior of the importance of pages changes over time. Based on patterns, we propose a crawl strategy to improve both the completeness and the coherence of web archives. Experiments based on real patterns show the usefulness and the effectiveness of our approach.