The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A First Experience in Archiving the French Web
ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Visual Based Content Understanding towards Web Adaptation
AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
Detecting Changes in XML Documents
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Managing duplicates in a web archive
Proceedings of the 2006 ACM symposium on Applied computing
DTD-Diff: A change detection algorithm for DTDs
Data & Knowledge Engineering
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
Data & Knowledge Engineering
Archiving the web using page changes patterns: a case study
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Improving the quality of web archives through the importance of changes
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives
TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Hi-index | 0.00 |
Due to the growing importance of the World Wide Web, archiving it has become crucial for preserving useful source of information. To maintain a web archive up-to-date, crawlers harvest the web by iteratively downloading new versions of documents. However, it is frequent that crawlers retrieve pages with unimportant changes such as advertisements which are continually updated. Hence, web archive systems waste time and space for indexing and storing useless page versions. Also, querying the archive can take more time due to the large set of useless page versions stored. Thus, an effective method is required to know accurately when and how often important changes between versions occur in order to efficiently archive web pages. Our work focuses on addressing this requirement through a new web archiving approach that detects important changes between page versions. This approach consists in archiving the visual layout structure of a web page represented by semantic blocks. This work seeks to describe the proposed approach and to examine various related issues such as using the importance of changes between versions to optimize web crawl scheduling. The major interesting research questions that we would like to address in the future are introduced.