Using visual pages analysis for optimizing web archiving

Authors:
Myriam Ben Saad;Stéphane Gançarski
Affiliations:
University P. and M. Curie Paris, France;University P. and M. Curie Paris, France
Venue:
Proceedings of the 2010 EDBT/ICDT Workshops
Year:
2010

Citing 11
Cited 3

The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A First Experience in Archiving the French Web

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Managing duplicates in a web archive

Proceedings of the 2006 ACM symposium on Applied computing
DTD-Diff: A change detection algorithm for DTDs

Data & Knowledge Engineering
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations

Data & Knowledge Engineering

Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Improving the quality of web archives through the importance of changes

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the growing importance of the World Wide Web, archiving it has become crucial for preserving useful source of information. To maintain a web archive up-to-date, crawlers harvest the web by iteratively downloading new versions of documents. However, it is frequent that crawlers retrieve pages with unimportant changes such as advertisements which are continually updated. Hence, web archive systems waste time and space for indexing and storing useless page versions. Also, querying the archive can take more time due to the large set of useless page versions stored. Thus, an effective method is required to know accurately when and how often important changes between versions occur in order to efficiently archive web pages. Our work focuses on addressing this requirement through a new web archiving approach that detects important changes between page versions. This approach consists in archiving the visual layout structure of a web page represented by semantic blocks. This work seeks to describe the proposed approach and to examine various related issues such as using the importance of changes between versions to optimize web crawl scheduling. The major interesting research questions that we would like to address in the future are introduced.