Structural and visual comparisons for web page archiving

  • Authors:
  • Marc Teva Law;Nicolas Thome;Stéphane Gançarski;Matthieu Cord

  • Affiliations:
  • LIP6, UPMC - Sorbonne University, Paris, France;LIP6, UPMC - Sorbonne University, Paris, France;LIP6, UPMC - Sorbonne University, Paris, France;LIP6, UPMC - Sorbonne University, Paris, France

  • Venue:
  • Proceedings of the 2012 ACM symposium on Document engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.