A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations

  • Authors:
  • Hassan Artail;Kassem Fawaz

  • Affiliations:
  • Department of Electrical and Computer Engineering, American University of Beirut, P.O. Box 11-0236, Riad El-Solh, Beirut 1107 2020, Lebanon;Department of Electrical and Computer Engineering, American University of Beirut, P.O. Box 11-0236, Riad El-Solh, Beirut 1107 2020, Lebanon

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a fast HTML web page detection approach that saves computation time by limiting the similarity computations between two versions of a web page to nodes having the same HTML tag type, and by hashing the web page in order to provide direct access to node information. This efficient approach is suitable as a client application and for implementing server applications that could serve the needs of users in monitoring modifications to HTML web pages made over time, and that allow for reporting and visualizing changes and trends in order to gain insight about the significance and types of such changes. The detection of changes across two versions of a page is accomplished by performing similarity computations after transforming the web page into an XML-like structure in which a node corresponds to an open-close HTML tag. Performance and detection reliability results were obtained, and showed speed improvements when compared to the results of a previous approach.