Accurate and Efficient HTML Differencing

  • Authors:
  • Rimon Mikhaiel;Eleni Stroulia

  • Affiliations:
  • University of Alberta, Canada;University of Alberta, Canada

  • Venue:
  • STEP '05 Proceedings of the 13th IEEE International Workshop on Software Technology and Engineering Practice
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recognizing the differences between subsequent versions of HTML documents is an important problem. It is useful for managers of multi-authored web sites who need to review and approve the changes to their web-site content. It is also necessary for users who want to be able to easily recognize changes to the pages they visit regularly. Comparing HTML documents at the lexical level, as if they were regular text documents, is neither informative nor intuitive. Instead, their internal tree structure has to be taken into account. In this paper, we discuss VDiff, an algorithm we have developed for HTML differencing, based on the Zhang-Shasha tree-edit distance algorithm. Our algorithm reports which nodes in the two compared documents match, have been deleted(inserted) from(in) the original(subsequent) document, or have been, moved in the HTML structure. We have evaluated the accuracy and performance of our algorithm with a case study.