Approximate Structure-Preserving Semantic Matching
OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
Hi-index | 0.00 |
Recognizing the differences between subsequent versions of HTML documents is an important problem. It is useful for managers of multi-authored web sites who need to review and approve the changes to their web-site content. It is also necessary for users who want to be able to easily recognize changes to the pages they visit regularly. Comparing HTML documents at the lexical level, as if they were regular text documents, is neither informative nor intuitive. Instead, their internal tree structure has to be taken into account. In this paper, we discuss VDiff, an algorithm we have developed for HTML differencing, based on the Zhang-Shasha tree-edit distance algorithm. Our algorithm reports which nodes in the two compared documents match, have been deleted(inserted) from(in) the original(subsequent) document, or have been, moved in the HTML structure. We have evaluated the accuracy and performance of our algorithm with a case study.