Structural matching and discovery in document databases

  • Authors:
  • Jason Tsong-Li Wang;Dennis Shasha;George J. S. Chang;Liam Relihan;Kaizhong Zhang;Girish Patel

  • Affiliations:
  • Computer and Information Science, New Jersey Institute of Technology;Courant Institute, New York University;Computer and Information Science, New Jersey Institute of Technology;Piercom Ltd., Inter. Business Center, National Tech. Park, Limerick, Ireland;Computer Science Department, Univ. of Western Ontario, Canada;Computer and Information Science, New Jersey Institute of Technology

  • Venue:
  • SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

Structural matching and discovery in documents such as SGML and HTML is important for data warehousing [6], version management [7, 11], hypertext authoring, digital libraries [4] and Internet databases. As an example, a user of the World Wide Web may be interested in knowing changes in an HTML document [2, 5, 10]. Such changes can be detected by comparing the old and new version of the document (referred to as structural matching of documents). As another example, in hypertext authoring, a user may wish to find the common portions in the history list of a document or in a database of documents (referred to as structural discovery of documents). In SIGMOD 95 demo sessions, we exhibited a software package, called TreeDiff [13], for comparing two latex documents and showing their differences. Given two documents, the tool represents the documents as ordered labeled trees and finds an optimal sequence of edit operations to transform one document (tree) to the other. An edit operation could be an insert, delete, or change of a node in the trees. The tool is so named because documents are represented and compared using approximate tree matching techniques [9, 12, 14].