Evaluate structure similarity in XML documents with merge-edit-distance

  • Authors:
  • Chong Zhou;Yansheng Lu;Lei Zou;Rong Hu

  • Affiliations:
  • College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China;College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China;College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China;College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China

  • Venue:
  • PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML language is widely used as a standard for data representation and exchange among Web applications. In recent years, many efforts have been spent in querying, integrating and clustering XML documents. Measuring the similarity among XML documents is the foundation of such applications. In this paper, we propose a new similarity measure method among the XML documents, which is based on Merge-Edit-Distance (MED). MED upholds the distribution information of the common tree in XML document trees. We urge the distribution information is useful for determining the similarity of XML documents. A novel algorithm is also proposed to calculate MED as follows. Given two XML document trees A and B, it compresses the two trees into one merge tree C and then transforms the tree C to the common tree of A and B with the defined operations such as "Delete", "Reduce", "Combine". The cost of the operation sequence is defined as MED. The experiments on real datasets give the evidence that the proposed similarity measure is effective.