Evaluate structure similarity in XML documents with merge-edit-distance

Authors:
Chong Zhou;Yansheng Lu;Lei Zou;Rong Hu
Affiliations:
College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China;College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China;College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China;College of Computer Science and Technology, HuaZhong University of Science and Technology, Wuhan, P.R. China
Venue:
PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Year:
2007

Citing 6
Cited 1

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Preparations for Semantics-Based XML Mining

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
A methodology for clustering XML documents by structure

Information Systems
A new sequential mining approach to XML document similarity computation

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Querying XML documents from a relational database in the presence of DTDs

ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology

Mining frequent association tag sequences for clustering XML documents

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML language is widely used as a standard for data representation and exchange among Web applications. In recent years, many efforts have been spent in querying, integrating and clustering XML documents. Measuring the similarity among XML documents is the foundation of such applications. In this paper, we propose a new similarity measure method among the XML documents, which is based on Merge-Edit-Distance (MED). MED upholds the distribution information of the common tree in XML document trees. We urge the distribution information is useful for determining the similarity of XML documents. A novel algorithm is also proposed to calculate MED as follows. Given two XML document trees A and B, it compresses the two trees into one merge tree C and then transforms the tree C to the common tree of A and B with the defined operations such as "Delete", "Reduce", "Combine". The cost of the operation sequence is defined as MED. The experiments on real datasets give the evidence that the proposed similarity measure is effective.