Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
Preparations for Semantics-Based XML Mining
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
A methodology for clustering XML documents by structure
Information Systems
A new sequential mining approach to XML document similarity computation
PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Querying XML documents from a relational database in the presence of DTDs
ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Mining frequent association tag sequences for clustering XML documents
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Hi-index | 0.00 |
XML language is widely used as a standard for data representation and exchange among Web applications. In recent years, many efforts have been spent in querying, integrating and clustering XML documents. Measuring the similarity among XML documents is the foundation of such applications. In this paper, we propose a new similarity measure method among the XML documents, which is based on Merge-Edit-Distance (MED). MED upholds the distribution information of the common tree in XML document trees. We urge the distribution information is useful for determining the similarity of XML documents. A novel algorithm is also proposed to calculate MED as follows. Given two XML document trees A and B, it compresses the two trees into one merge tree C and then transforms the tree C to the common tree of A and B with the defined operations such as "Delete", "Reduce", "Combine". The cost of the operation sequence is defined as MED. The experiments on real datasets give the evidence that the proposed similarity measure is effective.