Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
XRel: a path-based approach to storage and retrieval of XML documents using relational databases
ACM Transactions on Internet Technology (TOIT)
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
Verbs semantics and lexical selection
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Approximate matching of hierarchical data using pq-grams
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Path-sequence Based Discrimination for Subtree Matching in Approximate XML Joins
ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Using information content to evaluate semantic similarity in a taxonomy
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration
BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
XML-SIM: Structure and Content Semantic Similarity Detection Using Keys
OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
XML-SIM-CHANGE: structure and content semantic similarity detection among XML document versions
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Temporal and multi-versioned XML documents: A survey
Information Processing and Management: an International Journal
Hi-index | 0.00 |
In this paper, we describe a system incorporating an improved technique that detects the similarity of two XML documents based on content and structure similarity using keys. The technique consists of three major components: a subtree generator and validator, a key generator, and similarity components that compare content and structure of the XML documents. First, an XML document is stored in a relational database and extracted into small subtrees using leaf-node parents. The leaf-node parents are considered as a root of a subtree which is then recursively traversed bottom-up for matching. Second, a possible key(s) is identified in order to match XML subtrees from two documents efficiently. Key matchings help in reducing the number of comparisons dramatically. In addition, the number of subtrees to be processed is reduced in the subtree validation phase using instance statistics and taxonomic analyzer. The subtrees are matched by the key(s) first and the remaining subtrees are matched by finding degrees of similarity in content and structure. To obtain improved similarity comparison results, XML element names are transformed according to their semantic similarity. The results show that the clustering points are selected appropriately and the overall execution time is reduced dramatically.