A system for detecting xml similarity in content and structure using relational database

Authors:
Waraporn Viyanon;Sanjay Kumar Madria
Affiliations:
Missouri University of Sc. and Tech, Rolla, MO, USA;Missouri University of Sc amd Tech, Rolla, MO, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 9
Cited 4

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
XRel: a path-based approach to storage and retrieval of XML documents using relational databases

ACM Transactions on Internet Technology (TOIT)
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
Verbs semantics and lexical selection

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Path-sequence Based Discrimination for Subtree Matching in Approximate XML Joins

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration

BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation

XML-SIM: Structure and Content Semantic Similarity Detection Using Keys

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
XML-SIM-CHANGE: structure and content semantic similarity detection among XML document versions

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Temporal and multi-versioned XML documents: A survey

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe a system incorporating an improved technique that detects the similarity of two XML documents based on content and structure similarity using keys. The technique consists of three major components: a subtree generator and validator, a key generator, and similarity components that compare content and structure of the XML documents. First, an XML document is stored in a relational database and extracted into small subtrees using leaf-node parents. The leaf-node parents are considered as a root of a subtree which is then recursively traversed bottom-up for matching. Second, a possible key(s) is identified in order to match XML subtrees from two documents efficiently. Key matchings help in reducing the number of comparisons dramatically. In addition, the number of subtrees to be processed is reduced in the subtree validation phase using instance statistics and taxonomic analyzer. The subtrees are matched by the key(s) first and the remaining subtrees are matched by finding degrees of similarity in content and structure. To obtain improved similarity comparison results, XML element names are transformed according to their semantic similarity. The results show that the clustering points are selected appropriately and the overall execution time is reduced dramatically.