A system for detecting xml similarity in content and structure using relational database

  • Authors:
  • Waraporn Viyanon;Sanjay Kumar Madria

  • Affiliations:
  • Missouri University of Sc. and Tech, Rolla, MO, USA;Missouri University of Sc amd Tech, Rolla, MO, USA

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we describe a system incorporating an improved technique that detects the similarity of two XML documents based on content and structure similarity using keys. The technique consists of three major components: a subtree generator and validator, a key generator, and similarity components that compare content and structure of the XML documents. First, an XML document is stored in a relational database and extracted into small subtrees using leaf-node parents. The leaf-node parents are considered as a root of a subtree which is then recursively traversed bottom-up for matching. Second, a possible key(s) is identified in order to match XML subtrees from two documents efficiently. Key matchings help in reducing the number of comparisons dramatically. In addition, the number of subtrees to be processed is reduced in the subtree validation phase using instance statistics and taxonomic analyzer. The subtrees are matched by the key(s) first and the remaining subtrees are matched by finding degrees of similarity in content and structure. To obtain improved similarity comparison results, XML element names are transformed according to their semantic similarity. The results show that the clustering points are selected appropriately and the overall execution time is reduced dramatically.