LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration

  • Authors:
  • Wenxin Liang;Haruo Yokota

  • Affiliations:
  • Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Global Scientific Information and Computer Center, Tokyo Institute of Technology, Tokyo, Japan

  • Venue:
  • BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, more and more data are published and exchanged by XML on the Internet. However, different XML data sources might contain the same data but have different structures. Therefore, it requires an efficient method to integrate such XML data sources so that more complete and useful information can be conveniently accessed and acquired by users. The tree edit distance is regarded as an effective metric for evaluating the structural similarity in XML documents. However, its computational cost is extremely expensive and the traditional wisdom in join algorithms cannot be applied easily. In this paper, we propose LAX (Leaf-clustering based Approximate XML join algorithm), in which the two XML document trees are clustered into subtrees representing independent items and the similarity between them is determined by calculating the similarity degree based on the leaf nodes of each pair of subtrees. We also propose an effective algorithm for clustering the XML document for LAX. We show that it is easily to apply the traditional wisdom in join algorithms to LAX and the join result contains complete information of the two documents. We then do experiments to compare LAX with the tree edit distance and evaluate its performance using both synthetic and real data sets. Our experimental results show that LAX is more efficient in performance and more effective for measuring the approximate similarity between XML documents than the tree edit distance.