Approximate joins for XML using g-string

  • Authors:
  • Fei Li;Hongzhi Wang;Cheng Zhang;Liang Hao;Jianzhong Li;Hong Gao

  • Affiliations:
  • The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology

  • Venue:
  • XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

When integrating XML documents from autonomous databases, exact joins often fail for the data items representing the same real world object may not be exactly the same. Thus the join must be approximate. Tree-edit-distance-based join methods have high join quality but low efficiency. Comparatively, other methods with higher efficiency cannot perform the join as effectively as tree edit distance does. To keep the balance between efficiency and effectiveness, in this paper, we propose a novel method to approximately join XML documents. In our method, trees are transformed to g-strings with each entry a tiny subtree. Then the distance between two trees is evaluated as the g-string distance between their corresponding g-strings. To make the g-string based join method scale to large XML databases, we propose the gbag distance as the lower bound of the g-string distance. With g-bag distance, only a very small part of g-string distance need to be computed directly. Thus the whole join process can be done very efficiently. We theoretically analyze the properties of the g-string distance. Experiments with synthetic and various real world data confirm the effectiveness and efficiency of our method and suggest that our technique is both scalable and useful.