pq-hash: an efficient method for approximate XML joins

  • Authors:
  • Fei Li;Hongzhi Wang;Liang Hao;Jianzhong Li;Hong Gao

  • Affiliations:
  • The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology

  • Venue:
  • WAIM'10 Proceedings of the 2010 international conference on Web-age information management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Approximate matching between large tree sets is broadly used in many applications such as data integration and XML deduplication. However, most existing methods suffer for low efficiency, thus do not scale to large tree sets. pq-gram is a widely-used method with high quality of matches. In this paper, we propose pq-hash as an improvement to pq-gram. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. Sort-merge and hash join technique is applied based on these pq-arrays to avoid nested-loop join. From theoretical analysis and experimental results, retaining high join quality, pq-hash gains much higher efficiency than pq-gram.