pq-hash: an efficient method for approximate XML joins

Authors:
Fei Li;Hongzhi Wang;Liang Hao;Jianzhong Li;Hong Gao
Affiliations:
The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology
Venue:
WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Year:
2010

Citing 15
Cited 1

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Computing the Edit-Distance between Unrooted Ordered Trees

ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Detecting Changes in XML Documents

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
An Efficient Algorithm to Compute Differences between Structured Documents

IEEE Transactions on Knowledge and Data Engineering
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems

Theoretical Computer Science
An incrementally maintainable index for approximate lookups in hierarchical data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detectives: detecting coalition hit inflation attacks in advertising networks streams

Proceedings of the 16th international conference on World Wide Web
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An optimal decomposition algorithm for tree edit distance

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming

Similarity join on XML based on k-generation set distance

WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate matching between large tree sets is broadly used in many applications such as data integration and XML deduplication. However, most existing methods suffer for low efficiency, thus do not scale to large tree sets. pq-gram is a widely-used method with high quality of matches. In this paper, we propose pq-hash as an improvement to pq-gram. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. Sort-merge and hash join technique is applied based on these pq-arrays to avoid nested-loop join. From theoretical analysis and experimental results, retaining high join quality, pq-hash gains much higher efficiency than pq-gram.