Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Computing the Edit-Distance between Unrooted Ordered Trees
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Detecting Changes in XML Documents
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
An Efficient Algorithm to Compute Differences between Structured Documents
IEEE Transactions on Knowledge and Data Engineering
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
Approximate matching of hierarchical data using pq-grams
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems
Theoretical Computer Science
An incrementally maintainable index for approximate lookups in hierarchical data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detectives: detecting coalition hit inflation attacks in advertising networks streams
Proceedings of the 16th international conference on World Wide Web
Approximate Joins for Data-Centric XML
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An optimal decomposition algorithm for tree edit distance
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Similarity join on XML based on k-generation set distance
WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Hi-index | 0.00 |
Approximate matching between large tree sets is broadly used in many applications such as data integration and XML deduplication. However, most existing methods suffer for low efficiency, thus do not scale to large tree sets. pq-gram is a widely-used method with high quality of matches. In this paper, we propose pq-hash as an improvement to pq-gram. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. Sort-merge and hash join technique is applied based on these pq-arrays to avoid nested-loop join. From theoretical analysis and experimental results, retaining high join quality, pq-hash gains much higher efficiency than pq-gram.