Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
Information Retrieval
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Computing the Edit-Distance between Unrooted Ordered Trees
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
XML stream processing using tree-edit distance embeddings
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Similarity evaluation on tree-structured data
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems
Theoretical Computer Science
Approximate Joins for Data-Centric XML
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
An optimal decomposition algorithm for tree edit distance
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Similarity join on XML based on k-generation set distance
WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Hi-index | 0.00 |
When integrating XML documents from autonomous databases, exact joins often fail for the data items representing the same real world object may not be exactly the same. Thus the join must be approximate. Tree-edit-distance-based join methods have high join quality but low efficiency. Comparatively, other methods with higher efficiency cannot perform the join as effectively as tree edit distance does. To keep the balance between efficiency and effectiveness, in this paper, we propose a novel method to approximately join XML documents. In our method, trees are transformed to g-strings with each entry a tiny subtree. Then the distance between two trees is evaluated as the g-string distance between their corresponding g-strings. To make the g-string based join method scale to large XML databases, we propose the gbag distance as the lower bound of the g-string distance. With g-bag distance, only a very small part of g-string distance need to be computed directly. Thus the whole join process can be done very efficiently. We theoretically analyze the properties of the g-string distance. Experiments with synthetic and various real world data confirm the effectiveness and efficiency of our method and suggest that our technique is both scalable and useful.