Approximate joins for XML using g-string

Authors:
Fei Li;Hongzhi Wang;Cheng Zhang;Liang Hao;Jianzhong Li;Hong Gao
Affiliations:
The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology;The School of Computer Science and Technology, Harbin Institute of Technology
Venue:
XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
Year:
2010

Citing 12
Cited 2

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Information Retrieval

Information Retrieval
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Computing the Edit-Distance between Unrooted Ordered Trees

ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
XML stream processing using tree-edit distance embeddings

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Similarity evaluation on tree-structured data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems

Theoretical Computer Science
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
An optimal decomposition algorithm for tree edit distance

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming

Similarity join on XML based on k-generation set distance

WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
A survey on tree edit distance lower bound estimation techniques for similarity join on XML data

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

When integrating XML documents from autonomous databases, exact joins often fail for the data items representing the same real world object may not be exactly the same. Thus the join must be approximate. Tree-edit-distance-based join methods have high join quality but low efficiency. Comparatively, other methods with higher efficiency cannot perform the join as effectively as tree edit distance does. To keep the balance between efficiency and effectiveness, in this paper, we propose a novel method to approximately join XML documents. In our method, trees are transformed to g-strings with each entry a tiny subtree. Then the distance between two trees is evaluated as the g-string distance between their corresponding g-strings. To make the g-string based join method scale to large XML databases, we propose the gbag distance as the lower bound of the g-string distance. With g-bag distance, only a very small part of g-string distance need to be computed directly. Thus the whole join process can be done very efficiently. We theoretically analyze the properties of the g-string distance. Experiments with synthetic and various real world data confirm the effectiveness and efficiency of our method and suggest that our technique is both scalable and useful.