The power of two min-hashes for similarity search among hierarchical data objects
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A cluster-based approach to XML similarity joins
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
Exact and efficient proximity graph computation
ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Approximate joins for XML using g-string
XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
pq-hash: an efficient method for approximate XML joins
WAIM'10 Proceedings of the 2010 international conference on Web-age information management
PG-Skip: proximity graph based clustering of long strings
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Ingredients for accurate, fast, and robust XML similarity joins
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
RTED: a robust algorithm for the tree edit distance
Proceedings of the VLDB Endowment
Measuring structural similarity of semistructured data based on information-theoretic approaches
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient processing of containment queries on nested sets
Proceedings of the 16th International Conference on Extending Database Technology
RWS-Diff: flexible and efficient change detection in hierarchical data
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Clustering with Proximity Graphs: Exact and Efficient Algorithms
International Journal of Knowledge-Based Organizations
Hi-index | 0.00 |
In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate windowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable.